Using wget to recursively fetch a directory with arbitrary files in it

Question

I have a web directory where I store some config files  I d like to use wget to pull those files down and maintain their current structure  For instance  the remote directory looks like   http   mysite com configs  vim     vim holds multiple files and directories  I want to replicate that on the client using wget  Can t seem to find the right combo of wget flags to get this done  Any ideas

User · Answer

Recursive wget ignoring robots  for websites  wget -e robots off -r -np --page-requisites --convert-links  http   example com folder    -e robots off causes it to ignore robots txt for that domain -r makes it recursive -np   no parents  so it doesn t follow links up to the parent folder

User · Answer

For anyone else that having similar issues  Wget follows robots txt which might not allow you to grab the site  No worries  you can turn it off   wget -e robots off http   www example com    http   www gnu org software wget manual html node Robot-Exclusion html

User · Answer

Wget 1 18 may work better  e g   I got bitten by a version 1 12 bug where     wget --recursive            only retrieves index html instead of all files   Workaround was to notice some 301 redirects and try the new location     given the new URL  wget got all the files in the directory

User · Answer

wget -r http   mysite com configs  vim    works for me   Perhaps you have a  wgetrc which is interfering with it

User · Answer

To download a directory recursively  which rejects index html  files and downloads without the hostname  parent directory and the whole directory structure    wget -r -nH --cut-dirs 2 --no-parent --reject  index html   http   mysite com dir1 dir2 data

User · Answer

You have to pass the -np --no-parent option to wget  in addition to -r --recursive  of course   otherwise it will follow the link in the directory index on my site to the parent directory  So the command would look like this   wget --recursive --no-parent http   example com configs  vim    To avoid downloading the auto-generated index html files  use the -R --reject option   wget -r -np -R  index html   http   example com configs  vim

User · Answer

You have to pass the -np --no-parent option to wget  in addition to -r --recursive  of course   otherwise it will follow the link in the directory index on my site to the parent directory  So the command would look like this   wget --recursive --no-parent http   example com configs  vim    To avoid downloading the auto-generated index html files  use the -R --reject option   wget -r -np -R  index html   http   example com configs  vim

User · Answer

You have to pass the -np --no-parent option to wget  in addition to -r --recursive  of course   otherwise it will follow the link in the directory index on my site to the parent directory  So the command would look like this   wget --recursive --no-parent http   example com configs  vim    To avoid downloading the auto-generated index html files  use the -R --reject option   wget -r -np -R  index html   http   example com configs  vim

User · Answer

Wget 1 18 may work better  e g   I got bitten by a version 1 12 bug where     wget --recursive            only retrieves index html instead of all files   Workaround was to notice some 301 redirects and try the new location     given the new URL  wget got all the files in the directory

User · Answer

This version downloads recursively and doesn t create parent directories   wgetod         NSLASH    echo   1    perl -pe  s                     1     grep -o     wc -l       NCUT    NSLASH  gt  0   NSLASH-1   0       wget -r -nH --user-agent Mozilla 5 0 --cut-dirs  NCUT --no-parent --reject  index html     1      Usage    Add to    bashrc or paste into terminal wgetod  http   example com x

User · Answer

If --no-parent not help  you might use --include option   Directory struct   http    lt host gt  downloads good http    lt host gt  downloads bad   And you want to download downloads good but not downloads bad directory   wget --include downloads good --mirror --execute robots off --no-host-directories --cut-dirs 1 --reject  index html   --continue http    lt host gt  downloads good

User · Answer

wget -r http   mysite com configs  vim    works for me   Perhaps you have a  wgetrc which is interfering with it

User · Answer

For anyone else that having similar issues  Wget follows robots txt which might not allow you to grab the site  No worries  you can turn it off   wget -e robots off http   www example com    http   www gnu org software wget manual html node Robot-Exclusion html

User · Answer

All you need is two flags  one is  -r  for recursion and  --no-parent   or -np  in order not to go in the     and         Like this   wget -r --no-parent http   example com configs  vim   That s it  It will download into the following local tree    example com configs  vim   However if you do not want the first two directories  then use the additional flag --cut-dirs 2 as suggested in earlier replies   wget -r --no-parent --cut-dirs 2 http   example com configs  vim   And it will download your file tree only into    vim   In fact  I got the first line from this answer precisely from the wget manual  they have a very clean example towards the end of section 4 3

User · Answer

You should be able to do it simply by adding a -r  wget -r http   stackoverflow com

User · Answer

Recursive wget ignoring robots  for websites  wget -e robots off -r -np --page-requisites --convert-links  http   example com folder    -e robots off causes it to ignore robots txt for that domain -r makes it recursive -np   no parents  so it doesn t follow links up to the parent folder

User · Answer

You should be able to do it simply by adding a -r  wget -r http   stackoverflow com

User · Answer

First of all  thanks to everyone who posted their answers  Here is my  quot ultimate quot  wget script to download a website recursively  wget --recursive   comment  self-explanatory      --no-parent   comment  will not crawl links in folders above the base of the URL      --convert-links   comment  convert links with the domain name to relative and uncrawled to absolute      --random-wait --wait 3 --no-http-keep-alive   comment  do not get banned      --no-host-directories   comment  do not create folders with the domain name      --execute robots off --user-agent Mozilla 5 0   comment  I AM A HUMAN         --level inf  --accept       comment  do not limit to 5 levels or common file formats      --reject  quot index html  quot    comment  use this option if you need an exact mirror      --cut-dirs 0   comment  replace 0 with the number of folders in the path  0 for the whole domain     URL  Afterwards  stripping the query params from URLs like main css crc 12324567 and running a local server  e g  via python3 -m http server in the dir you just wget ed  to run JS may be necessary  Please note that the --convert-links option kicks in only after the full crawl was completed  Also  if you are trying to wget a website that may go down soon  you should get in touch with the ArchiveTeam and ask them to add your website to their ArchiveBot queue

User · Answer

Here s the complete wget command that worked for me to download files from a server s directory  ignoring robots txt    wget -e robots off --cut-dirs 3 --user-agent Mozilla 5 0 --reject  index html   --no-parent --recursive --relative --level 1 --no-directories http   www example com archive example 5 3 0

User · Answer

To download a directory recursively  which rejects index html  files and downloads without the hostname  parent directory and the whole directory structure    wget -r -nH --cut-dirs 2 --no-parent --reject  index html   http   mysite com dir1 dir2 data

User · Answer

This version downloads recursively and doesn t create parent directories   wgetod         NSLASH    echo   1    perl -pe  s                     1     grep -o     wc -l       NCUT    NSLASH  gt  0   NSLASH-1   0       wget -r -nH --user-agent Mozilla 5 0 --cut-dirs  NCUT --no-parent --reject  index html     1      Usage    Add to    bashrc or paste into terminal wgetod  http   example com x

User · Answer

You should use the -m  mirror  flag  as that takes care to not mess with timestamps and to recurse indefinitely   wget -m http   example com configs  vim    If you add the points mentioned by others in this thread  it would be   wget -m -e robots off --no-parent http   example com configs  vim

User · Answer

The following option seems to be the perfect combination when dealing with recursive download   wget  -nd -np -P  dest dir --recursive http   url dir1 dir2  Relevant snippets from man pages for convenience      -nd    --no-directories        Do not create a hierarchy of directories when retrieving recursively   With this option turned on  all files will get saved to the current directory  without clobbering  if a name shows up more than once  the        filenames will get extensions  n        -np    --no-parent        Do not ever ascend to the parent directory when retrieving recursively   This is a useful option  since it guarantees that only the files below a certain hierarchy will be downloaded

User · Answer

First of all  thanks to everyone who posted their answers  Here is my  quot ultimate quot  wget script to download a website recursively  wget --recursive   comment  self-explanatory      --no-parent   comment  will not crawl links in folders above the base of the URL      --convert-links   comment  convert links with the domain name to relative and uncrawled to absolute      --random-wait --wait 3 --no-http-keep-alive   comment  do not get banned      --no-host-directories   comment  do not create folders with the domain name      --execute robots off --user-agent Mozilla 5 0   comment  I AM A HUMAN         --level inf  --accept       comment  do not limit to 5 levels or common file formats      --reject  quot index html  quot    comment  use this option if you need an exact mirror      --cut-dirs 0   comment  replace 0 with the number of folders in the path  0 for the whole domain     URL  Afterwards  stripping the query params from URLs like main css crc 12324567 and running a local server  e g  via python3 -m http server in the dir you just wget ed  to run JS may be necessary  Please note that the --convert-links option kicks in only after the full crawl was completed  Also  if you are trying to wget a website that may go down soon  you should get in touch with the ArchiveTeam and ask them to add your website to their ArchiveBot queue

User · Answer

To fetch a directory recursively with username and password  use the following command    wget -r --user  put username here  --password   put password here   --no-parent http   example com

User · Answer

All you need is two flags  one is  -r  for recursion and  --no-parent   or -np  in order not to go in the     and         Like this   wget -r --no-parent http   example com configs  vim   That s it  It will download into the following local tree    example com configs  vim   However if you do not want the first two directories  then use the additional flag --cut-dirs 2 as suggested in earlier replies   wget -r --no-parent --cut-dirs 2 http   example com configs  vim   And it will download your file tree only into    vim   In fact  I got the first line from this answer precisely from the wget manual  they have a very clean example towards the end of section 4 3

User · Answer

You have to pass the -np --no-parent option to wget  in addition to -r --recursive  of course   otherwise it will follow the link in the directory index on my site to the parent directory  So the command would look like this   wget --recursive --no-parent http   example com configs  vim    To avoid downloading the auto-generated index html files  use the -R --reject option   wget -r -np -R  index html   http   example com configs  vim

User · Answer

Here s the complete wget command that worked for me to download files from a server s directory  ignoring robots txt    wget -e robots off --cut-dirs 3 --user-agent Mozilla 5 0 --reject  index html   --no-parent --recursive --relative --level 1 --no-directories http   www example com archive example 5 3 0

User · Answer

To fetch a directory recursively with username and password  use the following command    wget -r --user  put username here  --password   put password here   --no-parent http   example com

User · Answer

wget -r http   mysite com configs  vim    works for me   Perhaps you have a  wgetrc which is interfering with it

User · Answer

You should use the -m  mirror  flag  as that takes care to not mess with timestamps and to recurse indefinitely   wget -m http   example com configs  vim    If you add the points mentioned by others in this thread  it would be   wget -m -e robots off --no-parent http   example com configs  vim

User · Answer

The following option seems to be the perfect combination when dealing with recursive download   wget  -nd -np -P  dest dir --recursive http   url dir1 dir2  Relevant snippets from man pages for convenience      -nd    --no-directories        Do not create a hierarchy of directories when retrieving recursively   With this option turned on  all files will get saved to the current directory  without clobbering  if a name shows up more than once  the        filenames will get extensions  n        -np    --no-parent        Do not ever ascend to the parent directory when retrieving recursively   This is a useful option  since it guarantees that only the files below a certain hierarchy will be downloaded

User · Answer

If --no-parent not help  you might use --include option   Directory struct   http    lt host gt  downloads good http    lt host gt  downloads bad   And you want to download downloads good but not downloads bad directory   wget --include downloads good --mirror --execute robots off --no-host-directories --cut-dirs 1 --reject  index html   --continue http    lt host gt  downloads good

User · Answer

You should be able to do it simply by adding a -r  wget -r http   stackoverflow com

User · Answer

wget -r http   mysite com configs  vim    works for me   Perhaps you have a  wgetrc which is interfering with it

User · Answer

You should be able to do it simply by adding a -r  wget -r http   stackoverflow com

[shell] Using wget to recursively fetch a directory with arbitrary files in it

Examples related to shell

Examples related to wget