[shell] Using wget to recursively fetch a directory with arbitrary files in it

I have a web directory where I store some config files. I'd like to use wget to pull those files down and maintain their current structure. For instance, the remote directory looks like:

http://mysite.com/configs/.vim/

.vim holds multiple files and directories. I want to replicate that on the client using wget. Can't seem to find the right combo of wget flags to get this done. Any ideas?

This question is related to shell wget

The answer is


The following option seems to be the perfect combination when dealing with recursive download:

wget -nd -np -P /dest/dir --recursive http://url/dir1/dir2

Relevant snippets from man pages for convenience:

   -nd
   --no-directories
       Do not create a hierarchy of directories when retrieving recursively.  With this option turned on, all files will get saved to the current directory, without clobbering (if a name shows up more than once, the
       filenames will get extensions .n).


   -np
   --no-parent
       Do not ever ascend to the parent directory when retrieving recursively.  This is a useful option, since it guarantees that only the files below a certain hierarchy will be downloaded.

You should be able to do it simply by adding a -r

wget -r http://stackoverflow.com/

Here's the complete wget command that worked for me to download files from a server's directory (ignoring robots.txt):

wget -e robots=off --cut-dirs=3 --user-agent=Mozilla/5.0 --reject="index.html*" --no-parent --recursive --relative --level=1 --no-directories http://www.example.com/archive/example/5.3.0/

All you need is two flags, one is "-r" for recursion and "--no-parent" (or -np) in order not to go in the '.' and ".." . Like this:

wget -r --no-parent http://example.com/configs/.vim/

That's it. It will download into the following local tree: ./example.com/configs/.vim . However if you do not want the first two directories, then use the additional flag --cut-dirs=2 as suggested in earlier replies:

wget -r --no-parent --cut-dirs=2 http://example.com/configs/.vim/

And it will download your file tree only into ./.vim/

In fact, I got the first line from this answer precisely from the wget manual, they have a very clean example towards the end of section 4.3.


For anyone else that having similar issues. Wget follows robots.txt which might not allow you to grab the site. No worries, you can turn it off:

wget -e robots=off http://www.example.com/

http://www.gnu.org/software/wget/manual/html_node/Robot-Exclusion.html


This version downloads recursively and doesn't create parent directories.

wgetod() {
    NSLASH="$(echo "$1" | perl -pe 's|.*://[^/]+(.*?)/?$|\1|' | grep -o / | wc -l)"
    NCUT=$((NSLASH > 0 ? NSLASH-1 : 0))
    wget -r -nH --user-agent=Mozilla/5.0 --cut-dirs=$NCUT --no-parent --reject="index.html*" "$1"
}

Usage:

  1. Add to ~/.bashrc or paste into terminal
  2. wgetod "http://example.com/x/"

If --no-parent not help, you might use --include option.

Directory struct:

http://<host>/downloads/good
http://<host>/downloads/bad

And you want to download downloads/good but not downloads/bad directory:

wget --include downloads/good --mirror --execute robots=off --no-host-directories --cut-dirs=1 --reject="index.html*" --continue http://<host>/downloads/good

First of all, thanks to everyone who posted their answers. Here is my "ultimate" wget script to download a website recursively:

wget --recursive ${comment# self-explanatory} \
  --no-parent ${comment# will not crawl links in folders above the base of the URL} \
  --convert-links ${comment# convert links with the domain name to relative and uncrawled to absolute} \
  --random-wait --wait 3 --no-http-keep-alive ${comment# do not get banned} \
  --no-host-directories ${comment# do not create folders with the domain name} \
  --execute robots=off --user-agent=Mozilla/5.0 ${comment# I AM A HUMAN!!!} \
  --level=inf  --accept '*' ${comment# do not limit to 5 levels or common file formats} \
  --reject="index.html*" ${comment# use this option if you need an exact mirror} \
  --cut-dirs=0 ${comment# replace 0 with the number of folders in the path, 0 for the whole domain} \
$URL

Afterwards, stripping the query params from URLs like main.css?crc=12324567 and running a local server (e.g. via python3 -m http.server in the dir you just wget'ed) to run JS may be necessary. Please note that the --convert-links option kicks in only after the full crawl was completed.

Also, if you are trying to wget a website that may go down soon, you should get in touch with the ArchiveTeam and ask them to add your website to their ArchiveBot queue.


You should use the -m (mirror) flag, as that takes care to not mess with timestamps and to recurse indefinitely.

wget -m http://example.com/configs/.vim/

If you add the points mentioned by others in this thread, it would be:

wget -m -e robots=off --no-parent http://example.com/configs/.vim/

Recursive wget ignoring robots (for websites)

wget -e robots=off -r -np --page-requisites --convert-links 'http://example.com/folder/'

-e robots=off causes it to ignore robots.txt for that domain

-r makes it recursive

-np = no parents, so it doesn't follow links up to the parent folder


You should be able to do it simply by adding a -r

wget -r http://stackoverflow.com/

wget -r http://mysite.com/configs/.vim/

works for me.

Perhaps you have a .wgetrc which is interfering with it?


To fetch a directory recursively with username and password, use the following command:

wget -r --user=(put username here) --password='(put password here)' --no-parent http://example.com/

Wget 1.18 may work better, e.g., I got bitten by a version 1.12 bug where...

wget --recursive (...)

...only retrieves index.html instead of all files.

Workaround was to notice some 301 redirects and try the new location — given the new URL, wget got all the files in the directory.


All you need is two flags, one is "-r" for recursion and "--no-parent" (or -np) in order not to go in the '.' and ".." . Like this:

wget -r --no-parent http://example.com/configs/.vim/

That's it. It will download into the following local tree: ./example.com/configs/.vim . However if you do not want the first two directories, then use the additional flag --cut-dirs=2 as suggested in earlier replies:

wget -r --no-parent --cut-dirs=2 http://example.com/configs/.vim/

And it will download your file tree only into ./.vim/

In fact, I got the first line from this answer precisely from the wget manual, they have a very clean example towards the end of section 4.3.


You should be able to do it simply by adding a -r

wget -r http://stackoverflow.com/

If --no-parent not help, you might use --include option.

Directory struct:

http://<host>/downloads/good
http://<host>/downloads/bad

And you want to download downloads/good but not downloads/bad directory:

wget --include downloads/good --mirror --execute robots=off --no-host-directories --cut-dirs=1 --reject="index.html*" --continue http://<host>/downloads/good

wget -r http://mysite.com/configs/.vim/

works for me.

Perhaps you have a .wgetrc which is interfering with it?


Here's the complete wget command that worked for me to download files from a server's directory (ignoring robots.txt):

wget -e robots=off --cut-dirs=3 --user-agent=Mozilla/5.0 --reject="index.html*" --no-parent --recursive --relative --level=1 --no-directories http://www.example.com/archive/example/5.3.0/

For anyone else that having similar issues. Wget follows robots.txt which might not allow you to grab the site. No worries, you can turn it off:

wget -e robots=off http://www.example.com/

http://www.gnu.org/software/wget/manual/html_node/Robot-Exclusion.html


First of all, thanks to everyone who posted their answers. Here is my "ultimate" wget script to download a website recursively:

wget --recursive ${comment# self-explanatory} \
  --no-parent ${comment# will not crawl links in folders above the base of the URL} \
  --convert-links ${comment# convert links with the domain name to relative and uncrawled to absolute} \
  --random-wait --wait 3 --no-http-keep-alive ${comment# do not get banned} \
  --no-host-directories ${comment# do not create folders with the domain name} \
  --execute robots=off --user-agent=Mozilla/5.0 ${comment# I AM A HUMAN!!!} \
  --level=inf  --accept '*' ${comment# do not limit to 5 levels or common file formats} \
  --reject="index.html*" ${comment# use this option if you need an exact mirror} \
  --cut-dirs=0 ${comment# replace 0 with the number of folders in the path, 0 for the whole domain} \
$URL

Afterwards, stripping the query params from URLs like main.css?crc=12324567 and running a local server (e.g. via python3 -m http.server in the dir you just wget'ed) to run JS may be necessary. Please note that the --convert-links option kicks in only after the full crawl was completed.

Also, if you are trying to wget a website that may go down soon, you should get in touch with the ArchiveTeam and ask them to add your website to their ArchiveBot queue.


wget -r http://mysite.com/configs/.vim/

works for me.

Perhaps you have a .wgetrc which is interfering with it?


Recursive wget ignoring robots (for websites)

wget -e robots=off -r -np --page-requisites --convert-links 'http://example.com/folder/'

-e robots=off causes it to ignore robots.txt for that domain

-r makes it recursive

-np = no parents, so it doesn't follow links up to the parent folder


To fetch a directory recursively with username and password, use the following command:

wget -r --user=(put username here) --password='(put password here)' --no-parent http://example.com/

This version downloads recursively and doesn't create parent directories.

wgetod() {
    NSLASH="$(echo "$1" | perl -pe 's|.*://[^/]+(.*?)/?$|\1|' | grep -o / | wc -l)"
    NCUT=$((NSLASH > 0 ? NSLASH-1 : 0))
    wget -r -nH --user-agent=Mozilla/5.0 --cut-dirs=$NCUT --no-parent --reject="index.html*" "$1"
}

Usage:

  1. Add to ~/.bashrc or paste into terminal
  2. wgetod "http://example.com/x/"

To download a directory recursively, which rejects index.html* files and downloads without the hostname, parent directory and the whole directory structure :

wget -r -nH --cut-dirs=2 --no-parent --reject="index.html*" http://mysite.com/dir1/dir2/data

The following option seems to be the perfect combination when dealing with recursive download:

wget -nd -np -P /dest/dir --recursive http://url/dir1/dir2

Relevant snippets from man pages for convenience:

   -nd
   --no-directories
       Do not create a hierarchy of directories when retrieving recursively.  With this option turned on, all files will get saved to the current directory, without clobbering (if a name shows up more than once, the
       filenames will get extensions .n).


   -np
   --no-parent
       Do not ever ascend to the parent directory when retrieving recursively.  This is a useful option, since it guarantees that only the files below a certain hierarchy will be downloaded.

Wget 1.18 may work better, e.g., I got bitten by a version 1.12 bug where...

wget --recursive (...)

...only retrieves index.html instead of all files.

Workaround was to notice some 301 redirects and try the new location — given the new URL, wget got all the files in the directory.