Script to get the HTTP status code of a list of urls

Question

I have a list of URLS that I need to check, to see if they still work or not. I would like to write a bash script that does that for me.

I only need the returned HTTP status code, i.e. 200, 404, 500 and so forth. Nothing more.

EDIT Note that there is an issue if the page says "404 not found" but returns a 200 OK message. It's a misconfigured web server, but you may have to consider this case.

For more on this, see Check if a URL goes to a page containing the text "404"

User · Answer

wget -S -i  file  will get you the headers from each url in a file   Filter though grep for the status code specifically

User · Answer

wget --spider -S  http   url to be checked  2 gt  amp 1   grep  HTTP     awk   print  2     prints only the status code for you

User · Answer

Use curl to fetch the HTTP-header only  not the whole file  and parse it     curl -I  --stderr  dev null http   www google co uk index html   head -1   cut -d    -f2 200

User · Answer

Curl has a specific option  --write-out  for this     curl -o  dev null --silent --head --write-out    http code  n   lt url gt  200    -o  dev null throws away the usual output --silent throws away the progress meter --head makes a HEAD HTTP request  instead of GET --write-out    http code  n  prints the required status code   To wrap this up in a complete Bash script      bin bash while read LINE  do   curl -o  dev null --silent --head --write-out    http code   LINE n    LINE  done  lt  url-list txt    Eagle-eyed readers will notice that this uses one curl process per URL  which imposes fork and TCP connection penalties  It would be faster if multiple URLs were combined in a single curl  but there isn t space to write out the monsterous repetition of options that curl requires to do this

User · Answer

I found a tool  webchk    written in Python  Returns a status code for a list of urls  https   pypi org project webchk   Output looks like this     webchk -i   dxieu txt   grep  200  http   salesforce-case-status dxi eu login     200 OK  0 108  https   support dxi eu hc en-gb     200 OK  0 389  https   support dxi eu hc en-gb     200 OK  0 401    Hope that helps

User · Answer

Due to https   mywiki wooledge org BashPitfalls Non-atomic writes with xargs -P  output from parallel jobs in xargs risks being mixed   I would use GNU Parallel instead of xargs to parallelize   cat url lst     parallel -P0 -q curl -o  dev null --silent --head --write-out    url effective     http code  n   gt  outfile   In this particular case it may be safe to use xargs because the output is so short  so the problem with using xargs is rather that if someone later changes the code to do something bigger  it will no longer be safe  Or if someone reads this question and thinks he can replace curl with something else  then that may also not be safe

User · Answer

This relies on widely available wget  present almost everywhere  even on Alpine Linux   wget --server-response --spider --quiet    url   2 gt  amp 1   awk  NR  1 print  2     The explanations are as follow    --quiet     Turn off Wget s output       Source - wget man pages   --spider             it will not download the pages  just check that they are there               Source - wget man pages   --server-response     Print the headers sent by HTTP servers and responses sent by FTP servers       Source - wget man pages   What they don t say about --server-response is that those headers output are printed to standard error  sterr   thus the need to redirect to stdin   The output sent to standard input  we can pipe it to awk to extract the HTTP status code  That code is     the second   2  non-blank group of characters    2  on the very first line of the header  NR  1   And because we want to print it     print  2    wget --server-response --spider --quiet    url   2 gt  amp 1   awk  NR  1 print  2

User · Answer

Extending the answer already provided by Phil. Adding parallelism to it is a no brainer in bash if you use xargs for the call.

Here the code:

xargs -n1 -P 10 curl -o /dev/null --silent --head --write-out '%{url_effective}: %{http_code}\n' < url.lst

-n1: use just one value (from the list) as argument to the curl call

-P10: Keep 10 curl processes alive at any time (i.e. 10 parallel connections)

Check the write_out parameter in the manual of curl for more data you can extract using it (times, etc).

In case it helps someone this is the call I'm currently using:

xargs -n1 -P 10 curl -o /dev/null --silent --head --write-out '%{url_effective};%{http_code};%{time_total};%{time_namelookup};%{time_connect};%{size_download};%{speed_download}\n' < url.lst | tee results.csv

It just outputs a bunch of data into a csv file that can be imported into any office tool.

[bash] Script to get the HTTP status code of a list of urls?

The answer is

Examples related to bash

Examples related to curl

Examples related to http-status-codes

Examples related to http-status-code-415

Tags