[python] Python: Get HTTP headers from urllib2.urlopen call?

Actually, it appears that urllib2 can do an HTTP HEAD request.

The question that @reto linked to, above, shows how to get urllib2 to do a HEAD request.

Here's my take on it:

import urllib2

# Derive from Request class and override get_method to allow a HEAD request.
class HeadRequest(urllib2.Request):
    def get_method(self):
        return "HEAD"

myurl = 'http://bit.ly/doFeT'
request = HeadRequest(myurl)

try:
    response = urllib2.urlopen(request)
    response_headers = response.info()

    # This will just display all the dictionary key-value pairs.  Replace this
    # line with something useful.
    response_headers.dict

except urllib2.HTTPError, e:
    # Prints the HTTP Status code of the response but only if there was a 
    # problem.
    print ("Error code: %s" % e.code)

If you check this with something like the Wireshark network protocol analazer, you can see that it is actually sending out a HEAD request, rather than a GET.

This is the HTTP request and response from the code above, as captured by Wireshark:

HEAD /doFeT HTTP/1.1
Accept-Encoding: identity
Host: bit.ly
Connection: close
User-Agent: Python-urllib/2.7

HTTP/1.1 301 Moved
Server: nginx
Date: Sun, 19 Feb 2012 13:20:56 GMT
Content-Type: text/html; charset=utf-8
Cache-control: private; max-age=90
Location: http://www.kidsidebyside.org/?p=445
MIME-Version: 1.0
Content-Length: 127
Connection: close
Set-Cookie: _bit=4f40f738-00153-02ed0-421cf10a;domain=.bit.ly;expires=Fri Aug 17 13:20:56 2012;path=/; HttpOnly

However, as mentioned in one of the comments in the other question, if the URL in question includes a redirect then urllib2 will do a GET request to the destination, not a HEAD. This could be a major shortcoming, if you really wanted to only make HEAD requests.

The request above involves a redirect. Here is request to the destination, as captured by Wireshark:

GET /2009/05/come-and-draw-the-circle-of-unity-with-us/ HTTP/1.1
Accept-Encoding: identity
Host: www.kidsidebyside.org
Connection: close
User-Agent: Python-urllib/2.7

An alternative to using urllib2 is to use Joe Gregorio's httplib2 library:

import httplib2

url = "http://bit.ly/doFeT"
http_interface = httplib2.Http()

try:
    response, content = http_interface.request(url, method="HEAD")
    print ("Response status: %d - %s" % (response.status, response.reason))

    # This will just display all the dictionary key-value pairs.  Replace this
    # line with something useful.
    response.__dict__

except httplib2.ServerNotFoundError, e:
    print (e.message)

This has the advantage of using HEAD requests for both the initial HTTP request and the redirected request to the destination URL.

Here's the first request:

HEAD /doFeT HTTP/1.1
Host: bit.ly
accept-encoding: gzip, deflate
user-agent: Python-httplib2/0.7.2 (gzip)

Here's the second request, to the destination:

HEAD /2009/05/come-and-draw-the-circle-of-unity-with-us/ HTTP/1.1
Host: www.kidsidebyside.org
accept-encoding: gzip, deflate
user-agent: Python-httplib2/0.7.2 (gzip)