Python Get HTTP headers from urllib2 urlopen call

Question

Does urllib2 fetch the whole page when a urlopen call is made     I d like to just read the HTTP response header without getting the page  It looks like urllib2 opens the HTTP connection and then subsequently gets the actual HTML page    or does it just start buffering the page with the urlopen call   import urllib2 myurl    http   www kidsidebyside org 2009 05 come-and-draw-the-circle-of-unity-with-us   page   urllib2 urlopen myurl     open connection  get headers  html   page readlines       stream page

User · Answer

What about sending a HEAD request instead of a normal GET request  The following snipped  copied from a similar question  does exactly that     gt  gt  gt  import httplib  gt  gt  gt  conn   httplib HTTPConnection  www google com    gt  gt  gt  conn request  HEAD     index html    gt  gt  gt  res   conn getresponse    gt  gt  gt  print res status  res reason 200 OK  gt  gt  gt  print res getheaders      content-length    0      expires    -1      server    gws      cache-control    private  max-age 0      date    Sat  20 Sep 2008 06 43 36 GMT      content-type    text html  charset ISO-8859-1

User · Answer

Actually  it appears that urllib2 can do an HTTP HEAD request   The question that  reto linked to  above  shows how to get urllib2 to do a HEAD request   Here s my take on it   import urllib2    Derive from Request class and override get method to allow a HEAD request  class HeadRequest urllib2 Request       def get method self           return  HEAD   myurl    http   bit ly doFeT  request   HeadRequest myurl   try      response   urllib2 urlopen request      response headers   response info          This will just display all the dictionary key-value pairs   Replace this       line with something useful      response headers dict  except urllib2 HTTPError  e        Prints the HTTP Status code of the response but only if there was a        problem      print   Error code   s    e code    If you check this with something like the Wireshark network protocol analazer  you can see that it is actually sending out a HEAD request  rather than a GET     This is the HTTP request and response from the code above  as captured by Wireshark      HEAD  doFeT HTTP 1 1 Accept-Encoding  identity Host    bit ly Connection  close User-Agent  Python-urllib 2 7      HTTP 1 1 301 Moved Server  nginx Date  Sun  19 Feb 2012   13 20 56 GMT Content-Type  text html  charset utf-8   Cache-control  private  max-age 90 Location    http   www kidsidebyside org  p 445 MIME-Version  1 0   Content-Length  127 Connection  close Set-Cookie     bit 4f40f738-00153-02ed0-421cf10a domain  bit ly expires Fri Aug 17 13 20 56 2012 path    HttpOnly   However  as mentioned in one of the comments in the other question  if the URL in question includes a redirect then urllib2 will do a GET request to the destination  not a HEAD   This could be a major shortcoming  if you really wanted to only make HEAD requests   The request above involves a redirect   Here is request to the destination  as captured by Wireshark      GET  2009 05 come-and-draw-the-circle-of-unity-with-us  HTTP 1 1   Accept-Encoding  identity Host  www kidsidebyside org   Connection  close User-Agent  Python-urllib 2 7   An alternative to using urllib2 is to use Joe Gregorio s httplib2 library   import httplib2  url    http   bit ly doFeT  http interface   httplib2 Http    try      response  content   http interface request url  method  HEAD       print   Response status   d -  s     response status  response reason          This will just display all the dictionary key-value pairs   Replace this       line with something useful      response   dict    except httplib2 ServerNotFoundError  e      print  e message    This has the advantage of using HEAD requests for both the initial HTTP request and the redirected request to the destination URL   Here s the first request      HEAD  doFeT HTTP 1 1 Host  bit ly accept-encoding  gzip    deflate user-agent  Python-httplib2 0 7 2  gzip    Here s the second request  to the destination      HEAD  2009 05 come-and-draw-the-circle-of-unity-with-us  HTTP 1 1   Host  www kidsidebyside org accept-encoding  gzip  deflate   user-agent  Python-httplib2 0 7 2  gzip

User · Answer

def  GetHtmlPage self  addr     headers      User-Agent    self userAgent                 Cookie    self cookies     req   urllib2 Request addr    response   urllib2 urlopen req     print  ResponseInfo     print response info      resultsHtml   unicode response read    self encoding    return resultsHtml

User · Answer

Use the response info   method to get the headers   From the urllib2 docs      urllib2 urlopen url   data    timeout                 This function returns a file-like object with two additional methods          geturl       return the URL of the resource retrieved  commonly used to determine if a redirect was followed   info       return the meta-information of the page  such as headers  in the form of an httplib HTTPMessage instance  see Quick Reference to HTTP Headers       So  for your example  try stepping through the result of response info   headers for what you re looking for   Note the major caveat to using httplib HTTPMessage is documented in python issue 4773

User · Answer

urllib2 urlopen does an HTTP GET  or POST if you supply a data argument   not an HTTP HEAD  if it did the latter  you couldn t do readlines or other accesses to the page body  of course

User · Answer

One-liner     python -c  import urllib2  print urllib2 build opener urllib2 HTTPHandler debuglevel 1   open urllib2 Request  http   google com

[python] Python: Get HTTP headers from urllib2.urlopen call?

Examples related to python

Examples related to urllib

Examples related to forwarding