retrieve links from web page using python and BeautifulSoup

Question

How can I retrieve the links of a webpage and copy the url address of the links using Python

User · Answer

Here s an example using  ars accepted answer and the BeautifulSoup4  requests  and wget modules to handle the downloads   import requests import wget import os  from bs4 import BeautifulSoup  SoupStrainer  url    https   archive ics uci edu ml machine-learning-databases eeg-mld eeg full   file type     tar gz   response   requests get url   for link in BeautifulSoup response content   html parser   parse only SoupStrainer  a         if link has attr  href            if file type in link  href                full path   url   link  href               wget download full path

User · Answer

Here s a short snippet using the SoupStrainer class in BeautifulSoup  import httplib2 from bs4 import BeautifulSoup  SoupStrainer  http   httplib2 Http   status  response   http request  http   www nytimes com    for link in BeautifulSoup response  parse only SoupStrainer  a         if link has attr  href            print link  href     The BeautifulSoup documentation is actually quite good  and covers a number of typical scenarios  https   www crummy com software BeautifulSoup bs4 doc  Edit  Note that I used the SoupStrainer class because it s a bit more efficient  memory and speed wise   if you know what you re parsing in advance

User · Answer

This script does what your looking for  But also resolves the relative links to absolute links   import urllib import lxml html import urlparse  def get dom url       connection   urllib urlopen url      return lxml html fromstring connection read     def get links url       return resolve links  link for link in get dom url  xpath    a  href      def guess root links       for link in links          if link startswith  http                parsed link   urlparse urlparse link              scheme   parsed link scheme                     netloc   parsed link netloc             return scheme   netloc  def resolve links links       root   guess root links      for link in links          if not link startswith  http                link   urlparse urljoin root  link          yield link    for link in get links  http   www google com        print link

User · Answer

For completeness sake  the BeautifulSoup 4 version  making use of the encoding supplied by the server as well   from bs4 import BeautifulSoup import urllib request  parser    html parser     or  lxml   preferred  or  html5lib   if installed resp   urllib request urlopen  http   www gpsbasecamp com national-parks   soup   BeautifulSoup resp  parser  from encoding resp info   get param  charset     for link in soup find all  a   href True       print link  href      or the Python 2 version   from bs4 import BeautifulSoup import urllib2  parser    html parser     or  lxml   preferred  or  html5lib   if installed resp   urllib2 urlopen  http   www gpsbasecamp com national-parks   soup   BeautifulSoup resp  parser  from encoding resp info   getparam  charset     for link in soup find all  a   href True       print link  href     and a version using the requests library  which as written will work in both Python 2 and 3   from bs4 import BeautifulSoup from bs4 dammit import EncodingDetector import requests  parser    html parser     or  lxml   preferred  or  html5lib   if installed resp   requests get  http   www gpsbasecamp com national-parks   http encoding   resp encoding if  charset  in resp headers get  content-type       lower   else None html encoding   EncodingDetector find declared encoding resp content  is html True  encoding   html encoding or http encoding soup   BeautifulSoup resp content  parser  from encoding encoding   for link in soup find all  a   href True       print link  href      The soup find all  a   href True  call finds all  lt a gt  elements that have an href attribute  elements without the attribute are skipped   BeautifulSoup 3 stopped development in March 2012  new projects really should use BeautifulSoup 4  always   Note that you should leave decoding the HTML from bytes to BeautifulSoup  You can inform BeautifulSoup of the characterset found in the HTTP response headers to assist in decoding  but this can be wrong and conflicting with a  lt meta gt  header info found in the HTML itself  which is why the above uses the BeautifulSoup internal class method EncodingDetector find declared encoding   to make sure that such embedded encoding hints win over a misconfigured server   With requests  the response encoding attribute defaults to Latin-1 if the response has a text   mimetype  even if no characterset was returned  This is consistent with the HTTP RFCs but painful when used with HTML parsing  so you should ignore that attribute when no charset is set in the Content-Type header

User · Answer

import urllib2 import BeautifulSoup  request   urllib2 Request  http   www gpsbasecamp com national-parks   response   urllib2 urlopen request  soup   BeautifulSoup BeautifulSoup response  for a in soup findAll  a      if  national-park  in a  href        print  found a url with national-park in the link

User · Answer

The following code is to retrieve all the links available in a webpage using urllib2 and BeautifulSoup4   import urllib2 from bs4 import BeautifulSoup  url   urllib2 urlopen  http   www espncricinfo com    read   soup   BeautifulSoup url   for line in soup find all  a        print line get  href

User · Answer

just for getting the links  without B soup and regex   import urllib2 url  http   www somewhere com  page urllib2 urlopen url  data page read   split   lt  a gt    tag   lt a href     endtag     gt   for item in data      if   lt a href  in item          try              ind   item index tag              item item ind len tag                end item index endtag          except  pass         else              print item  end    for more complex operations  of course BSoup is still preferred

User · Answer

I found the answer by  Blairg23 working   after the following correction  covering the scenario where it failed to work correctly    for link in BeautifulSoup response content   html parser   parse only SoupStrainer  a         if link has attr  href            if file type in link  href                full path  urlparse urljoin url   link  href     module urlparse need to be imported             wget download full path    For Python 3   urllib parse urljoin has to be used in order to obtain the full URL instead

User · Answer

To find all the links  we will in this example use the urllib2 module together with the re module  One of the most powerful function in the re module is  re findall     While re search   is used to find the first match for a pattern  re findall   finds all the matches and returns them as a list of strings  with each string representing one match   import urllib2  import re  connect to a URL website   urllib2 urlopen url    read html code html   website read     use re findall to get all the links links   re findall     http ftp s            html   print links

User · Answer

Why not use regular expressions   import urllib2 import re url    http   www somewhere com  page   urllib2 urlopen url  page   page read   links   re findall r  lt a    s href              gt       lt  a gt    page  for link in links      print  href   s  HTML text   s     link 0   link 1

User · Answer

BeatifulSoup s own parser can be slow  It might be more feasible to use lxml which is capable of parsing directly from a URL  with some limitations mentioned below      import lxml html  doc   lxml html parse url   links   doc xpath    a  href     for link in links      print link attrib  href     The code above will return the links as is  and in most cases they would be relative links or absolute from the site root  Since my use case was to only extract a certain type of links  below is a version that converts the links to full URLs and which optionally accepts a glob pattern like   mp3  It won t handle single and double dots in the relative paths though  but so far I didn t have the need for it  If you need to parse URL fragments containing     or    then urlparse urljoin might come in handy   NOTE  Direct lxml url parsing doesn t handle loading from https and doesn t do redirects  so for this reason the version below is using urllib2   lxml        usr bin env python import sys import urllib2 import urlparse import lxml html import fnmatch  try      import urltools as urltools except ImportError      sys stderr write  To normalize URLs run   pip install urltools --user        urltools   None   def get host url       p   urlparse urlparse url      return           format p scheme  p netloc    if   name         main         url   sys argv 1      host   get host url      glob patt   len sys argv   gt  2 and sys argv 2  or          doc   lxml html parse urllib2 urlopen url       links   doc xpath    a  href         for link in links          href   link attrib  href            if fnmatch fnmatch href  glob patt                if not href startswith   http       https      ftp                         if href startswith                           href   host   href                 else                      parent url   url rsplit      1  0                      href   urlparse urljoin parent url  href                       if urltools                          href   urltools normalize href               print href   The usage is as follows   getlinks py http   stackoverflow com a 37758066 191246 getlinks py http   stackoverflow com a 37758066 191246   users   getlinks py http   fakedomain mu somepage html    mp3

User · Answer

Links can be within a variety of attributes so you could pass a list of those attributes to select for example  with src and href attribute  here I am using the starts with   operator to specify that either of these attributes values starts with http  You can tailor this as required from bs4 import BeautifulSoup as bs import requests r   requests get  https   stackoverflow com    soup   bs r content   lxml   links    item  href   if item get  href   is not None else item  src   for item in soup select   href   quot http quot     src   quot http quot       print links   Attribute   value selectors   attr  value  Represents elements with an attribute name of attr whose value is prefixed  preceded  by value

User · Answer

Others have recommended BeautifulSoup  but it s much better to use lxml  Despite its name  it is also for parsing and scraping HTML  It s much  much faster than BeautifulSoup  and it even handles  broken  HTML better than BeautifulSoup  their claim to fame   It has a compatibility API for BeautifulSoup too if you don t want to learn the lxml API   Ian Blicking agrees   There s no reason to use BeautifulSoup anymore  unless you re on Google App Engine or something where anything not purely Python isn t allowed   lxml html also supports CSS3 selectors so this sort of thing is trivial   An example with lxml and xpath would look like this   import urllib import lxml html connection   urllib urlopen  http   www nytimes com    dom    lxml html fromstring connection read     for link in dom xpath    a  href      select the url in href for all a tags links      print link

User · Answer

import urllib2 from bs4 import BeautifulSoup a urllib2 urlopen  http   dir yahoo com   code a read   soup BeautifulSoup code  links soup findAll  a    To get href part alone print links 0  attrs  href

User · Answer

There can be many duplicate links together with both external and internal links   To differentiate between the two and just get unique links using sets     Python 3  import urllib     from bs4 import BeautifulSoup  url    http   www espncricinfo com   resp   urllib request urlopen url    Get server encoding per recommendation of Martijn Pieters  soup   BeautifulSoup resp  from encoding resp info   get param  charset      external links   set   internal links   set   for line in soup find all  a        link   line get  href       if not link          continue     if link startswith  http            external links add link      else          internal links add link     Depending on usage  full internal links may be preferred  full internal links         urllib parse urljoin url  internal link       for internal link in internal links      Print all unique external and full internal links  for link in external links union full internal links       print link

User · Answer

Under the hood BeautifulSoup now uses lxml  Requests  lxml  amp  list comprehensions makes a killer combo   import requests import lxml html  dom   lxml html fromstring requests get  http   www nytimes com   content    x for x in dom xpath    a  href   if      in x and  nytimes com  not in x    In the list comp  the  if      and  url com  not in x  is a simple method to scrub the url list of the sites  internal  navigation urls  etc

[python] retrieve links from web page using python and BeautifulSoup

Examples related to python

Examples related to web-scraping

Examples related to hyperlink

Examples related to beautifulsoup