How can I get href links from HTML using Python

Question

import urllib2  website    WEBSITE  openwebsite   urllib2 urlopen website  html   getwebsite read    print html   So far so good    But I want only href links from the plain text HTML  How can I solve this problem

User · Accepted Answer

Try with Beautifulsoup   from BeautifulSoup import BeautifulSoup import urllib2 import re  html page   urllib2 urlopen  http   www yourwebsite com   soup   BeautifulSoup html page  for link in soup findAll  a        print link get  href     In case you just want links starting with http     you should use   soup findAll  a   attrs   href   re compile   http          In Python 3 with BS4 it should be   from bs4 import BeautifulSoup import urllib request  html page   urllib request urlopen  http   www yourwebsite com   soup   BeautifulSoup html page   html parser   for link in soup findAll  a        print link get  href

User · Answer

Using BS4 for this specific task seems overkill.

Try instead:

website = urllib2.urlopen('http://10.123.123.5/foo_images/Repo/')
html = website.read()
files = re.findall('href="(.*tgz|.*tar.gz)"', html)
print sorted(x for x in (files))

I found this nifty piece of code on http://www.pythonforbeginners.com/code/regular-expression-re-findall and works for me quite well.

I tested it only on my scenario of extracting a list of files from a web folder that exposes the files\folder in it, e.g.:

and I got a sorted list of the files\folders under the URL

User · Answer

This is way late to answer but it will work for latest python users:

from bs4 import BeautifulSoup
import requests 


html_page = requests.get('http://www.example.com').text

soup = BeautifulSoup(html_page, "lxml")
for link in soup.findAll('a'):
    print(link.get('href'))

Don't forget to install "requests" and "BeautifulSoup" package and also "lxml". Use .text along with get otherwise it will throw an exception.

"lxml" is used to remove that warning of which parser to be used. You can also use "html.parser" whichever fits your case.

User · Answer

Simplest way for me   from urlextract import URLExtract from requests import get  url    sample com samplepage   req   requests get url  text   req text   or if you already have the html source    text    This is html for ex  lt a href  http   google com   gt Google lt  a gt   lt a href  http   yahoo com   gt Yahoo lt  a gt   text   text replace          replace         extractor   URLExtract   print extractor find urls text      output      http   google com     http   yahoo com

User · Answer

Look at using the beautiful soup html parsing library.

http://www.crummy.com/software/BeautifulSoup/

You will do something like this:

import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup(html)
for link in soup.findAll("a"):
    print link.get("href")

User · Answer

My answer probably sucks compared to the real gurus out there  but using some simple math  string slicing  find and urllib  this little script will create a list containing link elements  I test google and my output seems right  Hope it helps   import urllib test   urllib urlopen  http   www google com   read   sane   0 needlestack      while sane    0    curpos   test find  href     if curpos  gt   0      testlen   len test      test   test curpos testlen      curpos   test find          testlen   len test      test   test curpos 1 testlen      curpos   test find          needle   test 0 curpos      if needle startswith  http  or  www            needlestack append needle    else      sane   1 for item in needlestack    print item

User · Answer

Here s a lazy version of  stephen s answer import html parser import itertools import urllib request  class LinkParser html parser HTMLParser       def reset self           super   reset           self links   iter          def handle starttag self  tag  attrs           if tag     a               for  name  value  in attrs                  if name     href                       self links   itertools chain self links   value     def gen links stream  parser       encoding   stream headers get content charset   or  UTF-8      for line in stream          parser feed line decode encoding           yield from parser links  Use it like so   gt  gt  gt  parser   LinkParser    gt  gt  gt  stream   urllib request urlopen  http   stackoverflow com questions 3075550    gt  gt  gt  links   gen links stream  parser   gt  gt  gt  next links     stackoverflow com

User · Answer

You can use the HTMLParser module   The code would probably look something like this   from HTMLParser import HTMLParser  class MyHTMLParser HTMLParser        def handle starttag self  tag  attrs             Only parse the  anchor  tag          if tag     a                Check the list of defined attributes             for name  value in attrs                   If href is defined  print it                 if name     href                      print name       value   parser   MyHTMLParser   parser feed your html string    Note  The HTMLParser module has been renamed to html parser in Python 3 0  The 2to3 tool will automatically adapt imports when converting your sources to 3 0

User · Answer

This answer is similar to others with requests and BeautifulSoup  but using list comprehension   Because find all   is the most popular method in the Beautiful Soup search API  you can use soup  a   as a shortcut of soup findAll  a   and using list comprehension   import requests from bs4 import BeautifulSoup  URL    http   www yourwebsite com  page   requests get URL  soup   BeautifulSoup page content  features  lxml     Find links all links    link get  href   for link in soup  a      Only external links ext links    link get  href   for link in soup  a   if  http  in link get  href      https   www crummy com software BeautifulSoup bs4 doc  calling-a-tag-is-like-calling-find-all

User · Answer

Using requests with BeautifulSoup and Python 3    import requests  from bs4 import BeautifulSoup   page   requests get  http   www website com   bs   BeautifulSoup page content  features  lxml   for link in bs findAll  a        print link get  href

[python] How can I get href links from HTML using Python?

The answer is

Examples related to python

Examples related to html

Examples related to hyperlink

Examples related to beautifulsoup

Examples related to href

Tags