[python] How can I get href links from HTML using Python?

import urllib2

website = "WEBSITE"
openwebsite = urllib2.urlopen(website)
html = getwebsite.read()

print html

So far so good.

But I want only href links from the plain text HTML. How can I solve this problem?

This question is related to python html hyperlink beautifulsoup href

The answer is


Try with Beautifulsoup:

from BeautifulSoup import BeautifulSoup
import urllib2
import re

html_page = urllib2.urlopen("http://www.yourwebsite.com")
soup = BeautifulSoup(html_page)
for link in soup.findAll('a'):
    print link.get('href')

In case you just want links starting with http://, you should use:

soup.findAll('a', attrs={'href': re.compile("^http://")})

In Python 3 with BS4 it should be:

from bs4 import BeautifulSoup
import urllib.request

html_page = urllib.request.urlopen("http://www.yourwebsite.com")
soup = BeautifulSoup(html_page, "html.parser")
for link in soup.findAll('a'):
    print(link.get('href'))

Using BS4 for this specific task seems overkill.

Try instead:

website = urllib2.urlopen('http://10.123.123.5/foo_images/Repo/')
html = website.read()
files = re.findall('href="(.*tgz|.*tar.gz)"', html)
print sorted(x for x in (files))

I found this nifty piece of code on http://www.pythonforbeginners.com/code/regular-expression-re-findall and works for me quite well.

I tested it only on my scenario of extracting a list of files from a web folder that exposes the files\folder in it, e.g.:

enter image description here

and I got a sorted list of the files\folders under the URL


This is way late to answer but it will work for latest python users:

from bs4 import BeautifulSoup
import requests 


html_page = requests.get('http://www.example.com').text

soup = BeautifulSoup(html_page, "lxml")
for link in soup.findAll('a'):
    print(link.get('href'))

Don't forget to install "requests" and "BeautifulSoup" package and also "lxml". Use .text along with get otherwise it will throw an exception.

"lxml" is used to remove that warning of which parser to be used. You can also use "html.parser" whichever fits your case.


Simplest way for me:

from urlextract import URLExtract
from requests import get

url = "sample.com/samplepage/"
req = requests.get(url)
text = req.text
# or if you already have the html source:
# text = "This is html for ex <a href='http://google.com/'>Google</a> <a href='http://yahoo.com/'>Yahoo</a>"
text = text.replace(' ', '').replace('=','')
extractor = URLExtract()
print(extractor.find_urls(text))

output:

['http://google.com/', 'http://yahoo.com/']


Look at using the beautiful soup html parsing library.

http://www.crummy.com/software/BeautifulSoup/

You will do something like this:

import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup(html)
for link in soup.findAll("a"):
    print link.get("href")

My answer probably sucks compared to the real gurus out there, but using some simple math, string slicing, find and urllib, this little script will create a list containing link elements. I test google and my output seems right. Hope it helps!

import urllib
test = urllib.urlopen("http://www.google.com").read()
sane = 0
needlestack = []
while sane == 0:
  curpos = test.find("href")
  if curpos >= 0:
    testlen = len(test)
    test = test[curpos:testlen]
    curpos = test.find('"')
    testlen = len(test)
    test = test[curpos+1:testlen]
    curpos = test.find('"')
    needle = test[0:curpos]
    if needle.startswith("http" or "www"):
        needlestack.append(needle)
  else:
    sane = 1
for item in needlestack:
  print item

Here's a lazy version of @stephen's answer

import html.parser
import itertools
import urllib.request

class LinkParser(html.parser.HTMLParser):
    def reset(self):
        super().reset()
        self.links = iter([])

    def handle_starttag(self, tag, attrs):
        if tag == 'a':
            for (name, value) in attrs:
                if name == 'href':
                    self.links = itertools.chain(self.links, [value])


def gen_links(stream, parser):
    encoding = stream.headers.get_content_charset() or 'UTF-8'
    for line in stream:
        parser.feed(line.decode(encoding))
        yield from parser.links

Use it like so:

>>> parser = LinkParser()
>>> stream = urllib.request.urlopen('http://stackoverflow.com/questions/3075550')
>>> links = gen_links(stream, parser)
>>> next(links)
'//stackoverflow.com'

You can use the HTMLParser module.

The code would probably look something like this:

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):

    def handle_starttag(self, tag, attrs):
        # Only parse the 'anchor' tag.
        if tag == "a":
           # Check the list of defined attributes.
           for name, value in attrs:
               # If href is defined, print it.
               if name == "href":
                   print name, "=", value


parser = MyHTMLParser()
parser.feed(your_html_string)

Note: The HTMLParser module has been renamed to html.parser in Python 3.0. The 2to3 tool will automatically adapt imports when converting your sources to 3.0.


This answer is similar to others with requests and BeautifulSoup, but using list comprehension.

Because find_all() is the most popular method in the Beautiful Soup search API, you can use soup("a") as a shortcut of soup.findAll("a") and using list comprehension:

import requests
from bs4 import BeautifulSoup

URL = "http://www.yourwebsite.com"
page = requests.get(URL)
soup = BeautifulSoup(page.content, features='lxml')
# Find links
all_links = [link.get("href") for link in soup("a")]
# Only external links
ext_links = [link.get("href") for link in soup("a") if "http" in link.get("href")]

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#calling-a-tag-is-like-calling-find-all


Using requests with BeautifulSoup and Python 3:

import requests 
from bs4 import BeautifulSoup


page = requests.get('http://www.website.com')
bs = BeautifulSoup(page.content, features='lxml')
for link in bs.findAll('a'):
    print(link.get('href'))

Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to html

Embed ruby within URL : Middleman Blog Please help me convert this script to a simple image slider Generating a list of pages (not posts) without the index file Why there is this "clear" class before footer? Is it possible to change the content HTML5 alert messages? Getting all files in directory with ajax DevTools failed to load SourceMap: Could not load content for chrome-extension How to set width of mat-table column in angular? How to open a link in new tab using angular? ERROR Error: Uncaught (in promise), Cannot match any routes. URL Segment Wrapping a react-router Link in an html button How to make a hyperlink in telegram without using bots? React onClick and preventDefault() link refresh/redirect? How to put a link on a button with bootstrap? How link to any local file with markdown syntax? link with target="_blank" does not open in new tab in Chrome How to create a link to another PHP page How to determine the current language of a wordpress page when using polylang? How to change link color (Bootstrap) How can I make a clickable link in an NSAttributedString?

Examples related to beautifulsoup

Scraping: SSL: CERTIFICATE_VERIFY_FAILED error for http://en.wikipedia.org What should I use to open a url instead of urlopen in urllib3 TypeError: a bytes-like object is required, not 'str' in python and CSV UnicodeEncodeError: 'ascii' codec can't encode character at special name UnicodeEncodeError: 'charmap' codec can't encode characters bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library? Using BeautifulSoup to extract text without tags python BeautifulSoup parsing table install beautiful soup using pip Python BeautifulSoup extract text between element

Examples related to href

File URL "Not allowed to load local resource" in the Internet Browser Angular 2 router no base href set How to redirect to a route in laravel 5 by using href tag if I'm not using blade or any template? Html- how to disable <a href>? Add php variable inside echo statement as href link address? ASP MVC href to a controller/view href="tel:" and mobile numbers How can I disable the bootstrap hover color for links? How can I disable HREF if onclick is executed? Open link in new tab or window