[python] Regular expression to extract URL from an HTML link

I’m a newbie in Python. I’m learning regexes, but I need help here.

Here comes the HTML source:

<a href="http://www.ptop.se" target="_blank">http://www.ptop.se</a>

I’m trying to code a tool that only prints out http://ptop.se. Can you help me please?

This question is related to python regex

The answer is


Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.

In particular you will want to look at the Python answers: BeautifulSoup, HTMLParser, and lxml.


There's tonnes of them on regexlib


Don't use regexes, use BeautifulSoup. That, or be so crufty as to spawn it out to, say, w3m/lynx and pull back in what w3m/lynx renders. First is more elegant probably, second just worked a heck of a lot faster on some unoptimized code I wrote a while back.


You can use this.

<a[^>]+href=["'](.*?)["']

this regex can help you, you should get the first group by \1 or whatever method you have in your language.

href="([^"]*)

example:

<a href="http://www.amghezi.com">amgheziName</a>

result:

http://www.amghezi.com

If you're only looking for one:

import re
match = re.search(r'href=[\'"]?([^\'" >]+)', s)
if match:
    print(match.group(1))

If you have a long string, and want every instance of the pattern in it:

import re
urls = re.findall(r'href=[\'"]?([^\'" >]+)', s)
print(', '.join(urls))

Where s is the string that you're looking for matches in.

Quick explanation of the regexp bits:

r'...' is a "raw" string. It stops you having to worry about escaping characters quite as much as you normally would. (\ especially -- in a raw string a \ is just a \. In a regular string you'd have to do \\ every time, and that gets old in regexps.)

"href=[\'"]?" says to match "href=", possibly followed by a ' or ". "Possibly" because it's hard to say how horrible the HTML you're looking at is, and the quotes aren't strictly required.

Enclosing the next bit in "()" says to make it a "group", which means to split it out and return it separately to us. It's just a way to say "this is the part of the pattern I'm interested in."

"[^\'" >]+" says to match any characters that aren't ', ", >, or a space. Essentially this is a list of characters that are an end to the URL. It lets us avoid trying to write a regexp that reliably matches a full URL, which can be a bit complicated.

The suggestion in another answer to use BeautifulSoup isn't bad, but it does introduce a higher level of external requirements. Plus it doesn't help you in your stated goal of learning regexps, which I'd assume this specific html-parsing project is just a part of.

It's pretty easy to do:

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html_to_parse)
for tag in soup.findAll('a', href=True):
    print(tag['href'])

Once you've installed BeautifulSoup, anyway.


this regex can help you, you should get the first group by \1 or whatever method you have in your language.

href="([^"]*)

example:

<a href="http://www.amghezi.com">amgheziName</a>

result:

http://www.amghezi.com

There's tonnes of them on regexlib


this should work, although there might be more elegant ways.

import re
url='<a href="http://www.ptop.se" target="_blank">http://www.ptop.se</a>'
r = re.compile('(?<=href=").*?(?=")')
r.findall(url)

John Gruber (who wrote Markdown, which is made of regular expressions and is used right here on Stack Overflow) had a go at producing a regular expression that recognises URLs in text:

http://daringfireball.net/2009/11/liberal_regex_for_matching_urls

If you just want to grab the URL (i.e. you’re not really trying to parse the HTML), this might be more lightweight than an HTML parser.


Don't use regexes, use BeautifulSoup. That, or be so crufty as to spawn it out to, say, w3m/lynx and pull back in what w3m/lynx renders. First is more elegant probably, second just worked a heck of a lot faster on some unoptimized code I wrote a while back.


You can use this.

<a[^>]+href=["'](.*?)["']

This works pretty well with using optional matches (prints after href=) and gets the link only. Tested on http://pythex.org/

(?:href=['"])([:/.A-z?<_&\s=>0-9;-]+)

Oputput:

Match 1. /wiki/Main_Page

Match 2. /wiki/Portal:Contents

Match 3. /wiki/Portal:Featured_content

Match 4. /wiki/Portal:Current_events

Match 5. /wiki/Special:Random

Match 6. //donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en


If you're only looking for one:

import re
match = re.search(r'href=[\'"]?([^\'" >]+)', s)
if match:
    print(match.group(1))

If you have a long string, and want every instance of the pattern in it:

import re
urls = re.findall(r'href=[\'"]?([^\'" >]+)', s)
print(', '.join(urls))

Where s is the string that you're looking for matches in.

Quick explanation of the regexp bits:

r'...' is a "raw" string. It stops you having to worry about escaping characters quite as much as you normally would. (\ especially -- in a raw string a \ is just a \. In a regular string you'd have to do \\ every time, and that gets old in regexps.)

"href=[\'"]?" says to match "href=", possibly followed by a ' or ". "Possibly" because it's hard to say how horrible the HTML you're looking at is, and the quotes aren't strictly required.

Enclosing the next bit in "()" says to make it a "group", which means to split it out and return it separately to us. It's just a way to say "this is the part of the pattern I'm interested in."

"[^\'" >]+" says to match any characters that aren't ', ", >, or a space. Essentially this is a list of characters that are an end to the URL. It lets us avoid trying to write a regexp that reliably matches a full URL, which can be a bit complicated.

The suggestion in another answer to use BeautifulSoup isn't bad, but it does introduce a higher level of external requirements. Plus it doesn't help you in your stated goal of learning regexps, which I'd assume this specific html-parsing project is just a part of.

It's pretty easy to do:

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html_to_parse)
for tag in soup.findAll('a', href=True):
    print(tag['href'])

Once you've installed BeautifulSoup, anyway.


Yes, there are tons of them on regexlib. That only proves that RE's should not be used to do that. Use SGMLParser or BeautifulSoup or write a parser - but don't use RE's. The ones that seems to work are extremely compliated and still don't cover all cases.


If you're only looking for one:

import re
match = re.search(r'href=[\'"]?([^\'" >]+)', s)
if match:
    print(match.group(1))

If you have a long string, and want every instance of the pattern in it:

import re
urls = re.findall(r'href=[\'"]?([^\'" >]+)', s)
print(', '.join(urls))

Where s is the string that you're looking for matches in.

Quick explanation of the regexp bits:

r'...' is a "raw" string. It stops you having to worry about escaping characters quite as much as you normally would. (\ especially -- in a raw string a \ is just a \. In a regular string you'd have to do \\ every time, and that gets old in regexps.)

"href=[\'"]?" says to match "href=", possibly followed by a ' or ". "Possibly" because it's hard to say how horrible the HTML you're looking at is, and the quotes aren't strictly required.

Enclosing the next bit in "()" says to make it a "group", which means to split it out and return it separately to us. It's just a way to say "this is the part of the pattern I'm interested in."

"[^\'" >]+" says to match any characters that aren't ', ", >, or a space. Essentially this is a list of characters that are an end to the URL. It lets us avoid trying to write a regexp that reliably matches a full URL, which can be a bit complicated.

The suggestion in another answer to use BeautifulSoup isn't bad, but it does introduce a higher level of external requirements. Plus it doesn't help you in your stated goal of learning regexps, which I'd assume this specific html-parsing project is just a part of.

It's pretty easy to do:

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html_to_parse)
for tag in soup.findAll('a', href=True):
    print(tag['href'])

Once you've installed BeautifulSoup, anyway.


Don't use regexes, use BeautifulSoup. That, or be so crufty as to spawn it out to, say, w3m/lynx and pull back in what w3m/lynx renders. First is more elegant probably, second just worked a heck of a lot faster on some unoptimized code I wrote a while back.


Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.

In particular you will want to look at the Python answers: BeautifulSoup, HTMLParser, and lxml.


There's tonnes of them on regexlib


If you're only looking for one:

import re
match = re.search(r'href=[\'"]?([^\'" >]+)', s)
if match:
    print(match.group(1))

If you have a long string, and want every instance of the pattern in it:

import re
urls = re.findall(r'href=[\'"]?([^\'" >]+)', s)
print(', '.join(urls))

Where s is the string that you're looking for matches in.

Quick explanation of the regexp bits:

r'...' is a "raw" string. It stops you having to worry about escaping characters quite as much as you normally would. (\ especially -- in a raw string a \ is just a \. In a regular string you'd have to do \\ every time, and that gets old in regexps.)

"href=[\'"]?" says to match "href=", possibly followed by a ' or ". "Possibly" because it's hard to say how horrible the HTML you're looking at is, and the quotes aren't strictly required.

Enclosing the next bit in "()" says to make it a "group", which means to split it out and return it separately to us. It's just a way to say "this is the part of the pattern I'm interested in."

"[^\'" >]+" says to match any characters that aren't ', ", >, or a space. Essentially this is a list of characters that are an end to the URL. It lets us avoid trying to write a regexp that reliably matches a full URL, which can be a bit complicated.

The suggestion in another answer to use BeautifulSoup isn't bad, but it does introduce a higher level of external requirements. Plus it doesn't help you in your stated goal of learning regexps, which I'd assume this specific html-parsing project is just a part of.

It's pretty easy to do:

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html_to_parse)
for tag in soup.findAll('a', href=True):
    print(tag['href'])

Once you've installed BeautifulSoup, anyway.


this should work, although there might be more elegant ways.

import re
url='<a href="http://www.ptop.se" target="_blank">http://www.ptop.se</a>'
r = re.compile('(?<=href=").*?(?=")')
r.findall(url)

There's tonnes of them on regexlib


This works pretty well with using optional matches (prints after href=) and gets the link only. Tested on http://pythex.org/

(?:href=['"])([:/.A-z?<_&\s=>0-9;-]+)

Oputput:

Match 1. /wiki/Main_Page

Match 2. /wiki/Portal:Contents

Match 3. /wiki/Portal:Featured_content

Match 4. /wiki/Portal:Current_events

Match 5. /wiki/Special:Random

Match 6. //donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en


Yes, there are tons of them on regexlib. That only proves that RE's should not be used to do that. Use SGMLParser or BeautifulSoup or write a parser - but don't use RE's. The ones that seems to work are extremely compliated and still don't cover all cases.


this should work, although there might be more elegant ways.

import re
url='<a href="http://www.ptop.se" target="_blank">http://www.ptop.se</a>'
r = re.compile('(?<=href=").*?(?=")')
r.findall(url)

Don't use regexes, use BeautifulSoup. That, or be so crufty as to spawn it out to, say, w3m/lynx and pull back in what w3m/lynx renders. First is more elegant probably, second just worked a heck of a lot faster on some unoptimized code I wrote a while back.