Regular expression to extract URL from an HTML link

Question

I   m a newbie in Python  I   m learning regexes  but I need help here   Here comes the HTML source    lt a href  http   www ptop se  target   blank  gt http   www ptop se lt  a gt    I   m trying to code a tool that only prints out http   ptop se  Can you help me please

User · Answer

You can use this    lt a   gt   href

User · Answer

There s tonnes of them on regexlib

User · Answer

this should work  although there might be more elegant ways   import re url   lt a href  http   www ptop se  target   blank  gt http   www ptop se lt  a gt   r   re compile     lt  href              r findall url

User · Answer

If you re only looking for one  import re match   re search r href     quot         quot   gt       s  if match      print match group 1    If you have a long string  and want every instance of the pattern in it  import re urls   re findall r href     quot         quot   gt       s  print      join urls    Where s is the string that you re looking for matches in  Quick explanation of the regexp bits   r      is a  quot raw quot  string  It stops you having to worry about escaping characters quite as much as you normally would     especially -- in a raw string a   is just a     In a regular string you d have to do    every time  and that gets old in regexps    quot href     quot    quot  says to match  quot href  quot   possibly followed by a   or  quot     quot Possibly quot  because it s hard to say how horrible the HTML you re looking at is  and the quotes aren t strictly required  Enclosing the next bit in  quot    quot  says to make it a  quot group quot   which means to split it out and return it separately to us  It s just a way to say  quot this is the part of the pattern I m interested in  quot   quot      quot   gt    quot  says to match any characters that aren t     quot    gt   or a space  Essentially this is a list of characters that are an end to the URL  It lets us avoid trying to write a regexp that reliably matches a full URL  which can be a bit complicated   The suggestion in another answer to use BeautifulSoup isn t bad  but it does introduce a higher level of external requirements  Plus it doesn t help you in your stated goal of learning regexps  which I d assume this specific html-parsing project is just a part of  It s pretty easy to do  from BeautifulSoup import BeautifulSoup soup   BeautifulSoup html to parse  for tag in soup findAll  a   href True       print tag  href     Once you ve installed BeautifulSoup  anyway

User · Answer

this should work  although there might be more elegant ways   import re url   lt a href  http   www ptop se  target   blank  gt http   www ptop se lt  a gt   r   re compile     lt  href              r findall url

User · Answer

This works pretty well with using optional matches  prints after href   and gets the link only  Tested on http   pythex org     href    quot        A-z  lt   amp  s  gt 0-9 -     Oputput   Match 1   wiki Main Page Match 2   wiki Portal Contents Match 3   wiki Portal Featured content Match 4   wiki Portal Current events Match 5   wiki Special Random Match 6    donate wikimedia org wiki Special FundraiserRedirector utm source donate amp utm medium sidebar amp utm campaign C13 en wikipedia org amp uselang en

User · Answer

If you re only looking for one  import re match   re search r href     quot         quot   gt       s  if match      print match group 1    If you have a long string  and want every instance of the pattern in it  import re urls   re findall r href     quot         quot   gt       s  print      join urls    Where s is the string that you re looking for matches in  Quick explanation of the regexp bits   r      is a  quot raw quot  string  It stops you having to worry about escaping characters quite as much as you normally would     especially -- in a raw string a   is just a     In a regular string you d have to do    every time  and that gets old in regexps    quot href     quot    quot  says to match  quot href  quot   possibly followed by a   or  quot     quot Possibly quot  because it s hard to say how horrible the HTML you re looking at is  and the quotes aren t strictly required  Enclosing the next bit in  quot    quot  says to make it a  quot group quot   which means to split it out and return it separately to us  It s just a way to say  quot this is the part of the pattern I m interested in  quot   quot      quot   gt    quot  says to match any characters that aren t     quot    gt   or a space  Essentially this is a list of characters that are an end to the URL  It lets us avoid trying to write a regexp that reliably matches a full URL  which can be a bit complicated   The suggestion in another answer to use BeautifulSoup isn t bad  but it does introduce a higher level of external requirements  Plus it doesn t help you in your stated goal of learning regexps  which I d assume this specific html-parsing project is just a part of  It s pretty easy to do  from BeautifulSoup import BeautifulSoup soup   BeautifulSoup html to parse  for tag in soup findAll  a   href True       print tag  href     Once you ve installed BeautifulSoup  anyway

User · Answer

This works pretty well with using optional matches  prints after href   and gets the link only  Tested on http   pythex org     href    quot        A-z  lt   amp  s  gt 0-9 -     Oputput   Match 1   wiki Main Page Match 2   wiki Portal Contents Match 3   wiki Portal Featured content Match 4   wiki Portal Current events Match 5   wiki Special Random Match 6    donate wikimedia org wiki Special FundraiserRedirector utm source donate amp utm medium sidebar amp utm campaign C13 en wikipedia org amp uselang en

User · Answer

If you re only looking for one  import re match   re search r href     quot         quot   gt       s  if match      print match group 1    If you have a long string  and want every instance of the pattern in it  import re urls   re findall r href     quot         quot   gt       s  print      join urls    Where s is the string that you re looking for matches in  Quick explanation of the regexp bits   r      is a  quot raw quot  string  It stops you having to worry about escaping characters quite as much as you normally would     especially -- in a raw string a   is just a     In a regular string you d have to do    every time  and that gets old in regexps    quot href     quot    quot  says to match  quot href  quot   possibly followed by a   or  quot     quot Possibly quot  because it s hard to say how horrible the HTML you re looking at is  and the quotes aren t strictly required  Enclosing the next bit in  quot    quot  says to make it a  quot group quot   which means to split it out and return it separately to us  It s just a way to say  quot this is the part of the pattern I m interested in  quot   quot      quot   gt    quot  says to match any characters that aren t     quot    gt   or a space  Essentially this is a list of characters that are an end to the URL  It lets us avoid trying to write a regexp that reliably matches a full URL  which can be a bit complicated   The suggestion in another answer to use BeautifulSoup isn t bad  but it does introduce a higher level of external requirements  Plus it doesn t help you in your stated goal of learning regexps  which I d assume this specific html-parsing project is just a part of  It s pretty easy to do  from BeautifulSoup import BeautifulSoup soup   BeautifulSoup html to parse  for tag in soup findAll  a   href True       print tag  href     Once you ve installed BeautifulSoup  anyway

User · Answer

John Gruber  who wrote Markdown  which is made of regular expressions and is used right here on Stack Overflow  had a go at producing a regular expression that recognises URLs in text   http   daringfireball net 2009 11 liberal regex for matching urls  If you just want to grab the URL  i e  you   re not really trying to parse the HTML   this might be more lightweight than an HTML parser

User · Answer

There s tonnes of them on regexlib

User · Answer

You can use this    lt a   gt   href

User · Answer

Regexes are fundamentally bad at parsing HTML  see Can you provide some examples of why it is hard to parse XML and HTML with a regex  for why    What you need is an HTML parser   See Can you provide an example of parsing HTML with your favorite parser  for examples using a variety of parsers   In particular you will want to look at the Python answers  BeautifulSoup  HTMLParser  and lxml

User · Answer

Yes  there are tons of them on regexlib  That only proves that RE s should not be used to do that  Use SGMLParser or BeautifulSoup or write a parser - but don t use RE s  The ones that seems to work are extremely compliated and still don t cover all cases

User · Answer

John Gruber  who wrote Markdown  which is made of regular expressions and is used right here on Stack Overflow  had a go at producing a regular expression that recognises URLs in text   http   daringfireball net 2009 11 liberal regex for matching urls  If you just want to grab the URL  i e  you   re not really trying to parse the HTML   this might be more lightweight than an HTML parser

User · Answer

Don t use regexes  use BeautifulSoup  That  or be so crufty as to spawn it out to  say  w3m lynx and pull back in what w3m lynx renders  First is more elegant probably  second just worked a heck of a lot faster on some unoptimized code I wrote a while back

User · Answer

Don t use regexes  use BeautifulSoup  That  or be so crufty as to spawn it out to  say  w3m lynx and pull back in what w3m lynx renders  First is more elegant probably  second just worked a heck of a lot faster on some unoptimized code I wrote a while back

User · Answer

Don t use regexes  use BeautifulSoup  That  or be so crufty as to spawn it out to  say  w3m lynx and pull back in what w3m lynx renders  First is more elegant probably  second just worked a heck of a lot faster on some unoptimized code I wrote a while back

User · Answer

Regexes are fundamentally bad at parsing HTML  see Can you provide some examples of why it is hard to parse XML and HTML with a regex  for why    What you need is an HTML parser   See Can you provide an example of parsing HTML with your favorite parser  for examples using a variety of parsers   In particular you will want to look at the Python answers  BeautifulSoup  HTMLParser  and lxml

User · Answer

Don t use regexes  use BeautifulSoup  That  or be so crufty as to spawn it out to  say  w3m lynx and pull back in what w3m lynx renders  First is more elegant probably  second just worked a heck of a lot faster on some unoptimized code I wrote a while back

User · Answer

this should work  although there might be more elegant ways   import re url   lt a href  http   www ptop se  target   blank  gt http   www ptop se lt  a gt   r   re compile     lt  href              r findall url

User · Answer

There s tonnes of them on regexlib

User · Answer

this should work  although there might be more elegant ways   import re url   lt a href  http   www ptop se  target   blank  gt http   www ptop se lt  a gt   r   re compile     lt  href              r findall url

User · Answer

this regex can help you  you should get the first group by  1 or whatever method you have in your language   href            example    lt a href  http   www amghezi com  gt amgheziName lt  a gt    result   http   www amghezi com

User · Answer

If you re only looking for one  import re match   re search r href     quot         quot   gt       s  if match      print match group 1    If you have a long string  and want every instance of the pattern in it  import re urls   re findall r href     quot         quot   gt       s  print      join urls    Where s is the string that you re looking for matches in  Quick explanation of the regexp bits   r      is a  quot raw quot  string  It stops you having to worry about escaping characters quite as much as you normally would     especially -- in a raw string a   is just a     In a regular string you d have to do    every time  and that gets old in regexps    quot href     quot    quot  says to match  quot href  quot   possibly followed by a   or  quot     quot Possibly quot  because it s hard to say how horrible the HTML you re looking at is  and the quotes aren t strictly required  Enclosing the next bit in  quot    quot  says to make it a  quot group quot   which means to split it out and return it separately to us  It s just a way to say  quot this is the part of the pattern I m interested in  quot   quot      quot   gt    quot  says to match any characters that aren t     quot    gt   or a space  Essentially this is a list of characters that are an end to the URL  It lets us avoid trying to write a regexp that reliably matches a full URL  which can be a bit complicated   The suggestion in another answer to use BeautifulSoup isn t bad  but it does introduce a higher level of external requirements  Plus it doesn t help you in your stated goal of learning regexps  which I d assume this specific html-parsing project is just a part of  It s pretty easy to do  from BeautifulSoup import BeautifulSoup soup   BeautifulSoup html to parse  for tag in soup findAll  a   href True       print tag  href     Once you ve installed BeautifulSoup  anyway

User · Answer

Yes  there are tons of them on regexlib  That only proves that RE s should not be used to do that  Use SGMLParser or BeautifulSoup or write a parser - but don t use RE s  The ones that seems to work are extremely compliated and still don t cover all cases

User · Answer

There s tonnes of them on regexlib

User · Answer

this regex can help you  you should get the first group by  1 or whatever method you have in your language   href            example    lt a href  http   www amghezi com  gt amgheziName lt  a gt    result   http   www amghezi com

[python] Regular expression to extract URL from an HTML link

Examples related to python

Examples related to regex