Regex to extract URLs from href attribute in HTML with Python

Question

Possible Duplicate    What is the best regular expression to check if a string is a valid URL        Considering a string as follows   string     lt p gt Hello World lt  p gt  lt a href  http   example com  gt More Examples lt  a gt  lt a href  http   example2 com  gt Even More Examples lt  a gt     How could I  with Python  extract the urls  inside the anchor tag s href  Something like    gt  gt  gt  url   getURLs string   gt  gt  gt  url   http   example com    http   example2 com     Thanks

User · Accepted Answer

import re  url     lt p gt Hello World lt  p gt  lt a href  http   example com  gt More Examples lt  a gt  lt a href  http   example2 com  gt Even More Examples lt  a gt    urls   re findall  https        - w         da-fA-F  2       url    gt  gt  gt  print urls   http   example com    http   example2 com

User · Answer

The best answer is     Don t use a regex  The expression in the accepted answer misses many cases  Among other things  URLs can have unicode characters in them  The regex you want is here  and after looking at it  you may conclude that you don t really want it after all  The most correct version is ten-thousand characters long   Admittedly  if you were starting with plain  unstructured text with a bunch of URLs in it  then you might need that ten-thousand-character-long regex  But if your input is structured  use the structure  Your stated aim is to  extract the url  inside the anchor tag s href   Why use a ten-thousand-character-long regex when you can do something much simpler   Parse the HTML instead  For many tasks  using Beautiful Soup will be far faster and easier to use    gt  gt  gt  from bs4 import BeautifulSoup as Soup  gt  gt  gt  html   Soup s   html parser               Soup s   lxml   if lxml is installed  gt  gt  gt   a  href   for a in html find all  a      http   example com    http   example2 com     If you prefer not to use external tools  you can also directly use Python s own built-in HTML parsing library  Here s a really simple subclass of HTMLParser that does exactly what you want   from html parser import HTMLParser  class MyParser HTMLParser       def   init   self  output list None           HTMLParser   init   self          if output list is None              self output list              else              self output list   output list     def handle starttag self  tag  attrs           if tag     a               self output list append dict attrs  get  href      Test    gt  gt  gt  p   MyParser    gt  gt  gt  p feed s   gt  gt  gt  p output list   http   example com    http   example2 com     You could even create a new method that accepts a string  calls feed  and returns output list  This is a vastly more powerful and extensible way than regular expressions to extract information from html

[python] Regex to extract URLs from href attribute in HTML with Python

Examples related to python

Examples related to regex

Examples related to url