Regular expression for extracting tag attributes

Question

I m trying to extract the attributes of a anchor tag   lt a gt    So far I have this expression      lt name gt  b w  b  s   s      lt value gt             lt value gt            lt value gt      lt  gt   s    s      which works for strings like   lt a href  test html  class  xyz  gt    and  single quotes    lt a href  test html  class  xyz  gt    but not for a string without quotes    lt a href test html class xyz gt    How can I modify my regex making it work with attributes without quotes  Or is there a better way to do that   Update  Thanks for all the good comments and advice so far  There is one thing I didn t mention  I sadly have to patch modify code not written by me  And there is no time money to rewrite this stuff from the bottom up

User · Accepted Answer

Update  2020   Gyum Fox proposes https   regex101 com r U9Yqqg 2  note regex101 com did not exist when I wrote originally this answer    S     quot              quot     s     S     s      gt  quot          quot      Applied to   lt a href test html class xyz gt   lt a href  quot test html quot  class  quot xyz quot  gt   lt a href  test html  class  quot xyz quot  gt   lt script type  quot text javascript quot  defer async id  quot something quot  onload  quot alert  hello    quot  gt  lt  script gt   lt img src  quot test png quot  gt   lt img src  quot a test png quot  gt   lt img src test png   gt   lt img src a test png   gt   lt img src test png  gt   lt img src a test png  gt   lt img src test png alt crap  gt   lt img src a test png alt crap  gt    Original answer  2008   If you have an element like  lt name attribute value attribute  quot value quot  attribute  value  gt   this regex could be used to find successively each attribute name and value   S     quot              quot     s     S      gt  quot          quot      Applied on   lt a href test html class xyz gt   lt a href  quot test html quot  class  quot xyz quot  gt   lt a href  test html  class  quot xyz quot  gt   it would yield   href    gt   test html   class    gt   xyz    Note  This does not work with numeric attribute values e g   lt div id  quot 1 quot  gt  won t work    Edited  Improved regex for getting attributes with no value and values with  quot     quot  inside       r n t f v    quot           quot              2  s     S     2      2     Applied on   lt script type  quot text javascript quot  defer async id  quot something quot  onload  quot alert  hello    quot  gt  lt  script gt   it would yield   type    gt   text javascript   defer    gt      async    gt      id    gt   something   onload    gt   alert   hello

User · Answer

I d reconsider the strategy to use only a single regular expression  Sure it s a nice game to come up with one single regular expression that does it all  But in terms of maintainabilty you are about to shoot yourself in both feet

User · Answer

If youre in  NET I recommend the HTML agility pack  very robust even with malformed HTML   Then you can use XPath

User · Answer

This is my best RegEx to extract properties in HTML Tag     Trim the match inside of the quotes  single or double     S   s   s           s    W w     s  2     Without trim    S   s   s             W w     2   Pros    You are able to trim the content inside of quotes  Match all the special ASCII characters inside of the quotes  If you have title  You re mine  the RegEx does not broken   Cons    It returns 3 groups  first the property then the quote       and at the end the property inside of the quotes i e    lt div title  You re  gt  the result is Group 1  title  Group 2     Group 3  You re    This is the online RegEx example  https   regex101 com r aVz4uG 13      I normally use this RegEx to extract the HTML Tags   I recommend this if you don t use a tag type like  lt div   lt span  etc    lt                               gt    For example    lt div title  a gt b c lt d  data-type  a gt b c lt d  gt Hello lt  div gt   lt span style  color   gt   lt red  gt Nothing lt  span gt    Returns     lt div title  a gt b c lt d  data-type  a gt b c lt d  gt     lt span style  color   gt   lt red  gt    This is the online RegEx example   https   regex101 com r aVz4uG 15  The bug in this RegEx is    lt div                              gt    In this tag    lt article title  a gt b c lt d  data-type  a gt b c lt div   gt Hello lt  article gt    Returns  lt div   gt  but it should not return any match   Match    lt div   gt    To  solve  this remove the        pattern    lt div                        gt        The answer  317081 is good but it not match properly with these cases    lt div id  a  gt    It returns  a instead of a  lt div style    gt    It doesn t match instead of return only an empty property  lt div title    c  gt    It not recognize the space between the equal       This is the improvement     S   s   s                    s     S      gt                      vs    S                      s     S      gt                 Avoid the spaces between equal signal    S   s   s          Change the last   and   for                         This is the online RegEx example  https   regex101 com r aVz4uG 8

User · Answer

I d reconsider the strategy to use only a single regular expression  Sure it s a nice game to come up with one single regular expression that does it all  But in terms of maintainabilty you are about to shoot yourself in both feet

User · Answer

I suggest that you use HTML Tidy to convert the HTML to XHTML  and then use a suitable XPath expression to extract the attributes

User · Answer

You cannot use the same name for multiple captures  Thus you cannot use a quantifier on expressions with named captures   So either don   t use named captures        b w  b  s   s                       lt  gt  s    s      Or don   t use the quantifier on this expression      lt name gt  b w  b  s   s    lt value gt                      lt  gt  s      This does also allow attribute values like bar   baz  quux   foo  bar   baz  quux    Well the drawback will be that you have to strip the leading and trailing quotes afterwards

User · Answer

I d reconsider the strategy to use only a single regular expression  Sure it s a nice game to come up with one single regular expression that does it all  But in terms of maintainabilty you are about to shoot yourself in both feet

User · Answer

Token Mantra response  you should not tweak modify harvest or otherwise produce html xml using regular expression    there are too may corner case conditionals such as    and    which must be accounted for  You are much better off using a proper DOM Parser  XML Parser  or one of the many other dozens of tried and tested tools for this job instead of inventing your own    I don t really care which one you use  as long as its recognized  tested  and you use one    my  foo    Someclass- gt parse   xmlstring     my  links    foo- gt getChildrenByTagName  a     my  srcs   map     - gt getAttribute  src      links      srcs now contains an array of src attributes extracted from the page

User · Answer

something like this might be helpful     S   s    s               2

User · Answer

I have created a PHP function that could extract attributes of any HTML tags  It also can handle attributes like disabled that has no value  and also can determine whether the tag is a stand-alone tag  has no closing tag  or not  has a closing tag  by checking the content result       Based on  lt https   github com mecha-cms cms blob master system kernel converter php gt     function extract html attributes  input        if    preg match      lt    a-z0-9 -         s            gt     s S       lt     2  gt      s       gt     im    input   matches   return false       matches 5    preg replace        s     a-z0-9 -             i     1 2 3 4 5 lt attr value gt  6    matches 5         results   array           element    gt   matches 2            attributes    gt  null           content    gt  isset  matches 8    amp  amp   matches 9       lt       matches 2      gt      matches 8    null            if preg match all     a-z0-9 -                         s     i    matches 5    attrs              results  attributes     array            foreach  attrs 1  as  i   gt   attr                 results  attributes    attr    isset  attrs 5   i    amp  amp    empty  attrs 5   i       attrs 5   i       lt attr value gt      attrs 5   i           attr                      return  results      Test Code   test   array        lt div class  foo  id  bar  data-test  1000  gt          lt div gt          lt div class  foo  id  bar  data-test  1000  gt test content lt  div gt          lt div gt test content lt  div gt          lt div gt test content lt  span gt          lt div gt test content         lt div gt  lt  div gt          lt div class  foo  id  bar  data-test  1000   gt          lt div class  foo  id  bar  data-test  1000    gt          lt  div  class  foo      id  bar    data-test  1000          gt          lt div class id data-test gt          lt id  foo  data-test  1000  gt          lt id data-test gt          lt select name  foo  id  bar  empty-value-test    selected disabled gt  lt option value  1  gt Option 1 lt  option gt  lt  select gt       foreach  test as  t        var dump  t  extract html attributes  t        echo   lt hr gt

User · Answer

If youre in  NET I recommend the HTML agility pack  very robust even with malformed HTML   Then you can use XPath

User · Answer

Just to agree with everyone else  don t parse HTML using regexp   It isn t possible to create an expression that will pick out attributes for even a correct piece of HTML  never mind all the possible malformed variants  Your regexp is already pretty much unreadable even without trying to cope with the invalid lack of quotes  chase further into the horror of real-world HTML and you will drive yourself crazy with an unmaintainable blob of unreliable expressions   There are existing libraries to either read broken HTML  or correct it into valid XHTML which you can then easily devour with an XML parser  Use them

User · Answer

Tags and attributes in HTML have the form   lt tag     attrnovalue     attrnoquote bli     attrdoublequote  blah  blah      attrsinglequote  bloob  bloob    gt    To match attributes  you need a regex attr that finds one of the four forms  Then you need to make sure that only matches are reported within HTML tags  Assuming you have the correct regex  the total regex would be   attr    attr   s    s  gt     The lookahead ensures that only other attributes and the closing tag follow the attribute  I use the following regular expression for attr    s   w      s   s                            gt  lt    s         Unimportant groups are made non capturing  The first matching group  1 gives you the name of the attribute  the value is one of   2or  3 or  4  I use  2 3 4 to extract the value  The final regex is    s   w      s   s                            gt  lt    s             s  w     s   s                       gt  lt    s        s    s  gt     Note  I removed all unnecessary groups in the lookahead and made all remaining groups non capturing

User · Answer

I also needed this and wrote a function for parsing attributes  you can get it from here   https   gist github com 4153580   Note  It doesn t use regex

User · Answer

Just to agree with everyone else  don t parse HTML using regexp   It isn t possible to create an expression that will pick out attributes for even a correct piece of HTML  never mind all the possible malformed variants  Your regexp is already pretty much unreadable even without trying to cope with the invalid lack of quotes  chase further into the horror of real-world HTML and you will drive yourself crazy with an unmaintainable blob of unreliable expressions   There are existing libraries to either read broken HTML  or correct it into valid XHTML which you can then easily devour with an XML parser  Use them

User · Answer

If you want to be general  you have to look at the precise specification of the a tag  like here  But even with that  if you do your perfect regexp  what if you have malformed html   I would suggest to go for a library to parse html  depending on the language you work with  e g  like python s Beautiful Soup

User · Answer

I have created a PHP function that could extract attributes of any HTML tags  It also can handle attributes like disabled that has no value  and also can determine whether the tag is a stand-alone tag  has no closing tag  or not  has a closing tag  by checking the content result       Based on  lt https   github com mecha-cms cms blob master system kernel converter php gt     function extract html attributes  input        if    preg match      lt    a-z0-9 -         s            gt     s S       lt     2  gt      s       gt     im    input   matches   return false       matches 5    preg replace        s     a-z0-9 -             i     1 2 3 4 5 lt attr value gt  6    matches 5         results   array           element    gt   matches 2            attributes    gt  null           content    gt  isset  matches 8    amp  amp   matches 9       lt       matches 2      gt      matches 8    null            if preg match all     a-z0-9 -                         s     i    matches 5    attrs              results  attributes     array            foreach  attrs 1  as  i   gt   attr                 results  attributes    attr    isset  attrs 5   i    amp  amp    empty  attrs 5   i       attrs 5   i       lt attr value gt      attrs 5   i           attr                      return  results      Test Code   test   array        lt div class  foo  id  bar  data-test  1000  gt          lt div gt          lt div class  foo  id  bar  data-test  1000  gt test content lt  div gt          lt div gt test content lt  div gt          lt div gt test content lt  span gt          lt div gt test content         lt div gt  lt  div gt          lt div class  foo  id  bar  data-test  1000   gt          lt div class  foo  id  bar  data-test  1000    gt          lt  div  class  foo      id  bar    data-test  1000          gt          lt div class id data-test gt          lt id  foo  data-test  1000  gt          lt id data-test gt          lt select name  foo  id  bar  empty-value-test    selected disabled gt  lt option value  1  gt Option 1 lt  option gt  lt  select gt       foreach  test as  t        var dump  t  extract html attributes  t        echo   lt hr gt

User · Answer

I d reconsider the strategy to use only a single regular expression  Sure it s a nice game to come up with one single regular expression that does it all  But in terms of maintainabilty you are about to shoot yourself in both feet

User · Answer

Although the advice not to parse HTML via regexp is valid  here s a expression that does pretty much what you asked         G                       start where the last match left off       gt                       begin non-backtracking expression                              anything  until            lt  Aa  b              an anchor tag                             but look ahead to see that the rest of the expression                                does not match       s                      at least one space        p Alpha              Our first capture  starting with one alpha        p Alnum              followed by any number of alphanumeric characters                             end capture  1          s     s            a group starting with a      possibly surrounded by spaces                              capture a single quote character                             anything else               2             which ever quote character we captured before                  gt  s         any number of non-    gt    space  quote   chars                             end group                             attribute value was optional  msx     But wait   you might say   What about  comments      Okay  then you can replace the   in the non-backtracking section with   It also handles CDATA sections          lt    lt       lt    -     lt       CDATA   lt    CDATA          gt   lt  --     -  -  -   -- gt      Also if you wanted to run a substitution under Perl 5 10  and I think PCRE   you can put  K right before the attribute name and not have to worry about capturing all the stuff you want to skip over

User · Answer

This is my best RegEx to extract properties in HTML Tag     Trim the match inside of the quotes  single or double     S   s   s           s    W w     s  2     Without trim    S   s   s             W w     2   Pros    You are able to trim the content inside of quotes  Match all the special ASCII characters inside of the quotes  If you have title  You re mine  the RegEx does not broken   Cons    It returns 3 groups  first the property then the quote       and at the end the property inside of the quotes i e    lt div title  You re  gt  the result is Group 1  title  Group 2     Group 3  You re    This is the online RegEx example  https   regex101 com r aVz4uG 13      I normally use this RegEx to extract the HTML Tags   I recommend this if you don t use a tag type like  lt div   lt span  etc    lt                               gt    For example    lt div title  a gt b c lt d  data-type  a gt b c lt d  gt Hello lt  div gt   lt span style  color   gt   lt red  gt Nothing lt  span gt    Returns     lt div title  a gt b c lt d  data-type  a gt b c lt d  gt     lt span style  color   gt   lt red  gt    This is the online RegEx example   https   regex101 com r aVz4uG 15  The bug in this RegEx is    lt div                              gt    In this tag    lt article title  a gt b c lt d  data-type  a gt b c lt div   gt Hello lt  article gt    Returns  lt div   gt  but it should not return any match   Match    lt div   gt    To  solve  this remove the        pattern    lt div                        gt        The answer  317081 is good but it not match properly with these cases    lt div id  a  gt    It returns  a instead of a  lt div style    gt    It doesn t match instead of return only an empty property  lt div title    c  gt    It not recognize the space between the equal       This is the improvement     S   s   s                    s     S      gt                      vs    S                      s     S      gt                 Avoid the spaces between equal signal    S   s   s          Change the last   and   for                         This is the online RegEx example  https   regex101 com r aVz4uG 8

User · Answer

If you want to be general  you have to look at the precise specification of the a tag  like here  But even with that  if you do your perfect regexp  what if you have malformed html   I would suggest to go for a library to parse html  depending on the language you work with  e g  like python s Beautiful Soup

User · Answer

Extract the element   var buttonMatcherRegExp   lt a  s S    gt   s S    lt   a gt    htmlStr string match  buttonMatcherRegExp   0    Then use jQuery to parse and extract the bit you want     htmlStr  attr  style

User · Answer

Although the advice not to parse HTML via regexp is valid  here s a expression that does pretty much what you asked         G                       start where the last match left off       gt                       begin non-backtracking expression                              anything  until            lt  Aa  b              an anchor tag                             but look ahead to see that the rest of the expression                                does not match       s                      at least one space        p Alpha              Our first capture  starting with one alpha        p Alnum              followed by any number of alphanumeric characters                             end capture  1          s     s            a group starting with a      possibly surrounded by spaces                              capture a single quote character                             anything else               2             which ever quote character we captured before                  gt  s         any number of non-    gt    space  quote   chars                             end group                             attribute value was optional  msx     But wait   you might say   What about  comments      Okay  then you can replace the   in the non-backtracking section with   It also handles CDATA sections          lt    lt       lt    -     lt       CDATA   lt    CDATA          gt   lt  --     -  -  -   -- gt      Also if you wanted to run a substitution under Perl 5 10  and I think PCRE   you can put  K right before the attribute name and not have to worry about capturing all the stuff you want to skip over

User · Answer

splattne    VonC solution partly works but there is some issue if the tag had a mixed of unquoted and quoted  This one works with mixed attributes   pat attributes      S                         gt      to test it out   lt  php  pat attributes      S                         gt      code         lt IMG title 09 jpg alt 09 jpg src  http   example com jpg v 185579  border 0 mce src  example com jpg v 185579          preg match all     pat attributes isU    code   ms   var dump   ms      code      lt a href test html class xyz gt   lt a href  test html  class  xyz  gt   lt a href   test html   class  xyz  gt   lt img src  http      gt           preg match all     pat attributes isU    code   ms    var dump   ms       ms would then contain keys and values on the 2nd and 3rd element    keys    ms 1    values    ms 2

User · Answer

Extract the element   var buttonMatcherRegExp   lt a  s S    gt   s S    lt   a gt    htmlStr string match  buttonMatcherRegExp   0    Then use jQuery to parse and extract the bit you want     htmlStr  attr  style

User · Answer

If you want to be general  you have to look at the precise specification of the a tag  like here  But even with that  if you do your perfect regexp  what if you have malformed html   I would suggest to go for a library to parse html  depending on the language you work with  e g  like python s Beautiful Soup

User · Answer

Just to agree with everyone else  don t parse HTML using regexp   It isn t possible to create an expression that will pick out attributes for even a correct piece of HTML  never mind all the possible malformed variants  Your regexp is already pretty much unreadable even without trying to cope with the invalid lack of quotes  chase further into the horror of real-world HTML and you will drive yourself crazy with an unmaintainable blob of unreliable expressions   There are existing libraries to either read broken HTML  or correct it into valid XHTML which you can then easily devour with an XML parser  Use them

User · Answer

I also needed this and wrote a function for parsing attributes  you can get it from here   https   gist github com 4153580   Note  It doesn t use regex

User · Answer

I suggest that you use HTML Tidy to convert the HTML to XHTML  and then use a suitable XPath expression to extract the attributes

User · Answer

Token Mantra response  you should not tweak modify harvest or otherwise produce html xml using regular expression    there are too may corner case conditionals such as    and    which must be accounted for  You are much better off using a proper DOM Parser  XML Parser  or one of the many other dozens of tried and tested tools for this job instead of inventing your own    I don t really care which one you use  as long as its recognized  tested  and you use one    my  foo    Someclass- gt parse   xmlstring     my  links    foo- gt getChildrenByTagName  a     my  srcs   map     - gt getAttribute  src      links      srcs now contains an array of src attributes extracted from the page

User · Answer

have a look at this Regex  amp  PHP - isolate src attribute from img tag  perhaps you can walk through the DOM and get the desired attributes  It works fine for me  getting attributes from the body-tag

User · Answer

Although the advice not to parse HTML via regexp is valid  here s a expression that does pretty much what you asked         G                       start where the last match left off       gt                       begin non-backtracking expression                              anything  until            lt  Aa  b              an anchor tag                             but look ahead to see that the rest of the expression                                does not match       s                      at least one space        p Alpha              Our first capture  starting with one alpha        p Alnum              followed by any number of alphanumeric characters                             end capture  1          s     s            a group starting with a      possibly surrounded by spaces                              capture a single quote character                             anything else               2             which ever quote character we captured before                  gt  s         any number of non-    gt    space  quote   chars                             end group                             attribute value was optional  msx     But wait   you might say   What about  comments      Okay  then you can replace the   in the non-backtracking section with   It also handles CDATA sections          lt    lt       lt    -     lt       CDATA   lt    CDATA          gt   lt  --     -  -  -   -- gt      Also if you wanted to run a substitution under Perl 5 10  and I think PCRE   you can put  K right before the attribute name and not have to worry about capturing all the stuff you want to skip over

User · Answer

If you want to be general  you have to look at the precise specification of the a tag  like here  But even with that  if you do your perfect regexp  what if you have malformed html   I would suggest to go for a library to parse html  depending on the language you work with  e g  like python s Beautiful Soup

User · Answer

I suggest that you use HTML Tidy to convert the HTML to XHTML  and then use a suitable XPath expression to extract the attributes

User · Answer

Although the advice not to parse HTML via regexp is valid  here s a expression that does pretty much what you asked         G                       start where the last match left off       gt                       begin non-backtracking expression                              anything  until            lt  Aa  b              an anchor tag                             but look ahead to see that the rest of the expression                                does not match       s                      at least one space        p Alpha              Our first capture  starting with one alpha        p Alnum              followed by any number of alphanumeric characters                             end capture  1          s     s            a group starting with a      possibly surrounded by spaces                              capture a single quote character                             anything else               2             which ever quote character we captured before                  gt  s         any number of non-    gt    space  quote   chars                             end group                             attribute value was optional  msx     But wait   you might say   What about  comments      Okay  then you can replace the   in the non-backtracking section with   It also handles CDATA sections          lt    lt       lt    -     lt       CDATA   lt    CDATA          gt   lt  --     -  -  -   -- gt      Also if you wanted to run a substitution under Perl 5 10  and I think PCRE   you can put  K right before the attribute name and not have to worry about capturing all the stuff you want to skip over

User · Answer

I suggest that you use HTML Tidy to convert the HTML to XHTML  and then use a suitable XPath expression to extract the attributes

User · Answer

PHP  PCRE  and Python  Simple attribute extraction  See it working            s        s    s              lt          lt                 lt          lt                                  gt   gt   s         Or with tag opening   closure verification  tag name retrieval and comment escaping  This expression foresees unquoted   quoted  single   double quotes  escaped quotes inside attributes  spaces around equals signs  different number of attributes  check only for attributes inside tags  and manage different quotes within an attribute value   See it working         lt    - -       - -  gt   r n   n     - -  gt       lt   S   s       gt      lt     s   G            s        s    s               lt           lt                   lt          lt                                   gt   gt   s             s      Works better with the  gisx  flags      Javascript  As Javascript regular expressions don t support look-behinds  it won t support most features of the previous expressions I propose  But in case it might fit someone s needs  you could try this version   See it working      S                   gt   gt        s

User · Answer

This works for me  It also take into consideration some end cases I have encountered   I am using this Regex for XML parser     lt   s    gt  lt   s         gt   s

User · Answer

have a look at this Regex  amp  PHP - isolate src attribute from img tag  perhaps you can walk through the DOM and get the desired attributes  It works fine for me  getting attributes from the body-tag

User · Answer

You cannot use the same name for multiple captures  Thus you cannot use a quantifier on expressions with named captures   So either don   t use named captures        b w  b  s   s                       lt  gt  s    s      Or don   t use the quantifier on this expression      lt name gt  b w  b  s   s    lt value gt                      lt  gt  s      This does also allow attribute values like bar   baz  quux   foo  bar   baz  quux    Well the drawback will be that you have to strip the leading and trailing quotes afterwards

User · Answer

If youre in  NET I recommend the HTML agility pack  very robust even with malformed HTML   Then you can use XPath

User · Answer

splattne    VonC solution partly works but there is some issue if the tag had a mixed of unquoted and quoted  This one works with mixed attributes   pat attributes      S                         gt      to test it out   lt  php  pat attributes      S                         gt      code         lt IMG title 09 jpg alt 09 jpg src  http   example com jpg v 185579  border 0 mce src  example com jpg v 185579          preg match all     pat attributes isU    code   ms   var dump   ms      code      lt a href test html class xyz gt   lt a href  test html  class  xyz  gt   lt a href   test html   class  xyz  gt   lt img src  http      gt           preg match all     pat attributes isU    code   ms    var dump   ms       ms would then contain keys and values on the 2nd and 3rd element    keys    ms 1    values    ms 2

User · Answer

something like this might be helpful     S   s    s               2

User · Answer

This works for me  It also take into consideration some end cases I have encountered   I am using this Regex for XML parser     lt   s    gt  lt   s         gt   s

User · Answer

Tags and attributes in HTML have the form   lt tag     attrnovalue     attrnoquote bli     attrdoublequote  blah  blah      attrsinglequote  bloob  bloob    gt    To match attributes  you need a regex attr that finds one of the four forms  Then you need to make sure that only matches are reported within HTML tags  Assuming you have the correct regex  the total regex would be   attr    attr   s    s  gt     The lookahead ensures that only other attributes and the closing tag follow the attribute  I use the following regular expression for attr    s   w      s   s                            gt  lt    s         Unimportant groups are made non capturing  The first matching group  1 gives you the name of the attribute  the value is one of   2or  3 or  4  I use  2 3 4 to extract the value  The final regex is    s   w      s   s                            gt  lt    s             s  w     s   s                       gt  lt    s        s    s  gt     Note  I removed all unnecessary groups in the lookahead and made all remaining groups non capturing

User · Answer

Just to agree with everyone else  don t parse HTML using regexp   It isn t possible to create an expression that will pick out attributes for even a correct piece of HTML  never mind all the possible malformed variants  Your regexp is already pretty much unreadable even without trying to cope with the invalid lack of quotes  chase further into the horror of real-world HTML and you will drive yourself crazy with an unmaintainable blob of unreliable expressions   There are existing libraries to either read broken HTML  or correct it into valid XHTML which you can then easily devour with an XML parser  Use them

User · Answer

If youre in  NET I recommend the HTML agility pack  very robust even with malformed HTML   Then you can use XPath

User · Answer

Token Mantra response  you should not tweak modify harvest or otherwise produce html xml using regular expression    there are too may corner case conditionals such as    and    which must be accounted for  You are much better off using a proper DOM Parser  XML Parser  or one of the many other dozens of tried and tested tools for this job instead of inventing your own    I don t really care which one you use  as long as its recognized  tested  and you use one    my  foo    Someclass- gt parse   xmlstring     my  links    foo- gt getChildrenByTagName  a     my  srcs   map     - gt getAttribute  src      links      srcs now contains an array of src attributes extracted from the page

User · Answer

Token Mantra response  you should not tweak modify harvest or otherwise produce html xml using regular expression    there are too may corner case conditionals such as    and    which must be accounted for  You are much better off using a proper DOM Parser  XML Parser  or one of the many other dozens of tried and tested tools for this job instead of inventing your own    I don t really care which one you use  as long as its recognized  tested  and you use one    my  foo    Someclass- gt parse   xmlstring     my  links    foo- gt getChildrenByTagName  a     my  srcs   map     - gt getAttribute  src      links      srcs now contains an array of src attributes extracted from the page

User · Answer

PHP  PCRE  and Python  Simple attribute extraction  See it working            s        s    s              lt          lt                 lt          lt                                  gt   gt   s         Or with tag opening   closure verification  tag name retrieval and comment escaping  This expression foresees unquoted   quoted  single   double quotes  escaped quotes inside attributes  spaces around equals signs  different number of attributes  check only for attributes inside tags  and manage different quotes within an attribute value   See it working         lt    - -       - -  gt   r n   n     - -  gt       lt   S   s       gt      lt     s   G            s        s    s               lt           lt                   lt          lt                                   gt   gt   s             s      Works better with the  gisx  flags      Javascript  As Javascript regular expressions don t support look-behinds  it won t support most features of the previous expressions I propose  But in case it might fit someone s needs  you could try this version   See it working      S                   gt   gt        s

[html] Regular expression for extracting tag attributes

Examples related to html

Examples related to regex