[html] Regular expression for extracting tag attributes

I'm trying to extract the attributes of a anchor tag (<a>). So far I have this expression:

(?<name>\b\w+\b)\s*=\s*("(?<value>[^"]*)"|'(?<value>[^']*)'|(?<value>[^"'<> \s]+)\s*)+

which works for strings like

<a href="test.html" class="xyz">

and (single quotes)

<a href='test.html' class="xyz">

but not for a string without quotes:

<a href=test.html class=xyz>

How can I modify my regex making it work with attributes without quotes? Or is there a better way to do that?

Update: Thanks for all the good comments and advice so far. There is one thing I didn't mention: I sadly have to patch/modify code not written by me. And there is no time/money to rewrite this stuff from the bottom up.

This question is related to html regex

The answer is


have a look at this Regex & PHP - isolate src attribute from img tag

perhaps you can walk through the DOM and get the desired attributes. It works fine for me, getting attributes from the body-tag


Token Mantra response: you should not tweak/modify/harvest/or otherwise produce html/xml using regular expression.

there are too may corner case conditionals such as \' and \" which must be accounted for. You are much better off using a proper DOM Parser, XML Parser, or one of the many other dozens of tried and tested tools for this job instead of inventing your own.

I don't really care which one you use, as long as its recognized, tested, and you use one.

my $foo  = Someclass->parse( $xmlstring ); 
my @links = $foo->getChildrenByTagName("a"); 
my @srcs = map { $_->getAttribute("src") } @links; 
# @srcs now contains an array of src attributes extracted from the page. 

I suggest that you use HTML Tidy to convert the HTML to XHTML, and then use a suitable XPath expression to extract the attributes.


I'd reconsider the strategy to use only a single regular expression. Sure it's a nice game to come up with one single regular expression that does it all. But in terms of maintainabilty you are about to shoot yourself in both feet.


splattne,

@VonC solution partly works but there is some issue if the tag had a mixed of unquoted and quoted

This one works with mixed attributes

$pat_attributes = "(\S+)=(\"|'| |)(.*)(\"|'| |>)"

to test it out

<?php
$pat_attributes = "(\S+)=(\"|'| |)(.*)(\"|'| |>)"

$code = '    <IMG title=09.jpg alt=09.jpg src="http://example.com.jpg?v=185579" border=0 mce_src="example.com.jpg?v=185579"
    ';

preg_match_all( "@$pat_attributes@isU", $code, $ms);
var_dump( $ms );

$code = '
<a href=test.html class=xyz>
<a href="test.html" class="xyz">
<a href=\'test.html\' class="xyz">
<img src="http://"/>      ';

preg_match_all( "@$pat_attributes@isU", $code, $ms);

var_dump( $ms );

$ms would then contain keys and values on the 2nd and 3rd element.

$keys = $ms[1];
$values = $ms[2];

Just to agree with everyone else: don't parse HTML using regexp.

It isn't possible to create an expression that will pick out attributes for even a correct piece of HTML, never mind all the possible malformed variants. Your regexp is already pretty much unreadable even without trying to cope with the invalid lack of quotes; chase further into the horror of real-world HTML and you will drive yourself crazy with an unmaintainable blob of unreliable expressions.

There are existing libraries to either read broken HTML, or correct it into valid XHTML which you can then easily devour with an XML parser. Use them.


Token Mantra response: you should not tweak/modify/harvest/or otherwise produce html/xml using regular expression.

there are too may corner case conditionals such as \' and \" which must be accounted for. You are much better off using a proper DOM Parser, XML Parser, or one of the many other dozens of tried and tested tools for this job instead of inventing your own.

I don't really care which one you use, as long as its recognized, tested, and you use one.

my $foo  = Someclass->parse( $xmlstring ); 
my @links = $foo->getChildrenByTagName("a"); 
my @srcs = map { $_->getAttribute("src") } @links; 
# @srcs now contains an array of src attributes extracted from the page. 

something like this might be helpful

'(\S+)\s*?=\s*([\'"])(.*?|)\2

Tags and attributes in HTML have the form

<tag 
   attrnovalue 
   attrnoquote=bli 
   attrdoublequote="blah 'blah'"
   attrsinglequote='bloob "bloob"' >

To match attributes, you need a regex attr that finds one of the four forms. Then you need to make sure that only matches are reported within HTML tags. Assuming you have the correct regex, the total regex would be:

attr(?=(attr)*\s*/?\s*>)

The lookahead ensures that only other attributes and the closing tag follow the attribute. I use the following regular expression for attr:

\s+(\w+)(?:\s*=\s*(?:"([^"]*)"|'([^']*)'|([^><"'\s]+)))?

Unimportant groups are made non capturing. The first matching group $1 gives you the name of the attribute, the value is one of $2or $3 or $4. I use $2$3$4 to extract the value. The final regex is

\s+(\w+)(?:\s*=\s*(?:"([^"]*)"|'([^']*)'|([^><"'\s]+)))?(?=(?:\s+\w+(?:\s*=\s*(?:"[^"]*"|'[^']*'|[^><"'\s]+))?)*\s*/?\s*>)

Note: I removed all unnecessary groups in the lookahead and made all remaining groups non capturing.


I have created a PHP function that could extract attributes of any HTML tags. It also can handle attributes like disabled that has no value, and also can determine whether the tag is a stand-alone tag (has no closing tag) or not (has a closing tag) by checking the content result:

/*! Based on <https://github.com/mecha-cms/cms/blob/master/system/kernel/converter.php> */
function extract_html_attributes($input) {
    if( ! preg_match('#^(<)([a-z0-9\-._:]+)((\s)+(.*?))?((>)([\s\S]*?)((<)\/\2(>))|(\s)*\/?(>))$#im', $input, $matches)) return false;
    $matches[5] = preg_replace('#(^|(\s)+)([a-z0-9\-]+)(=)(")(")#i', '$1$2$3$4$5<attr:value>$6', $matches[5]);
    $results = array(
        'element' => $matches[2],
        'attributes' => null,
        'content' => isset($matches[8]) && $matches[9] == '</' . $matches[2] . '>' ? $matches[8] : null
    );
    if(preg_match_all('#([a-z0-9\-]+)((=)(")(.*?)("))?(?:(\s)|$)#i', $matches[5], $attrs)) {
        $results['attributes'] = array();
        foreach($attrs[1] as $i => $attr) {
            $results['attributes'][$attr] = isset($attrs[5][$i]) && ! empty($attrs[5][$i]) ? ($attrs[5][$i] != '<attr:value>' ? $attrs[5][$i] : "") : $attr;
        }
    }
    return $results;
}

Test Code

$test = array(
    '<div class="foo" id="bar" data-test="1000">',
    '<div>',
    '<div class="foo" id="bar" data-test="1000">test content</div>',
    '<div>test content</div>',
    '<div>test content</span>',
    '<div>test content',
    '<div></div>',
    '<div class="foo" id="bar" data-test="1000"/>',
    '<div class="foo" id="bar" data-test="1000" />',
    '< div  class="foo"     id="bar"   data-test="1000"       />',
    '<div class id data-test>',
    '<id="foo" data-test="1000">',
    '<id data-test>',
    '<select name="foo" id="bar" empty-value-test="" selected disabled><option value="1">Option 1</option></select>'
);

foreach($test as $t) {
    var_dump($t, extract_html_attributes($t));
    echo '<hr>';
}

Tags and attributes in HTML have the form

<tag 
   attrnovalue 
   attrnoquote=bli 
   attrdoublequote="blah 'blah'"
   attrsinglequote='bloob "bloob"' >

To match attributes, you need a regex attr that finds one of the four forms. Then you need to make sure that only matches are reported within HTML tags. Assuming you have the correct regex, the total regex would be:

attr(?=(attr)*\s*/?\s*>)

The lookahead ensures that only other attributes and the closing tag follow the attribute. I use the following regular expression for attr:

\s+(\w+)(?:\s*=\s*(?:"([^"]*)"|'([^']*)'|([^><"'\s]+)))?

Unimportant groups are made non capturing. The first matching group $1 gives you the name of the attribute, the value is one of $2or $3 or $4. I use $2$3$4 to extract the value. The final regex is

\s+(\w+)(?:\s*=\s*(?:"([^"]*)"|'([^']*)'|([^><"'\s]+)))?(?=(?:\s+\w+(?:\s*=\s*(?:"[^"]*"|'[^']*'|[^><"'\s]+))?)*\s*/?\s*>)

Note: I removed all unnecessary groups in the lookahead and made all remaining groups non capturing.


I also needed this and wrote a function for parsing attributes, you can get it from here:

https://gist.github.com/4153580

(Note: It doesn't use regex)


PHP (PCRE) and Python

Simple attribute extraction (See it working):

((?:(?!\s|=).)*)\s*?=\s*?["']?((?:(?<=")(?:(?<=\\)"|[^"])*|(?<=')(?:(?<=\\)'|[^'])*)|(?:(?!"|')(?:(?!\/>|>|\s).)+))

Or with tag opening / closure verification, tag name retrieval and comment escaping. This expression foresees unquoted / quoted, single / double quotes, escaped quotes inside attributes, spaces around equals signs, different number of attributes, check only for attributes inside tags, and manage different quotes within an attribute value. (See it working):

(?:\<\!\-\-(?:(?!\-\-\>)\r\n?|\n|.)*?-\-\>)|(?:<(\S+)\s+(?=.*>)|(?<=[=\s])\G)(?:((?:(?!\s|=).)*)\s*?=\s*?[\"']?((?:(?<=\")(?:(?<=\\)\"|[^\"])*|(?<=')(?:(?<=\\)'|[^'])*)|(?:(?!\"|')(?:(?!\/>|>|\s).)+))[\"']?\s*)

(Works better with the "gisx" flags.)


Javascript

As Javascript regular expressions don't support look-behinds, it won't support most features of the previous expressions I propose. But in case it might fit someone's needs, you could try this version. (See it working).

(\S+)=[\'"]?((?:(?!\/>|>|"|\'|\s).)+)

have a look at this Regex & PHP - isolate src attribute from img tag

perhaps you can walk through the DOM and get the desired attributes. It works fine for me, getting attributes from the body-tag


This is my best RegEx to extract properties in HTML Tag:

# Trim the match inside of the quotes (single or double)

(\S+)\s*=\s*([']|["])\s*([\W\w]*?)\s*\2

# Without trim

(\S+)\s*=\s*([']|["])([\W\w]*?)\2

Pros:

  • You are able to trim the content inside of quotes.
  • Match all the special ASCII characters inside of the quotes.
  • If you have title="You're mine" the RegEx does not broken

Cons:

  • It returns 3 groups; first the property then the quote ("|') and at the end the property inside of the quotes i.e.: <div title="You're"> the result is Group 1: title, Group 2: ", Group 3: You're.

This is the online RegEx example: https://regex101.com/r/aVz4uG/13



I normally use this RegEx to extract the HTML Tags:

I recommend this if you don't use a tag type like <div, <span, etc.

<[^/]+?(?:\".*?\"|'.*?'|.*?)*?>

For example:

<div title="a>b=c<d" data-type='a>b=c<d'>Hello</div>
<span style="color: >=<red">Nothing</span>
# Returns 
# <div title="a>b=c<d" data-type='a>b=c<d'>
# <span style="color: >=<red">

This is the online RegEx example: https://regex101.com/r/aVz4uG/15

The bug in this RegEx is:

<div[^/]+?(?:\".*?\"|'.*?'|.*?)*?>

In this tag:

<article title="a>b=c<d" data-type='a>b=c<div '>Hello</article>

Returns <div '> but it should not return any match:

Match:  <div '>

To "solve" this remove the [^/]+? pattern:

<div(?:\".*?\"|'.*?'|.*?)*?>


The answer #317081 is good but it not match properly with these cases:

<div id="a"> # It returns "a instead of a
<div style=""> # It doesn't match instead of return only an empty property
<div title = "c"> # It not recognize the space between the equal (=)

This is the improvement:

(\S+)\s*=\s*["']?((?:.(?!["']?\s+(?:\S+)=|[>"']))?[^"']*)["']?

vs

(\S+)=["']?((?:.(?!["']?\s+(?:\S+)=|[>"']))+.)["']?

Avoid the spaces between equal signal: (\S+)\s*=\s*((?:...

Change the last + and . for: |[>"']))?[^"']*)["']?

This is the online RegEx example: https://regex101.com/r/aVz4uG/8


I'd reconsider the strategy to use only a single regular expression. Sure it's a nice game to come up with one single regular expression that does it all. But in terms of maintainabilty you are about to shoot yourself in both feet.


Although the advice not to parse HTML via regexp is valid, here's a expression that does pretty much what you asked:

/
   \G                     # start where the last match left off
   (?>                    # begin non-backtracking expression
       .*?                # *anything* until...
       <[Aa]\b            # an anchor tag
    )??                   # but look ahead to see that the rest of the expression
                          #    does not match.
    \s+                   # at least one space
    ( \p{Alpha}           # Our first capture, starting with one alpha
      \p{Alnum}*          # followed by any number of alphanumeric characters
    )                     # end capture #1
    (?: \s* = \s*         # a group starting with a '=', possibly surrounded by spaces.
        (?: (['"])        # capture a single quote character
            (.*?)         # anything else
             \2           # which ever quote character we captured before
        |   ( [^>\s'"]+ ) # any number of non-( '>', space, quote ) chars
        )                 # end group
     )?                   # attribute value was optional
/msx;

"But wait," you might say. "What about *comments?!?!" Okay, then you can replace the . in the non-backtracking section with: (It also handles CDATA sections.)

(?:[^<]|<[^!]|<![^-\[]|<!\[(?!CDATA)|<!\[CDATA\[.*?\]\]>|<!--(?:[^-]|-[^-])*-->)
  • Also if you wanted to run a substitution under Perl 5.10 (and I think PCRE), you can put \K right before the attribute name and not have to worry about capturing all the stuff you want to skip over.

This works for me. It also take into consideration some end cases I have encountered.

I am using this Regex for XML parser

(?<=\s)[^><:\s]*=*(?=[>,\s])

I suggest that you use HTML Tidy to convert the HTML to XHTML, and then use a suitable XPath expression to extract the attributes.


PHP (PCRE) and Python

Simple attribute extraction (See it working):

((?:(?!\s|=).)*)\s*?=\s*?["']?((?:(?<=")(?:(?<=\\)"|[^"])*|(?<=')(?:(?<=\\)'|[^'])*)|(?:(?!"|')(?:(?!\/>|>|\s).)+))

Or with tag opening / closure verification, tag name retrieval and comment escaping. This expression foresees unquoted / quoted, single / double quotes, escaped quotes inside attributes, spaces around equals signs, different number of attributes, check only for attributes inside tags, and manage different quotes within an attribute value. (See it working):

(?:\<\!\-\-(?:(?!\-\-\>)\r\n?|\n|.)*?-\-\>)|(?:<(\S+)\s+(?=.*>)|(?<=[=\s])\G)(?:((?:(?!\s|=).)*)\s*?=\s*?[\"']?((?:(?<=\")(?:(?<=\\)\"|[^\"])*|(?<=')(?:(?<=\\)'|[^'])*)|(?:(?!\"|')(?:(?!\/>|>|\s).)+))[\"']?\s*)

(Works better with the "gisx" flags.)


Javascript

As Javascript regular expressions don't support look-behinds, it won't support most features of the previous expressions I propose. But in case it might fit someone's needs, you could try this version. (See it working).

(\S+)=[\'"]?((?:(?!\/>|>|"|\'|\s).)+)

If you want to be general, you have to look at the precise specification of the a tag, like here. But even with that, if you do your perfect regexp, what if you have malformed html?

I would suggest to go for a library to parse html, depending on the language you work with: e.g. like python's Beautiful Soup.


Extract the element:

var buttonMatcherRegExp=/<a[\s\S]*?>[\s\S]*?<\/a>/;
htmlStr=string.match( buttonMatcherRegExp )[0]

Then use jQuery to parse and extract the bit you want:

$(htmlStr).attr('style') 

If youre in .NET I recommend the HTML agility pack, very robust even with malformed HTML.

Then you can use XPath.


Just to agree with everyone else: don't parse HTML using regexp.

It isn't possible to create an expression that will pick out attributes for even a correct piece of HTML, never mind all the possible malformed variants. Your regexp is already pretty much unreadable even without trying to cope with the invalid lack of quotes; chase further into the horror of real-world HTML and you will drive yourself crazy with an unmaintainable blob of unreliable expressions.

There are existing libraries to either read broken HTML, or correct it into valid XHTML which you can then easily devour with an XML parser. Use them.


Although the advice not to parse HTML via regexp is valid, here's a expression that does pretty much what you asked:

/
   \G                     # start where the last match left off
   (?>                    # begin non-backtracking expression
       .*?                # *anything* until...
       <[Aa]\b            # an anchor tag
    )??                   # but look ahead to see that the rest of the expression
                          #    does not match.
    \s+                   # at least one space
    ( \p{Alpha}           # Our first capture, starting with one alpha
      \p{Alnum}*          # followed by any number of alphanumeric characters
    )                     # end capture #1
    (?: \s* = \s*         # a group starting with a '=', possibly surrounded by spaces.
        (?: (['"])        # capture a single quote character
            (.*?)         # anything else
             \2           # which ever quote character we captured before
        |   ( [^>\s'"]+ ) # any number of non-( '>', space, quote ) chars
        )                 # end group
     )?                   # attribute value was optional
/msx;

"But wait," you might say. "What about *comments?!?!" Okay, then you can replace the . in the non-backtracking section with: (It also handles CDATA sections.)

(?:[^<]|<[^!]|<![^-\[]|<!\[(?!CDATA)|<!\[CDATA\[.*?\]\]>|<!--(?:[^-]|-[^-])*-->)
  • Also if you wanted to run a substitution under Perl 5.10 (and I think PCRE), you can put \K right before the attribute name and not have to worry about capturing all the stuff you want to skip over.

I suggest that you use HTML Tidy to convert the HTML to XHTML, and then use a suitable XPath expression to extract the attributes.


You cannot use the same name for multiple captures. Thus you cannot use a quantifier on expressions with named captures.

So either don’t use named captures:

(?:(\b\w+\b)\s*=\s*("[^"]*"|'[^']*'|[^"'<>\s]+)\s+)+

Or don’t use the quantifier on this expression:

(?<name>\b\w+\b)\s*=\s*(?<value>"[^"]*"|'[^']*'|[^"'<>\s]+)

This does also allow attribute values like bar=' baz='quux:

foo="bar=' baz='quux"

Well the drawback will be that you have to strip the leading and trailing quotes afterwards.


Just to agree with everyone else: don't parse HTML using regexp.

It isn't possible to create an expression that will pick out attributes for even a correct piece of HTML, never mind all the possible malformed variants. Your regexp is already pretty much unreadable even without trying to cope with the invalid lack of quotes; chase further into the horror of real-world HTML and you will drive yourself crazy with an unmaintainable blob of unreliable expressions.

There are existing libraries to either read broken HTML, or correct it into valid XHTML which you can then easily devour with an XML parser. Use them.


I'd reconsider the strategy to use only a single regular expression. Sure it's a nice game to come up with one single regular expression that does it all. But in terms of maintainabilty you are about to shoot yourself in both feet.


If youre in .NET I recommend the HTML agility pack, very robust even with malformed HTML.

Then you can use XPath.


I have created a PHP function that could extract attributes of any HTML tags. It also can handle attributes like disabled that has no value, and also can determine whether the tag is a stand-alone tag (has no closing tag) or not (has a closing tag) by checking the content result:

/*! Based on <https://github.com/mecha-cms/cms/blob/master/system/kernel/converter.php> */
function extract_html_attributes($input) {
    if( ! preg_match('#^(<)([a-z0-9\-._:]+)((\s)+(.*?))?((>)([\s\S]*?)((<)\/\2(>))|(\s)*\/?(>))$#im', $input, $matches)) return false;
    $matches[5] = preg_replace('#(^|(\s)+)([a-z0-9\-]+)(=)(")(")#i', '$1$2$3$4$5<attr:value>$6', $matches[5]);
    $results = array(
        'element' => $matches[2],
        'attributes' => null,
        'content' => isset($matches[8]) && $matches[9] == '</' . $matches[2] . '>' ? $matches[8] : null
    );
    if(preg_match_all('#([a-z0-9\-]+)((=)(")(.*?)("))?(?:(\s)|$)#i', $matches[5], $attrs)) {
        $results['attributes'] = array();
        foreach($attrs[1] as $i => $attr) {
            $results['attributes'][$attr] = isset($attrs[5][$i]) && ! empty($attrs[5][$i]) ? ($attrs[5][$i] != '<attr:value>' ? $attrs[5][$i] : "") : $attr;
        }
    }
    return $results;
}

Test Code

$test = array(
    '<div class="foo" id="bar" data-test="1000">',
    '<div>',
    '<div class="foo" id="bar" data-test="1000">test content</div>',
    '<div>test content</div>',
    '<div>test content</span>',
    '<div>test content',
    '<div></div>',
    '<div class="foo" id="bar" data-test="1000"/>',
    '<div class="foo" id="bar" data-test="1000" />',
    '< div  class="foo"     id="bar"   data-test="1000"       />',
    '<div class id data-test>',
    '<id="foo" data-test="1000">',
    '<id data-test>',
    '<select name="foo" id="bar" empty-value-test="" selected disabled><option value="1">Option 1</option></select>'
);

foreach($test as $t) {
    var_dump($t, extract_html_attributes($t));
    echo '<hr>';
}

If you want to be general, you have to look at the precise specification of the a tag, like here. But even with that, if you do your perfect regexp, what if you have malformed html?

I would suggest to go for a library to parse html, depending on the language you work with: e.g. like python's Beautiful Soup.


Although the advice not to parse HTML via regexp is valid, here's a expression that does pretty much what you asked:

/
   \G                     # start where the last match left off
   (?>                    # begin non-backtracking expression
       .*?                # *anything* until...
       <[Aa]\b            # an anchor tag
    )??                   # but look ahead to see that the rest of the expression
                          #    does not match.
    \s+                   # at least one space
    ( \p{Alpha}           # Our first capture, starting with one alpha
      \p{Alnum}*          # followed by any number of alphanumeric characters
    )                     # end capture #1
    (?: \s* = \s*         # a group starting with a '=', possibly surrounded by spaces.
        (?: (['"])        # capture a single quote character
            (.*?)         # anything else
             \2           # which ever quote character we captured before
        |   ( [^>\s'"]+ ) # any number of non-( '>', space, quote ) chars
        )                 # end group
     )?                   # attribute value was optional
/msx;

"But wait," you might say. "What about *comments?!?!" Okay, then you can replace the . in the non-backtracking section with: (It also handles CDATA sections.)

(?:[^<]|<[^!]|<![^-\[]|<!\[(?!CDATA)|<!\[CDATA\[.*?\]\]>|<!--(?:[^-]|-[^-])*-->)
  • Also if you wanted to run a substitution under Perl 5.10 (and I think PCRE), you can put \K right before the attribute name and not have to worry about capturing all the stuff you want to skip over.

splattne,

@VonC solution partly works but there is some issue if the tag had a mixed of unquoted and quoted

This one works with mixed attributes

$pat_attributes = "(\S+)=(\"|'| |)(.*)(\"|'| |>)"

to test it out

<?php
$pat_attributes = "(\S+)=(\"|'| |)(.*)(\"|'| |>)"

$code = '    <IMG title=09.jpg alt=09.jpg src="http://example.com.jpg?v=185579" border=0 mce_src="example.com.jpg?v=185579"
    ';

preg_match_all( "@$pat_attributes@isU", $code, $ms);
var_dump( $ms );

$code = '
<a href=test.html class=xyz>
<a href="test.html" class="xyz">
<a href=\'test.html\' class="xyz">
<img src="http://"/>      ';

preg_match_all( "@$pat_attributes@isU", $code, $ms);

var_dump( $ms );

$ms would then contain keys and values on the 2nd and 3rd element.

$keys = $ms[1];
$values = $ms[2];

This works for me. It also take into consideration some end cases I have encountered.

I am using this Regex for XML parser

(?<=\s)[^><:\s]*=*(?=[>,\s])

Although the advice not to parse HTML via regexp is valid, here's a expression that does pretty much what you asked:

/
   \G                     # start where the last match left off
   (?>                    # begin non-backtracking expression
       .*?                # *anything* until...
       <[Aa]\b            # an anchor tag
    )??                   # but look ahead to see that the rest of the expression
                          #    does not match.
    \s+                   # at least one space
    ( \p{Alpha}           # Our first capture, starting with one alpha
      \p{Alnum}*          # followed by any number of alphanumeric characters
    )                     # end capture #1
    (?: \s* = \s*         # a group starting with a '=', possibly surrounded by spaces.
        (?: (['"])        # capture a single quote character
            (.*?)         # anything else
             \2           # which ever quote character we captured before
        |   ( [^>\s'"]+ ) # any number of non-( '>', space, quote ) chars
        )                 # end group
     )?                   # attribute value was optional
/msx;

"But wait," you might say. "What about *comments?!?!" Okay, then you can replace the . in the non-backtracking section with: (It also handles CDATA sections.)

(?:[^<]|<[^!]|<![^-\[]|<!\[(?!CDATA)|<!\[CDATA\[.*?\]\]>|<!--(?:[^-]|-[^-])*-->)
  • Also if you wanted to run a substitution under Perl 5.10 (and I think PCRE), you can put \K right before the attribute name and not have to worry about capturing all the stuff you want to skip over.

I suggest that you use HTML Tidy to convert the HTML to XHTML, and then use a suitable XPath expression to extract the attributes.


If youre in .NET I recommend the HTML agility pack, very robust even with malformed HTML.

Then you can use XPath.


I'd reconsider the strategy to use only a single regular expression. Sure it's a nice game to come up with one single regular expression that does it all. But in terms of maintainabilty you are about to shoot yourself in both feet.


If you want to be general, you have to look at the precise specification of the a tag, like here. But even with that, if you do your perfect regexp, what if you have malformed html?

I would suggest to go for a library to parse html, depending on the language you work with: e.g. like python's Beautiful Soup.


You cannot use the same name for multiple captures. Thus you cannot use a quantifier on expressions with named captures.

So either don’t use named captures:

(?:(\b\w+\b)\s*=\s*("[^"]*"|'[^']*'|[^"'<>\s]+)\s+)+

Or don’t use the quantifier on this expression:

(?<name>\b\w+\b)\s*=\s*(?<value>"[^"]*"|'[^']*'|[^"'<>\s]+)

This does also allow attribute values like bar=' baz='quux:

foo="bar=' baz='quux"

Well the drawback will be that you have to strip the leading and trailing quotes afterwards.


I also needed this and wrote a function for parsing attributes, you can get it from here:

https://gist.github.com/4153580

(Note: It doesn't use regex)


Token Mantra response: you should not tweak/modify/harvest/or otherwise produce html/xml using regular expression.

there are too may corner case conditionals such as \' and \" which must be accounted for. You are much better off using a proper DOM Parser, XML Parser, or one of the many other dozens of tried and tested tools for this job instead of inventing your own.

I don't really care which one you use, as long as its recognized, tested, and you use one.

my $foo  = Someclass->parse( $xmlstring ); 
my @links = $foo->getChildrenByTagName("a"); 
my @srcs = map { $_->getAttribute("src") } @links; 
# @srcs now contains an array of src attributes extracted from the page. 

Token Mantra response: you should not tweak/modify/harvest/or otherwise produce html/xml using regular expression.

there are too may corner case conditionals such as \' and \" which must be accounted for. You are much better off using a proper DOM Parser, XML Parser, or one of the many other dozens of tried and tested tools for this job instead of inventing your own.

I don't really care which one you use, as long as its recognized, tested, and you use one.

my $foo  = Someclass->parse( $xmlstring ); 
my @links = $foo->getChildrenByTagName("a"); 
my @srcs = map { $_->getAttribute("src") } @links; 
# @srcs now contains an array of src attributes extracted from the page. 

This is my best RegEx to extract properties in HTML Tag:

# Trim the match inside of the quotes (single or double)

(\S+)\s*=\s*([']|["])\s*([\W\w]*?)\s*\2

# Without trim

(\S+)\s*=\s*([']|["])([\W\w]*?)\2

Pros:

  • You are able to trim the content inside of quotes.
  • Match all the special ASCII characters inside of the quotes.
  • If you have title="You're mine" the RegEx does not broken

Cons:

  • It returns 3 groups; first the property then the quote ("|') and at the end the property inside of the quotes i.e.: <div title="You're"> the result is Group 1: title, Group 2: ", Group 3: You're.

This is the online RegEx example: https://regex101.com/r/aVz4uG/13



I normally use this RegEx to extract the HTML Tags:

I recommend this if you don't use a tag type like <div, <span, etc.

<[^/]+?(?:\".*?\"|'.*?'|.*?)*?>

For example:

<div title="a>b=c<d" data-type='a>b=c<d'>Hello</div>
<span style="color: >=<red">Nothing</span>
# Returns 
# <div title="a>b=c<d" data-type='a>b=c<d'>
# <span style="color: >=<red">

This is the online RegEx example: https://regex101.com/r/aVz4uG/15

The bug in this RegEx is:

<div[^/]+?(?:\".*?\"|'.*?'|.*?)*?>

In this tag:

<article title="a>b=c<d" data-type='a>b=c<div '>Hello</article>

Returns <div '> but it should not return any match:

Match:  <div '>

To "solve" this remove the [^/]+? pattern:

<div(?:\".*?\"|'.*?'|.*?)*?>


The answer #317081 is good but it not match properly with these cases:

<div id="a"> # It returns "a instead of a
<div style=""> # It doesn't match instead of return only an empty property
<div title = "c"> # It not recognize the space between the equal (=)

This is the improvement:

(\S+)\s*=\s*["']?((?:.(?!["']?\s+(?:\S+)=|[>"']))?[^"']*)["']?

vs

(\S+)=["']?((?:.(?!["']?\s+(?:\S+)=|[>"']))+.)["']?

Avoid the spaces between equal signal: (\S+)\s*=\s*((?:...

Change the last + and . for: |[>"']))?[^"']*)["']?

This is the online RegEx example: https://regex101.com/r/aVz4uG/8


Just to agree with everyone else: don't parse HTML using regexp.

It isn't possible to create an expression that will pick out attributes for even a correct piece of HTML, never mind all the possible malformed variants. Your regexp is already pretty much unreadable even without trying to cope with the invalid lack of quotes; chase further into the horror of real-world HTML and you will drive yourself crazy with an unmaintainable blob of unreliable expressions.

There are existing libraries to either read broken HTML, or correct it into valid XHTML which you can then easily devour with an XML parser. Use them.


Extract the element:

var buttonMatcherRegExp=/<a[\s\S]*?>[\s\S]*?<\/a>/;
htmlStr=string.match( buttonMatcherRegExp )[0]

Then use jQuery to parse and extract the bit you want:

$(htmlStr).attr('style') 

If youre in .NET I recommend the HTML agility pack, very robust even with malformed HTML.

Then you can use XPath.


something like this might be helpful

'(\S+)\s*?=\s*([\'"])(.*?|)\2

If you want to be general, you have to look at the precise specification of the a tag, like here. But even with that, if you do your perfect regexp, what if you have malformed html?

I would suggest to go for a library to parse html, depending on the language you work with: e.g. like python's Beautiful Soup.