RegEx match open tags except XHTML self-contained tags

Question

I need to match all of these opening tags:

<p>
<a href="foo">

But not these:

<br />
<hr class="foo" />

I came up with this and wanted to make sure I've got it right. I am only capturing the a-z.

<([a-z]+) *[^/]*?>

I believe it says:

Find a less-than, then
Find (and capture) a-z one or more times, then
Find zero or more spaces, then
Find any character zero or more times, greedy, except /, then
Find a greater-than

Do I have that right? And more importantly, what do you think?

User · Answer

The W3C explains parsing in a pseudo regexp form:
W3C Link

Follow the var links for QName, S, and Attribute to get a clearer picture.
Based on that you can create a pretty good regexp to handle things like stripping tags.

User · Answer

RegEx match open tags except XHTML self-contained tags All other tags  and content  are skipped   This regex does that  If you need to match only specific Open tags  make a list in an alternation    p br  lt whatever tags you want gt   and replace the   w    construct in the appropriate place below   lt           script style object embed applet noframes noscript noembed     s    gt  quot   S s    quot     S s             gt     gt         s  gt    S s    lt   1 s     gt    SKIP   FAIL        w    b        quot   S s    quot     S s        gt       gt   2   lt                w    s          w    s     quot   S s    quot     S s        gt      s         S s                DOCTYPE  S s          CDATA    S s            --  S s   --     ATTLIST  S s        ENTITY  S s        ELEMENT  S s         SKIP   FAIL   gt  https   regex101 com r uMvJn0 1    Mix html xml         https   regex101 com r uMvJn0 1          lt                  Invisible content gets failed                                                          Invisible content  end tag req d                                   1 start               script              style              object              embed              applet              noframes              noscript              noembed                                    1 end                              s                   gt                   quot    S s     quot                      S s                                                   gt                          gt                                                               s   gt                            S s     lt    1  s               gt             SKIP   FAIL                             This is any open html tag we will match                       w     b                                               2 start                                    quot    S s     quot                       S s                          gt                                                       2 end             gt                   2            lt                             All other tags get failed                              w     s                              w                s                               quot    S s     quot                    S s                       gt                             s                          S s                                                             DOCTYPE   S s                         CDATA     S s                            --   S s    --                    ATTLIST   S s                       ENTITY   S s                       ELEMENT   S s                                       SKIP   FAIL       gt

User · Answer

As many people have already pointed out  HTML is not a regular language which can make it very difficult to parse  My solution to this is to turn it into a regular language using a tidy program and then to use an XML parser to consume the results  There are a lot of good options for this  My program is written using Java with the jtidy library to turn the HTML into XML and then Jaxen to xpath into the result

User · Answer

This may do    lt         gt    Or without the ending tags    lt             gt    What s with the flame wars on HTML parsers  HTML parsers must parse  and rebuild   the entire document before it can categorize your search  Regular expressions may be a faster   elegant in certain circumstances  My 2 cents

User · Answer

Whenever I need to quickly extract something from an HTML document  I use Tidy to convert it to XML and then use XPath or XSLT to get what I need  In your case  something like this      p a  href  foo

User · Answer

Try    lt     s     s   gt         lt     gt    It is similar to yours  but the last  gt  must not be after a slash  and also accepts h1

User · Answer

If you re simply trying to find those tags  without ambitions of parsing  try this regular expression     lt        gt  g   I wrote it in 30 seconds  and tested here  http   gskinner com RegExr   It matches the types of tags you mentioned  while ignoring the types you said you wanted to ignore

User · Answer

The OP doesn t seem to say what he needs to do with the tags  For example  does he need to extract inner text  or just examine the tags  I m firmly in the camp that says a regular expression is not the be-all  end-all text parser  I ve written a large amount of text-parsing code including this code to parse HTML tags  While it s true I m not all that great with regular expressions  I consider regular expressions just too rigid and hard to maintain for this sort of parsing

User · Answer

There are people that will tell you that the Earth is round (or perhaps that the Earth is an oblate spheroid if they want to use strange words). They are lying.

There are people that will tell you that Regular Expressions shouldn't be recursive. They are limiting you. They need to subjugate you, and they do it by keeping you in ignorance.

You can live in their reality or take the red pill.

Like Lord Marshal (is he a relative of the Marshal .NET class?), I have seen the ~~Underverse~~ Stack Based Regex-Verse and returned with ~~powers~~ knowledge you can't imagine. Yes, I think there were an Old One or two protecting them, but they were watching football on the TV, so it wasn't difficult.

I think the XML case is quite simple. The RegEx (in the .NET syntax), deflated and coded in base64 to make it easier to comprehend by your feeble mind, should be something like this:

7L0HYBxJliUmL23Ke39K9UrX4HShCIBgEyTYkEAQ7MGIzeaS7B1pRyMpqyqBymVWZV1mFkDM7Z28
995777333nvvvfe6O51OJ/ff/z9cZmQBbPbOStrJniGAqsgfP358Hz8itn6Po9/3eIue3+Px7/3F
86enJ8+/fHn64ujx7/t7vFuUd/Dx65fHJ6dHW9/7fd/t7fy+73Ye0v+f0v+Pv//JnTvureM3b169
OP7i9Ogyr5uiWt746u+BBqc/8dXx86PP7tzU9mfQ9tWrL18d3UGnW/z7nZ9htH/y9NXrsy9fvPjq
i5/46ss3p4z+x3e8b452f9/x93a2HxIkH44PpgeFyPD6lMAEHUdbcn8ffTP9fdTrz/8rBPCe05Iv
p9WsWF788Obl9MXJl0/PXnwONLozY747+t7x9k9l2z/4vv4kqo1//993+/vf2kC5HtwNcxXH4aOf
LRw2z9/v8WEz2LTZcpaV1TL/4c3h66ex2Xv95vjF0+PnX744PbrOm59ZVhso5UHYME/dfj768H7e
Yy5uQUydDAH9+/4eR11wHbqdfPnFF6cv3ogq/V23t++4z4620A13cSzd7O1s/77rpw+ePft916c7
O/jj2bNnT7e/t/397//M9+ibA/7s6ZNnz76PP0/kT2rz/Ts/s/0NArvziYxVEZWxbm93xsrUfnlm
rASN7Hf93u/97vvf+2Lx/e89L7+/FSXiz4Bkd/hF5mVq9Yik7fcncft9350QCu+efkr/P6BfntEv
z+iX9c4eBrFz7wEwpB9P+d9n9MfuM3yzt7Nzss0/nuJfbra3e4BvZFR7z07pj3s7O7uWJM8eCkme
nuCPp88MfW6kDeH7+26PSTX8vu+ePAAiO4LVp4zIPWC1t7O/8/+pMX3rzo2KhL7+8s23T1/RhP0e
vyvm8HbsdmPXYDVhtpdnAzJ1k1jeufOtUAM8ffP06Zcnb36fl6dPXh2f/F6nRvruyHfMd9rgJp0Y
gvsRx/6/ZUzfCtX4e5hTndGzp5jQo9e/z+s3p1/czAUMlts+P3tz+uo4tISd745uJxvb3/v4ZlWs
mrjfd9SG/swGPD/6+nh+9MF4brTBRmh1Tl5+9eT52ckt5oR0xldPzp7GR8pfuXf5PWJv4nJIwvbH
W3c+GY3vPvrs9zj8Xb/147/n7/b7/+52DD2gsSH8zGDvH9+i9/fu/PftTfTXYf5hB+9H7P1BeG52
MTtu4S2cTAjDizevv3ry+vSNb8N+3+/1po2anj4/hZsGt3TY4GmjYbEKDJ62/pHB+3/LmL62wdsU
1J18+eINzTJr3dMvXr75fX7m+MXvY9XxF2e/9+nTgPu2bgwh5U0f7u/74y9Pnh6/OX4PlA2UlwTn
xenJG8L996VhbP3++PCrV68QkrjveITxr2TIt+lL+f3k22fPn/6I6f/fMqZvqXN/K4Xps6sazUGZ
GeQlar49xEvajzI35VRevDl78/sc/b7f6jkG8Va/x52N4L9lBe/kZSh1hr9fPj19+ebbR4AifyuY
12efv5CgGh9TroR6Pj2l748iYxYgN8Z7pr0HzRLg66FnRvcjUft/45i+pRP08vTV6TOe2N/9jv37
R9P0/5YxbXQDeK5E9R12XdDA/4zop+/9Ht/65PtsDVlBBUqko986WsDoWqvbPD2gH/T01DAC1NVn
3/uZ0feZ+T77fd/GVMkA4KjeMcg6RcvQLRl8HyPaWVStdv17PwHV0bOB9xUh7rfMp5Zu3icBJp25
D6f0NhayHyfI3HXHY6YYCw7Pz17fEFhQKzS6ZWChrX+kUf7fMqavHViEPPKjCf1/y5hukcyPTvjP
mHQCppRDN4nbVFPaT8+ekpV5/TP8g/79mVPo77PT1/LL7/MzL7548+XvdfritflFY00fxIsvSQPS
mvctdYZpbt7vxKRfj3018OvC/hEf/79lTBvM3debWj+b8KO0wP+3OeM2aYHumuCAGonmCrxw9cVX
X1C2d4P+uSU7eoBUMzI3/f9udjbYl/el04dI7s8fan8dWRjm6gFx+NrKeFP+WX0CxBdPT58df/X8
DaWLX53+xFdnr06f/szv++NnX7x8fnb6NAhIwsbPkPS7iSUQAFETvP2Tx8+/Og0Xt/yBvDn9vd/c
etno8S+81QKXptq/ffzKZFZ+4e/743e8zxino+8RX37/k595h5/H28+y7fPv490hQdJ349E+txB3
zPZ5J/jsR8bs/y1j2hh/2fkayOqEmYcej0cXUWMN7QrqBwjDrVZRfyQM3xjj/EgYvo4wfLTZrnVS
ebdKq0XSZJvzajKQDUv1/P3NwbEP7cN5+Odivv9/ysPfhHfkOP6b9Fl+91v7LD9aCvp/+Zi+7lLQ
j0zwNzYFP+/Y6r1NcFeDbfBIo8rug3zS3/3WPumPlN3/y8f0I2X3cz4FP+/Y6htSdr2I42fEuSPX
/ewpL4e9/n1evzn94hb+Plpw2+dnbyh79zx0CsPvbq0lb+UQ/h7xvqPq/Gc24PnR18fzVrp8I57d
mehj7ebk5VdPnp+d3GJOSP189eTsaXyk/JV7l98j4SAZgRxtf7x155PR+O6jz36Pw9/1Wz/+e/5u
v//vbsfQAxobws8M9v7xLXp/785/395ED4nO1wx5fsTeH4LnRva+eYY8rpZUBFb/j/jfm8XAvfEj
4/b/ljF1F9B/jx5PhAkp1nu/+y3n+kdZp/93jWmjJ/M11TG++VEG6puZn593PPejoOyHMQU/79jq
GwrKfpSB+tmcwZ93XPkjZffDmIKfd2z1DSm7bmCoPPmjBNT74XkrVf71I/Sf6wTU7XJA4RB+lIC6
mW1+xN5GWw1/683C5rnj/m364cmr45Pf6/SN9H4Us4LISn355vjN2ZcvtDGT6fHvapJcMISmxc0K
MAD4IyP6/5Yx/SwkP360FvD1VTH191mURr/HUY+2P3I9boPnz7Ju/pHrcWPnP3I9/r/L3sN0v52z
0fEgNrgbL8/Evfh9fw/q5Xf93u/97vvf+2Lx/e89L7+/Fe3iZ37f34P5h178kTfx/5YxfUs8vY26
7/d4/OWbb5++ogn7PX5XzOHtOP3GrsHmqobOVO/8Hh1Gk/TPl198QS6w+rLb23fcZ0fMaTfjsv29
7Zul7me2v0FgRoYVURnf9nZEkDD+H2VDf8hjeq8xff1s6GbButNLacEtefHm9VdPXp++CRTw7/v9
r6vW8b9eJ0+/PIHzs1HHdyKE/x9L4Y+s2f+PJPX/1dbsJn3wrY6wiqv85vjVm9Pnp+DgN8efM5va
j794+eb36Xz3mAf5+58+f3r68s230dRvJcxKn/l//oh3f+7H9K2O0r05PXf85s2rH83f/1vGdAvd
w+qBFqsoWvzspozD77EpXYeZ7yzdfxy0ec+l+8e/8FbR84+Wd78xbvn/qQQMz/J7L++GPB7N0MQa
2vTMBwjDrVI0PxKGb4xxfiQMX0cYPuq/Fbx2C1sU8yEF+F34iNsx1xOGa9t6l/yX70uqmxu+qBGm
AxlxWwVS11O97ULqlsFIUvUnT4/fHIuL//3f9/t9J39Y9m8W/Tuc296yUeX/b0PiHwUeP1801Y8C
j/9vz9+PAo8f+Vq35Jb/n0rAz7Kv9aPA40fC8P+RMf3sC8PP08DjR1L3DXHoj6SuIz/CCghZNZb8
fb/Hf/2+37tjvuBY9vu3jmRvxNeGgQAuaAF6Pwj8/+e66M8/7rwpRNj6uVwXZRl52k0n3FVl95Q+
+fz0KSu73/dtkGDYdvZgSP5uskadrtViRKyal2IKAiQfiW+FI+tET/9/Txj9SFf8SFf8rOuKzagx
+r/vD34mUADO1P4/AQAA//8=

The options to set is RegexOptions.ExplicitCapture. The capture group you are looking for is ELEMENTNAME. If the capture group ERROR is not empty then there was a parsing error and the Regex stopped.

If you have problems reconverting it to a human-readable regex, this should help:

static string FromBase64(string str)
{
    byte[] byteArray = Convert.FromBase64String(str);

    using (var msIn = new MemoryStream(byteArray))
    using (var msOut = new MemoryStream()) {
        using (var ds = new DeflateStream(msIn, CompressionMode.Decompress)) {
            ds.CopyTo(msOut);
        }

        return Encoding.UTF8.GetString(msOut.ToArray());
    }
}

If you are unsure, no, I'm NOT kidding (but perhaps I'm lying). It WILL work. I've built tons of unit tests to test it, and I have even used (part of) the conformance tests. It's a tokenizer, not a full-blown parser, so it will only split the XML into its component tokens. It won't parse/integrate DTDs.

Oh... if you want the source code of the regex, with some auxiliary methods:

regex to tokenize an xml or the full plain regex

User · Answer

If you need this for PHP   The PHP DOM functions won t work properly unless it is properly formatted XML  No matter how much better their use is for the rest of mankind   simplehtmldom is good  but I found it a bit buggy  and it is is quite memory heavy  Will crash on large pages    I have never used querypath  so can t comment on its usefulness    Another one to try is my DOMParser which is very light on resources and I ve been using happily for a while  Simple to learn  amp  powerful   For Python and Java  similar links were posted   For the downvoters - I only wrote my class when the XML parsers proved unable to withstand real use  Religious downvoting just prevents useful answers from being posted - keep things within perspective of the question  please

User · Answer

Disclaimer  use a parser if you have the option  That said     This is the regex I use     to match HTML tags    lt                                   gt     gt    It may not be perfect  but I ran this code through a lot of HTML  Note that it even catches strange things like  lt a name  badgenerator   gt   which show up on the web   I guess to make it not match self contained tags  you d either want to use Kobi s negative look-behind    lt                                   gt       lt    s   gt    or just combine if and if not   To downvoters  This is working code from an actual product  I doubt anyone reading this page will get the impression that it is socially acceptable to use regexes on HTML    Caveat  I should note that this regex still breaks down in the presence of CDATA blocks  comments  and script and style elements  Good news is  you can get rid of those using a regex

User · Answer

Here is a PHP based parser that parses HTML using  some ungodly regex  As the author of this project  I can tell you it is possible to parse HTML with regex  but not efficient  If you need a server-side solution  as I did for my wp-Typography WordPress plugin   this works

User · Answer

Although it s not suitable and effective to use regular expressions for that purpose sometimes regular expressions provide quick solutions for simple match problems and in my view it s not that horrbile to use regular expressions for trivial works    There is a definitive blog post about matching innermost HTML elements written by Steven Levithan

User · Answer

In shell  you can parse HTML using sed    Turing sed Write HTML parser  homework      Profit      Related  why you shouldn t use regex match     If You Like Regular Expressions So Much  Why Don t You Marry Them  Regular Expressions  Now You Have Two Problems Hacking stackoverflow com s HTML sanitizer

User · Answer

Here s the solution    lt  php    here s the pattern   pattern      lt   w    s   w   s    s              4 s    s     gt   gt          a string to parse   string    Hello  try clicking  lt a href   paragraph  gt here lt  a gt       lt br  gt and check out  lt hr   gt       lt h2 gt title lt  h2 gt       lt a name   paragraph  rel   I  m an anchor  gt  lt  a gt      Fine   lt span title   highlight the  punch    gt thanks lt span gt        lt div class    clear  gt  lt  div gt       lt br gt        let s get the occurrences  preg match all  pattern   string   matches  PREG PATTERN ORDER       print the result  print r  matches 0      gt    To test it deeply  I entered in the string auto-closing tags like     lt hr   gt   lt br  gt   lt br gt    I also entered tags with    one attribute more than one attribute attributes which value is bound either into single quotes or into double quotes attributes containing single quotes when the delimiter is a double quote and vice versa  unpretty  attributes with a space before the     symbol  after it and both before and after it    Should you find something which does not work in the proof of concept above  I am available in analyzing the code to improve my skills    lt EDIT gt  I forgot that the question from the user was to avoid the parsing of self-closing tags  In this case the pattern is simpler  turning into this    pattern      lt   w    s   w   s    s              4 s    s  gt       The user  ridgerunner noticed that the pattern does not allow unquoted attributes or attributes with no value  In this case a fine tuning brings us the following pattern    pattern      lt   w    s   w    s    s               5 s      s  gt        lt  EDIT gt   Understanding the pattern  If someone is interested in learning more about the pattern  I provide some line    the first sub-expression   w   matches the tag name the second sub-expression contains the pattern of an attribute  It is composed by    one or more whitespaces  s  the name of the attribute   w   zero or more whitespaces  s   it is possible or not  leaving blanks here  the     symbol again  zero or more whitespaces the delimiter of the attribute value  a single or double quote        In the pattern  the single quote is escaped because it coincides with the PHP string delimiter  This sub-expression is captured with the parentheses so it can be referenced again to parse the closure of the attribute  that s why it is very important  the value of the attribute  matched by almost anything         in this specific syntax  using the greedy match  the question mark after the asterisk  the RegExp engine enables a  look-ahead -like operator  which matches anything but what follows this sub-expression here comes the fun  the  4 part is a backreference operator  which refers to a sub-expression defined before in the pattern  in this case  I am referring to the fourth sub-expression  which is the first attribute delimiter found zero or more whitespaces  s  the attribute sub-expression ends here  with the specification of zero or more possible occurrences  given by the asterisk   Then  since a tag may end with a whitespace before the   gt   symbol  zero or more whitespaces are matched with the  s  subpattern  The tag to match may end with a simple   gt   symbol  or a possible XHTML closure  which makes use of the slash before it          The slash is  of course  escaped since it coincides with the regular expression delimiter    Small tip  to better analyze this code it is necessary looking at the source code generated since I did not provide any HTML special characters escaping

User · Answer

While the answers that you can t parse HTML with regexes are correct  they don t apply here  The OP just wants to parse one HTML tag with regexes  and that is something that can be done with a regular expression   The suggested regex is wrong  though    lt   a-z            gt    If you add something to the regex  by backtracking it can be forced to match silly things like  lt a  gt  gt        is too permissive  Also note that  lt space gt        is redundant  because the       can also match spaces   My suggestion would be   lt   a-z      gt      lt     gt    Where    lt         is  in Perl regexes  the negative look-behind  It reads  a  lt   then a word  then anything that s not a    the last of which may not be a    followed by      Note that this allows things like  lt a   gt   just like the original regex   so if you want something more restrictive  you need to build a regex to match attribute pairs separated by spaces

User · Answer

There are some nice regexes for replacing HTML with BBCode here. For all you nay-sayers, note that he's not trying to fully parse HTML, just to sanitize it. He can probably afford to kill off tags that his simple "parser" can't understand.

For example:

$store =~ s/http:/http:\/\//gi;
$store =~ s/https:/https:\/\//gi;
$baseurl = $store;

if (!$query->param("ascii")) {
    $html =~ s/\s\s+/\n/gi;
    $html =~ s/<pre(.*?)>(.*?)<\/pre>/\[code]$2\[\/code]/sgmi;
}

$html =~ s/\n//gi;
$html =~ s/\r\r//gi;
$html =~ s/$baseurl//gi;
$html =~ s/<h[1-7](.*?)>(.*?)<\/h[1-7]>/\n\[b]$2\[\/b]\n/sgmi;
$html =~ s/<p>/\n\n/gi;
$html =~ s/<br(.*?)>/\n/gi;
$html =~ s/<textarea(.*?)>(.*?)<\/textarea>/\[code]$2\[\/code]/sgmi;
$html =~ s/<b>(.*?)<\/b>/\[b]$1\[\/b]/gi;
$html =~ s/<i>(.*?)<\/i>/\[i]$1\[\/i]/gi;
$html =~ s/<u>(.*?)<\/u>/\[u]$1\[\/u]/gi;
$html =~ s/<em>(.*?)<\/em>/\[i]$1\[\/i]/gi;
$html =~ s/<strong>(.*?)<\/strong>/\[b]$1\[\/b]/gi;
$html =~ s/<cite>(.*?)<\/cite>/\[i]$1\[\/i]/gi;
$html =~ s/<font color="(.*?)">(.*?)<\/font>/\[color=$1]$2\[\/color]/sgmi;
$html =~ s/<font color=(.*?)>(.*?)<\/font>/\[color=$1]$2\[\/color]/sgmi;
$html =~ s/<link(.*?)>//gi;
$html =~ s/<li(.*?)>(.*?)<\/li>/\[\*]$2/gi;
$html =~ s/<ul(.*?)>/\[list]/gi;
$html =~ s/<\/ul>/\[\/list]/gi;
$html =~ s/<div>/\n/gi;
$html =~ s/<\/div>/\n/gi;
$html =~ s/<td(.*?)>/ /gi;
$html =~ s/<tr(.*?)>/\n/gi;

$html =~ s/<img(.*?)src="(.*?)"(.*?)>/\[img]$baseurl\/$2\[\/img]/gi;
$html =~ s/<a(.*?)href="(.*?)"(.*?)>(.*?)<\/a>/\[url=$baseurl\/$2]$4\[\/url]/gi;
$html =~ s/\[url=$baseurl\/http:\/\/(.*?)](.*?)\[\/url]/\[url=http:\/\/$1]$2\[\/url]/gi;
$html =~ s/\[img]$baseurl\/http:\/\/(.*?)\[\/img]/\[img]http:\/\/$1\[\/img]/gi;

$html =~ s/<head>(.*?)<\/head>//sgmi;
$html =~ s/<object>(.*?)<\/object>//sgmi;
$html =~ s/<script(.*?)>(.*?)<\/script>//sgmi;
$html =~ s/<style(.*?)>(.*?)<\/style>//sgmi;
$html =~ s/<title>(.*?)<\/title>//sgmi;
$html =~ s/<!--(.*?)-->/\n/sgmi;

$html =~ s/\/\//\//gi;
$html =~ s/http:\//http:\/\//gi;
$html =~ s/https:\//https:\/\//gi;

$html =~ s/<(?:[^>'"]*|(['"]).*?\1)*>//gsi;
$html =~ s/\r\r//gi;
$html =~ s/\[img]\//\[img]/gi;
$html =~ s/\[url=\//\[url=/gi;

User · Answer

You want the first  gt  not preceded by a     Look here for details on how to do that   It s referred to as negative lookbehind   However  a na  ve implementation of that will end up matching  lt bar  gt  lt  foo gt  in this example document   lt foo gt  lt bar  gt  lt  foo gt    Can you provide a little more information on the problem you re trying to solve   Are you iterating through tags programatically

User · Answer

I agree that the right tool to parse XML and especially HTML is a parser and not a regular expression engine. However, like others have pointed out, sometimes using a regex is quicker, easier, and gets the job done if you know the data format.

Microsoft actually has a section of Best Practices for Regular Expressions in the .NET Framework and specifically talks about Consider[ing] the Input Source.

Regular Expressions do have limitations, but have you considered the following?

The .NET framework is unique when it comes to regular expressions in that it supports Balancing Group Definitions.

See Matching Balanced Constructs with .NET Regular Expressions
See .NET Regular Expressions: Regex and Balanced Matching
See Microsoft's docs on Balancing Group Definitions

For this reason, I believe you CAN parse XML using regular expressions. Note however, that it must be valid XML (browsers are very forgiving of HTML and allow bad XML syntax inside HTML). This is possible since the "Balancing Group Definition" will allow the regular expression engine to act as a PDA.

Quote from article 1 cited above:

.NET Regular Expression Engine

As described above properly balanced constructs cannot be described by a regular expression. However, the .NET regular expression engine provides a few constructs that allow balanced constructs to be recognized.

(?<group>) - pushes the captured result on the capture stack with the name group.

(?<-group>) - pops the top most capture with the name group off the capture stack.

(?(group)yes|no) - matches the yes part if there exists a group with the name group otherwise matches no part.

These constructs allow for a .NET regular expression to emulate a restricted PDA by essentially allowing simple versions of the stack operations: push, pop and empty. The simple operations are pretty much equivalent to increment, decrement and compare to zero respectively. This allows for the .NET regular expression engine to recognize a subset of the context-free languages, in particular the ones that only require a simple counter. This in turn allows for the non-traditional .NET regular expressions to recognize individual properly balanced constructs.

Consider the following regular expression:

(?=<ul\s+id="matchMe"\s+type="square"\s*>)
(?>
   <!-- .*? -->                  |
   <[^>]*/>                      |
   (?<opentag><(?!/)[^>]*[^/]>)  |
   (?<-opentag></[^>]*[^/]>)     |
   [^<>]*
)*
(?(opentag)(?!))

Use the flags:

Singleline
IgnorePatternWhitespace (not necessary if you collapse regex and remove all whitespace)
IgnoreCase (not necessary)

Regular Expression Explained (inline)

(?=<ul\s+id="matchMe"\s+type="square"\s*>) # match start with <ul id="matchMe"...
(?>                                        # atomic group / don't backtrack (faster)
   <!-- .*? -->                 |          # match xml / html comment
   <[^>]*/>                     |          # self closing tag
   (?<opentag><(?!/)[^>]*[^/]>) |          # push opening xml tag
   (?<-opentag></[^>]*[^/]>)    |          # pop closing xml tag
   [^<>]*                                  # something between tags
)*                                         # match as many xml tags as possible
(?(opentag)(?!))                           # ensure no 'opentag' groups are on stack

You can try this at A Better .NET Regular Expression Tester.

I used the sample source of:

<html>
<body>
<div>
   <br />
   <ul id="matchMe" type="square">
      <li>stuff...</li>
      <li>more stuff</li>
      <li>
          <div>
               <span>still more</span>
               <ul>
                    <li>Another &gt;ul&lt;, oh my!</li>
                    <li>...</li>
               </ul>
          </div>
      </li>
   </ul>
</div>
</body>
</html>

This found the match:

   <ul id="matchMe" type="square">
      <li>stuff...</li>
      <li>more stuff</li>
      <li>
          <div>
               <span>still more</span>
               <ul>
                    <li>Another &gt;ul&lt;, oh my!</li>
                    <li>...</li>
               </ul>
          </div>
      </li>
   </ul>

although it actually came out like this:

<ul id="matchMe" type="square">           <li>stuff...</li>           <li>more stuff</li>           <li>               <div>                    <span>still more</span>                    <ul>                         <li>Another &gt;ul&lt;, oh my!</li>                         <li>...</li>                    </ul>               </div>           </li>        </ul>

Lastly, I really enjoyed Jeff Atwood's article: Parsing Html The Cthulhu Way. Funny enough, it cites the answer to this question that currently has over 4k votes.

User · Answer

Don t listen to these guys  You totally can parse context-free grammars with regex if you break the task into smaller pieces  You can generate the correct pattern with a script that does each of these in order     Solve the Halting Problem  Square a circle  Work out the Traveling Salesman Problem in O log n  or less  If it s any more than that  you ll run out of RAM and the engine will hang  The pattern will be pretty big  so make sure you have an algorithm that losslessly compresses random data  Almost there - just divide the whole thing by zero  Easy-peasy    I haven t quite finished the last part myself  but I know I m getting close  It keeps throwing CthulhuRlyehWgahnaglFhtagnExceptions for some reason  so I m going to port it to VB 6 and use On Error Resume Next  I ll update with the code once I investigate this strange door that just opened in the wall  Hmm   P S  Pierre de Fermat also figured out how to do it  but the margin he was writing in wasn t big enough for the code

User · Answer

I suggest using QueryPath for parsing XML and HTML in PHP   It s basically much the same syntax as jQuery  only it s on the server side

User · Answer

About the question of the regular expression methods to parse (x)HTML, the answer to all of the ones who spoke about some limits is: you have not been trained enough to rule the force of this powerful weapon, since nobody here spoke about recursion.

A regular expression-agnostic colleague notified me this discussion, which is not certainly the first on the web about this old and hot topic.

After reading some posts, the first thing I did was looking for the "?R" string in this thread. The second was to search about "recursion".

No, holy cow, no match found. Since nobody mentioned the main mechanism a parser is built onto, I was soon aware that nobody got the point.

If an (x)HTML parser needs recursion, a regular expression parser without recursion is not enough for the purpose. It's a simple construct.

The black art of regular expressions is hard to master, so maybe there are further possibilities we left out while trying and testing our personal solution to capture the whole web in one hand... Well, I am sure about it :)

Here's the magic pattern:

$pattern = "/<([\w]+)([^>]*?)(([\s]*\/>)|(>((([^<]*?|<\!\-\-.*?\-\->)|(?R))*)<\/\\1[\s]*>))/s";

Just try it. It's written as a PHP string, so the "s" modifier makes classes include newlines.

Here's a sample note on the PHP manual I wrote in January: Reference

(Take care. In that note I wrongly used the "m" modifier; it should be erased, notwithstanding it is discarded by the regular expression engine, since no ^ or $ anchoring was used).

Now, we could speak about the limits of this method from a more informed point of view:

according to the specific implementation of the regular expression engine, recursion may have a limit in the number of nested patterns parsed, but it depends on the language used
although corrupted, (x)HTML does not drive into severe errors. It is not sanitized.

Anyhow, it is only a regular expression pattern, but it discloses the possibility to develop of a lot of powerful implementations.

I wrote this pattern to power the recursive descent parser of a template engine I built in my framework, and performances are really great, both in execution times or in memory usage (nothing to do with other template engines which use the same syntax).

User · Answer

While arbitrary HTML with only a regex is impossible  it s sometimes appropriate to use them for parsing a limited  known set of HTML     If you have a small set of HTML pages that you want to scrape data from and then stuff into a database  regexes might work fine   For example  I recently wanted to get the names  parties  and districts of Australian federal Representatives  which I got off of the Parliament s web site   This was a limited  one-time job     Regexes worked just fine for me  and were very fast to set up

User · Answer

Sun Tzu  an ancient Chinese strategist  general  and philosopher  said      It is said that if you know your enemies and know yourself  you can win a hundred battles without a single loss    If you only know yourself  but not your opponent  you may win or may lose    If you know neither yourself nor your enemy  you will always endanger yourself    In this case your enemy is HTML and you are either yourself or regex   You might even be Perl with irregular regex  Know HTML   Know yourself   I have composed a haiku describing the nature of HTML   HTML has complexity exceeding regular language    I have also composed a haiku describing the nature of regex in Perl   The regex you seek is defined within the phrase  lt   a-zA-Z         gt           gt

User · Answer

lt  php  selfClosing   explode       area base basefont br col frame hr img input isindex link meta param embed      html      lt p gt  lt a href     gt foo lt  a gt  lt  p gt   lt hr  gt   lt br  gt   lt div gt name lt  div gt      dom   new DOMDocument     dom- gt loadHTML  html    els    dom- gt getElementsByTagName       foreach    els as  el          nodeName   strtolower  el- gt nodeName       if    in array   nodeName   selfClosing               var dump   nodeName              Output   string 4   html  string 4   body  string 1   p  string 1   a  string 3   div    Basically just define the element node names that are self closing  load the whole html string into a DOM library  grab all elements  loop through and filter out ones which aren t self closing and operate on them   I m sure you already know by now that you shouldn t use regex for this purpose

User · Answer

You can t parse  X HTML with regex  Because HTML can t be parsed by regex  Regex is not a tool that can be used to correctly parse HTML  As I have answered in HTML-and-regex questions here so many times before  the use of regex will not allow you to consume HTML  Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML  HTML is not a regular language and hence cannot be parsed by regular expressions  Regex queries are not equipped to break down HTML into its meaningful parts  so many times but it is not getting to me  Even enhanced irregular regular expressions as used by Perl are not up to the task of parsing HTML  You will never make me crack  HTML is a language of sufficient complexity that it cannot be parsed by regular expressions  Even Jon Skeet cannot parse HTML using regular expressions  Every time you attempt to parse HTML with regular expressions  the unholy child weeps the blood of virgins  and Russian hackers pwn your webapp  Parsing HTML with regex summons tainted souls into the realm of the living  HTML and regex go together like love  marriage  and ritual infanticide  The  lt center gt  cannot hold it is too late  The force of regex and HTML together in the same conceptual space will destroy your mind like so much watery putty  If you parse HTML with regex you are giving in to Them and their blasphemous ways which doom us all to inhuman toil for the One whose Name cannot be expressed in the Basic Multilingual Plane  he comes  HTML-plus-regexp will liquify the n erves of the sentient whilst you observe  your psyche withering in the onslaught of horror  Rege   x-based HTML parsers are the cancer that is killing StackOverflow it is too late it is too late we cannot be saved the transgression of a chi ld ensures regex will consume all living tissue  except for HTML which it cannot  as previously prophesied  dear lord help us how can anyone survive this scourge using regex to parse HTML has doomed humanity to an eternity of dread torture and security holes using regex as a tool to process HTML establishes a breach between this world and the dread realm of c  o  rrupt entities  like SGML entities  but more corrupt  a mere glimpse of the world of reg ex parsers for HTML will ins tantly transport a programmer s consciousness into a world of ceaseless screaming  he comes  the pestilent slithy regex-infection wil l devour your HT ML parser  application and existence for all time like Visual Basic only worse he comes he comes do not fi ght he com e s   h i s un ho ly radian ce  destro ying all enli    ghtenment  HTML tags lea ki  n g fr o m  yo  ur eye s   l ik e liq uid pain  the song of re gular exp ression parsing will exti nguish the voices of mor tal man from the sp here I can see it can you see      i   t          it is beautiful t he final snuffing of the lie s of Man ALL IS LOS         T ALL I S LOST the pon y he comes he c  omes he comes the ich or permeates all MY FACE MY FACE  h god no NO NOO O O NT stop the an             g        l          e   s  a    r     e n ot re    a l         ZA    LG  IS          TO         TH E     P   O  N Y  H           E                        C          O      M          E         S            Have you tried using an XML parser instead    Moderator s Note This post is locked to prevent inappropriate edits to its content  The post looks exactly as it is supposed to look - there are no problems with its content  Please do not flag it for our attention

User · Answer

I think this might work  lt  a-z    lt  gt              s       s        gt   And that could be tested here   As per W3Schools    XML Naming Rules XML elements must follow these naming rules   Names can contain letters  numbers  and other characters Names cannot start with a number or punctuation character Names cannot start with the letters xml  or XML  Xml  etc   Names cannot contain spaces Any name can be used  and no words are reserved    And the pattern I used is going to adhere these rules

User · Answer

If you only want the tag names  it should be possible to do this via a regular expression   lt   a-zA-Z         gt            gt   should do what you need  But I think the solution of  quot moritz quot  is already fine  I didn t see it in the beginning  For all downvoters  In some cases it just makes sense to use a regular expression  because it can be the easiest and quickest solution  I agree that in general you should not parse HTML with regular expressions  But regular expressions can be a very powerful tool when you have a subset of HTML where you know the format and you just want to extract some values  I did that hundreds of times and almost always achieved what I wanted

User · Answer

I think the flaw here is that HTML is a Chomsky Type 2 grammar (context free grammar) and a regular expression is a Chomsky Type 3 grammar (regular grammar). Since a Type 2 grammar is fundamentally more complex than a Type 3 grammar (see the Chomsky hierarchy), it is mathematically impossible to parse XML with a regular expression.

But many will try, and some will even claim success - but until others find the fault and totally mess you up.

User · Answer

I used a open source tool called HTMLParser before  It s designed to parse HTML in various ways and serves the purpose quite well  It can parse HTML as different treenode and you can easily use its API to get attributes out of the node  Check it out and see if this can help you

User · Answer

lt  s   w      gt    gt   The parts explained   lt   Starting character  s   It may have whitespaces before the tag name  ugly  but possible     w    tags can contain letters and numbers  h1   Well   w also matches      but it does not hurt I guess  If curious  use   a-zA-Z0-9    instead      gt     Anything except  gt  and   until closing  gt   gt   Closing  gt  UNRELATED And to the fellows  who underestimate regular expressions  saying they are only as powerful as regular languages  anbanban which is not regular and not even context free  can be matched with   a  b 1b 1  Backreferencing FTW

User · Answer

It s true that when programming it s usually best to use dedicated parsers and APIs instead of regular expressions when dealing with HTML  especially if accuracy is paramount  e g   if your processing might have security implications   However  I don   t ascribe to a dogmatic view that XML-style markup should never be processed with regular expressions  There are cases when regular expressions are a great tool for the job  such as when making one-time edits in a text editor  fixing broken XML files  or dealing with file formats that look like but aren   t quite XML  There are some issues to be aware of  but they re not insurmountable or even necessarily relevant   A simple regex like  lt     gt                       gt  is usually good enough  in cases such as those I just mentioned  It s a naive solution  all things considered  but it does correctly allow unencoded  gt  symbols in attribute values  If you re looking for  e g   a table tag  you could adapt it as  lt   table b    gt                       gt    Just to give a sense of what a more  advanced  HTML regex would look like  the following does a fairly respectable job of emulating real-world browser behavior and the HTML5 parsing algorithm    lt     A-Za-z    s gt          s                       s gt        gt        gt       The following matches a fairly strict definition of XML tags  although it doesn t account for the full set of Unicode characters allowed in XML names     lt        A-Z  -   w       s    A-Z  -   w   s   s                       s         A-Z  -   w    s   gt    Granted  these don t account for surrounding context and a few edge cases  but even such things could be dealt with if you really wanted to  e g   by searching between the matches of another regex    At the end of the day  use the most appropriate tool for the job  even in the cases when that tool happens to be a regex

User · Answer

It seems to me you re trying to match tags without a     at the end  Try this    lt   a-zA-Z  a-zA-Z0-9      gt      lt     gt

User · Answer

I don t know your exact need for this  but if you are also using  NET  couldn t you use Html Agility Pack   Excerpt      It is a  NET code library that allows   you to parse  out of the web  HTML   files  The parser is very tolerant   with  real world  malformed HTML

User · Answer

I like to parse HTML with regular expressions. I don't attempt to parse idiot HTML that is deliberately broken. This code is my main parser (Perl edition):

$_ = join "",<STDIN>; tr/\n\r \t/ /s; s/</\n</g; s/>/>\n/g; s/\n ?\n/\n/g;
s/^ ?\n//s; s/ $//s; print

It's called htmlsplit, splits the HTML into lines, with one tag or chunk of text on each line. The lines can then be processed further with other text tools and scripts, such as grep, sed, Perl, etc. I'm not even joking :) Enjoy.

It is simple enough to rejig my slurp-everything-first Perl script into a nice streaming thing, if you wish to process enormous web pages. But it's not really necessary.

HTML Split

Some better regular expressions:

/(<.*?>|[^<]+)\s*/g    # Get tags and text
/(\w+)="(.*?)"/g       # Get attibutes

They are good for XML / XHTML.

With minor variations, it can cope with messy HTML... or convert the HTML -> XHTML first.

The best way to write regular expressions is in the Lex / Yacc style, not as opaque one-liners or commented multi-line monstrosities. I didn't do that here, yet; these ones barely need it.

User · Answer

Here s a PCRE regular expression for XML XHTML  built from a simplified EBNF syntax definition       DEFINE     lt tag gt     amp tagempty       amp tagopen      amp textnode       amp tag       amp comment       amp tagclose      lt tagunnested gt     amp tagempty       amp tagopen      amp textnode       amp comment       amp tagclose      lt textnode gt     lt  gt        lt comment gt   lt  --   s S    -- gt      lt tagopen gt   lt     amp tagname     amp attrlist      amp ws    gt      lt tagempty gt   lt     amp tagname     amp ws      amp attrlist      amp ws      gt      lt tagclose gt   lt       amp tagname     amp ws    gt      lt attrlist gt      amp ws      amp attr        lt attr gt     amp attrunquoted       amp attrsinglequoted       amp attrdoublequoted       amp attrempty      lt attrempty gt     amp attrname      lt attrunquoted gt     amp attrname     amp ws        amp ws      amp attrunquotedvalue      lt attrsinglequoted gt     amp attrname     amp ws        amp ws        amp attrsinglequotedvalue        lt attrdoublequoted gt     amp attrname     amp ws        amp ws    quot     amp attrdoublequotedvalue   quot      lt tagname gt     amp alphabets      amp alphabets       amp digits        lt attrname gt    amp alphabets      amp alphabets     amp digits    -        lt attrunquotedvalue gt     s quot    lt  gt         lt attrsinglequotedvalue gt            lt attrdoublequotedvalue gt     quot        lt alphabets gt   a-zA-Z      lt digits gt   0-9      lt ws gt   s       amp tagopen   x  This illustrates how to build regular expressions for context-free grammars  You can match other parts of the definition by changing the match on the last line from    amp tagopen  to e g     amp tagunnested  The real question is  Should you do it  For XML XHTML the consensus is no  Credits to nikic for supplying the idea

[html] RegEx match open tags except XHTML self-contained tags

The answer is

Understanding the pattern

Regular Expression Explained (inline)

As per W3Schools...

XML Naming Rules

UNRELATED

The real question is: Should you do it?

Examples related to html

Examples related to regex

Examples related to xhtml

Tags