[regex] Regular Expression to match every new line character (\n) inside a <content> tag

I'm looking for a regular expression to match every new line character (\n) inside a XML tag which is <content>, or inside any tag which is inside that <content> tag, for example :

<blog>
<text>
(Do NOT match new lines here)
</text>
<content>
(DO match new lines here)
<p>
(Do match new lines here)
</p>
</content>
(Do NOT match new lines here)
<content>
(DO match new lines here)
</content>

This question is related to regex

The answer is


Actually... you can't use a simple regex here, at least not one. You probably need to worry about comments! Someone may write:

<!-- <content> blah </content> -->

You can take two approaches here:

  1. Strip all comments out first. Then use the regex approach.
  2. Do not use regular expressions and use a context sensitive parsing approach that can keep track of whether or not you are nested in a comment.

Be careful.

I am also not so sure you can match all new lines at once. @Quartz suggested this one:

<content>([^\n]*\n+)+</content>

This will match any content tags that have a newline character RIGHT BEFORE the closing tag... but I'm not sure what you mean by matching all newlines. Do you want to be able to access all the matched newline characters? If so, your best bet is to grab all content tags, and then search for all the newline chars that are nested in between. Something more like this:

<content>.*</content>

BUT THERE IS ONE CAVEAT: regexes are greedy, so this regex will match the first opening tag to the last closing one. Instead, you HAVE to suppress the regex so it is not greedy. In languages like python, you can do this with the "?" regex symbol.

I hope with this you can see some of the pitfalls and figure out how you want to proceed. You are probably better off using an XML parsing library, then iterating over all the content tags.

I know I may not be offering the best solution, but at least I hope you will see the difficulty in this and why other answers may not be right...

UPDATE 1:

Let me summarize a bit more and add some more detail to my response. I am going to use python's regex syntax because it is what I am more used to (forgive me ahead of time... you may need to escape some characters... comment on my post and I will correct it):

To strip out comments, use this regex: Notice the "?" suppresses the .* to make it non-greedy.

Similarly, to search for content tags, use: .*?

Also, You may be able to try this out, and access each newline character with the match objects groups():

<content>(.*?(\n))+.*?</content>

I know my escaping is off, but it captures the idea. This last example probably won't work, but I think it's your best bet at expressing what you want. My suggestion remains: either grab all the content tags and do it yourself, or use a parsing library.

UPDATE 2:

So here is python code that ought to work. I am still unsure what you mean by "find" all newlines. Do you want the entire lines? Or just to count how many newlines. To get the actual lines, try:

#!/usr/bin/python

import re

def FindContentNewlines(xml_text):
    # May want to compile these regexes elsewhere, but I do it here for brevity
    comments = re.compile(r"<!--.*?-->", re.DOTALL)
    content = re.compile(r"<content>(.*?)</content>", re.DOTALL)
    newlines = re.compile(r"^(.*?)$", re.MULTILINE|re.DOTALL)

    # strip comments: this actually may not be reliable for "nested comments"
    # How does xml handle <!--  <!-- --> -->. I am not sure. But that COULD
    # be trouble.
    xml_text = re.sub(comments, "", xml_text)

    result = []
    all_contents = re.findall(content, xml_text)
    for c in all_contents:
        result.extend(re.findall(newlines, c))

    return result

if __name__ == "__main__":
    example = """

<!-- This stuff
ought to be omitted
<content>
  omitted
</content>
-->

This stuff is good
<content>
<p>
  haha!
</p>
</content>

This is not found
"""
    print FindContentNewlines(example)

This program prints the result:

 ['', '<p>', '  haha!', '</p>', '']

The first and last empty strings come from the newline chars immediately preceeding the first <p> and the one coming right after the </p>. All in all this (for the most part) does the trick. Experiment with this code and refine it for your needs. Print out stuff in the middle so you can see what the regexes are matching and not matching.

Hope this helps :-).

PS - I didn't have much luck trying out my regex from my first update to capture all the newlines... let me know if you do.


<content>(?:[^\n]*(\n+))+</content>