Regular Expression to match every new line character n inside a content tag

Question

I m looking for a regular expression to match every new line character   n  inside a XML tag which is  lt content gt   or inside any tag which is inside that  lt content gt  tag  for example     lt blog gt   lt text gt   Do NOT match new lines here   lt  text gt   lt content gt   DO match new lines here   lt p gt   Do match new lines here   lt  p gt   lt  content gt   Do NOT match new lines here   lt content gt   DO match new lines here   lt  content gt

User · Accepted Answer

Actually    you can t use a simple regex here  at least not one  You probably need to worry about comments  Someone may write    lt  --  lt content gt  blah  lt  content gt  -- gt    You can take two approaches here    Strip all comments out first  Then use the regex approach  Do not use regular expressions and use a context sensitive parsing approach that can keep track of whether or not you are nested in a comment    Be careful   I am also not so sure you can match all new lines at once   Quartz suggested this one    lt content gt     n   n    lt  content gt    This will match any content tags that have a newline character RIGHT BEFORE the closing tag    but I m not sure what you mean by matching all newlines  Do you want to be able to access all the matched newline characters  If so  your best bet is to grab all content tags  and then search for all the newline chars that are nested in between  Something more like this    lt content gt    lt  content gt    BUT THERE IS ONE CAVEAT  regexes are greedy  so this regex will match the first opening tag to the last closing one  Instead  you HAVE to suppress the regex so it is not greedy  In languages like python  you can do this with the     regex symbol   I hope with this you can see some of the pitfalls and figure out how you want to proceed  You are probably better off using an XML parsing library  then iterating over all the content tags   I know I may not be offering the best solution  but at least I hope you will see the difficulty in this and why other answers may not be right     UPDATE 1   Let me summarize a bit more and add some more detail to my response  I am going to use python s regex syntax because it is what I am more used to  forgive me ahead of time    you may need to escape some characters    comment on my post and I will correct it    To strip out comments  use this regex       Notice the     suppresses the    to make it non-greedy   Similarly  to search for content tags  use           Also  You may be able to try this out  and access each newline character with the match objects groups      lt content gt       n       lt  content gt    I know my escaping is off  but it captures the idea  This last example probably won t work  but I think it s your best bet at expressing what you want  My suggestion remains  either grab all the content tags and do it yourself  or use a parsing library   UPDATE 2   So here is python code that ought to work  I am still unsure what you mean by  find  all newlines  Do you want the entire lines  Or just to count how many newlines  To get the actual lines  try      usr bin python  import re  def FindContentNewlines xml text         May want to compile these regexes elsewhere  but I do it here for brevity     comments   re compile r  lt  --   -- gt    re DOTALL      content   re compile r  lt content gt       lt  content gt    re DOTALL      newlines   re compile r           re MULTILINE re DOTALL         strip comments  this actually may not be reliable for  nested comments        How does xml handle  lt  --   lt  -- -- gt  -- gt   I am not sure  But that COULD       be trouble      xml text   re sub comments      xml text       result          all contents   re findall content  xml text      for c in all contents          result extend re findall newlines  c        return result  if   name         main         example         lt  -- This stuff ought to be omitted  lt content gt    omitted  lt  content gt  -- gt   This stuff is good  lt content gt   lt p gt    haha   lt  p gt   lt  content gt   This is not found         print FindContentNewlines example    This program prints the result           lt p gt       haha      lt  p gt          The first and last empty strings come from the newline chars immediately preceeding the first  lt p gt  and the one coming right after the  lt  p gt   All in all this  for the most part  does the trick  Experiment with this code and refine it for your needs  Print out stuff in the middle so you can see what the regexes are matching and not matching   Hope this helps  -    PS - I didn t have much luck trying out my regex from my first update to capture all the newlines    let me know if you do

User · Answer

lt content gt       n    n     lt  content gt

[regex] Regular Expression to match every new line character (\n) inside a <content> tag

Examples related to regex