Regular expression matching a multiline block of text

Question

I m having a bit of trouble getting a Python regex to work when matching against text that spans multiple lines  The example text is    n  is a newline   some Varying TEXT n  n DSJFKDAFJKDAFJDSAKFJADSFLKDLAFKDSAF n  more of the above  ending with a newline  n  yep  there is a variable number of lines here  n  n  repeat the above a few hundred times     I d like to capture two things  the  some Varying TEXT  part  and all of the lines of uppercase text that comes two lines below it in one capture  i can strip out the newline characters later    I ve tried with a few approaches   re compile r   gt   w                re MULTILINE    try to capture both parts re compile r      gt    w s       re MULTILINE re DOTALL    just textlines   and a lot of variations hereof with no luck  The last one seems to match the lines of text one by one  which is not what I really want  I can catch the first part  no problem  but I can t seem to catch the 4-5 lines of uppercase text  I d like match group 1  to be some  95 Varying  95 Text and group 2  to be line1 line2 line3 etc until the empty line is encountered   If anyone s curious  its supposed to be a sequence of aminoacids that make up a protein

User · Answer

find     gt     n r     n r   A-Z n r       1   some varying text    2   lines of all CAPS  Edit  proof that this works    text       gt  some Varying TEXT  DSJFKDAFJKDAFJDSAKFJADSFLKDLAFKDSAF GATACAACATAGGATACA GGGGGAAAAAAAATTTTTTTTT CCCCAAAA   gt  some Varying TEXT2  DJASDFHKJFHKSDHF HHASGDFTERYTERE GAGAGAGAGAG PPPPPAAAAAAAAAAAAAAAP      import re  regex   re compile r   gt     n r     n r   A-Z n r      re MULTILINE  matches    m groups   for m in regex finditer text    for m in matches      print  Name   s nSequence  s     m 0   m 1

User · Answer

This will work    gt  gt  gt  import re  gt  gt  gt  rx sequence re compile r        n n     A-Z   n     re MULTILINE   gt  gt  gt  rx blanks re compile r  W      to remove blanks and newlines  gt  gt  gt  text    Some varying text1         AAABBBBBBCCCCCCDDDDDDD     EEEEEEEFFFFFFFFGGGGGGG     HHHHHHIIIIIJJJJJJJKKKK         Some varying text 2         LLLLLMMMMMMNNNNNNNOOOO     PPPPPPPQQQQQQRRRRRRSSS     TTTTTUUUUUVVVVVVWWWWWW          gt  gt  gt  for match in rx sequence finditer text         title  sequence   match groups         title   title strip         sequence   rx blanks sub    sequence        print  Title   title       print  Sequence   sequence       print     Title  Some varying text1 Sequence  AAABBBBBBCCCCCCDDDDDDDEEEEEEEFFFFFFFFGGGGGGGHHHHHHIIIIIJJJJJJJKKKK  Title  Some varying text 2 Sequence  LLLLLMMMMMMNNNNNNNOOOOPPPPPPPQQQQQQRRRRRRSSSTTTTTUUUUUVVVVVVWWWWWW     Some explanation about this regular expression might be useful         n n     A-Z   n      The first character     means  starting at the beginning of a line    Be aware that it does not match the newline itself  same for    it means  just before a newline   but it does not match the newline itself   Then       n n means  match as few characters as possible  all characters are allowed  until you reach two newlines    The result  without the newlines  is put in the first group   A-Z   n means  match as many upper case letters as possible until you reach a newline   This defines what I will call a textline      textline    means match one or more textlines but do not put each line in a group  Instead  put all the textlines in one group  You could add a final  n in the regular expression if you want to enforce a double newline at the end  Also  if you are not sure about what type of newline you will get   n or  r or  r n  then just fix the regular expression by replacing every occurrence of  n by     n  r n

User · Answer

If each file only has one sequence of aminoacids  I wouldn t use regular expressions at all   Just something like this   def read amino acid sequence path       with open path  as sequence file          title   sequence file readline     read 1st line         aminoacid sequence   sequence file read     read the rest        some cleanup  if necessary     title   title strip     remove trailing white spaces and newline     aminoacid sequence   aminoacid sequence replace         replace   n          return title  aminoacid sequence

User · Answer

My preference   lineIter  iter aFile  for line in lineIter      if line startswith    gt               someVaryingText  line          break assert len  lineIter next   strip        0 acids     for line in lineIter      if len line strip       0          break     acids append  line     At this point you have someVaryingText as a string  and the acids as a list of strings  You can do    join  acids   to make a single string   I find this less frustrating  and more flexible  than multiline regexes

User · Answer

The following is a regular expression matching a multiline block of text   import re result   re findall   startText          n      endText   input

User · Answer

Try this  re compile r quot       n     n      quot   re MULTILINE   I think your biggest problem is that you re expecting the   and   anchors to match linefeeds  but they don t   In multiline mode    matches the position immediately following a newline and   matches the position immediately preceding a newline  Be aware  too  that a newline can consist of a linefeed   n   a carriage-return   r   or a carriage-return linefeed   r n    If you aren t certain that your target text uses only linefeeds  you should use this more inclusive version of the regex  re compile r quot          n  r n          n  r n        quot   re MULTILINE   BTW  you don t want to use the DOTALL modifier here  you re relying on the fact that the dot matches everything except newlines

[python] Regular expression matching a multiline block of text

Examples related to python

Examples related to regex

Examples related to multiline