[python] Regular expression matching a multiline block of text

I'm having a bit of trouble getting a Python regex to work when matching against text that spans multiple lines. The example text is ('\n' is a newline)

some Varying TEXT\n
\n
DSJFKDAFJKDAFJDSAKFJADSFLKDLAFKDSAF\n
[more of the above, ending with a newline]\n
[yep, there is a variable number of lines here]\n
\n
(repeat the above a few hundred times).

I'd like to capture two things: the 'some_Varying_TEXT' part, and all of the lines of uppercase text that comes two lines below it in one capture (i can strip out the newline characters later). I've tried with a few approaches:

re.compile(r"^>(\w+)$$([.$]+)^$", re.MULTILINE) # try to capture both parts
re.compile(r"(^[^>][\w\s]+)$", re.MULTILINE|re.DOTALL) # just textlines

and a lot of variations hereof with no luck. The last one seems to match the lines of text one by one, which is not what I really want. I can catch the first part, no problem, but I can't seem to catch the 4-5 lines of uppercase text. I'd like match.group(1) to be some_Varying_Text and group(2) to be line1+line2+line3+etc until the empty line is encountered.

If anyone's curious, its supposed to be a sequence of aminoacids that make up a protein.

This question is related to python regex multiline

The answer is


The following is a regular expression matching a multiline block of text:

import re
result = re.findall('(startText)(.+)((?:\n.+)+)(endText)',input)

This will work:

>>> import re
>>> rx_sequence=re.compile(r"^(.+?)\n\n((?:[A-Z]+\n)+)",re.MULTILINE)
>>> rx_blanks=re.compile(r"\W+") # to remove blanks and newlines
>>> text="""Some varying text1
...
... AAABBBBBBCCCCCCDDDDDDD
... EEEEEEEFFFFFFFFGGGGGGG
... HHHHHHIIIIIJJJJJJJKKKK
...
... Some varying text 2
...
... LLLLLMMMMMMNNNNNNNOOOO
... PPPPPPPQQQQQQRRRRRRSSS
... TTTTTUUUUUVVVVVVWWWWWW
... """
>>> for match in rx_sequence.finditer(text):
...   title, sequence = match.groups()
...   title = title.strip()
...   sequence = rx_blanks.sub("",sequence)
...   print "Title:",title
...   print "Sequence:",sequence
...   print
...
Title: Some varying text1
Sequence: AAABBBBBBCCCCCCDDDDDDDEEEEEEEFFFFFFFFGGGGGGGHHHHHHIIIIIJJJJJJJKKKK

Title: Some varying text 2
Sequence: LLLLLMMMMMMNNNNNNNOOOOPPPPPPPQQQQQQRRRRRRSSSTTTTTUUUUUVVVVVVWWWWWW

Some explanation about this regular expression might be useful: ^(.+?)\n\n((?:[A-Z]+\n)+)

  • The first character (^) means "starting at the beginning of a line". Be aware that it does not match the newline itself (same for $: it means "just before a newline", but it does not match the newline itself).
  • Then (.+?)\n\n means "match as few characters as possible (all characters are allowed) until you reach two newlines". The result (without the newlines) is put in the first group.
  • [A-Z]+\n means "match as many upper case letters as possible until you reach a newline. This defines what I will call a textline.
  • ((?:textline)+) means match one or more textlines but do not put each line in a group. Instead, put all the textlines in one group.
  • You could add a final \n in the regular expression if you want to enforce a double newline at the end.
  • Also, if you are not sure about what type of newline you will get (\n or \r or \r\n) then just fix the regular expression by replacing every occurrence of \n by (?:\n|\r\n?).

My preference.

lineIter= iter(aFile)
for line in lineIter:
    if line.startswith( ">" ):
         someVaryingText= line
         break
assert len( lineIter.next().strip() ) == 0
acids= []
for line in lineIter:
    if len(line.strip()) == 0:
        break
    acids.append( line )

At this point you have someVaryingText as a string, and the acids as a list of strings. You can do "".join( acids ) to make a single string.

I find this less frustrating (and more flexible) than multiline regexes.


find:

^>([^\n\r]+)[\n\r]([A-Z\n\r]+)

\1 = some_varying_text

\2 = lines of all CAPS

Edit (proof that this works):

text = """> some_Varying_TEXT

DSJFKDAFJKDAFJDSAKFJADSFLKDLAFKDSAF
GATACAACATAGGATACA
GGGGGAAAAAAAATTTTTTTTT
CCCCAAAA

> some_Varying_TEXT2

DJASDFHKJFHKSDHF
HHASGDFTERYTERE
GAGAGAGAGAG
PPPPPAAAAAAAAAAAAAAAP
"""

import re

regex = re.compile(r'^>([^\n\r]+)[\n\r]([A-Z\n\r]+)', re.MULTILINE)
matches = [m.groups() for m in regex.finditer(text)]

for m in matches:
    print 'Name: %s\nSequence:%s' % (m[0], m[1])

If each file only has one sequence of aminoacids, I wouldn't use regular expressions at all. Just something like this:

def read_amino_acid_sequence(path):
    with open(path) as sequence_file:
        title = sequence_file.readline() # read 1st line
        aminoacid_sequence = sequence_file.read() # read the rest

    # some cleanup, if necessary
    title = title.strip() # remove trailing white spaces and newline
    aminoacid_sequence = aminoacid_sequence.replace(" ","").replace("\n","")
    return title, aminoacid_sequence

Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to regex

Why my regexp for hyphenated words doesn't work? grep's at sign caught as whitespace Preg_match backtrack error regex match any single character (one character only) re.sub erroring with "Expected string or bytes-like object" Only numbers. Input number in React Visual Studio Code Search and Replace with Regular Expressions Strip / trim all strings of a dataframe return string with first match Regex How to capture multiple repeated groups?

Examples related to multiline

Way to create multiline comments in Bash? How to get multiline input from user R: "Unary operator error" from multiline ggplot2 command Can a JSON value contain a multiline string PHP multiline string with PHP Pythonic way to create a long multi-line string How do I create a multiline Python string with inline variables? How do you write multiline strings in Go? Multiline TextView in Android? What is the proper way to format a multi-line dict in Python?