[python] How do I search for a pattern within a text file using Python combining regex & string/file operations and store instances of the pattern?

So essentially I'm looking for specifically a 4 digit code within two angle brackets within a text file. I know that I need to open the text file and then parse line by line, but I am not sure the best way to go about structuring my code after checking "for line in file".

I think I can either somehow split it, strip it, or partition, but I also wrote a regex which I used compile on and so if that returns a match object I don't think I can use that with those string based operations. Also I'm not sure whether my regex is greedy enough or not...

I'd like to store all instances of those found hits as strings within either a tuple or a list.

Here is my regex:

regex = re.compile("(<(\d{4,5})>)?")

I don't think I need to include all that much code considering its fairly basic so far.

This question is related to python regex file-io text-mining string-parsing

The answer is


import re
pattern = re.compile("<(\d{4,5})>")

for i, line in enumerate(open('test.txt')):
    for match in re.finditer(pattern, line):
        print 'Found on line %s: %s' % (i+1, match.group())

A couple of notes about the regex:

  • You don't need the ? at the end and the outer (...) if you don't want to match the number with the angle brackets, but only want the number itself
  • It matches either 4 or 5 digits between the angle brackets

Update: It's important to understand that the match and capture in a regex can be quite different. The regex in my snippet above matches the pattern with angle brackets, but I ask to capture only the internal number, without the angle brackets.

More about regex in python can be found here : Regular Expression HOWTO


Doing it in one bulk read:

import re

textfile = open(filename, 'r')
filetext = textfile.read()
textfile.close()
matches = re.findall("(<(\d{4,5})>)?", filetext)

Line by line:

import re

textfile = open(filename, 'r')
matches = []
reg = re.compile("(<(\d{4,5})>)?")
for line in textfile:
    matches += reg.findall(line)
textfile.close()

But again, the matches that returns will not be useful for anything except counting unless you added an offset counter:

import re

textfile = open(filename, 'r')
matches = []
offset = 0
reg = re.compile("(<(\d{4,5})>)?")
for line in textfile:
    matches += [(reg.findall(line),offset)]
    offset += len(line)
textfile.close()

But it still just makes more sense to read the whole file in at once.


Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to regex

Why my regexp for hyphenated words doesn't work? grep's at sign caught as whitespace Preg_match backtrack error regex match any single character (one character only) re.sub erroring with "Expected string or bytes-like object" Only numbers. Input number in React Visual Studio Code Search and Replace with Regular Expressions Strip / trim all strings of a dataframe return string with first match Regex How to capture multiple repeated groups?

Examples related to file-io

Python, Pandas : write content of DataFrame into text File Saving response from Requests to file How to while loop until the end of a file in Python without checking for empty line? Getting "java.nio.file.AccessDeniedException" when trying to write to a folder How do I add a resources folder to my Java project in Eclipse Read and write a String from text file Python Pandas: How to read only first n rows of CSV files in? Open files in 'rt' and 'wt' modes How to write to a file without overwriting current contents? Write objects into file with Node.js

Examples related to text-mining

How do I search for a pattern within a text file using Python combining regex & string/file operations and store instances of the pattern? What is "entropy and information gain"?

Examples related to string-parsing

Convert String to Float in Swift Parse time of format hh:mm:ss Counting the number of occurences of characters in a string How do I search for a pattern within a text file using Python combining regex & string/file operations and store instances of the pattern? How do I parse a URL query parameters, in Javascript? Convert String with Dot or Comma as decimal separator to number in JavaScript How do I extract a substring from a string until the second space is encountered? How to validate an email address using a regular expression?