[c++] Regular expression to detect semi-colon terminated C++ for & while loops

In my Python application, I need to write a regular expression that matches a C++ for or while loop that has been terminated with a semi-colon (;). For example, it should match this:

for (int i = 0; i < 10; i++);

... but not this:

for (int i = 0; i < 10; i++)

This looks trivial at first glance, until you realise that the text between the opening and closing parenthesis may contain other parenthesis, for example:

for (int i = funcA(); i < funcB(); i++);

I'm using the python.re module. Right now my regular expression looks like this (I've left my comments in so you can understand it easier):

# match any line that begins with a "for" or "while" statement:
^\s*(for|while)\s*
\(  # match the initial opening parenthesis
    # Now make a named group 'balanced' which matches a balanced substring.
    (?P<balanced>
        # A balanced substring is either something that is not a parenthesis:
        [^()]
        | # …or a parenthesised string:
        \( # A parenthesised string begins with an opening parenthesis
            (?P=balanced)* # …followed by a sequence of balanced substrings
        \) # …and ends with a closing parenthesis
    )*  # Look for a sequence of balanced substrings
\)  # Finally, the outer closing parenthesis.
# must end with a semi-colon to match:
\s*;\s*

This works perfectly for all the above cases, but it breaks as soon as you try and make the third part of the for loop contain a function, like so:

for (int i = 0; i < 10; doSomethingTo(i));

I think it breaks because as soon as you put some text between the opening and closing parenthesis, the "balanced" group matches that contained text, and thus the (?P=balanced) part doesn't work any more since it won't match (due to the fact that the text inside the parenthesis is different).

In my Python code I'm using the VERBOSE and MULTILINE flags, and creating the regular expression like so:

REGEX_STR = r"""# match any line that begins with a "for" or "while" statement:
^\s*(for|while)\s*
\(  # match the initial opening parenthesis
    # Now make a named group 'balanced' which matches
    # a balanced substring.
    (?P<balanced>
        # A balanced substring is either something that is not a parenthesis:
        [^()]
        | # …or a parenthesised string:
        \( # A parenthesised string begins with an opening parenthesis
            (?P=balanced)* # …followed by a sequence of balanced substrings
        \) # …and ends with a closing parenthesis
    )*  # Look for a sequence of balanced substrings
\)  # Finally, the outer closing parenthesis.
# must end with a semi-colon to match:
\s*;\s*"""

REGEX_OBJ = re.compile(REGEX_STR, re.MULTILINE| re.VERBOSE)

Can anyone suggest an improvement to this regular expression? It's getting too complicated for me to get my head around.

This question is related to c++ python regex parsing recursion

The answer is


Another thought that ignores parentheses and treats the for as a construct holding three semicolon-delimited values:

for\s*\([^;]+;[^;]+;[^;]+\)\s*;

This option works even when split over multiple lines (once MULTILINE enabled), but assumes that for ( ... ; ... ; ... ) is the only valid construct, so wouldn't work with a for ( x in y ) construct, or other deviations.

Also assumes that there are no functions containing semi-colons as arguments, such as:

for ( var i = 0; i < ListLen('a;b;c',';') ; i++ );

Whether this is a likely case depends on what you're actually doing this for.


A little late to the party, but I think regular expressions are not the right tool for the job.

The problem is that you'll come across edge cases which would add extranous complexity to the regular expression. @est mentioned an example line:

for (int i = 0; i < 10; doSomethingTo("("));

This string literal contains an (unbalanced!) parenthesis, which breaks the logic. Apparently, you must ignore contents of string literals. In order to do this, you must take the double quotes into account. But string literals itself can contain double quotes. For instance, try this:

for (int i = 0; i < 10; doSomethingTo("\"(\\"));

If you address this using regular expressions, it'll add even more complexity to your pattern.

I think you are better off parsing the language. You could, for instance, use a language recognition tool like ANTLR. ANTLR is a parser generator tool, which can also generate a parser in Python. You must provide a grammar defining the target language, in your case C++. There are already numerous grammars for many languages out there, so you can just grab the C++ grammar.

Then you can easily walk the parser tree, searching for empty statements as while or for loop body.


As Frank suggested, this is best without regex. Here's (an ugly) one-liner:

match_string = orig_string[orig_string.index("("):len(orig_string)-orig_string[::-1].index(")")]

Matching the troll line est mentioned in his comment:

orig_string = "for (int i = 0; i < 10; doSomethingTo(\"(\"));"
match_string = orig_string[orig_string.index("("):len(orig_string)-orig_string[::-1].index(")")]

returns (int i = 0; i < 10; doSomethingTo("("))

This works by running through the string forward until it reaches the first open paren, and then backward until it reaches the first closing paren. It then uses these two indices to slice the string.


I wouldn't even pay attention to the contents of the parens.

Just match any line that starts with for and ends with semi-colon:

^\t*for.+;$

Unless you've got for statements split over multiple lines, that will work fine?


This is the kind of thing you really shouldn't do with a regular expression. Just parse the string one character at a time, keeping track of opening/closing parentheses.

If this is all you're looking for, you definitely don't need a full-blown C++ grammar lexer/parser. If you want practice, you can write a little recursive-decent parser, but even that's a bit much for just matching parentheses.


I don't know that regex would handle something like that very well. Try something like this

line = line.Trim();
if(line.StartsWith("for") && line.EndsWith(";")){
    //your code here
}

This is a great example of using the wrong tool for the job. Regular expressions do not handle arbitrarily nested sub-matches very well. What you should do instead is use a real lexer and parser (a grammar for C++ should be easy to find) and look for unexpectedly empty loop bodies.


Greg is absolutely correct. This kind of parsing cannot be done with regular expressions. I suppose it is possible to build some horrendous monstrosity that would work for many cases, but then you'll just run across something that does.

You really need to use more traditional parsing techniques. For example, its pretty simple to write a recursive decent parser to do what you need.


Try this regexp

^\s*(for|while)\s*
\(
(?P<balanced>
[^()]*
|
(?P=balanced)
\)
\s*;\s

I removed the wrapping \( \) around (?P=balanced) and moved the * to behind the any not paren sequence. I have had this work with boost xpressive, and rechecked that website (Xpressive) to refresh my memory.


Examples related to c++

Method Call Chaining; returning a pointer vs a reference? How can I tell if an algorithm is efficient? Difference between opening a file in binary vs text How can compare-and-swap be used for a wait-free mutual exclusion for any shared data structure? Install Qt on Ubuntu #include errors detected in vscode Cannot open include file: 'stdio.h' - Visual Studio Community 2017 - C++ Error How to fix the error "Windows SDK version 8.1" was not found? Visual Studio 2017 errors on standard headers How do I check if a Key is pressed on C++

Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to regex

Why my regexp for hyphenated words doesn't work? grep's at sign caught as whitespace Preg_match backtrack error regex match any single character (one character only) re.sub erroring with "Expected string or bytes-like object" Only numbers. Input number in React Visual Studio Code Search and Replace with Regular Expressions Strip / trim all strings of a dataframe return string with first match Regex How to capture multiple repeated groups?

Examples related to parsing

Got a NumberFormatException while trying to parse a text file for objects Uncaught SyntaxError: Unexpected end of JSON input at JSON.parse (<anonymous>) Python/Json:Expecting property name enclosed in double quotes Correctly Parsing JSON in Swift 3 How to get response as String using retrofit without using GSON or any other library in android UIButton action in table view cell "Expected BEGIN_OBJECT but was STRING at line 1 column 1" How to convert an XML file to nice pandas dataframe? How to extract multiple JSON objects from one file? How to sum digits of an integer in java?

Examples related to recursion

List all the files and folders in a Directory with PHP recursive function Jquery Ajax beforeSend and success,error & complete Node.js - Maximum call stack size exceeded best way to get folder and file list in Javascript Recursive sub folder search and return files in a list python find all subsets that sum to a particular value jQuery - Uncaught RangeError: Maximum call stack size exceeded Find and Replace string in all files recursive using grep and sed recursion versus iteration Method to get all files within folder and subfolders that will return a list