[python] Why my regexp for hyphenated words doesn't work?

I'm writting a regular expression for match with simple words and single hyphenated words using re module of python, so for example in:

test_case_input = """the wide-field infrared survey explorer is a nasa infrared-wavelength space telescope in an earth-orbiting satellite which performed an all-sky astronomical survey. be careful of -tricky tricky- hyphens --- be precise.""" 

should be match:

test_case_output = ['the', 'wide-field', 'infrared', 'survey', 'explorer', 'is', 'a', 'nasa', 'infrared-wavelength', 'space', 'telescope', 'in', 'an', 'earth-orbiting', 'satellite', 'which', 'performed', 'an', 'all-sky', 'astronomical', 'survey', 'be', 'careful', 'of', 'tricky', 'tricky', 'hyphens', 'be', 'precise'] 

I found a regular expression that match single hyphenated words: r"[a-z]+-[a-z]+" and another for the simple words r"[a-z]+" then I tried with an or r"[a-z]+-[a-z]+ | [a-z]+" but the output is wrong:

[' wide', ' infrared', ' survey', ' explorer', ' is', ' a', ' nasa',  'infrared-wavelength ', ' telescope', ' in', ' an', ' earth', ' satellite',  ' which', ' an', ' all', ' astronomical', ' survey', ' be', ' careful', ' of',  ' tricky', ' be', ' precise'] 

If I use gruops: r"(:?[a-z]+-[a-z]+) | (:?[a-z]+)" neither, and another solution that I think that shold be work r"[a-z]+(:?-[a-z]+)?" neither does.

It is obviously possible, but there is something that I does not clearly understand. What's wrong?

This question is related to python regex

The answer is


A couple of things:

  1. Your regexes need to be anchored by separators* or you'll match partial words, as is the case now
  2. You're not using the proper syntax for a non-capturing group. It's (?: not (:?

If you address the first problem, you won't need groups at all.

*That is, a blank or beginning/end of string.


You can use this:

r'[a-z]+(?:-[a-z]+)*' 

This regex should do it.

\b[a-z]+-[a-z]+\b 

\b indicates a word-boundary.