I have technical strings as the following:
"The thing P1 must connect to the J236 thing in the Foo position."
I would like to match with a regular expression those only-in-uppercase words (namely here P1
and J236
). The problem is that I don't want to match the first letter of the sentence when it is a one-letter word.
Example, in:
"A thing P1 must connect ..."
I want P1
only, not A
and P1
. By doing that, I know that I can miss a real "word" (like in "X must connect to Y"
) but I can live with it.
Additionally, I don't want to match uppercase words if the sentence is all uppercase.
Example:
"THING P1 MUST CONNECT TO X2."
Of course, ideally, I would like to match the technical words P1
and X2
here but since they are "hidden" in the all-uppercase sentence and since these technical words have no specific pattern, it's impossible. Again I can live with it because all-uppercase sentences are not so frequent in my files.
Thanks!
Maybe you can run this regex first to see if the line is all caps:
^[A-Z \d\W]+$
That will match only if it's a line like THING P1 MUST CONNECT TO X2.
Otherwise, you should be able to pull out the individual uppercase phrases with this:
[A-Z][A-Z\d]+
That should match "P1" and "J236" in The thing P1 must connect to the J236 thing in the Foo position.
Don't do things like [A-Z] or [0-9]. Do \p{Lu} and \d instead. Of course, this is valid for perl based regex flavours. This includes java.
I would suggest that you don't make some huge regex. First split the text in sentences. then tokenize it (split into words). Use a regex to check each token/word. Skip the first token from sentence. Check if all tokens are uppercase beforehand and skip the whole sentence if so, or alter the regex in this case.
Why do you need to do this in one monster-regex? You can use actual code to implement some of these rules, and doing so would be much easier to modify if those requirements change later.
For example:
if(/^[A-Z0-9\s]*$/)
# sentence is all uppercase, so just fail out
return 0;
# Carry on with matching uppercase terms
For the first case you propose you can use: '[[:blank:]]+[A-Z0-9]+[[:blank:]]+', for example:
echo "The thing P1 must connect to the J236 thing in the Foo position" | grep -oE '[[:blank:]]+[A-Z0-9]+[[:blank:]]+'
In the second case maybe you need to use something else and not a regex, maybe a script with a dictionary of technical words...
Cheers, Fernando
I'm not a regex guru by any means. But try:
<[A-Z0-9][A-Z0-9]+>
< start of word
[A-Z0-9] one character
[A-Z0-9]+ and one or more of them
> end of word
I won't try for the bonus points of the whole upper case sentence. hehe
Source: Stackoverflow.com