[java] Java regex capturing groups indexes

I have the following line,

typeName="ABC:xxxxx;";

I need to fetch the word ABC,

I wrote the following code snippet,

Pattern pattern4=Pattern.compile("(.*):");
matcher=pattern4.matcher(typeName);

String nameStr="";
if(matcher.find())
{
    nameStr=matcher.group(1);

}

So if I put group(0) I get ABC: but if I put group(1) it is ABC, so I want to know

  1. What does this 0 and 1 mean? It will be better if anyone can explain me with good examples.

  2. The regex pattern contains a : in it, so why group(1) result omits that? Does group 1 detects all the words inside the parenthesis?

  3. So, if I put two more parenthesis such as, \\s*(\d*)(.*): then, will be there two groups? group(1) will return the (\d*) part and group(2) return the (.*) part?

The code snippet was given in a purpose to clear my confusions. It is not the code I am dealing with. The code given above can be done with String.split() in a much easier way.

This question is related to java regex

The answer is


Parenthesis () are used to enable grouping of regex phrases.

The group(1) contains the string that is between parenthesis (.*) so .* in this case

And group(0) contains whole matched string.

If you would have more groups (read (...) ) it would be put into groups with next indexes (2, 3 and so on).


For The Rest Of Us

Here is a simple and clear example of how this works

Regex: ([a-zA-Z0-9]+)([\s]+)([a-zA-Z ]+)([\s]+)([0-9]+)

String: "!* UserName10 John Smith 01123 *!"

group(0): UserName10 John Smith 01123
group(1): UserName10
group(2):  
group(3): John Smith
group(4):  
group(5): 01123

As you can see, I have created FIVE groups which are each enclosed in parentheses.

I included the !* and *! on either side to make it clearer. Note that none of those characters are in the RegEx and therefore will not be produced in the results. Group(0) merely gives you the entire matched string (all of my search criteria in one single line). Group 1 stops right before the first space because the space character was not included in the search criteria. Groups 2 and 4 are simply the white space, which in this case is literally a space character, but could also be a tab or a line feed etc. Group 3 includes the space because I put it in the search criteria ... etc.

Hope this makes sense.