[java] Using Regular Expressions to Extract a Value in Java

I have several strings in the rough form:

[some text] [some number] [some more text]

I want to extract the text in [some number] using the Java Regex classes.

I know roughly what regular expression I want to use (though all suggestions are welcome). What I'm really interested in are the Java calls to take the regex string and use it on the source data to produce the value of [some number].

EDIT: I should add that I'm only interested in a single [some number] (basically, the first instance). The source strings are short and I'm not going to be looking for multiple occurrences of [some number].

This question is related to java regex

The answer is


if you are reading from file then this can help you

              try{
             InputStream inputStream = (InputStream) mnpMainBean.getUploadedBulk().getInputStream();
             BufferedReader br = new BufferedReader(new InputStreamReader(inputStream));
             String line;
             //Ref:03
             while ((line = br.readLine()) != null) {
                if (line.matches("[A-Z],\\d,(\\d*,){2}(\\s*\\d*\\|\\d*:)+")) {
                     String[] splitRecord = line.split(",");
                     //do something
                 }
                 else{
                     br.close();
                     //error
                     return;
                 }
             }
                br.close();

             }
         }
         catch (IOException  ioExpception){
             logger.logDebug("Exception " + ioExpception.getStackTrace());
         }

How about [^\\d]*([0-9]+[\\s]*[.,]{0,1}[\\s]*[0-9]*).* I think it would take care of numbers with fractional part. I included white spaces and included , as possible separator. I'm trying to get the numbers out of a string including floats and taking into account that the user might make a mistake and include white spaces while typing the number.


This function collect all matching sequences from string. In this example it takes all email addresses from string.

static final String EMAIL_PATTERN = "[_A-Za-z0-9-\\+]+(\\.[_A-Za-z0-9-]+)*@"
        + "[A-Za-z0-9-]+(\\.[A-Za-z0-9]+)*(\\.[A-Za-z]{2,})";

public List<String> getAllEmails(String message) {      
    List<String> result = null;
    Matcher matcher = Pattern.compile(EMAIL_PATTERN).matcher(message);

    if (matcher.find()) {
        result = new ArrayList<String>();
        result.add(matcher.group());

        while (matcher.find()) {
            result.add(matcher.group());
        }
    }

    return result;
}

For message = "[email protected], <[email protected]>>>> [email protected]" it will create List of 3 elements.


Simple Solution

// Regexplanation:
// ^       beginning of line
// \\D+    1+ non-digit characters
// (\\d+)  1+ digit characters in a capture group
// .*      0+ any character
String regexStr = "^\\D+(\\d+).*";

// Compile the regex String into a Pattern
Pattern p = Pattern.compile(regexStr);

// Create a matcher with the input String
Matcher m = p.matcher(inputStr);

// If we find a match
if (m.find()) {
    // Get the String from the first capture group
    String someDigits = m.group(1);
    // ...do something with someDigits
}

Solution in a Util Class

public class MyUtil {
    private static Pattern pattern = Pattern.compile("^\\D+(\\d+).*");
    private static Matcher matcher = pattern.matcher("");

    // Assumptions: inputStr is a non-null String
    public static String extractFirstNumber(String inputStr){
        // Reset the matcher with a new input String
        matcher.reset(inputStr);

        // Check if there's a match
        if(matcher.find()){
            // Return the number (in the first capture group)
            return matcher.group(1);
        }else{
            // Return some default value, if there is no match
            return null;
        }
    }
}

...

// Use the util function and print out the result
String firstNum = MyUtil.extractFirstNumber("Testing4234Things");
System.out.println(firstNum);

Sometimes you can use simple .split("REGEXP") method available in java.lang.String. For example:

String input = "first,second,third";

//To retrieve 'first' 
input.split(",")[0] 
//second
input.split(",")[1]
//third
input.split(",")[2]

How about [^\\d]*([0-9]+[\\s]*[.,]{0,1}[\\s]*[0-9]*).* I think it would take care of numbers with fractional part. I included white spaces and included , as possible separator. I'm trying to get the numbers out of a string including floats and taking into account that the user might make a mistake and include white spaces while typing the number.


Allain basically has the java code, so you can use that. However, his expression only matches if your numbers are only preceded by a stream of word characters.

"(\\d+)"

should be able to find the first string of digits. You don't need to specify what's before it, if you're sure that it's going to be the first string of digits. Likewise, there is no use to specify what's after it, unless you want that. If you just want the number, and are sure that it will be the first string of one or more digits then that's all you need.

If you expect it to be offset by spaces, it will make it even more distinct to specify

"\\s+(\\d+)\\s+"

might be better.

If you need all three parts, this will do:

"(\\D+)(\\d+)(.*)"

EDIT The Expressions given by Allain and Jack suggest that you need to specify some subset of non-digits in order to capture digits. If you tell the regex engine you're looking for \d then it's going to ignore everything before the digits. If J or A's expression fits your pattern, then the whole match equals the input string. And there's no reason to specify it. It probably slows a clean match down, if it isn't totally ignored.


Pattern p = Pattern.compile("(\\D+)(\\d+)(.*)");
Matcher m = p.matcher("this is your number:1234 thank you");
if (m.find()) {
    String someNumberStr = m.group(2);
    int someNumberInt = Integer.parseInt(someNumberStr);
}

In Java 1.4 and up:

String input = "...";
Matcher matcher = Pattern.compile("[^0-9]+([0-9]+)[^0-9]+").matcher(input);
if (matcher.find()) {
    String someNumberStr = matcher.group(1);
    // if you need this to be an int:
    int someNumberInt = Integer.parseInt(someNumberStr);
}

if you are reading from file then this can help you

              try{
             InputStream inputStream = (InputStream) mnpMainBean.getUploadedBulk().getInputStream();
             BufferedReader br = new BufferedReader(new InputStreamReader(inputStream));
             String line;
             //Ref:03
             while ((line = br.readLine()) != null) {
                if (line.matches("[A-Z],\\d,(\\d*,){2}(\\s*\\d*\\|\\d*:)+")) {
                     String[] splitRecord = line.split(",");
                     //do something
                 }
                 else{
                     br.close();
                     //error
                     return;
                 }
             }
                br.close();

             }
         }
         catch (IOException  ioExpception){
             logger.logDebug("Exception " + ioExpception.getStackTrace());
         }

Look you can do it using StringTokenizer

String str = "as:"+123+"as:"+234+"as:"+345;
StringTokenizer st = new StringTokenizer(str,"as:");

while(st.hasMoreTokens())
{
  String k = st.nextToken();    // you will get first numeric data i.e 123
  int kk = Integer.parseInt(k);
  System.out.println("k string token in integer        " + kk);

  String k1 = st.nextToken();   //  you will get second numeric data i.e 234
  int kk1 = Integer.parseInt(k1);
  System.out.println("new string k1 token in integer   :" + kk1);

  String k2 = st.nextToken();   //  you will get third numeric data i.e 345
  int kk2 = Integer.parseInt(k2);
  System.out.println("k2 string token is in integer   : " + kk2);
}

Since we are taking these numeric data into three different variables we can use this data anywhere in the code (for further use)


Simple Solution

// Regexplanation:
// ^       beginning of line
// \\D+    1+ non-digit characters
// (\\d+)  1+ digit characters in a capture group
// .*      0+ any character
String regexStr = "^\\D+(\\d+).*";

// Compile the regex String into a Pattern
Pattern p = Pattern.compile(regexStr);

// Create a matcher with the input String
Matcher m = p.matcher(inputStr);

// If we find a match
if (m.find()) {
    // Get the String from the first capture group
    String someDigits = m.group(1);
    // ...do something with someDigits
}

Solution in a Util Class

public class MyUtil {
    private static Pattern pattern = Pattern.compile("^\\D+(\\d+).*");
    private static Matcher matcher = pattern.matcher("");

    // Assumptions: inputStr is a non-null String
    public static String extractFirstNumber(String inputStr){
        // Reset the matcher with a new input String
        matcher.reset(inputStr);

        // Check if there's a match
        if(matcher.find()){
            // Return the number (in the first capture group)
            return matcher.group(1);
        }else{
            // Return some default value, if there is no match
            return null;
        }
    }
}

...

// Use the util function and print out the result
String firstNum = MyUtil.extractFirstNumber("Testing4234Things");
System.out.println(firstNum);

In Java 1.4 and up:

String input = "...";
Matcher matcher = Pattern.compile("[^0-9]+([0-9]+)[^0-9]+").matcher(input);
if (matcher.find()) {
    String someNumberStr = matcher.group(1);
    // if you need this to be an int:
    int someNumberInt = Integer.parseInt(someNumberStr);
}

Try doing something like this:

Pattern p = Pattern.compile("^.+(\\d+).+");
Matcher m = p.matcher("Testing123Testing");

if (m.find()) {
    System.out.println(m.group(1));
}

This function collect all matching sequences from string. In this example it takes all email addresses from string.

static final String EMAIL_PATTERN = "[_A-Za-z0-9-\\+]+(\\.[_A-Za-z0-9-]+)*@"
        + "[A-Za-z0-9-]+(\\.[A-Za-z0-9]+)*(\\.[A-Za-z]{2,})";

public List<String> getAllEmails(String message) {      
    List<String> result = null;
    Matcher matcher = Pattern.compile(EMAIL_PATTERN).matcher(message);

    if (matcher.find()) {
        result = new ArrayList<String>();
        result.add(matcher.group());

        while (matcher.find()) {
            result.add(matcher.group());
        }
    }

    return result;
}

For message = "[email protected], <[email protected]>>>> [email protected]" it will create List of 3 elements.


Look you can do it using StringTokenizer

String str = "as:"+123+"as:"+234+"as:"+345;
StringTokenizer st = new StringTokenizer(str,"as:");

while(st.hasMoreTokens())
{
  String k = st.nextToken();    // you will get first numeric data i.e 123
  int kk = Integer.parseInt(k);
  System.out.println("k string token in integer        " + kk);

  String k1 = st.nextToken();   //  you will get second numeric data i.e 234
  int kk1 = Integer.parseInt(k1);
  System.out.println("new string k1 token in integer   :" + kk1);

  String k2 = st.nextToken();   //  you will get third numeric data i.e 345
  int kk2 = Integer.parseInt(k2);
  System.out.println("k2 string token is in integer   : " + kk2);
}

Since we are taking these numeric data into three different variables we can use this data anywhere in the code (for further use)


In Java 1.4 and up:

String input = "...";
Matcher matcher = Pattern.compile("[^0-9]+([0-9]+)[^0-9]+").matcher(input);
if (matcher.find()) {
    String someNumberStr = matcher.group(1);
    // if you need this to be an int:
    int someNumberInt = Integer.parseInt(someNumberStr);
}

Sometimes you can use simple .split("REGEXP") method available in java.lang.String. For example:

String input = "first,second,third";

//To retrieve 'first' 
input.split(",")[0] 
//second
input.split(",")[1]
//third
input.split(",")[2]

Allain basically has the java code, so you can use that. However, his expression only matches if your numbers are only preceded by a stream of word characters.

"(\\d+)"

should be able to find the first string of digits. You don't need to specify what's before it, if you're sure that it's going to be the first string of digits. Likewise, there is no use to specify what's after it, unless you want that. If you just want the number, and are sure that it will be the first string of one or more digits then that's all you need.

If you expect it to be offset by spaces, it will make it even more distinct to specify

"\\s+(\\d+)\\s+"

might be better.

If you need all three parts, this will do:

"(\\D+)(\\d+)(.*)"

EDIT The Expressions given by Allain and Jack suggest that you need to specify some subset of non-digits in order to capture digits. If you tell the regex engine you're looking for \d then it's going to ignore everything before the digits. If J or A's expression fits your pattern, then the whole match equals the input string. And there's no reason to specify it. It probably slows a clean match down, if it isn't totally ignored.


In addition to Pattern, the Java String class also has several methods that can work with regular expressions, in your case the code will be:

"ab123abc".replaceFirst("\\D*(\\d*).*", "$1")

where \\D is a non-digit character.


Try doing something like this:

Pattern p = Pattern.compile("^.+(\\d+).+");
Matcher m = p.matcher("Testing123Testing");

if (m.find()) {
    System.out.println(m.group(1));
}

Pattern p = Pattern.compile("(\\D+)(\\d+)(.*)");
Matcher m = p.matcher("this is your number:1234 thank you");
if (m.find()) {
    String someNumberStr = m.group(2);
    int someNumberInt = Integer.parseInt(someNumberStr);
}

Allain basically has the java code, so you can use that. However, his expression only matches if your numbers are only preceded by a stream of word characters.

"(\\d+)"

should be able to find the first string of digits. You don't need to specify what's before it, if you're sure that it's going to be the first string of digits. Likewise, there is no use to specify what's after it, unless you want that. If you just want the number, and are sure that it will be the first string of one or more digits then that's all you need.

If you expect it to be offset by spaces, it will make it even more distinct to specify

"\\s+(\\d+)\\s+"

might be better.

If you need all three parts, this will do:

"(\\D+)(\\d+)(.*)"

EDIT The Expressions given by Allain and Jack suggest that you need to specify some subset of non-digits in order to capture digits. If you tell the regex engine you're looking for \d then it's going to ignore everything before the digits. If J or A's expression fits your pattern, then the whole match equals the input string. And there's no reason to specify it. It probably slows a clean match down, if it isn't totally ignored.


Allain basically has the java code, so you can use that. However, his expression only matches if your numbers are only preceded by a stream of word characters.

"(\\d+)"

should be able to find the first string of digits. You don't need to specify what's before it, if you're sure that it's going to be the first string of digits. Likewise, there is no use to specify what's after it, unless you want that. If you just want the number, and are sure that it will be the first string of one or more digits then that's all you need.

If you expect it to be offset by spaces, it will make it even more distinct to specify

"\\s+(\\d+)\\s+"

might be better.

If you need all three parts, this will do:

"(\\D+)(\\d+)(.*)"

EDIT The Expressions given by Allain and Jack suggest that you need to specify some subset of non-digits in order to capture digits. If you tell the regex engine you're looking for \d then it's going to ignore everything before the digits. If J or A's expression fits your pattern, then the whole match equals the input string. And there's no reason to specify it. It probably slows a clean match down, if it isn't totally ignored.


In Java 1.4 and up:

String input = "...";
Matcher matcher = Pattern.compile("[^0-9]+([0-9]+)[^0-9]+").matcher(input);
if (matcher.find()) {
    String someNumberStr = matcher.group(1);
    // if you need this to be an int:
    int someNumberInt = Integer.parseInt(someNumberStr);
}

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Regex1 {
    public static void main(String[]args) {
        Pattern p = Pattern.compile("\\d+");
        Matcher m = p.matcher("hello1234goodboy789very2345");
        while(m.find()) {
            System.out.println(m.group());
        }
    }
}

Output:

1234
789
2345

In addition to Pattern, the Java String class also has several methods that can work with regular expressions, in your case the code will be:

"ab123abc".replaceFirst("\\D*(\\d*).*", "$1")

where \\D is a non-digit character.