What is a word boundary in regex does b match hyphen -

Question

I m trying to use regexes to match space-separated numbers  I can t find a precise definition of  b   quot word boundary quot    I had assumed that -12 would be an  quot integer word quot   matched by   b -  d  b  but it appears that this does not work   I d be grateful to know of ways of    I am using Java regexes in Java 1 6  Example  Pattern pattern   Pattern compile  quot   s   b  -   d   s  quot    String plus    quot  12  quot   System out println  quot  quot  pattern matcher plus  matches      String minus    quot  -12  quot   System out println  quot  quot  pattern matcher minus  matches      pattern   Pattern compile  quot   s   -   d   s  quot    System out println  quot  quot  pattern matcher minus  matches      This returns  true false true

User · Answer

I would like to explain Alan Moore s answer      A word boundary is a position that is either preceded by a word character and not followed by one or followed by a word character and not preceded by one    Suppose I have a string  This is a cat  and she s awesome   and I am supposed to replace all occurrence s  the letter  a  only if this letter exists at the  Boundary of a word   i e  the letter a inside  cat  should not be replaced   So I ll perform regex  in Python  as   re sub   ba   e   myString strip      replace a with e  so the output will be  This is e cat end she s ewesome

User · Answer

A word boundary is a position that is either preceded by a word character and not followed by one  or followed by a word character and not preceded by one

User · Answer

when you use   b   w     b that means exact match with a word containing only word characters   a-zA-Z0-9     in your case for example setting   b at the begining of regex will accept -12 with space  but again it won t accept -12 without space   for reference to support my words  https   docs oracle com javase tutorial essential regex bounds html

User · Answer

A word boundary  in most regex dialects  is a position between  w and  W  non-word char   or at the beginning or end of a string if it begins or ends  respectively  with a word character   0-9A-Za-z      So  in the string  -12   it would match before the 1 or after the 2  The dash is not a word character

User · Answer

Reference  Mastering Regular Expressions  Jeffrey E F  Friedl  - O Reilly  b is equivalent to    lt   w     w     lt   w     w

User · Answer

I think it s the boundary  i e  character following  of the last match or the beginning or end of the string

User · Answer

A word boundary can occur in one of three positions    Before the first character in the string  if the first character is a word character    After the last character in the string  if the last character is a word character  Between two characters in the string  where one is a word character and the other is not a word character     Word characters are alpha-numeric  a minus sign is not  Taken from Regex Tutorial

User · Answer

Word boundary  b is used where one word should be a word character and another one a non-word character  Regular Expression for negative number should be  --  b d  b   check working DEMO

User · Answer

I ran into an even worse problem when searching text for words like  NET  C    C   and C   You would think that computer programmers would know better than to name a language something that is hard to write regular expressions for   Anyway  this is what I found out  summarized mostly from http   www regular-expressions info  which is a great site    In most flavors of regex  characters that are matched by the short-hand character class  w are the characters that are treated as word characters by word boundaries   Java is an exception  Java supports Unicode for  b but not for  w   I m sure there was a good reason for it at the time    The  w stands for  word character   It always matches the ASCII characters  A-Za-z0-9     Notice the inclusion of the underscore and digits  but not dash     In most flavors that support Unicode   w includes many characters from other scripts   There is a lot of inconsistency about which characters are actually included   Letters and digits from alphabetic scripts and ideographs are generally included   Connector punctuation other than the underscore and numeric symbols that aren t digits may or may not be included   XML Schema and XPath even include all symbols in  w   But Java  JavaScript  and PCRE match only ASCII characters with  w   Which is why Java-based regex searches for C    C  or  NET  even when you remember to escape the period and pluses  are screwed by the  b   Note  I m not sure what to do about mistakes in text  like when someone doesn t put a space after a period at the end of a sentence   I allowed for it  but I m not sure that it s necessarily the right thing to do   Anyway  in Java  if you re searching text for the those weird-named languages  you need to replace the  b with before and after whitespace and punctuation designators   For example    public static String grep String regexp  String multiLineStringToSearch        String result           String   lines   multiLineStringToSearch split    n        Pattern pattern   Pattern compile regexp       for  String line   lines            Matcher matcher   pattern matcher line           if  matcher find                  result   result     n    line                      return result trim        Then in your test or main function       String beforeWord       s                                              String afterWord        s                                           text    Programming in C   C    C   Java  and  NET        System out println  text   text          Here is where Java word boundaries do not work correctly on  cutesy  computer language names        System out println  Bad word boundary can t find because of Java  grep with word boundary for  NET    grep    b   NET  b   text        System out println  Should find  grep exactly for  NET    grep beforeWord     NET  afterWord  text        System out println  Bad word boundary can t find because of Java  grep with word boundary for C     grep    bC   b   text        System out println  Should find  grep exactly for C     grep  C   afterWord  text        System out println  Bad word boundary can t find because of Java grep with word boundary for C      grep    bC        b   text        System out println  Should find  grep exactly for C      grep beforeWord  C        afterWord  text         System out println  Should find  grep with word boundary for Java    grep    bJava  b   text        System out println  Should find  grep for case-insensitive java    grep   i   bjava  b   text        System out println  Should find  grep with word boundary for C    grep    bC  b   text        Works Ok for this example  but see below        Because of the stupid too-short cutsey name  searches find stuff it shouldn t      text    Worked on C amp O  Chesapeake and Ohio  Canal when I was younger  more recently developed in Lisp        System out println  text   text       System out println  Bad word boundary because of C name  grep with word boundary for C    grep    bC  b   text        System out println  Should be blank  grep exactly for C    grep beforeWord  C  afterWord  text           Make sure the first and last cases work OK       text    C is a language that should have been named differently        System out println  text   text       System out println  grep exactly for C    grep beforeWord  C  afterWord  text         text    One language that should have been named differently is C       System out println  text   text       System out println  grep exactly for C    grep beforeWord  C  afterWord  text           Make sure we don t get false positives     text    The letter  c  can be hard as in Cat  or soft as in Cindy  Computer languages should not require disambiguation  e g  Ruby  Python vs  Fortran  Hadoop        System out println  text   text       System out println  Should be blank  grep exactly for C    grep beforeWord  C  afterWord  text      P S  My thanks to http   regexpal com  without whom the regex world would be very miserable

User · Answer

I talk about what  b-style regex boundaries actually are here     The short story is that they   re conditional   Their behavior depends on what they   re next to     same as using a  b before        w     lt   w        lt   W       same as using a  b after       lt   w      w         W       Sometimes that isn   t what you want   See my other answer for elaboration

User · Answer

I believe that your problem is due to the fact that - is not a word character   Thus  the word boundary will match after the -  and so will not capture it   Word boundaries match before the first and after the last word characters in a string  as well as any place where before it is a word character or non-word character  and after it is the opposite   Also note that word boundary is a zero-width match   One possible alternative is           s -   d  b   This will match any numbers starting with a space character and an optional dash  and ending at a word boundary   It will also match a number starting at the beginning of the string

User · Answer

In the course of learning regular expression  I was really stuck in the metacharacter which is  b  I indeed didn t comprehend its meaning while I was asking myself  what it is  what it is  repetitively  After some attempts by using the website  I watch out the pink vertical dashes at the every beginning of words and at the end of words  I got it its meaning well at that time  It s now exactly word  w -boundary   My view is merely to immensely understanding-oriented  Logic behind of it should be examined from another answers

User · Answer

Check out the documentation on boundary conditions   http   java sun com docs books tutorial essential regex bounds html  Check out this sample   public static void main final String   args                String x    I found the value -12 in my string            System err println Arrays toString x split    b-   d   b              When you print it out  notice that the output is this    I found the value -   in my string    This means that the  -  character is not being picked up as being on the boundary of a word because it s not considered a word character   Looks like  brianary kinda beat me to the punch  so he gets an up-vote

[regex] What is a word boundary in regex, does \b match hyphen '-'?

Examples related to regex

Examples related to word-boundary