Can anyone explain the difference between \b
and \w
regular expression metacharacters? It is my understanding that both these metacharacters are used for word boundaries. Apart from this, which meta character is efficient for multilingual content?
This question is related to
javascript
java
php
regex
perl
\w
matches a word character. \b
is a zero-width match that matches a position character that has a word character on one side, and something that's not a word character on the other. (Examples of things that aren't word characters include whitespace, beginning and end of the string, etc.)
\w
matches a
, b
, c
, d
, e
, and f
in "abc def"
\b
matches the (zero-width) position before a
, after c
, before d
, and after f
in "abc def"
\w
is not a word boundary, it matches any word character, including underscores: [a-zA-Z0-9_]
. \b
is a word boundary, that is, it matches the position between a word and a non-alphanumeric character: \W
or [^\w]
.
These implementations may vary from language to language though.
\b <= this is a word boundary.
Matches at a position that is followed by a word character but not preceded by a word character, or that is preceded by a word character but not followed by a word character.
\w <= stands for "word character".
It always matches the ASCII characters [A-Za-z0-9_]
Is there anything specific you are trying to match?
Some useful regex websites for beginners or just to wet your appetite.
I found this to be a very useful book:
@Mahender, you probably meant the difference between \W
(instead of \w
) and \b
. If not, then I would agree with @BoltClock and @jwismar above. Otherwise continue reading.
\W
would match any non-word character and so its easy to try to use it to match word boundaries. The problem is that it will not match the start or end of a line. \b
is more suited for matching word boundaries as it will also match the start or end of a line. Roughly speaking (more experienced users can correct me here) \b
can be thought of as (\W|^|$)
. [Edit: as @?mega mentions below, \b
is a zero-length match so (\W|^|$)
is not strictly correct, but hopefully helps explain the diff]
Quick example: For the string Hello World
, .+\W
would match Hello_
(with the space) but will not match World
. .+\b
would match both Hello
and World
.
Source: Stackoverflow.com