Regular expression for excluding special characters

Question

I am having trouble coming up with a regular expression which would essentially black list certain special characters   I need to use this to validate data in input fields  in a Java Web app   We want to allow users to enter any digit  letter  we need to include accented characters  ex  French or German  and some special characters such as  -  etc    How do I blacklist characters such as  lt     etc

User · Answer

Its usually better to whitelist characters you allow, rather than to blacklist characters you don't allow. both from a security standpoint, and from an ease of implementation standpoint.

If you do go down the blacklist route, here is an example, but be warned, the syntax is not simple.

http://groups.google.com/group/regex/browse_thread/thread/0795c1b958561a07

If you want to whitelist all the accent characters, perhaps using unicode ranges would help? Check out this link.

http://www.regular-expressions.info/unicode.html

User · Answer

The negated set of everything that is not alphanumeric  amp  underscore for ASCII chars         W  g   For email or username validation i ve used the following expression that allows 4 standard special characters   -            -   a-z0-9    gi   For a strict alphanumeric only expression use      a-z0-9    gi   Test   RegExr com

User · Answer

I guess it depends what language you are targeting   In general  something like this should work      lt  gt       The      construct defines a character class  which will match any of the listed characters   Putting     as the first character negates the match  ie  any character OTHER than one of those listed   You may need to escape some of the characters within the       depending on what language regex engine you are using

User · Answer

Even in 2009  it seems too many had a very limited idea of what designing for the WORLDWIDE web involved  In 2015  unless designing for a specific country  a blacklist is the only way to accommodate the vast number of characters that may be valid   The characters to blacklist then need to be chosen according what is illegal for the purpose for which the data is required   However  sometimes it pays to break down the requirements  and handle each separately  Here look-ahead is your friend  These are sections bounded by      for positive  and      for negative  and effectively become AND blocks  because when the block is processed  if not failed  the regex processor will begin at the start of the text with the next block  Effectively  each look-ahead block will be preceded by the    and if its pattern is greedy  include up to the    Even the ancient VB6 VBA  Office  5 5 regex engine supports look-ahead   So  to build up a full regular expression  start with the look-ahead blocks  then add the blacklisted character block before the final     For example  to limit the total numbers of characters  say between 3 and 15 inclusive  start with the positive look-ahead block       3 15     Note that this needed its own   and   to ensure that it covered all the text   Now  while you might want to allow   and -  you may not want to start or end with them  so add the two negative look-ahead blocks        -     for starts  and        -    for ends   If you don t want multiple   and -  add a negative look-ahead block of        -  2     This will also exclude  - and -  sequences   If there are no more look-ahead blocks  then add the blacklist block before the    such as    lt  gt                        0- cZ    where the  0- cZ excludes null and control characters  including NL   n  and CR   r   The final   ensures that all the text is greedily included   Within the Unicode domain  there may well be other code-points or blocks that need to be excluded as well  but certainly a lot less than all the blocks that would have to be included in a whitelist   The whole regex of all of the above would then be        3 15         -           -          -  2      lt  gt                       0- cZ      which you can check out live on https   regex101 com   for pcre  php   javascript and python regex engines  I don t know where the java regex fits in those  but you may need to modify the regex to cater for its idiosyncrasies   If you want to include spaces  but not    just swap them every where in the regex   The most useful application for this technique is for the pattern attribute for HTML input fields  where a single expression is required  returning a false for failure  thus making the field invalid  allowing input invalid css to highlight it  and stopping the form being submitted

User · Answer

Why do you consider regex the best tool for this  If your purpose is to detect whether an illegal character is present in a string  testing each character in a loop will be both simpler and more efficient than constructing a regex

User · Answer

Use This one        a-zA-Z0-9                        -           lt  gt

User · Answer

Here s all the french accented characters                                                                                 I would google a list of German accented characters  There aren t THAT many  You should be able to get them all    For URLS I Replace accented URLs with regular letters like so   string beforeConversion                                                                                    string afterConversion    aAaAaAaAeEeEeEeEiIiIiIoOoOoOuUuUuUcC n   for  int i   0  i  lt  beforeConversion Length  i            cleaned   Regex Replace cleaned  beforeConversion i  ToString    afterConversion i  ToString         There s probably a more efficient way  mind you

User · Answer

I would just white list the characters     a-zA-Z0-9                  Building a black list is equally simple with regex but you might need to add much more characters - there are a lot of Chinese symbols in unicode             lt  gt         The expression    many characters here   just matches any character that is not listed

User · Answer

Do you really want to blacklist specific characters or rather whitelist the allowed charachters   I assume that you actually want the latter  This is pretty simple  add any additional symbols to whitelist into the   -  group         p L  p M     -       Edit  Optimized the pattern with the input from the comments

User · Answer

I strongly suspect it s going to be easier to come up with a list of the characters that ARE allowed vs  the ones that aren t -- and once you have that list  the regex syntax becomes quite straightforward   So put me down as another vote for  whitelist

User · Answer

To exclude certain characters    lt         and     you can make a regular expression like this     lt  gt        This regular expression will match all inputs that have a blacklisted character in them   The brackets define a character class  and the   is necessary before the dollar sign because dollar sign has a special meaning in regular expressions   To add more characters to the black list  just insert them between the brackets  order does not matter   According to some Java documentation for regular expressions  you could use the expression like this   Pattern p   Pattern compile    lt  gt         Matcher m   p matcher unsafeInputString   if  m matches             Invalid input  reject it  or remove change the offending characters    else          Valid input

[java] Regular expression for excluding special characters

Examples related to java

Examples related to regex