Concrete Javascript Regex for Accented Characters Diacritics

Question

I ve looked on Stack Overflow  replacing characters   eh  how JavaScript doesn t follow the Unicode standard concerning RegExp  etc   and haven t really found a concrete answer to the question   How can JavaScript match for accented characters  those with diacritical marks    I m forcing a field in a UI to match the format  last name  first name  last  comma space  first   and I want to provide support for diacritics  but evidently in JavaScript it s a bit more difficult than other languages platforms   This was my original version  until I wanted to add diacritic support      a-zA-Z    s a-zA-Z      Currently I m debating one of three methods to add support  all of which I have tested and work  at least to some extent  I don t really know what the  extent  is of the second approach   Here they are    Explicitly listing all accented characters that I would want to accept as valid  lame and overly-complicated      var accentedCharacters                                                                                                                                  Build the full regex var regex      a-zA-Z    accentedCharacters         s a-zA-Z    accentedCharacters             Create a RegExp from the string version regexCompiled   new RegExp regex      regexCompiled      a-zA-Z                                                                                                                            s a-zA-Z                                                                                                                                This correctly matches a last first name with any of the supported accented characters in accentedCharacters      My other approach was to use the   character class  to have a simpler expression   var regex         s         This would match for just about anything  at least in the form of  something  something  That s alright I suppose        The last approach  which I just found might be simpler        a-zA-Z u00C0- u017F    s a-zA-Z u00C0- u017F        It matches a range of unicode characters - tested and working  though I didn t try anything crazy  just the normal stuff I see in our language department for faculty member names      Here are my concerns    The first solution is far too limiting  and sloppy and convoluted at that  It would need to be changed if I forgot a character or two  and that s just not very practical  The second solution is better  concise  but it probably matches far more than it actually should  I couldn t find any real documentation on exactly what   matches  just the generalization of  any character except the newline character   from a table on the MDN   The third solution seems the be the most precise  but are there any gotchas  I m not very familiar with Unicode  at least in practice  but looking at a code table continuation of that table   u00C0- u017F seems to be pretty solid  at least for my expected input    Faculty won t be submitting forms with their names in their native language  e g   Arabic  Chinese  Japanese  etc   so I don t have to worry about out-of-Latin-character-set characters      So the real question s   Which of these three approaches is most suited for the task  Or are there better solutions

User · Answer

You can remove the diacritics from alphabets by using:

var str = "résumé"
str.normalize('NFD').replace(/[\u0300-\u036f]/g, '') // returns resume

It will remove all the diacritical marks, and then perform your regex on it

Reference:

https://thread.engineering/2018-08-29-searching-and-sorting-text-with-diacritical-marks-in-javascript/

User · Answer

The XRegExp library has a plugin named Unicode that helps solve tasks like this    lt script src  xregexp js  gt  lt  script gt   lt script src  addons unicode unicode-base js  gt  lt  script gt   lt script gt    var unicodeWord   XRegExp     p L          unicodeWord test                true   unicodeWord test            true   unicodeWord test                true  lt  script gt    It s mentioned in the comments to the question  but it s easy to miss  I ve noticed it only after I submitted this answer

User · Answer

What about this      a-zA-Z     -       -         It will match every word with accented characters or not

User · Answer

The accented Latin range  u00C0- u017F was not quite enough for my database of names  so I extended the regex to  a-zA-Z u00C0- u024F   a-zA-Z u00C0- u024F u1E00- u1EFF     includes even more Latin chars  I added these code blocks   u00C0- u024F includes three adjacent blocks at once     u00C0- u00FF Latin-1 Supplement  u0100- u017F Latin Extended-A  u0180- u024F Latin Extended-B  u1E00- u1EFF Latin Extended Additional  Note that  u00C0- u00FF is actually only a part of Latin-1 Supplement  It skips unprintable control signals and all symbols except for the awkwardly-placed multiply     u00D7 and divide     u00F7   a-zA-Z u00C0- u00D6 u00D8- u00F6 u00F8- u024F     exclude       If you need more code points  you can find more ranges on Wikipedia s List of Unicode characters  For example  you could also add Latin Extended-C  D  and E  but I left them out because only historians seem interested in them now  and the D and E sets don t even render correctly in my browser  The original regex stopping at  u017F borked on the name  quot  enol quot   According to FontSpace s Unicode Analyzer  that first character is  u0218  LATIN CAPITAL LETTER S WITH COMMA BELOW   Yeah  it s usually spelled with a cedilla-S  u015E   quot Senol  quot  But I m not flying to Turkey to go tell him   quot You re spelling your name wrong  quot

User · Answer

Which of these three approaches is most suited for the task    Depends on the task  -  To match exactly all Latin characters and their accented versions  the Unicode ranges probably provide the best solution  They might be extended to all non-whitespace characters  which could be done using the  S character class      I m forcing a field in a UI to match the format  last name  first name  last  comma space  first    The most basic problem I m seeing here are not diacritics  but whitespaces  There are a few names that consist of multiple words  e g  for titles  So you should go with the most generic  that is allowing everything but the comma that distinguishes first from last name           s         But your second solution with the   character class is just as fine  you only might need to care about multiple commata then

User · Answer

from this wiki   https   en wikipedia org wiki List of Unicode characters Basic Latin  for latin letters  I use      A-z  -    -    -          it avoids hyphens and specials chars

User · Answer

The easier way to accept all accents is this     A-z  -       accepts lowercase and uppercase characters  A-z  -       as above but including letters with an umlaut  includes                 A-Za-z  -       as above but not including          A-Za-z  -    -    -       as above but not including                 See https   unicode-table com en  for characters listed in numeric order

User · Answer

pL pM p Zs  -    u   Explanation     pL - matches any kind of letter from any language  pM - atches a character intended to be combined with another character  e g  accents  umlauts  enclosing boxes  etc    p Zs  - matches a whitespace character that is invisible  but does take up space u - Pattern and subject strings are treated as UTF-8   Unlike other proposed regex  such as  A-Za-z  -    -    -      this will work with all language specific characters  e g       is matched by this rule  but not matched by others on this page   Unfortunately  natively JavaScript does not support these classes  However  you can use xregexp  e g   const XRegExp   require  xregexp     const isInputRealHumanName    input  string   boolean   gt      return XRegExp      pL  pM-      pL  pM-       u   test input

User · Answer

How about this      a-zA-Z  -    -    -

[javascript] Concrete Javascript Regex for Accented Characters (Diacritics)

Examples related to javascript

Examples related to regex

Examples related to unicode