Is there a way to get rid of accents and convert a whole string to regular letters

Question

Is there a better way for getting rid of accents and making those letters regular apart from using String replaceAll   method and replacing letters one by one  Example   Input   orcp  s      d  Output  orcpzsiayd  It doesn t need to include all letters with accents like the Russian alphabet or the Chinese one

User · Answer

David Conrad solution is the fastest I tried using the Normalizer  but it does have a bug  It basically strips characters which are not accents  for example Chinese characters and other letters like     are all stripped  The characters that we want to strip are non spacing marks  characters which don t take up extra width in the final string  These zero width characters basically end up combined in some other character  If you can see them isolated as a character  for example like this    my guess is that it s combined with the space character   public static String flattenToAscii String string        char   out   new char string length         String norm   Normalizer normalize string  Normalizer Form NFD        int j   0      for  int i   0  n   norm length    i  lt  n    i            char c   norm charAt i           int type   Character getType c              Log d TAG    c             by Ricardo  modified the character check for accents  ref  http   stackoverflow com a 5697575 689223         if  type    Character NON SPACING MARK               out j    c              j                          Log d TAG  normalized string   norm     new String out        return new String out

User · Answer

As of 2011 you can use Apache Commons StringUtils stripAccents input   since 3 0        String input   StringUtils stripAccents  Th  s i      funky   tring        System out println input          Prints  This is a funky String    Note   The accepted answer  Erick Robertson s  doesn t work for    or L  Apache Commons 3 5 doesn t work for    either  but it does work for L  After reading the Wikipedia article for     I m not sure it should be replaced with  O   it s a separate letter in Norwegian and Danish  alphabetized after  z   It s a good example of the limitations of the  strip accents  approach

User · Answer

System out println Normalizer normalize           Normalizer Form NFD  replaceAll    p InCombiningDiacriticalMarks             worked for me  The output of the snippet above gives  aee  which is what I wanted  but   System out println Normalizer normalize           Normalizer Form NFD  replaceAll      p ASCII             didn t do any substitution

User · Answer

I suggest Junidecode   It will handle not only  L  and       but it also works well for transcribing from other alphabets  such as Chinese  into Latin alphabet

User · Answer

I have faced the same issue related to Strings equality check  One of the comparing string has  ASCII character code 128-255      i e   Non-breaking space -  Hex - A0  Space  Hex - 20     To show Non-breaking space over HTML  I have used the following spacing entities  Their character and its bytes are like  amp emsp is very wide space    -30  -128  -125    amp ensp is somewhat wide space    -30  -128  -126    amp thinsp is narrow space    32    Non HTML Space     String s1    My Sample Space Data   s2    My Sample Space Data   System out format  S1   s n   java util Arrays toString s1 getBytes      System out format  S2   s n   java util Arrays toString s2 getBytes            Output in Bytes       S1   77  121  32  83  97  109  112  108  101  32  83  112  97  99  101  32  68  97  116  97    S2   77  121  -30  -128  -125  83  97  109  112  108  101  -30  -128  -125  83  112  97  99  101  -30  -128  -125  68  97  116  97    Use below code for Different Spaces and their Byte-Codes  wiki for List of Unicode characters  String spacing entities    very wide space narrow space regular space invisible  separator   System out println  Space String     spacing entities   byte   byteArray           spacing entities getBytes  Charset forName  UTF-8             Charset forName  UTF-8   encode  s2   array         -30  -128  -125  44  -30  -128  -126  44  32  44  -62  -96   System out println  Bytes    Arrays toString  byteArray      try       System out format  Bytes to String  S   n    new String byteArray   UTF-8       catch  UnsupportedEncodingException e        e printStackTrace           ASCII transliterations of Unicode string for Java  unidecode  String initials   Unidecode decode  s2       using Guava  Google Core Libraries for Java   String replaceFrom   CharMatcher WHITESPACE replaceFrom  s2           For URL encode for the space use Guava laibrary   String encodedString   UrlEscapers urlFragmentEscaper   escape inputString      To overcome this problem used String replaceAll   with some RegularExpression       p Z  or  p Separator   any kind of whitespace or invisible separator  s2   s2 replaceAll    p Zs            s2   s2 replaceAll      p ASCII           s2   s2 replaceAll                Using java text Normalizer Form  This enum provides constants of the four Unicode normalization forms that are described in Unicode Standard Annex  15     Unicode Normalization Forms and two methods to access them     s2   Normalizer normalize s2  Normalizer Form NFKC        Testing String and outputs on different approaches like   Unidecode  Normalizer  StringUtils   String strUni    Th  s i      funky   tring                   This is a funky String AE O D ss String initials   Unidecode decode  strUni        Following Produce this o p  Th i  s  i s  a  fu  n k  y  S t r  i n  g              String temp   Normalizer normalize strUni  Normalizer Form NFD   Pattern pattern   Pattern compile    p InCombiningDiacriticalMarks      temp   pattern matcher temp  replaceAll       String input   org apache commons lang3 StringUtils stripAccents  strUni        Using Unidecode is the best choice  My final Code shown below   public static void main String   args        String s1    My Sample Space Data   s2    My  Sample  Space  Data       String initials   Unidecode decode  s2        if  s1 equals s2               A0 -  2C -  20    http   www ascii-code com          System out println  Equal Unicode Strings          else if  s1 equals  initials               System out println  Equal Non Unicode Strings          else           System out println  Not Equal

User · Answer

The solution by  virgo47 is very fast  but approximate  The accepted answer uses Normalizer and a regular expression  I wondered what part of the time was taken by Normalizer versus the regular expression  since removing all the non-ASCII characters can be done without a regex   import java text Normalizer   public class Strip       public static String flattenToAscii String string            StringBuilder sb   new StringBuilder string length             string   Normalizer normalize string  Normalizer Form NFD           for  char c   string toCharArray                  if  c  lt     u007F   sb append c                     return sb toString              Small additional speed-ups can be obtained by writing into a char   and not calling toCharArray    although I m not sure that the decrease in code clarity merits it   public static String flattenToAscii String string        char   out   new char string length         string   Normalizer normalize string  Normalizer Form NFD       int j   0      for  int i   0  n   string length    i  lt  n    i            char c   string charAt i           if  c  lt     u007F   out j      c            return new String out       This variation has the advantage of the correctness of the one using Normalizer and some of the speed of the one using a table  On my machine  this one is about 4x faster than the accepted answer  and 6 6x to 7x slower that  virgo47 s  the accepted answer is about 26x slower than  virgo47 s on my machine

User · Answer

In case anyone is strugling to do this in kotlin  this code works like a charm  To avoid inconsistencies I also use  toUpperCase and Trim    then i cast this function      fun stripAccents s  String  String      if  s    null          return           val chars  CharArray   s toCharArray    var sb   StringBuilder s  var cont  Int   0  while  chars size  gt  cont        var c  kotlin Char     c   chars cont      var c2 String   c toString        these are my needs  in case you need to convert other accents just Add new entries aqui     c2   c2 replace        A       c2   c2 replace        O       c2   c2 replace        C       c2   c2 replace        A       c2   c2 replace        O       c2   c2 replace        E       c2   c2 replace        E       c2   c2 replace        U        c   c2 single       sb setCharAt cont  c      cont       return sb toString        to use these fun cast the code like this        var str  String      str   editText text toString     get the text from EditText      str   str toUpperCase   trim         str   stripAccents str    call the function

User · Answer

Use java text Normalizer to handle this for you   string   Normalizer normalize string  Normalizer Form NFD      or Normalizer Form NFKD for a more  compatable  deconstruction    This will separate all of the accent marks from the characters   Then  you just need to compare each character against being a letter and throw out the ones that aren t   string   string replaceAll      p ASCII            If your text is in unicode  you should use this instead   string   string replaceAll    p M           For unicode    P M  matches the base glyph and   p M   lowercase  matches each accent   Thanks to GarretWilson for the pointer and regular-expressions info for the great unicode guide

User · Answer

EDIT  If you re not stuck with Java  lt 6 and speed is not critical and or translation table is too limiting  use answer by David  The point is to use Normalizer  introduced in Java 6  instead of translation table inside the loop   While this is not  perfect  solution  it works well when you know the range  in our case Latin1 2   worked before Java 6  not a real issue though  and is much faster than the most suggested version  may or may not be an issue               Mirror of the unicode table from 00c0 to 017f without diacritics      private static final String tab00c0    AAAAAAACEEEEIIII         DNOOOOO u00d7 u00d8UUUUYI u00df         aaaaaaaceeeeiiii          u00f0nooooo u00f7 u00f8uuuuy u00fey         AaAaAaCcCcCcCcDd         DdEeEeEeEeEeGgGg         GgGgHhHhIiIiIiIi         IiJjJjKkkLlLlLlL         lLlNnNnNnnNnOoOo         OoOoRrRrRrSsSsSs         SsTtTtTtUuUuUuUu         UuUuWwYyYZzZzZzF           Returns string without diacritics - 7 bit approximation         param source string to convert     return corresponding string without diacritics     public static String removeDiacritic String source        char   vysl   new char source length         char one      for  int i   0  i  lt  source length    i              one   source charAt i           if  one  gt     u00c0   amp  amp  one  lt     u017f                 one   tab00c0 charAt  int  one -   u00c0                      vysl i    one            return new String vysl       Tests on my HW with 32bit JDK show that this performs conversion from       l  tc89FDC to aeelstc89FDC 1 million times in  100ms while Normalizer way makes it in 3 7s  37x slower   In case your needs are around performance and you know the input range  this may be for you   Enjoy  -

User · Answer

One of the best way using regex and Normalizer if you have no library is        public String flattenToAscii String s                    if s    null    s trim   length      0                          return                     return Normalizer normalize s  Normalizer Form NFD  replaceAll    u0300- u036F             This is more efficient than replaceAll     p ASCII          and if you don t need diacritics  just like your example    Otherwise  you have to use the p ASCII  pattern   Regards

User · Answer

I think the best solution is converting each char to HEX and replace it with another HEX  It s because there are 2 Unicode typing   Composite Unicode Precomposed Unicode   For example       written by Composite Unicode is different from     written by Precomposed Unicode  You can copy my sample chars and convert them to see the difference   In Composite Unicode        is combined from 2 char      U 00d4  and    U 0300  In Precomposed Unicode      is single char  U 1ED2    I have developed this feature for some banks to convert the info before sending it to core-bank  usually don t support Unicode  and faced this issue when the end-users use multiple Unicode typing to input the data  So I think  converting to HEX and replace it is the most reliable way

User · Answer

Depending on the language  those might not be considered accents  which change the sound of the letter   but diacritical marks  https   en wikipedia org wiki Diacritic Languages with letters containing diacritics   Bosnian and Croatian have the symbols c  c  d     and     which are considered separate letters and are listed as such in dictionaries and other contexts in which words are listed according to alphabetical order    Removing them might be inherently changing the meaning of the word  or changing the letters into completely different ones

[java] Is there a way to get rid of accents and convert a whole string to regular letters?

Examples related to java

Examples related to string

Examples related to diacritics