Converting Symbols Accent Letters to English Alphabet

Question

The problem is that  as you know  there are thousands of characters in the Unicode chart and I want to convert all the similar characters to the letters which are in English alphabet   For instance here are a few conversions    - gt H  - gt V  - gt Y O- gt O  - gt C t   T   ly -- gt  the Family       and I saw that there are more than 20 versions of letter A a  and I don t know how to classify them  They look like needles in the haystack    The complete list of unicode chars is at http   www ssec wisc edu  tomw java unicode html  or http   unicode org charts charindex html   Just try scrolling down and see the variations of letters    How can I convert all these with Java  Please help me

User · Answer

Following Class does the trick   org apache lucene analysis miscellaneous ASCIIFoldingFilter

User · Answer

You could try using unidecode  which is available as a ruby gem and as a perl module on cpan  Essentially  it works as a huge lookup table  where each unicode code point relates to an ascii character or string

User · Answer

The original request has been answered already    However  I am posting the below answer for those who might be looking for generic transliteration code to transliterate any charset to Latin English in Java   Naive meaning of tranliteration   Translated string in it s final form target charset sounds like the string in it s original form  If we want to transliterate any charset to Latin English alphabets   then ICU4 ICU4J library in java   will do the job   Here is the code snippet in java       import com ibm icu text Transliterator    ICU4J library import      public static String TRANSLITERATE ID    NFD  Any-Latin  NFC       public static String NORMALIZE ID    NFD    Nonspacing Mark   Remove  NFC                  Returns the transliterated string to convert any charset to latin             public static String transliterate String input            Transliterator transliterator   Transliterator getInstance TRANSLITERATE ID          NORMALIZE ID           String result   transliterator transliterate input           return result

User · Answer

Reposting my post from How do I remove diacritics  accents  from a string in  NET   This method works fine in java  purely for the purpose of removing diacritical marks aka accents    It basically converts all accented characters into their deAccented counterparts followed by their combining diacritics  Now you can use a regex to strip off the diacritics   import java text Normalizer  import java util regex Pattern   public String deAccent String str        String nfdNormalizedString   Normalizer normalize str  Normalizer Form NFD        Pattern pattern   Pattern compile    p InCombiningDiacriticalMarks          return pattern matcher nfdNormalizedString  replaceAll

User · Answer

There is no easy or general way to do what you want because it is just your subjective opinion that these letters look loke the latin letters you want to convert to  They are actually separate letters with their own distinct names and sounds which just happen to superficially look like a latin letter    If you want that conversion  you have to create your own translation table based on what latin letters you think the non-latin letters should be converted to    If you only want to remove diacritial marks  there are some answers in this thread  How do I remove diacritics  accents  from a string in  NET  However you describe a more general problem

User · Answer

Attempting to  convert them all  is the wrong approach to the problem     Firstly  you need to understand the limitations of what you are trying to do   As others have pointed out  diacritics are there for a reason  they are essentially unique letters in the alphabet of that language with their own meaning   sound etc   removing those marks is just the same as replacing random letters in an English word   This is before you even go onto consider the Cyrillic languages and other script based texts such as Arabic  which simply cannot be  converted  to English   If you must  for whatever reason  convert characters  then the only sensible way to approach this it to firstly reduce the scope of the task at hand   Consider the source of the input - if you are coding an application for  the Western world   to use as good a phrase as any   it would be unlikely that you would ever need to parse Arabic characters   Similarly  the Unicode character set contains hundreds of mathematical and pictorial symbols  there is no  easy  way for users to directly enter these  so you can assume they can be ignored   By taking these logical steps you can reduce the number of possible characters to parse to the point where a dictionary based lookup   replace operation is feasible   It then becomes a small amount of slightly boring work creating the dictionaries  and a trivial task to perform the replacement   If your language supports native Unicode characters  as Java does  and optimises static structures correctly  such find and replaces tend to be blindingly quick   This comes from experience of having worked on an application that was required to allow end users to search bibliographic data that included diacritic characters   The lookup arrays  as it was in our case  took perhaps 1 man day to produce  to cover all diacritic marks for all Western European languages

User · Answer

If the need is to convert      is    - oeisoc   you can use this a starting point    public class AsciiUtils       private static final String PLAIN ASCII          AaEeIiOoUu        grave        AaEeIiOoUuYy      acute        AaEeIiOoUuYy      circumflex        AaOoNn            tilde        AaEeIiOoUuYy      umlaut        Aa                ring        Cc                cedilla        OoUu              double acute            private static final String UNICODE          u00C0 u00E0 u00C8 u00E8 u00CC u00EC u00D2 u00F2 u00D9 u00F9                       u00C1 u00E1 u00C9 u00E9 u00CD u00ED u00D3 u00F3 u00DA u00FA u00DD u00FD           u00C2 u00E2 u00CA u00EA u00CE u00EE u00D4 u00F4 u00DB u00FB u0176 u0177           u00C3 u00E3 u00D5 u00F5 u00D1 u00F1          u00C4 u00E4 u00CB u00EB u00CF u00EF u00D6 u00F6 u00DC u00FC u0178 u00FF           u00C5 u00E5                                                                       u00C7 u00E7           u0150 u0151 u0170 u0171                 private constructor  can t be instanciated      private AsciiUtils               remove accentued from a string and replace with ascii equivalent     public static String convertNonAscii String s           if  s    null  return null         StringBuilder sb   new StringBuilder           int n   s length           for  int i   0  i  lt  n  i                char c   s charAt i             int pos   UNICODE indexOf c             if  pos  gt  -1                 sb append PLAIN ASCII charAt pos                          else                 sb append c                               return sb toString               public static void main String args             String s              The result                                                                                 System out println AsciiUtils convertNonAscii s              output              The result   E E E E U U I I A A O e e e e u u i i a a o c           The JDK 1 6 provides the java text Normalizer class that can be used for this task   See an example here

User · Answer

I m late to the party  but after facing this issue today  I found this answer to be very good   String asciiName   Normalizer normalize unicodeName  Normalizer Form NFD       replaceAll      p ASCII            Reference   https   stackoverflow com a 16283863

User · Answer

The problem with  converting  arbitrary Unicode to ASCII is that the meaning of a character is culture-dependent  For example           to a German-speaking person should be converted to  ss  while an English-speaker would probably convert it to    B      Add to that the fact that Unicode has multiple code points for the same glyphs   The upshot is that the only way to do this is create a massive table with each Unicode character and the ASCII character you want to convert it to  You can take a shortcut by normalizing characters with accents to normalization form KD  but not all characters normalize to ASCII  In addition  Unicode does not define which parts of a glyph are  accents    Here is a tiny excerpt from an app that does this   switch  c        case  A       case   u00C0           LATIN CAPITAL LETTER A WITH GRAVE     case   u00C1           LATIN CAPITAL LETTER A WITH ACUTE     case   u00C2           LATIN CAPITAL LETTER A WITH CIRCUMFLEX        and so on for about 20 lines            return  A           break       case   u00C6         LATIN CAPITAL LIGATURE AE         return  AE           break          And so on for pages

User · Answer

It s a part of Apache Commons Lang as of ver  3 0   org apache commons lang3 StringUtils stripAccents  A        returns An  Also see http   www drillio com en software-development java removing-accents-diacritics-in-any-language

User · Answer

String tested                                                               Tested     Output from Apache Commons Lang3   AAAAA  CEEEEIIII  NOOOOO  UUUUY   Output from ICU4j   AAAAA  CEEEEIIII  NOOOOO  UUUUY   Output from JUnidecode   AAAAAAECEEEEIIIIDNOOOOOOUUUUUss  problem with    and another issue  Output from Unidecode   AAAAAAECEEEEIIIIDNOOOOOOUUUUYss   The last choice is the best

User · Answer

Since the encoding that turns  the Family  into  t   T   ly  is effectively random and not following any algorithm that can be explained by the information of the Unicode codepoints involved  there s no general way to solve this algorithmically   You will need to build the mapping of Unicode characters into latin characters which they resemble  You could probably do this with some smart machine learning on the actual glyphs representing the Unicode codepoints  But I think the effort for this would be greater than manually building that mapping  Especially if you have a good amount of examples from which you can build your mapping   To clarify  a few of the substitutions can actually be solved via the Unicode data  as the other answers demonstrate   but some letters simply have no reasonable association with the latin characters which they resemble   Examples         U 0452 CYRILLIC SMALL LETTER DJE  is more related to  d  than to  h   but is used to represent  h    T   U 0166 LATIN CAPITAL LETTER T WITH STROKE  is somewhat related to  T   as the name suggests  but is used to represent  F        U 0E04 THAI CHARACTER KHO KHWAI  is not related to any latin character at all and in your example is used to represent  a

[java] Converting Symbols, Accent Letters to English Alphabet

Examples related to java

Examples related to unicode

Examples related to special-characters

Examples related to diacritics