[java] Is there a way to get rid of accents and convert a whole string to regular letters?

Is there a better way for getting rid of accents and making those letters regular apart from using String.replaceAll() method and replacing letters one by one? Example:

Input: orcpžsíáýd

Output: orcpzsiayd

It doesn't need to include all letters with accents like the Russian alphabet or the Chinese one.

This question is related to java string diacritics

The answer is


I have faced the same issue related to Strings equality check, One of the comparing string has ASCII character code 128-255.

i.e., Non-breaking space - [Hex - A0] Space [Hex - 20]. To show Non-breaking space over HTML. I have used the following spacing entities. Their character and its bytes are like &emsp is very wide space[ ]{-30, -128, -125}, &ensp is somewhat wide space[ ]{-30, -128, -126}, &thinsp is narrow space[ ]{32} , Non HTML Space {}

String s1 = "My Sample Space Data", s2 = "My Sample Space Data";
System.out.format("S1: %s\n", java.util.Arrays.toString(s1.getBytes()));
System.out.format("S2: %s\n", java.util.Arrays.toString(s2.getBytes()));

Output in Bytes:

S1: [77, 121, 32, 83, 97, 109, 112, 108, 101, 32, 83, 112, 97, 99, 101, 32, 68, 97, 116, 97] S2: [77, 121, -30, -128, -125, 83, 97, 109, 112, 108, 101, -30, -128, -125, 83, 112, 97, 99, 101, -30, -128, -125, 68, 97, 116, 97]

Use below code for Different Spaces and their Byte-Codes: wiki for List_of_Unicode_characters

String spacing_entities = "very wide space,narrow space,regular space,invisible separator";
System.out.println("Space String :"+ spacing_entities);
byte[] byteArray = 
    // spacing_entities.getBytes( Charset.forName("UTF-8") );
    // Charset.forName("UTF-8").encode( s2 ).array();
    {-30, -128, -125, 44, -30, -128, -126, 44, 32, 44, -62, -96};
System.out.println("Bytes:"+ Arrays.toString( byteArray ) );
try {
    System.out.format("Bytes to String[%S] \n ", new String(byteArray, "UTF-8"));
} catch (UnsupportedEncodingException e) {
    e.printStackTrace();
}
  • ? ASCII transliterations of Unicode string for Java. unidecode

    String initials = Unidecode.decode( s2 );
    
  • ? using Guava: Google Core Libraries for Java.

    String replaceFrom = CharMatcher.WHITESPACE.replaceFrom( s2, " " );
    

    For URL encode for the space use Guava laibrary.

    String encodedString = UrlEscapers.urlFragmentEscaper().escape(inputString);
    
  • ? To overcome this problem used String.replaceAll() with some RegularExpression.

    // \p{Z} or \p{Separator}: any kind of whitespace or invisible separator.
    s2 = s2.replaceAll("\\p{Zs}", " ");
    
    
    s2 = s2.replaceAll("[^\\p{ASCII}]", " ");
    s2 = s2.replaceAll(" ", " ");
    
  • ? Using java.text.Normalizer.Form. This enum provides constants of the four Unicode normalization forms that are described in Unicode Standard Annex #15 — Unicode Normalization Forms and two methods to access them.

    enter image description here

    s2 = Normalizer.normalize(s2, Normalizer.Form.NFKC);
    

Testing String and outputs on different approaches like ? Unidecode, Normalizer, StringUtils.

String strUni = "Thïs iš â funky Štring Æ,Ø,Ð,ß";

// This is a funky String AE,O,D,ss
String initials = Unidecode.decode( strUni );

// Following Produce this o/p: Th^i¨s^ i~s? a^ fu°n?k¸y^ S?t?r´i?n´g? Æ,Ø,Ð,ß
String temp = Normalizer.normalize(strUni, Normalizer.Form.NFD);
Pattern pattern = Pattern.compile("\\p{InCombiningDiacriticalMarks}+");
temp = pattern.matcher(temp).replaceAll("");

String input = org.apache.commons.lang3.StringUtils.stripAccents( strUni );

Using Unidecode is the best choice, My final Code shown below.

public static void main(String[] args) {
    String s1 = "My Sample Space Data", s2 = "My Sample Space Data";
    String initials = Unidecode.decode( s2 );
    if( s1.equals(s2)) { //[ , ] %A0 - %2C - %20 « http://www.ascii-code.com/
        System.out.println("Equal Unicode Strings");
    } else if( s1.equals( initials ) ) {
        System.out.println("Equal Non Unicode Strings");
    } else {
        System.out.println("Not Equal");
    }

}

@David Conrad solution is the fastest I tried using the Normalizer, but it does have a bug. It basically strips characters which are not accents, for example Chinese characters and other letters like æ, are all stripped. The characters that we want to strip are non spacing marks, characters which don't take up extra width in the final string. These zero width characters basically end up combined in some other character. If you can see them isolated as a character, for example like this `, my guess is that it's combined with the space character.

public static String flattenToAscii(String string) {
    char[] out = new char[string.length()];
    String norm = Normalizer.normalize(string, Normalizer.Form.NFD);

    int j = 0;
    for (int i = 0, n = norm.length(); i < n; ++i) {
        char c = norm.charAt(i);
        int type = Character.getType(c);

        //Log.d(TAG,""+c);
        //by Ricardo, modified the character check for accents, ref: http://stackoverflow.com/a/5697575/689223
        if (type != Character.NON_SPACING_MARK){
            out[j] = c;
            j++;
        }
    }
    //Log.d(TAG,"normalized string:"+norm+"/"+new String(out));
    return new String(out);
}

I suggest Junidecode . It will handle not only 'L' and 'Ø', but it also works well for transcribing from other alphabets, such as Chinese, into Latin alphabet.


System.out.println(Normalizer.normalize("àèé", Normalizer.Form.NFD).replaceAll("\\p{InCombiningDiacriticalMarks}+", ""));

worked for me. The output of the snippet above gives "aee" which is what I wanted, but

System.out.println(Normalizer.normalize("àèé", Normalizer.Form.NFD).replaceAll("[^\\p{ASCII}]", ""));

didn't do any substitution.


In case anyone is strugling to do this in kotlin, this code works like a charm. To avoid inconsistencies I also use .toUpperCase and Trim(). then i cast this function:

   fun stripAccents(s: String):String{

   if (s == null) {
      return "";
   }

val chars: CharArray = s.toCharArray()

var sb = StringBuilder(s)
var cont: Int = 0

while (chars.size > cont) {
    var c: kotlin.Char
    c = chars[cont]
    var c2:String = c.toString()
   //these are my needs, in case you need to convert other accents just Add new entries aqui
    c2 = c2.replace("Ã", "A")
    c2 = c2.replace("Õ", "O")
    c2 = c2.replace("Ç", "C")
    c2 = c2.replace("Á", "A")
    c2 = c2.replace("Ó", "O")
    c2 = c2.replace("Ê", "E")
    c2 = c2.replace("É", "E")
    c2 = c2.replace("Ú", "U")

    c = c2.single()
    sb.setCharAt(cont, c)
    cont++

}

return sb.toString()

}

to use these fun cast the code like this:

     var str: String
     str = editText.text.toString() //get the text from EditText
     str = str.toUpperCase().trim()

     str = stripAccents(str) //call the function

As of 2011 you can use Apache Commons StringUtils.stripAccents(input) (since 3.0):

    String input = StringUtils.stripAccents("Thïs iš â funky Štring");
    System.out.println(input);
    // Prints "This is a funky String"

Note:

The accepted answer (Erick Robertson's) doesn't work for Ø or L. Apache Commons 3.5 doesn't work for Ø either, but it does work for L. After reading the Wikipedia article for Ø, I'm not sure it should be replaced with "O": it's a separate letter in Norwegian and Danish, alphabetized after "z". It's a good example of the limitations of the "strip accents" approach.


One of the best way using regex and Normalizer if you have no library is :

    public String flattenToAscii(String s) {
                if(s == null || s.trim().length() == 0)
                        return "";
                return Normalizer.normalize(s, Normalizer.Form.NFD).replaceAll("[\u0300-\u036F]", "");
}

This is more efficient than replaceAll("[^\p{ASCII}]", "")) and if you don't need diacritics (just like your example).

Otherwise, you have to use the p{ASCII} pattern.

Regards.


Depending on the language, those might not be considered accents (which change the sound of the letter), but diacritical marks

https://en.wikipedia.org/wiki/Diacritic#Languages_with_letters_containing_diacritics

"Bosnian and Croatian have the symbols c, c, d, š and ž, which are considered separate letters and are listed as such in dictionaries and other contexts in which words are listed according to alphabetical order."

Removing them might be inherently changing the meaning of the word, or changing the letters into completely different ones.


I think the best solution is converting each char to HEX and replace it with another HEX. It's because there are 2 Unicode typing:

Composite Unicode
Precomposed Unicode

For example "Ô`" written by Composite Unicode is different from "?" written by Precomposed Unicode. You can copy my sample chars and convert them to see the difference.

In Composite Unicode, "Ô`" is combined from 2 char: Ô (U+00d4) and ` (U+0300)
In Precomposed Unicode, "?" is single char (U+1ED2)

I have developed this feature for some banks to convert the info before sending it to core-bank (usually don't support Unicode) and faced this issue when the end-users use multiple Unicode typing to input the data. So I think, converting to HEX and replace it is the most reliable way.


The solution by @virgo47 is very fast, but approximate. The accepted answer uses Normalizer and a regular expression. I wondered what part of the time was taken by Normalizer versus the regular expression, since removing all the non-ASCII characters can be done without a regex:

import java.text.Normalizer;

public class Strip {
    public static String flattenToAscii(String string) {
        StringBuilder sb = new StringBuilder(string.length());
        string = Normalizer.normalize(string, Normalizer.Form.NFD);
        for (char c : string.toCharArray()) {
            if (c <= '\u007F') sb.append(c);
        }
        return sb.toString();
    }
}

Small additional speed-ups can be obtained by writing into a char[] and not calling toCharArray(), although I'm not sure that the decrease in code clarity merits it:

public static String flattenToAscii(String string) {
    char[] out = new char[string.length()];
    string = Normalizer.normalize(string, Normalizer.Form.NFD);
    int j = 0;
    for (int i = 0, n = string.length(); i < n; ++i) {
        char c = string.charAt(i);
        if (c <= '\u007F') out[j++] = c;
    }
    return new String(out);
}

This variation has the advantage of the correctness of the one using Normalizer and some of the speed of the one using a table. On my machine, this one is about 4x faster than the accepted answer, and 6.6x to 7x slower that @virgo47's (the accepted answer is about 26x slower than @virgo47's on my machine).


EDIT: If you're not stuck with Java <6 and speed is not critical and/or translation table is too limiting, use answer by David. The point is to use Normalizer (introduced in Java 6) instead of translation table inside the loop.

While this is not "perfect" solution, it works well when you know the range (in our case Latin1,2), worked before Java 6 (not a real issue though) and is much faster than the most suggested version (may or may not be an issue):

    /**
 * Mirror of the unicode table from 00c0 to 017f without diacritics.
 */
private static final String tab00c0 = "AAAAAAACEEEEIIII" +
    "DNOOOOO\u00d7\u00d8UUUUYI\u00df" +
    "aaaaaaaceeeeiiii" +
    "\u00f0nooooo\u00f7\u00f8uuuuy\u00fey" +
    "AaAaAaCcCcCcCcDd" +
    "DdEeEeEeEeEeGgGg" +
    "GgGgHhHhIiIiIiIi" +
    "IiJjJjKkkLlLlLlL" +
    "lLlNnNnNnnNnOoOo" +
    "OoOoRrRrRrSsSsSs" +
    "SsTtTtTtUuUuUuUu" +
    "UuUuWwYyYZzZzZzF";

/**
 * Returns string without diacritics - 7 bit approximation.
 *
 * @param source string to convert
 * @return corresponding string without diacritics
 */
public static String removeDiacritic(String source) {
    char[] vysl = new char[source.length()];
    char one;
    for (int i = 0; i < source.length(); i++) {
        one = source.charAt(i);
        if (one >= '\u00c0' && one <= '\u017f') {
            one = tab00c0.charAt((int) one - '\u00c0');
        }
        vysl[i] = one;
    }
    return new String(vysl);
}

Tests on my HW with 32bit JDK show that this performs conversion from àèélštc89FDC to aeelstc89FDC 1 million times in ~100ms while Normalizer way makes it in 3.7s (37x slower). In case your needs are around performance and you know the input range, this may be for you.

Enjoy :-)