[.net] How do I remove diacritics (accents) from a string in .NET?

I'm trying to convert some strings that are in French Canadian and basically, I'd like to be able to take out the French accent marks in the letters while keeping the letter. (E.g. convert é to e, so crème brûlée would become creme brulee)

What is the best method for achieving this?

This question is related to .net string diacritics

The answer is


I really like the concise and functional code provided by azrafe7. So, I have changed it a little bit to convert it to an extension method:

public static class StringExtensions
{
    public static string RemoveDiacritics(this string text)
    {
        const string SINGLEBYTE_LATIN_ASCII_ENCODING = "ISO-8859-8";

        if (string.IsNullOrEmpty(text))
        {
            return string.Empty;
        }

        return Encoding.ASCII.GetString(
            Encoding.GetEncoding(SINGLEBYTE_LATIN_ASCII_ENCODING).GetBytes(text));
    }
}

What this person said:

Encoding.ASCII.GetString(Encoding.GetEncoding(1251).GetBytes(text));

It actually splits the likes of å which is one character (which is character code 00E5, not 0061 plus the modifier 030A which would look the same) into a plus some kind of modifier, and then the ASCII conversion removes the modifier, leaving the only a.


This code worked for me:

var updatedText = text.Normalize(NormalizationForm.FormD)
     .Where(c => CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
     .ToArray();

However, please don't do this with names. It's not only an insult to people with umlauts/accents in their name, it can also be dangerously wrong in certain situations (see below). There are alternative writings instead of just removing the accent.

Furthermore, it's simply wrong and dangerous, e.g. if the user has to provide his name exactly how it occurs on the passport.

For example my name is written Zuberbühler and in the machine readable part of my passport you will find Zuberbuehler. By removing the umlaut, the name will not match with either part. This can lead to issues for the users.

You should rather disallow umlauts/accent in an input form for names so the user can write his name correctly without its umlaut or accent.

Practical example, if the web service to apply for ESTA (https://www.application-esta.co.uk/special-characters-and) would use above code instead of transforming umlauts correctly, the ESTA application would either be refused or the traveller will have problems with the American Border Control when entering the States.

Another example would be flight tickets. Assuming you have a flight ticket booking web application, the user provides his name with an accent and your implementation is just removing the accents and then using the airline's web service to book the ticket! Your customer may not be allowed to board since the name does not match to any part of his/her passport.


Try HelperSharp package.

There is a method RemoveAccents:

 public static string RemoveAccents(this string source)
 {
     //8 bit characters 
     byte[] b = Encoding.GetEncoding(1251).GetBytes(source);

     // 7 bit characters
     string t = Encoding.ASCII.GetString(b);
     Regex re = new Regex("[^a-zA-Z0-9]=-_/");
     string c = re.Replace(t, " ");
     return c;
 }

this did the trick for me...

string accentedStr;
byte[] tempBytes;
tempBytes = System.Text.Encoding.GetEncoding("ISO-8859-8").GetBytes(accentedStr);
string asciiStr = System.Text.Encoding.UTF8.GetString(tempBytes);

quick&short!


I really like the concise and functional code provided by azrafe7. So, I have changed it a little bit to convert it to an extension method:

public static class StringExtensions
{
    public static string RemoveDiacritics(this string text)
    {
        const string SINGLEBYTE_LATIN_ASCII_ENCODING = "ISO-8859-8";

        if (string.IsNullOrEmpty(text))
        {
            return string.Empty;
        }

        return Encoding.ASCII.GetString(
            Encoding.GetEncoding(SINGLEBYTE_LATIN_ASCII_ENCODING).GetBytes(text));
    }
}

It's funny such a question can get so many answers, and yet none fit my requirements :) There are so many languages around, a full language agnostic solution is AFAIK not really possible, as others has mentionned that the FormC or FormD are giving issues.

Since the original question was related to French, the simplest working answer is indeed

    public static string ConvertWesternEuropeanToASCII(this string str)
    {
        return Encoding.ASCII.GetString(Encoding.GetEncoding(1251).GetBytes(str));
    }

1251 should be replaced by the encoding code of the input language.

This however replace only one character by one character. Since I am also working with German as input, I did a manual convert

    public static string LatinizeGermanCharacters(this string str)
    {
        StringBuilder sb = new StringBuilder(str.Length);
        foreach (char c in str)
        {
            switch (c)
            {
                case 'ä':
                    sb.Append("ae");
                    break;
                case 'ö':
                    sb.Append("oe");
                    break;
                case 'ü':
                    sb.Append("ue");
                    break;
                case 'Ä':
                    sb.Append("Ae");
                    break;
                case 'Ö':
                    sb.Append("Oe");
                    break;
                case 'Ü':
                    sb.Append("Ue");
                    break;
                case 'ß':
                    sb.Append("ss");
                    break;
                default:
                    sb.Append(c);
                    break;
            }
        }
        return sb.ToString();
    }

It might not deliver the best performance, but at least it is very easy to read and extend. Regex is a NO GO, much slower than any char/string stuff.

I also have a very simple method to remove space:

    public static string RemoveSpace(this string str)
    {
        return str.Replace(" ", string.Empty);
    }

Eventually, I am using a combination of all 3 above extensions:

    public static string LatinizeAndConvertToASCII(this string str, bool keepSpace = false)
    {
        str = str.LatinizeGermanCharacters().ConvertWesternEuropeanToASCII();            
        return keepSpace ? str : str.RemoveSpace();
    }

And a small unit test to that (not exhaustive) which pass successfully.

    [TestMethod()]
    public void LatinizeAndConvertToASCIITest()
    {
        string europeanStr = "Bonjour ça va? C'est l'été! Ich möchte ä Ä á à â ê é è ë Ë É ï Ï î í ì ó ò ô ö Ö Ü ü ù ú û Û ý Ý ç Ç ñ Ñ";
        string expected = "Bonjourcava?C'estl'ete!IchmoechteaeAeaaaeeeeEEiIiiiooooeOeUeueuuuUyYcCnN";
        string actual = europeanStr.LatinizeAndConvertToASCII();
        Assert.AreEqual(expected, actual);
    }

you can use string extension from MMLib.Extensions nuget package:

using MMLib.RapidPrototyping.Generators;
public void ExtensionsExample()
{
  string target = "aácceéií";
  Assert.AreEqual("aacceeii", target.RemoveDiacritics());
} 

Nuget page: https://www.nuget.org/packages/MMLib.Extensions/ Codeplex project site https://mmlib.codeplex.com/


Try HelperSharp package.

There is a method RemoveAccents:

 public static string RemoveAccents(this string source)
 {
     //8 bit characters 
     byte[] b = Encoding.GetEncoding(1251).GetBytes(source);

     // 7 bit characters
     string t = Encoding.ASCII.GetString(b);
     Regex re = new Regex("[^a-zA-Z0-9]=-_/");
     string c = re.Replace(t, " ");
     return c;
 }

TL;DR - C# string extension method

I think the best solution to preserve the meaning of the string is to convert the characters instead of stripping them, which is well illustrated in the example crème brûlée to crme brle vs. creme brulee.

I checked out Alexander's comment above and saw the Lucene.Net code is Apache 2.0 licensed, so I've modified the class into a simple string extension method. You can use it like this:

var originalString = "crème brûlée";
var maxLength = originalString.Length; // limit output length as necessary
var foldedString = originalString.FoldToASCII(maxLength); 
// "creme brulee"

The function is too long to post in a StackOverflow answer (~139k characters of 30k allowed lol) so I made a gist and attributed the authors:

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

/// <summary>
/// This class converts alphabetic, numeric, and symbolic Unicode characters
/// which are not in the first 127 ASCII characters (the "Basic Latin" Unicode
/// block) into their ASCII equivalents, if one exists.
/// <para/>
/// Characters from the following Unicode blocks are converted; however, only
/// those characters with reasonable ASCII alternatives are converted:
/// 
/// <ul>
///   <item><description>C1 Controls and Latin-1 Supplement: <a href="http://www.unicode.org/charts/PDF/U0080.pdf">http://www.unicode.org/charts/PDF/U0080.pdf</a></description></item>
///   <item><description>Latin Extended-A: <a href="http://www.unicode.org/charts/PDF/U0100.pdf">http://www.unicode.org/charts/PDF/U0100.pdf</a></description></item>
///   <item><description>Latin Extended-B: <a href="http://www.unicode.org/charts/PDF/U0180.pdf">http://www.unicode.org/charts/PDF/U0180.pdf</a></description></item>
///   <item><description>Latin Extended Additional: <a href="http://www.unicode.org/charts/PDF/U1E00.pdf">http://www.unicode.org/charts/PDF/U1E00.pdf</a></description></item>
///   <item><description>Latin Extended-C: <a href="http://www.unicode.org/charts/PDF/U2C60.pdf">http://www.unicode.org/charts/PDF/U2C60.pdf</a></description></item>
///   <item><description>Latin Extended-D: <a href="http://www.unicode.org/charts/PDF/UA720.pdf">http://www.unicode.org/charts/PDF/UA720.pdf</a></description></item>
///   <item><description>IPA Extensions: <a href="http://www.unicode.org/charts/PDF/U0250.pdf">http://www.unicode.org/charts/PDF/U0250.pdf</a></description></item>
///   <item><description>Phonetic Extensions: <a href="http://www.unicode.org/charts/PDF/U1D00.pdf">http://www.unicode.org/charts/PDF/U1D00.pdf</a></description></item>
///   <item><description>Phonetic Extensions Supplement: <a href="http://www.unicode.org/charts/PDF/U1D80.pdf">http://www.unicode.org/charts/PDF/U1D80.pdf</a></description></item>
///   <item><description>General Punctuation: <a href="http://www.unicode.org/charts/PDF/U2000.pdf">http://www.unicode.org/charts/PDF/U2000.pdf</a></description></item>
///   <item><description>Superscripts and Subscripts: <a href="http://www.unicode.org/charts/PDF/U2070.pdf">http://www.unicode.org/charts/PDF/U2070.pdf</a></description></item>
///   <item><description>Enclosed Alphanumerics: <a href="http://www.unicode.org/charts/PDF/U2460.pdf">http://www.unicode.org/charts/PDF/U2460.pdf</a></description></item>
///   <item><description>Dingbats: <a href="http://www.unicode.org/charts/PDF/U2700.pdf">http://www.unicode.org/charts/PDF/U2700.pdf</a></description></item>
///   <item><description>Supplemental Punctuation: <a href="http://www.unicode.org/charts/PDF/U2E00.pdf">http://www.unicode.org/charts/PDF/U2E00.pdf</a></description></item>
///   <item><description>Alphabetic Presentation Forms: <a href="http://www.unicode.org/charts/PDF/UFB00.pdf">http://www.unicode.org/charts/PDF/UFB00.pdf</a></description></item>
///   <item><description>Halfwidth and Fullwidth Forms: <a href="http://www.unicode.org/charts/PDF/UFF00.pdf">http://www.unicode.org/charts/PDF/UFF00.pdf</a></description></item>
/// </ul>
/// <para/>
/// See: <a href="http://en.wikipedia.org/wiki/Latin_characters_in_Unicode">http://en.wikipedia.org/wiki/Latin_characters_in_Unicode</a>
/// <para/>
/// For example, '&amp;agrave;' will be replaced by 'a'.
/// </summary>
public static partial class StringExtensions
{
    /// <summary>
    /// Converts characters above ASCII to their ASCII equivalents.  For example,
    /// accents are removed from accented characters. 
    /// </summary>
    /// <param name="input">     The string of characters to fold </param>
    /// <param name="length">    The length of the folded return string </param>
    /// <returns> length of output </returns>
    public static string FoldToASCII(this string input, int? length = null)
    {
        // See https://gist.github.com/andyraddatz/e6a396fb91856174d4e3f1bf2e10951c
    }
}

Hope that helps someone else, this is the most robust solution I've found!


I often use an extenstion method based on another version I found here (see Replacing characters in C# (ascii)) A quick explanation:

  • Normalizing to form D splits charactes like è to an e and a nonspacing `
  • From this, the nospacing characters are removed
  • The result is normalized back to form C (I'm not sure if this is neccesary)

Code:

using System.Linq;
using System.Text;
using System.Globalization;

// namespace here
public static class Utility
{
    public static string RemoveDiacritics(this string str)
    {
        if (null == str) return null;
        var chars =
            from c in str.Normalize(NormalizationForm.FormD).ToCharArray()
            let uc = CharUnicodeInfo.GetUnicodeCategory(c)
            where uc != UnicodeCategory.NonSpacingMark
            select c;

        var cleanStr = new string(chars.ToArray()).Normalize(NormalizationForm.FormC);

        return cleanStr;
    }

    // or, alternatively
    public static string RemoveDiacritics2(this string str)
    {
        if (null == str) return null;
        var chars = str
            .Normalize(NormalizationForm.FormD)
            .ToCharArray()
            .Where(c=> CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
            .ToArray();

        return new string(chars).Normalize(NormalizationForm.FormC);
    }
}

I needed something that converts all major unicode characters and the voted answer leaved a few out so I've created a version of CodeIgniter's convert_accented_characters($str) into C# that is easily customisable:

using System;
using System.Text;
using System.Collections.Generic;

public static class Strings
{
    static Dictionary<string, string> foreign_characters = new Dictionary<string, string>
    {
        { "äæ?", "ae" },
        { "öœ", "oe" },
        { "ü", "ue" },
        { "Ä", "Ae" },
        { "Ü", "Ue" },
        { "Ö", "Oe" },
        { "ÀÁÂÃÄÅ?AAAA??????????????", "A" },
        { "àáâãå?aaaaªa??????????????", "a" },
        { "?", "B" },
        { "?", "b" },
        { "ÇCCCC", "C" },
        { "çcccc", "c" },
        { "?", "D" },
        { "?", "d" },
        { "ÐDÐ?", "Dj" },
        { "ðddd", "dj" },
        { "ÈÉÊËEEEEE????????????", "E" },
        { "èéêëeeeee?e??????????", "e" },
        { "?", "F" },
        { "?", "f" },
        { "GGGGG??", "G" },
        { "gggg???", "g" },
        { "HH", "H" },
        { "hh", "h" },
        { "ÌÍÎÏIIIIII?????????", "I" },
        { "ìíîïiiiiii??????????", "i" },
        { "J", "J" },
        { "j", "j" },
        { "K??", "K" },
        { "k??", "k" },
        { "LLL?L??", "L" },
        { "lll?l??", "l" },
        { "?", "M" },
        { "?", "m" },
        { "ÑNNN??", "N" },
        { "ñnnn???", "n" },
        { "ÒÓÔÕOOOOOØ???O??????????????", "O" },
        { "òóôõoooooø?º?????????????????", "o" },
        { "?", "P" },
        { "?", "p" },
        { "RRR??", "R" },
        { "rrr??", "r" },
        { "SSS?ŠS?", "S" },
        { "sss?š?s??", "s" },
        { "?TTTt?", "T" },
        { "?ttt?", "t" },
        { "ÙÚÛUUUUUUUUUUUUU????????", "U" },
        { "ùúûuuuuuuuuuuuu???????????", "u" },
        { "ÝŸY????????", "Y" },
        { "ýÿy?????", "y" },
        { "?", "V" },
        { "?", "v" },
        { "W", "W" },
        { "w", "w" },
        { "ZZŽ??", "Z" },
        { "zzž??", "z" },
        { "Æ?", "AE" },
        { "ß", "ss" },
        { "?", "IJ" },
        { "?", "ij" },
        { "Œ", "OE" },
        { "ƒ", "f" },
        { "?", "ks" },
        { "p", "p" },
        { "ß", "v" },
        { "µ", "m" },
        { "?", "ps" },
        { "?", "Yo" },
        { "?", "yo" },
        { "?", "Ye" },
        { "?", "ye" },
        { "?", "Yi" },
        { "?", "Zh" },
        { "?", "zh" },
        { "?", "Kh" },
        { "?", "kh" },
        { "?", "Ts" },
        { "?", "ts" },
        { "?", "Ch" },
        { "?", "ch" },
        { "?", "Sh" },
        { "?", "sh" },
        { "?", "Shch" },
        { "?", "shch" },
        { "????", "" },
        { "?", "Yu" },
        { "?", "yu" },
        { "?", "Ya" },
        { "?", "ya" },
    };

    public static char RemoveDiacritics(this char c){
        foreach(KeyValuePair<string, string> entry in foreign_characters)
        {
            if(entry.Key.IndexOf (c) != -1)
            {
                return entry.Value[0];
            }
        }
        return c;
    }

    public static string RemoveDiacritics(this string s) 
    {
        //StringBuilder sb = new StringBuilder ();
        string text = "";


        foreach (char c in s)
        {
            int len = text.Length;

            foreach(KeyValuePair<string, string> entry in foreign_characters)
            {
                if(entry.Key.IndexOf (c) != -1)
                {
                    text += entry.Value;
                    break;
                }
            }

            if (len == text.Length) {
                text += c;  
            }
        }
        return text;
    }
}

Usage

// for strings
"crème brûlée".RemoveDiacritics (); // creme brulee

// for chars
"Ã"[0].RemoveDiacritics (); // A

I needed something that converts all major unicode characters and the voted answer leaved a few out so I've created a version of CodeIgniter's convert_accented_characters($str) into C# that is easily customisable:

using System;
using System.Text;
using System.Collections.Generic;

public static class Strings
{
    static Dictionary<string, string> foreign_characters = new Dictionary<string, string>
    {
        { "äæ?", "ae" },
        { "öœ", "oe" },
        { "ü", "ue" },
        { "Ä", "Ae" },
        { "Ü", "Ue" },
        { "Ö", "Oe" },
        { "ÀÁÂÃÄÅ?AAAA??????????????", "A" },
        { "àáâãå?aaaaªa??????????????", "a" },
        { "?", "B" },
        { "?", "b" },
        { "ÇCCCC", "C" },
        { "çcccc", "c" },
        { "?", "D" },
        { "?", "d" },
        { "ÐDÐ?", "Dj" },
        { "ðddd", "dj" },
        { "ÈÉÊËEEEEE????????????", "E" },
        { "èéêëeeeee?e??????????", "e" },
        { "?", "F" },
        { "?", "f" },
        { "GGGGG??", "G" },
        { "gggg???", "g" },
        { "HH", "H" },
        { "hh", "h" },
        { "ÌÍÎÏIIIIII?????????", "I" },
        { "ìíîïiiiiii??????????", "i" },
        { "J", "J" },
        { "j", "j" },
        { "K??", "K" },
        { "k??", "k" },
        { "LLL?L??", "L" },
        { "lll?l??", "l" },
        { "?", "M" },
        { "?", "m" },
        { "ÑNNN??", "N" },
        { "ñnnn???", "n" },
        { "ÒÓÔÕOOOOOØ???O??????????????", "O" },
        { "òóôõoooooø?º?????????????????", "o" },
        { "?", "P" },
        { "?", "p" },
        { "RRR??", "R" },
        { "rrr??", "r" },
        { "SSS?ŠS?", "S" },
        { "sss?š?s??", "s" },
        { "?TTTt?", "T" },
        { "?ttt?", "t" },
        { "ÙÚÛUUUUUUUUUUUUU????????", "U" },
        { "ùúûuuuuuuuuuuuu???????????", "u" },
        { "ÝŸY????????", "Y" },
        { "ýÿy?????", "y" },
        { "?", "V" },
        { "?", "v" },
        { "W", "W" },
        { "w", "w" },
        { "ZZŽ??", "Z" },
        { "zzž??", "z" },
        { "Æ?", "AE" },
        { "ß", "ss" },
        { "?", "IJ" },
        { "?", "ij" },
        { "Œ", "OE" },
        { "ƒ", "f" },
        { "?", "ks" },
        { "p", "p" },
        { "ß", "v" },
        { "µ", "m" },
        { "?", "ps" },
        { "?", "Yo" },
        { "?", "yo" },
        { "?", "Ye" },
        { "?", "ye" },
        { "?", "Yi" },
        { "?", "Zh" },
        { "?", "zh" },
        { "?", "Kh" },
        { "?", "kh" },
        { "?", "Ts" },
        { "?", "ts" },
        { "?", "Ch" },
        { "?", "ch" },
        { "?", "Sh" },
        { "?", "sh" },
        { "?", "Shch" },
        { "?", "shch" },
        { "????", "" },
        { "?", "Yu" },
        { "?", "yu" },
        { "?", "Ya" },
        { "?", "ya" },
    };

    public static char RemoveDiacritics(this char c){
        foreach(KeyValuePair<string, string> entry in foreign_characters)
        {
            if(entry.Key.IndexOf (c) != -1)
            {
                return entry.Value[0];
            }
        }
        return c;
    }

    public static string RemoveDiacritics(this string s) 
    {
        //StringBuilder sb = new StringBuilder ();
        string text = "";


        foreach (char c in s)
        {
            int len = text.Length;

            foreach(KeyValuePair<string, string> entry in foreign_characters)
            {
                if(entry.Key.IndexOf (c) != -1)
                {
                    text += entry.Value;
                    break;
                }
            }

            if (len == text.Length) {
                text += c;  
            }
        }
        return text;
    }
}

Usage

// for strings
"crème brûlée".RemoveDiacritics (); // creme brulee

// for chars
"Ã"[0].RemoveDiacritics (); // A

Not having enough reputations, apparently I can not comment on Alexander's excellent link. - Lucene appears to be the only solution working in reasonably generic cases.

For those wanting a simple copy-paste solution, here it is, leveraging code in Lucene:

string testbed = "ÁÂÄÅÇÉÍÎÓÖØÚÜÞàáâãäåæçèéêëìíîïðñóôöøúüaacÐegiLlnOorSsšzž????";

Console.WriteLine(Lucene.latinizeLucene(testbed));

AAAACEIIOOOUUTHaaaaaaaeceeeeiiiidnoooouuaacDegiLlnOorSsszzsteu

//////////

public static class Lucene
{
    // source: https://raw.githubusercontent.com/apache/lucenenet/master/src/Lucene.Net.Analysis.Common/Analysis/Miscellaneous/ASCIIFoldingFilter.cs
    // idea: https://stackoverflow.com/questions/249087/how-do-i-remove-diacritics-accents-from-a-string-in-net (scroll down, search for lucene by Alexander)
    public static string latinizeLucene(string arg)
    {
        char[] argChar = arg.ToCharArray();

        // latinizeLuceneImpl can expand one char up to four chars - e.g. Þ to TH, or æ to ae, or in fact ? to (10)
        char[] resultChar = new String(' ', arg.Length * 4).ToCharArray();

        int outputPos = Lucene.latinizeLuceneImpl(argChar, 0, ref resultChar, 0, arg.Length);

        string ret = new string(resultChar);
        ret = ret.Substring(0, outputPos);

        return ret;
    }

    /// <summary>
    /// Converts characters above ASCII to their ASCII equivalents.  For example,
    /// accents are removed from accented characters. 
    /// <para/>
    /// @lucene.internal
    /// </summary>
    /// <param name="input">     The characters to fold </param>
    /// <param name="inputPos">  Index of the first character to fold </param>
    /// <param name="output">    The result of the folding. Should be of size >= <c>length * 4</c>. </param>
    /// <param name="outputPos"> Index of output where to put the result of the folding </param>
    /// <param name="length">    The number of characters to fold </param>
    /// <returns> length of output </returns>
    private static int latinizeLuceneImpl(char[] input, int inputPos, ref char[] output, int outputPos, int length)
    {
        int end = inputPos + length;
        for (int pos = inputPos; pos < end; ++pos)
        {
            char c = input[pos];

            // Quick test: if it's not in range then just keep current character
            if (c < '\u0080')
            {
                output[outputPos++] = c;
            }
            else
            {
                switch (c)
                {
                    case '\u00C0': // À  [LATIN CAPITAL LETTER A WITH GRAVE]
                    case '\u00C1': // Á  [LATIN CAPITAL LETTER A WITH ACUTE]
                    case '\u00C2': // Â  [LATIN CAPITAL LETTER A WITH CIRCUMFLEX]
                    case '\u00C3': // Ã  [LATIN CAPITAL LETTER A WITH TILDE]
                    case '\u00C4': // Ä  [LATIN CAPITAL LETTER A WITH DIAERESIS]
                    case '\u00C5': // Å  [LATIN CAPITAL LETTER A WITH RING ABOVE]
                    case '\u0100': // A  [LATIN CAPITAL LETTER A WITH MACRON]
                    case '\u0102': // A  [LATIN CAPITAL LETTER A WITH BREVE]
                    case '\u0104': // A  [LATIN CAPITAL LETTER A WITH OGONEK]
                    case '\u018F': // ?  http://en.wikipedia.org/wiki/Schwa  [LATIN CAPITAL LETTER SCHWA]
                    case '\u01CD': // A  [LATIN CAPITAL LETTER A WITH CARON]
                    case '\u01DE': // A  [LATIN CAPITAL LETTER A WITH DIAERESIS AND MACRON]
                    case '\u01E0': // ?  [LATIN CAPITAL LETTER A WITH DOT ABOVE AND MACRON]
                    case '\u01FA': // ?  [LATIN CAPITAL LETTER A WITH RING ABOVE AND ACUTE]
                    case '\u0200': // ?  [LATIN CAPITAL LETTER A WITH DOUBLE GRAVE]
                    case '\u0202': // ?  [LATIN CAPITAL LETTER A WITH INVERTED BREVE]
                    case '\u0226': // ?  [LATIN CAPITAL LETTER A WITH DOT ABOVE]
                    case '\u023A': // ?  [LATIN CAPITAL LETTER A WITH STROKE]
                    case '\u1D00': // ?  [LATIN LETTER SMALL CAPITAL A]
                    case '\u1E00': // ?  [LATIN CAPITAL LETTER A WITH RING BELOW]
                    case '\u1EA0': // ?  [LATIN CAPITAL LETTER A WITH DOT BELOW]
                    case '\u1EA2': // ?  [LATIN CAPITAL LETTER A WITH HOOK ABOVE]
                    case '\u1EA4': // ?  [LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND ACUTE]
                    case '\u1EA6': // ?  [LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND GRAVE]
                    case '\u1EA8': // ?  [LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND HOOK ABOVE]
                    case '\u1EAA': // ?  [LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND TILDE]
                    case '\u1EAC': // ?  [LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND DOT BELOW]
                    case '\u1EAE': // ?  [LATIN CAPITAL LETTER A WITH BREVE AND ACUTE]
                    case '\u1EB0': // ?  [LATIN CAPITAL LETTER A WITH BREVE AND GRAVE]
                    case '\u1EB2': // ?  [LATIN CAPITAL LETTER A WITH BREVE AND HOOK ABOVE]
                    case '\u1EB4': // ?  [LATIN CAPITAL LETTER A WITH BREVE AND TILDE]
                    case '\u1EB6': // ?  [LATIN CAPITAL LETTER A WITH BREVE AND DOT BELOW]
                    case '\u24B6': // ?  [CIRCLED LATIN CAPITAL LETTER A]
                    case '\uFF21': // A  [FULLWIDTH LATIN CAPITAL LETTER A]
                        output[outputPos++] = 'A';
                        break;
                    case '\u00E0': // à  [LATIN SMALL LETTER A WITH GRAVE]
                    case '\u00E1': // á  [LATIN SMALL LETTER A WITH ACUTE]
                    case '\u00E2': // â  [LATIN SMALL LETTER A WITH CIRCUMFLEX]
                    case '\u00E3': // ã  [LATIN SMALL LETTER A WITH TILDE]
                    case '\u00E4': // ä  [LATIN SMALL LETTER A WITH DIAERESIS]
                    case '\u00E5': // å  [LATIN SMALL LETTER A WITH RING ABOVE]
                    case '\u0101': // a  [LATIN SMALL LETTER A WITH MACRON]
                    case '\u0103': // a  [LATIN SMALL LETTER A WITH BREVE]
                    case '\u0105': // a  [LATIN SMALL LETTER A WITH OGONEK]
                    case '\u01CE': // a  [LATIN SMALL LETTER A WITH CARON]
                    case '\u01DF': // a  [LATIN SMALL LETTER A WITH DIAERESIS AND MACRON]
                    case '\u01E1': // ?  [LATIN SMALL LETTER A WITH DOT ABOVE AND MACRON]
                    case '\u01FB': // ?  [LATIN SMALL LETTER A WITH RING ABOVE AND ACUTE]
                    case '\u0201': // ?  [LATIN SMALL LETTER A WITH DOUBLE GRAVE]
                    case '\u0203': // ?  [LATIN SMALL LETTER A WITH INVERTED BREVE]
                    case '\u0227': // ?  [LATIN SMALL LETTER A WITH DOT ABOVE]
                    case '\u0250': // ?  [LATIN SMALL LETTER TURNED A]
                    case '\u0259': // ?  [LATIN SMALL LETTER SCHWA]
                    case '\u025A': // ?  [LATIN SMALL LETTER SCHWA WITH HOOK]
                    case '\u1D8F': // ?  [LATIN SMALL LETTER A WITH RETROFLEX HOOK]
                    case '\u1D95': // ?  [LATIN SMALL LETTER SCHWA WITH RETROFLEX HOOK]
                    case '\u1E01': // ?  [LATIN SMALL LETTER A WITH RING BELOW]
                    case '\u1E9A': // ?  [LATIN SMALL LETTER A WITH RIGHT HALF RING]
                    case '\u1EA1': // ?  [LATIN SMALL LETTER A WITH DOT BELOW]
                    case '\u1EA3': // ?  [LATIN SMALL LETTER A WITH HOOK ABOVE]
                    case '\u1EA5': // ?  [LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE]
                    case '\u1EA7': // ?  [LATIN SMALL LETTER A WITH CIRCUMFLEX AND GRAVE]
                    case '\u1EA9': // ?  [LATIN SMALL LETTER A WITH CIRCUMFLEX AND HOOK ABOVE]
                    case '\u1EAB': // ?  [LATIN SMALL LETTER A WITH CIRCUMFLEX AND TILDE]
                    case '\u1EAD': // ?  [LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW]
                    case '\u1EAF': // ?  [LATIN SMALL LETTER A WITH BREVE AND ACUTE]
                    case '\u1EB1': // ?  [LATIN SMALL LETTER A WITH BREVE AND GRAVE]
                    case '\u1EB3': // ?  [LATIN SMALL LETTER A WITH BREVE AND HOOK ABOVE]
                    case '\u1EB5': // ?  [LATIN SMALL LETTER A WITH BREVE AND TILDE]
                    case '\u1EB7': // ?  [LATIN SMALL LETTER A WITH BREVE AND DOT BELOW]
                    case '\u2090': // ?  [LATIN SUBSCRIPT SMALL LETTER A]
                    case '\u2094': // ?  [LATIN SUBSCRIPT SMALL LETTER SCHWA]
                    case '\u24D0': // ?  [CIRCLED LATIN SMALL LETTER A]
                    case '\u2C65': // ?  [LATIN SMALL LETTER A WITH STROKE]
                    case '\u2C6F': // ?  [LATIN CAPITAL LETTER TURNED A]
                    case '\uFF41': // a  [FULLWIDTH LATIN SMALL LETTER A]
                        output[outputPos++] = 'a';
                        break;
                    case '\uA732': // ?  [LATIN CAPITAL LETTER AA]
                        output[outputPos++] = 'A';
                        output[outputPos++] = 'A';
                        break;
                    case '\u00C6': // Æ  [LATIN CAPITAL LETTER AE]
                    case '\u01E2': // ?  [LATIN CAPITAL LETTER AE WITH MACRON]
                    case '\u01FC': // ?  [LATIN CAPITAL LETTER AE WITH ACUTE]
                    case '\u1D01': // ?  [LATIN LETTER SMALL CAPITAL AE]
                        output[outputPos++] = 'A';
                        output[outputPos++] = 'E';
                        break;
                    case '\uA734': // ?  [LATIN CAPITAL LETTER AO]
                        output[outputPos++] = 'A';
                        output[outputPos++] = 'O';
                        break;
                    case '\uA736': // ?  [LATIN CAPITAL LETTER AU]
                        output[outputPos++] = 'A';
                        output[outputPos++] = 'U';
                        break;

        // etc. etc. etc.
        // see link above for complete source code
        // 
        // unfortunately, postings are limited, as in
        // "Body is limited to 30000 characters; you entered 136098."

                    [...]

                    case '\u2053': // ?  [SWUNG DASH]
                    case '\uFF5E': // ~  [FULLWIDTH TILDE]
                        output[outputPos++] = '~';
                        break;
                    default:
                        output[outputPos++] = c;
                        break;
                }
            }
        }
        return outputPos;
    }
}

Popping this Library here if you haven't already considered it. Looks like there are a full range of unit tests with it.

https://github.com/thomasgalliker/Diacritics.NET


this did the trick for me...

string accentedStr;
byte[] tempBytes;
tempBytes = System.Text.Encoding.GetEncoding("ISO-8859-8").GetBytes(accentedStr);
string asciiStr = System.Text.Encoding.UTF8.GetString(tempBytes);

quick&short!


This works fine in java.

It basically converts all accented characters into their deAccented counterparts followed by their combining diacritics. Now you can use a regex to strip off the diacritics.

import java.text.Normalizer;
import java.util.regex.Pattern;

public String deAccent(String str) {
    String nfdNormalizedString = Normalizer.normalize(str, Normalizer.Form.NFD); 
    Pattern pattern = Pattern.compile("\\p{InCombiningDiacriticalMarks}+");
    return pattern.matcher(nfdNormalizedString).replaceAll("");
}

The CodePage of Greek (ISO) can do it

The information about this codepage is into System.Text.Encoding.GetEncodings(). Learn about in: https://msdn.microsoft.com/pt-br/library/system.text.encodinginfo.getencoding(v=vs.110).aspx

Greek (ISO) has codepage 28597 and name iso-8859-7.

Go to the code... \o/

string text = "Você está numa situação lamentável";

string textEncode = System.Web.HttpUtility.UrlEncode(text, Encoding.GetEncoding("iso-8859-7"));
//result: "Voce+esta+numa+situacao+lamentavel"

string textDecode = System.Web.HttpUtility.UrlDecode(textEncode);
//result: "Voce esta numa situacao lamentavel"

So, write this function...

public string RemoveAcentuation(string text)
{
    return
        System.Web.HttpUtility.UrlDecode(
            System.Web.HttpUtility.UrlEncode(
                text, Encoding.GetEncoding("iso-8859-7")));
}

Note that... Encoding.GetEncoding("iso-8859-7") is equivalent to Encoding.GetEncoding(28597) because first is the name, and second the codepage of Encoding.


Not having enough reputations, apparently I can not comment on Alexander's excellent link. - Lucene appears to be the only solution working in reasonably generic cases.

For those wanting a simple copy-paste solution, here it is, leveraging code in Lucene:

string testbed = "ÁÂÄÅÇÉÍÎÓÖØÚÜÞàáâãäåæçèéêëìíîïðñóôöøúüaacÐegiLlnOorSsšzž????";

Console.WriteLine(Lucene.latinizeLucene(testbed));

AAAACEIIOOOUUTHaaaaaaaeceeeeiiiidnoooouuaacDegiLlnOorSsszzsteu

//////////

public static class Lucene
{
    // source: https://raw.githubusercontent.com/apache/lucenenet/master/src/Lucene.Net.Analysis.Common/Analysis/Miscellaneous/ASCIIFoldingFilter.cs
    // idea: https://stackoverflow.com/questions/249087/how-do-i-remove-diacritics-accents-from-a-string-in-net (scroll down, search for lucene by Alexander)
    public static string latinizeLucene(string arg)
    {
        char[] argChar = arg.ToCharArray();

        // latinizeLuceneImpl can expand one char up to four chars - e.g. Þ to TH, or æ to ae, or in fact ? to (10)
        char[] resultChar = new String(' ', arg.Length * 4).ToCharArray();

        int outputPos = Lucene.latinizeLuceneImpl(argChar, 0, ref resultChar, 0, arg.Length);

        string ret = new string(resultChar);
        ret = ret.Substring(0, outputPos);

        return ret;
    }

    /// <summary>
    /// Converts characters above ASCII to their ASCII equivalents.  For example,
    /// accents are removed from accented characters. 
    /// <para/>
    /// @lucene.internal
    /// </summary>
    /// <param name="input">     The characters to fold </param>
    /// <param name="inputPos">  Index of the first character to fold </param>
    /// <param name="output">    The result of the folding. Should be of size >= <c>length * 4</c>. </param>
    /// <param name="outputPos"> Index of output where to put the result of the folding </param>
    /// <param name="length">    The number of characters to fold </param>
    /// <returns> length of output </returns>
    private static int latinizeLuceneImpl(char[] input, int inputPos, ref char[] output, int outputPos, int length)
    {
        int end = inputPos + length;
        for (int pos = inputPos; pos < end; ++pos)
        {
            char c = input[pos];

            // Quick test: if it's not in range then just keep current character
            if (c < '\u0080')
            {
                output[outputPos++] = c;
            }
            else
            {
                switch (c)
                {
                    case '\u00C0': // À  [LATIN CAPITAL LETTER A WITH GRAVE]
                    case '\u00C1': // Á  [LATIN CAPITAL LETTER A WITH ACUTE]
                    case '\u00C2': // Â  [LATIN CAPITAL LETTER A WITH CIRCUMFLEX]
                    case '\u00C3': // Ã  [LATIN CAPITAL LETTER A WITH TILDE]
                    case '\u00C4': // Ä  [LATIN CAPITAL LETTER A WITH DIAERESIS]
                    case '\u00C5': // Å  [LATIN CAPITAL LETTER A WITH RING ABOVE]
                    case '\u0100': // A  [LATIN CAPITAL LETTER A WITH MACRON]
                    case '\u0102': // A  [LATIN CAPITAL LETTER A WITH BREVE]
                    case '\u0104': // A  [LATIN CAPITAL LETTER A WITH OGONEK]
                    case '\u018F': // ?  http://en.wikipedia.org/wiki/Schwa  [LATIN CAPITAL LETTER SCHWA]
                    case '\u01CD': // A  [LATIN CAPITAL LETTER A WITH CARON]
                    case '\u01DE': // A  [LATIN CAPITAL LETTER A WITH DIAERESIS AND MACRON]
                    case '\u01E0': // ?  [LATIN CAPITAL LETTER A WITH DOT ABOVE AND MACRON]
                    case '\u01FA': // ?  [LATIN CAPITAL LETTER A WITH RING ABOVE AND ACUTE]
                    case '\u0200': // ?  [LATIN CAPITAL LETTER A WITH DOUBLE GRAVE]
                    case '\u0202': // ?  [LATIN CAPITAL LETTER A WITH INVERTED BREVE]
                    case '\u0226': // ?  [LATIN CAPITAL LETTER A WITH DOT ABOVE]
                    case '\u023A': // ?  [LATIN CAPITAL LETTER A WITH STROKE]
                    case '\u1D00': // ?  [LATIN LETTER SMALL CAPITAL A]
                    case '\u1E00': // ?  [LATIN CAPITAL LETTER A WITH RING BELOW]
                    case '\u1EA0': // ?  [LATIN CAPITAL LETTER A WITH DOT BELOW]
                    case '\u1EA2': // ?  [LATIN CAPITAL LETTER A WITH HOOK ABOVE]
                    case '\u1EA4': // ?  [LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND ACUTE]
                    case '\u1EA6': // ?  [LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND GRAVE]
                    case '\u1EA8': // ?  [LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND HOOK ABOVE]
                    case '\u1EAA': // ?  [LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND TILDE]
                    case '\u1EAC': // ?  [LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND DOT BELOW]
                    case '\u1EAE': // ?  [LATIN CAPITAL LETTER A WITH BREVE AND ACUTE]
                    case '\u1EB0': // ?  [LATIN CAPITAL LETTER A WITH BREVE AND GRAVE]
                    case '\u1EB2': // ?  [LATIN CAPITAL LETTER A WITH BREVE AND HOOK ABOVE]
                    case '\u1EB4': // ?  [LATIN CAPITAL LETTER A WITH BREVE AND TILDE]
                    case '\u1EB6': // ?  [LATIN CAPITAL LETTER A WITH BREVE AND DOT BELOW]
                    case '\u24B6': // ?  [CIRCLED LATIN CAPITAL LETTER A]
                    case '\uFF21': // A  [FULLWIDTH LATIN CAPITAL LETTER A]
                        output[outputPos++] = 'A';
                        break;
                    case '\u00E0': // à  [LATIN SMALL LETTER A WITH GRAVE]
                    case '\u00E1': // á  [LATIN SMALL LETTER A WITH ACUTE]
                    case '\u00E2': // â  [LATIN SMALL LETTER A WITH CIRCUMFLEX]
                    case '\u00E3': // ã  [LATIN SMALL LETTER A WITH TILDE]
                    case '\u00E4': // ä  [LATIN SMALL LETTER A WITH DIAERESIS]
                    case '\u00E5': // å  [LATIN SMALL LETTER A WITH RING ABOVE]
                    case '\u0101': // a  [LATIN SMALL LETTER A WITH MACRON]
                    case '\u0103': // a  [LATIN SMALL LETTER A WITH BREVE]
                    case '\u0105': // a  [LATIN SMALL LETTER A WITH OGONEK]
                    case '\u01CE': // a  [LATIN SMALL LETTER A WITH CARON]
                    case '\u01DF': // a  [LATIN SMALL LETTER A WITH DIAERESIS AND MACRON]
                    case '\u01E1': // ?  [LATIN SMALL LETTER A WITH DOT ABOVE AND MACRON]
                    case '\u01FB': // ?  [LATIN SMALL LETTER A WITH RING ABOVE AND ACUTE]
                    case '\u0201': // ?  [LATIN SMALL LETTER A WITH DOUBLE GRAVE]
                    case '\u0203': // ?  [LATIN SMALL LETTER A WITH INVERTED BREVE]
                    case '\u0227': // ?  [LATIN SMALL LETTER A WITH DOT ABOVE]
                    case '\u0250': // ?  [LATIN SMALL LETTER TURNED A]
                    case '\u0259': // ?  [LATIN SMALL LETTER SCHWA]
                    case '\u025A': // ?  [LATIN SMALL LETTER SCHWA WITH HOOK]
                    case '\u1D8F': // ?  [LATIN SMALL LETTER A WITH RETROFLEX HOOK]
                    case '\u1D95': // ?  [LATIN SMALL LETTER SCHWA WITH RETROFLEX HOOK]
                    case '\u1E01': // ?  [LATIN SMALL LETTER A WITH RING BELOW]
                    case '\u1E9A': // ?  [LATIN SMALL LETTER A WITH RIGHT HALF RING]
                    case '\u1EA1': // ?  [LATIN SMALL LETTER A WITH DOT BELOW]
                    case '\u1EA3': // ?  [LATIN SMALL LETTER A WITH HOOK ABOVE]
                    case '\u1EA5': // ?  [LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE]
                    case '\u1EA7': // ?  [LATIN SMALL LETTER A WITH CIRCUMFLEX AND GRAVE]
                    case '\u1EA9': // ?  [LATIN SMALL LETTER A WITH CIRCUMFLEX AND HOOK ABOVE]
                    case '\u1EAB': // ?  [LATIN SMALL LETTER A WITH CIRCUMFLEX AND TILDE]
                    case '\u1EAD': // ?  [LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW]
                    case '\u1EAF': // ?  [LATIN SMALL LETTER A WITH BREVE AND ACUTE]
                    case '\u1EB1': // ?  [LATIN SMALL LETTER A WITH BREVE AND GRAVE]
                    case '\u1EB3': // ?  [LATIN SMALL LETTER A WITH BREVE AND HOOK ABOVE]
                    case '\u1EB5': // ?  [LATIN SMALL LETTER A WITH BREVE AND TILDE]
                    case '\u1EB7': // ?  [LATIN SMALL LETTER A WITH BREVE AND DOT BELOW]
                    case '\u2090': // ?  [LATIN SUBSCRIPT SMALL LETTER A]
                    case '\u2094': // ?  [LATIN SUBSCRIPT SMALL LETTER SCHWA]
                    case '\u24D0': // ?  [CIRCLED LATIN SMALL LETTER A]
                    case '\u2C65': // ?  [LATIN SMALL LETTER A WITH STROKE]
                    case '\u2C6F': // ?  [LATIN CAPITAL LETTER TURNED A]
                    case '\uFF41': // a  [FULLWIDTH LATIN SMALL LETTER A]
                        output[outputPos++] = 'a';
                        break;
                    case '\uA732': // ?  [LATIN CAPITAL LETTER AA]
                        output[outputPos++] = 'A';
                        output[outputPos++] = 'A';
                        break;
                    case '\u00C6': // Æ  [LATIN CAPITAL LETTER AE]
                    case '\u01E2': // ?  [LATIN CAPITAL LETTER AE WITH MACRON]
                    case '\u01FC': // ?  [LATIN CAPITAL LETTER AE WITH ACUTE]
                    case '\u1D01': // ?  [LATIN LETTER SMALL CAPITAL AE]
                        output[outputPos++] = 'A';
                        output[outputPos++] = 'E';
                        break;
                    case '\uA734': // ?  [LATIN CAPITAL LETTER AO]
                        output[outputPos++] = 'A';
                        output[outputPos++] = 'O';
                        break;
                    case '\uA736': // ?  [LATIN CAPITAL LETTER AU]
                        output[outputPos++] = 'A';
                        output[outputPos++] = 'U';
                        break;

        // etc. etc. etc.
        // see link above for complete source code
        // 
        // unfortunately, postings are limited, as in
        // "Body is limited to 30000 characters; you entered 136098."

                    [...]

                    case '\u2053': // ?  [SWUNG DASH]
                    case '\uFF5E': // ~  [FULLWIDTH TILDE]
                        output[outputPos++] = '~';
                        break;
                    default:
                        output[outputPos++] = c;
                        break;
                }
            }
        }
        return outputPos;
    }
}

TL;DR - C# string extension method

I think the best solution to preserve the meaning of the string is to convert the characters instead of stripping them, which is well illustrated in the example crème brûlée to crme brle vs. creme brulee.

I checked out Alexander's comment above and saw the Lucene.Net code is Apache 2.0 licensed, so I've modified the class into a simple string extension method. You can use it like this:

var originalString = "crème brûlée";
var maxLength = originalString.Length; // limit output length as necessary
var foldedString = originalString.FoldToASCII(maxLength); 
// "creme brulee"

The function is too long to post in a StackOverflow answer (~139k characters of 30k allowed lol) so I made a gist and attributed the authors:

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

/// <summary>
/// This class converts alphabetic, numeric, and symbolic Unicode characters
/// which are not in the first 127 ASCII characters (the "Basic Latin" Unicode
/// block) into their ASCII equivalents, if one exists.
/// <para/>
/// Characters from the following Unicode blocks are converted; however, only
/// those characters with reasonable ASCII alternatives are converted:
/// 
/// <ul>
///   <item><description>C1 Controls and Latin-1 Supplement: <a href="http://www.unicode.org/charts/PDF/U0080.pdf">http://www.unicode.org/charts/PDF/U0080.pdf</a></description></item>
///   <item><description>Latin Extended-A: <a href="http://www.unicode.org/charts/PDF/U0100.pdf">http://www.unicode.org/charts/PDF/U0100.pdf</a></description></item>
///   <item><description>Latin Extended-B: <a href="http://www.unicode.org/charts/PDF/U0180.pdf">http://www.unicode.org/charts/PDF/U0180.pdf</a></description></item>
///   <item><description>Latin Extended Additional: <a href="http://www.unicode.org/charts/PDF/U1E00.pdf">http://www.unicode.org/charts/PDF/U1E00.pdf</a></description></item>
///   <item><description>Latin Extended-C: <a href="http://www.unicode.org/charts/PDF/U2C60.pdf">http://www.unicode.org/charts/PDF/U2C60.pdf</a></description></item>
///   <item><description>Latin Extended-D: <a href="http://www.unicode.org/charts/PDF/UA720.pdf">http://www.unicode.org/charts/PDF/UA720.pdf</a></description></item>
///   <item><description>IPA Extensions: <a href="http://www.unicode.org/charts/PDF/U0250.pdf">http://www.unicode.org/charts/PDF/U0250.pdf</a></description></item>
///   <item><description>Phonetic Extensions: <a href="http://www.unicode.org/charts/PDF/U1D00.pdf">http://www.unicode.org/charts/PDF/U1D00.pdf</a></description></item>
///   <item><description>Phonetic Extensions Supplement: <a href="http://www.unicode.org/charts/PDF/U1D80.pdf">http://www.unicode.org/charts/PDF/U1D80.pdf</a></description></item>
///   <item><description>General Punctuation: <a href="http://www.unicode.org/charts/PDF/U2000.pdf">http://www.unicode.org/charts/PDF/U2000.pdf</a></description></item>
///   <item><description>Superscripts and Subscripts: <a href="http://www.unicode.org/charts/PDF/U2070.pdf">http://www.unicode.org/charts/PDF/U2070.pdf</a></description></item>
///   <item><description>Enclosed Alphanumerics: <a href="http://www.unicode.org/charts/PDF/U2460.pdf">http://www.unicode.org/charts/PDF/U2460.pdf</a></description></item>
///   <item><description>Dingbats: <a href="http://www.unicode.org/charts/PDF/U2700.pdf">http://www.unicode.org/charts/PDF/U2700.pdf</a></description></item>
///   <item><description>Supplemental Punctuation: <a href="http://www.unicode.org/charts/PDF/U2E00.pdf">http://www.unicode.org/charts/PDF/U2E00.pdf</a></description></item>
///   <item><description>Alphabetic Presentation Forms: <a href="http://www.unicode.org/charts/PDF/UFB00.pdf">http://www.unicode.org/charts/PDF/UFB00.pdf</a></description></item>
///   <item><description>Halfwidth and Fullwidth Forms: <a href="http://www.unicode.org/charts/PDF/UFF00.pdf">http://www.unicode.org/charts/PDF/UFF00.pdf</a></description></item>
/// </ul>
/// <para/>
/// See: <a href="http://en.wikipedia.org/wiki/Latin_characters_in_Unicode">http://en.wikipedia.org/wiki/Latin_characters_in_Unicode</a>
/// <para/>
/// For example, '&amp;agrave;' will be replaced by 'a'.
/// </summary>
public static partial class StringExtensions
{
    /// <summary>
    /// Converts characters above ASCII to their ASCII equivalents.  For example,
    /// accents are removed from accented characters. 
    /// </summary>
    /// <param name="input">     The string of characters to fold </param>
    /// <param name="length">    The length of the folded return string </param>
    /// <returns> length of output </returns>
    public static string FoldToASCII(this string input, int? length = null)
    {
        // See https://gist.github.com/andyraddatz/e6a396fb91856174d4e3f1bf2e10951c
    }
}

Hope that helps someone else, this is the most robust solution I've found!


In case anyone's interested, here is the java equivalent:

import java.text.Normalizer;

public class MyClass
{
    public static String removeDiacritics(String input)
    {
        String nrml = Normalizer.normalize(input, Normalizer.Form.NFD);
        StringBuilder stripped = new StringBuilder();
        for (int i=0;i<nrml.length();++i)
        {
            if (Character.getType(nrml.charAt(i)) != Character.NON_SPACING_MARK)
            {
                stripped.append(nrml.charAt(i));
            }
        }
        return stripped.toString();
    }
}

Popping this Library here if you haven't already considered it. Looks like there are a full range of unit tests with it.

https://github.com/thomasgalliker/Diacritics.NET


In case someone is interested, I was looking for something similar and ended writing the following:

public static string NormalizeStringForUrl(string name)
{
    String normalizedString = name.Normalize(NormalizationForm.FormD);
    StringBuilder stringBuilder = new StringBuilder();

    foreach (char c in normalizedString)
    {
        switch (CharUnicodeInfo.GetUnicodeCategory(c))
        {
            case UnicodeCategory.LowercaseLetter:
            case UnicodeCategory.UppercaseLetter:
            case UnicodeCategory.DecimalDigitNumber:
                stringBuilder.Append(c);
                break;
            case UnicodeCategory.SpaceSeparator:
            case UnicodeCategory.ConnectorPunctuation:
            case UnicodeCategory.DashPunctuation:
                stringBuilder.Append('_');
                break;
        }
    }
    string result = stringBuilder.ToString();
    return String.Join("_", result.Split(new char[] { '_' }
        , StringSplitOptions.RemoveEmptyEntries)); // remove duplicate underscores
}

What this person said:

Encoding.ASCII.GetString(Encoding.GetEncoding(1251).GetBytes(text));

It actually splits the likes of å which is one character (which is character code 00E5, not 0061 plus the modifier 030A which would look the same) into a plus some kind of modifier, and then the ASCII conversion removes the modifier, leaving the only a.


I often use an extenstion method based on another version I found here (see Replacing characters in C# (ascii)) A quick explanation:

  • Normalizing to form D splits charactes like è to an e and a nonspacing `
  • From this, the nospacing characters are removed
  • The result is normalized back to form C (I'm not sure if this is neccesary)

Code:

using System.Linq;
using System.Text;
using System.Globalization;

// namespace here
public static class Utility
{
    public static string RemoveDiacritics(this string str)
    {
        if (null == str) return null;
        var chars =
            from c in str.Normalize(NormalizationForm.FormD).ToCharArray()
            let uc = CharUnicodeInfo.GetUnicodeCategory(c)
            where uc != UnicodeCategory.NonSpacingMark
            select c;

        var cleanStr = new string(chars.ToArray()).Normalize(NormalizationForm.FormC);

        return cleanStr;
    }

    // or, alternatively
    public static string RemoveDiacritics2(this string str)
    {
        if (null == str) return null;
        var chars = str
            .Normalize(NormalizationForm.FormD)
            .ToCharArray()
            .Where(c=> CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
            .ToArray();

        return new string(chars).Normalize(NormalizationForm.FormC);
    }
}

The CodePage of Greek (ISO) can do it

The information about this codepage is into System.Text.Encoding.GetEncodings(). Learn about in: https://msdn.microsoft.com/pt-br/library/system.text.encodinginfo.getencoding(v=vs.110).aspx

Greek (ISO) has codepage 28597 and name iso-8859-7.

Go to the code... \o/

string text = "Você está numa situação lamentável";

string textEncode = System.Web.HttpUtility.UrlEncode(text, Encoding.GetEncoding("iso-8859-7"));
//result: "Voce+esta+numa+situacao+lamentavel"

string textDecode = System.Web.HttpUtility.UrlDecode(textEncode);
//result: "Voce esta numa situacao lamentavel"

So, write this function...

public string RemoveAcentuation(string text)
{
    return
        System.Web.HttpUtility.UrlDecode(
            System.Web.HttpUtility.UrlEncode(
                text, Encoding.GetEncoding("iso-8859-7")));
}

Note that... Encoding.GetEncoding("iso-8859-7") is equivalent to Encoding.GetEncoding(28597) because first is the name, and second the codepage of Encoding.


Imports System.Text
Imports System.Globalization

 Public Function DECODE(ByVal x As String) As String
        Dim sb As New StringBuilder
        For Each c As Char In x.Normalize(NormalizationForm.FormD).Where(Function(a) CharUnicodeInfo.GetUnicodeCategory(a) <> UnicodeCategory.NonSpacingMark)  
            sb.Append(c)
        Next
        Return sb.ToString()
    End Function

THIS IS THE VB VERSION (Works with GREEK) :

Imports System.Text

Imports System.Globalization

Public Function RemoveDiacritics(ByVal s As String)
    Dim normalizedString As String
    Dim stringBuilder As New StringBuilder
    normalizedString = s.Normalize(NormalizationForm.FormD)
    Dim i As Integer
    Dim c As Char
    For i = 0 To normalizedString.Length - 1
        c = normalizedString(i)
        If CharUnicodeInfo.GetUnicodeCategory(c) <> UnicodeCategory.NonSpacingMark Then
            stringBuilder.Append(c)
        End If
    Next
    Return stringBuilder.ToString()
End Function

Imports System.Text
Imports System.Globalization

 Public Function DECODE(ByVal x As String) As String
        Dim sb As New StringBuilder
        For Each c As Char In x.Normalize(NormalizationForm.FormD).Where(Function(a) CharUnicodeInfo.GetUnicodeCategory(a) <> UnicodeCategory.NonSpacingMark)  
            sb.Append(c)
        Next
        Return sb.ToString()
    End Function

This works fine in java.

It basically converts all accented characters into their deAccented counterparts followed by their combining diacritics. Now you can use a regex to strip off the diacritics.

import java.text.Normalizer;
import java.util.regex.Pattern;

public String deAccent(String str) {
    String nfdNormalizedString = Normalizer.normalize(str, Normalizer.Form.NFD); 
    Pattern pattern = Pattern.compile("\\p{InCombiningDiacriticalMarks}+");
    return pattern.matcher(nfdNormalizedString).replaceAll("");
}

THIS IS THE VB VERSION (Works with GREEK) :

Imports System.Text

Imports System.Globalization

Public Function RemoveDiacritics(ByVal s As String)
    Dim normalizedString As String
    Dim stringBuilder As New StringBuilder
    normalizedString = s.Normalize(NormalizationForm.FormD)
    Dim i As Integer
    Dim c As Char
    For i = 0 To normalizedString.Length - 1
        c = normalizedString(i)
        If CharUnicodeInfo.GetUnicodeCategory(c) <> UnicodeCategory.NonSpacingMark Then
            stringBuilder.Append(c)
        End If
    Next
    Return stringBuilder.ToString()
End Function

you can use string extension from MMLib.Extensions nuget package:

using MMLib.RapidPrototyping.Generators;
public void ExtensionsExample()
{
  string target = "aácceéií";
  Assert.AreEqual("aacceeii", target.RemoveDiacritics());
} 

Nuget page: https://www.nuget.org/packages/MMLib.Extensions/ Codeplex project site https://mmlib.codeplex.com/


This is how i replace diacritic characters to non-diacritic ones in all my .NET program

C#:

//Transforms the culture of a letter to its equivalent representation in the 0-127 ascii table, such as the letter 'é' is substituted by an 'e'
public string RemoveDiacritics(string s)
{
    string normalizedString = null;
    StringBuilder stringBuilder = new StringBuilder();
    normalizedString = s.Normalize(NormalizationForm.FormD);
    int i = 0;
    char c = '\0';

    for (i = 0; i <= normalizedString.Length - 1; i++)
    {
        c = normalizedString[i];
        if (CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
        {
            stringBuilder.Append(c);
        }
    }

    return stringBuilder.ToString().ToLower();
}

VB .NET:

'Transforms the culture of a letter to its equivalent representation in the 0-127 ascii table, such as the letter "é" is substituted by an "e"'
Public Function RemoveDiacritics(ByVal s As String) As String
    Dim normalizedString As String
    Dim stringBuilder As New StringBuilder
    normalizedString = s.Normalize(NormalizationForm.FormD)
    Dim i As Integer
    Dim c As Char

    For i = 0 To normalizedString.Length - 1
        c = normalizedString(i)
        If CharUnicodeInfo.GetUnicodeCategory(c) <> UnicodeCategory.NonSpacingMark Then
            stringBuilder.Append(c)
        End If
    Next
    Return stringBuilder.ToString().ToLower()
End Function

It's funny such a question can get so many answers, and yet none fit my requirements :) There are so many languages around, a full language agnostic solution is AFAIK not really possible, as others has mentionned that the FormC or FormD are giving issues.

Since the original question was related to French, the simplest working answer is indeed

    public static string ConvertWesternEuropeanToASCII(this string str)
    {
        return Encoding.ASCII.GetString(Encoding.GetEncoding(1251).GetBytes(str));
    }

1251 should be replaced by the encoding code of the input language.

This however replace only one character by one character. Since I am also working with German as input, I did a manual convert

    public static string LatinizeGermanCharacters(this string str)
    {
        StringBuilder sb = new StringBuilder(str.Length);
        foreach (char c in str)
        {
            switch (c)
            {
                case 'ä':
                    sb.Append("ae");
                    break;
                case 'ö':
                    sb.Append("oe");
                    break;
                case 'ü':
                    sb.Append("ue");
                    break;
                case 'Ä':
                    sb.Append("Ae");
                    break;
                case 'Ö':
                    sb.Append("Oe");
                    break;
                case 'Ü':
                    sb.Append("Ue");
                    break;
                case 'ß':
                    sb.Append("ss");
                    break;
                default:
                    sb.Append(c);
                    break;
            }
        }
        return sb.ToString();
    }

It might not deliver the best performance, but at least it is very easy to read and extend. Regex is a NO GO, much slower than any char/string stuff.

I also have a very simple method to remove space:

    public static string RemoveSpace(this string str)
    {
        return str.Replace(" ", string.Empty);
    }

Eventually, I am using a combination of all 3 above extensions:

    public static string LatinizeAndConvertToASCII(this string str, bool keepSpace = false)
    {
        str = str.LatinizeGermanCharacters().ConvertWesternEuropeanToASCII();            
        return keepSpace ? str : str.RemoveSpace();
    }

And a small unit test to that (not exhaustive) which pass successfully.

    [TestMethod()]
    public void LatinizeAndConvertToASCIITest()
    {
        string europeanStr = "Bonjour ça va? C'est l'été! Ich möchte ä Ä á à â ê é è ë Ë É ï Ï î í ì ó ò ô ö Ö Ü ü ù ú û Û ý Ý ç Ç ñ Ñ";
        string expected = "Bonjourcava?C'estl'ete!IchmoechteaeAeaaaeeeeEEiIiiiooooeOeUeueuuuUyYcCnN";
        string actual = europeanStr.LatinizeAndConvertToASCII();
        Assert.AreEqual(expected, actual);
    }

In case someone is interested, I was looking for something similar and ended writing the following:

public static string NormalizeStringForUrl(string name)
{
    String normalizedString = name.Normalize(NormalizationForm.FormD);
    StringBuilder stringBuilder = new StringBuilder();

    foreach (char c in normalizedString)
    {
        switch (CharUnicodeInfo.GetUnicodeCategory(c))
        {
            case UnicodeCategory.LowercaseLetter:
            case UnicodeCategory.UppercaseLetter:
            case UnicodeCategory.DecimalDigitNumber:
                stringBuilder.Append(c);
                break;
            case UnicodeCategory.SpaceSeparator:
            case UnicodeCategory.ConnectorPunctuation:
            case UnicodeCategory.DashPunctuation:
                stringBuilder.Append('_');
                break;
        }
    }
    string result = stringBuilder.ToString();
    return String.Join("_", result.Split(new char[] { '_' }
        , StringSplitOptions.RemoveEmptyEntries)); // remove duplicate underscores
}

This is how i replace diacritic characters to non-diacritic ones in all my .NET program

C#:

//Transforms the culture of a letter to its equivalent representation in the 0-127 ascii table, such as the letter 'é' is substituted by an 'e'
public string RemoveDiacritics(string s)
{
    string normalizedString = null;
    StringBuilder stringBuilder = new StringBuilder();
    normalizedString = s.Normalize(NormalizationForm.FormD);
    int i = 0;
    char c = '\0';

    for (i = 0; i <= normalizedString.Length - 1; i++)
    {
        c = normalizedString[i];
        if (CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
        {
            stringBuilder.Append(c);
        }
    }

    return stringBuilder.ToString().ToLower();
}

VB .NET:

'Transforms the culture of a letter to its equivalent representation in the 0-127 ascii table, such as the letter "é" is substituted by an "e"'
Public Function RemoveDiacritics(ByVal s As String) As String
    Dim normalizedString As String
    Dim stringBuilder As New StringBuilder
    normalizedString = s.Normalize(NormalizationForm.FormD)
    Dim i As Integer
    Dim c As Char

    For i = 0 To normalizedString.Length - 1
        c = normalizedString(i)
        If CharUnicodeInfo.GetUnicodeCategory(c) <> UnicodeCategory.NonSpacingMark Then
            stringBuilder.Append(c)
        End If
    Next
    Return stringBuilder.ToString().ToLower()
End Function

This code worked for me:

var updatedText = text.Normalize(NormalizationForm.FormD)
     .Where(c => CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
     .ToArray();

However, please don't do this with names. It's not only an insult to people with umlauts/accents in their name, it can also be dangerously wrong in certain situations (see below). There are alternative writings instead of just removing the accent.

Furthermore, it's simply wrong and dangerous, e.g. if the user has to provide his name exactly how it occurs on the passport.

For example my name is written Zuberbühler and in the machine readable part of my passport you will find Zuberbuehler. By removing the umlaut, the name will not match with either part. This can lead to issues for the users.

You should rather disallow umlauts/accent in an input form for names so the user can write his name correctly without its umlaut or accent.

Practical example, if the web service to apply for ESTA (https://www.application-esta.co.uk/special-characters-and) would use above code instead of transforming umlauts correctly, the ESTA application would either be refused or the traveller will have problems with the American Border Control when entering the States.

Another example would be flight tickets. Assuming you have a flight ticket booking web application, the user provides his name with an accent and your implementation is just removing the accents and then using the airline's web service to book the ticket! Your customer may not be allowed to board since the name does not match to any part of his/her passport.


Examples related to .net

You must add a reference to assembly 'netstandard, Version=2.0.0.0 How to use Bootstrap 4 in ASP.NET Core No authenticationScheme was specified, and there was no DefaultChallengeScheme found with default authentification and custom authorization .net Core 2.0 - Package was restored using .NetFramework 4.6.1 instead of target framework .netCore 2.0. The package may not be fully compatible Update .NET web service to use TLS 1.2 EF Core add-migration Build Failed What is the difference between .NET Core and .NET Standard Class Library project types? Visual Studio 2017 - Could not load file or assembly 'System.Runtime, Version=4.1.0.0' or one of its dependencies Nuget connection attempt failed "Unable to load the service index for source" Token based authentication in Web API without any user interface

Examples related to string

How to split a string in two and store it in a field String method cannot be found in a main class method Kotlin - How to correctly concatenate a String Replacing a character from a certain index Remove quotes from String in Python Detect whether a Python string is a number or a letter How does String substring work in Swift How does String.Index work in Swift swift 3.0 Data to String? How to parse JSON string in Typescript

Examples related to diacritics

Is there a way to get rid of accents and convert a whole string to regular letters? Converting Symbols, Accent Letters to English Alphabet Remove accents/diacritics in a string in JavaScript What is the best way to remove accents (normalize) in a Python unicode string? How do I remove diacritics (accents) from a string in .NET? Microsoft Excel mangles Diacritics in .csv files?