How do I remove diacritics accents from a string in NET

Question

I m trying to convert some strings that are in French Canadian and basically  I d like to be able to take out the French accent marks in the letters while keeping the letter   E g  convert    to e  so cr  me br  l  e would become creme brulee   What is the best method for achieving this

User · Answer

I really like the concise and functional code provided by azrafe7  So  I have changed it a little bit to convert it to an extension method   public static class StringExtensions       public static string RemoveDiacritics this string text                const string SINGLEBYTE LATIN ASCII ENCODING    ISO-8859-8            if  string IsNullOrEmpty text                         return string Empty                     return Encoding ASCII GetString              Encoding GetEncoding SINGLEBYTE LATIN ASCII ENCODING  GetBytes text

User · Answer

This works fine in java    It basically converts all accented characters into their deAccented counterparts followed by their combining diacritics  Now you can use a regex to strip off the diacritics   import java text Normalizer  import java util regex Pattern   public String deAccent String str        String nfdNormalizedString   Normalizer normalize str  Normalizer Form NFD        Pattern pattern   Pattern compile    p InCombiningDiacriticalMarks          return pattern matcher nfdNormalizedString  replaceAll

User · Answer

Popping this Library here if you haven t already considered it  Looks like there are a full range of unit tests with it    https   github com thomasgalliker Diacritics NET

User · Answer

Not having enough reputations  apparently I can not comment on Alexander s excellent link  - Lucene appears to be the only solution working in reasonably generic cases    For those wanting a simple copy-paste solution  here it is  leveraging code in Lucene   string testbed                                                                                aac  egiLlnOorSs  z          Console WriteLine Lucene latinizeLucene testbed             AAAACEIIOOOUUTHaaaaaaaeceeeeiiiidnoooouuaacDegiLlnOorSsszzsteu                  public static class Lucene          source  https   raw githubusercontent com apache lucenenet master src Lucene Net Analysis Common Analysis Miscellaneous ASCIIFoldingFilter cs        idea  https   stackoverflow com questions 249087 how-do-i-remove-diacritics-accents-from-a-string-in-net  scroll down  search for lucene by Alexander      public static string latinizeLucene string arg                char   argChar   arg ToCharArray                latinizeLuceneImpl can expand one char up to four chars - e g     to TH  or    to ae  or in fact   to  10          char   resultChar   new String      arg Length   4  ToCharArray             int outputPos   Lucene latinizeLuceneImpl argChar  0  ref resultChar  0  arg Length            string ret   new string resultChar           ret   ret Substring 0  outputPos            return ret                  lt summary gt          Converts characters above ASCII to their ASCII equivalents   For example          accents are removed from accented characters            lt para  gt           lucene internal          lt  summary gt           lt param name  input  gt      The characters to fold  lt  param gt           lt param name  inputPos  gt   Index of the first character to fold  lt  param gt           lt param name  output  gt     The result of the folding  Should be of size  gt    lt c gt length   4 lt  c gt    lt  param gt           lt param name  outputPos  gt  Index of output where to put the result of the folding  lt  param gt           lt param name  length  gt     The number of characters to fold  lt  param gt           lt returns gt  length of output  lt  returns gt      private static int latinizeLuceneImpl char   input  int inputPos  ref char   output  int outputPos  int length                int end   inputPos   length          for  int pos   inputPos  pos  lt  end    pos                        char c   input pos                   Quick test  if it s not in range then just keep current character             if  c  lt    u0080                                 output outputPos      c                            else                               switch  c                                        case   u00C0           LATIN CAPITAL LETTER A WITH GRAVE                      case   u00C1           LATIN CAPITAL LETTER A WITH ACUTE                      case   u00C2           LATIN CAPITAL LETTER A WITH CIRCUMFLEX                      case   u00C3           LATIN CAPITAL LETTER A WITH TILDE                      case   u00C4           LATIN CAPITAL LETTER A WITH DIAERESIS                      case   u00C5           LATIN CAPITAL LETTER A WITH RING ABOVE                      case   u0100      A   LATIN CAPITAL LETTER A WITH MACRON                      case   u0102      A   LATIN CAPITAL LETTER A WITH BREVE                      case   u0104      A   LATIN CAPITAL LETTER A WITH OGONEK                      case   u018F         http   en wikipedia org wiki Schwa   LATIN CAPITAL LETTER SCHWA                      case   u01CD      A   LATIN CAPITAL LETTER A WITH CARON                      case   u01DE      A   LATIN CAPITAL LETTER A WITH DIAERESIS AND MACRON                      case   u01E0          LATIN CAPITAL LETTER A WITH DOT ABOVE AND MACRON                      case   u01FA          LATIN CAPITAL LETTER A WITH RING ABOVE AND ACUTE                      case   u0200          LATIN CAPITAL LETTER A WITH DOUBLE GRAVE                      case   u0202          LATIN CAPITAL LETTER A WITH INVERTED BREVE                      case   u0226          LATIN CAPITAL LETTER A WITH DOT ABOVE                      case   u023A          LATIN CAPITAL LETTER A WITH STROKE                      case   u1D00          LATIN LETTER SMALL CAPITAL A                      case   u1E00          LATIN CAPITAL LETTER A WITH RING BELOW                      case   u1EA0          LATIN CAPITAL LETTER A WITH DOT BELOW                      case   u1EA2          LATIN CAPITAL LETTER A WITH HOOK ABOVE                      case   u1EA4          LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND ACUTE                      case   u1EA6          LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND GRAVE                      case   u1EA8          LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND HOOK ABOVE                      case   u1EAA          LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND TILDE                      case   u1EAC          LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND DOT BELOW                      case   u1EAE          LATIN CAPITAL LETTER A WITH BREVE AND ACUTE                      case   u1EB0          LATIN CAPITAL LETTER A WITH BREVE AND GRAVE                      case   u1EB2          LATIN CAPITAL LETTER A WITH BREVE AND HOOK ABOVE                      case   u1EB4          LATIN CAPITAL LETTER A WITH BREVE AND TILDE                      case   u1EB6          LATIN CAPITAL LETTER A WITH BREVE AND DOT BELOW                      case   u24B6          CIRCLED LATIN CAPITAL LETTER A                      case   uFF21      A   FULLWIDTH LATIN CAPITAL LETTER A                          output outputPos       A                           break                      case   u00E0           LATIN SMALL LETTER A WITH GRAVE                      case   u00E1           LATIN SMALL LETTER A WITH ACUTE                      case   u00E2           LATIN SMALL LETTER A WITH CIRCUMFLEX                      case   u00E3           LATIN SMALL LETTER A WITH TILDE                      case   u00E4           LATIN SMALL LETTER A WITH DIAERESIS                      case   u00E5           LATIN SMALL LETTER A WITH RING ABOVE                      case   u0101      a   LATIN SMALL LETTER A WITH MACRON                      case   u0103      a   LATIN SMALL LETTER A WITH BREVE                      case   u0105      a   LATIN SMALL LETTER A WITH OGONEK                      case   u01CE      a   LATIN SMALL LETTER A WITH CARON                      case   u01DF      a   LATIN SMALL LETTER A WITH DIAERESIS AND MACRON                      case   u01E1          LATIN SMALL LETTER A WITH DOT ABOVE AND MACRON                      case   u01FB          LATIN SMALL LETTER A WITH RING ABOVE AND ACUTE                      case   u0201          LATIN SMALL LETTER A WITH DOUBLE GRAVE                      case   u0203          LATIN SMALL LETTER A WITH INVERTED BREVE                      case   u0227          LATIN SMALL LETTER A WITH DOT ABOVE                      case   u0250          LATIN SMALL LETTER TURNED A                      case   u0259          LATIN SMALL LETTER SCHWA                      case   u025A          LATIN SMALL LETTER SCHWA WITH HOOK                      case   u1D8F          LATIN SMALL LETTER A WITH RETROFLEX HOOK                      case   u1D95          LATIN SMALL LETTER SCHWA WITH RETROFLEX HOOK                      case   u1E01          LATIN SMALL LETTER A WITH RING BELOW                      case   u1E9A          LATIN SMALL LETTER A WITH RIGHT HALF RING                      case   u1EA1          LATIN SMALL LETTER A WITH DOT BELOW                      case   u1EA3          LATIN SMALL LETTER A WITH HOOK ABOVE                      case   u1EA5          LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE                      case   u1EA7          LATIN SMALL LETTER A WITH CIRCUMFLEX AND GRAVE                      case   u1EA9          LATIN SMALL LETTER A WITH CIRCUMFLEX AND HOOK ABOVE                      case   u1EAB          LATIN SMALL LETTER A WITH CIRCUMFLEX AND TILDE                      case   u1EAD          LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW                      case   u1EAF          LATIN SMALL LETTER A WITH BREVE AND ACUTE                      case   u1EB1          LATIN SMALL LETTER A WITH BREVE AND GRAVE                      case   u1EB3          LATIN SMALL LETTER A WITH BREVE AND HOOK ABOVE                      case   u1EB5          LATIN SMALL LETTER A WITH BREVE AND TILDE                      case   u1EB7          LATIN SMALL LETTER A WITH BREVE AND DOT BELOW                      case   u2090          LATIN SUBSCRIPT SMALL LETTER A                      case   u2094          LATIN SUBSCRIPT SMALL LETTER SCHWA                      case   u24D0          CIRCLED LATIN SMALL LETTER A                      case   u2C65          LATIN SMALL LETTER A WITH STROKE                      case   u2C6F          LATIN CAPITAL LETTER TURNED A                      case   uFF41      a   FULLWIDTH LATIN SMALL LETTER A                          output outputPos       a                           break                      case   uA732          LATIN CAPITAL LETTER AA                          output outputPos       A                           output outputPos       A                           break                      case   u00C6           LATIN CAPITAL LETTER AE                      case   u01E2          LATIN CAPITAL LETTER AE WITH MACRON                      case   u01FC          LATIN CAPITAL LETTER AE WITH ACUTE                      case   u1D01          LATIN LETTER SMALL CAPITAL AE                          output outputPos       A                           output outputPos       E                           break                      case   uA734          LATIN CAPITAL LETTER AO                          output outputPos       A                           output outputPos       O                           break                      case   uA736          LATIN CAPITAL LETTER AU                          output outputPos       A                           output outputPos       U                           break              etc  etc  etc             see link above for complete source code                        unfortunately  postings are limited  as in             Body is limited to 30000 characters  you entered 136098                                                   case   u2053          SWUNG DASH                      case   uFF5E          FULLWIDTH TILDE                          output outputPos                                   break                      default                          output outputPos      c                          break                                                    return outputPos

User · Answer

I ve not used this method  but Michael Kaplan describes a method for doing so in his blog post  with a confusing title  that talks about stripping diacritics  Stripping is an interesting job  aka On the meaning of meaningless  aka All Mn characters are non-spacing  but some are more non-spacing than others   static string RemoveDiacritics string text         var normalizedString   text Normalize NormalizationForm FormD       var stringBuilder   new StringBuilder         foreach  var c in normalizedString                var unicodeCategory   CharUnicodeInfo GetUnicodeCategory c           if  unicodeCategory    UnicodeCategory NonSpacingMark                        stringBuilder Append c                        return stringBuilder ToString   Normalize NormalizationForm FormC       Note that this is a followup to his earlier post  Stripping diacritics       The approach uses String Normalize to split the input string into constituent glyphs  basically separating the  base  characters from the diacritics  and then scans the result and retains only the base characters  It s just a little complicated  but really you re looking at a complicated problem   Of course  if you re limiting yourself to French  you could probably get away with the simple table-based approach in How to remove accents and tilde in a C   std  string  as recommended by  David Dibben

User · Answer

In case someone is interested  I was looking for something similar and ended writing the following   public static string NormalizeStringForUrl string name        String normalizedString   name Normalize NormalizationForm FormD       StringBuilder stringBuilder   new StringBuilder         foreach  char c in normalizedString                switch  CharUnicodeInfo GetUnicodeCategory c                         case UnicodeCategory LowercaseLetter              case UnicodeCategory UppercaseLetter              case UnicodeCategory DecimalDigitNumber                  stringBuilder Append c                   break              case UnicodeCategory SpaceSeparator              case UnicodeCategory ConnectorPunctuation              case UnicodeCategory DashPunctuation                  stringBuilder Append                       break                      string result   stringBuilder ToString        return String Join      result Split new char                     StringSplitOptions RemoveEmptyEntries       remove duplicate underscores

User · Answer

I ve not used this method  but Michael Kaplan describes a method for doing so in his blog post  with a confusing title  that talks about stripping diacritics  Stripping is an interesting job  aka On the meaning of meaningless  aka All Mn characters are non-spacing  but some are more non-spacing than others   static string RemoveDiacritics string text         var normalizedString   text Normalize NormalizationForm FormD       var stringBuilder   new StringBuilder         foreach  var c in normalizedString                var unicodeCategory   CharUnicodeInfo GetUnicodeCategory c           if  unicodeCategory    UnicodeCategory NonSpacingMark                        stringBuilder Append c                        return stringBuilder ToString   Normalize NormalizationForm FormC       Note that this is a followup to his earlier post  Stripping diacritics       The approach uses String Normalize to split the input string into constituent glyphs  basically separating the  base  characters from the diacritics  and then scans the result and retains only the base characters  It s just a little complicated  but really you re looking at a complicated problem   Of course  if you re limiting yourself to French  you could probably get away with the simple table-based approach in How to remove accents and tilde in a C   std  string  as recommended by  David Dibben

User · Answer

This is how i replace diacritic characters to non-diacritic ones in all my  NET program  C      Transforms the culture of a letter to its equivalent representation in the 0-127 ascii table  such as the letter      is substituted by an  e  public string RemoveDiacritics string s        string normalizedString   null      StringBuilder stringBuilder   new StringBuilder        normalizedString   s Normalize NormalizationForm FormD       int i   0      char c     0        for  i   0  i  lt   normalizedString Length - 1  i                  c   normalizedString i           if  CharUnicodeInfo GetUnicodeCategory c     UnicodeCategory NonSpacingMark                        stringBuilder Append c                        return stringBuilder ToString   ToLower        VB  NET    Transforms the culture of a letter to its equivalent representation in the 0-127 ascii table  such as the letter      is substituted by an  e   Public Function RemoveDiacritics ByVal s As String  As String     Dim normalizedString As String     Dim stringBuilder As New StringBuilder     normalizedString   s Normalize NormalizationForm FormD      Dim i As Integer     Dim c As Char      For i   0 To normalizedString Length - 1         c   normalizedString i          If CharUnicodeInfo GetUnicodeCategory c   lt  gt  UnicodeCategory NonSpacingMark Then             stringBuilder Append c          End If     Next     Return stringBuilder ToString   ToLower   End Function

User · Answer

I ve not used this method  but Michael Kaplan describes a method for doing so in his blog post  with a confusing title  that talks about stripping diacritics  Stripping is an interesting job  aka On the meaning of meaningless  aka All Mn characters are non-spacing  but some are more non-spacing than others   static string RemoveDiacritics string text         var normalizedString   text Normalize NormalizationForm FormD       var stringBuilder   new StringBuilder         foreach  var c in normalizedString                var unicodeCategory   CharUnicodeInfo GetUnicodeCategory c           if  unicodeCategory    UnicodeCategory NonSpacingMark                        stringBuilder Append c                        return stringBuilder ToString   Normalize NormalizationForm FormC       Note that this is a followup to his earlier post  Stripping diacritics       The approach uses String Normalize to split the input string into constituent glyphs  basically separating the  base  characters from the diacritics  and then scans the result and retains only the base characters  It s just a little complicated  but really you re looking at a complicated problem   Of course  if you re limiting yourself to French  you could probably get away with the simple table-based approach in How to remove accents and tilde in a C   std  string  as recommended by  David Dibben

User · Answer

In case someone is interested  I was looking for something similar and ended writing the following   public static string NormalizeStringForUrl string name        String normalizedString   name Normalize NormalizationForm FormD       StringBuilder stringBuilder   new StringBuilder         foreach  char c in normalizedString                switch  CharUnicodeInfo GetUnicodeCategory c                         case UnicodeCategory LowercaseLetter              case UnicodeCategory UppercaseLetter              case UnicodeCategory DecimalDigitNumber                  stringBuilder Append c                   break              case UnicodeCategory SpaceSeparator              case UnicodeCategory ConnectorPunctuation              case UnicodeCategory DashPunctuation                  stringBuilder Append                       break                      string result   stringBuilder ToString        return String Join      result Split new char                     StringSplitOptions RemoveEmptyEntries       remove duplicate underscores

User · Answer

this did the trick for me    string accentedStr  byte   tempBytes  tempBytes   System Text Encoding GetEncoding  quot ISO-8859-8 quot   GetBytes accentedStr   string asciiStr   System Text Encoding UTF8 GetString tempBytes    quick amp short

User · Answer

you can use string extension from MMLib Extensions nuget package   using MMLib RapidPrototyping Generators  public void ExtensionsExample       string target    a  cce  i       Assert AreEqual  aacceeii   target RemoveDiacritics            Nuget page  https   www nuget org packages MMLib Extensions  Codeplex project site https   mmlib codeplex com

User · Answer

Imports System Text Imports System Globalization   Public Function DECODE ByVal x As String  As String         Dim sb As New StringBuilder         For Each c As Char In x Normalize NormalizationForm FormD  Where Function a  CharUnicodeInfo GetUnicodeCategory a   lt  gt  UnicodeCategory NonSpacingMark                sb Append c          Next         Return sb ToString       End Function

User · Answer

THIS IS THE VB VERSION  Works with GREEK     Imports System Text  Imports System Globalization  Public Function RemoveDiacritics ByVal s As String      Dim normalizedString As String     Dim stringBuilder As New StringBuilder     normalizedString   s Normalize NormalizationForm FormD      Dim i As Integer     Dim c As Char     For i   0 To normalizedString Length - 1         c   normalizedString i          If CharUnicodeInfo GetUnicodeCategory c   lt  gt  UnicodeCategory NonSpacingMark Then             stringBuilder Append c          End If     Next     Return stringBuilder ToString   End Function

User · Answer

I often use an extenstion method based on another version I found here  see Replacing characters in C   ascii   A quick explanation    Normalizing to form D splits charactes like    to an e and a nonspacing   From this  the nospacing characters are removed The result is normalized back to form C  I m not sure if this is neccesary    Code   using System Linq  using System Text  using System Globalization      namespace here public static class Utility       public static string RemoveDiacritics this string str                if  null    str  return null          var chars               from c in str Normalize NormalizationForm FormD  ToCharArray               let uc   CharUnicodeInfo GetUnicodeCategory c              where uc    UnicodeCategory NonSpacingMark             select c           var cleanStr   new string chars ToArray    Normalize NormalizationForm FormC            return cleanStr                or  alternatively     public static string RemoveDiacritics2 this string str                if  null    str  return null          var chars   str              Normalize NormalizationForm FormD               ToCharArray                Where c  gt  CharUnicodeInfo GetUnicodeCategory c     UnicodeCategory NonSpacingMark               ToArray             return new string chars  Normalize NormalizationForm FormC

User · Answer

What this person said   Encoding ASCII GetString Encoding GetEncoding 1251  GetBytes text     It actually splits the likes of    which is one character  which is character code 00E5  not 0061 plus the modifier 030A which would look the same  into a plus some kind of modifier  and then the ASCII conversion removes the modifier  leaving the only a

User · Answer

I needed something that converts all major unicode characters and the voted answer leaved a few out so I ve created a version of CodeIgniter s convert accented characters  str  into C  that is easily customisable   using System  using System Text  using System Collections Generic   public static class Strings       static Dictionary lt string  string gt  foreign characters   new Dictionary lt string  string gt                            ae                        oe                      ue                      Ae                      Ue                      Oe                             AAAA                  A                           aaaa  a                  a                     B                     b                  CCCC    C                  cccc    c                     D                     d                  D       Dj                  ddd    dj                        EEEEE                E                        eeeee e              e                     F                     f                GGGGG      G                gggg       g                HH    H                hh    h                        IIIIII             I                        iiiiii              i                J    J                j    j                K      K                k      k                LLL L      L                lll l      l                     M                     m                  NNN      N                  nnn       n                        OOOOO     O                  O                        ooooo                          o                     P                     p                RRR      R                rrr      r                SSS   S     S                sss    s      s                 TTTt     T                 ttt     t                      UUUUUUUUUUUUU            U                      uuuuuuuuuuuu               u                    Y            Y                    y         y                     V                     v                W    W                w    w                ZZ        Z                zz        z                       AE                      ss                     IJ                     ij                      OE                      f                     ks                p    p                      v                      m                     ps                     Yo                     yo                     Ye                     ye                     Yi                     Zh                     zh                     Kh                     kh                     Ts                     ts                     Ch                     ch                     Sh                     sh                     Shch                     shch                                             Yu                     yu                     Ya                     ya                 public static char RemoveDiacritics this char c           foreach KeyValuePair lt string  string gt  entry in foreign characters                        if entry Key IndexOf  c     -1                                return entry Value 0                                   return c             public static string RemoveDiacritics this string s                   StringBuilder sb   new StringBuilder             string text                 foreach  char c in s                        int len   text Length               foreach KeyValuePair lt string  string gt  entry in foreign characters                                if entry Key IndexOf  c     -1                                        text    entry Value                      break                                               if  len    text Length                    text    c                                    return text            Usage     for strings  cr  me br  l  e  RemoveDiacritics        creme brulee     for chars      0  RemoveDiacritics        A

User · Answer

This is how i replace diacritic characters to non-diacritic ones in all my  NET program  C      Transforms the culture of a letter to its equivalent representation in the 0-127 ascii table  such as the letter      is substituted by an  e  public string RemoveDiacritics string s        string normalizedString   null      StringBuilder stringBuilder   new StringBuilder        normalizedString   s Normalize NormalizationForm FormD       int i   0      char c     0        for  i   0  i  lt   normalizedString Length - 1  i                  c   normalizedString i           if  CharUnicodeInfo GetUnicodeCategory c     UnicodeCategory NonSpacingMark                        stringBuilder Append c                        return stringBuilder ToString   ToLower        VB  NET    Transforms the culture of a letter to its equivalent representation in the 0-127 ascii table  such as the letter      is substituted by an  e   Public Function RemoveDiacritics ByVal s As String  As String     Dim normalizedString As String     Dim stringBuilder As New StringBuilder     normalizedString   s Normalize NormalizationForm FormD      Dim i As Integer     Dim c As Char      For i   0 To normalizedString Length - 1         c   normalizedString i          If CharUnicodeInfo GetUnicodeCategory c   lt  gt  UnicodeCategory NonSpacingMark Then             stringBuilder Append c          End If     Next     Return stringBuilder ToString   ToLower   End Function

User · Answer

you can use string extension from MMLib Extensions nuget package   using MMLib RapidPrototyping Generators  public void ExtensionsExample       string target    a  cce  i       Assert AreEqual  aacceeii   target RemoveDiacritics            Nuget page  https   www nuget org packages MMLib Extensions  Codeplex project site https   mmlib codeplex com

User · Answer

It s funny such a question can get so many answers  and yet none fit my requirements    There are so many languages around  a full language agnostic solution is AFAIK not really possible  as others has mentionned that the FormC or FormD are giving issues   Since the original question was related to French  the simplest working answer is indeed       public static string ConvertWesternEuropeanToASCII this string str                return Encoding ASCII GetString Encoding GetEncoding 1251  GetBytes str            1251 should be replaced by the encoding code of the input language   This however replace only one character by one character  Since I am also working with German as input  I did a manual convert      public static string LatinizeGermanCharacters this string str                StringBuilder sb   new StringBuilder str Length           foreach  char c in str                        switch  c                                case                           sb Append  ae                        break                  case                           sb Append  oe                        break                  case                           sb Append  ue                        break                  case                           sb Append  Ae                        break                  case                           sb Append  Oe                        break                  case                           sb Append  Ue                        break                  case                           sb Append  ss                        break                  default                      sb Append c                       break                                  return sb ToString            It might not deliver the best performance  but at least it is very easy to read and extend  Regex is a NO GO  much slower than any char string stuff   I also have a very simple method to remove space       public static string RemoveSpace this string str                return str Replace      string Empty           Eventually  I am using a combination of all 3 above extensions       public static string LatinizeAndConvertToASCII this string str  bool keepSpace   false                str   str LatinizeGermanCharacters   ConvertWesternEuropeanToASCII                        return keepSpace   str   str RemoveSpace            And a small unit test to that  not exhaustive  which pass successfully        TestMethod        public void LatinizeAndConvertToASCIITest                 string europeanStr    Bonjour   a va  C est l   t    Ich m  chte                                                                                                              string expected    Bonjourcava C estl ete IchmoechteaeAeaaaeeeeEEiIiiiooooeOeUeueuuuUyYcCnN           string actual   europeanStr LatinizeAndConvertToASCII            Assert AreEqual expected  actual

User · Answer

Imports System Text Imports System Globalization   Public Function DECODE ByVal x As String  As String         Dim sb As New StringBuilder         For Each c As Char In x Normalize NormalizationForm FormD  Where Function a  CharUnicodeInfo GetUnicodeCategory a   lt  gt  UnicodeCategory NonSpacingMark                sb Append c          Next         Return sb ToString       End Function

User · Answer

I really like the concise and functional code provided by azrafe7  So  I have changed it a little bit to convert it to an extension method   public static class StringExtensions       public static string RemoveDiacritics this string text                const string SINGLEBYTE LATIN ASCII ENCODING    ISO-8859-8            if  string IsNullOrEmpty text                         return string Empty                     return Encoding ASCII GetString              Encoding GetEncoding SINGLEBYTE LATIN ASCII ENCODING  GetBytes text

User · Answer

TL DR - C  string extension method  I think the best solution to preserve the meaning of the string is to convert the characters instead of stripping them  which is well illustrated in the example cr  me br  l  e to crme brle vs  creme brulee   I checked out Alexander s comment above and saw the Lucene Net code is Apache 2 0 licensed  so I ve modified the class into a simple string extension method  You can use it like this   var originalString    cr  me br  l  e   var maxLength   originalString Length     limit output length as necessary var foldedString   originalString FoldToASCII maxLength        creme brulee    The function is too long to post in a StackOverflow answer   139k characters of 30k allowed lol  so I made a gist and attributed the authors          Licensed to the Apache Software Foundation  ASF  under one or more    contributor license agreements   See the NOTICE file distributed with    this work for additional information regarding copyright ownership     The ASF licenses this file to You under the Apache License  Version 2 0     the  License    you may not use this file except in compliance with    the License   You may obtain a copy of the License at           http   www apache org licenses LICENSE-2 0       Unless required by applicable law or agreed to in writing  software    distributed under the License is distributed on an  AS IS  BASIS     WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND  either express or implied     See the License for the specific language governing permissions and    limitations under the License            lt summary gt      This class converts alphabetic  numeric  and symbolic Unicode characters     which are not in the first 127 ASCII characters  the  Basic Latin  Unicode     block  into their ASCII equivalents  if one exists       lt para  gt      Characters from the following Unicode blocks are converted  however  only     those characters with reasonable ASCII alternatives are converted            lt ul gt         lt item gt  lt description gt C1 Controls and Latin-1 Supplement   lt a href  http   www unicode org charts PDF U0080 pdf  gt http   www unicode org charts PDF U0080 pdf lt  a gt  lt  description gt  lt  item gt         lt item gt  lt description gt Latin Extended-A   lt a href  http   www unicode org charts PDF U0100 pdf  gt http   www unicode org charts PDF U0100 pdf lt  a gt  lt  description gt  lt  item gt         lt item gt  lt description gt Latin Extended-B   lt a href  http   www unicode org charts PDF U0180 pdf  gt http   www unicode org charts PDF U0180 pdf lt  a gt  lt  description gt  lt  item gt         lt item gt  lt description gt Latin Extended Additional   lt a href  http   www unicode org charts PDF U1E00 pdf  gt http   www unicode org charts PDF U1E00 pdf lt  a gt  lt  description gt  lt  item gt         lt item gt  lt description gt Latin Extended-C   lt a href  http   www unicode org charts PDF U2C60 pdf  gt http   www unicode org charts PDF U2C60 pdf lt  a gt  lt  description gt  lt  item gt         lt item gt  lt description gt Latin Extended-D   lt a href  http   www unicode org charts PDF UA720 pdf  gt http   www unicode org charts PDF UA720 pdf lt  a gt  lt  description gt  lt  item gt         lt item gt  lt description gt IPA Extensions   lt a href  http   www unicode org charts PDF U0250 pdf  gt http   www unicode org charts PDF U0250 pdf lt  a gt  lt  description gt  lt  item gt         lt item gt  lt description gt Phonetic Extensions   lt a href  http   www unicode org charts PDF U1D00 pdf  gt http   www unicode org charts PDF U1D00 pdf lt  a gt  lt  description gt  lt  item gt         lt item gt  lt description gt Phonetic Extensions Supplement   lt a href  http   www unicode org charts PDF U1D80 pdf  gt http   www unicode org charts PDF U1D80 pdf lt  a gt  lt  description gt  lt  item gt         lt item gt  lt description gt General Punctuation   lt a href  http   www unicode org charts PDF U2000 pdf  gt http   www unicode org charts PDF U2000 pdf lt  a gt  lt  description gt  lt  item gt         lt item gt  lt description gt Superscripts and Subscripts   lt a href  http   www unicode org charts PDF U2070 pdf  gt http   www unicode org charts PDF U2070 pdf lt  a gt  lt  description gt  lt  item gt         lt item gt  lt description gt Enclosed Alphanumerics   lt a href  http   www unicode org charts PDF U2460 pdf  gt http   www unicode org charts PDF U2460 pdf lt  a gt  lt  description gt  lt  item gt         lt item gt  lt description gt Dingbats   lt a href  http   www unicode org charts PDF U2700 pdf  gt http   www unicode org charts PDF U2700 pdf lt  a gt  lt  description gt  lt  item gt         lt item gt  lt description gt Supplemental Punctuation   lt a href  http   www unicode org charts PDF U2E00 pdf  gt http   www unicode org charts PDF U2E00 pdf lt  a gt  lt  description gt  lt  item gt         lt item gt  lt description gt Alphabetic Presentation Forms   lt a href  http   www unicode org charts PDF UFB00 pdf  gt http   www unicode org charts PDF UFB00 pdf lt  a gt  lt  description gt  lt  item gt         lt item gt  lt description gt Halfwidth and Fullwidth Forms   lt a href  http   www unicode org charts PDF UFF00 pdf  gt http   www unicode org charts PDF UFF00 pdf lt  a gt  lt  description gt  lt  item gt       lt  ul gt       lt para  gt      See   lt a href  http   en wikipedia org wiki Latin characters in Unicode  gt http   en wikipedia org wiki Latin characters in Unicode lt  a gt       lt para  gt      For example    amp amp agrave   will be replaced by  a        lt  summary gt  public static partial class StringExtensions            lt summary gt          Converts characters above ASCII to their ASCII equivalents   For example          accents are removed from accented characters            lt  summary gt           lt param name  input  gt      The string of characters to fold  lt  param gt           lt param name  length  gt     The length of the folded return string  lt  param gt           lt returns gt  length of output  lt  returns gt      public static string FoldToASCII this string input  int  length   null                   See https   gist github com andyraddatz e6a396fb91856174d4e3f1bf2e10951c           Hope that helps someone else  this is the most robust solution I ve found

User · Answer

Popping this Library here if you haven t already considered it  Looks like there are a full range of unit tests with it    https   github com thomasgalliker Diacritics NET

User · Answer

What this person said   Encoding ASCII GetString Encoding GetEncoding 1251  GetBytes text     It actually splits the likes of    which is one character  which is character code 00E5  not 0061 plus the modifier 030A which would look the same  into a plus some kind of modifier  and then the ASCII conversion removes the modifier  leaving the only a

User · Answer

TL DR - C  string extension method  I think the best solution to preserve the meaning of the string is to convert the characters instead of stripping them  which is well illustrated in the example cr  me br  l  e to crme brle vs  creme brulee   I checked out Alexander s comment above and saw the Lucene Net code is Apache 2 0 licensed  so I ve modified the class into a simple string extension method  You can use it like this   var originalString    cr  me br  l  e   var maxLength   originalString Length     limit output length as necessary var foldedString   originalString FoldToASCII maxLength        creme brulee    The function is too long to post in a StackOverflow answer   139k characters of 30k allowed lol  so I made a gist and attributed the authors          Licensed to the Apache Software Foundation  ASF  under one or more    contributor license agreements   See the NOTICE file distributed with    this work for additional information regarding copyright ownership     The ASF licenses this file to You under the Apache License  Version 2 0     the  License    you may not use this file except in compliance with    the License   You may obtain a copy of the License at           http   www apache org licenses LICENSE-2 0       Unless required by applicable law or agreed to in writing  software    distributed under the License is distributed on an  AS IS  BASIS     WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND  either express or implied     See the License for the specific language governing permissions and    limitations under the License            lt summary gt      This class converts alphabetic  numeric  and symbolic Unicode characters     which are not in the first 127 ASCII characters  the  Basic Latin  Unicode     block  into their ASCII equivalents  if one exists       lt para  gt      Characters from the following Unicode blocks are converted  however  only     those characters with reasonable ASCII alternatives are converted            lt ul gt         lt item gt  lt description gt C1 Controls and Latin-1 Supplement   lt a href  http   www unicode org charts PDF U0080 pdf  gt http   www unicode org charts PDF U0080 pdf lt  a gt  lt  description gt  lt  item gt         lt item gt  lt description gt Latin Extended-A   lt a href  http   www unicode org charts PDF U0100 pdf  gt http   www unicode org charts PDF U0100 pdf lt  a gt  lt  description gt  lt  item gt         lt item gt  lt description gt Latin Extended-B   lt a href  http   www unicode org charts PDF U0180 pdf  gt http   www unicode org charts PDF U0180 pdf lt  a gt  lt  description gt  lt  item gt         lt item gt  lt description gt Latin Extended Additional   lt a href  http   www unicode org charts PDF U1E00 pdf  gt http   www unicode org charts PDF U1E00 pdf lt  a gt  lt  description gt  lt  item gt         lt item gt  lt description gt Latin Extended-C   lt a href  http   www unicode org charts PDF U2C60 pdf  gt http   www unicode org charts PDF U2C60 pdf lt  a gt  lt  description gt  lt  item gt         lt item gt  lt description gt Latin Extended-D   lt a href  http   www unicode org charts PDF UA720 pdf  gt http   www unicode org charts PDF UA720 pdf lt  a gt  lt  description gt  lt  item gt         lt item gt  lt description gt IPA Extensions   lt a href  http   www unicode org charts PDF U0250 pdf  gt http   www unicode org charts PDF U0250 pdf lt  a gt  lt  description gt  lt  item gt         lt item gt  lt description gt Phonetic Extensions   lt a href  http   www unicode org charts PDF U1D00 pdf  gt http   www unicode org charts PDF U1D00 pdf lt  a gt  lt  description gt  lt  item gt         lt item gt  lt description gt Phonetic Extensions Supplement   lt a href  http   www unicode org charts PDF U1D80 pdf  gt http   www unicode org charts PDF U1D80 pdf lt  a gt  lt  description gt  lt  item gt         lt item gt  lt description gt General Punctuation   lt a href  http   www unicode org charts PDF U2000 pdf  gt http   www unicode org charts PDF U2000 pdf lt  a gt  lt  description gt  lt  item gt         lt item gt  lt description gt Superscripts and Subscripts   lt a href  http   www unicode org charts PDF U2070 pdf  gt http   www unicode org charts PDF U2070 pdf lt  a gt  lt  description gt  lt  item gt         lt item gt  lt description gt Enclosed Alphanumerics   lt a href  http   www unicode org charts PDF U2460 pdf  gt http   www unicode org charts PDF U2460 pdf lt  a gt  lt  description gt  lt  item gt         lt item gt  lt description gt Dingbats   lt a href  http   www unicode org charts PDF U2700 pdf  gt http   www unicode org charts PDF U2700 pdf lt  a gt  lt  description gt  lt  item gt         lt item gt  lt description gt Supplemental Punctuation   lt a href  http   www unicode org charts PDF U2E00 pdf  gt http   www unicode org charts PDF U2E00 pdf lt  a gt  lt  description gt  lt  item gt         lt item gt  lt description gt Alphabetic Presentation Forms   lt a href  http   www unicode org charts PDF UFB00 pdf  gt http   www unicode org charts PDF UFB00 pdf lt  a gt  lt  description gt  lt  item gt         lt item gt  lt description gt Halfwidth and Fullwidth Forms   lt a href  http   www unicode org charts PDF UFF00 pdf  gt http   www unicode org charts PDF UFF00 pdf lt  a gt  lt  description gt  lt  item gt       lt  ul gt       lt para  gt      See   lt a href  http   en wikipedia org wiki Latin characters in Unicode  gt http   en wikipedia org wiki Latin characters in Unicode lt  a gt       lt para  gt      For example    amp amp agrave   will be replaced by  a        lt  summary gt  public static partial class StringExtensions            lt summary gt          Converts characters above ASCII to their ASCII equivalents   For example          accents are removed from accented characters            lt  summary gt           lt param name  input  gt      The string of characters to fold  lt  param gt           lt param name  length  gt     The length of the folded return string  lt  param gt           lt returns gt  length of output  lt  returns gt      public static string FoldToASCII this string input  int  length   null                   See https   gist github com andyraddatz e6a396fb91856174d4e3f1bf2e10951c           Hope that helps someone else  this is the most robust solution I ve found

User · Answer

In case anyone s interested  here is the java equivalent   import java text Normalizer   public class MyClass       public static String removeDiacritics String input                String nrml   Normalizer normalize input  Normalizer Form NFD           StringBuilder stripped   new StringBuilder            for  int i 0 i lt nrml length     i                        if  Character getType nrml charAt i      Character NON SPACING MARK                                stripped append nrml charAt i                                    return stripped toString

User · Answer

Not having enough reputations  apparently I can not comment on Alexander s excellent link  - Lucene appears to be the only solution working in reasonably generic cases    For those wanting a simple copy-paste solution  here it is  leveraging code in Lucene   string testbed                                                                                aac  egiLlnOorSs  z          Console WriteLine Lucene latinizeLucene testbed             AAAACEIIOOOUUTHaaaaaaaeceeeeiiiidnoooouuaacDegiLlnOorSsszzsteu                  public static class Lucene          source  https   raw githubusercontent com apache lucenenet master src Lucene Net Analysis Common Analysis Miscellaneous ASCIIFoldingFilter cs        idea  https   stackoverflow com questions 249087 how-do-i-remove-diacritics-accents-from-a-string-in-net  scroll down  search for lucene by Alexander      public static string latinizeLucene string arg                char   argChar   arg ToCharArray                latinizeLuceneImpl can expand one char up to four chars - e g     to TH  or    to ae  or in fact   to  10          char   resultChar   new String      arg Length   4  ToCharArray             int outputPos   Lucene latinizeLuceneImpl argChar  0  ref resultChar  0  arg Length            string ret   new string resultChar           ret   ret Substring 0  outputPos            return ret                  lt summary gt          Converts characters above ASCII to their ASCII equivalents   For example          accents are removed from accented characters            lt para  gt           lucene internal          lt  summary gt           lt param name  input  gt      The characters to fold  lt  param gt           lt param name  inputPos  gt   Index of the first character to fold  lt  param gt           lt param name  output  gt     The result of the folding  Should be of size  gt    lt c gt length   4 lt  c gt    lt  param gt           lt param name  outputPos  gt  Index of output where to put the result of the folding  lt  param gt           lt param name  length  gt     The number of characters to fold  lt  param gt           lt returns gt  length of output  lt  returns gt      private static int latinizeLuceneImpl char   input  int inputPos  ref char   output  int outputPos  int length                int end   inputPos   length          for  int pos   inputPos  pos  lt  end    pos                        char c   input pos                   Quick test  if it s not in range then just keep current character             if  c  lt    u0080                                 output outputPos      c                            else                               switch  c                                        case   u00C0           LATIN CAPITAL LETTER A WITH GRAVE                      case   u00C1           LATIN CAPITAL LETTER A WITH ACUTE                      case   u00C2           LATIN CAPITAL LETTER A WITH CIRCUMFLEX                      case   u00C3           LATIN CAPITAL LETTER A WITH TILDE                      case   u00C4           LATIN CAPITAL LETTER A WITH DIAERESIS                      case   u00C5           LATIN CAPITAL LETTER A WITH RING ABOVE                      case   u0100      A   LATIN CAPITAL LETTER A WITH MACRON                      case   u0102      A   LATIN CAPITAL LETTER A WITH BREVE                      case   u0104      A   LATIN CAPITAL LETTER A WITH OGONEK                      case   u018F         http   en wikipedia org wiki Schwa   LATIN CAPITAL LETTER SCHWA                      case   u01CD      A   LATIN CAPITAL LETTER A WITH CARON                      case   u01DE      A   LATIN CAPITAL LETTER A WITH DIAERESIS AND MACRON                      case   u01E0          LATIN CAPITAL LETTER A WITH DOT ABOVE AND MACRON                      case   u01FA          LATIN CAPITAL LETTER A WITH RING ABOVE AND ACUTE                      case   u0200          LATIN CAPITAL LETTER A WITH DOUBLE GRAVE                      case   u0202          LATIN CAPITAL LETTER A WITH INVERTED BREVE                      case   u0226          LATIN CAPITAL LETTER A WITH DOT ABOVE                      case   u023A          LATIN CAPITAL LETTER A WITH STROKE                      case   u1D00          LATIN LETTER SMALL CAPITAL A                      case   u1E00          LATIN CAPITAL LETTER A WITH RING BELOW                      case   u1EA0          LATIN CAPITAL LETTER A WITH DOT BELOW                      case   u1EA2          LATIN CAPITAL LETTER A WITH HOOK ABOVE                      case   u1EA4          LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND ACUTE                      case   u1EA6          LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND GRAVE                      case   u1EA8          LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND HOOK ABOVE                      case   u1EAA          LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND TILDE                      case   u1EAC          LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND DOT BELOW                      case   u1EAE          LATIN CAPITAL LETTER A WITH BREVE AND ACUTE                      case   u1EB0          LATIN CAPITAL LETTER A WITH BREVE AND GRAVE                      case   u1EB2          LATIN CAPITAL LETTER A WITH BREVE AND HOOK ABOVE                      case   u1EB4          LATIN CAPITAL LETTER A WITH BREVE AND TILDE                      case   u1EB6          LATIN CAPITAL LETTER A WITH BREVE AND DOT BELOW                      case   u24B6          CIRCLED LATIN CAPITAL LETTER A                      case   uFF21      A   FULLWIDTH LATIN CAPITAL LETTER A                          output outputPos       A                           break                      case   u00E0           LATIN SMALL LETTER A WITH GRAVE                      case   u00E1           LATIN SMALL LETTER A WITH ACUTE                      case   u00E2           LATIN SMALL LETTER A WITH CIRCUMFLEX                      case   u00E3           LATIN SMALL LETTER A WITH TILDE                      case   u00E4           LATIN SMALL LETTER A WITH DIAERESIS                      case   u00E5           LATIN SMALL LETTER A WITH RING ABOVE                      case   u0101      a   LATIN SMALL LETTER A WITH MACRON                      case   u0103      a   LATIN SMALL LETTER A WITH BREVE                      case   u0105      a   LATIN SMALL LETTER A WITH OGONEK                      case   u01CE      a   LATIN SMALL LETTER A WITH CARON                      case   u01DF      a   LATIN SMALL LETTER A WITH DIAERESIS AND MACRON                      case   u01E1          LATIN SMALL LETTER A WITH DOT ABOVE AND MACRON                      case   u01FB          LATIN SMALL LETTER A WITH RING ABOVE AND ACUTE                      case   u0201          LATIN SMALL LETTER A WITH DOUBLE GRAVE                      case   u0203          LATIN SMALL LETTER A WITH INVERTED BREVE                      case   u0227          LATIN SMALL LETTER A WITH DOT ABOVE                      case   u0250          LATIN SMALL LETTER TURNED A                      case   u0259          LATIN SMALL LETTER SCHWA                      case   u025A          LATIN SMALL LETTER SCHWA WITH HOOK                      case   u1D8F          LATIN SMALL LETTER A WITH RETROFLEX HOOK                      case   u1D95          LATIN SMALL LETTER SCHWA WITH RETROFLEX HOOK                      case   u1E01          LATIN SMALL LETTER A WITH RING BELOW                      case   u1E9A          LATIN SMALL LETTER A WITH RIGHT HALF RING                      case   u1EA1          LATIN SMALL LETTER A WITH DOT BELOW                      case   u1EA3          LATIN SMALL LETTER A WITH HOOK ABOVE                      case   u1EA5          LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE                      case   u1EA7          LATIN SMALL LETTER A WITH CIRCUMFLEX AND GRAVE                      case   u1EA9          LATIN SMALL LETTER A WITH CIRCUMFLEX AND HOOK ABOVE                      case   u1EAB          LATIN SMALL LETTER A WITH CIRCUMFLEX AND TILDE                      case   u1EAD          LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW                      case   u1EAF          LATIN SMALL LETTER A WITH BREVE AND ACUTE                      case   u1EB1          LATIN SMALL LETTER A WITH BREVE AND GRAVE                      case   u1EB3          LATIN SMALL LETTER A WITH BREVE AND HOOK ABOVE                      case   u1EB5          LATIN SMALL LETTER A WITH BREVE AND TILDE                      case   u1EB7          LATIN SMALL LETTER A WITH BREVE AND DOT BELOW                      case   u2090          LATIN SUBSCRIPT SMALL LETTER A                      case   u2094          LATIN SUBSCRIPT SMALL LETTER SCHWA                      case   u24D0          CIRCLED LATIN SMALL LETTER A                      case   u2C65          LATIN SMALL LETTER A WITH STROKE                      case   u2C6F          LATIN CAPITAL LETTER TURNED A                      case   uFF41      a   FULLWIDTH LATIN SMALL LETTER A                          output outputPos       a                           break                      case   uA732          LATIN CAPITAL LETTER AA                          output outputPos       A                           output outputPos       A                           break                      case   u00C6           LATIN CAPITAL LETTER AE                      case   u01E2          LATIN CAPITAL LETTER AE WITH MACRON                      case   u01FC          LATIN CAPITAL LETTER AE WITH ACUTE                      case   u1D01          LATIN LETTER SMALL CAPITAL AE                          output outputPos       A                           output outputPos       E                           break                      case   uA734          LATIN CAPITAL LETTER AO                          output outputPos       A                           output outputPos       O                           break                      case   uA736          LATIN CAPITAL LETTER AU                          output outputPos       A                           output outputPos       U                           break              etc  etc  etc             see link above for complete source code                        unfortunately  postings are limited  as in             Body is limited to 30000 characters  you entered 136098                                                   case   u2053          SWUNG DASH                      case   uFF5E          FULLWIDTH TILDE                          output outputPos                                   break                      default                          output outputPos      c                          break                                                    return outputPos

User · Answer

this did the trick for me    string accentedStr  byte   tempBytes  tempBytes   System Text Encoding GetEncoding  quot ISO-8859-8 quot   GetBytes accentedStr   string asciiStr   System Text Encoding UTF8 GetString tempBytes    quick amp short

User · Answer

The CodePage of Greek  ISO  can do it  The information about this codepage is into System Text Encoding GetEncodings    Learn about in  https   msdn microsoft com pt-br library system text encodinginfo getencoding v vs 110  aspx  Greek  ISO  has codepage 28597 and name iso-8859-7   Go to the code     o   string text    Voc   est   numa situa    o lament  vel    string textEncode   System Web HttpUtility UrlEncode text  Encoding GetEncoding  iso-8859-7       result   Voce esta numa situacao lamentavel   string textDecode   System Web HttpUtility UrlDecode textEncode     result   Voce esta numa situacao lamentavel    So  write this function     public string RemoveAcentuation string text        return         System Web HttpUtility UrlDecode              System Web HttpUtility UrlEncode                  text  Encoding GetEncoding  iso-8859-7          Note that    Encoding GetEncoding  iso-8859-7   is equivalent to Encoding GetEncoding 28597  because first is the name  and second the codepage of Encoding

User · Answer

In case anyone s interested  here is the java equivalent   import java text Normalizer   public class MyClass       public static String removeDiacritics String input                String nrml   Normalizer normalize input  Normalizer Form NFD           StringBuilder stripped   new StringBuilder            for  int i 0 i lt nrml length     i                        if  Character getType nrml charAt i      Character NON SPACING MARK                                stripped append nrml charAt i                                    return stripped toString

User · Answer

THIS IS THE VB VERSION  Works with GREEK     Imports System Text  Imports System Globalization  Public Function RemoveDiacritics ByVal s As String      Dim normalizedString As String     Dim stringBuilder As New StringBuilder     normalizedString   s Normalize NormalizationForm FormD      Dim i As Integer     Dim c As Char     For i   0 To normalizedString Length - 1         c   normalizedString i          If CharUnicodeInfo GetUnicodeCategory c   lt  gt  UnicodeCategory NonSpacingMark Then             stringBuilder Append c          End If     Next     Return stringBuilder ToString   End Function

User · Answer

This code worked for me  var updatedText   text Normalize NormalizationForm FormD        Where c   gt  CharUnicodeInfo GetUnicodeCategory c     UnicodeCategory NonSpacingMark        ToArray     However  please don t do this with names  It s not only an insult to people with umlauts accents in their name  it can also be dangerously wrong in certain situations  see below   There are alternative writings instead of just removing the accent  Furthermore  it s simply wrong and dangerous  e g  if the user has to provide his name exactly how it occurs on the passport  For example my name is written Zuberb  hler and in the machine readable part of my passport you will find Zuberbuehler  By removing the umlaut  the name will not match with either part  This can lead to issues for the users  You should rather disallow umlauts accent in an input form for names so the user can write his name correctly without its umlaut or accent  Practical example  if the web service to apply for ESTA  https   www application-esta co uk special-characters-and  would use above code instead of transforming umlauts correctly  the ESTA application would either be refused or the traveller will have problems with the American Border Control when entering the States  Another example would be flight tickets  Assuming you have a flight ticket booking web application  the user provides his name with an accent and your implementation is just removing the accents and then using the airline s web service to book the ticket  Your customer may not be allowed to board since the name does not match to any part of his her passport

User · Answer

I often use an extenstion method based on another version I found here  see Replacing characters in C   ascii   A quick explanation    Normalizing to form D splits charactes like    to an e and a nonspacing   From this  the nospacing characters are removed The result is normalized back to form C  I m not sure if this is neccesary    Code   using System Linq  using System Text  using System Globalization      namespace here public static class Utility       public static string RemoveDiacritics this string str                if  null    str  return null          var chars               from c in str Normalize NormalizationForm FormD  ToCharArray               let uc   CharUnicodeInfo GetUnicodeCategory c              where uc    UnicodeCategory NonSpacingMark             select c           var cleanStr   new string chars ToArray    Normalize NormalizationForm FormC            return cleanStr                or  alternatively     public static string RemoveDiacritics2 this string str                if  null    str  return null          var chars   str              Normalize NormalizationForm FormD               ToCharArray                Where c  gt  CharUnicodeInfo GetUnicodeCategory c     UnicodeCategory NonSpacingMark               ToArray             return new string chars  Normalize NormalizationForm FormC

User · Answer

It s funny such a question can get so many answers  and yet none fit my requirements    There are so many languages around  a full language agnostic solution is AFAIK not really possible  as others has mentionned that the FormC or FormD are giving issues   Since the original question was related to French  the simplest working answer is indeed       public static string ConvertWesternEuropeanToASCII this string str                return Encoding ASCII GetString Encoding GetEncoding 1251  GetBytes str            1251 should be replaced by the encoding code of the input language   This however replace only one character by one character  Since I am also working with German as input  I did a manual convert      public static string LatinizeGermanCharacters this string str                StringBuilder sb   new StringBuilder str Length           foreach  char c in str                        switch  c                                case                           sb Append  ae                        break                  case                           sb Append  oe                        break                  case                           sb Append  ue                        break                  case                           sb Append  Ae                        break                  case                           sb Append  Oe                        break                  case                           sb Append  Ue                        break                  case                           sb Append  ss                        break                  default                      sb Append c                       break                                  return sb ToString            It might not deliver the best performance  but at least it is very easy to read and extend  Regex is a NO GO  much slower than any char string stuff   I also have a very simple method to remove space       public static string RemoveSpace this string str                return str Replace      string Empty           Eventually  I am using a combination of all 3 above extensions       public static string LatinizeAndConvertToASCII this string str  bool keepSpace   false                str   str LatinizeGermanCharacters   ConvertWesternEuropeanToASCII                        return keepSpace   str   str RemoveSpace            And a small unit test to that  not exhaustive  which pass successfully        TestMethod        public void LatinizeAndConvertToASCIITest                 string europeanStr    Bonjour   a va  C est l   t    Ich m  chte                                                                                                              string expected    Bonjourcava C estl ete IchmoechteaeAeaaaeeeeEEiIiiiooooeOeUeueuuuUyYcCnN           string actual   europeanStr LatinizeAndConvertToASCII            Assert AreEqual expected  actual

User · Answer

This works fine in java    It basically converts all accented characters into their deAccented counterparts followed by their combining diacritics  Now you can use a regex to strip off the diacritics   import java text Normalizer  import java util regex Pattern   public String deAccent String str        String nfdNormalizedString   Normalizer normalize str  Normalizer Form NFD        Pattern pattern   Pattern compile    p InCombiningDiacriticalMarks          return pattern matcher nfdNormalizedString  replaceAll

User · Answer

This code worked for me  var updatedText   text Normalize NormalizationForm FormD        Where c   gt  CharUnicodeInfo GetUnicodeCategory c     UnicodeCategory NonSpacingMark        ToArray     However  please don t do this with names  It s not only an insult to people with umlauts accents in their name  it can also be dangerously wrong in certain situations  see below   There are alternative writings instead of just removing the accent  Furthermore  it s simply wrong and dangerous  e g  if the user has to provide his name exactly how it occurs on the passport  For example my name is written Zuberb  hler and in the machine readable part of my passport you will find Zuberbuehler  By removing the umlaut  the name will not match with either part  This can lead to issues for the users  You should rather disallow umlauts accent in an input form for names so the user can write his name correctly without its umlaut or accent  Practical example  if the web service to apply for ESTA  https   www application-esta co uk special-characters-and  would use above code instead of transforming umlauts correctly  the ESTA application would either be refused or the traveller will have problems with the American Border Control when entering the States  Another example would be flight tickets  Assuming you have a flight ticket booking web application  the user provides his name with an accent and your implementation is just removing the accents and then using the airline s web service to book the ticket  Your customer may not be allowed to board since the name does not match to any part of his her passport

User · Answer

The CodePage of Greek  ISO  can do it  The information about this codepage is into System Text Encoding GetEncodings    Learn about in  https   msdn microsoft com pt-br library system text encodinginfo getencoding v vs 110  aspx  Greek  ISO  has codepage 28597 and name iso-8859-7   Go to the code     o   string text    Voc   est   numa situa    o lament  vel    string textEncode   System Web HttpUtility UrlEncode text  Encoding GetEncoding  iso-8859-7       result   Voce esta numa situacao lamentavel   string textDecode   System Web HttpUtility UrlDecode textEncode     result   Voce esta numa situacao lamentavel    So  write this function     public string RemoveAcentuation string text        return         System Web HttpUtility UrlDecode              System Web HttpUtility UrlEncode                  text  Encoding GetEncoding  iso-8859-7          Note that    Encoding GetEncoding  iso-8859-7   is equivalent to Encoding GetEncoding 28597  because first is the name  and second the codepage of Encoding

User · Answer

Try HelperSharp package   There is a method RemoveAccents    public static string RemoveAccents this string source            8 bit characters       byte   b   Encoding GetEncoding 1251  GetBytes source            7 bit characters      string t   Encoding ASCII GetString b        Regex re   new Regex    a-zA-Z0-9  -           string c   re Replace t             return c

User · Answer

I needed something that converts all major unicode characters and the voted answer leaved a few out so I ve created a version of CodeIgniter s convert accented characters  str  into C  that is easily customisable   using System  using System Text  using System Collections Generic   public static class Strings       static Dictionary lt string  string gt  foreign characters   new Dictionary lt string  string gt                            ae                        oe                      ue                      Ae                      Ue                      Oe                             AAAA                  A                           aaaa  a                  a                     B                     b                  CCCC    C                  cccc    c                     D                     d                  D       Dj                  ddd    dj                        EEEEE                E                        eeeee e              e                     F                     f                GGGGG      G                gggg       g                HH    H                hh    h                        IIIIII             I                        iiiiii              i                J    J                j    j                K      K                k      k                LLL L      L                lll l      l                     M                     m                  NNN      N                  nnn       n                        OOOOO     O                  O                        ooooo                          o                     P                     p                RRR      R                rrr      r                SSS   S     S                sss    s      s                 TTTt     T                 ttt     t                      UUUUUUUUUUUUU            U                      uuuuuuuuuuuu               u                    Y            Y                    y         y                     V                     v                W    W                w    w                ZZ        Z                zz        z                       AE                      ss                     IJ                     ij                      OE                      f                     ks                p    p                      v                      m                     ps                     Yo                     yo                     Ye                     ye                     Yi                     Zh                     zh                     Kh                     kh                     Ts                     ts                     Ch                     ch                     Sh                     sh                     Shch                     shch                                             Yu                     yu                     Ya                     ya                 public static char RemoveDiacritics this char c           foreach KeyValuePair lt string  string gt  entry in foreign characters                        if entry Key IndexOf  c     -1                                return entry Value 0                                   return c             public static string RemoveDiacritics this string s                   StringBuilder sb   new StringBuilder             string text                 foreach  char c in s                        int len   text Length               foreach KeyValuePair lt string  string gt  entry in foreign characters                                if entry Key IndexOf  c     -1                                        text    entry Value                      break                                               if  len    text Length                    text    c                                    return text            Usage     for strings  cr  me br  l  e  RemoveDiacritics        creme brulee     for chars      0  RemoveDiacritics        A

User · Answer

I ve not used this method  but Michael Kaplan describes a method for doing so in his blog post  with a confusing title  that talks about stripping diacritics  Stripping is an interesting job  aka On the meaning of meaningless  aka All Mn characters are non-spacing  but some are more non-spacing than others   static string RemoveDiacritics string text         var normalizedString   text Normalize NormalizationForm FormD       var stringBuilder   new StringBuilder         foreach  var c in normalizedString                var unicodeCategory   CharUnicodeInfo GetUnicodeCategory c           if  unicodeCategory    UnicodeCategory NonSpacingMark                        stringBuilder Append c                        return stringBuilder ToString   Normalize NormalizationForm FormC       Note that this is a followup to his earlier post  Stripping diacritics       The approach uses String Normalize to split the input string into constituent glyphs  basically separating the  base  characters from the diacritics  and then scans the result and retains only the base characters  It s just a little complicated  but really you re looking at a complicated problem   Of course  if you re limiting yourself to French  you could probably get away with the simple table-based approach in How to remove accents and tilde in a C   std  string  as recommended by  David Dibben

User · Answer

Try HelperSharp package   There is a method RemoveAccents    public static string RemoveAccents this string source            8 bit characters       byte   b   Encoding GetEncoding 1251  GetBytes source            7 bit characters      string t   Encoding ASCII GetString b        Regex re   new Regex    a-zA-Z0-9  -           string c   re Replace t             return c

[.net] How do I remove diacritics (accents) from a string in .NET?

Examples related to .net

Examples related to string

Examples related to diacritics