How can you strip non-ASCII characters from a string in C

Question

How can you strip non-ASCII characters from a string   in C

User · Answer

Inspired by philcruz s Regular Expression solution  I ve made a pure LINQ solution  public static string PureAscii this string source  char nil              var min     u0000       var max     u007F       return source Select c   gt  c  lt  min   nil   c  gt  max   nil   c  ToText       public static string ToText this IEnumerable lt char gt  source        var buffer   new StringBuilder        foreach  var c in source          buffer Append c       return buffer ToString        This is untested code

User · Answer

I believe MonsCamus meant   parsememo   Regex Replace parsememo       u0020- u007E    string Empty

User · Answer

Here is a pure  NET solution that doesn t use regular expressions   string inputString    R  ksm  rg  s   string asAscii   Encoding ASCII GetString      Encoding Convert          Encoding UTF8          Encoding GetEncoding              Encoding ASCII EncodingName              new EncoderReplacementFallback string Empty               new DecoderExceptionFallback                          Encoding UTF8 GetBytes inputString             It may look cumbersome  but it should be intuitive  It uses the  NET ASCII encoding to convert a string  UTF8 is used during the conversion because it can represent any of the original characters  It uses an EncoderReplacementFallback to to convert any non-ASCII character to an empty string

User · Answer

I used this regex expression       string s    s  me string       Regex regex   new Regex     a-zA-Z0-9 s     RegexOptions 0       return regex Replace s

User · Answer

If you want not to strip  but to actually convert latin accented to non-accented characters  take a look at this question  How do I translate 8bit characters into 7bit characters   i e     to U

User · Answer

Necromancing  Also  the method by bzlm can be used to remove characters that are not in an arbitrary charset  not just ASCII     https   en wikipedia org wiki Code page EBCDIC-based code pages    https   en wikipedia org wiki Windows code page East Asian multi-byte code pages    https   en wikipedia org wiki Chinese character encoding System Text Encoding encRemoveAllBut   System Text Encoding ASCII  encRemoveAllBut   System Text Encoding GetEncoding System Globalization CultureInfo InstalledUICulture TextInfo ANSICodePage      System-encoding encRemoveAllBut   System Text Encoding GetEncoding 1252      Western European  iso-8859-1  encRemoveAllBut   System Text Encoding GetEncoding 1251      Windows-1251 KOI8-R encRemoveAllBut   System Text Encoding GetEncoding  quot ISO-8859-5 quot       used by less than 0 1  of websites encRemoveAllBut   System Text Encoding GetEncoding 37      IBM EBCDIC US-Canada encRemoveAllBut   System Text Encoding GetEncoding 500      IBM EBCDIC Latin 1 encRemoveAllBut   System Text Encoding GetEncoding 936      Chinese Simplified encRemoveAllBut   System Text Encoding GetEncoding 950      Chinese Traditional encRemoveAllBut   System Text Encoding ASCII     putting ASCII again  as to answer the question      https   stackoverflow com questions 123336 how-can-you-strip-non-ascii-characters-from-a-string-in-c string inputString    quot R  ksm  r           g  s quot   string asAscii   encRemoveAllBut GetString      System Text Encoding Convert          System Text Encoding UTF8          System Text Encoding GetEncoding              encRemoveAllBut CodePage              new System Text EncoderReplacementFallback string Empty               new System Text DecoderExceptionFallback                          System Text Encoding UTF8 GetBytes inputString            System Console WriteLine asAscii    AND for those that just want to remote the accents    caution  because Normalize    Latinize    Romanize     string str   Latinize  quot                a   quot    public static string Latinize string stIn           Special treatment for German Umlauts     stIn   stIn Replace  quot    quot    quot ae quot        stIn   stIn Replace  quot    quot    quot oe quot        stIn   stIn Replace  quot    quot    quot ue quot         stIn   stIn Replace  quot    quot    quot Ae quot        stIn   stIn Replace  quot    quot    quot Oe quot        stIn   stIn Replace  quot    quot    quot Ue quot           End special treatment for German Umlauts      string stFormD   stIn Normalize System Text NormalizationForm FormD       System Text StringBuilder sb   new System Text StringBuilder         for  int ich   0  ich  lt  stFormD Length  ich                  System Globalization UnicodeCategory uc   System Globalization CharUnicodeInfo GetUnicodeCategory stFormD ich             if  uc    System Globalization UnicodeCategory NonSpacingMark                        sb Append stFormD ich                 End if  uc    System Globalization UnicodeCategory NonSpacingMark            Next ich         return  sb ToString   Normalize System Text NormalizationForm FormC        return  sb ToString   Normalize System Text NormalizationForm FormKC         End Function Latinize

User · Answer

I use this regular expression to filter out bad characters in a filename   Regex Replace directory     a-zA-Z0-9     -           That should be all the characters allowed for filenames

User · Answer

I found the following slightly altered range useful for parsing comment blocks out of a database  this means that you won t have to contend with tab and escape characters which would cause a CSV field to become upset   parsememo   Regex Replace parsememo       u001F- u007F    string Empty     If you want to avoid other special characters or particular punctuation check the ascii table

User · Answer

This is not optimal performance-wise  but a pretty straight-forward Linq approach   string strippedString   new string      yourString Where c   gt  c  lt   sbyte MaxValue  ToArray            The downside is that all the  surviving  characters are first put into an array of type char   which is then thrown away after the string constructor no longer uses it

User · Answer

no need for regex  just use encoding     sOutput   System Text Encoding ASCII GetString System Text Encoding ASCII GetBytes sInput

User · Answer

string s    s  me string   s   Regex Replace s       u0000- u007F     string Empty

User · Answer

I came here looking for a solution for extended ascii characters  but couldnt find it  The closest I found is bzlm s solution  But that works only for ASCII Code upto 127 obviously you can replace the encoding type in his code  but i think it was a bit complex to understand  Hence  sharing this version   Here s a solution that works for extended ASCII codes i e  upto 255 which is the ISO 8859-1  It finds and strips out non-ascii characters greater than 255   Dim str1 as String            or   u  n i    -    4 od              1      Dim extendedAscii As Encoding   Encoding GetEncoding  ISO-8859-1                                                    New EncoderReplacementFallback String empty                                                   New DecoderReplacementFallback     Dim extendedAsciiBytes   As Byte   extendedAscii GetBytes str1   Dim str2 As String   extendedAscii GetString extendedAsciiBytes   console WriteLine str2   Output            or   u ni   -   4od         1      yz    Here s a working fiddle for the code   Replace the encoding as per the requirement  rest should remain the same

[c#] How can you strip non-ASCII characters from a string? (in C#)

Examples related to c#

Examples related to ascii