Convert a Unicode string to an escaped ASCII string

Question

How can I convert this string    This string contains the Unicode character Pi p    into an escaped ASCII string    This string contains the Unicode character Pi  u03a0    and vice versa    The current Encoding available in C  converts the p character to      I need to preserve that character

User · Answer

Here is my current implementation   public static class UnicodeStringExtensions       public static string EncodeNonAsciiCharacters this string value            var bytes   Encoding Unicode GetBytes value           var sb   StringBuilderCache Acquire value Length           bool encodedsomething   false          for  int i   0  i  lt  bytes Length  i    2                var c   BitConverter ToUInt16 bytes  i               if   c  gt   0x20  amp  amp  c  lt   0x7f     c    0x0A    c    0x0D                    sb Append  char  c                 else                   sb Append     u c x4                     encodedsomething   true                                  if   encodedsomething                StringBuilderCache Release sb               return value                    return StringBuilderCache GetStringAndRelease sb               public static string DecodeEncodedNonAsciiCharacters this string value          gt  Regex Replace value   language regexp         u a-fA-F0-9  4      Decode        static readonly string   Splitsequence   new         u         private static string Decode Match m            var bytes   m Value Split Splitsequence  StringSplitOptions RemoveEmptyEntries                   Select s   gt  ushort Parse s  NumberStyles HexNumber   SelectMany BitConverter GetBytes  ToArray            return Encoding Unicode GetString bytes             This passes a test   public void TestBigUnicode         var s     U00020000       var encoded   s EncodeNonAsciiCharacters        var decoded   encoded DecodeEncodedNonAsciiCharacters        Assert Equals s  decoded       with the encoded value    ud840 udc00   This implementation makes use of a StringBuilderCache  reference source link

User · Answer

To store actual Unicode codepoints  you have to first decode the String s UTF-16 codeunits to UTF-32 codeunits  which are currently the same as the Unicode codepoints    Use System Text Encoding UTF32 GetBytes   for that  and then write the resulting bytes to the StringBuilder as needed i e   static void Main string   args          String originalString    This string contains the unicode character Pi p         Byte   bytes   Encoding UTF32 GetBytes originalString       StringBuilder asAscii   new StringBuilder        for  int idx   0  idx  lt  bytes Length  idx    4                 uint codepoint   BitConverter ToUInt32 bytes  idx           if  codepoint  lt   127               asAscii Append Convert ToChar codepoint             else              asAscii AppendFormat    u 0 x4    codepoint               Console WriteLine  Final string   0    asAscii        Console ReadKey

User · Answer

You need to use the Convert   method in the Encoding class    Create an Encoding object that represents ASCII encoding Create an Encoding object that represents Unicode encoding Call Encoding Convert   with the source encoding  the destination encoding  and the string to be encoded   There is an example here   using System  using System Text   namespace ConvertExample      class ConvertExampleClass            static void Main                    string unicodeString    This string contains the unicode character Pi  u03a0                 Create two different encodings           Encoding ascii   Encoding ASCII           Encoding unicode   Encoding Unicode               Convert the string into a byte             byte   unicodeBytes   unicode GetBytes unicodeString                Perform the conversion from one encoding to the other           byte   asciiBytes   Encoding Convert unicode  ascii  unicodeBytes                Convert the new byte   into a char   and then into a string              This is a slightly different approach to converting to illustrate             the use of GetCharCount GetChars           char   asciiChars   new char ascii GetCharCount asciiBytes  0  asciiBytes Length             ascii GetChars asciiBytes  0  asciiBytes Length  asciiChars  0            string asciiString   new string asciiChars                Display the strings created before and after the conversion           Console WriteLine  Original string   0    unicodeString            Console WriteLine  Ascii converted string   0    asciiString

User · Answer

As a one-liner   var result   Regex Replace input       x00- x7F    c   gt       string Format    u 0 x4     int c Value 0

User · Answer

string StringFold string input  Func lt char  string gt  proc      return string Concat input Select proc  ToArray        string FoldProc char input      if  input  gt   128          return string Format    u 0 x4     int input         return input ToString       string EscapeToAscii string input      return StringFold input  FoldProc

User · Answer

A small patch to  Adam Sills s answer which solves FormatException on cases where the input string like  c  u00ab otherdirectory   plus RegexOptions Compiled makes the Regex compilation much faster       private static Regex DECODING REGEX   new Regex     u   lt Value gt  a-fA-F0-9  4     RegexOptions Compiled       private const string PLACEHOLDER               public static string DecodeEncodedNonAsciiCharacters this string value                return DECODING REGEX Replace              value Replace        PLACEHOLDER               m   gt                     return   char int Parse m Groups  Value   Value  NumberStyles HexNumber   ToString                    Replace PLACEHOLDER

User · Answer

class Program           static void Main string   args                        char   originalString    This string contains the unicode character Pi p   ToCharArray                StringBuilder asAscii   new StringBuilder       store final ascii string and Unicode points             foreach  char c in originalString                                   test if char is ascii  otherwise convert to Unicode Code Point                 int cint   Convert ToInt32 c                   if  cint  lt   127  amp  amp  cint  gt   0                      asAscii Append c                   else                     asAscii Append String Format    u 0 x4     cint  Trim                               Console WriteLine  Final string   0    asAscii               Console ReadKey                  All non-ASCII chars are converted to their Unicode Code Point representation and appended to the final string

User · Answer

This goes back and forth to and from the  uXXXX format   class Program       static void Main  string   args             string unicodeString    This function contains a unicode character pi   u03a0             Console WriteLine  unicodeString             string encoded   EncodeNonAsciiCharacters unicodeString           Console WriteLine  encoded             string decoded   DecodeEncodedNonAsciiCharacters  encoded            Console WriteLine  decoded               static string EncodeNonAsciiCharacters  string value             StringBuilder sb   new StringBuilder            foreach  char c in value                 if  c  gt  127                        This character is too big for ASCII                 string encodedValue      u      int  c  ToString   x4                     sb Append  encodedValue                              else                   sb Append  c                                    return sb ToString               static string DecodeEncodedNonAsciiCharacters  string value             return Regex Replace              value                  u   lt Value gt  a-zA-Z0-9  4                 m   gt                    return   char  int Parse  m Groups  Value   Value  NumberStyles HexNumber    ToString                               Outputs   This function contains a unicode character pi  p   This function contains a unicode character pi   u03a0   This function contains a unicode character pi  p

User · Answer

For Unescape You can simply use this functions   System Text RegularExpressions Regex Unescape string   System Uri UnescapeDataString string    I suggest using this method  It works better with UTF-8    UnescapeDataString string

[c#] Convert a Unicode string to an escaped ASCII string

Examples related to c#

Examples related to unicode

Examples related to encoding