So the issue is that UTF-8 code unit values have been stored as a sequence of 16-bit code units in a C# string
. You simply need to verify that each code unit is within the range of a byte, copy those values into bytes, and then convert the new UTF-8 byte sequence into UTF-16.
public static string DecodeFromUtf8(this string utf8String)
{
// copy the string as UTF-8 bytes.
byte[] utf8Bytes = new byte[utf8String.Length];
for (int i=0;i<utf8String.Length;++i) {
//Debug.Assert( 0 <= utf8String[i] && utf8String[i] <= 255, "the char must be in byte's range");
utf8Bytes[i] = (byte)utf8String[i];
}
return Encoding.UTF8.GetString(utf8Bytes,0,utf8Bytes.Length);
}
DecodeFromUtf8("d\u00C3\u00A9j\u00C3\u00A0"); // déjà
This is easy, however it would be best to find the root cause; the location where someone is copying UTF-8 code units into 16 bit code units. The likely culprit is somebody converting bytes into a C# string
using the wrong encoding. E.g. Encoding.Default.GetString(utf8Bytes, 0, utf8Bytes.Length)
.
Alternatively, if you're sure you know the incorrect encoding which was used to produce the string, and that incorrect encoding transformation was lossless (usually the case if the incorrect encoding is a single byte encoding), then you can simply do the inverse encoding step to get the original UTF-8 data, and then you can do the correct conversion from UTF-8 bytes:
public static string UndoEncodingMistake(string mangledString, Encoding mistake, Encoding correction)
{
// the inverse of `mistake.GetString(originalBytes);`
byte[] originalBytes = mistake.GetBytes(mangledString);
return correction.GetString(originalBytes);
}
UndoEncodingMistake("d\u00C3\u00A9j\u00C3\u00A0", Encoding(1252), Encoding.UTF8);