My finally working approach is to try potential candidates of expected encodings by detecting invalid characters in the strings created from the byte array by the encodings. If I don't encounter invalid characters, I suppose the tested encoding works fine for the tested data.
For me, having only Latin and German special characters to consider, in order to determine the proper encoding for a byte array, I try to detect invalid characters in a string with this method:
/// <summary>
/// detect invalid characters in string, use to detect improper encoding
/// </summary>
/// <param name="s"></param>
/// <returns></returns>
public static bool DetectInvalidChars(string s)
{
const string specialChars = "\r\n\t .,;:-_!\"'?()[]{}&%$§=*+~#@|<>äöüÄÖÜß/\\^€";
return s.Any(ch => !(
specialChars.Contains(ch) ||
(ch >= '0' && ch <= '9') ||
(ch >= 'a' && ch <= 'z') ||
(ch >= 'A' && ch <= 'Z')));
}
(NB: if you have other Latin-based languages to consider, you might want to adapt the specialChars const string in the code)
Then I use it like this (I only expect UTF-8 or Default encoding):
// determine encoding by detecting invalid characters in string
var invoiceXmlText = Encoding.UTF8.GetString(invoiceXmlBytes); // try utf-8 first
if (StringFuncs.DetectInvalidChars(invoiceXmlText))
invoiceXmlText = Encoding.Default.GetString(invoiceXmlBytes); // fallback to default