[encoding] Windows-1252 to UTF-8 encoding

If you are sure your files are either UTF-8 or Windows 1252 (or Latin1), you can take advantage of the fact that recode will exit with an error if you try to convert an invalid file.

While utf8 is valid Win-1252, the reverse is not true: win-1252 is NOT valid UTF-8. So:

recode utf8..utf16 <unknown.txt >/dev/null || recode cp1252..utf8 <unknown.txt >utf8-2.txt

Will spit out errors for all cp1252 files, and then proceed to convert them to UTF8.

I would wrap this into a cleaner bash script, keeping a backup of every converted file.

Before doing the charset conversion, you may wish to first ensure you have consistent line-endings in all files. Otherwise, recode will complain because of that, and may convert files which were already UTF8, but just had the wrong line-endings.

Examples related to encoding

How to check encoding of a CSV file UnicodeEncodeError: 'ascii' codec can't encode character at special name Using Javascript's atob to decode base64 doesn't properly decode utf-8 strings What is the difference between utf8mb4 and utf8 charsets in MySQL? The character encoding of the plain text document was not declared - mootool script UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 23: ordinal not in range(128) How to encode text to base64 in python UTF-8 output from PowerShell Set Encoding of File to UTF8 With BOM in Sublime Text 3 Replace non-ASCII characters with a single space

Examples related to utf-8

error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte Changing PowerShell's default output encoding to UTF-8 'Malformed UTF-8 characters, possibly incorrectly encoded' in Laravel Encoding Error in Panda read_csv Using Javascript's atob to decode base64 doesn't properly decode utf-8 strings What is the difference between utf8mb4 and utf8 charsets in MySQL? what is <meta charset="utf-8">? Pandas df.to_csv("file.csv" encode="utf-8") still gives trash characters for minus sign UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 23: ordinal not in range(128) Android Studio : unmappable character for encoding UTF-8

Examples related to character-encoding

Changing PowerShell's default output encoding to UTF-8 JsonParseException : Illegal unquoted character ((CTRL-CHAR, code 10) Change the encoding of a file in Visual Studio Code What is the difference between utf8mb4 and utf8 charsets in MySQL? How to open html file? All inclusive Charset to avoid "java.nio.charset.MalformedInputException: Input length = 1"? UTF-8 output from PowerShell ERROR 1115 (42000): Unknown character set: 'utf8mb4' "for line in..." results in UnicodeDecodeError: 'utf-8' codec can't decode byte How to make php display \t \n as tab and new line instead of characters

Examples related to windows-1252

Windows-1252 to UTF-8 encoding