[encoding] Windows-1252 to UTF-8 encoding

I've copied certain files from a Windows machine to a Linux machine. So all the Windows encoded (windows-1252) files need to be converted to UTF-8. The files which are already in UTF-8 should not be changed. I'm planning to use the recode utility for that. How can I specify that the recode utility should only convert windows-1252 encoded files and not the UTF-8 files?

Example usage of recode:

recode windows-1252.. myfile.txt

This would convert myfile.txt from windows-1252 to UTF-8. Before doing this, I would like to know that myfile.txt is actually windows-1252 encoded and not UTF-8 encoded. Otherwise, I believe this would corrupt the file.

This question is related to encoding utf-8 character-encoding windows-1252

The answer is


Use the iconv command.

To make sure the file is in Windows-1252, open it in Notepad (under Windows), then click Save As. Notepad suggests current encoding as the default; if it's Windows-1252 (or any 1-byte codepage, for that matter), it would say "ANSI".


iconv -f WINDOWS-1252 -t UTF-8 filename.txt


If you want to rename multiple files in a single command - let's say you want to convert all *.txt files - here is the command:

find . -name "*.txt" -exec iconv -f WINDOWS-1252 -t UTF-8 {} -o {}.ren \; -a -exec mv {}.ren {} \;

UTF-8 does not have a BOM as it is both superfluous and invalid. Where a BOM is helpful is in UTF-16 which may be byte swapped as in the case of Microsoft. UTF-16 if for internal representation in a memory buffer. Use UTF-8 for interchange. By default both UTF-8, anything else derived from US-ASCII and UTF-16 are natural/network byte order. The Microsoft UTF-16 requires a BOM as it is byte swapped.

To covert Windows-1252 to ISO8859-15, I first convert ISO8859-1 to US-ASCII for codes with similar glyphs. I then convert Windows-1252 up to ISO8859-15, other non-ISO8859-15 glyphs to multiple US-ASCII characters.


You can change the encoding of a file with an editor such as notepad++. Just go to Encoding and select what you want.

I always prefer the Windows 1252


Found this documentation for the TYPE command:

Convert an ASCII (Windows1252) file into a Unicode (UCS-2 le) text file:

For /f "tokens=2 delims=:" %%G in ('CHCP') do Set _codepage=%%G    
CHCP 1252 >NUL    
CMD.EXE /D /A /C (SET/P=ÿþ)<NUL > unicode.txt 2>NUL    
CMD.EXE /D /U /C TYPE ascii_file.txt >> unicode.txt    
CHCP %_codepage%    

The technique above (based on a script by Carlos M.) first creates a file with a Byte Order Mark (BOM) and then appends the content of the original file. CHCP is used to ensure the session is running with the Windows1252 code page so that the characters 0xFF and 0xFE (ÿþ) are interpreted correctly.


There's no general way to tell if a file is encoded with a specific encoding. Remember that an encoding is nothing more but an "agreement" how the bits in a file should be mapped to characters.

If you don't know which of your files are actually already encoded in UTF-8 and which ones are encoded in windows-1252, you will have to inspect all files and find out yourself. In the worst case that could mean that you have to open every single one of them with either of the two encodings and see whether they "look" correct -- i.e., all characters are displayed correctly. Of course, you may use tool support in order to do that, for instance, if you know for sure that certain characters are contained in the files that have a different mapping in windows-1252 vs. UTF-8, you could grep for them after running the files through 'iconv' as mentioned by Seva Akekseyev.

Another lucky case for you would be, if you know that the files actually contain only characters that are encoded identically in both UTF-8 and windows-1252. In that case, of course, you're done already.


Here's a transcription of another answer I gave to a similar question:

If you apply utf8_encode() to an already UTF8 string it will return a garbled UTF8 output.

I made a function that addresses all this issues. It´s called Encoding::toUTF8().

You dont need to know what the encoding of your strings is. It can be Latin1 (iso 8859-1), Windows-1252 or UTF8, or the string can have a mix of them. Encoding::toUTF8() will convert everything to UTF8.

I did it because a service was giving me a feed of data all messed up, mixing UTF8 and Latin1 in the same string.

Usage:

$utf8_string = Encoding::toUTF8($utf8_or_latin1_or_mixed_string);

$latin1_string = Encoding::toLatin1($utf8_or_latin1_or_mixed_string);

Download:

https://github.com/neitanod/forceutf8

Update:

I've included another function, Encoding::fixUFT8(), wich will fix every UTF8 string that looks garbled.

Usage:

$utf8_string = Encoding::fixUTF8($garbled_utf8_string);

Examples:

echo Encoding::fixUTF8("Fédération Camerounaise de Football");
echo Encoding::fixUTF8("Fédération Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂédÃÂération Camerounaise de Football");
echo Encoding::fixUTF8("Fédération Camerounaise de Football");

will output:

Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football

Update: I've transformed the function (forceUTF8) into a family of static functions on a class called Encoding. The new function is Encoding::toUTF8().


If you are sure your files are either UTF-8 or Windows 1252 (or Latin1), you can take advantage of the fact that recode will exit with an error if you try to convert an invalid file.

While utf8 is valid Win-1252, the reverse is not true: win-1252 is NOT valid UTF-8. So:

recode utf8..utf16 <unknown.txt >/dev/null || recode cp1252..utf8 <unknown.txt >utf8-2.txt

Will spit out errors for all cp1252 files, and then proceed to convert them to UTF8.

I would wrap this into a cleaner bash script, keeping a backup of every converted file.

Before doing the charset conversion, you may wish to first ensure you have consistent line-endings in all files. Otherwise, recode will complain because of that, and may convert files which were already UTF8, but just had the wrong line-endings.


Examples related to encoding

How to check encoding of a CSV file UnicodeEncodeError: 'ascii' codec can't encode character at special name Using Javascript's atob to decode base64 doesn't properly decode utf-8 strings What is the difference between utf8mb4 and utf8 charsets in MySQL? The character encoding of the plain text document was not declared - mootool script UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 23: ordinal not in range(128) How to encode text to base64 in python UTF-8 output from PowerShell Set Encoding of File to UTF8 With BOM in Sublime Text 3 Replace non-ASCII characters with a single space

Examples related to utf-8

error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte Changing PowerShell's default output encoding to UTF-8 'Malformed UTF-8 characters, possibly incorrectly encoded' in Laravel Encoding Error in Panda read_csv Using Javascript's atob to decode base64 doesn't properly decode utf-8 strings What is the difference between utf8mb4 and utf8 charsets in MySQL? what is <meta charset="utf-8">? Pandas df.to_csv("file.csv" encode="utf-8") still gives trash characters for minus sign UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 23: ordinal not in range(128) Android Studio : unmappable character for encoding UTF-8

Examples related to character-encoding

Changing PowerShell's default output encoding to UTF-8 JsonParseException : Illegal unquoted character ((CTRL-CHAR, code 10) Change the encoding of a file in Visual Studio Code What is the difference between utf8mb4 and utf8 charsets in MySQL? How to open html file? All inclusive Charset to avoid "java.nio.charset.MalformedInputException: Input length = 1"? UTF-8 output from PowerShell ERROR 1115 (42000): Unknown character set: 'utf8mb4' "for line in..." results in UnicodeDecodeError: 'utf-8' codec can't decode byte How to make php display \t \n as tab and new line instead of characters

Examples related to windows-1252

Windows-1252 to UTF-8 encoding