[encoding] How do I correct the character encoding of a file?

When you see character sequences like ç and é, it's usually an indication that a UTF-8 file has been opened by a program that reads it in as ANSI (or similar). Unicode characters such as these:

U+00C2 Latin capital letter A with circumflex
U+00C3 Latin capital letter A with tilde
U+0082 Break permitted here
U+0083 No break here

tend to show up in ANSI text because of the variable-byte strategy that UTF-8 uses. This strategy is explained very well here.

The advantage for you is that the appearance of these odd characters makes it relatively easy to find, and thus replace, instances of incorrect conversion.

I believe that, since ANSI always uses 1 byte per character, you can handle this situation with a simple search-and-replace operation. Or more conveniently, with a program that includes a table mapping between the offending sequences and the desired characters, like these:

“ -> “ # should be an opening double curly quote
â€? -> ” # should be a closing double curly quote

Any given text, assuming it's in English, will have a relatively small number of different types of substitutions.

Hope that helps.

Examples related to encoding

How to check encoding of a CSV file UnicodeEncodeError: 'ascii' codec can't encode character at special name Using Javascript's atob to decode base64 doesn't properly decode utf-8 strings What is the difference between utf8mb4 and utf8 charsets in MySQL? The character encoding of the plain text document was not declared - mootool script UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 23: ordinal not in range(128) How to encode text to base64 in python UTF-8 output from PowerShell Set Encoding of File to UTF8 With BOM in Sublime Text 3 Replace non-ASCII characters with a single space

Examples related to utf-8

error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte Changing PowerShell's default output encoding to UTF-8 'Malformed UTF-8 characters, possibly incorrectly encoded' in Laravel Encoding Error in Panda read_csv Using Javascript's atob to decode base64 doesn't properly decode utf-8 strings What is the difference between utf8mb4 and utf8 charsets in MySQL? what is <meta charset="utf-8">? Pandas df.to_csv("file.csv" encode="utf-8") still gives trash characters for minus sign UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 23: ordinal not in range(128) Android Studio : unmappable character for encoding UTF-8

Examples related to character-encoding

Changing PowerShell's default output encoding to UTF-8 JsonParseException : Illegal unquoted character ((CTRL-CHAR, code 10) Change the encoding of a file in Visual Studio Code What is the difference between utf8mb4 and utf8 charsets in MySQL? How to open html file? All inclusive Charset to avoid "java.nio.charset.MalformedInputException: Input length = 1"? UTF-8 output from PowerShell ERROR 1115 (42000): Unknown character set: 'utf8mb4' "for line in..." results in UnicodeDecodeError: 'utf-8' codec can't decode byte How to make php display \t \n as tab and new line instead of characters

Examples related to text-files

How to edit a text file in my terminal Reading specific columns from a text file in python Read a local text file using Javascript Split text file into smaller multiple text file using command line How do I read a text file of about 2 GB? Java - How Can I Write My ArrayList to a file, and Read (load) that file to the original ArrayList? Copying from one text file to another using Python Getting all file names from a folder using C# Reading From A Text File - Batch How to import data from text file to mysql database

Examples related to codepages

What is ANSI format? How do you properly use WideCharToMultiByte How do I correct the character encoding of a file?