I have an ANSI encoded text file that should not have been encoded as ANSI as there were accented characters that ANSI does not support. I would rather work with UTF-8.
Can the data be decoded correctly or is it lost in transcoding?
What tools could I use?
Here is a sample of what I have:
ç é
I can tell from context (café should be café) that these should be these two characters:
ç é
This question is related to
encoding
utf-8
character-encoding
text-files
codepages
I found this question when searching for a solution to a code page issue i had with Chinese characters, but in the end my problem was just an issue with Windows not displaying them correctly in the UI.
In case anyone else has that same issue, you can fix it simply by changing the local in windows to China and then back again.
I found the solution here:
Also upvoted Gabriel's answer as looking at the data in notepad++ was what tipped me off about windows.
Follow these steps with Notepad++
1- Copy the original text
2- In Notepad++, open new file, change Encoding -> pick an encoding you think the original text follows. Try as well the encoding "ANSI" as sometimes Unicode files are read as ANSI by certain programs
3- Paste
4- Then to convert to Unicode by going again over the same menu: Encoding -> "Encode in UTF-8" (Not "Convert to UTF-8") and hopefully it will become readable
The above steps apply for most languages. You just need to guess the original encoding before pasting in notepad++, then convert through the same menu to an alternate Unicode-based encoding to see if things become readable.
Most languages exist in 2 forms of encoding: 1- The old legacy ANSI (ASCII) form, only 8 bits, was used initially by most computers. 8 bits only allowed 256 possibilities, 128 of them where the regular latin and control characters, the final 128 bits were read differently depending on the PC language settings 2- The new Unicode standard (up to 32 bit) give a unique code for each character in all currently known languages and plenty more to come. if a file is unicode it should be understood on any PC with the language's font installed. Note that even UTF-8 goes up to 32 bit and is just as broad as UTF-16 and UTF-32 only it tries to stay 8 bits with latin characters just to save up disk space
When you see character sequences like ç and é, it's usually an indication that a UTF-8 file has been opened by a program that reads it in as ANSI (or similar). Unicode characters such as these:
U+00C2 Latin capital letter A with circumflex
U+00C3 Latin capital letter A with tilde
U+0082 Break permitted here
U+0083 No break here
tend to show up in ANSI text because of the variable-byte strategy that UTF-8 uses. This strategy is explained very well here.
The advantage for you is that the appearance of these odd characters makes it relatively easy to find, and thus replace, instances of incorrect conversion.
I believe that, since ANSI always uses 1 byte per character, you can handle this situation with a simple search-and-replace operation. Or more conveniently, with a program that includes a table mapping between the offending sequences and the desired characters, like these:
“ -> “ # should be an opening double curly quote
â€? -> ” # should be a closing double curly quote
Any given text, assuming it's in English, will have a relatively small number of different types of substitutions.
Hope that helps.
If you see question marks in the file or if the accents are already lost, going back to utf8 will not help your cause. e.g. if café became cafe - changing encoding alone will not help (and you'll need original data).
Can you paste some text here, that'll help us answer for sure.
In sublime text editor, file -> reopen with encoding -> choose the correct encoding.
Generally, the encoding is auto-detected, but if not, you can use the above method.
And then there is the somewhat older recode program.
There are programs that try to detect the encoding of an file like chardet. Then you could convert it to a different encoding using iconv. But that requires that the original text is still intact and no information is lost (for example by removing accents or whole accented letters).
I found a simple way to auto-detect file encodings - change the file to a text file (on a mac rename the file extension to .txt) and drag it to a Mozilla Firefox window (or File -> Open). Firefox will detect the encoding - you can see what it came up with under View -> Character Encoding.
I changed my file's encoding using TextMate once I knew the correct encoding. File -> Reopen using encoding and choose your encoding. Then File -> Save As and change the encoding to UTF-8 and line endings to LF (or whatever you want)
With vim from command line:
vim -c "set encoding=utf8" -c "set fileencoding=utf8" -c "wq" filename
On OS X Synalyze It! lets you display parts of your file in different encodings (all which are supported by the ICU library). Once you know what's the source encoding you can copy the whole file (bytes) via clipboard and insert into a new document where the target encoding (UTF-8 or whatever you like) is selected.
Very helpful when working with UTF-8 or other Unicode representations is UnicodeChecker
Use iconv - see Best way to convert text files between character sets?
Source: Stackoverflow.com