How do I correct the character encoding of a file

Question

I have an ANSI encoded text file that should not have been encoded as ANSI as there were accented characters that ANSI does not support  I would rather work with UTF-8   Can the data be decoded correctly or is it lost in transcoding   What tools could I use   Here is a sample of what I have               I can tell from context  caf     should be caf    that these should be these two characters

User · Answer

When you see character sequences like      and       it s usually an indication that a UTF-8 file has been opened by a program that reads it in as ANSI  or similar   Unicode characters such as these   U 00C2 Latin capital letter A with circumflex U 00C3 Latin capital letter A with tilde U 0082 Break permitted here U 0083 No break here    tend to show up in ANSI text because of the variable-byte strategy that UTF-8 uses  This strategy is explained very well here   The advantage for you is that the appearance of these odd characters makes it relatively easy to find  and thus replace  instances of incorrect conversion   I believe that  since ANSI always uses 1 byte per character  you can handle this situation with a simple search-and-replace operation  Or more conveniently  with a program that includes a table mapping between the offending sequences and the desired characters  like these           -           should be an opening double curly quote        -           should be a closing double curly quote    Any given text  assuming it s in English  will have a relatively small number of different types of substitutions   Hope that helps

User · Answer

With vim from command line   vim -c  set encoding utf8  -c  set fileencoding utf8  -c  wq  filename

User · Answer

I found this question when searching for a solution to a code page issue i had with Chinese characters  but in the end my problem was just an issue with Windows not displaying them correctly in the UI   In case anyone else has that same issue  you can fix it simply by changing the local in windows to China and then back again   I found the solution here   http   answers microsoft com en-us windows forum windows 7-desktop how-can-i-get-chinesejapanese-characters-to fdb1f1da-b868-40d1-a4a4-7acadff4aafa page 2 amp auth 1  Also upvoted Gabriel s answer as looking at the data in notepad   was what tipped me off about windows

User · Answer

Use iconv - see Best way to convert text files between character sets

User · Answer

There are programs that try to detect the encoding of an file like chardet  Then you could convert it to a different encoding using iconv  But that requires that the original text is still intact and no information is lost  for example by removing accents or whole accented letters

User · Answer

EDIT  A simple possibility to eliminate before getting into more complicated solutions  have you tried setting the character set to utf8 in the text editor in which you re reading the file  This could just be a case of somebody sending you a utf8 file that you re reading in an editor set to say cp1252    Just taking the two examples  this is a case of utf8 being read through the lens of a single-byte encoding  likely one of iso-8859-1  iso-8859-15  or cp1252  If you can post examples of other problem characters  it should be possible to narrow that down more   As visual inspection of the characters can be misleading  you ll also need to look at the underlying bytes  the    you see on screen might be either 0xa7 or 0xc2a7   and that will determine the kind of character set conversion you have to do   Can you assume that all of your data has been distorted in exactly the same way - that it s come from the same source and gone through the same sequence of transformations  so that for example there isn t a single    in your text  it s always       If so  the problem can be solved with a sequence of character set conversions  If you can be more specific about the environment you re in and the database you re using  somebody here can probably tell you how to perform the appropriate conversion   Otherwise  if the problem characters are only occurring in some places in your data  you ll have to take it instance by instance  based on assumptions along the lines of  no author intended to put      in their text  so whenever you see it  replace by      The latter option is more risky  firstly because those assumptions about the intentions of the authors might be wrong  secondly because you ll have to spot every problem character yourself  which might be impossible if there s too much text to visually inspect or if it s written in a language or writing system that s foreign to you

User · Answer

And then there is the somewhat older recode program

User · Answer

On OS X Synalyze It  lets you display parts of your file in different encodings  all which are supported by the ICU library   Once you know what s the source encoding you can copy the whole file  bytes  via clipboard and insert into a new document where the target encoding  UTF-8 or whatever you like  is selected   Very helpful when working with UTF-8 or other Unicode representations is UnicodeChecker

User · Answer

Follow these steps with Notepad    1- Copy the original text   2- In Notepad    open new file  change Encoding -  pick an encoding you think the original text follows  Try as well the encoding  ANSI  as sometimes Unicode files are read as ANSI by certain programs  3- Paste  4- Then to convert to Unicode by going again over the same menu  Encoding -   Encode in UTF-8   Not  Convert to UTF-8   and hopefully it will become readable  The above steps apply for most languages  You just need to guess the original encoding before pasting in notepad    then convert through the same menu to an alternate Unicode-based encoding to see if things become readable   Most languages exist in 2 forms of encoding  1- The old legacy ANSI  ASCII  form  only 8 bits  was used initially by most computers  8 bits only allowed 256 possibilities  128 of them where the regular latin and control characters  the final 128 bits were read differently depending on the PC language settings 2- The new Unicode standard  up to 32 bit  give a unique code for each character in all currently known languages and plenty more to come  if a file is unicode it should be understood on any PC with the language s font installed  Note that even UTF-8 goes up to 32 bit and is just as broad as UTF-16 and UTF-32 only it tries to stay 8 bits with latin characters just to save up disk space

User · Answer

In sublime text editor  file -  reopen with encoding -  choose the correct encoding   Generally  the encoding is auto-detected  but if not  you can use the above method

User · Answer

I found a simple way to auto-detect file encodings - change the file to a text file  on a mac rename the file extension to  txt  and drag it to a Mozilla Firefox window  or File -  Open   Firefox will detect the encoding - you can see what it came up with under View -  Character Encoding   I changed my file s encoding using TextMate once I knew the correct encoding  File -  Reopen using encoding and choose your encoding  Then File -  Save As and change the encoding to UTF-8 and line endings to LF  or whatever you want

User · Answer

If you see question marks in the file or if the accents are already lost  going back to utf8 will not help your cause  e g  if caf   became cafe - changing encoding alone will not help  and you ll need original data    Can you paste some text here  that ll help us answer for sure

[encoding] How do I correct the character encoding of a file?

Examples related to encoding

Examples related to utf-8

Examples related to character-encoding

Examples related to text-files

Examples related to codepages