Windows-1252 to UTF-8 encoding

Question

I ve copied certain files from a Windows machine to a Linux machine  So all the Windows encoded  windows-1252  files need to be converted to UTF-8  The files which are already in UTF-8 should not be changed  I m planning to use the recode utility for that  How can I specify that the recode utility should only convert windows-1252 encoded files and not the UTF-8 files   Example usage of recode   recode windows-1252   myfile txt   This would convert myfile txt from windows-1252 to UTF-8  Before doing this  I would like to know that myfile txt is actually windows-1252 encoded and not UTF-8 encoded  Otherwise  I believe this would corrupt the file

User · Answer

If you want to rename multiple files in a single command - let s say you want to convert all   txt files - here is the command   find   -name    txt  -exec iconv -f WINDOWS-1252 -t UTF-8    -o    ren    -a -exec mv    ren

User · Answer

If you are sure your files are either UTF-8 or Windows 1252  or Latin1   you can take advantage of the fact that recode will exit with an error if you try to convert an invalid file   While utf8 is valid Win-1252  the reverse is not true  win-1252 is NOT valid UTF-8  So   recode utf8  utf16  lt unknown txt  gt  dev null    recode cp1252  utf8  lt unknown txt  gt utf8-2 txt   Will spit out errors for all cp1252 files  and then proceed to convert them to UTF8   I would wrap this into a cleaner bash script  keeping a backup of every converted file   Before doing the charset conversion  you may wish to first ensure you have consistent line-endings in all files  Otherwise  recode will complain because of that  and may convert files which were already UTF8  but just had the wrong line-endings

User · Answer

iconv -f WINDOWS-1252 -t UTF-8 filename txt

User · Answer

How would you expect recode to know that a file is Windows-1252  In theory  I believe any file is a valid Windows-1252 file  as it maps every possible byte to a character   Now there are certainly characteristics which would strongly suggest that it s UTF-8 - if it starts with the UTF-8 BOM  for example - but they wouldn t be definitive   One option would be to detect whether it s actually a completely valid UTF-8 file first  I suppose    again  that would only be suggestive   I m not familiar with the recode tool itself  but you might want to see whether it s capable of recoding a file from and to the same encoding - if you do this with an invalid file  i e  one which contains invalid UTF-8 byte sequences  it may well convert the invalid sequences into question marks or something similar  At that point you could detect that a file is valid UTF-8 by recoding it to UTF-8 and seeing whether the input and output are identical   Alternatively  do this programmatically rather than using the recode utility - it would be quite straightforward in C   for example   Just to reiterate though  all of this is heuristic  If you really don t know the encoding of a file  nothing is going to tell you it with 100  accuracy

User · Answer

You can change the encoding of a file with an editor such as notepad    Just go to Encoding and select what you want   I always prefer the Windows 1252

User · Answer

Use the iconv command   To make sure the file is in Windows-1252  open it in Notepad  under Windows   then click Save As  Notepad suggests current encoding as the default  if it s Windows-1252  or any 1-byte codepage  for that matter   it would say  ANSI

User · Answer

Found this documentation for the TYPE command   Convert an ASCII  Windows1252  file into a Unicode  UCS-2 le  text file       For  f  tokens 2 delims      G in   CHCP   do Set  codepage   G     CHCP 1252  gt NUL     CMD EXE  D  A  C  SET P       lt NUL  gt  unicode txt 2 gt NUL     CMD EXE  D  U  C TYPE ascii file txt  gt  gt  unicode txt     CHCP   codepage        The technique above  based on a script by Carlos M   first creates a file with a Byte Order Mark  BOM  and then appends the content of the original file  CHCP is used to ensure the session is running with the Windows1252 code page so that the characters 0xFF and 0xFE        are interpreted correctly

User · Answer

There s no general way to tell if a file is encoded with a specific encoding  Remember that an encoding is nothing more but an  agreement  how the bits in a file should be mapped to characters   If you don t know which of your files are actually already encoded in UTF-8 and which ones are encoded in windows-1252  you will have to inspect all files and find out yourself  In the worst case that could mean that you have to open every single one of them with either of the two encodings and see whether they  look  correct -- i e   all characters are displayed correctly  Of course  you may use tool support in order to do that  for instance  if you know for sure that certain characters are contained in the files that have a different mapping in windows-1252 vs  UTF-8  you could grep for them after running the files through  iconv  as mentioned by Seva Akekseyev   Another lucky case for you would be  if you know that the files actually contain only characters that are encoded identically in both UTF-8 and windows-1252  In that case  of course  you re done already

User · Answer

Here s a transcription of another answer I gave to a similar question   If you apply utf8 encode   to an already UTF8 string it will return a garbled UTF8 output   I made a function that addresses all this issues   It  s called Encoding  toUTF8     You dont need to know what the encoding of your strings is  It can be Latin1  iso 8859-1   Windows-1252 or UTF8  or the string can have a mix of them  Encoding  toUTF8   will convert everything to UTF8   I did it because a service was giving me a feed of data all messed up  mixing UTF8 and Latin1 in the same string   Usage    utf8 string   Encoding  toUTF8  utf8 or latin1 or mixed string     latin1 string   Encoding  toLatin1  utf8 or latin1 or mixed string     Download   https   github com neitanod forceutf8  Update   I ve included another function  Encoding  fixUFT8    wich will fix every UTF8 string that looks garbled     Usage    utf8 string   Encoding  fixUTF8  garbled utf8 string     Examples   echo Encoding  fixUTF8  F    d    ration Camerounaise de Football    echo Encoding  fixUTF8  F      d      ration Camerounaise de Football    echo Encoding  fixUTF8  F          d          ration Camerounaise de Football    echo Encoding  fixUTF8  F      d  ration Camerounaise de Football      will output   F  d  ration Camerounaise de Football F  d  ration Camerounaise de Football F  d  ration Camerounaise de Football F  d  ration Camerounaise de Football   Update  I ve transformed the function  forceUTF8  into a family of static functions on a class called Encoding   The new function is Encoding  toUTF8

User · Answer

UTF-8 does not have a BOM as it is both superfluous and invalid  Where a BOM is helpful is in UTF-16 which may be byte swapped as in the case of Microsoft  UTF-16 if for internal representation in a memory buffer  Use UTF-8 for interchange  By default both UTF-8  anything else derived from US-ASCII and UTF-16 are natural network byte order  The Microsoft UTF-16 requires a BOM as it is byte swapped   To covert Windows-1252 to ISO8859-15  I first convert ISO8859-1 to US-ASCII for codes with similar glyphs  I then convert Windows-1252 up to ISO8859-15  other non-ISO8859-15 glyphs to multiple US-ASCII characters

[encoding] Windows-1252 to UTF-8 encoding

Examples related to encoding

Examples related to utf-8

Examples related to character-encoding

Examples related to windows-1252