Force encode from US-ASCII to UTF-8 iconv

Question

I m trying to transcode a bunch of files from US-ASCII to UTF-8   For that  I m using iconv   iconv -f US-ASCII -t UTF-8 file php  gt  file-utf8 php   My original files are US-ASCII encoded  which makes the conversion not happen  Apparently it occurs because ASCII is a subset of UTF-8     iconv US ASCII to UTF-8 or ISO-8859-15  And quoting      There s no need for the textfile to appear otherwise until non-ASCII   characters are introduced   True  If I introduce a non-ASCII character in the file and save it  let s say with Eclipse  the file encoding  charset  is switched to UTF-8   In my case  I d like to force iconv to transcode the files to UTF-8 anyway  Whether there is non-ASCII characters in it or not   Note  The reason is my PHP code  non-ASCII files     is dealing with some non-ASCII string  which causes the strings not to be well interpreted  french       Il     tait une fois    l homme s    rie anim    e mythique d Albert      Barill      Procidis   1    re             US ASCII -- is -- a subset of UTF-8  see Ned s answer below  Meaning that US ASCII files are actually encoded in UTF-8 My problem came from somewhere else

User · Answer

The following converts all files in a folder.

Create backup folder of original files.

mkdir backup

Convert all files in US ASCII encoding to UTF-8 (single line command)

for f in $(file -i * .sql | grep us-ascii | cut -d ':' -f 1); do iconv -f us-ascii -t utf-8 $f -o $ f.utf-8 && mv $f backup / && mv "$f.utf-8" $f; done

Convert all files in encoding ISO 8859-1 to UTF-8 (single line command)

for f $(file -i * .sql | grep iso-8859-1 | cut -d ':' -f 1); do iconv -f iso-8859-1 -t utf-8 $f -o $f.utf-8 && mv $f backup / && mv "$f.utf-8" $f; done

User · Answer

I accidentally encoded a file in UTF-7 and had a similar issue  When I typed  file -i name file I would get charset us-ascii  iconv -f us-ascii -t utf-9  translit name file would not work since I ve gathered UTF-7 is a subset of US ASCII  as is UTF-8  To solve this  I entered iconv -f UTF-7 -t UTF-8  TRANSLIT name file -o output file I m not sure how to determine the encoding other than what others have suggested here

User · Answer

There is no difference between US ASCII and UTF-8  so there isn t any need to reconvert it   But here a little hint  if you have trouble with special-chars while recoding   Add   TRANSLIT after the source-charset-Parameter   Example   iconv -f ISO-8859-1  TRANSLIT -t UTF-8 filename sql  gt  utf8-filename sql   This helps me with strange types of quotes  which are always breaking the character set reencode process

User · Answer

Inspired a lot by Mathieu s answer and Marcelo s answer   I face the need to see file -i myfile htm to show UTF-8 instead of US ASCII  yes  I know it is a subset of UTF-8     So here is a one liner inspired from previous answers that will convert on Linux all   htm file from US ASCII to UTF-8 so file -i will show you UTF-8  You can change   htm  two places in the command below  to fit your need   mkdir backup 2 gt  dev null  for f in   file -i   htm   grep -i us-ascii   cut -d     -f 1   do iconv -f  us-ascii  -t  utf-16   f  gt   f tmp  iconv -f  utf-16le  -t  utf-8   f tmp  gt   f utf8  cp  fic backup   mv  f utf8  f  rm  f tmp  done  file -i   htm

User · Answer

ASCII is a subset of UTF-8  so all ASCII files are already UTF-8 encoded   The bytes in the ASCII file and the bytes that would result from  encoding it to UTF-8  would be exactly the same bytes   There s no difference between them  so there s no need to do anything   It looks like your problem is that the files are not actually ASCII   You need to determine what encoding they are using  and transcode them properly

User · Answer

You can use file -i file name to check what exactly your original file format is   Once you get that  you can do the following   iconv -f old format -t utf-8 input file -o output file

User · Answer

I think Ned s got the core of the problem -- your files are not actually ASCII  Try  iconv -f ISO-8859-1 -t UTF-8 file php  gt  file-utf8 php   I m just guessing that you re actually using ISO 8859-1  It is popular with most European languages

User · Answer

People say you can t and I understand you may be frustrated when asking a question and getting such an answer   If you really want it to show in UTF-8 instead of US ASCII then you need to do it in two steps   First   iconv -f us-ascii -t utf-16 yourfile  gt  youfileinutf16     Second   iconv -f utf-16le -t utf-8 yourfileinutf16  gt  yourfileinutf8     Then if you do a file -i  you ll see the new character set is UTF-8

User · Answer

Short Answer  file only guesses at the file encoding and may be wrong  especially in cases where special characters only appear late in large files   you can use hexdump to look at bytes of non-7-bit-ASCII text and compare against code tables for common encodings  ISO 8859-   UTF-8  to decide for yourself what the encoding is  iconv will use whatever input output encoding you specify regardless of what the contents of the file are   If you specify the wrong input encoding  the output will be garbled  even after running iconv  file may not report any change due to the limited way in which file attempts to guess at the encoding  For a specific example  see my long answer  7-bit ASCII  aka US ASCII  is identical at a byte level to UTF-8 and the 8-bit ASCII extensions  ISO 8859-     So if your file only has 7-bit characters  then you can call it UTF-8  ISO 8859-  or US ASCII because at a byte level they are all identical  It only makes sense to talk about UTF-8 and other encodings  in this context  once your file has characters outside the 7-bit ASCII range   Long Answer I ran into this today and came across your question   Perhaps I can add a little more information to help other people who run into this issue  ASCII First  the term ASCII is overloaded  and that leads to confusion  7-bit ASCII only includes 128 characters  00-7F or 0-127 in decimal   7-bit ASCII is also sometimes referred to as US-ASCII  ASCII UTF-8 UTF-8 encoding uses the same encoding as 7-bit ASCII for its first 128 characters  So a text file that only contains characters from that range of the first 128 characters will be identical at a byte level whether encoded with UTF-8 or 7-bit ASCII  Codepage layout ISO 8859-  and other ASCII Extensions  The term extended ASCII  or high ASCII  refers to eight-bit or larger character encodings that include the standard seven-bit ASCII characters  plus additional characters   Extended ASCII ISO 8859-1  aka  quot ISO Latin 1 quot   is a specific 8-bit ASCII extension standard that covers most characters for Western Europe  There are other ISO standards for Eastern European languages and Cyrillic languages  ISO  8859-1 includes characters like            and    for German and Spanish   quot Extension quot  means that ISO  8859-1 includes the 7-bit ASCII standard and adds characters to it by using the 8th bit  So for the first 128 characters  it is equivalent at a byte level to ASCII and UTF-8 encoded files  However  when you start dealing with characters beyond the first 128  your are no longer UTF-8 equivalent at the byte level  and you must do a conversion if you want your  quot extended ASCII quot  file to be UTF-8 encoded  ISO 8859 and proprietary adaptations Detecting encoding with file One lesson I learned today is that we can t trust file to always give correct interpretation of a file s character encoding  file  command   The command tells only what the file looks like  not what it is  in the case where file looks at the content   It is easy to fool the program by putting a magic number into a file the content of which does not match it  Thus the command is not usable as a security tool other than in specific situations   file looks for magic numbers in the file that hint at the type  but these can be wrong  no guarantee of correctness  file also tries to guess the character encoding by looking at the bytes in the file  Basically file has a series of tests that helps it guess at the file type and encoding  My file is a large CSV file  file reports this file as US ASCII encoded  which is WRONG    ls -lh total 850832 -rw-r--r--  1 mattp  staff   415M Mar 14 16 38 source-file   file -b --mime-type source-file text plain   file -b --mime-encoding source-file us-ascii  My file has umlauts in it  ie      The first non-7-bit-ascii doesn t show up until over 100k lines into the file  I suspect this is why file doesn t realize the file encoding isn t US-ASCII    pcregrep -no     x00- x7F   source-file   head -n1 102321    I m on a Mac  so using PCRE s grep  With GNU grep you could use the -P option  Alternatively on a Mac  one could install coreutils  via Homebrew or other  in order to get GNU grep  I haven t dug into the source-code of file  and the man page doesn t discuss the text encoding detection in detail  but I am guessing file doesn t look at the whole file before guessing encoding  Whatever my file s encoding is  these non-7-bit-ASCII characters break stuff  My German CSV file is  -separated and extracting a single column doesn t work    cut -d quot   quot  -f1 source-file  gt  tmp cut  stdin  Illegal byte sequence   wc -l    3081673 source-file   102320 tmp  3183993 total  Note the cut error and that my  quot tmp quot  file has only 102320 lines with the first special character on line 102321  Let s take a look at how these non-ASCII characters are encoded  I dump the first non-7-bit-ascii into hexdump  do a little formatting  remove the newlines  0a  and take just the first few    pcregrep -o     x00- x7F   source-file   head -n1   hexdump -v -e  1 1  quot  02x n quot   d6 0a  Another way  I know the first non-7-bit-ASCII char is at position 85 on line 102321  I grab that line and tell hexdump to take the two bytes starting at position 85   You can see the special  non-7-bit-ASCII  character represented by a  quot   quot   and the next byte is  quot M quot     so this is a single-byte character encoding    tail -n  102321 source-file   head -n1   hexdump -C -s85 -n2 00000055  d6 4d                                               M  00000057  In both cases  we see the special character is represented by d6  Since this character is an    which is a German letter  I am guessing that ISO  8859-1 should include this  Sure enough  you can see  quot d6 quot  is a match  ISO IEC 8859-1   Important question    how do I know this character is an    without being sure of the file encoding  The answer is context  I opened the file  read the text and then determined what character it is supposed to be  If I open it in Vim it displays as an    because Vim does a better job of guessing the character encoding  in this case  than file does  So  my file seems to be ISO  8859-1  In theory I should check the rest of the non-7-bit-ASCII characters to make sure ISO  8859-1 is a good fit    There is nothing that forces a program to only use a single encoding when writing a file to disk  other than good manners   I ll skip the check and move on to conversion step    iconv -f iso-8859-1 -t utf8 source-file  gt  output-file   file -b --mime-encoding output-file us-ascii  Hmm  file still tells me this file is US ASCII even after conversion  Let s check with hexdump again    tail -n  102321 output-file   head -n1   hexdump -C -s85 -n2 00000055  c3 96                                                  00000057  Definitely a change  Note that we have two bytes of non-7-bit-ASCII  represented by the  quot   quot  on the right  and the hex code for the two bytes is now c3 96  If we take a look  seems we have UTF-8 now  c3 96 is the encoding of    in UTF-8   UTF-8 encoding table and Unicode characters But file still reports our file as us-ascii  Well  I think this goes back to the point about file not looking at the whole file and the fact that the first non-7-bit-ASCII characters don t occur until late in the file  I ll use sed to stick a    at the beginning of the file and see what happens    sed  1s          n   source-file  gt  test-file   head -n1 test-file      head -n1 test-file   hexdump -C 00000000  c3 96 0a                                                00000003  Cool  we have an umlaut  Note the encoding though is c3 96  UTF-8   Hmm  Checking our other umlauts in the same file again    tail -n  102322 test-file   head -n1   hexdump -C -s85 -n2 00000055  d6 4d                                               M  00000057  ISO 8859-1  Oops  It just goes to show how easy it is to get the encodings screwed up   To be clear  I ve managed to create a mix of UTF-8 and ISO 8859-1 encodings in the same file  Let s try converting our new test file with the umlaut      at the front and see what happens    iconv -f iso-8859-1 -t utf8 test-file  gt  test-file-converted   head -n1 test-file-converted   hexdump -C 00000000  c3 83 c2 96 0a                                            00000005   tail -n  102322 test-file-converted   head -n1   hexdump -C -s85 -n2 00000055  c3 96                                                  00000057  Oops  The first umlaut that was UTF-8 was interpreted as ISO  8859-1 since that is what we told iconv  The second umlaut is correctly converted from d6  ISO  8859-1  to c3 96  UTF-8   I ll try again  but this time I will use Vim to do the    insertion instead of sed  Vim seemed to detect the encoding better  as  quot latin1 quot  aka ISO  8859-1  so perhaps it will insert the new    with a consistent encoding    vim source-file   head -n1 test-file-2     head -n1 test-file-2   hexdump -C 00000000  d6 0d 0a                                                00000003   tail -n  102322 test-file-2   head -n1   hexdump -C -s85 -n2 00000055  d6 4d                                               M  00000057  It looks good  It looks like ISO  8859-1 for new and old umlauts  Now the test    file -b --mime-encoding test-file-2 iso-8859-1   iconv -f iso-8859-1 -t utf8 test-file-2  gt  test-file-2-converted   file -b --mime-encoding test-file-2-converted utf-8  Boom  Moral of the story  Don t trust file to always guess your encoding right  It is easy to mix encodings within the same file  When in doubt  look at the hex  A hack  also prone to failure  that would address this specific limitation of file when dealing with large files would be to shorten the file to make sure that special  non-ascii  characters appear early in the file so file is more likely to find them    first special   pcregrep -o1 -n       x00- x7F   source-file   head -n1   cut -d quot   quot  -f1    tail -n   first special source-file  gt   tmp source-file-shorter   file -b --mime-encoding  tmp source-file-shorter iso-8859-1  You could then use  presumably correct  detected encoding to feed as input to iconv to ensure you are converting correctly  Update Christos Zoulas updated file to make the amount of bytes looked at configurable  One day turn-around on the feature request  awesome  http   bugs gw com view php id 533 Allow altering how many bytes to read from analyzed files from the command line The feature was released in file version 5 26  Looking at more of a large file before making a guess about encoding takes time  However  it is nice to have the option for specific use-cases where a better guess may outweigh additional time and I O  Use the following option  -P  --parameter name value      Set various parameter limits       Name    Default     Explanation     bytes   1048576     max number of bytes to read from file  Something like    file to check  quot myfile quot  bytes to scan   wc -c  lt   file to check  file -b --mime-encoding -P bytes  bytes to scan  file to check      it should do the trick if you want to force file to look at the whole file before making a guess  Of course  this only works if you have file 5 26 or newer  Forcing file to display UTF-8 instead of US-ASCII Some of the other answers seem to focus on trying to make file display UTF-8 even if the file only contains plain 7-bit ascii   If you think this through you should probably never want to do this   If a file contains only 7-bit ascii but the file command is saying the file is UTF-8  that implies that the file contains some characters with UTF-8 specific encoding   If that isn t really true  it could cause confusion or problems down the line   If file displayed UTF-8 when the file only contained 7-bit ascii characters  this would be a bug in the file program  Any software that requires UTF-8 formatted input files should not have any problem consuming plain 7-bit ascii since this is the same on a byte level as UTF-8   If there is software that is using the file command output before accepting a file as input and it won t process the file unless it  quot sees quot  UTF-8   well that is pretty bad design   I would argue this is a bug in that program   If you absolutely must take a plain 7-bit ascii file and convert it to UTF-8  simply insert a single non-7-bit-ascii character into the file with UTF-8 encoding for that character and you are done   But I can t imagine a use-case where you would need to do this   The easiest UTF-8 character to use for this is the Byte Order Mark  BOM  which is a special non-printing character that hints that the file is non-ascii   This is probably the best choice because it should not visually impact the file contents as it will generally be ignored   Microsoft compilers and interpreters  and many pieces of software on Microsoft Windows such as Notepad treat the BOM as a required magic number rather than use heuristics  These tools add a BOM when saving text as UTF-8  and cannot interpret UTF-8 unless the BOM is present or the file contains only ASCII   This is key   or the file contains only ASCII  So some tools on windows have trouble reading UTF-8 files unless the BOM character is present   However this does not affect plain 7-bit ascii only files   I e  this is not a reason for forcing plain 7-bit ascii files to be UTF-8 by adding a BOM character  Here is more discussion about potential pitfalls of using the BOM when not needed  it IS needed for actual UTF-8 files that are consumed by some Microsoft apps    https   stackoverflow com a 13398447 3616686 Nevertheless if you still want to do it  I would be interested in hearing your use case   Here is how   In UTF-8 the BOM is represented by hex sequence 0xEF 0xBB 0xBF and so we can easily add this character to the front of our plain 7-bit ascii file   By adding a non-7-bit ascii character to the file  the file is no longer only 7-bit ascii   Note that we have not modified or converted the original 7-bit-ascii content at all   We have added a single non-7-bit-ascii character to the beginning of the file and so the file is no longer entirely composed of 7-bit-ascii characters    printf   xEF xBB xBF   gt  bom txt   put a UTF-8 BOM char in new file   file bom txt bom txt  UTF-8 Unicode text  with no line terminators   file plain-ascii txt    our pure 7-bit ascii file plain-ascii txt  ASCII text   cat bom txt plain-ascii txt  gt  plain-ascii-with-utf8-bom txt   put them together into one new file with the BOM first   file plain-ascii-with-utf8-bom txt plain-ascii-with-utf8-bom txt  UTF-8 Unicode  with BOM  text

User · Answer

Here s a script that will find all files matching a pattern you pass it  and then converting them from their current file encoding to UTF-8  If the encoding is US ASCII  then it will still show as US ASCII  since that is a subset of UTF-8      usr bin env bash find   -name    1         while read line      do         echo                                       echo  Converting   line            encoding   file -b --mime-encoding   line           echo  Found Encoding    encoding            iconv -f    encoding   -t  utf-8    line  -o   line  tmp         mv   line  tmp   line      done

[utf-8] Force encode from US-ASCII to UTF-8 (iconv)

Examples related to utf-8

Examples related to character-encoding

Examples related to iconv