How do I determine file encoding in OS X

Question

I m trying to enter some UTF-8 characters into a LaTeX file in TextMate  which says its default encoding is UTF-8   but LaTeX doesn t seem to understand them   Running cat my file tex shows the characters properly in Terminal  Running ls -al shows something I ve never seen before  an     by the file listing   -rw-r--r--   1 me      users      2021 Feb 11 18 05 my file tex    And  yes  I m using  usepackage utf8  inputenc  in the LaTeX    I ve found iconv  but that doesn t seem to be able to tell me what the encoding is -- it ll only convert once I figure it out

User · Answer

Typing file myfile tex in a terminal can sometimes tell you the encoding and type of file using a series of algorithms and magic numbers  It s fairly useful but don t rely on it providing concrete or reliable information   A Localizable strings file  found in localised Mac OS X applications  is typically reported to be a UTF-16 C source file

User · Answer

vim -c  execute  quot silent  echo  quot     amp fileencoding   q   filename   aliased somewhere in my bash configuration as alias vic  quot vim -c  execute   quot silent   echo   quot     amp fileencoding   q  quot   so I just type vic  filename   On my vanilla OSX Yosemite  it yields more precise results than  quot file -I quot     file -I pdfs udocument0 pdf pdfs udocument0 pdf  application pdf  charset binary   vic pdfs udocument0 pdf latin1     file -I pdfs t0 pdf pdfs t0 pdf  application pdf  charset us-ascii   vic pdfs t0 pdf utf-8

User · Answer

Using file command with the --mime-encoding option  e g  file --mime-encoding some file txt  instead of the -I option works on OS X and has the added benefit of omitting the mime type   text plain   which you probably don t care about

User · Answer

I implemented the bash script below  it works for me    It first tries to iconv from the encoding returned by file --mime-encoding to utf-8    If that fails  it goes through all encodings and shows the diff between the original and re-encoded file  It skips over encodings that produce a large diff output   large  as defined by the MAX DIFF LINES variable or the second input argument   since those are most likely the wrong encoding   If  bad things  happen as a result of using this script  don t blame me  There s a rm -f in there  so there be monsters  I tried to prevent adverse effects by using it on files with a random suffix  but I m not making any promises    Tested on Darwin 15 6 0      bin bash  if       -lt 1    then   echo  ERROR  need one input argument  file of which the enconding is to be detected     exit 3 fi  if     -e   1    then   echo  ERROR  cannot find file   1     exit 3 fi  if       -ge 2    then   MAX DIFF LINES  2 else   MAX DIFF LINES 10 fi    try the easy way ENCOD   file --mime-encoding  1   awk   print  2     check if this enconding is valid iconv -f  ENCOD -t utf-8  1  amp  gt   dev null if      -eq 0   then   echo  ENCOD   exit 0 fi   hard way  need the user to visually check the difference between the original and re-encoded files for i in   iconv -l   awk   print  1    do   SINK  1  i  RANDOM   iconv -f  i -t utf-8  1 2 gt   dev null  gt   SINK   if      -eq 0     then     DIFF   diff  1  SINK      if     -z   DIFF     amp  amp      echo   DIFF    wc -l  -le  MAX DIFF LINES       then       echo         i              echo   DIFF        echo  Does that make sense  N y         read  ANSWER       if     ANSWER      y           ANSWER      Y          then         echo  i         exit 0       fi     fi   fi    clean up re-encoded file   rm -f  SINK done  echo  None of the encondings worked  You re stuck   exit 3

User · Answer

Which LaTeX are you using   When I was using teTeX  I had to manually download the unicode package and add this to my  tex files     UTF-8 stuff  usepackage notipa  ucs   usepackage utf8x  inputenc   usepackage T1  fontenc    Now  I ve switched over to XeTeX from the TeXlive 2008 package  here   it is even more simple     UTF-8 stuff  usepackage fontspec   usepackage xunicode    As for detection of a file s encoding  you could play with file 1   but it is rather limited  but like someone else said  it is difficult

User · Answer

The   means that the file has extended file attributes associated with it  You can query them using the getxattr   function   There s no definite way to detect the encoding of a file  Read this answer  it explains why   There s a command line tool  enca  that attempts to guess the encoding  You might want to check it out

User · Answer

Using the -I  that s a capital i  option on the file command seems to show the file encoding  file -I  filename

User · Answer

A brute-force way to check the encoding might just be to check the file in a hex editor or similar   or write a program to check  Look at the binary data in the file  The UTF-8 format is fairly easy to recognize  All ASCII characters are single bytes with values below 128  0x80  Multibyte sequences follow the pattern shown in the wiki article  If you can find a simpler way to get a program to verify the encoding for you  that s obviously a shortcut  but if all else fails  this would do the trick

User · Answer

You can also convert from one file type to another using the following command    iconv -f original charset -t new charset originalfile  gt  newfile   e g    iconv -f utf-16le -t utf-8 file1 txt  gt  file2 txt

User · Answer

Synalyze It  allows to compare text or bytes in all encodings the ICU library offers  Using that feature you usually see immediately which code page makes sense for your data

User · Answer

Just use   file -I  lt filename gt    That s it

User · Answer

In Mac OS X the command file -I  capital i  will give you the proper character set so long as the file you are testing contains characters outside of the basic ASCII range   For instance if you go into Terminal and use vi to create a file eg  vi test txt then insert some characters and include an accented character  try ALT-e followed by e   then save the file   They type file -I text txt and you should get a result like this     test txt  text plain  charset utf-8

User · Answer

You can try loading the file into a firefox window then go to View - Character Encoding  There should be a check mark next to the file s encoding type

User · Answer

Classic 8-bit LaTeX is very restricted in which UTF8 characters it can use  it s highly dependent on the encoding of the font you re using and which glyphs that font has available   Since you don t give a specific example  it s hard to know exactly where the problem is     whether you re attempting to use a glyph that your font doesn t have or whether you re not using the correct font encoding in the first place   Here s a minimal example showing how a few UTF8 characters can be used in a LaTeX document    documentclass article   usepackage T1  fontenc   usepackage lmodern   usepackage utf8  inputenc   begin document     H  ll     th  r        end document    You may have more luck with the  utf8x  encoding  but be slightly warned that it s no longer supported and has some idiosyncrasies compared with  utf8   as far as I recall  it s been a while since I ve looked at it   But if it does the trick  that s all that matters for you

User · Answer

The   sign means the file has extended attributes  xattr file shows what attributes it has  xattr -l file shows the attribute values too  which can be large sometimes  mdash  try e g  xattr  System Library Fonts HelveLTMM to see an old-style font that exists in the resource fork

[macos] How do I determine file encoding in OS X?

Examples related to macos

Examples related to encoding

Examples related to latex

Examples related to utf-8