How to find encoding of a file via script on Linux

Question

I need to find the encoding of all files that are placed in a directory  Is there a way to find the encoding used   The file command is not able to do this   The encoding that is of interest to me is ISO-8859-1  If the encoding is anything else  I want to move the file to another directory

User · Accepted Answer

Sounds like you re looking for enca  It can guess and even convert between encodings  Just look at the man page   Or  failing that  use file -i  linux  or file -I  osx   That will output MIME-type information for the file  which will also include the character-set encoding  I found a man-page for it  too

User · Answer

If you re talking about XML-files  ISO-8859-1   the XML-declaration inside them specifies the encoding   lt  xml version  1 0  encoding  ISO-8859-1    gt  So  you can use regular expressions  e g  with perl  to check every file for such specification  More information can be found here  How to Determine Text File Encoding

User · Answer

I know you re interested in a more general answer  but what s good in ASCII is usually good in other encodings  Here is a Python one-liner to determine if standard input is ASCII   I m pretty sure this works in Python 2  but I ve only tested it on Python 3    python -c  from sys import exit stdin exit  if 128 gt max c for l in open stdin fileno    b   for c in l  else exit  Not ASCII     lt  myfile txt

User · Answer

I am using the following script to   Find all files that match FILTER with SRC ENCODING Create a backup of them Convert them to DST ENCODING  optional  Remove the backups         bin bash -xe  SRC ENCODING  iso-8859-1  DST ENCODING  utf-8  FILTER    java   echo  Find all files that match the encoding  SRC ENCODING and filter  FILTER  FOUND FILES   find   -iname   FILTER  -exec file -i         grep   SRC ENCODING    grep -Eo       java    for FILE in  FOUND FILES   do     ORIGINAL FILE   FILE  SRC ENCODING bkp      echo  Backup original file to  ORIGINAL FILE      mv   FILE    ORIGINAL FILE       echo  converting  FILE from  SRC ENCODING to  DST ENCODING      iconv -f   SRC ENCODING  -t   DST ENCODING    ORIGINAL FILE  -o   FILE  done  echo  Deleting backups  find   -iname     SRC ENCODING bkp  -exec rm

User · Answer

here is an example script using file -I and iconv which works on MacOsX For your question you need to use mv instead of iconv     bin bash   2016-02-08   check encoding and convert files for f in   java do   encoding  file -I  f   cut -f 2 -d      cut -f 2 -d     case  encoding in     iso-8859-1      iconv -f iso8859-1 -t utf-8  f  gt   f utf8     mv  f utf8  f          esac done

User · Answer

file -bi  lt file name gt    If you like to do this for a bunch of files  for f in  find   egrep -v Eliminate   do echo   f    --    file -bi   f     done

User · Answer

You can extract encoding of a single file with the file command  I have a sample html file with     file sample html    sample html  HTML document  UTF-8 Unicode text  with very long lines    file -b sample html   HTML document  UTF-8 Unicode text  with very long lines    file -bi sample html   text html  charset utf-8    file -bi sample html    awk -F      print  2      utf-8

User · Answer

with this command   for f in  find     do echo  file -i   f    done   you can list all files in a directory and subdirectories and the corresponding encoding

User · Answer

This is not something you can do in a foolproof way  One possibility would be to examine every character in the file to ensure that it doesn t contain any characters in the ranges 0x00 - 0x1f or 0x7f -0x9f but  as I said  this may be true for any number of files  including at least one other variant of ISO8859   Another possibility is to look for specific words in the file in all of the languages supported and see if you can find them   So  for example  find the equivalent of the English  and    but    to    of  and so on in all the supported languages of 8859-1 and see if they have a large number of occurrences within the file   I m not talking about literal translation such as   English   French -------   ------ of        de  du and       et the       le  la  les   although that s possible  I m talking about common words in the target language  for all I know  Icelandic has no word for  and  - you d probably have to use their word for  fish   sorry that s a little stereotypical  I didn t mean any offense  just illustrating a point

User · Answer

With Python  you can use the chardet module  https   github com chardet chardet

User · Answer

uchardet - An encoding detector library ported from Mozilla   Usage     gt  uchardet file java  UTF-8   Various Linux distributions  Debian Ubuntu  OpenSuse-packman       provide binaries

User · Answer

In Debian you can also use  encguess     encguess test txt test txt  US-ASCII

User · Answer

It is really hard to determine if it is iso-8859-1  If you have a text with only 7 bit characters that could also be iso-8859-1 but you don t know  If you have 8 bit characters then the upper region characters exist in order encodings as well  Therefor you would have to use a dictionary to get a better guess which word it is and determine from there which letter it must be  Finally if you detect that it might be utf-8 than you are sure it is not iso-8859-1  Encoding is one of the hardest things to do because you never know if nothing is telling you

User · Answer

To convert encoding from 8859 to ASCII    iconv -f ISO 8859-1 -t ASCII filename txt

User · Answer

In Cygwin  this looks like it works for me   find -type f -name   lt FILENAME GLOB gt     while read  lt VAR gt   do  file -i    lt VAR gt     done   Example   find -type f -name    txt    while read file  do  file -i   file    done   You could pipe that to awk and create an iconv command to convert everything  to utf8  from any source encoding supported by iconv   Example   find -type f -name    txt    while read file  do  file -i   file    done   awk -F       print  iconv -f   3  -t utf8     1     gt      1  utf8        bash

User · Answer

With Perl  use Encode  Detect

User · Answer

In php you can check like below    Specifying encoding list explicitly     php -r  echo  probably       mb detect encoding file get contents  myfile txt     UTF-8  ASCII  JIS  EUC-JP  SJIS  iso-8859-1     PHP EOL     More accurate  mb list encodings     php -r  echo  probably       mb detect encoding file get contents  myfile txt    mb list encodings      PHP EOL     Here in first example  you can see that i put a list of encodings  detect list order  that might be matching  To have more accurate result you can use all possible encodings via   mb list encodings    Note mb   functions require php-mbstring  apt-get install php-mbstring

[file] How to find encoding of a file via script on Linux?

Examples related to file

Examples related to shell

Examples related to unix

Examples related to encoding