Detect encoding and make everything UTF-8

Question

I m reading out lots of texts from various RSS feeds and inserting them into my database   Of course  there are several different character encodings used in the feeds  e g  UTF-8 and ISO nbsp 8859-1   Unfortunately  there are sometimes problems with the encodings of the texts  Example    The      in  Fu  ball  should look like this in my database          If it is a         it is displayed correctly  Sometimes  the      in  Fu  ball  looks like this in my database              Then it is displayed wrongly  of course  In other cases  the      is saved as a      - so without any change  Then it is also displayed wrongly    What can I do to avoid the cases 2 and 3   How can I make everything the same encoding  preferably UTF-8  When must I use utf8 encode    when must I use utf8 decode    it s clear what the effect is but when must I use the functions   and when must I do nothing with the input   How do I make everything the same encoding  Perhaps with the function mb detect encoding    Can I write a function for this  So my problems are    How do I find out what encoding the text uses  How do I convert it to UTF-8 - whatever the old encoding is    Would a function like this work   function correct encoding  text         current encoding   mb detect encoding  text   auto         text   iconv  current encoding   UTF-8    text       return  text      I ve tested it  but it doesn t work  What s wrong with it

User · Answer

Detecting the encoding is hard   mb detect encoding works by guessing  based on a number of candidates that you pass it  In some encodings  certain byte-sequences are invalid  an therefore it can distinguish between various candidates  Unfortunately  there are a lot of encodings  where the same bytes are valid  but different   In these cases  there is no way to determine the encoding  You can implement your own logic to make guesses in these cases  For example  data coming from a Japanese site might be more likely to have a Japanese encoding   As long as you only deal with Western European languages  the three major encodings to consider are utf-8  iso-8859-1 and cp-1252  Since these are defaults for many platforms  they are also the most likely to be reported wrongly about  Eg  if people use different encodings  they are likely to be frank about it  since else their software would break very often  Therefore  a good strategy is to trust the provider  unless the encoding is reported as one of those three  You should still doublecheck that it is indeed valid  using mb check encoding  note that valid is not the same as being - the same input may be valid for many encodings   If it is one of those  you can then use mb detect encoding to distinguish between them  Luckily that is fairly deterministic  You just need to use the proper detect-sequence  which is UTF-8 ISO-8859-1 WINDOWS-1252    Once you ve detected the encoding you need to convert it to your internal representation  UTF-8 is the only sane choice   The function utf8 encode transforms ISO-8859-1 to UTF-8  so it can only used for that particular input type  For other encodings  use mb convert encoding

User · Answer

Get encoding from headers and convert it to utf-8    post url  http   website domain        Get headers                                                              function get headers curl  url           ch   curl init          curl setopt  ch  CURLOPT URL              url        curl setopt  ch  CURLOPT HEADER          true        curl setopt  ch  CURLOPT NOBODY          true        curl setopt  ch  CURLOPT RETURNTRANSFER  true        curl setopt  ch  CURLOPT TIMEOUT         15          r   curl exec  ch        return  r      the header   get headers curl  post url       check for redirect                                                   if  preg match   Location  i    the header          arr   explode  Location     the header        location    arr 1         location explode chr 10    location        location    location 0     the header   get headers curl trim  location          Get charset                                                                       if  preg match   charset  i    the header          arr   explode  charset     the header        charset    arr 1         charset explode chr 10    charset        charset    charset 0                                                                                            echo  charset   if  charset  amp  amp   charset   UTF-8      html   iconv  charset   UTF-8    html

User · Answer

I know this is an older question  but I figure a useful answer never hurts  I was having issues with my encoding between a desktop application  SQLite  and GET POST variables  Some would be in UTF-8  some would be in ASCII  and basically everything would get screwed up when foreign characters got involved   Here is my solution  It scrubs your GET POST REQUEST  I omitted cookies  but you could add them if desired  on each page load before processing  It works well in a header  PHP will throw warnings if it can t detect the source encoding automatically  so these warnings are suppressed with   s     Convert everything in our vars to UTF-8 for playing nice with the database      Use some auto detection here to help us not double-encode      Suppress possible warnings with   s for when encoding cannot be detected try        process   array  amp   GET   amp   POST   amp   REQUEST       while  list  key   val    each  process             foreach   val as  k   gt   v                unset  process  key   k                if  is array  v                      process  key   mb convert encoding  k  UTF-8   auto       v                   process      amp  process  key   mb convert encoding  k  UTF-8   auto                   else                    process  key   mb convert encoding  k  UTF-8   auto       mb convert encoding  v  UTF-8   auto                                      unset  process     catch Exception  ex

User · Answer

If you apply utf8 encode   to an already UTF-8 string  it will return garbled UTF-8 output   I made a function that addresses all this issues  It  s called Encoding  toUTF8     You don t need to know what the encoding of your strings is  It can be Latin1  ISO 8859-1   Windows-1252 or UTF-8  or the string can have a mix of them  Encoding  toUTF8   will convert everything to UTF-8   I did it because a service was giving me a feed of data all messed up  mixing UTF-8 and Latin1 in the same string   Usage   require once  Encoding php    use  ForceUTF8 Encoding      It s namespaced now    utf8 string   Encoding  toUTF8  utf8 or latin1 or mixed string     latin1 string   Encoding  toLatin1  utf8 or latin1 or mixed string     Download   https   github com neitanod forceutf8  I ve included another function  Encoding  fixUFT8    which will fix every UTF-8 string that looks garbled   Usage   require once  Encoding php    use  ForceUTF8 Encoding      It s namespaced now    utf8 string   Encoding  fixUTF8  garbled utf8 string     Examples   echo Encoding  fixUTF8  F    d    ration Camerounaise de Football    echo Encoding  fixUTF8  F      d      ration Camerounaise de Football    echo Encoding  fixUTF8  F          d          ration Camerounaise de Football    echo Encoding  fixUTF8  F      d  ration Camerounaise de Football      will output   F  d  ration Camerounaise de Football F  d  ration Camerounaise de Football F  d  ration Camerounaise de Football F  d  ration Camerounaise de Football   I ve transformed the function  forceUTF8  into a family of static functions on a class called Encoding  The new function is Encoding  toUTF8

User · Answer

I had same issue with phpQuery  ISO-8859-1 instead of UTF-8  and this hack helped me    html     lt  xml version  1 0  encoding  UTF-8    gt      html    mb internal encoding  UTF-8    phpQuery  newDocumentHTML  html   utf-8    mbstring internal encoding and other manipulations didn t take any effect

User · Answer

harpax that worked for me   In my case  this is good enough   if  isUTF8  str          echo  str     else       echo iconv  ISO-8859-1    UTF-8  TRANSLIT    str

User · Answer

Your encoding looks like you encoded into UTF-8 twice  that is  from some other encoding  into UTF-8  and again into UTF-8  As if you had ISO 8859-1  converted from ISO 8859-1 to UTF-8  and treated the new string as ISO 8859-1 for another conversion into UTF-8   Here s some pseudocode of what you did    inputstring   getFromUser     utf8string   iconv  current encoding   utf-8    inputstring    flawedstring   iconv  current encoding   utf-8    utf8string     You should try    detect encoding using mb detect encoding   or whatever you like to use if it s UTF-8  convert into ISO 8859-1  and repeat step 1 finally  convert back into UTF-8   That is presuming that in the  middle  conversion you used ISO 8859-1  If you used Windows-1252  then convert into Windows-1252  latin1   The original source encoding is not important  the one you used in flawed  second conversion is   This is my guess at what happened  there s very little else you could have done to get four bytes in place of one extended ASCII byte   The German language also uses ISO 8859-2 and Windows-1250  Latin-2

User · Answer

You first have to detect what encoding has been used  As you   re parsing RSS feeds  probably via HTTP   you should read the encoding from the charset parameter of the Content-Type HTTP header field  If it is not present  read the encoding from the encoding attribute of the XML processing instruction  If that   s missing too  use UTF-8 as defined in the specification     Edit      Here is what I probably would do   I   d use cURL to send and fetch the response  That allows you to set specific header fields and fetch the response header as well  After fetching the response  you have to parse the HTTP response and split it into header and body  The header should then contain the Content-Type header field that contains the MIME type and  hopefully  the charset parameter with the encoding charset too  If not  we   ll analyse the XML PI for the presence of the encoding attribute and get the encoding from there  If that   s also missing  the XML specs define to use UTF-8 as encoding    url    http   www lr-online de storage rss rss sport xml     accept   array       type    gt  array  application rss xml    application xml    application rdf xml    text xml         charset    gt  array diff mb list encodings    array  pass    auto    wchar    byte2be    byte2le    byte4be    byte4le    BASE64    UUENCODE    HTML-ENTITIES    Quoted-Printable    7bit    8bit        header   array       Accept    implode        accept  type          Accept-Charset    implode        accept  charset         encoding   null   curl   curl init  url   curl setopt  curl  CURLOPT RETURNTRANSFER  true   curl setopt  curl  CURLOPT HEADER  true   curl setopt  curl  CURLOPT HTTPHEADER   header    response   curl exec  curl   if    response           error fetching the response   else        offset   strpos  response    r n r n         header   substr  response  0   offset       if    header     preg match    Content-Type  s             s charset        im    header   match                error parsing the response       else           if   in array strtolower  match 1    array map  strtolower    accept  type                       type not accepted                    encoding   trim  match 2                     if    encoding             body   substr  response   offset   4           if  preg match     lt   xml s version                        s encoding                      s    body   match                  encoding   trim  match 1                               if    encoding             encoding    utf-8         else           if   in array  encoding  array map  strtolower    accept  charset                       encoding not accepted                   if   encoding     utf-8                  body   mb convert encoding  body   utf-8    encoding                        simpleXML   simplexml load string  body  null  LIBXML NOERROR       if    simpleXML               parse error       else           echo  simpleXML- gt asXML

User · Answer

is Mojibake for      In your database  you may have hex  DF if the column is  latin1   C39F if the column is utf8 -- OR -- it is latin1  but  double-encoded  C383C5B8 if double-encoded into a utf8 column   You should not use any encoding decoding functions in PHP  instead  you should set up the database and the connection to it correctly   If MySQL is involved  see  Trouble with utf8 characters  what I see is not what I stored

User · Answer

Try without  auto   That is   mb detect encoding  text    instead of   mb detect encoding  text   auto     More information can be found here  mb detect encoding

User · Answer

Working out the character encoding of RSS feeds seems to be complicated   Even normal web pages often omit  or lie about  their encoding   So you could try to use the correct way to detect the encoding and then fall back to some form of auto-detection  guessing

User · Answer

A really nice way to implement an isUTF8-function can be found on php net   function isUTF8  string        return  utf8 encode utf8 decode  string       string

User · Answer

After sorting out your php scripts  don t forget to tell mysql what charset you are passing and would like to recceive   Example  set character set utf8  Passing utf8 data to a latin1 table in a latin1 I O session gives those nasty birdfeets  I see this every other day in oscommerce shops  Back and fourth it might seem right  But phpmyadmin will show the truth  By telling mysql what charset you are passing it will handle the conversion of mysql data for you   How to recover existing scrambled mysql data is another thread to discuss

User · Answer

When you try to handle multi languages like Japanese and Korean you might get in trouble  mb convert encoding with  auto  parameter doesn t work well  Setting mb detect order  ASCII UTF-8 JIS EUC-JP SJIS EUC-KR UHC   doesn t help since it will detect EUC-  wrongly   I concluded that as long as input strings comes from HTML  it should use  charset  in a meta element  I use Simple HTML DOM Parser because it supports invalid HTML   The below snippet extracts title element from a web page  If you would like to convert entire page  then you may want to remove some lines    lt  php require once  simple html dom php    echo convert title to utf8 file get contents  argv 1     PHP EOL   function convert title to utf8  contents         dom   str get html  contents        title    dom- gt find  title   0       if  empty  title             return null             title    title- gt plaintext       metas    dom- gt find  meta         charset    auto       foreach   metas as  meta            if   empty  meta- gt charset        html5              charset    meta- gt charset            else if  preg match   charset          meta- gt content   match                  charset    match 1                       if   in array strtolower  charset   array map  strtolower   mb list encodings                 charset    auto             return mb convert encoding  title   UTF-8    charset

User · Answer

php net mb detect encoding  echo mb detect encoding  str   auto      or  echo mb detect encoding  str   UTF-8  ASCII  ISO-8859-1      i really don t know what the results are  but i d suggest you just take some of your feeds with different encodings and try if mb detect encoding works or not   update auto is short for  ASCII JIS UTF-8 EUC-JP SJIS   it returns the detected charset  which you can use to convert the string to utf-8 with iconv    lt  php function convertToUTF8  str         enc   mb detect encoding  str        if   enc  amp  amp   enc     UTF-8             return iconv  enc   UTF-8    str         else           return  str            gt    i haven t tested it  so no guarantee  and maybe there s a simpler way

User · Answer

This version is for German language but you can modifiy the  CHARSETS and the  TESTCHARS   class CharsetDetector   private static  CHARSETS   array   ISO 8859-1    ISO 8859-15    CP850     private static  TESTCHARS   array                                                     public static function convert  string        return self    iconv  string  self  getCharset  string      public static function getCharset  string         normalized   self    normalize  string       if  strlen  normalized  return  UTF-8        best    UTF-8        charcountbest   0      foreach  self   CHARSETS as  charset             str   self    iconv  normalized   charset            charcount   0           stop     mb strlen   str   UTF-8             for   idx   0   idx  lt   stop   idx                           char   mb substr   str   idx  1   UTF-8                foreach  self   TESTCHARS as  testchar                     if  char     testchar                                          charcount                        break                                                    if  charcount gt  charcountbest                         charcountbest  charcount               best  charset                      echo  text   lt br   gt              return  best    private static function   normalize  str      len   strlen  str    ret       for  i   0   i  lt   len   i          c   ord  str  i        if   c  gt  128            if    c  gt  247    ret    str  i           elseif   c  gt  239   bytes   4          elseif   c  gt  223   bytes   3          elseif   c  gt  191   bytes   2          else  ret    str  i           if    i    bytes   gt   len   ret    str  i            ret2  str  i           while   bytes  gt  1                 i                 b   ord  str  i                if   b  lt  128     b  gt  191    ret    ret2   ret2      i   bytes-1  bytes 1  break               else  ret2   str  i                bytes--                    return  ret     private static function   iconv  string   charset        return iconv    charset   UTF-8     string

User · Answer

The interesting thing about mb detect encoding and mb convert encoding is that the order of the encodings you suggest does matter       input is actually UTF-8  mb detect encoding  input   UTF-8    ISO-8859-9  UTF-8       ISO-8859-9  WRONG    mb detect encoding  input   UTF-8    UTF-8  ISO-8859-9       UTF-8  OK    So you might want to use a specific order when specifying expected encodings  Still  keep in mind that this is not foolproof

User · Answer

I was checking for solutions to encoding since ages  and this page is probably the conclusion of years of search  I tested some of the suggestions you mentioned and here s my notes   This is my test string      this is a  wr  ng wr  tten  string b  t I n  ed to p    s  me   special   ch  rs to see th  m  convert  d by f  nct  on    amp  that s it    I do an INSERT to save this string on a database in a field that is set as utf8 general ci  The character set of my page is UTF-8   If I do an INSERT just like that  in my database  I have some characters probably coming from Mars     So I need to convert them into some  sane  UTF-8  I tried utf8 encode    but still aliens chars were invading my database     So I tried to use the function forceUTF8 posted on number 8  but in the database the string saved looks like this      this is a  wr    ng wr    tten  string b    t I n    ed to p      s    me  special   ch   rs to see th    m  convert    d by f    nct    on    amp  that s it    So collecting some more information on this page and merging them with other information on other pages I solved my problem with this solution    finallyIDidIt   mb convert encoding     string    mysql client encoding  resourceID     mb detect encoding  string       Now in my database I have my string with correct encoding   NOTE  Only note to take care of is in function mysql client encoding  You need to be connected to the database  because this function wants a resource ID as a parameter   But well  I just do that re-encoding before my INSERT so for me it is not a problem

User · Answer

The most voted answer doesn t work  Here is mine and hope it helps   function toUTF8  raw        try          return mb convert encoding  raw   UTF-8    auto          catch  Exception  e           return mb convert encoding  raw   UTF-8    GBK

User · Answer

You need to test the character set on input since responses can come coded with different encodings   I force all content been sent into UTF-8 by doing detection and translation using the following function   function fixRequestCharset        ref   array  amp   GET   amp   POST   amp   REQUEST     foreach   ref as  amp  var          foreach   var as  key   gt   val               encoding   mb detect encoding  var  key   mb detect order    true         if    encoding          continue        if  strcasecmp  encoding   UTF-8      0                   encoding   iconv  encoding   UTF-8    var  key            if   encoding     false            continue           var  key     encoding                        That routine will turn all PHP variables that come from the remote host into UTF-8   Or ignore the value if the encoding could not be detected or converted   You can customize it to your needs   Just invoke it before using the variables

User · Answer

I find solution here http   deer org ua 2009 10 06 1   class Encoding                  http   deer org ua 2009 10 06 1          param  string         return null             public static function detect encoding  string                static  list     utf-8    windows-1251             foreach   list as  item                try                    sample   iconv  item   item   string                 catch   Exception  e                    continue                            if  md5  sample     md5  string                     return  item                                  return null            content   file get contents  file  tmp name      encoding   Encoding  detect encoding  content   if   encoding     utf-8          result   iconv  encoding   utf-8    content     else        result    content      I think that   is bad decision  and make some changes to solution from deer org ua

User · Answer

A little heads up  You said that the      should be displayed as        in your database   This is probably because you re using a database with Latin-1 character encoding or possibly your PHP-MySQL connection is set wrong  this is  P believes your MySQL is set to use UTF-8  so it sends data as UTF-8  but your MySQL believes PHP is sending data encoded as ISO 8859-1  so it may once again try to encode your sent data as UTF-8  causing this kind of trouble   Take a look at mysql set charset  It may help you

User · Answer

It s simple  when you get something that s not UTF-8  you must encode that into UTF-8   So  when you re fetching a certain feed that s ISO 8859-1 parse it through utf8 encode   However  if you re fetching an UTF-8 feed  you don t need to do anything

User · Answer

This cheatsheet lists some common caveats related to UTF-8 handling in PHP  http   developer loftdigital com blog php-utf-8-cheatsheet  This function detecting multibyte characters in a string might also prove helpful  source     function detectUTF8  string        return preg match                 xC2- xDF   x80- xBF                non-overlong 2-byte           xE0  xA0- xBF   x80- xBF           excluding overlongs            xE1- xEC xEE xEF   x80- xBF  2    straight 3-byte           xED  x80- x9F   x80- xBF           excluding surrogates           xF0  x90- xBF   x80- xBF  2        planes 1-3            xF1- xF3   x80- xBF  3            planes 4-15           xF4  x80- x8F   x80- xBF  2        plane 16            xs         string

[php] Detect encoding and make everything UTF-8

Examples related to php

Examples related to encoding

Examples related to utf-8

Examples related to character-encoding