Fixing broken UTF-8 encoding

Question

I am in the process of fixing some bad UTF-8 encoding  I am currently using PHP 5 and MySQL   In my database I have a few instances of bad encodings that print like             The database collation is utf8 general ci PHP is using a proper UTF-8 header Notepad   is set to use UTF-8 without BOM database management is handled in phpMyAdmin not all cases of accented characters are broken   I need some sort of function that will help me map the instances of                              and others like it to their proper accented UTF-8 characters

User · Accepted Answer

I ve had to try to  fix  a number of UTF8 broken situations in the past  and unfortunately it s never easy  and often rather impossible   Unless you can determine exactly how it was broken  and it was always broken in that exact same way  then it s going to be hard to  undo  the damage   If you want to try to undo the damage  your best bet would be to start writing some sample code  where you attempt numerous variations on calls to mb convert encoding   to see if you can find a combination of  from  and  to  that fixes your data   In the end  it s often best to not even bother worrying about fixing the old data because of the pain levels involved  but instead to just fix things going forward   However  before doing this  you need to make sure that you fix everything that is causing this issue in the first place   You ve already mentioned that your DB table collation and editors are set properly   But there are more places where you need to check to make sure that everything is properly UTF-8    Make sure that you are serving your HTML as UTF-8    header  Content-Type  text html  charset utf-8     Change your PHP default charset to utf-8    ini set  default charset    utf-8     If your database doesn t ALWAYS talk in utf-8  then you may need to tell it on a per connection basis to ensure it s in utf-8 mode  in MySQL you do that by issuing    charset utf8  You may need to tell your webserver to always try to talk in UTF8  in Apache this command is    AddDefaultCharset UTF-8  Finally  you need to ALWAYS make sure that you are using PHP functions that are properly UTF-8 complaint   This means always using the mb   styled  multibyte aware  string functions   It also means when calling functions such as htmlspecialchars    that you include the appropriate  utf-8  charset parameter at the end to make sure that it doesn t encode them incorrectly    If you miss up on any one step through your whole process  the encoding can be mangled and problems arise   Once you get in the  groove  of doing utf-8 though  this all becomes second nature   And of course  PHP6 is supposed to be fully unicode complaint from the getgo  which will make lots of this easier  hopefully

User · Answer

I found a solution after days of search  My comment is going to be buried but anyway      I get the corrupted data with php  I don t use set names UTF8 I use utf8 decode   on my data I update my database with my new decoded data  still not using set names UTF8   and voil

User · Answer

I know this isn t very elegant  but after it was mentioned that the strings may be double encoded  I made this function   function fix double encoding  string         utf8 chars   explode                                                                                                                                                                                    utf8 double encoded   array        foreach  utf8 chars as  utf8 char                     utf8 double encoded     utf8 encode utf8 encode  utf8 char               string   str replace  utf8 double encoded   utf8 chars   string       return  string      This seems to work perfectly to remove the double encoding I am experiencing  I am probably missing some of the characters that could be an issue to others  However  for my needs it is working perfectly

User · Answer

If you utf8 encode   on a string that is already UTF-8 then it looks garbled when it is encoded multiple times   I made a function toUTF8   that converts strings into UTF-8   You don t need to specify what the encoding of your strings is  It can be Latin1  iso 8859-1   Windows-1252 or UTF8  or a mix of these three   I used this myself on a feed with mixed encodings in the same string   Usage    utf8 string   Encoding  toUTF8  mixed string     latin1 string   Encoding  toLatin1  mixed string     My other function fixUTF8   fixes garbled UTF8 strings if they were encoded into UTF8 multiple times   Usage    utf8 string   Encoding  fixUTF8  garbled utf8 string     Examples   echo Encoding  fixUTF8  F    d    ration Camerounaise de Football    echo Encoding  fixUTF8  F      d      ration Camerounaise de Football    echo Encoding  fixUTF8  F          d          ration Camerounaise de Football    echo Encoding  fixUTF8  F      d  ration Camerounaise de Football      will output   F  d  ration Camerounaise de Football F  d  ration Camerounaise de Football F  d  ration Camerounaise de Football F  d  ration Camerounaise de Football   Download   https   github com neitanod forceutf8

User · Answer

As Dan pointed out  you need to convert them to binary and then convert correct the encoding   E g   for utf8 stored as latin1 the following SQL will fix it   UPDATE table    SET field   CONVERT  CAST field AS BINARY  USING utf8   WHERE  broken field condition

User · Answer

If you have double-encoded UTF8 characters  various smart quotes  dashes  apostrophe           quotation mark          etc   in mysql you can dump the data  then read it back in to fix the broken encoding   Like this   mysqldump -h DB HOST -u DB USER -p DB PASSWORD --opt --quote-names       --skip-set-charset --default-character-set latin1 DB NAME  gt  DB NAME-dump sql  mysql -h DB HOST -u DB USER -p DB PASSWORD       --default-character-set utf8 DB NAME  lt  DB NAME-dump sql   This was a 100  fix for my double encoded UTF-8   Source  http   blog hno3 org 2010 04 22 fixing-double-encoded-utf-8-data-in-mysql

User · Answer

It looks like your utf-8 is being interpreted as iso8859-1 or Win-1250 at some point   When you say  In my database I have a few instances of bad encodings  - how did you check this  Through your app  phpmyadmin or the command line client  Are all utf-8 encodings showing up like this or only some  Is it possible you had the encodings wrong and it has been incorrectly converted from iso8859-1 to utf-8 when it was utf-8 already

User · Answer

I had a problem with an xml file that had a broken encoding  it said it was utf-8 but it had characters that where not utf-8  After several trials and errors with the mb convert encoding   I manage to fix it with  mb convert encoding  text   Windows-1252    UTF-8

User · Answer

In my case  I found out by using  mb convert encoding  that the previous encoding was iso-8859-1  which is latin1  then I fixed my problem by using an sql query    UPDATE myDB myTable SET myColumn   CAST CAST CONVERT myColumn USING latin1  AS binary  AS CHAR    However  it is stated in the mysql documentations that conversion may be lossy if the column contains characters that are not in both character sets

User · Answer

Another thing to check  which happened to be my solution  found here   is how data is being returned from your server  In my application  I m using PDO to connect from PHP to MySQL  I needed to add a flag to the connection which said get the data back in UTF-8 format  The answer was   dbHandle   new PDO  mysql host  dbHost dbname  dbName charset utf8    dbUser   dbPass       array PDO  MYSQL ATTR INIT COMMAND   gt   SET NAMES  utf8

User · Answer

This script had a nice approach    Converting it to the language of your choice should not be too difficult   http   plasmasturm org log 416      usr bin perl use strict  use warnings   use Encode qw  decode FB QUIET     binmode STDIN    bytes   binmode STDOUT    encoding UTF-8     my  out   while    lt  gt         out         while   length           consume input string up to the first UTF-8 decode error      out    decode   utf-8       FB QUIET          consume one character  all octets are valid Latin-1      out    decode   iso-8859-1   substr      0  1    FB QUIET   if length        print  out

User · Answer

i had the same problem  long time ago  and it fixed it using    lt meta http-equiv  Content-Type  content  text html  charset iso-8859-15  gt

User · Answer

The way is to convert to binary and then to correct encoding

[php] Fixing broken UTF-8 encoding

Examples related to php

Examples related to mysql

Examples related to unicode

Examples related to utf-8