[php] Fixing broken UTF-8 encoding

I am in the process of fixing some bad UTF-8 encoding. I am currently using PHP 5 and MySQL.

In my database I have a few instances of bad encodings that print like: î

  • The database collation is utf8_general_ci
  • PHP is using a proper UTF-8 header
  • Notepad++ is set to use UTF-8 without BOM
  • database management is handled in phpMyAdmin
  • not all cases of accented characters are broken

I need some sort of function that will help me map the instances of î, í, ü and others like it to their proper accented UTF-8 characters.

This question is related to php mysql unicode utf-8

The answer is


If you have double-encoded UTF8 characters (various smart quotes, dashes, apostrophe ’, quotation mark “, etc), in mysql you can dump the data, then read it back in to fix the broken encoding.

Like this:

mysqldump -h DB_HOST -u DB_USER -p DB_PASSWORD --opt --quote-names \
    --skip-set-charset --default-character-set=latin1 DB_NAME > DB_NAME-dump.sql

mysql -h DB_HOST -u DB_USER -p DB_PASSWORD \
    --default-character-set=utf8 DB_NAME < DB_NAME-dump.sql

This was a 100% fix for my double encoded UTF-8.

Source: http://blog.hno3.org/2010/04/22/fixing-double-encoded-utf-8-data-in-mysql/


I know this isn't very elegant, but after it was mentioned that the strings may be double encoded, I made this function:

function fix_double encoding($string)
{
    $utf8_chars = explode(' ', 'À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö');
    $utf8_double_encoded = array();
    foreach($utf8_chars as $utf8_char)
    {
            $utf8_double_encoded[] = utf8_encode(utf8_encode($utf8_char));
    }
    $string = str_replace($utf8_double_encoded, $utf8_chars, $string);
    return $string;
}

This seems to work perfectly to remove the double encoding I am experiencing. I am probably missing some of the characters that could be an issue to others. However, for my needs it is working perfectly.


If you utf8_encode() on a string that is already UTF-8 then it looks garbled when it is encoded multiple times.

I made a function toUTF8() that converts strings into UTF-8.

You don't need to specify what the encoding of your strings is. It can be Latin1 (iso 8859-1), Windows-1252 or UTF8, or a mix of these three.

I used this myself on a feed with mixed encodings in the same string.

Usage:

$utf8_string = Encoding::toUTF8($mixed_string);

$latin1_string = Encoding::toLatin1($mixed_string);

My other function fixUTF8() fixes garbled UTF8 strings if they were encoded into UTF8 multiple times.

Usage:

$utf8_string = Encoding::fixUTF8($garbled_utf8_string);

Examples:

echo Encoding::fixUTF8("Fédération Camerounaise de Football");
echo Encoding::fixUTF8("Fédération Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂédÃÂération Camerounaise de Football");
echo Encoding::fixUTF8("Fédération Camerounaise de Football");

will output:

Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football

Download:

https://github.com/neitanod/forceutf8


Another thing to check, which happened to be my solution (found here), is how data is being returned from your server. In my application, I'm using PDO to connect from PHP to MySQL. I needed to add a flag to the connection which said get the data back in UTF-8 format

The answer was

$dbHandle = new PDO("mysql:host=$dbHost;dbname=$dbName;charset=utf8", $dbUser, $dbPass, 
    array(PDO::MYSQL_ATTR_INIT_COMMAND => "SET NAMES 'utf8'"));

It looks like your utf-8 is being interpreted as iso8859-1 or Win-1250 at some point.

When you say "In my database I have a few instances of bad encodings" - how did you check this? Through your app, phpmyadmin or the command line client? Are all utf-8 encodings showing up like this or only some? Is it possible you had the encodings wrong and it has been incorrectly converted from iso8859-1 to utf-8 when it was utf-8 already?


The way is to convert to binary and then to correct encoding


I had a problem with an xml file that had a broken encoding, it said it was utf-8 but it had characters that where not utf-8.
After several trials and errors with the mb_convert_encoding() I manage to fix it with

mb_convert_encoding($text, 'Windows-1252', 'UTF-8')

I found a solution after days of search. My comment is going to be buried but anyway...

  1. I get the corrupted data with php.

  2. I don't use set names UTF8

  3. I use utf8_decode() on my data

  4. I update my database with my new decoded data, still not using set names UTF8

and voilà :)


As Dan pointed out: you need to convert them to binary and then convert/correct the encoding.

E.g., for utf8 stored as latin1 the following SQL will fix it:

UPDATE table
   SET field = CONVERT( CAST(field AS BINARY) USING utf8)
 WHERE $broken_field_condition

i had the same problem long time ago, and it fixed it using

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-15">

In my case, I found out by using "mb_convert_encoding" that the previous encoding was iso-8859-1 (which is latin1) then I fixed my problem by using an sql query :

UPDATE myDB.myTable SET myColumn = CAST(CAST(CONVERT(myColumn USING latin1) AS binary) AS CHAR)

However, it is stated in the mysql documentations that conversion may be lossy if the column contains characters that are not in both character sets.


This script had a nice approach. Converting it to the language of your choice should not be too difficult:

http://plasmasturm.org/log/416/

#!/usr/bin/perl
use strict;
use warnings;

use Encode qw( decode FB_QUIET );

binmode STDIN, ':bytes';
binmode STDOUT, ':encoding(UTF-8)';

my $out;

while ( <> ) {
  $out = '';
  while ( length ) {
    # consume input string up to the first UTF-8 decode error
    $out .= decode( "utf-8", $_, FB_QUIET );
    # consume one character; all octets are valid Latin-1
    $out .= decode( "iso-8859-1", substr( $_, 0, 1 ), FB_QUIET ) if length;
  }
  print $out;
}

Examples related to php

I am receiving warning in Facebook Application using PHP SDK Pass PDO prepared statement to variables Parse error: syntax error, unexpected [ Preg_match backtrack error Removing "http://" from a string How do I hide the PHP explode delimiter from submitted form results? Problems with installation of Google App Engine SDK for php in OS X Laravel 4 with Sentry 2 add user to a group on Registration php & mysql query not echoing in html with tags? How do I show a message in the foreach loop?

Examples related to mysql

Implement specialization in ER diagram How to post query parameters with Axios? PHP with MySQL 8.0+ error: The server requested authentication method unknown to the client Loading class `com.mysql.jdbc.Driver'. This is deprecated. The new driver class is `com.mysql.cj.jdbc.Driver' phpMyAdmin - Error > Incorrect format parameter? Authentication plugin 'caching_sha2_password' is not supported How to resolve Unable to load authentication plugin 'caching_sha2_password' issue Connection Java-MySql : Public Key Retrieval is not allowed How to grant all privileges to root user in MySQL 8.0 MySQL 8.0 - Client does not support authentication protocol requested by server; consider upgrading MySQL client

Examples related to unicode

How to resolve TypeError: can only concatenate str (not "int") to str (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape UnicodeEncodeError: 'ascii' codec can't encode character at special name Python NLTK: SyntaxError: Non-ASCII character '\xc3' in file (Sentiment Analysis -NLP) HTML for the Pause symbol in audio and video control Javascript: Unicode string to hex Concrete Javascript Regex for Accented Characters (Diacritics) Replace non-ASCII characters with a single space UTF-8 in Windows 7 CMD NameError: global name 'unicode' is not defined - in Python 3

Examples related to utf-8

error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte Changing PowerShell's default output encoding to UTF-8 'Malformed UTF-8 characters, possibly incorrectly encoded' in Laravel Encoding Error in Panda read_csv Using Javascript's atob to decode base64 doesn't properly decode utf-8 strings What is the difference between utf8mb4 and utf8 charsets in MySQL? what is <meta charset="utf-8">? Pandas df.to_csv("file.csv" encode="utf-8") still gives trash characters for minus sign UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 23: ordinal not in range(128) Android Studio : unmappable character for encoding UTF-8