There are some really good answers and attempts to answer your question here. I am not an encoding master, but I understand your desire to have a pure UTF-8 stack all the way through to your database. I have been using MySQL's utf8mb4
encoding for tables, fields, and connections.
My situation boiled down to "I just want my sanitizers, validators, business logic, and prepared statements to deal with UTF-8 when data comes from HTML forms, or e-mail registration links." So, in my simple way, I started off with this idea:
$encodings = ['UTF-8', 'ISO-8859-1', 'ASCII'];
throw new RuntimeException
UTF-8
, carry on.Else, if it is ISO-8859-1
or ASCII
a. Attempt conversion to UTF-8 (wait, not finished)
b. Detect the encoding of the converted value
c. If the reported encoding and converted value are both UTF-8
, carry on.
d. Else, throw new RuntimeException
From my abstract class Sanitizer
private function isUTF8($encoding, $value)
{
return (($encoding === 'UTF-8') && (utf8_encode(utf8_decode($value)) === $value));
}
private function utf8tify(&$value)
{
$encodings = ['UTF-8', 'ISO-8859-1', 'ASCII'];
mb_internal_encoding('UTF-8');
mb_substitute_character(0xfffd); //REPLACEMENT CHARACTER
mb_detect_order($encodings);
$stringEncoding = mb_detect_encoding($value, $encodings, true);
if (!$stringEncoding) {
$value = null;
throw new \RuntimeException("Unable to identify character encoding in sanitizer.");
}
if ($this->isUTF8($stringEncoding, $value)) {
return;
} else {
$value = mb_convert_encoding($value, 'UTF-8', $stringEncoding);
$stringEncoding = mb_detect_encoding($value, $encodings, true);
if ($this->isUTF8($stringEncoding, $value)) {
return;
} else {
$value = null;
throw new \RuntimeException("Unable to convert character encoding from ISO-8859-1, or ASCII, to UTF-8 in Sanitizer.");
}
}
return;
}
One could make an argument that I should separate encoding concerns from my abstract Sanitizer
class and simply inject an Encoder
object into a concrete child instance of Sanitizer
. However, the main problem with my approach is that, without more knowledge, I simply reject encoding types that I do not want (and I am relying on PHP mb_* functions). Without further study, I cannot know if that hurts some populations or not (or, if I am losing out on important information). So, I need to learn more. I found this article.
Moreover, what happens when encrypted data is added to my email registration links (using OpenSSL
or mcrypt
)? Could this interfere with decoding? What about Windows-1252? What about security implications? The use of utf8_decode()
and utf8_encode()
in Sanitizer::isUTF8
are dubious.
People have pointed out short-comings in the PHP mb_* functions. I never took time to investigate iconv
, but if it works better than mb_*functions, let me know.