[php] PHP: Convert any string to UTF-8 without knowing the original character set, or at least try

There is no way to identify the charset of a string that is completely accurate. There are ways to try to guess the charset. One of these ways, and probably/currently the best in PHP, is mb_detect_encoding(). This will scan your string and look for occurrences of stuff unique to certain charsets. Depending on your string, there may not be such distinguishable occurrences.

Take the ISO-8859-1 charset vs ISO-8859-15 ( http://en.wikipedia.org/wiki/ISO/IEC_8859-15#Changes_from_ISO-8859-1 )

There's only a handful of different characters, and to make it worse, they're represented by the same bytes. There is no way to detect, being given a string without knowing it's encoding, whether byte 0xA4 is supposed to signify ¤ or € in your string, so there is no way to know it's exact charset.

(Note: you could add a human factor, or an even more advanced scanning technique (e.g. what Oroboros102 suggests), to try to figure out based upon the surrounding context, if the character should be ¤ or €, though this seems like a bridge too far)

There are more distinguishable differences between e.g. UTF-8 and ISO-8859-1, so it's still worth trying to figure it out when you're unsure, though you can and should never rely on it being correct.

Interesting read: http://kore-nordmann.de/blog/php_charset_encoding_FAQ.html#how-do-i-determine-the-charset-encoding-of-a-string

There are other ways of ensuring the correct charset though. Concerning forms, try to enforce UTF-8 as much as possible (check out snowman to make sure yout submission will be UTF-8 in every browser: http://intertwingly.net/blog/2010/07/29/Rails-and-Snowmen ) That being done, at least you're can be sure that every text submitted through your forms is utf_8. Concerning uploaded files, try running the unix 'file -i' command on it through e.g. exec() (if possible on your server) to aid the detection (using the document's BOM.) Concerning scraping data, you could read the HTTP headers, that usually specify the charset. When parsing XML files, see if the XML meta-data contain a charset definition.

Rather than trying to automagically guess the charset, you should first try to ensure a certain charset yourself where possible, or trying to grab a definition from the source you're getting it from (if applicable) before resorting to detection.

Examples related to php

I am receiving warning in Facebook Application using PHP SDK Pass PDO prepared statement to variables Parse error: syntax error, unexpected [ Preg_match backtrack error Removing "http://" from a string How do I hide the PHP explode delimiter from submitted form results? Problems with installation of Google App Engine SDK for php in OS X Laravel 4 with Sentry 2 add user to a group on Registration php & mysql query not echoing in html with tags? How do I show a message in the foreach loop?

Examples related to utf-8

error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte Changing PowerShell's default output encoding to UTF-8 'Malformed UTF-8 characters, possibly incorrectly encoded' in Laravel Encoding Error in Panda read_csv Using Javascript's atob to decode base64 doesn't properly decode utf-8 strings What is the difference between utf8mb4 and utf8 charsets in MySQL? what is <meta charset="utf-8">? Pandas df.to_csv("file.csv" encode="utf-8") still gives trash characters for minus sign UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 23: ordinal not in range(128) Android Studio : unmappable character for encoding UTF-8

Examples related to character-encoding

Changing PowerShell's default output encoding to UTF-8 JsonParseException : Illegal unquoted character ((CTRL-CHAR, code 10) Change the encoding of a file in Visual Studio Code What is the difference between utf8mb4 and utf8 charsets in MySQL? How to open html file? All inclusive Charset to avoid "java.nio.charset.MalformedInputException: Input length = 1"? UTF-8 output from PowerShell ERROR 1115 (42000): Unknown character set: 'utf8mb4' "for line in..." results in UnicodeDecodeError: 'utf-8' codec can't decode byte How to make php display \t \n as tab and new line instead of characters