PHP Convert any string to UTF-8 without knowing the original character set or at least try

Question

I have an application that deals with clients from all over the world  and  naturally  I want everything going into my databases to be UTF-8 encoded    The main problem for me is that I don t know what encoding the source of any string is going to be - it could be from a text box  using  lt form accept-charset  utf-8  gt  is only useful if the user is actually submitted the form   or it could be from an uploaded text file  so I really have no control over the input   What I need is a function or class that makes sure the stuff going into my database is  as far as is possible  UTF-8 encoded  I ve tried iconv mb detect encoding  text    UTF-8    text   but that has problems  if the input is  fianc  e  it returns  fianc    I ve tried a lot of things     For file uploads  I like the idea of asking the end user to specify the encoding they use  and show them previews of what the output will look like  but this doesn t help against nasty hackers  in fact  it could make their life a little easier    I ve read the other SO questions on the subject  but they seem to all have subtle differences like  I need to parse RSS feeds  or  I scrape data from websites   or  indeed   You can t      But there must be something that at least has a good try

User · Answer

public function convertToUtf8  text        if   this- gt html           this- gt html   cURL  http      this- gt url  array  timeout    gt  15          html    this- gt html      preg match    lt meta   charset                  i    html   matches         charset    matches 2        if  charset          return mb convert encoding  text   UTF-8    charset       else         return  text      cURL default options   curl setopt  ch  CURLOPT HEADER  0   curl setopt  ch  CURLOPT RETURNTRANSFER  1   curl setopt  ch  CURLOPT FOLLOWLOCATION  1     I tried something like this  It helped me  If found on meta charset info  I m converting  otherwise doing nothing

User · Answer

The main problem for me is that I don t know what encoding the source of any string is going to be - it could be from a text box  using  is only useful if the user is actually submitted the form   or it could be from an uploaded text file  so I really have no control over the input    I don t think it s a problem  An application knows the source of the input  If it s from a form  use UTF-8 encoding in your case  That works  Just verify the data provided is correctly encoded  validation   Keep in mind that not all databases support UTF-8 in it s full range   If it s a file you won t save it UTF-8 encoded into the database but in binary form  When you output the file again  use binary output as well  then this is totally transparent   Your idea is nice that a user can tell the encoding  be he she can tell anyway after downloading the file  as it s binary   So I must admit I don t see a specific issue you raise with your question  But maybe you can add some more details what your problem is

User · Answer

You could set up a set of metrics to try to guess which encoding is being used   Again  not perfect  but could catch some of the misses from mb detect encoding

User · Answer

There are some really good answers and attempts to answer your question here  I am not an encoding master  but I understand your desire to have a pure UTF-8 stack all the way through to your database  I have been using MySQL s utf8mb4 encoding for tables  fields  and connections   My situation boiled down to  I just want my sanitizers  validators  business logic  and prepared statements to deal with UTF-8 when data comes from HTML forms  or e-mail registration links   So  in my simple way  I started off with this idea    Attempt to detect encoding   encodings     UTF-8    ISO-8859-1    ASCII    If encoding cannot be detected  throw new RuntimeException If input is UTF-8  carry on  Else  if it is ISO-8859-1 or ASCII  a  Attempt conversion to UTF-8  wait  not finished   b  Detect the encoding of the converted value  c  If the reported encoding and converted value are both UTF-8  carry on   d  Else  throw new RuntimeException   From my abstract class Sanitizer        private function isUTF8  encoding   value                return    encoding      UTF-8    amp  amp   utf8 encode utf8 decode  value        value               private function utf8tify  amp  value                 encodings     UTF-8    ISO-8859-1    ASCII             mb internal encoding  UTF-8            mb substitute character 0xfffd     REPLACEMENT CHARACTER         mb detect order  encodings             stringEncoding   mb detect encoding  value   encodings  true            if    stringEncoding                 value   null              throw new  RuntimeException  Unable to identify character encoding in sanitizer                        if   this- gt isUTF8  stringEncoding   value                 return            else                value   mb convert encoding  value   UTF-8    stringEncoding                stringEncoding   mb detect encoding  value   encodings  true                if   this- gt isUTF8  stringEncoding   value                     return                else                    value   null                  throw new  RuntimeException  Unable to convert character encoding from ISO-8859-1  or ASCII  to UTF-8 in Sanitizer                                      return          One could make an argument that I should separate encoding concerns from my abstract Sanitizer class and simply inject an Encoder object into a concrete child instance of Sanitizer  However  the main problem with my approach is that  without more knowledge  I simply reject encoding types that I do not want  and I am relying on PHP mb   functions   Without further study  I cannot know if that hurts some populations or not  or  if I am losing out on important information   So  I need to learn more  I found this article   What every programmer absolutely  positively needs to know about encodings and character sets to work with text  Moreover  what happens when encrypted data is added to my email registration links  using OpenSSL or mcrypt   Could this interfere with decoding  What about Windows-1252  What about security implications  The use of utf8 decode   and utf8 encode   in Sanitizer  isUTF8 are dubious   People have pointed out short-comings in the PHP mb   functions  I never took time to investigate iconv  but if it works better than mb  functions  let me know

User · Answer

You ve probably tried this to but why not just use the mb convert encoding function  It will attempt to auto-detect char set of the text provided or you can pass it a list   Also  I tried to run    text    fianc  e   echo mb convert encoding  text   UTF-8    echo   lt br  gt  lt br  gt    echo iconv mb detect encoding  text    UTF-8    text     and the results are the same for both  How do you see that your text is truncated to  fianc   is it in the DB or in a browser

User · Answer

What you re asking for is extremely hard  If possible  getting the user to specify the encoding is the best  Preventing an attack shouldn t be much easier or harder that way   However  you could try doing this   iconv mb detect encoding  text  mb detect order    true    UTF-8    text     Setting it to strict might help you get a better result

User · Answer

It seems that your question is quite answered  but i have an approach that may simplify you case   I had a similar issue trying to return string data from mysql  even configuring both database and php to return strings formatted to utf-8  The only way i got the error was actually returning them from the database    Finally  sailing through the web i found a really easy way to deal with it    Giving that you can save all those types of string data in your mysql in different formats and collations  what you only need to do is  right at your php connection file  set the collation to utf-8  like this    connection   new mysqli  server   user   pass   db    connection- gt set charset  utf8      Wich means that first you save the data in any format or collation and you convert it only at the return to your php file   Hope it was helpful

User · Answer

If you re willing to  take this to the console   I d recommend enca  Unlike the rather simplistic mb detect encoding  it uses  a mixture of parsing  statistical analysis  guessing and black magic to determine their encodings   lol - see man page   However  you usually have to pass the language of the input file if you want to detect such country-specific encodings   However  mb detect encoding essentially has the same requirement  as the encoding would have to appear  in the right place  in the list of passed encodings for it to be detectable at all     enca also came up here  How to find encoding of a file in Unix via script s

User · Answer

If the text is retrieved from a mysql database you may try adding this after BD connection   mysqli set charset  con   utf8     https   www php net manual en mysqli set-charset php

User · Answer

There is no way to identify the charset of a string that is completely accurate  There are ways to try to guess the charset  One of these ways  and probably currently the best in PHP  is mb detect encoding    This will scan your string and look for occurrences of stuff unique to certain charsets  Depending on your string  there may not be such distinguishable occurrences   Take the ISO-8859-1 charset vs ISO-8859-15   http   en wikipedia org wiki ISO IEC 8859-15 Changes from ISO-8859-1    There s only a handful of different characters  and to make it worse  they re represented by the same bytes  There is no way to detect  being given a string without knowing it s encoding  whether byte 0xA4 is supposed to signify    or     in your string  so there is no way to know it s exact charset    Note  you could add a human factor  or an even more advanced scanning technique  e g  what Oroboros102 suggests   to try to figure out based upon the surrounding context  if the character should be    or      though this seems like a bridge too far   There are more distinguishable differences between e g  UTF-8 and ISO-8859-1  so it s still worth trying to figure it out when you re unsure  though you can and should never rely on it being correct   Interesting read  http   kore-nordmann de blog php charset encoding FAQ html how-do-i-determine-the-charset-encoding-of-a-string  There are other ways of ensuring the correct charset though  Concerning forms  try to enforce UTF-8 as much as possible  check out snowman to make sure yout submission will be UTF-8 in every browser  http   intertwingly net blog 2010 07 29 Rails-and-Snowmen   That being done  at least you re can be sure that every text submitted through your forms is utf 8  Concerning uploaded files  try running the unix  file -i  command on it through e g  exec    if possible on your server  to aid the detection  using the document s BOM   Concerning scraping data  you could read the HTTP headers  that usually specify the charset  When parsing XML files  see if the XML meta-data contain a charset definition   Rather than trying to automagically guess the charset  you should first try to ensure a certain charset yourself where possible  or trying to grab a definition from the source you re getting it from  if applicable  before resorting to detection

User · Answer

There are a couple of libraries out there  onnov detect-encoding looks promising  It claims to do better than mb detect encoding Example usage for converting string in unknown character encoding to UTF-8  use Onnov DetectEncoding EncodingDetector   detector- gt iconvXtoEncoding                       To simply detect encoding   encoding    detector- gt getEncoding

User · Answer

In motherland Russia we have 4 popular encodings  so your question is in great demand here   Only by char codes of symbols you can not detect encoding  because code pages intersect  Some codepages in different languages have even full intersection   So  we need another approach   The only way to work with unknown encodings is working with probabilities  So  we do not want to answer the question  what is encoding of this text    we are trying to understand  what is most likely encoding of this text     One guy here in popular Russian tech blog invented this approach   Build the probability range of char codes in every encoding you want to support  You can build it using some big texts in your language  e g  some fiction  use Shakespeare for english and Tolstoy for russian  lol    You will get smth like this       encoding 1      190   gt  0 095249209893009      222   gt  0 095249209893009              encoding 2      239   gt  0 095249209893009      207   gt  0 095249209893009              encoding N      charcode   gt  probabilty   Next  You take text in unknown encoding and for every encoding in your  probability dictionary  you search for frequency of every symbol in unknown-encoded text  Sum probabilities of symbols  Encoding with bigger rating is likely the winner  Better results for bigger texts   If you are interested  I can gladly help you with this task  We can greatly increase the accuracy by building two-charcodes probabilty list   Btw  mb detect encoding certanly does not work  Yes  at all  Please  take a look of mb detect encoding source code in  ext mbstring libmbfl mbfl mbfl ident c

[php] PHP: Convert any string to UTF-8 without knowing the original character set, or at least try

Examples related to php

Examples related to utf-8

Examples related to character-encoding