Remove non-utf8 characters from string

Question

Im having a problem with removing non-utf8 characters from string  which are not displaying properly  Characters are like this 0x97 0x61 0x6C 0x6F  hex representation   What is the best way to remove them  Regular expression or something else

User · Answer

try this    string   iconv  UTF-8   UTF-8  IGNORE   string     According to the iconv manual  the function will take the first parameter as the input charset  second parameter as the output charset  and the third as the actual input string   If you set both the input and output charset to UTF-8  and append the   IGNORE flag to the output charset  the function will drop strip  all characters in the input string that can t be represented by the output charset  Thus  filtering the input string in effect

User · Answer

Using a regex approach    regex    lt  lt  lt  END                  x00- x7F                    single-byte sequences   0xxxxxxx           xC0- xDF   x80- xBF         double-byte sequences   110xxxxx 10xxxxxx           xE0- xEF   x80- xBF  2      triple-byte sequences   1110xxxx 10xxxxxx   2           xF0- xF7   x80- xBF  3      quadruple-byte sequence 11110xxx 10xxxxxx   3        1 100                              one or more times                                           anything else  x END  preg replace  regex    1    text     It searches for UTF-8 sequences  and captures those into group 1  It also matches single bytes that could not be identified as part of a UTF-8 sequence  but does not capture those  Replacement is whatever was captured into group 1  This effectively removes all invalid bytes   It is possible to repair the string  by encoding the invalid bytes as UTF-8 characters  But if the errors are random  this could leave some strange symbols    regex    lt  lt  lt  END                  x00- x7F                  single-byte sequences   0xxxxxxx           xC0- xDF   x80- xBF       double-byte sequences   110xxxxx 10xxxxxx           xE0- xEF   x80- xBF  2    triple-byte sequences   1110xxxx 10xxxxxx   2           xF0- xF7   x80- xBF  3    quadruple-byte sequence 11110xxx 10xxxxxx   3        1 100                            one or more times           x80- xBF                      invalid byte in range 10000000 - 10111111       xC0- xFF                      invalid byte in range 11000000 - 11111111  x END  function utf8replacer  captures      if   captures 1                  Valid byte sequence  Return unmodified      return  captures 1         elseif   captures 2                  Invalid byte of the form 10xxxxxx         Encode as 11000010 10xxxxxx      return   xC2   captures 2         else          Invalid byte of the form 11xxxxxx         Encode as 11000011 10xxxxxx      return   xC3  chr ord  captures 3  -64         preg replace callback  regex   utf8replacer    text     EDIT     empty x  will match non-empty values   0  is considered empty   x       will match non-empty values  including  0   x        will match anything except       x       seem the best one to use in this case   I have also sped up the match a little  Instead of matching each character separately  it matches sequences of valid UTF-8 characters

User · Answer

I have made a function that deletes invalid UTF-8 characters from a string   I m using it to clear description of 27000 products before it generates the XML export file   public function stripInvalidXml  value         ret            current      if  empty  value             return  ret             length   strlen  value       for   i 0   i  lt   length   i               current   ord  value  i            if    current    0x9       current    0xA       current    0xD        current  gt   0x20   amp  amp    current  lt   0xD7FF         current  gt   0xE000   amp  amp    current  lt   0xFFFD         current  gt   0x10000   amp  amp    current  lt   0x10FFFF                       ret    chr  current                     else                ret                            return  ret

User · Answer

So the rules are that the first UTF-8 octlet has the high bit set as a marker  and then 1 to 4 bits to indicate how many additional octlets  then each of the additional octlets must have the high two bits set to 10   The pseudo-python would be   newstring      cont   0 for each ch in string    if cont      if  ch  gt  gt  6     2    high 2 bits are 10         do whatever  e g  skip it  or skip whole point  or      else          acceptable continuation of multi-octlet char       newstring    ch     cont -  1   else      if  ch  gt  gt  7     high bit set        c    ch  lt  lt  1    strip the high bit marker       while  c  amp  1     while the high bit indicates another octlet         c  lt  lt   1         cont    1         if cont  gt  4               more than 4 octels not allowed  cope with error       if  cont            illegal  do something sensible       newstring    ch   or whatever if cont      last utf-8 was not terminated  cope   This same logic should be translatable to php   However  its not clear what kind of stripping is to be done once you get a malformed character

User · Answer

Maybe not the most precise solution  but it gets the job done with a single line of code   echo str replace         utf8 decode  str       utf8 decode will convert the characters to a question mark  str replace will strip out the question marks

User · Answer

text   iconv  UTF-8    UTF-8  IGNORE    text     This is what I am using  Seems to work pretty well  Taken from http   planetozh com blog 2005 01 remove-invalid-characters-in-utf-8

User · Answer

substr   can break your multi-byte characters  In my case  I was using substr  string  0  255  to ensure a user supplied value would fit in the database   On occasion it would split a multi-byte character in half and caused database errors with  quot Incorrect string value quot   You could use mb substr  string 0 255   and it might be ok for MySQL 5  but MySQL 4 counts bytes instead of characters  so it would still be too long depending on the number of multi-byte characters  To prevent these issues I implemented the following steps   I increased the size of the field  in this case it was a log of changes  so preventing the longer input was not an option   I still did a mb substring in case it was still too long I used the accepted answer above by  Markus Jarderot to ensure if there is a really long entry with a multi-byte character right at the length limit  that we can strip out the half of a multi-byte character at the end

User · Answer

string   preg replace    amp   a-z  1 2   acute cedil circ grave lig orn ring slash th tilde uml   i     1   htmlentities  string  ENT COMPAT   UTF-8

User · Answer

If you apply utf8 encode   to an already UTF8 string it will return a garbled UTF8 output   I made a function that addresses all this issues   It  s called Encoding  toUTF8     You dont need to know what the encoding of your strings is   It can be Latin1  ISO8859-1   Windows-1252 or UTF8  or the string can have a mix of them  Encoding  toUTF8   will convert everything to UTF8   I did it because a service was giving me a feed of data all messed up  mixing those encodings in the same string   Usage   require once  Encoding php     use  ForceUTF8 Encoding      It s namespaced now    utf8 string   Encoding  toUTF8  mixed string     latin1 string   Encoding  toLatin1  mixed string     I ve included another function  Encoding  fixUTF8    which will fix every UTF8 string that looks garbled product of having been encoded into UTF8 multiple times   Usage   require once  Encoding php     use  ForceUTF8 Encoding      It s namespaced now    utf8 string   Encoding  fixUTF8  garbled utf8 string     Examples   echo Encoding  fixUTF8  F    d    ration Camerounaise de Football    echo Encoding  fixUTF8  F      d      ration Camerounaise de Football    echo Encoding  fixUTF8  F          d          ration Camerounaise de Football    echo Encoding  fixUTF8  F      d  ration Camerounaise de Football      will output   F  d  ration Camerounaise de Football F  d  ration Camerounaise de Football F  d  ration Camerounaise de Football F  d  ration Camerounaise de Football   Download   https   github com neitanod forceutf8

User · Answer

Welcome to 2019 and the  u modifier in regex which will handle UTF-8 multibyte chars for you If you only use mb convert encoding  value   UTF-8    UTF-8   you will still end up with non-printable chars in your string This method will   Remove all invalid UTF-8 multibyte chars with mb convert encoding Remove all non-printable chars like  r   x00  NULL-byte  and other control chars with preg replace  method  function utf8 filter string  value   string      return preg replace       print   n  u       mb convert encoding  value   UTF-8    UTF-8          print   match all printable chars and  n newlines and strip everything else You can see the ASCII table below   The printable chars range from 32 to 127  but newline  n is a part of the control chars which range from 0 to 31 so we have to add newline to the regex      print   n  u  You can try to send strings through the regex with chars outside the printable range like  x7F  DEL    x1B  Esc  etc  and see how they are stripped function utf8 filter string  value   string      return preg replace       print   n  u       mb convert encoding  value   UTF-8    UTF-8         arr          Danish chars             gt   Hello from Denmark with               Non-printable chars      gt   quot  x7FHello with invalid chars r  x00 quot      foreach  arr as  k   gt   v       echo  quot  k  n--------- n quot             len   strlen  v       echo  quot  v n  quot   len  quot   n quot             strip   utf8 decode utf8 filter utf8 encode  v          strip len   strlen  strip       echo  strip  quot  n  quot   strip len  quot   n n quot            echo  quot Chars removed   quot    len -  strip len   quot  n n n quot      https   www tehplayground com q5sJ3FOddhv1atpR

User · Answer

From recent patch to Drupal s Feeds JSON parser module     remove everything except valid letters  from any language   raw   preg replace          u  pL p Zs             raw     If you re concerned yes it retains spaces as valid characters    Did what I needed  It removes widespread nowadays emoji-characters that don t fit into MySQL s  utf8  character set and that gave me errors like  SQLSTATE HY000   General error  1366 Incorrect string value     For details see https   www drupal org node 1824506 comment-6881382

User · Answer

static  preg    lt  lt  lt  END       x09 x0A x0D x20- x7E      xC2- xDF   x80- xBF     xE0  xA0- xBF   x80- xBF      xE1- xEC xEE xEF   x80- xBF  2     xED  x80- x9F   x80- xBF     xF0  x90- xBF   x80- xBF  2      xF1- xF3   x80- xBF  3     xF4  x80- x8F   x80- xBF  2    xs END  if  preg match all  preg   string   match          string   implode      match 0      else        string           it work on our service

User · Answer

This function removes all NON ASCII characters  it s useful but not solving the question  This is my function that always works  regardless of encoding   function remove bs  Str         StrArr   str split  Str    NewStr         foreach   StrArr as  Char             CharNo   ord  Char       if   CharNo    163     NewStr     Char  continue       keep         if   CharNo  gt  31  amp  amp   CharNo  lt  127           NewStr     Char                    return  NewStr      How it works   echo remove bs  Hello   how     are you          Hello how are you

User · Answer

You can use mbstring    text   mb convert encoding  text   UTF-8    UTF-8         will remove invalid characters   See  Replacing invalid UTF-8 characters by question marks  mbstring substitute character seems ignored

User · Answer

Slightly different to the question  but what I am doing is to use HtmlEncode string     pseudo code here  var encoded   HtmlEncode string   encoded   Regex Replace encoded    amp   d           var result   HtmlDecode encoded     input and output    Headlight x007E Bracket   amp  123  Cafe Racer lt  gt  Style    Stainless Steel        Headlight  Bracket   amp  123  Cafe Racer lt  gt  Style  Stainless Steel         I know it s not perfect  but does the job for me

User · Answer

To remove all Unicode characters outside of the Unicode basic language plane    str   preg replace       x00-  xFFFF          str

User · Answer

How about iconv    http   php net manual en function iconv php  Haven t used it inside PHP itself but its always performed well for me on the command line  You can get it to substitute invalid characters

User · Answer

UConverter can be used since PHP 5 5  UConverter is better the choice if you use intl extension and don t use mbstring   function replace invalid byte sequence  str        return UConverter  transcode  str   UTF-8    UTF-8       function replace invalid byte sequence2  str        return  new UConverter  UTF-8    UTF-8   - gt convert  str       htmlspecialchars can be used to remove invalid byte sequence since PHP 5 4  Htmlspecialchars is better than preg match for handling large size of byte and the accuracy  A lot of the wrong implementation by using regular expression can be seen   function replace invalid byte sequence3  str        return htmlspecialchars decode htmlspecialchars  str  ENT SUBSTITUTE   UTF-8

User · Answer

The text may contain non-utf8 character  Try to do first    nonutf8   mb convert encoding  nonutf8    UTF-8    UTF-8      You can read more about it here  http   php net manual en function mb-convert-encoding phpnews

[php] Remove non-utf8 characters from string

Examples related to php

Examples related to regex