PHP How to remove all non printable characters in a string

Question

I imagine I need to remove chars 0-31 and 127   Is there a function or piece of code to do this efficiently

User · Answer

The answer of  PaulDixon is completely wrong  because it removes the printable extended ASCII characters 128-255  has been partially corrected  I don t know why he still wants to delete 128-255 from a 127 chars 7-bit ASCII set as it does not have the extended ASCII characters    But finally it was important not to delete 128-255 because for example chr 128    x80  is the euro sign in 8-bit ASCII and many UTF-8 fonts in Windows display a euro sign and Android regarding my own test   And it will kill many UTF-8 characters if you remove the ASCII chars 128-255 from an UTF-8 string  probably the starting bytes of a multi-byte UTF-8 character   So don t do that  They are completely legal characters in all currently used file systems  The only reserved range is 0-31   Instead use this to delete the non-printable characters 0-31 and 127    string   preg replace     x00- x1F x7F          string     It works in ASCII and UTF-8 because both share the same control set range   The fastest slower   alternative without using regular expressions    string   str replace array         control characters     chr 0   chr 1   chr 2   chr 3   chr 4   chr 5   chr 6   chr 7   chr 8   chr 9   chr 10       chr 11   chr 12   chr 13   chr 14   chr 15   chr 16   chr 17   chr 18   chr 19   chr 20       chr 21   chr 22   chr 23   chr 24   chr 25   chr 26   chr 27   chr 28   chr 29   chr 30       chr 31          non-printing characters     chr 127          string     If you want to keep all whitespace characters  t   n and  r  then remove chr 9   chr 10  and chr 13  from this list  Note  The usual whitespace is chr 32  so it stays in the result  Decide yourself if you want to remove non-breaking space chr 160  as it can cause problems      Tested by  PaulDixon and verified by myself

User · Answer

this is simpler       string   preg replace          cntrl           string

User · Answer

preg replace       n   p Cc           response     This will remove all the control characters  http   uk php net manual en regexp reference unicode php  leaving the  n newline characters  From my experience  the control characters are the ones that most often cause the printing issues

User · Answer

Many of the other answers here do not take into account unicode characters  e g                            In this case you can use the following    string   preg replace     x00- x08 x0B x0C x0E- x1F x7F- x9F  u        string     There s a strange class of characters in the range  x80- x9F  Just above the 7-bit ASCII range of characters  that are technically control characters  but over time have been misused for printable characters  If you don t have any problems with these  then you can use    string   preg replace     x00- x08 x0B x0C x0E- x1F x7F  u        string     If you wish to also strip line feeds  carriage returns  tabs  non-breaking spaces  and soft-hyphens  you can use    string   preg replace     x00- x1F x7F- xA0 xAD  u        string     Note that you must use single quotes for the above examples   If you wish to strip everything except basic printable ASCII characters  all the example characters above will be stripped  you can use    string   preg replace        print           string     For reference see http   www fileformat info info charset UTF-8 list htm

User · Answer

Starting with PHP 5 2  we also have access to filter var  which I have not seen any mention of so thought I d throw it out there   To use filter var to strip non-printable characters  lt  32 and   127  you can do   Filter ASCII characters below 32   string   filter var  input  FILTER UNSAFE RAW  FILTER FLAG STRIP LOW     Filter ASCII characters above 127   string   filter var  input  FILTER UNSAFE RAW  FILTER FLAG STRIP HIGH     Strip both    string   filter var  input  FILTER UNSAFE RAW  FILTER FLAG STRIP LOW FILTER FLAG STRIP HIGH     You can also html-encode low characters  newline  tab  etc   while stripping high    string   filter var  input  FILTER UNSAFE RAW  FILTER FLAG ENCODE LOW FILTER FLAG STRIP HIGH     There are also options for stripping HTML  sanitizing e-mails and URLs  etc  So  lots of options for sanitization  strip out data  and even validation  return false if not valid rather than silently stripping    Sanitization  http   php net manual en filter filters sanitize php  Validation  http   php net manual en filter filters validate php  However  there is still the problem  that the FILTER FLAG STRIP LOW will strip out newline and carriage returns  which for a textarea are completely valid characters   so some of the Regex answers  I guess  are still necessary at times  e g  after reviewing this thread  I plan to do this for textareas    string   preg replace        print   r n         input     This seems more readable than a number of the regexes that stripped out by numeric range

User · Answer

For anyone that is still looking how to do this without removing the non-printable characters  but rather escaping them  I made this to help out  Feel free to improve it  Characters are escaped to   x A-F0-9  A-F0-9    Call like so    escaped   EscapeNonASCII  string     unescaped   UnescapeNonASCII  string        lt  php    function EscapeNonASCII  string    Convert string to hex  replace non-printable chars with escaped hex                hexbytes   strtoupper bin2hex  string             i   0          while   i  lt  strlen  hexbytes                          hexpair   substr  hexbytes   i  2                decimal   hexdec  hexpair               if   decimal  lt  32     decimal  gt  126                                 top   substr  hexbytes  0   i                    escaped   EscapeHex  hexpair                    bottom   substr  hexbytes   i   2                    hexbytes    top    escaped    bottom                   i    8                             i    2                     string   hex2bin  hexbytes           return  string            function EscapeHex  string    Helper function for EscapeNonASCII                  x    5C5C78      x          topnibble   bin2hex  string 0      Convert top nibble to hex          bottomnibble   bin2hex  string 1      Convert bottom nibble to hex          escaped    x    topnibble    bottomnibble    Concatenate escape sequence   x  with top and bottom nibble         return  escaped             function UnescapeNonASCII  string    Convert string to hex  replace escaped hex with actual hex                 stringtohex   bin2hex  string            stringtohex   preg replace callback   5c5c78  a-fA-F0-9  4      function   m                 return hex2bin  m 1                stringtohex           return hex2bin strtoupper  stringtohex            gt

User · Answer

I solved problem for UTF8 using https   github com neitanod forceutf8  use ForceUTF8 Encoding    string   Encoding  fixUTF8  string

User · Answer

You could use a regular express to remove everything apart from those characters you wish to keep    string preg replace     A-Za-z0-9   -    amp         string     Replaces everything that is not     the letters A-Z or a-z  the numbers 0-9  space  underscore  hypen  plus and ampersand - with nothing  i e  remove it

User · Answer

7 bit ASCII   If your Tardis just landed in 1963  and you just want the 7 bit printable ASCII chars  you can rip out everything from 0-31 and 127-255 with this    string   preg replace     x00- x1F x7F- xFF          string     It matches anything in range 0-31  127-255 and removes it    8 bit extended ASCII   You fell into a Hot Tub Time Machine  and you re back in the eighties   If you ve got some form of 8 bit ASCII  then you might want to keep the chars in range 128-255  An easy adjustment - just look for 0-31 and 127   string   preg replace     x00- x1F x7F          string     UTF-8   Ah  welcome back to the 21st century  If you have a UTF-8 encoded string  then the  u modifier can be used on the regex   string   preg replace     x00- x1F x7F  u        string     This just removes 0-31 and 127  This works in ASCII and UTF-8 because both share the same control set range  as noted by mgutt below   Strictly speaking  this would work without the  u modifier  But it makes life easier if you want to remove other chars     If you re dealing with Unicode  there are potentially many non-printing elements  but let s consider a simple one  NO-BREAK SPACE  U 00A0   In a UTF-8 string  this would be encoded as 0xC2A0  You could look for and remove that specific sequence  but with the  u modifier in place  you can simply add  xA0 to the character class    string   preg replace     x00- x1F x7F xA0  u        string     Addendum  What about str replace   preg replace is pretty efficient  but if you re doing this operation a lot  you could build an array of chars you want to remove  and use str replace as noted by mgutt below  e g     build an array we can re-use across several operations  badchar array         control characters     chr 0   chr 1   chr 2   chr 3   chr 4   chr 5   chr 6   chr 7   chr 8   chr 9   chr 10       chr 11   chr 12   chr 13   chr 14   chr 15   chr 16   chr 17   chr 18   chr 19   chr 20       chr 21   chr 22   chr 23   chr 24   chr 25   chr 26   chr 27   chr 28   chr 29   chr 30       chr 31          non-printing characters     chr 127        replace the unwanted chars  str2   str replace  badchar       str     Intuitively  this seems like it would be fast  but it s not always the case  you should definitely benchmark to see if it saves you anything  I did some benchmarks across a variety string lengths with random data  and this pattern emerged using php 7 0 12       2 chars str replace     5 3439ms preg replace     2 9919ms preg replace is 44 01  faster      4 chars str replace     6 0701ms preg replace     1 4119ms preg replace is 76 74  faster      8 chars str replace     5 8119ms preg replace     2 0721ms preg replace is 64 35  faster     16 chars str replace     6 0401ms preg replace     2 1980ms preg replace is 63 61  faster     32 chars str replace     6 0320ms preg replace     2 6770ms preg replace is 55 62  faster     64 chars str replace     7 4198ms preg replace     4 4160ms preg replace is 40 48  faster    128 chars str replace    12 7239ms preg replace     7 5412ms preg replace is 40 73  faster    256 chars str replace    19 8820ms preg replace    17 1330ms preg replace is 13 83  faster    512 chars str replace    34 3399ms preg replace    34 0221ms preg replace is  0 93  faster   1024 chars str replace    57 1141ms preg replace    67 0300ms str replace  is 14 79  faster   2048 chars str replace    94 7111ms preg replace   123 3189ms str replace  is 23 20  faster   4096 chars str replace   227 7029ms preg replace   258 3771ms str replace  is 11 87  faster   8192 chars str replace   506 3410ms preg replace   555 6269ms str replace  is  8 87  faster  16384 chars str replace  1116 8811ms preg replace  1098 0589ms preg replace is  1 69  faster  32768 chars str replace  2299 3128ms preg replace  2222 8632ms preg replace is  3 32  faster   The timings themselves are for 10000 iterations  but what s more interesting is the relative differences  Up to 512 chars  I was seeing preg replace alway win  In the 1-8kb range  str replace had a marginal edge    I thought it was interesting result  so including it here  The important thing is not to take this result and use it to decide which method to use  but to benchmark against your own data and then decide

User · Answer

Marked anwser is perfect but it misses character 127 DEL  which is also a non-printable character  my answer would be    string   preg replace     x00- x1F x7f- xFF          string

User · Answer

cedivad  solved the issue for me with persistent result of Swedish chars           text   preg replace       p L  s  u        text      Thanks

User · Answer

how about   return preg replace     a-zA-Z0-9                                        lt   gt      - s               data     gives me complete control of what I want to include

User · Answer

To strip all non-ASCII characters from the input string   result   preg replace     x00- x1F x80- xFF          string    That code removes any characters in the hex ranges 0-31 and 128-255  leaving only the hex characters 32-127 in the resulting string  which I call  result in this example

User · Answer

My UTF-8 compliant version   preg replace      p L  s  u      value

User · Answer

The regex into selected answer fail for Unicode  0x1d  with php 7 4   a solution    lt  php          ct    diff  rents    r n test               fail for Unicode  0x1d          ct   preg replace     x00- x1F x7F   u       ct               work for Unicode  0x1d          ct    preg replace       P C    u         ct               work for Unicode  0x1d and allow line break          ct    preg replace       P C  n   u         ct            echo  ct    from  UTF 8 String remove all invisible characters except newline

User · Answer

All of the solutions work partially  and even below probably does not cover all of the cases   My issue was in trying to insert a string into a utf8 mysql table   The string  and its bytes  all conformed to utf8  but had several bad sequences   I assume that most of them were control or formatting   function clean string  string       s   trim  string      s   iconv  UTF-8    UTF-8  IGNORE    s      drop all non utf-8 characters       this is some bad utf-8 byte sequence that makes mysql complain - control and formatting i think    s   preg replace      gt   x00- x1F   xC2  x80- x9F   xE2  x80- x8F  2   xE2 x80  xA4- xA8   xE2 x81  x9F- xAF            s       s   preg replace    s           s      reduce all multiple whitespace to a single space    return  s      To further exacerbate the problem is the table vs  server vs  connection vs  rendering of the content  as talked about a little here

User · Answer

you can use character classes      cntrl

[php] PHP: How to remove all non printable characters in a string?

Examples related to php

Examples related to utf-8

Examples related to ascii