[php] PHP: How to remove all non printable characters in a string?

I imagine I need to remove chars 0-31 and 127,

Is there a function or piece of code to do this efficiently.

This question is related to php utf-8 ascii

The answer is


preg_replace('/(?!\n)[\p{Cc}]/', '', $response);

This will remove all the control characters (http://uk.php.net/manual/en/regexp.reference.unicode.php) leaving the \n newline characters. From my experience, the control characters are the ones that most often cause the printing issues.


The answer of @PaulDixon is completely wrong, because it removes the printable extended ASCII characters 128-255! has been partially corrected. I don't know why he still wants to delete 128-255 from a 127 chars 7-bit ASCII set as it does not have the extended ASCII characters.

But finally it was important not to delete 128-255 because for example chr(128) (\x80) is the euro sign in 8-bit ASCII and many UTF-8 fonts in Windows display a euro sign and Android regarding my own test.

And it will kill many UTF-8 characters if you remove the ASCII chars 128-255 from an UTF-8 string (probably the starting bytes of a multi-byte UTF-8 character). So don't do that! They are completely legal characters in all currently used file systems. The only reserved range is 0-31.

Instead use this to delete the non-printable characters 0-31 and 127:

$string = preg_replace('/[\x00-\x1F\x7F]/', '', $string);

It works in ASCII and UTF-8 because both share the same control set range.

The fastest slower¹ alternative without using regular expressions:

$string = str_replace(array(
    // control characters
    chr(0), chr(1), chr(2), chr(3), chr(4), chr(5), chr(6), chr(7), chr(8), chr(9), chr(10),
    chr(11), chr(12), chr(13), chr(14), chr(15), chr(16), chr(17), chr(18), chr(19), chr(20),
    chr(21), chr(22), chr(23), chr(24), chr(25), chr(26), chr(27), chr(28), chr(29), chr(30),
    chr(31),
    // non-printing characters
    chr(127)
), '', $string);

If you want to keep all whitespace characters \t, \n and \r, then remove chr(9), chr(10) and chr(13) from this list. Note: The usual whitespace is chr(32) so it stays in the result. Decide yourself if you want to remove non-breaking space chr(160) as it can cause problems.

¹ Tested by @PaulDixon and verified by myself.


The regex into selected answer fail for Unicode: 0x1d (with php 7.4)

a solution:

<?php
        $ct = 'différents'."\r\n test";

        // fail for Unicode: 0x1d
        $ct = preg_replace('/[\x00-\x1F\x7F]$/u', '',$ct);

        // work for Unicode: 0x1d
        $ct =  preg_replace( '/[^\P{C}]+/u', "",  $ct);

        // work for Unicode: 0x1d and allow line break
        $ct =  preg_replace( '/[^\P{C}\n]+/u', "",  $ct);

        echo $ct;

from: UTF 8 String remove all invisible characters except newline


this is simpler:

$string = preg_replace( '/[^[:cntrl:]]/', '',$string);


Starting with PHP 5.2, we also have access to filter_var, which I have not seen any mention of so thought I'd throw it out there. To use filter_var to strip non-printable characters < 32 and > 127, you can do:

Filter ASCII characters below 32

$string = filter_var($input, FILTER_UNSAFE_RAW, FILTER_FLAG_STRIP_LOW);

Filter ASCII characters above 127

$string = filter_var($input, FILTER_UNSAFE_RAW, FILTER_FLAG_STRIP_HIGH);

Strip both:

$string = filter_var($input, FILTER_UNSAFE_RAW, FILTER_FLAG_STRIP_LOW|FILTER_FLAG_STRIP_HIGH);

You can also html-encode low characters (newline, tab, etc.) while stripping high:

$string = filter_var($input, FILTER_UNSAFE_RAW, FILTER_FLAG_ENCODE_LOW|FILTER_FLAG_STRIP_HIGH);

There are also options for stripping HTML, sanitizing e-mails and URLs, etc. So, lots of options for sanitization (strip out data) and even validation (return false if not valid rather than silently stripping).

Sanitization: http://php.net/manual/en/filter.filters.sanitize.php

Validation: http://php.net/manual/en/filter.filters.validate.php

However, there is still the problem, that the FILTER_FLAG_STRIP_LOW will strip out newline and carriage returns, which for a textarea are completely valid characters...so some of the Regex answers, I guess, are still necessary at times, e.g. after reviewing this thread, I plan to do this for textareas:

$string = preg_replace( '/[^[:print:]\r\n]/', '',$input);

This seems more readable than a number of the regexes that stripped out by numeric range.


"cedivad" solved the issue for me with persistent result of Swedish chars ÅÄÖ.

$text = preg_replace( '/[^\p{L}\s]/u', '', $text );

Thanks!


All of the solutions work partially, and even below probably does not cover all of the cases. My issue was in trying to insert a string into a utf8 mysql table. The string (and its bytes) all conformed to utf8, but had several bad sequences. I assume that most of them were control or formatting.

function clean_string($string) {
  $s = trim($string);
  $s = iconv("UTF-8", "UTF-8//IGNORE", $s); // drop all non utf-8 characters

  // this is some bad utf-8 byte sequence that makes mysql complain - control and formatting i think
  $s = preg_replace('/(?>[\x00-\x1F]|\xC2[\x80-\x9F]|\xE2[\x80-\x8F]{2}|\xE2\x80[\xA4-\xA8]|\xE2\x81[\x9F-\xAF])/', ' ', $s);

  $s = preg_replace('/\s+/', ' ', $s); // reduce all multiple whitespace to a single space

  return $s;
}

To further exacerbate the problem is the table vs. server vs. connection vs. rendering of the content, as talked about a little here


To strip all non-ASCII characters from the input string

$result = preg_replace('/[\x00-\x1F\x80-\xFF]/', '', $string);

That code removes any characters in the hex ranges 0-31 and 128-255, leaving only the hex characters 32-127 in the resulting string, which I call $result in this example.


I solved problem for UTF8 using https://github.com/neitanod/forceutf8

use ForceUTF8\Encoding;

$string = Encoding::fixUTF8($string);

For anyone that is still looking how to do this without removing the non-printable characters, but rather escaping them, I made this to help out. Feel free to improve it! Characters are escaped to \\x[A-F0-9][A-F0-9].

Call like so:

$escaped = EscapeNonASCII($string);

$unescaped = UnescapeNonASCII($string);

<?php 
  function EscapeNonASCII($string) //Convert string to hex, replace non-printable chars with escaped hex
    {
        $hexbytes = strtoupper(bin2hex($string));
        $i = 0;
        while ($i < strlen($hexbytes))
        {
            $hexpair = substr($hexbytes, $i, 2);
            $decimal = hexdec($hexpair);
            if ($decimal < 32 || $decimal > 126)
            {
                $top = substr($hexbytes, 0, $i);
                $escaped = EscapeHex($hexpair);
                $bottom = substr($hexbytes, $i + 2);
                $hexbytes = $top . $escaped . $bottom;
                $i += 8;
            }
            $i += 2;
        }
        $string = hex2bin($hexbytes);
        return $string;
    }
    function EscapeHex($string) //Helper function for EscapeNonASCII()
    {
        $x = "5C5C78"; //\x
        $topnibble = bin2hex($string[0]); //Convert top nibble to hex
        $bottomnibble = bin2hex($string[1]); //Convert bottom nibble to hex
        $escaped = $x . $topnibble . $bottomnibble; //Concatenate escape sequence "\x" with top and bottom nibble
        return $escaped;
    }

    function UnescapeNonASCII($string) //Convert string to hex, replace escaped hex with actual hex.
    {
        $stringtohex = bin2hex($string);
        $stringtohex = preg_replace_callback('/5c5c78([a-fA-F0-9]{4})/', function ($m) { 
            return hex2bin($m[1]);
        }, $stringtohex);
        return hex2bin(strtoupper($stringtohex));
    }
?>

Marked anwser is perfect but it misses character 127(DEL) which is also a non-printable character

my answer would be

$string = preg_replace('/[\x00-\x1F\x7f-\xFF]/', '', $string);

Many of the other answers here do not take into account unicode characters (e.g. öäüß??îû??????? ). In this case you can use the following:

$string = preg_replace('/[\x00-\x08\x0B\x0C\x0E-\x1F\x7F-\x9F]/u', '', $string);

There's a strange class of characters in the range \x80-\x9F (Just above the 7-bit ASCII range of characters) that are technically control characters, but over time have been misused for printable characters. If you don't have any problems with these, then you can use:

$string = preg_replace('/[\x00-\x08\x0B\x0C\x0E-\x1F\x7F]/u', '', $string);

If you wish to also strip line feeds, carriage returns, tabs, non-breaking spaces, and soft-hyphens, you can use:

$string = preg_replace('/[\x00-\x1F\x7F-\xA0\xAD]/u', '', $string);

Note that you must use single quotes for the above examples.

If you wish to strip everything except basic printable ASCII characters (all the example characters above will be stripped) you can use:

$string = preg_replace( '/[^[:print:]]/', '',$string);

For reference see http://www.fileformat.info/info/charset/UTF-8/list.htm


You could use a regular express to remove everything apart from those characters you wish to keep:

$string=preg_replace('/[^A-Za-z0-9 _\-\+\&]/','',$string);

Replaces everything that is not (^) the letters A-Z or a-z, the numbers 0-9, space, underscore, hypen, plus and ampersand - with nothing (i.e. remove it).


you can use character classes

/[[:cntrl:]]+/

how about:

return preg_replace("/[^a-zA-Z0-9`_.,;@#%~'\"\+\*\?\[\^\]\$\(\)\{\}\=\!\<\>\|\:\-\s\\\\]+/", "", $data);

gives me complete control of what I want to include


My UTF-8 compliant version:

preg_replace('/[^\p{L}\s]/u','',$value);


Examples related to php

I am receiving warning in Facebook Application using PHP SDK Pass PDO prepared statement to variables Parse error: syntax error, unexpected [ Preg_match backtrack error Removing "http://" from a string How do I hide the PHP explode delimiter from submitted form results? Problems with installation of Google App Engine SDK for php in OS X Laravel 4 with Sentry 2 add user to a group on Registration php & mysql query not echoing in html with tags? How do I show a message in the foreach loop?

Examples related to utf-8

error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte Changing PowerShell's default output encoding to UTF-8 'Malformed UTF-8 characters, possibly incorrectly encoded' in Laravel Encoding Error in Panda read_csv Using Javascript's atob to decode base64 doesn't properly decode utf-8 strings What is the difference between utf8mb4 and utf8 charsets in MySQL? what is <meta charset="utf-8">? Pandas df.to_csv("file.csv" encode="utf-8") still gives trash characters for minus sign UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 23: ordinal not in range(128) Android Studio : unmappable character for encoding UTF-8

Examples related to ascii

Detect whether a Python string is a number or a letter Is there any ASCII character for <br>? UnicodeEncodeError: 'ascii' codec can't encode character at special name Replace non-ASCII characters with a single space Convert ascii value to char What's the difference between ASCII and Unicode? Invisible characters - ASCII How To Convert A Number To an ASCII Character? Convert ascii char[] to hexadecimal char[] in C Convert character to ASCII numeric value in java