preg_replace('/(?!\n)[\p{Cc}]/', '', $response);
This will remove all the control characters (http://uk.php.net/manual/en/regexp.reference.unicode.php) leaving the \n
newline characters. From my experience, the control characters are the ones that most often cause the printing issues.
The answer of @PaulDixon is completely wrong, because it removes the printable extended ASCII characters 128-255! has been partially corrected. I don't know why he still wants to delete 128-255 from a 127 chars 7-bit ASCII set as it does not have the extended ASCII characters.
But finally it was important not to delete 128-255 because for example chr(128)
(\x80
) is the euro sign in 8-bit ASCII and many UTF-8 fonts in Windows display a euro sign and Android regarding my own test.
And it will kill many UTF-8 characters if you remove the ASCII chars 128-255 from an UTF-8 string (probably the starting bytes of a multi-byte UTF-8 character). So don't do that! They are completely legal characters in all currently used file systems. The only reserved range is 0-31.
Instead use this to delete the non-printable characters 0-31 and 127:
$string = preg_replace('/[\x00-\x1F\x7F]/', '', $string);
It works in ASCII and UTF-8 because both share the same control set range.
The fastest slower¹ alternative without using regular expressions:
$string = str_replace(array(
// control characters
chr(0), chr(1), chr(2), chr(3), chr(4), chr(5), chr(6), chr(7), chr(8), chr(9), chr(10),
chr(11), chr(12), chr(13), chr(14), chr(15), chr(16), chr(17), chr(18), chr(19), chr(20),
chr(21), chr(22), chr(23), chr(24), chr(25), chr(26), chr(27), chr(28), chr(29), chr(30),
chr(31),
// non-printing characters
chr(127)
), '', $string);
If you want to keep all whitespace characters \t
, \n
and \r
, then remove chr(9)
, chr(10)
and chr(13)
from this list. Note: The usual whitespace is chr(32)
so it stays in the result. Decide yourself if you want to remove non-breaking space chr(160)
as it can cause problems.
¹ Tested by @PaulDixon and verified by myself.
The regex into selected answer fail for Unicode: 0x1d (with php 7.4)
a solution:
<?php
$ct = 'différents'."\r\n test";
// fail for Unicode: 0x1d
$ct = preg_replace('/[\x00-\x1F\x7F]$/u', '',$ct);
// work for Unicode: 0x1d
$ct = preg_replace( '/[^\P{C}]+/u', "", $ct);
// work for Unicode: 0x1d and allow line break
$ct = preg_replace( '/[^\P{C}\n]+/u', "", $ct);
echo $ct;
from: UTF 8 String remove all invisible characters except newline
this is simpler:
$string = preg_replace( '/[^[:cntrl:]]/', '',$string);
Starting with PHP 5.2, we also have access to filter_var, which I have not seen any mention of so thought I'd throw it out there. To use filter_var to strip non-printable characters < 32 and > 127, you can do:
Filter ASCII characters below 32
$string = filter_var($input, FILTER_UNSAFE_RAW, FILTER_FLAG_STRIP_LOW);
Filter ASCII characters above 127
$string = filter_var($input, FILTER_UNSAFE_RAW, FILTER_FLAG_STRIP_HIGH);
Strip both:
$string = filter_var($input, FILTER_UNSAFE_RAW, FILTER_FLAG_STRIP_LOW|FILTER_FLAG_STRIP_HIGH);
You can also html-encode low characters (newline, tab, etc.) while stripping high:
$string = filter_var($input, FILTER_UNSAFE_RAW, FILTER_FLAG_ENCODE_LOW|FILTER_FLAG_STRIP_HIGH);
There are also options for stripping HTML, sanitizing e-mails and URLs, etc. So, lots of options for sanitization (strip out data) and even validation (return false if not valid rather than silently stripping).
Sanitization: http://php.net/manual/en/filter.filters.sanitize.php
Validation: http://php.net/manual/en/filter.filters.validate.php
However, there is still the problem, that the FILTER_FLAG_STRIP_LOW will strip out newline and carriage returns, which for a textarea are completely valid characters...so some of the Regex answers, I guess, are still necessary at times, e.g. after reviewing this thread, I plan to do this for textareas:
$string = preg_replace( '/[^[:print:]\r\n]/', '',$input);
This seems more readable than a number of the regexes that stripped out by numeric range.
"cedivad" solved the issue for me with persistent result of Swedish chars ÅÄÖ.
$text = preg_replace( '/[^\p{L}\s]/u', '', $text );
Thanks!
All of the solutions work partially, and even below probably does not cover all of the cases. My issue was in trying to insert a string into a utf8 mysql table. The string (and its bytes) all conformed to utf8, but had several bad sequences. I assume that most of them were control or formatting.
function clean_string($string) {
$s = trim($string);
$s = iconv("UTF-8", "UTF-8//IGNORE", $s); // drop all non utf-8 characters
// this is some bad utf-8 byte sequence that makes mysql complain - control and formatting i think
$s = preg_replace('/(?>[\x00-\x1F]|\xC2[\x80-\x9F]|\xE2[\x80-\x8F]{2}|\xE2\x80[\xA4-\xA8]|\xE2\x81[\x9F-\xAF])/', ' ', $s);
$s = preg_replace('/\s+/', ' ', $s); // reduce all multiple whitespace to a single space
return $s;
}
To further exacerbate the problem is the table vs. server vs. connection vs. rendering of the content, as talked about a little here
To strip all non-ASCII characters from the input string
$result = preg_replace('/[\x00-\x1F\x80-\xFF]/', '', $string);
That code removes any characters in the hex ranges 0-31 and 128-255, leaving only the hex characters 32-127 in the resulting string, which I call $result in this example.
I solved problem for UTF8 using https://github.com/neitanod/forceutf8
use ForceUTF8\Encoding;
$string = Encoding::fixUTF8($string);
For anyone that is still looking how to do this without removing the non-printable characters, but rather escaping them, I made this to help out. Feel free to improve it! Characters are escaped to \\x[A-F0-9][A-F0-9].
Call like so:
$escaped = EscapeNonASCII($string);
$unescaped = UnescapeNonASCII($string);
<?php
function EscapeNonASCII($string) //Convert string to hex, replace non-printable chars with escaped hex
{
$hexbytes = strtoupper(bin2hex($string));
$i = 0;
while ($i < strlen($hexbytes))
{
$hexpair = substr($hexbytes, $i, 2);
$decimal = hexdec($hexpair);
if ($decimal < 32 || $decimal > 126)
{
$top = substr($hexbytes, 0, $i);
$escaped = EscapeHex($hexpair);
$bottom = substr($hexbytes, $i + 2);
$hexbytes = $top . $escaped . $bottom;
$i += 8;
}
$i += 2;
}
$string = hex2bin($hexbytes);
return $string;
}
function EscapeHex($string) //Helper function for EscapeNonASCII()
{
$x = "5C5C78"; //\x
$topnibble = bin2hex($string[0]); //Convert top nibble to hex
$bottomnibble = bin2hex($string[1]); //Convert bottom nibble to hex
$escaped = $x . $topnibble . $bottomnibble; //Concatenate escape sequence "\x" with top and bottom nibble
return $escaped;
}
function UnescapeNonASCII($string) //Convert string to hex, replace escaped hex with actual hex.
{
$stringtohex = bin2hex($string);
$stringtohex = preg_replace_callback('/5c5c78([a-fA-F0-9]{4})/', function ($m) {
return hex2bin($m[1]);
}, $stringtohex);
return hex2bin(strtoupper($stringtohex));
}
?>
Marked anwser is perfect but it misses character 127(DEL) which is also a non-printable character
my answer would be
$string = preg_replace('/[\x00-\x1F\x7f-\xFF]/', '', $string);
Many of the other answers here do not take into account unicode characters (e.g. öäüß??îû??????? ). In this case you can use the following:
$string = preg_replace('/[\x00-\x08\x0B\x0C\x0E-\x1F\x7F-\x9F]/u', '', $string);
There's a strange class of characters in the range \x80-\x9F
(Just above the 7-bit ASCII range of characters) that are technically control characters, but over time have been misused for printable characters. If you don't have any problems with these, then you can use:
$string = preg_replace('/[\x00-\x08\x0B\x0C\x0E-\x1F\x7F]/u', '', $string);
If you wish to also strip line feeds, carriage returns, tabs, non-breaking spaces, and soft-hyphens, you can use:
$string = preg_replace('/[\x00-\x1F\x7F-\xA0\xAD]/u', '', $string);
Note that you must use single quotes for the above examples.
If you wish to strip everything except basic printable ASCII characters (all the example characters above will be stripped) you can use:
$string = preg_replace( '/[^[:print:]]/', '',$string);
For reference see http://www.fileformat.info/info/charset/UTF-8/list.htm
You could use a regular express to remove everything apart from those characters you wish to keep:
$string=preg_replace('/[^A-Za-z0-9 _\-\+\&]/','',$string);
Replaces everything that is not (^) the letters A-Z or a-z, the numbers 0-9, space, underscore, hypen, plus and ampersand - with nothing (i.e. remove it).
you can use character classes
/[[:cntrl:]]+/
how about:
return preg_replace("/[^a-zA-Z0-9`_.,;@#%~'\"\+\*\?\[\^\]\$\(\)\{\}\=\!\<\>\|\:\-\s\\\\]+/", "", $data);
gives me complete control of what I want to include
My UTF-8 compliant version:
preg_replace('/[^\p{L}\s]/u','',$value);
Source: Stackoverflow.com