[javascript] Remove non-ascii character in string

var str="INFO] :??????, ???????2????, ???????, ????? (Higashikurume)";

and i need to remove all non-ascii character from string,

means str only contain "INFO] (Higashikurume)";

This question is related to javascript non-ascii-characters

The answer is


You can use the following regex to replace non-ASCII characters

str = str.replace(/[^A-Za-z 0-9 \.,\?""!@#\$%\^&\*\(\)-_=\+;:<>\/\\\|\}\{\[\]`~]*/g, '')

However, note that spaces, colons and commas are all valid ASCII, so the result will be

> str
"INFO] :, , ,  (Higashikurume)"

To use ASCII with accents:

var str = str.replace(/[^\x00-\xFF]/g, "");

ASCII is in range of 0 to 127, so:

str.replace(/[^\x00-\x7F]/g, "");

None of these answers properly handle tabs, newlines, carriage returns, and some don't handle extended ASCII and unicode. This will KEEP tabs & newlines, but remove control characters and anything out of the ASCII set. Click "Run this code snippet" button to test. There is some new javascript coming down the pipe so in the future (2020+?) you may have to do \u{FFFFF} but not yet

_x000D_
_x000D_
console.log("line 1\nline2 \n\ttabbed\nF??^?¯?^??????????????l????~¨??????_??????a?????"????????????v?¯?????i????o?????????????????????".replace(/[\x00-\x08\x0E-\x1F\x7F-\uFFFF]/g, ''))
_x000D_
_x000D_
_x000D_


It can also be done with a positive assertion of removal, like this:

textContent = textContent.replace(/[\u{0080}-\u{FFFF}]/gu,"");

This uses unicode. In Javascript, when expressing unicode for a regular expression, the characters are specified with the escape sequence \u{xxxx} but also the flag 'u' must present; note the regex has flags 'gu'.

I called this a "positive assertion of removal" in the sense that a "positive" assertion expresses which characters to remove, while a "negative" assertion expresses which letters to not remove. In many contexts, the negative assertion, as stated in the prior answers, might be more suggestive to the reader. The circumflex "^" says "not" and the range \x00-\x7F says "ascii," so the two together say "not ascii."

textContent = textContent.replace(/[^\x00-\x7F]/g,"");

That's a great solution for English language speakers who only care about the English language, and its also a fine answer for the original question. But in a more general context, one cannot always accept the cultural bias of assuming "all non-ascii is bad." For contexts where non-ascii is used, but occasionally needs to be stripped out, the positive assertion of Unicode is a better fit.

A good indication that zero-width, non printing characters are embedded in a string is when the string's "length" property is positive (nonzero), but looks like (i.e. prints as) an empty string. For example, I had this showing up in the Chrome debugger, for a variable named "textContent":

> textContent
""
> textContent.length
7

This prompted me to want to see what was in that string.

> encodeURI(textContent)
"%E2%80%8B%E2%80%8B%E2%80%8B%E2%80%8B%E2%80%8B%E2%80%8B%E2%80%8B"

This sequence of bytes seems to be in the family of some Unicode characters that get inserted by word processors into documents, and then find their way into data fields. Most commonly, these symbols occur at the end of a document. The zero-width-space "%E2%80%8B" might be inserted by CK-Editor (CKEditor).

encodeURI()  UTF-8     Unicode  html     Meaning
-----------  --------  -------  -------  -------------------
"%E2%80%8B"  EC 80 8B  U 200B   &#8203;  zero-width-space
"%E2%80%8E"  EC 80 8E  U 200E   &#8206;  left-to-right-mark
"%E2%80%8F"  EC 80 8F  U 200F   &#8207;  right-to-left-mark

Some references on those:

http://www.fileformat.info/info/unicode/char/200B/index.htm

https://en.wikipedia.org/wiki/Left-to-right_mark

Note that although the encoding of the embedded character is UTF-8, the encoding in the regular expression is not. Although the character is embedded in the string as three bytes (in my case) of UTF-8, the instructions in the regular expression must use the two-byte Unicode. In fact, UTF-8 can be up to four bytes long; it is less compact than Unicode because it uses the high bit (or bits) to escape the standard ascii encoding. That's explained here:

https://en.wikipedia.org/wiki/UTF-8