Remove non-ascii character in string

Question

var str  INFO                  2                      Higashikurume      and i need to remove all non-ascii character from string   means str only contain  INFO   Higashikurume

User · Answer

It can also be done with a positive assertion of removal, like this:

textContent = textContent.replace(/[\u{0080}-\u{FFFF}]/gu,"");

This uses unicode. In Javascript, when expressing unicode for a regular expression, the characters are specified with the escape sequence \u{xxxx} but also the flag 'u' must present; note the regex has flags 'gu'.

I called this a "positive assertion of removal" in the sense that a "positive" assertion expresses which characters to remove, while a "negative" assertion expresses which letters to not remove. In many contexts, the negative assertion, as stated in the prior answers, might be more suggestive to the reader. The circumflex "^" says "not" and the range \x00-\x7F says "ascii," so the two together say "not ascii."

textContent = textContent.replace(/[^\x00-\x7F]/g,"");

That's a great solution for English language speakers who only care about the English language, and its also a fine answer for the original question. But in a more general context, one cannot always accept the cultural bias of assuming "all non-ascii is bad." For contexts where non-ascii is used, but occasionally needs to be stripped out, the positive assertion of Unicode is a better fit.

A good indication that zero-width, non printing characters are embedded in a string is when the string's "length" property is positive (nonzero), but looks like (i.e. prints as) an empty string. For example, I had this showing up in the Chrome debugger, for a variable named "textContent":

> textContent
""
> textContent.length
7

This prompted me to want to see what was in that string.

> encodeURI(textContent)
"%E2%80%8B%E2%80%8B%E2%80%8B%E2%80%8B%E2%80%8B%E2%80%8B%E2%80%8B"

This sequence of bytes seems to be in the family of some Unicode characters that get inserted by word processors into documents, and then find their way into data fields. Most commonly, these symbols occur at the end of a document. The zero-width-space "%E2%80%8B" might be inserted by CK-Editor (CKEditor).

encodeURI()  UTF-8     Unicode  html     Meaning
-----------  --------  -------  -------  -------------------
"%E2%80%8B"  EC 80 8B  U 200B   &#8203;  zero-width-space
"%E2%80%8E"  EC 80 8E  U 200E   &#8206;  left-to-right-mark
"%E2%80%8F"  EC 80 8F  U 200F   &#8207;  right-to-left-mark

Some references on those:

http://www.fileformat.info/info/unicode/char/200B/index.htm

https://en.wikipedia.org/wiki/Left-to-right_mark

Note that although the encoding of the embedded character is UTF-8, the encoding in the regular expression is not. Although the character is embedded in the string as three bytes (in my case) of UTF-8, the instructions in the regular expression must use the two-byte Unicode. In fact, UTF-8 can be up to four bytes long; it is less compact than Unicode because it uses the high bit (or bits) to escape the standard ascii encoding. That's explained here:

https://en.wikipedia.org/wiki/UTF-8

User · Answer

You can use the following regex to replace non-ASCII characters  str   str replace    A-Za-z 0-9                 amp       -       lt  gt                    g        However  note that spaces  colons and commas are all valid ASCII  so the result will be   gt  str  INFO           Higashikurume

User · Answer

None of these answers properly handle tabs  newlines  carriage returns  and some don t handle extended ASCII and unicode  This will KEEP tabs  amp  newlines  but remove control characters and anything out of the ASCII set  Click  Run this code snippet  button to test  There is some new javascript coming down the pipe so in the future  2020    you may have to do  u FFFFF  but not yet   x000D   x000D  console log  line 1 nline2  n ttabbed nF                      l                    a                  v        i    o                       replace    x00- x08 x0E- x1F x7F- uFFFF  g       x000D   x000D   x000D

User · Answer

ASCII is in range of 0 to 127  so   str replace     x00- x7F  g

User · Answer

To use ASCII with accents   var str   str replace     x00- xFF  g

[javascript] Remove non-ascii character in string

Examples related to javascript

Examples related to non-ascii-characters