[regex] How do I grep for all non-ASCII characters?

Searching for non-printable chars. TLDR; Executive Summary

  1. search for control chars AND extended unicode
  2. locale setting e.g. LC_ALL=C needed to make grep do what you might expect with extended unicode

SO the preferred non-ascii char finders:

$ perl -ne 'print "$. $_" if m/[\x00-\x08\x0E-\x1F\x80-\xFF]/' notes_unicode_emoji_test

as in top answer, the inverse grep:

$ grep --color='auto' -P -n "[^\x00-\x7F]" notes_unicode_emoji_test

as in top answer but WITH LC_ALL=C:

$ LC_ALL=C grep --color='auto' -P -n "[\x80-\xFF]" notes_unicode_emoji_test

. . more . . excruciating detail on this: . . .

I agree with Harvey above buried in the comments, it is often more useful to search for non-printable characters OR it is easy to think non-ASCII when you really should be thinking non-printable. Harvey suggests "use this: "[^\n -~]". Add \r for DOS text files. That translates to "[^\x0A\x020-\x07E]" and add \x0D for CR"

Also, adding -c (show count of patterns matched) to grep is useful when searching for non-printable chars as the strings matched can mess up terminal.

I found adding range 0-8 and 0x0e-0x1f (to the 0x80-0xff range) is a useful pattern. This excludes the TAB, CR and LF and one or two more uncommon printable chars. So IMHO a quite a useful (albeit crude) grep pattern is THIS one:

grep -c -P -n "[\x00-\x08\x0E-\x1F\x80-\xFF]" *

ACTUALLY, generally you will need to do this:

LC_ALL=C grep -c -P -n "[\x00-\x08\x0E-\x1F\x80-\xFF]" *

breakdown:

LC_ALL=C - set locale to C, otherwise many extended chars will not match (even though they look like they are encoded > 0x80)
\x00-\x08 - non-printable control chars 0 - 7 decimal
\x0E-\x1F - more non-printable control chars 14 - 31 decimal
\x80-1xFF - non-printable chars > 128 decimal
-c - print count of matching lines instead of lines
-P - perl style regexps

Instead of -c you may prefer to use -n (and optionally -b) or -l
-n, --line-number
-b, --byte-offset
-l, --files-with-matches

E.g. practical example of use find to grep all files under current directory:

LC_ALL=C find . -type f -exec grep -c -P -n "[\x00-\x08\x0E-\x1F\x80-\xFF]" {} + 

You may wish to adjust the grep at times. e.g. BS(0x08 - backspace) char used in some printable files or to exclude VT(0x0B - vertical tab). The BEL(0x07) and ESC(0x1B) chars can also be deemed printable in some cases.

Non-Printable ASCII Chars
** marks PRINTABLE but CONTROL chars that is useful to exclude sometimes
Dec   Hex Ctrl Char description           Dec Hex Ctrl Char description
0     00  ^@  NULL                        16  10  ^P  DATA LINK ESCAPE (DLE)
1     01  ^A  START OF HEADING (SOH)      17  11  ^Q  DEVICE CONTROL 1 (DC1)
2     02  ^B  START OF TEXT (STX)         18  12  ^R  DEVICE CONTROL 2 (DC2)
3     03  ^C  END OF TEXT (ETX)           19  13  ^S  DEVICE CONTROL 3 (DC3)
4     04  ^D  END OF TRANSMISSION (EOT)   20  14  ^T  DEVICE CONTROL 4 (DC4)
5     05  ^E  END OF QUERY (ENQ)          21  15  ^U  NEGATIVE ACKNOWLEDGEMENT (NAK)
6     06  ^F  ACKNOWLEDGE (ACK)           22  16  ^V  SYNCHRONIZE (SYN)
7     07  ^G  BEEP (BEL)                  23  17  ^W  END OF TRANSMISSION BLOCK (ETB)
8     08  ^H  BACKSPACE (BS)**            24  18  ^X  CANCEL (CAN)
9     09  ^I  HORIZONTAL TAB (HT)**       25  19  ^Y  END OF MEDIUM (EM)
10    0A  ^J  LINE FEED (LF)**            26  1A  ^Z  SUBSTITUTE (SUB)
11    0B  ^K  VERTICAL TAB (VT)**         27  1B  ^[  ESCAPE (ESC)
12    0C  ^L  FF (FORM FEED)**            28  1C  ^\  FILE SEPARATOR (FS) RIGHT ARROW
13    0D  ^M  CR (CARRIAGE RETURN)**      29  1D  ^]  GROUP SEPARATOR (GS) LEFT ARROW
14    0E  ^N  SO (SHIFT OUT)              30  1E  ^^  RECORD SEPARATOR (RS) UP ARROW
15    0F  ^O  SI (SHIFT IN)               31  1F  ^_  UNIT SEPARATOR (US) DOWN ARROW

UPDATE: I had to revisit this recently. And, YYMV depending on terminal settings/solar weather forecast BUT . . I noticed that grep was not finding many unicode or extended characters. Even though intuitively they should match the range 0x80 to 0xff, 3 and 4 byte unicode characters were not matched. ??? Can anyone explain this? YES. @frabjous asked and @calandoa explained that LC_ALL=C should be used to set locale for the command to make grep match.

e.g. my locale LC_ALL= empty

$ locale
LANG=en_IE.UTF-8
LC_CTYPE="en_IE.UTF-8"
.
.
LC_ALL=

grep with LC_ALL= empty matches 2 byte encoded chars but not 3 and 4 byte encoded:

$ grep -P -n "[\x00-\x08\x0E-\x1F\x80-\xFF]" notes_unicode_emoji_test
5:© copyright c2a9
7:call  underscore c2a0
9:CTRL
31:5 © copyright
32:7 call  underscore

grep with LC_ALL=C does seem to match all extended characters that you would want:

$ LC_ALL=C grep --color='auto' -P -n "[\x80-\xFF]" notes_unicode_emoji_test  
1:???? unicode dashes e28090
3:??? Heart With Arrow Emoji - Emojipedia == UTF8? f09f9298
5:? copyright c2a9
7:call? underscore c2a0
11:LIVE??E! ?????????? ???? ?????????? ???? ?? ?? ???? ????  YEOW, mix of japanese and chars from other e38182 e38184 . . e0a487
29:1 ???? unicode dashes
30:3 ??? Heart With Arrow Emoji - Emojipedia == UTF8 e28090
31:5 ? copyright
32:7 call? underscore
33:11 LIVE??E! ?????????? ???? ?????????? ???? ?? ?? ???? ????  YEOW, mix of japanese and chars from other
34:52 LIVE??E! ?????????? ???? ?????????? ???? ?? ?? ???? ????  YEOW, mix of japanese and chars from other
81:LIVE??E! ?????????? ???? ?????????? ???? ?? ?? ???? ????  YEOW, mix of japanese and chars from other

THIS perl match (partially found elsewhere on stackoverflow) OR the inverse grep on the top answer DO seem to find ALL the ~weird~ and ~wonderful~ "non-ascii" characters without setting locale:

$ grep --color='auto' -P -n "[^\x00-\x7F]" notes_unicode_emoji_test

$ perl -ne 'print "$. $_" if m/[\x00-\x08\x0E-\x1F\x80-\xFF]/' notes_unicode_emoji_test  

1 -- unicode dashes e28090
3  Heart With Arrow Emoji - Emojipedia == UTF8? f09f9298
5 © copyright c2a9
7 call  underscore c2a0
9 CTRL-H CHARS URK URK URK 
11 LIVE-E! ????? ?? ????? ?? ? ? ?? ??  YEOW, mix of japanese and chars from other e38182 e38184 . . e0a487
29 1 -- unicode dashes
30 3  Heart With Arrow Emoji - Emojipedia == UTF8 e28090
31 5 © copyright
32 7 call  underscore
33 11 LIVE-E! ????? ?? ????? ?? ? ? ?? ??  YEOW, mix of japanese and chars from other
34 52 LIVE-E! ????? ?? ????? ?? ? ? ?? ??  YEOW, mix of japanese and chars from other
73 LIVE-E! ????? ?? ????? ?? ? ? ?? ??  YEOW, mix of japanese and chars from other

SO the preferred non-ascii char finders:

$ perl -ne 'print "$. $_" if m/[\x00-\x08\x0E-\x1F\x80-\xFF]/' notes_unicode_emoji_test

as in top answer, the inverse grep:

$ grep --color='auto' -P -n "[^\x00-\x7F]" notes_unicode_emoji_test

as in top answer but WITH LC_ALL=C:

$ LC_ALL=C grep --color='auto' -P -n "[\x80-\xFF]" notes_unicode_emoji_test

Examples related to regex

Why my regexp for hyphenated words doesn't work? grep's at sign caught as whitespace Preg_match backtrack error regex match any single character (one character only) re.sub erroring with "Expected string or bytes-like object" Only numbers. Input number in React Visual Studio Code Search and Replace with Regular Expressions Strip / trim all strings of a dataframe return string with first match Regex How to capture multiple repeated groups?

Examples related to unix

Docker CE on RHEL - Requires: container-selinux >= 2.9 What does `set -x` do? How to find files modified in last x minutes (find -mmin does not work as expected) sudo: npm: command not found How to sort a file in-place How to read a .properties file which contains keys that have a period character using Shell script gpg decryption fails with no secret key error Loop through a comma-separated shell variable Best way to find os name and version in Unix/Linux platform Resource u'tokenizers/punkt/english.pickle' not found

Examples related to unicode

How to resolve TypeError: can only concatenate str (not "int") to str (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape UnicodeEncodeError: 'ascii' codec can't encode character at special name Python NLTK: SyntaxError: Non-ASCII character '\xc3' in file (Sentiment Analysis -NLP) HTML for the Pause symbol in audio and video control Javascript: Unicode string to hex Concrete Javascript Regex for Accented Characters (Diacritics) Replace non-ASCII characters with a single space UTF-8 in Windows 7 CMD NameError: global name 'unicode' is not defined - in Python 3

Examples related to grep

grep's at sign caught as whitespace cat, grep and cut - translated to python How to suppress binary file matching results in grep Linux find and grep command together Filtering JSON array using jQuery grep() Linux Script to check if process is running and act on the result grep without showing path/file:line How do you grep a file and get the next 5 lines How to grep, excluding some patterns? Fast way of finding lines in one file that are not in another?