How do I grep for all non-ASCII characters

Question

I have several very large XML files and I m trying to find the lines that contain non-ASCII characters  I ve tried the following   grep -e    x 00FF - x FFFF    file xml   But this returns every line in the file  regardless of whether the line contains a character in the range specified   Do I have the syntax wrong or am I doing something else wrong   I ve also tried   egrep    x 00FF - x FFFF    file xml     with both single and double quotes surrounding the pattern

User · Answer

Searching for non-printable chars  TLDR  Executive Summary   search for control chars AND extended unicode locale setting e g  LC ALL C needed to make grep do what you might expect with extended unicode   SO the preferred non-ascii char finders     perl -ne  print         if m   x00- x08 x0E- x1F x80- xFF    notes unicode emoji test   as in top answer  the inverse grep     grep --color  auto  -P -n     x00- x7F   notes unicode emoji test   as in top answer but WITH LC ALL C     LC ALL C grep --color  auto  -P -n    x80- xFF   notes unicode emoji test       more     excruciating detail on this         I agree with Harvey above buried in the comments  it is often more useful to search for non-printable characters OR it is easy to think non-ASCII when you really should be thinking non-printable  Harvey suggests  use this      n -     Add  r for DOS text files  That translates to     x0A x020- x07E   and add  x0D for CR    Also  adding -c  show count of patterns matched  to grep is useful when searching for non-printable chars as the strings matched can mess up terminal   I found adding range 0-8 and 0x0e-0x1f  to the 0x80-0xff range  is a useful pattern  This excludes the TAB  CR and LF and one or two more uncommon printable chars  So IMHO a quite a useful  albeit crude  grep pattern is THIS one   grep -c -P -n    x00- x08 x0E- x1F x80- xFF       ACTUALLY  generally you will need to do this   LC ALL C grep -c -P -n    x00- x08 x0E- x1F x80- xFF       breakdown   LC ALL C - set locale to C  otherwise many extended chars will not match  even though they look like they are encoded  gt  0x80   x00- x08 - non-printable control chars 0 - 7 decimal  x0E- x1F - more non-printable control chars 14 - 31 decimal  x80-1xFF - non-printable chars  gt  128 decimal -c - print count of matching lines instead of lines -P - perl style regexps  Instead of -c you may prefer to use -n  and optionally -b  or -l -n  --line-number -b  --byte-offset -l  --files-with-matches   E g  practical example of use find to grep all files under current directory   LC ALL C find   -type f -exec grep -c -P -n    x00- x08 x0E- x1F x80- xFF           You may wish to adjust the grep at times  e g  BS 0x08 - backspace  char used in some printable files or to exclude VT 0x0B - vertical tab   The BEL 0x07  and ESC 0x1B  chars can also be deemed printable in some cases    Non-Printable ASCII Chars    marks PRINTABLE but CONTROL chars that is useful to exclude sometimes Dec   Hex Ctrl Char description           Dec Hex Ctrl Char description 0     00      NULL                        16  10   P  DATA LINK ESCAPE  DLE  1     01   A  START OF HEADING  SOH       17  11   Q  DEVICE CONTROL 1  DC1  2     02   B  START OF TEXT  STX          18  12   R  DEVICE CONTROL 2  DC2  3     03   C  END OF TEXT  ETX            19  13   S  DEVICE CONTROL 3  DC3  4     04   D  END OF TRANSMISSION  EOT    20  14   T  DEVICE CONTROL 4  DC4  5     05   E  END OF QUERY  ENQ           21  15   U  NEGATIVE ACKNOWLEDGEMENT  NAK  6     06   F  ACKNOWLEDGE  ACK            22  16   V  SYNCHRONIZE  SYN  7     07   G  BEEP  BEL                   23  17   W  END OF TRANSMISSION BLOCK  ETB  8     08   H  BACKSPACE  BS               24  18   X  CANCEL  CAN  9     09   I  HORIZONTAL TAB  HT          25  19   Y  END OF MEDIUM  EM  10    0A   J  LINE FEED  LF               26  1A   Z  SUBSTITUTE  SUB  11    0B   K  VERTICAL TAB  VT            27  1B      ESCAPE  ESC  12    0C   L  FF  FORM FEED               28  1C      FILE SEPARATOR  FS  RIGHT ARROW 13    0D   M  CR  CARRIAGE RETURN         29  1D      GROUP SEPARATOR  GS  LEFT ARROW 14    0E   N  SO  SHIFT OUT               30  1E      RECORD SEPARATOR  RS  UP ARROW 15    0F   O  SI  SHIFT IN                31  1F      UNIT SEPARATOR  US  DOWN ARROW    UPDATE  I had to revisit this recently  And  YYMV depending on terminal settings solar weather forecast BUT     I noticed that grep was not finding many unicode or extended characters  Even though intuitively they should match the range 0x80 to 0xff  3 and 4 byte unicode characters were not matched      Can anyone explain this  YES   frabjous asked and  calandoa explained that LC ALL C should be used to set locale for the command to make grep match     e g  my locale LC ALL  empty    locale LANG en IE UTF-8 LC CTYPE  en IE UTF-8      LC ALL    grep with LC ALL  empty matches 2 byte encoded chars but not 3 and 4 byte encoded     grep -P -n    x00- x08 x0E- x1F x80- xFF   notes unicode emoji test 5    copyright c2a9 7 call  underscore c2a0 9 CTRL 31 5    copyright 32 7 call  underscore   grep with LC ALL C does seem to match all extended characters that you would want     LC ALL C grep --color  auto  -P -n    x80- xFF   notes unicode emoji test   1      unicode dashes e28090 3     Heart With Arrow Emoji - Emojipedia    UTF8  f09f9298 5   copyright c2a9 7 call  underscore c2a0 11 LIVE  E                                                   YEOW  mix of japanese and chars from other e38182 e38184     e0a487 29 1      unicode dashes 30 3     Heart With Arrow Emoji - Emojipedia    UTF8 e28090 31 5   copyright 32 7 call  underscore 33 11 LIVE  E                                                   YEOW  mix of japanese and chars from other 34 52 LIVE  E                                                   YEOW  mix of japanese and chars from other 81 LIVE  E                                                   YEOW  mix of japanese and chars from other   THIS perl match  partially found elsewhere on stackoverflow  OR the inverse grep on the top answer DO seem to find ALL the  weird  and  wonderful   non-ascii  characters without setting locale     grep --color  auto  -P -n     x00- x7F   notes unicode emoji test    perl -ne  print         if m   x00- x08 x0E- x1F x80- xFF    notes unicode emoji test    1 -- unicode dashes e28090 3  Heart With Arrow Emoji - Emojipedia    UTF8  f09f9298 5    copyright c2a9 7 call  underscore c2a0 9 CTRL-H CHARS URK URK URK  11 LIVE-E                               YEOW  mix of japanese and chars from other e38182 e38184     e0a487 29 1 -- unicode dashes 30 3  Heart With Arrow Emoji - Emojipedia    UTF8 e28090 31 5    copyright 32 7 call  underscore 33 11 LIVE-E                               YEOW  mix of japanese and chars from other 34 52 LIVE-E                               YEOW  mix of japanese and chars from other 73 LIVE-E                               YEOW  mix of japanese and chars from other   SO the preferred non-ascii char finders     perl -ne  print         if m   x00- x08 x0E- x1F x80- xFF    notes unicode emoji test   as in top answer  the inverse grep     grep --color  auto  -P -n     x00- x7F   notes unicode emoji test   as in top answer but WITH LC ALL C     LC ALL C grep --color  auto  -P -n    x80- xFF   notes unicode emoji test

User · Answer

Strangely  I had to do this today  I ended up using Perl because I couldn t get grep egrep to work  even in -P mode   Something like   cat blah   perl -en    xCA xFE xBA xBE   amp  amp  print  found     For unicode characters  like  u2212 in example below  use this   find       -exec perl -CA -e   ARGV    ARGV 0   open IN   ARGV  binmode IN    utf8    binmode STDOUT    utf8    while   lt IN gt     next unless   N U 2212    print   ARGV    amp        exit

User · Answer

In perl  perl -ane    if m     ascii        print       fileName  gt  newFile

User · Answer

The following code works   find  tmp   perl -ne  print if      ascii        Replace  tmp with the name of the directory you want to search through

User · Answer

You can use the command    grep --color  auto  -P -n    x80- xFF   file xml   This will give you the line number  and will highlight non-ascii chars in red   In  some systems  depending on your settings  the above will not work  so you can grep by the inverse  grep --color  auto  -P -n     x00- x7F   file xml   Note also  that the important bit is the -P flag which equates to --perl-regexp  so it will interpret your pattern as a Perl regular expression  It also says that      this is highly experimental and grep -P may warn of unimplemented   features

User · Answer

The following works for me   grep -P    x80- xFF   file xml   Non-ASCII characters start at 0x80 and go to 0xFF when looking at bytes   Grep  and family  don t do Unicode processing to merge multi-byte characters into a single entity for regex matching as you seem to want   The -P option in my grep allows the use of  xdd escapes in character classes to accomplish what you want

User · Answer

Instead of making assumptions about the byte range of non-ASCII characters  as most of the above solutions do  it s slightly better IMO to be explicit about the actual byte range of ASCII characters instead   So the first solution for instance would become   grep --color  auto  -P -n     x00- x7F   file xml    which basically greps for any character outside of the hexadecimal ASCII range  from  x00 up to  x7F   On Mountain Lion that won t work  due to the lack of PCRE support in BSD grep   but with pcre installed via Homebrew  the following will work just as well   pcregrep --color  auto  -n     x00- x7F   file xml   Any pros or cons that anyone can think off

User · Answer

The easy way is to define a non-ASCII character    as a character that is not an ASCII character   LC ALL C grep     -    file xml   Add a tab after the   if necessary   Setting LC COLLATE C avoids nasty surprises about the meaning of character ranges in many locales  Setting LC CTYPE C is necessary to match single-byte characters      otherwise the command would miss invalid byte sequences in the current encoding  Setting LC ALL C avoids locale-dependent effects altogether

User · Answer

It could be interesting to know how to search for one unicode character  This command can help  You only need to know the code in UTF8  grep -v    u200d

User · Answer

Finding all non-ascii characters gives the impression that one is either looking for unicode strings or intends to strip said characters individually   For the former  try one of these  variable file is used for automation     file file txt   LC ALL C grep -Piao    x80- xFF x20  7     file   iconv -f   uchardet  file  -t utf-8   file file txt   pcregrep -iao    x80- xFF x20  7     file   iconv -f   uchardet  file  -t utf-8   file file txt   pcregrep -iao     x00- x19 x21- x7F  7     file   iconv -f   uchardet  file  -t utf-8   Vanilla grep doesn t work correctly without LC ALL C as noted in the previous answers   ASCII range is x00-x7F  space is x20  since strings have spaces the negative range omits it   Non-ASCII range is x80-xFF  since strings have spaces the positive range adds it   String is presumed to be at least 7 consecutive characters within the range   7     For shell readable output  uchardet  file returns a guess of the file encoding which is passed to iconv for automatic interpolation

User · Answer

Here is another variant I found that produced completely different results from the grep search for   x80- xFF  in the accepted answer   Perhaps it will be useful to someone to find additional non-ascii characters   grep --color  auto  -P -n      ascii     myfile txt  Note  my computer s grep  a Mac  did not have -P option  so I did brew install grep and started the call above with ggrep instead of grep

[regex] How do I grep for all non-ASCII characters?

Examples related to regex

Examples related to unix

Examples related to unicode

Examples related to grep