grep from tar gz without extracting faster one

Question

Am trying to grep pattern from dozen files  tar gz but its very slow  am using   tar -ztf file tar gz   while read FILENAME do         if tar -zxf file tar gz   FILENAME  -O   grep  string   gt   dev null         then                 echo   FILENAME contains string          fi done

User · Answer

I know this question is 4 years old  but I have a couple different options   Option 1  Using tar --to-command grep  The following line will look in example tgz for PATTERN  This is similar to  Jester s example  but I couldn t get his pattern matching to work    tar xzf example tgz --to-command  grep --label   TAR FILENAME  -H PATTERN   true    Option 2  Using tar -tzf  The second option is using tar -tzf to list the files  then go through them with grep  You can create a function to use it over and over   targrep          for i in   tar -tzf   1    do         results   tar -Oxzf   1    i    grep --label   i  -H   2           echo   results      done     Usage   targrep example tar gz  pattern

User · Answer

If you have zgrep you can use  zgrep -a string file tar gz

User · Answer

For starters  you could start more than one process   tar -ztf file tar gz   while read FILENAME do          if tar -zxf file tar gz   FILENAME  -O   grep -l  string          then                 echo   FILENAME contains string          fi   amp  done   The          amp  creates a new detached  read  the parent shell does not wait for the child  process   After that  you should optimize the extracting of your archive  The read is no problem   as the OS should have cached the file access already  However  tar needs to unpack the archive every time the loop runs  which can be slow  Unpacking the archive once and iterating over the result may help here   local tempPath  tempfile  mkdir  tempPath  amp  amp  tar -zxf file tar gz -C  tempPath  amp  amp  find  tempPath -type f   while read FILENAME do          if grep -l  string    FILENAME          then                 echo   FILENAME contains string          fi   amp  done  amp  amp  rm -r  tempPath   find is used here  to get a list of files in the target directory of tar  which we re iterating over  for each file searching for a string   Edit  Use grep -l to speed up things  as Jim pointed out  From man grep      -l  --files-with-matches           Suppress normal output  instead print the name of each input file from which output would           normally have been printed   The scanning will stop on the first match    -l is specified           by POSIX

User · Answer

You can use the --to-command option to pipe files to an arbitrary script  Using this you can process the archive in a single pass  and without a temporary file   See also this question  and the manual  Armed with the above information  you could try something like       tar xf file tar gz --to-command  awk   bar    print ENVIRON   TAR FILENAME     exit     bfe2  bferc bfe2 CHANGELOG bfe2 README bferc

User · Answer

If this is really slow  I suspect you re dealing with a large archive file   It s going to uncompress it once to extract the file list  and then uncompress it N times--where N is the number of files in the archive--for the grep   In addition to all the uncompressing  it s going to have to scan a fair bit into the archive each time to extract each file   One of tar s biggest drawbacks is that there is no table of contents at the beginning   There s no efficient way to get information about all the files in the archive and only read that portion of the file   It essentially has to read all of the file up to the thing you re extracting every time  it can t just jump to a filename s location right away   The easiest thing you can do to speed this up would be to uncompress the file first  gunzip file tar gz  and then work on the  tar file   That might help enough by itself   It s still going to loop through the entire archive N times  though   If you really want this to be efficient  your only option is to completely extract everything in the archive before processing it   Since your problem is speed  I suspect this is a giant file that you don t want to extract first  but if you can  this will speed things up a lot   tar zxf file tar gz for f in hopefullySomeSubdir    do   grep -l  string   f done   Note that grep -l prints the name of any matching file  quits after the first match  and is silent if there s no match   That alone will speed up the grepping portion of your command  so even if you don t have the space to extract the entire archive  grep -l will help   If the files are huge  it will help a lot

User · Answer

All of the code above was really helpful  but none of it quite answered my own need  grep all   tar gz files in the current directory to find a pattern that is specified as an argument in a reusable script to output    The name of both the archive file and the extracted file The line number where the pattern was found The contents of the matching line   It s what I was really hoping that zgrep could do for me and it just can t   Here s my solution   pattern  1 for f in   tar gz  do      echo   f        tar -xzf   f  --to-command  grep --label   basename  TAR FILENAME   -Hin    pattern   true   done   You can also replace the tar line with the following if you d like to test that all variables are expanding properly with a basic echo statement   tar -xzf   f  --to-command  echo  f  basename  TAR FILENAME  s    pattern      Let me explain what s going on   Hopefully  the for loop and the echo of the archive filename in question is obvious   tar -xzf  x extract  z filter through gzip  f based on the following archive file       f   The archive file provided by the for loop  such as what you d get by doing an ls  in double-quotes to allow the variable to expand and ensure that the script is not broken by any file names with spaces  etc   --to-command  Pass the output of the tar command to another command rather than actually extracting files to the filesystem   Everything after this specifies what the command is  grep  and what arguments we re passing to that command   Let s break that part down by itself  since it s the  secret sauce  here    grep --label   basename  TAR FILENAME   -Hin    pattern   true    First  we use a single-quote to start this chunk so that the executed sub-command  basename  TAR FILENAME  is not immediately expanded resolved   More on that in a moment   grep  The command to be run on the  not actually  extracted files  --label   The label to prepend the results  the value of which is enclosed in double-quotes because we do want to have the grep command resolve the  TAR FILENAME environment variable passed in by the tar command   basename  TAR FILENAME  Runs as a command  surrounded by backticks  and removes directory path and outputs only the name of the file  -Hin  H Display filename  provided by the label   i Case insensitive search  n Display line number of match  Then we  end  the first part of the command string with a single quote and start up the next part with a double quote so that the  pattern  passed in as the first argument  can be resolved   Realizing which quotes I needed to use where was the part that tripped me up the longest   Hopefully  this all makes sense to you and helps someone else out   Also  I hope I can find this in a year when I need it again  and I ve forgotten about the script I made for it already      And it s been a bit a couple of weeks since I wrote the above and it s still super useful    but it wasn t quite good enough as files have piled up and searching for things has gotten more messy   I needed a way to limit what I looked at by the date of the file  only looking at more recent files    So here s that code   Hopefully it s fairly self-explanatory   if   -z   1     then     echo  Look within all tar gz files for a string pattern  optionally only in recent files      echo  Usage  targrep  lt string to search for gt   start date   fi pattern  1 startdatein  2 startdate   date -d   startdatein    s  for f in   tar gz  do     filedate   date -r   f    s      if    -z   startdatein            filedate -ge  startdate     then         echo   f           tar -xzf   f  --to-command  grep --label   basename  TAR FILENAME   -Hin    pattern   true      fi done     And I can t stop tweaking this thing   I added an argument to filter by the name of the output files in the tar file   Wildcards work  too   Usage   targrep sh  -d  lt start date gt    -f  lt filename to include gt    lt string to search for gt   Example   targrep sh -d  1 1 2019  -f   vehicle models csv  ford  while getopts  d f   opt  do     case  opt in             d  startdatein  OPTARG               f  targetfile  OPTARG       esac done shift     OPTIND-1      Discard options and bring forward remaining arguments pattern  1  echo  Searching for   pattern  if    -n  targetfile     then     echo  in filenames    targetfile  fi  startdate   date -d   startdatein    s  for f in   tar gz  do     filedate   date -r   f    s      if    -z   startdatein            filedate -ge  startdate     then             echo   f               if    -z   targetfile      then                     tar -xzf   f  --to-command  grep --label   basename  TAR FILENAME   -Hin    pattern   true              else                     tar -xzf   f  --no-anchored   targetfile  --to-command  grep --label   basename  TAR FILENAME   -Hin    pattern   true              fi     fi done

User · Answer

Both the below options work well     zgrep -ai  CDF FEED  FeedService log 1 05-31-2019-150003 tar gz   more 2019-05-30 19 20 14 568 ERROR 281 ---  http-nio-8007-exec-360  DrupalFeedService    CDF FEED SERVICE  CLASSIFICATION ERROR 408  Classification failed even after maximum retries for url   abcd html    zcat FeedService log 1 05-31-2019-150003 tar gz   grep -ai  CDF FEED  2019-05-30 19 20 14 568 ERROR 281 ---  http-nio-8007-exec-360  DrupalFeedService    CDF FEED SERVICE  CLASSIFICATION ERROR 408  Classification failed even after maximum retries for url   abcd html

User · Answer

Am trying to grep pattern from dozen files  tar gz but its very slow  tar -ztf file tar gz   while read FILENAME do         if tar -zxf file tar gz   FILENAME  -O   grep  string   gt   dev null         then                 echo   FILENAME contains string          fi done    That s actually very easy with ugrep option -z   -z  --decompress         Decompress files to search  when compressed   Archives   cpio           pax   tar  and  zip  and compressed archives  e g   taz   tgz           tpz   tbz   tbz2   tb2   tz2   tlz  and  txz  are searched and         matching pathnames of files in archives are output in braces   If         -g  -O  -M  or -t is specified  searches files within archives         whose name matches globs  matches file name extensions  matches         file signature magic bytes  or matches file types  respectively          Supported compression formats  gzip   gz   compress   Z   zip          bzip2  requires suffix  bz   bz2   bzip2   tbz   tbz2   tb2   tz2           lzma and xz  requires suffix  lzma   tlz   xz   txz     Which requires just one command to search file tar gz as follows   ugrep -z  string  file tar gz   This greps each of the archived files to display matches  Archived filenames are shown in braces to distinguish them from ordinary filenames  For example     ugrep -z  Hello  archive tgz  Hello bat  echo  Hello World   Binary file archive tgz Hello class  matches  Hello java  public class Hello    prints a Hello World  greeting  Hello java      System out println  Hello World      Hello pdf   Hello   Hello sh  echo  Hello World    Hello txt  Hello   If you just want the file names  use option -l  --files-with-matches  and customize the filename output with option --format   z    to get rid of the braces     ugrep -z Hello -l --format   z    archive tgz Hello bat Hello class Hello java Hello pdf Hello sh Hello txt

[linux] grep from tar.gz without extracting [faster one]

Option 1: Using `tar --to-command grep`

Option 2: Using `tar -tzf`

Examples related to linux

Examples related to bash

Examples related to grep