[linux] grep from tar.gz without extracting [faster one]

Am trying to grep pattern from dozen files .tar.gz but its very slow

am using

tar -ztf file.tar.gz | while read FILENAME
do
        if tar -zxf file.tar.gz "$FILENAME" -O | grep "string" > /dev/null
        then
                echo "$FILENAME contains string"
        fi
done

This question is related to linux bash grep

The answer is


All of the code above was really helpful, but none of it quite answered my own need: grep all *.tar.gz files in the current directory to find a pattern that is specified as an argument in a reusable script to output:

  • The name of both the archive file and the extracted file
  • The line number where the pattern was found
  • The contents of the matching line

It's what I was really hoping that zgrep could do for me and it just can't.

Here's my solution:

pattern=$1
for f in *.tar.gz; do
     echo "$f:"
     tar -xzf "$f" --to-command 'grep --label="`basename $TAR_FILENAME`" -Hin '"$pattern ; true";
done

You can also replace the tar line with the following if you'd like to test that all variables are expanding properly with a basic echo statement:

tar -xzf "$f" --to-command 'echo "f:`basename $TAR_FILENAME` s:'"$pattern\""

Let me explain what's going on. Hopefully, the for loop and the echo of the archive filename in question is obvious.

tar -xzf: x extract, z filter through gzip, f based on the following archive file...

"$f": The archive file provided by the for loop (such as what you'd get by doing an ls) in double-quotes to allow the variable to expand and ensure that the script is not broken by any file names with spaces, etc.

--to-command: Pass the output of the tar command to another command rather than actually extracting files to the filesystem. Everything after this specifies what the command is (grep) and what arguments we're passing to that command.

Let's break that part down by itself, since it's the "secret sauce" here.

'grep --label="`basename $TAR_FILENAME`" -Hin '"$pattern ; true"

First, we use a single-quote to start this chunk so that the executed sub-command (basename $TAR_FILENAME) is not immediately expanded/resolved. More on that in a moment.

grep: The command to be run on the (not actually) extracted files

--label=: The label to prepend the results, the value of which is enclosed in double-quotes because we do want to have the grep command resolve the $TAR_FILENAME environment variable passed in by the tar command.

basename $TAR_FILENAME: Runs as a command (surrounded by backticks) and removes directory path and outputs only the name of the file

-Hin: H Display filename (provided by the label), i Case insensitive search, n Display line number of match

Then we "end" the first part of the command string with a single quote and start up the next part with a double quote so that the $pattern, passed in as the first argument, can be resolved.

Realizing which quotes I needed to use where was the part that tripped me up the longest. Hopefully, this all makes sense to you and helps someone else out. Also, I hope I can find this in a year when I need it again (and I've forgotten about the script I made for it already!)


And it's been a bit a couple of weeks since I wrote the above and it's still super useful... but it wasn't quite good enough as files have piled up and searching for things has gotten more messy. I needed a way to limit what I looked at by the date of the file (only looking at more recent files). So here's that code. Hopefully it's fairly self-explanatory.

if [ -z "$1" ]; then
    echo "Look within all tar.gz files for a string pattern, optionally only in recent files"
    echo "Usage: targrep <string to search for> [start date]"
fi
pattern=$1
startdatein=$2
startdate=$(date -d "$startdatein" +%s)
for f in *.tar.gz; do
    filedate=$(date -r "$f" +%s)
    if [[ -z "$startdatein" ]] || [[ $filedate -ge $startdate ]]; then
        echo "$f:"
        tar -xzf "$f" --to-command 'grep --label="`basename $TAR_FILENAME`" -Hin '"$pattern ; true"
    fi
done

And I can't stop tweaking this thing. I added an argument to filter by the name of the output files in the tar file. Wildcards work, too.

Usage:

targrep.sh [-d <start date>] [-f <filename to include>] <string to search for>

Example:

targrep.sh -d "1/1/2019" -f "*vehicle_models.csv" ford

while getopts "d:f:" opt; do
    case $opt in
            d) startdatein=$OPTARG;;
            f) targetfile=$OPTARG;;
    esac
done
shift "$((OPTIND-1))" # Discard options and bring forward remaining arguments
pattern=$1

echo "Searching for: $pattern"
if [[ -n $targetfile ]]; then
    echo "in filenames:  $targetfile"
fi

startdate=$(date -d "$startdatein" +%s)
for f in *.tar.gz; do
    filedate=$(date -r "$f" +%s)
    if [[ -z "$startdatein" ]] || [[ $filedate -ge $startdate ]]; then
            echo "$f:"
            if [[ -z "$targetfile" ]]; then
                    tar -xzf "$f" --to-command 'grep --label="`basename $TAR_FILENAME`" -Hin '"$pattern ; true"
            else
                    tar -xzf "$f" --no-anchored "$targetfile" --to-command 'grep --label="`basename $TAR_FILENAME`" -Hin '"$pattern ; true"
            fi
    fi
done

I know this question is 4 years old, but I have a couple different options:

Option 1: Using tar --to-command grep

The following line will look in example.tgz for PATTERN. This is similar to @Jester's example, but I couldn't get his pattern matching to work.

tar xzf example.tgz --to-command 'grep --label="$TAR_FILENAME" -H PATTERN ; true'

Option 2: Using tar -tzf

The second option is using tar -tzf to list the files, then go through them with grep. You can create a function to use it over and over:

targrep () {
    for i in $(tar -tzf "$1"); do
        results=$(tar -Oxzf "$1" "$i" | grep --label="$i" -H "$2")
        echo "$results"
    done
}

Usage:

targrep example.tar.gz "pattern"

For starters, you could start more than one process:

tar -ztf file.tar.gz | while read FILENAME
do
        (if tar -zxf file.tar.gz "$FILENAME" -O | grep -l "string"
        then
                echo "$FILENAME contains string"
        fi) &
done

The ( ... ) & creates a new detached (read: the parent shell does not wait for the child) process.

After that, you should optimize the extracting of your archive. The read is no problem, as the OS should have cached the file access already. However, tar needs to unpack the archive every time the loop runs, which can be slow. Unpacking the archive once and iterating over the result may help here:

local tempPath=`tempfile`
mkdir $tempPath && tar -zxf file.tar.gz -C $tempPath &&
find $tempPath -type f | while read FILENAME
do
        (if grep -l "string" "$FILENAME"
        then
                echo "$FILENAME contains string"
        fi) &
done && rm -r $tempPath

find is used here, to get a list of files in the target directory of tar, which we're iterating over, for each file searching for a string.

Edit: Use grep -l to speed up things, as Jim pointed out. From man grep:

   -l, --files-with-matches
          Suppress normal output; instead print the name of each input file from which output would
          normally have been printed.  The scanning will stop on the first match.  (-l is specified
          by POSIX.)

Both the below options work well.

$ zgrep -ai 'CDF_FEED' FeedService.log.1.05-31-2019-150003.tar.gz | more
2019-05-30 19:20:14.568 ERROR 281 --- [http-nio-8007-exec-360] DrupalFeedService  : CDF_FEED_SERVICE::CLASSIFICATION_ERROR:408: Classification failed even after maximum retries for url : abcd.html

$ zcat FeedService.log.1.05-31-2019-150003.tar.gz | grep -ai 'CDF_FEED'
2019-05-30 19:20:14.568 ERROR 281 --- [http-nio-8007-exec-360] DrupalFeedService  : CDF_FEED_SERVICE::CLASSIFICATION_ERROR:408: Classification failed even after maximum retries for url : abcd.html

You can use the --to-command option to pipe files to an arbitrary script. Using this you can process the archive in a single pass (and without a temporary file). See also this question, and the manual. Armed with the above information, you could try something like:

$ tar xf file.tar.gz --to-command "awk '/bar/ { print ENVIRON[\"TAR_FILENAME\"]; exit }'"
bfe2/.bferc
bfe2/CHANGELOG
bfe2/README.bferc

If this is really slow, I suspect you're dealing with a large archive file. It's going to uncompress it once to extract the file list, and then uncompress it N times--where N is the number of files in the archive--for the grep. In addition to all the uncompressing, it's going to have to scan a fair bit into the archive each time to extract each file. One of tar's biggest drawbacks is that there is no table of contents at the beginning. There's no efficient way to get information about all the files in the archive and only read that portion of the file. It essentially has to read all of the file up to the thing you're extracting every time; it can't just jump to a filename's location right away.

The easiest thing you can do to speed this up would be to uncompress the file first (gunzip file.tar.gz) and then work on the .tar file. That might help enough by itself. It's still going to loop through the entire archive N times, though.

If you really want this to be efficient, your only option is to completely extract everything in the archive before processing it. Since your problem is speed, I suspect this is a giant file that you don't want to extract first, but if you can, this will speed things up a lot:

tar zxf file.tar.gz
for f in hopefullySomeSubdir/*; do
  grep -l "string" $f
done

Note that grep -l prints the name of any matching file, quits after the first match, and is silent if there's no match. That alone will speed up the grepping portion of your command, so even if you don't have the space to extract the entire archive, grep -l will help. If the files are huge, it will help a lot.


Am trying to grep pattern from dozen files .tar.gz but its very slow

tar -ztf file.tar.gz | while read FILENAME
do
        if tar -zxf file.tar.gz "$FILENAME" -O | grep "string" > /dev/null
        then
                echo "$FILENAME contains string"
        fi
done

That's actually very easy with ugrep option -z:

-z, --decompress
        Decompress files to search, when compressed.  Archives (.cpio,
        .pax, .tar, and .zip) and compressed archives (e.g. .taz, .tgz,
        .tpz, .tbz, .tbz2, .tb2, .tz2, .tlz, and .txz) are searched and
        matching pathnames of files in archives are output in braces.  If
        -g, -O, -M, or -t is specified, searches files within archives
        whose name matches globs, matches file name extensions, matches
        file signature magic bytes, or matches file types, respectively.
        Supported compression formats: gzip (.gz), compress (.Z), zip,
        bzip2 (requires suffix .bz, .bz2, .bzip2, .tbz, .tbz2, .tb2, .tz2),
        lzma and xz (requires suffix .lzma, .tlz, .xz, .txz).

Which requires just one command to search file.tar.gz as follows:

ugrep -z "string" file.tar.gz

This greps each of the archived files to display matches. Archived filenames are shown in braces to distinguish them from ordinary filenames. For example:

$ ugrep -z "Hello" archive.tgz
{Hello.bat}:echo "Hello World!"
Binary file archive.tgz{Hello.class} matches
{Hello.java}:public class Hello // prints a Hello World! greeting
{Hello.java}:  { System.out.println("Hello World!");
{Hello.pdf}:(Hello)
{Hello.sh}:echo "Hello World!"
{Hello.txt}:Hello

If you just want the file names, use option -l (--files-with-matches) and customize the filename output with option --format="%z%~" to get rid of the braces:

$ ugrep -z Hello -l --format="%z%~" archive.tgz
Hello.bat
Hello.class
Hello.java
Hello.pdf
Hello.sh
Hello.txt

If you have zgrep you can use

zgrep -a string file.tar.gz

Examples related to linux

grep's at sign caught as whitespace How to prevent Google Colab from disconnecting? "E: Unable to locate package python-pip" on Ubuntu 18.04 How to upgrade Python version to 3.7? Install Qt on Ubuntu Get first line of a shell command's output Cannot connect to the Docker daemon at unix:/var/run/docker.sock. Is the docker daemon running? Run bash command on jenkins pipeline How to uninstall an older PHP version from centOS7 How to update-alternatives to Python 3 without breaking apt?

Examples related to bash

Comparing a variable with a string python not working when redirecting from bash script Zipping a file in bash fails How do I prevent Conda from activating the base environment by default? Get first line of a shell command's output Fixing a systemd service 203/EXEC failure (no such file or directory) /bin/sh: apt-get: not found VSCode Change Default Terminal Run bash command on jenkins pipeline How to check if the docker engine and a docker container are running? How to switch Python versions in Terminal?

Examples related to grep

grep's at sign caught as whitespace cat, grep and cut - translated to python How to suppress binary file matching results in grep Linux find and grep command together Filtering JSON array using jQuery grep() Linux Script to check if process is running and act on the result grep without showing path/file:line How do you grep a file and get the next 5 lines How to grep, excluding some patterns? Fast way of finding lines in one file that are not in another?