How to search contents of multiple pdf files

Question

How could I search the contents of PDF files in a directory subdirectory  I am looking for some command line tools  It seems that grep can t search PDF files

User · Answer

Recoll is a fantastic full-text GUI search application for Unix/Linux that supports dozens of different formats, including PDF. It can even pass the exact page number and search term of a query to the document viewer and thus allows you to jump to the result right from its GUI.

Recoll also comes with a viable command-line interface and a web-browser interface.

User · Answer

I had the same problem and thus I wrote a script which searches all pdf files in the specified folder for a string and prints the PDF files wich matched the query string   Maybe this will be helpful to you        You can download it  here

User · Answer

try using  acroread  in a simple script like the one above

User · Answer

There is another utility called ripgrep-all  which is based on ripgrep   It can handle more than just PDF documents  like Office documents and movies  and the author claims it is faster than pdfgrep   Command syntax for recursively searching the current directory  and the second one limits to PDF files only   rga  pattern    rga --type pdf  pattern

User · Answer

There is pdfgrep  which does exactly what its name suggests    pdfgrep -R  a pattern to search recursively from path   some path   I ve used it for simple searches and it worked fine    There are packages in Debian  Ubuntu and Fedora    Since version 1 3 0 pdfgrep supports recursive search  This version is available in Ubuntu since Ubuntu 12 10  Quantal

User · Answer

I made this destructive small script  Have fun with it   function pdfsearch         find   -iname    pdf    while read filename     do          echo -e   033 34 1m       PDF Document  033 33 1m  filename 033 0m          pdftotext -q -enc ASCII7   filename    filename    grep -s -H --color always -i  1   filename             remove it   rm -f   filename       done

User · Answer

If You want to see file names with pdftotext use following command   find   -name    pdf  -exec echo       -exec pdftotext    -      grep  pattern  pdf

User · Answer

I like  sjr s answer however I prefer xargs vs -exec   I find xargs more versatile  For example with -P we can take advantage of multiple CPUs when it makes sense to do so   find   -name    pdf    xargs -P 5 -I   pdftotext   -   grep --with-filename --label      --color  pattern

User · Answer

There is an open source common resource grep tool crgrep which searches within PDF files but also other resources like content nested in archives  database tables  image meta-data  POM file dependencies and web resources - and combinations of these including recursive search   The full description under the Files tab pretty much covers what the tool supports   I developed crgrep as an opensource tool

User · Answer

You need some tools like pdf2text to first convert your pdf to a text file and then search inside the text   You will probably miss some information or symbols    If you are using a programming language there are probably pdf libraries written for this purpose  e g  http   search cpan org dist CAM-PDF  for Perl

User · Answer

First convert all your pdf files to text files   for file in   pdf do pdftotext   file   done   Then use grep as normal  This is especially good as it is fast when you have multiple queries and a lot of PDF files

User · Answer

Your distribution should provide a utility called pdftotext   find  path -name    pdf  -exec sh -c  pdftotext      -   grep --with-filename --label      --color  your pattern        The  -  is necessary to have pdftotext output to stdout  not to files  The --with-filename and --label  options will put the file name in the output of grep  The optional --color flag is nice and tells grep to output using colors on the terminal    In Ubuntu  pdftotext is provided by the package xpdf-utils or poppler-utils    This method  using pdftotext and grep  has an advantage over pdfgrep if you want to use features of GNU grep that pdfgrep doesn t support  Note  pdfgrep-1 3 x supports -C option for printing line of context

User · Answer

My actual version of pdfgrep  1 3 0  allows the following   pdfgrep -HiR  pattern   path   When doing pdfgrep --help    H  Print the file name for each match  i  Ignore case distinctions  R  Search directories recursively    It works well on my Ubuntu

[linux] How to search contents of multiple pdf files?

Examples related to linux

Examples related to pdf

Examples related to full-text-search

Examples related to grep

Examples related to debian