How could I search the contents of PDF files in a directory/subdirectory? I am looking for some command line tools. It seems that grep
can't search PDF files.
This question is related to
linux
pdf
full-text-search
grep
debian
You need some tools like pdf2text to first convert your pdf to a text file and then search inside the text. (You will probably miss some information or symbols).
If you are using a programming language there are probably pdf libraries written for this purpose. e.g. http://search.cpan.org/dist/CAM-PDF/ for Perl
I like @sjr's answer however I prefer xargs vs -exec. I find xargs more versatile. For example with -P we can take advantage of multiple CPUs when it makes sense to do so.
find . -name '*.pdf' | xargs -P 5 -I % pdftotext % - | grep --with-filename --label="{}" --color "pattern"
There is another utility called ripgrep-all, which is based on ripgrep.
It can handle more than just PDF documents, like Office documents and movies, and the author claims it is faster than pdfgrep
.
Command syntax for recursively searching the current directory, and the second one limits to PDF files only:
rga 'pattern' .
rga --type pdf 'pattern' .
First convert all your pdf files to text files:
for file in *.pdf;do pdftotext "$file"; done
Then use grep
as normal. This is especially good as it is fast when you have multiple queries and a lot of PDF files.
I made this destructive small script. Have fun with it.
function pdfsearch()
{
find . -iname '*.pdf' | while read filename
do
#echo -e "\033[34;1m// === PDF Document:\033[33;1m $filename\033[0m"
pdftotext -q -enc ASCII7 "$filename" "$filename."; grep -s -H --color=always -i $1 "$filename."
# remove it! rm -f "$filename."
done
}
If You want to see file names with pdftotext use following command:
find . -name '*.pdf' -exec echo {} \; -exec pdftotext {} - \; | grep "pattern\|pdf"
Recoll is a fantastic full-text GUI search application for Unix/Linux that supports dozens of different formats, including PDF. It can even pass the exact page number and search term of a query to the document viewer and thus allows you to jump to the result right from its GUI.
Recoll also comes with a viable command-line interface and a web-browser interface.
try using 'acroread' in a simple script like the one above
I had the same problem and thus I wrote a script which searches all pdf files in the specified folder for a string and prints the PDF files wich matched the query string.
Maybe this will be helpful to you.
You can download it here
There is an open source common resource grep tool crgrep which searches within PDF files but also other resources like content nested in archives, database tables, image meta-data, POM file dependencies and web resources - and combinations of these including recursive search.
The full description under the Files tab pretty much covers what the tool supports.
I developed crgrep as an opensource tool.
My actual version of pdfgrep (1.3.0) allows the following:
pdfgrep -HiR 'pattern' /path
When doing pdfgrep --help
:
It works well on my Ubuntu.
There is pdfgrep, which does exactly what its name suggests.
pdfgrep -R 'a pattern to search recursively from path' /some/path
I've used it for simple searches and it worked fine.
(There are packages in Debian, Ubuntu and Fedora.)
Since version 1.3.0 pdfgrep supports recursive search. This version is available in Ubuntu since Ubuntu 12.10 (Quantal).
Source: Stackoverflow.com