I'm trying to use the command line program convert
to take a PDF into an image (JPEG or PNG). Here is one of the PDFs that I'm trying to convert.
I want the program to trim off the excess white-space and return a high enough quality image that the superscripts can be read with ease.
This is my current best attempt. As you can see, the trimming works fine, I just need to sharpen up the resolution quite a bit. This is the command I'm using:
convert -trim 24.pdf -resize 500% -quality 100 -sharpen 0x1.0 24-11.jpg
I've tried to make the following conscious decisions:
-sharpen
(I've tried a range of values)Any suggestions please on getting the resolution of the image in the final PNG/JPEG higher would be greatly appreciated!
This question is related to
pdf
imagemagick
PNG file you attached looks really blurred. In case if you need to use additional post-processing for each image you generated as PDF preview, you will decrease performance of your solution.
2JPEG can convert PDF file you attached to a nice sharpen JPG and crop empty margins in one call:
2jpeg.exe -src "C:\In\*.*" -dst "C:\Out" -oper Crop method:autocrop
It's actually pretty easy to do with Preview on a mac. All you have to do is open the file in Preview and save-as (or export) a png or jpeg but make sure that you use at least 300 dpi at the bottom of the window to get a high quality image.
Linux user here: I tried the convert
command-line utility (for PDF to PNG) and I was not happy with the results. I found this to be easier, with a better result:
pdftk file.pdf cat 3 output page3.pdf
GIMP
Resolution
from 100
to 300
or 600 pixel/in
GIMP
export as PNG (change file extension to .png)Edit:
Added picture, as requested in the Comments
. Convert command used:
convert -density 300 -trim struct2vec.pdf -quality 100 struct2vec.png
GIMP
: imported at 300 dpi (px/in); exported as PNG compression level 3.
I have not used GIMP on the command line (re: my comment, below).
The following python script will work on any Mac (Snow Leopard and upward). It can be used on the command line with successive PDF files as arguments, or you can put in into a Run Shell Script action in Automator, and make a Service (Quick Action in Mojave).
You can set the resolution of the output image in the script.
The script and a Quick Action can be downloaded from github.
#!/usr/bin/python
# coding: utf-8
import os, sys
import Quartz as Quartz
from LaunchServices import (kUTTypeJPEG, kUTTypeTIFF, kUTTypePNG, kCFAllocatorDefault)
resolution = 300.0 #dpi
scale = resolution/72.0
cs = Quartz.CGColorSpaceCreateWithName(Quartz.kCGColorSpaceSRGB)
whiteColor = Quartz.CGColorCreate(cs, (1, 1, 1, 1))
# Options: kCGImageAlphaNoneSkipLast (no trans), kCGImageAlphaPremultipliedLast
transparency = Quartz.kCGImageAlphaNoneSkipLast
#Save image to file
def writeImage (image, url, type, options):
destination = Quartz.CGImageDestinationCreateWithURL(url, type, 1, None)
Quartz.CGImageDestinationAddImage(destination, image, options)
Quartz.CGImageDestinationFinalize(destination)
return
def getFilename(filepath):
i=0
newName = filepath
while os.path.exists(newName):
i += 1
newName = filepath + " %02d"%i
return newName
if __name__ == '__main__':
for filename in sys.argv[1:]:
pdf = Quartz.CGPDFDocumentCreateWithProvider(Quartz.CGDataProviderCreateWithFilename(filename))
numPages = Quartz.CGPDFDocumentGetNumberOfPages(pdf)
shortName = os.path.splitext(filename)[0]
prefix = os.path.splitext(os.path.basename(filename))[0]
folderName = getFilename(shortName)
try:
os.mkdir(folderName)
except:
print "Can't create directory '%s'"%(folderName)
sys.exit()
# For each page, create a file
for i in range (1, numPages+1):
page = Quartz.CGPDFDocumentGetPage(pdf, i)
if page:
#Get mediabox
mediaBox = Quartz.CGPDFPageGetBoxRect(page, Quartz.kCGPDFMediaBox)
x = Quartz.CGRectGetWidth(mediaBox)
y = Quartz.CGRectGetHeight(mediaBox)
x *= scale
y *= scale
r = Quartz.CGRectMake(0,0,x, y)
# Create a Bitmap Context, draw a white background and add the PDF
writeContext = Quartz.CGBitmapContextCreate(None, int(x), int(y), 8, 0, cs, transparency)
Quartz.CGContextSaveGState (writeContext)
Quartz.CGContextScaleCTM(writeContext, scale,scale)
Quartz.CGContextSetFillColorWithColor(writeContext, whiteColor)
Quartz.CGContextFillRect(writeContext, r)
Quartz.CGContextDrawPDFPage(writeContext, page)
Quartz.CGContextRestoreGState(writeContext)
# Convert to an "Image"
image = Quartz.CGBitmapContextCreateImage(writeContext)
# Create unique filename per page
outFile = folderName +"/" + prefix + " %03d.png"%i
url = Quartz.CFURLCreateFromFileSystemRepresentation(kCFAllocatorDefault, outFile, len(outFile), False)
# kUTTypeJPEG, kUTTypeTIFF, kUTTypePNG
type = kUTTypePNG
# See the full range of image properties on Apple's developer pages.
options = {
Quartz.kCGImagePropertyDPIHeight: resolution,
Quartz.kCGImagePropertyDPIWidth: resolution
}
writeImage (image, url, type, options)
del page
One more suggestion is that you can use GIMP.
Just load the PDF file in GIMP->save as .xcf and then you can do whatever you want to the image.
I have used pdf2image. A simple python library that works like charm.
First install poppler on non linux machine. You can just download the zip. Unzip in Program Files and add bin to Machine Path.
After that you can use pdf2image in python class like this:
from pdf2image import convert_from_path, convert_from_bytes
images_from_path = convert_from_path(
inputfile,
output_folder=outputpath,
grayscale=True, fmt='jpeg')
I am not good with python but was able to make exe of it. Later you may use the exe with file input and output parameter. I have used it in C# and things are working fine.
Image quality is good. OCR works fine.
It also gives you good results:
exec("convert -geometry 1600x1600 -density 200x200 -quality 100 test.pdf test_image.jpg");
Personally I like this.
convert -density 300 -trim test.pdf -quality 100 test.jpg
It's a little over twice the file size, but it looks better to me.
-density 300
sets the dpi that the PDF is rendered at.
-trim
removes any edge pixels that are the same color as the corner pixels.
-quality 100
sets the JPEG compression quality to the highest quality.
Things like -sharpen
don't work well with text because they undo things your font rendering system did to make it more legible.
If you actually want it blown up use resize here and possibly a larger dpi value of something like targetDPI * scalingFactor
That will render the PDF at the resolution/size you intend.
Descriptions of the parameters on imagemagick.org are here
I really haven't had good success with convert
[update May 2020: actually: it pretty much never works for me], but I've had EXCELLENT success with pdftoppm
. Here's a couple examples of producing high-quality images from a PDF:
[Produces ~25 MB-sized files per pg] Output uncompressed .tif file format at 300 DPI into a folder called "images", with files being named pg-1.tif, pg-2.tif, pg-3.tif, etc:
mkdir -p images && pdftoppm -tiff -r 300 mypdf.pdf images/pg
[Produces ~1MB-sized files per pg] Output in .jpg format at 300 DPI:
mkdir -p images && pdftoppm -jpeg -r 300 mypdf.pdf images/pg
[Produces ~2MB-sized files per pg] Output in .jpg format at highest quality (least compression) and still at 300 DPI:
mkdir -p images && pdftoppm -jpeg -jpegopt quality=100 -r 300 mypdf.pdf images/pg
https://askubuntu.com/questions/150100/extracting-embedded-images-from-a-pdf/1187844#1187844.
pdf2searchablepdf
] https://askubuntu.com/questions/473843/how-to-turn-a-pdf-into-a-text-searchable-pdf/1187881#1187881Please take note before down voting, this solution is for Gimp using a graphical interface, and not for ImageMagick using a command line, but it worked perfectly fine for me as an alternative, and that is why I found it needful to share here.
Follow these simple steps to extract images in any format from PDF documents
That's all.
I hope this helps
I use icepdf an open source java pdf engine. Check the office demo.
package image2pdf;
import org.icepdf.core.exceptions.PDFException;
import org.icepdf.core.exceptions.PDFSecurityException;
import org.icepdf.core.pobjects.Document;
import org.icepdf.core.pobjects.Page;
import org.icepdf.core.util.GraphicsRenderingHints;
import javax.imageio.ImageIO;
import java.awt.image.BufferedImage;
import java.awt.image.RenderedImage;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
public class pdf2image {
public static void main(String[] args) {
Document document = new Document();
try {
document.setFile("C:\\Users\\Dell\\Desktop\\test.pdf");
} catch (PDFException ex) {
System.out.println("Error parsing PDF document " + ex);
} catch (PDFSecurityException ex) {
System.out.println("Error encryption not supported " + ex);
} catch (FileNotFoundException ex) {
System.out.println("Error file not found " + ex);
} catch (IOException ex) {
System.out.println("Error IOException " + ex);
}
// save page captures to file.
float scale = 1.0f;
float rotation = 0f;
// Paint each pages content to an image and
// write the image to file
for (int i = 0; i < document.getNumberOfPages(); i++) {
try {
BufferedImage image = (BufferedImage) document.getPageImage(
i, GraphicsRenderingHints.PRINT, Page.BOUNDARY_CROPBOX, rotation, scale);
RenderedImage rendImage = image;
try {
System.out.println(" capturing page " + i);
File file = new File("C:\\Users\\Dell\\Desktop\\test_imageCapture1_" + i + ".png");
ImageIO.write(rendImage, "png", file);
} catch (IOException e) {
e.printStackTrace();
}
image.flush();
}catch(Exception e){
e.printStackTrace();
}
}
// clean up resources
document.dispose();
}
}
I've also tried imagemagick and pdftoppm, both pdftoppm and icepdf has a high resolution than imagemagick.
I have found it both faster and more stable when batch-processing large PDFs into PNGs and JPGs to use the underlying gs
(aka Ghostscript) command that convert
uses.
You can see the command in the output of convert -verbose
and there are a few more tweaks possible there (YMMV) that are difficult / impossible to access directly via convert
.
However, it would be harder to do your trimming and sharpening using gs
, so, as I said, YMMV!
In ImageMagick, you can do "supersampling". You specify a large density and then resize down as much as desired for the final output size. For example with your image:
convert -density 600 test.pdf -background white -flatten -resize 25% test.png
Download the image to view at full resolution for comparison..
I do not recommend saving to JPG if you are expecting to do further processing.
If you want the output to be the same size as the input, then resize to the inverse of the ratio of your density to 72. For example, -density 288 and -resize 25%. 288=4*72 and 25%=1/4
The larger the density the better the resulting quality, but it will take longer to process.
You can do it in LibreOffice Draw (which is usually preinstalled in Ubuntu):
normally I extract the embedded image with 'pdfimages' at the native resolution, then use ImageMagick's convert to the needed format:
$ pdfimages -list fileName.pdf
$ pdfimages fileName.pdf fileName # save in .ppm format
$ convert fileName-000.ppm fileName-000.png
this generate the best and smallest result file.
Note: For lossy JPG embedded images, you had to use -j:
$ pdfimages -j fileName.pdf fileName # save in .jpg format
With recent poppler you can use -all that save lossy as jpg and lossless as png
On little provided Win platform you had to download a recent (0.37 2015) 'poppler-util' binary from: http://blog.alivate.com.au/poppler-windows/
Use this commandline:
convert -geometry 3600x3600 -density 300x300 -quality 100 TEAM\ 4.pdf team4.png
This should correctly convert the file as you've asked for.
I use pdftoppm
on the command line to get the initial image, typically with a resolution of 300dpi, so pdftoppm -r 300
, then use convert
to do the trimming and PNG conversion.
Source: Stackoverflow.com