808

Do you have any idea how to extract a part of a PDF document and save it as PDF? On OS X it is absolutely trivial by using Preview. I tried PDF editor and other programs but to no avail.

I would like a program where I select the part that I want and then save it as a PDF file with a simple command like CMD+N on OS X. I want the extracted part to be saved as PDF and not as JPEG, etc.

user72469
  • 8,089

20 Answers20

835

pdftk is a useful multi-platform tool for the job (pdftk homepage).

pdftk full-pdf.pdf cat 12-15 output outfile_p12-15.pdf

you pass the filename of the main pdf, then you tell it to only include certain pages (12-15 in this example) and output it to a new file.

Installation instructions:

To install the snap version, which is an unofficial repackaging of an old version of PDFtk (repackaged by Scott Moser), visit this link or run:

sudo snap install pdftk

Alternatively, you can install an open source port of PDFtk to Java by Marc Vinyals, by running:

sudo apt install pdftk-java

Another alternative is PDFtk Server, available from the website: https://www.pdflabs.com/tools/pdftk-server/ . This version is free of charge for personal use, but it is not open source.

Flimm
  • 44,031
Martin H
  • 8,557
387

Very simple. Use the default PDF reader, select "Print To File" and that's it!

print menu

Then:

setting new PDF

Note that with this way, the text will no more be searchable, instead all texts are converted to images: this is how "Print" works.

305

QPDF is great. Use it this way to extract pages 1 to 10 from input.pdf and save it as output.pdf:

qpdf input.pdf --pages . 1-10 -- output.pdf

This preserves all metadata associated with that file.

From the old manual:

If you wanted pages 1 through 5 from infile.pdf but you wanted the rest of the metadata to be dropped, you could instead run

qpdf --empty --pages infile.pdf 1-5 -- outfile.pdf

Here's a link to the current documentation, giving more examples of page selection.


You can install it by invoking:

sudo apt-get install qpdf

It is a great tool for PDF manipulation. It's very fast and has very few dependencies. From QPDF's GitHub repo:

QPDF is a command-line tool and C++ library that performs content-preserving transformations on PDF files. It supports linearization, encryption, and numerous other features. It can also be used for splitting and merging files, creating PDF files (but you have to supply all the content yourself), and inspecting files for study or analysis.

Ho1
  • 4,000
105

Page range - Nautilus script


Overview

I created a slightly more advanced script based on the tutorial @ThiagoPonte linked to. Its key features are

  • that it's GUI based,
  • compatible with spaces in file names,
  • and based on three different backends that are capable of preserving all attributes of the original file

Screenshot

enter image description here

Code

#!/bin/bash
#
# TITLE:        PDFextract
#
# AUTHOR:       (c) 2013-2015 Glutanimate (https://github.com/Glutanimate)
#
# VERSION:      0.2
#
# LICENSE:      GNU GPL v3 (http://www.gnu.org/licenses/gpl.html)
# 
# OVERVIEW:     PDFextract is a simple PDF extraction script based on Ghostscript/qpdf/cpdf.
#               It provides a simple way to extract a page range from a PDF document and is meant
#               to be used as a file manager script/addon (e.g. Nautilus script).
#
# FEATURES:     - simple GUI based on YAD, an advanced Zenity fork.
#               - preserves _all_ attributes of your original PDF file and does not compress 
#                 embedded images further than they are.      
#               - can choose from three different backends: ghostscript, qpdf, cpdf
#
# DEPENDENCIES: ghostscript/qpdf/cpdf poppler-utils yad libnotify-bin
#                         
#               You need to install at least one of the three backends supported by this script.
#
#               - ghostscript, qpdf, poppler-utils, and libnotify-bin are available via 
#                 the standard Ubuntu repositories
#               - cpdf is a commercial CLI PDF toolkit that is free for personal use.
#                 It can be downloaded here: https://github.com/coherentgraphics/cpdf-binaries
#               - yad can be installed from the webupd8 PPA with the following command:
#                 sudo add-apt-repository ppa:webupd8team/y-ppa-manager && apt-get update && apt-get install yad
#
# NOTES:        Here is a quick comparison of the advantages and disadvantages of each backend:
#
#                               speed     metadata preservation     content preservation        license
#               ghostscript:     --               ++                         ++               open-source
#               cpdf:             -               ++                         ++               proprietary
#               qpdf:            ++                +                         ++               open-source
#
#               Results might vary depending on the document and the version of the tool in question.
#
# INSTALLATION: https://askubuntu.com/a/236415
#
# This script was inspired by Kurt Pfeifle's PDF extraction script 
# (http://www.linuxjournal.com/content/tech-tip-extract-pages-pdf)
#
# Originally posted on askubuntu
# (https://askubuntu.com/a/282453)

Variables

DOCUMENT="$1" BACKENDSELECTION="^qpdf!ghostscript!cpdf"

Functions

check_input(){ if [[ -z "$1" ]]; then notify "Error: No input file selected." exit 1 elif [[ ! "$(file -ib "$1")" == application/pdf ]]; then notify "Error: Not a valid PDF file." exit 1 fi }

check_deps () { for i in "$@"; do type "$i" > /dev/null 2>&1 if [[ "$?" != "0" ]]; then MissingDeps+="$i" fi done }

ghostscriptextract(){ gs -dFirstPage="$STARTPAGE "-dLastPage="$STOPPAGE" -sOutputFile="$OUTFILE" -dSAFER -dNOPAUSE -dBATCH -dPDFSETTING=/default -sDEVICE=pdfwrite -dCompressFonts=true -c
".setpdfwrite << /EncodeColorImages true /DownsampleMonoImages false /SubsetFonts true /ASCII85EncodePages false /DefaultRenderingIntent /Default /ColorConversionStrategy
/LeaveColorUnchanged /MonoImageDownsampleThreshold 1.5 /ColorACSImageDict << /VSamples [ 1 1 1 1 ] /HSamples [ 1 1 1 1 ] /QFactor 0.4 /Blend 1 >> /GrayACSImageDict
<< /VSamples [ 1 1 1 1 ] /HSamples [ 1 1 1 1 ] /QFactor 0.4 /Blend 1 >> /PreserveOverprintSettings false /MonoImageResolution 300 /MonoImageFilter /FlateEncode
/GrayImageResolution 300 /LockDistillerParams false /EncodeGrayImages true /MaxSubsetPCT 100 /GrayImageDict << /VSamples [ 1 1 1 1 ] /HSamples [ 1 1 1 1 ] /QFactor
0.4 /Blend 1 >> /ColorImageFilter /FlateEncode /EmbedAllFonts true /UCRandBGInfo /Remove /AutoRotatePages /PageByPage /ColorImageResolution 300 /ColorImageDict <<
/VSamples [ 1 1 1 1 ] /HSamples [ 1 1 1 1 ] /QFactor 0.4 /Blend 1 >> /CompatibilityLevel 1.7 /EncodeMonoImages true /GrayImageDownsampleThreshold 1.5
/AutoFilterGrayImages false /GrayImageFilter /FlateEncode /DownsampleGrayImages false /AutoFilterColorImages false /DownsampleColorImages false /CompressPages true
/ColorImageDownsampleThreshold 1.5 /PreserveHalftoneInfo false >> setdistillerparams" -f "$DOCUMENT" }

cpdfextract(){ cpdf "$DOCUMENT" "$STARTPAGE-$STOPPAGE" -o "$OUTFILE" }

qpdfextract(){ qpdf --linearize "$DOCUMENT" --pages "$DOCUMENT" "$STARTPAGE-$STOPPAGE" -- "$OUTFILE" echo "$OUTFILE" return 0 # even benign qpdf warnings produce error codes, so we suppress them }

notify(){ echo "$1" notify-send -i application-pdf "PDFextract" "$1" }

dialog_warning(){ echo "$1" yad --center --image dialog-warning
--title "PDFExtract Warning"
--text "$1"
--button="Try again:0"
--button="Exit:1"

[[ "$?" != "0" ]] && exit 0 }

dialog_settings(){ PAGECOUNT=$(pdfinfo "$DOCUMENT" | grep Pages | sed 's/[^0-9]*//') #determine page count

SETTINGS=($(
yad --form --width 300 --center
--window-icon application-pdf --image application-pdf
--separator=" " --title="PDFextract"
--text "Please choose the page range and backend"
--field="Start:NUM" 1[!1..$PAGECOUNT[!1]] --field="End:NUM" $PAGECOUNT[!1..$PAGECOUNT[!1]]
--field="Backend":CB "$BACKENDSELECTION"
--button="gtk-ok:0" --button="gtk-cancel:1"
))

SETTINGSRET="$?"

[[ "$SETTINGSRET" != "0" ]] && exit 1

STARTPAGE=$(printf %.0f ${SETTINGS[0]}) #round numbers and store array in variables STOPPAGE=$(printf %.0f ${SETTINGS[1]}) BACKEND="${SETTINGS[2]}" EXTRACTOR="${BACKEND}extract"

check_deps "$BACKEND"

if [[ -n "$MissingDeps" ]]; then dialog_warning "Error, missing dependency: $MissingDeps" unset MissingDeps dialog_settings return fi

if [[ "$STARTPAGE" -gt "$STOPPAGE" ]]; then dialog_warning "<b> Start page higher than stop page. </b>" dialog_settings return fi

OUTFILE="${DOCUMENT%.pdf} (p${STARTPAGE}-p${STOPPAGE}).pdf" }

extract_pages(){ $EXTRACTOR EXTRACTORRET="$?" if [[ "$EXTRACTORRET" = "0" ]]; then notify "Pages $STARTPAGE to $STOPPAGE succesfully extracted." else notify "There has been an error. Please check the CLI output." fi }

Main

check_input "$1" dialog_settings extract_pages

Installation

Please follow the generic installation instructions for Nautilus scripts. Make sure to read the script header carefully as it will help to clarify the installation and usage of the script.


Partial pages - PDF Arranger


Overview

PDF Arranger is a small python-gtk application, which helps the user to merge or split pdf documents and rotate, crop and rearrange their pages using an interactive and intuitive graphical interface. It is a frontend for python-pyPdf.

Installation

sudo apt-get install pdfshuffler

Usage

PDF Arranger can crop and delete single PDF pages. You can use it to extract a page range from a document or even partial pages using the cropping function:

enter image description here


Page elements - Inkscape


Overview

Inkscape is a very powerful open-source vector graphics editor. It supports a wide range of different formats, including PDF files. You can use it to extract, modify and save page elements from a PDF file.

Installation

sudo apt-get install inkscape

Usage

1.) Open the PDF file of your choice with Inkscape. An import dialog will appear. Choose the page you want to extract elements from. Leave the other settings as they are:

enter image description here

2.) In Inkscape click and drag to select the element(s) you want to extract:

enter image description here

3.) Invert the selection with ! and delete the selected object with DELETE:

enter image description here

4.) Crop the document to the remaining objects by accessing the Document Properties dialog with CTRL+SHIFT+D and selecting "fit document to image":

enter image description here

5.) Save the document as a PDF file from the File --> Save as dialog:

6.) If there are bitmap/raster images in your cropped document you can set their DPI in the dialog that appears next:

enter image description here

7.) If you followed all steps you will have produced a true PDF file that only consists of the objects of your choice:

enter image description here

Glutanimate
  • 21,763
73

Save this as a shell script, like pdfextractor.sh:

#!/bin/bash
# this function uses 3 arguments:
#     $1 is the first page of the range to extract
#     $2 is the last page of the range to extract
#     $3 is the input file
#     output file will be named "inputfile_pXX-pYY.pdf"
gs -sDEVICE=pdfwrite -dNOPAUSE -dBATCH -dSAFER \
   -dFirstPage="${1}" \
   -dLastPage="${2}" \
   -sOutputFile="${3%.pdf}_p${1}-p${2}.pdf" \
   "${3}"

To run type:

./pdfextractor.sh 4 20 myfile.pdf
  1. 4 refers to the page it will start the new pdf.

  2. 20 refers to the page it will end the pdf with.

  3. myfile.pdf is the pdf file you want to extract parts.

The output would be myfile_p4_p20.pdf in the same directory the original pdf file.

All this and more information here: Tech Tip

pomsky
  • 70,557
ThiagoPonte
  • 1,966
  • 14
  • 24
57

In any system that a TeX distribution is installed:

pdfjam <input file> <page ranges> -o <output file>

For example:

pdfjam original.pdf 5-10 -o out.pdf

See https://tex.stackexchange.com/a/79626/8666

0 _
  • 673
51

There is a command line utility called pdfseparate.

From the docs:

pdfseparate sample.pdf sample-%d.pdf

extracts  all pages from sample.pdf, if i.e. sample.pdf has 3 pages, it
   produces

sample-1.pdf, sample-2.pdf, sample-3.pdf

Or, to select a single page (in this case, the first page) from the file sample.pdf:

pdfseparate -f 1 -l 1 sample.pdf sample-1.pdf
jdmcbr
  • 631
  • 6
  • 7
20

pdftk (sudo apt-get install pdftk) is a great command line too for PDF manipulation. Here are some examples of what pdftk can do:

   Collate scanned pages
     pdftk A=even.pdf B=odd.pdf shuffle A B output collated.pdf
     or if odd.pdf is in reverse order:
     pdftk A=even.pdf B=odd.pdf shuffle A Bend-1 output collated.pdf

   Join in1.pdf and in2.pdf into a new PDF, out1.pdf
     pdftk in1.pdf in2.pdf cat output out1.pdf
     or (using handles):
     pdftk A=in1.pdf B=in2.pdf cat A B output out1.pdf
     or (using wildcards):
     pdftk *.pdf cat output combined.pdf

   Remove page 13 from in1.pdf to create out1.pdf
     pdftk in.pdf cat 1-12 14-end output out1.pdf
     or:
     pdftk A=in1.pdf cat A1-12 A14-end output out1.pdf

   Burst a single PDF document into pages and dump its data to
   doc_data.txt
     pdftk in.pdf burst

   Rotate the first PDF page to 90 degrees clockwise
     pdftk in.pdf cat 1east 2-end output out.pdf

   Rotate an entire PDF document to 180 degrees
     pdftk in.pdf cat 1-endsouth output out.pdf

In your case, I would do:

     pdftk A=input.pdf cat A<page_range> output output.pdf
14

I was trying to do the same. All you have to do is:

  1. install pdftk:

    sudo apt-get install pdftk
    
  2. if you want to extract random pages:

    pdftk myoldfile.pdf cat 1 2 4 5 output mynewfile.pdf
    
  3. if you want to extract a range:

    pdftk myoldfile.pdf cat 1-2 4-5 output mynewfile.pdf
    

Please check the source for more infos.

David Foerster
  • 36,890
  • 56
  • 97
  • 151
theCode
  • 289
  • 3
  • 10
8

If you wish to use inbuilt bash commands then pdfseparate and pdfunite are for you.

pdfseparate sample.pdf sample-%d.pdf
# ls; sample.pdf sample-1.pdf sample-2.pdf sample-3.pdf sample-4.pdf

pdfunite sample-2.pdf sample-3.pdf output.pdf
# now you can use output.pdf
8

Have you tried PDF Mod?

You can for example.. extract pages and save them as pdf.

Description:

PDF Mod is a simple tool for modifying PDF documents. It can rotate, extract, remove
and reorder pages via drag and drop. Multiple documents may be combined via drag
and drop. You may also edit the title, subject, author and keywords of a PDF
document using PDF Mod.

sudo apt install pdfmod

Screenshot

Hope this will useful.

Regars.

Flimm
  • 44,031
Roman Raguet
  • 9,613
7

mutool, which comes with mupdf, can do a lot of simple PDF processing stuff, but has a more elegant syntax than qpdf (and some of the other answers). Additionally, it seems faster on big PDFs:

# extract page range 20-40
mutool clean in.pdf out.pdf 20-40
# extract from all over the pdf
mutool clean in.pdf out.pdf '1, 3-4, 74-92'
rien333
  • 244
6

As it turns out, I can do it with imagemagick. If you don't have it, install simply with:

sudo apt-get install imagemagick

Note 1: I've tried this with a one-page pdf (I'm learning to use imagemagick, so I didn't want more trouble than necessary). I don't know if/how it will work with multiple pages, but you can extract one page of interest with pdftk:

pdftk A=myfile.pdf cat A1 output page1.pdf

where you indicate the page number to be split out (in the example above, A1 selects the first page).

Note 2: The resulting image using this procedure will be a raster.


Open the pdf with the command display, which is part of the imagemagick suite:

display file.pdf

Mine looked like this:

imagemagick display of a pdf
Click on the image to see a full resolution version

Now you click on the window and a menu will pop to the side. There, select Transform | Crop.

imagemagick transform>crop menu

Back in the main window, you can select the area you want to crop by simply dragging the pointer (classic corner-to-corner selection).

selection of area to crop
Notice the hand-shaped pointer around the image while selecting

This selection can be refined before proceeding to the next step.

Once you are done, take notice of the little rectangle that appears on the upper left corner (see the image above). It shows the dimensions of the area selected first (e.g. 281x218) and second the coordinates of the first corner (e.g. +256+215).

Write down the dimensions of the selected area; you'll need it at the moment of saving the cropped image.

Now, back at the pop menu (which now is the specific "crop" menu), click the button Crop.

imagemagick crop menu

Finally, once you are satisfied with the results of cropping, click on menu File | Save

Navigate to the folder where you want to save the cropped pdf, type a name, click the button Format, on the "Select image format type" window select PDF and click the button Select. Back on the "Browse and select a file" window, click the button Save.

imagemagick save as pdf

Before saving, imagemagick will ask to "select page geometry". Here, you type the dimensions of your cropped image, using a simple letter "x" to separate width and height.

imagemagick select page geometry

Now, you can do all this perfectly from the command line (the command is convert with option -crop) -- surely it's faster, but you would have to know beforehand the coordinates of the image you want to extract. Check man convert and an example in their webpage.

carnendil
  • 5,529
5

Tested on Ubuntu 20.04 with pdftk --version 3.0.9 from May 11, 2018 (date shown at the bottom of man pdftk).

If using pdftk, here's how to format it for multiple groups of pages:

pdftk in.pdf cat 13 18 33-36 39-41 52 output out.pdf

This will capture those groups of pages, inclusive.

To install and/or update pdftk:

sudo apt update
sudo apt install pdftk

Related:

  1. [my answer] How to rotate PDF pages: https://unix.stackexchange.com/questions/394065/command-line-how-do-you-rotate-a-pdf-file-90-degrees/634882#634882
Gabriel Staples
  • 11,502
  • 14
  • 97
  • 142
3

Unfortunately, Ubuntu do no provide command to do that directly.

But you can use pdfseparate and pdfunite in conjonction (both comes by default with your Ubuntu)

So if you want to extract page 32 to 65 of sourcefile.pdf in a new file called extract.pdf, you can type these commands :

mkdir tmppdfdir
pdfseparate -f 32 -l 65 sourcefile.pdf tmppdfdir/page-%d.pdf
pdfunite tmppdfdir/page*.pdf extract.pdf
rm -rf tmppdfdir/

Warning : Be sure that tmppdfdir do not already exists before !

doom
  • 231
3

PDF Split and Merge is quite useful for this and other PDF manipulation operations.

Download from here

To Do
  • 15,833
2

ThiagoPonte's Ghostscript answer is great for its portability, but it does not explain how to use a discontinuous page list, such as 2, 6, 7, 8, 9, 11. That is possible with -sPageList:

gs -sDEVICE=pdfwrite -dNOPAUSE -dBATCH -sPageList=2,6-9,11 -sOutputFile=out.pdf in.pdf

However I could not get it to work on older versions of Ghostscript, for which case, inspired by a Stack Overflow question, I created this shell script that only relies on -dFirstPage and -dLastPage:

#!/bin/sh -f
if [ "$#" != 2 ] && [ "$#" != 3 ]; then
    >&2 echo "Usage: $0 pagelist infile [outfile]"
    exit 11
fi
range=$1
infile=$2
outfile=${3-"${2%pdf}"out.pdf}
set --
IFS=,
for i in $range; do
    set -- "$@" "-dFirstPage=${i%-*}" "-dLastPage=${i#*-}" "$infile"
done
gs -sOutputFile="$outfile" -sDEVICE=pdfwrite -dNOPAUSE -dBATCH "$@"

You can save it in a PATH directory such as /usr/local/bin/, make it executable with chmod +x scriptname and then just call

scriptname 2,6-9,11 in.pdf out.pdf
Quasímodo
  • 2,104
2

LibreOffice Draw

LibreOffice is able to edit PDFs and create new ones, and that allows you to easily and interactively select page ranges with your mouse.

This way you can easily inspect which pages you want visually and then immediately move them around without having to note the page number and go to the command line.

Another advantage is that almost every Linux desktop user is already going to have the fundamental LibreOffice package already installed in their computer, but if for some reason you don't:

sudo apt install libreoffice

Then, just open the PDF, e.g. from the CLI:

libreoffice raymond.pdf

or just open LibreOffice from the GUI and then File > Open.

Once opened, you will see:

enter image description here

Now to split the document up:

  1. Select a page range from the page index on the left. You can use the usual range shortcuts:

    • Shift for range endpoint
    • Ctrl to select or unselect individual items
  2. Ctrl + C or Right click > Copy

  3. Ctrl + N or File > New to create a new document

  4. Ctrl + V on new document to paste from the old document

  5. No to "Do you want to scale the copied objects to fit the new page size?"

  6. Shift + Delete or Right click > Delete Page to the pre-existing initial blank page from the new document

  7. File > Export As > Export Directly as PDF.

    If you are doing this a lot, you might want to assign a keyboard shortcut to this option under Tools > Customize > Keyboard.

The PDF shown in the above examples is: http://users.ece.utexas.edu/~perry/education/382v-s08/papers/raymond.pdf which is a rendering of [The Cathedral and the Bazaar by Eric S. Raymond](The Cathedral and the Bazaar).

Tested on Ubuntu 22.04, LibreOffice Draw 7.3.4.

2

If you want to extract pages in a WYSIWYG way, then pdfarranger offers a nice and simple GUI. It can also join PDFs, and you can rotate or crop pages.

PSF Arranger

Worked great on a TurboTax file, which pdfmod couldn't open.

The only (tiny) donwside was that it couldn't export some metadata and the table of contents was lost, but I verified that the exported selection included all pages I had selected, so the tool did its job.

1

Apache PDFBox is open source Java tool for working with PDF documents. It comes with a command line tools that can split pages from pdf, among many others things (see manual here).

To use it simply install the pdfbox-app-2.?.?.jar and execute a command like:

java -jar pdfbox-app-2.0.20.jar PDFSplit -startPage 1 -endPage 10 -outputPrefix ch1 book.pdf
pomsky
  • 70,557