86

I'm using pdftotext (part of poppler-utils) to convert PDF documents to text. It works, for the most part, but one thing I wish it did was to insert blank lines between separate paragraphs instead of mashing them together.

Is there way to get pdftotext to do this? And if not, is there another pdf to text utility that can do this?

dan
  • 3,123

6 Answers6

139

If you are using pdftotext you can use the -layout flag to preserve the layout of the text on the pages in your input pdf file:

pdftotext -layout input.pdf output.txt
Noah
  • 1,491
  • 2
  • 9
  • 2
28

You could try ebook-convert from Calibre.

If anything, I'd say it errs in the other direction: too many line breaks.

Another thing I'd definitely consider though is converting to HTML using pdfreflow, and then convert the HTML to TXT.

frabjous
  • 6,601
18

As a fan of open source (and automation) I hate to say this, but the best results I just got (on quite a large, complex PDF) were to open it in Adobe Reader, then choose File|Save As Text.

(I am pre-processing for text analysis experiments, not as a reader, but I think my first and second choice would be the same.)

I've been comparing the output side-by-side. My second choice is ebook-convert.

Adobe: left in FF for page breaks, left in page numbers, hasn't converted headings/paragraphs to single lines, but it has fixed hyphens. Junk that was hidden in the PDF did not get output. Correctly got the big capitals at start of sections, e.g. "The", not "T he" or even "T he".

ebook-convert: Left in page numbers, and some hidden junk in header/footer (but no FFs). Converts most paragraphs to be single lines. The ones it missed are double-spaced though! Bullets don't always line up with the text. Correctly got "The" at the start of the chapter.

pdftotext (without --layout): Not bad, bullets line up, but header/footer noise. FFs are in there. Hyphens removed. Worst for start of chapter big letters: "T\n\nhe".

pdftotext (with --layout): Similar, but more indents. "T he" for start of chapter.

pdftohtml >> pdfreflow >> htmltotext: It removed page numbers, but still junk in header/footer. "T he" for start of chapter. Hyphens removed. (It uses multiple lines per paragraph, yet they are not the same line breaks as in the other versions!)

JinSnow
  • 105
7

If you have a Google account, you can use Google Drive to upload the PDF and transform it into editable text via 'Open with > Google Docs'.

Dennis
  • 413
xangua
  • 7,277
1

I also tried pypdf and compared it against pdftotext on two documents. It had more linebreaks and split some section names (REFERENCES was R E F E R E N C E S).

pdf2txt did output complete garbage.

I often use pdfBox (java) if pdftotext screws up the output. You might give it a try.

Max
  • 191
0

ebook-convert vs pdftotext concrete minimal example

ebook-coinvert was previously mentioned by frabjous , and I would like to illustrate it with a minimal example.

The problem with pdftotext from poppler-utils 22.12.0 is that it adds newlines within paragraphs when the paragraph is longer than the PDF page width, e.g. something like:

1:1 In the beginning God created the heaven and
the earth.
1:2 And the earth was without form, and void; and
darkness was upon the face of the deep. And the
Spirit of God moved upon the face of the waters.
1:3 And God said, Let there be light: and there
was light.
1:4 And God saw the light, that it was good: and
God divided the light from the darkness.

These extra newlines make the txt files really bad to read on a device like a Kindle.

ebook-convert however overcomes this very well, and produces something like:

1:1 In the beginning God created the heaven and the earth.

1:2 And the earth was without form, and void; and darkness was upon the face of the deep. And the Spirit of God moved upon the face of the waters.

1:3 And God said, Let there be light: and there was light.

1:4 And God saw the light, that it was good: and God divided the light from the darkness.

which maintains paragraphs in single lines, regardless of how long the paragraph is, and adds a double newline between paragraphs, and behaves much better on a Kindle.

I'm going to test methods mentioned in other answers with this test PDF generated from this Libreoffice .odt file:

pdftotext output:

Title of my file
Table of Contents
H1 1......................................................................................................................................................1
H2 1 1...............................................................................................................................................1
H2 1 2...............................................................................................................................................1
H1 2......................................................................................................................................................1
H2 2 1...............................................................................................................................................1
H2 2 2...............................................................................................................................................1

H1 1 H2 1 1 H2 1 2 First very important paragraph. And now a very very very very very very very very very very very very very very very very very very very very very very very very long paragraph that gets split across two lines. Reference to H1 1 on page: 1 https://commons.wikimedia.org/wiki/File:Fractal_Broccoli.jpg

H1 2 H2 2 1 H2 2 2

ebook-convert output:

Title of my file

Table of Contents

H1 1......................................................................................................................................................1

H2 1 1...............................................................................................................................................1

H2 1 2...............................................................................................................................................1

H1 2......................................................................................................................................................1

H2 2 1...............................................................................................................................................1

H2 2 2...............................................................................................................................................1

H1 1

H2 1 1

H2 1 2

First very important paragraph.

And now a very very very very very very very very very very very very very very very very very very very very very very very very long paragraph that gets split across two lines.

Reference to H1 1 on page: 1

https://commons.wikimedia.org/wiki/File:Fractal_Broccoli.jpg

H1 2

H2 2 1

H2 2 2

Document Outline

H1 1 H2 1 1

H2 1 2

H1 2 H2 2 1

H2 2 2

The line break aspect was also asked more specifically at: https://unix.stackexchange.com/questions/691579/how-to-convert-pdf-file-to-text-without-breaking-lines

Tested on Ubuntu 23.04, poppler-utils 22.12.0, calibre 6.11.0.