ebook-convert vs pdftotext concrete minimal example
ebook-coinvert was previously mentioned by frabjous
, and I would like to illustrate it with a minimal example.
The problem with pdftotext from poppler-utils 22.12.0 is that it adds newlines within paragraphs when the paragraph is longer than the PDF page width, e.g. something like:
1:1 In the beginning God created the heaven and
the earth.
1:2 And the earth was without form, and void; and
darkness was upon the face of the deep. And the
Spirit of God moved upon the face of the waters.
1:3 And God said, Let there be light: and there
was light.
1:4 And God saw the light, that it was good: and
God divided the light from the darkness.
These extra newlines make the txt files really bad to read on a device like a Kindle.
ebook-convert however overcomes this very well, and produces something like:
1:1 In the beginning God created the heaven and the earth.
1:2 And the earth was without form, and void; and darkness was upon the face of the deep. And the Spirit of God moved upon the face of the waters.
1:3 And God said, Let there be light: and there was light.
1:4 And God saw the light, that it was good: and God divided the light from the darkness.
which maintains paragraphs in single lines, regardless of how long the paragraph is, and adds a double newline between paragraphs, and behaves much better on a Kindle.
I'm going to test methods mentioned in other answers with this test PDF generated from this Libreoffice .odt file:
pdftotext output:
Title of my file
Table of Contents
H1 1......................................................................................................................................................1
H2 1 1...............................................................................................................................................1
H2 1 2...............................................................................................................................................1
H1 2......................................................................................................................................................1
H2 2 1...............................................................................................................................................1
H2 2 2...............................................................................................................................................1
H1 1
H2 1 1
H2 1 2
First very important paragraph.
And now a very very very very very very very very very very very very very very very very very
very very very very very very very long paragraph that gets split across two lines.
Reference to H1 1 on page: 1
https://commons.wikimedia.org/wiki/File:Fractal_Broccoli.jpg
H1 2
H2 2 1
H2 2 2
ebook-convert output:
Title of my file
Table of Contents
H1 1......................................................................................................................................................1
H2 1 1...............................................................................................................................................1
H2 1 2...............................................................................................................................................1
H1 2......................................................................................................................................................1
H2 2 1...............................................................................................................................................1
H2 2 2...............................................................................................................................................1
H1 1
H2 1 1
H2 1 2
First very important paragraph.
And now a very very very very very very very very very very very very very very very very very very very very very very very very long paragraph that gets split across two lines.
Reference to H1 1 on page: 1
https://commons.wikimedia.org/wiki/File:Fractal_Broccoli.jpg
H1 2
H2 2 1
H2 2 2
Document Outline
H1 1 H2 1 1
H2 1 2
H1 2 H2 2 1
H2 2 2
The line break aspect was also asked more specifically at: https://unix.stackexchange.com/questions/691579/how-to-convert-pdf-file-to-text-without-breaking-lines
Tested on Ubuntu 23.04, poppler-utils 22.12.0, calibre 6.11.0.