How to convert PDF to text with format kept on Linux?

Posted on In QA

How to convert PDF to text with format kept on Linux?

Many of the formatting in PDF will not be available in text. But better keep the text’s relative positions as the same. For example, the table columns should be kept.

The pdftotext tool can convert PDF to text pretty well:

pdftotext – Portable Document Format (PDF) to text converter

with the -layout option:

-layout

Maintain (as best as possible) the original physical layout of the text. The default is to 'undo' physical layout (columns, hyphenation,

etc.) and output the text in reading order.

$ pdftotext -layout file.pdf file.txt

and file.txt will contain the text version of the main text content of the PDF with layout kept as best as possible.

Eric Ma

Eric is a systems guy. Eric is interested in building high-performance and scalable distributed systems and related technologies. The views or opinions expressed here are solely Eric's own and do not necessarily represent those of any third parties.

One comment

  1. For linux users, nothing works better than using Calibre to convert pdf files to docx (or any other number of other formats). After conversion, clean up the docx by using LibreOffice Writer with the Advanced Search and Replace plug-in installed. https://calibre-ebook.com/download_linux

Leave a Reply

Your email address will not be published. Required fields are marked *