How to convert PDF to text with format kept on Linux?
Posted on In QAHow to convert PDF to text with format kept on Linux?
Many of the formatting in PDF will not be available in text. But better keep the text’s relative positions as the same. For example, the table columns should be kept.
The pdftotext
tool can convert PDF to text pretty well:
pdftotext – Portable Document Format (PDF) to text converter
with the -layout
option:
-layout
Maintain (as best as possible) the original physical layout of the text. The default is to 'undo' physical layout (columns, hyphenation,
etc.) and output the text in reading order.
$ pdftotext -layout file.pdf file.txt
and file.txt will contain the text version of the main text content of the PDF with layout kept as best as possible.
For linux users, nothing works better than using Calibre to convert pdf files to docx (or any other number of other formats). After conversion, clean up the docx by using LibreOffice Writer with the Advanced Search and Replace plug-in installed. https://calibre-ebook.com/download_linux