5.5 File Conversions

20220305

Converting from one file format to another is a complex yet common task. Here we collect together a variety of conversion tasks and the tools to perform them.

We can often use pandoc for much of the hard work. File types supported include docx, md, org, pdf, rst, tex, though some, like docx are only supported for output.

pandoc mydoc.tex -o mydoc.md        # LaTeX to Markdown
pandoc mydoc.md  -o mydoc.docx      # Markdown to Microsoft Word

To convert Microsoft and LibreOffice documents to pdf, as also covered in Section 97.5, we can use libreoffice itself, invoked in headless mode from the command line:

libreoffice --headless --convert-to pdf input.docx   # Microosft Word to PDF
libreoffice --headless --convert-to pdf input.xlsx   # Microsoft Excel to PDF

For multiple file conversion, to avoid restarting libreofficemultiple times, we can install and use unoserver.

Jupyter notebook conversions are provided by jupyter-nbconvert:

jupyter-nbconvert --to markdown doc.ipynd --stdout > doc.md
jupyter-nbconvert --to python   doc.ipynd --stdout > doc.py
jupyter-nbconvert --to python   doc.ipynd --stdout > doc.R

To extract structured information as Markdown or JSON from various document formats like PDF, DOCX, Images, etc, for RAG/QA applications, see docling.

Another type of conversion results from MS/Windows using a different line ending convention for text files to GNU/Linux (and Unix). Originally dos2unix was used for this task which is now accomplished by flip which will convert the file in-place:

$ flip -u doc.txt # Convert to GNU/Linux format.
$ flip -m doc.txt # Convert to MS/Windows format.


Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0