Extract table of contents from PDF (TOC)

For some time now I have been editing a magazine in electronic format, currently the 175th issue has been published (yes, more than 175 months since 2006 and some remarkable successes).

Initially I did all the work manually, but over time I have been automating some tasks, such as exporting the news from WordPress and importing the XML formatted with the styles to InDesign.

Another task that I have automated is the table of contents of each issue, which at the time of publishing the new issue, the table of contents (TOC, Table Of Contents) is imported from the pdf, with the titles and page numbers of each section.

This week I moved the DNG magazine server to a new instance, now managed from GridPane to avoid many of the system administration tasks and in this process it was time to reconfigure the TOC extraction automation.

The library used until now was pdfminer, but the project is officially abandoned since last year: “Warning: As of 2020, PDFMiner is not actively maintained. The code still works, but this project is largely dormant. For the active project, check out its fork pdfminer.six.”

On this new server I have installed pdfminer.six which runs Python 3.6 or higher instead of Python 2.4 or higher (it does not support Python 3.x).

Installing pdfminer.six

The machine running DNG Photo Magazine is an Ubuntu 18.04 which is the current requirement for GridPane and has available Pyhton 2.7 and Python 3.6 running respectively pyhton o python3 :

# python --version
Python 2.7.17

# python3 --version
Python 3.6.9

And if we execute pip we will be using Python version 2.7, so let’s install pip3 to run with Python 3.6:

# apt install python3-pip

And we already have both versions available:

# pip --version
pip 9.0.1 from /usr/lib/python2.7/dist-packages (python 2.7)

# pip3 --version
pip 9.0.1 from /usr/lib/python3/dist-packages (python 3.6)

So now we can install pdfminer.six :

# pip3 install pdfminer.six

And check that it is installed with pip3 list so now we will be able to run it:

# /usr/local/bin/dumppdf.py --extract-toc archivo.pdf

It will dump the TOC structure and then we can parse it with simplexml_load_string to get the titles and pages of each section.

Here is a screenshot of the part of the plugin created for WordPress that extracted the TOC until now, which was then incorporated into the download page through a slightly more complex process, I have formatted the code a little bit for pdfminer.six as it is a few years old, but it is the one that has been used since it was automated:

MuPDF Tools

But after looking at the new pdfminer without counting it, I have come across MuPDF which can also serve the same purpose. Let’s take a look at it.

We install it with:

# apt install mupdf-tools

And we already have the line command tool mutool show which has the outline option that prints the table of contents: https://www.mupdf.com/docs/manual-mutool-show.html so we can run it:

# mutool show archivo.pdf outline

And with this simple, we can now use it in our plugin in charge of extracting the TOC from the pdf of each issue.

I’ve rewritten just a dozen lines to use the new tool in a more efficient way and keeping the same result in the part in charge of incorporating the extracted TOC to our WordPress application, so this new instance of the DNG magazine is left with MuPDF Tools instead of the previous one pdfminer or its natural substitute pdfminer.six.