Persianov on Security
[Posts] [Projects] [Bugtrack] [Crackmes] [Tutorials]

[ Extract from pdf with textract. HOW TO ]

Good day, everyone! This short tutorial explains how to extract text from pdf files, using Python’s textract module. I am going to show you how to install it correctly. So, feel free to leave a comment below.

Install textract on Ubuntu 16.04 Server.

Let’s begin with the well known and necessary step, like updating repositories:

apt update

After we updated the local information about repositories, let’s install and upgrade pip:

apt install python-pip && pip install --upgrade pip

I will not use Python’s virtual environments to keep this guide as simple as possible. Once we have pip installed and updated, let’s install textract‘s dependencies. Run following command on your Ubuntu server:

apt install python-dev libxml2-dev libxslt1-dev antiword unrtf poppler-utils pstotext tesseract-ocr flac ffmpeg lame libmad0 libsox-fmt-mp3 sox libjpeg-dev swig

Basically, these are the steps described on textract Official Website. At this stage, if we try and install textract, we’ll face several issues:

pip install textract

Errors installing textract.

This is the main reason, why I’m writing this tutorial. To save you the hassle. First issue you can face is during SpeechRecognition-3.6.3 installation. If pip is unable to install this module, you can do it manually, by running:

pip install https://pypi.python.org/packages/ce/c7/ab6cd0d00ddf8dc3b537cfb922f3f049f8018f38c88d71fd164f3acb8416/SpeechRecognition-3.6.3-py2.py3-none-any.whl

Also, probably you’ll get same issue that I faced, while installing textract. It was pocketsphinx build failure. If you get something like:

fatal error: pulse/pulseaudio.h: No such file or directory
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
Failed building wheel for pocketsphinx

then you are missing libpulse-dev. What are you waiting for? =))

apt install libpulse-dev

Now, you are good to install textract:

pip install textract

Extract text from PDF files. Python sample.

import textract
text = textract.process("/home/user/textract_test.pdf")

Here is my result, after processing a dummy PDF file:

Textract Python sample script

Congratulations! You successfully installed textract on your Ubuntu Server. I hope you enjoyed this short tutorial.