Extract from pdf with textract. HOW TO

Good day, everyone! This short tutorial explains how to extract text from pdf files, using Python’s textract module. I am going to show you how to install it correctly. So, feel free to leave a comment below.

Β Install textract on Ubuntu 16.04 Server.


Let’s begin with the well known and necessary step, like updating repositories:

apt update

After we updated the local information about repositories, let’s install and upgrade pip:

apt install python-pip && pip install --upgrade pip

I will not use Python’s virtual environments to keep this guide as simple as possible. Once we have pip installed and updated, let’s install textract‘s dependencies. Run following command on your Ubuntu server:

apt install python-dev libxml2-dev libxslt1-dev antiword unrtf poppler-utils pstotext tesseract-ocr flac ffmpeg lame libmad0 libsox-fmt-mp3 sox libjpeg-dev swig

Basically, these are the steps described on textract Official Website. At this stage, if we try and install textract, we’ll face several issues:

pip install textract

Β Errors installing textract


This is the main reason, why I’m writing this tutorial. To save you the hassle.
First issueΒ you can face is during SpeechRecognition-3.6.3 installation. If pip is unable to install this module, you can do it manually, by running:

pip install https://pypi.python.org/packages/ce/c7/ab6cd0d00ddf8dc3b537cfb922f3f049f8018f38c88d71fd164f3acb8416/SpeechRecognition-3.6.3-py2.py3-none-any.whl

Also, probably you’ll get same issue that I faced, while installing textract. It was pocketsphinx build failure. If you get something like:

fatal error: pulse/pulseaudio.h: No such file or directory
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
Failed building wheel for pocketsphinx

then you are missing libpulse-dev. What are you waiting for? πŸ˜€

apt install libpulse-dev

Now, you are good to install textract:

pip install textract

Β Extract text from PDF files. Python sample.


import textract
text = textract.process("/home/user/textract_test.pdf")

Here is my result, after processing a dummy PDF file:

Extract from PDF with textract

 

Congratulations! You successfully installed textract on your Ubuntu Server. I hope you enjoyed this short tutorial.

6 Comments

 Add your comment
  1. hi tanx i try your solution but not working failed with error code 1 in /tmp/pip-build-qOU1Ei/pocketsphinx/

  2. Great tutorial, thank you very much

  3. Hi,

    I tried installing textract on windows 7 machine but it seems to be failing everytime. Please refer to the logs below:

    C:\Users\sikbhamb>pip install textract

    Collecting textract

    Downloading https://files.pythonhosted.org/packages/e0/00/a9278b3672a31da06394
    eb588a16e96f8fce9f6ae0ed44cca18103d4aef5/textract-1.6.1.tar.gz
    Collecting argcomplete==1.8.2 (from textract)
    Downloading https://files.pythonhosted.org/packages/f0/0f/f965f1520e6ba24b6332
    0919eecfbe3d03debd32402e0c61a08e8fa02d17/argcomplete-1.8.2-py2.py3-none-any.whl
    Collecting chardet==2.3.0 (from textract)
    Downloading https://files.pythonhosted.org/packages/7e/5c/605ca2daa5cf21c87690
    d8fe6ab05a6f2278c451f4ede6456dd26453f4bd/chardet-2.3.0-py2.py3-none-any.whl (180
    kB)
    100% |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 184kB 819kB/s
    Collecting python-pptx==0.6.5 (from textract)
    Downloading https://files.pythonhosted.org/packages/f8/9c/30bc244cedc571307efe
    0780d8195ffed5b08f09c94d23f50d6d5144ebc7/python-pptx-0.6.5.tar.gz (7.1MB)
    100% |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 7.1MB 86kB/s
    Collecting docx2txt==0.6 (from textract)
    Downloading https://files.pythonhosted.org/packages/aa/72/f02730ec3b0219d8f783
    a255416339b02ff8b6a300c817abf0505833212a/docx2txt-0.6.tar.gz
    Collecting beautifulsoup4==4.5.3 (from textract)
    Downloading https://files.pythonhosted.org/packages/af/a3/9e803f838b3eeb313d45
    d916d4387cda8572c92e1aafeb53fd43ddb5da2c/beautifulsoup4-4.5.3-py3-none-any.whl (
    85kB)
    100% |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 92kB 355kB/s
    Collecting xlrd==1.0.0 (from textract)
    Downloading https://files.pythonhosted.org/packages/0c/b0/8946fe3f9c2690c164aa
    a88dfd43b56347d3cdeac34124b988acd1aaa151/xlrd-1.0.0-py3-none-any.whl (143kB)
    100% |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 153kB 395kB/s
    Collecting EbookLib==0.15 (from textract)
    Downloading https://files.pythonhosted.org/packages/04/30/2cbf65fa9587a1ecc66a
    78eea91f9189ead8fdadd5e009115bce34529aa6/EbookLib-0.15.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
    File “”, line 1, in
    File “C:\Users\sikbhamb\AppData\Local\Temp\pip-build-ds1vtzsd\EbookLib\set
    up.py”, line 13, in
    long_description = open(‘README.md’).read(),
    File “c:\python 36\lib\encodings\cp1252.py”, line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
    UnicodeDecodeError: ‘charmap’ codec can’t decode byte 0x8d in position 1671:
    character maps to

    —————————————-
    Command “python setup.py egg_info” failed with error code 1 in C:\Users\sikbhamb
    \AppData\Local\Temp\pip-build-ds1vtzsd\EbookLib\

    • Hi Sikander,

      Unfortunately, I never used textract on a Windows machine. Based on the logs you provided, it looks like a char decoding issue (0x8d is not a valid ASCII/UTF8 character).
      What version of Python you are using? Seems like there is an opened bug reported on GitHub: https://github.com/deanmalmgren/textract/issues/190 . So, you’re not the only one experiencing this.

      I would try running: pip install git+git://github.com/deanmalmgren/[email protected]

      Hopefully it helps!

Leave a Comment

Your email address will not be published.