Saturday, June 23, 2018

Extract Text from from multi-page PDF with only Images


Image result for ocr image



Sometimes there are only images in a PDF. In such cases you can not select text to copy / paste or just for reference.
To extract text from an Image or a PDF containing only images, I used Tesseract OCR Engine and Ghostscript. I am running Fedora 19 at the moment, however these steps should apply to an older version of Fedora or Ubuntu. ( I believe this can be done on Windows as well ). Both Tesseract and Ghostscript are free softwares.
First, install both Tesseract and Ghostscript on Fedora:
$ sudo yum install -y ghostscript tesseract
Now go to the folder where your PDF is located ( assuming that it is named as story.pdf ):
$ cd ~/Downloads/
Next, extract each page from PDF as a PNG. For this I used Ghostscript. Note the resolution ( -r300 ):
$ ghostscript -dNOPAUSE -dBATCH -sDEVICE=pngalpha -r300 -sOutputFile="page%03d".png story.pdf
$ ls page*.png
page001.png
page002.png
...
Once we have a PNG for each page, we can use the OCR software to extract text:
$ for f in page*.png ; do tesseract $f $f.out; done
$ ls page*.out.txt
page001.png.out.txt
page002.png.out.txt
...
So, now we have all the text from images into text files. Tesseract works quite well with OCR output, and obviously it cant read drawing or misprinted characters quite well, still its quite accurate.
I hope it is helpful for you.


    References:

    Thursday, June 21, 2018

    How-to install Bash 4.1 in Linux

    How-to install Bash 4.1 in LinuxThis guide is for almost every Linux distribution.

    Prerequisite is that you have the required build tools installed already.
    If not, do the following step:
    Debian and Ubuntu users way;
    sudo apt-get install build-essential
    The Fedora/Red Hat way:
    sudo yum groupinstall "Development Tools" "Legacy Software Development"
    First step is getting the source package
    Next step is compiling and installing it;
    tar xf bash-4.1.tar.gz
    cd bash-4.1*
    ./configure
    make
    sudo make install

    Install Python on Ubuntu (Anaconda)

    Install Python on Ubuntu (Anaconda)

    Install Anaconda on Ubuntu
    The video above demonstrates one way to install anaconda which is good if you want to follow manually install anaconda (just be sure to open a new terminal or type source .bashrc after you finish the install).
    The way below utilizes bash scripts which is a faster way to install anaconda. This should work on Ubuntu 12.04 (precise), 14.04 (trusty), and 16.04 ( xenial).
    1. Open a new terminal.
    2. Copy and paste the paste commands from either gist (python 2 or 3) below on the terminal
    Python 2 Anaconda Ubuntu
    Python 3 Anaconda Ubuntu
    The files are from the Anaconda installer archive.

    Optional Steps

    The following are just optional things to get started on now that you have anaconda installed.
    1. (optional) A good way to test your anaconda installation is to open and use a Jupyter Notebook. Type the command below in your terminal to open a Jupyter (IPython) Notebook.
    jupyter notebook
    If you want a basic tutorial going over how to open Jupyter and using python, please see the following video.
    Python Basics 1: Hello World + Strings
    A blog version of the video can be found here.
    2. If you want to use both python 2 and 3, please see the following tutorial on Environment Management with Conda.
    3. I often get asked how to started with machine learning, here is a step by step tutorial on getting started with machine learning.
    Please let me know if you have any questions! You can either leave a comment here or leave me a comment on youtube. The youtube video has a ton of questions on it answered already (please subscribe if you can)!