Saturday, June 23, 2018

Extract Text from from multi-page PDF with only Images

Sometimes there are only images in a PDF. In such cases you can not select text to copy / paste or just for reference.

To extract text from an Image or a PDF containing only images, I used Tesseract OCR Engine and Ghostscript. I am running Fedora 19 at the moment, however these steps should apply to an older version of Fedora or Ubuntu. ( I believe this can be done on Windows as well ). Both Tesseract and Ghostscript are free softwares.

First, install both Tesseract and Ghostscript on Fedora:

$ sudo yum install -y ghostscript tesseract

Now go to the folder where your PDF is located ( assuming that it is named as story.pdf ):

$ cd ~/Downloads/

Next, extract each page from PDF as a PNG. For this I used Ghostscript. Note the resolution ( -r300 ):

$ ghostscript -dNOPAUSE -dBATCH -sDEVICE=pngalpha -r300 -sOutputFile="page%03d".png story.pdf
$ ls page*.png
page001.png
page002.png
...

Once we have a PNG for each page, we can use the OCR software to extract text:

$ for f in page*.png ; do tesseract $f $f.out; done
$ ls page*.out.txt
page001.png.out.txt
page002.png.out.txt
...

So, now we have all the text from images into text files. Tesseract works quite well with OCR output, and obviously it cant read drawing or misprinted characters quite well, still its quite accurate.

I hope it is helpful for you.

References:

Thursday, June 21, 2018

How-to install Bash 4.1 in Linux

How-to install Bash 4.1 in LinuxThis guide is for almost every Linux distribution.

Prerequisite is that you have the required build tools installed already.
If not, do the following step:

Debian and Ubuntu users way;

sudo apt-get install build-essential

The Fedora/Red Hat way:

sudo yum groupinstall "Development Tools" "Legacy Software Development"

First step is getting the source package

wget http://ftp.gnu.org/gnu/bash/bash-4.1.tar.gz

Next step is compiling and installing it;

tar xf bash-4.1.tar.gz
cd bash-4.1*
./configure
make
sudo make install

Install Python on Ubuntu (Anaconda)

Install Anaconda on Ubuntu

The video above demonstrates one way to install anaconda which is good if you want to follow manually install anaconda (just be sure to open a new terminal or type source .bashrc after you finish the install).

The way below utilizes bash scripts which is a faster way to install anaconda. This should work on Ubuntu 12.04 (precise), 14.04 (trusty), and 16.04 ( xenial).

Open a new terminal.
Copy and paste the paste commands from either gist (python 2 or 3) below on the terminal

Python 2 Anaconda Ubuntu

Python 3 Anaconda Ubuntu

The files are from the Anaconda installer archive.

Anaconda installer archive
Edit descriptionrepo.continuum.io

Optional Steps

The following are just optional things to get started on now that you have anaconda installed.

1. (optional) A good way to test your anaconda installation is to open and use a Jupyter Notebook. Type the command below in your terminal to open a Jupyter (IPython) Notebook.

jupyter notebook

If you want a basic tutorial going over how to open Jupyter and using python, please see the following video.

Python Basics 1: Hello World + Strings

A blog version of the video can be found here.

2. If you want to use both python 2 and 3, please see the following tutorial on Environment Management with Conda.

3. I often get asked how to started with machine learning, here is a step by step tutorial on getting started with machine learning.

Please let me know if you have any questions! You can either leave a comment here or leave me a comment on youtube. The youtube video has a ton of questions on it answered already (please subscribe if you can)!