In this post, we will talk about natural language processing (NLP) using Python. This NLP tutorial will use the Python NLTK library. NLTK is a popular Python library which is used for NLP.
So what is NLP? And what are the benefits of learning NLP?
Put simply, natural language processing (NLP) is about developing applications and services that are able to understand human languages.
We are talking here about practical examples of natural language processing (NLP) like speech recognition, speech translation, understanding complete sentences, understanding synonyms of matching words, and writing complete grammatically correct sentences and paragraphs.
Benefits of NLP
As all of you know, millions of gigabytes every day are generated by blogs, social websites, and web pages.
There are many companies gathering all of this data to better understand users and their passions and make appropriate changes.
These data could show that the people of Brazil are happy with product A, while the people of the US are happier with product B. With NLP, this knowledge can be found instantly (i.e. a real-time result). For example, search engines are a type of NLP that give the appropriate results to the right people at the right time.
But search engines are not the only implementation of natural language processing (NLP). There are a lot of even more awesome implementations out there.
NLP Implementations
These are some successful implementations of natural language processing (NLP):
- Search engines like Google, Yahoo, etc. Google's search engine understands that you are a tech guy, so it shows you results related to that.
- Social website feeds like your Facebook news feed. The news feed algorithm understands your interests using natural language processing and shows you related ads and posts more likely than other posts.
- Speech engines like Apple Siri.
- Spam filters like Google spam filters. It's not just about your usual spam filtering; now, spam filters understand what's inside the email content and see if it's spam or not.
NLP Libraries
There are many open source Natural Language Processing (NLP) libraries. These are some of them:
- Natural language toolkit (NLTK)
- Apache OpenNLP
- Stanford NLP suite
- Gate NLP library
Natural language toolkit (NLTK) is the most popular library for natural language processing (NLP). It was written in Python and has a big community behind it.
NLTK also is very easy to learn, actually, it's the easiest natural language processing (NLP) library that you'll use.
In this NLP tutorial, we will use the Python NLTK library. Before I start installing NLTK, I assume that you know some Python basics to get started.
Install NLTK
If you are using Windows or Linux or Mac, you can install NLTK using pip:
# pip install nltk
.
You can use NLTK on Python 2.7, 3.4, and 3.5 at the time of writing this post. Alternatively, you can install it from source from this tar.
To check if NLTK has installed correctly, you can open your Python terminal and type the following:
Import nltk
. If everything goes fine, that means you've successfully installed NLTK library.
Once you've installed NLTK, you should install the NLTK packages by running the following code:
This will show the NLTK downloader to choose what packages need to be installed.
You can install all packages since they all have small sizes with no problem. Now, let's start the show!
Tokenize Text Using Pure Python
First, we will grab some web page content. Then, we will analyze the text to see what the page is about. We will use the urllib module to crawl the web page:
As you can see from the printed output, the result contains a lot of HTML tags that need to be cleaned. We can use
BeautifulSoup
to clean the grabbed text like this:
Now, we have clean text from the crawled web page. Awesome, Right?
Finally, let's convert that text into tokens by splitting the text like this:
Count Word Frequency
The text is much better now. Let's calculate the frequency distribution of those tokens using Python NLTK. There is a function in NLTK called
FreqDist()
that does the job:
If you search the output, you'll find that the most frequent token is PHP.
You can plot a graph for those tokens using plot function like this:
freq.plot(20, cumulative=False)
.
From the graph, you can be sure that this article is talking about PHP. Great! There are some words like "the," "of," "a," "an," and so on. These words are stop words. Generally, stop words should be removed to prevent them from affecting our results.
Remove Stop Words Using NLTK
NLTK is shipped with stop words lists for most languages. To get English stop words, you can use this code:
Now, let's modify our code and clean the tokens before plotting the graph. First, we will make a copy of the list. Then, we will iterate over the tokens and remove the stop words:
You can review the Python list functions here to know how to process lists.
The final code should look like this:
If you check the graph now, it's better than before since no stop words on the count.
Tokenize Text Using NLTK
We just saw how to split the text into tokens using the
split
function. Now, we will see how to tokenize the text using NLTK. Tokenizing text is important since text can't be processed without tokenization. Tokenization process means splitting bigger parts to small parts.
You can tokenize paragraphs to sentences and tokenize sentences to words according to your needs. NLTK is shipped with a sentence tokenizer and a word tokenizer.
Let's assume that we have a sample text like the following:
To tokenize this text to sentences, we will use sentence tokenizer:
The output is the following:
You may say, This is an easy job; I don't need to use NLTK tokenization, and I can split sentences using regular expressions since every sentence is preceded by punctuation and a space.
Well, take a look at the following text:
Uh! The word Mr. is one word by itself. OK, let's try NLTK:
The output looks like this:
Great! It works like charm. Let's try the word tokenizer to see how it will work:
The output is:
The word Mr. is one word, as expected. NLTK uses
PunktSentenceTokenizer
, which is a part of the nltk.tokenize.punkt module
. This tokenizer is trained well to work with many languages.Tokenize Non-English Languages Text
To tokenize other languages, you can specify the language like this:
The result will be like this:
We are doing very well!
Get Synonyms From WordNet
If you remember we installed NLTK packages using
nltk.download()
. One of the packages was WordNet. WordNet is a database built for natural language processing. It includes groups of synonyms and a brief definition.
You can get these definitions and examples for a given word like this:
The result is:
WordNet includes a lot of definitions:
The result is:
You can use WordNet to get synonymous words like this:
The output is:
Cool!
Get Antonyms From WordNet
You can get the antonyms of words the same way. All you have to do is to check the
lemmas
before adding them to the array. it's an antonym or not.
The output is:
This is the power of NLTK in natural language processing.
NLTK Word Stemming
Word stemming means removing affixes from words and returning the root word. (The stem of the word working is work.) Search engines use this technique when indexing pages, so many people write different versions for the same word and all of them are stemmed to the root word.
There are many algorithms for stemming, but the most used algorithm is the Porter stemming algorithm. NLTK has a class called PorterStemmer that uses this algorithm.
The result is:
work
.
Clear enough!
There are some other stemming algorithms, like the Lancaster stemming algorithm. The output of this algorithm shows a bit different results for few words. You can try both of them to see the results.
Stemming Non-English Words
SnowballStemmer
can stem 13 languages besides the English language. The supported languages are:
You can use the
stem
function of the SnowballStemmer
class to stem non-English words like this:
The French commenters can tell us about the results!
Lemmatizing Words Using WordNet
Word lemmatizing is similar to stemming, but the difference is the result of lemmatizing is a real word. Unlike stemming, when you try to stem some words, it will result in something like this:
The result is:
increas
.
Now, if we try to lemmatize the same word using NLTK WordNet, the result is correct:
The result is
increase
.
The result might end up with a synonym or a different word with the same meaning. Sometimes, if you try to lemmatize a word like the word playing, it will end up with the same word. This is because the default part of speech is nouns. To get verbs, you should specify it like this:
The result is:
play
.
Actually, this is a very good level of text compression. You end up with about 50% to 60% compression. The result could be a verb, noun, adjective, or adverb:
The result is:
Stemming and Lemmatization Difference
OK, let's try stemming and lemmatization for some words:
The result is:
Stemming works on words without knowing their context, which is why it has lower accuracy and is faster than lemmatization.
In my opinion, lemmatizing is better than stemming. Word lemmatizing returns a real word even if it's not the same word; it could be a synonym, but at least it's a real word. Sometimes, you don't care about this level of accuracy, and all you need is speed. In this case, stemming is better.
All steps we discussed in this NLP tutorial involved text preprocessing. In the future posts, we will discuss text analysis using the Python NLTK.
Great efforts put it to find the list of articles which is very useful to know, Definitely will share the
ReplyDeletesame to other forums.
Best Machine Learning Training courses | best machine learning institute in chennai | Machine Learning course in chennai