Natural Language Processing (NLP) is one of the most up and coming fields in scientific research. Natural language is the language we use
in day-to-day life. There are various reasons which make analysis of natural language complex in nature, like context, polysemy etc.
Various techniques are being developed by which computer programs can understand and interpret natural language. These programs are being used in various ways in the industry. For example:
- Automatic analysis of customer reviews
- Automatic categorizing of web pages
- Recommendation systems for various things
Python is a language that offers simple paradigms for creating powerful programs. It has excellent library for processing natural laguage- NLTK.
NLTK offers simple, consistent, extensible and modular way to create programs for natural language processing. In addition it has text corpora associated with it which can be used by learners and researchers for their purposes.
The following software is needed for this tutorial:
- NLTK is supported by version 2.4 to 2.7
- All information on how to install NLTK can be found here.
- This includes various text corpora which can be used in various ways. All information about it can be found here.
What is POS tagging?
Every word in any language has a particular class or lexical category– like noun, verb, adjective etc. This class is called part-of-speech (POS). The process of identifying a word’s class and labeling it accordingly is called POS tagging.
Some text corpora in NLTK come already tagged, you can use them for testing your program. Some classes are also available which will help you in tagging words. We will be discussing them here in detail.
Every text xorpus has their own set of tags they like to use. This set of tags is known as tagset. In this tutorial we will be using the following tagset:
|N||Noun||house, pen, mouse|
|NP||Proper noun||Anne, London, December|
|V||Verb||is, has, get, put|
|P||Preposition||on, of, with, in, into|
|PRO||Pronoun||he, she, them, they|
|CNJ||Conjunction||while, but, if, and|
|UH||Interjection||oops, bang, whee|
|ADJ||Adjective||good, bad, ugly, careful, reddish|
|ADV||Adverb||truly, falsely, mildly,swiftly, carefully|
|DET||Determiner||a, an, the, every, no|
Automatically Tagging Words
The concept of this tagger is pretty simple, you deifne a tagger which will assign the same tag to every word. For example consider the following code:
>>> import nltk >>> raw_text='I am a little teapot, short and stout' >>> tokens=nltk.word_tokenize(raw_text) >>> default_tagger=nltk.DefaultTagger('NN') >>> default_tagger.tag(tokens) [('I', 'NN'), ('am', 'NN'), ('a', 'NN'), ('little', 'NN'), ('teapot', 'NN'), (',', 'NN'), ('short', 'NN'), ('and', 'NN'), ('stout', 'NN')]
This doesn’t really give accurate answer does it? After all little, short and stout are not nouns. However as we progress you will see that this has its own uses.
Tagging using Regular Expressions
You can tag words based on a regular expression. For example, words ending with “ing” are verbs(VBG) like running, playing, boxing; words ending with “er” are comparative adjectives(ADJ). Consider the following:
>>> import nltk >>> text='I am running faster than light, I am lighter than light' >>> text_tokens=nltk.word_tokenize(text) >>> patterns=[ (r'.*ing$', 'V'), (r'.*er$', 'ADJR'), (r'.*est$', 'ADJS'), (r'.*','N') ] >>> reg_tagger=nltk.RegexpTagger(patterns) >>> reg_tagger.tag(text_tokens) [('I', 'N'), ('am', 'N'), ('running', 'V'), ('faster', 'ADJR'), ('than', 'N'), ('light', 'N'), (',', 'N'), ('I', 'N'), ('am', 'N'), ('lighter', 'ADJR'), ('than', 'N'), ('light', 'N')]
However defining regular expression for each and every word is difficult in most natural languages. Therefore the regular expression tagger is not so useful.
In most natural languages one word can behave in different ways in a sentence. Many times a word can be used in 2 or more parts-of-speech. For example,th word ‘free’ behaves as an adjective in sentence 1 and as a verb in sentence two.
Sentence 1: 'After the civil war he was a free man.' Sentence 2: 'He could finally free the legs of man buried under the car'
The unigram tagger works in a very simple manner. It assigns the most likely tag to a particular word. To find the “most likely” tag the unigram tagger must be trained first. This is where we will be using the tagged corpora
For training we are using the brown tagged corpus. This corpus has text on various categories. We use one category to train out tagger:
>>> tagged_sents=brown.tagged_sents(categories='lore') >>> untagged_sents=brown.sents(categories='lore') >>> unigram_tagger=nltk.UnigramTagger(tagged_sents)
Once the tagger has been trained we can tag different words using it.
>>> text='After the civil war he was a free man' >>> tokens=nltk.word_tokenize(text) >>> unigram_tagger.tag(tokens) [('After', 'IN'), ('the', 'AT'), ('civil', 'JJ'), ('war', 'NN'), ('he', 'PPS'), ('was', 'BEDZ'), ('a', 'AT'), ('free', 'JJ'), ('man', 'NN')] >>> text2='He could finally free the legs of man buried under the car' >>> tokens2=nltk.word_tokenize(text2) >>> unigram_tagger.tag(tokens2) [('He', 'PPS'), ('could', 'MD'), ('finally', 'RB'), ('free', 'JJ'), ('the', 'AT'), ('legs', 'NNS'), ('of', 'IN'), ('man', 'NN'), ('buried', 'VBN'), ('under', 'IN'), ('the', 'AT'), ('car', 'NN')]
As you can see here again in the second example ‘free’ was tagged as JJ i.e. adjective while actually it was a verb. How ever unigram tagger is more accurate that the taggers we have seen before.
The weakness of unigram tagger is that while tagging a particular word it doesn’t consider the words surrounding it. In the second sentence of above example if the tagger has see finally and adverb before free
then it would have tagged free as verb. N-gram tagging works on this principle. A N-gram tagger is a generalization of a unigram tagger whose context is the current word together with the part-of-speech tags of the n-1 preceding tokens.
Here I have used an bi-gram tagger:
>>> text='After the civil war he was a free man' >>> tokens=nltk.word_tokenize(text) >>> bigram_tagger=nltk.BigramTagger(tagged_sents) >>> bigram_tagger.tag(tokens) [('After', 'IN'), ('the', 'AT'), ('civil', 'JJ'), ('war', 'NN'), ('he', 'PPS'), ('was', 'BEDZ'), ('a', 'AT'), ('free', 'JJ'), ('man', 'NN')]
More often than not a combination of these taggers are used to identify parts-of-speech of a given word. Ngram tagging has given the most accurate result so far but the time it takes for Ngram tagging
is not suitable for real time applications.