Speech Tagging Using Python


Natural Language Processing (NLP) is one of the most up and coming fields in scientific research. Natural language is the language we use
in day-to-day life. There are various reasons which make analysis of natural language complex in nature, like context, polysemy etc.
Various techniques are being developed by which computer programs can understand and interpret natural language. These programs are being used in various ways in the industry. For example:

  1. Automatic analysis of customer reviews
  2. Automatic categorizing of web pages
  3. Recommendation systems for various things

Why Python?

Python is a language that offers simple paradigms for creating powerful programs. It has excellent library for processing natural laguage- NLTK.
NLTK offers simple, consistent, extensible and modular way to create programs for natural language processing. In addition it has text corpora associated with it which can be used by learners and researchers for their purposes.

Required Software

The following software is needed for this tutorial:

NLTK is supported by version 2.4 to 2.7
All information on how to install NLTK can be found here.
This includes various text corpora which can be used in various ways. All information about it can be found here.

What is POS tagging?

Every word in any language has a particular class or lexical category– like noun, verb, adjective etc. This class is called part-of-speech (POS). The process of identifying a word’s class and labeling it accordingly is called POS tagging.

Some text corpora in NLTK come already tagged, you can use them for testing your program. Some classes are also available which will help you in tagging words. We will be discussing them here in detail.
Every text xorpus has their own set of tags they like to use. This set of tags is known as tagset. In this tutorial we will be using the following tagset:


N Noun house, pen, mouse
NP Proper noun Anne, London, December
V Verb is, has, get, put
P Preposition on, of, with, in, into
PRO Pronoun he, she, them, they
CNJ Conjunction while, but, if, and
UH Interjection oops, bang, whee
ADJ Adjective good, bad, ugly, careful, reddish
ADV Adverb truly, falsely, mildly,swiftly, carefully
DET Determiner a, an, the, every, no

Automatically Tagging Words

Default Tagger

The concept of this tagger is pretty simple, you deifne a tagger which will assign the same tag to every word. For example consider the following code:

>>> import nltk
>>> raw_text='I am a little teapot, short and stout'
>>> tokens=nltk.word_tokenize(raw_text)
>>> default_tagger=nltk.DefaultTagger('NN')
>>> default_tagger.tag(tokens)
[('I', 'NN'), ('am', 'NN'), ('a', 'NN'), ('little', 'NN'), ('teapot', 'NN'), (',', 'NN'), ('short', 'NN'), ('and', 'NN'), ('stout', 'NN')]

This doesn’t really give accurate answer does it? After all little, short and stout are not nouns. However as we progress you will see that this has its own uses.

Tagging using Regular Expressions

You can tag words based on a regular expression. For example, words ending with “ing” are verbs(VBG) like running, playing, boxing; words ending with “er” are comparative adjectives(ADJ). Consider the following:

>>> import nltk
>>> text='I am running faster than light, I am lighter than light'
>>> text_tokens=nltk.word_tokenize(text)
>>> patterns=[
(r'.*ing$', 'V'),
(r'.*er$', 'ADJR'),
(r'.*est$', 'ADJS'),
>>> reg_tagger=nltk.RegexpTagger(patterns)
>>> reg_tagger.tag(text_tokens)
[('I', 'N'), ('am', 'N'), ('running', 'V'), ('faster', 'ADJR'), ('than', 'N'), ('light', 'N'), (',', 'N'), ('I', 'N'), ('am', 'N'), ('lighter', 'ADJR'), ('than', 'N'), ('light', 'N')]

However defining regular expression for each and every word is difficult in most natural languages. Therefore the regular expression tagger is not so useful.

Unigram Tagging

In most natural languages one word can behave in different ways in a sentence. Many times a word can be used in 2 or more parts-of-speech. For example,th word ‘free’ behaves as an adjective in sentence 1 and as a verb in sentence two.

Sentence 1: 'After the civil war he was a free man.'
Sentence 2: 'He could finally free the legs of man buried under the car'

The unigram tagger works in a very simple manner. It assigns the most likely tag to a particular word. To find the “most likely” tag the unigram tagger must be trained first. This is where we will be using the tagged corpora

For training we are using the brown tagged corpus. This corpus has text on various categories. We use one category to train out tagger:

>>> tagged_sents=brown.tagged_sents(categories='lore')
>>> untagged_sents=brown.sents(categories='lore')
>>> unigram_tagger=nltk.UnigramTagger(tagged_sents)

Once the tagger has been trained we can tag different words using it.

>>> text='After the civil war he was a free man'
>>> tokens=nltk.word_tokenize(text)
>>> unigram_tagger.tag(tokens)
[('After', 'IN'), ('the', 'AT'), ('civil', 'JJ'), ('war', 'NN'), ('he', 'PPS'), ('was', 'BEDZ'), ('a', 'AT'), ('free', 'JJ'), ('man', 'NN')]
>>> text2='He could finally free the legs of man buried under the car'
>>> tokens2=nltk.word_tokenize(text2)
>>> unigram_tagger.tag(tokens2)
[('He', 'PPS'), ('could', 'MD'), ('finally', 'RB'), ('free', 'JJ'), ('the', 'AT'), ('legs', 'NNS'), ('of', 'IN'), ('man', 'NN'), ('buried', 'VBN'), ('under', 'IN'), ('the', 'AT'),
('car', 'NN')]

As you can see here again in the second example ‘free’ was tagged as JJ i.e. adjective while actually it was a verb. How ever unigram tagger is more accurate that the taggers we have seen before.

N-gram Tagging

The weakness of unigram tagger is that while tagging a particular word it doesn’t consider the words surrounding it. In the second sentence of above example if the tagger has see finally and adverb before free
then it would have tagged free as verb. N-gram tagging works on this principle. A N-gram tagger is a generalization of a unigram tagger whose context is the current word together with the part-of-speech tags of the n-1 preceding tokens.

Here I have used an bi-gram tagger:

>>> text='After the civil war he was a free man'
>>> tokens=nltk.word_tokenize(text)
>>> bigram_tagger=nltk.BigramTagger(tagged_sents)
>>> bigram_tagger.tag(tokens)
[('After', 'IN'), ('the', 'AT'), ('civil', 'JJ'), ('war', 'NN'), ('he', 'PPS'), ('was', 'BEDZ'), ('a', 'AT'), ('free', 'JJ'), ('man', 'NN')]


More often than not a combination of these taggers are used to identify parts-of-speech of a given word. Ngram tagging has given the most accurate result so far but the time it takes for Ngram tagging
is not suitable for real time applications.

Posted in Programming
6 comments on “Speech Tagging Using Python
  1. Dan says:

    This is really useful, thanks!

  2. Nickolas Vandaele says:

    Usually I do not learn article on blogs, however I would like to say that this write-up very pressured me
    to take a look at and do it! Your writing taste has been surprised me.
    Thanks, quite great article.

  3. Get Oranges says:

    Hey this is somewhat of off topic but I was wanting to know if blogs use WYSIWYG editors or if you have to manually code with HTML. I’m starting a blog soon but have no coding experience so I wanted to get advice from someone with experience. Any help would be greatly appreciated!

  4. Jennifer says:

    I found what I used to be seeking for. great article, thanks

  5. 行動電源 says:

    Great website! I truly love how it is nice on my eyes it is. I am wondering how I may be notified whenever a brand new publish has been produced. I’ve subscribed for your feed which may do the trick? Have a great day!

  6. violette says:

    I really was basically searching for recommendations for my own
    site and encountered ur blog, “Speech Tagging Using
    Python | York Hua”, do you really care if I personally apply some of your own ideas?
    Thanks a lot ,Violette

Leave a Reply to Nickolas Vandaele Cancel reply

Your email address will not be published. Required fields are marked *


3 × seven =