What is Part-of-Speech tagging?
Tagging is a kind of classification that may be defined as the automatic assignment of description to the tokens. Here the descriptor is called tag, which may represent one of the part-of-speech, semantic information, and so on.
Now, if we talk about Part-of-Speech (PoS) tagging, then it may be defined as the process of assigning one of the parts of speech to the given word. It is generally called POS tagging. In simple words, we can say that POS tagging is a task of labeling each word in a sentence with its appropriate part of speech. We already know that parts of speech include nouns, verbs, adverbs, adjectives, pronouns, conjunction, and their sub-categories.
POS tagging is often also known as annotation or POS annotation. The annotation can be performed manually or automatically.
This invovles getting human annotators to manually perform POS annotation. It is a particularly laborious process and because of that, manual annotation is very rarely performed in today’s day and age.
For this process to be carried out well, more than one annotator is required and attention must be paid to annotator agreement. This is usually facilitated by the use of a specialized annotation software which does not assign POS tags but detects any inconsistencies between annotators. When the software detects that there is a word (a token) that has been assigned different tags by different annotators, the annotators would need to find a resolution on how to annotate the word or they may even decide to expand the tagset to accommodate the new situation.
In current times, manual annotation is mostly used to annotate a small corpus that will be used as training data for the development of a new automatic POS tagger. Performing manual annotation on modern multi-billion-word corpora isn’t really feasible, which is why automatic tagging is used instead.
Because of the size of modern corpora, automatic annotation is the only tagging option that is really feasible. For the task of automatic annotation, a tool known as a POS tagger (or just a tagger) is used. These POS taggers can perform annotation tasks and acheive an accuracy of upto 98%.
Most of the mistakes are due to phenomena of less interest like misspelt words, rare usage or interjections. Another issue causing inaccuracies could be ambiguity.
In spite of a few inaccuracies, modern POS taggers have been able to to annotate a vast majority of the corpus correctly and the mistakes they make very rarely cause problems when using the corpus.
While developing a POS tagger, a small sample (at least 1 million words) of manually annotated training data is required. The POS tagger uses this data to learn how the language must be tagged. It works also with the context of the word in order to allocate the most appropriate POS tag.
It is important to remember that if the training data has errors or inconsistencies originating from low annotator agreement, the data annotated automatically by the POS tagger will also reflect these issues.
Most of the POS tagging falls under these categories:
- Rules-Based POS tagging
- Stochastic POS tagging,
- Transformation-based tagging
What are the types of POS tagging?
1. Rules-based POS tagging
One of the oldest techniques of tagging is rule-based POS tagging. Rule-based taggers use dictionary or lexicon for getting possible tags for tagging each word. If the word has more than one possible tag, then rule-based taggers use hand-written rules to identify the correct tag. Disambiguation can also be performed in rule-based tagging by analyzing the linguistic features of a word along with its preceding as well as following words. For example, suppose if the preceding word of a word is article then word must be a noun.
2. Stochastic POS Tagging
Another technique of tagging is Stochastic POS Tagging. Now, the question that arises here is which model can be stochastic. The model that includes frequency or probability (statistics) can be called stochastic. Any number of different approaches to the problem of part-of-speech tagging can be referred to as stochastic tagger.
3. Transformation-based Tagging
Transformation based tagging is also called Brill tagging. It is an instance of the transformation-based learning (TBL), which is a rule-based algorithm for automatic tagging of POS to the given text. TBL, allows us to have linguistic knowledge in a readable form, transforms one state to another state by using transformation rules.
It draws the inspiration from both the previous explained taggers − rule-based and stochastic. If we see similarity between rule-based and transformation tagger, then like rule-based, it is also based on the rules that specify what tags need to be assigned to what words. On the other hand, if we see similarity between stochastic and transformation tagger then like stochastic, it is machine learning technique in which rules are automatically induced from data.
What is part-of-speech tagging used for?
A part-of-speech tag (also known as a POS tag) is a special label assigned to each token (word) in a body of text to denote the part of speech and quite frequently also other grammatical categories like tense, number (plural/singular), case etc.
Part-of-speech tags are used in corpus searches and even in text analysis tools and algorithms.
Automatic text processing tools make use of POS tagging so that they can take into consideration which part of speech every word is. This helps in making use of linguistic criteria along with statistics.
For languages in which the same word can have various parts of speech, e.g. ‘drink’ in English, POS tags are used to differentiate between the occurrences of the word when used as a noun or verb.
POS tags are also useful for searching for examples of grammatical or lexical patterns without specifying a concrete word, e.g. identifying examples of any plural noun that is not preceded by an article.
Or both of them can be combined e.g. indentifying the word ‘help’ used as a noun followed by any verb in the past tense.