The general purpose of a part-of-speech tagger is to associate each word in a text with its morphosyntactic category (represented by a tag).
This+PRON is+VAUX_3SG a+DET sentence+NOUN_SG .+SENT
The process of tagging consists in three steps:
tokenization: break a text into tokens
lexical lookup: provide all potential tags for each token
disambiguation: assign to each token a single tag
Each step is performed by an application program which uses language specific data:
The tokenization step uses a finite-state transducer to insert token boundaries around simple words (or multi-word expressions), punctuations, numbers, etc.
Lexical lookup requires a morphological analyser to associate each token with one or more readings. Unknown words are handled by a guesser which provides potential part-of-speech categories based on affix patterns.
Disambiguation is done with statistical methods (Hidden Markov Model).
Using the Xerox HMM training tools, we have developed part-of-speech disambiguators for various languages including Czech, English, French, German, Greek, Hungarian, Italian, Polish and Russian.
|Rate this service :|