S.Veera alagiri
Research scholar
Department of Linguistics
Tamil University
Tanjavur
Email: alagiri.bagath@gmail.com
Mapping of Nominal Suffixes in Tamil to Malayalam for Morphological Generator Machine Translation
Abstract
Nouns are those units which take case markers. In both Tamil and Malayalam nouns are declinable for Gender, Number and Case. The gender distinction of Dravidian language is more rational. Generally masculine and feminine distinctions are observed only for rational beings. But in stylistic usage gender distinction can be made even in the case of irrational beings. That is to refer a cow; the speaker has the freedom to use 'she' or 'it'. To a large extent Tamil and Malayalam gender is a semantic-cum-grammatical category. There are three genders in Tamil and Malayalam there are masculine, feminine and Neuter. In Tamil and Malayalam nouns take, case markers there by identify themselves from other grammatical categories such as verbs adjectives, adverbs etc. The first and the pre most thing in morphology is to split the inflected words or complex words into smaller meaningful morphemic units and map them with similar units in the target language. The present paper focus on mapping of nominal suffixes in Tamil and Malayalam for Morphological generator machine translation.
Introduction
The present paper is generation of Tamil word for the given Malayalam word. For example a nominal form in Tamil must get its equivalent nominal form in Malayalam.
Preparation of morph generator
The preparation of generator involves the following two important steps:
1. The analysis of tokenized word forms of the source language into minimal meaningful units (which includes case suffixes, number suffix and postpositions for nouns and suffixes and bound auxiliary verbs which express tense, mood and aspect for verbs) and assigning semantic features to each unit and
2.The generation of the target word forms out of the analyzed meaningful units.
Let us make an assumption that the generation of word forms is the concatenation of morphemes and bound words. Clearly the simplest way to build such a model would be to assume that the allowability of a particular morpheme or a bound word in a given context depends only upon the morpheme that precedes it. Take the word ceytuviTTeen, whose morphological structure is cey+tu + viT+T+een. The combination of cey which is a verb and tu which an adverbial marker forms the first chunk of the verbal complex and the combination of viT which is a secondary verb, T which is a past tense marker and een which is a person-number-marker forms the second chunk of the word. Both of the chunks (V1 and V2) make a verbal complex.
Finites State Automata
One can easily model such a system with a finite-state machine (finite-state automaton or finite-state transition network) of the kind familiar from formal language theory. More formally, a finite automation is defined by the following five parameters:
Q: a finite set of N states q0, q1...,qN
S: a finite input alphabet of symbols
q0: the start state
F: the set of final states, F Í Q
d (q,i): the transition function or transition matrix between states. Given a state q Î Q and an input symbol i Î S, d (q,i) returns a new state q’ Î Q. d is thus a relation from Q ´ S to Q;
A set of rewrite rules, or grammar, can be thought of not only as an intentional definition of the units of a language, but also as the description of an abstract machine which is able to accept (or generate) just those strings which belong to the language. These machines are called (abstract) automata, and the complexity of their potential behaviour is directly determined by the internal complexity of the rules of the grammar to which they correspond. All such machines may be characterized in terms of a set of states, an input device which can access one input symbol at a time, and control unit which can examine and read input and cause the machine to shift from one sate to another. In addition, such machines may have memory, which can be accessed by the control unit for storing and testing symbols. The computational power of the machine is determined essentially by the complexity of its memory and of the operations that can be performed on the memory. The simplest of these automata are called finite state machines or finite state automata (FSA), and the languages they can accept are regular languages.
PC Kimmmo
The two level morphology of Kosekenniemi has been a widely implemented one, particularly in European languages. PCKimmo can do two functions as a morphological processor, recognition. In the case of the generator component of PCKimmo, it takes a lexical form of the foot as input. To this imput rules recorded in the rules file are applied and the corresponding surface form is returned. It does not use the lexicon. The recognizer accepts as input a surface form, applies the rules, consults the lexicon and returns the corresponding lexical form and gloss string. What makes it easier for the user and important from the point of economy of model is the fact that the same set of rules can be used for both the functions of recognition and generation. The rules can be used bi-directional (1990:8). If certain phonological, or of for that matter a morphological or orthographic, rule is written to refer to a transformation form the underlying structure to surface structure, they need not be rewritten for the reverse process. In reverse, it processes form the surface form to the underlying form. This model can be represented through the following diagram (Simons 1989).
Rules
Lexicon
Surface forms Underlying/Lexical form
RecognizercenReen cel+ndt-een
V+PAST+FPS
Generator
cenReen cel+ndt+een
(FPS=First person singular)
The two levels used in this model are linguistic descriptions that can handle any tow levels of abstractions, be it allophonic alternations, morphophonemic alternations or orthographic variations.
Transfer ComponentNormally in language processing, sentences are parsed to identify the syntactic structure of the sentence. There are more similarities than differences between the two Indian languages. Hence Malayalam and Tamil language pair does not require a full parse. The structural transformation is required when the source language structure does not have the same structure in the target language. A partial parse or shallow parse is sufficient to identify the specific constituents in the sentence that has to undergo transformation. Many partial parsing systems use cascades of finite state automata instead of context free grammars.
The syntactic difference between the languages can be found in complex sentence construction particularly participles. This would be bridged by means of transfer grammar. This component will also include the task of transliteration. A module that can perform transliteration among Indian languages, including Urdu, needs to be developed. Transliteration allows a word or words to be rendered in the script of the reader. For example, if a person who know Hindi reads Bangla text in Devanagari, (s)he can still understand some parts of the meaning.
It can be seen that even when no other linguistic resources for translation are available among Indian languages, transliteration can still allow a reader to try to read and understand. Indian language share a large number of lexical items, and simply by a change in the script the reader can understand quite a few things. When linguistic resources are available, transliteration can still be used to render words that are not handled by the MT system to be rendered in the user's script.
Classification of Malayalam noun
Taking into account the morphophonemic (or sandhi) changes of form during the addition of nominal suffixes, the following paradigm types are identified for Nouns and pronouns of Malayalam (adopted from ILILMT project and Morphological Generator for Malayalam-Tamil Machine Translation” by S.Veeraalagiri, M Phil thesis 2008).
1.aana 2.aaR 3.kuuN~ 4.maraM 5.paanp 6.vaay 7 .makan~ 8.amma 9.makaL~ 10.paSu 11.suuryan~ 12.akatt 13.aangngaLa 14.peNN 15.aatmaav 16.kalaakaari 17.raajaav 18.kavi 19.manushyar~ 20.kariiM 21.kaLatraM 22.vadhu 23.bhikshu 24.yattiiM 25.onn 26.kabir~ 27.kaaT
Classification of Tamil Nouns
Taking into account the morphophonemic (or sandhi) changes that take place when grammatical forms that are suffixal in nature are concatenated with nominal bases, the following paradigm types are identified for Tamil nouns for computational morphological processing.
1. aaN ` 2. aaRu 3. eli 4. itazh 5 .ii 6. kaN 7. kaal 8 .kaaTu 9. manitan 10.maram 11. muL 12. maan 13. ndaay 14. pon 15. poy 16. pul 17. poruL 18. puu 19. vaNTu 20. teer 21. tool 22. pas 23. vaLarvana
Tables for Tamil Nominal suffixes
The tables given below show the flow of suffixes form one level to another level and the set of suffixes occurring in each level.
Noun class
Base alternatns
Plural suffix
Accusative case suffix
Instrumental case suffix
Dative case suffix
Genitive case suffix-1
Genitive case suffix-2
Genitive case suffix-3
Locative case suffix
Sociative case suffix
Sociative case suffix
class-1 kaTaa-class
kaTaa~
kaTaav
kkaL
ai/inai
aal/inaal
ukku/inukku
in
atu/ inatu
uTaiya/inuTaiya
il/inil
ooTu
uTan
class-2
eli-class
eli ~ eliy
kaL
ai/inai
aal/inaal
ukku/inukku
in
atu/ inatu
uTaiya/inuTaiya
il/inil
ooTu
uTan
class 3
ii-class
ii~iiy
kkaL
ai/inai
aal/inaal
ukku/inukku
in
atu/ inatu
uTaiya/inuTaiya uTaiya/inuTaiya
il/inil
ooTu
uTan
class 4 vaNTu-class
vaNTu ~ vaNT
kaL
ai/inai
aal/inaal
ukku/inukku
in
atu/ inatu
uTaiya/inuTaiya
il/inil
ooTu
uTan
class 5
kaaTu-class
kaaTu ~ kaaTT
kaL
ai/inai
aal/inaal
ukku/inukku
in
atu/ inatu
uTaiya/inuTaiya
il/inil
ooTu
uTan
class 6
aaRu-class
aaRu ~aaRR
aaRukaL
ai/inai
aal/inaal
ukku/inukku
in
atu/ inatu
uTaiya/inuTaiya
il/inil
ooTu
uTan
class 7
kaN-class
kaN ~ kaNN
kaL
ai/inai
aal/inaal
ukku/inukku
in
atu/ inatu
uTaiya/inuTaiya
il/inil
ooTu
uTan
class 8
pon-class
pon ~ ponn
kaL
ai/inai
aal/inaal
ukku/inukku
in
atu/ inatu
uTaiya/inuTaiya
il/inil
ooTu
uTan
class 9
ndaay-class
ndaay
kaL
ai/inai
aal/inaal
kku/inukku
in
atu/ inatu
uTaiya/inuTaiya
il/inil
ooTu
uTan
class 10
pul-class
pul ~ pull~puR
kaL
ai/inai
aal/inaal
ukku/inukku
in
atu/ inatu
uTaiya/inuTaiya
il/inil
ooTu
uTan
class 11
muL-class
muL ~ muLL~muT
kaL
ai/inai
aal/inaal
ukku/inukku
in
atu/ inatu
uTaiya/inuTaiya
il/inil
ooTu
uTan
class 12
maram-class
maram ~ marang~ maratt
kaL
ai/inai
aal/inaal
ukku/inukku
in
atu/ inatu
uTaiya/inuTaiya il/inil
il/inil
ooTu
uTan
class 13
kaal-class
kaal
kaL
ai/inai
aal/inaal
ukku/inukku
in
atu/ inatu
uTaiya/inuTaiya
il/inil
ooTu
uTan
Bilingual Mapping
The root words identified by the morphological analyzer are looked upon in a Bilingual dictionary for the target language equivalent. This bilingual dictionary contains the root word equivalents of the target language as well as its category (say noun, verb etc.), paradigm and other necessary information. This stage also uses dictionaries to identify the target language equivalence for the source language grammatical suffixes. We consider the example taken earlier. Here the Tamil equivalents for the Malayalam words analyzed by the morphological analyzer are given as an example.
Tamil forms in IPK notation Equivalent Malayalam forms
ahid Ana
ahidfs; AnakalYZ
ahidia AnayeV / Anaye
ahidfis AnakalYeV / AnakalYe
ahidBahL Anayot
ahidfBshL AnakalYot
Noun Suffix Mapping
kaL ------ fs;
ye ------ ia
yee ------ ia
kaLe ------ fis
kaLee ------ fis
yooT ------ BahL
kaLooT ------ fBshL
ykk ------ f;F
kk ------ f;F
kaLkk ------ fSf;F
yaal ------ ahy;
kaLaal ------ fshy;
yuTe ------ a[ila
kaLuTe ------ fSila
yil ------ apy;
kaLil ------ fspy;
yooLaM ------ mst[
kaLooLaM ------ fs;mst[
Morphological Generation
In order to generate a word form for a specific grammatical category the corresponding suffixes have to be concatenated with the root word. The Morphological generator takes its input from the transfer grammar component. The input would be the root word along with its grammatical features. The generator then inflects the root word according to the morphology of the language and outputs the target language word form. The words thus generated are concatenated to form the complete target language sentence.
It should be admitted here that the information on subject agreement is missing in case of Malayalam; reconstructing such information is a challenging task. It is proposed here to make use of the case markers as well as semantically an noted corpora for this purpose.
Conclusion.
“Morphological generator for Tamil - Malayalam machine translation”. Here in this chapter the generation of Tamil word forms from the output of the Tamil Morphological analyzer has been explained briefly. For example a nominal form in Tamil must get its equivalent nominal form in Malayalam.
Thus the aim of the paper is fulfilled. A morphological generator is prepared which generates Malayalam word forms from the Tamil morphological analyzer input. The generator thus prepared is very useful for machine translation of Tamil into Malayalam.
The preparation of morphological analyzer is the preliminary step to be made before venturing into any other type of natural language processing. Tamil being a complex inflectional language is a challenge for computational linguists who try to prepare a morphological analyzer or generator for it. There are attempts to prepare morphological generators and analyzers for Tamil. Preparing a morphological generator or analyzer for the purpose of demonstration is quite different from preparing an exhaustive and efficient working system. It is not too difficult to prepare a system with the coverage of 60 to 65%. However increasing the efficiency beyond 65% and to reach a realistic level of 95% to 97% is a hard task. Most of the existing descriptions focus on a narrow and vertical detail rather than on breadth and exhaustiveness. Generally the coverage obtained by incorporating the descriptions of the existing work is never beyond 50%. The evaluation method points out that the efficiency of the morphological analyzer prepared for Tamil by the present team is very remarkable.
The preparation of a morphological analyzer or generator has many natural language applications such as parsing, text generation, machine translation, preparing dictionary tools and lemmatization. It also helps in speech applications such as text-to-speech synthesizing and speech recognition and in word processing applications like spell checking and text input and in retrieval of documents. Finding the categorical details of the word forms helps in assigning parts-of-speech tags to the nominal and verbal complex.
REFERENCES
Agesthialingom, S. 1964. ‘Auxiliary verbs in Tamil’. Tamil culture 11:3.
Annamalai,E. 1985. Dynamics of verbal extension. Trivandrum: Dravidian Linguistics Association
Asher R E and Kumari T C. 1997. Malayalam. London: Rutledge
Lehman, T. 1993.A grammar of Modern Tamil. Pondicherry: Pondicherry Institute of Linguistics and
Culture.
Rangan, K. 1970. ‘Modals as main verbs in Tamil’. In Proceedings of the First All India
Conference of Linguists. Poona: Deccan College.
Rajendran, S. 1999. ‘Spell and grammar checker for Tamil’. Paper read in 27th All India
Conference of Dravidian Linguists held in ISDL, Thiruvananthapuram, 17th-19th, July, 1999..
Rajendran, S. Morphological Analyzer for Tamil Nominal Complex. Languageinindia. Com, Language in
India
Rao, Umamaheshwar G. 1996. ‘Compound verb formation in Telugu’. Paper presented in the
National Workshop-cum-seminar on Lexical Typology, Telugu University, Hyderabad
Spener, A. 1991. Morphological theory: An Introduction to word structure in
Generative grammar. Cambridge: Basil Blackwell.
Veeraalagiri S. 2008 “Morphological Generator for Malayalam-Tamil Machine Translation”, MPhil Thesis, Tamil University Tanjavur.