என்னைத் தெரிந்துகொள்ள

திங்கள், 9 டிசம்பர், 2013

TEXT TO SPEECH SYNTHESIZER FOR TAMIL LANGUAGE


TEXT TO SPEECH SYNTHESIZER FOR TAMIL LANGUAGE

Dr. S. VEERA ALAGIRI
DEPT OF CEN
AMRITA UNIVERSITY
COIMBATORE.

1 Introduction
Incorporation of human facilities like speech and vision in to machine is a basic issue of artificial intelligence research. The capability of a computer to generate speech output is termed as speech synthesis. It requires an in depth understanding of speech production and perception. There is a wide spread talk about improvement of the human interface to the computer. No longer have people wanted to sit and type out required data or to read data from the monitor since it is a painstaking which involves strain to the eyes. In this aspect, we all know that Speech Synthesis is becoming one of the most important steps towards improving the human interface to the computer. The main aim of this paper discuss about the developing of a text to speech synthesis system for Tamil.

The Implementation of this TTS is shown in Figure1. The concatenation method is used in our subsystem. This approach is used in many popular speech engines available today. This will be very useful to the visually impaired people and to those who want all the digital books to be read loudly. Nowadays there are many books getting digitized and many foreign books are translated into Tamil and then digitalized. This system will help people who cannot read Tamil and people who don’t want to waste time in reading.

This field of TTS in Tamil has remained untouched for a long time due to various facts; a few of them are listed below.

1. The Complexity of the Tamil language.
2. The problems, which are posed by the Tamil grammar.
3. Less Knowledge of pure Tamil.
4. Huge gap and difference between the spoken Tamil, which is full of slang and the pure Tamil, which is written.

 Even though some people tried to develop TTS system for Tamil, the required naturalness is not yet achieved. This system fulfills that level.
Figure 1: Block Diagram of TTS System

2 Text Processing
The Tamil which is entered cannot be given directly to the unit selection process; the raw text should be processed and it will be given for further processing. For the text to speech system, we need the input as a syllable. The system doesn’t know where to separate the given word into sequence of syllables so to do this sequence syllabeling we need tell the machine where to break the word or sentence.
Texas Ranger, Texel, text, text editor, text message,  text wrap, textbook, textile, text phone, text speak and the handling of alphabetic characters by a computer.
For example, மார்ச் 31 needs to be pronounced மார்ச் முப்பத்தி ஒன்று, not . மார்ச் முன்று ஒன்று. The expansion of ரூ.1 should be of ரூபாய் ஒன்று; it should not be pronounced as ரூ ஒன்று. The second step in text normalization is normalizing non-standard words. Nonstandard words are tokens like numbers or abbreviations, which need to be expanded into sequences of Tamil words before they can be pronounced.
The TTS system comprises of these 5 fundamental components:
1.      Text Analysis and Detection
2.      Text Normalization and Linearization
3.      Phonetic Analysis
4.      Prosodic Modeling and Intonation
5.      Acoustic processing

The input text is passed through these phases to obtain the speech.

Input Text
Text Normalization & Text Linearization
Text Analysis &Text Detection
Phonetic Analysis
Prosodic Modeling & Intonation
Acoustic Processing
Speech as output
 













The Figure 2 System Overview of TTS

2.1 Text Analysis and Detection

The Text Analysis part is a preprocessing part which analyses the input text and organize it into manageable list of words. It consists of numbers, abbreviations, acronyms and idiomatic expressions and transforms them into full text when needed. An important problem is encountered as soon as the character level: that of punctuation ambiguity (sentence end detection). It can be solved, to some extent, with elementary regular grammars.
Text detection is localizing the text areas from any kind of printed documents. Most of the previous researches were concentrated on extracting text from video. We aim at developing a technique that work for all kind of documents like newspapers, books etc

2.2   Text Normalization and Linearization

Text Normalization is the transformation of text to pronounceable form. Text normalization is often performed before text is processed in some way, such as generating synthesized speech or automated language translation. The main objective of this process is to identify punctuation marks and pauses between words. Usually the text normalization process is done for converting all letters of lowercase or upper case, to remove punctuations, accent marks , stopwords or “too common words “and other diacritics from letters .

Text normalization is useful for example for comparing two sequences of characters which represented differently but mean the same.

2.3   Phonetic Analysis

Phonetic Analysis converts the orthographical symbols into phonological ones using a phonetic alphabet. This is basically known as “grapheme-to-phoneme” conversion. Phone is a sound that has definite shape as a sound wave. Phone is the smallest sound unit. A collection of phones that constitute minimal distinctive phonetic units are called Phoneme. Pronunciation of word based on its spelling has two approaches to do speech synthesis namely
(a)Dictionary based approach
(b) Rule based approach.
A dictionary is kept were It stores all kinds of words with their correct pronunciation, it‟s a matter of looking in to dictionary for each word for spelling out with correct pronounciation. This approach is very quick and accurate and the pronounciation quality will be better but the major drawback is that it needs a large database to store all words and the system will stop if a word is not found in the dictionary.
The letter sounds for a word are blended together to form a pronunciation based on some rule. Here main advantage is that it requires no database and it works on any type of input. same way the complexity grows for irregular inputs.

2.4    Prosodic Modeling And Intonation

The concept of prosody is the combination of stress pattern, rhythm and intonation in a speech. The prosodic modeling describes the speakers emotion. Recent investigations suggest the identification of the vocal features which signal emotional content may help to create a very natural synthesized speech.

Intonation is simply a variation of speech while speaking. All languages use pitch, as intonation to convey an instance, to express happiness, to raise a question etc. Modelling of an intonation is an important task that affects intelligibility and naturalness of the speech. To receive high quality text to speech conversion, good model of intonation is needed.
Generally intonations are distinguished as follows:

(i)                 Rising Intonation (when the pitch of the voice increases)
(ii)               Falling Intonation (when pitch of the voice decreases)
(iii)              Dipping Intonation (when the pitch of the voice falls and then rises)
(iv)             Peaking Intonation (when the pitch of the voice raises and then falls)

2.5    Acoustic Processing
The speech will be spoken according to the voice characteristics of a person. There are three types of Acoustic synthesis available.

 (i).Concatenative Synthesis
(ii).Formant Synthesis
(iii).Articulatory Synthesis

The concatenation of prerecorded human voice is called Concatenative synthesis, in this process a database is needed having all the prerecorded words .The natural sounding speech is the main advantage and the main drawback is the using and developing of large database.

Formant-synthesized speech can be constantly intelligible .It does not have any database of speech samples. So the speech is artificial and robotic. Speech organs are called Articulators. In this articulatory synthesis techniques for synthesizing speech based on models of the human vocal tract are to be developed. It produces a complete synthetic output, typically based on mathematical models.

2.1 Preprocessing

Before giving to the Syllabification process some preprocessing should be done. That is tokenization. The first task in text normalization is sentence tokenization. In order to segment the given Tamil paragraph into separate utterances for synthesis, we need to know that the first sentence ends at the period after (.) full stop not at the period of தா.நா It is somewhat easy to tokenize a word with help of full stop, that is most of the sentences will be ending with full stop. But there are some other cases where it ends in abbreviation and with semicolon or some other punctuation like the previous case.
This problem can be solved by expanding the abbreviation and removing the unwanted punctuation. For example
Ø  மகா வீட்டுக்குச் சென்றாள்.
There will not be any problem in this sentence to be tokenized, because the sentence can be tokenized according the reference of full stop. But for the sentence like
Ø  மகா தா.நா. சென்றாள்.
There comes the abbreviation, for this there is full stop coming in between the sentence for the abbreviation. This problem can be solved by expanding the abbreviation so that there will not be any full stop in between the sentence. So it is easy to tokenize as in the previous case. All the Tamil abbreviations cannot be expanded, because some mostly used abbreviations are stored in a separate database. When certain abbreviation comes in the text, then it will search in the database for that abbreviation. If that abbreviation is present the system will replace the text, if not it will be leaving the original text as it is. It is difficult to add all the abbreviations in the database, so some mostly used abbreviations are used. For example
Ø  தா.நா_____ தமிழ் நாடு
Ø  ரூ        _____ ரூபாய்
Then the unwanted punctuation like (: , ; ‘ „ ` $) etc... are to be removed from the given Tamil paragraph to avoid confusion and not to give any disturbance in the naturalness of the speech. Each and every text in the input should be assigned some sound file for the concatenation. So if we are keeping this punctuation and assigning some speech file for that punctuation will lead to unnaturalness in final output speech file. Because many punctuation in the input text. The given algorithm will do all these things and it will assign @ symbol for spaces present in the input text. In some passage there will be more space and extra punctuations will be there then the punctuation will be made as blank space and more gaps are made as a single space. And finally all the space are assigned @ symbol, this @ symbol is assigned silence for some short range in the matlab. This will be helpful in further processing.

Number converter

Number is pronounced differently in different situations. The second step in text normalization is normalizing non-standard words. Non standard words are tokens like numbers or abbreviations, which need to be expanded into sequences of Tamil words before they can be pronounced. What is difficult about these non-standard words is that they are often very ambiguous. For example, the number 1983 can be spoken in at least three different ways, depending on the context:

Ø  பத்தொண்பது என்பத்து மூன்று
Ø  ஒன்று ஒன்பது எட்டு மூன்று
Ø  ஆயிரத்தி தொல்லாயிரத்தி என்பத்து மூன்று.

This problem can be handled by considering a single type of methodology for all the cases. Number system for Tamil has already been developed in python code. That gives hundred percent accuracy, so by incorporating it in TTS system makes the system somewhat slow. So for time being Number system has been cancelled in current TTS system. It will be coming under future work. The algorithm will cancel all the numbers coming in text, and then it will become normal text without numbers or any extra punctuation.

Syllabification

This can be done in many ways for example, the syllabification algorithm breaks a word such that there are minimum numbers of breaks in the word, as minimum number of joins will have less artifacts. The algorithm dynamically looks for polysyllable units making up the word, cross checks the database for availability of units, and then breaks the word accordingly. If polysyllable units are not available, then algorithm naturally picks up smaller units. This mean, if database is populated with all available phones of language along with syllable units, algorithm falls back on phones if bigger units are not available.

But in this system we are not following this methodology instead of this we are using a mapping file which contains Tamil letter and corresponding Romanized letter for that. According to the input Tamil letters the corresponding Romanized letters are taken and made as a syllable. The important thing in this mapping file is that, the arrangement of the letters should be based on the length of the Tamil letter. Because if the letters are not arranged in a required manner then there will be system error and we will not get the output.

For example if we are having a Tamil word as the input the letters and will not be a problem. If comes before then it will be big problem. The algorithm will search for then it replace as . Then one unrecognized character will be coming, so the system cannot process this for unit selection, this is why arrangement of letters is compulsory.
                                               
                                                                        Input Text
காமராஜர்
                                                                       
Mapping File

 


kA \ ma \ rA \ ja \r
                                                                        Syllable Output


The Figure 3 clearly shows the syllabification of a Tamil word kamarajar.

3 Grammatical Rules
After text processing we will getting the processed Tamil text it cannot be given directly to the conversion process. That is converting Tamil text in to Romanized text which is similar to English. Because computer cannot process the Tamil data as such so the conversion is compulsorily needed. In Tamil same letter is pronounced differently at different places. Like English word ambiguity is not possible in Tamil but letter ambiguity is there.

Making Tamil system is somewhat easy than English TTS in ambiguity wise. Tamil is having less number of letters to compensate the pronunciation of all the Tamil words so for some same letters one or two different pronunciation is possible. This problem can be handled by writing rules, Tamil language will be having a language that makes the structure for that language. By that grammar we can somehow predict the letters which are having many possible pronunciation and in which place. This process is done in the grammatical block.
In Tamil the same letter in text format will have different pronunciation at different places they occur. This can be solved in many ways, which is by writing rules or any of the machines learning technique like SVM algorithm, HMM etc. In this project phonetic rule of Tamil language is used instead of machine learning approach as lot of previous work as done. Tamil language is somewhat weak in number of alphabets to fulfill all the pronunciation of letters. But south Indian languages like Malayalam, Kannada etc are having many alphabets to fill the gap in pronunciation change.

Making a TTS system for Malayalam is easy compared to Tamil language because we need to know the complete grammatical and phonetical structure of Tamil language. In English language there are many ambiguous words, this is not the case in Tamil. In Tamil the only problem is that the same letter will be pronounced differently at different place. For example in English a sentence.

Ø  Do u live (/l ih v/) near a zoo with live (/l ay v/) animals?
Ø  I prefer bass (/b ae s/) fishing to playing the bass (/b ey s/) guitar?
From the example we can understand ambiguous words in the English sentence are high. But in Tamil it will not happen
Ø  காக்கா நிறம் கருப்பு
Ø  காகம் கூட்டில் இருக்கிறது.

In the first sentence is pronounced as ‘ka’ but in the second sentence the same is pronounced as ‘gha’. These types of ambiguous letters are more in Tamil language.
Ø  பாப்பா இங்கே வா.
Ø  பாபா படம் கண்டேன்.

In the first sentence பா is pronounced as ‘pA’ but in the second sentence the same பா is pronounced as ‘bhA’. These types of problem can be handled by getting the complete information of Tamil phonetic grammar. Some of the rules are shown below
Ø  அவன் ஆசை காட்டினான்
Ø  அவளுக்கு பச்சை வண்ணம் பிடிக்கும்.
In the first sentence is pronounced as ‘sai' but in the second sentence the same is pronounced as ‘chai’. Some of rules can handle this, the letters like (sai, sa, sA, si) etc that is the letters based ‘sa’ will be pronounced as same as (sai, sa, sA) etc. But when the same ‘sa’ terms coming after ‘s’ will be pronounced as (chai, cha, chA) etc everything will be coming in ‘cha’ series.
Ø  அழகிரி பாடம் படித்தான்.
Ø  மணி படம் பார்த்தான்.

In the first sentence is pronounced as ‘da’ but in the second sentence the same is pronounced as ‘ta’. Some of rules can handle this, the letters like (dai, da, dA, si) etc that is the letters based ‘da’ will be pronounced as same as (dai, da, dA) etc. But when the same ‘da’ terms coming after ‘t’ will be pronounced as (tai, ta, tA) etc everything will be coming in ‘ta’ series.

Other than this some of the Tamil letters will have long, short and medium range pronunciation corresponding to their places where they appear. For example coming in different places of a word will have this sort of long, short, medium pronunciation.

Ø  பந்து (long pronunciation of து)
Ø  துவங்கினான் (short pronunciation of து )
Ø  பாதுகாப்பு (short pronunciation of து)

Almost 60% of the letters will have this property, so to get these information in TTS system it is necessary to record three phase of the same letter. Our algorithm is designed to fix the appropriate pronunciation of letters in appropriate place. For example Process of Phonetic Rules Applied to the Input Text is show in Figure 4.

Input Text
அவனுக்கு பாதுகாப்பு வழங்கப்பட்டது
 




Text Processing
avanukku pAthukAppu vazangkappattathu
 


Romanized Form
அவனுக்கு பாதுகாப்பு வழங்கப்பட்டது
 


Grammatical Rules
avanukku pAthu(M)kAppu vazangkappattathu(L)
 




Figure 4: Process of Phonetic Rules Applied to the Input Text.

4        Speech Database

All the letters of Tamil should be recorded and kept as a separate database. After the text processing, grammatical change and conversion of romanized text, the system should take the corresponding speech file from the speech database. This section of selecting corresponding speech file by the text file will be handled in the unit selection block. Recording of speech sounds of Tamil letters needs some requirement for further signal processing.
The recording should be on the sound proof room, the pronunciation Tamil letters should be good and a good quality mike should be used to avoid noise in speech file. This requirement will help a lot in the naturalness of the output speech. In text to speech synthesis the accuracy of the system is calculated in the ways of naturalness of the output speech. So instead of recording only a single letters of Tamil, we record some combination of letters that is diphone. This will improve the naturalness of the system in a better manner.



Figure 5: Speech Database
The main part of this project is to record all the letters of Tamil language, that is 247 letters and some of the Sanskrit letters which will be rarely there in Tamil language. Finally the database contains speech file up to 350. The grammatically changed letters should have their corresponding speech file, so it is necessary to record those files and store in the database. Most of the Tamil letters will have three form of pronunciation that is very important for the naturalness of the output speech.

Ø  பருப்பு (long pronunciation of பு)
Ø  புலவர்(short pronunciation of பு)
Ø  அன்புடையீர்(medium pronunciation of பு)

Apart from the normal letters of Tamil, the three forms of same letters should be recorded and stored in the database. Same letters will be pronounced differently at different places, so the grammatically changed letters should be recorded separately other than the basic alphabets. With this speech database we cannot get a good accuracy that is we cannot get high naturalness.

We can record all the words of Tamil language, so that we can get very high naturalness but it is not possible to record all the Tamil words. At least recording the most frequently used word in the database to get good naturalness. It will also take time for processing, so we are using the all possible combination of vowels and consonants. Adding this in a database will not have that much space and the processing time also less compare to the previous methodology.

 But it will give output speech naturalness similar to record the complete word. Recording and adding the most frequently used word will be having inflection, it may come in any form so it needs morphology process to do that work. So it is somewhat time consuming work. The combination of consonants vowels gives good result in this TTS system. The combination contains 4457 speech files which will be added in the database. For example the combination table will look like this which is given below in table. 

Tamil Alphabets Chart

a

nga
ஙா
ngA
ஙி
ngi
ஙீ
ngI
ஙு
ngu
ஙூ
ngU
ஙெ
nge
ஙே
ngE
ஙை
ngai
ஙொ
ngo
ஙோ
ngO
ஙௌ
ngau

Gna
ஞா
GnA
ஞி
Gni
ஞீ
GnI
ஞு
Gnu
ஞூ
GnU
ஞெ
Gne
ஞே
GnE
ஞை
Gnai
ஞொ
Gno
ஞோ
GnO
ஞௌ
Gnau

Na
ணா
NA
ணி
Ni
ணீ
NI
ணு
Nu
ணூ
NU
ணெ
Ne
ணே
NE
ணை
Nai
ணொ
No
ணோ
NO
ணௌ
Nau
த்
th
யௌ
yau

zha
ழா
zhA
ழி
zhi
ழீ
zhI
ழு
zhu
ழூ
zhU
ழெ
zhe
ழே
zhE
ழை
zhai
ழொ
zho
ழோ
zhO
ழௌ
zhau
ழ்
zh
Sanscrit Characters - வடமொழி (கிரந்த) எழுத்துக்கள்
க்ஷோ
kshO

The database part is over then the output from grammatical rules section will be in the form of Romanized sentence which has been already undergone syllabification process. So the unit selection part will pick the correct speech file from the speech database which has already created. The main process done in unit selection is that, it will take speech file of the corresponding syllabalized sentence and make a setup which groups those speech units.

5        Concatenation

The last and final stage process is the concatenation process. All the arranged speech units are concatenated using a concatenation algorithm. The concatenations of speech files are done using matlab. Because matlab will be very useful for further signal processing if needed. In this project no signal processing techniques are used. The main problem in concatenation process is that, there will be glitches in joint. In previous project they used some signal processing methodology to solve this problem. But in this project the glitches are avoided by perfect recording, where no glitches will come. The concatenation process combine all the speech file which is given as a output of the unit selection process and then making in to a single speech file. This can be played and stopped any where needed. The main aim of this project is to achieve good naturalness in output speech.

6.      Conclusion

This paper made a clear and simple overview of working of text to speech system (TTS) in step by step process. There are many text to speech systems (TTS) available in the market and also much improvisation is going on in the research area to make the speech more effective, natural with stress and emotions. We expect the synthesizers to continue to improve research in prosodic phrasing, improving quality of speech, voice, emotions and expressiveness in speech and to simplify the conversion process so as to avoid complexity in the program.


REFERENCE
1.      A.J. Hunt and Alan W. Black. Unit selection in a concatenative speech synthesis system for a large speech database. In Proceedings of IEEE Int. Conf. Acoust., Speech, and Signal Processing, pages 373-376, 1996.
2.      Anupam Basu, Debashish Sen, Shiraj Sen and Soumen Chakraborty “An Indian language speech synthesizer – techniques and application” IIT Kharagpur, pages 17-19, Dec 2003.
3.      Black, A., Taylor, P., Automatically Clustering Similar Units for Unit Selection in Speech Synthesis. Eurospeech97, Rhodes, Greece, 2:601-604, 1997.
4.      G. L. Jayavardhana Rama, A. G. Ramakrishnan, M Vijay Venkatesh, R. Murali Shankar “Thirukkural - A Text-to-Speech Synthesis System” Department of Electrical Engg, Indian Institute of Science, Bangalore presented at INFITT 2001.
5.      Harsh Jain, Varun Kanade, Kartik Desikan “Vani-An Indian Language Text To Speech Synthesizer” Department of Computer Science and Engineering, Indian Institute of Technology Mumbai, India. April 2004.
6.      James Allen, “Natural Language Understanding”, 1995
7.      Klatt D.H. “Review of text-to-speech conversion for English” pages 737-793, 1987.
8.      M. Vinu Krithiga and T.V. Geetha “Introducing Pitch Modification in Residual Excited LPC Based Tamil Text-to-Speech Synthesis” Department of Computer Science and Engineering, Anna University, Chennai-25, India. AACC 2004, LNCS 3285, pp. 177–183, 2004.
9.      Michael W. Macon and Mark A. Clements “Speech Concatenation and Synthesis using an Overlap-Add Sinusoidal Model” School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta. GA 30332-0250, 1996.
10.  Nur-Hana SAMSUDIN, Sabrina TIUN and TANG Enya Kong “A Simple Malay Speech Synthesizer Using Syllable Concatenation Approach” Computer-Aided Translation Unit, School of Computer Sciences, Universiti Sains Malaysia, Penang, Malaysia, Presented at M2USIC, October 2004.
11.  Prahallad Kishore, Black Alan “A text to speech interface for Universal Digital Library” Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA 15217, USA and International Institute of Information Technology, Hyderabad, AP, 500019, India. Journal of Zhejiang University Science, vol. 6A, no.11, pp. 1229-1234, Oct 2005.
12.  R. Kaplan and M. Kay, “Regular Models of Phonological Rule Systems”, Computational Linguistics Vol. 20, No. 3, 331-378, September 1994.
13.  S.P. Kishore, Rohit Kumar and Rajeev Sangal “A Data-Driven Synthesis Approach For Indian Languages using Syllable as Basic Unit” Language Technologies Research Center, International Institute of Information Technology, Hyderabad, India and Punjab Engineering College, Chandigarh, India, presented at International Conference on Natural Language Processing (ICON), 2002.
14.  S.R. Rajeshkumar. Significance of Durational Knowledge for a Text-to-Speech System in an Indian Language. MS dissertation, Indian Institute of Technology, Department of Computer Science and Engg. Madras, 1990.
15.  S.Subramanian and K.M.Ganesh “Study and Implementation of a Tamil Text-to-Speech Engine” Students, III yr B.E.CSE, Department of Computer Science and Engineering, PSG College of Technology, presented at the Tamil Internet conference 2001.
16.  S.Veera Alagiri, “Spectrograph Analysis of Tamil Vowels (a,aa)” MA Thesis, Linguistics, Tamil University, Thanjavur. 2007.
17.  Sangam P. Borkar, Prof. S. P. Patil “Text to Speech System for Konkani (Goan)  language” Rajarambapu Institute of Technology Sakharale, Islampur, Maharashtra, India, Oct 2003.
18.  Silvia Quazza, Laura Donetti, Loreta Moisa, Pier Luigi Salza “A Multilingual Unit-Selection Speech Synthesis System” Loquendo S.p.A, Via Nole 55, Torino, Italy. Paper 209 presented at SSW4-2001.
19.  Venugopalakrishna Y R, Sree Hari Krishnan P, Samuel Thomas, Kartik Bommepall, Karthik Jayanthi, Hemant Raghavan, Suket Murarka and Hema A Murthy “Design and Development of a Text-To-Speech Synthesizer for Indian Languages” Indian Institute of Technology Madras, Chennai, India. IDIAP Research Institute, Martigny, Switzerland. January 2008.
20.  Y.A El-Imam and Z.M. Don, “Text-to-Speech Conversion of standard Malay”, international Journal of Speech Technology, no 3,pp. 129-146, 2000