ISSN: 2395-0560
International Research Journal of Innovative Engineering www.irjie.com Volume1, Issue 3 of March 2015
The Comparative Study of Speech Recognition Models -Hidden Markov and Dynamic Time Wrapping Model V.Vaidhehi1, Anusha J2, Anand P3 1,2,3
Department of Computer Science, Christ University, Bangalore, 560034, India
Abstract Speech Recognition is the process of converting a speech signal to a sequence of words, by means of algorithms implemented as a computer program. Speech is the most natural form of human communication. Speech recognition technology has made it possible for computer to follow human voice commands and understand human languages. The main goal of speech recognition area is to develop techniques and systems for speech input to machine. Dynamic Time Warp and Hidden Markov techniques are used for isolated word recognition. The objective of this paper is to compare their performances.
Keywords Comparison, Performance Evaluation, Dynamic Time Wrapping model, Hidden Markov Model
1. Introduction The Speech recognition is an area with considerable literature, but there is little discussion of the topic within the computer science algorithms literature. Speech recognition is a multileveled pattern recognition task, in which a acoustical signals are examined and structured into a hierarchy of sub words units (e.g. phonemes), words, phrases, and sentences. Each level may provide additional temporal constraints, e.g. known word pronunciations or legal word sequences, which can compensate for errors or uncertainties of lower levels. This hierarchy of constraints can best be exploited by combining decisions probabilistically at all lower levels, and making discrete decisions only at highest level [5]. There are two related speech tasks they can be listed as; Speech understanding and Speech recognition Speech understanding is getting the meaning of an utterance such that one can respond properly whether or not one has correctly recognized all the words. Speech recognition is simply transcribing the speech without necessarily knowing the meaning of the utterance. The technology is also helpful to handicapped persons who might otherwise require helpers to control their environments. Studies in speaking recognition field, as well as studies in other fields, follow two trends; Fundamental research whose goal is to devise and test new methods, algorithms and concepts in a non- commercial manner and applied research whose goal is to improve existing methods, following specific criteria.
The fundamental research aims at medium and especially long term benefits , while applied research aims at quick performance improvements of existing methods or extending their use in domains where they have less been used so far. Improving performance in speech recognition can be done taking into account the following criteria; Dimension of recognizable vocabulary Spontaneous degree of speaking to be recognized. Dependence or independence on the speaker. Time to put in motion of the system. System accommodating time at new speakers. Decision and recognition time. Recognition rate which is expressed by words or by sentences.
_____________________________________________________________________________________________________________ ©2015, IRJIE-All Rights Reserved Page -22
ISSN: 2395-0560
International Research Journal of Innovative Engineering www.irjie.com Volume1, Issue 3 of March 2015 Today’s speech recognition systems are based on the general principles of forms recognition [1][2]. The methods and algorithms that have been used so far can be divided into four large classes as;
Dimension of recognizable vocabulary Spontaneous degree of speaking to be recognized. Dependence or independence on the speaker. Time to put in motion of the system. System accommodating time at new speakers. Decision and recognition time. Recognition rate which is expressed by words or by sentences. Today’s speech recognition systems are based on the general principles of forms recognition [1][2]. The methods and algorithms that have been used so far can be divided into four large classes as; Discriminated Analysis Methods based on Bayesian discrimination. Hidden Markov Model Dynamic programming – Dynamic Time wrapping (DTW) [4]. Neural networks. Algorithm optimization is therefore necessary to remove undesirable operations s far as possible. Hidden Markov Model is popular statistical models used to implement speech – recognition technologies [3]. The time variances in the spoken language are modeled as Markov processes with discrete states. Each state produces speech observations according to the probability distribution characteristics of that state. The speech observations can take on a discrete or a continuous value. In either case, speech observations represent fixed time duration that is, a frame. The states are not directly observable, which is why the model is called the hidden Markov model. The organization of this paper has been structured in the systematic manner as the introduction leads the reader from a general subject area to a particular field of research. It establishes the context and significance of the research being conducted by summarizing current understanding and background information, stating the purpose of the work in the form of the research problem supported by a hypothesis or a set of questions, briefly explaining the methodological approach used by highlighting the potential outcomes which the study reveals and outlining the remaining structure of the paper. The literature review describes the overall goal with an integrative summary of the other research findings and the questions that remains unanswered or require additional research. The next sections of the paper are brief insight of speech recognition and categorization of speech recognition. Comparative analysis is made based on the study made throughout the course. The conclusions are made at the end of the work as they discuss instances in which the findings were made.
2. Literature review Speech recognition technology has many applications on embedded systems, Such as stand-alone devices and single purpose command and control systems. This paper presents the comparative study on the algorithms which are mainly used for the speech recognition. This report represents and emphasizes techniques to prevent overflows of probability scores and efficiently represent some of the key variability in implementing them for the process. The highly complex algorithms have to be optimized to meet the limitations in computing power and memory resources. The optimization which typically involves _____________________________________________________________________________________________________________ ©2015, IRJIE-All Rights Reserved Page -23
ISSN: 2395-0560
International Research Journal of Innovative Engineering www.irjie.com Volume1, Issue 3 of March 2015 simplification and approximation, inevitably leads to the loss of precision and the degradation of recognition accuracy. This paper describes the comparison of state of art algorithms and their techniques for the speech recognition. By optimizing the speech recognition algorithms, the computation time for both front-end and pattern recognition has been efficiently reduced. On the other hand, the execution time for the back-end is proportional to the complexity of the model. The more complex the model is, the more execution time and memory will be required. The aim of this paper was to investigate the algorithms of speech recognition. In the past, the kernel of Automatic Speech Recognition (ASR) is dynamic Time Wrapping which is feature –based template matching and belongs to the category technique of Dynamic programming (DP).
Although, dynamic time wrapping is an early developed ASR technique, dynamic time wrapping was being popular in lots of applications. Dynamic time wrapping is playing an important role for the known Kinect-based gesture recognition application now. This paper proposed an intelligence comparison of speech recognition system for using an improved dynamic time wrapping approach for multimedia and other areas. The improved version of dynamic time wrapping presented in the recent areas of work called hidden markov model- like dynamic time wrapping is essentially a Hidden Markov Model like method where the concept of the typical hidden markov model statistical model is brought into the design of dynamic time wrapping. the developed hidden markov model – like dynamic time wrapping method, transforming feature – based dynamic time wrapping based recognition into model – based dynamic time wrapping recognition, will be able to behave as the hidden markov model recognition technique and therefore proposed hidden markov model –like dynamic time wrapping with the hidden markov model- like recognition model will have the capability to the further perform model adaption (Speaker adaption).
Speaker verification is one among the widely used biometrics which usually offer more secure authentication for user access than regular passwords. This is one of the areas in which speech recognition plays an important role for security issues. Dynamic Time wrapping and Hidden Markov Model are two well-studied non-linear sequence alignment algorithm. The research trend from dynamic time wrapping to hidden markov model in approximately1988-1990, since dynamic time wrapping is a deterministic and lack of the power to model stochastic signals. Dynamic time wrapping has been applied to mostly in speech recognition, since it’s obvious that the speech tends to have different temporal rate, and alignment is very important for a good performance. The standard dynamic time wrapping is basically using the idea of deterministic Dynamic programming. However, a lot of real signals are stochastic processes, such as speech signal, video signaled. A new algorithm called stochastic dynamic time wrapping was introduced based on the drawbacks of basic dynamic time wrapping. in hidden markov model, Viterbi algorithm is used for searching the optimal state transition sequence, for a given observation sequence. It turns out to be another application of dynamic programming to cut down its computations. In a regular Markov Model, the state is directly visible to the observer and therefore the state transition probabilities are the only parameters. In a hidden markov model, the state is not directly visible, but variables influenced by the state are visible. Each state has a probability distribution over the possible output tokens. Therefore the sequence o token generated by a hidden markov model gives some information about the sequence of states.
3. Speech Recognition The schematic diagram depicts the block diagram if speech recognition. The process description goes as; Input is digitalized into a sequence of feature vectors. An acoustic –phonetic recognizer transforms the feature vectors into a time –sequenced lattices of phones. A word recognition module transforms the phones lattice into a word lattice, with the help of a lexicon. Finally, in the case of continuous or connected word recognition, a grammar is applied to pick the most likely sequence of words from the word lattice. As fundamental equations of speech recognition there are many paradigms of speech recognition systems, the major two paradigms which are used abundantly are; stochastic approach and Template based approach.
_____________________________________________________________________________________________________________ ©2015, IRJIE-All Rights Reserved Page -24
ISSN: 2395-0560
International Research Journal of Innovative Engineering www.irjie.com Volume1, Issue 3 of March 2015
Speech Speech Recognition Feature Vector Lattice
Acoustic Models
Phonetic Recognition Phone lattice Lexicon Word Recognition Word Lattice Grammar Task Recognition
Text
Figure 1: Block diagram of speech recognition
The other efficiently used algorithm while in the speech recognition is Dynamic Time Wrapping algorithm (DTW) The Dynamic Time Wrapping is an algorithm which calculates an optimal wrapping path between two time series. The algorithm calculates both wrapping path values between the two series and the distance between them. Suppose we have two numerical sequences (a1,a2..an) and (b1,b2…bm). As we can see, the length of the two sequences can be different. The algorithm starts with local distances calculation between the elements of the two sequences using different types of distances. The most frequent used method for distance calculation is the absolute distance between the values of the two elements (Euclidian distance). That results in a matrix of distances having n lines and m columns. Starting with local distances matrix, then the minimal distance matrix between sequences is determined using the dynamic programming algorithm.
4. Categories of Speech Recognition The problem of fundamental interest is characterizing uch signal in the terms of signal model, signal model gives us the following data product. Theoritical description of a signal processing system which can be used to process the signal and so as to provide desired output. And also It helps us to understand great deal about signal source without having to have source available.The signal model can be classified as ;
1.
Deterministic model : This exploits some known property of signal lke amplitude of wave.
2.
Statistical model : This model takes statistical property of signal property of signal into acount . Example of this type model is said to be Gaussian Model, Poisson Model, Markov Model and Hidden Markov Model.
The Categorization od speech recognition can be based on different aspect of speechs, they can be desribed as ; 1.
Single word recognizer. The single recognizer is the one which is one of the categorical methods of speech recognition which is very easy to construct and there is no boundary limitation as to conditonalize the speech.
_____________________________________________________________________________________________________________ ©2015, IRJIE-All Rights Reserved Page -25
ISSN: 2395-0560
International Research Journal of Innovative Engineering www.irjie.com Volume1, Issue 3 of March 2015 2.
Continuous word recognizer. The continuous word recognizer is the one which has complex structure comparatively to single word recog-
nizer. As this recognizer is complex in nature, the word boundary conditions are followed as the start and end condition must be specified in detail. 3.
Speaker Dependent This category of speech recognition needs less training and the recognition of speech is based on the specific
speaker. 4.
Speaker Independent Speaker Independent recognizer does not depend on specific speaker. This is of more general recognition category but yet, this requires large training data.
5. Comparative Analysis Each algorithm has its own advantages and disadvantages, which makes dynamic time wrapping good for one type of application and hidden markov model better for another. Hidden markov model works very well if there is a large amount of training data available, however it does not do as well as dynamic time wrapping when there is a limited number of training samples. A core advantage of hidden markov model algorithm is that it is can work well even if there is high within-gesture variance. Dynamic time wrapping on the other hand needs several templates if there is high within-gesture variance. The major disadvantage of hidden markov model algorithm is that it has a lot of magic numbers that need to be select by the user before training the algorithm. Dynamic time wrapping has very few magic numbers and can therefore be easier. Table 1. Comparative Analysis
Sr. No 1.
Parameter Complexity
Hidden Markov Model
Dynamic Time Wrap
The Hidden Markov Model algorithm
The complexity of
Dynamic Time
is complex in comparison to dynamic
wrapping algorithm is
time wrapping algorithm. the com-
O(n+|E|log n)
plexity of Hidden Markov Model is ;
In Dynamic time wrapping, there may be
O(|E|log|E|)
number of paths to traverse the data.
In hidden markov model one node is
There may be chance to repetition of
traverse exactly once. No repetition is
nodes to traverse the data.
allowed.
_____________________________________________________________________________________________________________ ©2015, IRJIE-All Rights Reserved Page -26
ISSN: 2395-0560
International Research Journal of Innovative Engineering www.irjie.com Volume1, Issue 3 of March 2015 2.
Structure
The structure of Hidden Markov
The structure of dynamic time wrap is
Model is complex. It uses unicasting.
very easy to understand. In dynamic time
So the number of comparisons in
wrapping, the broadcasting is used to
Hidden markov model is less. But in
traverse the data. In this, the numbers of
unicasting the comparison are so
comparisons are more but there may be
complex and to find the shortest path
number of paths traversing the data.
at minimum cost is very difficult. 3.
Security
The hidden Markov Model is more
In dynamic time wrapping, the security
secure in comparison to Dynamic
of data is not possible, because it uses
Time Wrapping, because hidden mar-
broadcasting. Any node can access the
kov model use unicasting to transfer
data, whether it is required by them or
the data. So only the source and des-
not. So it is hence proved to be not
tination know about the data so the
secured.
data transmission is secure. 4.
Reliability
The Markov model is not reliable,
The dynamic time wrapping algorithm is
because there are only one source and
more reliable than hidden markov mod-
destination. I due to any reason the
el ,because as it uses broadcasting and
system get fail then there will be no
there are many number of sources and
chance to recover the data.
destinations, so if any one node fails, the other system get the data and the data will be secure , And in case of corruption of any system , the data can be recovered.
5.
6.
7.
Backup
Data Transmission
Economical
In hidden Markov Model there should
In dynamic time wrapping, backup of
be no backup of data.
data should be kept on other system.
It uses unicasting technology for data
In dynamic time wrapping , broadcasting
transmission.
Technology is used
Hidden markov Model algorithm im-
In dynamic time wrapping, the nodes are
plementation is economical in com-
used for transmissions. So the software
parison to dynamic time wrapping
and the components are installed on all
because ,in hidden markov model and
the systems, which is much costlier.
software installed on only single system for data transmission , so the number of installation are not required in huge numbers. 8.
Approaches
The markov model, Algorithm uses
This uses time series analysis with sto-
dynamic Bayesian network approach.
chastic approach which is mainly used
The core logical algorithm used in
for measuring similarity between two
hidden markov model is forward
temporal sequences which may vary in
–backward algorithm.
time or speed.
_____________________________________________________________________________________________________________ ©2015, IRJIE-All Rights Reserved Page -27
ISSN: 2395-0560
International Research Journal of Innovative Engineering www.irjie.com Volume1, Issue 3 of March 2015 9.
Applications
The application goal of hidden mar-
The application goal of dynamic time
kov model algorithm is to recover a
wrapping is to measure an optimal wrap-
data sequence that is not immediately
ping path between two time
observable. They mainly include few
series; They include;
applications like;
10.
Constraints
Cryptanalysis
Speaker recognition
Speech synthesis
Online signature recognition
Machine translation
Partial shape matching application.
The constraints of hidden markov
The constraints of dynamic time wrap-
model algorithm can be listed as;
ping can be listed as;
Evaluation:
Boundary conditions
Uncovered hidden data.
Continuity.
6. Conclusion In this research we tried to find better technique for speech recognition. The fact that performance of Hidden Markov Model recognizer is somewhat poorer than the Dynamic Time Wrapping based recognizer appears to be primarily because of insufficiency of the Hidden Markov model‘s training data. The Performance of the Hidden Markov Model depends on the number of states of the model. It is necessary that number of states should be such that they can model the word. The time and space complexity of Hidden Markov Model approach is less than the Dynamic Time Wrap approach because the hidden markov model tests to compute the probability of each model to produce that observed sequence. In some cases, the accuracy of hidden markov model is better in comparison to Dynamic time wrapping algorithm for speech recognition. Speech recognition using hidden markov model gives good result due to resemblance between the architecture of hidden markov model and varying speech data.
Neural network is another method, which uses gradient decent method with back propagation algorithm. While in hidden markov model, the recognition ability is good for unknown word. Hidden markov model is generic concept and is used in many area of research. Dynamic time wrapping algorithm is very useful for isolated word recognition in a limited dictionary. For fluent speech recognition, Hidden Markov model chains are used. The main complexity in using dynamic time wrapping is that it may not be as satisfactory for a large dictionary which could ensure an increase in the success rate of recognition process. The models provide flexible but rigorous stochastic frame work in which to build our systems. However, hidden markov models do not model certain aspects of speech recognition – such as suprasemental (long span) words phenomena well.
REFERENCES [1] [2] [3] [4] [5] [6]
F. Jelinek. "Continuous Speech Recognition by Statisical Methods." IEEE Proceedings 64:4(1976): 532-556 Young, S., A Review of Large-Vocabulary Continuous Speech Recognition, IEEE Signal Processing Magazine, pp. 45-57, Sep. 1996 Sakoe, H. & S. Chiba. (1978) Dynamic programming algorithm optimization for spoken word recognition. IEEE, Trans. Acoustics, Speech, and Signal Proc., Vol. ASSP-26. D. Raj Reddy, Speech Recognition by Machine: A Review, Proceedings of the IEEE, Vol. 64, No. 4, April 1976,pp 501-531. Santos, J. ,Ciudad Universitaria, Madrid, Spain ,Nombela, J. –“Text-to-speech conversion in Spanish a complete rule-based synthesis system”||Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP '82 Kain, A. ,CSLU, Oregon Graduate Inst. of Sci. & Technol., Beaverton, OR Macon, M.W. –“Spectral voice conversion for text-to-speech synthesis” || Acoustics, Speech and Signal Processing, 1998. Proceedings of the 1998 IEEE International Conference
_____________________________________________________________________________________________________________ ©2015, IRJIE-All Rights Reserved Page -28