Emotion Recognition using Face Images Fatma Guney Bogazici University Computer Engineering Istanbul, Turkey 34342 Email:
[email protected]
Abstract—In
this study, I present a real-time facial expression analysis system that I developed for the course project of Data Mining for Visual Media. Specifically task is to train a system that could recoginze six basic emotion types, which are anger, disgust, fear, happiness, surprise and sadness, plus neutr expression. For this task, I employed employed a local appearance-ba appearance-based sed repres representat entation ion using Discrete Cosine Transform (DCT), which has been shown to be very effective effective in real-tim real-timee processi processing ng and robust against lighting lighting changes. changes. Using Using these features, features, I trained trained one-versu one-versus-all s-all support vector machine classifiers for each emotion type.
I. I NTRODUCTION A persons face changes according to emotions or internal states of the person. Face is a natural and powerful communication nication tool. Analysis of facial facial expressi expressions ons through through moves moves of face muscles leads to various applications. Facial expression recognition plays a significant role in Human Computer Interacti Interaction on systems. systems. Humans Humans can understa understand nd and interpre interprett each each others others facia faciall change changess and use this this unders understan tandin ding g to response and communicate. A machine capable of meaningful and respon responsi sive ve commu communic nicati ation on is one of the main main focuse focusess in robotics. robotics. There are many other areas which benefit from the advances in facial expression analysis such as psychiatry, psychology, educational software, video games, animations, lie detection or other practical real-time applications. The objecti objective ve of this this study study is to get an overv overvie iew w of the current methods and based on proposed solutions, to develop a real-time facial expression recognition system to recognize neutr expressi expression on and six prototypi prototypicc expressi expressions, ons, happiness happiness,, sadness, anger, surprise, disgust and fear. The approach of [1] have been benefited during the study. In this study, a facial emotion recognition system is developed. First, face and eye detection based on modified census transf transform orm (MCT) (MCT) are perfor performed med on the input input image image.. After After detection, alignment based on eye coordinates of the detected face image is applied to scale and translate face and reduce varia variance ncess in featur featuree space. space. Aligne Aligned d face face image image is divid divided ed into local blocks and discrete cosine transform is performed on these local blocks. Concatenating features of each block, an overall feature vector is obtained. A scaling procedure is applied to obtained overall feature vector before classification. For classification purpose, one-versus-all Support Vector Machine (SVM) classifier is trained with cross validation.
facial expressions. According to literature reviews, steps of a facial facial expression expression system can be summariz summarized ed as face acquisition, sition, facial facial data extraction extraction or represen representati tation on and facial facial expression recognition [2]. First step is face acquisition which is composed of face detection and head pose estimation. Second step is facial data extraction and representation, which can be conveyed conveyed as geometric and/or appearance feature extraction. extraction. Last step is facial expression recognition which can be categorized as frame based and sequence based classification. There are many dimensions of facial expression analysis including individual differences in subjects, intensity of an expression, deliberate versus spontaneous expression, head orientation and scene complexity, image acquisition and resolution, reliability of ground ground truth, truth, databa databases ses,, and the relati relation on to other other facial facial behaviors or non-facial behaviors. Multimodal approach that is the fusion of, for example, audio and visual expression systems is an example methodology for the combination of facial and non-facial behaviors [3]. In the scope of facial expression analysis, there are deeper level studies based on facial muscles [1]. Facial parameterization to encode the movements of facial muscles is a highly studi studied ed subjec subjectt in both both psycho psycholog logy y and comput computer er scienc science. e. Most common coding scheme is Facial Action Coding System (FACS (FACS)) which defines action action units correspondin corresponding g to atomic atomic facial facial muscle actions and similarl similarly y there there is MPEG-4 MPEG-4 Facial Facial Animation Parameters (FAPs) [3]. III. M ETHODS A. Face Face Detection and Alignment
Fig. 1.
Face detectio detection n and alignment alignment examp examples les
I I . R ELATED W ORK There are many studies in the literature that aims to recognize facial changes and motions from visual data to analyze
The face and eyes are automatically detected using a modified census transformation (MCT) based face and eye detector
[4], [?]. Alignment of detected face is necessary to decrease the variation in feature space caused by pose, angle and scale changes. Alignment is based on transformation of face in euclidean space by using the detected eye coordinates. Face is cut out and scaled according to the fixed distance between two eyes and eyes are translated to the same fixed position for all images. Some examples of detection and alignment are shown in Fig.1
average intensity value of the block. From the remaining DCT coefficients, five coefficients which contain the highest information are extracted using zig-zag scanning as shown in Fig.3. Finally, the DCT coefficients extracted from each block are concatenated to construct the overall feature vector [6].
B. Local Appearance-based Face Representation
Fig. 3.
Zig-zag scanning order used to select the DCT coefficients.
C. Emotion Recognition using SVM
Fig. 2.
Local appearance-based feature extraction scheme using DCT.
For face representation a local appearance-based approach is used. In local appearance-based approach, face is divided into non-overlapping blocks and feature extraction is performed on these blocks instead of whole face as can be seen in Fig.2. When there is a change in appearance of face due to occlusion, using local blocks provides advantage because only the related region of block or blocks are affected. In case of applying feature extraction on entire face, entire representation is affected by changes [6]. For feature extraction on local regions, discrete cosine transform (DCT) is used. DCT is a signal analysis tool which is frequently used in facial image analysis because it provides frequency information in a compact representation. DCT is also frequently preferred in real-time applications due to its fast computation. DCT representation is shown to be robust to lighting changes and scaling variations due to its decomposition capability, that is elements sensitive to lighting changes and scaling variations can be removed. For example, first coefficient represent the average intensity value of the face image which can be directly effected by illumination variations ł [6]. A detected and aligned face image is divided into blocks of 8x8 pixels size. Each block is then represented by its DCT coefficients. 8x8 is chosen as size in face recognition applications, because it is small enough to provide stationary within the block with a simple transform complexity and big enough to provide sufficient compression [6]. After selecting coefficients a two-step normalization is applied to features of each block. Firstly, blocks with different brightness levels may have DCT coefficients with different value levels. Local feature vector’s magnitude is normalized to unit norm. Secondly, first coefficients have higher magnitudes, therefore each coefficient is divided by its standard deviation to balance the contribution of coefficients [6]. Top-left DCT coefficient which is the first one, is removed from the representation since it only represents the
Emotion recognition is modeled as a single one-versus-all problem for each emotion type. For each SVM classifier, a model for probability estimates is trained. All frames of videos that are labeled with the emotion class are trained as positive samples and samples of other classes as negative. For classification, each frame of test video is classified by all trained classifiers. Emotion with the highest probability is estimated as emotion type for each frame. Then, voting among the emotion types returned for frames of video determines the emotion type for that video. Scaling before applying SVM is very important to avoid features with huge ranges from dominating others [7]. Mean and standard deviation values are determined for each feature index. In training, each sample are scaled by making each feature zero-mean and unit-variance over all feature vectors. Normalization parameters are saved during training and applied in testing before classification. In this study, SVM with a radial basis function (RBF) kernel is used [1]. RBF kernel transforms the features into a higher dimensional space, so that they can be then linearly separated by a hyperplane. For the optimization of parameters slack variable C and kernel parameter γ , five-fold cross validation is used. Gridsearch on C and γ using cross-validation is performed as recommended in [7]. In grid search, pairs of ( C , γ ) are tried and the one with the best cross-validation accuracy is picked. IV. E XPERIMENTS A. Dataset
FGnet-Facial Expression and Emotion Database is used in experiments. It is an unpublished database with spontaneous and natural facial expressions. There are video sequences for 6 basic emotions plus neutral. Videos are in high quality, good lighting conditions and constant background. The database contains videos gathered from 19 different individuals, each performing all six desired expressions and an additional neutral sequence three times. There are 21 sequences recorded for each individual, which is in total there are 399 sequences in the database.
TABLE III C ONFUSION M ATRIX
Two records of each sample for each emotion and neutr are used in training and validation, set of remaining sequences constitute the dataset for tests. B. Experimental Setup
For emotion recognition, each detected face is scaled to 64x80 pixels and aligned so that eye row is 35 and distance between eyes is 32 pixels [?], [4]. DCT is performed on blocks of 8x8 pixels. For each block the first 10 coefficients in the zig-zag scanning order are kept, leading to a 8x10x10 = 800 dimensional feature vector. After scaling the overall feature vector, SVM parameter optimization is performed by grid-search using five-fold cross validation. Grid search is performed with C = 2k and γ = 2l with k = −3, −1, 1 and l = −16, −14, ..., −7. C. Results TABLE I PARAMETERS OF S VM Emotion
C
Anger
−10 2
1
2
10
2−
2−
10
2
−12 2
2
Disgust
2−
Fear Happiness
γ 1
1 1
Sadness
2
−10
2
1
Surprise
2
−10
2
1
Results of SVM parameter optimization using grid search can be seen in TableI. For training step, cross validation error rates for each emotion type is shown in TableII. Error rate is calculated as
ErrorRate =
F P + F N T P + T N + F P + F N
(1)
where T P is the number of correctly classified positive samples, F P the number of samples that have been classified incorrectly as positives, T N the number of correctly classified negative samples, and F N the number of samples that have been classified incorrectly as negatives. TABLE II ERROR RATES FOR CROSS VALIDATION Emotion/Fold
Fold-1
Fold-2
Fold-3
Fold-4
Fold-5
avg
Anger
0.19
0.27
0.12
0.26
0.13
0.19
Disgust
0.22
0.24
0.24
0.21
0.20
0.22
Fear
0.18
0.16
0.10
0.13
0.32
0.17
Happiness
0.17
0.08
0.10
0.06
0.12
0.10
Sadness
0.45
0.35
0.24
0.28
0.25
0.31
Surprise
0.33
0.10
0.23
0.10
0.08
0.16
From test results, a confusion matrix is constituted with each emotion type as in TableIII.
Anger
Disgust
Fear
Happiness
Sadness
Surprise
Anger
13
1
0
3
2
0
Disgust
1
18
0
0
0
0
Fear
0
1
12
0
1
3
Happiness
0
0
0
17
1
1
Sadness
0
1
1
0
16
0
Surprise
0
0
2
2
0
15
V. D ISCUSSION AND C ONCLUSION A real-time emotion recognition system using face data is proposed and developed. Given an input image or video sequence, first faces are detected and aligned for the feature extraction. Local appearance-based feature extraction using DCT is applied to each aligned face. Features of each block are concatenated to obtain the overall feature vector and each feature is scaled before applying SVM. For classification, oneversus-all SVM classifier is trained and used. Obtained results show that SVM classification using local appearance-based features give very promising results for emotion recognition problem. Validation accuracies are not that high due to the spontaneous expressions. For example, making people feel what is wanted when they are sitting in fornt of a screen is not an easy task. Especially some classes like sadness or fear does not seem that real in records, consequently their results are lower compared to other emotion types. Testing results are quite good. Fear is generally misclassified as surprise or other way due to the fact that similar facial muscles are used for these expressions. R EFERENCES [1] Gehrig, T., Ekenel, H.K., A Common Framework for Real-Time Emotion Recognition and Facial Action Unit Detection IEEE Workshop on CVPR for Human Communicative Behavior Analysis, June 2011. [2] Tian, Y.L., Kanade, T. Cohn, J.F. (2005). Facial Expression Analysis, In: Handbook of Face Recognition, Li, S.Z. Jain, A.K., (Eds.), pp. 247-276, Springer, New York, USA. [3] Z. Zeng, M. Pantic, G. I. Roisman, and T. S. Huang. A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(1):3958, 2009. [4] CVHCI lab. Okapi (open karlsruhe library for processing images) c++ library. [5] C. Kublbeck and A. Ernst. Face detection and tracking in video sequences using the modified census transformation. Image and Vision Computing, 24(6):564572, June 2006. [6] H. K. Ekenel. A Robust Face Recognition Algorithm for Real-World Applications. PhD thesis, Universitat Karlsruhe (TH), Karlsruhe, Germany, Feb. 2009. [7] Chih-Chung Chang, C.J.L.: Libsvm: a library for support vector machines (2001), http://www.csie.ntu.edu.tw/ cjlin/libsvm