CS277 DM Final Report

Data Mining Project - Final Report

Even Cheng CS277 Data Mining Mar 18, 2010

P RO J E C T : F I N A L R E P ORT Problem Definition: Multi-label Classification Problem in Wikipedia Dataset.

1. Introduction: Introduction: In various aspects, we run into the needs of doing multi-label classification in our daily life. From the categories in libraries to the categories in marts, by human’s human’s all senses we can do classification job in extremely high accuracy even though facing some objects which have similar labels. But will it be another story when the computer does the same thing? Nowadays, there are some well defined and widely used classification method such as K-nearest neighbor algorithm, Gaussian mixture model, Neural Network, Support Vector Machine and so on. They may achieve a fair prediction result when the difference among labels are distinct. But in the case that labels are closely related, it it will be more difficult for computers to classify them with a good accuracy returned. In this project, we simulate the situation described above by using the dataset from partial categories within the political concept in Wikipedia, and we choose Support Vector Machine as the classification method. In implementation aspects, we adopt LIBSVM as a tool that provides efficient model training, prediction making, evaluations, and more, there are plenty of miscellaneous tools that developed based on LIBSVM, such as parameter finder writing in python [3] and multi-label classification tool[2] which provide a good sense for our research. Here are a couple questions that in the final report this project will answer: • What can we do if the dataset is unbalanced? • How the accuracy will vary if we use different amount of labels in the classification process? • Is it true we can always get a better accuracy by using tf-idf when generate feature vectors? • How will the accuracy change if we increase the size of our training set? In this report, we will start from reviewing papers related to multi-label classification in Section 2, and briefly cover our dataset in Section 3. Section Section 4 we introduce how we adopted the tool LIBSVM, and interpreted its input and output. Section 5 contains the experiment results and short shor t discussions for answering the questions above. In the last two part, par t, we conclude this report with suggestions for future work and conclusion.

2. Related Works: Works: Single label classification means we label an object with exactly one label. However in our real life, there there are many things that have various spectrum that we need to use different labels to reflect their states. Multi-label classification is originally motivated by the ambiguity in text categorization [4]. Freund and Schapire proposed AdaBoost AdaBoost [5] and later Schapire continued working on the extension BoosTexter [6] , but in later times, computer scientists implement it to different ways. For example the image 1/8


recognition on scenes and humans. In the image recognition topic, we we usually will classify a set of feature vectors into a combination of labels rather than a specific label returned [7][8]. There are several popular ways to construct our classifier, for instance: Naïve Bayes classifier, Knearest Neighbor algorithm proposed by Cover and Hart Har t in 1967, Decision Tree proposed by Murthy in 1998, and Support Vector Machine proposed by Vapnik in 1995 [9]. Support Vector Machine is a vector-based, supervised machine learning method used for classification and regression. In the scenario of two class dataset that separable by a linear classifier, the aim here we are going to find out the maximum margin that can separate two classes.The data nodes on the margin are called the support vector. vector. [10] LIBSVM[1] is a library for support vector classification and regression. It implements various formulation and provides a model selection tool that does cross validation via parallel grid search which is more reliable way for model selection. And more, it supports multi-class classification by using “one-against-one” approach proposed by S. Knerr. that can help us walk into the discussion regarding to multi-label classification.

Figure 1: the margin between two classes of data is maximized [11]

3. Data Analysis: Analysis: The Wikipedia Wikipedia dataset is extremely unbalanced. For example, the 9 th category: “Development”. “Development”. It contains 181 articles in it, which is a large category while there are other categories which has less than 5 articles in it. Besides, since we will only assign “+1” “+1” label to the articles belongs to the 9 th category and “-1” label to the rest, most most of the labels in our training set will be “-1” “-1” for the total amount of articles ar ticles is 11,500. It forms the one versus all comparison and intuitively, intuitively, the predictor can easily predict all labels as “-1” with a 99% high accuracy achieved which is not likely favorable by us. Left aside of the unbalanced proportion of labels, this this dataset is a large dataset as running it on a PC since it has 11,500 articles and almost 28,600 vocabularies in this dataset, what what we can foresee is the tasks for doing cross validation, and training model will be exhaustive in time.

4. Classification by by Support Vector Vector Machine (LIBSVM): 4.1.Before involving into the process, we divide our data into training, validating and testing data. We leave 1/10 of data as testing data that never seen in the training process. In the rest of 9/10, we are going to run 10-fold validation that will be described in following section. 4.2.Pick one category that we are going to do classification on it. Pull Pull out data in the wikipedia dataset into a intermediate file by each wiki page. 4.3.Read the intermediate file into the program written by MS Visual Studio, and it organizes data into the structure that described above. Here each line represents a wiki page so every time we are going to get around 9 thousand lines as our training set.

2/8


4.3.1. 4.3.1.:: There is only two values in it: [category_id], the given category that we are looking loo king for fo r, and 0. Hence 0 here represents this wiki page does not belong to the given category, while [category_id] shows this article is a part of the members. 4.3.2. 4.3.2.:: It is represented by the [term_id] that we used in the dataset, and each [term_id] is associated to a certain vocabulary. 4.3.3. 4.3.3.:: The value here Figure 2: The flow of integrating LIBSVM to our project indicates how frequent is the term shows up.We like to make comparison regarding to the difference by calculating frequency of vocabularies in an article with the frequency of vocabularies in the whole corpus, and and see which way benefits more to our prediction. 4.4.Then we treat the output in step 3. as an input to LIBSVM, and we run ‘svm-train’ to help us train the model. Also, LIBSVM provides a parameter ‘-v’ that can split data into n parts and run the nfold cross validation. And this will calculate the cross validation accuracy/mean squared error on these data for helping to select good parameters. (need to do further reading about the rest parameters) 4.5.Tune parameters with patience. 4.6.After we have selected good parameters (how is a good parameter?), run ‘svm-train’ ‘svm-train’ again without ‘-v’ parameter to gain the model for this category. category.

Figure 3: the command line output after r unning svm-train

4.7.When we comes to predict unseen data, the test set is fed as a parameter of svm-predict and we use the tuned model to predict the labels of the test set. After LIBSVM finishing the predicting step, the output file is ready to retrieve in following format:

3/8


Figure 4: the command output (accuracy) and file output of svm-predict

4.8. We collect the results by using macro average value since we want to give equal value to every individual category no matter its size of category. category.

5. Experiment Experiment Results: 5.1. Use weighting parameter to deal with unbalanced dataset In the beginning, after some short iterations and cross-validation are executed, we discovered the results of all these iterations have have the same accuracy, accuracy, which which is achieved by assigning all labels into “-1” “-1” since in our training data, the “-1” labels are the majority about 100~1000 times to the “+1” label. To solve this problem, the document in in LIBSVM indicates a weighting parameter which can be assigned with different weight to various labels according to the distribution of labels or the numbers we preferred. In the training set, we set the weight 1 as following:

The purpose of assigning “-1” label as the value in the formula above is that we want to make the total weights of “+1” and “-1” to be equal. In Figure 5, we assigned 0.015656 as the weight of “-1” labels, and we can see in the following discussion, the number is pretty close to the number we got after hundreds of trials. In the followings are sample validation results while we tuning the weights in training stage, rest parameters are fixed: SVM type: C-SVC, cost (C) = 1 and those not mentioned are in their default values. For the expression “-w1 10 -w-1 0.05” in the weight column it means we give “+1” labels weight 10 and “-1” labels weight 0.05.

1

The weighting system in LIBSVM is not much sensitive to weight which is greater than 10, but take obvious effect to the weight in between 0~1.

4/8


Weight

Accuracy

Precision

Recall

rho

-w1 10 -w-1 0.05

98.4369% (10265/10428)

NaN (0/0)

0% (0/163)

-0.968869

-w1 10 -w-1 0.001

1.5631% (163/10428)

1.5631% (163/10428)

100% (163/163)

0.936390

-w1 -w1 10 -w-1 w-1 0.01 .0155

1.7 1.7453 53% % (18 182 2/10 /10428)

1.56 .56595% (163/1 3/1040 409 9)

100 00% % (16 163 3/16 /163)

0.01 014 4046

-w1 -w1 50 -w-1 -w-1 0.01 0.0157 57

4.01 4.0180 803% 3% (419 (419/1 /104 0428 28))

1.57 1.5738 387% 7% (160 (160/1 /101 0166 66))

98 98.1 .159 595% 5% (160 (160/1 /163 63))

0.00 0.0013 1324 24

-w1 -w1 10 -w-1 -w-1 0.01 0.0157 572 2

8.51 8.5155 554% 4% (888 (888/1 /104 0428 28))

1.59 1.5900 009% 9% (154 (154/9 /968 685) 5)

94 94.4 .478 785% 5% (154 (154/1 /163 63))

0.00 0.0000 0052 52

-w1 10 -w-1 0.015720812475

22.938 22.9382% 2% (2392 (2392/10 /10428 428))

1.491 1.49107 07% % (121/ (121/81 8115) 15)

74.23 74.2331% 31% (121/1 (121/163 63))

0.000 0.00000 000 0

Figure 5: Different weighting values reflect to accuracy, precision, and recall

After a few trials we can observe the trade-off between total accuracy and precision/recall. And we can discover that if we want to get positive precision and recall, the rho value has always to be greater than 0 or otherwise it will predict all labels as the “-1”. Besides, if the rho value is very close to 0, we can expect to achieve a higher accuracy in prediction (but we will lose in recall as well!)

5.2. Accuracy drops down while not exactly following the power law In the Section 5.2 and 5.3, for efficiently making use of the time, time , we we classify articles ar ticles from [2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50, 100, 150, 200, 250, 300] categories categorie s respectively res pectively for 50 times, t imes, calculate the average average accuracy from these 50 times and plot the figure. The possible explanation here why the curve does not follow the power law can result from the high variance of the data and unbalanced category size of the dataset. So in each time we randomly draw out a certain number of categories and run classification, maybe in some cases we draw more small

Figure 6: Number of categories - Accuracy, axises are in linear scale.

Figure 7: Number of categories - Accuracy, axises are in log scale with an auxiliary line shows the expected curve if the result follows power law.

5/8


categories or more large categories. Although in the end we average all the running results we have and it seems to be a more fair act as handling unbalance dataset, but how it can help still needs further look up.

5.3. Use tf-idf to construct the feature feature vector is not always a wise choice. In this Section we want to exam whether tf-idf really works. According to the definition of tf-idf we get:

In the figure 8, we can see the running results turn out that almost always the feature vectors generated under without tf-idf gains Figure 8: Number of categories in classification - Accuracy, axises are in linear scale. higher accuracy than the ones we implemented with tf-idf. And the read line here is our based line which is classified randomly by monkey. monkey. We think it can be one of the possibilities is the data has been normalized, and the second possible reason is that we use all the data available as feature vectors to train the model instead of trimming smaller articles and unusual vocabularies. In this case the feature vectors have already provided so much information needed that even though we utilize tf-idf in our case, it may not give us better accuracy.

5.4. To increase training data size mostly guarantee a higher accuracy In this topic we want to do experiments on the size of training data set, so we randomly generated 10 and 50 categories of articles and divided them into n for training data and 1-n for testing. We can see no matter the results in articles from 10 categories or 50 categories keep raising their accuracy while we put more data into our training set in the most of time. The reason that the values from Figure 9 and Figure 10 in the both end have higher variance because for the left hand side, we have very little training data and it could result in the trained model can never predict some categories if the categories are never shown up in the training data; in the case of right hand side, we run into the contrary situation, that that is, we have have very large amount of training data while very little amount of data for

6/8


testing, and the accuracy for individual category can go up and down very easily and it is the reason we get high variance at the right hand side in Figure 9 and 10.

Figure 9: Training data size, gathered from 10 random categories - Accuracy

Figure 10: Training data size, gathered from 50 random categories - Accuracy

6. Future Future Work Work This project just reveal a little part of the idea, and if we have time allow us walk further, we can extend the results of Section 5.2, and calculate the accuracy in Q1 and Q3 regarding to different size of label we have to deal with. w ith. For the result of Section 5.3, we we suggest to make a comparison with this one for the case that some noise has trimmed from the dataset, and examine the accuracy and tf-idf again. One another aspect worths to do experiments on it is the tree-like hierarchy relationship among categories. Multi-label classification in closely related labels can be a nice topic to try.

7. Conclusio Conclusion n This report presents several experiments regarding to multi-label classification in Wikipedia dataset by using support vector machine: We have done experiment regarding to tuning unbalanced data by giving different weights in training step; the accuracy will drop done as we put more labels into

7/8


classification but it not necessary follows power law; the weighting skill of tf-idf does not always work, and the last part, mostly mostly we get better accuracy as we increase our training data size.

Afterword After finishing these experiments, I feel this project is very meaningful to me. I am happy that I did not get the expected results from experiments, and it gives me a good incentive to dig out the reasons which causes the results, and it also makes me have the curiosity to analyze the results regarding to possible reasons to make the situation happened.

8. Refer Reference: 1. Chang, C.-C. and C.-J. C .-J. Lin., “LIBSVM: a library for support s upport vector machines.” Software available at http://www.csie.ntu.edu.tw/~cjlin/LIBSVM. http://www.csie.ntu.edu.tw/~cjlin/ LIBSVM. (2001) 2. http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/multilabel/ 3. http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/#grid_pa http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/#grid_parameter_search_for_r rameter_search_for_regression egression 4. Zhang, M.L., Zhou, Zhou, Z.H., Z.H., “A k-nearest neighbor based algorithm for multi-label classification.” Proceedings of the 1st IEEE International Conference on Granular Computing, pp. 718–721. IEEE Computer Society Press (2005) 5. Freund, Y. and Schapire, R.E., R.E., “A decision-theoretic generalization of on-line learning and an application applicati on to boosting, boosting ,” in Lecture L ecture Notes N otes in i n Computer Compu ter Science Sc ience 904, 90 4, P. M. B.Vita nyi, ́nyi, Ed. B e r l i n : Springer, pp. 23–37. (1995) 6. Schapire, R., Singer Y., “Boostexter: a boosting-based system for text categorization”, Machine Learning 39 (2/3) 135–168. (2000) 7. Boutell, M.R., Luo, J., Shen, X. & Brown, C.M., “Learning multi-label scene classification”, Pattern Recognition, vol. 37, no. 9, pp. 1757-71. (2004) 8. Campbell, Campbel l, N.W., N.W., Mackeown, W.P.J., .P.J., Thomas, B.T B. T., Troscianko, T., “The automatic automati c classication classic ation of outdoor images”, International Conference on Engineering Applications of Neural Networks, Systems Engineering Association, pp. 339–342 (1996) 9. Kotsiantis, Kotsiantis, S., “Supervised Machine Learning: A Review of Classification Techniques”, Techniques”, Informatica Journal 31 (2007) 10. C. D. Manning, P. Raghavan, and H. Schutze. Introduction to Information Retrieval. Chapter 15. Cambridge University Press, 2008. 11. From Wikipedia: Support Suppor t Vector M achine: http://en.wikipedia.org/wiki/File:Svm_max_sep_h http://en.wikipedia.org/wiki/File :Svm_max_sep_hyperplane_with_margin. yperplane_with_margin.png png 12. Liu, T.-Y T.-Y.,., Yang, Y., Wan, Wan, H., Zeng, H.-J., Chen, Z., and a nd Ma, W.-Y W.-Y.. “Support vector machines classification with a very large-scale taxonomy”. SIGKDD Explorations Newsletter 7(1), 36–43. (2005) 13. Tsoumakas, G., Katakis, I. “Multi-label classification: An overview”. International International Journal of Data Warehousing and Mining 3, 1–13 (2007)

8/8

CS277 DM Final Report

Recommend Documents