Financial statement analysis of 4 companies from capital goods industry. thermax BHEL L & T crompton and Greaves
Final Report NIT KKRFull description
Full description
reportFull description
MPU4 UTP
Description : DNS
Full description
report on smart contact lensFull description
Full description
great report
Full description
project wimax
Full description
DMDeskripsi lengkap
Textile mill
flipkart
Data Mining Project - Final Report
Even Cheng CS277 Data Mining Mar 18, 2010
P RO J E C T : F I N A L R E P ORT Problem Definition: Multi-label Classification Problem in Wikipedia Dataset.
1. Introduction: Introduction: In various aspects, we run into the needs of doing multi-label classification in our daily life. From the categories in libraries to the categories in marts, by human’s human’s all senses we can do classification job in extremely high accuracy even though facing some objects which have similar labels. But will it be another story when the computer does the same thing? Nowadays, there are some well defined and widely used classification method such as K-nearest neighbor algorithm, Gaussian mixture model, Neural Network, Support Vector Machine and so on. They may achieve a fair prediction result when the difference among labels are distinct. But in the case that labels are closely related, it it will be more difficult for computers to classify them with a good accuracy returned. In this project, we simulate the situation described above by using the dataset from partial categories within the political concept in Wikipedia, and we choose Support Vector Machine as the classification method. In implementation aspects, we adopt LIBSVM as a tool that provides efficient model training, prediction making, evaluations, and more, there are plenty of miscellaneous tools that developed based on LIBSVM, such as parameter finder writing in python [3] and multi-label classification tool[2] which provide a good sense for our research. Here are a couple questions that in the final report this project will answer: • What can we do if the dataset is unbalanced? • How the accuracy will vary if we use different amount of labels in the classification process? • Is it true we can always get a better accuracy by using tf-idf when generate feature vectors? • How will the accuracy change if we increase the size of our training set? In this report, we will start from reviewing papers related to multi-label classification in Section 2, and briefly cover our dataset in Section 3. Section Section 4 we introduce how we adopted the tool LIBSVM, and interpreted its input and output. Section 5 contains the experiment results and short shor t discussions for answering the questions above. In the last two part, par t, we conclude this report with suggestions for future work and conclusion.
2. Related Works: Works: Single label classification means we label an object with exactly one label. However in our real life, there there are many things that have various spectrum that we need to use different labels to reflect their states. Multi-label classification is originally motivated by the ambiguity in text categorization [4]. Freund and Schapire proposed AdaBoost AdaBoost [5] and later Schapire continued working on the extension BoosTexter [6] , but in later times, computer scientists implement it to different ways. For example the image 1/8
Data Mining Project - Final Report
recognition on scenes and humans. In the image recognition topic, we we usually will classify a set of feature vectors into a combination of labels rather than a specific label returned [7][8]. There are several popular ways to construct our classifier, for instance: Naïve Bayes classifier, Knearest Neighbor algorithm proposed by Cover and Hart Har t in 1967, Decision Tree proposed by Murthy in 1998, and Support Vector Machine proposed by Vapnik in 1995 [9]. Support Vector Machine is a vector-based, supervised machine learning method used for classification and regression. In the scenario of two class dataset that separable by a linear classifier, the aim here we are going to find out the maximum margin that can separate two classes.The data nodes on the margin are called the support vector. vector. [10] LIBSVM[1] is a library for support vector classification and regression. It implements various formulation and provides a model selection tool that does cross validation via parallel grid search which is more reliable way for model selection. And more, it supports multi-class classification by using “one-against-one” approach proposed by S. Knerr. that can help us walk into the discussion regarding to multi-label classification.
Figure 1: the margin between two classes of data is maximized [11]
3. Data Analysis: Analysis: The Wikipedia Wikipedia dataset is extremely unbalanced. For example, the 9 th category: “Development”. “Development”. It contains 181 articles in it, which is a large category while there are other categories which has less than 5 articles in it. Besides, since we will only assign “+1” “+1” label to the articles belongs to the 9 th category and “-1” label to the rest, most most of the labels in our training set will be “-1” “-1” for the total amount of articles ar ticles is 11,500. It forms the one versus all comparison and intuitively, intuitively, the predictor can easily predict all labels as “-1” with a 99% high accuracy achieved which is not likely favorable by us. Left aside of the unbalanced proportion of labels, this this dataset is a large dataset as running it on a PC since it has 11,500 articles and almost 28,600 vocabularies in this dataset, what what we can foresee is the tasks for doing cross validation, and training model will be exhaustive in time.
4. Classification by by Support Vector Vector Machine (LIBSVM): 4.1.Before involving into the process, we divide our data into training, validating and testing data. We leave 1/10 of data as testing data that never seen in the training process. In the rest of 9/10, we are going to run 10-fold validation that will be described in following section. 4.2.Pick one category that we are going to do classification on it. Pull Pull out data in the wikipedia dataset into a intermediate file by each wiki page. 4.3.Read the intermediate file into the program written by MS Visual Studio, and it organizes data into the structure that described above. Here each line represents a wiki page so every time we are going to get around 9 thousand lines as our training set.