Weka Tutorial By Tresna Maulana Fahrudin Department of Information and Computer Engineering Graduate Program of Engineering Technology Electronics Engineering Polytechnic Institute of Surabaya, Indonesia
1. Download two files following this URL: a. http://users.aber.ac.uk/rkj/book/wekafull.jar WEKA TOOLS b. http://tunedit.org/repo/UCI/breast-w.arff DATASET
2. Open wekafull.jar, and then click “Explorer” button
3. Click “Open file” button, and choose “breast-w.arff” file.
4. If your data is successful open, it will show all attributes/features. This data is consists of 9 features and 2 classes (benign, and malignant).
5. To make the scenario of feature selection / attribute selection, the first step you must measure the accuracy performance using classification (before feature selection) to know the accuracy of data. Click tab “Classify – weka – classifiers – bayes – NaiveBayes”
As you know there are four popular algorithms in data mining a. b. c. d.
Naïve Bayes (probability-based), Decision Tree (Tree/Reasoning-based), Rule Induction (IF-THEN rules-based), and K-Nearest Neighbor (Distance-based).
You can apply their algorithm in Weka following instruction below Naïve Bayes: Click tab “Classify – weka – classifiers – bayes – NaiveBayes” Decision Tree: Click tab “Classify – weka – classifiers – trees – J48” Rule Induction: Click tab “Classify – weka – classifiers – rules – JRip” K-Nearest Neighbor: Click tab “Classify – weka – classifiers – lazy – IBk”
After you choose one algorithm (for example: Naïve Bayes), you can click “start” button to execute its algorithm.
You can see on the figure above that is accuracy performance of Naïve Bayes algorithm to measure breast cancer dataset. You can read my paper, I used 4 parameter performance which it is recommended by several researcher to use Accuracy (TP rate), Precision, Recall, and F-measure. You can search the meaning of them. As example, we know the accuracy performance of breast cancer dataset is 96%.
6. After we known the accuracy performance of original breast cancer dataset (96%), we want to measure how the effect of feature selection. You can follow this instruction: Click tab “Select attributes – weka – attributeSelection – AntSearch”
The figure above explain that we choose Ant Colony (AntSearch) as feature selection algorithm. 7. We can click “Start” button to execute its algorithm, and then we get the result.
The figure above explain that all the feature of breast cancer dataset based on evaluation by Ant Colony algorithm as feature selection is important, because its algorithm choose 9 features as the important features and have high contribution from 9 existing features in data. In other case (other dataset), you will get the condition which Ant colony algorithm will choose 533 feature as important features from 2001 existing features (you can read my paper).
Existing features
After feature selection by GA(Genetic algorithm)
8. For example, if you get the condition/case that Ant Colony algorithm choose 5 features of 9 existing feature, you can remove the other 4 features in data.
The figure above explain that you can “check” in the feature column to remove which feature is not important in data based on feature selection result, and then click “Remove” button to execute.
9. And last step, you can measure the accuracy performance of your data which they are feature selection data (not original data), and then you apply the classification algorithm using Naïve Bayes classifiers.
- May be useful for you -