tutorialspoint SIMPLYEASYLEARNING
tuto or ia ia I spo int.com www. tut
fJ
ace boo k.com/tut o https:f /www.f ace alspol olnl nlfn fnd dl 1 orl rla
https:/ /twltter .com/ .com/1
utorl uto rla a
lspolnt
About Tutorial
the
Data Mining is defined as the procedure of extracting information from huge sets of data. In other words, we can say that data mining is mining knowledge from data. The tutorial starts off with a basic overview and the terminologies involved in data mining and then gradually moves on to cover topics such as knowledge discovery, query language, classification and prediction, decision tree induction, cluster analysis, and how to mine the eb. eb.
Audienc e This tutorial has been prepared for computer science graduates to help them understand the basic!to!advanced concepts related to data mining.
Prerequisite s "efore proceeding with this tutorial, you should have an understanding of the basic database concepts such as schema, #$ model, %tructured &uery language and a basic knowledge of Data arehousing concepts.
Copyright Disclaimer �
&
'opyright ()*+ by Tutorials oint -I vt. /td.
0ll the content and graphics published in this e!book are the property of Tutorials oint -I vt. vt. /td. The user of this this e!book is prohibited to reuse, reuse, retain, copy, distribute or republish any contents or a part of contents of this e!book in any manner without written consent of the publisher. e strive to update the contents of our website and tutorials as timely and as precisely as possible, however, the contents may contain inaccuracies or errors. Tutorials oint -I vt. /td. provides no guarantee regarding the accuracy, timeliness or completeness of our website or its contents including this tutorial. If you discover any errors on our website or in this tutorial, please notify us at contact1tutori a al sp spo nt i nt.com
About Tutorial
the
Data Mining is defined as the procedure of extracting information from huge sets of data. In other words, we can say that data mining is mining knowledge from data. The tutorial starts off with a basic overview and the terminologies involved in data mining and then gradually moves on to cover topics such as knowledge discovery, query language, classification and prediction, decision tree induction, cluster analysis, and how to mine the eb. eb.
Audienc e This tutorial has been prepared for computer science graduates to help them understand the basic!to!advanced concepts related to data mining.
Prerequisite s "efore proceeding with this tutorial, you should have an understanding of the basic database concepts such as schema, #$ model, %tructured &uery language and a basic knowledge of Data arehousing concepts.
Copyright Disclaimer �
&
'opyright ()*+ by Tutorials oint -I vt. /td.
0ll the content and graphics published in this e!book are the property of Tutorials oint -I vt. vt. /td. The user of this this e!book is prohibited to reuse, reuse, retain, copy, distribute or republish any contents or a part of contents of this e!book in any manner without written consent of the publisher. e strive to update the contents of our website and tutorials as timely and as precisely as possible, however, the contents may contain inaccuracies or errors. Tutorials oint -I vt. /td. provides no guarantee regarding the accuracy, timeliness or completeness of our website or its contents including this tutorial. If you discover any errors on our website or in this tutorial, please notify us at contact1tutori a al sp spo nt i nt.com
i
Table of Contents About
the
Tutorial
··················································································································· ··················i Audience····································································································· ···············································i
Prerequisites
··················································································································· ···························i
Copyright
&
Disclaimer
··················································································································· ···········i
Table
of
Contents
··················································································································· ···················ii
1. OVERVIEW····································································································· ························ 1 What is Data Mining? ··················································································································· ············· Applications Data Mining ··················································································································· ·······
Mar!et Analysis and Management ···········································································································" Management Corporate Analysis and #is! ································································································" Detection $raud ··················································································································· ·····················"
2. TASKS ······················································································································ ·············· 3 Descripti%e $unction ··················································································································· ··············
Prediction Classification and ··················································································································· ··'
Data Mining Tas! Primiti%es···································································································· ··················(
3. ISSUES ······················································································································ ············· 7 Mining Methodology and )ser *nteraction ·····················································································+
*ssues
*ssues Performance ··················································································································· ················, *ssues Di%erse Data Types ··················································································································· ······,
4. EVALUATION ······················································································································ · 10 Warehouse Data ····················································································································· ················ -
Data Warehousing ····················································································································· ············· Approach .uery/Dri%en ····················································································································· ·····
)pdate/Dri%en Approach ····················································································································· ···
ii
$rom Data Warehousing 012AP3 to Data Mining 012AM3 ········································································" *mportance of 12AM ······················································································································ ········ "
5. TERMINOLOGIES ················································································································· 14 Data Mining ······················································································································ ······················ ' Data Mining 4ngine ······················································································································ ··········· ' 5no6ledge 7ase·············································································································· ························ ' 5no6ledge Disco%ery ······················································································································ ········ ' )ser *nterface ······················································································································ ··················· ( Data *ntegration····································································································· ································· ( Data Cleaning········································································································ ·································· ( Data 8election ······················································································································ ··················· ( Clusters ······················································································································ ····························· 9
Data Transformation······························································································ ································· 9
6. KNOWLEDGE DISCOVERY ···································································································· 17 What is 5no6ledge Disco%ery? ··············································································································· +
7. SYSTEMS······································································································· ······················· 18 Data Mining 8ystem Classification ··········································································································, *ntegrating a Data Mining 8ystem 6ith a D7:DW 8ystem ·······································································"-
8. QUERY LANGUAGE ·············································································································· 22 8ynta; for Tas!/#ele%ant Data 8pecification ···························································································"" 8ynta; for 8pecifying the 5ind of 5no6ledge ··························································································"" 8ynta; for Concept ation 8pecification ·····························································"( $ull 8pecification of DM.2 ······················································································································ "(
iii
Data Mining 2anguages 8tandardi>ation
. CLASSI!ICATION AND "REDICTION ······················································································ 27 What is Classification? ······················································································································ ······ "+ What is Prediction? ······················································································································ ··········· "+
10. DECISION TREE INDUCTION································································································· 31 Decision Tree *nduction Algorithm ·········································································································· Tree Pruning ······················································································································ ····················· Cost Comple;ity ······················································································································ ················
11. #AYESIAN CLASSI!ICATION···························································································· ······ 34 7ayes@ Theorem ······················································································································ ················ ' 7ayesian 7elief et6or! ······················································································································ ··· ' Directed Acyclic Braph ······················································································································ ······ '
Data Mining 2anguages 8tandardi>ation Directed Acyclic Braph #epresentation ···································································································( Conditional Probability Table ·················································································································· (
12. RULE$#ASED CLASSI!ICATION···························································································· ·· 36 *$/T<4 #ules·············································································································· ···························· 9 #ule 4;traction······································································································ ·································· 9 #ule *nduction )sing 8equential Co%ering Algorithm ··············································································+ #ule Pruning ······················································································································ ····················· +
13. MISCELLANEOUS CLASSI!ICATION MET%ODS ····································································· 3 Benetic Algorithms ······················································································································ ··········· #ough 8et Approach ······················································································································ ·········
iv
$u>>y 8et Approach
14. CLUSTER ANALYSIS ·············································································································· 42 What is Clustering? ······················································································································ ··········· '" Applications of Cluster Analysis ·············································································································· '" #equirements of Clustering in Data Mining·····························································································' Clustering Methods········································································································· ························ '
15. MINING TE&T DATA············································································································· 46 *nformation #etrie%al········································································································ ······················ '9 7asic Measures for Te;t #etrie%al ··········································································································· '+
16. MINING WORLD WIDE WE# ································································································ 48 Challenges in Web Mining············································································································ ··········· ', Mining Web Page 2ayout 8tructure······································································································· ·· ', =ision/based Page 8egmentation 0=*P83 ··································································································'
17. A""LICATIONS AND TRENDS ······························································································· 50 Data Mining Applications ······················································································································ ·· (Data Mining 8ystem Products ················································································································· ("
$u>>y 8et Approach Choosing a Data Mining 8ystem ·············································································································· ( Trends in Data Mining ······················································································································ ······· ('
18. T%EMES ······················································································································ ········ 55 Theoretical $oundations of Data Mining ·································································································(( 8tatistical Data Mining ······················································································································ ······ (9 =isual Data Mining ······················································································································ ············ (+ Audio Data Mining ······················································································································ ············ (, Data Mining and Collaborati%e $iltering ··································································································(,
v
1. OVERVIEW
Data Mining
There is a huge amount of data available in the Information Industry. This data is of no use until it is converted into useful information. It is necessary to analy2e this huge amount of data and extract useful information from it. #xtraction of information is not the only process we need to perform3 data mining also involves other processes such as Data 'leaning, Data Integration, Data Transformation, Data Mining, attern #valuation and Data resentation. 4nce all these processes are over, we would be able to use this information in many applications such as 5raud Detection, Market 0nalysis, roduction 'ontrol, %cience #xploration, etc.
hat is !ining"
Data
Data Mining is defined as extracting information from huge sets of data. In other words, we can say that data mining is the procedure of mining knowledge from data. The information or knowledge extracted so can be used for any of the following applications6 •
Market 0nalysis
•
5raud Detection
•
'ustomer $etention
•
roduction 'ontrol
•
%cience #xploration
•
Data Applications
!ining
Data mining is highly useful in the following domains6 •
Market 0nalysis and Management
•
'orporate 0nalysis 7 $isk Management
•
5raud Detection
0part from these, data mining can also be used in the areas of production control, customer retention, science exploration, sports, astrology, and Internet eb %urf!0id 1
Data Mining
!ar#et Analysis !anagement
and
/isted below are the various fields of market where data mining is used6 •
Customer Profiling ! Data mining helps determine what kind of people buy what kind of products.
•
Identifying Customer Requirements ! Data mining helps in identifying the best products for different customers. It uses prediction to find the factors that may attract new customers.
•
Cross Market Analysis ! Data mining performs 0ssociation8correlations between product sales.
•
Target Marketing ! Data mining helps to find clusters of model customers who share the same characteristics such as interests, spending habits, income, etc.
•
Determining Customer purchasing pattern ! Data mining helps in determining customer purchasing pattern.
•
Providing Summary Information ! Data mining provides us various multidimensional summary reports.
Corporate Analysis !anagement
and
$is#
Data mining is used in the following fields of the 'orporate %ector6 •
inance Planning and Asset !valuation ! It involves cash flow analysis and prediction, contingent claim analysis to evaluate assets.
•
Resource Planning ! It involves summari2ing and comparing the resources and spending.
•
Competition ! It involves monitoring competitors and market directions.
%raud Detection Data mining is also used in the fields of credit card services and telecommunication to detect frauds. In fraud telephone calls, it helps to find the destination of the call, duration of the call, time of the day or week, etc. It also analy2es the patterns that deviate from expected norms.
2
2. TASKS
Data Mining
Data mining deals with the kind of patterns that can be mined. 4n the basis of the kind of data to be mined, there are two categories of functions involved in Data Mining6 •
Descriptive
•
'lassification and rediction
Descriptie %unction The descriptive function deals with the general properties of data in the database. 9ere is the list of descriptive functions6 •
'lass8'oncept Description
•
Mining of 5requent atterns
•
Mining of 0ssociations
•
Mining of 'orrelations
•
Mining of 'lusters
Class'Concept Description 'lass8'oncept refers to the data to be associated with the classes or concepts. 5or example, in a company, the classes of items for sales include computer and printers, and concepts of customers include big spenders and budget spenders. %uch descriptions of a class or a concept are called class8concept descriptions. These descriptions can be derived by the following two ways6 •
Data Characteri"ation ! This refers to summari2ing data of a class under study. This class under study is called as the Target 'lass.
•
Data Discrimination ! It refers to the mapping or classification of a class with some predefined group or class.
!ining of Patterns
%requent 3
5requent patterns are those patterns that occur frequently in transactional data. 9ere is the list of kind of frequent patterns6 •
requent Item Set ! It refers to a set of items that frequently appear together, for example, milk and bread.
4
Data Mining
•
requent Su#sequence! 0 sequence of patterns that occur frequently such as purchasing a camera is followed by memory card.
•
requent Su# Structure ! %ubstructure refers to different structural forms, such as graphs, trees, or lattices, which may be combined with item!sets or subsequences.
!ining Association
of
0ssociations are used in retail sales to identify patterns that are frequently purchased together. This process refers to the process of uncovering the relationship among data and determining association rules. 5or example, a retailer generates an association rule that shows that :); of time milk is sold with bread and only <); of times biscuits are sold with bread.
!ining Correlations
of
It is a kind of additional analysis performed to uncover interesting statistical correlations between associated!attribute!value pairs or between two item sets to analy2e that if they have positive, negative or no effect on each other.
!ining Clusters
of
'luster refers to a group of similar kind of ob=ects. 'luster analysis refers to forming group of ob=ects that are very similar to each other but are highly different from the ob=ects in other clusters.
Classification Prediction
and
'lassification is the process of finding a model that describes the data classes or concepts. The purpose is to be able to use this model to predict the class of ob=ects whose class label is unknown. This derived model is based on the analysis of sets of training data. The derived model can be presented in the following forms6 •
'lassification -I5!T9#> $ules
•
Decision Trees
•
Mathematical 5ormulae
•
>eural >etworks
Data Mining
The list of functions involved in these processes are as follows6 •
Classification ! It predicts the class of ob=ects whose class label is unknown. Its ob=ective is to find a derived model that describes and distinguishes data classes or concepts. The Derived Model is based on the
analysis set of training data i.e. the data ob=ect whose class label is well known. •
Prediction ! It is used to predict missing or unavailable numerical data values rather than class labels. $egression 0nalysis is generally used for prediction. rediction can also be used for identification of distribution trends based on available data.
•
$utlier Analysis ! 4utliers may be defined as the data ob=ects that do not comply with the general behavior or model of the data available.
•
!volution Analysis ! #volution analysis refers to the description and model regularities or trends for ob=ects whose behavior changes over time.
Data !ining Tas# Primities •
e can specify a data mining task in the form of a data mining query.
•
This query is input to the system.
•
0 data mining query is defined in terms of data mining task primitives.
%ote6 These primitives allow us to communicate in an interactive manner with the data mining system. 9ere is the list of Data Mining Task rimitives6 •
%et of task relevant data to be mined.
•
?ind of knowledge to be mined.
•
"ackground knowledge to be used in discovery process.
•
Interestingness measures and thresholds for pattern evaluation.
•
$epresentation for visuali2ing the discovered patterns.
(et of tas# releant data to be mined This is the portion of database in which the user is interested. This portion includes the following6 •
Database 0ttributes
•
Data arehouse dimensions of interest
)ind of #nowledge to be mined It refers to the kind of functions to be performed. These functions are6 •
'haracteri2ation
•
Discrimination
•
0ssociation and 'orrelation 0nalysis
•
'lassification
•
rediction
•
'lustering
•
4utlier 0nalysis
•
#volution 0nalysis
*ac#ground #nowledge The background knowledge allows data to be mined at multiple levels of abstraction. 5or example, the 'oncept hierarchies are one of the background knowledge that allows data to be mined at multiple levels of abstraction.
Interestingness measures and thresholds for pattern ealuation This is used to evaluate the patterns that are discovered by the process of knowledge discovery. There are different interesting measures for different kind of knowledge.
$epresentation for isuali+ing the discoered patterns This refers to the form in which discovered patterns are to be displayed. These representations may include the following6 •
$ules
•
Tables
•
'harts
•
@raphs
•
Decision Trees
•
'ubes
Data Mining
3. MASALA% Data mining bukanlah tugas yang mudah, karena algoritma yang digunakan bisa sangat kompleks dan data tidak selalu tersedia di satu tempat. Perlu diintegrasikan dari berbagai sumber data yang heterogen. Faktor-faktor ini uga membuat beberapa masalah. Di sini, di tutorial ini, kita akan membahas isu-isu utama mengenai:
! "etodologi Pertambangan dan Pengguna #nteraksi ! "asalah $inera ! #su %enis &eragam data
Diagram berikut ini menelaskan isu-isu utama.
!etodologi Pertambangan dan Pengguna Isu Intera#si 'al ini mengacu pada enis masalah berikut ini:
Data Mining •
Berbagai
,enis pengetahuan pertambangan dalam database - pengguna yang berbeda mungkin tertarik dalam berbagai enis pengetahuan. (leh karena itu diperlukan data mining untuk menutupi berbagai tugas penemuan pengetahuan.
•
Pertambangan intera#tif pada pengetahuan berbagai ting#at abstra#si - Proses data mining perlu interaktif karena memungkinkan pengguna untuk fokus pencarian pola, memproduksi dan permintaan pertambangan penyulingan data berdasarkan hasil kembali.
•
Pengetahuan Penggabungan latar bela#ang - )ntuk memandu penemuan proses dan untuk mengungkapkan penemuan pola, pengetahuan latar belakang dapat digunakan. Pengetahuan *atar &elakang dapat digunakan untuk mengekspresikan penemuan pola tidak hanya dalam hal ringkas tetapi pada beberapa tingkatan abstraksi.
•
*ahasa query data mining dan data ad hoc mining - bahasa +uery Data "ining yang memungkinkan pengguna untuk menggambarkan tugas pertambangan ad hoc, harus diintegrasikan dengan bahasa uery data warehouse dan dioptimalkan untuk efisien dan fleksibel data mining.
•
Presentasi dan isualisasi hasil data mining - etelah pola yang ditemukan itu perlu diungkapkan dalam bahasa tingkat tinggi, dan representasi isual. representasi ini harus mudah dimengerti.
•
!enangani data yang bising atau tida# leng#ap - "etode pembersihan data yang diperlukan untuk menangani kebisingan dan benda-benda yang tidak lengkap sementara pertambangan penyimpangan data. Data metode pembersihan tidak ada maka akurasi dari pola ditemukan akan menadi buruk.
•
-aluasi Pola Penemuan pola harus menarik karena baik mereka mewakili pengetahuan umum atau kurang baru.
!asalah )iner,a 0da beberapa isu-isu terkait kinera seperti berikut: •
-fisiensi dan s#alabilitas dari algoritma data mining - Dalam rangka untuk mengekstrak informasi dari seumlah besar data di database secara efektif, algoritma data mining harus efisien dan terukur.
•
Paralel. didistribusi#an. dan tambahan algoritma pertambangan - Faktor seperti ukuran besar database, distribusi yang luas dari data, dan kompleksitas metode data mining memotiasi pengembangan algoritma data mining paralel dan terdistribusi. 0lgoritma ini membagi data ke dalam partisi yang selanutnya diproses secara paralel. $emudian hasil dari partisi yang tergabung. 0lgoritma tambahan, memperbarui database tanpa pertambangan data lagi dari awal.
*eragam Jenis !asalah Data •
Penanganan ,enis relasional dan #omple#s data - Database mungkin berisi obek yang kompleks data, multimedia obek data, data spasial, data yang sementara dll 'al ini tidak mungkin bagi satu sistem untuk menambang semua enis ini data.
•
Pertambangan informasi dari database heterogen dan sistem informasi global Data tersedia di sumber data yang berbeda pada *0 atau 20. umber Data ini dapat terstruktur, semi terstruktur atau tidak terstruktur. (leh karena itu pertambangan pengetahuan dari mereka menambah tantangan untuk data mining.
4. EVALUASI
Data Mining
Data arehouse ebuah gudang data menunukkan karakteristik berikut ini untuk mendukung proses pengambilan keputusan manaemen ini: •
(ub,e# *erorientasi - Data warehouse adalah subek berorientasi karena memberikan kami informasi di sekitar subek daripada operasi yang sedang berlangsung organisasi. ubek ini dapat berupa produk, pelanggan, pemasok, penualan, pendapatan, dll data warehouse tidak fokus pada operasi yang sedang berlangsung, melainkan berfokus pada pemodelan dan analisis data untuk pengambilan keputusan.
•
Terpadu - Data warehouse dibangun oleh integrasi data dari sumber yang heterogen seperti database relasional, flat file dll integrasi ini meningkatkan analisis efektif data.
•
a#tu /arian - Data yang dikumpulkan dalam data warehouse diidentifikasi dengan angka waktu tertentu. Data dalam data warehouse memberikan informasi dari sudut pandang searah.
•
0on1olatile - on 3olatile berarti data sebelumnya tidak dihapus ketika data baru yang ditambahkan ke dalamnya. Data warehouse disimpan terpisah dari database operasional karena perubahan sering dalam database operasional tidak tercermin dalam data warehouse.
Data arehousing Data 2arehousing adalah proses membangun dan menggunakan data warehouse. ebuah data warehouse dibangun dengan mengintegrasikan data dari berbagai sumber yang heterogen. "endukung pelaporan analitis, terstruktur dan / atau ad hoc uery, dan pengambilan keputusan. Data 2arehousing melibatkan data cleaning, integrasi data, dan konsolidasi data. )ntuk mengintegrasikan database heterogen, kita memiliki dua pendekatan berikut: •
&uery Driven 0pproach
•
Apdate Driven 0pproach
10
Data Mining
2uery1Drien Approach #ni adalah pendekatan tradisional untuk mengintegrasikan database heterogen. Pendekatan ini digunakan untuk membangun pembungkus dan integrator di atas beberapa database heterogen. integrator ini uga dikenal sebagai mediator.
Proses 2uery Drien Approach 1. $etika uery dikeluarkan untuk sisi client, kamus metadata meneremahkan permintaan ke permintaan, sesuai untuk situs heterogen indiidu yang terlibat. 4. ekarang pertanyaan ini dipetakan dan dikirim ke prosesor uery lokal. 5. 'asil dari situs heterogen diintegrasikan ke dalam satu set awaban global.
)e#urangan Pendekatan ini memiliki kelemahan sebagai berikut: •
Pendekatan Permintaan Didorong kebutuhan kompleks proses integrasi dan penyaringan.
•
'al ini sangat tidak efisien dan sangat mahal untuk pertanyaan yang sering.
•
Pendekatan ini mahal untuk uery yang membutuhkan agregasi.
3pdate1Drien Approach istem data warehouse hari ini mengikuti pendekatan update-drien daripada pendekatan tradisioal dibahas sebelumnya. Dalam pendekatan update-drien, informasi dari berbagai sumber yang heterogen terintegrasi di muka dan disimpan di gudang. #nformasi ini t ersedia untuk uery langsung dan analisis.
)e#urangan Pendekatan ini memiliki kelemahan sebagai berikut: •
Pendekatan ini memberikan kinera tinggi.
•
Data dapat disalin, diproses, terpadu, dielaskan, diringkas dan direstrukturisasi dalam menyimpan data semantik di muka.
Pemrosesan uery tidak memerlukan interface dengan pengolahan di sumber-sumber lokal.
Data Mining
Dari Data arehousing 456AP7 )e Data !ining 456A!7 Pertambangan 0nalytical (nline terintegrasi dengan (nline 0nalytical Processing dengan data mining dan pengetahuan pertambangan di database multidimensi. &erikut adalah diagram yang menunukkan integrasi kedua (*0P dan (*0":
Pentingnya 56A! (*0" penting untuk alasan berikut: •
)ualitas tinggi dari data dalam data warehouse - 0lat data mining yang diperlukan untuk bekera pada data terpadu, konsisten, dan dibersihkan. *angkah-langkah ini sangat mahal di preprocessing data. 6udang Data dibangun oleh preprocessing seperti sumber berharga data berkualitas tinggi untuk (*0P dan data mining uga.
•
Informasi yang tersedia pengolahan infrastru#tur se#itar data warehouse #nformasi pengolahan infrastruktur mengacu mengakses, integrasi, konsolidasi, dan transformasi beberapa database heterogen, web-mengakses dan fasilitas pelayanan, pelaporan dan alat analisis (*0P.
•
*erbasis 56AP analisis data e#splorasi - analisis data eksplorasi diperlukan untuk data mining yang efektif. (*0" menyediakan fasilitas untuk data mining pada berbagai subset data dan pada berbagai tingkat abstraksi.
•
(ele#si online pada fungsi data mining - "engintegrasikan (*0P dengan fungsi data mining beberapa dan pertambangan analisis online menyediakan pengguna dengan fleksibilitas untuk memilih fungsi data mining yang diinginkan dan tugas pertukaran data mining dinamis.
Data Mining
5. TERMINOLOGIES Data !ining
Data mining is defined as extracting the information from a huge set of data. In other words we can say that data mining is mining the knowledge from data. This information can be used for any of the f ollowing applications6 •
Market 0nalysis
•
5raud Detection
•
'ustomer $etention
•
roduction 'ontrol
•
%cience #xploration
Data -ngine
!ining
Data mining engine is very essential to the data mining system. It consists of a set of functional modules that perform the following functions6 •
'haracteri2ation
•
0ssociation and 'orrelation 0nalysis
•
'lassification
•
rediction
•
'luster analysis
•
4utlier analysis
•
#volution analysis
)nowledge *ase This is the domain knowledge. This knowledge is used to guide the search or evaluate the interestingness of the resulting patterns.
14
)nowledge Discoery %ome people treat data mining same as knowledge discovery, while others view data mining as an essential step in the process of knowledge discovery. 9ere is the list of steps involved in the knowledge discovery process6
15
Data Mining
•
Data 'leaning
•
Data Integration
•
Data %election
•
Data Transformation
•
Data Mining
•
attern #valuation
•
?nowledge resentation
3ser Interface Aser interface is the module of data mining system that helps the communication between users and the data mining system. Aser Interface allows the following functionalities6 •
Interact with the system by specifying a data mining query task.
•
roviding information to help focus the search.
•
Mining based on the intermediate data mining results.
•
"rowse database and data warehouse schemas or data structures.
•
#valuate mined patterns.
•
Bisuali2e the patterns in different forms.
Data Integration Data Integration is a data preprocessing technique that merges the data f rom multiple heterogeneous data sources into a coherent data store. Data integration may involve inconsistent data and therefore needs data cleaning.
Data Cleaning Data cleaning is a technique that is applied to remove the noisy data and correct the inconsistencies in data. Data cleaning involves transformations to correct the wrong data. Data cleaning is performed as a data preprocessing step while preparing the data for a data warehouse.
Data Mining
Data (election Data %election is the process where data relevant to the analysis task are retrieved from the database. %ometimes data transformation and consolidation are performed before the data selection process.
Cluster s 'luster refers to a group of similar kind of ob=ects. 'luster analysis refers to forming group of ob=ects that are very similar to each other but are highly different from the ob=ects in other clusters.
Data Transformation In this step, data is transformed or consolidated into forms appropriate f or mining, by performing summary or aggregation operations.
6. KNOWLEDGE hat is Discoery"
Data Mining
)nowledge
%ome people donCt differentiate data mining from knowledge discovery while others view data mining as an essential step in the process of knowledge discovery. 9ere is the list of steps involved in the knowledge discovery process6 •
Data Cleaning ! In this step, the noise and inconsistent data is removed.
•
Data Integration ! In this step, multiple data sources are combined.
•
Data Selection ! In this step, data relevant to the analysis task are retrieved from the database.
•
Data Transformation ! In this step, data is transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations.
•
Data Mining ! In this step, intelligent methods are applied in order to extract data patterns.
•
Pattern !valuation ! In this step, data patterns are evaluated.
•
&no'ledge Presentation ! In this step, knowledge is represented.
The following diagram shows the process of knowledge discovery6
7.
Data Mining
There is a large variety of data mining systems available. Data mining systems may integrate techniques from the following6 •
%patial Data 0nalysis
•
Information $etrieval
•
attern $ecognition
•
Image 0nalysis
•
%ignal rocessing
•
'omputer @raphics
•
eb Technology
•
"usiness
•
"ioinformatics
Data !ining (ystem Classification 0 data mining system can be classified according to the following criteria6 •
Database Technology
•
%tatistics
•
Machine /earning
•
Information %cience
•
Bisuali2ation
•
4ther Disciplines
Data Mining
0part from these, a data mining system can also be classified based on the kind of -a databases mined, -b knowledge mined, -c techniques utili2ed, and -d applications adapted.
Classif ication *ased on the Databases !ined e can classify a data mining system according to the kind of databases mined. Database system can be classified according to different criteria such as data models, types of data, etc. 0nd the data mining system can be classified accordingly. 5or example, if we classify a database according to the data model, then we may have a relational, transactional, ob=ect!relational, or data warehouse mining system.
Classif ication *ased )nowledge !ined
on
the
)ind
of
e can classify a data mining system according to the kind of knowledge mined. It means the data mining system is classified on the basis of functionalities such as6 •
'haracteri2ation
•
Discrimination 19
Data Mining •
0ssociation and 'orrelation 0nalysis
•
'lassification
20
•
rediction
•
'lustering
•
4utlier 0nalysis
•
#volution 0nalysis
Classif ication *ased on the Techniques 3tili+ed e can classify a data mining system according to the kind of techniques used. e can describe these techniques according to the degree of user interaction involved or the methods of analysis employed.
Classif ication *ased on the Applications Adapted e can classify a data mining system according to the applications adapted. These applications are as follows6 •
5inance
•
Telecommunications
•
D>0
•
%tock Markets
•
#!mail
Integrating a Data !ining (ystem with a D*'D (ystem If a data mining system is not integrated with a database or a data warehouse system, then there will be no system to communicate with. This scheme is known as the non!coupling scheme. In this scheme, the main focus is on data mining design and on developing efficient and effective algorithms for mining the available data sets. The list of Integration %chemes is as follows6 •
%o Coupling ! In this scheme, the data mining system does not utili2e any of the database or data warehouse functions. It fetches the data from a particular source and processes that data using some data mining algorithms. The data mining result is stored in another file.
•
(oose Coupling ! In this scheme, the data mining system may use some of the functions of database and data warehouse system. It fetches the data from the data respiratory managed by these systems and performs
data mining on that data. It then stores the mining result either in a file or in a designated place in a database or in a data warehouse. •
Semi)tight Coupling ! In this scheme, the data mining system is linked with a database or a data warehouse system and in addition to that,
efficient implementations of a few data mining primitives can be provided in the database. •
Tight coupling ! In this coupling scheme, the data mining system is smoothly integrated into the database or data warehouse system. The data mining subsystem is treated as one functional component of an information system.
Data Mining
8. QUERY LANGUAGE
The Data Mining &uery /anguage -DM&/ was proposed by 9an, 5u, ang, et al. for the D"Miner data mining system. The Data Mining &uery /anguage is actually based on the %tructured &uery /anguage -%&/. Data Mining &uery /anguages can be designed to support ad hoc and interactive data mining. This DM&/ provides commands for specifying primitives. The DM&/ can work with databases and data warehouses as well. DM&/ can be used to define data mining tasks. articularly we examine how to define data warehouses and data marts in DM&/.
(ynta8 for (pecification
Tas#1$eleant
Data
9ere is the syntax of DM&/ for specifying task!relevant data6
'() *+,+-+() *+,+-+()/+)
or '() *+,+ +)'() *+,++)'()/+) / ))+/9) , +,,*(, : )+,/;(<=9'-);(< >)) 9/*,/? *) -@ *)(, 'B -@ 'B/(,
(ynta8 for (pecifying the )ind of )nowledge 9ere we will discuss the syntax for 'haracteri2ation, Discrimination, 0ssociation, 'lassification, and rediction.
Characteri+ation The syntax for 'haracteri2ation is6
/) 9++9,)(,9( >+( B+,,)//+)? 22
+/+@) )+(');(< The analy2e clause specifies aggregate measures, such as count, sum, or count;.
23
Data Mining
5or example6 Description describing customer purchasing habits. mine characteristics as customerurchasing analy2e count;
Discrimination The syntax for Discrimination is6
/) 9B+(/ >+( B+,,)//+)? : ,+),9+(( )) , +),9/*,/ )('( 9/,+(,9+(( )) 9/,+(,9/*,/ +/+@) )+(');(< 5or example, a user may define big spenders as customers who that cost *)) or more on an average3 and budget spenders as purchase items at less than *)) on an average. The mining descriptions for customers from each of these categories can be DM&/ as6
purchase items customers who of discriminant specified in the
/) 9B+(/ +( B'9+()G'B( : -SB)/*)( )) +;I.B9)< F 100 )('( -'*),SB)/*)( )) +;I.B9)< H 100 +/+@) 9'/,
Association The syntax for 0ssociation is6
/) +((9+,/( > +( B+,,)//+) ? +,9/ ),+B+,,)/ 5or example6
/) +((9+,/( +( -'@/%+-,( +,9/ ";&9'(,)W< J Q;&Y< F -'@(;&< where E is key of customer relation3 and & are predicate variables3 and , F, and G are ob=ect variables.
Classification The syntax for 'lassification is6
/) 9+((:9+,/ >+( B+,,)//+)? +/+@) 9+((:@/+,,-',)*)/(/ 5or example, to mine patterns, classifying customer credit rating where the classes are determined by the attribute creditHrating, and mine classification is determined as classify'ustomer'redit$ating.
+/+@) 9)*,+,/
Prediction The syntax for prediction is6
/) B)*9,/ >+( B+,,)//+)? +/+@) B)*9,/+,,-',)*)/(/ (), +,,-',)*)/(/ +')
(ynta8 for Concept 9ierarchy (pecification To specify concept hierarchies, use the following syntax6
'() )+9@ H)+9@ : H+,,-',)*)/(/ e use different syntaxes to define different types of hierarchies such as6
$(9)+ )+9)( *):/) )+9@ ,))+9@ / *+,) +( >*+,)/, '+,)@)+? $ (),$'B/ )+9)( *):/) )+9@ +))+9@ : +) / 9'(,) +( ))1 @'/ **)+)* ()/ H ))0 + ))2 20 ... 3 H ))1 @'/ ))3 40 ... 5 H ))1 **)+)* ))4 60 ... 8 H ))1 ()/ $B)+,/$*))* )+9)( *):/) )+9@ +))+9@ : +) / 9'(,) +(
+)9+,)@;1< ... +)9+,)@;5< 9'(,);*):+', +) 5< H +;+)< $')$-+()* )+9)( *):/) )+9@ B:,+/)+9@
/ ,)
+(
))1 B:,+/ H ))0 + : ;B9) $ 9(,
(ynta8 for Interestingness !easures (pecification Interestingness measures and thresholds can be specified by the user with the statement6
, H/,))(,)+(')/+) ,)(* ,)(*+') 5or example6
, ('BB, ,)(* 0.05 , 9/*)/9) ,)(* 0.7
(ynta8 for Pattern Presentation and /isuali+ation (pecification e have a syntax, which allows users to specify the display of discovered patterns in one or more forms.
*(B+@ +( H)(',: 5or example6
*(B+@ +( ,+-)
%ull (pecification D!26
of
0s a market manager of a company, you would like to characteri2e the buying habits of customers who can purchase items priced at no less than *))3 with respect to the customers age, type of item purchased, and the place where the 25
item was purchased. Fou would like to know the percentage of customers having that characteristic. In particular, you are only interested in purchases made in
26
'anada, and paid with an 0merican #xpress credit card. Fou would like to view the resulting descriptions in the form of a t able.
'() *+,+-+() AE)9,/9(*'() )+9@ 9+,/)+9@ : #.+**)(( /) 9++9,)(,9( +( 9'(,)"'9+(/ +/+@) 9'/, / ))+/9) , C.+)I.,@B)I.B+9)+*) : 9'(,) C ,) I B'9+() " ,)((* S -+/9 # )) I.,)ID S.,)ID +/* ".9'(,ID C.9'(,ID +/* ".),*B+* AE +/* #.+**)(( C+/+*+ +/* I.B9) F 100 , /() ,)(* 5 *(B+@ +( ,+-)
Data !ining 6anguages (tandardi+ation %tandardi2ing the Data Mining /anguages will serve the following purposes6 •
9elps systematic development of data mining solutions.
•
Improves interoperability among multiple data mining systems and functions.
•
romotes education and rapid learning.
•
romotes the use of data mining systems in industry and society.
Data Mining
. CLASSI!ICATION AND "REDICTION
There are two forms of data analysis that can be used for extracting models describing important classes or to predict future data trends. These two forms are as follows6 •
'lassification
•
rediction
'lassification models predict categorical class labels3 and prediction models predict continuous valued functions. 5or example, we can build a classification model to categori2e bank loan applications as either safe or risky, or a prediction model to predict the expenditures in dollars of potential customers on computer equipment given their income and occupation.
hat Classification"
is
5ollowing are the examples of cases where the data analysis task is 'lassification6 •
0 bank loan officer wants to analy2e the data in order to know which customer -loan applicant are risky or which are safe.
•
0 marketing manager at a company needs to analy2e a customer with a given profile, who will buy a new computer.
In both of the above examples, a model or classifier is constructed to predict the categorical labels. These labels are risky or safe for loan application data and yes or no for marketing data.
hat Prediction"
is
5ollowing are the examples of cases where the data analysis task is rediction6 %uppose the marketing manager needs to predict how much a given customer will spend during a sale at his company. In this example we are bothered to predict a numeric value. Therefore the data analysis task is an example of numeric prediction. In this case, a model or a predictor will be constructed that predicts a continuous!valued!function or ordered value. 27
%ote6 $egression analysis is a statistical methodology that is most often used for numeric prediction.
28
Data Mining
9ow Does or#"
Classification
ith the help of the bank loan application that we have discussed above, let us understand the working of classification. The Data 'lassification process includes two steps6 •
"uilding the 'lassifier or Model
•
Asing 'lassifier for 'lassification
*uilding the Classif ier or !odel •
This step is the learning step or the learning phase.
•
In this step the classification algorithms build the classifier.
•
The classifier is built from the training set made up of database tuples and their associated class labels.
•
#ach tuple that constitutes the training set is referred to as a category or class. These tuples can also be referred to as sample, ob=ect or data points.
3sing Classif ier Classification
for
In this step, the classifier is used for classification. 9ere the test data is used to estimate the accuracy of classification rules. The classification rules can be applied to the new data tuples if the accuracy is considered acceptable.
Classification and Prediction Issues The ma=or issue is preparing the data for 'lassification and rediction. reparing the data involves the following activities6 •
Data Cleaning ! Data cleaning involves removing the noise and treatment of missing values. The noise is removed by applying smoothing techniques and the problem of missing values is solved by replacing a missing value with most commonly occurring value for that attribute.
•
Relevance Analysis ! Database may also have the irrelevant attributes. 'orrelation analysis is used to know whether any two given attributes are related.
•
Data Transformation and reduction! The data can be transformed by any of the following methods. o
o
%ormali"ation ! The data is transformed using normali2ation. >ormali2ation involves scaling all values for given attribute in order to make them fall within a small specified range. >ormali2ation is used when in the learning step, the neural networks or the methods involving measurements are used. *enerali"ation ! The data can also be transformed by generali2ing it to the higher concept. 5or this purpose we can use the concept hierarchies.
%ote6 Data can also be reduced by some other methods such as wavelet transformation, binning, histogram analysis, and clustering.
Comparison of Classification and Prediction !ethods 9ere is the criteria for comparing the methods of 'lassification and rediction6 o
o
o
o
o
Accuracy ! 0ccuracy of classifier refers to the ability of classifier. It predict the class label correctly and the accuracy of the predictor refers to how well a given predictor can guess the value of predicted attribute for a new data. Speed ! This refers to the computational cost in generating and using the classifier or predictor. Ro#ustness ! It refers to the ability of classifier or predictor to make correct predictions from given noisy data. Scala#ility ! %calability refers to the ability to construct the classifier or predictor efficiently3 given large amount of data. Interpreta#ility ! It refers to what extent the classifier or predictor understands.
10. DECISION TREE INDUCTION
Data Mining
0 decision tree is a structure that includes a root node, branches, and leaf nodes. #ach internal node denotes a test on an attribute, each branch denotes the outcome of a test, and each leaf node holds a class label. The topmost node in the tree is the root node. The following decision tree is for the concept buyHcomputer that indicates whether a customer at a company is likely to buy a computer or not. #ach internal node represents a test on an attribute. #ach leaf node represents a class.
The benefits of having a decision tree are as follows6 •
It does not require any domain knowledge.
•
It is easy to comprehend.
•
The learning and classification steps of a decision tree are simple and fast.
Decision Algorithm
Tree
Induction 31
0 machine researcher named J. $oss &uinlan in *KL) developed a decision tree algorithm known as ID< -Iterative Dichotomiser. /ater, he presented '+., which was the successor of ID<. ID< and '+. adopt a greedy approach. In this
32
Data Mining
algorithm, there is no backtracking3 the trees are constructed in a top!down recursive divide!and!conquer manner.
G)/)+,/ + *)9(/ ,)) : ,) ,+// ,'B)( : *+,+ B+,,/ D Algorithm Beneratedecisiontree
*nput D+,+ B+,,/ D 9 ( + (), : ,+// ,'B)( +/* ,) +((9+,)* 9+(( +-)(. +,,-',)(, ,) (), : 9+/**+,) +,,-',)(. A,,-',) ())9,/ ),* + B9)*') , *),)/) ,) (B,,/ 9,)/ ,+, -)(, B+,,/( ,+, ,) *+,+ ,'B)( /, /**'+ 9+(()(. T( 9,)/ /9'*)( + (B,,/+,,-',) +/* ),) + (B,,/ B/, (B,,/ ('-(),.
1utput A D)9(/ T))
Method 9)+,) + /*) N : ,'B)( / D +) + : ,) (+) 9+(( C ,)/ ),'/ N +( )+: /*) +-))* , 9+(( C : +,,-',)(, ( )B,@ ,)/ ),'/ N +( )+: /*) , +-))* , +,@ 9+(( / D +,@ ,/ +BB@ +,,-',)())9,/),*;D +,,-',)(,< , /* ,) -)(, (B,,/9,)/ +-) /*) N , (B,,/9,)/ : (B,,/+,,-',) ( *(9),)$+')* +/* ',+@ (B,( +)* ,)/
== / )(,9,)* , -/+@ ,))(
+,,-',)(, (B,,/ +,,-',) == )) (B,,/ +,,-',) : )+9 ',9) : (B,,/ 9,)/ == B+,,/ ,) ,'B)( +/* ('-,))( : )+9 B+,,/ ), D -) ,) (), : *+,+ ,'B)( / D (+,(:@/ ',9)
== + B+,,/
: D ( )B,@ ,)/ +,,+9 + )+: +-))* , ,) +,@ 9+(( / D , /*) N )() +,,+9 ,) /*) ),'/)* -@ G)/)+,) *)9(/ ,));D +,,-',) (,< , /*) N )/* : ),'/ N
Tree Pruning Tree pruning is performed in order to remove anomalies in the training data due to noise or outliers. The pruned trees are smaller and less complex.
Tree Pruning Approaches There are two approaches to prune a tree6 •
Pre)pruning ! The tree is pruned by halting its construction early.
•
Post)pruning ! This approach removes a sub!tree from a fully grown tree.
Cost Comple8ity The cost complexity is measured by the following two parameters6 •
>umber of leaves in the tree, and
•
#rror rate of the tree.
11. #AYESIAN CLASSI!ICATION
Data Mining
"ayesian classification is based on "ayes Theorem. "ayesian classifiers are the statistical classifiers. "ayesian classifiers can predict class membership probabilities such as the probability that a given tuple belongs to a particular class.
*ayes: Theorem "ayes "ayes Theorem is named after Thomas Thomas "ayes. "ayes. There are two types of probabilities6 •
osterior robability N-98EO
•
rior robability N-9O
where E is data tuple and 9 is some hypothesis. 0ccording to "ayes Theorem, -98EP -E89-9 8 -E
*ayesian 0etwor#
*elief
"ayesian "elief >etworks specify =oint conditional probability distributions. They are also known as "elief >etworks, "ayesian >etworks, or robabilistic >etworks. •
0 "elief >etwork allows class conditional independencies to be defined between subsets of variables.
•
It provides a graphical model of causal relationship on which learning can be performed.
•
e can use a trained "ayesian >etwork for classification.
There are two components that define a "ayesian "elief >etwork6 •
Directed acyclic graph
•
0 set of conditional probability tables
34
Directed ;raph
Acyclic
•
#ach node in a directed acyclic graph represents a random variable.
•
These variable may be discrete or continuous valued.
35
Data Mining
•
These variables may correspond to the actual attribute given in the data.
Directed Acyclic $epresentation
;raph
The following diagram shows a directed acyclic graph for six "oolean variables.
The arc in the diagram allows representation of causal knowledge. 5or example, lung cancer is influenced by a persons family history of lung cancer, as well as whether or not the person is a smoker. It is worth noting that the variable ositiveEray is independent of whether the patient has a family history of lung cancer or that the patient is a smoker, given that we know the patient has lung cancer.
Conditional Table
Probability
The conditional probability table for the values of the variable /ung'ancer -/' showing each possible combination of the values of its parent nodes, 5amily9istory -59, and %moker -% is as follows6
35
Data Mining
12. RULE$#ASED CLASSI!ICATION I%1T9-0 $ules $ule!based classifier makes use of a set of I5!T9#> rules for classification. e can express a rule in the following from6 I5 condition T9#> conclusion /et us consider a rule $*,
R1 I! +)@', AND (,'*)/,@)( T%EN -'@9B',)@)(
Points to remember< •
The I5 part of the rule is called rule antecedent or precondition.
•
The T9#> part of the rule is called rule consequent.
•
The antecedent part the condition consist of one or more attribute tests and these tests are logically 0>Ded.
•
The consequent part consists of class prediction.
%ote6 e can also write rule $* as follows6
R1 ;+) @',< J ;(,'*)/, @)(<<;-'@( 9B',) @)(< If the condition holds true for a given tuple, then the antecedent is satisfied.
$ule -8traction 9ere we will learn how to build a rule!based classifier by extracting I5!T9#> rules from a decision tree.
Points to remember< To extract a rule from a decision tree6 •
4ne rule is created for each path from the root to the leaf node.
•
To form a rule antecedent, each splitting criterion is logically 0>Ded.
•
The leaf node holds the class prediction, forming the rule consequent. 36
Data Mining
$ule Induction 3sing (equential Coering Algorithm %equential 'overing 0lgorithm can be used to extract I5!T9#> rules form the training data. e do not require to generate a decision tree first. In this algorithm, each rule for a given class covers many of the tuples of that class. %ome of the sequential 'overing 0lgorithms are 0&, '>(, and $I#$. 0s per the general strategy the rules are learned one at a time. 5or each time rules are learned, a tuple covered by the rule is removed and the process continues for the rest of the tuples. This is because the path to each leaf in a decision tree corresponds to a rule. %ote6 The Decision tree induction can be considered as learning a set of rules simultaneously.
The 5ollowing is the sequential learning 0lgorithm where rules are learned for one class at a time. hen learning a rule from a class 'i, we want the rule to cover all the tuples from class ' only and no tuple form any other class.
A, S)')/,+ C)/ I/B', D + *+,+ (), 9+(($+-))* ,'B)( A,,+( ,) (), : + +,,-',)( +/* ,) B((-) +')(. O',B', A S), : I!$T%EN ')(. M),* R')(), == /,+ (), : ')( )+/)* ( )B,@ : )+9 9+(( 9 * )B)+, R') L)+/O/)R');D A,,+( 9< )) ,'B)( 9))* -@ R') : D '/, ,)/+,/ 9/*,/ R')(),R')(),R') == +** + /) ') , ')$(), )/* : ),'/ R')S),
$ule Pruning The rule is pruned due to the following reasons6 •
The 0ssessment of quality is made on the original set of training data.
Data Mining The rule may perform well on training data but less well on subsequent data. Thats why the rule pruning is required.
•
The rule is pruned by removing con=unct. The rule $ is pruned, if pruned version of $ has greater quality than what was assessed on an independent set of tuples.
54I/ is one of the simple and effective method for rule pruning. 5or a given rule $, 54I/Hrune P pos!neg8 posQneg where pos and neg is the number of positive tuples covered by $, respectively. %ote6 This value will increase with the accuracy of $ on the pruning set. 9ence, if the 54I/Hrune value is higher for the pruned version of $, then we prune $.
Data Mining
13. MISCELLANEOUS CLASSI!ICATION MET%ODS
9ere we will discuss other classification methods such as @enetic 0lgorithms, $ough %et 0pproach, and 5u22y %et 0pproach.
;enetic Algorithms The idea of genetic algorithm is derived from natural evolution. In genetic algorithm, first of all, the initial population is created. This initial population consists of randomly generated rules. e can represent each rule by a string of bits. 5or example, in a given training set, the samples are described by two "oolean attributes such as 0* and 0(. 0nd this given training set contains two classes such as '* and '(. e can encode the rule I A+ A%D %$T A, T-!% C, into a bit string +... In this bit representation, the two leftmost bits represent the attribute 0* and 0(, respectively. /ikewise, the rule I %$T A+ A%D %$T A, T-!% C+ can be encoded as ..+. %ote6 If the attribute has ? values where ?R(, then we can use the ? bits to encode the attribute values. The classes are also encoded in the same manner.
oints to remember6 •
"ased on the notion of the survival of the fittest, a new population is formed that consists of the fittest rules in the current population and offspring values of these rules as well.
•
The fitness of a rule is assessed by its classification accuracy on a set of training samples.
•
The genetic operators such as crossover and mutation are applied to create offspring.
•
In crossover, the substring from pair of rules are swapped to form a new pair of rules.
•
In mutation, randomly selected bits in a rules string are inverted.
39
$ough Approach
(et
e can use the rough set approach to discover structural relationship within imprecise and noisy data.
40
Data Mining
%ote6 This approach can only be applied on discrete!valued attributes. Therefore, continuous!valued attributes must be discreti2ed before its use.
The $ough %et Theory is based on the establishment of equivalence classes within the given training data. The tuples that f orms the equivalence class are indiscernible. It means the samples are identical with respect to the attributes describing the data. There are some classes in the given real world data, which cannot be distinguished in terms of available attributes. e can use the rough sets to roughly define such classes. 5or a given class ', the rough set definition is approximated by two sets as follows6 •
(o'er Appro/imation of C ! The lower approximation of ' consists of all the data tuples, that based on the knowledge of the attribute, are certain to belong to class '.
•
0pper Appro/imation of C ! The upper approximation of ' consists of all the tuples, that based on the knowledge of attributes, cannot be described as not belonging to '.
The following diagram shows the Apper and /ower 0pproximation of class '6
%u++y Approach
(et
5u22y %et Theory is also called ossibility Theory. This theory was proposed by /otfi Gadeh in *KS as an alternative the t'o)value logic and pro#a#ility
theory. This theory allows us to work at a high level of abstraction. It also provides us the means for dealing with imprecise measurement of data.
The fu22y set theory also allows us to deal with vague or inexact facts. 5or example, being a member of a set of high incomes is in exact -e.g. if ),))) is high then what about +K,))) and +L,))). Anlike the traditional '$I% set where the element either belong to % or its complement but in fu22y set theory the element can belong to more than one fu22y set. 5or example, the income value +K,))) belongs to both the medium and high fu22y sets but to differing degrees. 5u22y set notation for this income value is as follows6
)*'/9) ;4X<0.15 +/* /9);4X<0.6 where mC is the membership function that operates on the fu22y sets of mediumHincome and highHincome respectively. This notation can be shown diagrammatically as follows6
Data Mining
14. CLUSTER ANALYSIS
'luster is a group of ob=ects that belongs to the same class. In other words, similar ob=ects are grouped in one cluster and dissimilar ob=ects are grouped in another cluster.
hat Clustering"
is
'lustering is the process of making a group of abstract ob=ects into classes of similar ob=ects.
Points to $emember< •
0 cluster of data ob=ects can be treated as one group.
•
hile doing cluster analysis, we first partition the set of data into groups based on data similarity and then assign the labels to the groups.
•
The main advantage of clustering over classification is that, it is adaptable to changes and helps single out useful features that distinguish different groups.
Applications Analysis
of
Cluster
•
'lustering analysis is broadly used in many applications such as market research, pattern recognition, data analysis, and image processing.
•
'lustering can also help marketers discover distinct groups in their customer base. 0nd they can characteri2e their customer groups based on the purchasing patterns.
•
In the field of biology, it can be used to derive plant and animal taxonomies, categori2e genes with similar functionalities and gain insight into structures inherent to populations.
•
'lustering also helps in identification of areas of similar land use in an earth observation database. It also helps in the identification of groups of houses in a city according to house type, value, and geographic location.
•
'lustering also helps in classifying documents on the web for information discovery. 42
•
'lustering is also used in outlier detection applications such as detection of credit card fraud.
•
0s a data mining function, cluster analysis serves as a tool to gain insight into the distribution of data to observe characteristics of each cluster.
43
Data Mining
$equirements of Clustering in Data !ining The following points throw light on why clustering is required in data mining6 •
Scala#ility ! e need highly scalable clustering algorithms to deal with large databases.
•
A#ility to deal 'ith different kinds of attri#utes ! 0lgorithms should be capable to be applied on any kind of data such as interval!based -numerical data, categorical, and binary data.
•
Discovery of clusters 'ith attri#ute shape ! The clustering algorithm should be capable of detecting clusters of arbitrary shape. They should not be bounded to only distance measures that t end to find spherical cluster of small si2es.
•
-igh dimensionality ! The clustering algorithm should not only be able to handle low!dimensional data but also the high d imensional space.
•
A#ility to deal 'ith noisy data ! Databases contain noisy, missing or erroneous data. %ome algorithms are sensitive to such data and may lead to poor quality clusters.
•
Interpreta#ility ! The clustering comprehensible, and usable.
results
should
be
interpretable,
Clustering !ethods 'lustering methods can be classified into the following categories6 •
artitioning Method
•
9ierarchical Method
•
Density!based Method
•
@rid!"ased Method
•
Model!"ased Method
•
'onstraint!based Method
Partitioning !ethod %uppose we are given a database of nC ob=ects and the partitioning method constructs kC partition of data. #ach partition will represent a cluster and k U n.
Data Mining It means that it will classify the data into k groups, which satisfy the following requirements6 •
#ach group contains at least one ob=ect.
•
#ach ob=ect must belong to exactly one group.
Points to $emember< •
5or a given number of partitions -say k, the partitioning method will create an initial partitioning.
•
Then it uses the iterative relocation technique to improve the partitioning by moving ob=ects from one group to other.
9ierarchical !ethod This method creates a hierarchical decomposition of the given set of data ob=ects. e can classify hierarchical methods on the basis of how the hierarchical decomposition is formed. There are two approaches here6 •
0gglomerative 0pproach
•
Divisive 0pproach
Agglomeratie Approach This approach is also known as the bottom!up approach. In this, we start with each ob=ect forming a separate group. It keeps on merging the ob=ects or groups that are close to one another. It keep on doing so until all of the groups are merged into one or until the termination condition holds.
Diisie Approach This approach is also known as the top!down approach. In this, we start with all of the ob=ects in the same cluster. In the continuous iteration, a cluster is split up into smaller clusters. It is down until each ob=ect in one cluster or the termination condition holds. This method is rigid, i.e., once a merging or splitting is done, it can never be undone.
Approaches to Improe 2uality of 9ierarchical Clustering 9ere are the two approaches that are used to improve the quality of hierarchical clustering6 •
erform careful partitioning.
analysis
of
ob=ect
linkages
at
each
hierarchical
•
Integrate hierarchical agglomeration by first using a hierarchical agglomerative algorithm to group ob=ects into micro!clusters, and then performing macro!clustering on the micro!clusters.
Density1based !ethod 44
This method is based on the notion of density. The basic idea is to continue growing the given cluster as long as the density in the neighborhood exceeds some threshold, i.e., for each data point within a given cluster, the radius of a given cluster has to contain at least a minimum number of points.
45
;r id1based !ethod In this, the ob=ects together form a grid. The ob=ect space is quanti2ed into finite number of cells that form a grid structure.
Adantages •
The ma=or advantage of this method is fast processing time.
•
It is dependent only on the number of cells in each dimension in the quanti2ed space.
!odel1based !ethod In this method, a model is hypothesi2ed for each cluster to find the best fit of data for a given model. This method locates the clusters by clustering the density function. It reflects spatial distribution of the data points. This method also provides a way to automatically determine the number of clusters based on standard statistics, taking outlier or noise into account. It therefore yields robust clustering methods.
Constraint1based !ethod In this method, the clustering is performed by the incorporation of user or application!oriented constraints. 0 constraint refers to the user expectation or the properties of desired clustering results. 'onstraints provide us with an interactive way of communication with the clustering process. 'onstraints can be specified by the user or the application requirement.
15. MINING TE&T DATA
Data Mining
Text databases consist of huge collection of documents. They collect these information from several sources such as news articles, books, digital libraries, e!mail messages, web pages, etc. Due to increase in the amount of information, the text databases are growing rapidly. In many of the text databases, the data is semi!structured. 5or example, a document may contain a few structured fields, such as title, author, publishingHdate, etc. "ut along with the structure data, the document also contains unstructured text components, such as abstract and contents. ithout knowing what could be in the documents, it is difficult to formulate effective queries for analy2ing and extracting useful information from the data. Asers require tools to compare the documents and rank their importance and relevance. Therefore, text mining has become popular and an essential theme in data mining.
Information $etrieal Information retrieval deals with the retrieval of information from a large number of text!based documents. %ome of the database systems are not usually present in information retrieval systems because both handle different kinds of data. #xamples of information retrieval system include6 •
4nline /ibrary catalogue system
•
4nline Document Management %ystems
•
eb %earch %ystems etc.
%ote6 The main problem in an information retrieval system is to locate relevant documents in a document collection based on a users query. This kind of users query consists of some keywords describing an information need.
In such search problems, the user takes an initiative to pull relevant information out from a collection. This is appropriate when the user has ad!hoc information need, i.e., a short!term need. "ut if the user has a long!term information need, then the retrieval system can also take an initiative to push any newly arrived information item to the user. This kind of access to information is called Information 5iltering. 0nd the corresponding systems are known as 5iltering %ystems or $ecommender %ystems. 46
Data Mining
*asic !easures $etrieal
for
Te8t
e need to check the accuracy of a system when it retrieves a number of documents on the basis of users input. /et the set of documents relevant to a query be denoted as V$elevantW and the set of retrieved document as V$etrievedW. The set of documents that are relevant and retrieved can be denoted as V$elevantW 7 V$etrievedW. This can be shown in the form of a Benn diagram as follows6
There are three fundamental measures for assessing the quality of text retrieval6 •
recision
•
$ecall
•
5!score
Precisio n recision is the percentage of retrieved documents that are in fact relevant to the query. recision can be defined as6
")9(/ R))+/, R),))* = R),))*
$ecall $ecall is the percentage of documents that are relevant to the query and were in fact retrieved. $ecall is defined as6
R)9+ R))+/, R),))* = R))+/,
%1 score 5!score is the commonly used trade!off. The information retrieval system often needs to trade!off for precision or vice versa. 5!score is defined as harmonic mean of recall or precision as follows6 47
!$(9) )9+ B)9(/ = ;)9+ B)9(/< = 2
48
16. MINING WORLD WIDE WE#
Data Mining
The orld ide eb contains huge amounts of information that provides a rich source for data mining.
Challenges !ining
in
eb
The web poses great challenges for resource and knowledge discovery based on the following observations6 •
The 'e# is too huge1 ! The si2e of the web is very huge and rapidly increasing. This seems that the web is too huge for data warehousing and data mining.
•
Comple/ity of 2e# pages1 ! The web pages do not have unifying structure. They are very complex as compared to traditional text document. There are huge amount of documents in digital library of web. These libraries are not arranged according to any particular sorted order.
•
2e# is dynamic information source1 ! The information on the web is rapidly updated. The data such as news, stock markets, weather, sports, shopping, etc., are regularly updated.
•
Diversity of user communities1 ! The user community on the web is rapidly expanding. These users have different backgrounds, interests, and usage purposes. There are more than *)) million workstations that are connected to the Internet and still rapidly increasing.
•
Relevancy of Information1 ! It is considered that a particular person is generally interested in only small portion of the web, while the rest of the portion of the web contains the information that is not relevant to the user and may swamp desired results.
!ining eb (tructure
Page
6ayout
The basic structure of the web page -D4M. The D4M structure refers to a the page corresponds to a node in the by using predefined tags in 9TM/. The
is based on the Document 4b=ect Model tree like structure where the 9TM/ tag in D4M tree. e can segment the web page 9TM/ syntax is flexible therefore, the web 48
pages does not follow the <' specifications. >ot following the specifications of <' may cause error in D4M tree structure. The D4M structure was initially introduced for presentation in the browser and not for description of semantic structure of the web page. The D4M structure cannot correctly identify the semantic relationship between the different parts of a web page.
49
Data Mining
/ision1based Page (egmentation 4/IP(7 •
The purpose of BI% is to extract the semantic structure of a web page based on its visual presentation.
•
%uch a semantic structure corresponds to a tree structure. In this tree each node corresponds to a block.
•
0 value is assigned to each node. This value is called the Degree of 'oherence. This value is assigned to indicate the coherent content in the block based on visual perception.
•
The BI% algorithm first extracts all the suitable blocks from the 9TM/ D4M tree. 0fter that it finds the separators between these blocks.
•
The separators refer to the hori2ontal or vertical lines in a web page that visually cross with no blocks.
•
The semantics of the web page is constructed on the basis of these blocks.
The following figure shows the procedure of BI% algorithm6
49
Data Mining
1
7. A""LICATIONS AND TRENDS Data mining is widely used in diverse areas. There are a number of commercial data mining system available today and yet there are many challenges in this field. In this tutorial, we will discuss the applications and the trend of data mining.
Data Applications
!ining
9ere is the list of areas where data mining is widely used6 •
5inancial Data 0nalysis
•
$etail Industry
•
Telecommunication Industry
•
"iological Data 0nalysis
•
4ther %cientific 0pplications
•
Intrusion Detection
%inancial Analysis
Data
The financial data in banking and financial industry is generally reliable and of high quality which facilitates systematic data analysis and data mining. %ome of the typical cases are as follows6 •
Design and construction of data warehouses for multidimensional data analysis and data mining.
•
/oan payment prediction and customer credit policy analysis.
•
'lassification and clustering of customers for targeted marketing.
•
Detection of money laundering and other financial crimes.
50
$etail Industry Data Mining has its great application in $etail Industry because it collects large amount of data from on sales, customer purchasing history, goods transportation, consumption and services. It is natural that the quantity of data collected will continue to expand rapidly because of the increasing ease, availability and popularity of the web. Data mining in retail industry helps in identifying customer buying patterns and trends that lead to improved quality of customer service and good customer
51
Data Mining
retention and satisfaction. 9ere is the list of examples of data mining in the retail industry6 •
Design and 'onstruction of data warehouses based on the benefits of data mining.
•
Multidimensional analysis of sales, customers, products, time and region.
•
0nalysis of effectiveness of sales campaigns.
•
'ustomer $etention.
•
roduct recommendation and cross!referencing of items.
Telecommunication Industry Today the telecommunication industry is one of the most emerging industries providing various services such as fax, pager, cellular phone, internet messenger, images, e!mail, web data transmission, etc. Due to the development of new computer and communication technologies, the telecommunication industry is rapidly expanding. This is the reason why data mining is become very important to help and understand the business. Data mining in telecommunication industry helps in identifying the telecommunication patterns, catch fraudulent activities, make better use of resource, and improve quality of service. 9ere is the list of examples for which data mining improves telecommunication services6 •
Multidimensional 0nalysis of Telecommunication data.
•
5raudulent pattern analysis.
•
Identification of unusual patterns.
•
Multidimensional association and sequential patterns analysis.
•
Mobile Telecommunication services.
•
Ase of visuali2ation tools in telecommunication data analysis.
*iological Analysis
Data
In recent times, we have seen a tremendous growth in the field of biology such as genomics, proteomics, functional @enomics and biomedical research. "iological data mining is a very important part of "ioinformatics. 5ollowing are the aspects in which data mining contributes for b iological data analysis6 •
%emantic integration proteomic databases.
of
heterogeneous,
distributed
genomic
and
•
Data Mining 0lignment, indexing, similarity search and comparative analysis multiple nucleotide sequences.
•
Discovery of structural patterns and analysis of genetic networks and protein pathways.
•
0ssociation and path analysis.
•
Bisuali2ation tools in genetic data analysis.
5ther Applications
(cientif ic
The applications discussed above tend to handle relatively small and homogeneous data sets for which the statistical techniques are appropriate. 9uge amount of data have been collected from scientific domains such as geosciences, astronomy, etc. 0 large amount of data sets is being generated because of the fast numerical simulations in various fields such as climate and ecosystem modeling, chemical engineering, fluid dynamics, etc. 5ollowing are the applications of data mining in the field of %cientific 0pplications6 •
Data arehouses and data preprocessing.
•
@raph!based mining.
•
Bisuali2ation and domain specific knowledge.
Intrusion Detection Intrusion refers to any kind of action that threatens integrity, confidentiality, or the availability of network resources. In this world of connectivity, security has become the ma=or issue. ith increased usage of internet and availability of the tools and tricks for intruding and attacking network prompted intrusion detection to become a critical component of network administration. 9ere is the list of areas in which data mining technology may be applied for intrusion detection6 •
Development of data mining algorithm for intrusion detection.
•
0ssociation and correlation analysis, aggregation to help select and build discriminating attributes.
•
0nalysis of %tream data.
•
Distributed data mining.
•
Bisuali2ation and query tools.
Data !ining Products
(ystem
There are many data mining system products and domain specific data mining applications. The new data mining systems and applications are being added to
the previous systems. 0lso, efforts are being made to standardi2e data mining languages.
Choosing a Data !ining (ystem The selection of a data mining system depends on the following features6 •
Data Types ! The data mining system may handle formatted text, record! based data, and relational data. The data could also be in 0%'II text, relational database data or data warehouse data. Therefore, we should check what exact format the data mining system can handle.
•
System Issues ! e must consider the compatibility of a data mining system with different operating systems. 4ne data mining system may run on only one operating system or on several. There are also data mining systems that provide web!based user interfaces and allow EM/ data as input.
•
Data Sources ! Data sources refer to the data formats in which data mining system will operate. %ome data mining system may work only on 0%'II text files while others on multiple relational sources. Data mining system should also support 4D"' connections or 4/# D" for 4D"' connections.
•
Data Mining functions and methodologies ! There are some data mining systems that provide only one data mining function such as classification while some provides multiple data mining functions such as concept description, discovery!driven 4/0 analysis, association mining, linkage analysis, statistical analysis, classification, prediction, clustering, outlier analysis, similarity search, etc.
•
Coupling data mining 'ith data#ases or data 'arehouse systems ! Data mining systems need to be coupled with a database or a data warehouse system. The coupled components are integrated into a uniform information processing environment. 9ere are the types of coupling listed below6
•
o
>o coupling
o
/oose 'oupling
o
%emi tight 'oupling
o
Tight 'oupling
Scala#ility ! There are two scalability issues in data mining6 o
o
Ro' 3Data#ase si"e4 Scala#ility X 0 data mining system is considered as row scalable when the number or rows are enlarged *) times. It takes no more than *) times to execute a query. Column 3Dimension4 Sala#ility X 0 data mining system is considered as column scalable if the mining query execution time increases linearly with the number of columns.
•
•
5isuali"ation Tools ! Bisuali2ation in data mining can be categori2ed as follows6 o
Data Bisuali2ation
o
Mining $esults Bisuali2ation
o
Mining process visuali2ation
o
Bisual data mining
Data Mining query language and graphical user interface ! 0n easy! to!use graphical user interface is important to promote user!guided, interactive data mining. Anlike relational database systems, data mining systems do not share underlying data mining query language.
Trends in Data !ining Data mining concepts are still evolving and here are the latest trends that we get to see in this field6 •
0pplication exploration.
•
%calable and interactive data mining methods.
•
Integration of data mining with database systems, data warehouse systems and web database systems.
•
%tandardi2ation of data mining query language.
•
Bisual data mining.
•
>ew methods for mining complex types of data.
•
"iological data mining.
•
Data mining and software engineering.
•
eb mining.
•
Distributed data mining.
•
$eal time data mining.
•
Multi database data mining.
•
rivacy protection and information security in data mining.
18. T%EMES
Data Mining
Theoretical %oundations of Data !ining The theoretical foundations of data mining includes the following concepts6 •
•
•
Data Reduction ! The basic idea of this theory is to reduce the data representation which trades accuracy for speed in response to the need to obtain quick approximate answers to queries on very large databases. %ome of the data reduction techniques are as follows6 o
%ingular value Decomposition
o
avelets
o
$egression
o
/og!linear models
o
9istograms
o
'lustering
o
%ampling
o
'onstruction of Index Trees
Data Compression ! The basic idea of this theory is to compress the given data by encoding in terms of the following6 o
"its
o
0ssociation $ules
o
Decision Trees
o
'lusters
Pattern Discovery ! The basic idea of this theory is to discover patterns occurring in a database. 5ollowing are the areas that contribute to this theory6 o
Machine /earning
o
>eural >etwork
o
0ssociation Mining
o
%equential attern Matching
o
'lustering 55
Data Mining
•
Pro#a#ility Theory ! This theory is based on statistical theory. The basic idea behind this theory is to discover =oint probability distributions of random variables.
•
Pro#a#ility Theory ! 0ccording to this theory, data mining finds the patterns that are interesting only to the extent that they can be used in the decision!making process of some enterprise.
•
Microeconomic 5ie' ! 0s per this theory, a database schema consists of data and patterns that are stored in a database. Therefore, data mining is the task of performing induction on databases.
•
Inductive data#ases ! 0part from the database!oriented techniques, there are statistical techniques available for data analysis. These techniques can be applied to scientific data and data from economic and social sciences as well.
(tatistical Data !ining %ome of the %tatistical Data Mining Techniques are as follows6 •
•
Regression ! $egression methods are used to predict the value of the response variable from one or more predictor variables where the variables are numeric. /isted below are the forms of $egression6 o
/inear
o
Multiple
o
eighted
o
olynomial
o
>onparametric
o
$obust
*enerali"ed (inear Model ! @enerali2ed /inear Model includes6 o
/ogistic $egression
o
oisson $egression
The models generali2ation allows a categorical response variable to be related to a set of predictor variables in a manner similar to the modelling of numeric response variable using linear regression. •
Analysis of 5ariance ! This technique analy2es6 o
o
#xperimental data for two or more populations described by a numeric response variable. 4ne or more categorical variables -factors.
•
Mi/ed)effect Models ! These models are used for analy2ing grouped data. These models describe the relationship between a response variable and some co!variates in the data grouped according to one or more factors.
•
actor Analysis ! 5actor analysis is used to predict a categorical response variable. This method assumes that independent variables follow a multivariate normal distribution.
•
Time Series Analysis ! 5ollowing are the methods for analy2ing time! series data6 o
o
o
0uto!regression Methods. Anivariate 0$IM0 -0uto$egressive Integrated Moving 0verage Modeling. /ong!memory time!series modeling.
/isual !ining
Data
Bisual Data Mining uses data and8or knowledge visuali2ation techniques to discover implicit knowledge from large data sets. Bisual data mining can be viewed as an integration of the following disciplines6 •
Data Bisuali2ation
•
Data Mining
Bisual data mining is closely related to the following6 •
'omputer @raphics
•
Multimedia %ystems
•
9uman 'omputer Interaction
•
attern $ecognition
•
9igh!performance 'omputing
@enerally data visuali2ation and data mining can be integrated in the following ways6 •
Data 5isuali"ation ! The data in a database or a data warehouse can be viewed in several visual forms that are listed below6 o
"oxplots
o
o
Data distribution charts
o
'urves