Effective Pattern Discovery for Text Mining
ABSTRACT
Many data mining techniques have been proposed for mining useful patterns in text documents. However how to effectively use and update discovered patterns is still an open research issue especially in the domain of text mining. !ince most existing text mining methods adopted adopted term"based term"based approaches approaches they all suffer from the problems problems of polysemy polysemy and synonymy. synonymy. #ver the years people have often held the hypothesis that pattern $or phrase%"based approaches should perform better than the term"based ones but many experiments do not support this hypothesis. This paper presents an innovative and effective pattern discovery technique which includes the processes of pattern deploying and pattern evolving to improve the effectiveness of using using and updatin updating g discov discovere ered d patter patterns ns for findin finding g releva relevant nt and intere interesti sting ng inform informati ation. on. !ubstan !ubstanti tial al experim experiment entss on &'() &'() data data collec collecti tion on and T&E' T&E' topics topics demons demonstr trate ate that that the proposed solution achieves encouraging performance.
Contact: 040-40274843, 8143615322
[email protected], www.logicsystems.org.in Email id:
[email protected],
Effective Pattern Discovery for Text Mining
SYSTEM ARCHITECTURE:
Contact: 040-40274843, 8143615322 Email id:
[email protected], www.logicsystems.org.in
Effective Pattern Discovery for Text Mining
EXISTING SYSTEM:
*
Existing is used to term"based approach to extracting the text.
*
Term"based ontology methods are providing some text representations.
*
E.g.+ Hierarchical is used to determine synonymy and hyponymy relations between ,eywords.
*
Pattern evolution technique is used to improve the performance of term"based approach.
DISADVANTAGES OF EXISTING SYSTEM:
*
The term"based approach is suffered from the problems of polysemy and synonymy.
*
- term with higher $tfidf% value could be meaningless in some d"patterns $some important parts in documents%.
PROPOSED SYSTEM:
*
-n effective pattern discovery technique is discovered
*
Evaluates specificities of patterns and then evaluates term weights according to the distribution of terms in the discovered patterns
*
!olves Misinterpretation Problem
*
'onsiders the influence of patterns from the negative training examples to find ambiguous $noisy% patterns and tries to reduce their influence for the low"frequency problem.
*
The process of updating ambiguous patterns can be referred as pattern evolution.
*
The proposed approach can improve the accuracy of evaluating term weights because discovered patterns are more specific than whole documents.
*
/n 0eneral there are two phases
*
Training and Testing
*
/n training phase the d"patterns in positive documents $D1% based on a min sup are found and evaluates term supports by deploying dpatterns to terms
*
/n Testing Phase to revise term supports using noise negative documents in D based on an experimental coefficient Contact: 040-40274843, 8143615322 Email id:
[email protected], www.logicsystems.org.in
Effective Pattern Discovery for Text Mining
*
The incoming documents then can be sorted based on these weights.
ADVANTAGES OF PROPOSED SYSTEM:
*
The proposed approach is used to improve the accuracy of evaluating term weights.
*
2ecause the discovered patterns are more specific than whole documents.
*
To avoiding the issues of phrase"based approach to using the pattern"based approach.
*
Pattern mining techniques can be used to find various text patterns.
LIST OF MODULES:
). 3oading document 4. Text Preprocessing 5. Pattern taxonomy process 6. Pattern deploying 7. Pattern evolving
MODULES DESCRIPTION:
1. Loading do!"#n$
/n this module to load the list of all documents.
The user to retrieve one of the documents.
This document is given to next process.
That process is preprocessing.
%. T#&$ P'#('o#))ing
The retrieved document preprocessing is done in module.
Contact: 040-40274843, 8143615322 Email id:
[email protected], www.logicsystems.org.in
Effective Pattern Discovery for Text Mining
There are two types of process is done.
)% stop words removal 4%text stemming
!top words are words which are filtered out prior to or after processing of natural language data.
!temming is the process for reducing inflected $or sometimes derived% words to their stem base or root form. /t generally a written word forms.
*. Pa$$#'n $a&ono"+ ('o#))
/n this module the documents are split into paragraphs.
Each paragraph is considered to be each document.
/n each document the set of terms are extracted.
The terms which can be extracted from set of positive documents.
,. Pa$$#'n d#(-o+ing
The discovered patterns are summari8ed.
The d"pattern algorithm is used to discover all patterns in positive documents are composed.
The term supports are calculated by all terms in d"pattern.
Term support means weight of the term is evaluated.
Contact: 040-40274843, 8143615322 Email id:
[email protected], www.logicsystems.org.in
Effective Pattern Discovery for Text Mining
. Pa$$#'n #/o-/ing
/n this module used to identify the noisy patterns in documents.
!ometimes system falsely identified negative document as a positive.
!o noise is occurred in positive document.
The noised pattern named as offender.
/f partial conflict offender contains in positive documents the reshuffle process is applied.
SYSTEM CONFIGURATION:0 HARDARE RE2UIREMENTS:0
Processor
"Pentium 9///
!peed
"
).) 0h8
&-M
"
47: M2$min%
Hard Dis,
" 4; 02
).66 M2 "
"
!tandard >indows =eyboard
Two or Three 2utton Mouse "
!(0-
Contact: 040-40274843, 8143615322 Email id:
[email protected], www.logicsystems.org.in
Effective Pattern Discovery for Text Mining
SOFTARE RE2UIREMENTS:0
#perating !ystem
+ >indows?7@?A@4;;;@BP
+ Cava
T##3
+ etbeans /DE
REFERENCE:
)F =. -as and 3. Ei,vil GText 'ategorisation+ - !urvey Technical &eport &aport & ?6 ) orwegian 'omputing 'enter )???. 4F &. -grawal and &. !ri,ant G
esley )???. 7F . 'ancedda . 'esa"2ianchi -. 'onconi and '. 0entile G=ernel Methods for Document ord" !equence =ernels C. Machine 3earning &esearch vol. 5 pp. );7?" );A4 4;;5. JF M.<. 'aropreso !. Matwin and <. !ebastiani G!tatistical Phrases in -utomated Text 'ategori8ation Technical &eport /E/"26";J" 4;;; /nstituto di Elabora8ione dellI/nforma8ione 4;;;. AF '. 'ortes and (. (apni, G!upport"(ector etwor,s Machine 3earning vol. 4; no. 5 pp. 4J5"4?J )??7. ?F !.T. Dumais G/mproving the &etrieval of /nformation from External !ources 2eh avior &esearch Methods /nstruments and 'omputers vol. 45 no. 4 pp. 44?"45: )??).
Contact: 040-40274843, 8143615322 Email id: [email protected], www.logicsystems.org.in
Effective Pattern Discovery for Text Mining
);F C. Han and =.'."'. 'hang GData Mining for >eb /ntelligence 'omputer vol. 57 no. )) pp. :6"J; ov. 4;;4. ))F C. Han C. Pei and K. Kin GMining
Contact: 040-40274843, 8143615322 Email id: [email protected], www.logicsystems.org.in