An example machine learning notebook¶ notebook ¶ Notebook by Rand Randal al S. Ols Olson on¶ ¶ Supported by Jason by Jason H. Moore Moore¶ ¶ ni!ersity o" #ennsyl!ania #ennsyl!ani a $nstitute "or %ioin"ormatics %ioin"o rmatics ¶ I ti sr ecomm mmendedt ovi ew t hi snot ebooki nnbvi ewer f ort hebe stv i e wi wi nge x pe r i e nc e .
&able &able o" contents¶ contents¶ 1. I nt r oduc t i on 2. L i c en se 3. Requi r edl i br ar i es 4. Thepr obl em doma mai n 5. St e p1 :An swe r i n gt h eq ue s t i o n 6. St e p2 :Ch ec k i n gt h ed at a 7. St ep3:Ti d yi ngt h ed at a o
Bo nu s:T es t i n go urd at a
8. St ep4:Ex pl o r a t or yana l y s i s 9. St ep5:Cl a ss i fi ca t i on o
Cr os sv al i dat i on
o
Par a me me t e rt u ni n g
10.St ep6:Repr oduc i bi l i t y 11.Co nc l u s i o ns 12.Fur t h err e ad i ng 13.Acknowl edgeme ment s
'icense¶ 'icense¶ [g ob ac kt ot h et o p] Pl e as es eet h er er eposi t or yREADMEfil e f o rt hel i c en sesandus aget er msf ort hei n st r u ct i ona lmat er i al a ndc odei nt hi s n ot ebook .I ng en er a l ,Iha v el i c en se dt h i sma mat e r i al s ot ha ti ti saswi d el yus abl eands har eab l easpos s i bl e .
Re(uired libraries¶ libraries ¶ [g ob ac kt ot h et o p]
I fy o ud on ' th a v ePy t h onony ou rc o mp mp ut e r ,y o uc a nu s et h eAn ac on daPy t h ond i s t r i b ut i o nt nt oi ns t al l mos toft hePy t hon p ac k ag esy oun ee d.An ac on dapr o v i d esas i mp l edo ub l e c l i c ki n s t a l l e rf o ry o urc on v en i e nc e. Th i sno t e bo okus e ss e v e r a l Py t h onp ac k a ge st h atc o mes me t a nd ar dwi t ht h eAn ac o nd aPy t h ond i s t r i b ut i o n.Th ep r i ma r y l i br ar i est hatwe' l l beus i ngar e: •
NumPy :Pr o vi de saf a stnu me mer i c al a r r a ys t r uc t ur ea ndhe l pe rf unc t i on s.
•
pandas:Pr o v i d esaDa t a Fr a me mes t r u ct u r et os t o r eda t ai nme mo mo r ya ndwo r kwi t hi te as i l yan de ffic ffic i e nt l y .
•
sci ki t l ear n:Th ee s s en t i a lMa c hi n eL ea r n i n gp ac k a gei nPy t h on .
•
mat pl ot l i b:Bas i cpl ot t i ngl i br ar yi nPy t h on;mos tot herPy t h onpl ot t i ngl i br ar i esar ebui l tont opofi t .
•
Seaborn:Adv anc eds t a t i s t i c al pl ot t i ngl i br ar y .
T oma k es u r ey o uh a v ea l l o ft h ep ac k a ge sy o un ee d,i n s t a l l t h em wi t hconda: conda install numpy pandas scikit-learn matplotlib seaborn
ma ma yas ky out oup da t eso me meoft h em i fy o ud on ' th av et h emo s tr e ce ntv e r s i o n.Al l o wi tt odoso . conda Iwi l l n otb epr o vi d i ngs uppor tf orpe op l et r y i ngt or unt h i sn ot eb oo kou t s i d eoft heAna condaPy t hondi s t r i bu t i o n. Not e:
&he problem problem domain¶ domain¶ [g ob ac kt ot h et o p] Fort hepur pos esoft hi sex er c i s e,l e t ' spr et e ndwe' r ewor k i ngf oras t ar t upt h atj us tgo tf u nd edt oc r ea t easmar t ph oneapp t h ata ut o ma ma t i c al l yi d en t i fi ess pe ci e so ffl o we r sf r o mp i c t u r e st a k eno nt h es ma ma r t p ho ne .We We ' r ewo r k i n gwi t hamo de r a t e l y s i z edt eam ofdat as ci ent i s t sandwi l l bebui l di ngpar toft hedat aanal y si spi pel i nef ort hi sapp. We ' v eb ee nt a s k edb you rh ea do fd at as c i e nc et oc r e at eade mo moma ma c hi n el e ar n i n gmo mo de lt h att a k esf o urme as u r e me me nt s f r om t heflower s( s ep al l e ngt h,s ep al wi dt h ,p et a ll e ngt h ,a ndpe t al wi d t h )a ndi d ent i fi est hesp ec i e sb as edont hos e measur eme ment sal one.
We' vebeengi venad at as etf r o mo urfi e l dr e s ea r c h er st od ev e l o pt h ed emo ,wh i c ho nl yi n c l u de sme as u r e me nt sf o rt h r e e r i s t y pe so fI flower s:
Iris setosa¶
Iris versicolor ¶
Iris virginica¶
Th ef o urme as u r e me nt swe ' r eus i n gc u r r e nt l yc omef r o mh an dme as u r e me nt sb yt h efi e l dr e s ea r c h er s ,b utt h eywi l l b e a ut o ma t i c a l l yme as u r e db yani ma gep r o c es s i n gmo de li nt h ef u t u r e . r i sd Not e: Th ed at ase twe ' r ewor k i n gwi t hi st h ef a mo usI at as et— i n c l u d edwi t ht h i sn ot e b oo k— wh i c hIh a v emo d i fi e d
s l i g ht l yf ord emons t r at i onpur pos es .
Step )* Ans+ering the (uestion¶
[g ob ac kt ot h et o p] Th efi r s ts t e pt oa nyda t aa na l y s i spr o j e cti st od efi net heq ue st i o no rp r o bl e m we ' r el o ok i n gt os ol v e ,a ndt od efi neame as ur e ( ors etofmea sur es )f oro urs uc c es sats ol v i ngt h att a sk .Th ed at aanal y s i sc hec k l i s th asusans we rahan df u lo fques t i on st o a cc omp l i s ht ha t ,s ol e t ' swor kt hr ou ght ho sequ es t i o ns . Di dyouspec i f yt het y peofdat aanal y t i cques t i on( e . g.e xpl or at i on,as soc i at i oncaus al i t y )b ef or et ouc hi ngt hedat a? We' r et r y i ngt oc l a ss i f yt hes pe ci es( i . e. ,c l a ss )o ft hefl owe rb as edonf ou rme as ur ement st ha twe ' r epr o vi de d:s ep al l eng t h, s epal wi dt h,pet al l engt h,andpet al wi dt h. Di dy oude fin et h eme t r i cf o rs uc c es sbe f o r eb eg i n ni n g? L et ' sdot h atn ow.Si n cewe ' r ep er f o r mi n gc l a ss i fi c at i o n,wec anu s ea c c u r a c y —t h ef r a ct i o no fc or r e ct l yc l a ss i fi edfl ower s— t oqu an t i f yho w we l l o urmo de li spe r f or mi n g.Ou rh ea do fd at aha st o l dust h atwesh ou l dac hi e v ea tl e as t9 0% a c cu r a c y . Di dyouunde r s t andt h ec ont e xtf ort h eq ue st i o na ndt hes c i ent i fi corb us i nes sappl i c at i o n? We' r eb ui l d i ngpar to fada t aanal y s i spi pe l i nef oras mar t ph on ea ppt ha twi l l beab l et oc l a ss i f yt hes pec i e so ffl o wer sf r om pi c t ur est ak enont hesmar t phone.I nt hef ut ur e,t hi spi pel i newi l l beconnec t edt oanot herpi pel i net hataut omat i c al l y mea sur esf r om p i c t ur est h et r ai t swe' r eus i n gt oper f or mt hi scl a ss i fi c at i o n. Di dy our e cor dt h ee x pe r i me nt a ld es i g n? Ou rh e adofd at aha st o l dust h att h efi e l dr e s ea r c h er sa r eh an dme as u r i n g5 0r a nd oml y s a mp l e dfl o we r so fe ac hsp ec i e s u s i n gas t a nd ar d i z e dme t h od ol o gy . Th efi e l dr e s ea r c h er st a k ep i c t u r e so fe ac hfl o we rt h eys amp l ef r o mp r e d efi n eda ng l e s s ot h eme as u r e me nt san ds p ec i e sc a nb ec o nfi r me db yt h eo t h erfi e l dr e s ea r c h er satal a t e rp o i n t .Att h ee ndo fe ac hd a y , t h ed at ai sc omp i l edan ds t o r e donap r i v a t ec omp an yGi t Hu br e po si t o r y . Di dy ouco ns i d erwhe t h ert h eq ue st i o nc ou l dbean swe r e dwi t ht h ea v ai l a bl eda t a ? I r i s Theda t as etwec ur r en t l yha vei son l yf o rt h r e et y pesof fl ower s .Th emo del bui l to ffoft hi sda t as etwi l l o nl ywo r kf or r i s t h os eI fl o we r s ,s owewi l l n ee dmo r ed at at ocr e at eag en er a lfl o we rc l a ss i fi er .
No t i c et ha twe ' v es pen taf a i ra moun to ft i mewor k i ngont hepr obl e m wi t ho utwr i t i ngal i neofc od eo re v enl ook i ngatt he d at a . Thi nki ngaboutanddocument i ngt heprobl em we' r eworki ngoni sani mpor t antst ept oper f ormi ngeffect i vedat a a na l y s i st ha tof t e ngoe so ve r l ook e d. Don' ts ki pi t .
Step ,* -hecking the data¶ [g ob ac kt ot h et o p] Th en ex ts t e pi st ol o okatt h ed at awe' r ewo r k i n gwi t h .Ev e nc ur a t e dd at ase t sf r o mt h eg ov e r n me ntc anha v ee r r o r si nt h em, andi t ' sv i t al t hatwespott hes eer r or sbef or ei n ves t i ngt oomuc ht i mei nouranal y si s . Gener a l l y , we' r el o ok i n gt oa ns wert hef ol l o wi n gque st i o ns : •
I st her ean yt h i ngwr o ngwi t ht hedat a ?
•
Ar et h er ean yq ui r k swi t ht h ed at a?
•
DoIn ee dt ofi xorr e mo v ean yoft h ed at a ?
L et ' ss t a r t b yr e a di n gt h ed at ai n t oap an da sDa t a Fr a me . In [1]: import pandas as pd
iris_data = pd.read_csv('iris-data.csv') iris_data.ead()
!ut[1]:
sepal _l engt h_cm
sepal _wi dt h_cm
pet al _l engt h_cm
pet al _wi dt h_cm
cl ass
05. 1
3. 5
1. 4
0. 2
I r i s s et os a
14. 9
3. 0
1. 4
0. 2
I r i s s et os a
24. 7
3. 2
1. 3
0. 2
I r i s s et os a
34. 6
3. 1
1. 5
0. 2
I r i s s et os a
45. 0
3. 6
1. 4
0. 2
I r i s s et os a
We ' r ei nl u c k !Th ed at as e emst ob ei nau s ab l ef o r ma t . Th efi r s tr o wi nt h ed at afi l ed efi n est h ec o l u mnh ea de r s ,a ndt h eh ea de r sar ed es c r i p t i v ee no ug hf o ru st oun de r s t a ndwh at e ac hc o l u mnr e pr e s en t s .Th eh ea de r se v engi v eust h eu ni t st h att h eme as u r e me nt swe r er e c or d edi n ,j u s ti nc as ewe n ee dedt ok no w atal at erp oi nti nt hepr oj ec t . Ea chr o wf o l l o wi n gt h efi r s tr o wr e pr e se nt sanen t r yf o rafl owe r :f o urme as ur e men t san do necl a ss ,wh i c ht e l l sust h e s pe ci e soft h efl owe r . Oneoft hefirstt hi ngsweshoul dl ookf ori smi ssi ngdat a. Thank f ul l y ,t hefiel dr e sear c her sal r eadyt ol dust hatt he yputa ' NA' i n t ot h es p r e ad s he etwh ent h eywe r emi s s i n game as u r e me nt . Wec ant el l p an dast oau t omat i c al l yi d en t i f ymi s s i ngval ue si fi tk no wsourmi s s i n gv al uemar k er . In ["]: iris_data = pd.read_csv('iris-data.csv'# na_values=['$%'])
Vo i l à !No wp an da skn owst ot r e atr o wswi t h' NA' a smi s s i n gv a l u es . Ne xt ,i t ' sal wa ysagoodi deat ol ookatt hedi s t r i but i onofourdat a— es pec i al l yt heout l i er s . L et ' ss t a r tb yp r i nt i ngou ts o mes umma r ys t at i s t i c sabo utt heda t as et . In [&]: iris_data.describe()
!ut[&]:
sepal _l engt h_cm
sepal _wi dt h_cm
pet al _l engt h_cm
pet al _wi dt h_cm
count
150. 000000
150. 000000
150. 000000
145. 000000
mean
5. 644627
3. 054667
3. 758667
1. 236552
st d
1. 312781
0. 433123
1. 764420
0. 755058
mi n
0. 055000
2. 000000
1. 000000
0. 100000
25%
5. 100000
2. 800000
1. 600000
0. 400000
sepal _l engt h_cm
sepal _wi dt h_cm
pet al _l engt h_cm
pet al _wi dt h_cm
50%
5. 700000
3. 000000
4. 350000
1. 300000
75%
6. 400000
3. 300000
5. 100000
1. 800000
max
7. 900000
4. 400000
6. 900000
2. 500000
Weca ns eese v er a lu se f u lv a l u esf r o mt h i st a bl e .Fo re x a mp l e ,wese et h atfi v epetal_idt_cm ent r i esar emi s si n g. I fy ouas kme,t hough,t abl esl i k et hi sar er ar el yus ef ul unl es swek no wt hatourdat as houl df al l i napar t i c ul arr ange.I t ' s u sua l l ybe t t ert ovi s ua l i z et hedat ai nso mewa y .Vi s ual i z at i onma kesout l i er san der r or si mmed i at el ys t an do ut ,wh er e ast he y mi g htg ou nn ot i c edi nal a r g et a bl eofn umb er s . Si n cewek no w we ' r ego i ngt obepl o t t i ngi nt hi ssec t i on,l e t ' sse tu pt heno t e bo oksowec anp l o ti ns i d eo fi t . In []: *is line tells te notebook to so plots inside o+ te notebook ,matplotlib inline
import matplotlib.pyplot as plt import seaborn as sb
Nex t , l et ' scr eat eas ca t t e r pl otma t r i x .Sc at t e r p l o tma t r i c espl o tt hed i s t r i b ut i onofea chc ol umnal o ngt h ed i agona l ,andt he n p l o tasc at t er p l o tma t r i xf ort hec ombi na t i o nofe ac hv ar i ab l e.Th eymak ef ora ne ffic i en tt o ol t ol oo kf orer r or si nou rda t a. Wecanev enha v et hepl o t t i n gp ac k ag ec ol orea chen t r yb yi t sc l a sst ol oo kf o rt r en dswi t hi nt h ec l as s es . In []: e ave to temporarily drop te ros it '$%' values because te /eaborn plottin0 +unction does not kno at to do it tem sb.pairplot(iris_data.dropna()# ue='class')
!ut[]: seaborn.a2is0rid.3air4rid at 52156778c+89
Fr o mt h es c at t e r p l o tma t r i x ,wec anal r e ad ys eeso mei s s ue swi t ht h ed at ase t : 1 . Th er ear efi v ec l a s s e swh ent h e r es h ou l don l yb et h r e e,me an i n gt h er ewe r es omec o di n ge r r o r s . 2 . Th er ear es omec l e a ro u t l i e r si nt h eme a s ur e me nt st h atma yb ee r r o ne ou s :o n esepal_idt_cm ent r yf orIrisf al l swel l out s i dei t snor mal r ange,ands ev er al ent r i esf orIris-versicolor a r ene ar setosa sepal_len0t_cm z e r of o rs o mer e as o n . 3. Wehadt odr opt hoser owswi t hmi ssi ngval ues. I na l l o ft h es ec as es ,wen ee dt ofi gu r eou twh att od owi t ht h ee r r o ne ou sd at a .Wh i c ht a k esust ot h en ex ts t e p. . .
Step * &idying the data¶ [g ob ac kt ot h et o p]
No wt h atwe ' v ei d en t i fi eds e v er a le r r o r si nt h ed at ase t ,wene edt ofi xt h em b ef o r ewep r o c ee dwi t ht h ea na l y s i s . L et ' swal kt hr o ug ht h ei s s ue so ne b y o ne . Th er ear efi v ec l a s s eswh ent h er es ho ul don l yb et h r e e,me an i n gt h er ewe r es omec od i n ge r r o r s . Af t ert al k i ngwi t ht hefi el dr e sear c her s ,i ts oundsl i k eoneoft hem f or gott oaddIris- bef or et hei rIris-versicolor ent r i es . Th eo t h ere x t r a ne ou sc l a ss ,Iris-setossa,wassi mpl yat y pot hatt he yf or go tt ofi x . L et ' sus et h eDa t a Fr a met ofi xt h es ee r r o r s . In [7]: iris_data.loc[iris_data['class'] == 'versicolor'# 'class'] = 'Iris-versicolor' iris_data.loc[iris_data['class'] == 'Iris-setossa'# 'class'] = 'Iris-setosa'
iris_data['class'].uniue()
!ut[7]: array(['Iris-setosa'# 'Iris-versicolor'# 'Iris-vir0inica']# dtype=ob;ect)
Mu c hb et t e r !No w weon l yha v et h r e ec l a s st y p es .I ma gi n eh owe mb ar r a s s i n gi two ul d ' v ebe ent oc r e at eamo de lt h atu s ed t h ewr o ngc l a s s es . Th er ear es omec l e aro ut l i e r si nt h eme as u r e me nt st h atma ybeer r o ne ou s :o nesepal_idt_cm ent r yf orIris-setosa f a l l s sepal_len0t_cm wel l out s i d ei t snor mal r a nge ,a nds ev er al ent r i esf o rIris-versicolor a r en ea r z e r of o rs o mer e as o n. Fi x i ngo ut l i er sca nb et r i c k yb us i n es s .I t ' sr a r el ycl e arwhe t hert heout l i erwascau sedb yme as ur eme nter r or ,r ec or di ngt he d at ai ni mpr o pe runi t s ,ori ft h eout l i eri sar e al a nomal y .Fort hatr ea son ,wes hou l db ej u di c i ouswhe nwor k i ngwi t hout l i er s : i fwede c i d et oex c l u dean yda t a ,wene edt oma k es u r et odo c ume ntwh atd at aweex c l u de da ndpr o v i d es o l i dr e as o ni n gf o r e xc l udi ngt hatdat a.( i . e. ," T hi sdat adi dn' tfi tmyh ypo t h es i s "wi l l nots t andpeerr e vi e w. ) I nt h ec a seoft h eo nean oma l o use nt r yf o rIris-setosa,l et ' ssa yourfi e l dr es ear c her sknowt hati t ' si mpos si bl ef or Irist oha veas epal wi dt hbe l ow2 . 5cm.Cl ear l yt h i sen t r ywasma dei ner r or ,andwe' r ebe t t eroffj us ts cr a ppi ngt heen t r y setosa t h ans p en d i n gh ou r sfi n d i n go utwh a th ap pe n ed . In [<]: *is line drops any 'Iris-setosa' ros it a separal idt less tan ". cm iris_data = iris_data.loc[(iris_data['class'] = 'Iris-setosa') > (iris_data['sepal_idt_cm'] 9= ".)] iris_data.loc[iris_data['class'] == 'Iris-setosa'# 'sepal_idt_cm'].ist()
!ut[<]: matplotlib.a2es._subplots.%2es/ubplot at 5215dac5e+59
Ex c el l e nt !Nowal l o fo urIris-setosa r o wsha v ease pa lwi d t hgr e at e rt h an2. 5 . Thene xtd at ai s s uet oadd r e s si st hese v er a ln ear z er osepa ll en gt h sf o rt heIris-versicolor r ows .Let ' st ak eal ookat t h os er o ws . In [8]: iris_data.loc[(iris_data['class'] == 'Iris-versicolor') ?
(iris_data['sepal_len0t_cm'] 1.5)]
!ut[8]:
sepal _l engt h_cm
sepal _wi dt h_cm
pet al _l engt h_cm
pet al _wi dt h_cm
cl ass
77 0. 067
3. 0
5. 0
1. 7
I r i s v er s i c ol or
78 0. 060
2. 9
4. 5
1. 5
I r i s v er s i c ol or
79 0. 057
2. 6
3. 5
1. 0
I r i s v er s i c ol or
80 0. 055
2. 4
3. 8
1. 1
I r i s v er s i c ol or
81 0. 055
2. 4
3. 7
1. 0
I r i s v er s i c ol or
Ho wa bo utt h at ?Al l o ft h es en ea r z er osepal_len0t_cm e nt r i e ss ee mt obeoffbyt woor d er so fma gn i t u de ,a si ft h eyha d b ee nr e co r d e di nme t e r si n st e adofc en t i me t e r s . Af t e rs omebr i e fc or r e sp ond en cewi t ht h efi el dr e se ar c her s ,wefi ndt h ato neoft h em f or g ott oc on v er tt h os eme as ur e me nt s t ocent i met er s .Let ' sdot hatf ort hem. In [6]: iris_data.loc[(iris_data['class'] == 'Iris-versicolor') ? (iris_data['sepal_len0t_cm'] 1.5)# 'sepal_len0t_cm'] @= 155.5
iris_data.loc[iris_data['class'] == 'Iris-versicolor'# 'sepal_len0t_cm'].ist()
!ut[6]: matplotlib.a2es._subplots.%2es/ubplot at 5215d&7c&"59
Ph ew!Go odt h i n gwefi x e dt h os eo ut l i er s .Th eyc ou l d ' v er e al l yt h r o wnou ra na l y s i soff . Weha dt odr o pt h os er o wswi t hmi s s i n gv a l u es . L et ' st a keal oo ka tt h er owswi t hmi s s i ngval ue s: In [15]: iris_data.loc[(iris_data['sepal_len0t_cm'].isnull()) > (iris_data['sepal_idt_cm'].isnull()) > (iris_data['petal_len0t_cm'].isnull()) >
(iris_data['petal_idt_cm'].isnull())]
!ut[15]:
sepal _l engt h_cm
sepal _wi dt h_cm
pet al _l engt h_cm
pet al _wi dt h_cm
cl ass
7 5. 0
3. 4
1. 5
NaN
I r i s s et os a
8 4. 4
2. 9
1. 4
NaN
I r i s s et os a
9 4. 9
3. 1
1. 5
NaN
I r i s s et os a
10 5. 4
3. 7
1. 5
NaN
I r i s s et os a
sepal _l engt h_cm 11 4. 8
sepal _wi dt h_cm
pet al _l engt h_cm
pet al _wi dt h_cm
cl ass
3. 4
1. 6
NaN
I r i s s et os a
Iris-setosa I t ' snoti deal t hatwehadt odr opt hos er ows ,es pec i al l ycon si der i ngt he y' r eal l ent r i es .Si nc ei ts eemsl i k et he r i s mi s s i n gd at ai ss y s t e ma t i c— a l l o ft h emi s s i n gv a l u esa r ei nt h es amec ol u mnf o rt h es ameI t y pe— t hi ser r orc oul d pot ent i al l ybi asouranal y si s .
On ewa yt ode al wi t hmi s s i n gd at ai smeani mput at i on:I fwekno wt h att hev al u esf o ramea sur emen tf a l l i nac er t ai nr a nge , wec a nfi l l i ne mp t yv a l u eswi t ht h ea v er a geoft h atme as u r e me nt . L et ' ss eei fwec and ot h ath er e . In [11]: iris_data.loc[iris_data['class'] == 'Iris-setosa'# 'petal_idt_cm'].ist()
!ut[11]: matplotlib.a2es._subplots.%2es/ubplot at 5215c+76c+89
Mos to ft hep et a lwi dt h sf orIris-setosa f al l wi t hi nt he0. 20. 3r ange,s ol et ' sfil l i nt hes eent r i eswi t ht hea ver agemeas ur ed pet al wi dt h. In [1"]: avera0e_petal_idt = iris_data.loc[iris_data['class'] == 'Iris-setosa'# 'petal_idt_cm'].mean()
iris_data.loc[(iris_data['class'] == 'Iris-setosa') ?
(iris_data['petal_idt_cm'].isnull())# 'petal_idt_cm'] = avera0e_petal_idt
iris_data.loc[(iris_data['class'] == 'Iris-setosa') ? (iris_data['petal_idt_cm'] == avera0e_petal_idt)]
!ut[1"]:
sepal _l engt h_cm
sepal _wi dt h_cm
pet al _l engt h_cm
pet al _wi dt h_cm
cl ass
7 5. 0
3. 4
1. 5
0. 25
I r i s s et os a
8 4. 4
2. 9
1. 4
0. 25
I r i s s et os a
9 4. 9
3. 1
1. 5
0. 25
I r i s s et os a
10 5. 4
3. 7
1. 5
0. 25
I r i s s et os a
11 4. 8
3. 4
1. 6
0. 25
I r i s s et os a In [1&]:
iris_data.loc[(iris_data['sepal_len0t_cm'].isnull()) > (iris_data['sepal_idt_cm'].isnull()) > (iris_data['petal_len0t_cm'].isnull()) >
(iris_data['petal_idt_cm'].isnull())]
!ut[1&]:
sepal _l engt h_cm
sepal _wi dt h_cm
pet al _l engt h_cm
pet al _wi dt h_cm
cl ass
Gr e at !No w we ' v er e c ov e r e dt h os er o wsa ndnol o ng erh a v emi s s i n gd at ai nou rd at as et . Not e: I fy oudon' tf e el c omf or t a bl ei mput i ngy ou rd at a,y ouc andr opal l r o wswi t hmi s s i ngdat awi t ht hedropna() cal l : iris_data.dropna(inplace=*rue)
Af t eral l t hi shar dwor k ,wed on' twantt or epe att hi spr o ces se ver yt i mewewo r kwi t ht heda t as et .L et ' ssa v et het i di e ddat a sas ep ar a t efi l e fi l ea an dwo r kd i r e ct l ywi t ht h atd at afi l ef r o mn owo n. In [1]: iris_data.to_csv('iris-data-clean.csv'# inde2=Aalse)
iris_data_clean = pd.read_csv('iris-data-clean.csv')
Now,l et ' st ak eal ookatt hes ca t t er pl o tmat r i xnowt hatwe' v et i di edt hedat a. In [1]: sb.pairplot(iris_data_clean# ue='class')
!ut[1]: seaborn.a2is0rid.3air4rid at 5215ea7&59
Ofc ou r s e ,Ip ur p os el yi n se r t e dn ume r o u se r r o r si n t ot h i sda t ase tt ode mo ns t r a t eso meoft h ema nypo ss i b l esc en ar i o sy o u ma yf a cewh i l et i d yi n gy o urd at a . Th eg en e r a lt a k e a wa y sh er es ho u l dbe : •
•
Ma k es u r ey ou rd at ai se nc o de dp r o pe r l y Ma k es u r ey ou rd a t af a l l swi t h i nt h ee x pe c t e dr a ng e,a ndu s ed oma i nk n owl e dg ewh en ev e rp o s s i b l et ode fi net h at e x pe c t e dr a ng e
•
Deal wi t hmi s s i n gdat ai nonewa yorano t her :r epl ac ei ti fy ouca no rd r opi t
•
Ne v ert i d yy ourda t ama nu al l ybec aus et h ati sno teas i l yr epr odu ci bl e
•
Us eco deasar e c or dofh owy o ut i d i e dy o urd at a
•
i s ual l y Pl o te v er y t h i n gy o uc ana bo utt h ed at aa tt h i ss t a geo ft h ea nal y s i ss oy o uc anv c o nfi r me v er y t h i n gl o ok s c o r r e c t
%onus* &esting our data¶ [g ob ac kt ot h et o p] AtSc i Py20 15 ,Iwa se x po se dt oag r e a ti d ea :Wes ho ul dt e s to u rd at a .J us th owweu seu ni tt e st st ov e r i f you re x pe ct a t i o ns f r om c ode ,wec ans i mi l a r l yse tu pu ni tt es t st ov er i f youre xpe ct a t i o nsabo utadat as et .
Weca nq ui c k l yt e s to urd at aus i n gassert s t at emen t s :Weas s er tt h ats ome t h i ngmus tbet r u e,andi fi ti s ,t henno t hi n g h ap pe nsa ndt h en ot eb oo kco nt i n ue sr un ni n g.Ho we v er ,i fo ura ss er t i o ni swr o ng ,t h ent h en ot e bo oks t o psr u nn i n ga nd br i ngsi tt oourat t ent i on.Fore xampl e: In [17]: assert 1 == " --------------------------------------------------------------------------%ssertionBrror *raceback (most recent call last) ipyton-input-17-a815b&aaded9 in module9() ----9 1 assert 1 == " %ssertionBrror:
L et ' st e staf e wt h i ng st h atwek no wa bo uto urd at ase tn ow. In [18]: e kno tat e sould only ave tree classes assert len(iris_data_clean['class'].uniue()) == &
In [16]: e kno tat sepal len0ts +or 'Iris-versicolor' sould never be belo ". cm assert iris_data_clean.loc[iris_data_clean['class'] == 'Iris-versicolor'# 'sepal_len0t_cm'].min() 9= ".
In ["5]: e kno tat our data set sould ave no missin0 measurements assert len(iris_data_clean.loc[(iris_data_clean['sepal_len0t_cm'].isnull()) > (iris_data_clean['sepal_idt_cm'].isnull()) > (iris_data_clean['petal_len0t_cm'].isnull()) > (iris_data_clean['petal_idt_cm'].isnull())]) == 5
An ds oo n.I fan yo ft hes ee xpec t at i onsar ev i ol at ed ,t heno uran al y s i si mme di at el ys t op sa ndweha v et or e t ur nt ot h et i dy i n g s t a ge .
Step /* 0xploratory analysis¶ [g ob ac kt ot h et o p] No w af t ers pendi n ge nt i r el yt o omu cht i met i d yi n go urd at a ,wecanst ar ta nal y z i ngi t ! Ex pl or a t or yanal y s i si st h es t e pwher ewes t ar tdel v i ngde eperi nt ot hed at as etbe yo ndt heou t l i e r san de r r or s .We' l l be l o ok i n gt oa ns we rq u es t i o nss uc ha s : •
Ho wi smyda t adi s t r i b ut e d?
•
Ar et h er ean yc or r e l a t i o nsi nmyda t a ?
•
Ar et her ean yc on f oun di n gf a ct or st ha te x pl a i nt hes ec or r el at i on s?
Th i si st h es t a gewh er ewepl o ta l l t h ed at ai nasma nywa y sasp os s i b l e .Cr e at ema nyc ha r t s ,b utd on ' tb ot h erma ki n gt h em p r e t t y— t he sec har t sar ef o ri n t e r n al us e. Let ' sr e t ur nt ot hats cat t er pl otmat r i xt hatweus edear l i er . In ["1]: sb.pairplot(iris_data_clean)
!ut["1]: seaborn.a2is0rid.3air4rid at 5215+dd+cc59
Ourd at ai sno r ma l l ydi s t r i b ut e df o rt h emo s tp ar t ,wh i c hi sgr e atn ewsi fwep l a no nu si n ga nymo de l i n gme t h od st h at a ss umet h ed at ai snor mal l ydi s t r i but e d. Iris Th er e ' ss ome t h i n gs t r a ng eg oi n gwi t ht h ep et a lme as ur e me nt s .Ma y bei t ' ss ome t h i ngt odowi t ht h ed i ff er e nt t y pes . L et ' sco l orc odet heda t ab yt h ec l as sag ai nt os eei ft h atc l ea r st hi n gsup .
In [""]: sb.pairplot(iris_data_clean# ue='class')
!ut[""]: seaborn.a2is0rid.3air4rid at 52111&"b889
Su r eeno ugh,t hest r angedi s t r i bu t i o no ft h epe t a lmea sur ement sex i s tb ec aus eoft hedi ff er ents pe ci e s.Thi si sac t u al l ygr e at n ewsf o ro urc l a s si fi c at i o nt a s ks i n c ei tme an st h att h ep et a lme as ur e me nt swi l l ma k ei te as yt od i s t i n gu i s hb et we enIrisa ndt h eo t h erIris t y pe s. setosa Di s t i ngui s hi ngIris-versicolor andIris-vir0inica wi l l p r o v emo r ed i ffic u l tg i v e nh owmu c ht h ei rme as u r e me nt so v er l a p. Ther ear ea l s oc or r el at i on sb et we enpe t al l e ngt handpe t al wi dt h ,a swel l ass ep al l eng t han ds epa lwi d t h.Th efi el dbi ol og i s t s a ss ur eu st h att h i si st ob ee x pe ct e d:L on ge rfl o we rp e t a l sal s ot e ndt ob ewi d er ,a ndt h es amea pp l i e sf o rs e pa l s . Wecanal somakevi ol i npl ot s o ft h ed at at oc ompar et h eme as ur eme ntdi s t r i bu t i o nsoft hec l a ss es .Vi ol i npl o t sc ont ai nt he s a mei n f o r ma t i o na sbo xpl ot s , bu tal s os c al e st hebo xa cc or di n gt ot h ed en si t yoft heda t a. In ["&]: plt.+i0ure(+i0siCe=(15# 15))
+or column_inde2# column in enumerate(iris_data_clean.columns): i+ column == 'class':
continue plt.subplot("# "# column_inde2 D 1) sb.violinplot(2='class'# y=column# data=iris_data_clean)
En oughfl i r t i n gwi t ht h edat a.Le t ' sge tt omod el i ng.
Step 1* -lassi2cation¶
[g ob ac kt ot h et o p] t i l l Wo w,a l l t h i swo r ka ndwes h av e n' tmo de l e dt h ed at a !
Ast i r e someasi tc anb e,t i dy i ngan de xpl or i ngourdat ai sav i t al c omp onen tt oan ydat aa na l y s i s .I fwehadj umpe ds t r a i ght t ot h emo de l i n gs t e p,wewo ul dh av ec r e at e daf a ul t yc l a ss i fi c at i o nmo de l . Remember :Baddat al eadst obadmodel s. Al wa y sc he c ky o urd at afi r s t .
As s u r e dt h ato urd at ai sn owa sc l e anaswec anma k ei t— a ndar me dwi t hs omec ur s o r yk n o wl e dg eo ft h ed i s t r i b ut i o nsa nd r el at i ons hi psi nourdat as et— i t ' st i met omak et hene xtbi gs t epi nouranal y si s :Spl i t t i ngt hedat ai nt ot r ai ni ngandt es t i ng s e t s . At r a i ni ngs et i sar a n do ms u bs e to ft h ed at at h atweus et ot r a i nou rmo de l s . At es t i ngs et i sar a ndom s ubs etoft hedat a( mut ual l ye x cl u si v ef r om t het r a i ni n gs et )t ha tweus et ov al i d at eourmod el son u nf o r s e end at a . Es pec i al l yi ns par s edat as et sl i k eour s ,i t ' seas yf ormodel st ooverfit t hedat a:Themodel wi l l l ear nt het r ai ni ngse ts owel l t h ati twon ' tbeab l et oh andl emo sto ft h ec as esi t ' sne v ers eenbe f or e .Thi si swh yi t ' si mpor t an tf o ru st obu i l dt hemod el wi t ht het r ai ni ngs et ,buts cor ei twi t ht het es t i ngs et . Not et hatonc ewes pl i tt hedat ai nt oat r ai ni ngandt es t i ngs et ,wes houl dt r eatt het es t i ngs etl i k ei tnol ongere xi s t s :We c ann otu sean yi nf o r mat i onf r o mt het es t i ngse tt obui l dourmo del ore l s ewe ' r ech eat i ng . Let ' sse tupourdat afi r s t . In ["]: iris_data_clean = pd.read_csv('iris-data-clean.csv')
e're usin0 all +our measurements as inputs $ote tat scikit-learn e2pects eac entry to be a list o+ values# e.0.# [ [val1# val"# val&]#
[val1# val"# val&]#
... ]
suc tat our input data set is represented as a list o+ lists
e can e2tract te data in tis +ormat +rom pandas like tis: all_inputs = iris_data_clean[['sepal_len0t_cm'# 'sepal_idt_cm'# 'petal_len0t_cm'# 'petal_idt_cm']].values
/imilarly# e can e2tract te classes all_classes = iris_data_clean['class'].values
Eake sure tat you don't mi2 up te order o+ te entries all_inputs[] inputs sould correspond to te class in all_classes[]
Fere's at a subset o+ our inputs looks like: all_inputs[:]
!ut["]: array([[ .1#
&.#
1.#
5."]#
[ .6#
&. #
1.#
5."]#
[ .<#
&."#
1.
5."]#
[ .7#
&.1#
1.#
5."]#
[ . #
&.7#
1.#
5."]])
No wo urd at ai sr e ad yt ob es pl i t . In ["]: +rom sklearn.cross_validation import train_test_split
(trainin0_inputs# testin0_inputs# trainin0_classes# testin0_classes) = train_test_split(all_inputs# all_classes# train_siCe=5.<# random_state=1)
Wi t hourdat as pl i t ,wec ans t ar tfi t t i ngmodel st oourdat a.Ourheadofdat ai sal l aboutdec i s i ont r eec l as si fi er s ,s ol et ' ss t a r t wi t ho neoft h os e. Dec i s i ont r eec l as si fi er sar ei n cr edi bl ysi mpl ei nt heor y .I nt hei rs i mpl es tf or m,dec i s i ont r eec l as si fi er sas kas er i esofYes / No que st i on sabo utt heda t a— e ac ht i mege t t i ngcl o sert ofin di ngoutt hecl a ssofe ac he nt r y— u nt i l t he yei t h erc l a ss i f yt he dat as etper f ec t l yors i mpl ycan' tdi ffer ent i at eas etofent r i es .T hi nkofi tl i k eagameof T we nt yQu e s t i o ns , e x c ep tt h e h,muc h c o mp ut e ri smuc bet t era ti t . Her e' sanex amp l edec i s i ont r e ec l as s i fi er :
No t i c eho wt hec l a s si fi era sk sYe s/ Noqu es t i o nsa bo utt h ed at a— whe t h erac er t a i nf e at u r ei s<=1. 7 5,f o re x amp l e— s oi t c andi ff er ent i at et her e cor ds .Thi si st h ee ss enc eofe v er ydec i s i o nt r ee . Theni c ep ar ta bou td ec i s i ont r e ec l as s i fi er si st ha tt h eya r es ca l e i nv ar i a nt ,i . e. ,t hes cal eoft hef eat ur esdoesnotaff ec t t h ei rp er f or ma nc e,u nl i k ema nyMa ch i n eL ea r n i n gmo de l s .I no t he rwo r d s ,i td oe sn ' tma t t e ri fo urf e at u r e sr a n gef r o m 0t o1 o r0t o1, 000 ;d ec i s i ont r eec l as s i fi er swi l l wor kwi t ht he mj us tt hes ame. Th er ear es e v er a lp ar a me t e r st ha twec ant un ef ordec i s i ont r e ec l as s i fi er s ,butf orn owl e t ' sus eaba si cd ec i s i ont r e e c l as s i fi er . In ["7]: +rom sklearn.tree import Gecision*reeHlassi+ier
Hreate te classi+ier decision_tree_classi+ier = Gecision*reeHlassi+ier()
*rain te classi+ier on te trainin0 set decision_tree_classi+ier.+it(trainin0_inputs# trainin0_classes)
alidate te classi+ier on te testin0 set usin0 classi+ication accuracy
decision_tree_classi+ier.score(testin0_inputs# testin0_classes)
!ut["7]: 5.6<&78"15"7&18"
He c ky e ah !Ou rmo de la c hi e v e s97 %c l a s s i fi c a t i o na c c ur a c ywi t h ou tmu c he ff or t . Ho we v e r ,t h er e ' sac a t c h :De pe nd i n go nh owo urt r a i n i n ga ndt e s t i n gs e twa ssa mp l e d,o urmo de lc a na c hi e v ea ny wh er e f r om 80% t o100% accur acy : In ["<]: model_accuracies = []
+or repetition in ran0e(1555):
(trainin0_inputs#
testin0_inputs#
trainin0_classes# testin0_classes) = train_test_split(all_inputs# all_classes# train_siCe=5.<)
decision_tree_classi+ier = Gecision*reeHlassi+ier() decision_tree_classi+ier.+it(trainin0_inputs# trainin0_classes) classi+ier_accuracy = decision_tree_classi+ier.score(testin0_inputs# testin0_classes)
model_accuracies.append(classi+ier_accuracy)
sb.distplot(model_accuracies)
!ut["<]: matplotlib.a2es._subplots.%2es/ubplot at 521117c1"89
I t ' sob vi ous l yapr o bl e mt ha to urmode lp er f o r msqui t edi ff er en t l ydepend i ngont h es ubs eto ft h edat ai t ' st r a i nedon .T hi s phenomenoni skno wnasov er fi t t i ng:Themodel i sl ear ni ngt oc l as si f yt het r ai ni ngs ets owel l t hati tdoes n' tgener al i z eand p er f o r m wel l ondat ai tha sn' ts eenbef or e.
-ross3!alidation¶ [g ob ac kt ot h et o p] Th i spr o bl e mi st h ema i nr e as ont ha tmo std at as c i e nt i s t spe r f o r mkf ol dcr o ssv al i dat i on ont hei rmodel s :Spl i tt heor i gi nal dat as eti nt ok s ubs et s ,us eoneoft hes ub se t sast h et es t i ngs et ,a ndt her e stoft hes ub se t sar eu se da st h et r ai n i ngs et . Th i spr o ce ssi st h enr e p ea t e dk t i me ss u cht h ate ac hs ub s eti sus eda st h et e st i n gs e te x ac t l yon ce . 1 0f ol dc r o ss v a l i dat i oni st hemos tc ommonc hoi c e,s ol e t ' sus et hather e.Per f or mi n g10 f ol dc r os s v a l i d at i o no nourd at a s etl ook ss omet hi ngl i k et hi s : ( ea chs qu ar ei sane nt r yi no urd at as e t )
In ["8]: import numpy as np +rom sklearn.cross_validation import /trati+iedJAold
de+ plot_cv(cv# n_samples): masks = [] +or train# test in cv: mask = np.Ceros(n_samples# dtype=bool) mask[test] = 1
masks.append(mask)
plt.+i0ure(+i0siCe=(1# 1)) plt.imso(masks# interpolation='none')
plt.ylabel('Aold') plt.2label('Ko ')
plot_cv(/trati+iedJAold(all_classes# n_+olds=15)# len(all_classes))
You' l l no t i c et h atweu se dSt r a t i fie dkf ol dc r os sv al i da t i on i nt h ec od ea bo v e.St r a t i fi edk f o l dk eepst hec l a sspr o por t i on s t hes ameac r o ssal l oft hef ol ds ,whi c hi svi t al f ormai nt ai ni ngar epr es ent at i v es ubs etofourdat as et .( e. g. ,s owedon' tha ve 100 % Iris setosa ent r i esi noneoft hef ol ds . ) Wec anp er f o r m1 0f o l dc r o s sv a l i d at i o no no urmo de lwi t ht h ef o l l o wi n gc od e: In ["6]: +rom sklearn.cross_validation import cross_val_score
decision_tree_classi+ier = Gecision*reeHlassi+ier()
cross_val_score returns a list o+ te scores# ic e can visualiCe to 0et a reasonable estimate o+ our classi+ier's per+ormance cv_scores = cross_val_score(decision_tree_classi+ier# all_inputs# all_classes# cv=15) sb.distplot(cv_scores) plt.title('%vera0e score: LM'.+ormat(np.mean(cv_scores)))
!ut["6]: matplotlib.te2t.*e2t at 5211&8e""<89
No w weh av eamu c hmo r ec on si s t e ntr a t i n go fo urc l a s si fi er ' sge ne r a l c l a ss i fi c at i o na c cu r a c y .
#arameter tuning¶ [g ob ac kt ot h et o p] Ev e r yMa ch i n eL ea r n i n gmo de lc ome swi t hav a r i e t yo fp ar a me t e r st ot u ne ,a ndt h es epa r a me t e r sc a nbev i t a l l yi mp or t a ntt o t h eper f o r manc eofourc l as s i fi er .Fore x ampl e,i fwes ev er el yl i mi tt h ed ep t hofo urd ec i s i ont r eec l as s i fi er : In [&5]: decision_tree_classi+ier = Gecision*reeHlassi+ier(ma2_dept=1)
cv_scores = cross_val_score(decision_tree_classi+ier# all_inputs# all_classes# cv=15) sb.distplot(cv_scores# kde=Aalse) plt.title('%vera0e score: LM'.+ormat(np.mean(cv_scores)))
!ut[&5]: matplotlib.te2t.*e2t at 5211&a5"b<59
t h ec l as s i fi c at i onac c ur a cyf al l st r emen do us l y . Th er e f o r e ,wen ee dt ofi n das y s t e ma t i cme t h odt od i s c o v ert h eb es tp ar a me t e r sf o ro u rmo de la ndd at as e t . Themostcommonmet hodf ormodelpar amet ert uni ngi sGr i dSear ch.Th ei d eabe hi n dGr i dSe ar c hi ss i mp l e :e x pl o r ea r a ng eo fp ar a me t e r san dfi n dt h eb es t p er f o r mi n gp ar a me t e rc o mb i n at i o n.Fo c usy ou rs e a r c ho nt h eb es tr a ng eo f p ar a me t er s ,t h enr e peatt hi spr o ce sss ev er a lt i mesunt i l t h eb es tpar amet e r sa r edi s c ov er e d. Let ' st uneourdec i s i ont r eec l as si fi er .We' l l s t i c kt oonl yt wopar amet er sf ornow,buti t ' spos si bl et os i mul t aneous l yex pl or e d oz e nso fp ar a me t e r si fwewa nt . In [&1]: +rom sklearn.0rid_searc import 4rid/earcH
decision_tree_classi+ier = Gecision*reeHlassi+ier()
parameter_0rid = L'ma2_dept': [1# "# # ]# 'ma2_+eatures': [1# "# ]M
cross_validation = /trati+iedJAold(all_classes# n_+olds=15)
0rid_searc = 4rid/earcH(decision_tree_classi+ier#
param_0rid=parameter_0rid#
cv=cross_validation)
0rid_searc.+it(all_inputs# all_classes) print('Nest score: LM'.+ormat(0rid_searc.best_score_)) print('Nest parameters: LM'.+ormat(0rid_searc.best_params_)) Nest score: 5.66<&1&7"171 Nest parameters: L'ma2_+eatures': # 'ma2_dept': &M
No wl e t ' sv i s ual i z et hegr i dsear c ht oseeho wt hepa r a me t e r si nt er ac t . In [&"]: 0rid_visualiCation = []
+or 0rid_pair in 0rid_searc.0rid_scores_:
0rid_visualiCation.append(0rid_pair.mean_validation_score)
0rid_visualiCation = np.array(0rid_visualiCation) 0rid_visualiCation.sape = (# ) sb.eatmap(0rid_visualiCation# cmap='Nlues') plt.2ticks(np.aran0e() D 5.# 0rid_searc.param_0rid['ma2_+eatures']) plt.yticks(np.aran0e() D 5.# 0rid_searc.param_0rid['ma2_dept'][::-1]) plt.2label('ma2_+eatures') plt.ylabel('ma2_dept')
!ut[&"]: matplotlib.te2t.*e2t at 5211&a85dd89
Nowwehaveabet t ersenseoft hepar amet erspace:Weknow t hatweneedama2_dept ofatl eas t2t oal l owt hedec i s i on t r e et oma k emo r et h anao ne o ffde c i s i o n. ma2_+eatures d oe s n ' tr e al l ys e em t oma k eabi gdi ff e r e n c eh er easl o ngasweha v e2o ft h em,wh i c hma k e ss en s es i n c e
o urda t as etha so nl y4f ea t u r e sandi sr el at i v el yeas yt ocl a ss i f y .( Remembe r ,o neofourdat as et ' sc l a ss eswasea si l y s ep ar a bl ef r o mt h er e stb as edo nas i n gl ef e at u r e . ) L et ' sgoah ea da ndus eab r o a dg r i ds ea r c ht ofin dt hebe s ts e t t i n gsf o raha nd f u lo fp ar a me t er s . In [&&]: decision_tree_classi+ier = Gecision*reeHlassi+ier()
parameter_0rid = L'criterion': ['0ini'# 'entropy']# 'splitter': ['best'# 'random']# 'ma2_dept': [1# "# # ]# 'ma2_+eatures': [1# "# ]M
cross_validation = /trati+iedJAold(all_classes# n_+olds=15)
0rid_searc = 4rid/earcH(decision_tree_classi+ier#
param_0rid=parameter_0rid#
cv=cross_validation)
0rid_searc.+it(all_inputs# all_classes) print('Nest score: LM'.+ormat(0rid_searc.best_score_)) print('Nest parameters: LM'.+ormat(0rid_searc.best_params_)) Nest score: 5.66<&1&7"171 Nest parameters: L'ma2_+eatures': # 'ma2_dept': 'splitter': 'best'# 'criterion': '0ini'M
No w wec ant a k et h eb es tc l a s s i fi e rf r o mt h eGr i dSe ar c han du s et h at : In [&]: decision_tree_classi+ier = 0rid_searc.best_estimator_ decision_tree_classi+ier
!ut[&]: Gecision*reeHlassi+ier(class_ei0t=$one# criterion='0ini'# ma2_dept= ma2_+eatures=# ma2_lea+_nodes=$one# min_samples_lea+=1# min_samples_split="# min_ei0t_+raction_lea+=5.5# random_state=$one# splitter='best')
Wec ane v env i s ua l i z et h ed ec i s i o nt r e ewi t hGr aphVi z t os eeho wi t ' sma ki ngt h ec l a ss i fi c at i o ns : In [&]: import sklearn.tree as tree +rom sklearn.e2ternals.si2 import /trin0I!
it open('iris_dtc.dot'# '') as out_+ile: out_+ile = tree.e2port_0rapviC(decision_tree_classi+ier# out_+ile=out_+ile)
( Thi scl as si fi erma yl ookf a mi l i arf r om ear l i eri nt henot ebook . ) Al r i g ht !Wefin al l yha v eo urd emocl a ss i fi er .L et ' sc r e at es omevi s ua l sofi t spe r f or ma nc es oweha v es ome t h i n gt osh owo ur h ea do fd at a . In [&<]: r+_scores = cross_val_score(decision_tree_classi+ier# all_inputs# all_classes# cv=15)
sb.bo2plot(r+_scores) sb.stripplot(r+_scores# ;itter=*rue# color='ite')
!ut[&<]: matplotlib.a2es._subplots.%2es/ubplot at 5211&cdb&89
Hmmm. . .t h at ' sal i t t l eb or i n gb yi t s el ft h ou gh .Ho wa bo utwec ompa r ea no t h erc l a ss i fi ert os eeho wt h eype r f o r m? Wea l r e ad yk no wf r o mp r e v i o uspr o j e ct st h atRa nd om Fo r e s tc l a s si fi er sus ua l l ywo r kbe t t ert h ani n di v i d ua ld ec i s i o nt r e es .A c ommo npr o bl em t hatde ci s i o nt r ee sf a cei st h att he y' r epr onet oo v er fi t t i ng :Th eyc ompl e xi f yt ot hepo i ntt ha tt he yc l a ss i f y t het r ai ni ngs etnear per f ec t l y ,butf ai l t ogener al i z et odat at he yha venots eenbef or e. Random For estcl assi fiers wo r ka r o u ndt h atl i mi t a t i o nb yc r e at i n gawh ol eb un cho fd ec i s i o nt r e es( h en ce" f o r e s t " )— e ac h t r a i n edo nr a nd om s u bs e t soft r a i n i n gs a mp l e s( d r a wnwi t hr e pl a c eme nt )a ndf e at u r e s( d r a wnwi t h ou tr e pl a c eme nt )— a nd h av et h ed ec i s i o nt r e eswo r kt o ge t h ert oma k eamo r ea cc ur a t ec l a ss i fi c at i o n. L ett h atbeal es s onf oru s: Eveni nMachi neLearni ng,wegetbet t err esul t swhenweworkt oget her !
L et ' sse ei faRa nd om Fo r e s tc l a s si fi erwo r k sb et t e rh er e . Thegr eatpar tabouts ci k i t l ear ni st hatt het r ai ni ng,t es t i ng,par amet ert uni ng,et c .pr oc es si st hes amef oral l model s ,s owe o nl ynee dt opl ugi nt hene wc l as s i fi er . In [5]: +rom sklearn.ensemble import KandomAorestHlassi+ier
random_+orest_classi+ier = KandomAorestHlassi+ier()
parameter_0rid = L'n_estimators': [# 15# "# 5]# 'criterion': ['0ini'# 'entropy']# 'ma2_+eatures': [1# "# ]# 'arm_start': [*rue# Aalse]M
cross_validation = /trati+iedJAold(all_classes# n_+olds=15)
0rid_searc = 4rid/earcH(random_+orest_classi+ier#
param_0rid=parameter_0rid#
cv=cross_validation)
0rid_searc.+it(all_inputs# all_classes) print('Nest score: LM'.+ormat(0rid_searc.best_score_)) print('Nest parameters: LM'.+ormat(0rid_searc.best_params_))
0rid_searc.best_estimator_ Nest score: 5.6<&1&7"1715< Nest parameters: L'n_estimators': # 'ma2_+eatures': 'arm_start': *rue# 'criterion': '0ini'M
!ut[5]: KandomAorestHlassi+ier(bootstrap=*rue# class_ei0t=$one# criterion='0ini'# ma2_dept=$one# ma2_+eatures= ma2_lea+_nodes=$one# min_samples_lea+=1# min_samples_split="# min_ei0t_+raction_lea+=5.5# n_estimators=# n_;obs=1# oob_score=Aalse# random_state=$one# verbose=5# arm_start=*rue)
Nowwecancompar et hei rper f or mance: In ["]:
random_+orest_classi+ier = 0rid_searc.best_estimator_
r+_d+ = pd.GataArame(L'accuracy': cross_val_score(random_+orest_classi+ier# all_inputs# all_classes# cv=15)# 'classi+ier': ['Kandom Aorest'] @ 15M) dt_d+ = pd.GataArame(L'accuracy': cross_val_score(decision_tree_classi+ier# all_inputs# all_classes# cv=15)# 'classi+ier': ['Gecision *ree'] @ 15M) bot_d+ = r+_d+.append(dt_d+)
sb.bo2plot(2='classi+ier'# y='accuracy'# data=bot_d+) sb.stripplot(2='classi+ier'# y='accuracy'# data=bot_d+# ;itter=*rue# color='ite')
!ut["]: matplotlib.a2es._subplots.%2es/ubplot at 52111b++"89
Ho wa bo utt h at ?Th eyb ot hs ee mt ope r f o r ma bo utt h es a meont h i sda t ase t .Th i si spr o ba bl yb ec au seoft h el i mi t a t i o nsof o urd at as et :Weha v eo nl y4f e at u r e st oma k et h ec l a ss i fi c at i o n,a ndRa nd om Fo r e s tc l a s si fi er se x c el wh ent h er e ' sh un dr e ds o fp os s i bl ef eat ur e st ol oo ka t .I not herwo r d s ,t her ewas n' tmuc hr oom f ori mpr o v eme ntwi t ht h i sdat ase t .
Step 4* Reproducibility¶ [g ob ac kt ot h et o p] En su r i n gt h ato urwo r ki sr e pr o du ci b l ei st h el a s ta nd— ar g ua bl y— mo sti mp or t a nts t e pi na nyan al y s i s . Asarul e,we shoul dn' tpl acemuchwei ghtonadi scoveryt hatcan' tber epr oduced .Assuc h,i foura nal y si si s n' tr epr oduc i bl e,we mi g hta swe l l n oth av ed on ei t . No t e bo o k sl i k et h i so neg oal o ngwa yt o wa r dma k i n go urwo r kr e pr o du c i b l e .Si n c ewedo c u me nt e de v e r ys t e pa swemo v e d a l o ng ,weha v eawr i t t e nr e c or dofwh atwedi dan dwh ywedi di t— b ot hi nt e x ta ndc od e. Be y o ndr e c or d i n gwh atwedi d ,wes ho ul dal s odo c ume ntwh ats o f t wa r ean dh ar d wa r eweus e dt op er f o r mo ura na l y s i s .Th i s t y pi c al l ygo esatt h et o po fo urno t e bo ok ss oo urr e ad er sk no w wh att o ol st ou se . Se b as t i a nRa s c h k a c r e at e daha nd yn ot e bo okt o ol f ort hi s : In [&]: ,install_e2t
ttps:OOra.0itubusercontent.comOrasbtOatermarkOmasterOatermark.py
Installed atermark.py. *o use it# type: ,load_e2t atermark
In []: ,load_e2t atermark
In []:
,atermark -a 'Kandal /. !lson' -nmv --packa0es numpy#pandas#scikit-learn#matplotlib#/eaborn Kandal /. !lson Ari %u0 "1 "51 H3yton &..& I3yton &.".1 numpy 1.6." pandas 5.17." scikit-learn 5.17.1 matplotlib 1..& /eaborn 5.7.5 compiler : system : release : macine : processor : H3P cores : interpreter:
4HH .".1 (%pple Inc. build <<) Garin 1..5 287_7 i&87 8 7bit
Fi nal l y ,l e t ' se xt r ac tt hec or eofourwor kf r om St eps15andt ur ni ti nt oas i ngl epi pel i ne. In [7]: ,matplotlib inline import pandas as pd import seaborn as sb +rom sklearn.ensemble import KandomAorestHlassi+ier +rom sklearn.cross_validation import train_test_split +rom sklearn.cross_validation import cross_val_score
e can ;ump directly to orkin0 it te clean data because e saved our cleaned data set iris_data_clean = pd.read_csv('iris-data-clean.csv')
*estin0 our data: !ur analysis ill stop ere i+ any o+ tese assertions are ron0
e kno tat e sould only ave tree classes assert len(iris_data_clean['class'].uniue()) == &
e kno tat sepal len0ts +or 'Iris-versicolor' sould never be belo ". cm assert iris_data_clean.loc[iris_data_clean['class'] == 'Iris-versicolor'# 'sepal_len0t_cm'].min() 9= ".
e kno tat our data set sould ave no missin0 measurements
assert len(iris_data_clean.loc[(iris_data_clean['sepal_len0t_cm'].isnull()) > (iris_data_clean['sepal_idt_cm'].isnull()) > (iris_data_clean['petal_len0t_cm'].isnull()) > (iris_data_clean['petal_idt_cm'].isnull())]) == 5
all_inputs = iris_data_clean[['sepal_len0t_cm'# 'sepal_idt_cm'# 'petal_len0t_cm'# 'petal_idt_cm']].values
all_classes = iris_data_clean['class'].values
*is is te classi+ier tat came out o+ 4rid /earc random_+orest_classi+ier = KandomAorestHlassi+ier(bootstrap=*rue# class_ei0t=$one# criterion='0ini'# ma2_dept=$one# ma2_+eatures= ma2_lea+_nodes=$one# min_samples_lea+=1# min_samples_split="# min_ei0t_+raction_lea+=5.5# n_estimators=# n_;obs=1# oob_score=Aalse# random_state=$one# verbose=5# arm_start=*rue)
%ll tat's le+t to do no is plot te cross-validation scores r+_classi+ier_scores = cross_val_score(random_+orest_classi+ier# all_inputs# all_classes# cv=15) sb.bo2plot(r+_classi+ier_scores) sb.stripplot(r+_classi+ier_scores# ;itter=*rue# color='ite')
...and so some o+ te predictions +rom te classi+ier (trainin0_inputs# testin0_inputs# trainin0_classes# testin0_classes) = train_test_split(all_inputs# all_classes# train_siCe=5.<)
random_+orest_classi+ier.+it(trainin0_inputs# trainin0_classes)
+or input_+eatures# prediction# actual in Cip(testin0_inputs[:15]#
random_+orest_classi+ier.predict(testin0_inputs[:15])#
testin0_classes[:15]): print('LMQt--9QtLMQt(%ctual: LM)'.+ormat(input_+eatures# prediction# actual))
[ [ [ [ [ [ [ [ [ [
.7 ." <.1 7.& 7.< 7.6 .1 7.& ." 7.1
&.7 ".< &. &.& &.& &.1 &.& ".8 &. ".7
1. &.6 .6 .< .< . 1.< .1 1. .7
5."] 1.] ".1] 1.7] ".] ".1] 5.] 1.] 5."] 1.]
--9 --9 --9 --9 --9 --9 --9 --9 --9 --9
Iris-setosa Iris-versicolor Iris-vir0inica Iris-versicolor Iris-vir0inica Iris-vir0inica Iris-setosa Iris-versicolor Iris-setosa Iris-vir0inica
(%ctual: (%ctual: (%ctual: (%ctual: (%ctual: (%ctual: (%ctual: (%ctual: (%ctual: (%ctual:
Iris-setosa) Iris-versicolor) Iris-vir0inica) Iris-versicolor) Iris-vir0inica) Iris-vir0inica) Iris-setosa) Iris-vir0inica) Iris-setosa) Iris-vir0inica)
Th er eweha v ei t :Weha v eac o mp l e t ean dr e pr o du c i b l eMa c hi n eL ea r n i n gp i p el i n et ode mot oou rh ea do fd at a .We ' v eme t t h es uc c es scr i t e r i at h atwes e tf r o mt h eb eg i n ni n g( >9 0% a cc ur a c y) ,a ndo urp i p el i n ei sfl ex i b l ee no ug ht oh an dl en ew i n pu t sorfl o wer swh ent h atd at as e ti sr e ad y .No tb adf o ro urfi r s twe ekont h ej o b!
-onclusions¶ [g ob ac kt ot h et o p] Ih op ey o uf o un dt h i se xa mp l eno t e bo okus ef u lf o ry o uro wnwo r ka ndl e ar n edatl e as to nen ewt r i c kbyr e ad i n gt h r o u ghi t . I fy ou' v es pot t edan yer r or sorwoul dl i k et oc ont r i but et ot hi snot ebook ,pl eas edon' thes t i t at et ogeti nt ouc h.Ic anbe r e ac he di nt h ef o l l o wi n gwa y s : •
Emai l me
•
Tweet atme
•
Su bmi ta ni s s u eonGi t Hub
•
Fo r kt h eno t eboo kr epos i t or y ,ma ket hefix / addi t i onyo ur s el f ,t hense ndov erapul l r eq ues t
5urther reading¶ [g ob ac kt ot h et o p] Thi sno t e bo okco v er sabr o adv ar i e t yoft op i c sb uts k i psov erman yoft hes pe ci fi c s.I fy ou ' r el o ok i ngt odi v ede ep eri nt oa p ar t i c ul a rt o pi c ,h er e ' ss o mer e co mme nd edr e ad i n g. Dat aSci ence:Wi l l i a m Ch enc omp i l e dal i s toff r eebook sf o rn ewc o me r st oDa t aSc i e nc e ,r a ng i n gf r o mt h eb as i c so fR& Py t h ont oMa ch i n eL ea r n i ngt oi n t e r v i e wsan da dv i c ef r o mp r o mi n en td at as c i e nt i s t s . Machi neLear ni ng:/ r / Ma ch i n eL ea r n i n gh asau se f u l Wi k i p ag ec ont ai ni ngl i nk st oonl i nec our s es ,book s,dat as et s ,et c .f or Ma ch i n eL ea r n i n g.Th er e ' sa l s oac ur at edl i s to fMa ch i n eL ea r n i n gf r a me wo r k s ,l i b r a r i e s,a nds of t wa r es or t e db yl a ng ua ge . Uni tt est i ng:Di v eI n t oPy t h on3h asag r e a twa l k t h r o u ghofun i tt es t i ngi nPy t ho n,ho wi two r k s ,a ndho wi ts ho ul dbeus ed pandashass ev er al t ut or i al sc ov er i ngi t smy r i adf eat ur es . sci ki t l ear n hasabunc hoft ut or i al s f o rt h os el o ok i n gt ol e ar nMa ch i n eL ear n i n gi nPy t h on .An dr e asMu el l e r ' ss c i k i t l ear n wo r k s h opma t e r i a l s a r et opnot c handf r eel yav ai l abl e.
ma t pl ot l i b hasmanybook s,v i deos ,andt ut or i al s t ot eac hpl ot t i ngi nPy t hon. Seaborn hasabas i ct ut or i al c ov er i ngmos to ft hes t at i s t i c al pl ot t i ngf eat ur es .