he book ebsite ontains supportin materia or instrutors an reaers
LIG F DATA SOR OURS
Yaser S. AbuMosta
l l Mali MagdoIsmail
l Pl HsuaTie Li
l
M
aser . b stafa epartmets of lectrical gieerig ad ompter ciece alioria stitte of echology Pasadea, 91125
©cch.du
alik agdo smail epartmet of ompter ciece esselaer Polytechic stitte oy, 12180
d@c.du
sa ie i epartmet of ompter ciece ad rmatio gieerig atioal aiwa iversity aipei, 106 aiwa
h©c.u.du
10 1 60049 006 9 13978 1 60049 006 4
@2012 aser b ostaa, alik agdo smail, sa ie i
11
ll rights reserved. his work may ot be traslated or copied i whole or i part withot the writte permissio of the athors. o part of this pblicatio may be reprodced, stored i a retrieval system, or trasmitted i ay frm or by ay measelectroic, mechaical, photocopyig, scaig, or otherwisewithot prior writte permissio of the athors, except as permitted der ectio 107 or 108 of the 1976 ited tates opyright ct. imit o f iability /isclaimer of arraty hile the athors have sed their best eorts i preparig this book, they make o represetatio or warraties with re spect to the accracy or completeess of the cotets of this book ad specically disclaim ay implied warraties of merchatability or tess fr a particlar prpose. o warraty may be created or exteded by sales represetatives or writte sales materials. he advice ad strategies cotaied herei may ot be sitable fr yor sitatio. o shold coslt with a profssioal where appropriate. he athors shall ot be liable fr ay loss of prot or ay other commercial damages, icldig bt ot limited to special, icidetal, coseqetial, or other damages. he se i this pblicatio of tradeames, trademarks, service marks, ad similar terms, eve if they are ot idetied as sch, is ot to be take as a expressio of opiio as to whether or ot they are sbject to proprietary rights. his book was typeset by the athors ad was prited ad bod i the ited tates of merica
Preface This b k is esiged a sht cuse ma chie leai g. It is a sht cuse t a huied cuse. Fm ve a decade f teachig this mateial we have distilled what we believe t be the ce tpics that evey studet f the subject shul kw. We chse the title leaig m ata' that faithully descibes what the subject is abut ad made it a pit t cve the tpics i a sty-like fashi . Ou hpe i s that the eade ca lea all the dametals f the subject by eadig the bk cve t cve. Leaig m data has distict theetical ad pactical tacks. If yu ead tw bs that cus e tack the the yu may el that yu ae eadig abut subje altgethe . Iad thisthe b k we balace the theetical adtw thedieet pactical thects mathema tical heuistic . Ou cite i f iclusi is elevace. They that establishes the cceptual amewk f leaig is icluded ad s ae heuistics that impact the pe mace f eal leaig systems. Stegt hs ad weakesses f the dieet pats ae spelled ut . Ou philsphy is t say it like it is: what we kw what we d't kw ad what we patially kw. The b k a be taught i exactly the de it is pese ted . The tabl e excepti may be Chapte , which is the mst theetical chapte the bk. The they f geealizati that this chapte cves is cetal t leaig m data a we made a et t make it accessible t a wide eadeship. Hweve istucts wh ae me iteested i the pactical side may skim ve it delay it util afte the pactical methds f Chapte 3 ae taught. Yu will tice that we icluded execises (i gay bxes) thughut the text. The maif pupse f these execises t egage eadef ad ehace udestadig a paticula tpic beig iscveed. Outheeas sepaatig the execises ut is that they ae t cucial t the lgical w. evetheless they ctai usel imati ad we stgly ecuage yu t ead them eve if yu d't d them t cmpleti. Istucts may d sme f the execises apppiate as easy' hmewk pblems ad we als pvide ad ditial pblems f vayig diculty i the Pblems secti at the ed f each chapte. T help itucts with pepaig thei lectues based the bk we pvide supptig mateial the bk's website Thee is als a um that cves additial tpic s i leaig m ata . We will
vii
PREFACE
discuss these futhe i the Epilgue f this bk Ackwledgmet i alphabetical de each gup : We wuld like t expess u gatitude t th e alumi f u Leaig Systems G up at C altech wh gave us detailed expet edback: eha Cataltepe Lig Li Amit Pata p ad Js eph Sill We thak the may studets ad clleagues wh g ave us useful edback duig the develpmet f this b k es pecially ChuWei Liu The Caltech Libay sta especially isti Buxt ad David McCasli have give us excellet advice ad help i u selfpublishig et We als thak Lucida Acsta he help thugh ut the witig f this bk Last but t least we wuld like t thak u families thei ecuage met thei suppt ad mst f all thei patiece as they edued the time demads th at witig a b k has impsed us.
Pasadea Califoria. Malik MagdIsmail oy e ork HsuaTie Li Taipei Taia
Yase S AbuMstaf
arch
viii
Z
OTATION
ompete tabe o the notation use in this book is inue on pae 9 riht bere the ine o terms We suest rerrin to it as neee
xii
Chapr
The Le arni ng Pro blem If yu shw a pictue t a thee-yea-ld ad ask if thee is a tee i it yu will likely get the cect aswe . If yu ask a thity-yea-ld what the deiti f a tee is yu will likely get a icclusiv e aswe. We did't lea what a tee is by studyig the mathematical deiti f tees. We leaed it by lkig at tee. I the wds w e leaed m data' . Leaig usedthat i situatis whee we d' a t have a aalytic sluti but m we ddata haveis data we ca use t cstuct empiical slu ti. This peise cves a lt f teity ad ideed leaig m data is e f the mst widely used tec hiques i sciece egieei g ad ecmics amg the lds. I this chate we peset examples f leaig m data ad malize the leaig pblem. We als discuss the mai ccepts assciat ed with leaig ad the dieet paadigms f leaig that have bee develped.
.
Prlm Stu
What d acial fecastig medical diagsis cmpute visi ad seach egies have i cmm They all have successlly utilized leaig m data. The with eptaie f such applicatis i quite impessive. us pe discussi eal-life applicati t see hw leaig mLet data wks.the Cside the pblem f pedictig hw a mvie viewe wuld ate the vaius mvies ut thee. This is a impt at pblem if yu ae a cmpay that ets ut mvies sice yu wat t ecmmed t dieet viewes the mvies they will like. Gd ecmmede systems ae s imptat t busiess that the mvie etal cmpay etix eed a pize f e milli dllas t aye wh culd impve thei ecmmedatis by a mee 0 The mai diculty i this pblem is that the citeia that viewes use t ate mvies a quite cmplex. yig t mdel thse explicit ly is easy task s it may t be pss ible t cme up with a aalytic sluti . Hweve we
1 HE EARNIN PROBLEM
1.1. PROBLEM ETU
l
viewe
mvie
Figue
11: A
mdel r hw a vewer rates a mve
kw that the histical atig data eveal a l t abut hw peple ate mvies, s we may be able t cstuct a gd empi ical sluti Thee is a geat deal data available t mvie etal cmpaies, sice they te ask thei viewes t ate the mvies that they have aleady see. Figue illustates a specic appach that was widely used i the millidlla cmpetiti Hee is hw it wks. Yu descibe a mvie as a lg aay dieet acts, e.g., hw much cmedy is i it, hw cm plicated is the plt, hw hadsme is the lead act, etc. w, yu descibe each viewe with cespdig cts; hw much d they like cmedy, d they pe simple cmplicated plts, hw imptat ae the lks the lead act , ad s Hw this viewe will ate that mvie is w estimated based the match/mismatch these act s. F example, i the mvie is pue cmedy ad the viewe hates cmedies, the chaces ae he w't like it I yu take dzes these acts descibig may cets a mv ie's ctet ad a viewe's taste, the cclusi based matchig all the cts will be a gd pedict hw the viewe will ate the mvie The pwe leaig m data is that this etie pcess ca be aut mate d, withut ay eed aalyzig mvie ctet viewe taste . T d s , the leaig algithm evese egiees' these act s based slely pe-
1
2
1 . HE EARN IN PRO BLEM
1 . 1 PRO BLEM ETU
vius atigs. It stats with adm acts the tues these fcts t make them me ad me aliged with hw viewes have ated mvies bee util they ae ultiately able t predct hw viewes ate mvies i geeal. The facts we ed up with may t be a ituitive as cmedy ctet' ad i fact ca be qite subtle eve icmpehesibl e. Afte all the algithm is ly tyig to d the best way t pedict hw a viewe wuld ate a mvie t ecessaily t us hw it iscmpetiti. de. This algithm was pat f the wiig slutiexplai i the milli-dlla
1. 1. 1
Compoe ts of Learig
The mvie atig applicati captues the essece f leaig m data ad s d may the applicatis m vastly dieet elds. I de t abstact the cmm ce f the leaig pblem we will pick e applicati ad use it as a metaph the di eet cmpets f the pblem. Let us take cedit appval as u metaph. Suppse that a bak eceives thusads f cedit cad applicatis evey day ad it wats t autmate the pcess f evaluatig them. Just as i the case f mvie atigs the bak kws f magical mula that ca pipit whe cedit shuld be appved but it has a lt f data. This calls leaig ma data the bak uses histical ut gd s mula cedit appval.ecds f pevius custmes t gue Each custme ecd has pesal imati elated t cedit such a aual salay yeas i eside ce utstadig las et c. The ecd als keeps tack f whethe appvig cedit that custme was a gd idea i.e. did the bak make mey that cus tme. This data guides the cst ucti f a successul fmula cedit appval that ca be used utue applicats. Let us give ames ad symbls t the mai cmpets f this leaig pble m. Thee is the iput x custme imati that is used t make a cedit decisi) the ukw taget ucti f X Y ideal mula cedit app val) whee X is the iput space set f all pssible iputs x, ad Y is the utput space set f all pssible utputs i this case just a yes/ deci si) . Thee is a data set D f iput-utput examples (x1 , Y1 , (x N , YN , whee Yn f (xn ... , iputs cespdig t pevius custmes
1
ad thed cect cedit decisi them examples aeuses fte ee t as data pits . Fially theeiis hidsight). the leaigThe algithm that the data set D t pick a mula g X Y that appximates f. The algithm chses g a set f cadidate mulas ude csideati which we call the hypthesis set . F istace culd be the set f all liea mulas m which the algithm wuld chse the best liea t t the data as we will itduce late i this secti. Whe a ew custme applies cedit the bak will base its decisi f the g the hypthesis that the leaig algithm pduced) t ideal taget cti which emais uk w) . The decisi will be g d ly t the extet that g faithlly eplicates f T achieve that the algithm 3
1.1 PROBLEM ETU
1 . HE EARN IN PROBLEM
UNKNWN TARGET FUNCTN
f Y (el l 'Ul) TRANNG EXAPLES
,
FNAL HYPTHESS
f (lee e l 'Ul) HYPTHESS SET
(e e 'l) Fgue
1.2
asc setup of the learnng problem
chses g that best matches the tiig examples f pevus custmes wth the hpe that t wll c tue t match ew custmes. Whethe t ths hpe s justed e mas t be see. Fgue llustates the cmpets f the leag pblem
1.2
xerise xress each o he llowing a ss i n t he am ewok o learin g rom ata secif ing e i nut sace out ut sace aget fnction an th e se ic s o he ata s et hat we wi ll lear n from
(a) Mical iagnosis atient al s i n with a meica l h istory an some smtoms an you want o ientif e rolem (b ) H an written igit recognition examle ostal z i co e recognitio n r ma il sorting) ( c) Determi ni ng if a n em ail is sam o not () P ric tin g h ow a n electri c loa varies with rice temerature an ay of the week (e) A robl em of inter est to you r which there is no a na lytic s ol ution but you h ave at a from whi ch to co nstruct an emi rical sol ution
4
1.1. PROBLEM ETU
1 . HE EARN ING PROBLEM
We ll use the setup n Fgure a ur dentn f the learnng prblem. Later n, e ll cnsder a num ber f renements and varatns t ths basc setup as ne eded. Hever, the essence f the prblem ll reman the same. There s a t arget t b e learned. It s unknn t us . We have a set f e amples generated by the target. The learnng algrthm uses these eamples t lk r a hypthess that apprmates the target.
12
1. 1. 2
A Siple Learig Model
12
Let us cnsder the derent cmpnents f Fgure Gven a specc learn ng prblem, the target nctn and tranng eamples are dctated by the prble m. Hever, the learnng algrthm and hypth ess set are nt . These are slutn tls that e get t chse. The hypthes s set and learnng algrthm are rerred t nrmally as the leaing model Here s a smple mde l. Let X ]d be the nput space, here Jd s the dmensnal Eucldean space, and let Y { be the utput space, dentng a bnary yes/n) decsn . In ur credt eample, d erent cr dnates f the nput vectr x Jd crrespnd t salary, years n resdence, utstandng debt, and the ther data elds n a cred t applcat n. The b nary utput crrespnds t app rvng r denyng cred t. We spe c the hypthess set thrugh a nctnal rm that all the hyptheses h share. The nctnal rm h(x that e chse here gves derent eghts t the derent crdnates f x, reectng ther relatve mprtance n the credt decs n. The eghted crdnates are then cmbned t rm a credt scre' and the result s cmpared t a threshld v alue. If the applcant passes the threshld, credt s apprved f nt, credt s dened
+ 1, 1}
Apprve credt f Deny credt f
= WX = WX < d
> threshld,
d
threshld.
Ths rmula can be rtten mre cmpactly as
1 ++1
(11)
here xi ,· , x d are the cmpnents f the vectr x; h(x means ap prve credt' and h(x means deny credt' sgn) f > 0 and sgn) f < . 1 The eghts are w1, , wd , and the threshld s determned by the bas term b snce n Equatn credt s apprved f > b. Ths mde l f s called th e perceptrn, a name that t gt n the cnte t f artcal ntellgence. The learnng algrthm ll search by lkng r
= WX
1
1
1 The vle of sign when
(11),
is simple techniclity tht we ignore r the moment
1 HE EARING PROBLEM
1.1. PROBLEM ETU
a) isclassied data
1.3:
b) Perectly classied data
Figure Perceptron classication of linearly searable data in a two dimensional input space a) ome training e amples will be misclassied blue points in red region and vice versa) r certain values of the weight parameters which dene the separating line b) nal hyothesis that classies all training e amles correctly is + 1 and is - 1 )
A
eights and bias that perfrm ell n the data set . Sme f the eights w1, , may end up being negative, crrespnding t an adverse eect n credit apprval Fr instan ce, th e eight f the utstanding debt ' eld shuld cme ut negative since mre debt is nt g d fr credit The bias value b may end up being large r small, reecting h lenient r stringent the bank shuld be in etending credit The ptimal chices f eights and bias dene the nal hypthesis g that the algrithm prduces
Wd
xercse 2 Suose hat we use a ecetron to etect sam messages. Lets say that eac email message is reresente y e feuency of occurence of kewors an the outut is i the messa ge is consier e sam .
a an ou th i nk of some e ywors that wil l e n u with a
la rge ositiv e
eight in the rcetron
How about keywos that will get a negative weight? bc What aram eter in the ercetron irectl a ects how man y bo rer li ne messages en u bei ng lassi e a s sam?
Figure illustrates hat a perceptrn des i n a t-dimensinal case (d The plane is split by a line int t regi ns, the decisin regin and the decisi n regin Dierent values fr the parameters w1, w, b crrespnd t dierent lines w1x1 wx b . If the data set is linearly sepaable, there ill be a chice r these parameters that classies all the training eamples crrectly
13
+
+1
+
6
2). 1
1 HE EARNING PROBLEM
11 PROBLEM ETU
T smpl the ntatn f the perceptrn rmul a e ll treat the b as b as a eght wo b and merge t th the ther eghts nt ne vectr [w, w ] here dentes the transpse f a vectr s s a clumn vectr. We als treat as a clumn vectr and md t t bec me [x, xi, ] here the added crdnate x s ed at x Frmally speakng the nput space s n
1
Wth ths cnventn ten n vectr rm as
=O WX
and s Equatn
) sgn ).
(1.1)
can be rert
(1.2)
We n ntrduce the perceptron learning algoritm PLA) . The algrthm ll determne hat shuld be based n the data. Let us assume that the data set s lnearly separable hch means that there s a vectr that makes acheve the crrect decsn ) n all the tranng eam ples as shn n Fgure Our learnng algrthm ll nd ths usng a smple teratve methd. Here s h t rks. At teratn here 0 . . . th ere s a current value f the eght vectr call t ) . The algrthm pcks an eample m ) ) that s currently msclassed cal l t ) ) ) and uses t t update ) . Snce the eample s msclassed e have y t sgn))). The update rule s
(1.2)
1.3.
1 2
+ 1) ) + )).
(1.3)
Ths rule mves the bundary n the drectn f classfyng ) crrectly as depcte d n the gure abve. The algrthm cntnues th frther teratns untl there are n lnger msclassed eamples n the data set. 7
1 . HE EAR IG PR OBLEM
xerse
1.1. PROBLEM ETU
he e ight u ate u le i n ) has the nice i nter retation that it mo ves in e irection of lassifing x(t) correctl (a how that b Show that
y(twTt)t 0 i t s isssie t.] t)wTt l)t) ytTtt i Use
c) s g t concene o fr a moe i nisthe right irecargue tionth at the mov e om tas lassifis in
t)
Althugh the update rule n .3) cnsders nly ne tranng eample at a tme and may mess up' the classcatn f the ther eamples that are nt nvlved n the current ter atn, t turns ut that the algrthm s guaranteed t arrve at the r ght slutn n the end. The prf s the subject f Prb lem 3 . The result hlds regard less f hch eample e chse m amng the msclassed eamples n )··· at each teratn, and re gardles s f h e ntalze the eght vectr t start the algrthm. Fr smplcty, e can pck ne f the msclassed eamples at randm (r cycle thrugh the eamples and alays chse the rst msclassed ne), and e can ntalze w(O) t the zer vectr. Wthn the nnte space f all eght vectrs, the perceptrn algrthm
Y (x Y)
manages t nd h a eght vectr that rks, can usngeectvely a smple teratve prcess. Ths llustrates a learnng algrthm search an nnte hypthess se t usng a nte number f smple steps Ths ature s character stc f many technques that are used n learnng, sme f hch are far mre sphstcated than the perceptrn learnng algrthm. Exercse 1.
et us create our own arget nction an ata set an see ho w he erce tron learni ng algoritm o s ke so ou can visua lize h robl em a n choo se a anom li ne i n te l an as our tar get function whee one sie of the l in e m as o a e o e m as to -1 hoose te inuts of he ata set as anom ints i e lane an valuate he target function on a o get he co resoni g outut n Now g ene rate a at a set o size 0 e ereton le arn ing a lgoih m on ou r ata se t an see o w l ong it ta kes to o nver ge an how wel l th e nal hyothesis matc es u tar get u ca n n othe r ways to l ay with this e eriment in ole m
The per ceptr n learnng algrthm succe eds n achevng ts gal; ndng a hy · · x pthe ss that classes all the pnts n the data set { x crrec tly. Des ths mean that ths hypthes s ll als be successl n class fyng ne data pnts t hat are nt n ? Ths turns ut t b e the key questn n the thery f learnng, a questn that ll be thrughly eamned n ths bk.
8
) }
1.1 PROBLEM ETU
1 . HE ERN ING PROBLEM
ize
ize
b earned classier Fgure 1.4 he learning approach to coin classication a aining data of pennies, nickels, dimes, and quarters ( 1 , , 10, and 2 cents are represented in a size mss space where they fall into clusters (b A classication rule i s a oin data
A
learned om the data set by separating the ur clusters new coin will be classied according to the region in the size mass plane that it alls into
1. 1. 3
Learig versus Desig
S far e have dscussed hat learnng s e dscuss hat t s nt . The gal s t dstngush beteen learnng and a related apprach that s used fr smlar prblems Whle learnng s based n data ths ther apprach des nt use data It s a desgn' apprach based n speccatns and s en dscussed alngsde the learnng apprach n pattern recgntn lterature. Cnsde r the prblem f recgnz ng cns f derent denmnatns hch s relevant t vendng machnes fr eample. We ant the machne t recg nze quarters dmes nckels and pennes We ll cntrast the learnng m data' apprach and the desgn m speccatns' apprach r ths prb lem. We assume that each cn ll be repre sented by ts sze and ma ss a t-dmensnal nput . thedenmnatns learnng apprach e use are these gven cns a sample cnsset m each f the Inur and e as urf data We treat the sze and mass as the nput vectr and the denmnatn as the utput Fgure 14 a) shs hat the data set may lk lke n the nput space. There s sme varatn f sze and mass thn each class but by and large cns f the same denmnatn cluster tg ether. The learnn g algrthm searches fr a hypthess that classes the data set ell If e ant t cl ass a ne cn the machne measures ts sze and mass and then classes t accrdng t the learned hypthess n Fgu re .4 b) In the desgn apprach e call the Unted States Mnt and ask them abut the speccat ns f derent cns We als ask them abut the number 9
1 . HE ENING POBLEM
1 1 POBLEM ETU
ize (a) Proailistic model of data
ize () nrred classier
A A
Fgure 15: he design approach to coin classication (a) probailistic model r the size, mass, and denomination of coins is derived om known specications he gure shows the high proaility region fr each denom ination 1 , 10 and 2 cents) according to the model () classication rule is derived analytically to minimize the proability of error in classiing a coin based on size and mass he resulting regions r each denomination are shown
f cns f ech denmntn n crcultn, n rder t get n estmte f the reltve equency f ech cn. Fnlly, e mke physcl mdel f the vrtns n sze nd mss due t epsure t the elements nd due t errrs n mesurement. We put ll f ths nrmtn tgether nd cmput e the ll jnt prbblty dstrbutn f sze, mss, nd cn denmntn Fgure 15( )) . Once e hve tht jnt dstrbu tn, e cn cnstruct the ptml decsn r ule t cls sfy cns bsed n sze nd mss Fgure 1 5b)) . The rule chse s the denmntn tht hs the hghest prbblty r gve n sze nd mss, thus chevng the smllest pssble prbblty f errr. The mn derence beteen the lernng pprch nd the desgn p prch s the rle tht dt plys. In the desgn pprch, the prblem s ell speced nd ne cn nlytclly derve thut the need t see ny dt. In the lernng pprch, the prblem s much less speced, n ne needs dt t pn dn ht s. Bth pprches my be vble n sme pplctns, but nly the lernng pprch s pssble n mny pplctns here the trget nctn s un knn. We re nt tryng t cmpre the utlty r the per rmnce f the t pprches. We re just mkng the pnt tht the desgn pprch s dstnct m le rnng. Ths bk s but lernn g.
2 This is called Bayes optimal decision t heory Some learning models are ba sed on the same theory by estimating the probability om data
10
1 . HE ERN ING PROBL EM
1 . 2 YES OF ERN ING
erie 1 5 hic f the llowing roblems a re more suite r he l earn in g aoach an hich a re more suite e esign a roach ? a Determining the age at hich a articular meical test shoul be errme b lassif ing nu mbe rs into rimes an non-rimes c etecting otentia l fau i n re it ar ha rges ete mi ni ng he time it wou l take a l l i ng object to hit he gr ou n e) Dete rmi ni ng the otima l cle r r ac lights in a bus in tersection
. 2
Tys f Larning
The basc premse f learn ng m dat a s th e use f a set f bservatns t uncver an unde rlyng prcess It s a very brad prems e and dcult t t nt a sngle amerk As a result derent learnng paradgms have arsen t deal th derent stuatns and derent assumptns . In ths sectn e ntrduce sme f these paradgms The learnng that e s farfslearnng called supervised learnng It s theparadgm mst studed andhave mstdscussed utlzed type but t s nt the nly ne Sme varatns f supervsed learnng a re smple enugh t be accmmdated thn the same amerk Other varatns are mre prund and lead t ne cncepts and technques that take n lves f ther n The mst mp rtant varatns have t d th the nature f the data set
1. 2. 1
Sup ervised Learig
When the tranng dat a cntans eplct eamples f hat the crrect utput shuld be r gven nputs then e are thn the supervsed learnng set tng that e have cvered s far Cnsder the hand-rtten dgt recgntn prblem task b f Eercse 1 1) A reasnable data set r ths prblem s
a cllectn f handrtten dgts andfr dgt a ctuallyfsmages We thus have a set f eamples theeach rmmage hat the The learnng s supervsed n the sense that sme supervsr' has taken the truble t lk at each nput n ths case an mage and determne the crrect utput n ths case n e f the ten categres {O 1 2, 3, 4, 5, 6, 7, 8, 9} Whle e are n the subje ct f varat ns there s mre than ne ay that a data set can be presented t the lea rnng prcess Data sets are typ cally cre ated and preseted t us n ther entrety a t the utset f the learnng prc ess Fr nstance hstrcal recrds f custmers n the credtcard applcatn and prevus mve ratngs f custmers n the mve ratng applcatn are already there r us t use. Ths prtc l f a ready' d ata set s the mst
ima ii
11
1 . HE ERN ING PROBLEM
1 .. YES OF ER NING
cmmn n practce and t s hat e ll fcus n n ths b k. Hever t s rt h ntng that t varatn s f ths pr tcl have attracted a sgn cant bd f rk. One s active leaing, here the data set s acqured thrugh queres that e make. Thus e get t chse a pnt n the nput space and the supervsr reprts t us the target value r As u can see ths pens
the p ssblt strategc chce fquestn the pnt t mamze ts nfrmatn value smlar traskng a strategc n a game f 20 questns. Anther varatn s called online learning, here the data set s gven t
the algrthm ne ea mple at a tme . Ths happens hen e have stream ng data that the algrthm has t prcess n the run ' Fr nstance hen the mve recmmendatn sstem dscussed n Sectn 11 s depled n lne learnng can prcess ne ratngs m current users and mves. Onlne learnng s als usefl hen e have lmtatns n cmputng and strage that preclude us m prcessng the hle data as a batch. We shuld nte that nlne learnng can be used n derent paradgms f learnng nt just n supervsed learnng
1. 2. 2
Reirceet Learig
When the tranng data des nt eplctl cntan the crrect utput r each nput e are n lnger n a supervsed learnng settng. Cnsder a t ddler learnng nt t tuch a ht cup f tea. The eperence f such a tddler uld tpcall cmprse a set f ccasns hen the tddler cnnted a ht cup f tea and as aced th the decsn f tuchng t r nt tuchng t Presumabl ever tme she tuched t the result as a hgh level f pan and ever tme she ddn't tuch t a much ler level f pan resulted that f an unsatsed curst . Eventuall the tddle r learns that she s bette r nt tuchng the ht cup. The tranng eamp les dd nt spel l ut hat the tddle r shuld have dne but the nstead grade d derent actns that she ha s taken. evertheless she uses the eamples t renrce the better act ns eventuall learnng hat she shuld d n smlar stuatns Ths characterzes reinforcement learnng here the tranng eample des nt cntan the target utput but nstead cntans sme pssble utput tgether th a measure f h gd that ut put s In cntrast t supervsed learnng here the tranng eamples ere f the rm the eamples n ren rcement learnn g are f the rm .
iput orrt output iput som output ra r tis output
Imprtantl the eample des nt sa h gd ther utputs uld have been fr ths partcular nput. Renrcement learnng s espec all usel r learnng h t pla a game. Imagn e a stu atn n backgammn here u have a chce beteen derent actns and u ant t dentf the best act n. It s nt a trval task t ascertan hat the best actn s at a gven stage f the game s e cannt 12
1 HE EARNIN PROBLEM
1 . YE OF EARNIN
ize
ize
a nabeed oin data
b nsuervised earning Fgure 16 nsuervised earning of coin cassication a he same data set of coins in igure 14 a is again reresented in the size mass sace, but without being abeed. hey sti f into custers b n unsuervised cassication rue treats the fur custers as dierent tyes he rue may be somewhat ambiguous, as tye 1 and tye coud be viewed as one custer
ealy create uperved learnng example If you ue renrcement lea rnng ntead all you need to do to take ome acton and report how well thng went and you have a tranng example The renrcement learnng algorthm left wth the tak of ortng out the nrmaton comng om derent ex ample to nd the bet lne of play
1.2 . 3
Usupervised Learig
In the unuperved ettng the tranng data doe not contan any output nrmaton at all We are jut gven nput example ·· XN You may wonder how we could pobly lea rn anythng om mere nput Conder the con cla caton problem that we dcu ed earler n Fgure 1 4 Suppoe that we ddn' t know the denom naton of any of the con n the data et Th 6( a)the We tllcol getormlar they ulabeled data hown are now unlabeled o allnpFgure ont have ame ' Thecluter decon but r egon n unuperved learnng may be dentcal to thoe n uperved learnng but wthout the label (Fgure 1 6 (b) ) However the correct clu terng le obvou now and even the number of cluter may be ambguou onethele th example how that we can learn somethig om the nput by themelve Unuperved learnng can be vewed a the tak of pontaneouly ndng pattern and tructure n nput data For ntance f our tak to categorze a et of book nto topc and we only ue general propert e of the varou book we can dentfy book that have m lar prop erte and put them together n one category wthout namng that category 13
1 HE EANIN POBLEM
1 .. YE OF EANIN
Unuperved earnng can ao be vewed a a way to create a hgher eve repreentaton of the data Imagne that you don't peak a word of Spanh bu t your company w reocate you to Span next month They w arrange r Spanh eon once you are there but you woud ke to prepare youref a bt bere you go A you have acce to a Spanh rado taton For a month you contnuouy bombard youref wth Spanh; ththe word an unuperv ed earnng experence nceayou don repreentaton 't know the meanng of However you graduay deveop better of the anguage n your bran by becomng more tuned to t common ound and tructure When you arrve n Spa n you w be n a better poton to tart your Spanh eon Indeed unuperved earnng c an be a precuror to uperved earnng In other cae t a tand-a one techn que Exercs
o ac h o th e ll owing tass ie ntify ic o learn in g is i nvolv survise reinrcement or unsuervise an he training ata o be use If a task c an t mo re han one e exlain ow an esc ibe he trai ni ng ata r each t e a Recomme ni ng a oo o a use in an on li ne books tore b) la ying ic tac to (c) ategorizi ng movies i nto iferent t yes eani ng o la music (e rit l im it Deciing he m aximu m a llow e et r each ban k cus omer
Our man cu n th book w be uperved earnng whch the mot popuar rm of earnng om data
1.2 .
Other Vies of Learig
The tudy of earnng ha evoved omewhat ndependenty n a number of ed that tarted htorcay at derent tme and n derent doman and thee ed have deveoped derent emphae and even derent jargon A a reut earnng a dvere to ubject wth man y aae n the centc terature The om mandata eddedcated the ubject caed machie learig, a name that dtnguhe t om human earnng We brey menton two other mportant ed that approach earnng om data n ther own way Statistics hare the bac preme of earnng om data namey the ue of a et of obervaton to uncover an underyng proce In th cae the proce a probabty dtrbuton and the obervaton are ampe om that dtr buton Becaue tattc a ma thematca ed empha gven to tuaton where mot of the queton can be anwered wth rgorou proo A a reut tatt c cue on omewha t deazed mode and anayze them n great deta Th the man derence between the ta ttca approach 14
1 s EARNIN EASIBLE?
1 HE EARNIN PROBLEM
f f
f Figure 1 visua earning robem he rst two rows show the training exames each inut is a 9 bit vector reresented visuay as a 3 3 back and white array . he inuts in the rst row have f) - 1 , and the inuts in the second row have f) + 1 our task is to earn om this data set
what is, then ay or +1? f
f to the test inut at the bottom o you get - 1
to learning an how we approach th e subject here We make less restrictive assumptions and deal with more general models than in statis tics Therefre, we end up with weaker results that are nonetheless broadly applicable Data miig is a practical eld that cuses on nding patterns, correla tions, or anomalies in large relational databases For example, we could be looking at meical records of patients and trying to detect a causeeect re lationship between a particular dr ug and long-term eects We coul also be looking at credit card spending patterns and trying to detect potential aud Technically, data mining is the same as learning om data, with more empha sis on data analysis than on prediction Because databases are usually huge, computational issues are often critical in the atamovie mining Recommender which were illustrated in Section 1.1 with rating example, aresystems, also considered part of data mining
. 3
Is Larning Fasil?
The target function is the object of le arning The most important assertion about the target fnction is that it is uko We really mean unknown This raises a natural question How could a limited data set reveal enough inrmation to pin down the entire target function? Figure 1. illustrates this
7
15
1 s EANIN EAIBLE?
1 HE EANIN POBLEM
diculty A simple learning task with 6 training examples of a ±1 target fnction is shown y to learn what the fnction is then apply it t o the test input given Do you get 1 or + 1? ow, show the problem to your iends and see if they get the same answer The chances are the answers were not unanimous, and r good reason There is simply more than one fnction that ts the 6 training examples, and some of these fnctions ha ve a value of 1 on the test point and oth ers have a value of + 1 For instance, if the true is + 1 when the pattern is symmetric, the value fr the test point would be + 1 If the true is 1 when the top left square of the pattern is white, the value r the test point would be 1 Both fnctions agree with all the examples in the data set, so there isn't enough inrmation to tell us which would be the correct answer This does not bo de well r the fasibility of learning To make matters worse, we will now see that the diculty we experienced in this simple problem is the rule, not the exception
+
1. 3 . 1
Outside the Data Set
When we get the training data , eg, the rst two rows of Figure 17, we know the value of on all the points in This doesn't mean that we have learned , since doesn't thatseen, we know anything outside of We knowitwhat we guarantee have already but that 's not about learning That 's memorizing Does the data set tell us anything outside of that we didn't know bere? If the answer is yes, then we have learned somethig. If the answer is no, we can conclude that learning is not asible Since we maintain that is an unknown function, we can prove that remains unknown outside of Instead of going through a frmal proof fr the general case, we will illustrate the idea in a concrete cas e Consider a Boolean target fnction over a three-dimensional input space X {O 1 } . We are given a data set of ve examples represented in the table below We denote the binary output by / r visual clarity,
0000 01 010 01 1 100
0 0
where Yn (xn r 1 , 2 3 4, 5 The advantage of this simple Boo lean case is that we can enumerate the entire input space (sinc e there are only 2 8 distin ct input vectors) , and we can enumerate the set of all possi ble target nctions (since is a Boolean fnction on 3 Boolean inputs, and there are only 2 256 distinct Boolean fnctio ns on 3 Boolean inputs) 16
1.. s EARNING EASIBLE?
1 . HE EARNING PROBLEM
Let us look at the problem of learning Since is unkn own except inside D any fnction that agrees with D could conceivably be The table below shows all such nctions , · · · , It also shows the data set D (in blue) and what the nal hypothesis g may look like
0 •• 0 • 0 0 0
0 •• 0 • 0 • 0
0 •• 0 • 0 0 •
f4
0 •• 0 • 0 • •
f5
0 •• 0 • • 0 0
f6
0 •• 0 • • 0 •
0 •• 0 • • • 0
fs
0 •• 0 • • • •
The nal hypothesis g is chosen based on the ve examples in D The table shows the case where g is chosen to match on these examples If we remain true t o the noti on of unknown target, we cann ot exclude any of , · · · , om bei ng the true · ow, we have a dilemma The whole purpose of learning is to be able to predict the value of on points that we haven' t seen bere The quality of the learning will be determined by how clsethree our prediction is to theseen truebere value.(those Regardless on g predicts the points we haven't outside of of what by red D denoted question marks), it can agree or disagree with the target, depending on which of , · · · , turns out to be the t rue target It is easy to verify that any 3 bits that replace the red question marks are as good a any other 3 bits xerse 7 o eac o the l lowing lean ing scenaio s n t e a ove l em evaluate e err ma nce o n e ree oi nts in outsie measue e errmance coute ow man o he 8 ossile target nctions agee wi on all hree oints on wo o the n one o e an on noe o tem.
a has onl t wo othese s one hat a lwas etuns • a n one tat alwas returns
e l eanin a lgo ritm ics te othesi s a
matches the ata set he most. b he same but e leaning lgoithm now ics he hothesis that m atche s he ata set te es c {XOR} onl one hothesis which is always icke were
XR is en e by OR) if he n um ber of s n XOR) if the n um ber is ev en.
is o an
contains all ossible hyotheses all Boolean functions on three variables a n th e lea rni ng a lgori th m icks the hyo thesi s that agrees
with al l trai ni ng exam les but othe rwse isa grees the most wit h h e
XOR
17
1.. s EARNING EASIBLE?
1 . HE EARNI NG PROBL EM BIN
µ=probability of red marbles
Figure 18: random sample is picked om a bin ofred and green marbles. he probability µ of red marbles in the bin is unknown. hat does the action of red marbles in the sample tell us about µ? It doesn't matter what the algorithm does or what hypothesis set is used Whether has a hypothesis that perctly agrees with (as depicted in the table) or not, and whether the learning algorithm picks that hypothesis or picks another one that disagrees with (dierent green bits), it makes no dierence whatsoever as ar as the perrmance outside of is concerned Yet the perrmance outside is all that matters in learning! This dilemma is not restricted to Boolean functions, but extends to the general learning problem As long as f is an unknown nction, knowing cannot exclude any pattern of values r f outside of Therere, the pre dictions of g outside of are meaningless Does this mean that learning om data is doomed? If so, this will be a very short book Fortunately, learning is alive and well, and we will see why We won 't have to change our basic assumption to do tha t The target nction will continue to be unknown, and we still mean nknwn
@
1. 3 .2
Probability to the Rescue
We will show that we can indeed inr something o utside using only , but in a probabilistic way What we inr may not be much compared to learning a fll target fnction, but it will establi h the principle that we can reach outside Once we establish that, we wil l take it to the ge neral learning problem and pin down what we can and cannot learn Let 's take the simplest case of picking a samp le, and see when we can say something about the objects outside the sam ple Consider a bin that contains red and green marbles, possibly innitely many The proportion of red and green marbles in the bin is such that if we pick a marble at random, the probability that it will be red is µ and the probability that it will be green is 1 µ We assume that the value of µ is unknown to us 18
1 s EARNING EASIBLE?
1 HE EARNING PROBLEM
We pick a random sample of independent marbles (with replacement) om this bin, and observe the action v of red marbles within the sample (Figure 1.8. hat does the value of v tell us about t he value of µ? On e answer is that regardless of the colors o f the marbles that we picked, we still don't know the color o f any marble that we didn' t pick We can get mostly green marbles in the sample while the bin has mostly red marbles Although this is certainly possible, it is by no means pbable Exere 8 If
09
what is the robabil
01 His
m uber
it t at a sam le o
Use biomi isribuio
males will have
e swe s ve
The situation is simila r to taking a poll A random sample om a population tends to agree with the views of the popula tion at large The probabil ity distribution of the random variable in terms of the parameter µ is well understood, and when the sample size is big, tends to be close to µ To quanti the relationship between v and µ, we use a simple bound called the oedig equalit It states that fr any sample size ,
r any
> 0
(1.4
Here, [ ] denotes the probability of an event, in this case with respect to the random sample we pick, and is any positive value we choose Puttin g Inequality (1.4 in words, it says that as the sample size grows, it becomes exponentially nlikely that v will deviate o m µ by more than our t olerance' The only qantity that is random in (14 is v which depends on the random sample By cotrast, µ is not random It is just a constant , albeit unknown to us There is a subtle point here The utility of (1.4 is to iner t he value of µ using the vale of v although it is µ that aects v not vice versa However, since the eect is that v tends to be close to µ, we infr that µ tends' to be close to v Although [I µ I > ] depends on µ, as µ appears in the argment and also aects th e distribution of v we are able to bound the probability by 2 N which does not depend on µ otic e that only the size of the sample aects
the bound, not the size of the bin The bin can be large or small, nite or innite, and we still get the same bound when we use the same sample size Exerce 1 9
If µ 0.9 use te oefing Inequalit to oun the robailit that a samle of 10 marbles will have 01 an comare the answer to the revious eercise
If we choose to be very small in order to make v a goo d approximation of µ, we need a larger sample size to make the RHS of nequality (1.4 small We
19
13. s EARNING EASIBLE?
1 . HE EAR NING PROBLEM
can then assert that it is likely that v will indeed be a good appr oximation of µ Althogh this assertion does not g ive s the exact v ale of µ, and doesn't even garantee that the approximate vale holds, knowing that we are within of µ most of the time is a signicant improvement over not knowing anything at all The fct that the sample was randomly select ed om the bin i s the reaso n we are able to make any kind of statement abot µ being close to v If the sample was not randomly selected bt picked in a particlar way, we wold lose th e benet of the probabilistic analysis and we wold again be in the dark otside of the sample How does the bin model relate to the learning problem? It seems that the nknown here was jst the vale of µ while t he nknown in learning is an entire nction f : X - Y The two sitations can be connected Take any single hypothesis and compare it to f on each point x X. If (x) f(x) color the point x green If (x) = f(x) color the point x red The color that each point gets is not known to s, since f is nknown However, if we pick x at random according to some probability distribtion P over the inpt space X we know that x will be red with some proba bility, call it µ, and green with probability 1 µ Regardless of the vale of µ, the space X now behaves like the bin in Figre 18 The training examples play the role o f a sample om the bin If the inpts i XN in are picked independen tly according to P, we will get a random sample of red (( x) = (x ) ) and green ((x) f(x)) points Each point will be red with probability µ and green with probability 1 µ The color of each point will be known to s since both (x) and f (x) are known r 1 , , the nction is or hypothesis so we can evalate it on any point, and f (x) Y is given to s r all points in the data set ) The learning problem is now redced to a bin problem, nder the assmption that the inpts in are picked independently according to some distribtion P on X . Any P will translate to some µ in the eqivalent bin Since µ is allowed to be nknown, P can be nknown to s as well Figre 1 9 adds this probabilistic comp onent to the basic learni ng setp depi cted in Figr e 1 2 With th is eqivalence, the Hoeding Ineqality ca n be applied to th e learn ing proble m, allowing s to make a predict ion otside of Using v to pre dict µ tells s something abot f althogh it d oesn't tell s what f is What µ tells s is the error rate makes in approximating f If v happens to be close to zero, we can predict that will approximate f well over the entire inpt space If not, we are ot of lck Unfrtnately, we have no control over v in or crrent sitation, since v is based on a particlar hypothesis In real learning, we explore an entire hypothesis set looking r some that has a small error rate If we have only one hypothesi s to begin with , we are not really learning, bt rather verifying' whether that partic lar hypothesis is good or bad Let s se e if we can extend the bin eqivalence to the case where we have mltiple hypotheses in order to captre real learning
±
20
1.. s EARNNG EASBE?
1 . HE EARN NG PROBEM
UNKNOWN TARGET FUNCTION
f Y TRAINING EXAMPLES
FINAL HYPOTHESIS
HYPOTHESIS SET
Figure 19: Probablty added to the basc learnng setup
To do that, we start by introducing more descriptive names r the dif rent components that we will use The error rate within the sample, which corresponds to v in the bin model, will be called the i-sample error,
action of ' where f and h disagree 1
N
h(x) f(x)
where [statement 1 if the statement is true, and if the statement is alse We have made explicit the depen dency of Ein on the particular h that we are considering In the same way, we dene the outofsample error
Eu(h) [h(x) f (x)] which corresponds to in the bin model The probability is based on the distribution P over X which is used t o sample the data points x 21
1 . s EARNING EASIBLE?
1. HE EARNING PROBLEM
Figure
110:
ultiple bin s depict te leaning poblem wit
Substituting the new notation Ein r Inequality (14 can be rewritten a
v
and
Eu
M ypoteses
r µ, the Hoeding
fr any >
0
(15
where is the number of training examples The insample error Ein just like v is a random variable that depends on the sample The out-of-sample error Eu j st like µ, is unknown but not random Let us cosider an entire hypothesis set instead of just one hypothesis h and assume r the moment that has a nite number of hypotheses
H
H
We can construct a bin equivalent in this case by having bins a shown in Figure 110 Each bin still represents the input space X with the red marbles in the th bin corresponding to the points x X where hm(x) f(x) The probability of red marbles in the th bin is Eu(hm) and the action of red marbles in the th sample is Ein(hm) r 1 Although the Hoeding Inequality (15 still applies to each bin individually, the situation becomes more complicated when is that? The inequality stated thatwe consider all the bins simultaneously Why r any
> 0
where the hypothesis h is ed bere you generate the data set, and the probability is with respect to random data sets ; we emphasize that the assumption h is xed before you generate the data set" is critical to the validity of this bound If you are allowed to change h after you generate the data set, the assumptions that are needed to prove the Hoeding Inequality no longer hold With multiple hypotheses in the learning algorithm picks
H,
22
1. s EARNING EASIBLE?
1 . HE EARN ING PROBLEM
the nal hypothesis based on D ie statement we would like to make is not
ar
" J [IEin (hm) - Eout (hm ) I
>
generating the data set The E]
is small"
fr any particular, xed hm ) but rather is small" r the nal hypothesis
>
"I n(g) -out(g)I h
The hypothesis is n d ahead o f time bere generating the data, because which hypothesis is selected to be depends on the data So, we cannot just plug in r in the Hoeding inequality The next exercise considers a simple coin experiment that frther illustrates the dierence between a xed and the nal hypothesis selected by the learning algorithm
h
xeie 0 ee s an exement a llustates the ference etween a sngle n an mu ltle ns Run a comu te smulaton ln g cons. l each con nenenl tmes. ets cus on 3 cons as llows c s e rst con l e ; Crnd s a on ou choose at anom; s the con that ha he m n um feuen eas k he eale ne as e a e et rn an e th acton o eas ou oa n e resecte t ree cons
n
n
a at s r he ree ons sele te Reat ths e re eement a la rge n um e o me s .g. un s o he e nte exe e to get seeral nstances o Vrn
n an lo e stogams o e sutons o Vrnd an n Notc e a c cons en u e ng Crn an n a r an
om one un to an he.
sn g lt esm ates jvµj as a n cton f togete th the oefng oun n te same g a ch cons e e oef ng oun a n w nes o not x pl an hy.
e Relate part o the m ult le ns gure
n(g) - out(g)I
The way getesaround this isontowhich try tobound in a way thatto do not depend the learning algorithm picks >There is a simple but crude way of doing that Since has to be one of the regardless of the algorithm and the sample, it is always true that
" In(g) - out(g)I > "
23
In( h) - out( h ) > or In( h ) - out( h )I >
hm '
1 s EARNING EASIBLE?
1 HE EARNING PROBLE!
B
B hm
B
B
where means that event implies event Although the events on the RHS cover a lot more than the LHS , the RHS has the property we want; the hypotheses are xe We now apply two basic rules in probability; an, if
B B · · , BM are any events, then
The secon rule is known as the we get
In(g) -out(g)I
>
uio boud.
<
J[
I Ein (h 1 ) - Eout ( h1 ) I
L m l
[IEin (hm) Eout(hm) I
>
E
In(h ) -out(h )I > In(hM ) - out( hM) I > M
<
Applying the Hoeing Inequality boun each term in the sum by
Putting the two rules together,
>
E] .
(15
the terms one at a time, we can Nto. Substituting, we get
2
(16
Mathematically, this is a unifrm' version of (15. We are trying to simul taneously approximate all by the corresponing This allows the learning algorithm to choose any hypothesis base on an ex pect that the corresponing will unirmly fllow suit , regarless of which hypothesis is chosen N The owns ie fr unifrm estimates is th at the pro bability boun is a fctor of looser than the boun r a single hypothesis, an will only be meaningl if is nite We will improve on that in Chapter
out(hm)' out
n(hm) ' n
1. 3 . 3
2.
2
Feasiility of Learig
We have introuc e two apparently conicting arguments about the fasibility of learning One argument says that we cannot learn anything outsie of , an the other says that we can We woul like to reconcile these two arguments an pinpoint the sense in which learning is fasible:
1 Let us reconc ile the two arguments The question of whether tells us anything outsie of that we in't know bere has two ierent answers If we insist on a eterministic answer, which means that tells us something certain about outsie of , then the answer is no If we accept a probabilistic answer, which means that tells us something likely about outsie of , then the answer is yes
24
1.. s EARNI NG EASIBLE ?
1 . HE EARNN G PROBLEM
xerie 1
e ae gven a ata set o f 5 ra nn g exam les om a n u now n tag et ncon ere = J an = lean we use a sm l e othess se = ee s e ons an t nton an s te cons ant e nse two leanng algortms sma an caz ooses
e otess a agees t ost w an oos es e ote oss el eratel et us see ow ese a lgotm s erm out o sa le om e etemnsc an roalsc onts ew ssume n e oa l stc ew tat tere s a roa l st utn on an let = =
a an rouce a ess t a s utee o erm ete t an anom on a n on t out se ssume e st e exese ta all e exales n ave Y = Is t osse at ho tes s at roues u ns out o e etter than e ho ess at roues? c I p = 09 wat s e oalt that wll roce a ette othes s h an ? Is there an al ue f r w h t s mor e l el th an not at w l l prouce a etter hyotess tan ?
By adopting the prbabilistic view, we get a positive answer to the fasibility question without paying too muc h of a price The only assumption we make in the probabilistic amework is that the examples in are generated inde pende ntly We don't insist on using any particular probability distribution , or even on knwing what distributio n is used However, whatever distribu tion we use r generating the examples, we must also use when we evaluate how well aprximates (Figure 19 That's what makes the Hoeding Inequality applicable Of course this ideal situation may not always happen in practice, and some variations of it have been explored in the literature
g
2 Let u s pin down what we mean by the fasi bility of learning Learning pro duces a hypothesis to approximate the unknown targe t nction If learning is successul, then should approximate well, which means R 0
g g out(g) However, this i s not what we get om the prbabilistic anal ysis What we get instead is out (g) n (g) . We still have to make n (g) R 0 in order to conclude that out(g) R 0 We cannot guarantee that we will n d a hypothesis that achieves n (g) 0 , but at least we will know if we nd it Remember tat out(g) is an unknown quantity, since is unknown, but n (g) is a quantity that we can evaluate We have thus traded the condition out (g) R 0, one that we cannt ascertain, r the conditin n (g) 0, which we can ascertain What enabled this is the Hoeding Inequality
(16
n(g) out(g) > 2 25
1.. s EARNIN G EASIBLE?
1 . HE EARN ING PROBLEM
that assures us that
(g � (g
so we can use
as a proxy fr
xeie fe n comes t o u a l ea n g olm e sas e tag e nc s omete unn ut se as n 00 aa os e s llng to a ou sole e rol an ouce c a oma tes . a s est a u can omse e a ong e llng a e leanng u ll e e t a a u wll guaranee aoxmates ll out sale
g te lan ng l l o e e a an wt hg roal t e ou ue l rxmae el l ut o saml e c ne of wo thngs l l a en u ll ue a tess u ll elar e at u le I
ou eturn a oss ou rou ce wl l a rxmate
en th g oab lt e ch wel l out of sam l e
(g � 0 One should note that there are cases where we wont insist that Financial recasting is an example where market unpredictability makes it impossible to get a recast that has anywhere near zero error All we hope r is a recast that gets it r ight more often than not If we get that , our bets will win in the long run This means that a hypothesis that has (g somewhat below 0 5 will work, provided of course that (g is close enough to (g The asibility of learning is thus split into two questions:
1 2
Can we make sure that Can we make
(g
(g
is close enoug h to
(g ?
small enough?
The Hoeding Inequality (16 addresses the rst q uestion only The second question is answered after we run the learning algorithm on the actual data and see h ow small we can get to be Breaking down th e asibility of learning into these two questions provides frther insight into the role that dierent co mponents of the learning problem play One such insig ht has to do with the complexity of these compon ents
The omplexit y o If the number of hypotheses more risk that equality (16
]J
]J
goes up, we run will be a poor estimator of (g according to In can be thought of as a measure of the complexity of the
(g
26
1 HE EARNING PROBLEM
14 RROR AND OISE
1
hypothesis set that we use If we want an armative answer to th e rst question, we need to keep the complexity of in chec k However, if we want an armative answer to the second question, we stand a better chance if is more complex, since g has to come om So, a more complex gives us more exibility in nding some g that ts the data well, leading to sm all (g This tradeo in the complexity of is a major theme in learning theory that
1
1
1
we will study in detail in Chapter 2
The omplexit y o Intuitively, a complex target nction should be harder to learn than a simple Let us examine if this can be inrred om the two questions above A clos e look at Inequality (16 reveals that the complexity of does not aect how well g approximates (g . If we x the hypothesis set and the number of training examples, the inequality provides the same bo und whether we are trying to learn a simple ( r instance a constant nction) or a complex (fr instance a highly nonlinear nction) However, this doesn't mean that we can learn complex nctions as easily as we learn simple nctions Remember that (1.6 aects the rst question only If the target nction is complex, the second question comes into play since the data om a complex are harder to t than the data om a simple
This means will get worse valueour frhypothesis is complex might try totht get we around thata by making set more complexWe so (g when that we ca n t the data better and get a lower (g , but then won't be as close to per (1.6. Either way we look at it, a complex is harder to learn as we expected In the extreme cas e, if is too complex, we may not b e able to learn i t at all Fortunately, most target nctions in real life are not too complex; we can learn them om a reasonable using a reasonable This is obviously a practical observation , not a mathematical statement Even when we cannot learn a particular , we will at least be able to tell that we can' t As long as we make sure that the complexity of gives us a good Hoeding bound, our success or filure in learning can be determined by our success or filure in tting the training data
H
1
1 .4
Error and Nis
We clos e this chapter by revisiting two notio ns in th e learning problem in order to bring the m closer to the real world The rst notion is what app roimation means when we say that our hypothesis approximates the target nction well The second notion is a bout the nature of the target nction In many situa tions , there is noise that makes the output of not uniquely determined by the input What are the ramications of having such a noisy' target on the learning problem? 27
14 RROR AND OISE
1 HE EARNING PROBLEM
1. 1
Error Measures
Learning is not expected to replicat e the target nction per ctly The nal hypothesis g is only an approximation of f To quanti how well g approxi mates f , we need to dene an error measure that quanties how fr we are om the target The choi ce of an errormay measure aec ts the ou tcomeofofthe thenal learning pro cess Dierent error measures lead to dierent choices hypothesis, even if the t arget and the data are the same, since the value of a particular error measure may be small while the value of another error measure in the same situation is large There re, which error measure we use has consequences r what we learn What are the criteria r choosing one error measure ov er another? We address this question here First , let s rmalize this notion a bit An erro r measure quanties how well each hypothesis in the model approximates the target nction f , Error
(, f )
While (, f ) is based on the entirety of and f, it is almost universally de ned based on the errors o n individual input points x If we dene a pointwise error measure the overall error will be the average value of this f(x)), pointwise error((x), So fr, we have been working with the classication error
((x), f (x))
[(x) J(x) ]
In an ideal world, ( , J) should be user-s pecie d The same learning task in dierent contexts ma y warrant the use of die rent error measures One may view (, J) as the cost of using when you should use f This cost depends on what is used r, and cannot be dictated just by our learning techniques Here is a case in point
xample 11 (Fingerprint veric ation) Cons ider the probl em of verifying that a ngerprint belongs to a particular person What is the appropriate error measure?
f
{+1-1
y
The target nction takes as input a ngerprint, and returns to the right person, and if it belongs to an intruder
+ if it belongs
k in the literture, nd sometimes the error
3 This meure is lso clled n error is rerred to s or
28
1 . HE EARN NG PROBLEM
1 .4 . RR OR AND OIS E
There are wo ypes of error ha our hypohesis can make here If he correc pers on is rejeced ( bu f i is called alse rejec, and if an incorrec pers on is acceped ( bu f i is called false accep
+ +
+
f
no error alse accep alse rejec no error How should he error measure be dened in his problem? If he righ person is acceped or an inruder is rejeced, he error is clearly zero We need o spec i he erro r values r a false accep a nd fr a false rejec The righ values depend on he applicaion Consider w o poenial clien s of his ngerprin sysem One i s a super marke who will use i a he checkou couner o verify ha you are a member of a discoun program The oher is he CIA who will use i a he enrance o a secure faciliy o verify ha you are auhori zed o ener ha ciliy For he supermarke, a false rejec is cosly because if a cusomer ges wrongly rejeced, she may be discouraged om paronizing he supermarke in he ure All fuure revenue om his annoyed cusomer is los . On he oher hand, he cos of a flse accep is minor You jus gave away a discoun
+
o someone whomus di dn' i, and ha pers on lef heir ngerprin in your sysem hey be deserve bold indeed For he CIA, a false accep is a disaser An unauhorized person will gain access o a highly sensiive aciliy This should be reeced in a much higher cos r he flse accep False rejecs , on he oher hand, can be oleraed since auhorized persons are employees (raher han cusomers as wih he supermarke) The inconvenience of rerying when rejeced is jus p ar of he job, and hey mus deal wih i The coss of he dieren ypes of errors can be abulaed in a marix For our examples, he marices migh look like:
f
+ +
Supermarke
+ f + CIA
These marices should be used o weigh he dieren ypes of errors when we compue he oal error When he learning algorihm minimizes a cos weighed error measure, i auomaically akes ino consideraion he uiliy of he hypohesis ha i will produce In he supermarke and CIA scenario s, his could lead o wo compleely dieren nal hypoheses
D
The moral of his example is ha he choice of he error measure depends on how he sysem is going o be used, raher han on any inheren crierion 29
1 . HE EARNI NG PROBLEM
1.4. RROR AND OISE
UKW UT DTRUT TRG EXMLE
HYTHE ET
Figure
he general (supervsed) learnng problem
that we can independently determine during the learning proc ess However, this ideal choice may not be possible in practice fr two reasons One is that the user may not provide an error specication, which is not uncommon The other is that the weighted cost may be a dicult objective nction r optimizers t o work with There re, we often look r other ways to dene the error measure, sometimes with purely practical or analytic considerations in mindinWe already an example thismeasures with th e insimple error used thishave chapter, and seen we will see otheroferror later binary chapters
1. . 2
Nisy Targets
In many practical applications, the data we learn om are not generated by a deterministic tar get functi on Inste ad, they are generated in a noisy way such that the output is not uniquely determined by the input For instance, in the creditcard example we presented in Section two customers may have identical salaries, outstanding loans, etc, but end up with dierent credit behavior Therere, the credit unction' is not really a deterministic fnction,
1 HE EARNING PROBLEM
1.4 RROR AND OISE
but a noisy one This situation can be readily modeled within the same amework that we have Inste ad of () , we can take the output to be a random variable that is aected by, rather than determined by, the input Formally, we have a target distributio ( ) instead of a target fnction f () A data point (, ) is now generated by the joint distribution ( , ) ()( )
I
I
can think r of aexample, noisy target as atake deterministic target plusofadded noise If One is real-valued one can the exp ected value given to be th e determ inistic () , and consi der () as pure noise that is added to This view suggests that a deterministic target nction can be considered a special case of a noisy target , just with zero noise Indeed, we can rmally express any unction as a distribution ( ) by choosing ( ) to be zero or all except () Therere , there is no loss of generality if we consider t he target to be a distribution rather than a function Figure 1 1 1 modies the previous Figures 1.2 and 1.9 to illustrate the general learning problem, covering both deterministic and noisy targets
I
I
erie
at a es a n er ror ith o ailty in a oxim atin g a etermi nisti tag et n ction ( boh an a bina f nctions) I we use the same o aoimae a nis ersion onsier he in moel r a hoesis
of gi vn b
a hat i s th e oabi lt of eo tat ma es i n a roxim ating (b) t at alue o ill the errmance o e ieenen t
[nt e nos target w oo omete anom
I
There is a dierence between the role of ( ) and the role of () in the learning problem While both distributi ons model probabilistic aspects of and , the target distribution ( ) is what we are trying to learn, while the input distribution () only quanties the relative importance of
I
the Our po intentire in gauging wellfasibility we have learned analysis how of the of learning applies to noisy target nctions as well Intuitively, this is becaus e the Hoeding Inequality ( 1 6) applies to a n arbitrary, unknown target nction Assume we randomly picked all the 's according to the distribut ion ( ) over the ent ire input space X . This realization of ( ) is eectively a target nction Therere, the inequality will be valid no matter which particular random realization the target function' happens to be This does not mean that learning a noisy target is as easy a learning a deterministic one Remember the two questions of learning? With the same learning model Eu may be as close to Ein in the noisy case as it is in the
I
I
31
1 HE EAR NIN G PROBLEM
1 .4 . RROR AND OISE
deterministi ase, but Ein itself will likely be worse in the noisy ase sine it is hard to t the noise In Chapter 2, where we prove a stronger version of ( 1 . 6) , we will as sume the target to be a probability distribution P(y I x) thus overing the general ase.
32
1 . HE EARNI PROBEM
1.5
15 PROBEMS
Prlms
Problem 11
We hae opaque bag s each cont ain ing bas. One bag has bac k ba ls and the other has a bl ack and a whit e bal . You pick a bag at random a nd then pick one o f the bal s in that bag at ran dom . When y ou look at the ba l l it is bl ack. You now pic k the sec ond bal l fro m that same bag. What is the pr oba bi ity that this bal is al so ba ck? {Hi Use Byes eorem
A d B]
A I B] BJ B I A] A]
Problem 1.2 Consider the perceptron in two dimensions h(x) sign(wTx) were w [o , , 2 r and x [1, , 2 r Technicaly x has
th ree coor di nates but we ca l th is pe rceptr on twodimension al because the rs t coordinate is xed at 1
(a) Show tat the regio ns o n the pl ane wher e h(x) + 1 and h(x) - 1 are separated by a l in e. If we express th is in e by the equa tion 2 a + what are the sope a a nd interce pt in term s of o, , 2 ? (b ) Draw a pictu re fr th e cases w [1, , 3 and w -[1, 3.
In more than two dimensions the
+1 and 1 regions are separated by a
y
ere the generaization of a ine.
Problem 1.
roe that the LA eentually conerges to a linear separa tor r separa ble data . The fll owing st eps wil l guide you through the proof. Let w* be an optima set of weights (one which separates the data). Th e essentia idea i n t his proof is to s how that the LA weights w(t) get more aigned" with w* with eery iteration. For si mp li city assum e that w(O) 0
(a) Let p min: N Y(wx) Sho w that > . (b) Sh ow tat wT (t)w* � wT(tl)w*+p, and concude that
[Hi se iducio ]
wT(t) w* � tp
(c) Show tat ll w (t) ll 2 : ll w(t 1) 2 + x(t - 1) 2
{Hi y(t - 1) (wT( t l) x(t 1)) becuse x( t 1) ws miscs sied b w( t - 1) j (d) Show by induction that w(t) 2 tR2 , where R m N x · (cntined n next pge)
33
1 HE EARNIN PROBEM
1 5 PROBEMS
() Using ( b) and (d ) show that WT(t) w * w(t)
t ·
'
and hnc pro that
Hint:
ll w tl/ ll w *ll
1 Wy?J
In practic LA conrgs mor quickly than th bound suggsts p Nrthlss bcaus w do not know in adanc w cant dtrmin th nu mb r of itrat ions to con rgnc which dos pos a problm if th dat a i s nonsparabl
Problem 14
In Exrcis w us a n a rti cia l d ata s t to study th pr cp tro n lar ning al gori thm This problm lads you to xp lor th algor ithm fu rthr with data st s of difrnt sizs a n d di mns ions Gnrat linarly sparabl in Exr aot th xa m pls data { (x,stY)of }siz as wl l as 20 thastarindicatd gt fun ction on ( a ) cis a p la n B sur to m ark th xam pl s from dif rnt class s d ifrnt ly an d add la bls t o th ax s of th p lot b un th prc ptro n la rnin g algori th m on th data s t a bo port th nu mb r of updat s that th a gor ithm tak s bf r con rging lot th xampls { (x, Y) } th targ t fun ction and th n a hypo thsis i n th sam gur Commnt on wh thr is clo s to c pat rything in b with anothr randomly gnratd data st of siz 20 Compar your rsults with b
()
() () () (d ) pat rything in ( b ) with anothr randomly gnratd data st of siz 100 Compar your rsults with ( b ) () pat rything in ( b) with anothr randomly gnratd data st of siz 1 000 Compar your rsults with ( b ) Modify th agorithm such that it taks of J1 instad J (f) dom ly g nra t a li narly sparabl data s t ofEsiz Ean 000 with J and e d th data st to th a lgor ith m How many u pdat s dos th a lgorithm ta k to conr g?
(g) pat th algor
()
ithm on th sam data st as f r 100 xprimnts In t h itratio ns of ach xprim nt pick x(t) randoml y i nstad of dtrmi n istica lly l ot a hi stogram r th n u m br of u pdats that th a lgorith m taks to conrg
( h ) Summariz your conclusions with rspct to accuracy and running tim as a function of
N and
34
1 HE EARNI PROBEM
15 PROBEMS
Th prcpt ron larni ng agori th m works lik this In ach it Problem 15 t pick a ran dom t) yt)) and comput th signal ' t) = wt)t) t) 0 updat w by
ration If yt)
wt + 1)
wt) + yt)
t) ;
On may that this algorithm doslook not tak th closnss' and in to considrati on Lt's at a nothr pr cptbtwn ron arni ng algot) yt) argu rith m I n ach i tration, pi ck a random t) , yt)) and comput t) If yt) t) 1, updat w by
wt + 1)
w t) + '
yt) t)) t) ,
'
whr is a cons ta nt Tha t is, if t) agrs with yt) wl thir product is 1 th a lgo rithm dos nothing On th othr han d, if t) is furthr from yt) th a gorithm ch a ngs wt) mor In this problm, you ar askd to implmnt this algorithm and study its prrmanc
>
a ) Gn rat a train ing data s t of siz 100 si mi a r to that us d in Exrcis 14 Gn rat a tst data st of siz 10,000 from th sam procss To gt run th algorithm abo with = 100 on th training data st, until a maxium of 1 , 000 upd ats has bn r achd lot t h train in g data
g
'
st, th th targt andst th nal hypothsis port rrorfunction on th , tst b ) Us th da ta st in a) a nd rdo ryth in g with c ) Us th data s t in a ) a nd rdo rythi ng w ith d ) Us th data s t in a) an d rdo rythi ng with ) Com par th r su lts tha t you gt from a) to d )
g on th sam gur
' = 1 ' = 001 ' = 00001
T h algor ith a bo is a ariant o f th s o calld Adali n euron) algorithm r prcptron larning
daptive Linear
Considr a sampl of 10 ma rbls drawn i ndpndntly fr om a bi n that holds rd a nd grn marbl s Th probab i ity of a rd ma rbl is µ Forµ= 005 µ = 05 an d µ = 08 comp ut th proba bi lity o f gtting no rd marbls = 0 in th l lowing ca ss
Problem 16
a) W dra w on y on such s am pl Comput th probab il ity that = 0 b) W draw 1, 000 indpndnt samps Comput th pro babi ity tha t at last) on of th sampls has = 0 c) pat b) r 1, 000, 000
indpndnt s am pls
35
1 . PROBEMS
1 . HE EA RNIN PROBEM
Problem 1. 7
A sample of heads and tals s created by tossng a con a n um ber o f tmes ndepend ently Assume w e have a n um ber of cons that gene rate dferent sam ples ndep end ent ly For a gv en con let the probab lty of heads (probablty of error) be µ The probablty of obtanng heads n N tosses of ths co n s gven by the b nom a l dstrb uton
Remember that the tran
ng err or s
(a ) Assue the sam ple s ze ( N ) s 10 I f all the c ons ha ve µ = 5 compute the probablty that at least one con wll have = r the case of 1 con 1 cons 1 co ns Repeat r µ = 8. (b) For the case N = 6 and 2 cons wth µ = .5 r both cons plot the probablty
P[mix I V - µi > E ] or E n the ra nge O 1 ] (the mx s over co ns) On the sa me pl ot show the
bound that would be ob taned usng the Hoe fdng I nequ alt y Remember that r a sngle co n the Hoef d ng bo und s
+
+
[Hint Use P[A or B] = PA] P[B] P[A and BJ = P[A] P[B] P[A]P[B] where the last equality llows by independene to evaluate P[mx ]}
Problem 1.8 The Hoefdng Inequalty s one rm of the law of large numbers. One of the smplest frms of that law s the Chebyshev Inequality whch you wll prove here
t s a non nega tve random va rabl e prove that fr any > [t � ] (t)/ (b) If u s a ny random varabl e wth mean µ and varance prove that r any > [(u µ) ] [Hint Use a (c) If u1 U are d random varables each wt h mean µ and varance and u = l U prove th at fr any > (a) If
[ (u µ) 2 ]
. Na
Notce that the RHS of ths Chebyshev Inequalty goes down lnearly n N, wh le the c ounterpart n H oefdng s I neq ual ty goes down exponental ly In roblem 1.9 we d evelop an exp onental bound usng a s m lar a ppr oach
36
.
5 PROBLEMS
HE EARNIN PROBLEM
Prob lem 1 9
n t his proble we derive a fr of the law of la rge n u bers that has an eponential bound called the Cheo bound. We cus on the siple case o f lipping a f ai r coin a nd use an a ppro ach si il ar to roble 18 (a) Let be a ( nite) rando variable positive paraeter. f
a be a positive constant and
() ()
prove that
s monotonay nreasng n. u, , uNbe iid rando variables and let () ( ) (r any prove that [Hnt
(b) Let
be a
l n
f
[un O] [u n ] � (fai r coin ) Eval uate () as and iniize () with respect to fr ed a, O
hence the boun d is eponential ly decre asing in N (c) Suppose a function of
{x , x , . . , xN, N , . . . , xN M } {, } f (x , ), , (xN, YN ) h f M Eo(h, f) M h(xN ) f(N ) . l (a) Say f (x) r all x and odd and M N h(x) -,, r x k a nd isotherwise What is Eo(h, f)? (b) We say that a targ et function f can generate V in a noiseless setting if Yn f (xn) r all (xn, Yn ) E D For a ed V of size N how a ny possible f X can generate V in a noiseless setting? (c) For a given hypothesis a nd a n i nteger bet ween and M how any of those f i n (b) satis fy Eo (h, f) ?
Problem 110
Assue that X and with an unknown target function X The training data set V is . Dene the oftranngset errorof a hypothesis with respect to by
{
( d) For a given hyp othesis h, if all those that generate V in a noiseless settin g are equ al ly li kely i n proba bi lity what is the epected of trai ni ng set error
[Eo(h, )]?
(continued on next pge)
1 HE EARI G PROBEM
1.5. PROBEMS
(e) A deteri nistic al gorith A is dened as a procedure that takes V as an input and outputs a hypothesis A(V) Argue that fr a ny two deter i nistic algoriths A and
2
h
You have now proved that in a noiseless setting r a ed
V if al l possible
are equ al lyo likely any wo det tersgeneral o f the epected traini ngt set erroeri r Sinistic i la algori r resultsth s canarebeequiva pr ovedlent rinore settings
Problem 1 11
The at ri which ta bu lates the cos t of various er rors r the CA and Superarket applications in Eaple 11 is called a rs or oss
matrx
For the two risk atrices in Eaple 11 epl icitly write do wn th e i n saple error that on e should i ni ize to o btain g This in-saple error should weight t he di ferent type s of error s based on t he risk at ri [Hnt: Consder 1 and - 1 separatey.]
Ei Yn
Yn =
proble tigates w ch easu re Problem 12re sult Th ca n cha nge 1the of isthe l earnini nves g proces s ho You ha an veginNg the dataerror points 1 :
YN and wish to estiate a representative value (a) f your a lgo rith is to nd the hypoth esis su of squared deviations
h tha t ini iz es th e in
saple
Ei(h) (h - Yn ) 2 , nl then sho w that your es tiate will be the in sapl e ean
(b)
hm N1 Yn nl f your a lgorith is to nd the hypothe sis h tha t ini iz es the in su of absolute deviations
Ei (h)
saple
h- , nl Yn
hm
then sho w th at your es tiate will be the in sapl e edian which is any val ue r which ha lf the data points are at o st and half t he data points are at least (c) Suppose is perturbed to where So the single data point beco es a n outl ier What hap pens to y ou r two estiator s and
Y
Y
hm YN
hm?
38
hm
hm
Chapr
aining versus Testing Bere the nal exam, a prossor may hand out some pratie problems and solutions to the lass Although these problems are not the exat ones that will appear on the exam, studying them will help you do better They are the training set in your learning If the profssors goal is to help you do better in the exam, why not give out theis not examtheproblems themselves? nieistry the exam goal in and of itself Well, The goal fr you toDoing learnwell the in ourse material The exam is merely a way to gauge how well you have learned the material If the exam problems are known ahead of time, y our perrmane on them will no longer aurately gauge how well you have learned. The same distintio n between trainin g and testing happens in learnin g om data In this hapter, we will develop a mathematial theory that haraterizes this dist inti on We will also disuss the oneptual and pratial imp liations of the ontrast between training and testing
@
21
Ther y f Gene ralizatin
The outofsample error
Eou measures how well our training on D has geer
to data we spae have not bere ifEou based the perrmane alizedthe over entirethat input X . seen Intuitively, we iswant to on estimate the val ue of Eou using a sample of data points, these points must be esh test points that have not been used r training, similar to the questions on the nal exam that have not been used r pratie The in sample error Ein, by ontrast, is based on data points that have been used r training It expressly measures training perfrmane, similar to your perrmane on the pratie probl ems that you got b ere the nal exam Suh perrmane has the benet of looking at the solutions and adjusting aordingly, and may not reet the ultimate perfrmane in a real tes t We began the analysis of in-sample error in Chapt er 1 , and we will extend this 39
2.1. HEORY OF ENERALIZATION
2. RAININ VERSUS ESTIN
analysis to th e general ase in this hapter We will also make the ontrast between a training set and a test set more preise A word of warni ng: this hapter is the hea viest in this boo k in terms of mathematial abstration To make it easier on the notsomathematially inlined, we will tell you whih part you an safly skip without losing the plot The mathematial results provide ndamental insights into learning om data, and we will interpret these results in pratial terms
eneralization error. We have already disussed how the value of Ein does not always generalize to a similar value of Eout eneralization is a key issue in learning One an dene the geealizatio error as the disrepany between Ei and Eout The Hoeding Inequality provides a way to haraterize the generalization error with a probabilisti bound,
6
r any E > 0 This an be rephrased as fllows Pik a tolerane level example 005 , and assert with probability at least that
2M
2 2 6
8
fr
2
We rer to the type of inequality in as a geealizatio boud beause it bounds Eout in terms of Ein To see that the Hoeding Inequality implies this generalization bound, we rewrite as fllows: with probability at least IEout Ein E whih implies Eout Ein E We may now llows om whih E ln and identi otie that the other side of IEout E in E also holds, that is, Eout Ein E r all This is important r learning, but in a more subtle way ot only do we want to know that the hypothesisg that we hoose (say the one with the best training error) will ontinue to do well out of sample (ie, Eout Ei E but we also want to be sure that we did the best we ould with our (no other hypothesis has Eout ) signiantly better than Eout (g)) The Eout () Ein ) E diretion of the bound assures us that
2 ME E
+
2
2
we do hosen muh better beause every hypothesis with a higher Ein than the ouldn't g we have will have a omparably higher Eout or error bar if you will, depends The error bound ln in on the size of the hypothesis set . If is an innite set, the bound goes to innity and beomes meaningless Unfrtunately, almost all interesting learning moels have innite inluding the simple pereptron whih we disussed in Chapter In order t o study general ization in suh m oels, we need t o derive a oun terpart to that deals with innite We woul like to replae with
2
1 oetes generalzaton error s used
M
anoter na e r ut but not n ts book 40
. RAINING VERSUS ESTING
. 1 . HEOR Y OF ENE RALI ZATIO N
something nite, so that the bound is mea ningfl. To do this, we notie that the way we got the tor in the rst plae was by taking the disjuntion of events
"JEin(h1 ) Eou(h1)J > E" or "JEi n (h2 ) Eou(h 2 ) J > E" or (2.2) whih is guaranteed to inlude the event " JEin (g) Eou (g) J > E" sine g is al ways one o f the hypotheses in . We then over-estimated the probability using the union bound. Let be the (ad ) event that "JE in(h m) Eou( hm)J > E" Then,
If the events , , , are strongly overlapping, the union bound beomes par tiularly loose s illustrated in the gure to the right r an example with 3 hypotheses the areas of dierent events orrespond to their probabilities The union bound says that the total area overed by 1, , or is smaller than the sum of the individual ar eas, whih is true but is a gross overestimate when the areas overlap heavily as in this ex ample The events "JEin(hm) Eou(hm)J > E" m 1 , are often strongly overlap ping. If h1 is very similar to h 2 r instane, Eou(h1)J > E" and "JEin(h 2 ) Eou(h 2 ) J > E" are the two events JEin(h1) likely to oinide r most data sets . In a typial learning model, many hy pothes es are indeed very similar. If you take the perep tron model r instane, as you slowly vary the weight vetor w you get innitely many hypotheses that dier om eah other only innitesimally The matheatial theory of generalization hinges on this observation One we properly aount r the overlaps of the dierent hypotheses, we will be able to replae the number of hypotheses in (21) by an eetive number whih is nite even when is innite, and establish a more usefl ondition under whih Eou is lose to Ein
2. 1. 1
Eeive Nmber o f Hypoheses
We now introdue the groth fuctio the quantity that will rmalize the eetive number of hypothese s The growth fntion is what will repla e 41
2 RAININ VERSU S ESTING
2 1 HEOR OF ENERALI ZATI ON
in he generalizaion bound (2 1 ) I is a combinaorial quaniy ha cap ures how dieren he hypoheses in are, and hence how much overlap he dieren evens in (22) have We will sar by dening he growh ncion and sudying is basic prop eri es ex , we will show how we can bound he value of he growh ncio n Finally, we will show ha we can replace in he generalizaion bound wih he growh ncio n These hree seps will yield he generalizaio n bound ha we need, which applies o innie We will cus on binary arge fncions r he purpose of his analysis, so each maps X o { 1 , 1 } The deniion o f he growh unc ion i s based on he number o f dieren hypoheses ha can implemen, bu only over a nie sample of poins raher han over he enire inpu space X If is applied o a nie sample , , X, we ge an N-uple (), , ( ) of ± 's uch an N-uple is called a dichotom since i splis , , ino w o groups hse poins r which is 1 and hose r which is 1 Each generaes a dichoomy on , , , bu wo dieren 's may generae he same dichomy if hey happen o give he same paern of ±1 's on his paricular sample
Let , these poits are deed b
Dnition 21
, X
The dichotomies geeated b
o
)) I
(, , ) { ((), , ( } (23) One can hin of he dichoom ies ( , , ) a se of hypoheses jus lie is, excep ha he hypoheses are seen hrough he eyes of N poins only A larger ( , , ) means is more diverse' generaing more dichoomies on , , The growh ncion is based on he number of dichoomies
Dnition 22
The groth fuctio is deed for a hpothesis set
b
here deotes the cardialit umber of elemets of a set In words, m (N) is he maximum number of dichoomies ha can be gen eraed by on any N poins To compue (N) , we consider all possible choices of poins , , om X and pic he one ha gives us he mos dichoomies Lie , (N ) is a measure of he number of hypohe ses in , excep ha a hypohesis is now considered on poins insead of he enire X F any , since ( , ,) {1, } (he se of all possible dichoomies on any N poins), he value of (N) is a mos {1,} , hence (N ) 2
If is capable of generaing all possibl e dichoomies on , (, , ) { 1, 1 } and we say ha can shatter , signies ha is as diverse as can be on his paricular sample 42
, , hen , This
2 . 1 HEORY OF ENERAZA TON
2 RANN VERSUS ESTN
•
a
c
Figure llstration of the growth fnction r a two dimensional er cetron h dichotomy of red verss le on the 3 colinear oints in art a cannot e generated y a ercetron, t all 8 dichotomies on the 3 oints in art can y contrast the dichotomy of red verss le on the 4 oints in art c cannot e generated y a ercetron t most 14 ot of the ssile 16 dichotomies on any 4 oints can e generated
Eapl 21 If X is a Euclidean plane and is a wo-dimensional percep ron, wha are () and () Figure 2 a shows a dichoomy on poins ha he perceron canno generae, while Figure 2 b shows anoher poins ha he percepron can shaer, generaing all 2 8 dichoomies
Because deniion based on he 2 maximum number of di of ( N)heis case choomies,he() 8 inofspie in Figure a) In he case of poins , Figure 2 ( c shows a dichoomy ha he perc epron canno generae One can verify ha here are no poins ha he percepron can shaer The mos a percepro n can do on any poins is dichoomies ou of he possible , where he 2 missing dichoomies are as depiced in Figure 2 c wih blue and red corresponding o - , + or o +, Hence,
()
D
Le us now illusrae how o compue H(N) r some simple hypohesis ses These examples will conrm he inuiion ha ( N) grows fser when he hypohesis se becomes more complex This is wha we expec of a quaniy ha is mean o replace in he generalizaion bound 2 )
Eapl 22 Le us nd a frmula r H(N) in each of he llowi ng cas es
Positive s : co nsiss of all hypoheses h { , +} of he frm h ( ) sign ie, he hypoheses are dened in a one-dimensional inpu space, and hey reurn o he lef of some value and + o he righ f
2 1 . HEORY OF EN ERALI ZATIO N
2. RAININ VERSUS ESTIN
N(N)+,
N
N
To compue m1 we noice ha given poins , he line is spli by he poins ino regions The dichoomy we ge on he poins is decided by which region conains he value As we vary we will ge N + dieren dicho omies ince his is he mos we can ge r any poins, he growh funcion is
N
N(N)
N+
oice ha if we piced poins where some of he poins coincided which is allowed , we will ge less han dichoomies This does no aec he value of m1 sinc e i is dened based on he maximum number of dichoomies
Positive itevals: consiss of all hypoheses in one dimension ha reurn + wihin some inerval and - oherwise Each hypohesis is specied by he wo end values of ha inerval
(N), N + N (( N) N+l + N2 +N,N + N ) J2 + (N) N N + + N
To ha given poins, we hege lineis isdecided again m1 inowe noice splicompue regions by he poins The dichoomy by which wo regions conain he end values of he inerval, resuling in dieren dichoomi es If boh end values fll in he same region, he resuling hypohesis is he consan - regardless of which region i is Adding up hese possibiliies, we ge
N
m1 m1 N )
=
oice ha grows as he square of ear m1 of he simpler posiive ray case
3. Cove sets { -,
faser han he lin
co nsiss of all hypohes es in wo dimensions
h:
} ha are posiive inside some convex se and negaive elsewhere
a se is convex if he line segmen connecing any wo poins in he se lies enirely wihin he se To compue m1 in his case, we need o choose he poins care lly Per he nex gure, choos e N poins on he perimeer of a circle ow consider any dichoomy on hese po ins , assigning an arbirary pa ern of ± 's o he poins If you connec he poins wih a polygon, he hypohesis made up of he clos ed inerior of he polygon which has o be convex since is verices are on he perimeer of a circle agrees wih he dichoomy on all poins For he dichoomies ha have less han hree poins, he convex se will be a line segmen, a poin, or an empy se
RAININ VERSUS ESTIN
HEORY OF ENERAIZATION
This mean s ha any dichoomy on hese N poins can be realize d using a convex hpohe sis so manages o shaer hese poins and he grow h funcion has he maximum possible value
oice ha if he N poins were chosen a random in he plane raher han on he perimeer of a circle many of he poins would be inernal' and we wouldn' be able o shaer all he poins wih conve x hypoheses as we did r he perimeer poins However his doesn' maer as far as (N) is concerned since i is dened based on he maximum ( in his case)
D
I is no pracical o ry o compue ( N) r every hypohesis se we use Forunaely we don' have o ince (N) is mean o replace in (), we can use an upper bound on (N) insead of he exac value and he inequaliy in () will sill hold eing a good bound on (N) will prove much easier han compuing ( N) iself hans o he noion of a brea
poit
f o data set of size ca be shattered b the is said to be a brea poit for If is a brea poin hen () Example shows ha is a Dn ition 2 . .
brea poin r wo-dimensional perceprons In general i is easier o nd a < brea poin r han o compue he ll growh ncion r ha
Eri 2.1 By inspet in nd a break pint k fr each hypthesis se in Eaple 22 Verify th at mk < using the r u las der ive d in tha E apl e.
k
if ther e is ne .
We now use he brea poin o derive a bound on he growh ncion (N) r all values of N For example he ac ha no poins can be shaered by
RAININ VERSUS ESTIN
. 1 . HEORY OF ENER AIZATION
he wodimensional percepr on pus a signican consrain on he nu mber of dichoomies ha can be realized by he percepron on or more poins We will exploi his idea o ge a signican bound on (N) in general
2. 1. 2
Bodig the Groth Fctio
The mos imporan fc abou growh ncions is ha if he condiion = 2 breas a any poin, we can bound ( N) r all values of N by a simple polynomial based on his brea poin The c ha he bound is polynomial is crucial Absen a brea poin (as is he case in he convex hypohesis example), ( N = 2 r all N If (N replaced in Equaln ion ( 2 1 ) , he bound on he generalizaion error would no go o zero regardless of how many raining examples N we have However, if ( N) can be bounded by a polynomial any polynomial , he generalizaion error will go o zero as N This means ha we will generalize well given a sucien number of examples
(N)
a kip: If you rus our mah, you can he llowing par wihou compromising he sequence A similar green box will ell you when rejoin To prove he polynomial bound, we will inroduce a combinaorial quaniy ha couns he maximum number of dichoomies given ha here is a brea poin, wihou having o assume any paricular rm of This bound will herere apply o any
) is the maimum umber of dichotomies o N poits such that o subset of size of the N poits ca be shattered b these di chotomies The deniion of (N, assumes a brea poin hen ries o nd he Dnition 24 (N,
mos dichoomies on N poins wihou imposing any frher resricions
ince ( N, is dened as a maximum, i will serve as an upper bound r any (N ) )ha has a brea poin ;
(N ) (N, )
if is a bre a poin r
The noaion comes om inomial' and he reason will become clear shorly To evaluae (N, ) we sar wih he wo boundary condiions = 1 and N = 1 (N, 1) (, )
1 2 r
>
1
. 1 . HEORY OF ENER AIZATIO N
RAININ VERSUS ESTIN
B ( N, 1) = 1 r all since if no subse of size 1 can be shaered, hen only one dichoomy can be allowed A second dieren dichoomy mus dier on a leas one poin and hen ha subse of size 1 would be shaered B( , k) = r k > 1 since in his case h ere do no even exis subses of size k; he consrain is vacuously rue and we have possible dichoomies ( + 1 and 1 ) o n he on e poin o develop a recursion Consider N in and k and B (N,now k) assume , ry he We dichoomies deniion where no k poins can be shaered We lis hese dichoomies in he llowing able, # of rows
X1 X2
XN - 1 XN +1 +1
+1 1
+1 1 1 +1 +1 1 1 1
1 1 +1 +1
1 +1
+ 11 11 +1 1 1 1
+1 1 +1 +1
+1 1 1 1
+1 1
+ 1 +1 1 +1
S1
S2
where x1 · · · XN in he able are labels r he N poins of he dichoomy We have chosen a convenien order in which o lis he dicho omies, as llows Consider he dichoomies on x · · X N · ome dichoomies on hese N poins appear only once (wih eiher +1 or 1 in he X N column, bu no
l
boh) dichappear oomieswice, in heonce se S1 on he We rscollec poins wihThe + 1remaining and once dichoomies wih 1 in N 1hese he X N column We collec hese dichoomies in he se S2 which can be divided ino wo equal pars, S and (wih +1 and 1 in he XN column, respecively) Le S1 have rows, and le and have rows each ince he oal number of rows in he able is B(N, k) by consrucion, we have
t
B(N, k)
st
= +
()
The oal number of dieren dichoomies on he rs N 1 poins is given by + ; since and S are idenical on hese N 1 poins, heir di choomies are redundan ince no subse of k of hese rs N 1 poins can
st
.
. .
RAININ VERSUS ESTIN
HEORY OF ENERAIZATION
be shaered (since no -subse of all N poins can be shaered) we deduce ha (N ) ()
-
-
of he rs N poins can by deniion of Furher no subse of size be shaered by he dichoomies in If here exised such a subse hen aing he corresponding se of dichoomies in and adding XN o he daa
st
poins yields a subse of size ha is shaered which we now canno exis in his able by deniion of ( N, ). Therere (N
,
)
()
ubsiuing he wo Inequaliies () and () ino (), we ge (N,
) (N
-
)
(N
)
()
We can use () o recursively compue a bound on (N, ) as shown in he llowing able
N
\
where he rs row (N ) and he rs column ( ) are he bound ary condiions ha we already calculaed We can also use he recursion o bound ( N, k) analyically
Lemma 2 (auer's Lemma) (N, )
The saemen is rue whenever or N by inspecion The proof is by inducion on N Assume he saemen is rue f r all N N and all We need o prove he saemen for N N and all ince he saemen is already rue when = (fr all values of N) by he iniial condiion we only need o worry abou By () ,
(N
, ) (N , ) (N - )
.
.
RAININ VERSUS ESTIN
HEORY OF ENERAIZATION
Applying he inducion hypohesis o each erm on he RH, we ge ( + )
<
+
+
+
c
+
+
N°t
"
( 0 ) ( i01 )
+ where he combinaorial ideniy has been used . This ideniy can be proved by noicing ha o calculae he number of ways o pic i objecs om N + disinc objecs, eiher he rs objec is included, ways, or he rs objec is no included, in ways We have in hus proved he inducion sep, so he saemen is rue r all and
0
0
7
I urns o ha (N ) in fc equals (see Problem ) bu we only need he inequaliy of Lemma o bond he growh ncion For a given brea poin he bound is polynomial in N as each erm in he sum is polynomial (of degree i ) ince ( N ) is an upper bound on any (N) ha has a brea poin we have proved
7
End a kip: Those who sipped are now rejoining us The nex heorem saes ha any growh fncion ( N) wih a brea poin i s bounded by a polyno mial Theorem 24 If ()
< r some value hen ()
r all N The RH is polynom ial in of degree
H
The implicaion of Theorem is ha if has a brea poin, we have wha we wan o ensure good generalizaion; a polynomial bound on (N)
.
.
RAININ VERSUS ESTING
xr
(a ) erf the b ou n of eoe
HEORY OF ENERAIZATON
n the ree cas es o a l e 2
( ostve ras conssts o all hotheses n one enson f he sgn () oste nterals onss ts o a l l hyoteses n one eson ta a re pos tve w t n soe nterv al a n negatve elsew here.
( ) onve sets onss ts o a l l hotes es n tw o ens ns hat a re ost ve nse soe one set an nega ve elsewhee. ote ou an use e reak onts ou un n Eercse . (b oes he re est a htss s et wh ch (where s he large st nteg e )
Lj
2. 1. 3
/ 2
The VC Dimesio
Theorem 2.4 bounds he enire growh ncion in erms of any br ea poin The smaller he br ea poin he bee r he bound This leads us o he l lowing deni ion of a single parameer ha characerizes he grow h fncion
Dnition 25
Th apnikChrvonnkis dimnsion of a hypothsis st
. If by d N( simply is thd( lar gst=valu 2dnotd ll d N thn o f N for hich N 2or for If d is he C dimension of hen k = d + 1 is a brea poin r since ( N canno equal 2 r any N > d by deniion I is easy o see ha no smaller brea poin exiss since can shaer d poins hence i can also shaer any subse of hese poins
xr
Coute the C ensn o of E ercse . (a .
r e ho ess sets n
ince k d + 1 is a brea poin r Theorem erms of he C dimension
N
dvc
N
2.4
arts ( (
can be rewrien in
()
Therere he VC dimension is he order of he polynomial bound o n ( N I is also he bes we can do using his line of reasoning because no smaller brea poin han k d + 1 exiss The rm of he polynomial bound can be rher simplied o mae he dependency on d more salien We sae a usel rm here which can be proved by inducion Problem 2.5.
50
(2.10
.
RAININ VERSUS ESTIN
HEORY OF ENERAIZATION
ow ha he growh ncion ha s been bounded in erms of he C dimen sion we have only one more sep lef in our analysis which is o replace he number of hypoheses in he generalizaion bound () wih he growh ncion (N) If we manage o do ha he C dimension will play a pivoal role in he generalizaion quesion If we were o direcly repla ce by H(N) in (), we would ge a bound of he rm
dvcH
Unless = we now ha H(N) is bounded by a polynomial in N hus ln (N) grows logarihmically in N regardless o f he order of he poly nomial and so i will be crushed by he facor Thereore r any xed olerance 8 he bound on Eou will be arbirarily close o Ein r sucienly large N Only if = will his argumen fail as he growh ncion in his case is exponenial in N For any nie value of he error bar will converge o zero a a spe ed deermined by since is he order of he polynomial The smaller is he aser he convergence o zero I urns ou ha we canno jus replace wih (N) in he generaliza
dvcH
dvc,
dvc
dvc
dvc,
ion bound bu general raher we need o mae oher and adjusmens we play will see () he shorly. However idea above is correc willassill he role ha we discussed here One implicaion of his discussion is ha he re is a division of models ino wo classes. The good models ' have nie and r sucienly large N Ein will be close o Eou r good models he in-sample perfrmance generalizes o ou of sample The bad models ' have innie Wih a bad model no maer how large he daa se is we canno mae generaliaion conclusions om Ein o Eou based on he C analysis Because of is signican role i is worhwhile o ry o gain some insigh abou he C dimension bere we proceed o he frmaliies of deriving he new generaliaion bound One way o gain insigh abou is o ry o compue i fr learning mo dels ha we are fa miliar wih Percepr ons are on e case where we can compue exacly Thi s is don e in wo se ps Firs we show ha is a leas a cerain value hen we show ha i is a mos he same value There is a logical d ierence in arguing ha is a leas a cerain value as opposed o a mos a cerain value This is because
dvc
dvc,
dvc
dvc
dvc
dvc 2 N
2
dvc
dvc
here it
D of size N such ha shaers D
hence we have dieren conclusions in he ollowing cases.
There is a se of N poins ha can be shaered by In his case we can conclude ha N
dvc 2
2 In some cases wit innite suc as te convex sets tat we discussed alternative analysis bed on an average growt function can establis good generalization beavior
RAININ VERSUS ESTIN
HEORY OF ENERAIZATION
Any se of poins can b e shaered by In his case , e have more han enough inrmaion o conclude ha
There is a se of poins ha canno be shaered by Based only on his inrmaion, e canno conclude anyhing abou he value of
o se of poins can be shaered by In his case, e can conclude ha < xeri
d
onsie he i nu sae in lu ing he n stnt o oi nate iensin e eretrn wi = ow at e eters cuntin g is eatl sowig tat i is at l east t ost as llow s a sow at c 1 n tat te rer n oit s in can sater. su osu m
wose rows ese e os e se e osgu/ o gue e eero se ese os
b
show tat c sow h no set oints in can e shatee te ereon Reee o e use e s e e
ers o e e o e e eee s mes some eo s e o o e oe eors u oose e ss o ese e veos eu e e ass o e ee e w e e ue ere s ome om co e memee eee
+
The C dimension of a dimensional percepron is indeed This is consisen ih Figure fr he case = which shows a C dimension of The percepron case provides a nice inuiion abou he C dimension, since + is also he number of parameers in his model One can view he C dimension as measuring he eecive number of parameers The more parameers a model has, he more diverse is hypohesis se is, hich is reeced in a larger value of he growh ncion H ( ) In he case of perceprons, he eecive parameers correspond o explici parameers in he model, namely W In oher models, he eecive parameers may be less obvious or implici The C dimension measures hese eecive parameers or degrees of eedom ha enable he model o express a diverse se of hypoheses Diversiy is no necessarily a good hing in he conex of generalizaion For example, he se of all possible hypoheses is as diverse as can be, so H() = r all and = In his case, no generalizaion a all is o be expeced, as he nal version of he generalizaion bound will show
d
3
{} d is consider ed diensional since te rs t coordinate is xed
..
RAININ VERSUS ESTIN
2 1
HEORY OF ENERAIZATION
The VC Geeralizati Bud
If we reaed he growh funcion as an eecive number of hypoheses and replaced in he generalizaion bound ( 2 1) wih (N) he resuling bound would be () 1 (211) Eout(g) Ein(g) ln
N
+
I urns ou ha his i s no exacly he rm ha will hold The quaniies in red need o be e chnically modied o mae (2 1 1 ) rue The correc bound which is called he C generalizaion boun d is given in he llowing heorem; i holds r any binary arge ncion any hypohesis se any learning algorihm and any inpu probabiliy disribuion P
A
Thor 2.5 (C generalizaion bound) For any olerance
Eout(g) wih probabiliy
1
Ein(g)
+
N
ln
> 0,
()
(212)
If you compare he blue iems in (2 12 ) o hei r red counerpars in ( 2 1 1 ) you noice ha all he blue iems move he bound in he weaer direcion How ever as long as he C dimension is nie he error bar sill converges o zer o (albei a a slower rae) since (N) is also polynomial of order in N jus lie (N) This means ha wih enough daa each and every hypoth esis in an innie wih a nie C dimension will generalize well om Ein o Eout The ey is ha he eecive number of hypoheses represened by he nie growh funcion has replaced he acual number of hypoheses in he bound The C generalizaion bound is he mos imporan mahemaical resul in he heory of learning I esablishes he asibiliy of le arning wih innie hypohe sis ses ince he rmal proof is somewha lenghy and echnical we illusrae he main ideas in a sech of he proof and include he rmal pro of as an appendix There are wo pars o he proof; he jusicaio n ha he growh funcion can replace he number of hypoheses in he rs place and he reason why we had o change he red iems in (211) ino he blue iems in (212)
dc
kth o th proo.
The daa se is he source of randomizaion in he srcinal Hoeding Inequaliy Consider he space of all possible daa se s Le us hin of his space as a canvas' (Figure 2 2(a) ) Each is a poin on ha canvas The probabiliy of a poin is deermined by which X 's in X happen o be i n ha paricul ar and is calculaed based on he disribuion P over X . Les hin of probabiliies of dieren evens as areas on ha canvas so he oal area of he canvas is 1
.
RAININ VERSUS ESTIN
2 1 HEORY OF ENERA IZATIO N
space of daa ses
a oeng nequalty b non oun
c oun
Figure 22 llustraton of the proof of the boun, where the canvas represents the space of all ata sets, wth areas corresponng to probabl tes a or a gven hypothess, the colore ponts correspon to ata sets where oes not generalze well to Eout· he oeng nequalty guar antees a small colore area. b or several hypotheses, the unon boun assumes no overlaps, so the total colore area s large. c he boun keeps track of overlaps, so t estmates the total area of ba generalzaton
to be relatvely small
For a given hypohesis he even "IEi(h) Eout(h)I > E" consiss of all poins r which he saemen is rue For a paric ular h, le us pain all hese bad' poins using one color Wha he basic Hoeding Inequaliy ells us is ha he colored area on he canvas will be small Figure 22 a ow, if we ae anoher he even "IEi(h) Eout(h)I > E" may conain dieren poins, since he even depends on Le us pain hese poins wih a dieren color The area covered by all he poins we colored will be a mos he sum of he wo individual areas, which is he case only if he o areas have no poins in common This is he wors case ha he union bound consi ders If we eep hrowing in a new colored area r each and never overlap wih previous colors, he canvas will soon be mosly covered in coor Figure 22 b Even if each h conribued very lile, he sheer number of hypoheses will even ually mae he colored are a cover he whole canvas This was he problem wih using he union bound in he Hoed ing Inequaliy ( 1 6 ) , and no aking he over laps o f he co lored areas ino consider aion The bul of he C proof deals wih how o accoun r he overlaps Here is he idea If you were old ha he h ypoheses in are such ha each poin on he canvas ha is colored will be colored 100 imes because of 100 dieren 's , hen he oal colored area is now 1 /100 of wha i would have been if he colored poins had no overlapped a all This is he essence of he C bound as illusraed in Figure 2 2 c The argumen goes as llows
,
,
,
. RAINING VERSUS ESTING
..
NTERRETING THE OUND
Many hypoheses share he same dichoomy on a given D since here are niely many dichoomies ev en wih an innie number of hypohese s Any saemen based on D alone will be simulaneously rue or simulaneously alse r all he hypoheses ha loo he same on ha paricular D Wha he growh fncion enables us o do is o accoun fr his ind of hypohesis redundancy in a precise way so we can ge a fcor similar o he 100' in he above example When is innie he redundancy acor will also be innie since he hypoheses will be divided among a nie num ber of dichoomies Therere he reducion in he oal colored area when we ae he redundancy ino consideraion will be drama ic If i happens ha he number of dichoom ies is only a polynomial he reducion will be so dramaic as o bring he oal probabiliy down o a very small value This is he essence o f he proof of Theorem 25 The reason ( 2N) appears i n he C bound inse ad of (N) is ha he proof uses a sample of 2N poins insead of N poins Why do we need 2N poins The even "IE in( h) Eout(h )J > " depends no only on D bu also on he enire X because Eout (h) is based on X This breas he main premise of grouping 's based on heir behavior on D since aspecs of each h ouside of D aec he ruh of "JEin(h) Eout(h) J > " To remedy ha we consider he aricial even "ID are om based Ein(h) n(h)Jof >size" Ninsead on wo samples and DEeach This where is whereEin heand 2NEn comes I accouns fr he oal size of he wo samples D and D ow he ruh of he saemen I Ein(h) En (h)J > " depends exclusively on he oal sample of size 2N and he above redundancy argumen will hold Of course we have o jusi why he wosample condiion " JEin ( h) En (h)J > " can replace he srcinal condiion "JEin(h) Eout(h)J > " In doing so we end up having o shrin he 's by a fcor of and also end up wih a fcor of 2 in he esimae o f he overall probabiliy. This accoun s fr he insead of in he C bound and r having insead of 2 as he muliplicaive cor of he growh fncion When you pu all his ogeh er you ge he rmula in ( 2 12 )
2.2
Intting th Gnalizatin Bund
The C generalizaion bound (212) is a universal resul in he sense ha i applies o all hypohesis ses learning algorihms inpu spaces probabiliy disribuion s and binary arge fncions I can be exended o oher ypes of arge fncions as well iven he generaliy of he resul one would suspec ha he bound i provides may no be paricularly igh in any given case sinc e he same boun d has o cover a lo of dieren cases Indeed he bound is quie loose
2 RAININ VERSUS ESTIN
2.2. NTERRETIN THE OUND
.5
Suose e ave a simle learning moel ose grw function is ence se e oun o esti = mate he proai lit that ot will e ithin of given 1 aining examles. nt e estmate w be ros
mN =
n
Why is the C bound so loose? The slack in the bound can be attributed to a number of technical fctors Among them 1 The basic Hoeding Ine quality used in the proof already has a slack The inequality gives the same bound whether Eou is close to 05 or close to zero However the variance of Ein is quite dierent in these two cases Therefre having one bound capture both cases will result in some slack 2 Using H(N) to quantify the number of dichotomies on N points re gardless of which N points are in the data set gives us a worst-case estimate This does allow the bound to be independent of the pro ability distribution P over X. However we would get a more tuned bound if we considered specic XN and used I ( XN ) I or
H
its expected value instead of the upper bound For instance in the case of convex sets in two dimensio ns whichH(N) we examined in Exam ple 22 if you pick N points at random in the plane they will likely have fr fwer dichotomies than 2N while H ( N) = 2 N
Bounding H (N) by a simple poly nomial of order will contribute further slack to the C boun d
dvc, as given in (2 10)
Some eort could be put into tightening the C bound but many highly technical attempts in the literature have resulted in only diminishing returns The reality is that the C line of analysis leads to a very loose bound Why did we bothe r to go through the analysis then? To reasons First the C analysis is what establishes the asibility of learning r innite hypothesis sets the only k ind we use in practice Second although the bound is loose it tends to be equally loose r dierent learning models and hence is usel r comparing the generalization perrmance of these model s This is an observation om practical experien ce not a mathematical statement In real applications learning models with lower tend to generalize better than those with higher Because of this observation the C analysis proves usel in practice and some rules of thumb have emerged in terms of the C dimens ion For instan ce requiring that N be at least 10 x to get decent generalization is a popular rule of thumb Thus the C bound can be used as a guideline fr generalization relatively if not absolutely With this understanding let us look at the dierent ways the bound is used in practice
dvc
dvc
dvc
56
2. RAININ VERSUS ESTIN
2. 2. 1
2.2. NTERRETIN THE OUND
Sample omp lexity
The sample complexity denotes how many training examples are needed to achieve a certain generalization errmance. The perfrmance is specied by two parameters, and The error tolerance determines the allowed generalization error, and the condence parameter determines how often the error tolerance is violated How st grows as and become smaller indicates databound is needed to get good generalization We canhow usemuch the VC to estimate the sample complexity r a given learning model. Fix > and supose we want the generalization error to be at most om quation the generalization error is bounded by ln ln It llows that and so it suces to make
E
E
8.
E
N
8 0, (2.12),
E.
N
E 8
E.
N >- E ln 4m(2N) 8 suces to obtain generalization error at most E (with probability at least 1 8). This gives an imlicit bound r the sample complexity N, since N appears on both sides of the inequality If we replace m (2N) in (2.12) by its polynomial upper bound in (2.10) which is based on the the VC dimension, we get a similar bound N E ln 4((2N) 8dvc + l) ' (2.13) which is again implicit in N. We can obtain a numerical value fr N using simple iterative methods Eapl 26 Suppose that we have a learning model with dc = 3 and would like the generalization error to be at most 0.1 with condence 90% (so E = 0.1 and 8 = 0.1). How big a data set do we need? Using (2.13), we need N -> 0.1 n 0.1 + . ying an initial guess of N = 1, 000 in the RHS , we get N>
ln
x 1000) +
�
21 193.
0.1 N 21, 1930.1 ' N 30, 000. N 40, 000. 5, N 50, 000.4, 10,000, 10.
We then try the new value = in the RHS and continue this iterative process, rapidly converging to an estimate of If dc were a similar calculation will nd that For dc = we get ou can see that th e inequality suggests that the number of examples needed is approximately proportional to the VC dimension, as has been observed in practice The constant of proporti onality it suggests is which is a gss overestimate; a more practical constant of proportionality is closer to D 4 Te ter coplexty coes o a slar etapor n coputatonal coplexty
57
2. RANN VERSUS ESTNG
2. 2. 2
2.2. NTERRETN THE OUND
Pealy r Model Complexiy
Sample coplexiy xes he perfrmance parameers E generalizaion error and 8 condence parameer and esimaes how many examples are neede d In mos pracical siuaions, however, we are given a xed daa se , so is also xed In his case, he relevan quesion is wha perfrmance can we expec given his paricular The bound in answers his quesion:
N
8N (22) out( g ) � in (g ) + N ln
wih probailiy a leas
N
.
If we use he polynomial bound based on d insead of anoher valid bound on he ou-of-sample error,
m ( 2N), we ge (2 4) out (g) � in (g ) + N ln 4 ((2N)dvc8 + ) Eapl 7 Suppose ha N = 00 and we have a 90 condence require men 8 = 0) We could ask wha error bar can we oer wih his condence,
if 1 has d =
Using
we have
(20) � in (g) + 0 4 (2.5) out( g) . in (g ) (2+ 4),OO ln 4 wih condence � 90. This is a prey poor bound on out· ven if in = 0, may sill b e close o If N = , 000, hen we ge aut( g ) � in( g ) + 030, out D a somewha more respecable bound Le us look more closely a he wo pars ha make up he bound on in The rs par is and he second par is a erm ha increases as he C dimension of 1 increases
(2.2).
where
out
in , out( g) in (g) + (N, 8)
(26)
r(N, 8) <
N ln
8
+
One way o hink of 1 8) is ha i is a penaly fr model complexiy I when we use a more complex 1 penalizes us by worsening he bound on larger d If someone manages o a simpler model wih he same raining
r(N,
out
5
2.2. NTERRETIN THE OUND
2. RAININ VERSUS ESTIN
c
C
dimension,
vc
Figure 23: hen we use a more comlex learning model, one that has higher dimension , we are likely to t the training data better re sulting in a lower in samle error, but we ay a higher penalty or model comlexity combination o the two, which estimates the ou t o samle error, thus attains a minimum at some intermediate �•
error, they ill get a more fvorable estimate r Eou The penalty (N gets orse if e insist on higher condence (loer · and it gets better hen , e have more training examples, as e ould expect Although (N , goes up hen has a higher C dimension, Ei is liely to go don ith a higher C dimension as e have more choices ithin to t the data Therere, e have a tradeo: more complex models help Ei and hurt (N , The optimal model is a compromise that minimizes a combination of the to terms, as illustrated inrmal ly in Figure 23 2. 2. 3
The Test Set
As e ave seen, the generalization bound gives us a loose estimate of the out-of-sample error Eou based on Ei While the estimate can be usel as a guideline r the training process, it is next to useless if the goal is to get an accurate recast of Eou f you are developing a system r a customer, you need a more accurate estimate so that your customer nos ho ell the system is expected to perrm. An alternative approach that e alluded to in th e beginning of this chapter is to estimate Eou by using a test set, a data set that as not involve d in the training proc ess. The nal hypothesis g is evaluated on the test set , an d the result is taen an estimate of Eou We ould lie to no tae a closer loo at this approach. Let us call the error e get on the test set Ees When e report Ees as our estimate of Eou e are in ct asserting that Ees generalizes very ell to Eou After all Ees is just a sample estimate lie Ei Ho do e no 59
2 RAININ VERSUS ESTIN
2.2 NTERRETIN THE OUND
tat Ees genrazes e? W can anser ts quston t autorty no tat e av deveoped te teory of gnerazaton n concrte matematca ters Te ectve number of ypotess tat atters n te gnerazaton be avor of Ees s 1 Ter s ony one ypotess as fr as te test set s concerned and tat s te na ypotess g tat te tranng pase produced Ts ypotess ou d not cange f used a derent tst st as t ou d f usd a drent tranng set Terefre t smpe Hoedng nequaty s vad n t case of a tst set Had te coce of g ben aectd by te test set n any sape or rm t oudnt be consdered a test set any ore and te spe Hodng nquaty oud not appy Terere te generazaton bound tat appes to Ees s te spe Hoedng nequaty t one ypote ss Ts s a uc tgter bound tan te C boun For exampe f you av 1 , 000 data ponts n te test s et Ees be tn ± of Eou t probabty � Te bgger te test set you use t more accura te Ees be as an estate of Eou Eris
d ata set
has 6 ex am ples. o roerly t est the errma nce of he n a l othesis you set aside a ra nd om l select ed suse t o e a m les wic are neer used in the taining ase; hese m a test set. u use a learn ing moel with hothe ses an select he na l hothes is ased on h e tra i n ing ex am les. e wish to esi mate out(g). e have access two estimates the in samle er or on he tr ain ing examles; an the test erro n he tes t exam les tha t wee set asie. a sing a 5 error oleance 5 h ich es timat e has he h ige eror ar
test( g
i g
8 Is he re a ny reason why y ou shou ld n t ese re even moe exam l es r testing?
Anoter aspect tat dstnguses te test s t om te tranng st s tat te test set s not based Bot sets are nte sapes tat ar bound to av some varance due to sape sze but t test set doesnt ave an optmstc or pessmstc bas n ts estmate of Eou Te tranng set as an optmstc bas snc t was us ed to coose a ypotess tat ooked good it Te C generazaton bound mpcty takes tat b as nto consdra ton and tats y t gves a uge rror bar Te tst st just as stragt nte-sape varance but no bas Wen you report te vaue of Ees to your customer and tey try your syst on ne data tey are as key to b peasanty surprsed as unpeasanty surprsed toug qute ky not to be surprsed at a Tere s a prce to be pad r avng a test set Te test set do es not act te outco of our earnng process c ony uses te tranng set Te test set just tes us o e e dd Terre f e set asde soe 60
2 RAININ VER SUS ESTI N
2 2 NTER RETIN THE OUND
of t data ponts provdd by t cstor as a tst st, w nd p sng fwr apls r tranng Snc t tr anng st s sd to s lct on of t ypotss n , tranng apls ar ssntal to ndng a go od ypotss If w tak a b g cnk of t data r tstng an d nd p wt too w apls fr tranng, w ay not gt a good ypotss o t tranng part vn f w can rlably valat t n t tstng par t ay nd p rportn g to t cstor, wt g yo, tat ar dlvrng trrbl © r s tscondnc a tradond to sttng asdttstg w apls wlls addrss tat trd o n or dtal and larn so clvr trc ks to gt arond t n Captr 4. In so of t larn ng ltrat r, s sd a s synonyos wt n w rport prntal rslts n ts book, w wll oftn trat basd on a larg tst st as f t was bcas of t closnss of t two qantts
test out
2. 2.
out test
Otr Target Tpes
Altog t C analyss was basd on bnary targt functons, t can b tndd to ral-va ld functons, as wll as to otr typs of fnctons proof n tos cass ar qt tcncal, and ty do not add to t nsgt tat t C anlyss of bnary nctons provds functons rr,andwprovds wll ntrodc an altrnatv pproac tat covrs ral-vald nw nsgts nto gnralzaton approac s basd on bas-varanc analyss, and wll b dsssd n t nt scton In ordr to dal wt ral- vald unctons, w nd t o adapt t dntons of and tat av so r bn basd on bnary f unctons dnd and n trms of bnary rror; tr = or ls If and ar ral-vald, a or approprat rror asr wold gag ow far and r o ac otr, ratr tan jst wtr tr vals ar actly t sa An rror asr tat s coonly sd n ts cs s t sqard rror = can dn n-sapl and ot-of-sapl vrsons of ts rror asr ot-of-sapl rror s basd on t pctd val of t rror asr ovr t ntr npt spac X ,
out n out f (x) h(x) (h(x) J (x)) (h(x) - f(x))
h(x) f (x)
h(x) f(x) fn
out(h ) = [ (h(x) - J (x)) ]
wl t n-sapl rror s basd on avragng t rror asr ovr t data st,
L (h(x ) f(x )) . n(h) = N1 = s dntons ak n a sapl stat of out jst as t was n t cas of bnary functons In fact, t rror asr s d fr bnary fnctons can also b prssd s a sqard rror
61
2. RANN VERSUS ESTN
2 3 . RO XMATON ENERA ZATON
xerise . or in ar targ et n cions sow tat J[ fx] can e itten as an xecte alue o a mean sua e e rror measue in e llow in g cases. a e convention use he i nar nction is o e conention use r te in ar funti n is
±
nt: e ferene etween an s ust a ale
Just as the saple equency of error converges to the overall probability of error per Hoeing's Inequality the saple average of square error converges to the expecte value of that error (assuing nite variance). This is a an istation of what is rerre to as the law of large nubers' an Hoeing's Inequal ity is just one r of that law . The sae issues of the ata set s ize an the hypothesis set coplexity coe into play just as they i in the C analysis. 23
Ar xiatin- Gnraliza tin ad
The C appoxiating analysis showe us the thattraining the choice nees to strike a balance between on ata ofangeneralizing on new ata. The ieal is a singleton hypothesis se t containing only the target unction. Unrtunately we are better o buying a lottery ticket than hoping to have this Since we o not know the target unction we resort to a larger oel hoping that it will con tain a goo hypothesis an hoping th at the a ta will pin own that h ypothesis. When you select y our hypothesis set you shoul balance these two conicting goals; to have soe hypothesis in that can approxiate an to enable the ata to zoo in on the right h ypothesis. The C generalization boun i s one way to look at this traeo If is too siple we ay ail to approxiate well an en up with a large in saple error ter. If is too coplex we ay ail to generalize well because of the large oel coplexity ter. There is another way to look at the approxiationgeneralization traeo which we will present in this section. It is particularly use in the VCsuite analysis. r square The new error way easures provies rather a ierent than angle; the binary instea errorof bouning Eou by Ein plus a penalty ter we will ecopose Eou into two ierent error ters.
0
2.3. 1
Bias ad Variace
The bias-variance ecoposition of outofsaple error is base on square error easures. The out-ofsaple error is (2.17) 62
2 . 3 . ROX IMATION EN ERA LIZATION
2. RAININ VERSUS ESTIN
where x denotes the expec ted value with respect to based on the probabil ity distribution on the input space X) We have made explicit the dependence of the nal hypothesis g on the da ta V, as this will play a key role in the cur rent analysis. We can rid quation ( 2 .17) of the dependence on a particular data set by taking the expectation with respect to all data sets. We then get the expected outof-sample error r our learning model, independent of any particular realiation of the data set ,
v [x(g D () - f ()) J] x [v (g D() - f ()) J] x [vg D () - 2 v g D() f () + f ()
J
The term vg D () gives an average nction', which we denote by (). ne can interpret () in the llowing operational way. Generate many data sets V, . . , V and apply the learning algorithm to each data set t o pro duce nal hypotheses , . . , K . We can then estimate the average nction r any by () � gk() ssentially, we are viewing g() a a random variable, with the randomness coming om the randomness in the data set () is the expected value of this random variable r a particular ), and isnction a function, average fnction, composed of these expected The is athelittle counterintuitive r one thing, need notvalues be in the model's hypothesis set, even though it is the average of nctions that are.
Eercse .8
a how tat i is l ose u ner l inea co ination an l inear c omi nation of oteses in is also a hosis in hen
) ive a moel r h ic h he ara ge n ction is not in e mo els h othesis set int Ue e ime me (c o ina r l assicati on o ou exect to e a ina unction
We can now rewrite th e expected outof-sample error in terms o f :
v (g V) out - 2() f() + f () x [vgCD() x [ v gCD () - () + () - 2() f () + f () (() - f () ) v [ (g D () - ()) where the last reduction llows since () is constant with respect to V. The term (( ) - f ()) measures how much the average nction that we would learn using dierent data sets V deviates om the target nction that generated these data sets. This term is appropriately called the bia: ba() = (() - f ()) ,
J
63
2 . 3 ROXI MATION ENERAIZATION
2 RAININ VERSUS ESTIN
a it meaure how much our learning model i biaed awa om the target function 5 Thi i becaue ha the benet of learning om an unlimited number of data et, o it i onl limited in it abilit to approximate b the limitation in the learning model itelf The term v (g V () ()) i the variance of the random variable gV ()
a() =invthe(gnal()hpothei, - ()) depending which meaure the variation on the data et We thu arrive at the bia-variance decompoiti on of out-of-ample error, xbi a() + a() bia + a where bia = x bia() and a = xa( ) Our derivation aumed that
the data wa noiele A imilar derivation with noie in the data would lead to an additional noie term in the out-of-ample error (Problem 222) The noie term i unavoidable no matter what we do, o the term we are intereted in are reall the bia and a Th e approximation-generalization tradeo i captured in th e bia-variance decompoition To illutrat e, let conider two extreme cae a ver mall model (with one hpothei and a ver large one with all hpothee
Very smal model. ince there is
he target nction is in ierent data sets will lead to dierent hypotheses that agree with on the data set, and are spread around in the red region. hus, ias � because is likely Very large model
only one hypothesis both the av erage nction and the nal hy pothesis will be the same, r any dat a set. hus var . he ias will depend solely on how well
gD
to be close to he var is large (heuristically represented by the size of the red region in the gure).
this single ypothesis approximates the target and unless we are ex tremely lucky, we expect a large ias.
One can alo view the variance a a meaure of intabilit in the learning model Intailit manit in wild reaction to mall variation or idion craie in the data, reulting in vatl dierent hpothe e 5 at we call
is soeties called n te lteratur e
64
2 RAININ VERSUS ESTIN
2.3 ROX IMATION ENERA IZATION
f () s) ad a data set N 2 We sample frmly 1, 1 to geerate a data set (1, Y 1) , ( , Y ) ad t the data sg oe of two models
Eapl 28 Cosder a target cto
of sze
2 2
H
H1
Set of all les of the rm h() b; Set of all les of the rm h() b
For H, we choose the costat hypothess that best ts the data the hor For H1, we choose the le that passes zotal le at the mdpot , b throgh the two data pots (1 , Y1) ad ( 2 , 2 ) Repeatg ths process wth may data sets we ca estmate the bas ad the va race The gres whch llow show the resltg ts o the same radom) data sets fr both models.
o
Wth H1, the leared hypothess s wlder ad vares extesvely depedg o the data set The bias-var aalyss s smmarzed the ext gres
o bia s = .5 va r = 25
1 bia s = .21 va r = 169
verage hypothesis g red with varx) indicated by the gray shaded region that is gx)
For H, the average hypothess g red le) s a reasoable t wth a arly small bias of 021. However, the large varablty leads to a hgh var of 169 resltg a large expected otofsample error of 1 9 0. Wth the smp ler 65
2 3 ROXI MATION ENERALIZATION
2 RAININ VERSUS ESTIN
model the ts are much less volatle and we have a sgncantly lower var of 025, as ndcated by the shaded regon However the average t s now the zero fncton resultng n a hgher bas of 050 The total outofsample error has a much smaller expected value of 075 The smpler model wns by sgncantly decreasng the var at the expense of a smaller ncrease n bas Notce that we are not comparng how well the red curves the average hy potheses theaccess sn e to These curves areofonly n real learnng we do not t have the multtude dataconceptual sets neededsnce to generate them We have one data set and the smpler model resu lts n a better outofsample error on average as we t our model to just ths one data However the var term decreases as ncreases so f we get a bgger and bgger data set the bas term wll be the domnant part of Eou, and 1 wll wn
N
The learnng algorthm plays a role n the basvarance analyss that t dd not play n the C analyss o ponts are worth notng By des gn th e C analyss s based purely on the hypothe ss set n dependently of the learnng algorthm A In the basvarance analyss both and the algorthm A matter Wth the same usng a der ent learnng algorthm can produce a derent gV Snce gV s the buldng block of the basvara nce analyss ths may result n derent bas and var terms 2 Although the basvarance analyss s based on squarederror measure the learnng alorthm tself does not have to be based on mnmzng the squared error It can use any crteron to produce gV based on However once the algorthm produces g , we measure ts bas and varance usng squared error Unrtunately the bas and v arance cannot be computed n prac tce snce they d epend on the target fncton and the nput probablty dstrbu ton both unknown Thus the basvarance decomposton s a conceptual tool whch s helpfl when t comes to developng a model There are two typcal goals when we consder bas and varance The rst s to try to lower the varance wthout sgncantly ncreasng the bas and the second s to lower the bas wthout sgncantly ncreasng the varance These goals are acheved by
1.
derent technques some and some heurstc Reducng Regularzat one of these technques thatprncpled we wll dscuss n Chapter theonbass wthout ncrea sng the varance requre s some pror nfrmaton regardng the target ncton to stee r the selecton of n the drecton of f, and ths task s largely applcatonspec c On the other hand reducng the v arance wthout compromsng the bas can be done through general technques
4.
2.3.2
The Learig Curve
We close ths chapter wth an mportant plot that llustrates the tradeos that we have seen so ar The learning curves summarze the behavor of the
66
2 . 3 . RO XIMATION ENERALIZATION
2 RAINING VERSUS ESTING
in-ample and out-of-ample error a we va ry the i ze of the training et After learning with a particular data et of ize N the nal hypothe i g CD ha in-ample error n(g T ) and out-of-ample error out(g T), both of which depend on A we aw in the bia-variance analyi t he expectation with repect t o all data et of ize N give the expected error: vn(g T) and vout (g) Thee expected error are fnction of N and are called the learning curve the model illutrate the learning curve r a imple learning model and ofa complex oneWebaed on actual experiment
<< umer o f ata Points,
N
umber of ata Points,
N
Cmplex Mdel
Simple Mdel
Notice that r the imple model the learning curve converge more quickly but to wore ultimate perrmance than r the complex mod el Thi behavior i typical in practice For both imple and complex model the out-of-ample learning curve i decreaing in N while the in-ample learning curve i in creaing in N Let u take a cloer look at thee curve and interpret them in term of the dierent approache to generalization that we have dicued n the VC analyi out wa expreed a the um of n and a generaliza tion error that wa bounded by the penalty r model complexity n the bia-variance analyi ut wa expreed a the u m of a bia and a variance The llowing learnin g curve illutrate thee two approach e ide by ide
umer of at a Points,
N
umber of ata Points, Bias-Variance Analysis
VC Analysis
N
2 . 3 . RO XIMATION ENER ALIZATION
2 RAININ VERSUS ESTIN
The VC analysis bounds the generalization error which is illustrated on the le The bias-variance analysis is illustrated on the right The bias-variance illustration is somewhat idealized since it assumes that r every N the aver age learned hyothesis g has the same errmance as the best aroximation to f in the learning model When the number of data oints increases we move to the right on the learning curves andlearning both thecurve generalization andimortant the var iance shrink as exected The also illustraerror tes an ointterm about E As N increases, E edges toward the smallest error that the learning model can achieve in aroximating f or small N the value of E is actually smaller than that smallest ossible error This is because the learning model has an easier task r smaller N; it only needs to aroximate f on the N oints regardless of what haens outside those oints Therere it can achieve a suerior t on those oints, albeit at the exense of an inrior t on the rest of the oints as shown by the corresonding value of E
6 Fr te learnng curve we take te expected values f all quanttes wt respect t f sze
6
24. PROBLEMS
2 . RAININ VERSUS ESTIN
24
Prblems
Problem 1
(a) For () For (c) For
In Equat ion (2.1), set
8 0.03 and let
M 1 , how many examples do we need to make � 0.05? M 100, how many examples do we need to make � 005? M 10 000, how many exam ples do we need to m a ke � 0.05?
Problem
Show that or the learning model of positive rectangles (a ligned h orizontal ly or vert ical ly) mH(4) 2 and mH(5) < 2 ence give a ound r mH(N).
Compute the maxim um n um er o f dich otomies mH(N), r these learni ng models and consequently compute the VC dimensi on.
Problem
(a ) Posi tive or nega tive r ay: contains th e fun ctions which are + 1 on (r some together with those that are +1 on (r some
[ ] () Positive or negative interval: contains th e functions whic h a re 1 o n an interval [ b] and -1 elsewhere or -1 on an interval [ b] and 1 elsewhere
b. d :
(c) Two con centric spheres in
+ . . . + �
Problem direction to Lemma
contains the functions which
Show that B(N k 23, namely that
B(N k )
are 1 r
y sho wing th e other
To do so construct a specic set of dichotomies that does not shatter any suset of k variales [Hint: ry limiting the n mber f - 1 's in
each dichtm]
Problem 5
Prove y induction that
D
iO � N
m'(N) � N dvc + 1
69
D + 1, hence
2.
2.4.
RAININ VERSUS ESTING
Pro bl 2 6
ro ve that r
PROBEMS
d
We sugge s yo u rst sho w the l lowing i nermedi ate steps
a iO iO J J iO iO f {Hits Bimial therem; 1 + r
ence, arge that
d d ed dvc
e
r > Oj
m
Pro bl 2 7 lot the ounds r m given in rolems 25 and 2.6 dva 2 and dvc 5 Wh en d o you prer one ou nd over th e oher?
r
Po bl 2 8 Whi r some hypohesis set ch : of th e llowin g are pos si le growth functions m
l+ ·, 1 + + 2 - 1) ; 2 · 2 v J . 2 L / J . 1 +N+ N(N - l)(N - 2) . ' ' 6 Pro bl 2 9
[hard] For th e percep tron i n
m
d . 1
2
d dimensions, show that
H F
Use this rmla to verify that dvc d + 1 y evaluating md + 1) and md + 2 lot m/2 r d 10 and 1 40 . If you generae a random dichotomy on points in 10 dimensions, gi ve an uppe r ound on the
E
proail ity hat the dichotomy w il l e separ al e r N
Show that m2 m genera Iizatin ound which on ly involv es m .
Probl 210
10 2 0 40. and hence otain a
Pro bl 2 11 Suppose m + 1 so dva 1 . You have 100 trai ni ng examples Use the general ization ound o give a ound r Eaut with condence 90%. epeat r 10 000.
70
2 . RAINING VERSUS ESTING
2.4. PROBEMS
Probl 212 For an with dvc 10 what sampl e size do y ou need ( as prescried y the generalization ound to h ave a 95% con den ce that your general ization error i s at most 005?
Probl 21 ( a Let
{h1h 2 hM}
log 2 .
with some nite
Prove that
dvc() :
( For hypothesis sets 1, 2 , · · , K with n ite VC di mensions dvc ( k) , derive and prove the tightest upper and lower ound that you can get on dvc n�1 k
( c For h ypothesis sets 1 , 2 , · , K with nite VC dimensi ons dvc( k), derive an d pro ve t h e tighte st u pper an d lowe r ound s that y ou c a n get on dvc uf=1 k
1 2 K e hypothesis sets with nite VC dimension dvc Let 1 U 2 U U K e the un ion of thes e mo dels ( a Show that dvc() < (dvc + 1 ( Suppose that f satises 2£ > 2Kfd Show that dvc() : £ Prob l 2 14
Let
( c ence, show that
dvc() That is,
dvc()
Probl 215 where
1 ; 2
mi (dvc + 1, 7(dvc + ) log 2 (dvc)
O max(dvc ) log2 max(dvc ) is not too ad The monotonical ly i ncre asing hypothe sis s et is
if and only if the inequality is satised r every component
( a Give an exam ple of a monotonic cla ssier in tw o di mensions, cl earl y sho w ing the
+1 and 1 regions
( Compute N) and hence the VC dimension {Hint: Conider a et ofN point generated by rt chooing one point, and then generating the next point by increaing the rt component and decreaing the econd component until N point are obtained}
71
2.
2.4
RAININ VERSUS ESTNG
X =
Probl 2.16
In this pr olem , we will consider is a one di mension al variale . For a hypothesi s set
prove th at the VC d ime nsion of is exactly
PROBEMS
That is
x
D + 1) y showing tha t
D + 1) points which are shattered y D + 2 points which are shattered y
(a) There are
() There are no
Probl 2.17
The VC dimension depends on the in put spa ce as w el l a s . For a xed , consider t wo i np ut spaces X1 X2 Show that the VC dimension f with respect to input space X1 is at most the VC dimension of with respect to i npu t space X2 ow can the result of this prolem e used t o an swer part ( ) i n P rolem
[Hint: How i Problem related to a perceptron in
D dimenion?}
216?
Probl 2.18
The VC dimension of the perceptron hypothesis set corresponds to th e nu mer of parameter s w0 w1 w of the set an d this oserv ati on is usua l ly tr ue r othe r hypothesis s ets owever we wil l present a cou nter exam ple h ere Prove that the llowing hypoth esis set r x has an inn ite VC dimension
EI
LA
{ ha I hax) -l)L aJ
where
A
EI
}
where is the iggest integer (the loor funct ion ) This hypothe sis has only one parameter ut enjoys an innite VC dimension [Hint: Con
10n and how how to implement an arbitrary
ider x xN where Xn dichotomy Y . YN J
Probl 2.19
This p rolem derive s a oun d r the VC di mension of a complex hypothesis set that is uilt from simpler hypothesis sets via composi tion Let 1 . e hypothesis sets with VC dimension d1 dK Fix h h where hi i Dene a vector z otained from to have com ponents hi) Note that J , ut z {-1+l} Let e a hyp othesis K So set of functions that take inputs in
E
an d su ppo se that
E
E
E z E K {+l1} has V C di mens ion 72
2 RAINING VERSUS ESTING
24. PROBLEMS
We can a ppl y a hypohess n o he z cons ruced from (h, , hK) Ths s he com poson of h e hypohess se wh ( , K More rmally he com posed hypohess se ( , K) s dened y h f
h(x)
E
= ( h( x), , h K(x )),
a Show ha
K
218) m (N = {Hint ix N pint x, , X and x h, , hK hi generate N tranrmed pint z, . . . hee z . . can be dichtmized in at mt m(N way, hence fr xed (h, , hK) (x, ,x ) can be dichtmized in at mt m (N way hrugh the eye f x, , X at mt hw many hypthee are there eectively in? Ue thi b und t bund the efective number f K-tuple (h, ,h K) that need t be cnidered. inall argue that yu can bu nd the numb er f dichtmie that can be implemented by the prduct f the number f pible K-tuple (h , , hK) and the number f dichtmie per K-tuple
m(N : m (N
Use he o und m(N
vc
o ge a o un d fr m (N n erms of
, K c Le + � a nd as sum e ha
D =
D > 2 lo D Show ha
d If and are all percepron hypohess ses show ha ()
= (K l o(K)) .
I n he ne x chape r w e w ll fur her de velop h e smple ln ear model Ths l near model s he uldng lock of many oher models such as neural neworks. The resuls of h s prole m show h ow o oun d he VC d mens on of he more compl ex models ul n hs mann er
Probl 220 There ar e a nu mer of ou nds on he generalzao error E al l hold ng wh pro a ly a leas 1
8
a Orgna l VC-ound : <
! 1 4m1 (2N)
8
ademacher Penal y Bound:
(continued on next page)
73
n
4
2 RA VERSUS EST (c) Parrondo a nd Van de n B roek:
1
N
2 + 62N)
b
PROBEMS
( d) Dev roye:
Note that (c) a nd ( d) are m pl ct o unds n Fx plot thes e ou nds as a functon of N Wh ch s es t?
Probl 2.21
50 and b 005 and
Assume th e fl lowng th eore m t o hol d
Thor J
[ Eout (g) - Ein (g) > l :
c .
m1l (2N) e x p
( - 2 N ) 4
'
where c i a cntant tha t i a little bigger than 6 Th s o un d s use ful ecause so metm es wh at we care aout s not th e asolute general zaton err or u t n stead a relatve genera lzaton e rror (on e can mag ne that a general zaton erro r of 00 1 s mor e sgnfcant when 0.01 than when 05 ) Conv ert ths to a gener al zaton ound y showng that wth proalty at least 1
Eo =
Eo
- b
Eo (g ) Ei (g) + � 2 N where � = lo
1+
+ 4Ei (g) �
Probl 2.22 When there s nose n the data, Eo(gD) [(g D(x ) - (x)) 2] where 2(x) (x) + If s a zero mean nose random ecomes varale wth varan
ce show that th e as vara nce decomposton
[Eo(D )] 2 + as + var
Pro bl 2 .2
Consder the learnng prolem n Exam ple 28 where the nput space s X +1] the target functon s () sin() and the nput proalty dstruton s unrm on X Assume that the trann g se t V has only two data po nts (pcked ndepe ndently), an d that the learn ng a lgo rthm pcks the hypo thes s that mn m zes th e n sampl e mea n squa red e rror I n ths prole m, we wl l d g de eper nto ths case
= [-
2
24 PROBEMS
RAINING VERSUS ESTING
For each of he fllowing learning models, nd analyically or numerically i he es hypohesis ha approximaes f in he mean squar ed error sen se assume ha f is known for his par , ii h e expe ced val ue wih respec o of he hypohesis ha he learning algorihm produces, and iii he expeced ou of sampl e err or and is ias and var componens
a The lea rni ng mod el consis s of a l l hypoheses of he fr m h() = b nee d o dea l wih he in niesim a l proa il iy case of wo ide nica l if youpoins, daa choose he hypohesis angenial o f The learning model consiss of all hypoheses of he rm h() = This case was no covered in Example 2.8 h() = b c The learning model consiss of all hypoheses of he frm Consider a sim plied learni ng scena rio Assume ha Probl 224 he inpu dimension is one Assume ha he inpu var ial e is unirmly disriued in he inerval [1, 1] The daa se consiss of 2 poins and assume ha he arge funcion is Thus, he full daa se is The learning a lgo rihm reurns he line ing hes e wo poins as g consiss of funcions of he rm We are ineresed in he es perrmance Bout of our learning sysem wih respec o he squared error measure, he ias and he var
' = {( 1 ) ( 2 §)}
f () = 2
{ 1 2 }
h( ) = b)
a Give he an a lyic expres sion r he averag e funcion () Descrie an experimen ha you could run o deermine numerically () Bout, ias, and var c un your ex perimen and repor he re suls Compa re Bout wih ias+var rov ide a plo of you r () and f() on he same plo d Compue ana lyically wha Bout ias and var should e
75
76
Chapr
3
The Linear Model We often wonder how to draw a line between two categories; right versus wrong personal versus profssional lif usel email versus spam to name a fw A line is intuitively our rst choice r a decision boundary In learning as in li a line is also a good rst choice In Chapter 1 we ( and the machine @ ) learned a procedure to draw a line' between two categories based on d ata ( the perceptron learning algorithm ) We started by taking the hypothesis set 1 that included all possible lines ( actually hyperplanes ) The algorithm then searched r a good line in 1 by iteratively correcting the errors made by the current candidate line in an attempt to improve E As we saw in Chapter 2 , the linear model set of lines has a small C dimension and so is able to generalize well om E to E. The aim of this chapter is to rther develop the basic linear model into a powerl tool r learning om data We branch into three important prob lems: the classication proble m that we have seen and two other important problems called regression and prbability estimation The three problems come with dierent but related algorithms and cover a lot of territory in learning om data As a rule of thumb when fced with learning problems it is generally a winning strategy to try a linear model r st 3 . 1 Linea r Cassica tion The linear model fr classi ing dat a into two classes uses a hypothesis se t of linear classiers where each has the frm ( ) = sign ( ) fr some column vector d where is the dimensiona lity of the input space and the added coordinate x 1 corresponds o the bias weight' w (recall that the input space X = {1} x d is considered dimensional since the added coodinate x = 1 is xed ) We will use h and interchangeably
77
3 . HE INEAR ODE
3.1.
IEAR ASSIFICATION
rer he hphesi s when he cn ex is clear. When we lef Chaper 1 we had w basic crieria r learning: 1 Can we make sure ha E() is clse E(g)? This ensures ha wha we have learned in sample will generalize u f sample. 2 Can we make E( ) small? This ensures ha wha we have learned in sample is a gd hphe sis. The rs crierin was sudied in Chaper 2 Specicall, he C dimensin f he linear mdel is nl d 1 ( xercise 24) Using he C generalizain bund (212), and he bund (210) n he grwh ncin in erms f he C dimensin, we cnclude ha wih high prbabili,
+
Eout ( g )
=
E; n ( 9 )
+ 0
(
�
.
(31)
Thus, when is sucienl large, E and E will be clse each her ( see he deniin f 0(-) in he Nain able ) , and he rs crierin r learning is fllled. The secnd crierin, making sure ha E is small, requires rs and rems ha here is some linear hphesis ha has small E. f here isn' such a linear hphesis, hen learni ng cerainl can' nd ne . S , le 's suppse r he mmen ha here is a linear hphe sis wih small E. n fc, le's suppse ha he daa is linearl separable, which means here is sme hphesis w* wih E(w*) = 0 We will deal wih he case when his is n rue shrl. n Chaper 1 we inrduced he perceprn learning algrihm ( PLA) . Sar wih an arbirar weigh vecr w ( O ) . Then, a ever ime sep t 0 selec any misclassied daa pin (x(t),(t)), and updae w(t) as llws:
w(t
+ 1 ) = w(t) + (t)x(t)
The inuiin is ha he updae i s aemping crrec he errr i n classi ing x(t) The remarkable hing is ha his incremenal apprach f learning based n ne daa pin a a ime wrks. As discussed in Prblem 13 i can be prvedE(w) ha he PLA evenual sp updai a a sluin wih Alhugh hislresul appliesng,ending a resriced seing w (lin = 0 will earl separable daa ) , i is a signican sep . The PLA is clever i desn' navel es ever linear hphesis see if i (he hphesis ) separaes he daa; ha wuld ake inniel lng. Using an ieraive apprach, he PLA manages search an innite hphesis se and upu a linear separar in ( prvabl ) nite ime. As ar as PLA is cncerned, linear separabili is a prper f he data, n he target A linearl separable culd have been generaed eiher m a linearl separable arge, r ( b chance) m a arg e ha is n linearl separab le. The cnvergence prf f PLA guaranees ha he algrihm will
3 . HE NEAR ODEL
3.1. NEAR LASSIFICA TION
() ew noisy dt
(b) onlinerly seprble
Figure 31
t sets tht re not linerly seprble but re () linerly seprble er discrding w exmples, or (b) seprble by more so phisticted curve.
work in both these cases and roduce a hyothesis with E 0 Further in cases you can be to condent this erfrmance will generalize well outboth of samle according the Cthat bound Eri 1 ill
3 . 1. 1
vr st uating
if th ata i s n t lin a ly sp aral?
No-Spar abe Data
We now address the case where the data is not linearly searable Figure 31 shows two data sets that are not linearly searable In Figure 3 (a) the data becomes linearl searable after the removal of just two examles which could be considered or outliers Figure 3 (b) can be searated by a noisy c ircleexamles rather than a line InInboth cases the rethewilldata always a misclassied training examle if we insist on using a linear hyothesis and hence L A will never terminate In fct its behavio r becomes quite unstable and can jum om a goo d ercetron to a very bad one wi thin one udate ; the quality of the rsulting E cannot be guaranteed In Figure 3 (a) it seems aroriate to stick with a line but to somehow tol erate noise and outut a hyothesis with a small E, not necessarily E = 0 In Figure 3(b) the linear model does not seem t o be th e correct model in the rst la e and we will discuss a tchnique called nonlinear transfrmation r this situation in Section 3 4 79
3 HE NEAR ODE
3.1 . NEAR ASS FCATO N
Te situation in Figure 3 (a) is actually encountered very often: even toug a linear classier seems appropriate te data may not be linearly sep arable because of outliers or noise To nd a ypotesis wit te minimum E we need to solve te combinatorial optimization problem: min
N
1
Ed+
w
[sign (wT xn ) # Yn ] .
(3)
=
Te diculty in solving tis problem arises om te discrete nature of bot sign(· ) and [ ] In fct min imizing E(w) in (3 ) in te gen eral case is known to be NPard wic means tere is no known ecient algoritm r it and if you discovered one you would become really really fmous © Tus one as to resort to approximately min imizing E. One approac r getting an approximate solution is to extend PL troug a simple modication into wat is called t e pocket algorithm ssentially te pocket algoritm keeps in its pocket te best weigt vector encountered up to iteration t in PL t te end te best weigt vector will be reported as te nal ypotesis Tis simple algoritm is sown below Th pot algorith:
Set te pocket weigt vector w to w(O) of PL t = 0, do Run PL r one update to obtain w(t + 1. valuate E(w(t + 1. If w(t + 1 is better tan w in terms of E, set w to w(t 1. Return w Te srcinal PL only cecks some of te examples using w(t) to identi (x(t), (t)) in eac iteration wile te pocket algoritm needs an additional step tat evaluates all examples using w(t + 1 to get E(w(t + 1 Te additional step makes te pocket algoritm muc slower tan PL In addi tion tere i s no guarantee or ow fst te pocket algoritm can conver ge to a good E. Neverteless it is a usefl algoritm to ave on and because of its
: 3: 5 6
r
1
simplicityave Oter ecient approa ces r obtaining goodtecniques approximate solutions beenmore developed based on dierent optimization as sown later in tis capter Eris 2
=
'
=
Take 2 and create a data set of size 100 that is not li nearly separa le You can do so y rs t choo sing a rand om line in the p lane as your targ et function and t he i np uts X of the data set as random points in the pl an e The n, eval uate t he tar get function on each X to ge the corresponding output Fina lly li p the la els of randomly selected s and the da ta set will li kely eco me non sep arale
80
3
31 .
HE INEAR ODE
INEAR ASSFICATION
Now try he oc et a lgor ithm on ur ata set using 1 0 iterations eat the exeriment 0 imes. hen lot e ave age an e average wich is also a funtion e sae gue an see how hey ehave wen i ncreases i mi la l use a est set of size 1 00 an pl ot a gu re to sow ow Eout(w) an out( eave.
Eapl 1 Handwritten digit recognition We ample some digit om the US Postal Service Zip Code Databae These 16 x 16 pixel images are preproc eed om the scanned handwritten zip codes The goal is to recognize the digit in each im age We alluded to thi task in pa rt b of xercise 11 A quick look at the images reveal that this is a nontrivial task even r a human , and tpical human E i about 25 % Common conion occur between the digits { 4, 9} and {2 7} A machine-learned hpothesi which can achieve such an error rate would be highl deirable
Let rt de compose the big task o f eparating ten digit into smal ler taks of eparating two of the digits Such a decomposition approach om multiclass to binary claication i commonly ued in man learning algorithm We will cus on digit {1 5} r now A human approach to determining the digit black correponding pixels Thus, to anrather image than is to carring look at the all shape the inrmatio in the 256 pixel, of the or othern properties it makes sen se to summarize the infrmation contained in the image into a fw features Let look at two important atures here intensit and symmetr Digit 5 usually occupies more black pixel than digit 1 , and hence th e average pixel intensity of digit 5 is higher On the other hand, digit 1 is smmetric while digit 5 i not Therere, if we dene asmmetry as the average absolute dierence between an image and its ipped verions, and mmetr as the negation of asmmetr, digit 1 would result in a higher smmetry value A catter plot r thee intenity and mmetry atures r some of the digit i shown next 81
HE INEAR ODE
2 INEAR ERESSION
While the digits can be roughly separated by a line in the plane representing these two atures, there are poorly written digits (such as the 5 depicted in the top-left corner) that prevent a perct linear separation We now rn PA and pocket on the data set and see what happens Since the data set is not linearly separable, PA will not stop updating In act, as can be seen in Figure 3 2(a) , its behavior can be quite unstable When it is =rcibly terminated iteration 1 000, PA gives a lin e that has a poor E 224 and E = at 637 On the other hand, if the po cket algorithm is applied to the same data set, as shown in Figure 3 2(b) , we can obtain a line that has a better E = 045 and a better E = 189 D 3.2
Lin ar Rgrssion
inear regression is another useul linear model that applies to realvalued target nctio ns It has a long h istory in statistics, where it has been studied in great detail, and has various applications in social and behavioral sciences Here , we discuss linear regression om a learning perspective , where we derive the main results with minimal assumptions et usn revisit application in credit approval, this Recall time considering a regressio problemourrather than a classication problem that the bank has custo mer records that contain in rmation elds related to personal cre dit, such as annual salary, years in residence, outstanding loans, etc Such variables can be use d to learn a linear classier to decide on credit approval Instead of just making a binary decision (approve or not), the bank also wants to set a proper credit limit r each approved customer Credit limits are traditi onally determined by human experts The bank wants to automate this task, as it did with credit approval
1 Regresson a ter nerted o earler work n statstcs eans s real valued 82
3 HE INEAR ODE
32 INEAR ERESSION
5%
5%
5
5
75
Iteraton Nuber
5
5
75
Iteraton Nuber
verage Intensty
verage Intensty
a P
b Pock
Figure 32:
omparson of wo nar casscaon agorhms r sp arang dgs 1 and 5. Ein and Bout ar pod vrss raon nmbr and bow ha s h arnd hypohss g a vrson of h P whch scs a random ranng xamp and pdas f ha xamp s mscas sd hnc h a rgons whn no pda s mad . hs vrson avods sarchng a h daa a vry raon. b h pock agorhm.
Ti i a rereio n earning probem Te ban ue itorica record to contruct a data et ' of exampe (x 1) (x 2 2 ) . . (x N N , were X i cutomer inrmation and i te credit imit et by one of te uman expert in te ban ote tat i now a rea number poitive in ti ce intead of jut a binary vaue Te ban want to ue to nd a earning ypotei g tat repicate ow uman expert determine credit imit Since tere i more tan one uman expert, and ince eac expert may not be percty conitent, our target wi not be a determinitic function = f(x). ntead, it wi be a noiy target rmaized a a ditrib ution of te random variabe tat come om t e dierent view of dierent expert a we a te variation witin te view of eac expert Tat i, te abe come om ome ditribution P( I x) intead of a determinitic fnction f(x). onetee, a we dicued in previou capter, te nature of te probem i not canged. We ave an unknon ditribution P(x ) tat generate
±1
83
2 NEAR EESSON
3 HE NEA ODEL
each (X Y and we want to nd a hypothesis that minimizes the error between () and y with respect to that distribution The choic e of a inear mode or this probem presum es that there is a inear combination of the customer inrmation eds that woud propery appro x imate the credit imit a s determined by human experts If this assumption does not hod we cannot achieve a sma error with a inear mode We wi dea with this situation when we discuss noninear tr ansrmation ater in the chapter 3 2 1
The Agorithm
The inear regression agorithm is based on minimizing the squared error be tween h() and y 2
E(h) (h()
y) 2
where the expected vaue is taken with respect to the joint probabiity distri bution (, y) The goa is to nd a hypothesis that achieves a sma E(h) Since the distribution (, y) is unknown E(h) cannot be computed Sim iar to what we did in cassication we resort to the in-sampe version instead 1 N
E(h) =
2
L (h( ) N
Y
In inear regression h takes the rm o f a inear combination of the compon ents of That is
h() =
Ld
W X
=O
wT
where x = 1 and {1} x d as usua and w d + . For the specia case of inear h, it is very usef to have a matrix representatio n of E(h) First dene the data matrix X N d+ to be the N x ( 1) matrix whose rows are the inputs X as row vectors and dene the target vector y N to be te coumn ector whose components are th e target vaues Y The in-sampe error is a fnction of w and the data X y
LXw-yw X 1
N
T y)2
2
N 1
N
(wTXT Xw - 2wTXTy + yTy) ,
(3.3)
(3.4)
where II II is the ucid ean norm of a vector and (33) ows because the th component of the vector Xw - y is exacty wTX Y The inear regression 2 The tem 'lnea egesson h been hstoall onned to squaed eo measues
84
HE INEAR ODE
2 INEAR EGRESSION
(a) one dimenion (line)
(b) wo dimeni on (hyperplane)
Figure 33
The oluion hyp ohei ( in blue) of he linear rereion al o rihm in one and wo dimenion The um o quared error i minimized.
algorithm is derived by minimizing E ( over all possible + as rmalized by the llowing optimization problem WH
=
(35)
argmin E (w) .
w EJd+
Figure 3 3 illustrate s the solution in one and two dimensions Since qua tion 34) implies that E ( is dierentiable, we can use standard matrix calculus t o nd the that minimizes E ( by requiring that the gradient of E with res pect to is the zer o vector, i e , E = 0 The gradient is a column vector whose th component is E = By explicitly computing the reader can verify the fllowing gradient identities,
[
These ident ities are the matrix analo g of ordinary dierentiation of quadratic and linear nctions To obtain the gradient of E , we tae the gradient of each term in ( 3 4) to obtai n
Note that both and E are column vectors Finally, to get E to be 0 one should solve r that satises If of
XTX is invertible, = x y where x XTx 1 XT is the pseudo-inverse X The resulting is the unique optimal solut ion to (3 .5) If XTX is not 85
HE INEAR ODEL
2 INEAR ERESSION
invertible, a pseudoinverse can still be dened, but the solution will not be unique (see roblem 315) In practice, is invertible in most of the cases since N is often much bigger than d 1 , so there will likely be d 1 linearly independent vectors X We have thus derived the llowing linear regression
+
XTX
+
algorithm
Linar rgrion algorith:
:
X
y
onstruct the matrix and the vector om the data set (x,Y), (x , Y ), where each x includes the o 1 bias coordinate, as llows
X
[ l' : l
y
taret vector
in put ata atri
ompute the pseudoinverse is invertible,
x of the matrix x If XTX
3: Return Wlin xy This algorithm is sometimes rerred to as ordinary least squ ares ( OS ) It ma seem tha t, compared with the perceptron learning algorithm, linear regression doesn t reall look li ke learning, in the sense th at the hypothesis Win comes om an analtic solution (matrix inversion and multiplications) rather than om iterative learning steps Well, as long as the hypothesis Wlin has a decent outofsample error, then learning has occurred inear regression is a rare case where we have an analytic rmula r learning that is easy to evaluate This is one of the reasons wh the technique is so widely used It should be noted that there are methods r computing the pseudoinverse directl without inverting a matrix, and that these methods are numerically more stable than matrix inversion regression analyze d inhere great detail in statistics We would likeinear to mention one ofhasthebeen analsis tools since it relates to insample and outofsample errors, and that is the hat matri H Here is how H is dened The linear regression weight vector W!in is an attempt to map the inputs X to the outputs However, wlin does not produce exactly, but produces an estimate
y
y
= W!in
which diers om r Win (assuming
y due to insample error Substituting the expression XTX is invertible), we get xxTx 1 XTy
6
3. HE IEAR ODE
3.2. IEAR ERESSIO
herere the estimate is a linear transrmation of the actual through matrix multiplication with where (36) Since th e matrix puts a hat on , hence th e name he hat matrix is a very special matrix For one thing 2 = , which can be veried usinganalysis the above expressionand r out-of-sample his and other willonacilitate the of in-sample errorsproperties of linear of regressi ri 3.3
= )
onsier he hat matrix matrix an is inver ile a how that is s mmet ic
here
is an
1
how that K= an ositi i nteg e If is th e id entit m atri o size show at I - = I - r an osit ie i nteger h ow that trace) = d 1, whee he tace is he su m of ia gonal elements. {: trace
3 .2 .2
= trace J
Geraliatio Issues
inear regression loo ks r the optimal weight vector in terms of the in-sample error Ein which leads to the usual generalization question Does this guarantee decent outof-sample error Eout? he short answer is yes here is a regression version o f the VC generalization bound (3.1) that similarly bounds Eout In the case of linear regression in particular, there are also exact rmulas r the expected Eout and Ein that can be derived under simplifying assumptions he general rm of the result is
Eout(g)
= En(g)
+ o ,
where Eout (g) and Ein (g) are the expected values his is comparable to the classication bound in ( 31 ) Eri 3. 4
= w*Tx + 2
Consider a noisy target r generating the data where is a noise ter with zero mean and variance independently generated r every exam ple The expec ted e rror of the est pos si le li ne ar t to this target is thus For the data denote the noise in as and let E assume tha t is i nvertil e By l lowin g
(x ) 2 ' = (x ) (xNYN )}X X = [ 2 N T 87
Yn n
(oninued on nex page
3 . HE INEAR ODEL
3 3 OISTIC ERESSION
the stes elo w sow at he exected n sam l eor o f l near reg rsson wth respect to s gen y
[EiWi] = 2 a ho w tat the n samle es tmate o ( how hat th n sam le e r ector max t mes a s e mat
s gven
w*
can e exresse a
ii n term s o us ng an s m l te eresson c. rove ha t i ] 2 usng c an e n een c ress usn g Execse
int e m te iagna eement a atr he tra wl a re ee ee
ence of
o the x ecte ut sam le ero e tae a se l ase ch s eas o a n alze. onser a es ata s et = .. ch sa es he sam e ut ve cors wth ut t a eren ealzatn o the n ose ems. Denot e the nose n as an let = ene o e th e avrag e suare eo on
test
[
testW i
test
= 02 ( 1
(e) Prove hat l v,1 [Etest (Win)]
).
est
he secal test error s a ery estrcte case of he general out o sam ple err or. ome etale a na lss shws at sm l a res ults an e otan ed r the general case as swn n ole m
Figur 34 illustrats t larning curv of linar rgrssion undr t assum tions of xrcis 34 T bst ossibl linar t as xctd rror T r N d + T xctd insaml rror is smallr, qual to larnd linar t as atn into t in-saml nois as muc as it could wit t d + dgrs of dom tat it as at its dis osal Tis occurs bcaus t tting ca nnot distinguis t nois om t signal. On t otr and, t xctd out-of-saml rror is + , wic is mor tan t un avoidabl rror of T additional rror rcts t drift in W du to tting t insaml nois
•
3.3
Lgisti c Regressin
T co r of t li nar mod l is t signal' = wT tat combins t inut variabls linarly. · av sn two modls basd on tis signal, and w ar now going to introduc a tird In linar rgrssion, t signal itslf is takn as t ou tut, wic is aroriat if you ar trying to rdict a ral rsons tat could b unbound d In linar clas sication , t signal is trsoldd at zro to roduc a outut, aroriat fr binary dcisions A tird ossibility wic as wid alication in ractic, is to outut a pbability,
±
3. HE INEAR ODE
3.3 OGISTIC ERESSION
Number of ata Points,
Figure 34:
The learnin curve r linear rereion.
a value between 0 and 1. Our new model is called logistic regression It has similarities to both revious models as the outut is real ( like regression ) but bounded ( like classication) 2 (rediction of heart attacks ) Suose we want to redict the occurrence of heart attacks base d on a erson s cholesterol leve l bloo d res sure age and other we cannot attack withweight an certaint butfactors we ma Obviousl be able to redict howredict likel ita isheart to occur given these fctors Therefre an outut that varies continuousl be tween 0 and 1 ould be a more suitable model t han a binar decision The closer is to 1 , the more likel that the erson will have a heart attack D 3.3.1
Predictig a Probability
Linear classication uses a hard threshold on the signal s ( = sign () while linear regression uses no threshold at all
In our new model we need something in between these tw o cases that smoothl restricts the outut to the robabilit range One choice that accom lishes this goal is the logistic regression model
[O
where 8 is the socalled logistic nction (s) and 1 . 89
whose outut is between 0
3 . HE INEAR ODE
3.3. OISTIC ERESS ION
The output can be nterpreted as a probabl ty r a bnary event ( heart attack or no heart 1 attack dgt versus dgt 5' etc ) Lnear classcaton also deal s wth a bnary event but the derence s that the classcaton n logs tc regresson s allowed to be uncertan wth ntermedate values between 0 and 1 reectng ths uncertanty The logstc fncton s rerred to as a so threshold n contrast to the hard threshold n classcaton It s also called a sigmoid because ts shape looks lke a attened out s.
l '
Exercis 3.5 nother op ula r soft hre shol i s the hyperol ic tangent
anh
= -
a ow is anh rla ted to the logistic f un ctio n ? int: hift and cale] how hat anh() converges to a hard threshold or large and converges to no threshold small
eow
I s l int: ormalize the ge
The specc rmula of ( ) wll allow us to dene an error measure r learnng that has analytcal and computatonal advantages as we wll see shortly Let us rst look at the target that logstc regresson s tryng to learn The target s a probablty say of a patent beng at rsk r heart attack that depends on the nput x (the characterstcs of the patent ) Formally we are tryng to learn the target fncton
f f f f f (x) = [y
+1 x The data does not gve us the value of explctly Rather t gves us samples generated by ths probablt y eg patents who had heart attacks and patents who ddn t Therer e the data s n fct generated by a nosy target P(y x)
x
r y +1; (37) 1 (x) r y 1 To learn om such data we need to dene a prop er error measure that gauges how close a gven hypothess s to n terms of these nosy ±1 examples
P(y I x) =
90
3.3. OGSTC EGRESS ON
3 . HE NEAR ODE
Error measure. Th andard rror mar (h (x) y) d n logc r gron bad on h noon of likelihood; how lkly ha w wold g h op y om h np x f h arg drbon P(y I x) wa ndd caprd by or hypoh h(x)? Bad on (37), ha lklhood wold b
(y I x
+;
fr y
h(x)
h(x) fr y W b r h(x) by val B(wTx) and h ac ha ( ) ay o vrfy) o g
P(y I x) B(y w T x).
B()
+
(38)
On o f or raon f r choong h mahmacal rm () e ( e ha lad o h mpl xpr on r P(y I x). Snc h daa pon (x1 Y 1) . . . (x N YN ) ar ndpndnly gnrad, h probably of gng all h Y n h daa om h corrpond ng X wold b h prodc
N P(y I X) · Th mhod of maimum likelihood lc h hypoh h whch maxm h probably. W can qvalnly mnm a mor convnn qany,
1
ln
(g P(y I X) ) N
1
N
ln
( P(y I X) )
nc ln·) a monooncally dcrang ncon. qaon (38), w wold b mnmng
1
Sbng wh
t
n
(Y w Tx)
wh rpc o h wgh vcor w. Th ac ha w ar iiizig h qany allow o ra a an rror mar. Sbng h nc onal frm r B(y WTX) prodc h n-ampl rror mar r logc rgron, (39)
l+ n T
Th mpld p onw rror ma r (h(x ) Y) ln ( e Y w x ). Noc ha h rror mar mall whn Y wTx larg and positive, whch wold mply ha gn(wTx) Y Thrfr, a or non wold xpc, h rror mar ncorag w o cla ach X corrcly.
=
3 lthough the method of maxmum lkelhood s ntutvel plausble, ts goous just aton as an nfeene tool ontnues to be dsussed n the statsts ommunt
9
33. OISTIC ERESSION
3. HE INEAR ODE
erise 6
[rossentrop error measure]
(a) More generally if e are learning fo 1 ata to reict a noisy target I wit ca n idate ho hesis h sho w that he axiu li eli oo et o euces to he as of n ing tat i ni izes
w =
[Yn
=
1
ln
[ yn
1 ln
(n
1
(n
or the case h) (wT) argu e tat iniizing he in sale err or in ar t a is euiv alent to i niizing e one in 9
{ q lo + 1 ) o
o two proa i lit ist i ution s an 1 } with inar out coes the crss entro (fro i nration eo is
q
The i n sa ple erro r in art (a) corr eso nds to a co ss entr oy erro easure on the data point (n Yn) with an h(n [Yn
p=
q=
For linear classication, we saw that minimizing E r the perceptron is a combinatorial optimization em; to solve it, we introduced number of al gorithms such as the percep probl tron learning algorithm and the po acket algorithm For linear regression, we saw that training can be done using the analytic pseudoinverse algorithm fr minimizing E by setting \ E w 0 hese algorithm s were developed based on th e specic frm of linear classication or linear regression, so none of them would apply to logistic r egression o train logistic regression, we will take an approach similar to linear re gression in that we will try to set \ E w) = 0 Unrtunately, unlike the case of linear regression, the mathematical rm of the gradient of E r logistic regression is not easy to manipulate, so an analytic solution is not asible Eercise 3. 7 For lo gist ic regre ssion s how h at
\ E w Argue that a isclas si ed exa ple contriutes ore a correctly classied one.
to the gradient than
Instead of analyt ically setting the gradient to zero, we will itetively set it to zero o do so, we will introduce a new algorithm, gdient descent. Gradient 92
3 3 . OGISTIC EGRESSION
3 HE INEAR ODE
descet s a very geeral algorthm that ca be used to tra may other learg models wth smooth error measures For logstc regresso, gradet descet has partcularly ce propertes 3 . 3 .2
Gradie Desce
mmzg Gradet descet a twcede s a retable geeral cto, techquesuch r as Ein ( w) logstc regresso A usel phys cal aalogy o gradet descet s a ball rollg dow a hlly surce I the ball s placed o a hll, t wll roll dow, comg to rest at the bottom o a valley The same basc d ea uder les gradet descet Ein(w) s a surace a hghdmesoal space At step 0 we start somewhere o ths surace, at w(O), ad try to roll dow ths surace, thereby decreasg Ein Oe thg whch you mme dately otce om the physcal aalogy s that the ball wll ot ecessarly come to rest the lowest valley o the etre surace Depedg o where you start the ball rollg, you wll ed up at the bot tom o oe o the valleys a local miimum. I geeral, the same apples to gradet descet Depedg o your startg weghts, the path o descet wll take you to a local mmum the error surace A partcular advatage r logstc regresso wth the crossetropy error s that the pcture looks much cer There s oly oe valley! So, t does ot matter where you start your ball rollg, t wll always roll dow to the same uque glob al miimum. Ths s a cosequece o the ct that Ein ( w) s a cto o w a mathematcal property that mples a sgle valley as show to the rght Ths meas Weghs, that gradet descet wll ot be trapped lo cal mma whe mmzg such covex er ror measures Lets ow determe how to roll dow the surace We would lke to take a step the drecto o steepest descet, to ga the bggest bag r our buck Suppose that we take a small step o sze the drecto o a ut vector The ew weghts are w(O) + Sce s small, usg the Taylor expaso to rst order, we compute the chage Ein as
Ein
>
Ein(w(O) + ) Ein(w(O)) Ein(w(O)) T + ( 2 ) Ein( w(O)) ,
4 In fa, he squaed n-sample eo n lnea egesson s also onvex, whh s wh he anal soluon fund b he pseudo-nves e s guaaneed o have opmal n-sample eo
93
3. HE INEAR ODE
3.3 OISTIC ERESSION
whr w ha ignord th small trm ( 2 ) Sin is a unit vtor, quality holds if and only if =
Ein (w(O))
(3.10)
Ein (w( O))
This dirtion, spid b y , lads to th largst dras in Ein r a gin stp siz erise 8 e laim at ols fr small
is t e i ection w ic
ies lg est ece ase i n in onl
Thr is nothing to prvnt us om ontinuing to tak stps of siz r valuating th dirtion V at ah itration t = 0, 1, 2, . . . How larg a stp should on tak at ah itration This is a good qustion, and to gain som insight, lt 's look at th llowing xampls.
Weghts
too small
Weghts
too larg
·ghts ariabl just right
A xd stp siz (if it is too small) is inint whn you ar r om th th loalminimum minimumlads On tothbouning othr hand, around, too larg possibla stp y n sizinrasing whn youEin ar los Idally, to w would lik to tak larg stps whn far om th minimum to gt in th right ballpark quikly, and thn small (mor arl) stps whn los to th minimum A simpl huristi an aomplish this: fr om th minimum, th norm of th gradint is typially larg, and los to th minimum, it is small. Thus, w ould st Ein to obtain th dsird bhaior r th ariabl stp siz hoosing th stp siz proportional to th norm of th gradint will also onnintly anl th trm normalizing th uni t tor in quation (3.10), lading to th ed learning rate gdient descent algorithm r minimizing Ein (with rdnd ) : 94
3 . HE INE AR ODE
3 3 OGISTI C EGRESS ION
Fixed learning rate gradient descent:
Initilize the weights t time step t 0 t w(O). r t 0, 1, 2 do 3: mpte t he grdient gt (w(t)).
=
et the directin t mve Vt gt . Updte t he weights: w(t + 1) w(t) + Vt Itertethet nl the next step ntil it is time t stp Retrn weights
4 5 7
6:
=
In th e lgrithm t is directin tht is n lnger restricted t nit length The prmeter the learning rate) hs t be s pecied A typiclly gd chice r is rnd 0.1 prely prcticl bservtin T se grdient descent ne mst cmpte the grdient This cn be dne explicitly fr lgistic regressin see xercise ) Example 33. Grdient descent is generl lgrithm r minimizing twice dierentible fnctins We cn pply it t the lgistic regressin insmple errr t retrn weights tht pprximtely mini mize
(w)
1 N ln 1 + YT n=l
=N L
Logistic regression algorithm:
:
3
Initilize the weights t time ste p t 0 t w(O). r t 0 1 2 do mpte the grdient
4 5 6 7
e t the directin t mve Vt gt Updte the weights: w(t + 1) w(t) + Vt Iterte t the next step ntil it is time t stp Retrn the nl weights w.
=
D nitialization and terminati on. We hve tw mre lse ends t t ie: the rst is hw t chse w(O), the initil weights nd the secnd is hw t set th e criterin r " ntil it is time t stp in step 6 f the grdient descent lgrithm In sme cses sch a lgistic regressin initilizing the weights w(O) s zers wrks well Hwever in generl it is sfr t initilize the weights rndmly s s t vid getting stck n perctly symmetric hilltp hsing ech weight independently m rml distribti n with zer men nd smll vrince slly wrks well in prctice 95
3 HE INE AR ODE
3 . 3 OIST IC ER ESS ION
That taes care of initializati on so we now move on to termination How do we decide when to stop? Termination is a non-trivial topic in optimizati on. One simple approach as we encountered in the pocet algorithm is to set an upper bound on the number of iterations where the upper bound is typically in the thousand s depending on the amount of training time we have The problem with this approach is that there is no guarantee on the quality of the nalAnother weightsplausibl e approach is based on th e gradient being zero a t any min imum. A natural termination criterion would be to stop once drops below a certain treshold. ventually this must happen but we do not now when it will happen. For logistic regression a combination of the two conditions ( setting a large upper bound fr the number of iterations and a small lower bound r the size of the gradient ) usually wors well in practice . There is a problem with relying solely on the size of the gradient to stop which is that you might stop prematurely as illustrated on the right When the iteration reaches a relatively at region ( which is more common than you might suspct ) the algorithm will prematurely e igt , stop when we may want to continue. So one so-
k
iflution the error is to require change that is small termination and the error occursitself onlyis small. Ultimately a combina tion of termination criteria ( a maximum number of iterations marginal error improvement coupled with small value r the error itself ) wors rasonably well. Exampe 3 .4. By way of summarizing linear models we revisit our old iend the credit example If the goal is to d ecide whether to approve or deny then we are in t he realm of classication; if you want to assign an amount of credit line then linear regression is appropriate ; if you want to predict t he prob ability that someone will deault use logistic regression.
Credi nalyi
ppove o en ount o e it obabilit o eault
rro Lr Rro Lo Rro
The tree linear models have teir respective goals error measures and al gorithms. Nonetheless they not only share similar sets of linear hypotheses but are in act related in other ways We would lie to point out one impor tant relationship Both logistic regression and linear regression can be used in linear classication. Here is how. Logistic regression produces a nal hypothesis (x) which is our estimate of [y = + 1 x Such an estimate can easily be used fr classication by
96
3 HE INEAR ODE
33 OGISTIC EGRESSION
'
setting a threshold on (); a natural threshold is which corresponds to classifying + 1 if + 1 is more likely. This choice r threshold corresponds to using the logistic regression weights as weig hts in the perceptro n r classica tion. Not only can logistic regression weights be used r classication i n this way but they can also be used as a way to train the perceptron mode l. The perceptron learning problem (32) is a very hard combinatorial optimization problem. much The convexity of E Since in logistic regression mak es optimiz ation problem easier to solve. the logistic nction i s the a soft version of a hard threshold the logistic regression weights should be good weights r classication using the perceptr on A similar relationship exists between classication and linear regression Linear regression can be used with any real-valued target nction which includes real values that are ±1 If is t to ±1 vales sign( ) will likely agree with these values and make good classication predictions. In other words the linear regression weights which are easily computed using the pseudo-inverse are also an approximate solution r the perceptron model. The weights can be directly used r classication or used as an initial D condition r the pocket algorithm to give it a head start. eise . 9
[
onsi e oi twise e o easu s eclass, sg ] esq, y = - an 10g, l + x -y ee e signal sus on e sa lot a o lot lassr s an
=
=
o at 1ass,y eo is ue oue
s,
=
an ence at e lassiaion
e su ae e a as i a ge a ow a c1ass, ue oun u to a osan sin g te lgis i egessi n eo
ese ons niate a miniiing te suae o logisi egession e sold als eease e lassiain i ustis usin e eig s etune l ine a o lgisi ege ssi a s a xiatins siatin
as
tochastic gradient descent. The version of gradient descent we have de
scribed so fr is known as batch gradient descent the gradient is compute d r the error o n the whole data set befr e a weight updat e is done. A sequen tial version of gradient descent known as stochastic gdent descent (SGD) turns out t o b e very ecient in practice. Instead of considering the ll batch gradient on all N training data points we consider a stochastic version of the gradient. First pick a training data point ( Y) unirmly at random (hence the name sto chastic ) and consider only the error on that data poin t
97
3 . HE N EAR ODE
3 . 3 . OGSTC EG RES SO N
n th cas of logst c rgrsso n) Th gradnt of ths sngl data ponts rror s usd r th wght updat n xactly th sam way that th gradnt was usd n batch gradnt dscnt Th gradnt ndd r th wght updat of SGD s s xrcs 37)
w w 7\enw N, 1 N \enw nl
and th wght updat s Insght nto why SGD works can b gand by lookng at th xpctd valu of th chang n th wght th xpctaton s wth rspct to th random pont that s slctd ) Snc s pckd unrmly at random om { 1 th xpctd wght chang s
Ths s xactly th sam as th dtrmnstc wght chang om th batch gradnt dscnt wght updat That s on avrag th mnmzaton pro
N,
cds n th rght s a bt wggly long run uctuatons cancldrct out on Thbut computatonal costInsth chapr by aths fctorrandom of though snc w comput th gradnt fr only on pont pr trat on rathr than r all ponts as w do n batch gradnt dscnt Notc that SGD s smlar to PLA n that t dcras s th rror wth r spct t o on data pont at a tm Mnmzng th rror on on data pont may ntrr wth th rror on th rst of th data ponts that ar not consdrd at that traton Howvr also smlar to PLA th ntrrnc cancls out on avrag as w hav just argud
N
xercise 3.0
a Dene a eo a single ata oin xn Yn to e en = (O -n xn
Ague that P LA can b e viewed as G D on e with le ani ng at e mini mizing using G D is simil a to PLA. This is anothe indication that th e lo gistic egession weights can be used as a good appoximation clasication.
b Fo logistic egession with a vey lage ague that
SGD s succssfl n practc oftn batng th batch vrson and othr mor sophst catd algorthms In act SGD was an mportant part of th algorthm that won th mllondollar Ntx comptton dscussd n Scton 1.1 It scals wll to larg data sts and s naturally sutd to onln larnng whr
9
3.4. ONINEAR RANSFORMATION
3. HE INEAR ODE
a stream of data present themselves to the learning algorithm sequentially The randomness introduced by processing one data point at a time can be a plus, helping the algorithm to avoid at regions and local minima in the case of a complicated error surface However, it is challenging to choose a suit able termination criteri on r SGD. A good stopping criterion should consider the total error on all the data, which can be computationally demanding to evaluate at each iteration 3.4
Noninar ansformation
All rmulas fr the linear model have use d the sum d
WT L X
(3.11)
as the main quantity in computing the hypothesis output This quantity is linear, not only in the 's but also in the /s A closer inspection of the corresponding learning algorithms shows that the linearit in s is the key property r deriving these algorithms; the X's are just constants as far as the algorithm is concerned This observation opens the possibility r allowing nonlinearbecause versions X's while still remaining in the analytic realm (3.11) remains models, theofrm of quation linear in the of linear param eters Consider the credit limit problem r instance It makes sense that the years in residence' eld would aect a person's credit since it is correlated with stability However, it is less plaus ible that the credit limit would grow linearl with the number of years in residence More plausibly, there is a threshold (say 1 year) below which the credit limit is aected negatively and another threshold (say years) above which the credit limit is aected positively If X is the input variable that measures years in residence, then two nonlinear eatures' derived om it, namely [ x < 1 and [x > , would allow a linear frmula to reect the credit limit better We have alread seen the use of atures in th e classication o f handwritten digits, where intensity and symmetry atures were derived om input pixels Nonlinear transrms be rther applied to those atures, as we willThe see shortly, creating more can elaborate atures and improving the perfrmance scope of linear methods expands signic antly when we represent th e input by a set of appropriate atures 3..1
The
Space
Consider the situation in Figure 3.1 (b) where a linear classier can't t the data By transfrming the inputs x1 x 2 in a nonlinear fashion, we will be able to separate the data with more complicated boundaries while still using the 99
3.4. ONLINEAR RANSFORMATION
3 . HE INEAR ODEL
smple PLA a a buldg blok Lets start by look g at the rle Fg ure 35 a , whh s a repla o the o-separable ase Fgure 3 b The rle represets the llowg equato
+ 06
++
That s , the olear hypothess h(x) sg 06 separates the data set per tly We a vew the hypothes a a linear oe ater applyg a olear trasrmato o x. I partular, oder 1, ad
Z
h(x)
sg
(
0.6
+
1
1
+
1
)
Wo Zo w1 W 2 Z2 Zl
sg [W W W ]
:�
where the vetor s obtaed om x through a olear trasrm < ,
<(x) We a plot the data term o tead o x as depted Fgure 35 b For stae, the pot x Fgure 35 a s trasrmed to the pot Fgure 3 5 b ad the pot x trasrmed to the pot • The spae , whh otas the vetors, s rerred t o a s the feature space se ts oor dates are hgher-level atures derved om the raw put x. We desgate
deret quattes wth a tlde verso o ther outerparts X, eg, the dmesoalty o ad the weght vetor s 5 The trasrm < that take us om X to s alled a feature trnsform, whh ths ase
<(x) 1,
31
I geeral, ome pots the pae may ot be vald trarms o ay X ad potstrarm X may<.be trasrmed to the same , xdepedg o multple the olear The usel e o the tr asrm above s that the olear hypothess h rle the X spae a be represeted by a lear hypothess le the spae Ideed, ay lear hypothess orrespod s to a posbly olear hypothess o x gve by
h(x) (<(x)).
Z {} whee n ths e
5 oodnate
s xed
We teat
100
dmensonal sne the added
34. ONINEAR TRANSFORAION
3 . THE LINEAR MODE
(b) anrmed daa i n Z pace
= ) =
[i]
Figure 35 (a) The oriinal daa e ha i no linearly eparable bu eparable by a circle. (b) The ranrmed daa e ha i linearly eparable in he Z pace. n he ure map o and map o ; he circular epara or in h e X pace map o he line ar eparaor in h e Z pace. The set o these hypotheses h is denoted by For instance, when using the ature transrm in 312), each h is a quadratic curve in X that corresponds to some line h in Z xerse 3
onsie r he eatue ans in at i ouna in oes a erl ane in orr eson o i n te ll owing as es raw a icture at il lust as a ea le ea ase.
0 0 c
a
Because the transrmed data set (, Y), , ( , Y ) in Figure 3.5(b) is linearly separable in the ature space Z, we can apply PLA on the trans rmed data set to obtin the PL A solution, which gives us a nal hypothes is (x) sign( ) in the X space, where = <(x) The whole process o applying the at ure transrm bere running PLA r linear classication is depicte d in Figre 36 The insample error in the input space X is the same a in t he ature space Z, so E() 0 Hyperplanes that achieve E ( ) = 0 in Z cor respond to seprating curves in the srcinal input space X For instance, 101
THE LINER MODE
4 ONINER TRNSFORION
0.5
0 0
0
1 Orgna data
X
05
2 ansrm the dat a
<(x)
X
Z
+
0 05
0
0
4. Cassfy n Xspace
3. Separate data n Z-space
(x) < (x)) sgn (T < (x))
() sgn
Fgure 36 The nonlinear ransrm o r separain non separable daa as shown n Fgure 36, the PLA may seect the ne ( 0.6 0.6 1) that separates the transrmed data (, Y), · · · , ( , Y . The correspond ng hypothess w separate the orgna data sgn 06 + 06the·decson + �) boundary (x, Y) , , (x , Y ) · In ths case, s an epse n X How doe s the fature transfrm aect the VC bou nd (3.1)? If we honesty decde on the transrm < bere seeng the data, then wth probabty at east 1 the bound (31) remans true by usng d() as the VC dmen son For nsta nce, consder the fature transrm < n (3.12). e know that Z {1} 2 Snce s the perceptron n Z, d () 3 the s bec ause some po nts Z may not be vad transf rms of any x, so some dchotomes may not be reazabe . We can then substtute N, d() and nto the VC bou nd. After runnng PLA on the transfrm ed data set, f we succeed n
6
102
3 . THE LINEAR MODE
3.4. ONINEAR TRANSFORAION
= 0, we can clam that g wll perrm well out of gettng some g wth Ein sample It s very mport ant to understand that t he clam above s val d only f you decde on < before seeng the data or tryng any algorthms What f we rst try usng l nes to separate the data, fl, and then use the crcles Then we are eectvely usng a model that contans both lnes and crcles, and d s no longer 3
xie e now that in h uliean lane te eceton oel imlem ent all ic otomies on oits. hat is atue ansm in
6
a how that m<3 = b how at m c h ow tat mu = 6
at is if you use l ines vc lin es and elipses dc 3.
= 3 if ou use e li ses
6
canno
e
c = 3 if ou used
Worse yet, f you actually look at the data (eg, look at the ponts n Fg ure 3(a)) bere decdng on a sutable <, you rt most of what you learned n Chapter 2 You have nadvertently explored a huge hypothess space n your mnd to come up wth a specc < that would work for this data set. If you nvoke a generalzaton bound now, you wll be charged r the VC dmenson of the fll space that you explored n your mnd, not just the space that < creates Ths does not mean that < should be chosen bl ndly In the credt lmt problem r nstance, we suggested nonlnear atures based on the years n resdence' eld that may be more sutable r lnear regresson than the raw nput Ths was based on our understandng of the problem, not on sno opng' nto the tranng data Therere , we pay no prce n terms of generalzaton, and we may well gan a dvdend n perfrmance because of a good choce of atures The ature transrm < can be genera l, as long as t s chosen b efre seeng the data set (as f we cannot emphasze th s enough) For nstance, you may have notced that the ature transrm n (3.12 only allows us to get very lmted types of quadratc curves Ellpses that do not center at the orgn n X cannot correspon d to a hyperplane n To get all possble quadratc curves n X, we could consder the more general ature transrm z = < (,
< (
= (1,,
, , , ,
(3.13
whch gves us the exblty to represent any quadratc curve n X by a hy perplane n (the subscrpt 2 of < s r polynomals of degree 2 quadratc curves) The prce we pay s that s now ve-dmensonal nstead of two dmensonal and hence d s doubled om 3 to 6.
103
3 . THE LINER MODE
4 ONINEAR TRANSFORION
2
Consider h e ature tansfm = in How can we se a pelan in o reresent he l l owing bouaies in para b ola ( x 1
(a)
3 2
The ci le 1
3) 2
x2 = x2 - 4 2
( c) The lli pse 2(x 1 - 3 ) 2 (x2 4)2 bola (x1 - 2 x 2 4 2 (e) lipse x2 3 2 x f) line x1 2 =
=
2 2
One may rther extend < 2 to a atre transrm < 3 r cbc crves n X or more generay dene the atre transfrm r degree crves n X The atre transrm s caed the Qth order polynom ial transform The power of the atre transrm shod be sed wth care It may not be worth t to nsst on near separabty and empoy a hghy compex srfce to acheve that. Consder the case of Fgre 3 a . If we nsst on a atre transrm that neary separates the data, t may ead to a sgncant ncrease of th VC dmenson. As we see n Fgre 3 .7, no ne can separate th
<
<
tranng exampes percty, and nether can any qadratc nor any thrd-or der po ynoma crves Ths, we need to se a rthorder poynoma transrm
(X
= (1
2
2 3 2
2 3
4
3
2 2
3
4
, Xi , X 2 , X 1 , X 1 X2 , X 2 , X 1 , X1 X2 , X 1 X 2 , X 2 , X 1 , X1 X 2 , X1 X 2 , X 1 X 2 , X 2 .
If yo ook at the rth-order decson bondary n Fgre 37 b , yo dont need the VC anayss to te yo that ths s an overk that s nkey to generaze we to new data A better opton wod have been to gnore the two mscassed xampes n Fgre 3.7 a , separate the other exampes percty wth the ne, and accept the sma bt nonzero Indeed, sometmes or best bet s to go wth a smper hypothess set whe toeratng a sma . Whe or dscsson of atre transrms has csed on casscaton probems, thse transrms can be apped eqay to regresson probems. Both near regresson and ogstc regresson can be mpemented n the atre space Z nsted of the npt space X For nstance, near regresson s of ten
coped wth a atre transrm to perfrm nonnear regresson The by d + 1 npt matrx n the agorthm s repaced wth the by + 1 matrx , whe the otpt vector y remans the same.
N J N
X
3 . .2
Cputati ad Geeralizati
Athogh sng a arger Q gves s more exbty n terms of the shape of decson bond ares n X , there s a prce to be pad. Comptaton s one sse, and generazaton s the other Comptaton s an sse becase the atre transrm maps a two dmensons, whch ncreases the memory dmensona vector to =
<
J
104
3 THE LINE AR MODE
3.4. ONINEA TRANSFOAION
(a) Linear t
(b) 4th order polynomial t
Fgure 37: llutration of the nonlinear tranrm uin a data et that i not linearly eparable; (a) a line eparate the data afer omittin a w point (b) a urth order polynomial eparate all the point and computatona costs Thngs coud get worse f x s n a hgher dmenson to begn wth
xerise
Q }
Rd
onsie te Qth oe olynomial ansfm at is te imensionality of he at ue sa ce exlu in g the xe co oi nate val uae ou esult on 3 5 an { 5
The other mpor tant ssue s gen erazaton If < s the fature transrm of a twodmensona nput space there w be dmensons n Z, and d(H) can be hgh as + 1 Ths means that the second term n the VC bound (31) can grow sgncanty. In other words we woud have a weaker guarantee that Eout w be sma For nstance f we use 5 0 0 the VC dmenson of coud b e as hgh as 5 2 5 3 + 1 1326 nstead of the orgna d 3 Appyng the rue of thumb that th e amount of data needed s proportona to the VC dmenson we woud need hundreds of tmes more data than we woud f we ddn't use a ature transrm n order to acheve the same eve of generazaton error
Exercise 3 . 15 H igh-dim ensiona l fatue t ans ms ae by no mea ns the only t ansms that w e can use We can take the ad eof in the o the diec tion an use lo w di mension al fatu e tansms as wel l (to achieve an even lowe genealizatin eo ba). (ontnued on next page
105
4. ONINEAR TRANSFORAION
. THE LINEAR MODE
to
Conside he fl lowing atue ans m w ich ma s a -i mensional a one-imensional z eeing only the th ooinate o
<(k) = 1 Xk · Let
3.4
k be he set of pecet ons in he atue sace
a ove that k b ove hat vU� = l=k (lo
1.
-k is called the ecin tm model on i mesion The probem o f generazaton when we go t o hghdmens ona space s some tmes baanced by the adv antage we get n approxmatng the target bet ter . As we have seen n the case of usng quadratc curves nstead of nes the trans rmed data became neary separabe reducng Ein to 0 In genera when choosng the approprate dmenson r the fature transrm we cannot avod the approxmatongenerzaton tradeo
hgher ower
d
better chance of beng neary separabe possby not neary separabe ( Ein
(Ein )
Therefre choosng a fature transrm bere seeng the data s a nontrva task When we appy earnng to a partcuar probem some understandng of the probem can hep n choosng atures that work we More generay there are some gudene s r ch oosng a sutabe transfrm or a sutabe mode whch we w dscuss n Cha pter 4
Exercise 3.16 Wite ow n t he steps of h e a lgo ithm that com bin es < 3 with linea e gession using a lgoithm? instead? ee is te main computational bottleneckHow of about the esulting
0
Example 3.5. Let 's revst the handwrt ten dgt recognt on exampe. We can try a derent way of decomposng the bg task of separatng ten dgts to smaer tasks One decompos ton s to separate dgt 1 om a the other dgts . Usng ntensty and symmetry as our nput varabes ke we dd bere the scatter po t of the tranng data s sho wn next A n e can roughy separate dgt 1 om the rest but a more compcated curve mght do better.
106
3 . THE LINEAR MODE
3 . 4 ONINEAR TRANSFORAION
vee ei
We use near regresson r casscaton , rst wthout any fature transfrm The resuts are shown beow LHS We get Ei 2.3% and Eou 238%
Average Intensty
Average Intensty
Linear model
3rd order polynomial model
En = 2 13% Eout = 2.38 %
En = 175% Eout = 187%
Claication of the diit data 1 veru 'not 1 uin linear and third order polynomial model
When we run near regresso n wt h , the thrd-order poynoma transfrm, we obtan a better t to the data, wth a ower Ei 75% The resut s depcted n the RH S of the gure In ths case, the better n-sampe t aso resuted n a better out-of-sampe perfrmance, wth Eou 87%
Linear models a nal pitch. The near mode r casscaton or regres son s an often overooked resource n the arena of earnng om data Snc e ecent earnng agorthms exst fr near modes, they are ow overhead. They are aso ve ry robust and hav e goo d generazaton propertes A sound
07
3 . THE LINEAR MODE
3.4. ONINEAR TRANSFORAION
pocy to ow when earnng om data s to rst try a near mode Because of the go od generaz aton propertes of near modes not much can go wrong If you get a goo d t to th e data ow E) then you are done If you do not get a good enough t to the data and decde to go r a more compex mode you w pay a prce n terms o f the VC dmenson as we have seen n xercse 3 12 , but the prce s modest
108
5 ROES
. THE LINEAR MODE
3.5
Prblems
Problem 3.1
Consider the doub le sem icircle toy" lear nin g task below
The re are two semi circles of width thk with inner radius r separated by red is 1 a nd blue is +1 The cen ter of the top semi circle is aligned with the middl e of the ed ge o f the bot tom semi circle This tas k is l inearly sep arable when < 0 et ra 10 0 and not so or 5 Then , genera te 2 000 ex am ple s u nirmly , which m eans thk 5 and you wi ll have a pproximately 1 000 ex am ples r eac h class
sep as shown
=
sep =
sep
sep
=
a Run the PLA starting from = 0 unti l it co nverges P lot the data an d the na l hypo the sis b Repeat pa rt a using the l ine ar regr ession r classi cati on to obtai n Expla in your obs ervations
sep
Problem 3. For the d ouble semi cir cle task in Prob lem 31 vary in the ra nge 02 0 4 5}. Genera te 2 000 ex am ples and ru n th e P LA star ting with 0 Record the n u m ber of iterations P LA tak es to conv erge Plot vers us the n um ber o f iterations taken fr PLA to conv erge Explai n you r observations [Hnt Proble 3.}
=
sep
Probl em 33
For the do uble semi circle task in Problem and generate 2 000 examples
a What w il l h appe n if you ru n P LA on thos e exampl es? b Run the pocket algorithm r 100 000 iterations and plot itera tion nu mber t c Pl ot the data and t he n al hypothes is in part b
31 set sep = - 5
versus the
(ontnued on next page
109
THE LINEAR MODE
.5. ROES
d Use the li near regress ion a lgorithm to obtai n the weights
w and compare this result with the pocket algorithm in terms of computation time and qu al ity of th e solution e Repeat b d with a 3rd o rder polynomi al atu re transf orm
Problem 3.4
In Problem 1.5 we i ntroduced th e Ada ptive Linea r Neu ron Adaline al gorithm f r classic ation Here, we derive Ada lin e from an optimization perspective
a Consider E(w) = (m(O - ywTx)) 2 how that E(w) tin uous and dif erentia ble Write down the gr ad ient \ E(w) b how that E(w) is an upper bound r sign(wT x) Y ; l E(w)
is an upper bound
Hence, fr the in sam ple clas si ca tion er
ror E(w) c Argue that the Adaline algorithm in Problem gradient descent on E(w)
is con
l
1.5 perfrms stochastic
Problem 3.
a Consider
how that
E (w) = m( O - yW T X) E(w) is continuous and diferentiable except when
Y = WTX b how that E(w)
; l E(w)
is an upper bound r sign(wTx ) Y Hence, is an u pper bound f r the i n sam ple clas sicatio n er
ror E(w) c Apply stochastic gradient descent on E(w) ignoring the sin gu la r case of wT X Y an d deriv e a new per ceptr on lear ning algorithm
l
=
Prob lem 3. 6
Deriv e a l in ear pr ogram mi ng algori thm to t a line ar model
r classicatio n using therm f llowing st eps A l in ear pro gram is an optimization problem of the fllo wing : T
min z suec
, and are par amete rs of the l inear prog ram and is the optimization vari a ble Th is is such a w ell stud ied optimization pr oble m that mos t mathematics software have canned optimization functions which solve linear programs
a For linearly separable data, show that fr some = l ...
N
110
w Y(wTx)
r
.5 ROE
THE LINER MODE
b
w
Fomu lte the t sk of ndi ng sep ting f sepb le dt s l ine pogm You need to specify wht the pmetes e nd wht the optimiztion vible z is c If the dt is not sepble, the condition in cnnot hold evey Thus i ntoduce the violtion 0 to cptue the mount of violtion expl e o, 1 N
n
n
Yn(WTn)n
1 0.
n
Ntul ly, we would li ke to in im ize the m ount of viol tion One intu itive ppo ch is t o mi ni mize ie , we wnt tht solves
n
ubjec o whee the inequlities must hold lem s line pogm
w
n l Yn (wTxn) 1 - n n 0 1 N Fomulte this pob
d Ague tht the line pogm you deived in
c nd the optimiztion
poblem in oblem e equivlent
Problem 3 7 se the line pogmming lgoithm fom oblem on the lening t sk in oblem the sep bl e 5 nd the non - 5 cses sepble Compe you esults to the line egession ppoch with nd without the d ode polynomil ftue tnsm
=
Problem 38
Fo l ine egession, t he out of smple eo is
Eou h
(h (x) )2
how tht mong all hypothes es, the one tht i ni mizes Eou is given by
h*
h* (x) [ x]
The function cn be teted s deteministic tget function, in which cse we c n wite + whee is n inp ut dependen t noise vible how tht hs expe cted vl ue zeo
h*(x) (x) (x)
(x)
3 . THE LINE MODE
3.5 OES
XX
3.9 Assum ing that is i nveti bl e show by di ect com paison with quation 3.4 that Ein can be witten as
Ein = (w - ( X) 1 Xy (XX)( w - ( X ) 1 X y) + y (I - X (X) 1 X )y Ein to obt ain W!in What is the in sample e o? [Hnt he atr XX s ostve dente.
Use this expession f
3.10 xecise 33 studied some popeties of the hat matix whee is a N by + 1 matix and is invetible how the f llowing add itional popeties
X X = X (X X) 1 X X a vey eigen val ue of is eithe 0 o 1 [Hnt ercse 3.3b. b how that the tace of a symmetic matix equals the sum of its eigen
[Hnt Use the sectral theore and the cyclc oerty of the trace. ote that the sae result holds r non-syetrc atrces, but s a lttle harder to rove. c How m any eigenva lu es of ae 1 ? Wha t is the an k of ? [Hnt ercse 3.3d.j values
3.11 Conside the li nea eg essio n pobl em setu p in xe cise 34 whee the data comes fom a gen ui ne l in ea elationsh ip with adde d noise The noise the difeent data points is assumed to be iid with zeo mean and Assum e that the nd moment matix vaiance is non-singula Follow the steps below to show that with high pobability the out-of-sample eo on aveage is
2
[xx ]
EutW!in
a
Fo a test point
= (2
1+
1 .
+ + o( N )
x show that the eo () ( X X ) l X
is
whee is the noise ealization o the test point and noise ealizations on the data
is the vecto of
b Take the expectation with espect to the test point ie
and
obtain a n expession Eut h ow that Eaut
=
=
C
2
+ trace
to
I ( XT X ) 1 XT E ET XT ( XT X ) 1 .
[Hnts trace( for any scalar trace(B) ton and trace coute.
c What is [ ]?
112
= trace(B)
eecta
3 THE LINEAR MODE
35 ROES
(d) Take the expectation with
resp ect to
E to show that, on average,
2
But = 2 + trac ( N1 XTX) - 1 Note that XTX �
XTX = L:=l XX� is an sam ple estimate of If XTX = then what is But on average?
o
(e) probability, how th at (after takin g the expectation over th e data noise) with high
d
But = 2 1 +
l + )
[Hnt: By the law of large nubers XTX onverges n robablty to , and so by ontnuty of the nverse at , XTX - 1 onverges n robablty to - 1
Problem 3. 1 In l ine ar regression , the in sam ple predictions are given b y = Hy where H = X(XTX)- 1 XT h ow that H is a projection matrix, ie H2 = H o is the projection of y onto s ome space What is this space?
Problem 3.13
This problem creates a linear regression algorithm from a good algorithm r linear classication As illustrated, the idea is to take the srcin al data and shif t it i n one d irec tion to get th e +1 data points then, shift it in the o pposite di rectio n to get the - 1 data points
Original data fr the one dimensional regression prob lem
hifted data viewed as a two dim ensional class i ca tion problem
More generally, The data ( , ) can be viewed as data points in tre ating the value as the ( + 1 th coor di nate
d
Jd+ 1
(continued on next page
113
by
5 ROES
THE LINER MODE Now, constuct po sitive and negative points
(x 1 1 ) + a . . . (x N YN ) (x 1 1 ) a . . . (xN YN ) a
_ a
whee is a petubation paamete You can now use the linea pogamming algoithm in oblem to sepaate fom The esulting sepaating hypepl ane ca n be use d as the egession t to the oig in al data
a How many weights ae leaned in the classication poblem? How many weights ae needed the linea t in the egession poblem? b The linea t equies weights , whee h(x) Tx . u ppos e the ss
weights etuned by solving the classication poblem ae Deive an expessi on as a function of c Geneate a data set with 50 whee X is unifm on [ 1 and is zeo mean Ga ussian noise set 0.1 lot and
a
Y +
Wss·
=
_
01 d Give compaisons o f the esulti ng ts fom un ni ng the c lassi cation ap poach and the analytic pseudo-invese algoithm f linea egession
Prob lem 3 14
I n a egession se tting, assume the tag et function is li nea,
f(x) = xTW and y X + E whee the enties in E ae zeo mean, iid 2 In this poblem deive the bias and vaiance as llows. is (x) f(x) no matte what the size a h ow tha t the aveage f u nction of the data set, as long as XT X is invetible. What is the bias?
so with vaiance
b What is the va ian ce? Problem 315
[Hnt: Proble 3
In the text we deived that the linea egession solution
XTXw XTy. If XTX is not invetible, the solution W (XTX) 1 XTy won 't wok In this e vent, thee wil l be m any s ol utions that minimize Ei· Hee, you wil l dei ve one such solution Let be the ank of Assume that the singula value decomposition V D of is satises = rTX.whee E N satises UTU = E X T = and r E ]PP i s a posit ive diagonal matix. weights must satisfy
a how that < + 1 b how that W!i = r 1 Ty satises XTXW!i = XTy and hence is a solution. c h ow th at o any othe solut ion that satis es XTXw = XTy < n
That is, the solution we have constucted is the minimum nom set
of weights that minimizes
Ei·
114
3 . R MO
3.5.
ROBS
Problem 316 I xample 3.4, it is metioed that the output of the al hypothesis lea ed us ig logistic e gessio ca be thes hol ded to g et a had' ±1 cl assicatio Th is pobl em show s how to use the isk mat ix itoduced i xample to obtai such a theshold
gx
Coside gepit vei catio as i xam ple Afte leaig f om the data usig logistic egessio you poduce the al hypothesis [
gx
+ x which is you estimate of the pobability that +
uppose that the cost
matix is give by
you say
+1
Tue classica tio - 1 (itude)
+ 1 (coect peso)
1
x
gx
Fo a ew peso with gepit you c omp ute ad you ow eed to de cide whe the to accep t o eject the peso (i e you eed a h ad classicatio ) o you w il l accep t if K whee K is t he th eshold
gx
(a ) i Dee t hedecostaccept as y ou expected mi la ly e cost eject how that co st if you accep t the pe so
(1 gx gx (b ) Use pat (a) to deive a codit io o gx f acceptig the peso ad hece show that + costa ccept cost eject
(c) Use the co st matic es f the up emake t a d CIA a ppl icatios i x am ple to co mpute the th eshold K each of these t o cases G ive some i tui tio f the thesholds you get
Problem 317
Coside a fuctio
E(u v) + + + 3uv + 4 - 3u - 5 (a) Appoximate E(u + bu v + bv) by 1 (bu bv) whee 1 is the stode Taylo's expasio of E aoud (uv) (0,0) uppose 1 (bubv) bu + + What ae the values of ad
(ontnued on next page
115
3 IR MO
3.5.
b Miniize ove all possible (Lu, Lv)
such that
(Lu, Lv) = 0.5
In this chapte, we poved that the optial colun vecto
E(u,v)
paallel to the colun vecto gradent drecton Copute the optial
ROBS
is
which is called the negatve and the esulting
(Lu, Lv) E(u + Lu, v + Lv) by 2(Lu, ode T aylos E(u+ expa nsion (0whee 0) uppose Lu,v+of Lv) E aound (u, v) Lv) 2 is th e second c Appoxiate What ae the values of
and
d Miniize 22 ove all possible (Lu, Lv) egadless of length Use the fact that \ E(u, v) I the H essia n atix at 0 , 0 ) is positive denite to pove that the optial colun vecto
[LLvu**] =
\ 2 E(u, v) \E(u,v),
ewton drecton
which is called the
( e Nueically copute the fllowing values: vecto of length the es ultin (Lu, g E(u Lv)+ Lu, v + Lv)0.5 along the Newton diection, and i the ii th e vecto (Lu, Lv) of length 0.5 that i ni izes E(u+Lu, v+Lv) and the esulting E(u + Lu, v + Lv) (Hnt: Let Lu = 0.5sin ) Cop ae the valu es of E(u+Lu,v+Lv) in b , e i , and e ii Biely
state you nd in gs
The negative g ad ient di ectio n a nd the Newton d iec tion a e quite f un da ental designin g optiization a lgo iths It is ipot ant to un des tand the se di ections an d put the in you t oolbox desi gni ng lean ing a lgo ith s
Problem 3 18
Take the fatue tans
2 in qua tio n ( ) as <
a how tha t dvc ( ) 6 that dvc(Hq) > 4 {Hnt ercse 3} bc how Give an uppe bound on dvc(Hqk) f I d Dene
2 () E I
9 Ague that dvc(Hq 2 ) dvc(H 2 ) . In othe wods, w hi le dvc ( 2 ) 6 < Thus, the di ension of only gives an uppe bound of dvc(Hq), a nd the exact v al ue of dvc() can depend on the co pone nts of the ta ns
<()
116
R MO
ROBS
Problem 3.19 A Transrmer thinks the llowing procedures would work well in learning from two-dimensional data sets of any size Please point out if there are any potential problem s i n the procedu res: (a) Use the eature tran srm < ( x) =
{
bere run ning PLA (b) Use th e featu re tran srm
using some ve ry small
(c) Use th e atu re tran srm
bere running PLA, with
0 . 0 1 0 � 0 0 0
if
=
otherwise
with
that consists of all i
E {,
117
1 } and
j E {,
1}
118
Chap
4
Overtting Paraskavedekatraphoba (far of day the 1 3th) , and supersttons n gen era, are perhaps the most ustrous cases of the human abty to overt. Unrtunate events are memorabe, and gven a fw such memorabe events, t s natur a to ty and nd an expanat on In th e fture, w there be more unfrtunate events on day the 13ths than on any other day Overttng s the that phenomenon where ttng the observed acts (data) no onger ndcates we w get a decent out-of-sampe error, andwe may actuay ead to the opposte eect You have probaby seen cases of overt tng when the earnng mode s more compex than s necessary to represent the target ncton The mode uses ts addtona degrees of eedom to t dosyncrases n the data (fr exampe, nose), yedng a na hypothess that s nferor Overttng can occur even when the hypothess set contans ony nctons whch are far simpler than the target fncton, and so the pot thck ens @ The abty to dea wth overttng s what separates profssonas om amateurs n the ed of earnng om data We w cover three themes When does overttn g occur What are the too s to combat overttng How can one estmate the degree of overttng and cert that a mode s good, or bet ter than another Ou r emphass w be on technq ues that work we n practce
4.
Whe Doe s Overttig Occur?
Overttng teray means Fttng th e data more than s wa rranted " The man case of overttng s when you pck the hypothess wth ower En and t resuts n hgher Eout Ths means that Ein aone s no onger a good gude fr earnng Let us start by dentfyng the cause of overttng 1 om the Geek
k (da, k (thteen, (a 119
4.1. W OS VRIIG CCUR
4 VRIIG
Consder a s mpe onedmen sona regres son probem wth ve data pont s We do not know the target ncton so et's seect a genera mode maxmz ng our chance to capture the target ncton. Snce 5 data ponts can be t by a 4th order poynoma we seect 4th order poynomas The resut s shown on the rght . The target ncton s a 2n d order poynoma t rget (bue curve wth a little added nose n t the data ponts. Though the target s smpe the earnng agorthm used the power of the 4th order poynoma to t the data exacty but the resut does not ook anythng ke the target nc ton . The d ata has be en ov ert. The tte nose n the data has msed the earnng r f there were no nose the tted red curve woud exacty match the target Ths s a typca overttng scenaro n whch a cmpex mode uses ts addtona degrees of eedom to earn the nose. The t has zero nsampe error but huge outofsampe error so ths s a a key outcome when case of bad generazaton (as dscussed n Chapter 2 overttng s occurrn g. However our denton of overttng goes beyond bad generazaton r any gven hypoth ess. Instead overttng appes to a cess n ths case the process of pckng a hypothess wth ower and ower Ein resutng n hgher and hgher Eout
.1.1
A Case Study: Oettig ith Polyomials
Let 's dg deeper to gan a better understandng of when overttng oc curs. We w ustrate the man concepts usng data n onedmenson and poynoma regresson a speca case of a near mode that uses the ature transfrm x (1 x x 2 . Consder the two regresson probems beow
t rget
t rget
(b) 50th order trget nction
() 10th order trget nction
In both probems the target ncton s a poynoma and the data set contans 15 data ponts In (a the target uncton s a 10th order poynoma 120
4 VERFITTIN
4 . 1 . WHEN OE S VERFITTING CCUR
ata nd rder t 10th rder t
a osy low order target
ata nd rder t 10th rder t
(b) oseless h gh order target
Figure 4.1 : ts usng nd and 10th order polynomals to 1 data ponts.
n a the da ta are nosy and the target s a 1 0th order polyno mal. n ( b) the dat a are noseless and the th e targe t s a 0 th order polynomal
and the sampled data are noisy the data do not lie on the target function curve . In b , the target fn ction is a 50t h order polynomial and the data are noiseless The best 2nd and 10th order ts ar e shown in Figur e 4 .1 , and the ins ample and outofsample errors are given in the llowing table.
50th order noiseless target 2nd Order 10th Order 10 0029 0.120 7680
10th order noisy target 2nd Order 10th Order 0.050 0034 Ein 0127 9.00 Eout
What the learning al gorithm see s is the dat a, not the target fnction In both cases, the 10th order polynomial heavily overts the data, and results in a nonsensi cal nal hypothesis which does not resemble the target f nction The 2nd order ts do not capture the fl l nature of the target fncti on either, but they d o at least cap ture its general trend, resulting in signican tly lower outof sample error The 1 0th order ts have lower insample error and hig her outof sample erro r, so this is indeed a case of overtting that resul ts in pathologically bad generalization. xercise . 1
10
et and be the nd an th ode hypothesis sets espectively pecif y th ese sets as paa mete ize se ts of fun ctions how tha t 2
1 0
These two examples r eveal some surprising phenomena Let 's consider r st the 10th order targe t nction, Figure 4. a . Her e is the scenario o learners, 0 r overtted and R fr restricted , know that the target fnction is a 10th order polynomial, a nd that they will receive 1 5 noisy data points Learner 0
121
4 VERFITTING
41 WHEN OES VERFITTING CCUR
Larnin curvs fr
2
arnin curvs r
HH µH
0
H
µ
µ� ubr f aa Pins
ubr f aa Pins
Figure 42: vrin is ccurrin r in h shad d ra rin bcaus b chsin 0 which has better Ein u worse Eout· uses model , which is known to contain the target fnction and nds the best tting hypothesis to the data Learner R uses model and similarly nds the best tting hypothesis to the data The surprising thing is th at learner R wins (lower outof-sample error by using the smaller model even though she has knoingly given up the ability to implement thegain trueintarget fnction Learner trades oresulting a worse in in-sample error fr a huge the generalization error Rultimately lower outof-sample erro r What is fnny here? A flklore belief about learning is that best resu lts are obtain ed by incorporating as much infrm ation about the target fnction as is available But as we see here even if we kno the order of the target and naively incorporate this knowledge by choosing the model accordingly (), the perrmance is inrior to that demonstrated by the more stable' 2nd order model The models and were in act the ones used to generate the learn ing curves in Chapter 2 and we use those same learning curves to illustrate overtting in Figure 4 2 If you mentally supe rimpose the two plots you can see that there is a range of r which has lower Ein but higher Eout than does a case in point of overtting Is learner R always going to prevail? Certainly not For example if the data wasom noiseless indeed learner target fnction exactly 15 datathen points while learner 0Rwould wouldrecover have nothe hope This brings us to the second example Figure 4(b Here the data is noiseless but the target fnction is very complex (5 0th order polynomia l Again learner R wins and again because learner 0 heavily overts the data Overttin g is not a disease inicted only upon complex models with many more degrees of eedom than warra nted by the complexi ty of the target fnction In fct the reverse is true here and overtting is just as bad What matters is how the model complexity matches the quantity and quality of the data we have not how it matches the target fnction
N
122
4 VERFITTING
. 1.
4 1 WHEN E ERFITTING R
Catalsts r Overttig
A skeptical reader should ask whether the examples in Figure 4.1 are just pathological constructions created by the authors, or is overtting a real phe nomenon which h as to b e considered carelly when lear ning om data? The next exercise guides you through an experimental design r studying overt ting within our current setup. We will use the results om this experiment to serve purposes construction, to convince you overtting result of some raretwo pathological and that to unravel someisofnot thethe conditions conducive t o overtting. Exercise 4.
[Experimenta esign r stuing overftting]
Th is is a eadi ng exe cise that se ts u a n exeime nta l fam ewk to stuy aious asec ts of ovetting he eae i nte este in im lementin g e exeiment can n te etails leshe ut in ole he inut sace is � [ , 1 with unim input obaility ensity, e onside the w o m odels an
=
2
10
P = (
The taget is a egee olnomial, which e wite whee a olynomi als of in ceasin g comlexi ty he Legene polnomials) he ata set is , ee
0 q(),
()
D = , 1 , . . , ,
= (n)
= (n )
ot (g )
,
10
=
2
0
N
ot2
Va a n d each combinatio of amet e, un a l age nu me of exeimens, each time comuting an veaging these out-o f-sa ple eos giv es estimates o t e exect e out-o-sa l e eo f the given l ea nin g scenaio an using
ot(g0 2
ot(g 2 )
xercise 42 set up an experiment to study how the noise level 2 the target complexity Q f and the number of data points relate to overtting. We compare the nal hypothesis 910 10 (larger model to the nal hypothesis 92 2 (smaller model . Clearly, Ein (910 ) Ein (92 ) since 910 has more degrees of eedom to t the data . What is surprising is how often 910 overts the data, resulting in Eu(910) > Eu(9 2 ) Let us dene the overt measure as Eu(910) Eu(9 2 ) The more positive this measure is, the more severe overtting would be. Figure 4 .3 shows how th e extent o f overtting depe nds o n certain param e ters of the learni ng problem (the results are om our impl eentatio n of xer cise 4 2 . In the gure, the colors map to the level of overtting, with redder
N
123
4 . VERFITTING
41
WHEN OES VERFITTING CCUR?
00 _ ·; � 0
I
t
Q
s z
80
00
0
Numbe of Daa Poins, a tochastic noise
(
80
00
0
Numbe of Daa Poins, b eterministic noise
(
C2
Figure 43: ow overttin depends on the noise the taret function complexity QJ and the number of data points The colors map t o the overt measure n a we see how overttin depends on and , with QJ 0. s increase s we are addi n stochastic noise to the data n b we see how overttin depends on Q and with 01 s Q increases we are addin deterministic noise to the data.
C2 C2 =
Eot( 0)-Eot() = C2 (
regions showing worse overtting These red regions are large overtting is real and here to stay Figure 43( athereveals thatofthere less overtting when noisepattern level in 2 drops or when number data ispoints increases (thethelinear Figure 43(a is typical Since the signal' is normalized to [j 2 ] 1 the noise level 2 is automatica lly calibra ted t o the signal level Noise leads the learning astra y and the larger mo re complex model is more susceptib le to nois e than the simpler one because it has more w ays to go astray Figure 4 3(b reveals that target fnction complexity Q f aects overtting in a similar way to noise albeit nonlinearly To summarize
N
Deterministic noise. Why does a higher target complexity lead to more overtting when comparing the same two models? The intuition is that r a given learning mode l there is a best approximation to the target fnction The part of the target fnction outside' this best t acts like noise in the data We can call this deterministic noise to dierentiate it om the random stochastic noise. ust as stochastic noise cannot be modeled the deterministic noise is that part of the target fn ction which cannot be modele d The learning algorithm should not attempt to t the noise; however it cannot distinguish noise om signal On a nite data set the algorithm inadvertently uses some
124
4.1. WHN OS RFITTIN CCUR
4 RFITTIN
h*
Figure 4.4: eterministi noise is the est t to in illustrates deterministi noise r this learnin prolem.
2 The shadin
of the degrees of eedom to t the noise which can result in overtting and a spurious nal hypothesis Figure 4.4 illustrates deterministic noise r a quadratic model tting a more complex target nction While stochastic and deterministic noise have similar eects on overtting there are two basic dierences between the two tpes of noise First if we generated t he same data (x values again the deterministic noise would not change but the stochastic noise would Second dierent mo dels capture di erent parts ' of the target fnction hence th e same data set will have dier ent deterministic noise depending on which model we use In reality we work with one model at a time and have only one data set on hand Hence we have one realization of the noise to work with and the algorithm cannot dierent iate between the two types of noise Exerise
etemi nistic noise eens on tha n otes
as some moels aoximate bete
a ssume is xed an e in cea se the comlex ity i ll ete m i nistic noise in genea l go u o own Is he e a hi ghe o lowe tenenc to oet ( b) ssu me is x ed an we ecease t he omlei ty of Will dete m in istic nois e i n geneal go up o own? Is thee a highe o l owe endency to ovet? Hnt here s a race between two factors that
aect overttng n ooste ways, but one wns}
The biasvariance decomposition which we discussed in Section 2 3.1 see also Problem 2. 22 is a usefl t ool r understanding how noise aects perrmance:
[Eout ] 2 + is + vr The rst two terms reect the direct impact of the stochastic and determin istic noise The variance of the stochastic noise is 2 and the is is directly 125
4. ERFITTING
4.2. EGULRITION
rlatd to th dtrministic nois in that it capturs th modl's inability to approximat Th v trm is indirctly impactd by both typs of nois, capturing a nod ls suscp tibility to bin g ld astray by th nois
4.2
eularizain
Rgularization is our rst wapon to com bat ovrtting constrains larning algorithm to improv outofsampl rror, spciallyItwhn nois isth prsnt To ht your ap ptit, look at what a littl rgulariz ation can do fr our rst ovrtting xampl in Sctio n 4 1 Though w only usd a vry small amount' of rgularization, th t improvs dramatically
Data
Target
it
wth reularzatn
thut reularzatn
Now that w hav your attntion, w would lik to com clan Rgularization is as much an art as it is a scinc ost of th mthods us d succssflly in practic ar huristic mthods Howvr, ths mthods ar groundd in a mathmatical amwork that is dvlopd r spcial cass W will discuss both th mathmatical and th huristic, trying to maintain a balanc that rcts th rality of th ld Spaking of huristic s, on viw of rgularization is through th lns of th VC bound, which bounds Eu using a modl complxity pnalty fr all h
0
( 41
So , w ar bttr o if w t th data using a simpl xtrapolating on stp frthr, w should b bttr o by tting th data using a simpl h om Th ssn c of rgularization is t o concoct a masur (h) fr th complxity of an individual hypothsis Instad of minimizing Ein ( h) alon, on minimizs a combination of Ein(h) and (h) This avoids ov rtting by cons training th larning algoithm to t th data wll using a simpl hypothsis
eight decay which masurs th complxity of a hypothsis h by th siz of th cocints usd to rprsnt h g in a lin ar modl This huristic prfrs mild lin s with
Example 1 On popular rgularization tchniqu is
126
4 . VERFITTING
4.2. EGLRITION
small oset and slo pe to wild lines with big ger oset and slope . We will get to the mechanics of w eight de cay shortly but fr now let s fcus on the outcome. We apply weight decay to tting the target sin( ?X using 2 data points (as in xample 28 sample unirmly in [ 1 1] ge nerate a data set and t a line to the data (our model is H The gures below show the resulting ts on the same (random data sets with and without regularization.
=
N=
witout reularization
wit reularization
Without regularization t he learned nction vari es extensively depending on the data set . s we have seen in xample 2. 8 a constant model scored model 0.75, scored hadilyEu beating 1the .90 . perfrmance With a littleof weight the (unregularized decay regularization linear Eu that the ts to the same data sets are considerably less volatile . This results in a signicantly lower Eu 0.56 that beats both the constant model and the unregularized liear model. The b ias-variance decomposition helps us to understand how th e regular ized version beat both the unregularized version as well as the constant model.
= =
witout reularization bia s 02; var 69
wit reularization bi as 0.2; var 0
verae ypotesis red wit varx) indicated y t e ra saded reion tat is gx) ±
s exp ected regularization reduced the var term rather dramatically om 1 69 down to 0 .33 . The pri ce paid i n terms of the bias (quality of the average t was 12
4 . VERFITTING
4.. EGULRITION
modest, only slightly increasing om 0 2 to 0 2 3 The result w as a signicant decrease in the expected outofsample error because bias + va decreased This is the cru x of regularization By constraining the learning alg orithm to selec t simpler' hypotheses om we sacrice a little bias fr a signicant gain in the va
Thisis example also illustra tes amount why regularization is needesince d The model too sophisticated for the of data e have a linelinear can perfctly t any 2 points This need would persist even if we changed the target fnction, as long as we have either stochastic or deterministic noise The need r regularization depends on the quantity and quality of the data Given our meager data set, our choices were either to take a simpler model, such as the mod el with co nstant fncti ons, or to constra in the line ar model It tur ns out that using the complex model bu t constrainin g the algorit hm toward simpler hypotheses gives us more exibility, and ends up giving the best Eout. In practice, this is the rule not the exception nough heuristics Let 's develop the mathematics of regularization .2.1
A Sft Order Cstrit
In thisofsection, derive a regularization method a wide vate riety learningweproblems To simplify the math,that weapplies will useto the concre setting of regression using Legendre polynomials, the p olynomials of incr eas ing complexity used in xercise 4 2 So, let 's rst rmally introduce you to the Legendre polynomials Consider a learning model where is the set of polynomials in one vari able x [ 1 Instead of expressing the polynom ials in terms of consecutive powers of x, we will express them as a combination of Legendre polynomials in x Legendre polynomials are a standard set of polynomials with nice ana lytic properties that result in simple r derivations The zerothorder Legendre polynomial is the constant x and the rst w Legendre polynomials are illustrated below
L3
L2
�(x2
1)
H5x
L4
x)
(5x x2 + )
Ls
(6x
·
As you can see , when the order of the Legendre poly nomial increa ses, the curve gets more complex Legendre polynomials are orthogonal to each other within x [ 1 1] and any regular polynomial can be written as a linear combinatin of Legendre polynomials, just like it can be written as a linear combination of powers of
28
4 . VERFITTIN G
x HQ [LQ1(�x) ]
4.2. EGULRITION
Polynomial models are a special case of linear models in a space Z, under a nonlinear trasfrmation < X - Z Here r the Qth order polyomial model < transrms into a vector of Legendre polynomials
Our hypothesis set
Lo ( x )
is a linear combination of these p olynomials
where 1. As usual we will sometime s rer to the hypothesis h by its weight vector Since each h is linear in we can use the machinery of linear regression om Chapter 3 t o minimize the squared error
N
I Z •
E(
(4.2
= The case of po lyomial regression with squarede rror measure illustrates the main ideas of regularization well and cilitates a solid mathematical deriva tion Nonethele ss our discussion will generalize in practice to nolinear multidimensional settings with more general error measures The baseline al gorithm (without regularization is to minimize E over the hypotheses in to produce the nal hypothesis where argminE(
g( x )
xercise . et Z [z1 rank); let Win
HQ
has f ul l olumn T the hat matix
ZNr b the aa matix assume (ZTz)- 1 Vy; an let Z ZT f ercise .. how tat linfVZ w lin
m W
_
where is the identity matrix (a) hat val ue of minimizes in?
Tl
b What i s the mini mu i n sample er ror?
The task of regularization which results in a nal hy pothe sis istead of the simple is to constrain the learning so as to prevent overtting the
2 e used and he weigh veo and dimension in dealing wih polnomials and is he onl spae aound, we use
Z
Z
129
Sine we ae expliil
and Q simplii
4. VERFITTING
4.. EGULARIATION
data. We have already seen an example of constraining the learnin g; the set 2 can be thought of as a constrained version of 10 in the sense that some of the 10 weights are require d to be zero. That is, 2 is a subset of 10 dened by 2 {w w 10 Wq r q } Requiring some weights to be is a hard constraint. We have seen that such a hard constraint on the order can help, r exam ple 2 is better than 10 when there is a lot of noise and N is to small. be small Instead butofnotrequiring necessarily s ome zero weights through to be a softer zero, constraint we can rce such theasweights
This is a soft order constraint because it only encourages each weight to be small, without chang ing th e order of th e polynomial by explicitl y setting some weights to zero. The insample optimization problem becomes
Ein(w)
subect to
wTw � C
()
The data determines the optimal weight sizes, given the total budget C which determines the amount of regularization; the larger C is, the weaker the con straint and the smaller the amount of regularization. We can dene the soft orderconstrained hypothesis set (C) by quation () is equivalent to minimizing Ein over (C) If C1 < 2 , then (C1) (C2 ) and so ((C1)) � ((C 2 )), and we expect better generalizat ion with ( C1 ) Let the regularized weights Wre be the solution
to ()
olving r W • If w n W lin � then Wre Wlin because Wlin ( C) If Win (C), then not only is we Wre � C, but in fact we Wre C (wre uses the entire budget C see Problem )
We thus constraint need towTw minimizeC EinThe subect situation to the is equality illustrated to the right. The weights w must lie on the surface of the sphere , the normal vector to this surace at w is the vector w itself (also in red . A surface of constant Ein is shown in blue; this surface is a quadratic surfce (see xercise ) and the normal to this surface is . In this case, w cannot b e optimal because \ Ein ( w) is not parallel to the red normal vector. This means tha t Ein(w) has some non zero component along the constraint surfce, and by moving a small amount in the opposite direction of this component we can improve Ein while still
4 ERFITTING
42 EGULRITION
remaining on the surface If Wre is to be optimal, then r some positive parameter A ie, Ein must be parallel to Wre, the normal vector to t he constraint surace (the scaling by 2 is r mathematical convenience and the negative sign is because
Ein and w are in opposite directions quivalently,
because V(wTw)
Wre satises
2w So, fr some A > 0, Wre locally minimizes (45
The parameter A and th e vecto r Wre (both of wh ich depend o n C and the data must be chosen so as to simultaneously satisfy the gradient equality and the weight norm constraint we Wre C 3 That A > 0 is intuitive since we are enfrcing smaller weights, and minimizing Ein(w) + AwTw would not lead to smaller weights if A were negative Note that if wn Win C, Wre Wn and miniizing ( 45 still holds with A 0 Therefre , we
:
have an equiv betweenofsolving problem ( 4 4mini andmiz the unconstra ined alence minimization ( 4 5 the Thisconstrained equivale nce means that Ein ing ( 4. 5 is similar to minimizing using a smaller hypothesis set, which in turn means that we can expect better generalization by minimizing ( 45 than by just minimizing Ein. Other variations of the const raint in ( 44 can be used to emph asize some weights over the others Consider the constraint �O C The im portance given to weight determines the type of regularization For example, q or encourages a low-order t, and (1 + q) - 1 or encour ages a high-ord er t In extreme cases , one recovers hard-ord er constraints by choosing some 0 and some -
/q /q /q q
/q q Wq /q
Exercise . 5
/q
/qW� : /q
[Tkhono regarzer
A oe geneal oft containt i the
khonov egul aization con taint
i
whi ch can captue elation h ip among the the matix i the Ti khonov egulaize a What hould be to obtain the containt C?
b What hould
be to obtain the containt
o w ) 2 ?
3 > s known as a Lagange multple and an altenate devaton of these same esults an be obtaned va the theo of Lagange multples onstaned optmzaton
131
. VERFIING
.2 .2
2 EGULRIION
Weight Decay ad Augmeted Error
The softorder constrant fr a gven value of C s a constraned mnmza ton of Ei quaton (4.5 suggests that we may equvalently solve an un constran ed mnmzaton of a derent fncton . Let 's dene the augmented
error
(46 where 0 s now a ee parameter at our dsposal. Th e augmented error h as two terms. The rst s the nsample error whch we are used to mnmz ng, and the second s a penalty term. Notce that ths ts the heurstc vew of regularzaton that we dscussed earler, where the penalty fr complexty s dened fr each ndvdual h nstead of as a whole. When 0, we have the usual nsample error. For > 0, mnmzng the augmented error cor respond s to nmz ng a penalzed nsample error. The value of controls the amo unt of regularzaton. The penalty term wTw enrces a tradeo between makng the nsample error small and makng the w eghts small, and has become known as eight deay As dscussed n Problem 48, f we mnmze the augmented error usng an teratve method lke gradent descent, we wll have a reducton of the nsample error together wth a gradual s hrnkng of the wegh ts , hence the name weght decay . ' In the statstcs co mmunty, ths type of pen alty term s a rm of There s an equvalence between the soft order constrant and augmented ridge regression error mnmzaton. In the soforder constrant, the amount of regularzaton s control led by the parameter C om ( 4 5 , there s a partcular depend ng on C and the data ') r whch mnmzng the augmented error Eau(w) leads to the same nal hypothess we A larger C allows larger weghts and s a weaker softorder constrant ths corresponds to smaller .e., less em phass on the penalty term wTw n the augmented error. For a partcular data set, the optmal value C* leadng to mnmum outofsample error wth the softorder cons trant corresponds to an optmal value * n the augmented error mnmzaton. If we can nd * , we can get the mnmum Eout. Have we ganed om the augmented error vew? Yes, because augmented error mnmzat on s unconstraned, whch s gener ally easer than constraned mnmzato n. For example, we can obtan a closed rm soluton fr lnear models or use a method lke stochastc grad ent descent to carry out the mn mzaton. However, augmented error mnm zaton s not so easy to nterp ret. There are no values fr the weghts whch are explctly rbdden, as there are n the softorder constrant. For a gven C, the softorder constrant cor responds to selectng a hypothess om the smaller set (C), and so om our VC analyss we should expect better generalzaton when C decreases ncreases . It s through the relatonshp between and C that one has a theoretcal justcaton of weght decay as a method r regularzaton. We ocused on the softorder constrant wTw C wth correspondng augmented error Eau(w) Ei(w) + wTw However, our dscusson apples more generally. There s a dualty between the mnmzaton of the nsample
32
4 . ERF ITTI NG
4.2. EGULRIATION
error over a con strained hypothesis set and the unconstrained minimization of an augmente d error We may choose to l ive in either world, but more ofen than not, the unconstrained minimization of the augmented error is more convenient In our denition of () in quation (46 we only highlighted the dependence on There are two other quantities unde our control, namely the amount of regularization, , and the nature of the regularizer which we chose to be In general, the augmented error r a hypothesis is (47 For weight decay, D( which pen alizes large weights The pen alty term has two components: the regularizer ( (the type of regularization which penalizes a particular proper ty of and the regularization pameter (the amount of regularization The need r regularization goes down a the number of data point s goes up , so we actore d out this allows the optimal choce r to be less sensit ive to N This is just a redenition of the that we have been using, in order to make it a more stable parameter that is easier to interpr et Notice how quati on ( 4 7 resembles the VC bound ( 4 1 as we anticipated in the heuristic view of regularization This is why we use the same
;
notation the penalty on individual between h ypotheses ( and theofpen on the whole The correspondence theDcomplexity and alty rsetboth0(). the complexi ty of an individual wil l be discuss ed frther in Section 5 1 The regularizer is typically xed ahead of time, bere seeing the data someti mes the pro blem itself can dictate an appropriate reg ularizer xerise .6 e h ae se en both he ha r-orer const ain t a n t e so t-orer cons traint ic h o ou exe ct o be more us eful r in a lassica tion usin g he ercetron moel? int: signwTx) ignaTx a > O
The optimal regulariza tion parameter, however, typically depends on th e data The choice of the optimal is one of t he applica tions of validation which we will discuss shortly Exampe 42 Linear modes ith eight decay. Linear models are important enough that it is worthwhile to spell out the details of augmented error minimiza tion in this case om xercise 4 4 t he augmented er ror is
where is t he transfrmed data matrix and W ()y The reader may veri, after taking the derivatives of and setting \ 0, that 133
4 VERFITTING
4 EGULRITION
As expected, Wre will go to zero as tions on the insample data are given by
due to the I term.
Zwre
The predic H( , where
The matrix H( plays an important role in dening the eective complexit of a model. When 0 H is the hat matrix of xercises 3.3 and 4.4 which satises H and trace(H of the insample errors , which are also called H residuals, is y d + (I1. The H( \vector , and insample error Ei is Ein(Wre) = (I H( .
We can now apply weight decay regularization to the rst overtting example that opened this ch apter. The results r dierent 's are s hown in Figure 4 .5
0.0001
\ 0.01
?
Figure 4.5: Weght decay appled to xample 4 wth derent value r the regularzaton parameter . The red t get atter a we ncreae .
As you can see, even very little regularization goes a long wa, but too much regularization results in an overly at curve at the expense of insample t. Another case we saw earlier is xample 4.1 where we t a linear model to a sinusoid. The regularization used there was also weight deca, with 0.1 . .23
Choosig a Regularier Pill or Poiso?
We have presented a numb er of ways to constrain a model hardorder con straints where we simply use a loweror der model, softorder constraints where we constrain the parameters of the model, and augmented error where we add a penalty ter to an otherwise unconstrained minimization of error. Aug mented error is the most popular rm of regularization, r which we need to choose the regularizer (h) and the regularization parameter \. In practice, the choice of is largely heur istic. Finding a perfct is as dicult as nding a perct It depends on inrmation that, by the very nature of learning, we don' t have. However, there are regularizers we can work with that have stood the te st of time, such as weight decay. Some rms of regularization work and some do not, depending on the specic application and the data Figure 4.5 illustrated that eve n the amount of regularization 134
4 VERFITTING
4 EGULRITION
0.84
084
08
i0.76
�
2 egaization Paaete , 0.5
1
1.5
a) Uniom egulaize
1
0.5
15
2,
egaization Paaete b Low ode egulaiz e
Figure 46: ut of sample pemance f the uni and low ode eg
2 =
=
=
ulaizes using model H 1 5 with 05 Q 15 and 30 vetting occus in the shaded egion because lowe lowe A) leads to highe Undetting occus when A is too lage because the leaning algoithm has too little eibility to t the data
Ei
Eot.
has to be chosen carefuy Too much reguarization too harsh a constraint eaves the earning too itte exibiity to t the data and eads to undertting which can be ust as bad as overtting If so many choices can go wrong, why do we bother with reguarization in the rst pace? Reguarization is a ne cessary evi, with the op erative word being necessary. If our mode is too sophisticated fr the amount of data we have, we are doomed By appying reguarization , we have a chance By appying the proper reguarization, we a re in good shape Let us experiment with two choices of a reguarizer fr the mode of 1t h order poynomias, using the experimenta design in xercise 42:
�o 2. A ow-order reguarizer: (w) = �o
1 A unirm reguarizer: uni( w)
The rst encourages a weights to be sma, unifrmy; the second pays more attention to the higher order weights, encouraging a ower order t Figure 46 shows perfrmance r dierentpavaues f the reguarization parameter As youthe decrease ys esso attention to the penaty term and the optimization more to Ein, and so Ein wi decrease Probem 4 7 In the shaded region, Eout increases as you decrease Ein decrease the reguarization parameter is too sma and there is not enough of a constraint on the earning, eading to decreased per frmance because of overtting In the unshaded region, th e reguarization parameter is too arge, over-constraining the earning and not giving it enough exibiity to t the data, eading to decreased perrmance bec ause of undertting As can be observed om the gure, the price paid f r overtting is generay more severe than undertti ng It usuay pays to be conservative
135
42 EGULRITIN
4 . VERFITTING
0.5
.
garzatn Paratr, (a) tchastic nise
2,
.
.
garzatn Paratr, (b) eterministic nise
2,
Figure 4 7: errmance the unirm regularizer at diernt levels nise. The ptimal > is hghlighted r each crve The optimal regularization parameter r the two cases is quite dierent and the perfrmance can be quite sensitive to the choice of regularization parameter However the promising message om the gure is that though the behaviors are quite dierent the perrmances of the two regularizers are comparable (around 0. 76 if e choose the right for each Wen can also use to study how perfrmance izatio depends on this the experiment noise In Figure 4 7(a when 2 with 0, noregular amount of regularization helps (ie the optimal regularization parameter is 0) , which is not a surprise because there is no stochastic or deterministic noise in the data (both targe t and model are 15t h order polynomials As we add more stochastic noise the overall perrmance degrades as expecte d Note that the optimal value fr the regularization parameter increases with noise which is also expected based on the earlier discussion that the potential to overt in creases as the noise increases; hence constraining the learning more should help Figure 4 7(b shows what happens when we add determ inistic nois e keeping the stochast ic noise at zero This is accomplished b y increasing Q f (the target complexity thereby adding deterministic noise but keeping ev erything else the same Comparing parts ( a and (b of Figures 47 provides another demonstration of ho w the eects of deterministic and stochastic noise are similar When either is present it helpful to regularize and the more noise there is the larger the amount of regularization you need is What happens if you pick the wrong regularizer? To illustrate we picked a regularizer which encourages large weights (weight growth versus weight decay which encourages small weights As you can see in this case weight growth does not help the cause of overtting If we happened to choose weight growth as our regularizer we would still be OK as long as we have garzatn Paratr , 136
4 VERFITTING
43 LIDTION
a good way to pick the regula rization parameter the optimal regularization parameter in this case i s 0 and we are no worse o than not reg ularizing. No regularizer wi ll be ideal r all settings o r even r a specic setting since we never have perct inrmation but they all tend to work with varying success if the amount of regularizat ion is set to the correct level Thus the entire burden rests on picking the right a task that can be addressed by a technique called validation the of topic of the nextissection The lesso n learned is thatwhich some isrm regularization necessary as le arn ing is quite sensitiv e to stochastic and determin istic nois e The best way to constrain the learning is in the direction' of the target fnction and more of a constraint is needed when there is more noise ven though we don 't know either the target fnction or the noise regularizat ion helps by reducing the impact of the noise Most common models have hypothesis sets which are naturally parame terized so that smaller param eters lead t o smoother hypothe ses Thus a weight deca y type of regularizer constrains the learning to wards smoother hypotheses This helps because stochastic noise is high equency' (non-smooth Similarly deterministic noise (the part of the target fnction which cannot be modeled also tends to be non-smoo th Thus constra ining the learning towards smoother hypotheses hurts' our ability to overt the noise more than it hurts our ability to t the usel inrmation These are empirical observations not theoret ically justiable statements egularization and the C dimension. Regularization (r example soft-order selection by minimizing the augmented error poses a problem r the VC line of r easoning As go es up the learning algorithm changes but the hypothesis set does not so d will not change We argued that in the augmented error corresponds t o C in the soft-order constrained model. So more regularization corresponds to an eectively smaller model and we expect better generalization fr a small increase in Ein even though the VC dimension of the model we are actually using with augmented error does not change This suggests a heuristi c that works well in pract ice which is to use an eective VC dimension' in stead of the VC dimension For linear perceptrons the VC d imension equals the number o f ee parameters d + 1 and so an eec tive number of ameters is a goo d surrogate r the VC dimension in th e VC bound The eective number will of parameters willgeneralization go down as with increases and so the eective VC dimension reect better increased regularization. Problems 413 4 14 and 4 15 explore the notion o f an eective number of parameters
4.3
Validation
So fr we have identie d overtting as a problem noise (st ochastic an d deter ministic as a cause and regularization a s a cure In this sectio n we introduce another cure called validation One can think of both regularization and val 137
43. LIDTION
4 VERFITTING
idation as attempts at minimizing Eout rather than ust Ein Of course the true Eout is not available to us so we need an estimate of Eout based on in rmation available to us in sample. In some sens e this is the Holy Grail of machine learnin g to nd an in-sample estimate of the out-of-samp le error. Regularization attempts to minimize Eout by working thr ough the equation
Eout(h) Ein(h)
�
+ overt penalty
and concocti ng a heuris tic term that emulates the penalty term . Validation on the other hand cuts to the chase and estimates the out-of-sample error directly. Eout(h) Ein(h) + overt penalty. stimating the out-of-sample error direct ly is nothing new to us. In Sec tion 2.23, we introduced the idea of a test set a subset of that is not involved in the learning process and is used to evaluate the nal hypothesis. The test error Etst unlike the in-sample error Ein is an unbiased estimate of Eout · . 3 . 1
The Validatio Set
The idea of a validation set is almost identi cal to that of a test set . e remove a subset om the data this subse t is not used in training. We then use this held-out subset to estimate the out-of -sample error. The held-out set is eectively out-of-sample because it has not been used during the learning. However there is a dierence between a validation set and a test set. Although the validation set will not be directly used r training it will be used in making certain choices in the learning proc ess. The minute a set aects the learning process in any way it is no longer a test set. However as we will see the way the validation set is used in the learning process is so benign that its estimate of Eout remains almost intact. Let us rst look at how the validation set is created. The rst step is into (N to partitionsettheDval dataof set of sizedoes and a tining setmethod Dtrain which validation size K. Anyapartitioning notK)depend on the values of the data points will do fr exaple we can select NK points at random r training and the remaining r validation. Now w e run the learning alg orithm using the training se t Dtrain to obtain a nal hypothesis g where the minus superscript indicates that some data point s were taken out of the training. We then compute the validation error r g using the validation set Dval:
13
4 3 LIDTION
4 VERFITTIN
whr (g (x) ) is th pointwis rror masur which w introducd in Sc g (x) and r rgrssion using tion 141 For classication, (g(x) ) squard rror, (g(x) ) (g (x) ) 2 . Th validation rror is an unbiased stimat of Eout bcaus th nal hy pothsis g was cratd ind pndntly of th data points in th validatio n st . Indd, taing th xpctation of Eval with rspct to th data points in 'val
1 K 1 K
n n
(48 Th rst stp uss th linarity of xpct ation, and th scond st p llows bcaus (g (X) Y) dpnds only on X and so
[(g (x)Y ) ] xn [ (g (x) Y) ]
Eout(g ).
How rliabl is Eval at stimating Eout? In th cas of classication, on can us th VC bou d to prd ict how good th validation rror is as an stimat o r th outofsal rror. W can viw 'val as an insampl data st on which w computd th rror of th singl hypothsis g . W can thus apply th VC bound r a nit modl with on hypothsis in it th Hoding bound . With high probability,
Eout (g ) Eval (g ) + 0
(49
Whil Inquality ( 49 applis to binary targt functions, w may us th varianc of Eval as a mor gnrally applicabl masur of th rliability. Th nxt xrcis studis how th varianc of Eval dpnds on K th siz of th validation st , and implis that a similar bound holds r rgrssion Th conclusion is that th rror btwn Eva(g ) and Eout (g ) drops as (g )/ whr (g ) is boundd by a constant in th cas of classication
Exercise 4. 7
ti
Fix learn ed f rom sider how ; eends on
an d ene Let K.
a[E(g ]
e con
be the pointwi se variance in the out- of-sam ple erro r of .
a Show that -;l
f 2 (g ) .
b n a classi cation probem w here e( () , () ; in terms of c Show that r any in a cassication problem ;
[g(x) ] g
= ] express
(ontinued on next page
139
4 VERFITTIN
4 3 LIDTION
tere a ir e o r
r ] iilar o c i te cae regreio wit ae erro e g (x g x y 2 ? nt he are erro none e or rereio wit are eror i we ai ig ewe oit alle o get o exet a2 (g to e ige or loer? nt: o contno nonnegate ranm arabe, er
K
ean odeften ger amle t icreaig teaance iz e of te val id atio e t ca rel t i a f Cocl etter or a wore tiate out ·
he expected valdaton eo f 2 s llustate d n Fgue 4 8 whee we 10 used the expemental desgn n Execse 42 wth 40 and nose leel 0 4 he expected v aldaton eo e quals Eout g pe Equaton (48)
Q N
e of Valdaton et, K
Fgue 48 The expected vlidtio error ()] th e shded re is ] ± ·
uctio o K
he gue clealy shows that thee s a pce to be pad settng asde K data ponts to get ths unbased estmate of Eout when we set asde moe data valdaton thee ae fwe tanng data ponts and so g becomes wose Eout g and hence the expected valdaton eo increases (the blue cuve) As e expect the uncetanty n Evl as measued by v (sze of the shade d ego n) s deceas ng wth K up to the p ont whee the v aance 2 ( g ) gets eally bad hs pont comes when the numbe of tanng data ponts becomes ctcall y small as n Execse 4 7(e) If K s nethe too small no too lage Evl povdes a good estmate of Eout A ule of thumb n pactce s to set K (set asde 20 of the data f v aldaton) We have establshed two conctng demands on K It has to be bg enough Evl to be elable and t has to be small enough so that the tanng set wth K ponts s bg enough to get a decent g Inequalty ( 4 9 quantes the st deand he second deman d s quanted by the leanng cue
N
140
4.
4 3 .
VERFITTING
ALIDTION
discussed in Section 232 also the blue cuv e in Figue 4 8, o m ight to lef t , which shows how the expected out-of-sample eo goes down as the numbe of taining data points goes up he fact that moe taining data lead to a bette nal hypothesis has b een extensively veied empiically, although it is challenging to pove th eoetic ally Although the leaning cuve estoring suggests that taking out K data points validation and using only N K tain ing will cost us in tems of Eou, we do not have to pay that pice! he pupose of vali dation is to estimate the out-of-sample pe mance, and Eval happens to be a good estimate of Eou(g ) his does not mean that we have to output g as ou nal hy pothesis he pimay goal is to get the best possible hypothesis, so we should out put g, the hypothesis tained on the en tie set he seconday goal is to esti mate Eou, which is what validation allows
g va(g )
g
Figue 49: Usin a valida tion set to estimate Eot
us do. Based discussion ing to cuves, Eou (g)on ouEou (g ) , so of lean
g � Eou(g ) Eval(g ) +
(410)
he st inequ ality is subdued bec ause it was not igoously pov ed. If we st tain with N K data points, validate with the emaining K data points and then etain using all the data to get g, the validation eo we got will likely still be bette at estimating Eou (g) than the estimate using the V-bound with Ein (g) , especially lage hypothesis sets with big d So a, we have teated the validation set as a way to estimate Eou, without involving it in any decision s that aect the leaning pocess Estimating Eou is a usel ole by itself a custome would typically want to know how goo d the nal hypothesis is in act, the inequalities in ( 410) suggest that the
validation eo is a pessimistic of system custome is likely to Eou, so be pleasantly supised when heestimate ties you on you new data Howeve, as we will see next, an impotant ole of a validation set is in fact to guide the leaning pocess . hat s what dis tinguishes a validation set om a test set
. 3 .2
Modl Slct io
By f, t he most i mpot ant use o f validation is model s election his could mean the choice between a linea model and a nonl inea model, the choice of the ode of polynomial in a model, the choice of the value of a egulaization 141
4 3 LIDTION
4 VERFITTIN
8 6
Fgue 4.1 0 timistic bias of the validation error when usin a alidation set fr the odel selected
paamete , o any othe choce that aects the leanng pocess. In almost evey leanng stuaton, thee ae some choces to be made and we need a pncpled way of makng these choces. he leap s to ealze that valdaton can be used to estmate the out-of sample eo o moe than one mode l. Suppo se we have models 1 . M Valdaton can be used to select one of these model s. Use the tanng set Dri to lean a nal hypothess g; f each model. ow evaluate each model on the valdaton set to obtan the valdaton eos E EM whee
Em = E(g) = 1,.
. . .
he valdaton eos estmate the out-of-sample eo
Eo (g;) f each m
Exercise .8 Is a u bias e stimate r the out of sample er ro aut g? It s now a smple matte to select the model wth lowest valdaton eo. Let * be the ndex of the model whch acheves the mnmum valdaton eo. So f m* Em* Em = 1 , he model m* s the model select ed based on the valdaton eos. ote that Em* s no longe an unbased estmate of Eu(g; * ) . Snce we selected the model wth mnmum valdaton eo, Em* wll have an optmstc bas. hs optmstc bas when selectng between 2 and 5 s llust ated n Fgue 4 10, usng the expe mental desgn descbed n Execse 42 wth Q f = , 2 = 0. 4 and =
N
Exercise 4.9 Rerri to Fiu re 4.10 why are both cu rves i creasi with K? Why do they covere to each other with icreasi K?
142
4 VERFITTIN
43 LIDTION
How good s the genealzaton eo f ths ente pocess of model selecton usng valdaton? Consd e a new model Hv consstng of the nal hypotheses leaned fom the tanng data usng each model , . . . H
{g g g}
Hv
Model selecton usng the valdaton set chose one of the hypotheses n Hv based on ts pemance on 'v Snce the model Hv was obtaned bee eve lookng at the da ta n the valdaton set, ths poc ess s entely equva lent to leanng a hypothess om Hv usng the data n 'v he valdaton eos Ev (g�) ae n-sample eos f ths leanng pocess and so we may apply the VC bound nte hypothess sets, wth IHvI
Eout(g ) Ev(g ) + 0
/
(411)
What f we d dnt use a valda ton set to choose the model One altenat ve would be to use the n-sample eos om each model as the model selecton cte on. Sp eccally, pc k the model whch gves a nal hypothess wth mn mum n-sample eo. hs s equvalent to pckng the hypothess wth mn mum n-sample eo om the gand model whch contans all the hypotheses n each of the ognal models. If we want a bound on the out-of- sample eo f the nal hypothess that esults om ths selecton, we need to apply the VC-penalty ths gand hypothess set whch s the unon of the ! hypothess sets see Poble m 2 14) Snce ths gand hypothess set can have a huge VC-dmenson, the bound n ( 411) wll geneally be tghte. he goal of model selecton s to se · · lect the best model and output the best hypothess om that model. Spec cally, we want to select the model 91 whch Eout(gm) wll be mnmum when we etan wth all the data. Model se lecton usng a valdaton set eles on the leap of ath that f Eout(gm) s mnmum, pk he bes then Eout (g�) s also mnmum. he val daton eos Em estmate Eout (g�) so modulo ou leap of fth, the valdaton
9
9
i . . M (* *)
set should pck the model. howeve, o mat te whch model *ght s selected, based on the dscusson of leanng cuves Fgue 4 1 1 : Usin a validation n the pevous secton, we should not out set r model selection put g� * as the nal hypothes s. Rathe, once * s selected usng valdaton, lean usng all the data and output gm* whch satses
9*
Eout (gm'
$ Eout (g Ev! (g ) + 0 •
.
Agan, the st nequalty s subdued because we ddnt pove t. 143
(412)
4.3. LIDATION
4. VERFITTING
056
in sample: g lition
9
0.8 5
15
Validation et ize K
25
Fgue 4 12 odel seletion between 2 and usin a validation set . The solid blak line uses E fr model seletion whih always selets • The
dotted line shows the optimal model seletion, if we ould selet the model based on the true out of sample error. This is unahievable but a u seful benhmark . The best per ormer is learly the validation set outputtin 9* or suitable K even g�* is better th an in sample seletion.
Cont nung ou expement om Fgue 4 10 we evaluate the outofsample pemance when usng a valdaton set to select between the models 2 and 5 he esults ae shown n Fgue 4. 12. Valdaton s a clea wnne ove usng E f model select on
xerise
rom igure [ * ] i i niial l ereaing ow an i if [E ] i inreaing in r eah m? rom igue e e e ha [ * ] i in iial l ec eaing n h e i ar o i ncreae a are e o il reaon r i c en K [( * [(9* ow can i e i he lea rni ng uve r moel a re ecea i ng
Example 4 We can use a valdaton set to select the value of the eg ulazat on paamete n the augmented eo of (4 6) Although the most mpotant pat of a model s the hypothess set evey hypothess set has an assocated leanng algothm whch selects the nal hypothess g o mod els may be deent only n the leanng algothm, whle wokng wth the same hypothess set Changng the value of n the augmented eo changes the leanng algothm (the cteon by whch g s selected) and eectvely changes the model Based o n ths dscusson consde the dierent models coespondng to the same hypothess set but wt h deent choces n th e augmented eo So, we have ( ) ( A 2 ) . ( AM) as ou deent mode ls We 144
4 VERFITTING
4 3 LIDTION
may example choose 0 2 001 002 10. Us ng a aldaton set to choose one of these models amounts to detemnng the alue of to wthn a esoluton of 001
We hae analyzed aldaton model selecton based on a nte numbe of models If aldaton s used to ch oose the alue of a paamete example as whch n the peous example the alue wll depend on thesesoluton to we detemne that then p aamete In of the lmt the selecton actuall y among an nnte numbe of models snce the alue of can be any eal numbe What happens to bounds lke (4) and (4) whch depend on ? Just as the Hoedng bound a nte hypothess set dd not collapse when we moed to nnte hypothess sets wth nte V-dmenson bounds lke (4) and (4) wll not completely collapse et he We can dee VC-type bounds hee too because een though thee ae an nnte numbe of models these models ae all e y smla they de only slghtly n the alue of As a ule of thumb what mattes s the numbe of paametes we ae tyng to set If we hae only one o a w paametes the estmates based on a decent szed aldaton set would be elable The moe choces we make based on the same aldaton set the moe contamnated the aldaton set becomes and the less elable ts estmates w ll be The moe we use the aldaton set to
ne the model the model; moe the aldaton set becomes lke a tanng usedtune to lean the ght and we all know how lmted gsetsets n ts ablty to estmate Eout You wll be had pessed to nd a seous leanng poblem n whch alda ton s not used Valdaton s a conceptually smple technque easy to apply n almost any settng and eques no specc knowledge about the detals of a model The man dawback s the educed sze of the tanng set but that can be sgncantly mtgated though a moded eson of aldaton whch we dscuss next.
.3.3
Cross Validatio
Valdaton eles on the llowng chan of easonng
whch hghlghts the dlemma we fce n tyng to select K We ae gong to output g When K s lage thee s a dscepancy between the two out-of sample eos Eout(g ) whch Eval dectly estmates and Eout(g) whch s the nal eo when we lean usng all the data ) We would lke to c hoose K as small as possble n ode to mnmze the dscepancy between Eout (g ) and Eout(g) deally K 1 . Howee f we make ths choc e we lose the elablty of the aldaton estmate as the bound on th e RHS of (4 ) becomes huge The aldaton eo Eval (g ) wll stll be an unbased estmate of Eout (g )
4
4 VERFITTING
43 LIDTIN
(g s taned on N - 1 ponts), but t wll be so unelable as to be useless snce t s based on on ly one data pont. hs bngs u s to the ss validation estmate of out-of-sample eo We wll cus on the leave-one-out veson whch coesponds to a valdaton set of sze K 1 and s also the easest case to llustate Moe popula veso ns typcally use lage K but the essence of the method s th e same hee ae N ways to patton the data nto a tanng set of sze N - 1 and a valdaton set of sze 1. Speccally, let be the data set afte leavng out data pont (, ), whch has been shaded n ed Denote the nal hypothess leaned om by g. Let be the eo made by g on ts valdaton set whch s j ust a sngle data pont { ( , ) } :
he coss valdaton estmate s the aveage value of the s,
Fgue 4 13 llustrti on of leve one out cross vlidtion r liner t usin three dt points. The vere of the three red errors obtined by the li ner ts levin out one dt point t time is Ecv ·
Fgue 4 13 llustates coss valdaton on a smple example Each s a wld, yet unbased estmate the coespondng Eut(g), whch llows afte settng K 1 n ( 4 ) Wth coss valdaton, w e have N nctons g , . . , g togethe wth the N eo estmates e1, N he hope s that these N eos together would be almost equvalent to estmatng Eut on a elable valdaton set of sze N whle at the same tme we managed to use N 1 ponts to obtan each g. Let s ty to undestand wh y Ecv s a good estmato of Eut 146
4. VERFITTING
4.3. LIDTION
Fst and emost Ecv s an unbased estmato of Eo g ) . We have to be a lttle caefl hee because we don't have a sngle hypothess g , as we dd when usng a sngle valdaton set Dep endng on the xn , Yn) that was taken out each g can be a deent hypothess. To undestand the sense n whch Ecv estmates Eo , we need to evst the concept of the leanng cuve Ideally we would lke to know Eo g). The nal hypothess g s the esult of leanng on a andom data set of sze N. It s almost as usel to know the epected peance of you model ' when you lean on a data set of sze N; the hypothess s just one such nstance of leanng on a data set of sze N. Ths expected emance aveaged ove data sets of sze , when vewed as a ncton of N, s exactly the leanng cuve shown n Fgue 4 2. Moe mally a gven model let
Bo N)
[Eo g)
be the expectaton (ove data sets ' of sze N) of the out-ofsample eo poduc ed by the model The expected value of Ecv s exactly o N 1). Ths s tue becuse t s tue each ndvdual valdaton eo en: 1 v 1(xn ,Yn ) [e (g� (xn ) , yn)] ,
[Eo g� ) , o n N 1) Snce th s equalty holds each en, t also holds the aveage. We hghlght ths esult by makng t a theoem.
Theorem 4.4. Ecv s an unbased estmate of o 1) (the expectaton of the model pemance [Eo J , ove data sets of sze 1). ow that we have ou coss valdaton estmate of Eo , thee s no need to out put any of the g as ou nal hypothess. We mght a well squeeze evey last dop of pemanc e and etan usng the ente g g] data set ' oututtng g as the nal hy ( ) pothess and gettng the benet of gong om N 1 to N on the leanng cuve. i 2 In ths case the coss valdaton estmate wll on aveage e an uppe estmate the out-of-sample eo Eo g) Ecv so g expect to be pleasantly supsed albet slghtly. Fgue 4.14 Usin ross vali dation to estimate Eout Wth just smple valdaton and a val daton set of sze K 1 , we know that the valdaton estmate wll not be elable. How elable s the coss valdaton estmate Ecv? We can measue the elablty usng the vaance of Ecv ·
y
147
4. VERFITTING
4.3. LIDTION
Untunately whle we wee able to pn down the expectaton of Ec the vaance s not so easy If the N coss valdaton eos wee equvalent to N eos on a totally sepaate valdaton set of sze N, then Ec would ndeed be a elable estmate decentszed N he equvalence would hold f the ndvdu al s wee ndepende nt of each othe Of couse ths s too optmstc Consde two valdaton eos m he valdaton eo depends on g whc h was taned on data contanng (xm Ym) · hus has a dependency on (xm Ym ) he valdaton eo m s computed usng (xm Ym) dectly and so t also has a dependency on (Xm Ym) Consequently thee s a possble coelaton between and though the data pont (Xm Ym) hat coelaton wouldnt be thee f we wee valdatng a sngle hypothess usng N esh ndependent data ponts How much wose s the coss valdaton estmate as compaed to an est mate based on a tuly ndependent set of N valdaton eos A VCtype pobablstc bound o even computaton of the asymptotc vaance of the coss valdaton estmate Poblem 4. 23 ) s challengng One way to quantfy the elablty of Ec s to compute how many esh valdaton data ponts would have a compaable elablty to Ec and Poblem 424 dscusses one way to do ths hee ae two extemes ths ee ctve sze On the hgh end s N, whch means that the co ss valdaton eos ae essentally ndependent On the low end s 1 whch means that Ec s only as good as any sngle one of the ndvdual coss valdaton eos e the coss valdaton eos ae totally dependent Whle one cannot pove anythng theoet cally n pactc e the elablty of Ec s much close to the hghe end
Eeive numbe o esh exampes giving a ompaabe esimae o ut
ross vaidation r mode seection. In Fgu e 4 1 1 the est mates the out-of-sample eo o f model m wee obtaned usng the valdaton set Instead we may use coss valdaton estmates to obtan Em: use coss valda ton to obtan estmates of the out-of-sample eo each model i , M and select the model wth the smallest coss valdaton eo ow tan ths model selected by coss valdaton usng all the data to output a nal hypoth ess makng the usual leap of ath that Eout (g ) tacks Eout (g) well Exampe 45 In Fgue 413 we llustated coss valdaton estmat
ng Eout of a lnea model (h( ) a + b) usng a smple expement wth thee dat a ponts geneat ed om a constant taget ncton wth nose We now consde a second model the constant model (h() b) We can also use coss valdaton to estmate Eout the constant model llustated n Fgue 4.15 148
4 VERFITTING
43 LIDTION
Fgue Lav onout cross validation rror r a constant t If we use the nsample eo afte ttng all the data (thee ponts ) then the lnea mode wns because t can use ts addtonal degee of eedom to t the data bette he same s tue wth the coss al daton data sets of sze two the lnea model has pect nsamp le eo But wth coss adaton what mattes s the eo on the outstanding poin t n each of these ts Een to the naked eye the aeage of the coss aldaton eos s smale the constant model whch obtaned Ecv esus Ecv the lnea model he constant model wns accodng to coss aldaton he constant model also has owe Eout and so coss aldaton selected the coect model n ths example
One mpotat use of aldaton s to estmate the optmal egulazaton paamete A as descbed n Example We can use coss aldaton f the same pupose as summazed n the agothm below
ross validation r selecting A:
: Dene models by choosng deent alues f A n the
: 4 5
augmented error: (1 Ai ) , (1 A 2 ) . (1 A M )
, , do Use the coss aldaton module n Fgue to est mate Ecv() the coss aldaton eo model Select the model * wth mnmum Ecv ( * ) Use mode ( Am ) and all the data to obtan the nal hypothess g Eectely you hae estmated the optmal A. r each mode
We see om Fgue that estmatng Ecv f just a s nge model eques N ounds of leanng on N each of sze N So the coss aldaton algothm aboe eques MN ounds of leanng hs s a fmdable task If we coud analytcally obtan Ecv that would be a bg bonus but analytc esuts ae often dcult to come by o coss adato n One excepton s n the case of lnea models whee we ae able to dee an exact anaytc fmula the coss aldaton estmate
4 VERFITTING
43 ALIDATION
Analyti omptation o E r linear models. egesson wth weght decay edctons ae
Recall that f lnea
ZTZ I) 1 ZT, and the n-samle
H), 1 ZZTZ I ) ZT. Gen H, and y, t tuns out that we can
whee H) analytcally comute the coss aldaton estmate as
E
1 N
N
1
Yn - yn HnnA)
2
( 4.13 )
otc e that the coss aldato n estmate s ey sm la to the n-sam le eo E L n fn - Yn)2 , deng only by a nomalzaton of each tem n the sum by a facto 1 /( 1 Hnn))2 One use ths analytc fmula s that t A can be dectly otmzed to obtan the best egulazaton aamete oof of ths ema kable mula s gen n Poblem 4 .2 6
Een when we cannot dee such an analytc chaactezaton of coss aldaton the technque wdely esults n good out-of-samle eo estmates n ac tce and so the comutatonal buden s often woth endung. Also as wth usng a aldaton set coss aldaton ales n almost any settng wthout equng secc knowledge about the detals of the models. So fa we hae led n a wold of unlmted comutaton and all that matteed was out-of-samle eo; n ealty comutaton tme can be of con sequen ce es ecally wth huge data sets . Fo ths eason leae-one- out coss aldaton m ay not b e the method of choce. A oula deate of leae one-out coss aldaton s fold ss validation. 5 In -ld coss aldaton the data ae attoned nto dsjont sets ( o lds ) D 1 , . ,D , each of sze aoxmately N/ ; each set D n ths atton sees as a aldaton set to comute a aldaton eo a hyothess leaned on a tanng set whch s the comlement of the aldaton set D \ D So you always aldate a hyothess on data that was not used tanng that atcula hyothess. he -fld coss alda ton eo s th e aeage of the aldaton eos that ae obtaned one om each aldaton set D. Leae-one-out coss aldaton s the same as N-ld coss aldaton . he gan om choosng « N s comutatonal. he dawback s that you wll be estmatng E f a hy othess less( data comaed leae-one-out so the taned ) and dsceancy be lage. A comon choce n betweenonE ) and( as E ( ) wll wth actce s 10-ld coss aldaton and one of the lds s llustated below.
an
tan
4 Stability problems have also been reported in lea ve one out. 5 Some authors call it K fld cross validation , but we choose the size of the validation set K
150
so not to confuse with
4.3 LIDTION
4 VERFITTING
4 . 3. 4
Thery Versus Practice
Both valdaton and coss valdaton pesent challenges the mathematcal theoy of leanng, smla to the challenges pesented by egulazaton he theoy of genealzaton, n patcula the VC analyss, ms the undaton leanablty It pov des us wth gudel nes unde whch t s possb le to make a genealzaton concluson wth hgh pobablty It s not staght wad, sometmes not possble, goously these conclusons to the and analyss of valdato n, coss to valdaton , ocay egulazat on What sove pos sble, and ndeed qute e ectve, s to use the theoy s a gudelne In the case of egulazaton, constanng the choce of a hypothess leads to bet te genealzaton, as we would ntutvely expect, even f the hypothess set emans techncally the same In the case of valdaton, makng a choce fw paametes does not ovely contamnate the valdaton estmate of E, even f the VC guaantee these estmate s s too weak In the case of coss valdaton, the benet of aveagng seveal valdaton eos s obseved, even f the estmates ae not ndependent Although these technques wee based on sound theoetcal fundaton, they ae to be consdeed heuristics because they do not have a ll mathe matcal justcton n the geneal case Leanng om data s an empcal task wth theoetcal undepnnngs We pove what we can pove, but we use the theoy asheustcs a gudelne have aappoach conclusvethat poof In uneals a pactcal applcaton, maywhen wn we ovedont a goous makes tc assu mptons he only way to be convnced about what woks and what doesnt n a gven stuaton s to ty out the technques and see youself he basc message n ths chapte can be summazed as llows
1 ose stochastc o detemnstc advesely, leadng to ovettng
aects leanng
2. Regulazaton helps to pevent ovettng by con stanng the mode l, educng the mpact of the nose, whle stll gvng us exb lty to t the data 3 f Vldaton and coss ldaton ae usefl nques estmtng E vaOne mpotant use tech of valda ton s model selecton, n ptcula to estmate the amount of egulazaton to use
Example 4.6. We llustate valda ton on the handwtten dgt classcato n task of decdng whethe a dgt s 1 o not see also Example 3 1 based on the two atues whch measue the symmety and aveage ntensty of the dgt he data s shown n Fgue 416 a
151
4. VERFITTING
4.3. ALIDATION
3
Average Intenit ( a) iis classicaion ask
# Feature Ue ( b rror curves
Fgue 416 ( a) The diis daa of which 5 are seleced he rainin se ( b ) The daa are ransrmed via he 5h order polynomial ransfrm o a dimensional aure veco r We show he perrmance curv es as we vary he number of hese aures used r classicaion We have andomly selected 500 data ponts as the tanng data and the emanng ae used as a test set evaluaton We consdeed a nonlnea atue tansm to a 5th ode polynomal atue space
1 Xi X2 - 1 x 1 X2 x21 X 1 X2 x22 X1X3 21 X 2 x1 X1X4 2X1X3 22 X1X2 32 X 1 X42 X2 . Fgue 416 b shows the nsample eo as you use moe of the tansmed atues nceasng the dmenson om 1 to 20 As you add moe dmensons nce ase the complexty of the mo del the nsample eo dops as expected
he outofsample eo dops at st and then stats to ncease as we ht the appoxmatongenealzaton tadeo he leaveoneout coss valdaton eo t acks the behavo of the outofsample eo qute well If we wee to pck a model based on the nsample eo we would use all 20 dmensons he coss valdaton eo s mnmzed between 57 fatue dmensons; we take 6 fatue dmensons as the model sele cted by coss valdaton he table below summazes the esultng pemance metcs o Valdaton Coss Valdaton
0% 08 %
E 2 5% 1 5%
Coss valdaton esults n a pefmance mpovement of about 1 % whch s a massve elatve mpovement 40 % educton n eo ate
Exerise 4 11 n t his pr ticulr x primnt t h blck cu r (Ecv) is somtims blow nd somtis bo th th rd cur Eout. f w rptd thi s xprimn t mny tim s nd plott d th rg bl ck n d rd curs w ou ld you xpct th blk cur to li bo or blow th rd ur? 152
. VERFITTIN
.3. LIDTION
It s llumnatng to see the actual classca ton boundaes leaned wth and wthout aldaton hese esultng classe s togethe wth the 500 nsample data ponts ae shown n the next gue
vrae ntensity 20 dim lassier no validation) in 0 out 2 5
vera ntensity 6 dim lassier LCV) in 08 out . 5
It s clea that the swose of ponts the classe pcked wthout aldaton due tooutofsample the oettngpemance of a w nosy n the tanng data Whle the t anng data s pectl y sepaated the shape of the esultng bounday seems hghl y contoted whch s a symptom of oettng Do es ths emnd you of the st example that opened the chapte hee albe t n a toy example we smlaly obtaned a hghly contoted t As you can see oettng s ea l an d he e to stay!
153
4 VERFITTING
4.4
4.4 PROBLEMS
Prblems
Problem 4 1
Pl ot the monom al s of order i, As you ncrease the or der does th s correspon d to the n tu tve noton of ncreasn comp lext y
onsd er the atu re transrm z and the lnear model z For the hypothess wth what s expl ctly as a functon of What s ts deree
Problem 4.
h(x) =
h(x)
= [L0(x), L1 =(x),, L-2(xW , 1
x
Problem 43
The Leendre Polynomals are a famly of orthoonal polyn om a ls wh c h are useful f r reresson Th e rst two Leen dre Polynom a ls are 1 The h her or der Leendre Pol ynom als are dened by the recurson
Lo(x) = L1(x) = x
k - 1 2 (x). -;L
L (x) = k - 1
( a ) Wha t are the rst sx Leend re Polynom a ls Use th e recurson to de velop an efcent alorthm to compute ven Your
Lo(x), . . , LK(x)
x.
K Plot the rst sx Leendre alorthm polynomalsshould run n tme lnear n ( b ) Show that s a l near co mb naton of monom als . ( e ther all odd or all even order wth hhest order ) Thus
x , x 2 , .
L(x)
( c) (d )
k. L( x) = (- l ) L(x). Sh ow that 1 = xL(x ) - L 1(x) [Hnt use nducton} Use part ( c ) to show that L satses Legendre's dferental equaton ! (x 2 ) dL(x) = k(k + )L(x). l dx dx
Ths eans that the Le en dre Polynoma ls are e enfunc tons of a er mt an l near dferental operat or an d from Stu rm Louvl le th eory they rm an ortho onal bass f r contn uous functons on 1]
[,
( e) Use the recurrence to show drectly the orthoonalty property:
dx L (x)L e (x) = 2 = k.k, 2 + [Hnt: use nducton on k wth k Use the recurrence r L and consder searately the ur cases = k, k - 1 k and < k - . For the case = k you wll need to coute the ntegral J� 1 dx x 2 L 1 (x) In order to do ths, you could use the dferental equaton n artc, ultly by xL and then ntegrate both sdes the L HS can be ntegrated by arts ow solve the resultng equa ton r f 1 dx x 2 L 1 (x)j
{O
154
4.4 PROBLEMS
4 VERFITTIN
Problem 4.4 LAMi This problem is a detailed version of ercise 4.. We se t up an eperimental framework which th e reader may use to stud y var ious aspect s of overttin g. The i n put space is X = [ , with unirm input probability density = We consider the two models 2 and 1 . The tar et function is a poly nom ial of degree Q wh ic h we rite as ! here are the Leendre polynomials. We use the Legend re polyno mi als beca use they are a c onvenie nt orth ogon al basis or the = polynomi als on [ , see Section 4. a nd Problem 4.3 f r some basic inf r mation on Legendre polynomials. The data set is = 1), . . , ( N , Y ) where Y = + E and E are iid standard Normal random variates. For a sin gle e peri ment with s pecied val ues fr QJ , , generate a random degree-Q ta rget fun ction by se lectin g coecients independently from a standard Normal rescaling them so that . Generate a data set selecting X N inde pendentl y from and Y = + E. Let 2 and 1 be the best t hypotheses to the data from 2 and 71 respectively with respecti ve out of-sam ple error s E( 2 ) and E (1 ) .
P(x)
(x)
aqLq(x)
�·
Lq(x)
(x1,
(xn)
x1,
P(x)
2 ]
x
aq
(xn)
a Why d o e nor malie j? [Hint: how would you interpret ?] b ow can we obtain 2 , 1? [Hint pose the problem as linear regression and use the technology from Chapter 3} c Vary ow Qcan , we com pute E a na lytica lly r a given 1 ? d , and fr each combination of parameters run a large num ber of eperiments each time computing E( 2 ) and E(1). Aver aging these out-of-sample errors gives estimates of the epected out-of sampl e erro r r the giv en learni ng sc enario QJ , , using 7 2 and 1 Let
average over eperiments(E( 2 )), avera ge over eperime nts( E (1 ) ) .
E( 2 ) E(1)
e ne the overt measure E(1) E( 2 ) . When is the over t measure signicantly positive i.e. overtting is serious as opposed to sign ica ntl y negative? Try the choi ces QJ { , , . . . , 00} {0, 5, . . , 0} 2 {O ,0 .0 5, 0., . . . , }. pl ain you r obse rvations.
E
E
e Why do e the average over y eperim to select an take acceptabl e n u mber of man eperiments to ents? averaeUse ov the er. varian ce f Repeat thi s eperime nt r cla ssication where the targ et function is a noisy perceptron = in " !1 + E . Notice that = 0 = . and the shou ld be norma lied so that = !1 For classication the models H2, H1 contain the sign of the nd and th order polynom ial s respecti vely. You may use a learn in g a lgorith m or non-separable data from hapter 3.
aq's
aqLq(x)
155
[
ao aqLq(x)) 2
4 VERFITTIN
44 PROBEMS
If > < 0 n the augment ed erro r Eaugw = in w +>wTw wha t oft order contra nt doe th corre o nd to? [Hn > < 0 ncoura lar wh}
Problem 45
I n the a ugmen ted erro r m n mzaton wth
Problem 46
and >
=
0
>
[Hn a Show that wreg wHn utfng the term weght deca ar by aumn ha wreg > wHn and drv a conradcon] In ct a tronger tatement hold wrg decr eang n > b xlctl ver f th r lnear model [Hn whr u = ZT y and Z h ranrmd daa ma rx Show ha ZTZ + > ha h am nvcor wh corrondnly lrr nvalu a ZTZ xand u n h nba of ZTZ For a marx A, how ar h nvcor and nvalu of A 2 rlad o ho o f A ?] Show that th e nam le erro r
Problem 47 from xamle 42 an ncreang f uncton of > wher e (> = ZZ+ >) and Z the t ranrm ed d ata m atrx To do o let the SVD of z = and let ZTZ have ege nval ue O ' ' O� Den e th e vector = UT Show that
and roceed from there
n th e augment ed err or m n mzat on wth = and > > 0 Prob lem 48 aume th at in d ferenta ble a nd ue gr ad ent dec ent to m n m ze Eaug:
wt + ) wt - aug wt Sho w that th e udate rule abov e th e ame a
wt + ) ( 2J>wt - J\Einw t Not e: Th the orgn of th e name 'weght dec a: u date d b the gr ad ent of Ein · 156
wt deca bere beng
4 ERFITTING
4.4 PROBLEMS
Prblem 4.9
In T ikh onov re ul arization the reuarized weih ts are iven by + The Tikhonov reularizer is a k x ( + 1) matrix each row correspondin to a + 1 di mensi ona l vector ach ro w of correspo nds to a + 1 dimensional vector the rst component is ) For each row of construct a virtual example i 0) r = 1 . . k where i is the vector obtained from the th row of after scalin it by and the taret vaue is 0. Add these k virtua l exam ples t o the data to construc t an a umented data set and consider non-reularized reression with this aumented data
W (ZTZ >rTr) l ZTy
r
r
r
Z
a Show that r the aumented data Z and b Show that solvin the least squares problem with (ZZ and results in the same reularized weiht W ie W Z) Z This result may be interpre ted as l lows a n equ ival ent way to accop ish weiht-decay-type reularization with linear modes is to create a bunch of virtual examples a ll of whose tar et val ues a re zero
Prblem 4.10
In this problem you wi investiate the relationship between the so ft order constr ai nt and t he a umented error The reul arize d
weiht
to W is a soution Ei(w) ubject to wTrTrw C a If W rTrwi then what is w? b If w rTrwi > C, the situation is illustrated below
The constr ai nt is satised i n the sh aded reio n a nd t he contours of con stant are the ellipsoids why el lipsoid s? What is c Show that with
Ei
w rTrw?
W\Ei(W), Ei(w) + >wTrTrw [Hint use te previous part to A =
W minimizes solve r W as an equality constrained optimization problem using te metod o Lagrange multipliers]
(contnue on next page) 157
4 VERFITTING
44 PROBLEMS
: = 0 (w! itself satis es the constraint). > 0 (the pena lty term i s posi tive) . is a strictly decreasing function of C < 0 r C E [O, w I'TIW]
(d) Show that the ll owing hold r (i) If then (ii) If > then > then (iii) If
wI'TI'W! C w I'TI'W! C w rTrw C
[Hint: show that
Problem 411
For the l in ear model in Exercise 4 2 the target function is a pol ynomi al of degree Q ; the model is Q wit h polynomials up to o rder Q. QJ. = Assum e Q and = where is th target function and is the m atrix contain ing the transrme d data.
W! (ZT) 1 ZTy y Zw + w Z (a) Show that W! = w + (ZT) 1 ZT What is the average function ? Show that bias = 0 (recall that bias f(x)) 2 (b) Show that 2 ac �< z [(Z TZ) 1 ] , where �< [<()
For the well speci ed li near mode l the bias is zero and t he va riance is incre asi ng as the mod el ge ts larg er (Q increa ses) but decr easing in
Problem 412
Use the setup in Problem 4.11 with Q QJ Con sider regression with weight decay using a linear model in the trans rmed space with input probability distribution such that = . The regular ized
[zzT] weights are given by Weg = (ZTZ + I) 1Vy where y = Zw + (a) Show that We W (ZTZ + I) 1 w + (ZTZ + I) 1 V . g
( b) Argue that to rst order in
2 (+ )2 w 2 2 var � [ac(H 2 ())], where H() = Z(V Z + I) 1 V bias �
158
4. VERFITTIN G
4.4. PROBLEMS
If we plo he bi as a nd va we ge a gu e ha is vey simila o Figue 23 whee he adeof was based on and com plexi y ahe h an bias a nd va Hee he bias is inceasing in as expeced and in l fll; he vaiance is deceasing in When 0 Q and so ap peas o be playi ng he ole of an efecive n u m be of paam ees
2 ac(H = 2(A))ac(H (A)) = +
Reulrizti reter
Problem 41
Wihi n he li nea e gession se ing ma ny a emps have been made o quanify he efecive numbe of paamees in a model Thee possib iliies ae:
i de() = ac(H(A)) -ac(H 2 (A)) ii de() = ac(H( A)) 2 iii de( whee H(A) = ac(H Z(VZ + (AI))) ZT and Z is h e an sfmed daa maix To obain de one mus s compue H() as hough you ae doing egession One can hen heuisically use de in place of d in he VC bound a When A= 0 show ha al l h ee choices de = + whee is he dimension in he Z space b When A > 0 show ha 0 de + and de is dece asing in A f al l h ee choice s. [Hint Use the singular alue decomposition.}
Problem 414
The obseved age values
y can be sepaaed ino he
and he noise y = + The com ponens of ae iid 0 wih expecaion linea weigh egulavaiance izaion 2 by and aking he exp eced Fo val ue of egession he i n samwih ple eo in decay (4) show ha 2 T (I - H(A)) 2 + ac (I - H (A)) 2 (I-H(A)) 2 + 2 whee de = ac(H (A)) - ac(H 2 (A)) as dened in Poblem 413 i H(A) = Z(ZTZ + AI) ZT and Z is he ansfomed daa maix ue a ge val ues
i
(cotiued o ext pge)
159
4.4. PROBLEMS
4 VERFITTING
2
(a) f the noise w as no t over t, what shou ld the term involving be, and why (b) Hence, argue that the degree to which the noise has been overt is nterpret the dependence of this result on the parameters and to justify the use of as a n efecti ve n um ber of parameters
2 de/
de
de
Proble m 4. 15
de
We furth er i nvestigate of Problems 4 13 and 4. 14 We know that When is square and invertible, as denote is us ua lly the case (r exam ple with weight dec ay, s; > 0 when has f ull colum n ran k) Let s be the eigenval ues of
H() Z(VZ rr) Z . ZZ (a) For de() ac(H() H 2 ())
r
Z
r Z
show that
de() ac(H()) show that de() d i=dO ac(H 2 )), show th at de() ?d [>• (c) For de() O n al l ca ses, r 0 0 de() d l de(O) d and de is decre asing in . Hint se the singlar vale decomosition Z US where U are orthogonal a nd S is diagonal with entries B i ] (b) For
Probl em 4. 16 with penalty term
For lin ear models a nd the general Tikhonov re in the augmented error, show that
wrrw Wreg
where
=
gu larizer
r
(ZT Z + .rTr) l ZTy ,
Z is the ature matrix
(a) Sh ow that the in-sam ple pre di ctions are H() Z(VZ rr) ZH()y r Z and obtain Weg in terms of Wl·
where ( b) Si mp lify th is in th e case called unirm weight decay. Problem 4.17
This is
To model uncertainty in th e measurement of th e in puts, assume that the true in put s are the observed inputs perturbed by s ome noise the true inputs are given by Assume th at th e = are independent of = and mean with covariance matrix
n :
n
(x n Yn )
n n n n [n ] ;
160
n
4 VERFITTIN
44. PROBLEMS
[n] = 0 The learn
ng algo rt hm m n mze s the expected in sample error where the e xpectaton s wth respect to the n certa nty n th e tre Xn
!i
Bi
Bi
Ei
Show weghts resltobtaned from mnmzng are eqa = lent tothat the the weghts whch wold whch hae been by mnmzng r the obsered data, wth Thono reglarzaton What a re a nd > (see Problem 4. 16 f r t h e gener al Thon o regla rzer) ? One can nterpret th s reslt as fllows reg la rzaton enrces a robstness to potental measrement errors (nose) n the obsered npts
L=l (wx n Yn )
Problem 4.18
n a regresson settng, assme the target fncton s = + where the entres n are d wth lnear, so and y zero mean a nd aran ce Assme a regl arzaton term and that n th s problem der e the optmal ale f r > as ll ows.
f(x) wx
Zw
wZZw
[xx]
(x) =
(a) Show that the aerage fncton s
What s the bas?
J
(d+1) • (b) Sh ow th at ar s asymptotca y I N02(i+>) Problem . 2 (c) Use the bas and asym ptot c aran ce to obta n an ex presson r Opt mze th s wth respect to > to ob tan the optmal reglarzat on pa 02(d+) . nswer. > * ra m ete r . N !lwi ll 2
[Eot]
J•
-
( d) Expla n the depe ndence o f the optm al regl arzaton parameter o n t he parameters of the l earnn g problem {int write> * =
Problem 4 19
[The Las so al gorth m] Rathe r th an a soft order constra nt on the sq ares of th e weghts, one co ld se the a bsolte a l es of th e weghts d
Ei(w)
subjec o i =O lwil : C
The model s called the lasso algorthm
( a) Formla te a nd mpl ement ths as a q adratc program. Use t he exper mental desgn n Problem 44 to compare the lasso algorthm wth the q ad ratc pena lty by gng plo ts of erss reglarzaton parameter. (b) What s the agm ented error ? s t more conenent t o opt m ze? (c) Wth d 5 and = 3 compare the weghts from the lasso erss the qadratc penalty [int Look a t th e nmber of non-zero weights.
Et
161
44 PROBLES
4 VERFITTING
Problem 4.20 In thi problem you will explore a conitency condition fr weight de cay Sup poe that we ma ke an iner tib le li near tran rm of the data
n
yn
I ntu itiely li near re greion hou ld n ot b e afected by a l in ear tranrm Th i mean th at the new o ptim al weight ho uld be gi en by a cor repondi ng l inea r tranf rm of the old optim al weight
a ) Suppoe
w min im ize the in am ple e rror or the o rig ina l problem . Show th at for the tranrmed proble m t he op ti ma l weight are
b ) Suppoe the regularization penalty term in the augmented error i wXXw r the srcinal data and
ww r the tranrme d data. On the srcinal data the regularized olution i () Show that for the tranrm ed proble m the am e l inear trano rm of () gie the correpondi ng regula rized weigh t or the tranfr med p rob lem
Problem 4.21
The Tikhono moothne penalty which penalize 2 deriatie of i dx Show that r linear model thi red uce to a penalty of the rm ww What i ?
h (h)
Problem 4.22
You hae a data et with 100 data point You hae 100 model eac h wi th VC dimenion 10 You et aide 5 point f r alidati on . You elect the model which produced minimum alidation error of 0.5 Gie a bound on the ou t of ampl e err or r thi e lected f unction Suppoe you intead trained each model on all the data and elected the func tion wi th mi nim um in ampl e err or The reulting in amp le err or i 0.15 Gie a bound on the out of am ple err or i n thi c ae [Hint Use the boun in
Poblem to boun th e VC imension of the union of a the moels.]
Problem 4.2
Thi probl em ine tigate the coariance of the l eae one out cro al idatio n error C[n e Aume that r wel l behaed model the learning proce i table' and o the change in the learned hypothei hou ld be mall 'O ' if a new data point i added to a data et of ize N. Write g g
162
4 VERFITTIN
4.4 PROBLEMS
() Show tht Var[Bev] =
:=l Var [n] + :#m C[n, em ]
(b) Show C [n, m] = Var [ Bout( g -2]+ h igher order i 8n , Om Argue tht (c) Assu e tht y ters i volvig On, Om re O(
)
Does Var [e ] dec y to zero with Wht bout Var[Bout ( g]? (d) Use th e experietl desig i P roble . to study Var[Bev] d give log lo g plot of Var[Bev]/Var[] versus N Wht is th e dec y rte
Problem 424
F d = 3 geerte r do d t set with N poits s llows For ech poit ech di esio of hs s tdrd Norl distr ib utio Si i l rly g eer te (d+ di esio l trg et weight vector Wf d set Yn = w' Xn + n where n is oi se ( lso fro s tdrd Norl d istr ibut io) d is the oise vrice; set to 05 Use li er regressio with weight decy r egul riztio to es ti te Wf with W Set the regulriztio preter to 005/N () For N {d+15 , d+ 5, , d+ 115} compute the cros s vl idtio errors e e d Bev Repet the experiet (sy) 10 ties itiig the verg e d vri ce ov er th e experimets of e e d Bev (b ) How sh ou ld you r verge o f th e e 's rel te to the verge of the Bev's; how bout to the v erge of the e 's Sup port you r cli usig results fro your exper iet (c) Wht re the cotribu tors to the vri ce of the e 's (d) If the cro ss vlidtio errors were truly i depedet how should t he vri ce of the e 's rel te to the vri ce of the Bev's? (e) O e esure of the ef ectiv e u b er of fres h ex m ples used i cop ut ig Bev is the rtio of the vri ce of the e 's to th t of the Bev's xpli why d plot versus N the efective u b er of fr esh ex p les N s percetge of N You shoul d d tht N is close to N (f) If you icre se the o ut of r egul riztio will N go up or dow xpl i reyou th (ee)se e xperiet w ith Are= 5/N d cop your rreso resu ltsig fro Ruprt t o veri fy your cojectu
Problem 425
Whe u sig vli dtio s et r ode l sele ctio l l odels were lered o the m Dti of size N K d vlidted o the m Dvl of size K We hve the VC boud (see qu tio ( 4 1 ) :
Eout ( g •
Evl (g + •
(J)
ontinud on nxt pag)
163
44. PROLMS
4 VRFI I G
M Suppose that instead you had no control oer the alidation process So learners each with their own models present you with the results of their al idation processes on diferent alidati on sets Here is what y ou know about each learner Each learner reports to you the size of their alidation set K and the alidation error Ev! ( ) he learners may hae used dif frent lear ned t rai niused ng set an d adata idatesets d onexcep a heldt that out they alida it tionhful se tlywhich waons a only r alidation purposes. As the m odel selecto r you h ae to decide whi ch learn er to go with
a Sh ou ld yo u sele ct th e learner with m in im um al idation erro r? If yes why? If no why not? {Hint: thin VC-bound.j b If al models are alid ated on the same al idation se t as des cribed in t he text wh y is it o kay to selec t the lear ner with t he l owest al id ation error? c After selec ti ng learn er * say show that Eva1 (m* ) + ] : Me- 2€ " (E ) , e 2 € K is an "aerage alidation _ 2 ln
J [Eout (m* ) where K()
=
>
(
)
d setShowsize.that with
prob ab il ity at l east 8 whi ch satis es * In 2 *
- Eout Ev! * r any
*
e Show that min K : () : l K . Is this boun d bet ter or worse than the bound when all models use the same alidation set size equa l to the aerage al idation set siz e l K ? Proble 4. 26 I n th is probl em derie the f rmu a fr th e exact expr essi on r the lea eone out cross al idati on error r li near reg ression . Let be the data mat rix whose ro ws correspond to th e transrmed dat a poi nts
Z n = (xn)
a Show that :
N
N
ZTZ = nl nz Z Ty = nl nYn ZTZ rTr and H() = Z() V Hence where = () show that when (z n Yn ) is left out ZTZ ZTZ - n, and ZTy ZTy nYn
b Compte
w� the weight ector learned when the th data point is left out and show that: w
lnTn nl ) (ZTy
164
nYn )
4 VERFI IN G
44 PROBLEMS
use the identity ( A xxT ) - 1 = A 1 +
A - 1 xxT A
() Usin g (a) and ( b) sho that w = w + regression eight vetor using a the data (d) he predition on the vaidation point is given by
1
T A
.]
A - 1 zn , here w is the z�w. Sho that
ZnT n = Yn Hnnn Hnn • {e) Sho that e
( ,
and hene prove Equation ()
Problm 4.2 7
Cross va i dati on gives an aurate es ti mate of (N- but i t an b e quite s ens itiv e eading to pr obems i n mode see tion A ommon heu rist i r regu arizing ross va idation is to use a measure of error Ocv( ) r the ross va idation esti mate i n mode see tion (a) ne hoie r divided by
0v is the stan da rd deviation of the eave- one-out error s
cv �
' en ). Wh divide by ?
(b) or inear modes sho that Vcv = ;1 E;v . () (i) iven the best mod e * the onserv ative one-siga a pproah se ets the sim pes t ode ith in Ocv( * ) of the best (ii) he bound m in im izing appr oa h se ets the mode hih mi ni mize s
Ecv() + Ocv( )
Use the experie nta d esign in P robem 4 4 to ompa re these a pproahes ith the u n regu arized ross va idat ion estimate as f os ix Q = 5 Q = 0 and = Use ea h of th e to m ethods propo sed h ere as e as traditiona ros s va idation to see t the optim a va ue of the regu arization parameter > in the range {0 05 0 0 0 5 5} using eight deay reguarization O(w) = fww. Pot the resuting out-of-sampe error r the m ode see ted usin g eah meth od as a funtion of N ith N in the range { x Q3 x Q 0 x Q } What a re your onusions?
165
166
Chap
5
Three Learning Principles he study of learning om data highlights some general principles that are fascinating concets in their own right. Haing gone through the mathematical analysis and empirical illustrations of the rst w chapters, we hae a good undation om which to articulate some of these principl es and explain them in concrete terms. In this chapter, we will discuss three prin ciples . he rst one is related to the choice of model and i s called Oc cams razor. he other tw o are relate d to data sampling bias establishes an important principle about obtaining the data, and data snooping establishes an important principle about handling the data. A genuine understanding of these principles will protect you om the most common pitfalls in learning om data, and allow you to interpret generalization perfrmance properly.
5.1
Occa ' s Razr
Although it is not an exact quote of Einsteins, it is oen attributed to him that An explanation of the data should be made as simple as possible but no simpler." A similar principle, ccam 's Razor dates om the 14th centur y and is attributed to William of Occam, where the razor is meant to trim down the In explanation to of thelearning, bare minimum that isrconsistent with the data. the context the penalty model complexity which was introduced in Section 2. 2 is a manistation of Occ ams razor. If E 0 then the explanation hypothesis is consistent with the data. In this case, the most plausible explanation, with the lowest estimate of E gien in the VC bound 2.14 , happ ens when the complexity of the explanation measured by d(H)) is as small as possi ble. Here is a statement of the underlying principle.
The simplest model that ts the data is also the most plausible. 167
HREE EARNIN PRINCIES
CAM'S AZOR
Applyng ths pncpl e we should choose as smple a model as we thnk we can get away wth Although the pncple that smple s bette may be ntutve t s ne the pec se no selfevdent When we apply the pncpl e to leanng om data thee ae two basc questons to be asked 1 What does t mean a model to be smple? 2 How do we know that smple s bett e? Let s sta t wth the st ques ton hee a e two dstnc t appoaches to de n ng the noton of complexty one based on a mly of objects and the othe based on an ndvdual object We have aleady seen both appoaches n ou analyss he VC dmenson n Chapte 2 s a measue of complexty and t s based on the hypoth ess set a whole e based on a amly of object s he egulazaton t em of the augmented eo n Chapte 4 s al so a meue of complexty but n ths case t s the complexty of an ndvdual object namely the hypothess h he two appoaches to denng complexty ae not encounteed only n leanng om data they ae a ecung theme wheneve complexty s ds cussed Fo nstance n nmaton theoy entopy s a measue of complexty based on a mly of objects whle minimum description length s a elated measue based on ndvdual objects hee s a eason why ths s a ecung theme he two appoaches to denng complexty ae n ct elated When we say a fmly of objects s complex we mean that the mly s b g hat s t contans a lage vaety of object s heee each nd vdual objec t n the ml y s one of many By contast a smple mly of objects s small t h elatvely w objects and each ndvdual object s one of fe Why s the shee numbe of objects an ndcaton of the level of complext? he eason s that both the numbe of objects n a mly and the complexty of an object ae elated to how many paametes ae needed to spec the obj ect When you ncease the numbe of paametes n a leanng model you smultaneously ncease how dvese s and how complex the ndvdual h s Fo example consde 17th ode polynomals vesus 3d ode polynomals hee s moe vaety n 17th ode polynomals and at the same tme the ndvdual 1 7th ode polynomal s moe complex than a 3d ode polynomal he most co mmon dentons of object complexty ae based o n the numbe of bts needed to desc be an object Unde such dentons an object s smple f t has a shot desc pt on hee e a smple objec t s not only ntnscally smple (as t can be descbed succnctly) but t also has to be one of w snce th ee ae we objects that have shot descptons than thee ae that have long descptons as a matte of smple countng
xrcis 51 Conder hypoth e et H1 a n d 1 th at contan Boole an fn cton on Boolean varab le o X { 1 H1 contan all Boolean fncton
168
HREE EARI PRICILES
wic evaate
n exact ne int int an
00 cntains a Be an ntins wic eva ate
int ints an t
esewee
CCAM'S A OR
esewee
n exa
a w ig n m e teses ae 1 an 1 00? ( w man its ae neee t sei ne e eses i n ( c w man its ae neee sei ne te teses n
? 100
We now addess the second queston When Occ ams azo says that smple s bette t do esnt mean smple s moe elegant It means smple has a bette chance of beng ght Occ ams azo s about pemance not abou t aesthet cs If a complex explanaton of the data pems bet te we wll take t he agument that smple has a bette chance of beng ght goes as l lows We ae tyng to t a hypothess to ou data ' = {(x 1 , Y1 ) , (x N , YN ) } (assume Yn s ae bnay) hee ae we smple hypotheses than thee ae complex one s Wth complex hypoth eses thee would be enough of them to shatte x 1 , XN, so t s cetan that we can t the data set egadless of what the labels Y1 , YN ae even f these ae completely andom. hee e ttng the data does not stll mean much n stead we have a smple model wth fw hypotheses and we und oneIfthat pectly ts the dchotomy ' = {( xi , Y1), , (xN , YN ) } , ths s supsng and theee t means some thng. Occams Razo has been fmally poved unde deent sets of dealzed condtons he above agument captues the essence of these po o f some thng s less lkely to happen then when t does happen t s moe sgncant Let us look at an example Example 51. Suppose that one constucts a physcal theoy about the e sstvty of a metal unde vaous tempeat ues In ths theoy asde om some constants that need to be detemned the esstvty has a lnea de pendence on the tempeatue In ode to ve that the theoy s coect and to obtan the unknown constants 3 scentsts conduct the llowng thee expements and pesent the data to you
]
Q
/
:f
Q
Q
temperature cietist
T
temperature cietist
169
T
temperature cietist
T
51. CCAM'S AZR
5 . HREE EARNIN PRINCILES
It s clea that Scentst 3 has poduced t he most convncng evd ence f th e theoy If the measu ements ae exact the n Scentst 2 h as managed to ls the theoy and we ae back to the dawng boad What about Scentst 1 Whle he has not lsed the theoy has he povded an y evdence t he answe s no we can e vese the qu eston Suppose that the theoy was not coect what could the data have done to pove hm wong othng snce any two ponts can be joned by a lne heee the model s not j ust lkely to t the data n ths case t s cetan to do so hs endes the t tot ally nsgncant when t does happen
hs example llustates a concept elated to Occams Razo whch s the he axom assets that the data should have some chance of alsfyng a hypothess f we ae to conclude that t can povde evdenc e the hypothes s One way to guaantee that evey data set has some chance at flscaton s the VC dmenson of the hypothess set to be less than N, the numbe of data ponts hs s dscussed the n Poblem 51 Hee s anothe example of the same concept
aiom of non-falsiability
Example 52 Fnancal ms ty to pck good t ades ( pedctos of whethe the maket wll go up o not) Suppose th at each tade s tested on the pedcton (up o down) ove the next 5 days and those who pefm well wll be hed One mght thnk that ths pocess should poduce bette and bette tade s on Wall St eet Vewed as a leanng pobl em consde each tade to be a pedcton hypothess Suppose that the hng po ol s complex we ae ntevewng 2 tades who happen to be a dvese set of people such that the pedct ons ove the next 5 days ae all deent ecessaly one of thes e tades gets t all coect and wll be hed Hng the tade though ths pocess may o may not be a god thng snce the pocess wll pck someone even f the tades ae just ppng cons to make the pedct ons A pefct pedcto always exsts n ths goup so ndng one doesn t mean much If we wee ntevewng onl y two tades and one of them made peect ped ctons that would mean somethng
xerise 5 u os e a 5 eeks i n a ro, a leer aie s i n h e ai l ha e is e ouc oe f e ucoi ng Mna nig oall ga e u e enl ah eac Mona an o ou surise, he reicion is core ah ie On he ay a fer he h gae, a leer arives , sain g ha i ou ish o s ee nex ees rei cion, a aen o $5 0 is eu ired Sh ou l you ay? a a n ossib le redicions of inl ose are here f r 5 gaes? b If he sender ans o ake sure ha a leas one erson eeives orre predicion s on a l l 5 gaes fro hi , ho any people shou l he arg e o begi n ih?
170
5 HREE EARNING PRINCILES
5.. AMLING IAS
(c h rs r c ng h o ucom o h frs gam , ho ma n o h orgn a rcn s o s ar h scon r Ho ma n rs ao av n sn a h of 5 ks? ( f h cos of rn n g n m a n g ou ac s $0.5 ho much ou d h snr ak h n f 5 orc rcons sn n h 5.? (f a ou ra h s suaon o h goh fun con an h cr b of ng h aa
Leanng om data takes Occam's Razo to anothe level gong beyond as smple as poss le but no smple Indeed we may opt f a smple t than possble' namely an mpefct t of the data usng a smple model ove a pefct t usng a moe complex one he eason s that the pce we pay a pefct t n tems of the penalty f model complexty n (214) may be too much n compason to the benet of the bette t hs dea was llustated n Fgue 3 7, and s a manfestaton of ovettng he dea s also the atonale behnd the ecommended polcy n Chapte 3: rst ty a lnea model one of the smplest models n the aena of leanng om data
5.2
Sm pli ng Bis
A vvd example of samplng bas happened n the 1948 US pesde ntal electo n betwee n uman and Dewey On electon nght a mao newspape caed out a telephone poll to ask people how they voted he poll ndcated that Dewey won and the pape was so condent about the small eo ba n ts poll that t declaed Dewe y the wnne n ts headlne When the actual vote s wee counte d Dewey lost to the delght of a smlng uman
@ss Pss 171
HR ARNN PRNCLS
2 AMLN AS
hs was not a case of statstcal anomaly whee the newspape was just ncedbly unlucky emembe the n the VC bound t was a case whee the sample was doomed om the getgo egadless of ts sze Een f the expement wee epeate d the esult would be the same n 1 48 , telephones wee expense and those who had them tended to be n an elte goup that aoed Dewey much moe than the aeage ote dd Snce the newspape dd ts poll by telephone t nadetently used an nsample dstbuton that was deent om the outofsample dstbuton hat s what samplng bas s
the data s sampled n a based ay learnng ll pro duce a smlarly based outcome Applyng ths pncple we should make sue that the tanng and testng dstbutons ae the same f not ou esults may be nald o at the ey least eque caeul ntepetaton f you ecall the VC analyss made ey w assumptons but one as sumpton t dd make was that the data set s geneated om the same dstbuton that the nal hypothess g s tested on n pactce we may en counte data sets tha t wee not geneated unde those deal condto ns hee ae some technques n statstcs and n leanng to compensate the ms match between tanng and testng but not n cases whee was geneated wth the excluson cetan pats nputexample space suchhee as thes excluson of households wth noof telephones n of thetheaboe nothng that can be done when ths happens othe than to admt that the esult wll not be elable statstcal bounds lke Hoedng and VC eque a match between the tanng and testng dstbutons hee ae many examples of how samplng bas can be ntoduced n data collec ton n some cases t s nadetently ntoduc ed by an oesght as n the case of Dewey and uman n othe cases t s ntodu ced because cetan type s of data ae not aalable Fo nstance n ou ced t example of Chapte 1 , the bank ceated the tanng set om the databa se of peous cus tomes and how they pefmed f the bank Such a set necessaly excludes those who appled to the bank f cedt cads and wee ejected because the bank does not hae data on how they ould have perfrmed f they wee ac cepted Snce utue applcants wll come om a mxed populaton ncludng some who would hae been ejected n the past the test set comes om a deent dstbuton than the tanng set and we hae a case of samplng bas n ths patcula case f no data on the applcants that wee ejected s aalable nothng much can be done othe than to acknowledge that thee s a bas n the nal pedcto that leanng wll poduce snce a epesentate tanng set s just not aalable
xercse 53 n an epermen o e ermn e he d srbuon of szes of sh n a a ke, a ne mgh be used o cach a rep resenav e sam pe o f sh he sam pe s
172
HREE EARNIN PRINCILES
3 ATA NOOIN
hen nled o nd o he fcions o sh o ifeen sizes e sl e is ig nogh , si sic l conl sions y e dn o he cl is iio n i n h e eni e le n o se ll sling is ?
hee ae othe cases, aguably moe common, whee samlng bas s nto duced by human nteventon It s not that uncommon someone to thow away tanng examles they dont lke! A Wall Steet m who wants to de velo an automated tadng system mght choose data sets when the maket was behavng well to tan the system, wth the semlegtmate justcaton that they don t want the nose to comlcate the tanng ocess hey wll suely acheve that f they get d of the b ad' examles, but they wll ceate a system that can be tusted only n the eods when the maket does behave well! What haens when the maket s not behavng well s anybodys guess In geneal, thowng away tanng examles based on the values, eg, ex amles that look lke outles o dont confm to ou econceved deas, s a fly common samlng bas ta
ther biases. Samlng bas has also b een called selecto n bas n t he stats tcs communty We wll stck wth the moe desctve tem samlng bas f two easons Fst , the bas ases n how the data was sampled; second, t s less ambguous because n the leanng context, thee s anothe noton of selecton bas dftng aound selection of a nal hyothess om the leanng model based on the data he emance of the selected hyothess on the data s otmstcally based, and ths could be denoted as a selecton bas We have ee d to th s tye of bas smly as bad genealzaton hee ae vaous othe bases that have smla avo hee s even a secal tye of bas the eseach communty, called ublcaton bas! hs es to the bas n ublshe d scent c esults because negat ve esults ae often not ublshed n the lteatue , wheeas ostve esults ae he common theme of all of these bases s that they ende the standad statstcal conclusons nvald because the basc emse such conclusons, that the samlng dstbuton s the same as the oveall dstbuton, does not hold any moe In the eld of leanng om data, t s samlng bas n the tanng set that we need to woy about
5.3
Data Snoo ing
Data snoong s the most common ta acttones n leanng om data he ncle nvolved s smle enough,
a data set has aeted any step in the learnin proess its ability to assess the otome has been ompromised. 173
3 AA NOOING
HREE EARNING PRINCILES
Applyng ths pncple, f you want an unbased assessment of you leanng pe mance , you should keep a test set n a vault and neve use t f leanng n any way. hs s bascally what we have been talkng about all along n tanng v esus testng, but t goes beyond that. Even f a data set has not been physcally used tanng, t can stll aect the leanng pocess, sometmes n subtle ways.
Exercise 5.4 Cons ider the oing a ppro ach o earning B ooking at the dat a it app ears th at th e at a is ineary se arabe s o e go ahead an d use a simp e perce ptron and g et a trai ni ng error of ero af tr etermin ing the otim a set o f eigh ts We no ish to make some generaization concusions so e ook up the c r our ear ni ng mode a nd see tha t it is + 1 herere e use this vaue of vc to get a bo un on te est error a What is the pr ob em it h h is bou nd is it correct? b Do e kno the vc r the earni ng mode th at e actua y used? It is th is vc that e need to use in the bound
o avod theleanng ptfll nmodel the above s extemely mpotant thatcan yoube beforeexecse, choose you seeng tany of the dat a. he choce based on geneal nmaton about the leanng poblem, such as the num be of data ponts and po knowledge egadng the nput space and taget funct on, but not on the actual data set Falue to obseve ths ule wll nvaldate the VC bounds, and any genealzaton conclusons wll be up n the a. Even a cael peson can all nto the taps of data snoopng. Consde the llowng example.
Example 5 An nvestment bank wants to develop a system f ecastng cuency exchange ates. It has 8 yeas woth of hstoca l data on the US Dolla ( USD ) vesus the Btsh P ound ( GBP ) , so t tes to use the data to see f thee s any patten that can b e exploted. he bank takes the sees of dal y changes n the USD / GBP ate, nomalzes t to zeo mean and unt vaance, and stats to develop a system f ecastng the decton of the change Fo each day, t tes to pedct that decton based on the uctuatons n the pevous 20 days 5 of the data s used tanng, and the emanng 25 s se t asde testng th e nal hypothess. he test shows eat success. he nal hypothess has a ht ate (pe centage of tme gettng the decton ght ) of 52 . 1 . hs may seem modest, but n the wold of nance you can make a lot of money f you get that ht ate conssten tly. Indeed, ove the 500 test days (2 yeas woth, as each yea has about 250 tadng days ) , the cumulatve pot of the system s a espectable 22. 14
. HREE EARNIN PRINCILE
.3. T NOOIN
Day
4
When the system s used n le tadng, the pemance deteoates sg ncantly In fct, t loses oney. Why ddn't the good test pefmance contnue on the new data? In ths case, thee s a sple explanaon and t has to do wth data snoopng Although the bank was cael to set asde test ponts that wee not used f tanng n ode to popely evaluate the nal hypothess, the test data had n fct aected the tanng pocess n a subtle way When the ognal sees of daly changes was nomalzed to zeo ean and unt aance, all of the data was noled n ths step . heefe, the test data that was extacted had aleady contbuted to the choces ade by the leanng algothm by contbutng to the alues of the mean and the aance that wee used n noalzaton. Although ths sees lke a no eect, t is data snoopng. When you plot he cumulate pot on the test set wth o wthout that snoopng step, you see how snoopng esulted n an oe-optstc expectaton compaed to the ealstc expectaton that aods snoopng. It s not the nomalzaton that was a bad dea. It s the noleent of test data n that nomalzaton, whch contanated ths data and endeed ts estmate of the nal peance naccua te
D
One of the ost common occuences of data snoopng s the euse of the same data set . If you ty leanng usng st one model and then anothe and then anothe on the s ame data set , you wll eentually succe ed' . As the sayng goes, f you totue the data long enough, t wll confss If you ty all possble dchotomes, you wll eentually t any data set ths s tue whethe we ty the dchotomes dectly usng a sngle model o ndectly usng a sequence of odels The eecte VC denson the sees of tals wll not be that of the last model that succ eeded, but of the ente unon of odels that could hae been used dependng on the outcomes of deent als. Sometes t he euse of the same da ta set s caed out by de ent peop le Let s say that thee s a publ c data set that you would lke to wok on. Bee you download the data, you ead abou t how othe people dd wth ths data set
©
175
. HREE EARNN PRNCLES
3 ATA NOON
usng deent technqu es You natually pck the most pomsng te chnques as a baselne then ty to mpoe on them and ntoduce you own deas Although you haent een seen the data set yet you ae aleady gulty of data snoop ng You choce of baselne technques wa s aecte d by the data set though the actons of othes You may nd that you estmates of the pemance wll tun out to be too optmstc snce the technques you ae usng hae aleady poen wellsuted to his paiula data set o quantfy the damage done by data snoopng one has to assess the penalty f model com plexty n ( 2 14) takng the snoopng nto consdeat on In the publc data set case the eecte VC dmenson coesponds to a much bgge hypothess set than the that you leanng algothm uses It coes all hypotheses that wee consdeed and mostly ejected) by eeybody else n the pocess of comng up wth the solutons that they publshed and that you used as you baselne hs s a potentally huge set wth ey hgh VC dmenson hence the genealzaton guaantees n (214) wll be much wose than wthout data snoopn g ot all data sets subjected to data snoopng ae equally contamnated he bounds n (16) n the case of a choce between a nte numbe of hy potheses and n (212) n the case of an nnte numbe pode gudelnes the leel of contamnaton he moe elaboate the choce made based on a data set the moe contamnated the set becomes and the less elable t wll
be n gaugng the pemance of the nal hypothess
xerise 5.5 ssum e we se t asie xam ls om that w il l no t e use in t ai ni ng ut wil l e use o select one o t e n al hothe ses roue h ee ifeent lea rni ng a lgoritms t at ra in n he re st on e ata . ach a lgoi m or s with a i feren o size e woul l ie o c aateize e a ccurac of estimat ing ut9 on he electe nal hotesis if e use he same exam les t o m ake at estimae.
2
a hat is he alu e f at shou l e use in i n his situaion b ow oes the leve l o con am inat ion of ese examles omare to the case wher e he y woul e us e i n train in g rather ha n i n e nal select ion?
In ode to deal wth data snoopng thee ae bascally two appoaches 1 Aod data snoopng: A stct dscplne n handlng the data s equed Data that s gong to be used to ealuate the nal pemance should be locked n a sa and only bought out afte the nal hypothess has been decd ed If ntemedate tests ae neede d sepaate data sets should be used that Once a data set has been used t should be teated as contamnated as f as testng the pemance s concened 2 Account data snoop ng: If you hae to use a data set moe than once keep tack of the leel of contamnaton and teat the elablty of 176
. HREE EARIN PRICIES
.3. ATA NOOI
you peance estates n l ght of ths contanaton he bounds (16) and (212) can povde gudelnes the elatve elablty of df ent data sets that have been used n deent oles wthn the leanng pocess
Data snooping versus samplin bias
Saplng bas was dened based
on howonthehow data obtaned bee any leanng data snoopng wasleanng dened based thewas data aected the leanng n patcula how the odel s selected hese ae obvously deent concepts Howeve thee ae cases whee saplng bas occus as a consequence of sno opng' lookng at data that you ae not supposed to look at Hee s an exaple Consde pedctng the peance of deent stocks based on hstocal data In ode to see f a pedcton ule s any good you take all cuently taded copanes and test the ule on the stock data ove the past 50 yeas Let us say that ou ae testng the buy and hold stategy whee you would have bought the stock 50 yeas ago and kept t untl now If you test ths hypothess' you wll g et excellent peance n te s of pot Well don't get to o exc ted ! You nadvetently based the esults n you fvo by pckng only uenl aded ompanies, whch eans that the copanes that dd not ake t ae not pat of you evaluaton When you put you pedcton ule snce to wok wll bedentfy used onwhch all copanes theybewll o not you tcannot copaneswhethe today wll the suvve cuently tad ed' copanes 5 0 yeas o now hs s a typcal case of saplng bas snce the poble s that the tanng data s not epesentatve of the test data Howeve f we tace the ogn o f the bas we dd snoo p' n ths case by lookng at ftue da ta of copanes to detene whch of these copanes to use n ou ta nng Snc e we ae usng nfat on n tann g that we would not have access to n eal tadng ths s vewed as a of data snoopng
177
5 PROBLES
HREE EARNIN PRINCILES
5.4
Prblems
Problem 5.1
The dea of lsiability tha t a clam can be endee d false by obseved data s an m potant pnc p le n expe me ntal scence If th e outco me of an experment has no chance of falsfyng a partcular proposton then the result A Flly of that experment does not provde evdence one way or another toward the truth of the pposton.
E
Consde the poposton hee s that appoxmates f as old be evdenced b y nd ng sch an th n sam ple e o zeo on ." We say that the poposton s falsed f no hypothess n can t the data pefctly.
a ) S p pose that shatte s Sho that ths poposton s not flsable r any f b ) Sppose that f s andom f ( ) = ±1 th pobablty � ndependently on evey so t ( h ) = � o evey . Sho that
E
[falscaton ]
1
-
.
c) Sppose dvc 10 and N 100 If yo obtan a hypothess
h th zeo
b ?
on yo data, hat can yo conclde' fom the eslt n pat
Problem 5.2 Stc tal Rsk M n mzaton S RM ) s a sefl fameok mod el selec ton that s elated to Occam 's Razo. Dene a structure a nested seq en ce of hypoth ess sets:
The SRM fameok pcks a hypothess fom each by mmzg · That s, agm n ( ) . The n, th e fame ok selects the n al hy
h
pothess by mnmzng and the model complexty penalty That s, agmn ( ( ) + ( )) . N ote that ( ) shold be n on deceas ng n g* becase o f t he nested stcte.
a ) Sh o that the n sample e o ( ) s n on nc eas ng n 178
5 . HREE EARING PRINCIES
54 PRBEMS
E
( b ) Assum h a h framork nds g* i ih proba bi Iiy Pi . Ho dos Pi rla o h comp li y of h arg fun cion ( c ) Argu ha h Pi1S ar unknon bu po : p 1 : p2 1 ( d ) uppos g* = 9i· ho ha
[I En(9i) - Eout (gi ) > I g*
gi] : Pi 4mi (N) e
E /s
Hr, he condiioni ng is o n slc ing gi as h nal hypohsis by RM [Hint: Use the Byes theorem to decompose the probbility nd then pply the VC bound on one of the terms} You m ay in rpr h is rsul as ll os if you us RM a nd n d up ih gi, hn h genralizaion bou nd is a fa cor ors han h bound you ould hav gon had you simply sard ih i
Pob le 5.
In ou r crdi card am pl, h ban k sars ih so me vagu ida of ha consius a good credi risk o, as cusomrs x 1 , x2 , , X
arriv, h baThnkena ,pp lis hos is vagu ap prov cardsiord r som cusomers o nly hoidgoao crdi cards crdi ar mon o s ofifhs hy dfau l or no For simpliciy, suppos ha h rs N cusomrs r givn credi cards No ha h bank knos h bhavior of hs cusomrs, i coms o you o im prov hir a lgor ihm r a ppro ving cr di Th ban k givs you h daa
(x 1 , y1 ) , ' (x , Y )
Ber yo u look a he daa , you d o mahmaical d riva ion s an d com up ih a crdi approval funcion You no s i on h daa and, o your dligh, obain prec prdicion
( a ) Wha is M h siz of your hypohsis s ( b ) Wih such an M ha dos h Hof di ng bou nd say a bou h probabi l iy ha h ru prrmanc is ors h an % rror fr N 10000 giv your o h bank assur hm ha h prfrmanc ill ( c ) You b br han g % rror andand your condnc is givn by your ansr o par ( b ) Th bank is hrilld and uss your g o approv crdi r n clins To hir dismay, mor han half hir crdi cards are bing dau ld on Epl ai n h pos sibl ras on ( s) bhind his oucom
( d ) Is hr a ay in hi ch h ban k could u s your crdi a ppro val funcio n o hav y ou r probab il isic guar an Ho [Hint he nswer is yes!}
179
5..
5. HR ARNIN PRINCILS Pro blem 5 4
PROBLMS
500
The S& 5 is a set of the l argest compa ies cu rretly tradi g Su ppose there are stocks cu rretly tradig, ad th ere have bee stock s which h ave ever tr aded ov er the last years some of these have goe ba krupt ad sto ppe d tradig We wish to eval u ate the pro tab il ity of various buy a d ho ld ' stra tegies usig the se years of d ata roughly tradi g day s
50 000
10 000
50
50
1 500
Sice it is ot easy to get stock data, e will coe our aalysis to todays S& 5 stocks, r which the data is readily available
a A stock is pro ta bl e if it we t u p o more th a 50% of the d ays Of your S & stock s, the most pr otabl e wet up o 5% of t h ed ays En = 0. i Sice we picked the best amog 500 usig the Hoef di g boud, [IEin - Eout l
>
002] : 2
x
500 x
5 e- 2 x 1 2 oo x o . o 2
�
0.045.
There i s a greater tha 95% cha ce th is stock is pro table Wher e did e go wrog? ii Gi ve a better estim ate fr the probab il ity that th is stock is pro tabl e
[Hint: What should the correct M be in the Hoefding bound?
b We ish to evaluate the protability of buy ad hold' r geeral stock
tradi g We ot ice t hat a ll of ou r 5 S& stocks wet up o a t le ast
of the days i We cocl ude that bu yig a d hol di g a sto cks is a go od strat egy r geeral stock tradig Where did we go rog? ii Ca we say anything a bout the per rma ce o f buy a d hold tradig?
51%
You thik that the stock market exhibits reversal, so if Problem 5.5 the price of a stock sharp ly d rops you expe ct it to rise shortly the reafte r If it sha rpl y rises, you expect i t to drop shortly thereaf ter To test thi s hypothesis, you bu il d a t radi g strategy tha t bu ys whe t he sto cks go dow a d sel ls i the op posite case You coll ect hi storica l data o the cu rret S& 5 stocks, ad your hypothesis gave a good aual retur of
1%
a Whe you trade usig this system, do you expect it to perfrm at this level? Why or why ot? b How ca you test you r stra tegy so th at its perrma ce i sam pl e is more relective of what you should expect i reality?
Prob lem 5. 6
Oe of te he ars Extrapolatio is harder tha iterpolatio " Give a possible explaatio fr this pheomeo usig the priciples i this chapter [Hint training distribution ersus testing distribution}
180
Epilogue hs bo ok set th e stage a deepe exploat on nto Lean ng om ata by developng the undatons It is possble to lean om data, and you have all the basc too ls to do so he lnea model coupled wth the ght atues and an appopate nonlnea tansfm, togethe wth the ght amount of egulazaton, petty much puts you nto the thck of the game, and you wll be n good stead as long as you keep n mnd the thee basc pncples smple s bette Occams azo , avod data snoopng and bewae of samplng bas Whee to go om he e? hee ae two man decto ns One s to lea n moe sophstcated leanng technques, and the othe s to exploe deent
leanng paadgms Let us map pevew these twoom dectons bette undestandng of the of leanng data to gve the ead e a he lnea model can be used as a buldng block othe popula tech nques A cascade of lnea models, mostly wth sof thesholds, ceates a neual netwok A obust algothm lnea models , based on quadatc pogammng, ceates suppot vecto m achnes An ecent appoach to non lnea tansmaton n suppot vecto machnes ceates kenel metho ds A combnaton of deent models n a pncpled way ceates boostng and en semble leanng hee ae othe successl models an d technques, and moe to come sue In tems o f othe paadgms, we have bey mentone d unsupevsed lea n ng and encement leanng hee s a wealth of technques these lean ng paadgms, ncludng methods that mx labeled and unlabeled data Actve leanng and onlne leanng, whch we also mentoned bey, have the own technques theoes pobablstc In addton, th ee s a school thought appoach, that teats leanng as and a completely paadgm usng aofBayesan and thee ae usel pobablstc technques such as Gaussan pocesse s Last but not least, thee s a school that teats leanng as a banch of the theoy of computatonal complexty, wth emphass on asymptotc esults Of couse, the ultmate test of any engneeng dscplne s ts mpact n eal l hee s no shotage of success ful applcatons of leanng om data Some of the applcaton domans have specalzed technques that ae woth explong , eg , computatonal nance and ecommende systems Leanng om data s a vey dynamc eld Some of the hot technques and theoes at tmes become just ads, and othes gan tacton and become 181
IOGUE
pat of the eld What we hae emphaszed n ths book ae the necessay ndamentals that ge any student of leanng om data a sold undaton, and enable hm o he to entue out and exploe uthe technques and theoes, o pehaps to contbute the own.
82
Further Reading
Learning om Data book rum at AMLBookcom Y S Abu-Mostafa The Vapnik-Chervonenkis dimension: Inrmation versus complexity in learning eual Compuaion, 1 3 :31 2 317 19 89
X
Y S Abu-Mostafa Song A icholson and M MagdonIsmail The bin model Technical Report CaltechCS TR: 2004 0 02 Calirnia Institute of Technology 2004
kham s Razo A isoial and Philosophial An alsis of k ham 's Piniple of Pasimon. University of Illinois Press 197
R Ariew
R Bell J . Bennet t Y oren and C Volinsky The million dollar program ming prize I Speum, 4 5 :2 9 33 20 09
A Blumer A Ehrenfucht D Haussler and M Warmuth Occam's razor fomaion oessing Lees, 24 :37 7 380 198 7
A Blumer A hrenucht D Haussler and M Warmuth Learnability and the Vapnik-Chervonenkis dimension Jounal of he Assoiaion fo Compuing Mahine, 3 4 :9 29 9 5 19 89
S Boyd and L Vandenberghe Press 2004
Cone pimizaion.
Cambridge University
P Burman A comparative study of ordinary cross-validation -fld cross validation an the repeated learning-testing methods Biomeika, 7 3 : 503 514 1 98 9
T M Cover Geometrical and statis tical properties of systems of linear in equalities with applications in pattern recognition I Tansaions on leoni Compues, 14 3 :32 33 4 19 5
M H DeGroot and M J. Schervish Wesley urt edition 2011 183
Pobabili and Saisis.
Addison
URTHER EADIG
V. Fabian. Stochastic approximation methods.
Jounal, 10(1):123
159, 1960
W. Feer. An Induion third edition 1968.
Czehosloak Mahemaial
o P babi li Theo and Is Appliaions
A. ank and A. Asuncion.
UCI machine earning repository
Wiey
2010
UL
h//chcc/
0 /1 oss and the curse-ofdimensionaity. Daa Mining and Knoledge Disoe , 1 (1 )5 5 77, 19 97
J. H . iedman. On bias v ariance
S. I . Gaant. Perceptron -based earning agorithms.
eual eoks, 1(2):179 191, 1990
I ansaions on
Z Ghahramani. Unsupervised earning. In Adaned Leues in Mahine
Leaning MLSS
, pages 72 112, 2004.
G. H. Goub and C. F. van Loan. versity Press 1996
Mai ompuaions
Johns Hopkins Uni
D. Ameian C . Hoagin an d R. E 32:17 Wesch. Saisiian, 1978.hat matrix in regression and AOVA. 22,The W. Hoeding. Probabiity inequaitie s fr sums of bounde d random variabes.
Jounal of he Ame ian Saisial Assoiai on, 58( 30 1):13 30 , 19 63.
R. C. Hote. Very simpe cassication rues perfrm we on most commony used datasets. Mahine Leaning, 11( 1) :63 91 , 1993 R. A . Horn and C. . Johnson.
1990
Mai Analsis
Cambridge University Press
L. P. aebing M . L . Littman and A. W. Moore . Reinfrcement earning: A survey. Jounal of Aiial Inelligene Reseah, 4: 23 7 285, 1996 A. I. huri. Interscience
Adaned alulus ih appliaions in saisis 2003.
Wiey
R. ohavi. A study of cross-vaidation and bootstrap r accuracy estimation and mode seecti on. In Poeedings of he h Inenaional Join Con feene on Aiial inelligene ICAI voume 2, pages 11 37 11 43 ,
1995
J. Langford. Ttoria on practi ca prediction theory f r cassication.
of Mahine Leaning Reseah, 6:273 306, 2005 184
Jounal
URTHER EAI
i and H- T in Optimizing 0 /1 loss r perceptr ons by random coordinat e International oint Conference on eual descent In Proceedings of the etorks C pages 749 754, 2007
7
7
HT in and i Support vector m achinery r innite ens emble learning ournal of achine Learning Research, 9(2) 285 31 2, 200 8 M Magdon-Ismail and Mertsalov A permutation approach to validation Statistical Analysis and Data ining, 3(6)361380, 2010 M Magdon-Ismail, A icholson, and Y S Abu-Mosta earning in the presence of noise In S Hayki n and B osko, editors , Intelligent Signal Processing IEEE Press, 2001 M Markatou, H Tian, S Biswas, and G Hripcsak Analysis of variance of crossvalidation estimators of the generalization error ournal o f Machine Learning Re search, 61127 116 8, 2005 M Minsky and S Papert Perceptrons An Introduction Geometry. MIT Press, expanded edition, 1988
to Computational
T Poggio and S Smale The mathematics of learning: ealing ith data otices of the American athematical Society, 50(5)537 544, 2003 Popper
The logic of scientic discovery
Routledge, 2002
F Rosenblatt The perceptron: A probabilistic model r inrmation storage and organization in the brain Psychological Revie, 65(6)38 6 408, 195 8
Principles of eurodynamics Perceptrons and the Theory of Brain echanisms. Spartan, 1962
F Rosenblatt
B Settles Active learning literature survey Technical Report 1648 , University of WisconsinMadison, 201 0 J ShaweTaylor, P B artlett, R C William son, and M Anthony A ame work or structural risk minimisation In Leaing Theory th Annual Conference on Learning Theory CLT ', pages 68 76, 19 96 G Valiant A theory of th e learnable (11) 113 4 114 2, 19 84
Communications of the ACM,
27
V Vapnik and A Y Chervonenkis On the unirm c onvergence of relative equencies of events to their probabilities Theory of Pbability and Its Applications, 162 64 280, 19 71 185
V Vapnik E. Levi n and Y L un Measuring the V-dimension of a learning machine eul Compuaion, 6() :8 1 876, 1994. G Yuan H Ho and -J Lin Recent advances of large-scale linear classication Pceeings of I, 2012. T hang Solving large scale linear prediction problems usin g sto chastic gra dient descent algorithms In achine Leaning Pceeings of he h Inenaional Confeence ICL , pages 919 926 2004.
186
Appix Proof of te VC Bod In this Appendix we pre sent the rm l proof of Theorem 2 5 It is firly elborte proof nd you my skip it ltogether nd just tke the theorem r grnted but you won't know wht you re missing !
1971)
Theorem A. 1 (Vpnik Chervonenkis J
[ IEin (h) sup
Eout (h) I
l
> E
:
4mH (2N) e- i E .
This inequlity is clled the VC Inequlity nd it implies the VC bound of Theorem 25 The inequlity is vlid r ny trget function (deterministic or probbil istic) nd ny input distribution The probbility is over dt sets of size N Ech dt set is generted iid (independent nd identiclly distributed) with ech dt point generted independently ccording to the joint distribution P(x, The event sup E() E() > is equiv lent to the union over ll of the events E ( ) E( ) > ; this union contins the event tht involves in Theorem 2 5 The use of the supremum ( technicl version of the mximum) i s necessry since cn hve continuum of hypotheses The min chllenge to proving this theorem is tht E( ) is dicult to
mnipulte depends on the entire E()setbecuse E() spce rthercompred thn just to nite of points The min insight needed to input over come this diculty is the observtion tht we cn get rid of E( ) ltogether becuse the devitions between E nd E cn be essentilly cptured by devitions between two in-smple errors: E (the srcinl in-smple error) nd the in-smple error on scond independent dt set (Lemm A 2) We hve seen this ide mny times bere when we use test or vlidtion set to estimte E. This insight results in two min simplictions:
1.
The supremum of the devitions over innitely mny cn be reduced to considering only the dichotomies implementble by on the
187
A PPENDIX two nepenent ata sets Tat s were te growt fun ton enters te pture (Lemma A3)
H N )
Te devaton between two ndndnt nsample errors s easy' to an alyze ompare to te evaton between E and E (Lemma A 4) Te ombnaton of Lemmas A A3 an A4 pro ves Teorem A
A.1
Relating Generalization Error to In-Sale Deviations
Let' s ntrod ue a seon ata set ' w s nepenent of ' bu t sampled aordng to te same strbuton Px Ts seond data set s alle a ghost ata set beause t oesn't really exst t s a just a tool used n te analyss We ope to boun te term [E E s large by anoter term [E E s large w s easer to anal yze Te ntuton ben te rmal proo f s as llows For any sngle ypot ess beause ' s es sample nepenently om Px, te Hoedng nequalty guarantees tat E ) � E) wt a g probablty Tat s wen E) E) s large wt a g probablty E) E )
s also large s large an be approxmately E) bounde e by [Terere E) E{[E))s larg We are tryng to bound te probabl ty tat E s ar om E Let E{ ) be te nsample error r ypotess on ' . Suppose tat E s ar om E wt some probablty (and smlarly E{ s far om E wt tat same prob ablty sne E an E are entally dstrbute) Wen s large te proba blty s rougly Gaussan around E as llustrate n te gure to te rgt Te red regon represents te ases wen E s ar om E n tose ases E{ s r om E about alf te tme as llustrated by te green regon Tat s [ E E s large] an be approxmately boune by [E E{ s large Ts argument proves some ntuton tat te evatons between E and E an be apture by te evatons between E and E Te argu ment an be arelly extene to multple ypoteses
N
Lemma
were te probablty on te RHS s over
188
' an ' jontly
ENDIX
[
Proof. We ca aume that up J E ) E ) J there othg to prove
[
sup J Ein (h) hE
[[ [
hE up J E )
hE
E{n (h) J
>
�
1
E )J � and hE up JE ) E)J (.1)
I
up J E )
E ) J
up J E )
E )J � up JE) E)J
hE hE
hE
[
hE
I
[ and fr ay two evet
Ieualty (1) llow becaue [] , Now, let conder the lat term: up J E )
otherwe
E )J � up JE ) hE
E ) J
The evet on whch we are condtong a et of data et wth on-zero probablty Fx a data et V n th event Let * b e any hypothe r whch JE*) E*)J One uch hypothe mut ext gven tha t V the event o whch we are condtong The hypothe * doe not deped on V but t doe depend on V
[ I[ I[
I
up JE) E )J � up JE )
hE
I S I
E * )
E * ) J �
E * )
E * ) J
1
N
hE
�
E ) J
E ) E)
E) E)J
1 Ieualty (2 fllow bec aue the evet " JE * ) E{ *)J
)
(3) (4)
E{ )J Ieualty 3 fllow becaue th e event " JE{ *) E*)J ad " JE * ) E*)J " whch gven) mply " JE ) E{ )J " Ieualty 4 fllow becaue * xed wt h repect t o V and o we mple " hE up JE )
H
can apply the Hoedg Ineualty to [JE{ *) E*)J Notce that the Hoedg Ieualty apple to [JE{ *) E *)J � ] r any * , a long a * xed wth repect to V Thereore, t alo apple
189
ENDIX
to any wightd avrag o [En (h) Eout(h) ] ad on h Finally, inc h dpnd on a particular V, w tak th wightd avrag ovr allV in th vnt up Ein(h) Eout(h) E" hE H on which w ar conditi oning, whr th wight com om th proaility o th particular V. Sinc th ound hold or vry V in thi vnt, it hold r th wight d avrag
e-E 2 N ,
ot that w can aum cau othrwi th ound in < ' o th lmma Thorm .1 i trivially tru In thi ca, 1 impli
J
A.2
[
sup
hE H
IEin (h) - Eout (h) I
l
> E
:
2J
e-E 2 N
[
sup
hE H
IEin (h) - E{n (h) I
>
i]
·
Boudig Worst Case Deviatio Usig the Growth Fuctio
ow that w hav rlatd th gnralization rror to th dviation twn in-ampl rror, w can actually work with rtrictd to two data t o iz ach, rathr than th innit Spcicall y, w want to ound
[
sup
hE H
I Ein(h ) - E{n (h) I
>
i] ,
whr th proaility i ovr th joint ditriution o th data t V and V. On quivalnt way o ampling two data t V and V i to rt ampl a data t S o iz thn randomly partition S into V and V. Thi amount to randomly ampling, ithout rplacmnt xampl om S r V, laving th rmaining r V. Givn th joint data t S, lt
th proaility dviation twn th two th proaility i taknoovr th random partition o in-ampl S into Vrror, and Vwhr By th law o total proaility (with dnoting um or intgral a th ca may ) ,
[
sup
hE H
IEin(h) - E{n (h) I
L [S] x
S
�
< p
[�
[
up Ein(h) hE H
En(h) 190
>
i] En (h)
Il
E (h) � S
i Is]
ENDX
S
S
S
Let be the dichotomies that can implement on the points in By denition of the growth unction, cannot have more than H (2N di chotomies Suppose it has M : mH (2N) dichotomies, realized by h1 , hM Thus, sup
hE H
IEin (h) - E[n (h) I
Then,
J
r� [
M
=
/Ein (h) - E[n (h) I .
sup
hE{h,,h}
En ( ) I
Ein(h) sup
hE{h 1 , ,h}
Il
> s
I]
IEin(h) - E[n (h) I > � s
< : J IEin (hm ) - E[n (hm ) / > � I SJ m < M sup J IEin (h) - E[n (h) I > � j SJ , X
hEH
(A5) (A6)
where we use th e union bound in (A 5) , and overestimate each term by the supremum over all possible hypotheses to get ( A6 ) After using M : H (2N) and taking the sup operation over
Lemma A.
J
[
sup
I Ein (h) - E[1 (h) I > �
hE H m H (2N)
<
S we have proved
X
sup sup
S
hEH
1
J IEin(h ) - E[n (h)/ > � I SJ ,
where the pro bability on the LHS is over D and D ointly, and the prob ability on the RHS is over random partitions of into two sets D and D
S
The main achievement of Lemma A3 is that we have pulled the supre mum over h outside the probability, at the expense of the extra actor of H ( 2N
A.3
Bunding he Deviin beween In-Smple Errrs
We now address the purely combinatorial problem of bounding sup sup
S
hE H
J IEin(h)
E{1 (h) I > � j SJ ,
which appears in Lemma A3 We will prove the llowing lemma Then, Theorem A can be proved by combining Lemmas A2 A3 and A4 taking 2e - � (the only case we need to consider)
9
ENDX
Lea A .4 Fr
any h and any
S
whr th prbability is vr randm partitins f
S int tw sts
' an ' .
Poo T prv th rsult w will us a rsult which is als du t Hding r sampling ithout replacement: Lea A. 5 Hding 1963) Lt = {a, , a } b a st f valus with a [O, 1 ] , and lt µ = a b thir man Lt ' = {, , Z } b a sampl f siz N, sampld m unirmly ithout rplacmnt Thn
S
W apply Lmma A5 llws Fr th 2N xampls in lt a 1 if h(x) Y and a = 0 thrwis Th {a} ar th rrrs mad by h n w randmly partitin int ' and ' i sampl N xampls m withut rplacmnt t gt , laving th rmaining N xampls r '. This rsults in a sampl f siz N f th {a} r ' sampld unirmly withut rplacmnt t that
S
E() =
EV
a,
and E{ () =
Sin c w ar sampling withut rplacmnt s
Substituting
EV
a
S = ' ' and ' ' 0 and
1 E() + E{ () 2N 2 E µ > t { E E{ 2t By Lmma A5 µ
It llws that
S S
t=
givs th rsult
192
· {···} |·| · 2 ⌊·⌋ [a, b] · ∇
a
w
b
∇E
E (w)
(·)−1 (·)† (·)
N k
k
N
N! (N −k)!k!
A\B
A
B
0
{1} × Rd
d
ǫ δ
ǫ
η λ λC C Ω θ Φ Φ
θ(s) = e s /(1 + es ) z = Φ(x) Q
φ µ ν σ2 A
Φ zi = φi (x)
a (·)
a
B b w0 B(N, k)
N k
C d d˜ d d (H) D
D
X = Rd Z H D = (x1 , y1 ), · · · , (xN , yN ) (xn , yn ) D
X = { 1} × Rd
D
D E (h, f ) ex (h(x), f (x))
D x E (h, f )
n
h e = 2.71828 · · · (h(x) − f (x))2 n n
E[·] Ex [·]
x
E[y |x]
y
E E E (h) E E E (h) ED E¯ E E f g
x
h h
D
f: X → Y g∈H g: X → Y
g (D ) g¯
D
f
D
g g = ∇E
g
h ∈ H h: X → Y
h ˜ h H HΦ
Z Φ
H(C )
C
H(x1 ,..., xN )
±1
H
x1 , · · · , xN
H I 1 K Lq ln log2 M mH (N )
0
q e 2
H
N
max(·, ·) N o(·)
D
O(·) P (x) P (y | x ) P (x, y) P[·] Q Qf
x
y y
x
f
x
f
R Rd
s
d s = w x =
wx
i i
i
x
(·) supa (.) T t tanh(·) (·) V v
i
d
1
d
x0 = 1
−1
+1
≥
a
tanh(s) = (es −e−s)/(es +e−s ) V
V ×K =N
ˆ v
v
w ˜ w ˆ w w∗ w w w
Z
w0
b
w x ∈ Rd
x ∈ X
x
x ∈
{1} × Rd x x0
x
X X XOR
x0 = 1 x∈X xn
y∈Y
y y
yn y ˆ
Y Z Z
y
y∈Y z = Φ(x) zn = Φ(xn )
H
f B (N, k )
V λ N, d
L1
L2
λ
Ω
E
λ λ
λ
m H (N )
tanh
λ λ
Z
d