Abu-Mostafa et al. - 2012 - Learning from data a short course.pdf

  

he book ebsite   ontains supportin materia or instrutors an reaers

LIG F DATA  SOR OURS

Yaser S. AbuMosta

l   l Mali MagdoIsmail

l Pl  HsuaTie Li

l  

M

aser . b stafa epartmets of lectrical gieerig ad ompter ciece alioria stitte of echology Pasadea,  91125 

©cch.du

alik agdo smail epartmet of ompter ciece esselaer Polytechic stitte oy,  12180 

d@c.du

sa ie i epartmet of ompter ciece ad rmatio gieerig atioal aiwa iversity aipei, 106 aiwa

h©c.u.du

 10 1 60049 006 9  13978 1 60049 006 4

@2012 aser  b ostaa, alik agdo smail, sa ie i

11

ll rights reserved. his work may ot be traslated or copied i whole or i part withot the writte permissio of the athors. o part of this pblicatio may be reprodced, stored i a retrieval system, or trasmitted i ay frm or by ay measelectroic, mechaical, photocopyig, scaig, or otherwisewithot prior writte permissio of the athors, except as permitted der ectio 107 or 108 of the 1976 ited tates opyright ct. imit o f iability /isclaimer of arraty hile the athors have sed their best eorts i preparig this book, they make o represetatio or warraties with re spect to the accracy or completeess of the cotets of this book ad specically disclaim ay implied warraties of merchatability or tess fr a particlar prpose. o warraty may be created or exteded by sales represetatives or writte sales materials. he advice ad strategies cotaied herei may ot be sitable fr yor sitatio. o shold coslt with a profssioal where appropriate. he athors shall ot be liable fr ay loss of prot or ay other commercial damages, icldig bt ot limited to special, icidetal, coseqetial, or other damages. he se i this pblicatio of tradeames, trademarks, service marks, ad similar terms, eve if they are ot idetied as sch, is ot to be take as a expressio of opiio as to whether or ot they are sbject to proprietary rights. his book was typeset by the athors ad was prited ad bod i the ited tates of merica

      

Preface This b k is esiged  a sht cuse  ma chie leai g. It is a sht cuse t a huied cuse. Fm ve a decade f teachig this mateial we have distilled what we believe t be the ce tpics that evey studet f the subject shul kw. We chse the title leaig m ata' that faithully descibes what the subject is abut ad made it a pit t cve the tpics i a sty-like fashi . Ou hpe i s that the eade ca  lea all the dametals f the subject by eadig the bk cve t cve. Leaig m data has distict theetical ad pactical tacks. If yu ead tw bs that cus  e tack  the the yu may el that yu ae eadig abut subje altgethe . Iad thisthe b k  we balace the theetical adtw thedieet pactical thects mathema tical heuistic . Ou cite i f iclusi is elevace. They that establishes the cceptual amewk f leaig is icluded ad s ae heuistics that impact the pe mace f eal leaig systems. Stegt hs ad weakesses f the dieet pats ae spelled ut . Ou philsphy is t say it like it is: what we kw what we d't kw ad what we patially kw. The b k a be taught i exactly the de it is pese ted . The tabl e excepti may be Chapte , which is the mst theetical chapte  the bk. The they f geealizati that this chapte cves is cetal t leaig m data a we made a et t make it accessible t a wide eadeship. Hweve istucts wh ae me iteested i the pactical side may skim ve it  delay it util afte the pactical methds f Chapte 3 ae taught. Yu will tice that we icluded execises (i gay bxes) thughut the text. The maif pupse f these execises t  egage eadef ad ehace udestadig a paticula tpic beig iscveed. Outheeas sepaatig the execises ut is that they ae t cucial t the lgical w. evetheless they ctai usel imati ad we stgly ecuage yu t ead them eve if yu d't d them t cmpleti. Istucts may d sme f the execises apppiate as easy' hmewk pblems ad we als pvide ad ditial pblems f vayig diculty i the Pblems secti at the ed f each chapte. T help itucts with pepaig thei lectues based  the bk we pvide supptig mateial  the bk's website    Thee is als a um that cves additial tpic s i leaig m ata . We will



vii



PREFACE

discuss these futhe i the Epilgue f this bk Ackwledgmet i alphabetical de  each gup : We wuld like t expess u gatitude t th e alumi f u Leaig Systems G up at C altech wh gave us detailed expet edback: eha Cataltepe  Lig Li Amit Pata p ad Js eph Sill We thak the may studets ad clleagues wh g ave us useful edback duig the develpmet f this b k es pecially ChuWei Liu The Caltech Libay sta especially isti Buxt ad David McCasli have give us excellet advice ad help i u selfpublishig et We als thak Lucida Acsta  he help thugh ut the witig f this bk  Last  but t least  we wuld like t thak u families  thei ecuage met thei suppt ad mst f all thei patiece as they edued the time demads th at witig a b k has impsed  us.





Pasadea Califoria. Malik MagdIsmail oy e ork HsuaTie Li Taipei Taia

Yase S AbuMstaf

arch 

viii

Z

OTATION

 ompete tabe o the notation use in this book is inue on pae 9 riht bere the ine o terms We suest rerrin to it as neee

xii

Chapr



The Le arni ng Pro blem If yu shw a pictue t  a thee-yea-ld ad ask if thee is a tee i  it  yu will likely get the cect aswe . If yu ask a thity-yea-ld what the deiti f a tee is yu will likely get a icclusiv e aswe. We did't lea what a tee is by studyig the mathematical deiti f tees. We leaed it by lkig at tee. I the wds w e leaed m data' . Leaig usedthat i situatis whee we d' a t have a aalytic sluti but m we ddata haveis data we ca use t cstuct empiical slu ti. This peise cves a lt f teity ad ideed leaig m data is e f the mst widely used tec hiques i sciece egieei g ad ecmics amg the lds. I this chate we peset examples f leaig m data ad malize the leaig pblem. We als discuss the mai ccepts assciat ed with leaig ad the dieet paadigms f leaig that have bee develped.

. 

Prlm Stu

What d acial fecastig medical diagsis cmpute visi ad seach egies have i cmm They all have successlly utilized leaig m data. The with eptaie f such applicatis i quite impessive. us pe discussi eal-life applicati t see hw leaig mLet data wks.the Cside the pblem f pedictig hw a mvie viewe wuld ate the vaius mvies ut thee. This is a impt at pblem if yu ae a cmpay that ets ut mvies sice yu wat t ecmmed t dieet viewes the mvies they will like. Gd ecmmede systems ae s imptat t busiess that the mvie etal cmpay etix eed a pize f e milli dllas t aye wh culd impve thei ecmmedatis by a mee 0 The mai diculty i this pblem is that the citeia that viewes use t ate mvies a quite cmplex. yig t mdel thse explicit ly is  easy task s it may t be pss ible t cme up with a aalytic sluti . Hweve we 

1 HE EARNIN PROBLEM

1.1. PROBLEM ETU

l

viewe

    

    

mvie

Figue

11: A

mdel r hw a vewer rates a mve

kw that the histical atig data eveal a l t abut hw peple ate mvies, s we may be able t cstuct a gd empi ical sluti  Thee is a geat deal  data available t mvie etal cmpaies, sice they te ask thei viewes t ate the mvies that they have aleady see. Figue illustates a specic appach that was widely used i the millidlla cmpetiti Hee is hw it wks. Yu descibe a mvie as a lg aay  dieet acts, e.g., hw much cmedy is i it, hw cm plicated is the plt, hw hadsme is the lead act, etc. w, yu descibe each viewe with cespdig cts; hw much d they like cmedy, d they pe simple  cmplicated plts, hw imptat ae the lks  the lead act , ad s  Hw this viewe will ate that mvie is w estimated based  the match/mismatch  these act s. F example, i  the mvie is pue cmedy ad the viewe hates cmedies, the chaces ae he w't like it I yu take dzes  these acts descibig may cets  a mv ie's ctet ad a viewe's taste, the cclusi based  matchig all the cts will be a gd pedict  hw the viewe will ate the mvie The pwe  leaig m data is that this etie pcess ca be aut mate d, withut ay eed  aalyzig mvie ctet  viewe taste . T d s , the leaig algithm evese egiees' these act s based slely  pe-

1

2

1 . HE EARN IN PRO BLEM

1 . 1  PRO BLEM ETU

vius atigs. It stats with adm acts the tues these fcts t make them me ad me aliged with hw viewes have ated mvies bee util they ae ultiately able t predct hw viewes ate mvies i geeal. The facts we ed up with may t be a ituitive as cmedy ctet' ad i fact ca be qite subtle  eve icmpehesibl e. Afte all  the algithm is ly tyig to d the best way t pedict hw a viewe wuld ate a mvie t ecessaily t us hw it iscmpetiti. de. This algithm was pat f the wiig slutiexplai i the milli-dlla

1. 1. 1

Compoe ts of Learig

The mvie atig applicati captues the essece f leaig m data ad s d may the applicatis m vastly dieet elds. I de t abstact the cmm ce f the leaig pblem we will pick e applicati ad use it as a metaph  the di eet cmpets f the pblem. Let us take cedit appval as u metaph. Suppse that a bak eceives thusads f cedit cad applicatis evey day ad it wats t autmate the pcess f evaluatig them. Just as i the case f mvie atigs the bak kws f  magical mula that ca pipit whe cedit shuld be appved but it has a lt f data. This calls  leaig ma data the bak uses histical ut gd s mula  cedit appval.ecds f pevius custmes t gue Each custme ecd has pesal imati elated t cedit such a aual salay yeas i eside ce utstadig las et c. The ecd als keeps tack f whethe appvig cedit  that custme was a gd idea i.e. did the bak make mey  that cus tme. This data guides the cst ucti f a successul fmula  cedit appval that ca be used  utue applicats. Let us give ames ad symbls t the mai cmpets f this leaig pble m. Thee is the iput x custme imati that is used t make a cedit decisi) the ukw taget ucti f X  Y ideal mula  cedit app val)  whee X is the iput space set f all pssible iputs x, ad Y is the utput space set f all pssible utputs i this case just a yes/ deci si) . Thee is a data set D f iput-utput examples (x1 , Y1  ,    (x N , YN  , whee Yn  f (xn     ... ,  iputs cespdig t pevius custmes





1

ad thed cect cedit decisi  them examples aeuses fte ee t as data pits . Fially theeiis hidsight). the leaigThe algithm that the data set D t pick a mula g X  Y that appximates f. The algithm chses g  a set f cadidate mulas ude csideati which we call the hypthesis set  . F istace  culd be the set f all liea mulas m which the algithm wuld chse the best liea t t the data as we will itduce late i this secti. Whe a ew custme applies  cedit the bak will base its decisi f the  g the hypthesis that the leaig algithm pduced) t  ideal taget  cti which emais uk w) . The decisi will be g d ly t the extet that g faithlly eplicates f T achieve that the algithm 3

1.1 PROBLEM ETU

1 . HE EARN IN PROBLEM

UNKNWN TARGET FUNCTN

f  Y (el  l 'Ul) TRANNG EXAPLES

 

 ,

FNAL HYPTHESS

f (lee e l 'Ul) HYPTHESS SET



(e  e 'l) Fgue

1.2

asc setup of the learnng problem

chses g that best matches   the tiig examples f pevus custmes wth the hpe that t wll c tue t match   ew custmes. Whethe  t ths hpe s justed e mas t be see. Fgue llustates the cmpets f the leag pblem

1.2

xerise  xress each o  he llowing a ss i n t he  am ewok o learin g rom  ata  secif ing  e i nut sace out ut sace aget fnction  an  th e se ic s o he  ata s et hat we wi ll lear n from 









(a) Mical iagnosis atient al s i n with a meica l h istory an  some smtoms an  you want o ientif  e rolem (b ) H an written igit recognition  examle  ostal z i co e recognitio n r ma il sorting)  ( c) Determi ni ng if a n em ail is sam o not () P ric tin g h ow a n electri c loa varies with rice temerature an  ay of the week (e) A robl em of inter est to you r which there is no a na lytic s ol ution but you h ave at a from whi ch to co nstruct an emi rical sol ution

4


1 . HE EARN ING PROBLEM

We ll use the setup n Fgure a ur dentn f the learnng prblem. Later n, e ll cnsder a num ber  f renements and varatns t ths basc setup as ne eded. Hever, the essence f the prblem ll reman the same. There s a t arget t b e learned. It s unknn t us . We have a set f e amples generated by the target. The learnng algrthm uses these eamples t  lk r a hypthess that apprmates the target.

12

1. 1. 2

A Siple Learig Model

12

Let us cnsder the derent cmpnents f Fgure Gven a specc learn  ng prblem, the target nctn and tranng eamples are dctated by the prble m. Hever, the learnng algrthm and hypth ess set are nt . These are slutn tls that e get t chse. The hypthes s set and learnng algrthm are rerred t nrmally as the leaing model Here s a smple mde l. Let X ]d be the nput space, here Jd s the dmensnal Eucldean space, and let Y  { be the utput space, dentng a bnary yes/n) decsn . In ur credt eample, d erent cr dnates f the nput vectr x  Jd crrespnd t salary, years n resdence, utstandng debt, and the ther data elds n a cred t applcat n. The b nary utput  crrespnds t app rvng r denyng cred t. We spe c the hypthess set  thrugh a nctnal rm that all the hyptheses h   share. The nctnal rm h(x that e chse here gves derent eghts t the derent crdnates f x, reectng ther relatve mprtance n the credt decs n. The eghted crdnates are then cmbned t rm a credt scre' and the result s cmpared t a threshld v alue. If the applcant passes the threshld, credt s apprved f nt, credt s dened

+ 1,  1}

Apprve credt f Deny credt f

= WX = WX < d

> threshld,

d

threshld.

Ths rmula can be rtten mre cmpactly as

1 ++1

(11)

here xi ,· , x d are the cmpnents f the vectr x; h(x  means ap prve credt' and h(x  means deny credt' sgn)  f > 0 and sgn)  f  < . 1 The eghts are w1,  , wd , and the threshld s determned by the bas term b snce n Equatn credt s apprved f >  b. Ths mde l f  s called th e perceptrn, a name that t gt  n the cnte t f artcal ntellgence. The learnng algrthm ll search  by lkng r

= WX

1

1



1 The vle of sign  when

(11),

 is  simple techniclity tht we ignore r the moment 

1 HE EARING PROBLEM


a) isclassied data

1.3:

b) Perectly classied data

Figure Perceptron classication of linearly searable data in a two dimensional input space a) ome training e amples will be misclassied blue points in red region and vice versa) r certain values of the weight parameters which dene the separating line b) nal hyothesis that classies all training e amles correctly is + 1 and is - 1  )





 A

eights and bias that perfrm ell n the data set . Sme  f the eights w1,    , may end up being negative, crrespnding t an adverse eect n credit apprval Fr instan ce, th e eight f the  utstanding debt ' eld shuld cme ut negative since mre debt is nt g d fr credit The bias value b may end up being large r small, reecting h lenient r stringent the bank shuld be in etending credit The ptimal chices f eights and bias dene the nal hypthesis g   that the algrithm prduces

Wd

xercse 2 Suose hat we use a ecetron to etect sam messages. Lets say that eac email message is reresente y e feuency of occurence of kewors an the outut is i the messa ge is consier e sam .

a  an ou th i nk of some e ywors that wil l e n u with a

la rge ositiv e

eight in the rcetron

How about keywos that will get a negative weight?  bc What aram eter in the ercetron  irectl a ects how man y bo rer li ne messages en  u bei ng lassi e a s sam?

Figure illustrates hat a perceptrn des i n a t-dimensinal case (d  The plane is split by a line int  t regi ns, the decisin regin and the decisi n regin Dierent values fr the parameters w1, w, b crrespnd t dierent lines w1x1 wx b  . If the data set is linearly sepaable, there ill be a chice r these parameters that classies all the training eamples crrectly

13

+

+1

+

6

2). 1

1 HE EARNING PROBLEM

11 PROBLEM ETU

T smpl the ntatn f the perceptrn rmul a e ll treat the b as b as a eght wo  b and merge t th the ther eghts nt ne vectr   [w, w       ] here  dentes the transpse f a vectr s  s a clumn vectr. We als treat  as a clumn vectr and md t t bec me   [x, xi,     ] here the added crdnate x s ed at x  Frmally speakng the nput space s n

1

Wth ths cnventn    ten n vectr rm as

=O WX

and s Equatn

 )  sgn ).

(1.1)

can be rert

(1.2)

We n ntrduce the perceptron learning algoritm PLA) . The algrthm ll determne hat  shuld be  based n the data. Let us assume that the data set s lnearly separable hch means that there s a vectr  that makes acheve the crrect decsn  )   n all the tranng eam ples as shn n Fgure Our learnng algrthm ll nd ths  usng a smple teratve methd. Here s h t rks. At teratn  here   0 . . .  th ere s a current value f the eght vectr call t  ) . The algrthm pcks an eample m    )      ) that s currently msclassed cal l t  )  ) )  and uses t t update  ) . Snce the eample s msclassed  e have y t  sgn))). The update rule s

(1.2)

1.3.

1 2





+ 1)  ) + )).

(1.3)

Ths rule mves the bundary n the drectn f classfyng ) crrectly as depcte d n the gure abve. The algrthm cntnues th frther teratns untl there are n lnger msclassed eamples n the data set. 7

1 . HE EAR IG PR OBLEM

xerse




he e ight u ate u le i n ) has the nice i nter retation that it mo ves in e irection of lassifing x(t) correctl (a how that b Show that

y(twTt)t  0 i t s isssie  t.] t)wTt l)t)  ytTtt i Use 

 

c) s g t concene o fr a moe  i nisthe right irecargue tionth at the mov e om tas lassifis in

t)

Althugh the update rule n .3) cnsders nly ne tranng eample at a tme and may mess up' the classcatn f the ther eamples that are nt nvlved n the current ter atn, t turns ut that the algrthm s guaranteed t arrve at the r ght slutn n the end. The prf s the subject f Prb  lem 3 . The result hlds regard less f hch eample e chse m amng the msclassed eamples n )··· at each teratn, and re gardles s f h e ntalze the eght vectr t start the algrthm. Fr smplcty, e can pck ne f the msclassed eamples at randm (r cycle thrugh the eamples and alays chse the rst msclassed ne), and e can ntalze w(O) t the zer vectr. Wthn the nnte space f all eght vectrs, the perceptrn algrthm

 Y (x Y)

manages t nd h a eght vectr that rks, can usngeectvely a smple teratve prcess. Ths llustrates a learnng algrthm search an nnte hypthess se t usng a nte number f smple steps Ths ature s character stc f many technques that are used n learnng, sme f hch are far mre sphstcated than the perceptrn learnng algrthm. Exercse 1.



et us create our own arget nction  an ata set an  see ho w he erce tron learni ng algoritm o s ke  so ou can visua lize h robl em  a n choo se a anom li ne i n te l an as our tar get function whee one sie of the l in e m as o a  e o e m as to -1 hoose te inuts  of he ata set as anom ints i e lane an valuate he target function on a  o get he co resoni g outut n Now g ene rate a at a set o size 0   e ereton le arn ing a lgoih m on ou r ata se t an  see o w l ong it ta kes to o nver ge an  how wel l th e nal hyothesis matc es u  tar get  u ca n n  othe r ways to l ay with this e eriment in ole m











The per ceptr n learnng algrthm succe eds n achevng ts gal; ndng a hy ·  · x  pthe ss that classes all the pnts n the data set   { x crrec tly. Des ths mean that ths hypthes s ll als be successl n class fyng ne data pnts t hat are nt n ? Ths turns ut t b e the key questn n the thery f learnng, a questn that ll be thrughly eamned n ths bk.

 

8

 ) }

1.1 PROBLEM ETU

1 . HE ERN ING PROBLEM

ize

ize

b earned classier Fgure 1.4 he learning approach to coin classication  a aining data of pennies, nickels, dimes, and quarters ( 1 , , 10, and 2 cents are represented in a size mss space where they fall into clusters (b A classication rule i s a oin data

A

learned om the data set by separating the ur clusters new coin will be classied according to the region in the size mass plane that it alls into

1. 1. 3

Learig versus Desig

S far e have dscussed hat learnng s  e dscuss hat t s nt . The gal s t dstngush beteen learnng and a related apprach that s used fr smlar prblems Whle learnng s based n data ths ther apprach des nt use data It s a desgn' apprach based n speccatns and s en dscussed alngsde the learnng apprach n pattern recgntn lterature. Cnsde r the prblem f recgnz ng cns f derent denmnatns  hch s relevant t vendng machnes  fr eample. We ant the machne t recg nze quarters dmes nckels and pennes We ll cntrast the learnng m data' apprach and the desgn m speccatns' apprach r ths prb lem. We assume that each cn ll be repre sented by ts sze and ma ss  a t-dmensnal nput . thedenmnatns learnng apprach e use are these gven cns a sample cnsset m each f the Inur and e as urf data  We treat the sze and mass as the nput vectr and the denmnatn as the utput Fgure 14 a) shs hat the data set may lk lke n the nput space. There s sme varatn f sze and mass thn each class but by and large cns f the same denmnatn cluster tg ether. The learnn g algrthm searches fr a hypthess that classes the data set ell If e ant t cl ass a ne cn the machne measures ts sze and mass and then classes t accrdng t the learned hypthess n Fgu re  .4 b)  In the desgn apprach e call the Unted States Mnt and ask them abut the speccat ns f derent cns We als ask them abut the number 9

1 . HE ENING POBLEM

1  1  POBLEM ETU

ize (a) Proailistic model of data

ize () nrred classier

A A

Fgure 15: he design approach to coin classication (a) probailistic model r the size, mass, and denomination of coins is derived om known specications he gure shows the high proaility region fr each denom ination 1 , 10 and 2 cents) according to the model () classication rule is derived analytically to minimize the proability of error in classiing a coin based on size and mass he resulting regions r each denomination are shown

f cns f ech denmntn n crcultn, n rder t get n estmte f the reltve equency f ech cn. Fnlly, e mke  physcl mdel f the vrtns n sze nd mss due t epsure t the elements nd due t errrs n mesurement. We put ll f ths nrmtn tgether nd cmput e the ll jnt prbblty dstrbutn f sze, mss, nd cn denmntn Fgure 15( )) . Once e hve tht jnt dstrbu tn, e cn cnstruct the ptml decsn r ule t cls sfy cns bsed n sze nd mss Fgure 1  5b)) . The rule chse s the denmntn tht hs the hghest prbblty r  gve n sze nd mss, thus chevng the smllest pssble prbblty f errr. The mn derence beteen the lernng pprch nd the desgn p prch s the rle tht dt plys. In the desgn pprch, the prblem s ell speced nd ne cn nlytclly derve  thut the need t see ny dt. In the lernng pprch, the prblem s much less speced, n ne needs dt t pn dn ht  s. Bth pprches my be vble n sme pplctns, but nly the lernng pprch s pssble n mny pplctns here the trget nctn s un knn. We re nt tryng t cmpre the utlty r the per rmnce f the t pprches. We re just mkng the pnt tht the desgn pprch s dstnct m le rnng. Ths bk s but lernn g.



2 This is called Bayes optimal decision t heory Some learning models are ba sed on the same theory by estimating the probability om data

10

1 . HE ERN ING PROBL EM

1 . 2  YES OF ERN ING

erie 1 5 hic f the  llowing roblems a re more suite  r he l earn in g aoach an hich a re more suite  e esign a roach ? a Determining the age at hich a articular meical test shoul be errme b lassif ing nu mbe rs into rimes an  non-rimes c etecting otentia l fau  i n re it ar ha rges  ete mi ni ng he time it wou l take a  l l i ng object to hit he gr ou n e) Dete rmi ni ng the otima l cle r r ac lights in a bus in tersection

. 2

Tys f Larning

The basc premse f learn ng m dat a s th e use f a set f bservatns t  uncver an unde rlyng prcess It s a very brad prems e and dcult t  t nt a sngle amerk As a result derent learnng paradgms have arsen t deal th derent stuatns and derent assumptns . In ths sectn  e ntrduce sme f these paradgms The learnng that e s farfslearnng called supervised learnng It s theparadgm mst studed andhave mstdscussed utlzed type but t s nt the nly ne Sme varatns f supervsed learnng a re smple enugh t be accmmdated thn the same  amerk Other varatns are mre prund and lead t ne cncepts and technques that take n lves f ther n The mst mp rtant varatns have t d th the nature f the data set

1. 2. 1

Sup ervised Learig

When the tranng dat a cntans eplct eamples f hat the crrect utput shuld be r gven nputs then e are thn the supervsed learnng set tng that e have cvered s far Cnsder the hand-rtten dgt recgntn prblem task b f Eercse 1  1)  A reasnable data set r ths prblem s





a cllectn f handrtten dgts andfr dgt a ctuallyfsmages  We thus have a set f eamples theeach rmmage hat the  The learnng s supervsed n the sense that sme supervsr' has taken the truble t lk at each nput n ths case an mage  and determne the crrect utput n ths case n e f the ten categres {O 1  2, 3, 4, 5, 6, 7, 8, 9} Whle e are  n the subje ct  f varat ns there  s mre than ne ay that a data set can be presented t the lea rnng prcess Data sets are typ cally cre ated and preseted t  us n ther entrety a t the utset f the learnng prc ess Fr nstance hstrcal recrds f custmers n the credtcard applcatn and prevus mve ratngs f custmers n the mve ratng applcatn are already there r us t use. Ths prtc l f a ready' d ata set s the mst

 ima  ii 

11

1 . HE ERN ING PROBLEM

1 .. YES OF ER NING

cmmn n practce and t s hat e ll fcus n n ths b k. Hever t s rt h ntng that t varatn s f ths pr tcl have attracted a sgn cant bd f rk. One s active leaing, here the data set s acqured thrugh queres that e make. Thus e get t chse a pnt n the nput space and the supervsr reprts t us the target value r As u can see ths pens

  the p ssblt strategc chce fquestn the pnt t mamze ts nfrmatn value smlar traskng a strategc n a game f 20 questns. Anther varatn s called online learning, here the data set s gven t

the algrthm ne ea mple at a tme . Ths happens hen e have stream ng data that the algrthm has t prcess  n the run ' Fr nstance hen the mve recmmendatn sstem dscussed n Sectn 11 s depled n lne learnng can prcess ne ratngs m current users and mves. Onlne learnng s als usefl hen e have lmtatns n cmputng and strage that preclude us m prcessng the hle data as a batch. We shuld nte that nlne learnng can be used n derent paradgms f learnng nt just n supervsed learnng

1. 2. 2

Reirceet Learig

When the tranng data des nt eplctl cntan the crrect utput r each nput  e are n lnger n a supervsed learnng settng. Cnsder a t ddler learnng nt t tuch a ht cup f tea. The eperence f such a tddler uld tpcall cmprse a set f ccasns hen the tddler cnnted a ht cup f tea and as aced th the decsn f tuchng t r nt tuchng t Presumabl ever tme she tuched t  the result as a hgh level f pan and ever tme she ddn't tuch t a much ler level f pan resulted that f an unsatsed curst . Eventuall the tddle r learns that she s bette r  nt tuchng the ht cup. The tranng eamp les dd nt spel l ut hat the tddle r shuld have dne  but the nstead grade d derent actns that she ha s taken. evertheless she uses the eamples t renrce the better act ns eventuall learnng hat she shuld d n smlar stuatns  Ths characterzes reinforcement learnng here the tranng eample des nt cntan the target utput but nstead cntans sme pssble utput tgether th a measure f h gd that ut put s In cntrast t supervsed learnng here the tranng eamples ere f the rm  the eamples n ren rcement learnn g are f the rm .





 iput  orrt output   iput  som output  ra r tis output 

Imprtantl the eample des nt sa h gd ther utputs uld have been fr ths partcular nput. Renrcement learnng s espec all usel r learnng h t pla a game. Imagn e a stu atn n backgammn here u have a chce beteen derent actns and u ant t dentf the best act n. It s nt a trval task t ascertan hat the best actn s at a gven stage f the game s e cannt 12

1  HE EARNIN PROBLEM

1 . YE OF EARNIN 



 ize

ize

a nabeed oin data

 b nsuervised earning Fgure 16 nsuervised earning of coin cassication  a he same data set of coins in igure 14  a is again reresented in the size mass sace, but without being abeed. hey sti f into custers  b n unsuervised cassication rue treats the fur custers as dierent tyes he rue may be somewhat ambiguous, as tye 1 and tye  coud be viewed as one custer

ealy create uperved learnng example If you ue renrcement lea rnng ntead all you need to do  to take ome acton and report how well thng went and you have a tranng example The renrcement learnng algorthm  left wth the tak of ortng out the nrmaton comng om derent ex ample to nd the bet lne of play

1.2 . 3

Usupervised Learig

In the unuperved ettng the tranng data doe not contan any output nrmaton at all We are jut gven nput example ··  XN  You may wonder how we could pobly lea rn anythng om mere nput  Conder the con cla caton problem that we dcu ed earler n Fgure 1 4  Suppoe that we ddn' t know the denom naton of any of the con n the data et  Th  6( a)the  We tllcol getormlar they ulabeled data  hown are now unlabeled o allnpFgure ont have ame '  Thecluter decon but r egon n unuperved learnng may be dentcal to thoe n uperved learnng but wthout the label (Fgure 1 6 (b) )  However the correct clu terng  le obvou now and even the number of cluter may be ambguou onethele th example how that we can learn somethig om the nput by themelve  Unuperved learnng can be vewed a the tak of pontaneouly ndng pattern and tructure n nput data  For ntance  f our tak  to categorze a et of book nto topc and we only ue general propert e of the varou book  we can dentfy book that have m lar prop erte and put them together n one category wthout namng that category 13

1 HE EANIN POBLEM

1 .. YE OF EANIN

Unuperved earnng can ao be vewed a a way to create a hgher eve repreentaton of the data Imagne that you don't peak a word of Spanh bu t your company w reocate you to Span next month They w arrange r Spanh eon once you are there but you woud ke to prepare youref a bt bere you go A you have acce to   a Spanh rado taton For a  month you contnuouy bombard youref wth Spanh; ththe  word an unuperv ed earnng experence nceayou don repreentaton 't know the meanng of  However you graduay deveop better of the anguage n your bran by becomng more tuned to t common ound and tructure When you arrve n Spa n you w be n a better poton to tart your Spanh eon Indeed unuperved earnng c an be a precuror to uperved earnng In other cae t  a tand-a one techn que Exercs



o ac h o th e ll owing tass ie ntify ic   o learn in g is i nvolv  survise reinrcement or unsuervise an he training ata o be use If a task c an t mo re han one  e exlain  ow an esc ibe he trai ni ng ata r each t e a Recomme ni ng a oo o a use in an on li ne books tore b) la ying ic tac to (c) ategorizi ng movies i nto  iferent t yes   eani ng o la music (e rit l im it Deciing he m aximu m a llow e et r each ban k cus omer

Our man cu n th book w be uperved earnng whch  the mot popuar rm of earnng om data

1.2 .

Other Vies of Learig

The tudy of earnng ha evoved omewhat ndependenty n a number of ed that tarted htorcay at derent tme and n derent doman and thee ed have deveoped derent emphae and even derent jargon A a reut earnng a dvere to ubject wth man y aae n the centc terature The om mandata eddedcated the ubject  caed machie learig, a name that dtnguhe t om human earnng We brey menton two other mportant ed that approach earnng om data n ther own way Statistics hare the bac preme of earnng om data namey the ue of a et of obervaton to uncover an underyng proce  In th cae the proce   a probabty dtrbuton and the obervaton are ampe om that dtr buton  Becaue tattc  a ma thematca ed  empha  gven to tuaton where mot of the queton can be anwered wth rgorou proo A a reut  tatt c cue on omewha t deazed mode  and anayze them n great deta  Th  the man derence between the ta ttca approach 14

1  s EARNIN EASIBLE?

1 HE EARNIN PROBLEM

f  f 

f  Figure 1   visua earning robem he rst two rows show the training exames each inut  is a 9 bit vector reresented visuay as a 3 3 back and white array . he inuts in the rst row have f) - 1 , and the inuts in the second row have f) + 1 our task is to earn om this data set





what is, then ay or +1? f







f to the test inut at the bottom o you get - 1

to learning an how we approach th e subject here We make less restrictive assumptions and deal with more general models than in statis tics Therefre, we end up with weaker results that are nonetheless broadly applicable Data miig is a practical eld that cuses on nding patterns, correla tions, or anomalies in large relational databases For example, we could be looking at meical records of patients and trying to detect a causeeect re lationship between a particular dr ug and long-term eects  We coul also be looking at credit card spending patterns and trying to detect potential aud Technically, data mining is the same as learning om data, with more empha sis on data analysis than on prediction Because databases are usually huge, computational issues are often critical in the atamovie mining Recommender which were illustrated in Section 1.1 with rating example, aresystems, also considered part of data mining

. 3

Is Larning Fasil?

The target function  is the object of le arning The most important assertion about the target fnction is that it is uko We really mean unknown This raises a natural question How could a limited data set reveal enough inrmation to pin down the entire target function? Figure 1. illustrates this

7

15

1 s EANIN EAIBLE?

1 HE EANIN POBLEM

diculty A simple learning task with 6 training examples of a ±1 target fnction is shown y to learn what the fnction is then apply it t o the test input given Do you get  1 or + 1? ow, show the problem to your iends and see if they get the same answer The chances are the answers were not unanimous, and r good reason There is simply more than one fnction that ts the 6 training examples, and some of these fnctions ha ve a value of  1 on the test point and oth ers have a value of + 1  For instance, if the true  is + 1 when the pattern is symmetric, the value fr the test point would be + 1  If the true  is 1 when the top left square of the pattern is white, the value r the test point would be 1 Both fnctions agree with all the examples in the data set, so there isn't enough inrmation to tell us which would be the correct answer This does not bo de well r the fasibility of learning To make matters worse, we will now see that the diculty we experienced in this simple problem is the rule, not the exception

+

1. 3 . 1

Outside the Data Set

When we get the training data , eg, the rst two rows of Figure 17, we know the value of  on all the points in  This doesn't mean that we have learned , since doesn't thatseen, we know anything  outside of  We knowitwhat we guarantee have already but that 's not about learning That 's memorizing Does the data set  tell us anything outside of  that we didn't know bere? If the answer is yes, then we have learned somethig. If the answer is no, we can conclude that learning is not asible Since we maintain that  is an unknown function, we can prove that  remains unknown outside of  Instead of going through a frmal proof fr the general case, we will illustrate the idea in a concrete cas e Consider a Boolean target fnction over a three-dimensional input space X  {O 1 }  . We are given a data set  of ve examples represented in the table below We denote the binary output by  /  r visual clarity,





0000 01 010 01 1 100

0  0 

where Yn  (xn r   1 , 2 3 4, 5 The advantage of this simple Boo lean case is that we can enumerate the entire input space (sinc e there are only 2   8 distin ct input vectors) , and we can enumerate the set of all possi ble target nctions (since  is a Boolean fnction on 3 Boolean inputs, and there are  only 2   256 distinct Boolean fnctio ns on 3 Boolean inputs)  16

1.. s EARNING EASIBLE?

1 . HE EARNING PROBLEM

Let us look at the problem of learning  Since  is unkn own except inside D any fnction that agrees with D could conceivably be  The table below shows all such nctions , · · · ,  It also shows the data set D (in blue) and what the nal hypothesis g may look like



0 •• 0 • 0 0 0

0 •• 0 • 0 • 0

0 •• 0 • 0 0 •

f4

0 •• 0 • 0 • •

f5

0 •• 0 • • 0 0

f6

0 •• 0 • • 0 •

0 •• 0 • • • 0

fs

0 •• 0 • • • •

The nal hypothesis g is chosen based on the ve examples in D The table shows the case where g is chosen to match  on these examples If we remain true t o the noti on of unknown target, we cann ot exclude any of , · · · ,  om bei ng the true  · ow, we have a dilemma The whole purpose of learning  is to be able to predict the value of  on points that we haven' t seen bere  The quality of the learning will be determined by how clsethree our prediction is to theseen truebere value.(those Regardless on g predicts the points we haven't outside of of what by red D denoted question marks), it can agree or disagree with the target, depending on which of  , · · · ,  turns out to be the t rue target It is easy to verify that any 3 bits that replace the red question marks are as good a any other 3 bits xerse  7 o eac o the l lowing lean ing scenaio s n t e a ove l em  evaluate e err ma nce o n e ree oi nts in outsie measue e errmance coute ow man o he 8 ossile target nctions agee wi on all hree oints on wo o the n one o e an on noe o tem.



 



a   has onl t wo othese s one hat a lwas etuns  • a n one tat  alwas returns



e l eanin a lgo ritm ics te  othesi s a

matches the ata set he most.  b he same  but e leaning lgoithm now ics he hothesis that m atche s he ata set te es c   {XOR} onl one hothesis which is always icke   were

XR is en e by OR)   if he n um ber of s n XOR)   if the n um ber is ev en.

 is o an

  contains all ossible hyotheses all Boolean functions on three variables   a n th e lea rni ng a lgori th m icks the hyo thesi s that agrees

with al l trai ni ng exam les but othe rwse isa grees the most wit h h e

XOR

17

1.. s EARNING EASIBLE?

1 . HE EARNI NG PROBL EM BIN



µ=probability of red marbles

Figure 18:  random sample is picked om a bin ofred and green marbles. he probability µ of red marbles in the bin is unknown. hat does the action  of red marbles in the sample tell us about µ? It doesn't matter what the algorithm does or what hypothesis set  is used Whether  has a hypothesis that perctly agrees with  (as depicted in the table) or not, and whether the learning algorithm picks that hypothesis or picks another one that disagrees with  (dierent green bits), it makes no dierence whatsoever as ar as the perrmance outside of  is concerned Yet the perrmance outside  is all that matters in learning! This dilemma is not restricted to Boolean functions, but extends to the general learning problem As long as f is an unknown nction, knowing  cannot exclude any pattern of values r f outside of  Therere, the pre dictions of g outside of  are meaningless Does this mean that learning om data is doomed? If so, this will be a very short book Fortunately, learning is alive and well, and we will see why We won 't have to change our basic assumption to do tha t The target nction will continue to be unknown, and we still mean nknwn

@

1. 3 .2

Probability to the Rescue

We will show that we can indeed inr something o utside  using only , but in a probabilistic way What we inr may not be much compared to learning a fll target fnction, but it will establi h the principle that we can reach outside  Once we establish that, we wil l take it to the ge neral learning problem and pin down what we can and cannot learn Let 's take the simplest case of picking a samp le, and see when we can say something about the objects outside the sam ple Consider a bin that contains red and green marbles, possibly innitely many The proportion of red and green marbles in the bin is such that if we pick a marble at random, the probability that it will be red is µ and the probability that it will be green is 1  µ We assume that the value of µ is unknown to us  18

1 s EARNING EASIBLE?

1 HE EARNING PROBLEM

We pick a random sample of  independent marbles (with replacement) om this bin, and observe the action v of red marbles within the sample (Figure 1.8. hat does the value of v tell us about t he value of µ? On e answer is that regardless of the colors o f the  marbles that we picked, we still don't know the color o f any marble that we didn' t pick  We can get mostly green marbles in the sample while the bin has mostly red marbles Although this is certainly possible, it is by no means pbable Exere 8 If



  09

what is the robabil

 01 His 

m uber



it t at a sam le o

Use biomi isribuio



 males will have

e swe s  ve

The situation is simila r to taking a poll  A random sample om a population tends to agree with the views of the popula tion at large The probabil ity distribution of the random variable  in terms of the parameter µ is well understood, and when the sample size is big,  tends to be close to µ To quanti the relationship between v and µ, we use a simple bound called the oedig equalit It states that fr any sample size ,



r any

> 0

(1.4



Here, [  ] denotes the probability of an event, in this case with respect to the random sample we pick, and is any positive value we choose Puttin g Inequality (1.4 in words, it says that as the sample size  grows, it becomes exponentially nlikely that v will deviate o m µ by more than our t olerance' The only qantity that is random in (14 is v which depends on the random sample By cotrast, µ is not random It is just a constant , albeit unknown to us There is a subtle point here  The utility of (1.4 is to iner t he value of µ using the vale of v although it is µ that aects v not vice versa However, since the eect is that v tends to be close to µ, we infr that µ tends' to be close to v Although  [I µ I > ] depends on µ, as µ appears in the argment and also aects th e distribution of v we are able to bound the probability by 2  N which does not depend on µ otic e that only the size  of the sample aects





the bound, not the size of the bin The bin can be large or small, nite or innite, and we still get the same bound when we use the same sample size Exerce 1 9



If µ 0.9 use te oefing Inequalit to oun the robailit that a samle of 10 marbles will have  01 an comare the answer to the revious eercise





If we choose to be very small in order to make v a goo d approximation of µ, we need a larger sample size  to make the RHS of nequality (1.4 small We

19

13. s EARNING EASIBLE?

1 . HE EAR NING PROBLEM

can then assert that it is likely that v will indeed be a good appr oximation of µ Althogh this assertion does not g ive s the exact v ale of µ, and doesn't even garantee that the approximate vale holds, knowing that we are within of µ most of the time is a signicant improvement over not knowing anything at all The fct that the sample was randomly select ed om the bin i s the reaso n we are able to make any kind of statement abot µ being close to v If the sample was not randomly selected bt picked in a particlar way, we wold lose th e benet of the probabilistic analysis and we wold again be in the dark otside of the sample How does the bin model relate to the learning problem? It seems that the nknown here was jst the vale of µ while t he nknown in learning is an entire nction f : X - Y The two sitations can be connected Take any single hypothesis   and compare it to f on each point x  X. If (x)  f(x) color the point x green If (x) = f(x) color the point x red The color that each point gets is not known to s, since f is nknown However, if we pick x at random according to some probability distribtion P over the inpt space X we know that x will be red with some proba bility, call it µ, and green with probability 1  µ Regardless of the vale of µ, the space X now behaves like the bin in Figre 18 The training examples play the role o f a sample om the bin  If the inpts i   XN in  are picked independen tly according to P, we will get a random sample of red (( x) = (x ) ) and green ((x)  f(x)) points Each point will be red with probability µ and green with probability 1  µ The color of each point will be known to s since both (x) and f (x) are known r   1 ,    ,  the nction  is or hypothesis so we can evalate it on any point, and f (x)  Y is given to s r all points in the data set  ) The learning problem is now redced to a bin problem, nder the assmption that the inpts in  are picked independently according to some distribtion P on X . Any P will translate to some µ in the eqivalent bin Since µ is allowed to be nknown, P can be nknown to s as well Figre 1  9 adds this probabilistic comp onent to the basic learni ng setp depi cted in Figr e 1 2  With th is eqivalence, the Hoeding Ineqality ca n be applied to th e learn ing proble m, allowing s to make a predict ion otside of  Using v to pre dict µ tells s something abot f althogh it d oesn't tell s what f is What µ tells s is the error rate  makes in approximating f If v happens to be close to zero, we can predict that  will approximate f well over the entire inpt space If not, we are ot of lck Unfrtnately, we have no control over v in or crrent sitation, since v is based on a particlar hypothesis  In real learning, we explore an entire hypothesis set looking r some   that has a small error rate If we have only one hypothesi s to begin with , we are not really learning, bt rather verifying' whether that partic lar hypothesis is good or bad Let s se e if we can extend the bin eqivalence to the case where we have mltiple hypotheses in order to captre real learning

±











20

1.. s EARNNG EASBE?

1 . HE EARN NG PROBEM

UNKNOWN TARGET FUNCTION

f    Y TRAINING EXAMPLES

FINAL HYPOTHESIS

 HYPOTHESIS SET

 Figure 19: Probablty added to the basc learnng setup

To do that, we start by introducing more descriptive names r the dif rent components that we will use The error rate within the sample, which corresponds to v in the bin model, will be called the i-sample error,

action of ' where f and h disagree  1

N

  

h(x)  f(x)  

where [statement   1 if the statement is true, and   if the statement is alse  We have made explicit the depen dency of Ein on the particular h that we are considering In the same way, we dene the outofsample error

Eu(h)   [h(x)  f (x)]  which corresponds to  in the bin model The probability is based on the distribution P over X which is used t o sample the data points x 21

1 . s EARNING EASIBLE?

1. HE EARNING PROBLEM

Figure

110:

ultiple bin s depict te leaning poblem wit

Substituting the new notation Ein r Inequality (14 can be rewritten a

v

and

Eu

M ypoteses

r µ, the Hoeding



fr any >

0

(15

where  is the number of training examples The insample error Ein just like v is a random variable that depends on the sample  The out-of-sample error Eu j st like µ, is unknown but not random  Let us cosider an entire hypothesis set instead of just one hypothesis h and assume r the moment that has a nite number of hypotheses

H

H

We can construct a bin equivalent in this case by having  bins a shown in Figure 110 Each bin still represents the input space X with the red marbles in the th bin corresponding to the points x  X where hm(x)  f(x) The probability of red marbles in the th bin is Eu(hm) and the action of red marbles in the th sample is Ein(hm) r   1      Although the Hoeding Inequality (15 still applies to each bin individually, the situation becomes more complicated when is that? The inequality stated thatwe consider all the bins simultaneously Why r any

 > 0

where the hypothesis h is ed bere you generate the data set, and the probability is with respect to random data sets ; we emphasize that the assumption h is xed before you generate the data set" is critical to the validity of this bound If you are allowed to change h after you generate the data set, the assumptions that are needed to prove the Hoeding Inequality no longer hold With multiple hypotheses in the learning algorithm picks

H,

22

1. s EARNING EASIBLE?

1 . HE EARN ING PROBLEM

the nal hypothesis  based on D ie statement we would like to make is not

ar

" J [IEin (hm) - Eout (hm ) I

>

generating the data set The E]

is small"

fr any particular, xed hm  ) but rather is small" r the nal hypothesis 

>

"I n(g) -out(g)I h



The hypothesis  is n d ahead o f time bere generating the data, because which hypothesis is selected to be  depends on the data So, we cannot just plug in  r in the Hoeding inequality The next exercise considers a simple coin experiment that frther illustrates the dierence between a xed and the nal hypothesis  selected by the learning algorithm

h

xeie 0 ee s an exement a llustates the ference etween a sngle n an mu ltle ns Run a comu te smulaton  ln g    cons. l each con nenenl tmes. ets cus on 3 cons as llows c s e  rst con l e ; Crnd s a on ou choose at anom; s the con that ha he m n um feuen   eas  k he eale ne  as e  a  e   et  rn an e th  acton o eas ou oa n  e resecte t ree cons 

n

n

 a  at s  r he ree  ons sele te   Reat ths e re eement a la rge n um e o me s  .g.    un s o he e nte exe e  to get seeral nstances o  Vrn

n an lo e stogams o e sutons o  Vrnd an n  Notc e a c cons en u  e ng Crn an n a r an

om one un to an he.

 sn g  lt esm ates  jvµj   as a n cton f  togete th the oefng oun    n te same g a     ch cons e e oef  ng oun  a n  w nes o not  x pl an hy.

 e Relate part    o the m ult le  ns  gure



n(g) - out(g)I

The way getesaround this isontowhich try tobound in a way thatto do not depend the learning algorithm picks >There is a simple but crude way of doing that Since  has to be one of the regardless of the algorithm and the sample, it is always true that

" In(g) - out(g)I > "  

23

In( h) - out( h ) >  or In( h ) - out( h )I > 



hm '

1 s EARNING EASIBLE?

1 HE EARNING PROBLE!

B

B hm

B 

B

where  means that event implies event Although the events on the RHS cover a lot more than the LHS , the RHS has the property we want; the hypotheses are xe We now apply two basic rules in probability; an, if

B  B   · · , BM are any events, then

The secon rule is known as the we get

  In(g) -out(g)I

>

uio boud.

 <

J[

I Ein (h 1 ) - Eout ( h1 ) I

L m l

[IEin (hm) Eout(hm) I

>

E

 In(h  ) -out(h  )I >   In(hM ) - out( hM) I >   M

<

Applying the Hoeing Inequality boun each term in the sum by

Putting the two rules together,

>

E] .

(15

the  terms one at a time, we can  Nto. Substituting, we get

2

(16

Mathematically, this is a unifrm' version of (15. We are trying to simul taneously approximate all by the corresponing This allows the learning algorithm to choose any hypothesis base on an ex pect that the corresponing will unirmly fllow suit , regarless of which hypothesis is chosen N The owns ie fr unifrm estimates is th at the pro bability boun is a fctor of looser than the boun r a single hypothesis, an will only be meaningl if  is nite We will improve on that in Chapter

out(hm)' out

n(hm) ' n



1. 3 . 3

2.

2 

Feasiility of Learig

We have introuc e two apparently conicting arguments about the fasibility of learning One argument says that we cannot learn anything outsie of , an the other says that we can We woul like to reconcile these two arguments an pinpoint the sense in which learning is fasible:

1 Let us reconc ile the two arguments  The question of whether  tells us anything outsie of  that we in't know bere has two ierent answers If we insist on a eterministic answer, which means that  tells us something certain about  outsie of , then the answer is no If we accept a probabilistic answer, which means that  tells us something likely about  outsie of , then the answer is yes

24

1.. s EARNI NG EASIBLE ?

1 . HE EARNN G PROBLEM

xerie 1



e ae gven a ata set o f  5 ra nn g exam les  om a n u now n tag et ncon  ere = J an = lean  we use a sm l e othess se = ee s  e ons an t nton an s te cons ant e nse two leanng algortms  sma an caz   ooses









    



 e otess a agees t ost w an  oos es e ote  oss el  eratel et us see ow ese a lgotm s erm out o sa le om e etemnsc an roalsc onts ew ssume  n   e oa l stc ew tat tere s a roa l st utn on an  let = =







 



   a  an  rouce a   ess t a s utee o erm ete t an anom on a n on t out se     ssume  e st  e exese ta all e exales n ave Y =  Is t osse at  ho tes s at roues u ns out o e etter than e ho  ess at  roues? c I p = 09 wat s e oalt that  wll roce a ette  othes s h an ? Is there an al ue f  r w h t s mor e l el th an not at  w l l prouce a etter hyotess tan ?

 

By adopting the prbabilistic view, we get a positive answer to the fasibility question without paying too muc h of a price The only assumption we make in the probabilistic amework is that the examples in  are generated inde pende ntly We don't insist on using any particular probability distribution , or even on knwing what distributio n is used However, whatever distribu tion we use r generating the examples, we must also use when we evaluate how well aprximates  (Figure 19 That's what makes the Hoeding Inequality applicable Of course this ideal situation may not always happen in practice, and some variations of it have been explored in the literature

g

2  Let u s pin down what we mean by the fasi bility of learning Learning pro duces a hypothesis to approximate the unknown targe t nction  If learning is successul, then should approximate  well, which means R 0

g g out(g) However, this i s not what we get om the prbabilistic anal ysis  What we get instead is out (g)  n (g) . We still have to make n (g) R 0 in order to conclude that out(g) R 0 We cannot guarantee that we will n d a hypothesis that achieves n (g)  0 , but at least we will know if we nd it  Remember tat out(g) is an unknown quantity, since  is unknown, but n (g) is a quantity that we can evaluate We have thus traded the condition out (g) R 0, one that we cannt ascertain, r the conditin n (g)  0, which we can ascertain What enabled this is the Hoeding Inequality

(16 

n(g) out(g)  >   2 25

  

1.. s EARNIN G EASIBLE?

1 . HE EARN ING PROBLEM

that assures us that

 (g � (g

so we can use



as a proxy fr



xeie  fe n comes t o u  a l ea n g olm   e sas e tag e nc s omete unn ut se as n 00 aa os  e s llng to a ou  sole e rol an ouce    c a oma tes . a s  est a u can omse e a ong e llng a e leanng u ll e e t a a u wll guaranee aoxmates ll out sale





 g    te lan  ng  l l o e e    a  an wt hg roal t e   ou ue l rxmae el l ut o saml e c ne of wo thngs  l l a en    u ll ue a tess    u ll  elar e at u le I

ou   eturn a oss ou rou ce wl l a rxmate

 en th  g  oab lt e  ch wel l out of sam l e

 (g � 0 One should note that there are cases where we wont insist that Financial recasting is an example where market unpredictability makes it impossible to get a recast that has anywhere near zero error All we hope r is a recast that gets it r ight more often than not  If we get that , our bets will win in the long run This means that a hypothesis that has  (g somewhat below 0 5 will work, provided of course that (g is close enough to (g The asibility of learning is thus split into two questions:

1 2

Can we make sure that Can we make

(g

 (g

is close enoug h to

 (g ?

small enough?

The Hoeding Inequality (16 addresses the rst q uestion only The second question is answered after we run the learning algorithm on the actual data and see h ow small we can get  to be Breaking down th e asibility of learning into these two questions provides frther insight into the role that dierent co mponents of the learning problem play One such insig ht has to do with the complexity of these compon ents

The omplexit y o  If the number of hypotheses more risk that equality (16

]J

]J

goes up, we run will be a poor estimator of  (g according to In can be thought of as a measure of the complexity of the

 (g

26

1 HE EARNING PROBLEM

14 RROR AND OISE

1

hypothesis set that we use If we want an armative answer to th e rst question, we need to keep the complexity of in chec k However, if we want an armative answer to the second question, we stand a better chance if is more complex, since g has to come om  So, a more complex  gives us more exibility in nding some g that ts the data well, leading to sm all (g  This tradeo in the complexity of is a major theme in learning theory that

1

1

1

we will study in detail in Chapter 2

The omplexit y o  Intuitively, a complex target nction  should be harder to learn than a simple  Let us examine if this can be inrred om the two questions above A clos e look at Inequality (16 reveals that the complexity of  does not aect how well  g approximates  (g . If we x the hypothesis set and the number of training examples, the inequality provides the same bo und whether we are trying to learn a simple  ( r instance a constant nction) or a complex  (fr instance a highly nonlinear nction)  However, this doesn't mean that we can learn complex nctions as easily as we learn simple nctions Remember that (1.6 aects the rst question only If the target nction is complex, the second question comes into play since the data om a complex  are harder to t than the data om a simple 



This means will get worse valueour frhypothesis is complex might try totht get we around thata by making set more complexWe so (g when that we ca n t the data better and get a lower  (g , but then  won't be as close to  per (1.6. Either way we look at it, a complex  is harder to learn as we expected  In the extreme cas e, if  is too complex, we may not b e able to learn i t at all Fortunately, most target nctions in real life are not too complex; we can learn them om a reasonable  using a reasonable This is obviously a practical observation , not a mathematical statement Even when we cannot learn a particular , we will at least be able to tell that we can' t As long as we make sure that the complexity of gives us a good Hoeding bound, our success or filure in learning  can be determined by our success or filure in tting the training data

H

1

1 .4

Error and Nis

We clos e this chapter by revisiting two notio ns in th e learning problem in order to bring the m closer to the real world The rst notion is what app roimation means when we say that our hypothesis approximates the target nction well The second notion is a bout the nature of the target nction In many situa tions , there is noise that makes the output of  not uniquely determined by the input What are the ramications of having such a noisy' target on the learning problem? 27

14 RROR AND OISE

1 HE EARNING PROBLEM

1.   1

Error Measures

Learning is not expected to replicat e the target nction per ctly The nal hypothesis g is only an approximation of f  To quanti how well g approxi mates f , we need to dene an error measure  that quanties how fr we are om the target The choi ce of an errormay measure aec ts the ou tcomeofofthe thenal learning pro cess  Dierent error measures lead to dierent choices hypothesis, even if the t arget and the data are the same, since the value of a particular error measure may be small while the value of another error measure in the same situation is large There re, which error measure we use has consequences r what we learn What are the criteria r choosing one error measure ov er another? We address this question here First , let s rmalize this notion a bit  An erro r measure quanties how well each hypothesis  in the model approximates the target nction f , Error



(, f ) 

While (, f ) is based on the entirety of  and f, it is almost universally de ned based on the errors o n individual input points x If we dene a pointwise error measure the overall error will be the average value of this f(x)), pointwise error((x), So fr, we have been working with the classication error

((x), f (x))

 [(x)  J(x) ] 

In an ideal world, ( , J) should be user-s pecie d The same learning task in dierent contexts ma y warrant the use of die rent error measures  One may view (, J) as the cost of using  when you should use f This cost depends on what  is used r, and cannot be dictated just by our learning techniques Here is a case in point

xample 11 (Fingerprint veric ation)  Cons ider the probl em of verifying that a ngerprint belongs to a particular person  What is the appropriate error measure?

f

{+1-1

y

The target nction takes as input a ngerprint, and returns to the right person, and if it belongs to an intruder



+  if it belongs

  k  in the literture, nd sometimes the error

3 This meure is lso clled n error is rerred to s or

28

1 . HE EARN NG PROBLEM

1 .4 . RR OR AND OIS E

There are wo ypes of error ha our hypohesis  can make here If he correc pers on is rejeced (  bu f  i is called alse rejec, and if an incorrec  pers on is acceped (  bu f  i is called false accep



+ +

+

f

 

no error alse accep alse rejec no error  How should he error measure be dened in his problem? If he righ person is acceped or an inruder is rejeced,  he error is clearly zero We need o spec i he erro r values r a false accep a nd fr a false rejec  The righ values depend on he applicaion Consider w o poenial clien s of his ngerprin sysem  One i s a super marke who will use i a he checkou couner o verify ha you are a member of a discoun program The oher is he CIA who will use i a he enrance o a secure faciliy o verify ha you are auhori zed o ener ha ciliy For he supermarke, a false rejec is cosly because if a cusomer ges wrongly rejeced, she may be discouraged om paronizing he supermarke in he ure All fuure revenue om his annoyed cusomer is los . On he oher hand, he cos of a flse accep is minor You jus gave away a discoun

+ 

o someone whomus di dn' i, and ha pers on lef heir ngerprin in your sysem hey be deserve bold indeed For he CIA, a false accep is a disaser An unauhorized person will gain access o a highly sensiive aciliy This should be reeced in a much higher cos r he flse accep  False rejecs , on he oher hand, can be oleraed since auhorized persons are employees (raher han cusomers as wih he supermarke)  The inconvenience of rerying when rejeced is jus p ar of he job, and hey mus deal wih i The coss of he dieren ypes of errors can be abulaed in a marix For our examples, he marices migh look like:

f



+  +     



Supermarke

+ f +      CIA

These marices should be used o weigh he dieren ypes of errors when we compue he oal error When he learning algorihm minimizes a cos weighed error measure, i auomaically akes ino consideraion he uiliy of he hypohesis ha i will produce In he supermarke and CIA scenario s, his could lead o wo compleely dieren nal hypoheses

D

The moral of his example is ha he choice of he error measure depends on how he sysem is going o be used, raher han on any inheren crierion 29

1 . HE EARNI NG PROBLEM

1.4. RROR AND OISE

 UKW UT DTRUT TRG EXMLE

HYTHE ET

Figure



he general (supervsed) learnng problem

that we can independently determine during the learning proc ess However, this ideal choice may not be possible in practice fr two reasons One is that the user may not provide an error specication, which is not uncommon The other is that the weighted cost may be a dicult objective nction r optimizers t o work with There re, we often look r other ways to dene the error measure, sometimes with purely practical or analytic considerations in mindinWe already an example thismeasures with th e insimple error used thishave chapter, and seen we will see otheroferror later binary chapters

1. . 2

Nisy Targets

In many practical applications, the data we learn om are not generated by a deterministic tar get functi on Inste ad, they are generated in a noisy way such that the output is not uniquely determined by the input For instance, in the creditcard example we presented in Section two customers may have identical salaries, outstanding loans, etc, but end up with dierent credit behavior Therere, the credit unction' is not really a deterministic fnction,





1 HE EARNING PROBLEM

1.4 RROR AND OISE

but a noisy one This situation can be readily modeled within the same amework that we have Inste ad of   () , we can take the output  to be a random variable that is aected by, rather than determined by, the input  Formally, we have a target distributio ( ) instead of a target fnction   f () A data point (, ) is now generated by the joint distribution ( , )  ()( )

I

I

can think r of aexample, noisy target as atake deterministic target plusofadded noise If  One is real-valued one can the exp ected value  given  to be th e determ inistic  () , and consi der    () as pure noise that is added to  This view suggests that a deterministic target nction can be considered a special case of a noisy target , just with zero noise Indeed, we can rmally express any unction  as a distribution ( ) by choosing ( ) to be zero or all  except    ()  Therere , there is no loss of generality if we consider t he target to be a distribution rather than a function Figure 1  1 1 modies the previous Figures 1.2 and 1.9 to illustrate the general learning problem, covering both deterministic and noisy targets

I

I

erie 

 at a es a n er ror  ith o ailty  in a oxim atin g a etermi nisti tag et n ction ( boh  an a bina f nctions) I we use the same  o aoimae a nis ersion onsier he in moel r a hoesis

of  gi vn b 

 

 

  

 





a  hat i s th e oabi lt  of eo tat ma es i n a roxim ating  (b) t at alue o ill the errmance o e ieenen t  





[nt e nos target w oo omete anom

I

There is a dierence between the role of ( ) and the role of () in the learning problem While both distributi ons model probabilistic aspects of  and , the target distribution ( ) is what we are trying to learn, while the input distribution () only quanties the relative importance of

I

the Our po intentire  in gauging wellfasibility we have learned analysis how of the of learning applies to noisy target nctions as well Intuitively, this is becaus e the Hoeding Inequality ( 1 6) applies to a n arbitrary, unknown target nction  Assume we randomly picked all the  's according to the distribut ion  ( ) over the ent ire input space X . This realization of (  ) is eectively a target nction Therere, the inequality will be valid no matter which particular random realization the target function' happens to be This does not mean that learning a noisy target is as easy a learning a deterministic one Remember the two questions of learning? With the same learning model Eu may be as close to Ein in the noisy case as it is in the

I

I

31

1  HE EAR NIN G PROBLEM

1 .4 . RROR AND OISE

deterministi ase, but Ein itself will likely be worse in the noisy ase sine it is hard to t the noise In Chapter 2, where we prove a stronger version of ( 1 . 6) , we will as sume the target to be a probability distribution P(y I x) thus overing the general ase.

32

1 . HE EARNI PROBEM

1.5

15 PROBEMS

Prlms

Problem 11

We hae  opaque bag s each cont ain ing  bas. One bag has  bac k ba ls and the other has a bl ack and a whit e bal  . You pick a bag at random a nd then pick one o f the bal s in that bag at ran dom . When y ou look at the ba l l it is bl ack. You now pic k the sec ond bal l fro m that same bag. What is the pr oba bi ity that this bal  is al so ba ck? {Hi Use Byes eorem

A d B]

 A I B]  BJ  B I A]  A] 

Problem 1.2 Consider the perceptron in two dimensions h(x)  sign(wTx) were w  [o ,  ,  2 r and x  [1, ,  2 r  Technicaly x has

th ree coor di nates but we ca l th is pe rceptr on twodimension al because the rs t coordinate is xed at 1 





(a) Show tat the regio ns o n the pl ane wher e h(x) + 1 and h(x) - 1 are separated by a l in e. If we express th is in e by the equa tion  2 a +  what are the sope a a nd interce pt  in term s of o, ,  2 ? (b ) Draw a pictu re fr th e cases w [1, , 3 and w -[1,  3.



In more than two dimensions the





+1 and  1 regions are separated by a

y

ere the generaization of a ine.

Problem 1.

roe that the LA eentually conerges to a linear separa tor r separa ble data . The fll owing st eps wil l guide you through the proof. Let w* be an optima set of weights (one which separates the data). Th e essentia idea i n t his proof is to s how that the  LA weights w(t) get more aigned" with w* with eery iteration. For si mp li city assum e that w(O) 0





(a) Let p min: N Y(wx) Sho w that > . (b) Sh ow tat wT (t)w* � wT(tl)w*+p, and concude that

[Hi se iducio ]

 wT(t) w* � tp

(c) Show tat ll w (t) ll 2 : ll w(t  1)  2 +  x(t - 1)  2 

{Hi y(t - 1)  (wT( t l) x(t  1))   becuse x( t  1) ws miscs sied b w( t - 1) j (d) Show by induction that  w(t)  2  tR2 , where R  m N  x  · (cntined n next pge)

33

1  HE EARNIN PROBEM

1  5  PROBEMS

() Using ( b) and (d )  show that WT(t) w *  w(t) 



t · 

'

and hnc pro that



Hint:

ll w tl/ ll w *ll

 1  Wy?J

In practic LA conrgs mor quickly than th bound suggsts p Nrthlss bcaus w do not know in adanc w cant dtrmin th nu mb r of itrat ions to con rgnc which dos pos a problm if th dat a i s nonsparabl



Problem 14

In Exrcis   w us a n a rti cia l d ata s t to study th pr cp tro n lar ning al gori thm  This problm lads you to xp lor th algor ithm fu rthr with data st s of difrnt sizs a n d di mns ions Gnrat linarly sparabl in Exr   aot th xa m pls data { (x,stY)of }siz as wl l as 20 thastarindicatd gt fun ction  on ( a ) cis a p la n  B sur to m ark th xam pl s from dif rnt class s d ifrnt ly an d add la bls t o th ax s of th p lot b un th prc ptro n la rnin g algori th m on th data s t a bo port th nu mb r of updat s that th a gor ithm tak s bf r con rging  lot th xampls { (x, Y) } th targ t fun ction   and th  n a hypo thsis  i n th sam  gur  Commnt on wh thr  is clo s to  c pat rything in b with anothr randomly gnratd data st of siz 20 Compar your rsults with b 

()

() () () (d ) pat rything in ( b ) with anothr randomly gnratd data st of siz 100 Compar your rsults with ( b )  () pat rything in ( b) with anothr randomly gnratd data st of siz 1 000 Compar your rsults with ( b )  Modify th agorithm such that it taks of J1  instad J  (f) dom ly g nra t a li narly sparabl  data s  t ofEsiz  Ean 000 with J and e d th data st to th a lgor ith m How many u pdat s dos th a lgorithm ta k to conr g?

(g) pat th algor

()

ithm on th sam  data st as f r 100 xprimnts In t h itratio ns of ach xprim nt pick x(t) randoml y i nstad of dtrmi n istica lly l ot a hi stogram r th n u m br of u pdats that th  a lgorith m taks to conrg

( h ) Summariz your conclusions with rspct to accuracy and running tim as a function of

N and



34

1  HE EARNI PROBEM

15 PROBEMS

Th prcpt ron larni ng agori th m works lik this In ach it Problem 15 t pick a ran dom t) yt)) and comput th signal ' t) = wt)t)  t) 0 updat w by

ration If yt)



wt + 1)

 wt) + yt)



t) ;

On may that this algorithm doslook not tak th closnss' and in to considrati on  Lt's at a nothr pr cptbtwn ron arni ng algot)  yt) argu rith m I n ach i tration, pi ck a random t) , yt)) and comput t) If yt)  t) 1, updat w by



wt + 1)

 w t) + '



yt) t))  t) ,

'

whr is a cons ta nt Tha t is, if t) agrs with yt) wl thir product is 1 th a lgo rithm dos nothing On th othr han d, if t) is furthr from yt) th a gorithm ch a ngs wt) mor In this problm, you ar askd to implmnt this algorithm and study its prrmanc

>

a ) Gn rat a train ing data s t of siz 100 si mi a r to that us d in Exrcis 14 Gn rat a tst data st of siz 10,000 from th sam procss To gt run th algorithm abo with = 100 on th training data st, until a maxium of 1 , 000 upd ats has bn r achd  lot t h train in g data

g

'

st, th th targt andst th nal hypothsis port rrorfunction on th , tst b ) Us th  da ta st in  a) a nd rdo  ryth in g with c ) Us th  data s t in a ) a nd rdo rythi ng w ith d ) Us th data s t in  a) an d rdo rythi ng with ) Com par th r su lts tha t you gt from a) to d ) 

g on th sam gur

' = 1 ' = 001 ' = 00001

T h algor ith a bo is a ariant o f th s o calld Adali n euron) algorithm r prcptron larning

daptive Linear

Considr a sampl of 10 ma rbls drawn i ndpndntly fr om a bi n that holds rd a nd grn marbl s Th probab i ity of a rd ma rbl is µ  Forµ= 005 µ = 05 an d µ = 08 comp ut th proba bi lity o f gtting no rd marbls  = 0 in th l lowing ca ss

Problem 16



a) W dra w on y on such s am pl  Comput th probab il ity that  = 0 b) W draw 1, 000 indpndnt samps Comput th pro babi ity tha t at last) on of th sampls has  = 0 c) pat  b) r 1, 000, 000

indpndnt s am pls

35

1 .   PROBEMS

1 . HE EA RNIN PROBEM

Problem 1. 7

A sample of heads and tals s created by tossng a con a n um ber o f tmes ndepend ently  Assume w e have a n um ber of cons that gene rate dferent sam ples ndep end ent ly For a gv en con  let the probab lty of heads (probablty of error) be µ The probablty of obtanng  heads n N tosses of ths co n s gven by the b nom a l dstrb uton

Remember that the tran

ng err or  s



(a ) Assue the sam ple s ze ( N ) s 10 I f all the c ons ha ve µ = 5 compute the probablty that at least one con wll have  =  r the case of 1 con 1  cons 1    co ns  Repeat r µ = 8. (b) For the case N = 6 and 2 cons wth µ = .5 r both cons plot the probablty

P[mix I V - µi > E ] or E n the ra nge O 1 ] (the mx s over co ns)  On the sa me pl ot show the

bound that would be ob taned usng the Hoe fdng I nequ alt y  Remember that r a sngle co n  the Hoef d ng bo und s

+

+

[Hint Use P[A or B] = PA] P[B] P[A and BJ = P[A] P[B] P[A]P[B] where the last equality llows by independene to evaluate P[mx    ]}

Problem 1.8 The Hoefdng Inequalty s one rm of the law of large numbers. One of the smplest frms of that law s the Chebyshev Inequality whch you wll prove here

t s a non nega tve random va rabl e prove that fr any  >   [t � ]  (t)/  (b) If u s a ny random varabl e wth mean µ and varance   prove that r any  >  [(u µ)   ]  [Hint Use a   (c) If u1    U are  d random varables each wt h mean µ and varance   and u =  l U prove th at fr any  >  (a) If





 [ (u µ) 2  ]



 .  Na

Notce that the RHS of ths Chebyshev Inequalty goes down lnearly n N, wh le the c ounterpart n H oefdng s I neq ual ty goes down exponental ly In roblem 1.9 we d evelop an exp onental bound usng a s m lar a ppr oach

36

.

 5 PROBLEMS

HE EARNIN PROBLEM

Prob lem 1 9

n t his proble   we derive a fr  of the law of la rge n u bers that has an eponential bound called the Cheo bound. We cus on the siple case o f lipping a f ai r coin  a nd use an a ppro ach si il ar to roble 18 (a) Let  be a ( nite) rando variable positive paraeter. f

a be a positive constant and

()   ()  

prove that

 s monotonay nreasng n. u,    , uNbe iid rando variables and let ()    ( ) (r any   prove that [Hnt

(b) Let



 be a

   l n 

f

[un  O]  [u n  ]  � (fai r coin ) Eval uate () as  and iniize () with respect to  fr ed a, O  hence the boun d is eponential ly decre asing in N (c) Suppose a function of

 {x , x ,  . . , xN, N , . . . , xN M }  {,  } f (x ,   ),  , (xN, YN ) h f M Eo(h, f)  M  h(xN )  f(N  )  . l (a) Say f (x)    r all x and odd and     M  N h(x)  -,, r x  k a nd  isotherwise What is Eo(h, f)? (b) We say that a targ et function f can generate V in a noiseless setting if Yn  f (xn) r all (xn, Yn ) E D For a ed V of size N how a ny possible f  X   can generate V in a noiseless setting? (c) For a given hypothesis  a nd a n i nteger  bet ween  and M how any of those f i n (b) satis fy Eo (h, f)   ?

Problem 110

Assue that X and  with an unknown target function X   The training   data set V is . Dene the oftranngset errorof a hypothesis with respect to by

{

( d) For a given hyp othesis h, if all those  that generate V in a noiseless settin g are equ al ly li kely i n proba bi lity  what is the epected of trai ni ng set error

 [Eo(h, )]?

(continued on next pge)



1  HE EARI G PROBEM

1.5. PROBEMS

(e) A deteri nistic al gorith A is dened as a procedure that takes V as an input and outputs a hypothesis A(V) Argue that fr a ny two deter i nistic algoriths A and

2 

h

You have now proved that in a noiseless setting r a ed

V if al l possible 

are equ al lyo likely  any wo det tersgeneral o f the epected traini ngt set erroeri r Sinistic i la algori r resultsth s canarebeequiva pr ovedlent rinore settings

Problem 1 11

The at ri which ta bu lates the cos t of various er rors r the CA and Superarket applications in Eaple 11 is called a rs or oss

matrx

For the two risk atrices in Eaple 11 epl icitly write do wn th e i n saple error that on e should i ni ize to o btain g  This in-saple error should weight t he di ferent type s of error s based on t he risk at ri [Hnt: Consder 1 and - 1 separatey.]

Ei Yn  

Yn =

proble tigates w ch easu re Problem 12re sult Th ca n cha nge 1the of isthe l earnini nves g proces s ho You ha an veginNg the dataerror points 1 :

    YN and wish to estiate a representative value (a) f your a lgo rith is to nd the hypoth esis su of squared deviations

h tha t ini iz es th e in

saple



Ei(h)   (h - Yn ) 2 , nl then sho w that your es tiate will be the in sapl e ean 



(b)

hm    N1  Yn nl f your a lgorith  is to nd the hypothe sis h tha t ini iz es the in su of absolute deviations

Ei (h) 

saple



 h- , nl  Yn 

hm

then sho w th at your es tiate will be the in sapl e edian which is any val ue r which ha lf the data points are at o st and half t he data points are at least (c) Suppose is perturbed to where  So the single data point beco es a n outl ier What hap pens to y ou r two estiator s and

Y

Y

hm  YN  

hm?

38

hm



hm

Chapr



aining versus Testing Bere the nal exam, a prossor may hand out some pratie problems and solutions to the lass Although these problems are not the exat ones that will appear on the exam, studying them will help you do better  They are the training set in your learning If the profssors goal is to help you do better in the exam, why not give out theis not examtheproblems themselves? nieistry the exam goal in and of itself Well, The goal fr you toDoing learnwell the in ourse material The exam is merely a way to gauge how well you have learned the material If the exam problems are known ahead of time, y our perrmane on them will no longer aurately gauge how well you have learned. The same distintio n between trainin g and testing happens in learnin g om data In this hapter, we will develop a mathematial theory that haraterizes this dist inti on We will also disuss the oneptual and pratial imp liations of the ontrast between training and testing

@

21

Ther y f Gene ralizatin

The outofsample error

Eou measures how well our training on D has geer

to data we spae have not bere ifEou based the perrmane alizedthe over entirethat input X . seen Intuitively, we iswant to on estimate the val ue of Eou using a sample of data points, these points must be esh test points that have not been used r training, similar to the questions on the nal exam that have not been used r pratie The in sample error Ein, by ontrast, is based on data points that have been used r training It expressly measures training perfrmane, similar to your perrmane on the pratie probl ems that you got b ere the  nal exam Suh perrmane has the benet of looking at the solutions and adjusting aordingly, and may not reet the ultimate perfrmane in a real tes t We began the analysis of in-sample error in Chapt er 1 , and we will extend this 39

2.1. HEORY OF ENERALIZATION

2. RAININ VERSUS ESTIN

analysis to th e general ase in this hapter  We will also make the ontrast between a training set and a test set more preise A word of warni ng: this hapter is the hea viest in this boo k in terms of mathematial abstration To make it easier on the notsomathematially inlined, we will tell you whih part you an safly skip without losing the plot  The mathematial results provide ndamental insights into learning om data, and we will interpret these results in pratial terms

eneralization error. We have already disussed how the value of Ein does not always generalize to a similar value of Eout eneralization is a key issue in learning One an dene the geealizatio error as the disrepany between Ei and Eout The Hoeding Inequality provides a way to haraterize the generalization error with a probabilisti bound,

6 



r any E > 0 This an be rephrased as fllows Pik a tolerane level example  005 , and assert with probability at least that



2M



2  2   6 



8

fr

2 

We rer to the type of inequality in as a geealizatio boud beause it bounds Eout in terms of Ein To see that the Hoeding Inequality implies this generalization bound, we rewrite as fllows: with probability at least IEout Ein   E whih implies Eout  Ein E We may now llows om whih E  ln and identi  otie that the other side of IEout E in  E also holds, that is, Eout Ein  E r all    This is important r learning, but in a more subtle way ot only do we want to know that the hypothesisg that we hoose (say the one with the best training error) will ontinue to do well out of sample (ie, Eout  Ei  E  but we also want to be sure that we did the best we ould with our  (no other hypothesis    has Eout ) signiantly better than Eout (g))  The Eout () Ein )  E diretion of the bound assures us that

 2 ME   E 





+

2 







 2    

we do hosen muh better beause every hypothesis with a higher Ein than the ouldn't g we have will have a omparably higher Eout or error bar if you will, depends The error bound ln in on the size of the hypothesis set . If  is an innite set, the bound goes to innity and beomes meaningless  Unfrtunately, almost all interesting learning moels have innite  inluding the simple pereptron whih we disussed in Chapter In order t o study general ization in suh m oels, we need t o derive a oun terpart to that deals with innite  We woul like to replae with



2 



1 oetes generalzaton error s used

M

 anoter na e r ut  but not n ts book 40

. RAINING VERSUS ESTING

. 1 . HEOR Y OF ENE RALI ZATIO N

something nite, so that the bound is mea ningfl. To do this, we notie that the way we got the  tor in the rst plae was by taking the disjuntion of events 

"JEin(h1 ) Eou(h1)J > E" or "JEi n (h2 ) Eou(h 2 ) J > E" or (2.2) whih is guaranteed to inlude the event " JEin (g) Eou (g) J > E" sine g is al ways one o f the hypotheses in . We then over-estimated the probability using the union bound. Let  be the (ad ) event that "JE in(h m) Eou( hm)J > E"  Then,

If the events   ,   ,    ,  are strongly overlapping, the union bound beomes par tiularly loose s illustrated in the gure to the right r an example with 3 hypotheses the areas of dierent events orrespond to their probabilities The union bound says that the total area overed by 1,  , or  is smaller than the sum of the individual ar eas, whih is true but is a gross overestimate when the areas overlap heavily as in this ex ample The events "JEin(hm) Eou(hm)J > E"  m  1 ,      are often strongly overlap ping. If h1 is very similar to h 2 r instane, Eou(h1)J > E" and "JEin(h 2 ) Eou(h 2 ) J > E" are the two events JEin(h1) likely to oinide r most data sets . In a typial learning model, many hy pothes es are indeed very similar. If you take the perep tron model r instane, as you slowly vary the weight vetor w  you get innitely many hypotheses that dier  om eah other only innitesimally The matheatial theory of generalization hinges on this observation One we properly aount r the overlaps of the dierent hypotheses, we will be able to replae the number of hypotheses  in (21) by an eetive number whih is nite even when is innite, and establish a more usefl ondition under whih Eou is lose to Ein



2. 1. 1

Eeive Nmber o f Hypoheses

We now introdue the groth fuctio the quantity that will rmalize the eetive number of hypothese s The growth fntion is what will repla e  41

2 RAININ VERSU S ESTING

2  1  HEOR OF ENERALI ZATI ON

in he generalizaion bound (2  1 )  I is a combinaorial quaniy ha cap ures how dieren he hypoheses in  are, and hence how much overlap he dieren evens in (22) have We will sar by dening he growh ncion and sudying is basic prop eri es ex , we will show how we can bound he value of he growh ncio n Finally, we will show ha we can replace  in he generalizaion bound wih he growh ncio n These hree seps will yield he generalizaio n bound ha we need, which applies o innie  We will cus on binary arge fncions r he purpose of his analysis, so each    maps X o {  1 ,  1 }  The deniion o f he growh unc ion i s based on he number o f dieren hypoheses ha  can implemen, bu only over a nie sample of poins raher han over he enire inpu space X If    is applied o a nie sample ,  ,   X, we ge an N-uple (),  , ( ) of ± 's uch an N-uple is called a dichotom since i splis  ,    ,  ino w o groups  hse poins r which  is 1 and hose r which  is  1 Each    generaes a dichoomy on  ,    ,  , bu wo dieren  's may generae he same dichomy if hey happen o give he same paern of ±1 's on his paricular sample

Let ,    these poits are deed b

Dnition 21

,   X 

The dichotomies geeated b



o

)) I

(,    , )  { ((),    , (   }  (23) One can hin  of he dichoom ies  ( ,    , )  a se of hypoheses jus lie  is, excep ha he hypoheses are seen hrough he eyes of N poins only A larger (  ,    ,  ) means  is more diverse' generaing more dichoomies on ,    ,   The growh ncion is based on he number of dichoomies

Dnition 22

The groth fuctio is deed for a hpothesis set



b

here    deotes the cardialit umber of elemets of a set In words, m (N) is  he maximum number of dichoomies ha can be gen eraed by  on any N poins To compue  (N) , we consider all possible choices of  poins  ,    ,  om X and pic he one ha gives us he mos dichoomies Lie ,   (N ) is a measure of he number of hypohe ses in , excep ha a hypohesis is now considered on  poins insead of he enire X F any  , since  ( ,    ,) {1, }  (he se of all possible dichoomies on any N poins), he value of (N) is a mos {1,}  , hence (N )  2  



If  is capable of generaing all possibl e dichoomies on   , (,    , )  { 1, 1 }  and we say ha  can shatter , signies ha  is as diverse as can be on his paricular sample 42



     



,  , hen ,   This

2 . 1 HEORY OF ENERAZA TON

2 RANN VERSUS ESTN

•

a



 c

Figure  llstration of the growth fnction r a two dimensional er cetron h dichotomy of red verss le on the 3 colinear oints in art a cannot e generated y a ercetron, t all 8 dichotomies on the 3 oints in art  can y contrast the dichotomy of red verss le on the 4 oints in art c cannot e generated y a ercetron t most 14 ot of the ssile 16 dichotomies on any 4 oints can e generated







Eapl 21 If X is a Euclidean plane and  is a wo-dimensional percep ron, wha are () and () Figure 2  a shows a dichoomy on  poins ha he perceron canno generae, while Figure 2 b shows anoher  poins ha he percepron can shaer, generaing all 2   8 dichoomies





Because deniion based on he 2 maximum number of di  of ( N)heis case choomies,he()  8 inofspie in Figure a)  In he case of  poins , Figure 2   ( c shows a dichoomy ha he perc epron canno  generae One can verify ha here are no  poins ha he percepron can shaer The mos a percepro n can do on any  poins is  dichoomies ou of he possible , where he 2 missing dichoomies are as depiced in Figure 2 c wih blue and red corresponding o - , + or o +,  Hence,



()





D

 

Le us now illusrae how o compue H(N) r some simple hypohesis ses  These examples will conrm he inuiion ha  ( N) grows fser when he hypohesis se  becomes more complex This is wha we expec of a quaniy ha is mean o replace  in he generalizaion bound 2 )



Eapl 22 Le us nd a frmula r H(N) in each of he llowi ng cas es 

Positive s :  co nsiss of all hypoheses h    { , +} of he frm h (  )  sign       ie, he hypoheses are dened in a one-dimensional inpu space, and hey reurn  o he lef of some value  and +  o he righ f 



2 1 . HEORY OF EN ERALI ZATIO N


N(N)+,

N

N

To compue m1 we noice ha given poins , he line is spli by he poins ino  regions  The dichoomy we ge on he poins is decided by which region conains he value  As we vary  we will ge N +  dieren dicho omies  ince his is he mos we can ge r any poins, he growh funcion is

N

N(N)

N+

oice ha if we piced poins where some of he poins coincided which is allowed , we will ge less han  dichoomies This does no aec he value of m1 sinc e i is dened based on he maximum number of dichoomies







Positive itevals:  consiss of all hypoheses in one dimension ha reurn +  wihin some inerval and - oherwise Each hypohesis is specied by he wo end values of ha inerval

(N), N + N (( N) N+l  + N2 +N,N + N ) J2 + (N)  N N +  + N

To ha given poins, we hege lineis isdecided again m1 inowe noice splicompue  regions by he poins The dichoomy by which wo regions conain he end values of he inerval, resuling in dieren dichoomi es If boh end values fll in he same region, he resuling hypohesis is he consan - regardless of which region i is Adding up hese possibiliies, we ge

N 

m1 m1 N )

=







oice ha grows as he square of ear m1 of he simpler posiive ray case

3. Cove sets { -,





faser han he lin

 co nsiss of all hypohes es in wo dimensions

h:



} ha are posiive inside some convex se and negaive elsewhere

a se is convex if he line segmen connecing any wo poins in he se lies enirely wihin he se  To compue m1 in his case, we need o choose he poins care lly Per he nex gure, choos e N poins on he perimeer of a circle ow consider any dichoomy on hese po ins , assigning an arbirary pa ern of ± 's o he poins If you connec he  poins wih a polygon, he hypohesis made up of he clos ed inerior of he polygon which has o be convex since is verices are on he perimeer of a circle agrees wih he dichoomy on all poins For he dichoomies ha have less han hree  poins, he convex se will be a line segmen, a poin, or an empy se









RAININ VERSUS ESTIN

HEORY OF ENERAIZATION

This mean s ha any dichoomy on hese N poins can be realize d using a convex hpohe sis so  manages o shaer hese poins and he grow h funcion has he maximum possible value

oice ha if he N poins were chosen a random in he plane raher han on he perimeer of a circle many of he poins would be inernal' and we wouldn' be able o shaer all he poins wih conve x hypoheses as we did r he perimeer poins However his doesn' maer as far as (N) is concerned since i is dened based on he maximum ( in his case) 

D

I is no pracical o ry o compue  ( N) r every hypohesis se we use Forunaely we don' have o ince (N) is mean o replace in (), we can use an upper bound on  (N) insead of he exac value and he inequaliy in () will sill hold eing a good bound on (N) will prove much easier han compuing  ( N) iself hans o he noion of a brea



poit

f o data set of size  ca be shattered b  the  is said to be a brea poit for  If  is a brea poin hen ()  Example  shows ha    is a Dn ition 2 . .

brea poin r wo-dimensional perceprons In general i is easier o nd a < brea poin r  han o compue he ll growh ncion r ha 

Eri 2.1 By inspet in  nd a break pint k fr each hypthesis se in Eaple 22 Verify th at mk < using the r u las der ive d in tha E apl e.

k

 if ther e is ne  .

We now use  he brea poin  o derive a bound on he growh ncion (N) r all values of N For example  he ac ha no  poins can be shaered by



 RAININ VERSUS ESTIN

. 1 . HEORY OF ENER AIZATION

he wodimensional percepr on pus a signican consrain on he nu mber of dichoomies ha can be realized by he percepron on  or more poins We will exploi his idea o ge a signican bound on  (N) in general

2. 1. 2

Bodig the Groth Fctio

The mos imporan fc abou growh ncions is ha if he condiion = 2  breas a any poin, we can bound  ( N) r all values of N by a simple polynomial based on his brea poin The c ha he bound is polynomial is crucial  Absen a brea poin (as is he case in he convex hypohesis example),  ( N = 2  r all N If  (N replaced  in Equaln ion ( 2  1 ) , he bound on he generalizaion error would no go o zero regardless of how many raining examples N we have However, if  ( N) can be bounded by a polynomial any polynomial , he generalizaion error will go o zero as N    This means ha we will generalize well given a sucien number of examples

 (N)





a kip: If you rus our mah, you can he llowing par wihou compromising he sequence A similar green box will ell you when rejoin To prove he polynomial bound, we will inroduce a combinaorial quaniy ha couns he maximum number of dichoomies given ha here is a brea poin, wihou having o assume any paricular rm of  This bound will herere apply o any 

) is the maimum umber of dichotomies o N poits such that o subset of size  of the N poits ca be shattered b these di chotomies The deniion of (N,  assumes a brea poin  hen ries o nd he Dnition 24 (N,

mos dichoomies on N poins wihou imposing any frher resricions

ince  ( N, is dened as a maximum, i will serve as an upper bound r any  (N ) )ha has a brea poin ;

 (N )   (N, )

if  is a bre a poin  r 

The noaion  comes om inomial' and he reason will become clear shorly To evaluae  (N, ) we sar wih he wo boundary condiions  = 1 and N = 1   (N, 1) (, )

1 2 r



>

1



 . 1 . HEORY OF ENER AIZATIO N


B ( N, 1) = 1 r all  since if no subse of size 1 can be shaered, hen only one dichoomy can be allowed A second dieren dichoomy mus dier on a leas one poin and hen ha subse of size 1 would be shaered B( , k) =  r k > 1 since in his case h ere do no even exis subses of size k; he consrain is vacuously rue and we have  possible dichoomies ( + 1 and  1 ) o n he on e poin    o develop a recursion Consider N  in and k   and B (N,now k) assume , ry he We dichoomies deniion where no k poins can be shaered We lis hese dichoomies in he llowing able, # of rows

X1 X2

XN - 1 XN +1 +1

+1 1

+1 1  1 +1 +1 1 1 1

1 1 +1 +1

1 +1

+ 11  11  +1 1 1 1

+1 1 +1 +1

+1  1 1 1

+1 1

+ 1 +1 1 +1

S1



S2 

where x1 · · ·  XN in he able are labels r he N poins of he dichoomy We have chosen a convenien order in which o lis he dicho omies, as llows Consider he dichoomies on x  · ·  X N  · ome dichoomies on hese N  poins appear only once (wih eiher +1 or 1 in he X N column, bu no

l

boh) dichappear oomieswice, in heonce se S1 on he We rscollec poins wihThe + 1remaining and once dichoomies wih  1 in N  1hese he X N column We collec hese dichoomies in he se S2 which can be divided ino wo equal pars, S and (wih +1 and  1 in he XN column, respecively) Le S1 have  rows, and le and  have  rows each ince he oal number of rows in he able is B(N, k) by consrucion, we have

t

B(N, k)



st

= + 

()

The oal number of dieren dichoomies on he rs N  1 poins is given by  + ; since and S are idenical on hese N  1 poins, heir di choomies are redundan ince no subse of k of hese rs N  1 poins can

st



.

. .



be shaered (since no -subse of all N poins can be shaered) we deduce ha   (N    ) ()



-

-

 of he rs N  poins can by deniion of  Furher no subse of size  be shaered by he dichoomies in If here exised such a subse hen aing he corresponding se of dichoomies in and adding XN o he daa

st



poins yields a subse of size  ha is shaered which we now canno exis in his able by deniion of  ( N, ). Therere   (N



, 

)

()

ubsiuing he wo Inequaliies () and () ino (), we ge (N,

)  (N

-

  )

 (N

 

)

()

We can use () o recursively compue a bound on (N, ) as shown in he llowing able 

    

 

 

 

 

  









N

  

\ 

      where he rs row (N  ) and he rs column (  ) are he bound ary condiions ha we already calculaed We can also use he recursion o bound  ( N, k) analyically

Lemma 2  (auer's Lemma)  (N, )





 The saemen is rue whenever    or N    by inspecion The proof is by inducion on N Assume he saemen is rue f r all N  N and all  We need o prove he saemen for N  N  and all  ince he saemen is already rue when  =  (fr all values of N) by he iniial condiion we only need o worry abou   By () ,





(N

  , )  (N  , )   (N   - ) 

.

.



Applying he inducion hypohesis o each erm on he RH, we ge  ( +  )

<

      

 +



 +

+

    c  

+

 +

N°t 



"



( 0 ) ( i01 )

+ where he combinaorial ideniy has been used . This ideniy can be proved by noicing ha o calculae he number of ways o pic i objecs om N +  disinc objecs, eiher he rs  objec is included, ways, or he rs objec is no included, in ways We have in hus proved he inducion sep, so he saemen is rue r all  and  

 0 

 0 

7  

I urns o ha (N ) in fc equals (see Problem ) bu we only need he inequaliy of Lemma  o bond he growh ncion For a given brea poin  he bound is polynomial in N as each erm in he sum is polynomial (of degree i   ) ince  ( N ) is an upper bound on any (N) ha has a brea poin  we have proved

7   

End a kip: Those who sipped are now rejoining us The nex heorem saes ha any growh fncion  ( N) wih a brea poin i s bounded by a polyno mial Theorem 24 If  ()

<  r some value  hen ()

r all N The RH is polynom ial in  of degree 

H





The implicaion of Theorem  is ha if has a brea poin, we have wha we wan o ensure good generalizaion; a polynomial bound on (N)



.

 .

RAININ VERSUS ESTING

xr



(a ) erf the b ou n of eoe

 

HEORY OF ENERAIZATON

 n the  ree cas es o  a l e 2

( ostve ras conssts o all hotheses n one enson f he  sgn   () oste nterals onss ts o a l l hyoteses n one   eson ta a re pos tve w t n soe nterv al a n negatve elsew here.







(  ) onve sets onss ts o a l l hotes es n tw o  ens ns hat a re ost ve  nse soe one set an  nega ve elsewhee. ote ou an use e reak onts ou un n Eercse . (b  oes he re est a htss s et   wh ch   (where s he large st nteg e  )

Lj

2. 1. 3



/ 2

The VC Dimesio

Theorem 2.4 bounds he enire growh ncion in erms of any br ea poin The smaller he br ea poin he bee r he bound This leads us o he l lowing deni ion of a single parameer ha characerizes he grow h fncion 

Dnition 25

Th apnikChrvonnkis dimnsion of a hypothsis st 

 . If by d N( simply is thd( lar gst=valu 2dnotd ll d N thn  o f N for hich  N    2or for If d is he C dimension of  hen k = d + 1 is a brea poin r  since  ( N  canno equal 2  r any N > d by deniion I is easy o see ha no smaller brea poin exiss since  can shaer d poins hence i can also shaer any subse  of hese poins 

xr



Coute the C  ensn o   of E ercse . (a  .

 r e ho ess sets n

ince k  d + 1 is a brea poin r  Theorem erms of he C dimension

 

 N

 dvc

N

2.4

arts (  ( 

can be rewrien in

()



Therere he VC dimension is he order of he polynomial bound o n  ( N  I is also he bes we can do using his line of reasoning because no smaller brea poin han k  d + 1 exiss The rm of he polynomial bound can be rher simplied o mae he dependency on d more salien We sae a usel rm here which can be proved by inducion Problem 2.5.



50

(2.10



  .



ow ha he growh ncion ha s been bounded in erms of he C dimen sion we have only one more sep lef in our analysis which is o replace he number of hypoheses in he generalizaion bound () wih he growh ncion  (N)  If we manage o do ha he C dimension will play a pivoal role in he generalizaion quesion If we were o direcly repla ce  by H(N) in (), we would ge a bound of he rm



dvc H

Unless =  we now ha H(N) is bounded by a polynomial in N hus ln (N) grows logarihmically in N regardless o f he order of he poly nomial and so i will be crushed by he facor Thereore r any xed olerance 8 he bound on Eou will be arbirarily close o Ein r sucienly large N Only if =  will his argumen fail as he growh ncion in his case is exponenial in N For any nie value of he error bar will converge o zero a a spe ed deermined by since is he order of he polynomial The smaller is he aser he convergence o zero I urns ou ha we canno jus replace  wih  (N) in he generaliza



dvc H

dvc,

dvc

dvc

dvc,

ion bound bu general raher we need  o mae oher and adjusmens we play will see () he shorly. However idea above is correc willassill he role ha we discussed here One implicaion of his discussion is ha he re is a division of models ino wo classes. The  good models ' have nie and r sucienly large N Ein will be close o Eou r good models he in-sample perfrmance generalizes o ou of sample The  bad models ' have innie Wih a bad model no maer how large he daa se is we canno mae generaliaion conclusions om Ein o Eou based on he C analysis Because of is signican role i is worhwhile o ry o gain some insigh abou he C dimension bere we proceed o he frmaliies of deriving he new generaliaion bound One way o gain insigh abou is o ry o compue i fr learning mo dels ha we are fa miliar wih  Percepr ons are on e case where we can compue exacly Thi s is don e in wo se ps Firs  we show ha is a leas a cerain value hen we show ha i is a mos he same value There is a logical d ierence in arguing ha is a leas a cerain value as opposed o a mos  a cerain value This is because

dvc

dvc,

dvc

dvc

dvc

dvc 2 N 

2

dvc

dvc

here it

D of size N such ha  shaers D

hence we have dieren conclusions in he ollowing cases.

  There is a se of N poins ha can be shaered by   In his case  we can conclude ha N

dvc 2

2 In some cases wit innite  suc as te convex sets tat we discussed alternative analysis bed on an average growt function can establis good generalization beavior









 Any se of  poins can b e shaered by  In his case , e have more han enough inrmaion o conclude ha  



 There is a se of  poins ha canno be shaered by  Based only on his inrmaion, e canno conclude anyhing abou he value of  

 o se of  poins can be shaered by  In his case, e can conclude ha  <  xeri 

d 

onsie he i nu sae  in lu ing he n stnt o oi nate iensin  e eretrn wi  = ow at e  eters cuntin g  is eatl  sowig tat i is at l east  t ost  as llow s a sow at c 1 n tat te rer n oit s in can sater.  su  osu  m



















  



wose rows ese e  os e se e osgu/ o gue  e eero  se ese os

b



 

 

show tat c sow h no set  oints in can e shatee  te ereon  Reee  o e use e      s e e





ers o  e  e o e e eee s mes  some eo s  e o o  e oe eors  u oose e ss o ese e veos eu e e ass o e ee e w e e ue  ere s ome om  co e memee  eee    

   

+

The C dimension of a dimensional percepron  is indeed    This is consisen ih Figure  fr he case  =   which shows a C dimension of  The percepron case provides a nice inuiion abou he C dimension, since  +  is also he number of parameers in his model One can view he C dimension as measuring he  eecive number of parameers The more parameers a model has, he more diverse is hypohesis se is, hich is reeced in a larger value of he growh ncion H ( )  In he case of perceprons, he eecive parameers correspond o explici parameers in he model, namely      W  In oher models, he eecive parameers may be less obvious or implici  The C dimension measures hese eecive parameers or degrees of eedom ha enable he model o express a diverse se of hypoheses Diversiy is no necessarily a good hing in he conex of generalizaion For example, he se of all possible hypoheses is as diverse as can be, so H() =   r all  and   =  In his case, no generalizaion a all is o be expeced, as he nal version of he generalizaion bound will show

 d

 

3

{}  d is consider ed  diensional since te rs t coordinate   is xed 



..


2 1


The VC Geeralizati Bud

If we reaed he growh funcion as an eecive number of hypoheses and replaced  in he generalizaion bound ( 2 1) wih (N) he resuling bound would be   ()  1 (211) Eout(g) Ein(g)   ln



N



+

I urns ou ha his i s no exacly he rm ha will hold The quaniies in red need o be e chnically modied o mae (2 1 1 ) rue The correc bound which is called he C generalizaion boun d is given in he llowing heorem; i holds r any binary arge ncion  any hypohesis se  any learning algorihm and any inpu probabiliy disribuion P

A

Thor 2.5 (C generalizaion bound)  For any olerance

Eout(g) wih probabiliy

1





 Ein(g)

+



N

ln

 > 0,

 ()

(212)



If you compare he blue iems in (2 12 ) o hei r red counerpars in ( 2 1 1 )  you noice ha all he blue iems move he bound in he weaer direcion How ever as long as he C dimension is nie  he error bar sill converges o zer o (albei a a slower rae) since (N) is also polynomial of order in N jus lie (N) This means ha wih enough daa each and every hypoth esis in an innie  wih a nie C dimension will generalize well om Ein o Eout The ey is ha he eecive number of hypoheses represened by he nie growh funcion has replaced he acual number of hypoheses in he bound The C generalizaion bound is he mos imporan mahemaical resul in he heory of learning I esablishes he asibiliy of le arning wih innie hypohe sis ses  ince he rmal proof is somewha lenghy and echnical we illusrae he main ideas in a sech of he proof and include he rmal pro of as an appendix There are wo pars o he proof; he jusicaio n ha he growh funcion can replace he number of hypoheses in he rs place and he reason why we had o change he red iems in (211) ino he blue iems in (212)

dc

kth o th proo.

The daa se  is he source of randomizaion in he srcinal Hoeding Inequaliy Consider he space of all possible daa se s Le us hin of his space as a canvas' (Figure 2 2(a) )  Each  is a poin on ha canvas The probabiliy of a poin is deermined by which X 's in X happen  o be i n ha paricul ar  and is calculaed based on he disribuion P over X . Les hin of probabiliies of dieren evens as areas on ha canvas so he oal area of he canvas is 1



.


2 1  HEORY OF  ENERA IZATIO N

space of daa ses



a oeng nequalty b non oun

c  oun

Figure 22 llustraton of the proof of the  boun, where the canvas represents the space of all ata sets, wth areas corresponng to probabl tes a or a gven hypothess, the colore ponts correspon to ata sets where oes not generalze well to Eout· he oeng nequalty guar antees a small colore area. b or several hypotheses, the unon boun assumes no overlaps, so the total colore area s large. c he  boun keeps track of overlaps, so t estmates the total area of ba generalzaton

 





to be relatvely small

For a given hypohesis   he even "IEi(h) Eout(h)I > E" consiss of all poins  r which he saemen  is rue For a paric ular h, le us pain all hese  bad' poins using one color  Wha he basic Hoeding Inequaliy ells us is ha he colored area on he canvas will be small Figure 22 a  ow, if we ae anoher   he even "IEi(h) Eout(h)I > E" may conain dieren poins, since he even depends on  Le us pain hese poins wih a dieren color The area covered by all he poins we colored will be a mos he sum of he wo individual areas, which is he case only if he o areas have no poins in common This is he wors case ha he union bound consi ders If we eep hrowing in a new colored area r each   and never overlap wih previous colors, he canvas will soon be mosly covered in coor Figure 22 b  Even if each h conribued very lile, he sheer number of hypoheses will even ually mae he colored are a cover he whole canvas This was he problem wih using he union bound in he Hoed ing Inequaliy ( 1  6 ) , and no aking he over laps o f he co lored areas ino consider aion The bul of he C proof deals wih how o accoun r he overlaps Here is he idea  If you were old ha he h ypoheses in are such ha each poin on he canvas ha is colored will be colored 100 imes because of 100 dieren 's , hen he oal colored area is now 1 /100 of wha i would have been if he colored poins had no overlapped a all This is he essence of he C bound as illusraed in Figure 2 2 c  The argumen goes as llows

,



,

 

,



 







 





. RAINING VERSUS ESTING

..

NTERRETING THE OUND

Many hypoheses share he same dichoomy on a given D since here are niely many dichoomies ev en wih an innie number of hypohese s Any saemen based on D alone will be simulaneously rue or simulaneously alse r all he hypoheses ha loo he same on ha paricular D Wha he growh fncion enables us o do is o accoun fr his ind of hypohesis redundancy in a precise way so we can ge a fcor similar o he 100' in he above example When  is innie he redundancy acor will also be innie since he hypoheses will be divided among a nie num ber of dichoomies Therere he reducion in he oal colored area when we ae he redundancy ino consideraion will be drama ic  If i happens ha he number of dichoom ies is only a polynomial he reducion will be so dramaic as o bring he oal probabiliy down o a very small value This is he essence o f he proof of Theorem 25 The reason  ( 2N) appears i n he C bound inse ad of  (N) is ha he proof uses a sample of 2N poins insead of N poins Why do we need 2N poins The even "IE in( h) Eout(h )J > " depends no only on D bu also on he enire X because Eout (h) is based on X This breas he main premise of grouping  's based on heir behavior on D since aspecs of each h ouside of D aec he ruh of "JEin(h) Eout(h) J > " To remedy ha we consider he aricial even "ID are om based Ein(h) n(h)Jof >size" Ninsead on wo samples and DEeach  This where is whereEin heand 2NEn comes I accouns fr he oal size of he wo samples D and D ow he ruh of he saemen I Ein(h) En (h)J > " depends exclusively on he oal sample of size 2N and he above redundancy argumen will hold Of course we have o jusi why he wosample condiion " JEin ( h) En (h)J > " can replace he srcinal condiion "JEin(h) Eout(h)J > " In doing so we end up having o shrin he 's by a fcor of  and also end up wih a fcor of 2 in he esimae o f he overall probabiliy. This accoun s fr he insead of in he C bound and r having  insead of 2 as he muliplicaive cor of he growh fncion When you pu all his ogeh er you ge he  rmula in ( 2 12 )





2.2

Intting th Gnalizatin Bund

The C generalizaion bound (212) is a universal resul in he sense ha i applies o all hypohesis ses learning algorihms inpu spaces probabiliy disribuion s and binary arge fncions  I can be exended o oher ypes of arge fncions as well iven he generaliy of he resul one would suspec ha he bound i provides may no be paricularly igh in any given case sinc e he same boun d has o cover a lo of dieren cases Indeed  he bound is quie loose



2 RAININ VERSUS ESTIN 

2.2. NTERRETIN THE OUND

.5

Suose e ave a simle learning moel ose grw function is  ence se e oun o esti = mate he proai lit that ot will e ithin of given 1 aining examles. nt e estmate w be ros

mN =





  n



Why is the C bound so loose? The slack in the bound can be attributed to a number of technical fctors Among them 1  The basic Hoeding Ine quality used in the proof already has a slack The inequality gives the same bound whether Eou is close to 05 or close to zero However the variance of Ein is quite dierent in these two cases Therefre having one bound capture both cases will result in some slack 2 Using H(N) to quantify the number of dichotomies on N points re gardless of which N points are in the data set gives us a worst-case estimate This does allow the bound to be independent of the pro ability distribution P over X. However we would get a more tuned bound if we considered specic      XN and used I (       XN ) I or

H

its expected value instead of the upper bound For instance in the case of convex sets in two dimensio ns whichH(N) we examined in Exam ple 22 if you pick N points at random in the plane they will likely have fr fwer dichotomies than 2N  while H ( N) = 2 N 

 Bounding H (N) by a simple poly nomial of order will contribute further slack to the  C boun d

dvc, as given in (2  10) 

Some eort could be put into tightening the C bound but many highly technical attempts in the literature have resulted in only diminishing returns The reality is that the C line of analysis leads to a very loose bound Why did we bothe r to go through the analysis then? To reasons First  the C analysis is what establishes the asibility of learning r innite hypothesis sets  the only k ind we use in practice Second  although the bound is loose it tends to be equally loose r dierent learning models and hence is usel r comparing the generalization perrmance of these model s This is an observation om practical experien ce not a mathematical statement  In real applications learning models with lower tend to generalize better than those with higher Because of this observation the C analysis proves usel in practice and some rules of thumb have emerged in terms of the C dimens ion For instan ce requiring that N be at least 10 x to get decent generalization is a popular rule of thumb Thus the C bound can be used as a guideline fr generalization relatively if not absolutely With this understanding let us look at the dierent ways the bound is used in practice

dvc

dvc 

dvc

56


2. 2. 1


Sample omp lexity

The sample complexity denotes how many training examples are needed to achieve a certain generalization errmance. The perfrmance is specied by two parameters, and The error tolerance determines the allowed generalization error, and the condence parameter  determines how often the error tolerance is violated How st grows as and become smaller  indicates databound is needed to get good generalization We canhow usemuch the VC to estimate the sample complexity r a given learning model. Fix > and supose we want the generalization error to be at most om quation the generalization error is bounded by ln ln  It llows that and so it suces to make

E

E

8.

E

N

8 0, (2.12),

E.

N

E 8

E.

 N >- E ln 4m(2N) 8 suces to obtain generalization error at most E (with probability at least 1  8). This gives an imlicit bound r the sample complexity N, since N appears on both sides of the inequality If we replace m (2N) in (2.12) by its polynomial upper bound in (2.10) which is based on the the VC dimension, we get a similar bound  N  E ln 4((2N) 8dvc + l) ' (2.13) which is again implicit in N. We can obtain a numerical value fr N using simple iterative methods Eapl 26 Suppose that we have a learning model with dc = 3 and would like the generalization error to be at most 0.1 with condence 90% (so E = 0.1 and 8 = 0.1). How big a data set do we need? Using (2.13), we need N -> 0.1 n 0.1 + . ying an initial guess of N = 1, 000 in the RHS , we get  N>

ln

x 1000) +

�

21 193.

 0.1 N 21, 1930.1 ' N 30, 000. N 40, 000. 5, N 50, 000.4, 10,000, 10.

We then try the new value = in the RHS and continue this iterative process, rapidly converging to an estimate of  If dc were a similar calculation will nd that  For dc = we get  ou can see that th e inequality suggests that the number of examples needed is approximately proportional to the VC dimension, as has been observed in practice The constant of proporti onality it suggests is which is a gss overestimate; a more practical constant of proportionality is closer to D 4 Te ter coplexty coes o a slar etapor n coputatonal coplexty

57

2. RANN VERSUS ESTNG

2. 2. 2

2.2. NTERRETN THE OUND

Pealy r Model Complexiy

Sample coplexiy xes he perfrmance parameers E generalizaion error  and 8  condence parameer  and esimaes how many examples are neede d In mos pracical siuaions, however, we are given a xed daa se , so is also xed In his case, he relevan quesion is wha perfrmance can we expec given his paricular The bound in answers his quesion:

N



 8N (22)  out( g ) � in (g ) + N ln

wih probailiy a leas

N

.

If we use he polynomial bound based on d insead of anoher valid bound on he ou-of-sample error,

m ( 2N), we ge (2 4) out (g) � in (g ) + N ln 4 ((2N)dvc8 + )  Eapl  7 Suppose ha N = 00 and we have a 90 condence require men 8 = 0) We could ask wha error bar can we oer wih his condence,

if 1 has d =

Using

we have

(20)  � in (g) + 0 4 (2.5) out( g) . in (g ) (2+ 4),OO ln 4 wih condence � 90. This is a prey poor bound on out· ven if in = 0, may sill b e close o  If N = , 000, hen we ge aut( g ) � in( g ) + 030, out D a somewha more respecable bound Le us look more closely a he wo pars ha make up he bound on in The rs par is and he second par is a erm ha increases as he C dimension of 1 increases

(2.2).

where

out

in , out( g)  in (g) +  (N,  8) 

(26)

r(N,   8) <

N ln

8

+



One way o hink of 1 8) is ha i is a penaly fr model complexiy I when we use a more complex 1 penalizes us by worsening he bound on larger d If someone manages o  a simpler model wih he same raining

r(N,

out

5



c

C

dimension,

vc

Figure 23: hen we use a more comlex learning model, one that has higher  dimension , we are likely to t the training data better re sulting in a lower in samle error, but we ay a higher penalty or model comlexity combination o the two, which estimates the ou t o samle error, thus attains a minimum at some intermediate �•



error, they ill get a more fvorable estimate r Eou The penalty (N gets orse if e insist on higher condence (loer  · and it gets better hen ,  e have more training examples, as e ould expect Although (N ,  goes up hen  has a higher C dimension, Ei is liely to go don ith a higher C dimension as e have more choices ithin  to t the data Therere, e have a tradeo: more complex models help Ei and hurt (N ,  The optimal model is a compromise that minimizes a combination of the to terms, as illustrated inrmal ly in Figure 23 2. 2. 3

The Test Set

As e ave seen, the generalization bound gives us a loose estimate of the out-of-sample error Eou based on Ei While the estimate can be usel as a guideline r the training process, it is next to useless if the goal is to get an accurate recast of Eou f you are developing a system r a customer, you need a more accurate estimate so that your customer nos ho ell the system is expected to perrm. An alternative approach that e alluded to in th e beginning of this chapter is to estimate Eou by using a test set, a data set that as not involve d in the training proc ess. The nal hypothesis g is evaluated on the test set , an d the result is taen  an estimate of Eou We ould lie to no tae a closer loo at this approach. Let us call the error e get on the test set Ees When e report Ees as our estimate of Eou e are in ct asserting that Ees generalizes very ell to Eou After all Ees is just a sample estimate lie Ei Ho do e no 59

2 RAININ VERSUS ESTIN

2.2 NTERRETIN THE OUND

tat Ees genrazes e? W can anser ts quston t autorty no tat e av deveoped te teory of gnerazaton n concrte matematca ters Te ectve number of ypotess tat atters n te gnerazaton be  avor of Ees s 1  Ter s ony one ypotess as fr as te test set s concerned and tat s te na ypotess g tat te tranng pase produced Ts ypotess ou d not cange f  used a derent tst st as t ou d f  usd a drent tranng set  Terefre t smpe Hoedng nequaty s vad n t case of a tst set  Had te coce of g ben aectd by te test set n any sape or rm t oudnt be consdered a test set any ore and te spe Hodng nquaty oud not appy Terere te generazaton bound tat appes to Ees s te spe Hoedng nequaty t one ypote ss Ts s a uc tgter bound tan te C boun For exampe f you av 1 , 000 data ponts n te test s et Ees  be tn ± of Eou t probabty �  Te bgger te test set you use t more accura te Ees  be as an estate of Eou Eris

 d ata set



has 6 ex am ples. o roerly t est the errma nce of he n a l  othesis you set aside a ra nd om l select ed suse t o  e a m les wic are neer used in the taining ase; hese m a test set. u use a learn ing moel with   hothe ses an  select he na l hothes is ased on h e  tra i n ing ex am les. e wish to esi mate out(g). e have access  two estimates the in samle er or on he   tr ain ing examles; an the test erro n he   tes t exam les tha t wee set asie. a sing a 5  error oleance 5  h ich es timat e has he h ige eror ar

test( g 



i g

 8    Is he re a ny reason why y ou shou ld n t ese re even moe exam l es r testing?

Anoter aspect tat dstnguses te test s t om te tranng st s tat te test set s not based Bot sets are nte sapes tat ar bound to av some varance due to sape sze but t test set doesnt ave an optmstc or pessmstc bas n ts estmate of Eou Te tranng set as an optmstc bas snc t was us ed to coose a ypotess tat ooked good  it Te C generazaton bound mpcty takes tat b as nto consdra ton  and tats y t gves a uge rror bar Te tst st just as stragt nte-sape varance but no bas Wen you report te vaue of Ees to your customer and tey try your syst on ne data tey are as key to b peasanty surprsed as unpeasanty surprsed  toug qute ky not to be surprsed at a Tere s a prce to be pad r avng a test set Te test set do es not act te outco of our earnng process c ony uses te tranng set Te test set just tes us o e e dd Terre f e set asde soe 60

2  RAININ  VER SUS ESTI N

2  2  NTER RETIN  THE OUND

of t data ponts provdd by t cstor as a tst st, w nd p sng fwr apls r tranng Snc t tr anng st s sd to s lct on of t ypotss n , tranng apls ar ssntal to ndng a go od ypotss If w tak a b g cnk of t data r tstng an d nd p wt too w apls fr tranng, w ay not gt a good ypotss o t tranng part vn f w can rlably valat t n t tstng par t  ay nd p rportn g to t cstor, wt g yo, tat ar dlvrng trrbl © r s tscondnc a tradond to sttng asdttstg w apls  wlls addrss tat trd o n or dtal and larn so clvr trc ks to gt arond t n Captr 4. In so of t larn ng ltrat r, s sd a s synonyos wt n w rport prntal rslts n ts book, w wll oftn trat basd on a larg tst st as f t was bcas of t closnss of t two qantts

test out

2. 2. 

out test

Otr Target Tpes

Altog t C analyss was basd on bnary targt functons, t can b tndd to ral-va ld functons, as wll as to otr typs of fnctons  proof n tos cass ar qt tcncal, and ty do not add to t nsgt tat t C anlyss of bnary nctons provds functons  rr,andwprovds wll ntrodc an altrnatv pproac tat covrs ral-vald nw nsgts nto gnralzaton   approac s basd on bas-varanc analyss, and wll b dsssd  n t nt scton In ordr to dal wt ral- vald unctons, w nd t o adapt t dntons of and tat av so r bn basd on bnary f unctons  dnd and n trms of bnary rror; tr = or ls  If and  ar ral-vald, a or approprat rror asr wold gag ow far and r o ac otr, ratr tan jst wtr tr vals ar actly t sa An rror asr tat s coonly sd n ts cs s t sqard rror =  can dn n-sapl and ot-of-sapl vrsons of ts rror asr  ot-of-sapl rror s basd on t  pctd val of t rror asr ovr t ntr npt spac X ,

out n out f (x) h(x) (h(x) J (x)) (h(x) - f(x)) 

h(x) f (x)

h(x) f(x) fn

out(h ) =  [ (h(x) - J (x))  ] 

wl t n-sapl rror s basd on avragng t rror asr ovr t data st, 

L (h(x  )  f(x ))  . n(h) = N1 = s dntons ak n a sapl stat of out jst as t was n t cas of bnary functons In fact, t rror asr s d fr bnary fnctons can also b prssd s a sqard rror

61

2. RANN VERSUS ESTN

2  3 . RO XMATON ENERA ZATON

xerise .  or in ar targ et n cions sow tat J[ fx] can e itten as an xecte alue o a mean sua e e rror measue in  e llow in g cases. a e convention use he i nar   nction is  o  e conention use r te  in ar funti n is

 



±

nt: e ferene etween  an  s ust a ale

Just as the saple equency of error converges to the overall probability of error per Hoeing's Inequality the saple average of square error converges to the expecte value of that error (assuing nite variance). This is a an istation of what is rerre to as the law of large nubers' an Hoeing's Inequal ity is just one r of that law . The sae issues of the ata set s ize an the hypothesis set coplexity coe into play just as they i in the C analysis. 23

Ar xiatin- Gnraliza tin ad 

The C appoxiating analysis showe us the thattraining the choice nees to strike a balance between  on ata ofangeneralizing on new ata. The ieal  is a singleton hypothesis se t containing only the target unction. Unrtunately we are better o buying a lottery ticket than hoping to have this   Since we o not know the target unction we resort to a larger oel hoping that it will con tain a goo hypothesis an hoping th at the a ta will pin own that h ypothesis. When you select y our hypothesis set you shoul balance these two conicting goals; to have soe hypothesis in  that can approxiate  an to enable the ata to zoo in on the right h ypothesis. The C generalization boun i s one way to look at this traeo If  is too siple we ay ail to approxiate  well an en up with a large in saple error ter. If  is too coplex we ay ail to generalize well because of the large oel coplexity ter. There is another way to look at the approxiationgeneralization traeo  which we will present in this section. It is particularly use in the VCsuite analysis. r square The new error way easures provies rather a ierent than angle; the binary instea errorof bouning Eou by Ein plus a penalty ter we will ecopose Eou into two ierent error ters.

0

2.3. 1

Bias ad Variace

The bias-variance ecoposition of outofsaple error is base on square error easures. The out-ofsaple error is (2.17) 62

2 . 3 . ROX IMATION EN ERA LIZATION


where x denotes the expec ted value with respect to  based on the probabil  ity distribution on the input space X) We have made explicit the dependence of the nal hypothesis g on the da ta V, as this will play a key role in the cur rent analysis. We can rid quation ( 2 .17) of the dependence on a particular data set by taking the expectation with respect to all data sets. We then get the expected outof-sample error r our learning model, independent of any particular realiation of the data set ,

v [x(g D () - f ())  J] x [v  (g D() - f ()) J] x [vg D  ()   - 2 v g D() f () + f () 

J

The term vg D () gives an average nction', which we denote by (). ne can interpret () in the llowing operational way. Generate many data sets V, . .  , V  and apply the learning algorithm to each data set t o pro duce nal hypotheses  , . .  , K . We can then estimate the average nction r any  by () �   gk() ssentially, we are viewing g() a a random variable, with the randomness coming om the randomness in the data set () is the expected value of this random variable r a particular ), and  isnction a function, average fnction, composed of these expected The  is athelittle counterintuitive r one thing, need notvalues be in the model's hypothesis set, even though it is the average of nctions that are.



Eercse .8



a how tat i is l ose u ner l inea co ination an l inear c omi nation of oteses in is also a hosis in  hen





  

) ive a moel r h ic h he ara ge n ction is not in  e mo els h othesis set int Ue  e ime me (c o ina r l assicati on  o ou exect to e a ina unction



We can now rewrite th e expected outof-sample error in terms o f :

v  (g  V) out  - 2() f() + f () x [vgCD() x [ v gCD  ()   - ()  + ()  - 2() f () + f ()  (() - f () )  v [ (g D () - ())   where the last reduction llows since () is constant with respect to V. The term (( ) - f ())  measures how much the average nction that we would learn using dierent data sets V deviates om the target nction that generated these data sets. This term is appropriately called the bia: ba() = (() - f ())  ,

J

63

2 . 3  ROXI MATION ENERAIZATION

2 RAININ VERSUS ESTIN

a it meaure how much our learning model i biaed awa om the target function 5 Thi i becaue  ha the benet of learning om an unlimited number of data et, o it i onl limited in it abilit to approximate  b the limitation in the learning model itelf The term v (g V () ())   i the variance of the random variable gV ()





a() =invthe(gnal()hpothei, - ()) depending  which meaure the variation on the data et We thu arrive at the bia-variance decompoiti on of out-of-ample error, xbi a() + a() bia + a where bia  = x  bia() and a = xa( )  Our derivation aumed that

the data wa noiele A imilar derivation with noie in the data would lead to an additional noie term in the out-of-ample error (Problem 222) The noie term i unavoidable no matter what we do, o the term we are intereted in are reall the bia and a Th e approximation-generalization tradeo i captured in th e bia-variance decompoition  To illutrat e, let conider two extreme cae a ver mall model (with one hpothei and a ver large one with all hpothee 

Very smal model. ince there is

he target nction is in  ierent data sets will lead to dierent hypotheses that agree with  on the data set, and are spread around  in the red region. hus, ias �  because is likely Very large model

only one hypothesis both the av erage nction and the nal hy pothesis will be the same, r any dat a set. hus var . he ias will depend solely on how well

gD







to be close to  he var is large (heuristically represented by the size of the red region in the gure).

this single ypothesis approximates the target  and unless we are ex tremely lucky, we expect a large ias.

One can alo view the variance a a meaure of intabilit in the learning model Intailit manit in wild reaction to mall variation or idion craie in the data, reulting in vatl dierent hpothe e 5 at we call

 is soeties called   n te lteratur e

64

2  RAININ VERSUS ESTIN

2.3 ROX IMATION ENERA IZATION

f ()  s) ad a data set N  2 We sample  frmly   1, 1 to geerate a data set (1, Y 1) , ( , Y ) ad t the data sg oe of two models

Eapl 28 Cosder a target cto

of sze

2 2

H 

H1 

Set of all les of the rm h()  b; Set of all les of the rm h()    b

For H, we choose the costat hypothess that best ts the data the hor For H1, we choose the le that passes zotal le at the mdpot , b  throgh the two data pots (1 , Y1) ad ( 2 , 2 )  Repeatg ths process wth may data sets  we ca estmate the bas ad the va race The gres whch llow show the resltg ts o the same radom) data sets fr both models.





o

Wth H1, the leared hypothess s wlder ad vares extesvely depedg o the data set The bias-var aalyss s smmarzed  the ext gres





o bia s = .5 va r =   25

1 bia s = .21 va r = 169

 

verage hypothesis g red with varx) indicated by the gray shaded region that is gx) 

For H, the average hypothess g red le) s a reasoable t wth a arly small bias of 021. However, the large varablty leads to a hgh var of 169 resltg  a large expected otofsample error of 1 9 0. Wth the smp ler 65

2  3  ROXI MATION ENERALIZATION

2 RAININ VERSUS ESTIN

model   the ts are much less volatle and we have a sgncantly lower var of 025, as ndcated by the shaded regon However the average t s now the zero fncton resultng n a hgher bas of 050 The total outofsample error has a much smaller expected value of 075 The smpler model wns by sgncantly decreasng the var at the expense of a smaller ncrease n bas Notce that we are not comparng how well the red curves the average hy potheses theaccess sn e to These curves areofonly n real learnng we do not t have the multtude dataconceptual sets neededsnce to generate them We have one data set  and the smpler model resu lts n a better outofsample error on average as we t our model to just ths one data  However the var term decreases as ncreases so f we get a bgger and bgger data set the  bas term wll be the domnant part of Eou, and  1 wll wn

N

The learnng algorthm plays a role n the basvarance analyss that t dd not play n the C analyss o ponts are worth notng By des gn th e C analyss s based purely on the hypothe ss set   n dependently of the learnng algorthm A In the basvarance analyss both  and the algorthm A matter Wth the same  usng a der ent learnng algorthm can produce a derent gV Snce gV s the buldng block of the basvara nce analyss ths may result n derent bas and var terms 2 Although the basvarance analyss s based on squarederror measure the learnng alorthm tself does not have to be based on mnmzng the squared error It can use any crteron to produce gV based on  However once the algorthm produces g   , we measure ts bas and varance usng squared error  Unrtunately the bas and v arance cannot be computed n prac tce snce they d epend on the target fncton and the nput probablty dstrbu ton both unknown  Thus the basvarance decomposton s a conceptual tool whch s helpfl when t comes to developng a model There are two typcal goals when we consder bas and varance The rst s to try to lower the varance wthout sgncantly ncreasng the bas and the second s to lower the bas wthout sgncantly ncreasng the varance These goals are acheved by

1.

derent technques  some and some heurstc Reducng Regularzat one of these technques thatprncpled we wll dscuss n Chapter theonbass wthout ncrea sng the varance requre s some pror nfrmaton regardng the target ncton to stee r the selecton of  n the drecton of f, and ths task s largely applcatonspec c On the other hand reducng the v arance wthout compromsng the bas can be done through general technques

4.

2.3.2

The Learig Curve

We close ths chapter wth an mportant plot that llustrates the tradeos that we have seen so ar The learning curves summarze the behavor of the

66

2 . 3 .  RO XIMATION ENERALIZATION

2 RAINING VERSUS ESTING

in-ample and out-of-ample error a we va ry the i ze of the training et  After learning with a particular data et  of ize N the nal hypothe i g CD  ha in-ample error n(g T ) and out-of-ample error out(g  T), both of which depend on  A we aw in the bia-variance analyi t he expectation with repect t o all data et of ize N give the expected error: vn(g  T) and vout (g) Thee expected error  are fnction of N and are called the learning curve the model illutrate the learning curve r a imple learning model and ofa complex oneWebaed on actual experiment



 << umer o f ata Points,

N

umber of ata Points,

N

Cmplex Mdel

Simple Mdel

Notice that r the imple model the learning curve converge more quickly but to wore ultimate perrmance than  r the complex mod el Thi behavior i typical in practice For both imple and complex model  the out-of-ample learning curve i decreaing in N while the in-ample learning curve i in creaing in N Let u take a cloer look at thee curve and interpret them in term of the dierent approache to generalization that we have dicued n the VC analyi out wa expreed a the um of n and a generaliza tion error that wa bounded by  the penalty r model complexity n the bia-variance analyi ut wa expreed a the u m of a bia and a variance The llowing learnin g curve illutrate thee two approach e ide by ide

umer of at a Points,

N

umber of ata Points, Bias-Variance Analysis

VC Analysis



N

2 . 3 . RO XIMATION ENER ALIZATION

2 RAININ VERSUS ESTIN

The VC analysis bounds the generalization error which is illustrated on the le The bias-variance analysis is illustrated on the right The bias-variance illustration is somewhat idealized since it assumes that r every N the aver age learned hyothesis g has the same errmance as the best aroximation to f in the learning model When the number of data oints increases we move to the right on the learning curves andlearning both thecurve generalization andimortant the var iance shrink as exected  The also illustraerror tes an ointterm about E  As N increases, E edges toward the smallest error that the learning model can achieve in aroximating f or small N the value of E is actually smaller than that smallest ossible error This is because the learning model has an easier task r smaller N; it only needs to aroximate f on the N oints regardless of what haens outside those oints  Therere it can achieve a suerior t on those oints, albeit at the exense of an inrior t on the rest of the oints as shown by the corresonding value of E 

6 Fr te learnng curve we take te expected values f all quanttes wt respect t f sze



6



24. PROBLEMS

2 . RAININ VERSUS ESTIN

24

Prblems

Problem 1

(a) For () For (c) For

In Equat ion (2.1), set

8  0.03 and let

M  1 , how many examples do we need to make  � 0.05? M  100, how many examples do we need to make  � 005? M  10 000, how many exam ples do we need to m a ke  � 0.05?

Problem 

Show that or the learning model of positive rectangles (a ligned h orizontal ly or vert ical ly) mH(4) 2 and mH(5) < 2 ence give a ound r mH(N).



Compute the maxim um n um er o f dich otomies mH(N), r these learni ng models and consequently compute  the VC dimensi on.

Problem 

(a ) Posi tive or nega tive r ay:  contains th e fun ctions which are + 1 on (r some together with those that are +1 on  (r some

[    ] () Positive or negative interval:  contains th e functions whic h a re  1 o n an interval [ b] and -1 elsewhere or -1 on an interval [ b] and 1 elsewhere

b. d :

(c) Two con centric spheres in

   + . . . +   �

Problem  direction to Lemma

 contains the functions which

Show that B(N k 23, namely that

B(N k )

   

are 1 r

y sho wing th e other



  

To do so construct a specic set of dichotomies that does not shatter any suset of k variales [Hint: ry limiting the n mber f - 1 's in

each dichtm]

Problem 5

Prove y induction that 

D

iO   � N

m'(N) � N dvc + 1

69

D + 1, hence

2.

2.4.

RAININ VERSUS ESTING

Pro bl 2 6

ro ve that r

PROBEMS

  d

We sugge s yo u rst sho w the l lowing i nermedi ate steps

a  iO   iO   J  J iO        iO    f  {Hits Bimial therem; 1 + r



ence, arge that

d  d ed     dvc 



e

r  > Oj

m

Pro bl 2  7 lot the ounds r m  given in rolems 25 and 2.6 dva  2 and dvc  5 Wh en d o you prer one ou nd over th e oher?

r

Po bl  2 8 Whi r some hypohesis set ch : of th e llowin g are pos si le growth functions m

l+ ·, 1 +  + 2 - 1) ; 2 · 2  v J . 2 L /  J . 1 +N+ N(N - l)(N - 2) . ' '  6 Pro bl  2 9

[hard] For th e percep tron i n

m 

d    .  1

2

d dimensions, show that

H   F 

Use this rmla to verify that dvc  d + 1 y evaluating md + 1) and md + 2 lot m/2  r d  10 and  1 40 . If you generae a random dichotomy on  points in 10 dimensions, gi ve an uppe r ound on the

E

proail ity hat the dichotomy w il l e separ al e r N



Show that m2 m genera Iizatin ound which on ly involv es m   .

Probl 210

 10 2 0 40.   and hence otain a

Pro bl 2 11 Suppose m  + 1  so dva 1 . You have 100 trai ni ng examples Use the general ization ound o give a ound r Eaut with condence 90%. epeat r  10 000.





70



2 . RAINING VERSUS ESTING

2.4. PROBEMS

Probl 212 For an  with dvc  10 what sampl e size do y ou need ( as prescried y the generalization ound  to h ave a 95% con den ce that your general ization error i s at most 005?

Probl 21 ( a  Let 

 {h1h 2    hM}

log 2 .

with some nite



Prove that

dvc() :

(   For hypothesis sets  1, 2 , ·  · , K with  n ite VC di mensions dvc ( k) , derive and prove the tightest upper and lower ound that you can get on dvc n�1 k

( c  For h ypothesis sets 1 , 2 ,  ·  , K with nite VC dimensi ons dvc( k), derive an d pro ve t h e tighte st u pper an d lowe r ound s that y ou c a n get on dvc uf=1 k

1 2    K e  hypothesis sets with nite VC dimension dvc Let   1 U 2 U    U K e the un ion of thes e mo dels ( a  Show that dvc() < (dvc + 1 (   Suppose that f satises 2£ > 2Kfd Show that dvc() : £ Prob l 2 14

Let

( c  ence, show that

dvc() That is,

dvc()

Probl 215 where

1 ; 2



mi  (dvc + 1, 7(dvc + ) log 2 (dvc)  

 O max(dvc  ) log2 max(dvc )  is not too ad The monotonical ly i ncre asing hypothe sis s et is

if and only if the inequality is satised r every component

( a  Give an exam ple of a monotonic cla ssier in tw o di mensions, cl earl y sho w ing the

+1 and 1 regions

(   Compute N) and hence the VC dimension {Hint: Conider a et ofN point generated by rt chooing one point, and then generating the next point by increaing the rt component and decreaing the econd component until N point are obtained}

71

2.

2.4

RAININ VERSUS ESTNG

X =

Probl 2.16

In this pr olem , we will consider is a one di mension al variale . For a hypothesi s set

prove th at the VC d ime nsion of  is exactly

PROBEMS

That is 

x

D + 1) y showing tha t

D + 1) points which are shattered y  D + 2 points which are shattered y 

(a) There are

() There are no

Probl 2.17

The VC dimension depends on the in put spa ce as w el l a s . For a xed , consider t wo i np ut spaces X1  X2 Show that the VC dimension f  with respect to input space X1 is at most the VC dimension of  with respect to i npu t space X2  ow can the result of this prolem e used t o an swer part ( ) i n P rolem

[Hint: How i Problem   related to a perceptron in

D dimenion?}

216?

Probl 2.18

The VC dimension of the perceptron hypothesis set corresponds to th e nu mer of parameter s w0 w1      w  of the set  an d this oserv ati on is  usua l ly tr ue r othe r hypothesis s ets owever we wil l present a cou nter exam ple h ere Prove that the llowing hypoth esis set r x has an inn ite VC dimension 

EI



LA



{ ha I hax)  -l)L aJ

where

A

EI

}

where is the iggest integer  (the  loor funct ion )  This hypothe sis has only one parameter  ut enjoys an innite VC dimension [Hint: Con

10n  and how how to implement an arbitrary

ider x     xN  where Xn dichotomy Y  .  YN J

Probl 2.19

This p rolem derive s a oun d r the VC di mension of a complex hypothesis set that is uilt from simpler hypothesis sets via composi tion Let 1 .     e hypothesis sets with VC dimension d1     dK Fix h    h  where hi i Dene a vector z otained from  to have com ponents hi) Note that  J , ut z {-1+l}   Let e a hyp othesis K  So set of functions that take inputs in 

E

an d su ppo se that

E

E

 E  z E K  {+l1}  has V C di mens ion  72



2 RAINING VERSUS ESTING

24. PROBLEMS



We can a ppl y a hypohess n o he z cons ruced from (h,    , hK)  Ths s he com poson of h e hypohess se wh (   , K   More rmally he com posed hypohess se (   , K) s dened y h f



h(x)



E

= ( h( x),    , h K(x )),

a  Show ha

K

218) m (N  = {Hint ix N pint x,    , X and x h,    , hK hi generate N tranrmed pint z, . . .     hee z  . .    can be dichtmized in at mt m(N way, hence fr xed (h,    , hK)  (x,    ,x  ) can be dichtmized in at mt m (N way hrugh the eye f x,  , X   at mt hw many hypthee are there eectively in? Ue thi b und t bund the efective number f K-tuple (h,  ,h K) that need t be cnidered. inall argue that yu can bu nd the numb er f dichtmie that can be implemented by the prduct f the number f pible K-tuple (h ,    , hK) and the number f dichtmie per K-tuple

m(N : m (N





  Use he o und m(N

 vc

o ge a o un d fr m (N  n erms of

    , K  c Le + �    a nd as sum e ha

   D = 

D > 2 lo D Show ha

d  If  and  are all percepron hypohess ses show ha ()

= (K l o(K)) .

I n he ne x chape r w e w ll fur her de velop h e smple ln ear model Ths l near model s he uldng lock of many oher models such as neural neworks. The resuls of h s prole m show h ow o oun d he VC d mens on of  he more compl ex models ul n hs mann er

Probl 220 There ar e a nu mer of ou nds on he generalzao error E al l hold ng wh pro a ly a leas  1 

8

a  Orgna l VC-ound : <

 

! 1 4m1 (2N)

8 

ademacher Penal y Bound:

(continued on next page)

73

n

4

2  RA VERSUS EST (c) Parrondo a nd Van de n B roek:

 

1

N



2 +  62N)



b

PROBEMS



( d) Dev roye:



Note that (c) a nd ( d) are m pl ct o unds n Fx  plot thes e ou nds as a functon of N Wh ch s es t?

Probl 2.21

 50 and b  005 and

Assume th e fl lowng th eore m t o hol d

Thor J

[ Eout (g) - Ein (g) > l :

c .

m1l (2N) e x p

( - 2 N ) 4

'

where c i a cntant tha t i a little bigger than 6  Th s o un d s use ful ecause so metm es wh at we care aout s not th e asolute general zaton err or u t n stead a relatve genera lzaton e rror (on e can  mag ne that a general zaton erro r of 00 1 s mor e sgnfcant when 0.01 than when 05 ) Conv ert ths to a gener al zaton ound  y showng that wth proalty at least 1

Eo =

Eo 

- b

Eo (g )  Ei (g) + � 2 N  where � =  lo



1+

 + 4Ei (g) �





Probl 2.22 When there s nose n the data, Eo(gD)  [(g D(x ) - (x)) 2] where 2(x)  (x) +  If  s a zero mean nose random ecomes varale wth varan

ce   show that th e as vara nce decomposton

 [Eo(D )]  2 +  as + var

Pro bl  2 .2

Consder the learnng prolem n Exam ple 28 where the nput space s X +1] the target functon s () sin()  and the nput proalty dstruton s unrm on X  Assume that the trann g se t V has only two data po nts (pcked ndepe ndently), an d that the learn ng a lgo rthm pcks the hypo thes s that mn m zes th e n sampl e mea n squa red e rror I n ths prole m, we wl l d g de eper  nto ths case

= [-





2

24 PROBEMS

RAINING VERSUS ESTING

For each of he fllowing learning models, nd  analyically or numerically   i  he es hypohesis ha approximaes f in he mean squar ed error sen se assume ha f is known for his par  , ii  h e expe ced val ue wih respec o  of he hypohesis ha he learning algorihm produces, and  iii  he expeced ou of sampl e err or and is ias and var componens

 a  The lea rni ng mod el consis s of a l l hypoheses of he fr m h() =   b nee d o dea l wih he in niesim a l proa il iy case of wo ide nica l  if youpoins, daa choose he hypohesis angenial o f   The learning model consiss of all hypoheses of he rm h() =  This case was no covered in Example 2.8 h() = b  c  The learning model consiss of all hypoheses of he frm Consider a sim plied learni ng scena rio Assume ha Probl 224 he inpu dimension is one Assume ha he inpu var ial e is unirmly disriued in he inerval [1, 1]  The daa se consiss of 2 poins and assume ha he arge funcion is Thus, he full daa se is The learning a lgo rihm reurns he line ing hes e wo poins as g  consiss of funcions of he rm We are ineresed in he es perrmance Bout of our learning sysem wih respec o he squared error measure, he ias and he var



' = {( 1  ) ( 2  §)}

f () =  2 

{ 1  2 }

h( ) =   b)

 a  Give he an a lyic expres sion r he averag e funcion ()   Descrie an experimen ha you could run o deermine  numerically  () Bout, ias, and var  c  un your ex perimen and repor he re suls Compa re Bout wih ias+var rov ide a plo of you r () and f()  on he same plo     d  Compue ana lyically wha Bout ias and var should e

75

76

Chapr

3

The Linear Model We often wonder how to draw a line between two categories; right versus wrong personal versus profssional lif usel email versus spam to name a fw A line is intuitively our rst choice r a decision boundary In learning as in li a line is also a good rst choice In Chapter 1 we ( and the machine @ ) learned a procedure to draw a line' between two categories based on d ata ( the perceptron learning algorithm )  We started by taking the hypothesis set 1 that included all possible lines ( actually hyperplanes )  The algorithm then searched r a good line in 1 by iteratively correcting the errors made by the current candidate line in an attempt to improve E As we saw in Chapter 2 , the linear model set of lines has a small C dimension and so is able to generalize well om E to E. The aim of this chapter is to rther develop the basic linear model into a powerl tool r learning om data We branch into three important prob lems: the classication proble m that we have seen and two other important problems called regression and prbability estimation The three problems come with dierent but related algorithms and cover a lot of territory in learning om data As a rule of thumb when fced with learning problems it is generally a winning strategy to try a linear model r st 3 . 1 Linea r Cassica tion The linear model fr classi ing dat a into two classes uses a hypothesis se t of linear classiers where each  has the frm  (  ) = sign ( )  fr some column vector   d   where  is the dimensiona lity of the input space and the added coordinate x  1 corresponds o the bias weight' w (recall that the input space X = {1} x d is considered dimensional since the added coodinate x = 1 is xed )  We will use h and  interchangeably

77

3 . HE INEAR ODE

3.1.

IEAR ASSIFICATION

 rer  he hphesi s when he cn ex is clear. When we lef Chaper 1  we had w basic crieria r learning: 1  Can we make sure ha E() is clse  E(g)? This ensures ha wha we have learned in sample will generalize u f sample. 2 Can we make E( ) small? This ensures ha wha we have learned in sample is a gd hphe sis. The rs crierin was sudied in Chaper 2 Specicall, he C dimensin f he linear mdel is nl d 1 ( xercise 24) Using he C generalizain bund (212), and he bund (210) n he grwh ncin in erms f he C dimensin, we cnclude ha wih high prbabili,

+

Eout ( g )

=

E; n ( 9 )

+ 0

(

�

.

(31)

Thus, when  is sucienl large, E and E will be clse  each her ( see he deniin f 0(-) in he Nain able ) , and he rs crierin r learning is fllled. The secnd crierin, making sure ha E is small, requires rs and rems ha here is some linear hphesis ha has small E. f here isn'  such a linear hphesis, hen learni ng cerainl can' nd ne . S , le 's suppse r he mmen ha  here is a linear hphe sis wih small E. n fc, le's suppse ha he daa is linearl separable, which means here is sme hphesis w* wih E(w*) = 0 We will deal wih he case when his is n rue shrl. n Chaper 1  we inrduced he perceprn learning algrihm ( PLA) . Sar wih an arbirar weigh vecr w ( O ) . Then, a ever ime sep t  0 selec any misclassied daa pin (x(t),(t)), and updae w(t) as llws:

w(t

+ 1 ) = w(t) + (t)x(t)

The inuiin is ha he updae i s aemping  crrec he errr i n classi  ing x(t) The remarkable hing is ha his incremenal apprach f learning based n ne daa pin a a ime wrks. As discussed in Prblem 13 i can be prvedE(w) ha he PLA evenual sp updai a a sluin wih Alhugh hislresul appliesng,ending a resriced seing w (lin = 0 will earl separable daa ) , i is a signican sep . The PLA is clever i desn' navel es ever linear hphesis  see if i (he hphesis ) separaes he daa; ha wuld ake inniel lng. Using an ieraive apprach, he PLA manages  search an innite hphesis se and upu a linear separar in ( prvabl ) nite ime. As ar as PLA is cncerned, linear separabili is a prper f he data, n he target A linearl separable  culd have been generaed eiher m a linearl separable arge, r ( b chance) m a arg e ha is n linearl separab le. The cnvergence prf f PLA guaranees ha he algrihm will 

3 . HE NEAR ODEL

3.1. NEAR LASSIFICA TION

() ew noisy dt

(b) onlinerly seprble

Figure 31

t sets tht re not linerly seprble but re () linerly seprble er discrding  w exmples, or (b) seprble by  more so phisticted curve.

work in both these cases and roduce a hyothesis with E  0  Further in cases you can be to condent this erfrmance will generalize well outboth of samle according the Cthat bound Eri  1 ill

3 . 1. 1

 vr st  uating

if th  ata i s n t lin a ly sp aral?

No-Spar abe Data

We now address the case where the data is not linearly searable Figure 31 shows two data sets that are not linearly searable In Figure 3  (a)  the data becomes linearl searable after the removal of just two examles  which could be considered or outliers Figure 3 (b) can be searated by a noisy c ircleexamles rather than a line InInboth cases the  rethewilldata always a misclassied training examle if we insist on using a linear hyothesis and hence L A will never terminate In fct its behavio r becomes quite unstable and can jum om a goo d ercetron to a very bad one wi thin one udate ; the quality of the rsulting E cannot be guaranteed In Figure 3 (a)  it seems aroriate to stick with a line but to somehow tol erate noise and outut a hyothesis with a small E, not necessarily E = 0 In Figure 3(b) the linear model does not seem t o be th e correct model in the rst la e and we will discuss a tchnique called nonlinear transfrmation r this situation in Section 3 4 79

3  HE NEAR ODE

3.1 . NEAR ASS FCATO N

Te situation in Figure 3 (a) is actually encountered very often: even toug a linear classier seems appropriate  te data may not be linearly sep arable because of outliers or noise To nd a ypotesis wit te minimum E we need to solve te combinatorial optimization problem: min

N

1

Ed+

w

[sign (wT xn ) # Yn ] .

(3)

=

Te diculty in solving tis problem arises om te discrete nature of bot sign(· ) and [ ]  In fct min imizing E(w) in (3  ) in te gen eral case is known to be NPard wic means tere is no known ecient algoritm r it and if you discovered one you would become really really fmous © Tus one as to resort to approximately min imizing E. One approac r getting an approximate solution is to extend PL  troug a simple modication into wat is called t e pocket algorithm ssentially te pocket algoritm keeps in its pocket te best weigt vector encountered up to iteration t in PL t te end te best weigt vector will be reported as te nal ypotesis Tis simple algoritm is sown below Th pot algorith:

Set te pocket weigt vector w to w(O) of PL t = 0,      do Run PL r one update to obtain w(t + 1. valuate E(w(t + 1. If w(t + 1 is better tan w in terms of E, set w to w(t  1. Return w Te srcinal PL only cecks some of te examples using w(t) to identi (x(t),  (t)) in eac iteration wile te pocket algoritm needs an additional step tat evaluates all examples using w(t + 1 to get E(w(t + 1 Te additional step makes te pocket algoritm muc slower tan PL In addi tion tere i s no guarantee or ow fst te pocket algoritm can conver ge to a good E. Neverteless  it is a usefl algoritm to ave on and because of its

: 3:  5 6

r

1

simplicityave Oter ecient approa ces r obtaining goodtecniques approximate solutions beenmore developed based on dierent optimization as sown later in tis capter Eris 2

=

'

=

Take  2 and create a data set of size 100 that is not li nearly separa le You can do so y rs t choo sing a rand om line in the p lane as your targ et function and t he i np uts X of the data set as random points in the pl an e The n, eval uate t he tar get function on each X to ge the corresponding output  Fina lly  li p the la els of randomly selected  s and the da ta set will li kely eco me non sep arale



80

3

31 .

HE INEAR ODE

INEAR ASSFICATION



Now try he oc et a lgor ithm on ur ata set using 1  0 iterations eat the exeriment 0 imes. hen lot e ave age an e average wich is also a funtion  e sae gue an see how hey ehave wen i ncreases i mi la l use a  est set of size 1 00 an pl ot a gu re to sow ow Eout(w) an out( eave.

 







Eapl 1  Handwritten digit recognition   We ample some digit om the US Postal Service Zip Code Databae These 16 x 16 pixel images are preproc eed om the scanned handwritten zip codes The goal is to recognize the digit in each im age We alluded to thi task in pa rt  b  of xercise 11 A quick look at the images reveal that this is a nontrivial task  even r a human , and tpical human E i about 25 %  Common conion occur between the digits { 4, 9} and {2 7}  A machine-learned hpothesi which can achieve such an error rate would be highl deirable

 Let  rt de compose the big task o f eparating ten digit into smal ler taks of eparating two of the digits Such a decomposition approach om multiclass to binary claication i commonly ued in man learning algorithm We will cus on digit {1  5} r now A human approach to determining the digit black correponding pixels Thus, to anrather image than is to carring look at the all shape the inrmatio in the 256 pixel, of the or othern properties it makes sen se to summarize the infrmation contained in the image into a fw features Let look at two important atures here intensit and symmetr Digit 5 usually occupies more black pixel than digit 1 , and hence th e average pixel intensity of digit 5 is higher On the other hand, digit 1 is smmetric while digit 5 i not Therere, if we dene asmmetry as the average absolute dierence between an image and its ipped verions, and mmetr as the negation of asmmetr, digit 1 would result in a higher smmetry value A catter plot r thee intenity and mmetry atures r some of the digit i shown next 81

  HE INEAR ODE

2 INEAR ERESSION

While the digits can be roughly separated by a line in the plane representing these two atures, there are poorly written digits (such as the 5 depicted in the top-left corner) that prevent a perct linear separation We now rn PA and pocket on the data set and see what happens Since the data set is not linearly separable, PA will not stop updating In act, as can be seen in Figure 3 2(a) , its behavior can be quite unstable When it is =rcibly terminated iteration 1 000, PA gives a lin e that has a poor E 224 and E = at 637 On the other hand, if the po cket algorithm is applied to the same data set, as shown in Figure 3 2(b) , we can obtain a line that has a better E = 045 and a better E = 189 D 3.2

Lin ar Rgrssion

inear regression is another useul linear model that applies to realvalued target nctio ns  It has a long h istory in statistics, where it has been studied in great detail, and has various applications in social and behavioral sciences Here , we discuss linear regression om a learning perspective , where we derive the main results with minimal assumptions et usn revisit application in credit approval, this Recall time considering a regressio problemourrather than a classication problem that the bank has custo mer records that contain in rmation elds related to personal cre dit, such as annual salary, years in residence, outstanding loans, etc Such variables can be use d to learn a linear classier to decide on credit approval Instead of just making a binary decision (approve or not), the bank also wants to set a proper credit limit r each approved customer Credit limits are traditi onally determined by human experts The bank wants to automate this task, as it did with credit approval

1 Regresson a ter nerted o earler work n statstcs eans  s real valued 82

3 HE INEAR ODE

32 INEAR ERESSION

5%

5%

5

5

75

Iteraton Nuber





5

5

75

Iteraton Nuber

verage Intensty





verage Intensty

a P

b Pock

Figure 32:

omparson of wo nar casscaon agorhms r sp arang dgs 1 and 5. Ein and Bout ar pod vrss raon nmbr and bow ha s h arnd hypohss g a vrson of h P whch scs a random ranng xamp and pdas f ha xamp s mscas sd hnc h a rgons whn no pda s mad . hs vrson avods sarchng a h daa a vry raon. b h pock agorhm.

    



Ti i a rereio n earning probem Te ban ue itorica record to contruct a data et ' of exampe (x 1) (x 2  2 ) . .   (x N  N  , were X i cutomer inrmation and  i te credit imit et by one of te uman expert in te ban ote tat  i now a rea number poitive in ti ce intead of jut a binary vaue Te ban want to ue to nd a  earning ypotei g tat repicate ow uman expert determine credit imit  Since tere i more tan one uman expert, and ince eac expert may not be percty conitent, our target wi not be a determinitic function  = f(x). ntead, it wi be a noiy target rmaized a a ditrib ution of te random variabe  tat come om t e dierent view of dierent expert a we a te variation witin te view of eac expert Tat i, te abe  come om ome ditribution P( I x) intead of a determinitic fnction f(x). onetee, a we dicued in previou capter, te nature of te probem i not canged. We ave an unknon ditribution P(x ) tat generate

±1

83

2 NEAR EESSON

3  HE NEA ODEL

each (X Y   and we want to nd a hypothesis  that minimizes the error between () and y with respect to that distribution The choic e of a inear mode or this probem presum es that there is a inear combination of the customer inrmation eds that woud propery appro x imate the credit imit a s determined by human experts If this assumption does not hod we cannot achieve a sma error with a inear mode We wi dea with this situation when we discuss noninear tr ansrmation ater in the chapter 3 2  1

The Agorithm

The inear regression agorithm is based on minimizing the squared error be tween h() and y 2



E(h)   (h()



y) 2 

where the expected vaue is taken with respect to the joint probabiity distri bution (, y) The goa is to nd a hypothesis that achieves a sma E(h) Since the distribution (, y) is unknown E(h) cannot be computed Sim iar to what we did in cassication we resort to the in-sampe version instead  1 N

E(h) =

2

L (h( ) N 

Y  

In inear regression h takes the rm o f a inear combination of the compon ents of  That is

h() =

Ld

W X 

 =O

 wT 

where x = 1 and   {1} x d as usua and w  d + . For the specia case of inear h, it is very usef to have a matrix representatio n of E(h) First dene the data matrix X  N   d+   to be the N x (  1) matrix whose rows are the inputs X as row vectors and dene the target vector y  N to be te coumn ector whose components are th e target vaues Y The in-sampe error is a fnction of w and the data X  y

 LXw-yw X 1

N  

T  y)2

2

N 1

N

(wTXT Xw - 2wTXTy + yTy) ,

(3.3)

(3.4)

where II  II is the ucid ean norm of a vector and (33) ows because the th component of the vector Xw - y is exacty wTX Y  The inear regression 2 The tem 'lnea egesson h been hstoall onned to squaed eo measues

84

  HE INEAR ODE

2 INEAR EGRESSION



(a) one dimenion (line)

(b) wo dimeni on (hyperplane)

Figure 33 

The oluion hyp ohei ( in blue) of he linear rereion al o rihm in one and wo dimenion The um o  quared error i minimized.



algorithm is derived by minimizing E ( over all possible   +   as rmalized by the llowing optimization problem WH

=

(35)

argmin E (w) .

w EJd+

Figure 3 3 illustrate s the solution in one and two dimensions Since qua tion  34) implies that E ( is dierentiable, we can use standard matrix calculus t o nd the  that minimizes E ( by requiring that the gradient of E with res pect to  is the zer o vector, i e  ,  E  = 0  The gradient is a column vector whose th component is E   = By explicitly computing the reader can verify the fllowing gradient identities,









[

   

These ident ities are the matrix analo g of ordinary dierentiation of quadratic and linear nctions  To obtain the gradient of E , we tae the gradient of each term in ( 3 4) to obtai n

 

 

Note that both  and E  are column vectors Finally, to get E  to be 0 one should solve r  that satises If of

XTX is invertible,  = x y where x  XTx   1 XT is the pseudo-inverse X The resulting  is the unique optimal solut ion to (3 .5)  If XTX is not 85

 HE INEAR ODEL

2 INEAR ERESSION

invertible, a pseudoinverse can still be dened, but the solution will not be unique (see roblem 315) In practice, is invertible in most of the cases since N is often much bigger than d 1 , so there will likely be d 1 linearly independent vectors X  We have thus derived the llowing linear regression

+

XTX

+

algorithm

Linar rgrion algorith:

:

X

y

onstruct the matrix and the vector om the data set (x,Y),     (x  , Y ), where each x includes the o 1 bias coordinate, as llows

X

[ l'  : l



y

 taret vector

in put ata atri

 ompute the pseudoinverse is invertible,

x of the matrix x If XTX

3: Return Wlin  xy  This algorithm is sometimes rerred to as ordinary least squ ares ( OS )  It ma seem tha t, compared with the perceptron learning algorithm, linear regression doesn t reall look li ke  learning, in the sense th at the hypothesis Win comes om an analtic solution (matrix inversion and multiplications) rather than om iterative learning steps Well, as long as the hypothesis Wlin has a decent outofsample error, then learning has occurred inear regression is a rare case where we have an analytic rmula r learning that is easy to evaluate This is one of the reasons wh the technique is so widely used  It should be noted that there are methods r computing the pseudoinverse directl without inverting a matrix, and that these methods are numerically more stable than matrix inversion regression analyze d inhere great detail in statistics We would likeinear to mention one ofhasthebeen analsis tools since it relates to insample and outofsample errors, and that is the hat matri H Here is how H is dened The linear regression weight vector W!in is an attempt to map the inputs X to the outputs However, wlin does not produce exactly, but produces an estimate

y

y

 = W!in

which diers om r Win (assuming

y due to insample error Substituting the expression XTX is invertible), we get   xxTx   1 XTy 

6

3. HE IEAR ODE

3.2. IEAR ERESSIO

herere the estimate  is a linear transrmation of the actual  through matrix multiplication with  where (36) Since    th e matrix  puts a hat on , hence th e name he hat matrix is a very special matrix For one thing 2 = , which can be veried usinganalysis the above expressionand r out-of-sample  his and other  willonacilitate the of in-sample errorsproperties of linear of regressi ri 3.3

 = )  

onsier he hat matrix matrix an is inver ile  a how that  is s mmet ic



here

 is an   

1

  how that K=   an ositi  i nteg e   If  is th e id entit m atri o  size  show at I -  = I -  r an  osit ie i nteger    h ow that trace) = d 1, whee he tace is he su m of ia gonal elements. {: trace

3 .2 .2

= trace  J

Geraliatio  Issues

inear regression loo ks r the optimal weight vector in terms of the in-sample error Ein which leads to the usual generalization question Does this guarantee decent outof-sample error Eout? he short answer is yes here is a regression version o f the VC generalization bound (3.1) that similarly bounds Eout  In the case of linear regression in particular, there are also exact rmulas r the expected Eout and Ein that can be derived under simplifying assumptions  he general rm of the result is

Eout(g)

= En(g)

+ o  ,

where Eout (g) and Ein (g) are the expected values his is comparable to the classication bound in ( 31 ) Eri 3. 4

 = w*Tx +  2



Consider a noisy target  r generating the data where is a noise ter with zero mean and variance independently generated r every exam ple The expec ted e rror of the est pos si le li ne ar t to this target is thus For the data denote the noise in as and let E assume tha t is i nvertil e By l lowin g

(x )  2  ' = (x  )    (xNYN )}X X = [  2     N T 87

Yn n

(oninued on nex page

3 . HE INEAR ODEL

3  3  OISTIC ERESSION

the stes elo w sow at he exected n sam l eor o f l near reg rsson wth respect to s gen y



 [EiWi] = 2  a ho w tat the  n samle es tmate o ( how hat th n sam le e r ector max t mes  a s e mat

 s gven  

 w* 

can e exresse  a

ii  n term s o  us ng   an  s m l  te eresson  c.   rove ha t i   ]  2   usng c an e n een c ress usn g Execse



     int e m  te iagna eement  a atr he tra wl  a re ee ee 

ence of

o the x ecte ut  sam le ero  e tae a se l ase  ch s eas  o a n alze. onser a es ata s et = .. ch sa es he sam e   ut ve cors wth ut t a eren ealzatn o the n ose ems. Denot e the nose n as an let =  ene o e th e avrag e suare eo on



test 

   [       



testW i

test

= 02 ( 1

(e) Prove hat l v,1 [Etest (Win)]

).

est

he secal test error s a ery estrcte case of he general out o sam ple err or. ome etale a na lss shws at sm l a res ults an e otan ed r the general case as swn n  ole m



Figur 34 illustrats t larning curv of linar rgrssion undr t assum tions of xrcis 34 T bst ossibl linar t as xctd rror T r N  d + T xctd insaml rror is smallr, qual to  larnd linar t as atn into t in-saml nois as muc as it could wit t d + dgrs of dom tat it as at its dis osal Tis occurs bcaus t tting ca nnot distinguis t nois om t signal.  On t otr and, t xctd out-of-saml rror is +  , wic is mor tan t un avoidabl rror of T additional rror rcts t drift in W du to tting t insaml nois



• 







3.3

Lgisti c Regressin

T co r of t li nar mod l is t signal'  = wT tat combins t inut variabls linarly. · av sn two modls basd on tis signal, and w ar now going to introduc a tird In linar rgrssion, t signal itslf is takn as t ou tut, wic is aroriat if you ar trying to rdict a ral rsons tat could b unbound d In linar clas sication , t signal is trsoldd at zro to roduc a outut, aroriat fr binary dcisions A tird ossibility wic as wid alication in ractic, is to outut a pbability,

±



3. HE INEAR ODE

3.3 OGISTIC ERESSION

Number of ata Points,

Figure 34:



The learnin curve r linear rereion.

a value between 0 and 1. Our new model is called logistic regression It has similarities to both revious models as the outut is real ( like regression ) but bounded ( like classication)   2 (rediction of heart attacks )  Suose we want to redict the occurrence of heart attacks base d on a erson s cholesterol leve l bloo d res sure age and other we cannot attack withweight an certaint butfactors we ma Obviousl be able to redict howredict likel ita isheart to occur given these fctors Therefre an outut that varies continuousl be tween 0 and 1 ould be a more suitable model t han a binar decision The closer  is to 1 , the more likel that the erson will have a heart attack D 3.3.1

Predictig a Probability

Linear classication uses a hard threshold on the signal s    (  = sign ()  while linear regression uses no threshold at all

In our new model we need something in between these tw o cases that smoothl restricts the outut to the robabilit range One choice that accom lishes this goal is the logistic regression model

[O   

where 8 is the socalled logistic nction (s) and 1 . 89



whose outut is between 0

3 . HE INEAR ODE

3.3. OISTIC ERESS ION

The output can be nterpreted as a probabl ty r a bnary event ( heart attack or no heart 1 attack dgt versus dgt 5' etc )  Lnear classcaton also deal s wth a bnary event but the derence s that the classcaton n logs tc regresson s allowed to be uncertan wth ntermedate values between 0 and 1 reectng ths uncertanty The logstc fncton  s rerred to as a so threshold n contrast to the hard threshold n classcaton It s also called a sigmoid because ts shape looks lke a attened out s.

l '

Exercis 3.5 nother op ula r soft hre shol i s the hyperol ic tangent

anh



=  - 

 

a  ow is anh rla ted to the logistic f un ctio n ? int: hift and cale]   how hat anh() converges to a hard threshold or large  and converges to no threshold  small

eow

I s l int: ormalize the ge

The specc rmula of (  ) wll allow us to dene an error measure r learnng that has analytcal and computatonal advantages as we wll see shortly Let us rst look at the target that logstc regresson s tryng to learn The target s a probablty say of a patent beng at rsk r heart attack that depends on the nput x (the characterstcs of the patent )  Formally we are tryng to learn the target fncton

f f f f f (x) = [y

+1  x The data does not gve us the value of explctly Rather t gves us samples generated by ths probablt y eg patents who had heart attacks and patents who ddn t  Therer e the data s n fct generated by a nosy target P(y  x)

x

 

r y +1; (37) 1  (x) r y 1 To learn om such data we need to dene a prop er error measure that gauges how close a gven hypothess  s to n terms of these nosy ±1 examples

P(y I x) =

90

3.3. OGSTC EGRESS ON

3 . HE NEAR ODE

Error measure. Th andard rror mar (h (x)  y) d n logc r gron  bad on h noon of likelihood; how  lkly    ha w wold g h op y om h np x f h arg drbon P(y I x) wa ndd caprd by or hypoh h(x)? Bad on (37), ha lklhood wold b

 (y I x

+;  

fr y 

h(x)

  

h(x) fr y W b r h(x) by  val B(wTx) and  h ac ha ( ) ay o vrfy) o g



P(y I x)  B(y w T x).

B() 

+

(38)

On o f or raon f r choong h mahmacal rm ()  e  ( e   ha  lad o h mpl xpr on r P(y I x). Snc h daa pon (x1 Y 1) . . .  (x N  YN ) ar ndpndnly gnrad, h probably of gng all h Y  n h daa  om h corrpond ng X  wold b h prodc

N P(y I X) · Th mhod of maimum likelihood lc  h hypoh h whch maxm h probably. W can qvalnly mnm a mor convnn qany, 

1

 ln

(g P(y I X) )  N



1



N



ln

( P(y I X) ) 

nc ln·)  a monooncally dcrang ncon. qaon (38), w wold b mnmng

1

Sbng wh

t  

  n

(Y w Tx)

wh rpc o h wgh vcor w. Th ac ha w ar iiizig h qany allow  o ra  a an rror mar. Sbng h nc onal frm r B(y WTX) prodc h n-ampl rror mar r logc rgron, (39)

l+  n T 

Th mpld p onw rror ma r  (h(x ) Y)  ln ( e Y w x ). Noc ha h rror mar  mall whn Y wTx  larg and positive, whch wold mply ha gn(wTx) Y Thrfr, a or non wold xpc, h rror mar ncorag w o cla ach X corrcly.

=

3 lthough the method of maxmum lkelhood s ntutvel plausble, ts goous just aton as an nfeene tool ontnues to be dsussed n the statsts ommunt

9



33. OISTIC ERESSION

3. HE INEAR ODE

erise 6

[rossentrop error measure]

(a) More generally if e are learning fo 1 ata to reict a noisy target  I  wit ca n idate ho hesis h sho w that he axiu  li eli oo et o euces to he as of  n ing  tat i ni izes

 w  =



[Yn

=

1

 ln



[ yn

1 ln



(n

1 

  (n

 or the case h) (wT) argu e tat iniizing he in sale err or in ar t a is euiv alent to i niizing e one in 9

{    q  lo  + 1  ) o  

o two proa i lit ist i ution s an 1 } with inar out coes the crss entro  (fro  i nration eo  is

q



The i n sa ple erro r in art (a) corr eso nds to a co ss entr oy erro easure on the data point (n  Yn)  with an h(n [Yn

p=

 

q=

For linear classication, we saw that minimizing E r the perceptron is a combinatorial optimization em; to solve it, we introduced number of al gorithms such as the percep probl tron learning algorithm and the po acket algorithm For linear regression, we saw that training can be done using the analytic pseudoinverse algorithm fr minimizing E by setting \ E  w   0  hese algorithm s were developed based on th e specic frm of linear classication or linear regression, so none of them would apply to logistic r egression o train logistic regression, we will take an approach similar to linear re gression in that we will try to set \ E w) = 0 Unrtunately, unlike the case of linear regression, the mathematical rm of the gradient of E r logistic regression is not easy to manipulate, so an analytic solution is not asible Eercise 3. 7 For lo gist ic regre ssion  s how h at

\ E w Argue that a  isclas si ed exa ple contriutes ore a correctly classied one.

to the gradient than

Instead of analyt ically setting the gradient to zero, we will itetively set it to zero o do so, we will introduce a new algorithm, gdient descent. Gradient 92

3  3 . OGISTIC EGRESSION

3  HE INEAR ODE

descet s a very geeral algorthm that ca be used to tra may other learg models wth smooth error measures For logstc regresso, gradet descet has partcularly ce propertes 3 . 3 .2

Gradie Desce

mmzg Gradet descet a twcede s a retable geeral cto, techquesuch r as Ein ( w)  logstc regresso A usel phys cal aalogy o gradet descet s a ball rollg dow a hlly surce I the ball s placed o a hll, t wll roll dow, comg to rest at the bottom o a valley The same basc d ea uder les gradet descet Ein(w) s a surace  a hghdmesoal space At step 0 we start somewhere o ths surace, at w(O), ad try to roll dow ths surace, thereby decreasg Ein Oe thg whch you mme dately otce om the physcal aalogy s that the ball wll ot ecessarly come to rest  the lowest valley o the etre surace Depedg o where you start the ball rollg, you wll ed up at the bot tom o oe o the valleys a local miimum. I geeral, the same apples to gradet descet Depedg o your startg weghts, the path o descet wll take you to a local mmum  the error surace A partcular advatage r logstc regresso wth the crossetropy error s that the pcture looks much cer There s oly oe valley! So, t does ot matter where you start your ball rollg, t wll always roll dow to the same uque  glob al miimum. Ths s a cosequece o the ct that Ein ( w) s a  cto o w a mathematcal property that mples a sgle valley as show  to the rght Ths meas Weghs,  that gradet descet wll ot be trapped  lo cal mma whe mmzg such covex er ror measures  Lets ow determe how to roll dow the surace We would lke to take a step  the drecto o steepest descet, to ga the bggest bag r our buck Suppose that we take a small step o sze   the drecto o a ut vector  The ew weghts are w(O) +  Sce  s small, usg the Taylor expaso to rst order, we compute the chage  Ein as

 Ein

>

Ein(w(O) + ) Ein(w(O))   Ein(w(O)) T + ( 2 )   Ein( w(O))  ,

4 In fa, he squaed n-sample eo n lnea egesson s also onvex, whh s wh he anal soluon fund b he pseudo-nves e s guaaneed o have opmal n-sample eo

93


3.3 OISTIC ERESSION

whr w ha ignord th small trm ( 2 ) Sin  is a unit vtor, quality holds if and only if =

 Ein (w(O))

(3.10)

  Ein (w( O))  

This dirtion, spid b y , lads to th largst dras in Ein r a gin stp siz  erise 8 e laim at ols fr small

 is t e i ection w ic  

ies lg est ece ase i n in onl

Thr is nothing to prvnt us om ontinuing to tak stps of siz  r valuating th dirtion V at ah itration t = 0, 1, 2,  . . . How larg a stp should on tak at ah itration This is a good qustion, and to gain som insight, lt 's look at th llowing xampls.

Weghts 

 too small

Weghts 

 too larg

·ghts  ariabl  just right

A xd stp siz (if it is too small) is inint whn you ar r om th th loalminimum minimumlads On tothbouning othr hand, around, too larg possibla stp y n sizinrasing whn youEin ar los Idally, to w would lik to tak larg stps whn far om th minimum to gt in th right ballpark quikly, and thn small (mor arl) stps whn los to th minimum A simpl huristi an aomplish this: fr om th minimum, th norm of th gradint is typially larg, and los to th minimum, it is small. Thus, w ould st    Ein  to obtain th dsird bhaior r th ariabl stp siz hoosing th stp siz proportional to th norm of th gradint will also onnintly anl th trm normalizing th uni t tor  in quation (3.10), lading to th ed learning rate gdient descent algorithm r minimizing Ein (with rdnd  ) : 94

3 . HE INE AR ODE

3  3  OGISTI C EGRESS ION

Fixed learning rate gradient descent:

 Initilize the weights t time step t  0 t w(O).  r t 0, 1, 2    do 3: mpte t he grdient gt   (w(t)).

=

et the directin t mve Vt   gt . Updte t he weights: w(t + 1) w(t) + Vt Itertethet nl the next step ntil it is time t stp  Retrn weights

4 5 7

6:

=

In th e lgrithm t is  directin tht is n  lnger restricted t nit length The prmeter  the learning rate) hs t be s pecied A typiclly gd chice r  is rnd 0.1  prely prcticl bservtin  T se grdient descent  ne mst cmpte the grdient This cn be dne explicitly fr lgistic regressin see xercise  ) Example 33. Grdient descent is  generl lgrithm r minimizing twice dierentible fnctins We cn pply it t the lgistic regressin insmple errr t retrn weights tht pprximtely mini mize







(w)







1 N ln 1 +  YT  n=l

=N L

Logistic regression algorithm:

:

 3

Initilize the weights t time ste p t  0 t w(O). r t  0 1 2   do mpte the grdient

4 5 6 7

e t the directin t mve Vt   gt  Updte the weights: w(t + 1) w(t) + Vt Iterte t the next step ntil it is time t stp  Retrn the nl weights w.

=

D nitialization and terminati on. We hve tw mre lse ends t t ie: the rst is hw t chse w(O), the initil weights nd the secnd is hw t set th e criterin r "    ntil it is time t stp in step 6 f the grdient descent lgrithm In sme cses sch a lgistic regressin initilizing the weights w(O) s zers wrks well Hwever in generl it is sfr t initilize the weights rndmly s s t vid getting stck n  perctly symmetric hilltp hsing ech weight independently m  rml distribti n with zer men nd smll vrince slly wrks well in prctice 95

3  HE INE AR ODE

3 . 3  OIST IC ER ESS ION

That taes care of initializati on so we now move on to termination How do we decide when to stop? Termination is a non-trivial topic in optimizati on. One simple approach as we encountered in the pocet algorithm is to set an upper bound on the number of iterations where the upper bound is typically in the thousand s depending on the amount of training time we have The problem with this approach is that there is no guarantee on the quality of the nalAnother weightsplausibl e approach is based on th e gradient being zero a t any min imum. A natural termination criterion would be to stop once  drops below a certain treshold. ventually this must happen but we do not now when it will happen. For logistic regression a combination of the two conditions ( setting a large upper bound fr the number of iterations and a small lower bound r the size of the gradient ) usually wors well in practice . There is a problem with relying solely on the size of the gradient to stop which is that you might stop prematurely as illustrated on the right When the iteration reaches a relatively at region ( which is more common than you might suspct )  the algorithm will prematurely e igt ,  stop when we may want to continue. So one so-

k

iflution the error is to require change that is small termination and the error occursitself onlyis small. Ultimately a combina tion of termination criteria ( a maximum number of iterations marginal error improvement coupled with small value r the error itself ) wors rasonably well. Exampe 3 .4. By way of summarizing linear models we revisit our old iend the credit example If the goal is to d ecide whether to approve or deny then we are in t he realm of classication; if you want to assign an amount of credit line then linear regression is appropriate ; if you want to predict t he prob ability that someone will deault use logistic regression.

Credi nalyi

ppove o en ount o e it  obabilit o  eault

rro Lr Rro Lo Rro

The tree linear models have teir respective goals error measures and al gorithms. Nonetheless  they not only share similar sets of linear hypotheses but are in act related in other ways We would lie to point out one impor tant relationship Both logistic regression and linear regression can be used in linear classication. Here is how. Logistic regression produces a nal hypothesis (x) which is our estimate of  [y = + 1  x Such an estimate can easily be used fr classication by

96

3 HE INEAR ODE

33 OGISTIC EGRESSION

'

setting a threshold on (); a natural threshold is which corresponds to classifying + 1 if + 1 is more likely. This choice r threshold corresponds to using the logistic regression weights as weig hts in the perceptro n r classica tion. Not only can logistic regression weights be used r classication i n this way but they can also be used as a way to train the perceptron mode l. The perceptron learning problem (32) is a very hard combinatorial optimization problem. much The convexity of E Since in logistic regression mak es optimiz ation problem easier to solve. the logistic nction i s the a soft version of a hard threshold the logistic regression weights should be good weights r classication using the perceptr on A similar relationship exists between classication and linear regression Linear regression can be used with any real-valued target nction which includes real values that are ±1 If    is t to ±1 vales sign(  ) will likely agree with these values and make good classication predictions. In other words the linear regression weights  which are easily computed using the pseudo-inverse are also an approximate solution r the perceptron model. The weights can be directly used r classication or used as an initial D condition r the pocket algorithm to give it a head start. eise . 9

 [

onsi e oi twise e o easu s eclass, sg ]  esq, y = -  an 10g, l + x -y ee e signal    sus  on e sa lot a o lot lassr s an





 = 

 =

 o at 1ass,y eo is ue  oue

s,

=

an ence at e lassiaion

 e su ae e a as i a  ge a  ow a c1ass,  ue oun  u to a osan   sin g te lgis i egessi n eo

ese ons niate a miniiing te suae o logisi egession e sold als eease e lassiain  i ustis usin e eig s etune   l ine a o lgisi ege ssi a s a xiatins siatin

 as

tochastic gradient descent. The version of gradient descent we have de

scribed so fr is known as batch gradient descent the gradient is compute d r the error o n the whole data set befr e a weight updat e is done. A sequen tial version of gradient descent known as stochastic gdent descent (SGD) turns out t o b e very ecient in practice. Instead of considering the ll batch gradient on all N training data points we consider a stochastic version of the gradient. First  pick a training data point (  Y) unirmly at random (hence the name sto chastic )  and consider only the error on that data poin t

97

3 . HE N EAR ODE

3 . 3 . OGSTC EG RES SO N

n th cas of logst c rgrsso n)  Th gradnt of ths sngl data ponts rror s usd r th wght updat n xactly th sam way that th gradnt was usd n batch gradnt dscnt Th gradnt ndd r th wght updat of SGD s  s xrcs 37)

w w 7\enw    N, 1 N \enw  nl

and th wght updat s  Insght nto why SGD works can b gand by lookng at th xpctd valu of th chang n th wght th xpctaton s wth rspct to th random pont that s slctd ) Snc   s pckd unrmly at random om { 1 th xpctd wght chang s



Ths s xactly th sam as th dtrmnstc wght chang om th batch gradnt dscnt wght updat That s on avrag th  mnmzaton pro

N,

cds n th rght s a bt wggly long run uctuatons cancldrct out  on Thbut computatonal costInsth chapr by aths fctorrandom of though snc w comput th gradnt fr only on pont pr trat on rathr than r all ponts as w do n batch gradnt dscnt Notc that SGD s smlar to PLA n that t dcras s th rror wth r spct t o on data pont at a tm Mnmzng th rror on on data pont may ntrr wth th rror on th rst of th data ponts that ar not consdrd at that traton Howvr also smlar to PLA th ntrrnc cancls out on avrag as w hav just argud

N

xercise 3.0

a  Dene a  eo  a single  ata oin  xn Yn  to e en  = (O  -n xn   

Ague that P LA can b e viewed as G D on e with le ani ng at e mini mizing using G D is simil a to PLA. This is anothe indication that th e lo gistic egession weights can be used as a good appoximation  clasication.

 b  Fo logistic egession with a vey lage  ague that



SGD s succssfl n practc oftn batng th batch vrson and othr mor sophst catd algorthms In act SGD was an mportant part of th algorthm that won th mllondollar Ntx comptton dscussd n Scton 1.1 It scals wll to larg data sts  and s naturally sutd to onln larnng whr

9

3.4. ONINEAR RANSFORMATION


a stream of data present themselves to the learning algorithm sequentially The randomness introduced by processing one data point at a time can be a plus, helping the algorithm to avoid at regions and local minima in the case of a complicated error surface However, it is challenging to choose a suit able termination criteri on r SGD. A good stopping criterion should consider the total error on all the data, which can be computationally demanding to evaluate at each iteration 3.4

Noninar ansformation

All rmulas fr the linear model have use d the sum d

WT  L X 

(3.11)

 as the main quantity in computing the hypothesis output  This quantity is linear, not only in the  's but also in the /s  A closer inspection of the corresponding learning algorithms shows that the linearit in  s is the key property r deriving these algorithms; the X's are just constants as far as the algorithm is concerned This observation opens the possibility r allowing nonlinearbecause versions X's while still remaining in the analytic realm (3.11) remains models, theofrm of quation linear in the of  linear param eters Consider the credit limit problem r instance It makes sense that the years in residence' eld would aect a person's credit since it is correlated with stability However, it is less plaus ible that the credit limit would grow linearl with the number of years in residence More plausibly, there is a threshold (say 1 year) below which the credit limit is aected negatively and another threshold (say  years) above which the credit limit is aected positively If X is the input variable that measures years in residence, then two nonlinear eatures' derived om it, namely [ x < 1 and [x > , would allow a linear frmula to reect the credit limit better We have alread seen the use of atures in th e classication o f handwritten digits, where intensity and symmetry atures were derived om input pixels Nonlinear transrms be rther applied to those atures, as we willThe see shortly, creating more can elaborate atures and improving the perfrmance scope of linear methods expands signic antly when we represent th e input by a set of appropriate atures 3..1

The

 Space

Consider the situation in Figure 3.1 (b) where a linear classier can't t the data By transfrming the inputs x1  x 2 in a nonlinear fashion, we will be able to separate the data with more complicated boundaries while still using the 99

3.4. ONLINEAR RANSFORMATION

3 . HE INEAR ODEL

smple PLA a a buldg blok Lets start by look g at the rle  Fg  ure 35  a , whh s a repla o the o-separable ase  Fgure 3  b   The rle represets the llowg equato

 +   06

++ 

That s , the olear hypothess h(x)  sg   06 separates the data set per tly We a vew the hypothes a a linear oe ater applyg a olear trasrmato o x. I partular, oder   1,   ad

Z 



h(x)

sg

(

 0.6 

+

1



1



 +

1





)



      Wo Zo w1 W 2 Z2 Zl

sg [W W W ]

 :� 

 where the vetor  s obtaed om x through a olear trasrm < ,



  <(x) We a plot the data  term o  tead o x as depted  Fgure 35  b  For stae, the pot x  Fgure 35  a s trasrmed to the pot   Fgure 3 5 b  ad the pot x  trasrmed to the pot • The spae , whh otas the  vetors,  s rerred t o a s the feature space se ts oor dates are hgher-level atures derved om the raw put x. We desgate



deret quattes   wth a tlde verso o ther outerparts  X, eg, the dmesoalty o    ad the weght vetor s 5 The trasrm < that take us om X to  s alled a feature trnsform, whh  ths ase 

<(x)   1,

  

31 

I geeral, ome pots  the  pae may ot be vald trarms o ay  X ad potstrarm  X may<.be trasrmed to the same   , xdepedg o multple the olear The usel e o  the tr asrm above  s that the olear hypothess h rle   the X spae a be represeted by a lear hypothess le   the  spae Ideed, ay lear hypothess    orrespod s to a posbly olear  hypothess o x gve by

h(x)  (<(x)).

Z {}   whee    n ths e

5 oodnate



 s xed

We teat

100

   dmensonal sne the added

34. ONINEAR TRANSFORAION

3 . THE LINEAR MODE











(b) anrmed daa i n Z pace 

= ) =

[i]

Figure 35 (a) The oriinal daa e ha i no linearly eparable bu eparable by a circle. (b) The ranrmed daa e ha i linearly eparable in he Z pace. n he ure  map o  and  map o  ; he circular epara or in h e X pace map o he line ar eparaor in h e Z pace. The set o these hypotheses h is denoted by  For instance, when using the ature transrm in 312), each h   is a quadratic curve in X that corresponds to some line h in Z xerse 3







onsie r he  eatue ans in at i ouna in oes a erl ane in orr eson o i n te ll owing as es raw a icture at il lust as a ea le ea ase.



        0   0 c       



a 



       

Because the transrmed data set (, Y),    , (  , Y ) in Figure 3.5(b) is linearly separable in the ature space Z, we can apply PLA on the trans rmed data set to obtin  the PL A solution, which gives us a nal hypothes is (x)  sign(  ) in the X space, where  = <(x) The whole process o applying the at ure transrm bere running PLA r linear classication is depicte d in Figre 36  The insample error in the input space X is the same a in t he ature space Z, so E()  0 Hyperplanes that achieve E ( ) = 0 in Z cor respond to seprating curves in the srcinal input space X For instance, 101

 THE LINER MODE

4 ONINER TRNSFORION



0.5

 0 0

0

1 Orgna data

X

05

2 ansrm the dat a

  <(x)

X

Z

+

0 05

0

0

4. Cassfy n Xspace

3. Separate data n Z-space

 (x)  < (x))  sgn (T < (x))

 ()  sgn 

Fgure 36 The nonlinear ransrm o r separain non separable daa as shown n Fgure 36, the PLA may seect the ne   ( 0.6 0.6 1) that separates the transrmed data (, Y), · · · , (  , Y . The correspond ng hypothess w separate the orgna data  sgn 06 + 06the·decson + �) boundary (x, Y) ,    , (x  , Y ) · In ths case, s an epse n X How doe s the fature transfrm aect the VC bou nd (3.1)? If we honesty decde on the transrm < bere seeng the data, then wth probabty at east 1  the bound (31) remans true by usng d() as the VC dmen son For nsta nce, consder the fature transrm < n (3.12). e know that Z  {1}  2  Snce  s the perceptron n Z, d ()  3 the  s bec ause some po nts   Z may not be vad transf rms of any x, so some dchotomes may not be reazabe . We can then substtute N, d() and nto the VC bou nd. After runnng PLA on the transfrm ed data set, f we succeed n











6



102


3.4. ONINEAR TRANSFORAION



= 0, we can clam that g wll perrm well out of gettng some g wth Ein sample It s very mport ant to understand that t he clam above s val d only f you decde on < before seeng the data or tryng any algorthms What f we rst try usng l nes to separate the data, fl, and then use the crcles Then we are eectvely usng a model that contans both lnes and crcles, and d s no longer 3

xie  e now that in h uliean lane te eceton oel imlem ent all ic otomies on oits. hat is  atue ansm in

6



 a  how that m<3 =  b how at m    c h ow tat mu   = 6 

at is if you use l ines vc lin es and elipses  dc 3.



= 3 if ou use e li ses

 6

 canno

e 

c = 3 if ou used

Worse yet, f you actually look at the data (eg, look at the ponts n Fg ure 3(a)) bere decdng on a sutable <, you rt most of what you learned n Chapter 2 You have nadvertently explored a huge hypothess space n your mnd to come up wth a specc < that would work for this data set. If you nvoke a generalzaton bound now, you wll be charged r the VC dmenson of the fll space that you explored n your mnd, not just the space that < creates Ths does not mean that < should be chosen bl ndly In the credt lmt problem r nstance, we suggested nonlnear atures based on the years n resdence' eld that may be more sutable r lnear regresson than the raw nput Ths was based on our understandng of the problem, not on sno opng' nto the tranng data Therere , we pay no prce n terms of generalzaton, and we may well gan a dvdend n perfrmance because of a good choce of atures The ature transrm < can be genera l, as long as t s chosen b efre seeng the data set (as f we cannot emphasze th s enough)  For nstance, you may have notced that the ature transrm n (3.12 only allows us to get very lmted types of quadratc curves Ellpses that do not center at the orgn n X cannot correspon d to a hyperplane n  To get all possble quadratc curves n X, we could consder the more general ature transrm z = <  (,



<  (

= (1,,

 ,  ,  ,   ,

(3.13

whch gves us the exblty to represent any quadratc curve n X by a hy perplane n  (the subscrpt 2 of < s r polynomals of degree 2  quadratc curves)  The prce we pay s that  s now ve-dmensonal nstead of two dmensonal and hence d s doubled om 3 to 6.

103

3 . THE LINER MODE 

4 ONINEAR TRANSFORION



2  

Consider h e ature tansfm  = in How can we se a pelan in o reresent he  l l owing bouaies in para b ola ( x 1

(a)



3 2

The ci le 1

3) 2

x2 =  x2 - 4 2



( c) The  lli pse 2(x 1 - 3 ) 2 (x2 4)2 bola (x1 -  2 x 2 4 2 (e) lipse  x2 3 2 x f) line x1  2 =

=

2   2 

One may rther extend < 2 to a atre transrm < 3 r cbc crves n X or more generay dene the atre transfrm r degree crves n X The atre transrm s caed the Qth order polynom ial transform The power of the atre transrm shod be sed wth care It may not be worth t to nsst on near separabty and empoy a hghy compex srfce to acheve that. Consder the case of Fgre 3   a . If we nsst on a atre transrm that neary separates the data, t may ead to a sgncant ncrease of th VC dmenson. As we see n Fgre 3 .7, no ne can separate th

<

<



tranng exampes percty, and nether can any qadratc nor any thrd-or der po ynoma crves Ths, we need to se a rthorder poynoma transrm

(X

 = (1

2

2 3 2

2 3

4

3

2 2

3

4

, Xi , X 2 , X 1 , X 1 X2 , X 2 , X 1 , X1 X2 , X 1 X 2 , X 2 , X 1 , X1 X 2 , X1 X 2 , X 1 X 2 , X 2 .



If yo ook at the rth-order decson bondary n Fgre 37 b , yo dont need the VC anayss to te yo that ths s an overk that s nkey to generaze we to new data A better opton wod have been to gnore the two mscassed xampes n Fgre 3.7 a , separate the other exampes percty wth the ne, and accept the sma bt nonzero  Indeed, sometmes or best bet s to go wth a smper hypothess set whe toeratng a sma . Whe or dscsson of atre transrms has csed on casscaton probems, thse transrms can be apped eqay to regresson probems. Both near regresson and ogstc regresson can be mpemented n the atre space Z nsted of the npt space X For nstance, near regresson s of ten



coped wth a atre transrm to perfrm nonnear regresson The by d + 1 npt matrx n the agorthm s repaced wth the by + 1 matrx  , whe the otpt vector y remans the same.

N J N

X

3 . .2

Cputati ad Geeralizati

Athogh sng a arger Q gves s more exbty n terms of the shape of decson bond ares n X , there s a prce to be pad. Comptaton s one sse, and generazaton s the other Comptaton s an sse becase the atre transrm maps a two dmensons, whch ncreases the memory dmensona vector  to =

<

J

104

3  THE LINE AR MODE

3.4. ONINEA TRANSFOAION

(a) Linear  t

(b) 4th order polynomial t

Fgure 37: llutration of the nonlinear tranrm uin a data et that i not linearly eparable; (a) a line eparate the data afer omittin a w point (b) a urth order polynomial eparate all the point and computatona costs  Thngs coud get worse f x s n a hgher dmenson to begn wth

xerise  

Q    } 

Rd

onsie te Qth oe olynomial ansfm at is te  imensionality of he at ue sa ce exlu in g the  xe co oi nate val uae  ou esult on  3 5 an {  5

  







The other mpor tant ssue s gen erazaton If < s the fature transrm of a twodmensona nput space there w be  dmensons n Z, and d(H) can be  hgh as + 1  Ths means that the second term n the VC bound (31) can grow sgncanty. In other words we woud have a weaker guarantee that Eout w be sma For nstance f we use    5 0 0 the VC dmenson of   coud b e as hgh as  5 2 5 3  + 1  1326 nstead of the orgna d  3 Appyng the rue of thumb that th e amount of data needed s proportona to the VC dmenson we woud need hundreds of tmes more data than we woud f we ddn't use a ature transrm n order to acheve the same eve of generazaton error



Exercise 3 . 15 H igh-dim ensiona l fatue t ans ms ae by no mea ns the only t ansms that w e can use We can take the  ad eof in the o the diec tion an  use lo w di mension al fatu e tansms as wel l (to achieve an even lowe  genealizatin eo ba). (ontnued on next page

105

4. ONINEAR TRANSFORAION

. THE LINEAR MODE

 to

Conside he fl lowing atue ans m w ich ma s a -i mensional a one-imensional z eeing only the th ooinate o 



<(k)  = 1 Xk · Let

3.4

k be he set of pecet ons in he atue sace

a ove that k  b ove hat vU� = l=k  (lo  

1.

-k is called the ecin tm model on i mesion  The probem o f generazaton when we go t o hghdmens ona space s some tmes baanced by the adv antage we get n approxmatng the target bet ter . As we have seen n the case of usng quadratc curves nstead of nes the trans rmed data became neary separabe reducng Ein to 0 In genera when choosng the approprate dmenson r the fature transrm we cannot avod the approxmatongenerzaton tradeo

hgher ower

d

better chance of beng neary separabe possby not neary separabe ( Ein

(Ein  )

Therefre choosng a fature transrm bere seeng the data s a nontrva task When we appy earnng to a partcuar probem some understandng of the probem can hep n choosng atures that work we More generay there are some gudene s r ch oosng a sutabe transfrm or a sutabe mode  whch we w dscuss n Cha pter 4

Exercise 3.16 Wite ow n t he steps of h e a lgo ithm that com bin es < 3 with linea e gession using a lgoithm? instead? ee is te main computational bottleneckHow of about the esulting

0

Example 3.5. Let 's revst the handwrt ten dgt recognt on exampe. We can try a derent way of decomposng the bg task of separatng ten dgts to smaer tasks One decompos ton s to separate dgt 1 om a the other dgts . Usng ntensty and symmetry as our nput varabes ke we dd bere  the scatter po t of the tranng data s sho wn next A n e can roughy separate dgt 1 om the rest but a more compcated curve mght do better.

106


3 . 4  ONINEAR TRANSFORAION

vee ei





We use near regresson r casscaton , rst wthout any fature transfrm The resuts are shown beow LHS  We get Ei  2.3% and Eou  238%

 

Average Intensty

Average Intensty

Linear model

3rd order polynomial model

En = 2  13% Eout = 2.38 %

En = 175% Eout = 187%

Claication of the diit data  1  veru 'not 1   uin linear and third order polynomial model

When we run near regresso n wt h   , the thrd-order poynoma transfrm, we obtan a better t to the data, wth a ower Ei   75% The resut s depcted n the RH S of the gure In ths case, the better n-sampe t aso resuted n a better out-of-sampe perfrmance, wth Eou  87%





Linear models a nal pitch. The near mode r casscaton or regres son s an often overooked resource n the arena of earnng om data Snc e ecent earnng agorthms exst fr near modes, they are ow overhead. They are aso ve ry robust and hav e goo d generazaton propertes A sound



07

3 . THE LINEAR MODE

3.4. ONINEAR TRANSFORAION

pocy to ow when earnng om data s to rst try a near mode  Because of the go od generaz aton propertes of near modes  not much can go wrong If you get a goo d t to th e data ow E) then you are done If you do not get a good enough t to the data and decde to go r a more compex mode you w pay a prce n terms o f the VC dmenson as we have seen n xercse 3 12 , but the prce s modest



108

5 ROES

. THE LINEAR MODE

3.5

Prblems

Problem 3.1

Consider the doub le sem icircle toy" lear nin g task below

The re are two semi circles of width thk with inner radius r separated by red is 1 a nd blue is +1  The cen ter of the top semi circle is aligned with the middl e of the ed ge o f the bot tom semi circle This tas k is l inearly sep arable when < 0 et ra 10  0 and not so or 5 Then , genera te 2  000 ex am ple s u nirmly , which m eans thk 5 and you wi ll have a pproximately 1  000 ex am ples r eac h class

sep as shown



=

sep =

sep

sep

=

 a  Run the PLA starting from  = 0 unti l it co nverges P lot the data an d the na l hypo the sis  b Repeat pa rt  a  using the l ine ar regr ession r classi cati on  to obtai n  Expla in your obs ervations

sep

Problem 3. For the d ouble semi cir cle task in Prob lem 31 vary in the ra nge 02  0 4     5}. Genera te 2 000 ex am ples and ru n th e P LA star ting with 0 Record the n u m ber of iterations P LA tak es to conv erge Plot vers us the n um ber o f iterations taken fr PLA to conv erge Explai n you r observations [Hnt Proble 3.}

=

sep

Probl em 33

For the do uble semi circle task in Problem and generate 2 000 examples

a  What w il l h appe n if you ru n P LA on thos e exampl es?  b Run the pocket algorithm r 100 000 iterations and plot itera tion nu mber t c Pl ot the data and t he n al hypothes is in part  b

31 set sep = - 5

 versus the

(ontnued on next page

109

  THE LINEAR MODE

.5. ROES

d  Use the li near regress ion a lgorithm to obtai n the weights

w and compare this result with the pocket algorithm in terms of computation time and qu al ity of th e solution e Repeat b  d with a 3rd o rder polynomi al atu re transf orm



 

Problem 3.4

In Problem 1.5 we i ntroduced th e Ada ptive Linea r Neu ron Adaline al gorithm f r classic ation Here, we derive Ada lin e from an optimization perspective





a  Consider E(w) = (m(O  - ywTx)) 2  how that E(w) tin uous and dif erentia ble  Write down the gr ad ient \ E(w)  b  how that E(w) is an upper bound r sign(wT x)  Y   ; l E(w)

is an upper bound

Hence, fr the in sam ple clas si ca tion er 

ror E(w) c Argue that the Adaline algorithm in Problem gradient descent on E(w)



is con

   l

1.5 perfrms stochastic

Problem 3.

a  Consider

how that

E (w) = m( O  - yW T X) E(w) is continuous and diferentiable except when

Y = WTX  b  how that E(w)

 ; l E(w)

is an upper bound r sign(wTx )  Y Hence, is an u pper bound f r the i n sam ple clas sicatio n er

ror E(w) c Apply stochastic gradient descent on E(w) ignoring the sin gu la r case of wT X Y an d deriv e a new per ceptr on lear ning algorithm 



  l

=

Prob lem 3. 6



Deriv e a l in ear pr ogram mi ng algori thm to  t a line ar model

r classicatio n using therm f llowing st eps A l in ear pro gram is an optimization problem of the fllo wing : T 

min z suec 





,  and  are par amete rs of the l inear prog ram and  is the optimization vari a ble  Th is is such a w ell stud ied optimization pr oble m that mos t mathematics software have canned optimization functions which solve linear programs

 a  For linearly separable data, show that fr some  = l ...

N

110

w Y(wTx)

  r

.5 ROE

 THE LINER MODE

 b

w

Fomu lte the t sk of ndi ng  sep ting f sepb le dt s  l ine  pogm You need to specify wht the pmetes   e nd wht the optimiztion vible z is c If the dt is not sepble, the condition in  cnnot hold  evey Thus i ntoduce the violtion  0 to cptue the mount of violtion  expl e o,  1     N



n





n

Yn(WTn)n

 

1 0.



n

Ntul ly, we would li ke to in im ize the m ount of viol tion One intu itive ppo ch is t o mi ni mize ie , we wnt tht solves

  n 

ubjec o whee the inequlities must hold  lem s  line pogm

w

n l Yn (wTxn)  1 - n n  0   1    N Fomulte this pob

d  Ague tht the line pogm you deived in

c nd the optimiztion

poblem in oblem  e equivlent

Problem 3 7 se the line pogmming lgoithm fom oblem  on the lening t sk in oblem    the sep bl e 5 nd the non - 5  cses sepble Compe you esults to the line egession ppoch with nd without the d ode polynomil ftue tnsm

 

 =

Problem 38

Fo l ine egession, t he out of smple eo is

Eou h

  (h (x)  )2 

how tht  mong all hypothes es, the one tht i ni mizes Eou is given by

h*

h* (x)   [  x]

The function cn be teted s  deteministic tget function, in which cse we c n wite + whee is n inp ut dependen t noise vible how tht hs expe cted vl ue zeo

  h*(x) (x) (x)

(x)







3 . THE LINE MODE

3.5 OES

XX

 3.9 Assum ing that is i nveti bl e show by di ect com paison with quation 3.4 that Ein can be witten as



Ein  = (w - ( X) 1 Xy  (XX)( w - ( X ) 1 X y) + y (I - X (X) 1 X )y Ein to obt ain W!in  What is the in sample e o? [Hnt he atr XX s ostve dente.

Use this expession f

 3.10 xecise 33 studied some popeties of the hat matix  whee is a N by  + 1 matix and is invetible  how the f llowing add itional popeties

X X = X (X  X) 1 X  X  a  vey eigen val ue of  is eithe 0 o 1  [Hnt ercse 3.3b.  b how that the tace of a symmetic matix equals the sum of its eigen

[Hnt Use the sectral theore and the cyclc oerty of the trace. ote that the sae result holds r non-syetrc atrces, but s a lttle harder to rove. c How m any eigenva lu es of  ae 1 ? Wha t is the an k of ? [Hnt ercse 3.3d.j values



 3.11 Conside the li nea  eg essio n pobl em setu p in xe cise 34 whee the data comes fom a gen ui ne l in ea  elationsh ip with adde d noise The noise  the difeent data points is assumed to be iid with zeo mean and Assum e that the nd moment matix vaiance is non-singula Follow the steps below to show that with high pobability the out-of-sample eo on aveage is

2 

  [xx ]

EutW!in

a

Fo a test point

 = (2



1+

1 .

+  + o( N )

x show that the eo   ()     ( X  X ) l X  

is

whee  is the noise ealization o the test point and noise ealizations on the data

 is the vecto of

 b Take the expectation with espect to the test point ie

 and

obtain a n expession  Eut  h ow that Eaut

=

=

C

2

+ trace



to

 I ( XT X ) 1 XT E ET XT ( XT X ) 1  .



[Hnts trace( for any scalar  trace(B) ton and trace coute.

c  What is [ ]?

112

= trace(B)

eecta

3  THE LINEAR MODE

35 ROES

(d) Take the expectation with

resp ect to

E to show that, on average,

2

But = 2 +  trac ( N1 XTX) - 1   Note that XTX �



XTX =  L:=l XX� is an  sam ple estimate of   If XTX =  then what is But on average?

o

(e) probability, how th at (after takin g the expectation over th e data noise) with high

 d

But = 2 1 +



l +   ) 

[Hnt: By the law of large nubers  XTX onverges n robablty to , and so by ontnuty of the nverse at ,  XTX - 1 onverges n robablty to - 1 



Problem 3. 1 In l ine ar regression , the in sam ple predictions are given b y  = Hy where H = X(XTX)- 1 XT h ow that H is a projection matrix, ie H2 = H o  is the projection of y onto s ome space What is this space?

Problem 3.13

This problem creates a linear regression algorithm from a good algorithm r linear classication As illustrated, the idea is to take the srcin al data and shif t it i n one d irec tion to get th e +1 data points then, shift it in the o pposite di rectio n to get the - 1 data points





Original data fr the one dimensional regression prob lem

hifted data viewed as a two dim ensional class i ca tion problem

More generally, The data ( , ) can be viewed as data points in tre ating the  value as the ( + 1 th coor di nate

d

Jd+ 1

(continued on next page

113

by

5 ROES

 THE LINER MODE Now, constuct po sitive and negative points



(x 1  1 ) + a . . .  (x N YN )   (x 1  1 )  a . . .  (xN YN )  a

_ a

whee is a petubation paamete You can now use the linea pogamming algoithm in oblem  to sepaate fom The esulting sepaating hypepl ane ca n be use d as the egession  t to the oig in al data 





 a  How many weights ae leaned in the classication poblem? How many weights ae needed  the linea t in the egession poblem?  b  The linea t equies weights , whee h(x)  Tx . u ppos e the ss

weights etuned by solving the classication poblem ae  Deive an expessi on   as a function of c Geneate a data set with  50 whee X is unifm on [ 1 and is zeo mean Ga ussian noise set 0.1 lot and







a

Y   +

Wss·



=





_

01  d Give compaisons o f the esulti ng  ts fom un ni ng the c lassi cation ap poach and the analytic pseudo-invese algoithm f linea egession



Prob lem 3 14

I n a egession se tting, assume the tag et function is li nea,

f(x) = xTW and y  X + E whee the enties in E ae zeo mean, iid 2  In this poblem deive the bias and vaiance as llows. is (x)  f(x) no matte what the size a  h ow tha t the aveage f u nction of the data set, as long as XT X is invetible. What is the bias?

so with vaiance

 b  What is the va ian ce? Problem 315

[Hnt: Proble 3

In the text we deived that the linea egession solution

XTXw  XTy. If XTX is not invetible, the solution W    (XTX) 1 XTy won 't wok In this e vent, thee wil l be m any s ol utions   that minimize Ei· Hee, you wil l dei ve one such solution  Let  be the ank of Assume that the singula value decomposition V D of is  satises  = rTX.whee  E N   satises UTU =   E  X  T = and r E ]PP i s a posit ive diagonal matix. weights must satisfy

 a  how that  <  + 1   b  how that W!i = r 1 Ty satises XTXW!i = XTy and hence is a solution. c h ow th at o any othe solut ion that satis es XTXw = XTy     < n

    That is, the solution we have constucted is the minimum nom set

of weights that minimizes

Ei·

114

3 .  R MO

3.5.

ROBS

Problem 316 I xample 3.4, it is metioed that the output of the al hypothesis lea ed us ig logistic e gessio ca be thes hol ded to g et a had' ±1 cl assicatio  Th is pobl em show s how to use the isk mat ix itoduced i xample  to obtai such a theshold

gx

Coside  gepit vei catio as i xam ple     Afte leaig f om the data usig logistic egessio you poduce the al hypothesis [

gx 

 +  x which is you estimate of the pobability that   +

uppose that the cost

matix is give by

you say

+1

Tue classica tio - 1 (itude) 

+ 1 (coect peso)



1

x

gx

Fo a ew peso with gepit you c omp ute ad you ow eed to de cide whe the  to accep t o eject the peso (i e   you eed a h ad classicatio )  o you w il l accep t if K whee K is t he th eshold

gx 

(a ) i Dee t hedecostaccept as y  ou  expected mi la ly e cost eject how that co st if you accep t the pe so 

(1  gx   gx  (b ) Use pat (a) to deive a codit io o gx f acceptig the peso ad hece show that      +  costa ccept cost eject 

(c) Use the co st matic es f the  up emake t a d CIA a ppl icatios i x am ple   to co mpute the th eshold K  each of these t o cases G ive some i tui tio f the thesholds you get

Problem 317

Coside a fuctio

E(u v)   +  +  +   3uv + 4  - 3u - 5  (a) Appoximate E(u + bu v + bv) by 1 (bu bv)  whee 1 is the stode Taylo's expasio of E aoud (uv)  (0,0) uppose 1 (bubv)   bu +    + What ae the values of      ad 

(ontnued on next page

115

3   IR MO

3.5.

 b Miniize  ove all possible (Lu, Lv)

such that

 (Lu, Lv)  = 0.5

In this chapte, we poved that the optial colun vecto

E(u,v)

paallel to the colun vecto gradent drecton Copute the optial

ROBS



is

which is called the negatve and the esulting

(Lu, Lv) E(u + Lu, v + Lv)  by 2(Lu, ode T aylos E(u+ expa nsion  (0whee 0)  uppose Lu,v+of Lv) E aound (u, v) Lv) 2 is th e second c Appoxiate What ae the values of

          and 

d  Miniize 22 ove all possible (Lu, Lv)  egadless of length   Use the fact that \ E(u, v) I     the H essia n atix at  0 , 0  ) is positive denite to pove that the optial colun vecto

[LLvu**] =



\ 2 E(u, v)  \E(u,v),

ewton drecton

which is called the



( e Nueically copute the fllowing values: vecto of length the es ultin (Lu, g E(u Lv)+ Lu, v + Lv)0.5 along the Newton diection, and  i  the  ii  th e vecto (Lu, Lv) of length 0.5 that i ni izes E(u+Lu, v+Lv)  and the esulting E(u + Lu, v + Lv) (Hnt: Let Lu = 0.5sin ) Cop ae the valu es of E(u+Lu,v+Lv) in  b  ,  e i  , and  e ii   Biely

state you nd in gs

The negative g ad ient di ectio n a nd the Newton d iec tion a e quite f un da ental  designin g optiization a lgo iths It is ipot ant to un des tand the se di ections an d put the in you t oolbox  desi gni ng lean ing a lgo ith s

Problem 3 18

Take the fatue tans

 2 in qua tio n ( ) as <

a how tha t dvc ( )  6 that dvc(Hq) > 4 {Hnt ercse 3} bc how Give an uppe bound on dvc(Hqk) f   I   d  Dene

 2 () E I 

9 Ague that dvc(Hq 2 ) dvc(H 2 ) . In othe wods, w hi le dvc ( 2 )  6 <  Thus, the di ension of only gives an uppe bound of dvc(Hq), a nd the exact v al ue of dvc() can depend on the co pone nts of the ta ns 



<()

116

  R MO

 ROBS

Problem 3.19 A Transrmer thinks the llowing procedures would work well in learning from two-dimensional data sets of any size  Please point out if there are any potential problem s i n the procedu res: (a) Use the eature tran srm < ( x) =

{

bere run ning PLA  (b) Use th e featu re tran srm

using some ve ry small



(c) Use th e atu re tran srm

bere running PLA, with

0  .   0 1 0     � 0  0   0

if

 = 

otherwise 

 with

 that consists of all i

E {,

117

  1 } and

j E {,

    1}

118

Chap

4

Overtting Paraskavedekatraphoba  (far of day the 1 3th) , and supersttons n gen  era, are perhaps the most ustrous cases of the human abty to overt. Unrtunate events are memorabe, and gven a fw such memorabe events, t s natur a to ty and nd an expanat on In th e fture, w there be more unfrtunate events on day the 13ths than on any other day Overttng s the that phenomenon where ttng the observed acts (data) no onger ndcates we w get a decent out-of-sampe error, andwe may actuay ead to the opposte eect  You have probaby seen cases of overt tng when the earnng mode s more compex than s necessary to represent the target ncton The mode uses ts addtona degrees of eedom to t dosyncrases n the data (fr exampe, nose), yedng a na hypothess that s nferor Overttng can occur even when the hypothess set contans ony nctons whch are far simpler than the target fncton, and so the pot thck ens @ The abty to dea wth overttng s what separates profssonas om amateurs n the ed of earnng om data We w cover three themes When does overttn g occur What are the too s to combat overttng How can one estmate the degree of overttng and cert that a mode s good, or bet ter than another Ou r emphass w be on technq ues that work we n practce

4. 

Whe Doe s Overttig Occur?

Overttng teray means Fttng th e data more than s wa rranted " The man case of overttng s when you pck the hypothess wth ower En and t resuts n hgher Eout Ths means that Ein aone s no onger a good gude fr earnng Let us start by dentfyng the cause of overttng 1 om the Geek

k (da, k (thteen,  (a 119

4.1. W OS VRIIG CCUR

4  VRIIG

Consder a s mpe onedmen sona regres son probem wth ve data pont s We do not know the target ncton so et's seect a genera mode maxmz ng our chance to capture the target ncton. Snce 5 data ponts can be t by a 4th order poynoma we seect 4th order poynomas The resut s shown on the rght . The target ncton s a 2n d order poynoma  t  rget (bue curve wth a little added nose n t the data ponts. Though the target s smpe the earnng agorthm used the  power of the 4th order poynoma to t the data exacty but the resut does not ook anythng ke the target nc ton . The d ata has be en ov ert.  The  tte  nose n the data has msed the earnng r f there were no nose the tted red curve woud exacty match the target  Ths s a typca overttng scenaro n whch a cmpex mode uses ts addtona degrees of eedom to earn the nose. The t has zero nsampe error but huge outofsampe error so ths s a a key outcome when case of bad generazaton (as dscussed n Chapter 2 overttng s occurrn g. However our denton of overttng goes beyond bad generazaton r any gven hypoth ess. Instead overttng appes to a cess n ths case the process of pckng a hypothess wth ower and ower Ein resutng n hgher and hgher Eout

.1.1

A Case Study: Oettig ith Polyomials

Let 's dg deeper to gan a better understandng of when overttng oc curs. We w ustrate the man concepts usng data n onedmenson and poynoma regresson a speca case of a near mode that uses the ature transfrm x  (1 x  x 2      . Consder the two regresson probems beow





 





t  rget



t  rget

(b) 50th order trget nction

() 10th order trget nction

In both probems the target ncton s a poynoma and the data set  contans 15 data ponts In (a the target uncton s a 10th order poynoma 120

4  VERFITTIN 

4 . 1 . WHEN OE S VERFITTING CCUR





 ata nd rder t 10th rder t



a osy low order target



 ata nd rder t 10th rder t

(b) oseless h gh order target

Figure 4.1 : ts usng nd and 10th order polynomals to 1 data ponts.



n a  the da ta are nosy and the target  s a 1 0th order polyno mal. n ( b) the dat a are noseless and the th e targe t s a 0 th order polynomal



and the sampled data are noisy the data do not lie on the target function curve . In b , the target fn ction is a 50t h order polynomial and the data are noiseless The best 2nd and 10th order ts ar e shown in Figur e 4 .1 , and the ins ample and outofsample errors are given in the llowing table.

 

50th order noiseless target 2nd Order 10th Order 10 0029 0.120 7680

10th order noisy target 2nd Order 10th Order 0.050 0034 Ein 0127 9.00 Eout

What the learning al gorithm see s is the dat a, not the target fnction In both cases, the 10th order polynomial heavily overts the data, and results in a nonsensi cal nal hypothesis which does not resemble the target f nction The 2nd order ts do not capture the fl l nature of the target fncti on either, but they d o at least cap ture its general trend, resulting in signican tly lower outof sample error  The 1 0th order ts have lower insample error and hig her outof sample erro r, so this is indeed a case of overtting that resul ts in pathologically bad generalization. xercise . 1



10

et and be the nd an th ode hypothesis sets espectively  pecif y th ese sets as paa mete ize se ts of fun ctions  how tha t 2

 1 0

These two examples r eveal some surprising phenomena Let 's consider r st the 10th order targe t nction, Figure 4.  a . Her e is the scenario  o learners, 0 r overtted and R fr restricted , know that the target fnction is a 10th order polynomial, a nd that they will receive 1 5 noisy data points Learner 0







 

121

4 VERFITTING

41 WHEN OES VERFITTING CCUR

Larnin curvs fr

2

arnin curvs r

HH µH  

0

H

µ  

µ� ubr f aa Pins 

ubr f aa Pins 

Figure 42: vrin is ccurrin r  in h shad d ra rin bcaus b chsin   0 which has better Ein u  worse Eout· uses model , which is known to contain the target fnction and nds the best tting hypothesis to the data Learner R uses model    and similarly nds the best tting hypothesis to the data The surprising thing is th at learner R wins (lower outof-sample error by using the smaller model even though she has knoingly given up the ability to implement thegain trueintarget fnction Learner trades oresulting a worse in in-sample error fr a huge the generalization error Rultimately lower outof-sample erro r What is fnny here? A flklore belief about learning is that best resu lts are obtain ed by incorporating as much infrm ation about the target fnction as is available But as we see here even if we kno the order of the target and naively incorporate this knowledge by choosing the model accordingly (), the perrmance is inrior to that demonstrated by the more stable' 2nd order model The models  and   were in act the ones used to generate the learn ing curves in Chapter 2 and we use those same learning curves to illustrate overtting in Figure 4 2 If you mentally supe rimpose the two plots you can see that there is a range of r which  has lower Ein but higher Eout than   does a case in point of overtting Is learner R always going to prevail? Certainly not For example if the data wasom noiseless indeed learner target fnction exactly 15 datathen points while learner 0Rwould wouldrecover have nothe hope  This brings us to the second example Figure 4(b Here the data is noiseless but the target fnction is very complex (5 0th order polynomia l  Again learner R wins and again because learner 0 heavily overts the data Overttin g is not a disease inicted only upon complex models with many more degrees of eedom than warra nted by the complexi ty of the target fnction In fct the reverse is true here and overtting is just as bad What matters is how the model complexity matches the quantity and quality of the data we have not how it matches the target fnction

N

122

4  VERFITTING

 . 1. 

4 1  WHEN E ERFITTING R

Catalsts r Overttig

A skeptical reader should ask whether the examples in Figure 4.1 are just pathological constructions created by the authors, or is overtting a real phe nomenon which h as to b e considered carelly when lear ning om data? The next exercise guides you through an experimental design r studying overt ting within our current setup. We will use the results om this experiment to serve purposes construction, to convince you overtting result of some raretwo pathological and that to unravel someisofnot thethe conditions conducive t o overtting. Exercise 4.

[Experimenta esign r stuing overftting]

Th is is a eadi ng exe cise that se ts u a n exeime nta l fam ewk to stuy aious asec ts of ovetting he eae i nte este in im lementin g e exeiment can n te etails leshe ut in ole he inut sace is � [  , 1  with unim input obaility ensity, e onside the w o m odels an



=

2

 10

P = (

The taget is a egee olnomial, which e wite whee a olynomi als of in ceasin g comlexi ty he Legene polnomials) he ata set is , ee
 0 q(),

()

D =  , 1 , . .  ,  ,  

= (n) 

=  (n ) 

ot (g )



 ,

10

   =

2

0

 N





 ot2 

Va   a n d  each combinatio of amet e, un a l age nu me of exeimens, each time comuting an veaging these out-o f-sa ple eos giv es estimates o  t e exect e out-o-sa l e eo f the given l ea nin g scenaio an  using

ot(g0  2 

ot(g 2 )

  

xercise 42 set up an experiment to study how the noise level  2  the target complexity Q f and the number of data points relate to overtting. We compare the nal hypothesis 910  10 (larger model to the nal hypothesis 92   2 (smaller model  . Clearly, Ein (910 )  Ein (92 ) since 910 has more degrees of eedom to t the data . What is surprising is how often 910 overts the data, resulting in Eu(910) > Eu(9 2 )  Let us dene the overt measure as Eu(910) Eu(9 2 )  The more positive this measure is, the more severe overtting would be. Figure 4 .3 shows how th e extent o f overtting depe nds o n certain param e ters of the learni ng problem (the results are om our impl eentatio n of xer cise 4 2 . In the gure, the colors map to the level of overtting, with redder

N

123

4 . VERFITTING

41

WHEN OES VERFITTING CCUR?

00  _  ·; �  0   



I

t

Q 

s z

80

00



0 

Numbe of Daa Poins, a tochastic noise

(

80

00

0 

Numbe of Daa Poins, b eterministic noise

(

C2 

Figure 43: ow overttin depends on the noise the taret function complexity QJ and the number of data points  The colors map t o the overt measure n a we see how overttin depends on and , with QJ 0. s increase s we are addi n stochastic noise to the data n b we see how overttin depends on Q and  with 01 s Q  increases we are addin deterministic noise to the data.

C2 C2 =

Eot( 0)-Eot() = C2 (



regions showing worse overtting These red regions are large overtting is real and here to stay Figure 43( athereveals thatofthere less overtting when noisepattern level in 2 drops or when number data ispoints increases (thethelinear Figure 43(a is typical  Since the signal'  is normalized to [j 2 ]  1  the noise level 2 is automatica lly calibra ted t o the signal level Noise leads the learning astra y and the larger mo re complex model is more susceptib le to nois e than the simpler one because it has more w ays to go astray Figure 4 3(b  reveals that target fnction complexity Q f aects overtting in a similar way to noise albeit nonlinearly To summarize

N

Deterministic noise. Why does a higher target complexity lead to more overtting when comparing the same two models? The intuition is that r a given learning mode l there is a best approximation to the target fnction  The part of the target fnction outside' this best t acts like noise in the data We can call this deterministic noise to dierentiate it om the random stochastic noise. ust as stochastic noise cannot be modeled the deterministic noise is that part of the target fn ction which cannot be modele d The learning algorithm should not attempt to t the noise; however it cannot distinguish noise om signal On a nite data set  the algorithm inadvertently uses some

124

4.1. WHN OS RFITTIN CCUR

4  RFITTIN



h*

Figure 4.4: eterministi noise is the est t to  in illustrates deterministi noise r this learnin prolem.

2  The shadin

of the degrees of eedom to t the noise which can result in overtting and a spurious nal hypothesis Figure 4.4 illustrates deterministic noise r a quadratic model tting a more complex target nction While stochastic and deterministic noise have similar eects on overtting there are two basic dierences between the two tpes of noise  First if we generated t he same data (x values again the deterministic noise would not change but the stochastic noise would Second dierent mo dels capture di erent  parts ' of the target fnction hence th e same data set will have dier ent deterministic noise depending on which model we use In reality we work with one model at a time and have only one data set on hand Hence  we have one realization of the noise to work with and the algorithm cannot dierent iate between the two types of noise  Exerise



etemi nistic noise eens on tha n otes



as some moels aoximate  bete



a ssume is xed an  e in cea se the comlex ity   i ll ete  m i nistic noise in genea l go u o own Is he e a hi ghe  o lowe tenenc to oet ( b) ssu me  is x ed an  we ecease t he omlei ty of Will dete m in istic nois e i n geneal go up o  own? Is thee a highe o l owe endency to ovet? Hnt here s a race between two factors that



aect overttng n ooste ways, but one wns}



The biasvariance decomposition which we discussed in Section 2 3.1 see also Problem 2. 22  is a usefl t ool r understanding how noise aects perrmance:

 [Eout ]   2 + is + vr The rst two terms reect the direct impact of the stochastic and determin istic noise The variance of the stochastic noise is  2 and the is is directly 125

4. ERFITTING

4.2. EGULRITION

rlatd to th dtrministic nois in that it capturs th modl's inability to approximat  Th v trm is indirctly impactd by both typs of nois, capturing a nod ls suscp tibility to bin g ld astray by th nois 

4.2

eularizain

Rgularization is our rst wapon to com bat ovrtting constrains larning algorithm to improv outofsampl rror, spciallyItwhn nois isth prsnt To ht your ap ptit, look at what a littl rgulariz ation can do fr our rst ovrtting xampl in Sctio n 4 1  Though w only usd a vry small amount' of rgularization, th t improvs dramatically

 Data

 Target

it





wth reularzatn

thut reularzatn

Now that w hav your attntion, w would lik to com clan Rgularization is as much an art as it is a scinc ost of th mthods us d succssflly in practic ar  huristic mthods  Howvr, ths mthods ar groundd in a mathmatical amwork that is dvlopd r spcial cass  W will discuss both th mathmatical and th huristic, trying to maintain a balanc that rcts th rality of th ld Spaking of huristic s, on viw of rgularization is through th lns of th VC bound, which bounds Eu using a modl complxity pnalty fr all h 





0

( 41

So , w  ar bttr o if w t th data using a simpl xtrapolating on stp frthr, w should b bttr o by tting th data using a simpl h om Th ssn c of rgularization is t o concoct a masur (h) fr th complxity of an individual hypothsis Instad of minimizing Ein ( h) alon, on minimizs a combination of Ein(h) and (h) This avoids ov rtting by cons training th larning algoithm to t th data wll using a simpl hypothsis



eight decay which masurs th complxity of a hypothsis h by th siz of th cocints usd to rprsnt h  g  in a lin ar modl  This huristic prfrs mild lin s with

Example 1 On popular rgularization tchniqu is

126

4 . VERFITTING

4.2. EGLRITION

small oset and slo pe  to wild lines with big ger oset and slope . We will get to the mechanics of w eight de cay shortly but fr now let s fcus on the outcome. We apply weight decay to tting the target  sin( ?X using 2 data points (as in xample 28  sample unirmly in [ 1  1]  ge nerate a data set and t a line to the data (our model is H   The gures below show the resulting ts on the same (random data sets with and without regularization.



 =



N=







witout reularization

wit reularization

Without regularization t he learned nction vari es extensively depending on the data set . s we have seen in xample 2. 8 a constant model scored model 0.75, scored hadilyEu beating 1the .90 . perfrmance With a littleof weight the (unregularized decay regularization linear  Eu that the ts to the same data sets are considerably less volatile . This results in a signicantly lower Eu 0.56 that beats both the constant model and the unregularized liear model. The b ias-variance decomposition helps us to understand how th e regular ized version beat both the unregularized version as well as the constant model.

= =





witout reularization bia s  02; var  69

wit reularization bi as  0.2; var  0

 

verae ypotesis red wit varx) indicated y t e ra saded reion tat is gx) ±

s exp ected regularization reduced the var term rather dramatically om 1 69 down to 0 .33 . The pri ce paid i n terms of the bias (quality of the average t was 12

4 . VERFITTING

4.. EGULRITION

modest, only slightly increasing om 0 2 to 0 2 3 The result w as a signicant decrease in the expected outofsample error because bias + va decreased This is the cru x of regularization By constraining the learning alg orithm to selec t simpler' hypotheses om we sacrice a little bias fr a signicant gain in  the va



Thisis example also illustra tes amount why regularization is needesince d The model too sophisticated for the of data e have a linelinear can perfctly t any 2 points This need would persist even if we changed the target fnction, as long as we have either stochastic or deterministic noise The need r regularization depends on the quantity and quality of the data Given our meager data set, our choices were either to take a simpler model, such as the mod el with co nstant fncti ons, or to constra in the line ar model It tur ns out that using the complex model bu t constrainin g the algorit hm toward simpler hypotheses gives us more exibility, and ends up giving the best Eout. In practice, this is the rule not the exception nough heuristics Let 's develop the mathematics of regularization .2.1

A Sft Order Cstrit

In thisofsection, derive a regularization method a wide vate riety learningweproblems To simplify the math,that weapplies will useto the concre setting of regression using Legendre polynomials, the p olynomials of incr eas ing complexity used in xercise 4 2 So, let 's rst rmally introduce you to the Legendre polynomials Consider a learning model where is the set of polynomials in one vari able x  [ 1    Instead of expressing the polynom ials in terms of consecutive powers of x, we will express them as a combination of Legendre polynomials in x Legendre polynomials are a standard set of polynomials with nice ana lytic properties that result in simple r derivations  The zerothorder Legendre polynomial is the constant  x   and the rst w Legendre polynomials are illustrated below





L3

L2

�(x2

1)

H5x

L4

x)

(5x x2 + )

Ls

(6x

 · 

As you can see , when the order of the Legendre poly nomial increa ses, the curve gets more complex Legendre polynomials are orthogonal to each other within x  [ 1  1]  and any regular polynomial can be written as a linear combinatin of Legendre polynomials, just like it can be written as a linear combination of powers of



28

4 . VERFITTIN G

x HQ [LQ1(�x) ]

4.2. EGULRITION

Polynomial models are a special case of linear models in a space Z, under a nonlinear trasfrmation <  X - Z Here r the Qth order polyomial model < transrms into a vector  of Legendre polynomials 

Our hypothesis set

Lo ( x )



is a linear combination of these p olynomials

where 1. As usual we will sometime s rer to the hypothesis h by its weight vector   Since each h is linear in  we can use the machinery of linear regression om Chapter 3 t o minimize the squared error

N

I Z   •

E(

(4.2



= The case of po lyomial regression with squarede rror measure illustrates the main ideas of regularization well and cilitates a solid mathematical deriva tion  Nonethele ss our discussion will generalize in practice to nolinear multidimensional settings with more general error measures The baseline al gorithm (without regularization is to minimize E over the hypotheses in to produce the nal hypothesis   where  argminE(

g( x )

xercise . et Z [z1 rank); let Win

HQ



 has f ul l olumn    T the hat matix

ZNr b the aa matix assume (ZTz)- 1 Vy; an let  Z ZT f ercise .. how tat  linfVZ w lin



m W

 _

where is the identity matrix (a) hat val ue of  minimizes in?



Tl







b What i s the mini mu i n sample er ror?

The task of regularization which results in a nal hy pothe sis   istead of the simple  is to constrain the learning so as to prevent overtting the



2 e used and  he weigh veo and dimension in dealing wih polnomials and is he onl spae aound, we use



Z

Z

129

Sine we ae expliil

 and Q  simplii

4. VERFITTING

4.. EGULARIATION

data. We have already seen an example of constraining the learnin g; the set  2 can be thought of as a constrained version of  10 in the sense that some of the 10 weights are require d to be zero. That is,  2 is a subset of 10 dened by  2 {w  w  10 Wq  r q  } Requiring some weights to be  is a hard constraint. We have seen that such a hard constraint on the order can help, r exam ple  2 is better than 10 when there is a lot of noise and N is to small. be small Instead butofnotrequiring necessarily s ome zero weights through to be a softer zero, constraint we can rce such theasweights

This is a soft order constraint because it only encourages each weight to be small, without chang ing th e order of th e polynomial by explicitl y setting some weights to zero. The insample optimization problem becomes

Ein(w) 

subect to

wTw � C

()

The data determines the optimal weight sizes, given the total budget C which determines the amount of regularization; the larger C is, the weaker the con straint and the smaller the amount of regularization. We can dene the soft orderconstrained hypothesis set (C) by quation () is equivalent to minimizing Ein over (C) If C1 < 2 , then (C1)  (C2 ) and so ((C1)) � ((C 2 )), and we expect better generalizat ion with ( C1 )  Let the regularized weights Wre be the solution

to ()

olving r W • If w n W lin �  then Wre Wlin because Wlin   ( C)  If Win  (C), then not only is we Wre � C, but in fact we Wre C (wre uses the entire budget C see Problem )

We thus constraint need towTw minimizeC EinThe subect situation to the is equality illustrated to the right. The weights w must lie on the surface of the sphere , the normal vector to this surace at w is the vector w itself (also in red  . A surface of constant Ein is shown in blue; this surface is a quadratic surfce (see xercise ) and the normal to this surface is . In this case, w cannot b e optimal because \ Ein ( w) is not parallel to the red normal vector. This means tha t Ein(w) has some non zero component along the constraint surfce, and by moving a small amount in the opposite direction of this component we can improve Ein  while still 

4  ERFITTING

42 EGULRITION

remaining on the surface If Wre is to be optimal, then r some positive parameter A ie, Ein must be parallel to Wre, the normal vector to t he constraint surace (the scaling by 2 is r mathematical convenience and the negative sign is because

 Ein and w are in opposite directions quivalently,

because V(wTw)

Wre satises

2w  So, fr some A > 0, Wre locally minimizes (45

The parameter A and th e vecto r Wre (both of wh ich depend o n C and the data must be chosen so as to simultaneously satisfy the gradient equality and the weight norm constraint we Wre C 3 That A > 0 is intuitive since we are enfrcing smaller weights, and minimizing Ein(w) + AwTw would not lead to smaller weights if A were negative Note that if wn Win C, Wre Wn and miniizing ( 45 still holds with A 0 Therefre , we

:

have an equiv betweenofsolving problem ( 4 4mini andmiz the unconstra ined alence minimization ( 4 5  the Thisconstrained equivale nce means that Ein ing ( 4. 5 is similar to minimizing using a smaller hypothesis set, which in turn means that we can expect better generalization by minimizing ( 45 than by just minimizing Ein. Other variations of the const raint in ( 44 can be used to emph asize some weights over the others Consider the constraint �O C The im portance given to weight determines the type of regularization For example, q or encourages a low-order t, and (1 + q) - 1 or encour ages a high-ord er t In extreme cases , one recovers hard-ord er constraints by choosing some 0 and some - 

/q /q /q  q

/q q Wq /q

Exercise . 5

/q

/qW� : /q

[Tkhono regarzer

A oe geneal oft containt i the

khonov egul aization con taint

i 

whi ch can captue  elation h ip among the the matix  i the Ti khonov egulaize  a What hould  be to obtain the containt C?



  b What hould

 be to obtain the containt

o w    ) 2  ?

3 > s known as a Lagange multple and an altenate devaton of these same esults an be obtaned va the theo of Lagange multples  onstaned optmzaton

131

 . VERFIING

 .2 .2

2 EGULRIION

Weight Decay ad Augmeted Error

The softorder constrant fr a gven value of C s a constraned mnmza ton of Ei quaton (4.5 suggests that we may equvalently solve an un constran ed mnmzaton of a derent fncton . Let 's dene the augmented

error

(46 where   0 s now a ee parameter at our dsposal. Th e augmented error h as two terms. The rst s the nsample error whch we are used to mnmz ng, and the second s a penalty term. Notce that ths ts the heurstc vew of regularzaton that we dscussed earler, where the penalty fr complexty s dened fr each ndvdual h nstead of  as a whole. When  0, we have the usual  nsample error. For  > 0, mnmzng the augmented error cor respond s to nmz ng a penalzed nsample error. The value of  controls the amo unt of regularzaton. The penalty term wTw enrces a tradeo between makng the nsample error small and makng the w eghts small, and has become known as eight deay As dscussed n Problem 48, f we mnmze the augmented error usng an teratve method lke gradent descent, we wll have a reducton of the nsample error together wth a gradual s hrnkng of the wegh ts , hence the name weght decay . ' In the statstcs co mmunty, ths type of pen alty term s a rm of There s an equvalence between the soft order constrant and augmented ridge regression error mnmzaton. In the soforder constrant, the amount of regularzaton s control led by the parameter C om ( 4 5 , there  s a partcular  depend ng on C and the data ') r whch mnmzng the augmented error Eau(w) leads to the same nal hypothess we  A larger C allows larger weghts and s a weaker softorder constrant ths corresponds to smaller .e., less em phass on the penalty term wTw n the augmented error. For a partcular data set, the optmal value C* leadng to mnmum outofsample error wth the softorder cons trant corresponds to an optmal value  * n the augmented error mnmzaton. If we can nd  * , we can get the mnmum Eout. Have we ganed om the augmented error vew? Yes, because augmented error mnmzat on s unconstraned, whch s gener ally easer than constraned mnmzato n. For example, we can obtan a closed rm soluton fr lnear models or use a method lke stochastc grad ent descent to carry out the mn mzaton. However, augmented error mnm zaton s not so easy to nterp ret. There are no values fr the weghts whch are explctly rbdden, as there are n the softorder constrant. For a gven C, the softorder constrant cor responds to selectng a hypothess om the smaller set  (C), and so om our VC analyss we should expect better generalzaton when C decreases ncreases . It s through the relatonshp between  and C that one has a theoretcal justcaton of weght decay as a method r regularzaton. We ocused on the softorder constrant wTw  C wth correspondng augmented error Eau(w) Ei(w) + wTw However, our dscusson apples more generally. There s a dualty between the mnmzaton of the nsample









32

4 . ERF ITTI NG

4.2. EGULRIATION

error over a con strained hypothesis set and the unconstrained minimization of an augmente d error We may choose to l ive in either world, but more ofen than not, the unconstrained minimization of the augmented error is more convenient In our denition of () in quation (46 we only highlighted the dependence on  There are two other quantities unde our control, namely the amount of regularization, , and the nature of the regularizer which we chose to be  In general, the augmented error r a hypothesis    is (47 For weight decay, D(  which pen alizes large weights The pen alty term has two components: the regularizer (  (the type of regularization which penalizes a particular proper ty of  and the regularization pameter  (the amount of regularization The need r regularization goes down a the number of data point s goes up , so we actore d out this allows the optimal choce r  to be less sensit ive to N This is just a redenition of the  that we have been using, in order to make it a more stable parameter that is easier to interpr et Notice how quati on ( 4 7 resembles the VC bound ( 4 1  as we anticipated in the heuristic view of regularization  This is why we use the same

;

notation the penalty on individual between h ypotheses ( and theofpen on the whole The correspondence theDcomplexity and  alty  rsetboth0(). the complexi ty of an individual  wil l be discuss ed frther in Section 5 1  The regularizer  is typically xed ahead of time, bere seeing the data someti mes the pro blem itself can dictate an appropriate reg ularizer xerise .6 e h ae se en both he ha r-orer const ain t a n t e so t-orer cons traint  ic h o ou exe ct o be more us eful r in a lassica tion usin g he ercetron moel? int: signwTx) ignaTx a > O

 

The optimal regulariza tion parameter, however, typically depends on th e data The choice of the optimal  is one of t he applica tions of validation which we will discuss shortly Exampe 42 Linear modes ith eight decay. Linear models are important enough that it is worthwhile to spell out the details of augmented error minimiza tion in this case om xercise 4 4 t he augmented er ror is

where  is t he transfrmed data matrix and W ()y The reader may veri, after taking the derivatives of  and setting \  0, that 133

4 VERFITTING

4 EGULRITION

As expected, Wre will go to zero as   tions on the insample data are given by

 due to the I term.



Zwre

The predic H( , where

The matrix H( plays an important role in dening the eective complexit of a model. When  0 H is the hat matrix of xercises 3.3 and 4.4 which  satises H and trace(H of the insample errors , which are also called H residuals, is y  d + (I1. The H( \vector , and insample error Ei  is Ein(Wre) =  (I H( . 



We can now apply weight decay regularization to the rst overtting example that opened this ch apter. The results r dierent 's are s hown in Figure 4 .5 

0.0001

\ 0.01



?





Figure 4.5: Weght decay appled to xample 4 wth derent value r the regularzaton parameter . The red t get atter a we ncreae .

As you can see, even very little regularization goes a long wa, but too much regularization results in an overly at curve at the expense of insample t. Another case we saw earlier is xample 4.1 where we t a linear model to a sinusoid. The regularization used there was also weight deca, with  0.1 . .23

Choosig a Regularier Pill or Poiso?

We have presented a numb er of ways to constrain a model  hardorder con straints where we simply use a loweror der model, softorder constraints where we constrain the parameters of the model, and augmented error where we add a penalty ter to an otherwise unconstrained minimization of error. Aug mented error is the most popular rm of regularization, r which we need to choose the regularizer (h) and the regularization parameter \. In practice, the choice of  is largely heur istic. Finding a perfct  is as dicult as nding a perct  It depends on inrmation that, by the very nature of learning, we don' t have. However, there are regularizers we can work with that have stood the te st of time, such as weight decay. Some rms of regularization work and some do not, depending on the specic application and the data Figure 4.5 illustrated that eve n the amount of regularization 134

4  VERFITTING

4 EGULRITION

0.84

084

 08 

i0.76

�

2 egaization Paaete , 0.5

1

1.5

a) Uniom egulaize

1

0.5

15

2,

egaization Paaete b Low ode egulaiz e



Figure 46: ut of sample pemance f the uni and low ode eg

2 =

=

=

ulaizes using model H 1 5  with 05 Q 15 and  30 vetting occus in the shaded egion because lowe lowe A) leads to highe Undetting occus when A is too lage because the leaning algoithm has too little eibility to t the data 

Ei 



Eot.



has to be chosen carefuy Too much reguarization too harsh a constraint eaves the earning too itte exibiity to t the data and eads to undertting which can be ust as bad as overtting If so many choices can go wrong, why do we bother with reguarization in the rst pace? Reguarization is a ne cessary evi, with the op erative word being necessary. If our mode is too sophisticated fr the amount of data we have, we are doomed By appying reguarization , we have a chance By appying the proper reguarization, we a re in good shape Let us experiment with two choices of a reguarizer fr the mode   of 1t h order poynomias, using the experimenta design in xercise 42:

�o   2. A ow-order reguarizer: (w) = �o  

1 A unirm reguarizer: uni( w)

The rst encourages a weights to be sma, unifrmy; the second pays more attention to the higher order weights, encouraging a ower order t Figure 46 shows perfrmance  r dierentpavaues f the reguarization parameter As youthe decrease ys esso attention to the penaty term and  the optimization more to Ein, and so Ein wi decrease Probem 4 7  In the shaded region, Eout increases as you decrease Ein decrease the reguarization parameter is too sma and there is not enough of a constraint on the earning, eading to decreased per frmance because of overtting In the unshaded region, th e reguarization parameter is too arge, over-constraining the earning and not giving it enough exibiity to t the data, eading to decreased perrmance bec ause of undertting As can be observed om the gure, the price paid f r overtting is generay more severe than undertti ng It usuay pays to be conservative





135



42 EGULRITIN

4 . VERFITTING

0.5



.

garzatn Paratr, (a) tchastic nise

2,

.



.

garzatn Paratr, (b) eterministic nise

2,

Figure 4 7: errmance  the unirm regularizer at diernt levels  nise. The ptimal > is hghlighted r each crve The optimal regularization parameter r the two cases is quite dierent and the perfrmance can be quite sensitive to the choice of regularization parameter However the promising message om the gure is that though the behaviors are quite dierent the perrmances of the two regularizers are comparable (around 0. 76 if e choose the right  for each Wen can also use to study how perfrmance izatio depends on this the experiment noise In Figure 4 7(a when 2  with 0, noregular amount of regularization helps (ie the optimal regularization parameter is  0) , which is not a surprise because there is no stochastic or deterministic noise in the data (both targe t and model are 15t h order polynomials  As we add more stochastic noise the overall perrmance degrades as expecte d Note that the optimal value fr the regularization parameter increases with noise which is also expected based on the earlier discussion that the potential to overt in creases as the noise increases; hence constraining the learning more should help Figure 4 7(b shows what happens when we add determ inistic nois e  keeping the stochast ic noise at zero  This is accomplished b y increasing Q f (the target complexity thereby adding deterministic noise but keeping ev erything else the same Comparing parts ( a and (b of Figures 47 provides another demonstration of ho w the eects of deterministic and stochastic noise are similar When either is present it helpful to regularize and the more noise there is the larger the amount of regularization you need is What happens if you pick the wrong regularizer? To illustrate  we picked a regularizer which encourages large weights (weight growth versus weight decay which  encourages small weights As you can see   in this case weight growth does not help the cause of overtting If we happened to choose weight growth as our regularizer we would still be OK as long as we have garzatn Paratr , 136

4  VERFITTING

43 LIDTION

a good way to pick the regula rization parameter the optimal regularization parameter in this case i s   0  and we are no worse o than not reg ularizing. No regularizer wi ll be ideal r all settings o r even r a specic setting since we never have perct inrmation but they all tend to work with varying success if the amount of regularizat ion  is set to the correct level Thus the entire burden rests on picking the right  a task that can be addressed by a technique called validation the of topic of the nextissection The lesso n learned is thatwhich some isrm regularization necessary as le arn ing is quite sensitiv e to stochastic and determin istic nois e The best way to constrain the learning is in the direction' of the target fnction and more of a constraint is needed when there is more noise  ven though we don 't know either the target fnction or the noise regularizat ion helps by reducing the impact of the noise Most common models have hypothesis sets which are naturally parame terized so that smaller param eters lead t o smoother hypothe ses  Thus a weight deca y type of regularizer constrains the learning to wards smoother hypotheses This helps because stochastic noise is high equency' (non-smooth   Similarly deterministic noise (the part of the target fnction which cannot be modeled also tends to be non-smoo th Thus constra ining the learning towards smoother hypotheses hurts' our ability to overt the noise more than it hurts our ability to t the usel inrmation These are empirical observations  not theoret ically justiable statements egularization and the C dimension. Regularization (r example soft-order selection by minimizing the augmented error poses a problem r the VC line of r easoning As  go es up the learning algorithm changes but the hypothesis set does not so d will not change We argued that   in the augmented error corresponds t o C  in the soft-order constrained model. So more regularization corresponds to an eectively smaller model and we expect better generalization fr a small increase in Ein even though the VC dimension of the model we are actually using with augmented error does not change This suggests a heuristi c that works well in pract ice which is to use an eective VC dimension' in stead of the VC dimension For linear perceptrons the VC d imension equals the number o f ee parameters d + 1 and so an eec tive number of ameters is a goo d surrogate r the VC dimension in th e VC bound The eective number will of parameters willgeneralization go down as with increases  and so the eective VC dimension reect better increased regularization. Problems 413 4  14  and 4 15 explore the notion o f an eective number of parameters

4.3

Validation

So fr we have identie d overtting as a problem noise (st ochastic an d deter ministic as a cause and regularization a s a cure In this sectio n we introduce another cure called validation One can think of both regularization and val 137

43. LIDTION

4 VERFITTING

idation as attempts at minimizing Eout rather than ust Ein Of course the true Eout is not available to us so we need an estimate of Eout based on in rmation available to us in sample. In some sens e this is the Holy Grail of machine learnin g to nd an in-sample estimate of the out-of-samp le error. Regularization attempts to minimize Eout by working thr ough the equation

Eout(h) Ein(h)

�

+ overt penalty

and concocti ng a heuris tic term that emulates the penalty term . Validation on the other hand cuts to the chase and estimates the out-of-sample error directly. Eout(h) Ein(h) + overt penalty.  stimating the out-of-sample error direct ly is nothing new to us. In Sec tion 2.23, we introduced the idea of a test set a subset of  that is not involved in the learning process and is used to evaluate the nal hypothesis. The test error Etst unlike the in-sample error Ein is an unbiased estimate of Eout · . 3 . 1

The Validatio Set

The idea of a validation set is almost identi cal to that of a test set . e remove a subset om the data  this subse t is not used in training. We then use this held-out subset to estimate the out-of -sample error. The held-out set is eectively out-of-sample because it has not been used during the learning. However there is a dierence between a validation set and a test set. Although the validation set will not be directly used r training it will be used in making certain choices in the learning proc ess. The minute a set aects the learning process in any way it is no longer a test set. However as we will see  the way the validation set is used in the learning process is so benign that its estimate of Eout remains almost intact. Let us rst look at how the validation set is created. The rst step is  into (N  to partitionsettheDval dataof set of sizedoes and a tining setmethod Dtrain which validation size K. Anyapartitioning notK)depend on the values of the data points will do fr exaple we can select NK points at random r training and the remaining r validation. Now w e run the learning alg orithm using the training se t Dtrain to obtain a nal hypothesis g   where the minus superscript indicates that some data point s were taken out of the training. We then compute the validation error r g using the validation set Dval:

13

4 3  LIDTION

4  VERFITTIN

whr  (g (x) ) is th pointwis rror masur which w introducd in Sc g (x)  and r rgrssion using tion 141 For classication, (g(x) ) squard rror, (g(x) ) (g (x) ) 2 . Th validation rror is an unbiased stimat of Eout bcaus th nal hy pothsis g was cratd ind pndntly of th data points in th validatio n st . Indd, taing th xpctation of Eval with rspct to th data points in 'val



1 K 1 K

n n

(48 Th  rst stp uss th linarity of xpct ation, and th scond st p llows bcaus  (g (X)  Y) dpnds only on X and so

 [(g (x)Y ) ] xn [ (g (x) Y) ]

Eout(g ).

How rliabl is Eval at stimating Eout? In th cas of classication, on can us th VC bou d to prd ict how good th validation rror is as an stimat o r th outofsal rror. W can viw 'val as an insampl data st on which w computd th rror of th singl hypothsis g . W can thus apply th VC bound r a nit modl with on hypothsis in it th Hoding bound . With high probability,

Eout (g )  Eval (g ) + 0

 





(49

Whil Inquality ( 49 applis to binary targt functions, w may us th varianc of Eval as a mor gnrally applicabl masur of th rliability. Th nxt xrcis studis how th varianc of Eval dpnds on K th siz of th validation st , and implis that a similar bound holds r rgrssion Th conclusion is that th rror btwn Eva(g ) and Eout (g ) drops as (g )/  whr (g ) is boundd by a constant in th cas of classication





Exercise 4. 7

ti

Fix  learn ed f rom sider how ; eends on

an d ene   Let K.

a[E(g ]

e con

be the pointwi se variance in the out- of-sam ple erro r of  .

a Show that -;l

f 2 (g ) .



 b  n a classi cation probem w here e( () ,  () ; in terms of c Show that r any in a cassication problem ;

[g(x)  ] g

= ]  express



(ontinued on next page

139

4  VERFITTIN 

4  3  LIDTION

   tere a ir e o r



r  ] iilar o c i te cae  regreio wit ae erro e g (x  g x  y 2 ? nt he are erro  none e or rereio wit are eror i we ai ig ewe oit alle  o get   o  exet a2 (g to e ige or loer? nt: o contno nonnegate ranm arabe, er





  

K

ean odeften ger amle t icreaig teaance iz e of te val id atio e t ca rel t i a f Cocl etter or a wore tiate  out ·

he expected valdaton eo f  2 s llustate d n Fgue 4 8 whee we  10 used the expemental desgn n Execse 42 wth 40 and nose leel 0 4 he expected v aldaton eo e quals Eout g  pe Equaton (48)

Q  N





e of Valdaton et, K

Fgue 48 The expected vlidtio error  ()] th e shded re is  ] ± ·

   uctio o K

he gue clealy shows that thee s a pce to be pad  settng asde K data ponts to get ths unbased estmate of Eout  when we set asde moe data  valdaton thee ae fwe tanng data ponts and so g becomes wose Eout g  and hence the expected valdaton eo increases (the blue cuve)  As e expect the uncetanty n Evl as measued by v (sze of the shade d ego n)  s deceas ng wth K up to the p ont whee the v aance  2 ( g ) gets eally bad  hs pont comes when the numbe of tanng data ponts becomes ctcall y small as n Execse 4 7(e)  If K s nethe too small no too lage Evl povdes a good estmate of Eout A ule of thumb n pactce s to set K  (set asde 20 of the data f v aldaton)  We have establshed two conctng demands on K It has to be bg enough  Evl to be elable and t has to be small enough so that the tanng set wth  K ponts s bg enough to get a decent g  Inequalty ( 4 9 quantes the st deand he second deman d s quanted by the leanng cue

 



N

140

4.

4 3 .

VERFITTING

ALIDTION





discussed in Section 232 also the blue cuv e in Figue 4 8, o m ight to lef t , which shows how the expected out-of-sample eo goes down as the numbe of taining data points goes up  he fact that moe taining data lead to a bette  nal hypothesis has b een extensively veied empiically, although it is challenging to pove th eoetic ally  Although the leaning cuve estoring suggests that taking out K data points  validation and using only N  K  tain ing will cost us in tems of Eou, we do not have to pay that pice! he pupose of vali dation is to estimate the out-of-sample pe mance, and Eval happens to be a good estimate of Eou(g )  his does not mean that we have to output g as ou nal hy pothesis  he pimay goal is to get the best possible hypothesis, so we should out put g, the hypothesis tained on the en tie set  he seconday goal is to esti mate Eou, which is what validation allows

g va(g )

g

Figue 49: Usin a valida tion set to estimate Eot 

us do. Based discussion ing to cuves, Eou (g)on ouEou (g ) , so of lean

 g  � Eou(g )  Eval(g ) + 

(410)



he st inequ ality is subdued bec ause it was not igoously pov ed. If we st tain with N  K data points, validate with the emaining K data points and then etain using all the data to get g, the validation eo we got will likely still be bette at estimating Eou (g) than the estimate using the V-bound with Ein (g) , especially  lage hypothesis sets with big d So a, we have teated the validation set as a way to estimate Eou, without involving it in any decision s that aect the leaning pocess Estimating Eou is a usel ole by itself a custome would typically want to know how goo d the nal hypothesis is in act, the inequalities in ( 410) suggest that the



validation eo is a pessimistic of system custome is likely to Eou, so be pleasantly supised when heestimate ties you on you new data  Howeve, as we will see next, an impotant ole of a validation set is in fact to guide the leaning pocess . hat s what dis tinguishes a validation set om a test set 



 . 3 .2

Modl Slct io

By f, t he most i mpot ant use o f validation is  model s election  his could mean the choice between a linea model and a nonl inea model, the choice of the ode of polynomial in a model, the choice of the value of a egulaization 141

4  3  LIDTION

4  VERFITTIN

8     6      



Fgue 4.1 0 timistic bias of the validation error when usin a alidation set fr the odel selected

paamete , o any othe choce that aects the leanng pocess. In almost evey leanng stuaton, thee ae some choces to be made and we need a pncpled way of makng these choces. he leap s to ealze that valdaton can be used to estmate the out-of sample eo o moe than one mode l. Suppo se we have  models 1  .    M  Valdaton can be used to select one of these model s. Use the tanng set Dri  to lean a nal hypothess g; f each model. ow evaluate each model on the valdaton set to obtan the valdaton eos E     EM  whee

Em = E(g)   = 1,.

. . .

he valdaton eos estmate the out-of-sample eo

Eo (g;) f each m

Exercise .8 Is  a u bias e  stimate r the out of sample er ro aut g? It s now a smple matte to select the model wth lowest valdaton eo. Let  * be the ndex of the model whch acheves the mnmum valdaton eo. So f m*  Em*  Em   = 1 ,    he model m* s the model select ed based on the valdaton eos. ote that Em* s no longe an unbased estmate of Eu(g; * ) . Snce we selected the model wth mnmum valdaton eo, Em* wll have an optmstc bas. hs optmstc bas when selectng between  2 and  5 s llust ated n Fgue 4 10, usng the expe mental desgn descbed n Execse 42 wth Q f = ,  2 = 0. 4 and = 

N

Exercise 4.9 Rerri to Fiu re 4.10 why are both cu rves i creasi  with K? Why do they covere to each other with icreasi K?

142

4  VERFITTIN

43 LIDTION

How good  s the genealzaton eo f ths ente pocess of model selecton usng valdaton? Consd e a new model Hv consstng of the nal hypotheses leaned fom the tanng data usng each model  , . . .  H

{g  g      g} 

Hv

Model selecton usng the valdaton set chose one of the hypotheses n Hv based on ts pemance on 'v Snce the model Hv was obtaned bee eve lookng at the da ta n the valdaton set, ths poc ess s entely equva lent to leanng a hypothess om Hv usng the data n 'v  he valdaton eos Ev (g�) ae n-sample eos f ths leanng pocess and so we may apply the VC bound  nte hypothess sets, wth IHvI 

Eout(g  )  Ev(g  ) + 0

/ 

(411)

What  f we d dnt use a valda ton set to choose the model One altenat ve would be to use the n-sample eos om each model as the model selecton cte on. Sp eccally, pc k the model whch gves a nal hypothess wth mn mum n-sample eo. hs s equvalent to pckng the hypothess wth mn mum n-sample eo om the gand model whch contans all the hypotheses n each of the  ognal models. If we want a bound on the out-of- sample eo f the nal hypothess that esults om ths selecton, we need to apply the VC-penalty  ths gand hypothess set whch s the unon of the ! hypothess sets see Poble m 2 14)  Snce ths gand hypothess set can have a huge VC-dmenson, the bound n ( 411) wll geneally be tghte. he goal of model selecton s to se · · lect the best model and output the best hypothess om that model. Spec cally, we want to select the model   91 whch Eout(gm) wll be mnmum when we etan wth all the data. Model se lecton usng a valdaton set eles on the  leap of ath that  f Eout(gm) s mnmum, pk he bes then Eout (g�) s also mnmum. he val daton eos Em estmate Eout (g�)  so modulo ou leap of fth, the valdaton



9

9

i   . . M (* *)

set should pck the model. howeve, o mat te whch model *ght s selected, based on the dscusson of leanng cuves Fgue 4 1 1 : Usin a validation n the pevous secton, we should not out set r model selection put g� * as the nal hypothes s. Rathe, once * s selected usng valdaton, lean usng all the data and output gm*  whch satses

9*

Eout (gm'

 $ Eout (g   Ev! (g ) + 0 •

 .

Agan, the st nequalty s subdued because we ddnt pove t. 143

(412)

4.3. LIDATION

4. VERFITTING

056

in sample: g lition

9

0.8 5

15

Validation et ize K

25

Fgue 4 12  odel seletion between  2 and  usin a validation set . The solid blak line uses E fr model seletion whih always selets   • The

dotted line shows the optimal model seletion, if we ould selet the model based on the true out of sample error. This is unahievable but a u seful benhmark . The best per ormer is learly the validation set outputtin 9* or suitable K even g�* is better th an in sample seletion.

Cont nung ou expement om Fgue 4  10 we evaluate the outofsample pemance when usng a valdaton set to select between the models 2 and  5  he esults ae shown n Fgue 4. 12. Valdaton s a clea wnne ove usng E f model select on

xerise



 rom igure  [ * ] i i niial l ereaing ow an  i  if [E ] i inreaing in r eah m?  rom igue  e e e ha  [ * ] i in iial l ec eaing n h e i ar o i ncreae a  are e o il  reaon r  i c en K   [( *   [(9*  ow can i e i he lea rni ng uve  r  moel a re ecea i ng





Example 4 We can use a valdaton set to select the value of the eg ulazat on paamete n the augmented eo of (4 6)  Although the most mpotant pat of a model s the hypothess set evey hypothess set has an assocated leanng algothm whch selects the nal hypothess g o mod els may be deent only n the leanng algothm, whle wokng wth the same hypothess set  Changng the value of  n the augmented eo changes the leanng algothm (the cteon by whch g s selected) and eectvely changes the model Based o n ths dscusson consde the  dierent models coespondng to the same hypothess set  but wt h  deent choces   n th e augmented eo  So, we have ( ) ( A 2 )   .   (  AM) as ou  deent mode ls We 144

4  VERFITTING

4 3  LIDTION

may  example choose  0 2 001   002      10. Us ng a aldaton set to choose one of these  models amounts to detemnng the alue of  to wthn a esoluton of 001



We hae analyzed aldaton  model selecton based on a nte numbe of models If aldaton s used to ch oose the alue of a paamete  example  as whch n the peous example the alue wll depend on thesesoluton to we detemne that then p aamete  In of the lmt the selecton actuall y among an nnte numbe of models snce the alue of  can be any eal numbe  What happens to bounds lke (4) and (4) whch depend on ? Just as the Hoedng bound  a nte hypothess set dd not collapse when we moed to nnte hypothess sets wth nte V-dmenson bounds lke (4) and (4) wll not completely collapse et he We can dee VC-type bounds hee too because een though thee ae an nnte numbe of models these models ae all e y smla  they de  only slghtly n the alue of  As a ule of thumb what mattes s the numbe of paametes we ae tyng to set  If we hae only one o a w paametes the estmates based on a decent szed aldaton set would be elable The moe choces we make based on the same aldaton set the moe contamnated the aldaton set becomes and the less elable ts estmates w ll be  The moe we use the aldaton set to



ne the model the model; moe the aldaton set becomes lke a tanng usedtune to lean the ght and we all know how lmted gsetsets n ts ablty to estmate Eout You wll be had pessed to nd a seous leanng poblem n whch alda ton s not used Valdaton s a conceptually smple technque easy to apply n almost any settng and eques no specc knowledge about the detals of a model The man dawback s the educed sze of the tanng set but that can be sgncantly mtgated though a moded eson of aldaton whch we dscuss next.

.3.3

Cross Validatio

Valdaton eles on the llowng chan of easonng

whch hghlghts the dlemma we fce n tyng to select K We ae gong to output g When K s lage thee s a dscepancy between the two out-of sample eos Eout(g ) whch Eval dectly estmates and Eout(g) whch s the nal eo when we lean usng all the data ) We would lke to c hoose K as small as possble n ode to mnmze the dscepancy between Eout (g ) and Eout(g) deally K 1 . Howee  f we make ths choc e we lose the elablty of the aldaton estmate as the bound on th e RHS of (4 ) becomes huge The aldaton eo Eval (g ) wll stll be an unbased estmate of Eout (g )





4



4  VERFITTING

43 LIDTIN

(g s taned on N - 1 ponts), but t wll be so unelable as to be useless snce t s based on on ly one data pont. hs bngs u s to the ss validation estmate of out-of-sample eo  We wll cus on the leave-one-out veson whch coesponds to a valdaton set of sze K 1 and s also the easest case to llustate  Moe popula veso ns typcally use lage K but the essence of the method s th e same hee ae N ways to patton the data nto a tanng set of sze N - 1 and a valdaton set of sze 1. Speccally, let be the data set  afte leavng out data pont (, ), whch has been shaded n ed  Denote the nal hypothess leaned om  by g. Let  be the eo made by g on ts valdaton set whch s j ust a sngle data pont { ( , ) } :

he coss valdaton estmate s the aveage value of the s,







Fgue 4 13  llustrti on of leve one out cross vlidtion r  liner t usin three dt points. The vere of the three red errors obtined by the li ner ts levin out one dt point t  time is Ecv ·

Fgue 4 13 llustates coss valdaton on a smple example Each  s a wld, yet unbased estmate  the coespondng Eut(g), whch llows afte settng K 1 n ( 4 )  Wth coss valdaton, w e have N nctons g , .  . , g togethe wth the N eo estmates e1,     N he hope s that these N eos together would be almost equvalent to estmatng Eut on a elable valdaton set of sze N whle at the same tme we managed to use N  1 ponts to obtan each g. Let s ty to undestand wh y Ecv s a good estmato of Eut  146

4. VERFITTING

4.3. LIDTION

Fst and emost Ecv s an unbased estmato of Eo g )  . We have to be a lttle caefl hee because we don't have a sngle hypothess g , as we dd when usng a sngle valdaton set  Dep endng on the xn , Yn) that was taken out each g can be a deent hypothess. To undestand the sense n whch Ecv estmates Eo , we need to evst the concept of the leanng cuve Ideally we would lke to know Eo g). The nal hypothess g s the esult of leanng on a andom data set of sze N. It s almost as usel to know the epected peance of you model ' when you lean on a data set of sze N; the hypothess  s just one such nstance of leanng on a data set of sze N. Ths expected emance aveaged ove data sets of sze , when vewed as a ncton of N, s exactly the leanng cuve shown n Fgue 4 2. Moe mally  a gven model let

Bo  N)

 [Eo g) 

be the expectaton (ove data sets ' of sze N) of the out-ofsample eo poduc ed by the model The expected value of Ecv s exactly o N  1). Ths s tue becuse t s tue  each ndvdual valdaton eo en: 1 v  1(xn ,Yn ) [e (g� (xn ) , yn)] ,

[Eo g� )  , o n N  1) Snce th s equalty holds   each en, t also holds  the aveage. We hghlght ths esult by makng t a theoem.

Theorem 4.4. Ecv s an unbased estmate of o  1) (the expectaton of the model pemance [Eo J , ove data sets of sze   1). ow that we have ou coss valdaton estmate of Eo , thee s no need to out put any of the g as ou nal hypothess. We mght a well squeeze evey last dop of pemanc e and etan usng the ente g g] data set ' oututtng g as the nal hy (  ) pothess and gettng the benet of gong om N  1 to N on the leanng cuve. i 2 In ths case the coss valdaton estmate wll on aveage e an uppe estmate  the out-of-sample eo Eo g) Ecv  so g expect to be pleasantly supsed albet slghtly. Fgue 4.14 Usin ross vali dation to estimate Eout Wth just smple valdaton and a val daton set of sze K 1 , we know that the valdaton estmate wll not be elable. How elable s the coss valdaton estmate Ecv? We can measue the elablty usng the vaance of Ecv ·

y 



147

4. VERFITTING

4.3. LIDTION

Untunately whle we wee able to pn down the expectaton of Ec the vaance s not so easy If the N coss valdaton eos         wee equvalent to N eos on a totally sepaate valdaton set of sze N, then Ec would ndeed be a elable estmate   decentszed N he equvalence would hold f the ndvdu al  s wee ndepende nt of each othe Of couse ths s too optmstc Consde two valdaton eos    m he valdaton eo  depends on g whc h was taned on data contanng (xm Ym) · hus  has a dependency on (xm  Ym )  he valdaton eo m s computed usng (xm  Ym) dectly and so t also has a dependency on (Xm Ym)  Consequently thee s a possble coelaton between  and  though the data pont (Xm Ym)  hat coelaton wouldnt be thee f we wee valdatng a sngle hypothess usng N esh ndependent data ponts How much wose s the coss valdaton estmate as compaed to an est mate based on a tuly ndependent set of N valdaton eos A VCtype pobablstc bound o even computaton of the asymptotc vaance of the coss valdaton estmate Poblem 4. 23 )  s challengng One way to quantfy the elablty of Ec s to compute how many esh valdaton data ponts would have a compaable elablty to Ec and Poblem 424 dscusses one way to do ths hee ae two extemes  ths ee ctve sze  On the hgh end s N, whch means that the co ss valdaton eos ae essentally ndependent  On the low end s 1 whch means that Ec s only as good as any sngle one of the ndvdual coss valdaton eos  e the coss valdaton eos ae totally dependent  Whle one cannot pove anythng theoet cally n pactc e the elablty of Ec s much close to the hghe end







Eeive numbe o esh exampes giving a ompaabe esimae o ut

ross vaidation r mode seection. In Fgu e 4 1 1  the est mates   the out-of-sample eo o f model m wee obtaned usng the valdaton set Instead we may use coss valdaton estmates to obtan Em: use coss valda ton to obtan estmates of the out-of-sample eo   each model  i ,   M  and select the model wth the smallest coss valdaton eo ow tan ths model selected by coss valdaton usng all the data to output a nal hypoth ess makng the usual leap of ath that Eout (g ) tacks Eout (g) well Exampe 45 In Fgue 413 we llustated coss valdaton  estmat

ng Eout of a lnea model (h( ) a + b) usng a smple expement wth thee dat a ponts geneat ed om a constant taget ncton wth nose We now consde a second model the constant model (h() b) We can also use coss valdaton to estmate Eout  the constant model llustated n Fgue 4.15 148

4  VERFITTING

43 LIDTION

 



 





Fgue  Lav onout cross validation rror r a constant t If we use the nsample eo afte ttng all the data (thee ponts )  then the lnea mode wns because t can use ts addtonal degee of eedom to t the data bette  he same s tue wth the coss al daton data sets of sze two  the lnea model has pect nsamp le eo But  wth coss  adaton what mattes s the eo on the outstanding poin t n each of these ts Een to the naked eye the aeage of the coss aldaton eos s smale  the constant model whch obtaned Ecv  esus Ecv   the lnea model he constant model wns accodng to coss aldaton he constant model also has owe Eout and so coss aldaton selected the coect model n ths example



One mpotat use of aldaton s to estmate the optmal egulazaton paamete A as descbed n Example   We can use coss aldaton f the same pupose as summazed n the agothm below

ross validation r selecting A:

: Dene  models by choosng deent alues f A n the

 : 4 5

augmented error: (1  Ai ) , (1  A 2 )   .   (1  A M )

  ,  ,  do Use the coss aldaton module n Fgue  to est mate Ecv() the coss aldaton eo  model  Select the model * wth mnmum Ecv ( * )  Use mode (  Am ) and all the data  to obtan the  nal hypothess g Eectely you hae estmated the optmal A. r each mode

We see om Fgue  that estmatng Ecv f just a s nge model eques N ounds of leanng on        N each of sze N   So the coss aldaton algothm aboe eques MN ounds of leanng hs s a fmdable task  If we coud analytcally obtan Ecv that would be a bg bonus but analytc esuts ae often dcult to come by o coss adato n One excepton s n the case of lnea models whee we ae able to dee an exact anaytc fmula  the coss aldaton estmate 

4  VERFITTING

43 ALIDATION

Analyti omptation o E r linear models. egesson wth weght decay edctons ae

Recall that f lnea

ZTZ  I) 1 ZT, and the n-samle

 



H), 1 ZZTZ  I )  ZT. Gen H, and y, t tuns out that we can



whee H) analytcally comute the coss aldaton estmate as

E

1 N

 N

1

Yn  - yn HnnA)



2

( 4.13 )

otc e that the coss aldato n estmate s ey sm la to the n-sam le eo E L n fn - Yn)2 , deng only by a nomalzaton of each tem n the sum by a facto 1 /( 1 Hnn))2 One use  ths analytc fmula s that t  A can be dectly otmzed to obtan the best egulazaton aamete oof of ths ema kable mula s gen n Poblem 4 .2 6



Een when we cannot dee such an analytc chaactezaton of coss aldaton the technque wdely esults n good out-of-samle eo estmates n ac tce and so the comutatonal buden s often woth endung. Also as wth usng a aldaton set coss aldaton ales n almost any settng wthout equng secc knowledge about the detals of the models. So fa we hae led n a wold of unlmted comutaton and all that matteed was out-of-samle eo; n ealty comutaton tme can be of con sequen ce es ecally wth huge data sets . Fo ths eason leae-one- out coss aldaton m ay not b e the method of choce.  A oula deate of leae one-out coss aldaton s fold ss validation. 5 In -ld coss aldaton the data ae attoned nto  dsjont sets ( o lds ) D 1 , . ,D , each of sze aoxmately N/ ; each set D n ths atton sees as a aldaton set to comute a aldaton eo  a hyothess  leaned on a tanng set whch s the comlement of the aldaton set D \ D So you always aldate a hyothess on data that was not used  tanng that atcula hyothess. he -fld coss alda ton eo s th e aeage of the  aldaton eos that ae obtaned one om each aldaton set D. Leae-one-out coss aldaton s the same as N-ld coss aldaton . he gan om choosng  « N s comutatonal. he dawback s that you wll be estmatng E f a hy othess less( data comaed leae-one-out so the  taned ) and dsceancy be lage. A comon choce n betweenonE ) and( as E (  ) wll wth actce s 10-ld coss aldaton and one of the lds s llustated below.



an



tan

4 Stability problems have also been reported in lea ve one out. 5 Some authors call it K fld cross validation , but we choose the size of the validation set K

150

 so  not to confuse with

4.3 LIDTION

4 VERFITTING

4 . 3. 4

Thery Versus Practice

Both valdaton and coss valdaton pesent challenges  the mathematcal theoy of leanng, smla to the challenges pesented by egulazaton  he theoy of genealzaton, n patcula the VC analyss, ms the undaton  leanablty It pov des us wth gudel nes unde whch t s possb le to make a genealzaton concluson wth hgh pobablty It s not staght wad, sometmes not possble, goously these conclusons to the and analyss of valdato n, coss to valdaton , ocay egulazat on What sove pos sble, and ndeed qute e ectve, s to use the theoy s a gudelne In the case of egulazaton, constanng the choce of a hypothess leads to bet te genealzaton, as we would ntutvely expect, even f the hypothess set emans techncally the same In the case of valdaton, makng a choce  fw paametes does not ovely contamnate the valdaton estmate of E, even f the VC guaantee  these estmate s s too weak In the case of coss valdaton, the benet of aveagng seveal valdaton eos s obseved, even f the estmates ae not ndependent Although these technques wee based on sound theoetcal fundaton, they ae to be consdeed heuristics because they do not have a ll mathe matcal justcton  n the geneal case Leanng om data s an empcal task wth theoetcal undepnnngs  We pove what we can pove, but we use the theoy asheustcs a gudelne have aappoach conclusvethat poof In uneals a pactcal applcaton, maywhen wn we ovedont a goous makes tc assu mptons he only way to be convnced about what woks and what doesnt n a gven stuaton s to ty out the technques and see  youself he basc message n ths chapte can be summazed as llows



1 ose stochastc o detemnstc advesely, leadng to ovettng

 aects leanng

2. Regulazaton helps to pevent ovettng by con stanng the mode l, educng the mpact of the nose, whle stll gvng us exb lty to t the data 3 f Vldaton and coss ldaton ae usefl nques estmtng E vaOne mpotant use tech of valda ton s model selecton, n ptcula to estmate the amount of egulazaton to use

Example 4.6. We llustate valda ton on the handwtten dgt classcato n task of decdng whethe a dgt s 1 o not see also Example 3  1 based on the two atues whch measue the symmety and aveage ntensty of the dgt he data s shown n Fgue 416 a 





151



4. VERFITTING

4.3. ALIDATION

3

 



Average Intenit ( a) iis classicaion ask

# Feature Ue ( b rror curves





Fgue 416 ( a) The diis daa of which 5 are seleced  he rainin se ( b ) The daa are ransrmed via he 5h order polynomial ransfrm o a  dimensional aure veco r We show he perrmance curv es as we vary he number of hese aures used r classicaion We have andomly selected 500 data ponts as the tanng data and the emanng ae used as a test set  evaluaton We consdeed a nonlnea atue tansm to a 5th ode polynomal atue space

 1  Xi X2  -  1  x 1 X2 x21  X 1 X2  x22 X1X3 21 X 2    x 1 X1X4 2X1X3 22 X1X2 32  X 1 X42  X 2  . Fgue 416  b shows the nsample eo as you use moe of the tansmed atues nceasng the dmenson om 1 to 20 As you add moe dmensons nce ase the complexty of the mo del   the nsample eo dops as expected

he outofsample eo dops at st and then stats to ncease as we ht the appoxmatongenealzaton tadeo he leaveoneout coss valdaton eo t acks the behavo of the outofsample eo qute well If we wee to pck a model based on the nsample eo we would use all 20 dmensons he coss valdaton eo s mnmzed between 57 fatue dmensons; we take 6 fatue dmensons as the model sele cted by coss valdaton he table below summazes the esultng pemance metcs o Valdaton Coss Valdaton

0% 08 %

E 2  5% 1 5%

Coss valdaton esults n a pefmance mpovement of about 1 %  whch s a massve elatve mpovement 40 % educton n eo ate 





Exerise 4 11  n t his pr ticulr x primnt t h blck cu r (Ecv) is somtims blow  nd somtis bo th th rd cur Eout. f w rptd thi s xprimn t mny tim s nd plott d th rg  bl ck n d rd curs w ou ld you xpct th blk cur to li bo or blow th rd ur? 152

 . VERFITTIN 

.3. LIDTION

It s llumnatng to see the actual classca ton boundaes leaned wth and wthout aldaton hese esultng classe s togethe wth the 500 nsample data ponts ae shown n the next gue

vrae ntensity 20 dim lassier no validation) in 0  out 2  5

vera ntensity 6 dim lassier LCV) in 08  out  . 5 



 

It s clea that the swose of ponts the classe pcked wthout aldaton due tooutofsample the oettngpemance of a  w nosy n the tanng data Whle the t anng data s pectl y sepaated  the shape of the esultng bounday seems hghl y contoted  whch s a symptom of oettng Do es ths emnd you of the st example that opened the chapte hee  albe t n a toy example we smlaly obtaned a hghly contoted t  As you can see oettng s ea l an d he e to stay!



153

4  VERFITTING

4.4

4.4 PROBLEMS

Prblems

Problem 4 1

Pl ot the monom al s of order i, As you ncrease the or der does th s correspon d to the n tu tve noton of ncreasn comp lext y

onsd er the atu re transrm z and the lnear model  z For the hypothess wth what s expl ctly as a functon of What s ts deree

Problem 4.

h(x) =

h(x)

= [L0(x), L1 =(x),, L-2(xW , 1

x

Problem 43

The Leendre Polynomals are a famly of orthoonal polyn om a ls wh c h are useful f r reresson  Th e rst two Leen dre Polynom a ls are 1 The h her or der Leendre Pol ynom als are dened by the recurson

Lo(x) = L1(x) = x

k - 1  2 (x).  -;L

L (x) = k - 1

( a ) Wha t are the  rst sx Leend re Polynom a ls Use th e recurson to de velop an efcent alorthm to compute ven Your

Lo(x),  . . , LK(x)

x.

K Plot the rst sx Leendre alorthm polynomalsshould run n tme lnear n ( b ) Show that s a l near co mb naton of monom als . ( e ther all odd or all even order wth hhest order ) Thus

x , x 2 , . 

L(x)

( c) (d )

k.  L( x) = (- l ) L(x). Sh ow that  1 = xL(x ) - L 1(x) [Hnt use nducton} Use part ( c ) to show that L satses Legendre's dferental equaton ! (x 2 ) dL(x) = k(k + )L(x). l dx dx

Ths eans that the Le en dre Polynoma ls are e enfunc tons of a er mt an l near dferental operat or an d from Stu rm Louvl le th eory they rm an ortho onal bass f r contn uous functons on 1]

[,

( e) Use the recurrence to show drectly the orthoonalty property:

dx L (x)L e (x) = 2  = k.k, 2 +  [Hnt: use nducton on k wth   k Use the recurrence r L and consder searately the ur cases  = k, k - 1 k   and  < k - . For the case  = k you wll need to coute the ntegral J� 1 dx x 2 L 1 (x) In order to do ths, you could use the dferental equaton n artc,  ultly by xL and then ntegrate both sdes the L HS can be ntegrated by arts ow solve the resultng equa ton r f 1 dx x 2 L 1 (x)j

{O

154

4.4 PROBLEMS

4  VERFITTIN

Problem 4.4 LAMi This problem is a detailed version of ercise 4.. We se t up an eperimental framework which th e reader may use to stud y var ious aspect s of overttin g. The i n put space is X = [ ,   with unirm input probability density = We consider the two models 2 and 1 . The tar et function is a poly nom ial of degree Q  wh ic h we rite as  ! here are the Leendre polynomials. We use the Legend re polyno mi als beca use they are a c onvenie nt orth ogon al basis or the = polynomi als on [ ,  see Section 4. a nd Problem 4.3 f r some basic inf r mation on Legendre polynomials. The data set is  = 1),  . . , ( N , Y )  where Y = + E and E are iid standard Normal random variates. For a sin gle e peri ment  with s pecied val ues fr QJ , ,  generate a random degree-Q  ta rget fun ction by se lectin g coecients independently from a standard Normal rescaling them so that    . Generate a data set selecting     X N inde pendentl y from and Y = + E. Let 2 and 1 be the best t hypotheses to the data from 2 and 71 respectively with respecti ve out of-sam ple error s E( 2 ) and E  (1 ) .

P(x)

(x)

 aqLq(x)

�·

Lq(x)

(x1,

 (xn)

x1,

P(x)

 2 ]

x

aq

(xn)

 a Why d o e nor malie j? [Hint: how would you interpret ?] b ow can we obtain  2 , 1? [Hint pose the problem as linear regression and use the technology from Chapter 3} c Vary ow Qcan , we com pute E a na lytica lly r a given 1 ? d ,  and fr each combination of parameters run a large num ber of eperiments each time computing E( 2 ) and E(1). Aver aging these out-of-sample errors gives estimates of the epected out-of sampl e erro r r the giv en learni ng sc enario QJ , ,  using 7 2 and 1  Let



average over eperiments(E( 2 )), avera ge over eperime nts( E (1 ) ) .

E( 2 ) E(1)

e ne the overt measure E(1)  E( 2 ) . When is the over t measure signicantly positive i.e. overtting is serious as opposed to sign ica ntl y negative? Try the choi ces QJ { , , . . . , 00}  {0, 5, .  . , 0}  2 {O ,0 .0 5, 0., . . . , }. pl ain you r obse rvations.

E



E

e Why do e the average over y eperim to select an take acceptabl e n u mber of man eperiments to ents? averaeUse ov the er. varian ce f Repeat thi s eperime nt r cla ssication where the targ et function is a noisy perceptron = in " !1 + E . Notice that = 0 = . and the shou ld be norma lied so that  = !1 For classication the models H2, H1 contain the sign of the nd and th order polynom ial s respecti vely. You may use a learn in g a lgorith m or non-separable data from hapter 3.

aq's





 aqLq(x)

155

 [



ao aqLq(x)) 2 

4  VERFITTIN

44 PROBEMS

If > < 0 n the augment ed erro r Eaugw = in w  +>wTw wha t oft order contra nt doe th  corre o nd to? [Hn > < 0 ncoura lar wh}

Problem 45

I n the a ugmen ted erro r m n mzaton wth

Problem 46

and >

= 



0

>

[Hn a Show that wreg   wHn  utfng the term weght deca ar by aumn ha  wreg  >  wHn and drv a conradcon] In ct a tronger tatement hold   wrg   decr eang  n > b xlctl ver f th r lnear model  [Hn whr u = ZT y and Z  h ranrmd daa ma rx Show ha  ZTZ + > ha h am nvcor wh corrondnly lrr nvalu a ZTZ xand u n h nba of ZTZ For a marx A, how ar h nvcor and nvalu of A 2 rlad o ho o f A ?] Show that th e nam le erro r

Problem 47 from xamle 42  an ncreang f uncton of > wher e (>  = ZZ+ >) and Z   the t ranrm ed d ata m atrx To do  o let the SVD of z =  and let ZTZ have ege nval ue O '    ' O�  Den e th e vector  = UT Show that

and roceed from there

 n th e augment ed err or m n mzat on wth  =  and > > 0 Prob lem 48 aume th at in  d ferenta ble a nd ue gr ad ent dec ent to m n m ze Eaug:

wt + )  wt -  aug wt Sho w that th e udate rule abov e  th e ame a

wt + )  (  2J>wt - J\Einw t Not e: Th   the orgn of th e name 'weght dec a: u date d b the gr ad ent of Ein · 156

wt deca bere beng

4 ERFITTING

4.4 PROBLEMS

Prblem 4.9

In T ikh onov re ul arization  the reuarized weih ts are iven by + The Tikhonov reularizer is a k x ( + 1) matrix each row correspondin to a  + 1 di mensi ona l vector ach ro w of correspo nds to a  + 1 dimensional vector the rst component is ) For each row of construct a virtual example  i  0) r  = 1 .  .  k where  i is the vector obtained from the th row of after scalin it by and the taret vaue is 0. Add these k virtua l exam ples t o the data  to construc t an a umented data set and consider non-reularized reression with this aumented data 

W (ZTZ >rTr) l ZTy

r

r

r



Z





 



 a  Show that r the aumented data Z  and   b Show that solvin the least squares problem with (ZZ and  results in the same reularized weiht W ie W  Z)  Z   This result may be interpre ted as l lows a n equ ival ent way to accop ish weiht-decay-type reularization with linear modes is to create a bunch of virtual examples a ll of whose tar et val ues a re zero 

Prblem 4.10

In this problem you wi investiate the relationship between the so ft order constr ai nt and t he a umented error The reul arize d

weiht

to W is a soution Ei(w) ubject to wTrTrw  C a  If W rTrwi   then what is w?  b If w rTrwi > C, the situation is illustrated below

The constr ai nt is satised i n the sh aded reio n a nd t he contours of con stant are the ellipsoids why el lipsoid s?  What is c Show that with



Ei





w rTrw?

W\Ei(W), Ei(w) + >wTrTrw [Hint use te previous part to A = 

W minimizes solve r W as an equality constrained optimization problem using te metod o Lagrange multipliers]

(contnue on next page) 157

4  VERFITTING

44 PROBLEMS

:  = 0 (w! itself satis es the constraint).  > 0 (the pena lty term i s posi tive) .  is a strictly decreasing function of C < 0 r C E [O, w  I'TIW]

(d) Show that the ll owing hold r (i) If then (ii) If > then > then (iii) If

wI'TI'W!  C w I'TI'W! C w rTrw C

[Hint: show that

Problem 411

For the l in ear model in Exercise 4 2 the target function is a pol ynomi al of degree Q ; the model is Q wit h polynomials up to o rder Q. QJ. = Assum e Q and = where is th target function and is the m atrix contain ing the transrme d data.

W! (ZT) 1 ZTy y Zw +  w Z (a) Show that W! = w + (ZT) 1 ZT What is the average function ? Show that bias = 0 (recall that bias   f(x)) 2  (b) Show that 2 ac �< z [(Z TZ) 1 ] , where �<  [<()


For the well speci ed li near mode l the bias is zero and t he va riance is incre asi ng as the mod el ge ts larg er (Q increa ses) but decr easing in



Problem 412

Use the setup in Problem 4.11 with Q  QJ Con sider regression with weight decay using a linear model  in the trans rmed space with input probability distribution such that = . The regular ized

[zzT] weights are given by Weg = (ZTZ + I) 1Vy where y = Zw +  (a) Show that We  W  (ZTZ +  I) 1 w + (ZTZ + I) 1 V . g

( b) Argue that to   rst order in



2 (+ )2  w  2  2 var �   [ac(H 2 ())], where H() = Z(V Z + I) 1 V bias �

158

4. VERFITTIN G

4.4. PROBLEMS

If we plo he bi as a nd va  we ge a gu e ha is vey simila o Figue 23 whee he adeof was based on  and com plexi y ahe h an bias a nd va Hee he bias is inceasing in as expeced and in l fll; he vaiance is deceasing in When 0 Q  and so ap peas o be playi ng he ole of an efecive n u m be of paam ees

   2 ac(H  = 2(A))ac(H (A)) = +

Reulrizti reter 

Problem 41

Wihi n  he li nea e gession se ing ma ny a emps have been made o quanify he efecive numbe of paamees in a model  Thee possib iliies ae:

 i  de() = ac(H(A)) -ac(H 2 (A))  ii  de() = ac(H( A)) 2 iii   de( whee H(A) = ac(H Z(VZ + (AI)))  ZT and Z is h e an sfmed daa maix To obain de one mus s compue H() as hough you ae doing egession One can hen heuisically use de in place of d in he VC bound  a  When A= 0 show ha  al l h ee choices de = +   whee  is he dimension in he Z space  b When A > 0 show ha 0  de   +  and de is dece asing in A f al l h ee choice s. [Hint Use the singular alue decomposition.}

Problem 414

The obseved age values

y can be sepaaed ino he

 and  he noise  y =  +  The com ponens of  ae iid 0 wih expecaion linea weigh egulavaiance izaion 2 by and aking he exp eced Fo val ue of egession he i n samwih ple eo  in decay (4) show ha  2  T (I - H(A)) 2 +  ac  (I - H (A)) 2   (I-H(A)) 2  + 2 whee de = ac(H (A)) - ac(H 2 (A)) as dened in Poblem 413  i   H(A) = Z(ZTZ + AI)  ZT and Z is he ansfomed daa maix ue a ge val ues

i  

(cotiued o ext pge)

159

4.4. PROBLEMS

4 VERFITTING

2

(a) f the noise w as no t over t, what shou ld the term involving be, and why (b) Hence, argue that the degree to which the noise has been overt is nterpret the dependence of this result on the parameters and  to justify the use of as a n efecti ve n um ber of parameters

 2 de/ 

de

de

Proble m 4. 15

de

We furth er i nvestigate of Problems 4 13 and 4. 14 We  know that When is square and invertible, as    denote  is us ua lly the case (r exam ple with weight dec ay, s; > 0 when has f ull colum n ran k) Let s  be the eigenval ues of

H() Z(VZ  rr)  Z    .   ZZ  (a) For de()  ac(H()  H 2 ())

r

Z   

r  Z

show that

de()  ac(H()) show that de()  d    i=dO   ac(H 2  )), show th at de()  ?d [>• (c) For de()  O   n al l ca ses, r   0 0  de()  d  l de(O)  d   and de is decre asing in . Hint se the singlar vale decomosition Z  US where U  are orthogonal a nd S is diagonal with entries B i ] (b) For

Probl em 4. 16 with penalty term

For lin ear models a nd the general Tikhonov re in the augmented error, show that

wrrw Wreg

where

=

gu larizer

r

(ZT Z + .rTr) l ZTy ,

Z is the ature matrix

(a) Sh ow that the in-sam ple pre di ctions are  H() Z(VZ  rr) ZH()y  r  Z and obtain Weg in terms of Wl·

 where ( b) Si mp lify th is in th e case called unirm weight decay. Problem 4.17

This is

To model uncertainty in th e measurement of th e in puts, assume that the true in put s are the observed inputs perturbed by s ome noise the true inputs are given by Assume th at th e = are independent of = and mean with covariance matrix

n :

n

(x n Yn )

n n n  n   [n ] ;

160

n

4  VERFITTIN

44. PROBLEMS

[n] = 0  The learn

ng algo rt hm m n mze s the expected in sample error where the e xpectaton s wth respect to the n certa nty n th e tre Xn 

!i

Bi

Bi

Ei

Show weghts resltobtaned from mnmzng are eqa = lent tothat the the weghts whch wold whch hae been by mnmzng r the obsered data, wth Thono reglarzaton What a re  a nd > (see Problem 4. 16 f r t h e gener al Thon o regla rzer) ? One can  nterpret th s reslt as fllows reg la rzaton enrces a robstness to potental measrement errors (nose) n the obsered npts

 L=l (wx n  Yn ) 

Problem 4.18

n a regresson settng, assme the target fncton s =  + where the entres n  are d wth lnear, so and y  zero mean a nd aran ce  Assme a regl arzaton term and that   n th s problem der e the optmal ale f r > as ll ows.

f(x) wx

Zw 

wZZw

[xx]

(x) =

(a) Show that the aerage fncton s

What s the bas?

J

(d+1) • (b) Sh ow th at ar s asymptotca y I N02(i+>) Problem . 2 (c) Use the bas and asym ptot c aran ce to obta n an ex presson r Opt mze th s wth respect to > to ob tan the optmal reglarzat on pa 02(d+) . nswer. > * ra m ete r . N !lwi ll 2

 [Eot] 

J•

-

( d) Expla n the depe ndence o f the optm al regl arzaton parameter o n t he parameters of the l earnn g problem  {int write> * =

Problem 4 19

[The Las so al gorth m] Rathe r th an a soft order constra nt on the sq ares of th e weghts, one co ld se the a bsolte  a l es of th e weghts d

Ei(w)

subjec o i =O lwil : C



The model s called the lasso algorthm

( a) Formla te a nd  mpl ement ths as a q adratc program. Use t he exper mental desgn n Problem 44 to compare the lasso algorthm wth the q ad ratc pena lty by gng plo ts of erss reglarzaton parameter. (b) What s the agm ented error ? s t more conenent t o opt m ze? (c) Wth d  5 and  = 3 compare the weghts from the lasso erss the qadratc penalty [int Look a t th e nmber of non-zero weights.

Et

161

44 PROBLES

4  VERFITTING

Problem 4.20 In thi problem you will explore a conitency condition fr weight de cay Sup poe that we ma ke an iner tib le li near tran rm of the data

n

 yn 

I ntu itiely li near re greion hou ld n ot b e afected by a l in ear tranrm  Th i mean th at the new o ptim al weight ho uld be gi en by a cor repondi ng l inea r tranf rm of the old optim al weight

 a ) Suppoe

w min im ize the in am ple e rror or the o rig ina l problem . Show th at for the tranrmed proble m t he op ti ma l weight  are

 b ) Suppoe the regularization penalty term in the augmented error i wXXw r the srcinal data and

ww r the tranrme d data. On the srcinal data the regularized olution i  () Show that for the tranrm ed proble m the am e l inear trano rm of  () gie the correpondi ng regula rized weigh t or the tranfr med p rob lem 

Problem 4.21

The Tikhono moothne penalty which penalize 2 deriatie of i  dx  Show that r linear model thi  red uce to a penalty of the rm ww  What i ?

h (h)





Problem 4.22

You hae a data et with 100 data point You hae 100 model eac h wi th VC dimenion 10 You et aide 5 point f r alidati on . You elect the model which produced minimum alidation error of 0.5 Gie a bound on the ou t of ampl e err or r thi e lected f unction Suppoe you intead trained each model on all the data and elected the func tion wi th mi nim um in ampl e err or The reulting in amp le err or i 0.15 Gie a bound on the out of am ple err or i n thi c ae [Hint Use the boun in

Poblem   to boun th e VC imension of the union of a the moels.]

Problem 4.2

Thi probl em ine tigate the coariance of the l eae one out cro  al idatio n error C[n  e   Aume that r wel l behaed model the learning proce i table' and o the change in the learned hypothei hou ld be  mall 'O ' if a new data point i added to a data et of ize N. Write g  g
 

162

4 VERFITTIN

4.4 PROBLEMS

() Show tht Var[Bev] =

:=l Var [n] + :#m C[n, em ]



(b) Show C [n, m] = Var  [ Bout( g -2]+ h igher order i 8n , Om  Argue tht (c) Assu e tht y ters i volvig On, Om re O(

 )

Does Var [e ] dec y to zero with  Wht  bout Var[Bout ( g]? (d) Use th e experietl desig i P roble . to study Var[Bev] d give  log lo g plot of Var[Bev]/Var[] versus N Wht is th e dec y rte

Problem 424

F d = 3 geerte  r do d t set with N poits s llows  For ech poit ech di esio of  hs  s tdrd Norl distr ib utio Si i l rly g eer te  (d+  di esio l trg et weight vector Wf  d set Yn = w' Xn + n where n is oi se ( lso fro  s tdrd Norl d istr ibut io)  d  is the oise vrice; set  to 05 Use li er regressio  with weight decy r egul riztio to es ti te Wf with W   Set the regulriztio preter to 005/N () For N  {d+15 , d+ 5,    , d+ 115} compute the cros s vl idtio  errors e     e d Bev Repet the experiet (sy) 10 ties itiig the verg e  d vri ce ov er th e experimets of e  e  d Bev  (b ) How sh ou ld you r verge o f th e e 's rel te to the verge of the Bev's; how  bout to the v erge of the e 's Sup port you r cli usig results fro your exper iet  (c) Wht  re the cotribu tors to the vri ce of the e 's (d) If the cro ss vlidtio errors were truly i depedet how should t he vri  ce of the e 's rel te to the vri ce of the Bev's? (e) O e  esure of the ef ectiv e  u b er of fres h ex m ples used i cop ut ig Bev is the rtio of the vri ce of the e 's to th t of the Bev's xpli why d plot versus N the efective u b er of fr esh ex p les N s  percetge of N You shoul d d tht N is close to N (f) If you icre se the o ut of r egul riztio will N go up or dow xpl i reyou th (ee)se e xperiet w ith Are=  5/N d cop your rreso resu ltsig fro Ruprt t o veri fy your cojectu

Problem 425

Whe u sig  vli dtio s et r ode l sele ctio l l  odels were lered o the m Dti of size N  K d vlidted o the m Dvl of size K We hve the VC boud (see qu tio ( 4  1 ) :

Eout ( g • 

 Evl (g  +  •

(J)

ontinud on nxt pag)

163

44. PROLMS

4  VRFI I G

M Suppose that instead you had no control oer the alidation process So learners each with their own models present you with the results of their al idation processes on diferent alidati on sets  Here is what y ou know about each learner Each learner  reports to you the size of their alidation set K and the alidation error Ev! ( )  he learners may hae used dif f rent lear ned t rai niused ng set an d adata  idatesets d onexcep a heldt that out they alida it tionhful se tlywhich waons a only r alidation purposes. As the m odel selecto r you h ae to decide whi ch learn er to go with 

a  Sh ou ld yo u sele ct th e learner with m in im um al idation erro r? If yes why? If no why not? {Hint: thin VC-bound.j  b If al models are alid ated on the same  al idation se t as des cribed in t he text wh y is it o kay to selec t the lear ner with t he l owest al id ation error? c After selec ti ng learn er * say   show that Eva1 (m* ) + ] : Me- 2€ " (E ) , e 2 € K is an "aerage alidation _ 2   ln 

J [Eout (m* ) where K()

=

>

(

)

d  setShowsize.that with

prob ab il ity at l east  8   whi ch satis es *  In 2 * 

-  Eout  Ev!  *  r any

*

e Show that min K : () : l K . Is this boun d bet ter or worse than the bound when all models use the same alidation set size equa l to the aerage  al idation set siz e  l K  ? Proble 4. 26 I n th is probl em  derie the f rmu a fr th e exact expr essi on r the lea eone out cross al idati on error r li near reg ression . Let be the data mat rix whose ro ws correspond to th e transrmed dat a poi nts

Z n = (xn)

a  Show that :

N

N

ZTZ = nl  nz Z Ty = nl  nYn  ZTZ  rTr and H() = Z()  V Hence where  = () show that when (z n Yn ) is left out ZTZ  ZTZ - n, and ZTy  ZTy  nYn 

 b Compte

w� the weight ector learned when the th data point is left out and show that: w

   lnTn  nl ) (ZTy 

164

nYn ) 

4  VERFI IN G

44 PROBLEMS

use the identity ( A  xxT ) - 1 = A  1 +

A - 1 xxT A 

() Usin g (a) and ( b)  sho that w = w + regression eight vetor using a the data (d) he predition on the vaidation point is given by

1

 T A  

.]

A - 1 zn , here w is the z�w. Sho that

ZnT n = Yn  Hnnn Hnn • {e) Sho that e 

( ,

and hene prove Equation ()

Problm 4.2 7

Cross va i dati on gives an aurate es ti mate of (N- but i t an b e quite s ens itiv e eading to pr obems i n mode see tion  A ommon heu rist i r  regu arizing ross va idation is to use a measure of error Ocv(  ) r the ross va idation esti mate i n mode see tion (a) ne hoie r divided by

0v is the stan da rd deviation of the eave- one-out error s

 cv �

  ' en ). Wh divide by ?



(b) or inear modes sho that Vcv = ;1 E;v . () (i) iven the best mod e  *  the onserv ative one-siga a pproah se ets the sim pes t ode ith in Ocv(  * ) of the best (ii) he bound m in im izing appr oa h se ets the mode  hih mi ni mize s

Ecv() + Ocv( ) 

Use the experie nta d esign in P robem 4 4 to ompa re these a pproahes ith the u n regu arized ross va idat ion estimate as f os ix Q = 5 Q = 0 and  =   Use ea h of th e to m ethods propo sed h ere as  e  as traditiona  ros s va idation to see t the optim a va ue of the regu arization parameter > in the range {0 05 0 0 0 5     5} using eight deay reguarization O(w) = fww. Pot the resuting out-of-sampe error r the m ode see ted usin g eah meth od as a funtion of N ith N in the range { x Q3 x Q     0 x Q } What a re your onusions?

165

166

Chap

5

Three Learning Principles he study of learning om data highlights some general principles that are fascinating concets in their own right. Haing gone through the mathematical analysis and empirical illustrations of the rst w chapters, we hae a good undation om which to articulate some of these principl es and explain them in concrete terms. In this chapter, we will discuss three prin ciples . he rst one is related to the choice of model and i s called Oc cams razor. he other tw o are relate d to data sampling bias establishes an important principle about obtaining the data, and data snooping establishes an important principle about handling the data. A genuine understanding of these principles will protect you om the most common pitfalls in learning om data, and allow you to interpret generalization perfrmance properly.

5.1

Occa ' s Razr

Although it is not an exact quote of Einsteins, it is oen attributed to him that An explanation of the data should be made as simple as possible but no simpler." A similar principle, ccam 's Razor dates om the 14th centur y and is attributed to William of Occam, where the razor is meant to trim down the In explanation to of thelearning, bare minimum that isrconsistent with the data. the context the penalty model complexity which was introduced in Section 2. 2 is a manistation of Occ ams razor. If E 0 then the explanation hypothesis is consistent with the data. In this case, the most plausible explanation, with the lowest estimate of E gien in the VC bound 2.14 , happ ens when the complexity of the explanation measured by d(H)) is as small as possi ble. Here is a statement of the underlying principle.



 



 



The simplest model that ts the data is also the most plausible. 167

  HREE EARNIN PRINCIES

 CAM'S AZOR

Applyng ths pncpl e we should choose as smple a model as we thnk we can get away wth  Although the pncple that smple s bette  may be ntutve t s ne the pec se no selfevdent When we apply the pncpl e to leanng om data thee ae two basc questons to be asked 1  What does  t mean  a model to be smple? 2 How do we know that smple s bett e? Let s sta t wth the st ques ton hee a e two dstnc t appoaches to de n ng the noton of complexty one based on a mly of objects and the othe based on an ndvdual object  We have aleady seen both appoaches n ou analyss he VC dmenson n Chapte 2 s a measue of complexty and t s based on the hypoth ess set    a whole e   based on a  amly of object s he egulazaton t em of the augmented eo n Chapte 4 s al so a meue of complexty but n ths case t s the complexty of an ndvdual object namely the hypothess h he two appoaches to denng complexty ae not encounteed only n leanng om data they ae a ecung theme wheneve complexty s ds cussed Fo nstance n nmaton theoy entopy s a measue of complexty based on a mly of objects whle minimum description length s a elated measue based on ndvdual objects hee s a eason why ths s a ecung theme he two appoaches to denng complexty ae n ct elated When we say a fmly of objects s complex we mean that the mly s b g hat s t contans a lage vaety of object s heee each nd vdual objec t n the ml y s one of many By contast a smple mly of objects s  small  t h elatvely w objects and each ndvdual object s one of fe Why s the shee numbe of objects an ndcaton of the level of complext? he eason s that both the numbe of objects n a mly and the complexty of an object ae elated to how many paametes ae needed to spec the obj ect  When you ncease the numbe of paametes n a leanng model  you smultaneously ncease how dvese  s and how complex the ndvdual h s Fo example consde 17th ode polynomals vesus 3d ode polynomals hee s moe vaety n 17th ode polynomals and at the same tme the ndvdual 1 7th ode polynomal s moe complex than a 3d ode polynomal he most co mmon dentons of object complexty ae based o n the numbe of bts needed to desc be an object Unde such dentons an object s smple f t has a shot desc pt on hee e a smple objec t s not only ntnscally smple (as t can be descbed succnctly) but t also has to be one of w snce th ee ae we objects that have shot descptons than thee ae that have long descptons as a matte of smple countng

xrcis 51 Conder hypoth e  et H1 a n d 1 th at contan Boole an fn cton on Boolean varab le o X  {    1  H1 contan all Boolean fncton

168



  HREE EARI PRICILES

wic evaate 



 n exact ne int int an 

 00 cntains a  Be an ntins wic eva ate 

int ints an t



 esewee



CCAM'S A OR

 esewee 

 n exa



a w ig n m e  teses ae  1 an  1 00? ( w man its ae neee  t sei ne  e eses i n ( c w man its ae neee  sei ne  te teses n



 ? 100

We now addess the second queston When Occ ams azo says that smple s bette  t do esnt mean smple s moe elegant It means smple has a bette chance of beng ght  Occ ams azo s about pemance not abou t aesthet cs If a complex explanaton of the data pems bet te  we wll take t he agument that smple has a bette chance of beng ght goes as l lows We ae tyng to t a hypothess to ou data ' = {(x 1 , Y1 ) ,     (x N , YN ) } (assume Yn s ae bnay)  hee ae we smple hypotheses than thee ae complex one s Wth complex hypoth eses  thee would be enough of them to shatte x 1 ,     XN, so t s cetan that we can t the data set egadless of what the labels Y1 ,     YN ae even f these ae completely andom. hee e ttng the data does not stll mean much n stead we have a smple model wth fw hypotheses and we und oneIfthat pectly ts the dchotomy ' = {( xi , Y1),    , (xN , YN ) } , ths s supsng and theee t means some thng. Occams Razo has been fmally poved unde deent sets of dealzed condtons he above agument captues the essence of these po o f some thng s less lkely to happen then when t does happen t s moe sgncant Let us look at an example Example 51. Suppose that one constucts a physcal theoy about the e sstvty of a metal unde vaous tempeat ues In ths theoy asde om some constants that need to be detemned the esstvty  has a lnea de pendence on the tempeatue  In ode to ve that the theoy s coect and to obtan the unknown constants 3 scentsts conduct the llowng thee expements and pesent the data to you

]

Q 

/   

:f

Q 

Q

temperature cietist 

T

temperature cietist 

169

T

temperature cietist 

T

51. CCAM'S AZR

5 . HREE EARNIN PRINCILES

It s clea that Scentst 3 has poduced t he most convncng evd ence f th e theoy If the measu ements ae exact the n Scentst 2 h as managed to ls the theoy and we ae back to the dawng boad  What about Scentst 1 Whle he has not lsed the theoy has he povded an y evdence  t he answe s no  we can e vese the qu eston Suppose that the theoy was not coect what could the data have done to pove hm wong othng snce any two ponts can be joned by a lne heee the model s not j ust lkely to t the data n ths case  t s cetan to do so hs endes the t tot ally nsgncant when t does happen



hs example llustates a concept elated to Occams Razo whch s the he axom assets that the data should have some chance of alsfyng a hypothess f we ae to conclude that t can povde evdenc e  the hypothes s One way to guaantee that evey data set has some chance at flscaton s  the VC dmenson of the hypothess set to be less than N, the numbe of data ponts  hs s dscussed the n Poblem 51 Hee s anothe example of the same concept

aiom of non-falsiability

Example 52 Fnancal ms ty to pck good t ades ( pedctos of whethe the maket wll go up o not)  Suppose th at each tade s tested on the pedcton (up o down) ove the next 5 days and those who pefm well wll be hed One mght thnk that ths pocess should poduce bette and bette  tade s on Wall St eet Vewed as a leanng pobl em consde each tade to be a pedcton hypothess Suppose that the hng po ol s  complex  we ae ntevewng 2 tades who happen to be a dvese set of people such that the pedct ons ove the next 5 days ae all deent ecessaly one of thes e tades gets t all coect and wll be hed Hng the tade though ths pocess may o may not be a god thng snce the pocess wll pck someone even f the tades ae just ppng cons to make the pedct ons A pefct pedcto  always exsts n ths goup so ndng one doesn t mean much If we wee ntevewng onl y two tades  and one of them made peect ped ctons that would mean somethng



xerise 5 u os e a 5 eeks i n a ro, a leer aie s i n h e ai l ha e is e ouc oe  f e ucoi ng Mna nig  oall ga e u e enl  ah eac Mona an o ou surise, he reicion is core ah ie On  he ay a fer he  h gae, a leer arives , sain g ha i ou ish o s ee nex ees rei cion, a aen o  $5 0 is eu ired Sh ou l you ay? a  a n ossib le redicions of inl ose are here f r 5 gaes? b If he sender ans o ake sure ha a leas one erson eeives orre predicion s on a l l 5 gaes fro hi , ho any people shou l he arg e o begi n ih?

 

170

5  HREE EARNING PRINCILES

5.. AMLING IAS

(c  h  rs r   c ng h o ucom  o h frs  gam , ho ma n o h orgn a rcn s o s  ar    h scon r Ho ma n  rs ao     av n sn a h  of   5 ks? ( f h cos of rn n g n m a  n g ou ac  s $0.5 ho much ou d h snr  ak  h n f 5 orc rcons sn n h 5.? (f a ou ra h s suaon o h goh fun con an h  cr b  of ng h  aa



Leanng om data takes Occam's Razo to anothe level gong beyond as smple as poss le but no smple  Indeed we may opt f a smple t than possble' namely an mpefct t of the data usng a smple model ove a pefct t usng a moe complex one he eason s that the pce we pay  a pefct t n tems of the penalty f model complexty n (214) may be too much n compason to the benet of the bette t hs dea was llustated n Fgue 3 7, and s a manfestaton of ovettng he dea s also the atonale behnd the ecommended polcy n Chapte 3: rst ty a lnea model one of the smplest models n the aena of leanng om data

5.2

Sm pli ng Bis

A vvd example of samplng bas happened n the 1948 US pesde ntal electo n betwee n uman and Dewey On electon nght a mao newspape caed out a telephone poll to ask people how they voted he poll ndcated that Dewey won and the pape was so condent about the small eo ba n ts poll that t declaed Dewe y the wnne n ts headlne  When the actual vote s wee counte d Dewey lost to the delght of a smlng uman

@ss Pss 171

 HR ARNN PRNCLS

2 AMLN AS

hs was not a case of statstcal anomaly whee the newspape was just ncedbly unlucky emembe the  n the VC bound  t was a case whee the sample was doomed om the getgo  egadless of ts sze  Een f the expement wee epeate d the esult would be the same n 1 48 , telephones wee expense and those who had them tended to be n an elte goup that aoed Dewey much moe than the aeage ote dd Snce the newspape dd ts poll by telephone t nadetently used an nsample dstbuton that was deent om the outofsample dstbuton hat s what samplng bas s





 the data s sampled n a based ay learnng ll pro duce a smlarly based outcome Applyng ths pncple we should make sue that the tanng and testng dstbutons ae the same f not ou esults may be nald o at the ey least eque caeul ntepetaton f you ecall the VC analyss made ey w assumptons but one as sumpton t dd make was that the data set  s geneated om the same dstbuton that the nal hypothess g s tested on n pactce we may en counte data sets tha t wee not geneated unde those deal condto ns hee ae some technques n statstcs and n leanng to compensate  the ms match between tanng and testng but not n cases whee  was geneated wth the excluson cetan pats nputexample space suchhee as thes excluson of households wth noof telephones n of thetheaboe nothng that can be done when ths happens othe than to admt that the esult wll not be elable statstcal bounds lke Hoedng and VC eque a match between the tanng and testng dstbutons hee ae many examples of how samplng bas can be ntoduced n data collec ton n some cases t s nadetently ntoduc ed by an oesght as n the case of Dewey and uman n othe cases t s ntodu ced because cetan type s of data ae not aalable Fo nstance n ou ced t example of Chapte 1 , the bank ceated the tanng set om the databa se of peous cus tomes and how they pefmed f the bank Such a set necessaly excludes those who appled to the bank f cedt cads and wee ejected because the bank does not hae data on how they ould have perfrmed f they wee ac cepted Snce utue applcants wll come om a mxed populaton ncludng some who would hae been ejected n the past the test set comes om a deent dstbuton than the tanng set and we hae a case of samplng bas  n ths patcula case f no data on the applcants that wee ejected s aalable nothng much can be done othe than to acknowledge that thee s a bas n the nal pedcto that leanng wll poduce snce a epesentate tanng set s just not aalable

xercse 53 n an epermen o e ermn e he d srbuon of szes of sh n a a ke, a ne mgh be used o cach a rep resenav e sam pe o f sh  he sam pe s

172

 HREE EARNIN PRINCILES

  3 ATA NOOIN

hen nled o nd o he fcions o sh o ifeen sizes   e sl e is ig nogh , si sic l conl sions  y e dn o he cl is iio n i n h e eni e le n o  se ll sling is ?



hee ae othe cases, aguably moe common, whee samlng bas s nto duced by human nteventon It s not that uncommon  someone to thow away tanng examles they dont lke! A Wall Steet m who wants to de velo an automated tadng system mght choose data sets when the maket was behavng well to tan the system, wth the semlegtmate justcaton that they don t want the nose to comlcate the tanng ocess hey wll suely acheve that f they get d of the b ad' examles, but they wll ceate a system that can be tusted only n the eods when the maket does behave well! What haens when the maket s not behavng well s anybodys guess In geneal, thowng away tanng examles based on the values, eg, ex amles that look lke outles o dont confm to ou econceved deas, s a fly common samlng bas ta

ther biases. Samlng bas has also b een called selecto n bas n t he stats tcs communty We wll stck wth the moe desctve tem samlng bas f two easons Fst , the bas ases n how the data was sampled; second, t s less ambguous because n the leanng context, thee s anothe noton of selecton bas dftng aound selection of a nal hyothess om the leanng model based on the data he emance of the selected hyothess on the data s otmstcally based, and ths could be denoted as a selecton bas We have ee d to th s tye of bas smly as bad genealzaton  hee ae vaous othe  bases that have smla avo hee s even a secal tye of bas  the eseach communty, called ublcaton bas! hs es to the bas n ublshe d scent c esults because negat ve esults ae often not ublshed n the lteatue , wheeas ostve esults ae he common theme of all of these bases s that they ende the standad statstcal conclusons nvald because the basc emse  such conclusons, that the samlng dstbuton s the same as the oveall dstbuton, does not hold any moe In the eld of leanng om data, t s samlng bas n the tanng set that we need to woy about

5.3

Data Snoo ing

Data snoong s the most common ta  acttones n leanng om data he ncle nvolved s smle enough,

 a data set has aeted any step in the learnin proess its ability to assess the otome has been ompromised. 173

3 AA NOOING

 HREE EARNING PRINCILES

Applyng ths pncple, f you want an unbased assessment of you leanng pe mance , you should keep a test set n a vault and neve use t f leanng n any way. hs s bascally what we have been talkng about all along n tanng v esus testng, but t goes beyond that. Even f a data set has not been physcally used  tanng, t can stll aect the leanng pocess, sometmes n subtle ways.

Exercise 5.4 Cons ider the oing a ppro ach o earning  B ooking at the dat a it app ears th at th e at a is  ineary se arabe s o e go ahead an d use a simp e perce ptron  and g et a trai ni ng error of ero af tr  etermin ing the otim a set o f eigh ts  We no ish to make some generaization concusions so e ook up the c r our ear ni ng mode a nd see tha t it is + 1  herere e use this vaue of vc to get a bo un on te est error   a  What is the pr ob em it h h is bou nd is it correct?  b  Do e kno the vc r the earni ng mode th at e actua y used? It is th is vc that e need to use in the bound

o avod theleanng ptfll nmodel the above s extemely mpotant thatcan yoube beforeexecse, choose you seeng tany of the dat a. he choce based on geneal nmaton about the leanng poblem, such as the num be of data ponts and po knowledge egadng the nput space and taget funct on, but not on the actual data set  Falue to obseve ths ule wll nvaldate the VC bounds, and any genealzaton conclusons wll be up n the a. Even a cael peson can all nto the taps of data snoopng. Consde the llowng example.

Example 5 An nvestment bank wants to develop a system f ecastng cuency exchange ates. It has 8 yeas woth of hstoca l data on the US Dolla ( USD ) vesus the Btsh P ound ( GBP ) , so t tes to use the data to see f thee s any patten that can b e exploted. he bank takes the sees of dal y changes n the USD / GBP ate, nomalzes t to zeo mean and unt vaance, and stats to develop a system f ecastng the decton of the change Fo each day, t tes to pedct that decton based on the uctuatons n the pevous 20 days 5 of the data s used  tanng, and the emanng 25  s se t asde  testng th e nal hypothess. he test shows eat success. he nal hypothess has a ht ate (pe centage of tme gettng the decton ght ) of 52 . 1 . hs may seem modest, but n the wold of nance you can make a lot of money f you get that ht ate conssten tly. Indeed, ove the 500 test days (2 yeas woth, as each yea has about 250 tadng days ) , the cumulatve pot of the system s a espectable 22. 14

. HREE EARNIN PRINCILE

.3. T NOOIN





Day



4



When the system s used n le tadng, the pemance deteoates sg ncantly In fct, t loses oney. Why ddn't the good test pefmance contnue on the new data? In ths case, thee s a sple explanaon and t has to do wth data snoopng Although the bank was cael to set asde test ponts that wee not used f tanng n ode to popely evaluate the nal hypothess, the test data had n fct aected the tanng pocess n a subtle way When the ognal sees of daly changes was nomalzed to zeo ean and unt aance, all of the data was noled n ths step . heefe, the test data that was extacted had aleady contbuted to the choces ade by the leanng algothm by contbutng to the alues of the mean and the aance that wee used n noalzaton. Although ths sees lke a no eect, t is data snoopng. When you plot he cumulate pot on the test set wth o wthout that snoopng step, you see how snoopng esulted n an oe-optstc expectaton compaed to the ealstc expectaton that aods snoopng. It s not the nomalzaton that was a bad dea. It s the noleent of test data n that nomalzaton, whch contanated ths data and endeed ts estmate of the nal peance naccua te

D

One of the ost common occuences of data snoopng s the euse of the same data set . If you ty leanng usng st one model and then anothe and then anothe on the s ame data set , you wll eentually succe ed' . As the sayng goes, f you totue the data long enough, t wll confss If you ty all possble dchotomes, you wll eentually t any data set ths s tue whethe we ty the dchotomes dectly usng a sngle model o ndectly usng a sequence of odels  The eecte VC denson  the sees of tals wll not be that of the last model that succ eeded, but of the ente unon of odels that could hae been used dependng on the outcomes of deent als. Sometes t he euse of the same da ta set s caed out by de ent peop le Let s say that thee s a publ c data set that you would lke to wok on. Bee you download the data, you ead abou t how othe people dd wth ths data set

©







175



. HREE EARNN PRNCLES

3 ATA NOON

usng deent technqu es You natually pck the most pomsng te chnques as a baselne then ty to mpoe on them and ntoduce you own deas Although you haent een seen the data set yet you ae aleady gulty of data snoop ng You choce of baselne technques wa s aecte d by the data set though the actons of othes You may nd that you estmates of the pemance wll tun out to be too optmstc snce the technques you ae usng hae aleady poen wellsuted to his paiula data set o quantfy the damage done by data snoopng one has to assess the penalty f model com plexty n ( 2 14) takng the snoopng nto consdeat on In the publc data set case the eecte VC dmenson coesponds to a much bgge hypothess set than the  that you leanng algothm uses It coes all hypotheses that wee consdeed and mostly ejected) by eeybody else n the pocess of comng up wth the solutons that they publshed and that you used as you baselne hs s a potentally huge set wth ey hgh VC dmenson hence the genealzaton guaantees n (214) wll be much wose than wthout data snoopn g ot all data sets subjected to data snoopng ae equally contamnated he bounds n (16) n the case of a choce between a nte numbe of hy potheses and n (212) n the case of an nnte numbe pode gudelnes  the leel of contamnaton he moe elaboate the choce made based on a data set the moe contamnated the set becomes and the less elable t wll



be n gaugng the pemance of the nal hypothess

xerise 5.5 ssum e we se t asie  xam ls om that w il l no t e use in t ai ni ng ut wil l e use o select one o t e n al hothe ses roue  h ee  ifeent lea rni ng a lgoritms t at ra in n he re st on e ata . ach a lgoi  m or s with a i feren o size  e woul  l ie o c aateize e a ccurac of estimat ing ut9 on he electe nal hotesis if e use he same  exam les t o m ake at estimae.

 2  





a hat is he alu e f at shou l e use in i n his situaion b ow oes the leve l o con am inat ion of ese  examles omare to the case wher e he y woul e us e i n train in g rather ha n i n e nal select ion?

In ode to deal wth data snoopng thee ae bascally two appoaches 1  Aod data snoopng: A stct dscplne n handlng the data s equed Data that s gong to be used to ealuate the nal pemance should be locked n a sa and only bought out afte the nal hypothess has been decd ed If ntemedate tests ae neede d sepaate data sets should be used  that Once a data set has been used t should be teated as contamnated as f as testng the pemance s concened 2 Account  data snoop ng: If you hae to use a data set moe than once keep tack of the leel of contamnaton and teat the elablty of 176

. HREE EARIN PRICIES

.3. ATA NOOI

you peance estates n l ght of ths contanaton  he bounds (16) and (212) can povde gudelnes  the elatve elablty of df ent data sets that have been used n deent oles wthn the leanng pocess

Data snooping versus samplin bias

Saplng bas was dened based

on howonthehow data obtaned bee any leanng data snoopng wasleanng dened based thewas data aected the leanng n patcula how the odel s selected hese ae obvously deent concepts  Howeve thee ae cases whee saplng bas occus as a consequence of sno opng' lookng at data that you ae not supposed to look at Hee s an exaple Consde pedctng the peance of deent stocks based on hstocal data In ode to see f a pedcton ule s any good you take all cuently taded copanes and test the ule on the stock data ove the past 50 yeas Let us say that ou ae testng the buy and hold stategy whee you would have bought the stock 50 yeas ago and kept t untl now If you test ths hypothess'  you wll g et excellent peance n te s of pot Well don't get to o exc ted ! You nadvetently based the esults n you fvo by pckng only uenl aded ompanies, whch eans that the copanes that dd not ake t ae not pat of you evaluaton When you put you pedcton ule snce to wok wll bedentfy used onwhch all copanes theybewll o not you tcannot copaneswhethe today wll the suvve cuently tad ed' copanes 5 0 yeas o now hs s a typcal case of saplng bas  snce the poble s that the tanng data s not epesentatve of the test data Howeve f we tace the ogn o f the bas we dd snoo p' n ths case by lookng at ftue da ta of copanes to detene whch of these copanes to use n ou ta nng Snc e we ae usng nfat on n tann g that we would not have access to n eal tadng ths s vewed as a  of data snoopng

177

5 PROBLES

  HREE EARNIN PRINCILES

5.4

Prblems

Problem 5.1

The dea of lsiability tha t a clam can be  endee d false by obseved data  s an m potant pnc p le  n expe me ntal scence If th e outco me of an experment has no chance of falsfyng a partcular proposton then the result A  Flly of that experment does not provde evdence one way or another toward the truth of the pposton.

E

Consde the poposton hee s   that appoxmates f as old be evdenced b y nd ng sch an  th  n sam ple e o zeo on  ." We say that the poposton s falsed f no hypothess n  can t the data pefctly.



 

 a ) S p pose that  shatte s       Sho that ths poposton s not flsable r any f   b ) Sppose that f s andom  f ( ) = ±1 th pobablty �  ndependently on evey    so t ( h ) = � o evey  . Sho that

E

 [falscaton ]

1

-

.

 c) Sppose dvc  10 and N  100 If yo obtan a hypothess

h th zeo

 b ?

 on yo data, hat can yo conclde' fom the eslt n pat

Problem 5.2 Stc tal Rsk M n mzaton  S RM )  s a sefl fameok  mod el selec ton that s elated to Occam 's Razo. Dene a structure  a nested seq en ce of hypoth ess sets:

The SRM fameok pcks a hypothess fom each  by mmzg  · That s,  agm n  (  ) . The n, th e fame ok selects the n al hy

 h



pothess by mnmzng  and the model complexty penalty That s, agmn (  ( ) +  ( )) . N ote that  ( ) shold be n on deceas ng n  g*  becase o f t he nested stcte.



 a ) Sh o that the n sample e o  (  ) s n on nc eas ng n 178



5 . HREE EARING PRINCIES

54 PRBEMS

E

( b ) Assum  h a h framork nds g* i ih proba bi Iiy Pi . Ho dos Pi rla o h comp li y of h  arg  fun cion ( c ) Argu  ha h Pi1S ar unknon bu po : p 1 : p2    1  ( d ) uppos g* = 9i· ho ha

 

 [I En(9i) - Eout (gi )  >  I g*

 gi] : Pi  4mi (N) e

E /s

Hr, he condiioni ng is o n slc ing gi as h nal hypohsis by RM [Hint: Use the Byes theorem to decompose the probbility nd then pply the VC bound on one of the terms} You m ay in rpr h is rsul as ll os if you us  RM a nd n d up ih gi, hn  h genralizaion bou nd is a fa cor ors han h bound you ould hav gon had you simply sard ih i



Pob le 5.

In ou r crdi card  am pl, h ban k sars ih so me vagu ida of ha consius a good credi risk o, as cusomrs x 1 , x2 ,    , X 

arriv, h baThnkena ,pp lis hos is vagu ap prov cardsiord r som cusomers o nly hoidgoao crdi cards crdi ar mon o s ofifhs hy dfau l or no  For simpliciy, suppos ha h rs N cusomrs r givn credi cards No ha h bank knos h bhavior of hs cusomrs, i coms o you o im prov hir a lgor ihm r a ppro ving cr di  Th ban k givs you h  daa

(x 1 , y1 ) ,    ' (x  , Y )

Ber yo u look a he daa , you d o mahmaical d riva ion s an d com up ih a crdi approval funcion You no s i on h daa and, o your dligh, obain prec prdicion

( a ) Wha is M h siz of your hypohsis s ( b ) Wih such an M ha  dos h Hof di ng bou nd say a bou h probabi l iy ha  h ru prrmanc is ors h an % rror fr N  10000 giv your o h bank assur hm ha h prfrmanc ill ( c ) You b br han g % rror andand your condnc is givn by your ansr o par ( b )  Th bank is hrilld and uss your g o approv crdi r n clins To hir dismay, mor han half hir crdi cards are bing dau ld on Epl ai n h pos sibl ras on ( s) bhind his oucom

( d ) Is hr a ay in  hi ch h ban k could u s your crdi a ppro val funcio n o hav y ou r probab il isic guar an Ho [Hint he nswer is yes!}

179

5..

5. HR ARNIN PRINCILS Pro blem 5 4

PROBLMS

500

The S& 5 is a set of the l argest compa ies cu rretly tradi g Su ppose there are stocks cu rretly tradig, ad th ere have bee stock s which h ave ever tr aded ov er the last years some of these have goe ba  krupt ad sto ppe d tradig  We wish to eval u ate the pro tab il ity of various  buy a d ho ld ' stra tegies usig the se years of d ata roughly tradi g day s 

50 000

10 000

50



50







1 500

Sice it is ot easy to get stock data, e will coe our aalysis to todays S& 5 stocks, r which the data is readily available

a  A stock is pro ta bl e if it we t u p o more th a 50% of the d ays Of your S &  stock s, the most pr otabl e wet up o  5% of t h ed ays  En = 0.  i  Sice we picked the best amog 500 usig the Hoef di g boud,  [IEin - Eout l

>

002] : 2

x

500 x

5 e- 2 x 1 2 oo x o . o 2

�

0.045.

There i s a greater tha  95% cha ce th is stock is pro table  Wher e did e go wrog? ii Gi ve a better estim ate fr the probab il ity that th is stock is pro tabl e



[Hint: What should the correct M be in the Hoefding bound?

 b  We ish to evaluate the protability of buy ad hold' r geeral stock

tradi g We ot ice t hat a ll of ou r 5 S& stocks wet up o a t le ast

of the days i We cocl ude that bu yig a d hol di g a sto cks is a go od strat egy r geeral stock tradig Where did we go rog? ii Ca  we say anything a bout the per rma ce o f buy a d hold tradig?

51%

 

You thik that the stock market exhibits reversal, so if Problem 5.5 the price of a stock sharp ly d rops you expe ct it to rise shortly the reafte r If it sha rpl y rises, you expect i t to drop shortly thereaf ter To test thi s hypothesis, you bu il d a t radi g strategy tha t bu ys whe t he sto cks go dow a d sel ls i the op posite case  You coll ect hi storica l data o the cu rret S& 5 stocks, ad your hypothesis gave a good aual retur of

1%

a Whe you trade usig this system, do you expect it to perfrm at this level? Why or why ot? b How ca you test you r stra tegy so th at its perrma ce i sam pl e is more relective of what you should expect i reality?

 

Prob lem 5. 6

Oe of te he ars Extrapolatio is harder tha iterpolatio " Give a possible explaatio fr this pheomeo usig the priciples i this chapter [Hint training distribution ersus testing distribution}

180

Epilogue hs bo ok set th e stage   a deepe exploat on nto Lean ng om ata by developng the undatons It is possble to lean om data, and you have all the basc too ls to do so  he lnea model coupled wth the ght atues and an appopate nonlnea tansfm, togethe wth the ght amount of egulazaton, petty much puts you nto the thck of the game, and you wll be n good stead as long as you keep n mnd the thee basc pncples smple s bette Occams azo , avod data snoopng and bewae of samplng bas Whee to go om he e? hee ae two man decto ns One s to lea n moe sophstcated leanng technques, and the othe s to exploe deent





leanng paadgms Let us map pevew these twoom dectons bette undestandng of the of leanng data to gve the ead e a he lnea model can be used as a buldng block  othe popula tech nques A cascade of lnea models, mostly wth sof thesholds, ceates a neual netwok  A obust algothm  lnea models , based on quadatc pogammng, ceates suppot vecto m achnes  An ecent appoach to non lnea tansmaton n suppot vecto machnes ceates kenel metho ds A combnaton of deent models n a pncpled way ceates boostng and en semble leanng hee ae othe successl models an d technques, and moe to come  sue In tems o f othe paadgms, we have bey mentone d unsupevsed lea n ng and encement leanng hee s a wealth of technques  these lean  ng paadgms, ncludng methods that mx labeled and unlabeled data Actve leanng and onlne leanng, whch we also mentoned bey, have the own technques theoes pobablstc In addton, th ee s a school thought appoach, that teats leanng as and a completely paadgm usng aofBayesan and thee ae usel pobablstc technques such as Gaussan pocesse s Last but not least, thee s a school that teats leanng as a banch of the theoy of computatonal complexty, wth emphass on asymptotc esults Of couse, the ultmate test of any engneeng dscplne s ts mpact n eal l  hee s no shotage of success ful applcatons of leanng om data Some of the applcaton domans have specalzed technques that ae woth explong , eg , computatonal nance and ecommende systems  Leanng om data s a vey dynamc eld Some of the hot technques and theoes at tmes become just ads, and othes gan tacton and become 181

IOGUE

pat of the eld What we hae emphaszed n ths book ae the necessay ndamentals that ge any student of leanng om data a sold undaton, and enable hm o he to entue out and exploe uthe technques and theoes, o pehaps to contbute the own.

82

Further Reading 



Learning om Data book rum at AMLBookcom  Y S Abu-Mostafa The Vapnik-Chervonenkis dimension: Inrmation versus complexity in learning eual Compuaion, 1 3 :31 2 317  19 89



X

Y S Abu-Mostafa Song  A icholson and M MagdonIsmail The bin model Technical Report CaltechCS TR: 2004 0 02  Calirnia Institute of Technology 2004

kham s Razo A isoial and Philosophial An alsis of k ham 's Piniple of Pasimon. University of Illinois Press 197

R Ariew

R Bell J . Bennet t Y  oren and C  Volinsky The million dollar program ming prize  I Speum, 4 5 :2 9 33  20 09



A Blumer A Ehrenfucht D Haussler and M  Warmuth Occam's razor fomaion oessing Lees, 24  :37 7 380 198 7



A Blumer A  hrenucht D Haussler  and M  Warmuth Learnability and the Vapnik-Chervonenkis dimension Jounal of he Assoiaion fo Compuing Mahine, 3 4 :9 29 9 5 19 89 



S Boyd and L Vandenberghe Press 2004

Cone pimizaion.

Cambridge University

P Burman A comparative study of ordinary cross-validation  -fld cross validation an the repeated learning-testing methods Biomeika, 7 3 : 503 514  1 98 9



T M Cover Geometrical and statis tical properties of systems of linear in equalities with applications in pattern recognition I Tansaions on leoni Compues, 14 3 :32  33 4 19 5



M H DeGroot and M J. Schervish Wesley urt edition 2011 183

Pobabili and Saisis.

Addison

URTHER EADIG

V. Fabian. Stochastic approximation methods.

Jounal, 10(1):123

159, 1960

W. Feer. An Induion third edition 1968.

Czehosloak Mahemaial

o P babi li Theo and Is Appliaions

A. ank and A. Asuncion.

UCI machine earning repository 

Wiey

2010

UL

h//chcc/

0 /1 oss and the curse-ofdimensionaity. Daa Mining and Knoledge Disoe , 1 (1 )5 5 77, 19 97

J. H . iedman. On bias v ariance

S. I . Gaant. Perceptron -based earning agorithms.

eual eoks, 1(2):179  191, 1990 

I ansaions on

Z Ghahramani. Unsupervised earning. In Adaned Leues in Mahine

Leaning MLSS

, pages 72 112, 2004.

G. H. Goub and C. F. van Loan. versity Press  1996

Mai ompuaions

Johns Hopkins Uni

D. Ameian C . Hoagin an d R. E 32:17 Wesch. Saisiian, 1978.hat matrix in regression and AOVA. 22,The W. Hoeding. Probabiity inequaitie s fr sums of bounde d random variabes.

Jounal of he Ame ian Saisial Assoiai on, 58( 30 1):13 30 , 19 63.

R. C. Hote. Very simpe cassication rues perfrm we on most commony used datasets. Mahine Leaning, 11( 1) :63 91 , 1993 R. A . Horn and C. . Johnson.

1990

Mai Analsis

Cambridge University Press

L. P. aebing M . L . Littman and A. W. Moore . Reinfrcement earning: A survey. Jounal of Aiial Inelligene Reseah, 4: 23 7 285, 1996 A. I. huri. Interscience

Adaned alulus ih appliaions in saisis 2003.

Wiey

R. ohavi. A study of cross-vaidation and bootstrap r accuracy estimation and mode seecti on. In Poeedings of he h Inenaional Join Con feene on Aiial inelligene ICAI  voume 2, pages 11 37 11 43 ,

 

1995

J. Langford. Ttoria on practi ca prediction theory f r cassication.

of Mahine Leaning Reseah, 6:273 306, 2005 184

Jounal

URTHER EAI

  i and H- T in Optimizing 0 /1 loss r perceptr ons by random coordinat e International oint Conference on eual descent In Proceedings of the etorks C pages 749 754, 2007 

7

7

HT  in and   i Support vector m achinery r innite ens emble learning ournal of achine Learning Research, 9(2) 285 31 2, 200 8 M Magdon-Ismail and  Mertsalov A permutation approach to validation Statistical Analysis and Data ining, 3(6)361380, 2010 M Magdon-Ismail, A icholson, and Y S  Abu-Mosta earning in the presence of noise In S Hayki n and B osko, editors , Intelligent Signal Processing IEEE Press, 2001 M Markatou, H  Tian, S  Biswas, and G  Hripcsak Analysis of variance of crossvalidation estimators of the generalization error ournal o f Machine Learning Re search, 61127 116 8, 2005 M  Minsky and S Papert Perceptrons An Introduction Geometry. MIT Press, expanded edition, 1988

to Computational

T Poggio and S Smale The mathematics of learning: ealing ith data otices of the American athematical Society, 50(5)537 544, 2003  Popper

The logic of scientic discovery

Routledge, 2002

F Rosenblatt The perceptron: A probabilistic model r inrmation storage and organization in the brain Psychological Revie, 65(6)38 6 408, 195 8

Principles of eurodynamics Perceptrons and the Theory of Brain echanisms. Spartan, 1962

F Rosenblatt 

B  Settles Active learning literature survey Technical Report 1648 , University of WisconsinMadison, 201 0 J ShaweTaylor, P    B artlett, R  C William son, and M Anthony A ame work or structural risk minimisation In Leaing Theory th Annual Conference on Learning Theory CLT ', pages 68 76, 19 96  G Valiant A theory of th e learnable (11) 113 4 114 2, 19 84

Communications of the ACM,

27

V  Vapnik and A Y Chervonenkis On the unirm c onvergence of relative equencies of events to their probabilities Theory of Pbability and Its Applications, 162 64 280, 19 71 185

V  Vapnik E. Levi n and Y L un Measuring the V-dimension of a learning machine eul Compuaion, 6() :8 1 876, 1994. G  Yuan  H Ho  and  -J  Lin Recent advances of large-scale linear classication Pceeings of I, 2012. T hang Solving large scale linear prediction problems usin g sto chastic gra dient descent algorithms  In achine Leaning Pceeings of he h Inenaional Confeence ICL , pages 919 926 2004.

186

Appix Proof of te VC Bod In this Appendix we pre sent the rm l proof of Theorem 2 5 It is  firly elborte proof nd you my skip it ltogether nd just tke the theorem r grnted but you won't know wht you re missing !



1971)

Theorem A. 1 (Vpnik Chervonenkis J

[ IEin (h) sup

Eout (h) I

l

> E

:

4mH (2N) e- i E .

This inequlity is clled the VC Inequlity nd it implies the VC bound of Theorem 25 The inequlity is vlid r ny trget function (deterministic or probbil istic) nd ny input distribution The probbility is over dt sets of size N Ech dt set is generted iid (independent nd identiclly distributed) with ech dt point generted independently ccording to the joint distribution P(x, The event sup    E() E() >  is equiv lent to the union over ll    of the events E ( ) E( ) > ; this union contins the event tht involves  in Theorem 2 5 The use of the supremum ( technicl version of the mximum) i s necessry since  cn hve  continuum of hypotheses The min chllenge to proving this theorem is tht E( ) is dicult to

 

mnipulte depends on the entire E()setbecuse E() spce rthercompred thn just to  nite of points The min insight needed to input over come this diculty is the observtion tht we cn get rid of E( ) ltogether becuse the devitions between E nd E cn be essentilly cptured by devitions between two in-smple errors: E (the srcinl in-smple error) nd the in-smple error on  scond independent dt set (Lemm A 2)  We hve seen this ide mny times bere when we use  test or vlidtion set to estimte E. This insight results in two min simplictions:

1.

The supremum of the devitions over innitely mny    cn be reduced to considering only the dichotomies implementble by  on the

187

A PPENDIX two nepenent ata sets  Tat s were te growt fun ton enters te pture (Lemma A3)

H  N )

 Te devaton between two ndndnt nsample errors s easy' to an alyze ompare to te evaton between E and E (Lemma A 4)  Te ombnaton of Lemmas A   A3 an A4 pro ves Teorem A   

A.1

Relating Generalization Error to In-Sale Deviations

Let' s ntrod ue a seon ata set ' w s nepenent of ' bu t sampled aordng to te same strbuton Px  Ts seond data set s alle a ghost ata set beause t oesn't really exst t s a just a tool used n te analyss We ope to boun te term [E E  s large by anoter term [E E   s large  w s easer to anal yze Te ntuton ben  te rmal proo f s as llows For any sngle ypot ess  beause ' s es sample nepenently om Px, te Hoedng nequalty guarantees tat E ) � E) wt a g probablty Tat s wen E) E) s large wt a g probablty E) E  )

 

 

s also large s large an be approxmately E) bounde e  by [Terere E) E{[E))s larg We are tryng to bound te probabl ty tat E s ar om E Let E{  ) be te nsample error r ypotess  on ' . Suppose tat E s ar om E wt some probablty (and smlarly E{ s far om E wt tat same prob ablty sne E an E are entally dstrbute) Wen s large te proba blty s rougly Gaussan around E as llustrate n te gure to te rgt Te red regon represents te ases wen E s ar om E n tose ases E{ s r om E about alf te tme as llustrated by te green regon Tat s [ E E  s large] an be approxmately boune by   [E E{  s large Ts argument proves some ntuton tat te evatons between E and E an be apture by te evatons between E and E  Te argu ment an be arelly extene to multple ypoteses

N

Lemma 

were te probablty on te RHS s over

188

' an ' jontly

ENDIX

[

Proof. We ca aume that  up J E ) E ) J there  othg to prove

[

 sup J Ein (h) hE

[[ [

  hE up J E )

hE

E{n (h) J

>

�





1



E )J  � and hE up JE ) E)J   (.1)

 I

 up J E )

E ) J  

 up J E )

E )J  � up JE) E)J

hE hE

hE

[

hE

I



 

 [ and   fr ay two evet

Ieualty (1) llow becaue []  ,    Now, let conder the lat term:  up J E )

  otherwe

E )J  � up JE ) hE



E ) J   

The evet on whch we are condtong  a et of data et wth on-zero probablty Fx a data et V n th event Let * b e any hypothe r whch JE*) E*)J   One uch hypothe mut ext gven tha t V   the event o whch we are condtong The hypothe  * doe not deped on V but t doe depend on V

[ I[ I[

I

 up JE) E  )J  � up JE )

  

hE

I  S I 

E   * )

E  * ) J  �

E  * )

E * ) J

1  

  N 

hE

�



E ) J  



E ) E)  

E) E)J  



1 Ieualty (2 fllow bec aue the evet " JE * ) E{ *)J



 )

(3) (4) 



 E{ )J      Ieualty 3 fllow becaue th e event " JE{ *) E*)J   ad " JE * ) E*)J  " whch  gven) mply " JE ) E{  )J   "  Ieualty 4 fllow becaue *  xed wt h repect t o V and o we mple " hE up JE )

H

can apply the Hoedg Ineualty to [JE{  *) E*)J Notce that the Hoedg Ieualty apple to [JE{  *) E *)J  � ] r any * , a long a *  xed wth repect to V Thereore, t alo apple

189

ENDIX



to any wightd avrag o  [En (h) Eout(h)   ] ad on h Finally, inc h dpnd on a particular V, w tak th wightd avrag ovr allV in th vnt  up Ein(h) Eout(h)  E" hE H on which w ar conditi oning, whr th wight com om th proaility o th particular V. Sinc th ound hold or vry V in thi vnt, it hold r th wight d avrag 

e-E 2 N ,

ot that w can aum cau othrwi th ound in <  ' o th lmma Thorm .1 i trivially tru In thi ca, 1 impli

J

A.2

[

sup

hE H

IEin (h) - Eout (h) I

l

> E

:

2J

e-E 2 N 

[

sup

hE H

IEin (h) - E{n (h) I

>

i]

·

Boudig Worst Case Deviatio Usig the Growth Fuctio

ow that w hav rlatd th gnralization rror to th dviation twn in-ampl rror, w can actually work with  rtrictd to two data t o iz ach, rathr than th innit   Spcicall y, w want to ound





[

sup

hE H

I Ein(h ) - E{n (h) I

>

i] ,

whr th proaility i ovr th joint ditriution o th data t V and V. On quivalnt way o ampling two data t V and V i to rt ampl a data t S o iz thn randomly partition S into V and V. Thi amount to randomly ampling, ithout rplacmnt xampl om S r V, laving th rmaining r V. Givn th joint data t S, lt





 th proaility dviation twn th two th proaility i taknoovr th random partition o in-ampl S into Vrror, and Vwhr By th law o total proaility (with  dnoting um or intgral a  th ca may ) ,



[

sup

hE H

IEin(h) - E{n (h) I

L  [S] x 

S

� 

<  p

[�

[

up Ein(h) hE H

 En(h) 190

>

i] En (h) 

Il

E (h)   � S

i Is] 

ENDX

  S

 S

S

Let be the dichotomies that can implement on the points in By denition of the growth unction, cannot have more than  H (2N di chotomies Suppose it has M : mH (2N) dichotomies, realized by h1     , hM  Thus, sup

hE H

IEin (h) - E[n (h) I

Then,

 J

r� [

M

=



/Ein (h) - E[n (h) I .

sup

hE{h,,h}

En ( ) I

 Ein(h) sup

hE{h 1 , ,h}

Il

> s

I]

IEin(h) - E[n (h) I > � s

< : J  IEin (hm ) - E[n (hm ) / > � I SJ m  < M sup J  IEin (h) - E[n (h) I > � j SJ , X

hEH

(A5) (A6)

where we use th e union bound in (A 5) , and overestimate each term by the supremum over all possible hypotheses to get ( A6 )  After using M :  H (2N) and taking the sup operation over

Lemma A.

J

[

sup

I Ein (h) - E[1 (h) I > �

hE H m H (2N)

<

S we have proved

X

sup sup

S

hEH

1

J  IEin(h ) - E[n (h)/ > � I SJ ,

where the pro bability on the LHS is over D and D ointly, and the prob ability on the RHS is over random partitions of into two sets D and D

S

The main achievement of Lemma A3 is that we have pulled the supre mum over h  outside the probability, at the expense of the extra actor of H ( 2N 



A.3



Bunding he Deviin beween In-Smple Errrs

We now address the purely combinatorial problem of bounding sup sup

S

hE H

J  IEin(h)

E{1 (h) I > � j SJ ,

which appears in Lemma A3 We will prove the llowing lemma Then, Theorem A can be proved by combining Lemmas A2 A3 and A4 taking  2e - �    (the only case we need to consider)



9

ENDX

Lea A .4 Fr

any h and any

S

whr th prbability is vr randm partitins f

S int tw sts

' an ' .

Poo T prv th rsult w will us a rsult which is als du t Hding r sampling ithout replacement: Lea A. 5 Hding 1963) Lt  = {a,    , a  } b a st f valus with a  [O, 1 ] , and lt µ =   a b thir man Lt ' = {,    , Z } b a sampl  f siz N, sampld m  unirmly ithout rplacmnt Thn

S

W apply Lmma A5  llws Fr th 2N xampls in lt a  1 if h(x) Y and a = 0 thrwis Th {a} ar th rrrs mad by h n w randmly partitin int ' and ' i sampl N xampls m withut rplacmnt t gt , laving th rmaining N xampls r '. This rsults in a sampl f siz N f th {a} r ' sampld unirmly withut rplacmnt t that



S

E() =

  EV

 a,

and E{ () =

Sin c w ar sampling withut rplacmnt s

Substituting

 EV

 a

S = '  ' and '  '  0 and

1 E() + E{ ()  2N    2 E  µ > t { E E{   2t  By Lmma A5 µ

It llws that

S S

t=





givs th rsult



192

· {···} |·|  · 2 ⌊·⌋ [a, b] ·  ∇

a

w

b

∇E

E (w)

(·)−1 (·)† (·)

  N k

k

N

N! (N −k)!k!

A\B

A

B

0

{1} × Rd

d

ǫ δ

ǫ

η λ λC C Ω θ Φ Φ

θ(s) = e s /(1 + es ) z = Φ(x) Q

φ µ ν σ2 A

Φ zi = φi (x)

a (·)

a

B b w0 B(N, k)

N k

C d d˜ d d (H) D

D

X = Rd Z H D = (x1 , y1 ), · · · , (xN , yN ) (xn , yn ) D

X = { 1} × Rd

D

D E (h, f ) ex (h(x), f (x))

D x E (h, f )

n

h e = 2.71828 · · · (h(x) − f (x))2 n n

E[·] Ex [·]

x

E[y |x]

y

E E E (h) E E E (h) ED E¯ E E f g

x

h h

D

f: X → Y g∈H g: X → Y

g (D ) g¯

D

f

D

g g = ∇E

g

h ∈ H h: X → Y

h ˜ h H HΦ

Z Φ

H(C )

C

H(x1 ,..., xN )

±1

H

x1 , · · · , xN

H I 1 K Lq ln log2 M mH (N )

0

q e 2

H

N

max(·, ·) N o(·)

D

O(·) P (x) P (y | x ) P (x, y) P[·] Q Qf

x

y y

x

f

x

f

R Rd

s

d s = w x =

 wx

i i

i

x

(·) supa (.) T t tanh(·) (·) V v

i

d

1

d

x0 = 1

−1

+1

≥

a

tanh(s) = (es −e−s)/(es +e−s ) V

V ×K =N

ˆ v

v

w ˜ w ˆ w w∗ w w w

Z

w0

b

w x ∈ Rd

x ∈ X

x

x ∈

{1} × Rd x x0

x

X X XOR

x0 = 1 x∈X xn

y∈Y

y y

yn y ˆ

Y Z Z

y

y∈Y z = Φ(x) zn = Φ(xn )

H

f B (N, k )

V λ N, d

L1

L2

λ

Ω

E

λ λ

λ

m H (N )

tanh

λ λ

Z

d

Abu-Mostafa et al. - 2012 - Learning from data a short course.pdf

Recommend Documents