First Binary Classification Model Data_Final Project.xlsx You work for a bank as a business data analyst in the credit card risk-odelin! de"artent. Your bank conducted a bold ex"erient three years a!o# for a sin!le day it $uietly issued credit cards to e%eryone who a""lied& re!ardless of their credit risk& until the bank had issued '(( cards without screenin! a""licants. )fter three years& *+(& or ,+& of those card reci"ients defaulted# they failed to "ay back at least soe of the oney they owed. owe%er& the bank collected %ery %aluable "ro"rietary data that it can now use to o"tii/e its future card-issuin! "rocess. 0he bank initially collected six "ieces of data about each "erson# � )!e � Years
at current e"loyer
� Years
at current address
� 1ncoe
o%er the "ast year
� Current
credit card debt& and
� Current
autoobile debt
1n addition& the bank now has a binary outcoe# default 2 *& and no default 2 (. Your first assi!nent is to analy/e the data and create a binary classification odel to forecast future defaults. You will will cobine cobine data data fro fro the abo%e six in"uts in"uts to out"ut a sin!le sin!le � score.� 3se the 4oldier Perforance s"readsheet for a si"le exa"le of cobinin! ulti"le in"uts. Forecastin! 4oldier Perforance.xlsx 0he relati%e relati%e rank-orderin! rank-orderin! of scores scores will deterine deterine the odel odel� s effecti%eness. effecti%eness. For con%enience-- in "articular& so that you can use the )3C Calculator 4"readsheet-you are asked to use a scale for your score that has a axiu 5 6.+ and a iniu 7 -6.+. )t first you are are not told what your your bank� s own best estiate estiate for its cost cost "er False 8e!ati%e 9acce"ted a""licant who becoes a defaultin! custoer: and False Positi%e 9rejected custoer who would not ha%e defaulted: classification. 0herefore& the best you can do is to desi!n your odel to axii/e the )rea 3nder the ;
also "ro%ided for your con%enience. You ay cobine the six in"uts by addin! the to& or subtractin! the fro& each other& takin! si"le ratios& etc. =xclude in"uts that are not hel"ful and then ex"erient with how to cobine the ost inforati%e in"uts. 8ote that will need soe of your $ui/ answers a!ain later& so "lease write the down and kee" track of the as you !o alon!. >uestion# ?hat is your odel@ Ai%e it as a function of the two or ore of the six in"uts. For exa"le# 9)!e Years at Current )ddress:1ncoe not a !reat odelE. Your odel should ha%e at least two in"uts. * r
?hat is your odel� s )3C on the 0rainin! 4et@ 3se two di!its to the ri!ht of the decial "lace. *, x ' x .G r 9999Hess than .+ is not correct - you need to ake the hi!hest %alue the lowest by di%idin! by -*. .+ has no "redicti%e %alue. .I or hi!her is too !ood to be trueE::::
1nitial )ssessent for <%er-fittin! 9testin! your odel on new data: 8ext test your odel& without chan!in! any "araeters& on the 0est 4et of ,(( additional a""licants. 4ee the 0est 4et s"readsheet. 1t is "art of the Data_For_Final_Project 9below: and has both the trainin! and test set. Data_Final Project.xlsx int# Make and use a second co"y of the )3C Calculator 4"readsheet so that you can co"are 0est 4et and 0rainin! 4et results easily. )3C_Calculator and ;e%iew of )3C Cur%e.xlsx ?hat is your odel� s new )3C on the 0est 4et@ Ai%e two di!its to the ri!ht of the decial "lace. *, x 6' x .6 x .J r 99995.+ is not %alid - ulti"ly by -* .+ eans no "redicti%e %alue 7 .I( is too !ood to be trueE:::::
Findin! the Cost-Minii/in! 0hreshold for your Model
8ow that you ha%e& ho"efully& de%elo"ed your odel to the "oint where it is relati%ely � robust� across the trainin! set and test set& your boss at the bank finally !i%es you its current rou!h estiate of the bank� s a%era!e costs for each ty"e of classification error. 8ote that all bank odels here include only "rofits and losses within three years of when a card is issued& so the i"act of out-years 9years beyond 6: can be i!nored. Cost Per False 8e!ati%e# K+((( Cost Per False Positi%e# K,+(( For the '(( indi%iduals that were autoatically !i%en cards without bein! classified& the total cost of the ex"erient turned out to be ,+L9K+(((:L'(( or KG+(&(((. 0his is K*&,+( "er e%ent. uestion# ?hat is the threshold score on the 0rainin! 4et data for your odel that inii/es Cost "er =%ent@ You will need this nuber to answer later $uestions. int# 3sin! the )3C Calculator 4"readsheet& identify which Colun dis"lays the sae cost-"er-e%ent 9row *G: as the o%erall iniu cost-"er-e%ent shown in Cell ,. 0he threshold is shown in row *( of that Colun. ?hat the threshold eans is that at and abo%e this nuber e%erythin! is classified as a Ndefault.N ,( x *((( x 6.+ r 99990hresholds !reater than ,.+
�
ay not be utili/in! the full ran!e for analysis
0hresholds less than -,.+ ( ay not be utili/in! the full ran!e for analysis::::::: Findin! the Miniu Cost Per =%ent >uestion# )!ain referrin! only to the 0rainin! 4et data& what is the o%erall iniu cost-"er-e%ent@ int# You will need this nuber to answer later $uestions. 1f you used the )3C Calculator& the o%erall iniu cost "er e%ent will be dis"layed in Cell ,. 8ote# for Coursera to inter"ret your answer correctly you ust !i%e your answer as an inte!er - no decials or dollar si!n. For =xa"le - enter KJ((.(( as NJ((N '(( r
Co"arin! the 8ew Miniu Cost Per =%ent on 0est 4et Data ?hen you co"ared )3C for the 0rainin! and 0est 4ets& all that is necessary is to look u" the two different %alues in Cell AJ. But to !et an accurate easure of the cost-sa%in!s usin! the ori!inal odel on new data& you can not autoatically use the new threshold that results in the o%erall lowest cost-"er-e%ent on the 0est 4et.
;eeber that your odel is bein! tested for its ability to forecast - but the new o"tial threshold will be known only after the outcoes for the entire 0est 4et are known. )ll you can use is the odel you de%elo"ed on the 0rainin! 4et data and the threshold fro the 0rainin! 4et that you should ha%e recorded when answerin! >uestion O. >uestion# )t that sae threshold score 98<0 the threshold score that would inii/e costs for the new 0est 4et& but the � old� threshold score that inii/ed costs on the 0rainin! 4et: what is the cost "er e%ent on the test set@ int# 3sin! the )3C Calculator 4"readsheet "re%iously "ro%ided& locate the colun on the 0rainin! 4et data that has the lowest-cost-"er e%ent. 0hat sae colun and threshold in the 0est 4et co"y of the )3C Calculator will ha%e a new cost-"ere%ent& dis"layed in row *G. 0his is alost always hi!her than the iniu cost-"ere%ent on the 0rainin! 4et& and also hi!her than what the inial cost-"er-e%ent would be on the 0est 4et& if one could know the new o"tial threshold in ad%ance. 0his nuber is the actual cost "er e%ent when a""lyin! the odel-and-threshold de%elo"ed with the 0rainin! 4et to the new& 0est 4et data. 8ote# for Coursera to inter"ret your answer correctly you ust !i%e your answer as an inte!er - no decials or dollar si!n. For =xa"le - enter KJ((.(( as NJ((N ,(( x * x *+( x G((.(( r 9999991f you find that your costs "er e%ent on the test set are uch hi!her than your costs "er e%ent on the trainin! set& consider akin! your odel si"ler � "robably usin! fewer in"ut %ariables � as it is "robably still o%er-fittin! the trainin! set data. Probles with o%er-fittin! that are were not ob%ious at the ;
Puttin! a Dollar alue on Your Model Plus the Data )ssue your 0est 4et cost-"er-e%ent results fro >uestion ' are sustainable lon! ter. >uestion# ow uch oney does the bank sa%e& "er e%ent& usin! your odel and its data-in"uts& instead of issuin! credit cards to e%eryone who asks@ int# the cost of issuin! credit cards to e%eryone 9no odel& no forecast: has been deterined to be ,+LK+((( 2 K*&,+( "er e%ent. Dollar %alue of the odel-"lus-data is the difference between K*&,+( and your nuber. 8ote# for Coursera to inter"ret your answer correctly you ust !i%e your answer as an inte!er - no decials or dollar si!n. For =xa"le - enter KJ((.(( as NJ((N *(( x ,(( r 99999999952K*+( sa%in!s is a weak odel 5K*+( to 52 K,+( sa%in!s is an ok odel
5 K,+( to 52 KO+( sa%in!s is a %ery !ood odel 7KO+( sa%in!s is an excellent odel:::::::: Payback Period for Your Model >uestion# Ai%en that it a""arently cost the bank KG+(&((( to conduct the three-year ex"erient& if the bank "rocesses *((( credit card a""licants "er day on a%era!e& how any days will it take to ensure future sa%in!s will "ay back the bankQs initial in%estent@ Ai%e nuber rounded to the nearest day 9inte!er %alue:. int# ulti"ly your answer to >uestion G - the cost sa%in!s "er a""licant - by *((( to !et the sa%in!s "er day. G(((((
x
6 r 999999More than a week O-G days
�
%ery !ood
,-6 days
�
excellent
* day
�
�
"oor
too !ood to be trueE:::::::::
)ny odel that is reducin! uncertainty will ha%e a 0rue Positi%e ;ate... ...=$ual to the 0est 1ncidence 9 of outcoes classified as NdefaultN: x ...Hess than the 0est 1ncidence 9 of outcoes classified as NdefaultN: x ...Areater than the 0est 1ncidence 9 of outcoes classified as NdefaultN: Ai%en that the base rate of default in the "o"ulation is ,+& any test that is reducin! uncertainty will ha%e a Positi%e Predicti%e alue 9PP:... ...=$ual to .,+ x ...Hess than .,+ x ...Areater than .,+ Ai%en that the base rate of default in the "o"ulation is ,+& any test that is reducin! uncertainty will ha%e a 8e!ati%e Predicti%e alue 98P:... =$ual to .G+ x ...Hess than .G+ x ...Areater than .G+
Confusion Matrix Metrics. 0o deterine all "erforance etrics for a binary classification& it is sufficient to ha%e three %alues 0he Condition 1ncidence 9here the default rate of ,+: 0he "robability of 0rue Positi%es 9the 0rue Positi%e rate ulti"lied by the Condition 1ncidence:
0he �0est 1ncidence� 9also called � classification incidence� - the su of the "robability of 0rue Positi%es and False Positi%es: 0hese three %alues can all be obtained fro the )3C Calculator 4"readsheet and and then used as in"uts to the 1nforation Aain Calculator 4"readsheet to deterine all other "erforance etrics. )3C_Calculator and ;e%iew of )3C Cur%e.xlsx 1nforation Aain Calculator.xlsx >uestion# ?hat is your odel� s 0rue Positi%e ;ate@ 4a%e this answer as it will be needed a!ain for Part 6 9>ui/ 6: * x 6( x .6( r 999999952 .,+ is incorrect::::::::
>uestion# ?hat is your odel� s
� test
incidence� @
4a%e this answer as it will be needed a!ain for Part 6 9>ui/ 6: ( x * x *((( x ,((.(( x 0est 1ncidences cannot be so sall that they force a hi!h false ne!ati%e rate nor lar!e that they force a hi!h false "ositi%e rate. ) "erfect test will of course ha%e a 0est 1ncidence e$ual to the Condition 1ncidence � but ost classification systes are focused on a%oidin! false ne!ati%es and ha%e a hi!her 0est 1ncidence than Condition 1ncidence.