MELJUN CORTES Predictive Modeling With IBM SPSS Modeler

Predictive Modeling with IBM SPSS Modeler Student Guide Course Code: 0A032 ERC 1.0

Predictive Modeling with IBM SPSS Modeler

Licensed Materials - Property of IBM © Copyright IBM Corp. 2010

0A032 Published October 2010

US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. IBM, the IBM logo and ibm.com are trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. SPSS, and PASW are trademarks of SPSS Inc., an IBM Company, registered in many jurisdictions worldwide. Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. Other product and service names might be trademarks of IBM or other companies. This guide contains proprietary information which is protected by copyright. No part of this document may be photocopied, reproduced, or translated into another language without a legal license agreement from IBM Corporation. Any references in this information to non-IBM Web sites are provided for convenience only and do not in any manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the materials for this IBM product and use of those Web sites is at your own risk.

TABLE OF CONTENTS

Table of Contents LESSON 1: PREPARING DATA FOR MODELING ............................... 1-1 1.1 INTRODUCTION ............................................................................................................. 1-2 1.2 CLEANING DATA .......................................................................................................... 1-3 1.3 BALANCING DATA........................................................................................................ 1-4 1.4 NUMERIC DATA TRANSFORMATIONS .......................................................................... 1-6 1.5 BINNING DATA VALUES ............................................................................................... 1-9 1.6 DATA PARTITIONING .................................................................................................. 1-12 1.7 ANOMALY DETECTION ............................................................................................... 1-14 1.8 FEATURE SELECTION FOR MODELS............................................................................ 1-19 SUMMARY EXERCISES............................................................................................................ 1-24

LESSON 2: DATA REDUCTION: PRINCIPAL COMPONENTS ........... 2-1 2.1 INTRODUCTION ............................................................................................................. 2-1 2.2 USE OF PRINCIPAL COMPONENTS FOR PREDICTION MODELING AND CLUSTER ANALYSES ................................................................................................................................ 2-1 2.3 WHAT TO LOOK FOR WHEN RUNNING PRINCIPAL COMPONENTS OR FACTOR ANALYSIS ................................................................................................................................. 2-3 2.4 PRINCIPLES ................................................................................................................... 2-3 2.5 FACTOR ANALYSIS VERSUS PRINCIPAL COMPONENTS ANALYSIS............................... 2-4 2.6 NUMBER OF COMPONENTS ........................................................................................... 2-4 2.7 ROTATIONS ................................................................................................................... 2-5 2.8 COMPONENT SCORES ................................................................................................... 2-6 2.9 SAMPLE SIZE ................................................................................................................ 2-6 2.10 METHODS ..................................................................................................................... 2-6 2.11 OVERALL RECOMMENDATIONS ................................................................................... 2-7 2.12 EXAMPLE: REGRESSION WITH PRINCIPAL COMPONENTS ............................................ 2-7 SUMMARY EXERCISES............................................................................................................ 2-21

LESSON 3: DECISION TREES/RULE INDUCTION .............................. 3-1 3.1 INTRODUCTION ............................................................................................................. 3-1 3.2 COMPARISON OF DECISION TREE MODELS .................................................................. 3-1 3.3 USING THE C5.0 NODE ................................................................................................. 3-3 3.4 VIEWING THE MODEL................................................................................................... 3-7 3.5 GENERATING AND BROWSING A RULE SET................................................................ 3-14 3.6 UNDERSTANDING THE RULE AND DETERMINING ACCURACY ................................... 3-17 3.7 UNDERSTANDING THE MOST IMPORTANT FACTORS IN PREDICTION ......................... 3-26 3.8 FURTHER TOPICS ON C5.0 MODELING ....................................................................... 3-27 3.9 MODELING CATEGORICAL OUTPUTS WITH OTHER DECISION TREE ALGORITHMS ... 3-32 3.10 MODELING CATEGORICAL OUTPUTS WITH CHAID .................................................. 3-32 3.11 MODELING CATEGORICAL OUTPUTS WITH C&R TREE ............................................. 3-39 3.12 MODELING CATEGORICAL OUTPUTS WITH QUEST .................................................. 3-45 3.13 PREDICTING CONTINUOUS FIELDS ............................................................................. 3-48 SUMMARY EXERCISES............................................................................................................ 3-56

LESSON 4: NEURAL NETWORKS .......................................................... 4-1

i

PREDICTIVE MODELING WITH IBM SPSS MODELER

4.1 INTRODUCTION TO NEURAL NETWORKS ......................................................................4-1 4.2 TRAINING METHODS .....................................................................................................4-2 4.3 THE MULTI-LAYER PERCEPTRON .................................................................................4-3 4.4 THE RADIAL BASIS FUNCTION .....................................................................................4-4 4.5 WHICH METHOD? .........................................................................................................4-5 4.6 THE NEURAL NETWORK NODE .....................................................................................4-6 4.7 MODELS PALETTE .......................................................................................................4-15 4.8 THE NEURAL NET MODEL ..........................................................................................4-16 4.9 VALIDATING THE LIST OF PREDICTORS ......................................................................4-23 4.10 UNDERSTANDING THE NEURAL NETWORK ................................................................4-25 4.11 UNDERSTANDING THE REASONING BEHIND THE PREDICTIONS ..................................4-28 4.12 MODEL SUMMARY ......................................................................................................4-31 4.13 BOOSTING AND BAGGING MODELS ............................................................................4-31 4.14 MODEL BOOSTING WITH NEURAL NET.......................................................................4-32 4.15 MODEL BAGGING WITH NEURAL NET ........................................................................4-40 SUMMARY EXERCISES ............................................................................................................4-47

LESSON 5: SUPPORT VECTOR MACHINES ......................................... 5-1 5.1 INTRODUCTION .............................................................................................................5-1 5.2 THE STRUCTURE OF SVM MODELS ..............................................................................5-1 5.3 SVM MODEL TO PREDICT CHURN ................................................................................5-5 5.4 EXPLORING THE MODEL .............................................................................................5-14 5.5 A MODEL WITH A DIFFERENT KERNEL FUNCTION .....................................................5-17 5.6 TUNING THE RBF MODEL ...........................................................................................5-20 SUMMARY EXERCISES ............................................................................................................5-23

LESSON 6: LINEAR REGRESSION ......................................................... 6-1 6.1 INTRODUCTION .............................................................................................................6-1 6.2 BASIC CONCEPTS OF REGRESSION ................................................................................6-2 6.3 AN EXAMPLE: ERROR OR FRAUD DETECTION IN CLAIMS ............................................6-4 6.4 USING LINEAR MODELS NODE TO PERFORM REGRESSION ........................................6-15 SUMMARY EXERCISES ............................................................................................................6-26

LESSON 7: COX REGRESSION FOR SURVIVAL DATA ..................... 7-1 7.1 INTRODUCTION .............................................................................................................7-1 7.2 WHAT IS SURVIVAL ANALYSIS? ...................................................................................7-2 7.3 COX REGRESSION .........................................................................................................7-5 7.4 COX REGRESSION TO PREDICT CHURN .........................................................................7-6 7.5 CHECKING THE PROPORTIONAL HAZARDS ASSUMPTION ...........................................7-19 7.6 PREDICTIONS FROM A COX MODEL ............................................................................7-22 SUMMARY EXERCISES ............................................................................................................7-36

LESSON 8: TIME SERIES ANALYSIS .................................................... 8-1

ii

TABLE OF CONTENTS

8.1 INTRODUCTION ............................................................................................................. 8-1 8.2 WHAT IS A TIME SERIES? ............................................................................................. 8-3 8.3 A TIME SERIES DATA FILE ........................................................................................... 8-4 8.4 TREND, SEASONAL AND CYCLIC COMPONENTS........................................................... 8-7 8.5 WHAT IS A TIME SERIES MODEL? ................................................................................ 8-9 8.6 INTERVENTIONS ......................................................................................................... 8-11 8.7 EXPONENTIAL SMOOTHING ........................................................................................ 8-12 8.8 ARIMA ...................................................................................................................... 8-13 8.9 DATA REQUIREMENTS................................................................................................ 8-15 8.10 AUTOMATIC FORECASTING IN A PRODUCTION SETTING ........................................... 8-16 8.11 FORECASTING BROADBAND USAGE IN SEVERAL MARKETS ..................................... 8-16 8.12 APPLYING MODELS TO SEVERAL SERIES ................................................................... 8-40 SUMMARY EXERCISES............................................................................................................ 8-46

LESSON 9: LOGISTIC REGRESSION ..................................................... 9-1 9.1 INTRODUCTION TO LOGISTIC REGRESSION .................................................................. 9-1 9.2 A MULTINOMIAL LOGISTIC ANALYSIS: PREDICTING CREDIT RISK ............................. 9-4 9.3 INTERPRETING COEFFICIENTS .................................................................................... 9-13 SUMMARY EXERCISES............................................................................................................ 9-19

LESSON 10: DISCRIMINANT ANALYSIS ........................................... 10-1 10.1 INTRODUCTION ........................................................................................................... 10-1 10.2 HOW DOES DISCRIMINANT ANALYSIS WORK? .......................................................... 10-2 10.3 THE DISCRIMINANT MODEL ....................................................................................... 10-3 10.4 HOW CASES ARE CLASSIFIED .................................................................................... 10-3 10.5 ASSUMPTIONS OF DISCRIMINANT ANALYSIS ............................................................. 10-4 10.6 ANALYSIS TIPS ........................................................................................................... 10-5 10.7 COMPARISON OF DISCRIMINANT AND LOGISTIC REGRESSION .................................. 10-5 10.8 AN EXAMPLE: DISCRIMINANT.................................................................................... 10-6 SUMMARY EXERCISES.......................................................................................................... 10-18

LESSON 11: BAYESIAN NETWORKS.................................................. 11-1 11.1 INTRODUCTION ........................................................................................................... 11-1 11.2 THE BASICS OF BAYESIAN NETWORKS ...................................................................... 11-1 11.3 TYPE OF BAYESIAN NETWORKS IN PASW MODELER ............................................... 11-4 11.4 CREATING A BAYES NETWORK MODEL ..................................................................... 11-5 11.5 MODIFYING BAYES NETWORK MODEL SETTINGS ................................................... 11-21 SUMMARY EXERCISES.......................................................................................................... 11-28

LESSON 12: FINDING THE BEST MODEL FOR CATEGORICAL TARGETS .......................................................................................... 12-1 12.1 INTRODUCTION ........................................................................................................... 12-1 SUMMARY EXERCISES.......................................................................................................... 12-22

LESSON 13: FINDING THE BEST MODEL FOR CONTINUOUS TARGETS .......................................................................................... 13-1 13.1 INTRODUCTION ........................................................................................................... 13-1 SUMMARY EXERCISES.......................................................................................................... 13-19

LESSON 14: GETTING THE MOST FROM MODELS ......................... 14-1

iii


14.1 INTRODUCTION ...........................................................................................................14-1 14.2 COMBINING MODELS WITH THE ENSEMBLE NODE.....................................................14-2 14.3 USING PROPENSITY SCORES .....................................................................................14-11 14.4 META-LEVEL MODELING .........................................................................................14-17 14.5 ERROR MODELING ....................................................................................................14-22 SUMMARY EXERCISES ..........................................................................................................14-30

APPENDIX A : DECISION LIST .............................................................. A-1 INTRODUCTION ........................................................................................................................ A-1 A DECISION LIST MODEL ........................................................................................................ A-1 COMPARISON OF RULE INDUCTION MODELS .......................................................................... A-4 RULE INDUCTION USING DECISION LIST ................................................................................. A-5 UNDERSTANDING THE RULES AND DETERMINING ACCURACY .............................................. A-8 UNDERSTANDING THE MOST IMPORTANT FACTORS IN PREDICTION .................................... A-14 EXPERT OPTIONS FOR DECISION LIST ................................................................................... A-15 INTERACTIVE DECISION LIST ................................................................................................ A-18 SUMMARY EXERCISES ........................................................................................................... A-34

iv

PREPARING DATA FOR MODELING

Lesson 1: Preparing Data for Modeling Overview • Preparing and cleaning data for modeling • Balancing data using the Distribution and the Balance node • Transforming the data with the Distribution node • Grouping data with the Binning node • Partition the data into training and testing samples with the Partition node • Detecting unusual cases with the Anomaly Node • Selecting predictors with the Feature Selection Node Data In this lesson we use data from a telecommunications company, churn.txt for several examples. The file contains records for 1477 of the company’s customers who have at one time purchased a mobile phone. It includes such information as length of time spent on local, long distance and international calls, the type of billing scheme and a variety of basic demographics, such as age and gender. The customers fall into one of three groups: current customers, involuntary leavers and voluntary leavers. We want to use data mining to understand what factors influence whether an individual remains as a customer or leaves for an alternative company. The data are typical of what is often referred to as a churn example (hence the file name). We also use a similar data file named rawdata.txt to illustrate several steps in data preparation, and a third file customer_dbase.sav, also from a telecommunications firm, to demonstrate how to detect anomalous records and select fields for modeling.

Note about Type Nodes in this Course Streams presented in this course contain Type nodes, although in most instances the Types tab in the Source node would serve the same purpose.

PASW® Modeler and PASW® Modeler Server By default, PASW Modeler will run in local mode on your desktop machine. If PASW Modeler Server has been installed, then PASW Modeler can be run in local mode or in distributed (clientserver) mode. In this latter mode, PASW Modeler streams are built on the client machine, but run by PASW Modeler Server. Since the data files used in this training course are relatively small, we recommend you run in local mode. However, if you choose to run in distributed mode make sure the training data are either placed on the machine running PASW Modeler Server or that the drive containing the data can be mapped from the server. To determine in which mode PASW Modeler is running on your machine, click Tools…Server Login (from within PASW Modeler) and see whether the Connection option is set to Local or Network. This dialog is shown below.

1-1


Figure 1.1 Server Login Dialog in PASW Modeler

Note Concerning Data for this Course Data for this course are assumed to be stored in the folder c:\Train\ModelerPredModel. At SPSS® training centers, the data will be located in a folder of that name. Note that if you are running PASW Modeler in distributed (Server) mode (see note above), then the data should be copied to the server machine or the directory containing the data should be mapped from the server machine.

1.1 Introduction Preparing data for modeling can be a lengthy but essential and extremely worthwhile task. If data are not cleaned and modified/transformed as necessary, it is doubtful that the models you build will be successful. In this lesson we will introduce a number of techniques that enable such data preparation. We will begin with a brief discussion concerning the handling of blanks and cleaning of data, although this is covered in greater detail in the Introduction to PASW Modeler and Data Mining course. Following this, we will introduce the concept of data balancing and how it is achieved within PASW Modeler. A number of data transformations will also be introduced as possible solutions to skewed data. We will discuss how to create training and validation samples of the data automatically with the use of data partitioning.

1-2


1.2 Cleaning Data In most cases, datasets contain problems or errors such as missing information, outliers, and/or spurious values. Before modeling begins, these problems should be corrected or at least minimized. The higher the quality of data used in data mining, the more likely it is that predictions or results are accurate. PASW Modeler provides a number of ways to handle blank or missing information and several techniques to detect data irregularities. In this section we will briefly discuss an approach to data cleaning. Note: If there is interest the trainer may refer to the stream Dataprep.str located in the c:\Train\ModelerPredModel directory. This stream contains examples of the techniques detailed in the following paragraphs. After the data have been read into PASW Modeler, and if necessary all relevant data sources have been combined, the first step in data cleaning is to assess the overall quality of the data. This often involves: • • •

Using the Types tab of a source node or the Type node to fully instantiate data, usually achieved by clicking the Read Values button within the source or Type node, or by passing the data from a Type node into a Table node and allowing PASW Modeler to auto-type. Flagging missing values (white space, null and value blanks) as blank definitions within a source node or the Type node. Using the Data Audit node to examine the distribution and summary statistics (minimum, maximum, mean, standard deviation, number of valid records) for data fields.

Once the condition of the data has been assessed, the next step is to attempt to improve the overall quality. This can be achieved in a variety of ways: • • •

•

Using the Generate menu from the Data Audit node’s report, a Select node that removes records with blank fields can be automatically created (particularly relevant for a model’s output field). Fields with a high proportion of blank records can be filtered out using the Generate menu from the Data Audit node’s report to create a Filter node. Blanks can be replaced with appropriate values using the Filler node. Possible appropriate values within a continuous field can range from the average, mode, or median, to a value predicted using one of the available modeling techniques. In addition, missing values can be imputed by using the Data Audit node. The Type node and Types tab in source nodes provide an automatic checking process that examines values within a field to determine whether they comply with the current measurement level and bounds settings. If they do not, fields with out-of-bound values can either be modified, or those records removed from passing downstream.

After these actions are completed, the data will have been cleaned of blanks and out-of-bounds values. It may also be necessary to use the Distinct node to remove any duplicate records. Once the data file has been cleaned, you can then begin to modify it further so that it is suitable for the modeling technique(s) you plan to use.

1-3


1.3 Balancing Data Once the data have been cleaned you should examine the distribution of the key fields you will be using in modeling, including the output field (if you are creating a predictive model). This is achieved most easily using the Data Audit node, but either the Distribution node (for categorical data), the Histogram node (for continuous data), or the Graphboard node (for either type) will produce charts for single fields. If the distribution of a categorical target field is heavily skewed in favor of one of the categories, you may encounter problems when generating predictive models. For example, if only 3% of a mailing database have responded to a campaign, a neural network trained on this data might try to classify every individual as a non-responder to achieve 97% accuracy—great but not very useful! One solution to overcome this problem is to balance the data, which will overweight the less frequent categories. This can be accomplished with the Balance node, which works by either reducing the number of records in the more frequent categories, or boosting the records in the less frequent categories. It can be automatically generated from the distribution and histogram displays. When balancing data we recommend using the reduce option in preference to the boosting option. The latter duplicates records and thus magnifies problems and irregularities, as only a relatively few cases can be heavily weighted. However, when working with small datasets, data reducing is often not feasible and data boosting is the only sensible solution to imbalances within data. Note A better solution than balancing data at this stage is to sample from the original dataset(s) to create a training file with a roughly equal number of cases in each category of the output field. The test datasets should, however, match the unbalanced population proportions for this field to provide a realistic test of the generated models. The Partition node makes it easy to create training and validation data partitions from a single data file, but that node doesn’t solve the problem of a skewed distribution for a field, as it can overweight one or more categories. We will illustrate data balancing by examining the distribution of the field CHURNED within the file churn.txt. This field records whether the customer is current, a voluntary leaver, or an involuntary leaver (we attempt to predict this field in the lessons that follow). Open the stream Cpm1.str (located in c:\Train\ModelerPredModel) Run the Table node and familiarize yourself with the data Close the Table window Connect a Distribution node to the Type node Edit the Distribution node and set the Field: to CHURNED Click the Run button

1-4


Figure 1.2 Distribution of the CHURNED Field

The proportions of the three groups are rather unequal and data balancing may be useful when trying to predict this field. This output can be used directly to create a Balance node, but first we must decide whether we wish to reduce or boost the current data. Reducing the data will drop over 73% of the records, but boosting the data will involve duplicating the involuntary leavers from 132 records to over 830. Neither of these methods is ideal but in this case we choose to reduce the data to eliminate the magnification of errors. Click Generate…Balance Node (reduce) Close the Distribution plot window

A generated Balance node will appear in the Stream Canvas. Drag the Balance node to the right of the Type node and connect it between the Type and Distribution nodes Run the stream from the Distribution node

Figure 1.3 Distribution of the CHURNED Field after Balancing the Data

When balancing data it is advisable to enable a data cache on the balance node, to freeze the selected sample. This is due to the fact that the balance node is randomly reducing or boosting data and a different sample will be selected each time the data are passed through the node.

1-5


At this point the data are balanced and can be passed into a modeling node, such as the Neural Net node. Once the model has been built, it is important that the testing and assessment of the model should be done based on the unbalanced data. Close the Distribution plot window

1.4 Numeric Data Transformations When working with numeric data, the act of data balancing, as detailed above, is a rather drastic solution to the problem of skewed data and usually isn’t appropriate. There are a variety of numerical transformations that provide a more sensible approach to this problem and that result in a flat or flatter distribution. The Derive node can be used to produce such transformed fields within PASW Modeler. To determine which transformation is appropriate, we need to view the data using a histogram. We’ll use the field LOCAL in this example, which measures the number of minutes of local calls per month. Add a Histogram node to the stream Connect the Histogram node to the Type node Edit the Histogram node and select LOCAL in the Field list (not shown) Run the node

Figure 1.4 Histogram of the LOCAL Field

This distribution has a strong positive skewness. This condition may lead to poor performance of a neural network predicting LOCAL since there is less information (fewer records) on those individuals with higher local usage. What we need is a transformation that inverts the original skewness, that is, skews it to the left. If we get the transformation correct, the data will become relatively balanced. When you transform data you normally try to create a normal distribution or a uniform (flat) distribution.

1-6


For our problem, the distribution of LOCAL closely follows that of a negative exponential, e-x, so the inverse is a logarithmic function. We will therefore try a transformation of the form ln(x + a), where a is a constant and x is the field to be transformed. We need to add a small constant because some of the records have values of 0 for LOCAL, and the log of 0 is undefined. Typically the value of a would be the smallest actual positive value in the data. Close the Histogram window Add a Derive node from the Field Ops palette and connect the Type node to it Edit the Derive node and set the Derive Field name to LOGLOCAL Select Formula in the Derive As list Enter log(LOCAL + 3) in the Formula text box (or use the Expression Builder) Click on OK

Figure 1.5 Derive Node to Create LOGLOCAL

Connect the Derive node to the existing Histogram node Edit the Histogram node and set the Field to LOGLOCAL Click the Run button

1-7


Figure 1.6 Histogram of the Transformed LOCAL Field Using a Logarithmic Function

Although this distribution is not perfectly normal it is a great improvement on the distribution of the original field. Close the Histogram window

The above is a simple example of a transformation that can be used. Table 1.1 gives a number of other possible transformations you may wish to try when transforming data, together with their CLEM expression. Table 1.1 Possible Numerical Transformations Transformation e

x

ln(x+a) ln((x-a)/(b-x)) log(x+a) sqrt(x) 1 / e(mean(x)-x)

CLEM Expression Exp(x) where x is the name of the field to be transformed Log(x + a) where a is a numerical constant Log((x - a) / (b – x)) Where a and b are numerical constants Log10(x + a) Sqrt(x) 1 / exp(@GLOBAL_AVE(x) – x) where @GLOBAL_AVE is the average of the field x, set using the Set Globals node in the Output palette

Note Because the original field LOCAL has been transformed, predictions from a model will be made in the log of that field. To transform back to the original scale, you need to raise the predicted value to the base 10 for the standard log or base e for the natural log (e.g., 10predictved value). So, for example, if the

1-8


model predicts a value of 1.4 for LOGLOCAL, that is actually 101.4 or 25.12 for LOCAL (or LOCAL + constant).

1.5 Binning Data Values Another method of transforming a continuous field involves modifying it to create a new categorical field (flag, nominal, ordinal) based on the original field’s values. For example, you might wish to group age into a new field based on fixed width categories of 5 or 10 years. Or, you might wish to transform income into a new field based on the percentiles (based on either the count or sum) of income (e.g., quartiles, deciles). This operation is labeled binning in PASW Modeler, since it takes a range of data values and collapses them into one bin where they are all given the same data value. It is certainly true that binning data loses some information compared to the original distribution. On the other hand, you often gain in clarity, and binning can overcome some data distribution problems, including skewness. Moreover, oftentimes there is interest in looking at the effect of a predictor at natural cutpoints (e.g., one standard deviation above the mean). In addition, when performing data understanding, it might be easier to view the relationship between two or more continuous fields if at least one is binned. Binning can be performed with bins based on fixed widths, percentiles, the mean and standard deviation, or ranks. We can use the original field LOCAL to show an example of binning. We know this field is highly positively skewed, and it has many distinct values. Let’s group the values into five bins by requesting binning by quintiles, and then examine the relationship of the binned field to CHURNED. The Binning node is located in the Field Ops palette. Add a Binning node to the stream near the Type node Connect the Type node to the Binning node Edit the Binning node and set the Bin fields to LOCAL Click the Binning method dropdown and select Tiles (equal count) method Click the Quintile (5) check box

By default, a new field will be created from the original field name with the suffix _TILEN, where N stands for the number of bins to be created (here five). Percentiles can be based on the record count (in ascending order of the value of the bin field, which is the standard definition of percentiles), or on the sum of the field.

1-9


Figure 1.7 Completed Binning Node to Group LOCAL by Quintiles

The Bin Values tab allows you to view the bins that have been created and their upper and lower limits. However, understandably, information on generated bins is not available until the node has been run in order to allow the thresholds to be determined. Click OK

To study the relationship between binned LOCAL (LOCAL_TILE5) and CHURNED, we could use a Matrix node, since both fields are categorical, but we can also use a Distribution node, which will be our choice here. Add a Distribution node to the stream and attach it to the Binning node Edit the Distribution node and select LOCAL_TILE5 as the Field Select CHURNED as the Overlay field Click Normalize by color checkbox (not shown) Click Run

1-10


Figure 1.8 Distribution of CHURNED by Binned LOCAL

There is an interesting pattern apparent. Essentially all the involuntary churners are in the first quintile of LOCAL_TILE5 (notice how the number of cases in each category is almost exactly the same). Perhaps we got lucky when specifying quintiles as the binning technique, but we have found a clear pattern that might not have been evident if LOCAL had not been binned. We would next wish to know what the bounds are on the first quintile, and to see that we need to edit the Binning node. Close the Distribution plot window Edit the Binning node for LOCAL Click the Bin Values tab Select 5 from the Tile: menu

1-11


Figure 1.9 Bin Thresholds for LOCAL

We observe that the upper bound for Bin 1 is 10.38 minutes. That means that the involuntary churners essentially all made less than 10.38 minutes of local calls, since they all fall into this bin (quintile). Given this finding, we might decide to use the binned version of LOCAL in modeling, or try two models, one with the original field and then one with the binned version.

1.6 Data Partitioning Models that you build (train) must be assessed with separate testing data that was not used to create the model. The training and testing data should be created randomly from the original data file. They can be created with either a Derive or Sample node, but the Partition node allows greater flexibility. With the Partition node, PASW Modeler has the capability to directly create a field that can split records between training, testing (and validation) data files. Partition nodes generate a partition field that splits the data into separate subsets or samples for the training and testing stages of model building. When using all three subsets, the model is built with the training data, refined with the testing data, and then tested with the validation data. The Partition node creates a categorical field with the role automatically set to Partition. The set field will either have two values (corresponding to the training and testing files), or three values (training, testing, and validation). PASW Modeler model nodes have an option to enable partitioning, and they will recognize a field with role “partition” automatically (as will the Evaluation node). When a generated model is created, predictions will be made for records in the testing (and validation) samples, in addition to the training records. Because of this capability, the use of the Partition node makes model assessment more efficient.

1-12


To illustrate the use of data partitioning, we will create a partition field for the churn data with two values, for training and testing. Although the Partition node assists in selecting records for training and testing, its output is a new field, and so it can be found in the Field Ops Palette. Add a Partition node to the stream and connect the Type node to it Edit the Partition node

The name of the partition field is specified in the Partition field text box. The Partitions choice allows you to create a new field with either 2 or 3 values, depending on whether you wish to create 2 or 3 data samples. The size of the files is specified in the partition size text boxes. Size is relative and given in percents (which do not have to add to 100%). If the sum of the partition sizes is less than 100%, the records not (randomly) included in a partition will be discarded. The Generate menu allows you to create Select nodes that will select records in the training, testing, and validation samples. We’ll change the size of the training and testing partitions, and input a random seed so our results are comparable. Figure 1.10 Partition Node Settings

1-13


Change the Training partition size: to 70 Change the Testing partition size: to 30 Change the Seed value to 999 (not shown) Click OK Attach a Distribution node to the Partition node Edit the Distribution node and select Partition in the Field list Run the Distribution node

Figure 1.11 Distribution of the Partition Field

The new field Partition has close to a 70/30 distribution. It can now be directly used in modeling as described above, or separate files can be created with use of the Select node. We will use the partition field in a later lesson, so we’ll save the stream for use in later lessons. Close the Distribution window Click on File…Save Stream As Save the stream with the name Lesson1_Partition

1.7 Anomaly Detection Data mining usually involves very large data files, sometimes with millions of records. In such situations, we may not be concerned about whether some records are odd or unusual based on how they compare to the bulk of records in the file. Odd cases, unless they are relatively frequent (and then they can hardly be labeled “unusual”), will not cause problems to most algorithms when we try to predict some outcome. For analysts with smaller data files, though, anomalous records can be a concern, as they can distort the outcomes of a modeling process. The most salient example of this comes from classical statistics, where regression, and other methods that fall under the rubric of the General Linear Model, can be strongly affected by outliers and deviant points. PASW Modeler includes an Anomaly node that searches for unusual cases in an automatic manner. Anomaly detection is an exploratory method designed for the quick detection of unusual cases or records that should be candidates for further analysis. These should be regarded as suspected anomalies, which, on closer examination, may or may not turn out to be real concerns. You may find that a record is perfectly valid but choose to screen it from the data for purposes of model building. Alternatively, if the algorithm repeatedly turns up false anomalies, this may point to an error in the data collection process.

1-14


The procedure is based on clustering the data using a set of user-specified fields. A case that is deviant compared to the norms (distributions) of all the cases in that cluster is deemed anomalous. The procedure helps you quickly detect unusual cases during data exploration before you begin modeling. It is important to note that the definition of an anomalous case is statistical and not particular to any specific industry or application, such as fraud in the finance or insurance industry (although it is possible that the technique might find such cases). Clustering is done using the TwoStep cluster routine (also available in the TwoStep node). In addition to clustering, the Anomaly node scores each case to identify its cluster group and creates an anomaly index, to measure how unusual it is, and identifies which fields contribute most to the anomalous nature of the case. We’ll use a new data file to demonstrate the Anomaly node’s operation. The file, customer_dbase.sav, is a richer data file that is also from a telecommunications company. It has an outcome field churn which measures whether a customer switched providers in the last month. There is no target field for anomaly detection, but in most instances you will want to use the same set of fields in the Anomaly node that you plan to use for modeling. There is an existing stream file we can use for this example. The Anomaly node is found in the Modeling palette since it uses the TwoStep clustering routine. Click File…Open Stream Double-click on Anomaly_FeatureSelect.str in the c:\Train\ModelerPredModel directory Run the Table node and view the data Close the Table window Place an Anomaly node in the stream and connect it to the Type node Edit the Anomaly node, and then click the Fields tab

Figure 1.12 Anomaly Node Fields Tab

You will typically specify exactly which fields should be used to search for anomalous cases. In these data, there are several fields that measure various aspects of the customer’s account, and we want to

1-15


use all these here (there are also demographic fields, but in the interests of keeping this example relatively simple, we will restrict somewhat the number and type of fields used). Click Use custom settings button Click the Field chooser button, and select all the fields from longmon to ebill (they are contiguous) Click OK Click the Model tab

Figure 1.13 Anomaly Node Model Settings

By default, the procedure will use a cutoff value that flags 1% of the records in the data. The cutoff is included as a parameter in the model being built, so this option determines how the cutoff value is set for modeling but not the actual percentage of records to be flagged during scoring. Actual scoring results may vary depending on the data. The Number of anomaly fields to report specifies the number of fields to report as an indication of why a particular record is flagged as an anomaly. The most anomalous fields are defined as those that show the greatest deviation from the field norm for the cluster to which the record is assigned. We’ll use the defaults for this example. Click Run Right-click on the Anomaly model in the Models Manager, and select Browse Click Expand All button

1-16


Figure 1.14 Browsing Anomaly Generated Model Results

We see that three clusters (labeled “Peer Groups”) were created automatically (although we didn’t view the Expert options, the default number of clusters to be created is set between 1 and 15). In the first cluster there are 1267 records, and 18 have been flagged as anomalies (about 1.4%, close to the 1% cutoff value). The Model browser window doesn’t tell us which cases are anomalous in this cluster, but it does provide a list of fields that contributed to defining one or more cases as anomalous. Of the 18 records identified by the procedure, 16 are anomalous on the field lnwireten (the log of wireless usage over tenure in months [time as a customer]). This was a derived field created earlier in the data exploration process. The average contribution to the anomaly index from lnwireten is .275. This value should be used in a relative sense in comparison to the other fields. To see information for specific records we use the generated Anomaly model on the stream canvas. We will sort the records by the $O-AnomalyIndex field, which contains the index values. Add a Sort node from the Record Ops palette to the stream and connect the Anomaly generated model node to the Sort node Edit the Sort node and select the field $O-AnomalyIndex as the sort field Change the Sort Order to Descending

1-17


Figure 1.15 Sorting Records by Anomaly Index

Click OK Connect a Table node to the Sort node Run the Table node

Figure 1.16 Records Sorted by Anomaly Index with Fields Generated by Anomaly Model

For each record, the model creates 9 new fields. The field #O-PeerGroup contains the cluster membership. The next six fields contain the top three fields that contributed to this record being an anomaly and the contribution of that field to the anomaly index (we can request fewer or more fields on which to report in the Anomaly node Model tab). Thus we see that the three most anomalous cases, with an anomaly index of 5.0, all are in cluster 2. The first two of these are most deviant on longmon and longten. Knowing which fields made the greatest contribution to the anomaly index allows you to more easily review the data values for these cases. You don’t need to look at all the fields, but instead can

1-18


concentrate on specific fields detected by the model for that case. In the interests of time, we won’t take this next step here, but you might want to try this in the exercises. What we can briefly show are the options available when an Anomaly generated model is added to the stream. Close the Table window Edit the Anomaly generated model node in the stream Click on the Settings tab

Figure 1.17 Settings Tab Options for Anomaly Generated Models

Note in particular that in large files, there is an option available to discard non-anomalous records, which will make investigating the anomalous records much easier. Also, you can change the number of fields on which to report here. Close the Anomaly model Browser window

1.8 Feature Selection for Models Just as data files can have many records in data-mining problems, there are often hundreds, or thousands, of potential fields that can be used as predictors. Although some models can naturally use many fields—decision trees, for example—others cannot or are inefficient, at best, with too many fields. As a result, you may have to spend an inordinate amount of time to examine the fields to decide which ones should be included in a modeling effort. To shortcut this process and narrow the list of candidate predictors, the Feature Selection node can identify the fields that are most important—mostly highly related—to a particular target/outcome

1-19


field. Reducing the number of fields required for modeling will allow you to develop models more quickly, but also permit you to explore the data more efficiently. Feature selection has three steps: 1) Screening: In this first step, fields are removed that have too much missing data, too little variation, or too many categories, among other criteria. Also, records are removed with excessive missing data. 2) Ranking: In the second step, each predictor is paired with the target and an appropriate test of the bivariate relationship between the two is performed. This can be a matrix for categorical fields or a Pearson correlation coefficient if both fields are continuous. The probability values from these bivariate analyses are turned into an importance measure by subtracting the p value of the test from 1 (thus a low p value leads to an importance near 1). The predictors are then ranked on importance. 3) Selecting: In the final step, a subset of predictors is identified to use in modeling. The number of predictors can be identified automatically by the model, or you can request a specific number. Feature selection is also located in the Modeling palette and creates a generated model node. This node, though, does not add predictions or other derived fields to the stream. Instead, it acts as a filter node, removing unnecessary fields downstream (with parameters under user control). We’ll try feature selection on the customer database file. Note that although we are using feature selection after demonstrating anomaly detection, you may want to use these two in combination. For example, you can first use feature selection to identify important fields. Then you can use anomaly detection to find unusual cases on only those fields. Add a Feature Selection node to the stream and connect it to the Type node Edit the Feature Selection node and click the Fields tab Click the Use custom settings button Select churn as the Target field (not shown) Select all the fields from region to news (near the bottom) as Inputs (be careful not to select churn again) Click the Model tab

1-20


Figure 1.18 Model Tab for Feature Selection to Predict Churn

By default fields will initially be screened based on the various criteria listed in the Model tab. A field can have no more than 70% missing data (which is rather generous, and you may wish to modify this value). There can be no more than 90% of the records with the same value, and the minimum coefficient of variation (standard deviation/mean) is 0.1. All of these are fairly liberal standards. Click the Options tab

Figure 1.19 Options for Feature Selection

Fields after being ranked will be selected based on importance, and only those deemed Important will be selected in the model. This can be changed to select the top N fields, by ranking of importance, or

1-21


by selecting all fields that meet a minimum level of importance. Four options are available for determining the importance of categorical predictors, with the default being the Pearson chi-square value. We will use all default settings for these data. Click Run Right-click on the churn Feature Selection generated model and select Browse

Figure 1.20 Feature Selection Browser Window

We selected 127 potential predictors. Seven were rejected in the screening stage because of too much missing data or too little variation. Of the remaining 120 fields, the model selected 63 as being important, so it has reduced our tasks of data review and model building considerably. The model

1-22


ranked the fields by importance (importance is rounded off to a maximum value of 1.000). If you scroll down the list of fields in the upper pane, you will eventually see fields with low values of importance that are unrelated to churn. All fields with their box checked will be passed downstream if this node is added to a data stream. The set of important fields includes a mix, with some demographic (age, employ), account-related (tenure, ebill), and financial status (cardtenure) types. From here, the generated Feature Selection model in the stream will filter out the unimportant fields. Note When using the Feature Selection node, it is important to understand its limitations. First, importance of a relationship is not the same thing as the strength of a relationship. In data mining, the large data files used allow very weak relationships to be statistically significant. So just because a field has an importance value near 1 does not guarantee that it will be a good predictor of some target field. Second, nonlinear relationships will not necessarily be detected by the tests used in the Feature Selection node, so a field could be rejected yet have the potential of being a good predictor (this is especially true for continuous predictors).

1-23


Summary Exercises A Note Concerning Data Files In this training guide files are assumed to be located in the c:\Train\ModelerPredModel directory. The exercises in this lesson use the data file churn.txt. The following table provides details about the file. churn.txt contains information from a telecommunications company. The data are comprised of customers who at some point have purchased a mobile phone. The primary interest of the company is to understand which customers will remain with the organization or leave for another company. The file contains the following fields: ID LONGDIST International LOCAL DROPPED PAY_MTHD LocalBillType LongDistanceBillType AGE SEX STATUS CHILDREN Est_Income Car_Owner CHURNED

Customer reference number Time spent on long distance calls per month Time spent on international calls per month Time spent on local calls per month Number of dropped calls Payment method of the monthly telephone bill Tariff for locally based calls Tariff for long distance calls Age Gender Marital status Number of Children Estimated income Car owner (3 categories): Current – Still with company Vol – Leavers who the company wants to keep Invol – Leavers who the company doesn’t want

In these exercises we will perform some exploratory analysis on the Churn.txt data file and prepare these data so that they are ready for modeling. 1. Read the file c:\Train\ModelerPredModel\Churn.txt—this file is comma delimited and includes field names—using a Var. File node. Browse the data and familiarize yourself with the data structure within each field. 2. Check to see if there are blanks (missing values) within the data; if you find any problems, decide how you wish to deal with these and take appropriate steps. 3. Look at the distribution of the CHURNED field. This field probably requires balancing. Try “boosting” the data to balance the field, since we used reducing in the lesson. 4. If you think that both of these methods are too harsh (either in terms of duplicating data too much or reducing data so there are too few cases), edit the balance node and see if you can find a way of reducing the impact of balancing.

1-24


5. If you are going to use this data for modeling, do you wish to cache this node? 6. Use the Data Audit node to look at the distribution of some of the fields that will be used as inputs. Does the distribution of these fields appear appropriate? If not, try and find a transformation that may help the modeling process. (Note: The instructor may have already spoken about the field LOCAL—you may want to transform this field, as discussed in the lesson). 7. Look at the field International. Do you think this field will need transforming or binning? Can you find a transformation that helps with this field? If not, why do think this is? 8. Think about whether there are potentially any other fields that could be derived from existing data that may help out with the modeling process. If so, create those fields. 9. Try using the Anomaly node on these data to detect unusual records. Don’t use the field CHURNED. Do you find any commonalities among most of the anomalous records? 10. If you have made any data transformations, balanced the data, or derived any fields, you may want to create a Supernode that reduces the size of your current stream. 11. Save your stream as Exer1.str. 12. For those with extra time. Use the Anomaly node to detect anomalous cases in the customer_dbase.sav file, as we did in the lesson. Then add the generated Anomaly node to the stream and investigate these unusual cases in more detail. Would you retain them for modeling, or not? Why? 13. For those with more extra time. Use the Data Audit node or other methods to search for outlier data values on continuous fields. If you find some, what might be done to reduce their impact on modeling?

1-25


1-26

DATA REDUCTION: PRINCIPAL COMPONENTS

Lesson 2: Data Reduction: Principal Components Objectives • •

Review principal components analysis, a technique used to perform data reduction prior to modeling Run a principal components analysis on a dataset of waste production

Data We use a file containing information about the amount of solid waste in thousands of tons (WASTE) in various locations along with information about land use, including number of acres used for industrial work (INDUST), fabricated metals (METALS), trucking and wholesale trade (TRUCKS), retail trade (RETAIL), and restaurants and hotels (RESTRNTS). The data set appears in Chatterjee and Hadi (1988, Sensitivity Analysis in Linear Regression. New York: Wiley).

2.1

Introduction

Although it is used as an analysis technique in its own right, in this lesson we discuss principal components primarily as a data reduction technique in support of statistical predictive modeling (for example, regression or logistic regression) and clustering. We first review the role of principal components and factor analysis in segmentation and prediction studies, and then discuss what to look for when running these techniques. Some background principles will be covered along with comments about popular factor methods. We provide some overall recommendations. We will perform a principal components analysis on a set of fields recording different types of land usage, all of which are to be used to predict the amount of waste produced from that land.

2.2 Use of Principal Components for Prediction Modeling and Cluster Analyses In the areas of segmentation and prediction, principal components and factor analysis typically serve in the ancillary role of reducing the many fields available to a core set of composite fields (components or factors) that are used by cluster, regression or logistic regression. These techniques, though, can also be used for classical data mining methods, including neural networks and Bayesian networks. Statistical prediction models such as regression, logistic regression, and discriminant analysis, when run with highly correlated input fields can produce unstable coefficient estimates (the problem of near multicollinearity). In these models, if any input field can be almost or perfectly predicted from a linear combination of the other inputs (near or pure multicollinearity), the estimation will either fail or be badly in error. Prior data reduction using factor or principal components analysis is one approach to reducing this risk. Although we have described this problem in the context of statistical prediction models, neural network coefficients can become unstable under these circumstances. However, since the interpretation of neural network coefficients is relatively rarely done, this issue is less prominent.

2-1


However, neural networks training time increases with more inputs, so reducing the number of inputs while retaining important variable information is normally a good practice. Bayesian networks that model the conditional probabilities among a set of fields function best, and are much easier to interpret, with a relatively small number of inputs, so data reduction can be a useful step before modeling here as well. Rule induction methods will run when predictors are highly related. However, if two continuous predictors are highly correlated and have about the same relationship to the target, then the predictor with the slightly stronger relationship to the target will enter into the model. The other predictor is unlikely to enter into the model, since it contributes little in addition to the first predictor. While this may be adequate from the perspective of accurate prediction, the fact that the first field entered the model, while the second didn't, could be taken to mean that the first was important and the second was not. However, if the first were removed, the second predictor would have performed nearly as well. Such relationships among inputs should be revealed as part of the data understanding and data preparation step of a data mining project. If this were not done, or if it were done inadequately, then the data reduction performed by principal components or factor analysis might be necessary (for statistical methods) and helpful (for both statistical and machine learning methods). In some surveys done for segmentation purposes, dozens of customer attitude measures or product attribute ratings may be collected. Although cluster analysis can be run using a large number of cluster fields, two complications can develop. First, if several fields measure the same or very similar characteristics and are included in a cluster analysis, then what they measure is weighted more heavily in the analysis. For example, suppose a set of rating questions about technical support for a product is used in a cluster analysis with other unrelated questions. Since distance calculations used in the PASW Modeler clustering algorithms are based on the differences between observations on each field, then other things being equal, the set of related items would carry more weight in the analysis. To exaggerate to make a point, if two fields were identical copies of each other and both were used in a cluster analysis, the effect would be to double the influence of what they measure. In practice you rarely ask the same number of rating questions about each attribute (or psychographic) area. So principal components and factor analysis are used to either explicitly combine the original input fields into independent composite fields, to guide the analyst in constructing subscales, or to aid in selection of representative sets of fields (some analysts select three fields strongly related to each factor or component to be used in cluster analysis). Cluster is then performed on these fields. A second reason factor or principal components might be run prior to clustering is for conceptual clarity and simplification. If a cluster analysis were based on forty fields it would be difficult to look at so large a table of means or a line chart and make much sense of them. As an alternative, you can perform rule induction to identify the more influential fields and summarize those. If factor or principal components analysis is run first, then the clustering is based on the themes or concepts measured by the factors or components. Or, as mentioned above, clustering can be done on equalsized sets of fields, where each set is based on a factor. If the factors (components) have a ready interpretation, it can be much easier to understand a solution based on five or six factors, compared to one based on forty fields. As you might expect, factor and principal components analyses are more often performed on “soft” measures—attitudes, beliefs, and attribute ratings— and less often on behavioral measures like usage and purchasing patterns. Keep in mind that factor and principal components analysis are considered exploratory data techniques (although there are confirmatory factor methods; for example, AmosTM can be used to test specific factor models). So as with cluster analysis, do not expect a definitive, unassailable answer.

2-2


When deciding on the number and interpretation of factors or components, domain knowledge of the data, common sense, and a dose of hard thinking are very valuable.

2.3 What to Look for When Running Principal Components or Factor Analysis There are two main questions that arise when running principal components and factor analysis: how many (if any) components are there, and what do they represent? Most of our effort will be directed toward answering them. These questions are related because, in practice, you rarely retain factors or components that you cannot identify and name. Although the naming of components has rarely stumped a creative researcher for long, which has led to some very odd-sounding “components,” it is accurate enough to say that interpretability is one of the criteria when deciding to keep or drop a component. When choosing the number of components, there are some technical aids (eigenvalues, percentage of variance accounted for) we will discuss, but they are guides and not absolute criteria. To interpret the components, a set of coefficients, called loadings or lambda coefficients, relating the components (or factors) to the original fields, are very important. They provide information as to which components are highly related to which fields and thus give insight into what the components represent.

2.4

Principles

Factor analysis operates (and principal components usually operates) on the correlation matrix relating the continuous fields to be analyzed. The basic argument is that the fields are correlated because they share one or more common components, and if they didn’t correlate there would be no need to perform factor or component analysis. Mathematically a one-factor (or component) model for three fields can be represented as follows (Vs are fields (or variables), F is a factor (or component), Es represent error variation that is unique to each field (uncorrelated with the F component and the E components of the other variables)): V 1 = L 1* F 1 + E 1 V 2 = L 2* F 1 + E 2 V 3 = L 3* F 1 + E 3

Each field is composed of the common factor (F 1 ) multiplied by a loading coefficient (L 1 , L 2 , L 3 the lambdas) plus a unique or random component. If the factor were measurable directly (which it isn’t) this would be a simple regression equation. Since these equations can’t be solved as given (the Ls, Fs and Es are unknown), factor and principal components analysis take an indirect approach. If the equations above hold, then consider why fields V 1 and V 2 correlate. Each contains a random or unique component that cannot contribute to their correlation (Es are assumed to have 0 correlation). However, they share the factor F 1 , and so if they correlate the correlation should be related to L 1 and L 2 (the factor loadings). When this logic is applied to all the pairwise correlations, the loading coefficients can be estimated from the correlation data. One factor may account for the correlations between the fields, and if not, the equations can be easily generalized to accommodate additional factors. There are a number of approaches to fitting factors to a correlation matrix (least squares, generalized least squares, maximum likelihood), which has given rise to a number of factor methods. What is a factor? In market research factors are usually taken to be underlying traits, attitudes or beliefs that are reflected in specific rating questions. You need not believe that factors or components

2-3


actually exist in order to perform a factor analysis, but in practice the factors are usually interpreted, given names, and generally spoken of as real things.

2.5 Factor Analysis versus Principal Components Analysis Within the general area of data reduction there are two highly related techniques: factor analysis and principal components analysis. They can both be applied to correlation matrices with data reduction as a goal. They differ in a technical way having to do with how they attempt to fit the correlation matrix. We will pursue the distinction since it is relevant to which method you choose. The diagram below is a correlation matrix composed of five continuous fields. Figure 2.1 Correlation Matrix of Five Continuous Fields

Principal components analysis attempts to account for the maximum amount of variation in the set of fields. Since the diagonal of a correlation matrix (the ones) represents standardized variances, each principal component can be thought of as accounting for as much as possible of the variation remaining in the diagonal. Factor analysis, on the other hand, attempts to account for correlations between the fields, and therefore its focus is more on the off-diagonal elements (the correlations). So while both methods attempt to fit a correlation matrix with fewer components or factors than fields, they differ in what they focus on when fitting. Of course, if a principal component accounts for most of the variance in fields V 1 and V 2 , it must also account for much of the correlation between them. And if a factor accounts for the correlation between V 1 and V 2 , it must account for at least some of their (common) variance. Thus, there is definitely overlap in the methods and they usually yield similar results. Often factor is used when there is interest in studying relations among the fields, while principal components is used when there is a greater emphasis on data reduction and less on interpretation. However, principal components is very popular because it can run even when the data are multicollinear (one field can be perfectly predicted from the others), while most factor methods cannot. In data mining, since data files often contain many fields likely to be multicollinear or near multicollinear, principal components is used more often. This is especially the case if statistical modeling methods, which will not run with multicollinear predictors, are used. Both methods are available in the PCA/Factor node; by default, the principal components method is used.

2.6

Number of Components

When factor or principal components analysis is run there are several technical measures that can guide you in choosing a tentative number of factors or components. The first indicator would be the eigenvalues. Eigenvalues are fairly technical measures, but in principal components analysis, and some factor methods (under orthogonal rotations), their values represent the amount of variance in the input fields that is accounted for by the components (or factors). If we turn back to the correlation

2-4


matrix in Figure 8.1, there are five fields and therefore 5 units of standardized variance to be accounted for. Each eigenvalue measures the amount of this variance accounted for by a factor. This leads to a rule of thumb and a useful measure to evaluate a given number of factors. The rule of thumb is to select as many factors as there are eigenvalues greater than 1. Why? If the eigenvalue represents the amount of standardized variance in the fields accounted for by the factor, then if it is above 1, it must represent variance contained in more than one field. This is because the maximum amount of standardized variance contained in a single field is 1. Thus, if in our five-field analysis the first eigenvalue were 3, it must account for variation in several fields. Now an eigenvalue can be less that 1 and still account for variation shared among several fields (for example 30% of the variation of each of three fields for an eigenvalue of .9), so the eigenvalue of 1 rule is only applied as a rule of thumb. Another aspect of eigenvalues (for principal components and some factor methods) is that their sum is the same as the number of fields, which is equal to the total standardized variance in the fields. Thus you can convert the eigenvalue into a measure of percentage of explained variance, which is helpful when evaluating a solution. Finally, it is important to mention that in applications in which you need to be able to interpret the results, the components must make sense. For this reason, factors with eigenvalues over 1 that cannot be interpreted may be dropped and those with eigenvalues less than 1 may be retained.

2.7

Rotations

When factor analysis succeeds you obtain a relatively small number of interpretable factors that account for much of the variation in the original set of fields. Suppose you have eight fields and factor analysis returns a two-factor solution. Formally, the factor solution represents a two-dimensional space. Such a space can be represented with a pair of axes as shown below. While each pair of axes defines the same two-dimensional space, the coordinates of a point would vary depending on which pair of axes was applied. This creates a problem for factor methods since the values for the loadings or lambda coefficients vary with the orientation of axes and there is no unique orientation defined by the factor analysis itself. Principal Components does not suffer from this problem since its method produces a unique orientation. This difficulty for factor analysis is a fundamental mathematical problem. The solutions to it are designed to simplify the task of interpretation for the analyst. Most involve, in some fashion, finding a rotation of the axes that maximizes the variance of the loading coefficients, so some are large and some small. This makes it easier for the analyst to interpret the factors. This is the best that can currently be done, but the fact that factor loadings are not uniquely determined by the method is a valid criticism leveled against it by some statisticians. We will discuss the various rotational schemes in the Methods section below.

2-5


Figure 2.2 Two Dimensional Space

2.8

Component Scores

If you are satisfied with a factor analysis or principal components solution, you can request that a new set of fields be created that represent the scores of each data record on the factors. These are calculated by summing the product of each original field and a weight coefficient (derived from the lambda coefficients). These factor score fields can then be used as the inputs for prediction and segmentation analyses. They are usually normalized to have a mean of zero and standard deviation of one. An alternative some analysts prefer is to use the lambda coefficients to judge which fields are highly related to a factor, and then compute a new field which is the sum or mean of that set of fields. This method, while not optimal in a technical sense, keeps (if means are used) the new scores on the same scale as the original fields (this of course assumes the fields themselves share a common scale), which can make the interpretation and the presentation straightforward. Essentially, subscale scores are created based on the factor results, and these scores are used in further analyses.

2.9

Sample Size

Since principal components analysis is a multivariate statistical method, the rule of thumb for sample size (commonly violated) is that there should be from 10 to 25 times as many records as there are continuous fields used in the factor or principal components analysis. This is because principal components and factor analysis are based on correlations and for p fields there are p* (p-1)/2 correlations. Think of this as a desirable goal and not a formal requirement (technically if there are p fields there must be p+1 observations for factor analysis to run—but don’t expect reasonable results). If your sample size is very small relative to the number of input fields, you should turn to principal components.

2.10 Methods There are several popular methods within the domain of factor and principal components analyses. The common factor methods differ in how they go about fitting the correlation matrix. A traditional method that has been around for many years—for some it means factor analysis— is the principal axis factor method (often abbreviated as PAF). A more modern method that carries some technical advantages is maximum likelihood factor analysis. If the data are ill behaved (say near

2-6


multicollinear), maximum likelihood, the more refined method, is more prone to give wild solutions. In most cases results using the two methods will be very close, so either is fine under general circumstances. If you suspect there are problems with your data, then principal axis may be a safer bet. The other factor methods are considerably less popular. One factor method, called Q factor analysis, involves transposing the data matrix and then performing a factor analysis on the records instead of the fields. Essentially, correlations are calculated for each pair of records based on the values of the input fields. This technique is related to cluster analysis, but is used infrequently today. Besides the factor methods, principal components can be run and, as mentioned earlier, must be run when the inputs are multicollinear. Similarly, there are several choices in rotations. The most popular by far is the varimax rotation, which attempts to simplify the interpretation of the factors by maximizing the variances of the input fields’ loadings on each factor. In other words, it attempts to finds a rotation in which some fields have high and some low loadings on each factor, which makes it easier to understand and name the factors. The quartimax rotation attempts to simplify the interpretation of each field in terms of the factors by finding a rotation yielding high and low loadings across factors for each field. The equimax rotation is a compromise between the varimax and quartimax rotation methods. These three rotations are orthogonal, which means the axes are perpendicular to each other and the factors will be uncorrelated. This is considered a desirable feature since statements can be made about independent factors or aspects of the data. There are nonorthogonal rotations available (axes are not perpendicular); popular ones are oblimin and promax (runs faster than oblimin). Such rotations are rarely used in data mining, since the point of data reduction is to obtain relatively independent composite measures, and it is easier to speak of independent effects when the factors are uncorrelated. Finally, principal components does not require a rotation, since there is a unique solution associated with it. However, in practice, a varimax rotation is sometimes done to facilitate the interpretation of the components.

2.11 Overall Recommendations For data mining applications, principal components is more commonly performed than factor analysis because of the expected high correlations among the many continuous inputs that are often analyzed, and because there isn't always strong interest in interpreting the results. Varimax rotation is usually done (although it is not necessary for principal components) to simplify the interpretation. If there are not many highly correlated fields (or other sources for ill-behaved data, for example, much missing data), then either principal axis or maximum likelihood factor can be performed. Maximum likelihood has technical advantages, but can produce an ugly solution if the data are not well conditioned (a statistical criterion).

2.12 Example: Regression with Principal Components To demonstrate principal components, we will run a linear regression analysis predicting a target (amount of waste produced) as a function of several related inputs (amount of acreage put to different uses). After examining the regression results, we will run principal components analysis and use the first few component score fields as inputs to the regression. Note A complete example of linear regression is provided in Lesson 6. Our intent here is not to teach linear regression, but instead to use this technique to illustrate how principal components can be used in conjunction with that modeling technique. Click File…Open Stream and move to the c:\Train\ModelerPredModel directory Double-click on PrincipalComponents.str

2-7


When the stream first opens, the following warning dialog is displayed. In version 14.0 of Modeler, the Linear Models node was added that is an enhanced version of the Regression node. The Regression node will be replaced in a future release. Figure 2.3 Regression Node Expiration Warning

Click OK Right-click on the Table node connected to the Type node, then click Run Examine the data, and then close the Table window Double-click on the Type node

Figure 2.4 Type Node for Linear Regression Analysis

The INDUST, METALS, TRUCKS, RETAIL, and RESTRNTS fields (which measure the number of acres of a specific type of land usage) will be used as inputs to predict the amount of solid waste (WASTE). Close the Type node window Double-click on the Regression node named WASTE at the top of the Stream canvas Click the Expert tab, and then click the Expert option button Click the Output button, and then make sure that the Descriptives check box is checked

2-8


Figure 2.5 Requesting Descriptive Statistics in a Linear Regression Node

To check for correlation among the inputs, we request descriptive statistics (Descriptives). This will display correlations for all the fields in the analysis, among other statistics. (Note that we could have obtained these correlations from the Statistics node.) We can obtain more technical information about correlated predictors by checking the Collinearity Diagnostics check box. Click OK, and then click the Run button Right-click the Regression generated model node named Waste in the Models Manager window, then click Browse Click the Summary tab Expand the Analysis topic in the Summary tab (if necessary)

Figure 2.6 Linear Regression Browser Window (Summary Tab)

2-9


The estimated regression equation appears in the Summary tab under Analysis; notice that two of the inputs have negative coefficients. Click the Advanced tab Scroll to the Pearson Correlation section of the Correlations table in the Advanced tab of the browser window

Figure 2.7 Correlations for Input Fields and Target Field

All correlations are positive and there are high correlations between the METALS and TRUCKS fields (.893) and between the RESTRNTS and RETAIL fields (.920). Since some of the inputs are highly correlated, this might create stability problems (large standard errors) for the estimated regression coefficients due to near multicollinearity. Scroll to the Model Summary table

Figure 2.8 Regression Model Summary

The regression model with five predictors accounted for about 83% (adjusted R Square) of the variation in the target field (waste). Scroll to the Coefficients table

2-10


Figure 2.9 Linear Regression Coefficients

Two of the significant coefficients (INDUST and RETAIL) have negative regression coefficients, although they correlate positively (see Figure 2.9) with the target field. Although there might be a valid reason for this to occur, this coupled with the fact that RETAIL is highly correlated with another predictor is suspicious. Also, those familiar with regression should note that the estimated beta coefficient for RESTRNTS is above 1, which is another sign of near multicollinearity. It is possible that this situation could have been avoided if a stepwise method had been used (this is left as an exercise). However, we will take the position that the current set of inputs is exhibiting signs of near multicollinearity and we will run principal components as an attempt to improve the situation. Close the Regression browser window Double-click the PCA/Factor model node (named Factor) in the stream canvas

Figure 2.10 PCA/Factor Dialog

2-11


In Simple mode (see Expert tab), the only options involve selection of the factor extraction method (some of these were discussed in the Methods section). Notice that Principal Components is the default method. Click the Run button Right-click the PCA/Factor generated model node named Factor in the Models Manager window, then click Browse

Figure 2.11 PCA/Factor Browser Window (Five-Component Solution)

Five principal components were found. Since there were originally five input fields, reducing them to five principal components does not constitute data reduction (but it does solve the problem of multicollinearity). If the solution were successful, we would expect that the variation within the five input fields would be concentrated in the first few components and we could check this by examining the Advanced tab of the browser window. However, instead we will use the Expert options to have the PCA/Factor node select an optimal number of principal components. Close the PCA/Factor browser window Double-click on the PCA/Factor modeling node named Factor Click the Expert tab, and then click the Expert Mode option button

2-12


Figure 2.12 Expert Options

The Extract factors option indicates that while in Expert mode, PCA/Factor will select as many factors as there are eigenvalues over 1 (we discussed this rule of thumb earlier in the lesson). You can change this rule or specify a number of factors; this might be done if you prefer more or fewer factors than the eigenvalue rule provides. By default, the analysis will be performed on the correlation matrix; principal components can also be applied to covariance matrices, in which case fields with greater variation will have more weight in the analysis. This is really all we need to proceed, but let's examine the other Expert options. Notice that the Only use complete records check box becomes active when the Expert Mode is selected. By default, PCA/Factor will only use records with complete information on the input fields. If this option is not checked, then a pairwise technique is used. Here for a record with missing values on one or more fields used in the analysis, fields with valid values will be used. However, the created factor score fields will be set to $null$ for these records. Also, substantial amounts of missing data, when the Use only complete records is not selected, can lead to numeric instabilities in the algorithm. The Sort values check box in the Component/Factor format section will have PCA/Factor list the fields in descending order by their loading coefficients on the factor/component for which they load highest. This makes it very easy to see which fields relate to which factors and is especially useful when a many input fields are involved. To further aid this effort, by suppressing loading coefficients less that .3 in absolute value (the Hide values below option) you will only see the larger loadings (small values are replaced with blanks) and not be distracted by small loadings. Although not required, these options make the interpretive task much easier when many fields are involved. Make sure the Sort values check box is checked Make sure the Hide values below check box is checked Set the Hide scores below value to 0.3

2-13


Click the Rotation button

Figure 2.13 Expert Options (Factor/Component Rotation)

By default, no rotation is performed, which is often the case when principal components is run. The Delta and Kappa text boxes control aspects of the Oblimin and Promax rotation methods, respectively. Click Cancel Click the Run button Right-click the PCA/Factor generated model node, named Factor, in the Models Manager window, then click Browse Click the Model tab

2-14


Figure 2.14 PCA/Factor Browser Window (Two-Component Solution)

The PCA/Factor browser window contains the equations to create component (in this case) or factor score fields from the inputs. Two components were selected based on the eigenvalue greater than 1 rule (recall five were selected in the original analysis under the Simple mode). The coefficients are so small because the components are normalized to have means of 0 and standard deviations of 1, while most inputs have values that extend into the thousands. To interpret the components, we turn to the advanced output. Click the Advanced tab Scroll to the Communalities table in the Expert Output browser window

2-15


Figure 2.15 Communalities Summary

The communalities represent the proportion of variance in an input field explained by the factors (here principal components). Since initially, as many components are fit as there are inputs, the communalities in the first column (Initial) are trivially 1. They are of interest when a solution is reached (Extraction column). Here the communalities are below 1 and measure the percentage of variance in each input field that is accounted for by the selected number of components (two). Any fields having very small communalities (say .2 or below) have little in common with the other inputs, and are neither explained by the components (or factors), nor contribute to their definition. Of the five inputs, all but INDUST have a large proportion of their variance accounted for by the two components and Indust itself has a communality of .44 (44%). Scroll to the Total Variance Explained table in the Avanced tab of browser window

Figure 2.16 Total Variance Explained (by Components) Table

The Initial eigenvalues area contains all (5) eigenvalues, along with the percentage of variance (of the fields) explained by each and a cumulative percentage of variance. We see in the Extracted Sums of Squared Loadings section that there are two eigenvalues over 1, the first being about twice the size of the second. Two components were selected and they collectively account for about 82 percent of the variance of the 5 inputs. The third eigenvalue is .73, which might be explored as a third component if more input fields were involved (reducing from five fields to three components is not much of a reduction). The remaining two components (fourth and fifth) are quite small. While not pursued here, in practice we might try out a solution with a different number of components. Scroll to the Component Matrix table in the Advanced tab of the browser window

2-16


Figure 2.17 Component Matrix (Component or Factor Loadings)

PCA/Factor next presents the Component (or Factor) Matrix that contains the unrotated loadings. If a rotation were requested, this table would appear in addition to a table containing the rotated loadings. The input fields form the rows and the components (or factors if a factor method were run) form the columns. The values in the table are the loadings. If any loading were below .30 (in absolute value), blanks would appear in its position due to our option choice. While it makes no difference here, the option helps focus on the larger (absolute value closer to 1) loadings. The first component seems to be a general component, having positive loadings on all the input fields (recall that they all correlated positively—see Figure 2.17). In some sense, it could represent the total (weighted) amount of land used in these activities. The second component has both positive and negative coefficients, and seems to represent the difference between land usage for trucking and wholesale trade, fabricated metals, and industrial work, versus retail trade, restaurants and hotels. This might be considered a contrast between manufacturing/industrial and service-oriented use of land. This pattern, all fields with positive loadings on the first component (factor) and contrasting signs on coefficients of the second and later components (factors), is fairly common in unrotated solutions. If we requested a rotation, the fields would group into the two rotated components according to their signs on the second component. We should note that when interpreting components or factors, the loading magnitude is important; that is, fields with greater loadings (in absolute value) are more closely associated with the components and are more influential when interpreting the components. We know that the two components account for 82 percent of the variation of the original input fields (a substantial amount), and that we can interpret the components. Now we will rerun the linear regression with the components as inputs. Close the PCA/Factor browser window Double-click on the Type node located to the right of the PCA/Factor generated model node named Factor

2-17


Figure 2.18 Type Node Set Up for Principal Components Regression

The two component score fields ($F-Factor-1, $F-Factor-2) are the only fields that will be used as inputs; the original land usage fields have their role set to None. If both the land usage fields and the component score fields were inputs to the linear regression, we would have only exacerbated the near multicollinearity problem (as an exercise, explain why). Close the Type node window Run the Regression modeling node, named Waste, located in the lower right section of the Stream canvas Right-click the Regression generated model node named Waste in the Models Manager, then click Browse Click the Summary tab Expand the Analysis topic

2-18


Figure 2.19 Linear Regression (Using Components as Inputs) Browser Window

The prediction equation for waste is now in terms of the two principal component fields. Notice that the coefficient for the second component has a negative sign, which we will consider when examining the expert output. Click the Advanced tab Scroll to the Model Summary table

Figure 2.20 Model Summary (Principal Components Regression)

The regression model with two principal component fields as inputs accounts for about 73% of the variance (adjusted R square) in the Waste field. This compares with the 83% in the original analysis (Figure 2.8). Essentially, we are giving up 10% explained variance to gain more stable coefficients and possibly a simpler interpretation. The requirements of the analysis would determine whether this tradeoff is acceptable. Scroll to the Coefficients table

2-19


Figure 2.21 Coefficients Table (Principal Components Regression)

Both components are statistically significant. The positive coefficient for $F-Factor-1 indicates, not surprisingly, as overall land usage increases, so does the amount of waste. The coefficient for the second component (which represented a contrast of land use for manufacturing/industrial versus service-oriented), which is negative, indicates that, controlling for total land usage, as the amount of manufacturing/industrial land use increases relative to service-oriented usage, waste production goes down. Or, to put it another way, as service-oriented land use increases, relative to manufacturing/industrial, waste production increases. As mentioned before, the interpretation of the component, and thus the regression, results might be made easier by rotating (say using a varimax rotation) the components (you might ask your instructor to demonstrate this approach). Notice that the components, unlike the original fields (see Figure 2.8), have no beta coefficients above 1, indicating that the potential problem with near multicollinearity has been resolved. It is important to note that while we have shifted from a regression with five inputs to a regression with two components, the five inputs are still required to produce predictions because they are needed to create the component score fields.

Additional Readings Those interested in learning more about factor and principal components analysis might consider the book by Kline (1994), Jae-On Kim’s introductory text (1978) and his book with Charles W Mueller (1979), and Harry Harman’s revised text (1979).

2-20


Summary Exercises The exercises in this lesson use the file waste.dat. The table provides details about the file. Waste.dat contains information from a waste management study in which the amount of solid waste produced within an area was related to type of land usage. Interest is in relating land usage to amount of waste produced for planning purposes. Inputs were found to be highly correlated and the dataset is used to demonstrate principal components regression. The file contains 40 records and the following fields: INDUST METALS TRUCKS RETAIL RESTRNTS WASTE

Acreage (US) used for industrial work Acreage used for fabricated metal Acreage used for trucking and wholesale trade Acreage used for retail trade Acreage used for restaurants and hotels Amount of solid waste produced

1. Working with the current stream from the lesson, request a varimax rotation of the principal components analysis. Interpret the component coefficients. Use the component score fields from this generated model node as inputs to the Regression node predicting waste. Does the R square change? Explain this. Do the regression coefficients change? How would you interpret them? 2. With the same data, use the Extraction Method drop-down list in the PCA/Factor node to run a factor analysis instead (using principal axis factoring or maximum likelihood) with no rotation. Compare the results to those obtained by the principal components in the lesson. Are they similar? In what way do they differ? Now rerun the factor analysis, requesting a varimax rotation. How do these results compare to those obtained in the first exercise? Do you find anything that leads you to prefer one to the other?

2-21


2-22

DECISION TREES/RULE INDUCTION

Lesson 3: Decision Trees/Rule Induction Overview • • •

Introduce the features of the C5.0, CHAID, C&R Tree and QUEST nodes Create models for categorical targets Understand how CHAID and C&R Tree model a continuous output

Data We will use the dataset churn.txt that we used in Lesson 1. This data file contains information on 1477 of a telecommunication company’s customers who have at some time purchased a mobile phone. The customers fall into one of three groups: current customers, involuntary leavers and voluntary leavers. In this lesson, we use decision tree models to understand which factors influence group membership. Following recommended practice, we will use a Partition Node to divide the cases into two partitions (subsamples), one to build or train the model and the other to test the model (often called a holdout sample). With a holdout sample, you are able to check the resulting model performance on data not used to fit the model. The holdout data sample also has known values for the target field and therefore can be used to check model performance. A second dataset Insclaim.dat, used with the C&R Tree node, contains 293 records based on patient admissions to a hospital. All patients belong to a single diagnosis related group (DRG). Four fields (grouped severity of illness, age, length of stay, and insurance claim amount) are included. The goal is to build a predictive model for the insurance claim amount and use this model to identify outliers (patients with claim values far from what the model predicts), which might be instances of errors or fraud made in the claims. Such analyses can be performed for error or fraud detection in instances where audited data (for which the outcome error/no error or fraud/no fraud) are not available.

3.1 Introduction PASW Modeler contains four different algorithms for constructing a decision tree (more generally referred to as rule induction): C5.0, CHAID, QUEST, and C&R Tree (classification and regression trees). They are similar in that they can all construct a decision tree by recursively splitting data into subgroups defined by the predictor fields as they relate to the target. However, they differ in several important ways. (PASW Modeler also includes the Decision List node, which develops models to identify subgroups or segments that show a higher or lower likelihood of a binary (yes or no) target relative to the overall sample. These models are tree-like, but they are different enough that Decision List is reviewed separately in Appendix A.) We begin by reviewing a table that highlights some distinguishing features of the algorithms. Next, we will examine the various options for the algorithms in the context of predicting a categorical output. Within each section we discuss when it is advisable to use the expert options within these nodes.

3.2 Comparison of Decision Tree Models The table below lists some of the important differences between the decision tree/rule induction algorithms available within PASW Modeler.

3-1


Table 3.1 Some Key Differences between the Four Decision Tree Models

Model Criterion Type of Split for Categorical Predictors Continuous Target Continuous Predictors Criterion for Predictor Selection Can Cases Missing Predictor Values be Used? Priors Pruning Criterion Build Trees Interactively Supports Bagging/Boosting

C5.0

CHAID 1

QUEST

C&R Tree

Multiple

Multiple

Binary

Binary

No

Yes

No

Yes

Yes

No2

Yes

Yes

Information measure

Statistical

Yes, uses surrogates

Impurity (dispersion) measure Yes, uses surrogates

No Upper limit on predicted error No

Chi-square F test for continuous Yes, missing becomes a category No Stops rather than overfit Yes

Yes Cost-complexity pruning Yes

Yes Cost-complexity pruning Yes

Yes

Yes

Yes

Yes

Yes, uses fractionalization

1

Modeler has extended the logic of the CHAID approach to accommodate ordinal and continuous target fields. 2 Continuous predictors are binned into ordinal fields containing by default approximately equal sized categories. Note: C&R Tree and QUEST produce binary splits (two branch splits) when growing the tree, while C5.0 and CHAID can produce more than two subgroups when splitting occurs. However, if we had a predictor of measurement level nominal or ordinal with four categories, each of which were distinct with relation to the target field, C&R Tree and QUEST could perform successive binary splits on this field. This would produce a result equivalent to a multiple split at a single node, but requires additional tree levels. All methods can handle predictors and targets that are categorical (flag, nominal, and ordinal). CHAID and C&R Tree can use a continuous target field, while all but CHAID can use a continuous predictor or input (although see footnote 2). The trees that each method grows will not necessarily be identical because the methods use very different criteria for selecting a predictor. CHAID and QUEST use more standard statistical methods, while C5.0 and C&R Tree use non-statistical measures, as explained below. Missing (blank) values are handled in three different ways. C&R Tree and QUEST use the substitute (surrogate) predictor field whose split is most strongly associated with that of the original predictor to direct a case with a missing value to one of the split groups during tree building. C5.0 splits a case in proportion to the distribution of the predictor field and passes a weighted portion of the case down each tree branch. CHAID uses all the missing values as an additional category in model building.

3-2


Three of the four methods prune trees after growing them quite large, while CHAID instead stops before a tree gets too large. For all these reasons, you should not expect the four algorithms to produce identical trees for the same data. You should expect that important predictors would be included in trees built by any algorithm. Those interested in more detail concerning the algorithms can see the PASW Modeler 14.0 Algorithms Guide. Also, you might consider C4.5: Programs for Machine Learning (Morgan Kauffman, 1993) by Ross Quinlan, which details the predecessor to C5.0; Classification and Regression Trees (Wadsworth, 1984) by Breiman, Friedman, Olshen and Stone, who developed CART (Classification and Regression Tree) analysis; the article by Loh and Shih (1997, “Split Selection Methods for classification trees,” Statistica Sinica, 7: 815-840) that details the QUEST method; and for a description of CHAID, “The CHAID Approach to Segmentation Modeling: CHI-squared Automatic Interaction Detection,” Lesson 4 in Richard Bagozzi, Advanced Methods of Marketing Research (Blackwell, 1994).

3.3 Using the C5.0 Node We will use the C5.0 node to create a rule induction model. It contains the rule induction model in either decision tree or rule set format. By default, the C5.0 node is labeled with the name of the output field. The C5.0 model can be browsed and predictions can be made by passing new data through it in the Stream Canvas. Before a data stream can be used by the C5.0 node—or essentially any node in the Modeling palette—the measurement levels of all fields used in the model must be instantiated (either in the source node or a Type node). That is because all modeling nodes use this information to set up the models. As a reminder, the table below shows the six key roles for a field. Table 3.2 Role Settings The field acts as an input or predictor within the modeling. The field is the output or target for the modeling. Allows the field to be act as both an input and an target in modeling. This role is suitable for the association rule and sequence detection algorithms only; all other modeling techniques will ignore the field. The field will not be used in machine learning or statistical modeling. Default if None the field is defined as Typeless. Indicates a field used to partition the data into separate samples for training, Partition testing, and (optional) validation purposes. Indicates that a model should be built for each value of a field. For flag, nominal Split and ordinal fields only. Frequency Used as a frequency weighting factor. Supported by CHAID, QUEST, C&R Tree, and the Linear node Role can be set by clicking in that column for a field within the Type node or the Type tab of a source node and selecting the role from the drop-down menu. Alternatively, this can be done from the Fields tab of a modeling node. Input Target Both

If the Stream Canvas is not empty, click File…New Stream Place a Var. File node from the Sources palette Double-click the Var. File node Move to the c:\Train\ModelerPredModel directory and double-click on the Churn.txt file

3-3


As delimiter, check the Comma option if necessary Set the Strip lead and trail spaces: option to Both Click OK to return to the Stream Canvas Place a Partition node from the Field Ops palette to the right of the Var. File node named Churn.txt Connect the Var.File node named Churn.txt to the Partition node Place a Type node from the Field Ops palette to the right of the Partition node Connect the Partition Node to the Type node

Next we will add a Table node to the stream. This not only will force PASW Modeler to instantiate the data but also will act as a check to ensure that the data file is being correctly read. Place a Table node from the Output palette above the Type node in the Stream Canvas Connect the Type node to the Table node Right-click the Table node Run the Table node

The values in the data table should look reasonable (not shown). Click File…Close to close the Table window Double-click the Type node Click in the cell located in the Measurement column for ID (current value is Continuous), and select Typeless from the list Click in the cell located in the Role column for CHURNED (current value is Input) and select Target from the list

Figure 3.1 Type Node Ready for Modeling

3-4


Notice that ID will be excluded from any modeling as the role is automatically set to None for a Typeless field. The CHURNED field will be the target field for any predictive model and all fields but ID and Partition will be used as predictors. Click OK Place a C5.0 node from the Modeling palette to the right of the Type node Connect the Type node to the C5.0 node

The name of the C5.0 node should immediately change to CHURNED. Figure 3.2 C5.0 Modeling Node Added to Stream

Double-click the C5.0 node

Figure 3.3 C5.0 Node Model Tab

3-5


The Model name option allows you to set the name for both the C5.0 and resulting C5.0 rule nodes. The form (decision tree or rule set, both will be discussed) of the resulting model is selected using the Output type: option. The Use partitioned data option is checked so that the C5.0 node will make use of the Partition field created by the Partition node earlier in the stream. Whenever this option is checked, only the cases the Partition node assigned to the Training sample will be used to build the model; the rest of the cases will be held out for Testing and/or Validation purposes. If unchecked, the field will be ignored and the model will be trained on all the data. Here, we use the default setting for the Partition node, so 50% of cases will be used for training and 50% for testing. The Build model for each split option enables you to use a single stream to build separate models for each possible value of a flag, categorical or continuous input field, which is specified as split field in the Fields tab or upstream Type node. With split modeling, you can easily build the best-fitting model for each possible field value in a single execution of the stream. The Cross-validate option provides a way of validating the accuracy of C5.0 models when there are too few records in the data to permit a separate holdout sample. It does this by partitioning the data into N equal-sized subgroups and fits N models. Each model uses (N-1) of the subgroups for training, then applies the resulting model to the remaining subgroup and records the accuracy. Accuracy figures are pooled over the N holdout subgroups and this summary statistic estimates model accuracy applied to new data. Since N models are fit, N-fold validation is more resource intensive and reports the accuracy statistic, but does not present the N decision trees or rule sets. By default N, the number of folds, is set to 10. For a predictor field that has been defined as categorical, C5.0 will normally form one branch per value in the set. However, by checking the Group symbolics check box, the algorithm can be set so that it finds sensible groupings of the values within the field, thus reducing the number of rules. This is often desirable. For example, instead of having one rule per region of the country, group symbolic values may produce a rule such as:

Region [South, Midwest] … Region [Northeast, West] … Once trained, C5.0 builds one decision tree or rule set that can be used for predictions. However, it can also be instructed to build a number of alternative models for the same data by selecting the Boosting option. Under this option, when it makes a prediction it consults each of the alternative models before making a decision. This can often provide more accurate prediction, but takes longer to train. Also the resulting model is a set of decision tree predictions and the outcome is determined by voting, which is not simple to interpret. The algorithm can be set to favor either Accuracy on the training data (the default) or Generality to other data. In our example, we favor a model that is expected to better generalize to other data and so we select Generality. Click Generality option button

C5.0 will automatically handle errors (noise) within the data and, if known, you can inform PASW Modeler of the expected proportion of noisy or erroneous data. This option is rarely used.

3-6


As with all of the modeling nodes, after selecting the Expert option or tab, more advanced settings are available. In this course, we will discuss the Expert options briefly. The reader is referred to the Modeler 14 Modeling Nodes documentation for more information on these settings. Click the Expert option button

Figure 3.4 C5.0 Node Model Tab Expert Options

By default, C5.0 will produce splits if at least two of the resulting branches have at least two data records each. For large datasets you may want to increase this value to reduce the likelihood of rules that apply to very few records. To do so, increase the value in the Minimum records per child branch box. Click the Simple Mode option button, and then click Run

A C5.0 Rule model, labeled with the predicted field (CHURNED), will appear in the Models palette of the Manager. The C5.0 Rule model is also added automatically to the stream, connected to the Type node. A dotted line connects it to the C5.0 Modeling node, indicating the source of the model (not shown). Each time the model is rerun, the model in the stream will be replaced.

3.4 Viewing the Model Once the C5.0 Rule node is in the stream it can be edited.

3-7


Right-click the C5.0 generated model node named CHURNED in the stream palette, then click Edit

The Model Viewer window has two panes. The left one shows the root node of the tree and the first split; the right pane displays a graph of predictor importance measures. According to what we see of the tree so far, LOCAL is the first split in the tree. Further, we see that if LOCAL <= 4.976 the Mode value for CHURNED is InVol. The Mode is the modal (most frequent) output value for the branch, and it will be the predicted value unless there are other fields that need to be taken into account within that branch to make a prediction. When LOCAL <= 4.976 the branch terminates, visually apparent because of the arrow. So this means the prediction for all customers with this range of values on LOCAL is to be an involuntary churner. In the second half of the first split where LOCAL > 4.976, the Mode value is Current. In this instance, no predictions of CHURNED are visible, and to view the predictions we need to further unfold the tree. Predictor importance is enabled by default on the Analyze tab in the C5.0 modeling node (or any modeling node for which it can be calculated). Predictor importance takes into account the whole tree and is calculated on the test partition, if one is available (as is true here). Predictor importance values sum to 1.0 so the relative importance of each predictor can be directly compared. Importantly, predictor importance does not relate to model accuracy; instead, it is a measure of how much influence a field has on model prediction, i.e., changes in a field lead to changes in model predictions. Figure 3.5 Browsing the C5.0 Rule Node

3-8


The bar chart shows that the field LOCAL, used on the first split, is by far the most important in predicting CHURNED. However, we haven’t seen the whole tree, and critically, we aren’t yet ready to use the test partition data, so we won’t examine predictor importance any further at the moment. To unfold the branch LOCAL > 4.976, just click the expand button. Click

to unfold the branch LOCAL > 4.976

Figure 3.6 Unfolding a Branch

SEX is the next split field. Now we see that SEX is the best predictor for persons who spend more than 4.976 minutes on local calls. The Mode value for Males is Current and for Females is Vol. However, at this point we still cannot make any predictions because there is a symbol to the left of each value of SEX which means that other fields need to be taken into account before we can make a prediction. Once again we can unfold each separate branch to see the rest of the tree, but we will take a shortcut: Click the All button in the Toolbar

3-9


Figure 3.7 Fully Unfolded Tree

We can see several nodes usually referred to as terminal nodes that cannot be refined any further. In these instances, the mode is the prediction. For example, if we are interested in the Current Customer group, one group we would predict to remain customers are persons where LOCAL > 4.976, SEX = M, International <= 0.905, and AGE > 29. To get an idea about the number and percentage of records within such branches we ask for more details. Click Show or hide instance and confidence figures

3-10

in the toolbar


Figure 3.8 Instance and Confidence Figures Displayed

Branch predicting Current

The incidence tells us that there are 218 persons who met those criteria. The confidence figure for this set of individuals is 1.0, which represents the proportion of records within this set correctly classified (predicted to be Current and actually being Current). That means it is 100% accurate on this group! If we were to score another dataset with this model, how would persons with the same characteristics be classified? Because PASW Modeler assigns the group the modal category of the branch, everyone in the new dataset who met the criteria defined by this rule would be predicted to remain Current Customers. If you would like to present the results to others, an alternative format is available that helps visualize the decision tree. The Viewer tab provides this alternative format. Click the Viewer tab Click the Decrease Zoom tool (to view more of the tree). (You may also need to expand the size of the window.)

3-11


Figure 3.9 Decision Tree in the Viewer Tab

The root of the tree shows the overall percentages and counts for the three categories of CHURNED. The modal category is shaded in each node. We see that there are 719 customers in the training partition. The first split is on LOCAL, as we have seen already in the text display of the tree. Similar to the text display, we can decide to expand or collapse branches. In the right corner of some nodes a – or + is displayed, referring to an expanded or collapsed branch, respectively. For example, to collapse the tree at node 2: Click

3-12

in the lower right corner of node 2 (shown in Figure 3.10)


Figure 3.10 Collapsing a Branch

In the Viewer tab, toolbar buttons are available for zooming in or out; showing frequency information as graphs and/or as tables; changing the orientation of the tree; and displaying an overall map of the tree in a smaller window (tree map window) that aids navigation in the Viewer tab. When it is not possible to view the whole tree at once, such as now, one of the more useful buttons in the toolbar is the Tree map button because it shows you the size of the tree. A red rectangle indicates the portion of the tree that is being displayed. You can then navigate to any portion of the tree you want by clicking on any node you desire in the Tree map window. Click

in the lower right corner of node 2

Click on the Treemap button in the tool bar Enlarge the Treemap until you see the node numbers (shown in Figure 3.11)

3-13


Figure 3.11 Decision Tree in the Viewer Tab with a Tree Map

3.5 Generating and Browsing a Rule Set When building a C5.0 model, the C5.0 node can be instructed to generate the model as a rule set, as opposed to a decision tree. A rule set is a number of IF … THEN rules which are collected together by prediction. A rule set can also be produced from the Generate menu when browsing a C5.0 decision tree model. In the C5.0 Rule Model Viewer window, click Generate…Rule Set

Figure 3.12 Generate Ruleset Dialog

3-14


Note that the default Rule set name appends the letters “RS” to the output field name. You may specify whether you want the C5.0 Ruleset node to appear in the Stream Canvas (Canvas), the generated Models palette (GM palette), or both. You may also change the name of the rule set and lower limits on support (percentage of records having the particular values on the input fields) and confidence (accuracy) of the produced rules (percentage of records having the particular value for the output field, given values for the input fields). Set Create node on: to GM Palette Click OK

Figure 3.13 Generated C5.0 Rule Set Node

Generated Rule Set for CHURNED

Click File…Close to close the C5.0 Rule browser window Right-click the C5.0 Rule Set node named CHURNEDRS in the generated Models palette in the Manager, then click Browse

3-15


Figure 3.14 Browsing the C5.0 Generated Rule Set

Click All button to unfold Click Show or hide instance and confidence figures

The numbered rules now expand as shown below.

3-16

button in the toolbar


Figure 3.15 Fully Expanded C5.0 Generated Rule Set

For example Rule #1 (Current) has this logic: If a person makes more than 4.976 minutes of local calls a month, is Male, makes less than or equal to .905 minutes of International calls, is less than or equal to 29 years old, and has an estimated income greater than 38,950.50, then we would predict Current. This form of the rules allows you to focus on a particular conclusion rather than having to view the entire tree. If the Rule Set is added to the stream, a Settings tab will become available that allows you to export the rule set in SQL format, which permits the rules to be directly applied to a database. Click File…Close to close the Rule set browser window

3.6 Understanding the Rule and Determining Accuracy The predictive accuracy of the rule induction model is not given directly within the C5.0 model node. To get that information, you can use an Analysis node. However, at this stage we will use Matrix nodes and Evaluation Charts to determine how good the model is.

3-17


Creating a Data Table Containing Predicted Values We use the Table node to examine the predictions from the C5.0 model. Place a Table node from the Output palette below the generated C5.0 Rule model named CHURNED Connect the generated C5.0 Rule model named CHURNED to the Table node Right-click the Table node, then click Run and scroll to the right in the table

Figure 3.16 Two New Fields Generated by the C5.0 Rule Node

Two new columns appear in the data table, $C-CHURNED and $CC-CHURNED. The first represents the predicted value for each record and the second the confidence value for the prediction. Click File…Close to close the Table output window

Comparing Predicted to Actual Values We will view a matrix (crosstab table) to see where the predictions were more, and less, correct, and then we evaluate the model graphically with a gains chart. Place two Select nodes from the Records palette, one to the lower right of the generated C5.0 node named CHURNED, and one to the lower left Connect the generated C5.0 node named CHURNED to the each Select node

First we will edit the Select node on the left that we will use to select the Training sample cases:

3-18


Double-click on the Select node on the left to edit it Click the Expression Builder button Move Partition from the Fields list box to the Expression Builder text box Click the (equal sign button) Click the Select from existing field values button and insert the value 1_Training Click OK, and then click OK again to close the dialog

Figure 3.17 Completed Selection for the Training Partition

Now we will edit the Select node on the right to select the Testing sample cases: Double-click on the Select node on the right to edit it Click the Expression Builder button Move Partition from the Fields list box to the Expression Builder text box Click the (equal sign button) Click the Select from existing field values button and insert the value 2_Testing Click OK, and then click OK again to close the dialog

Now attach a separate Matrix node to each of the Select nodes. For each of the Select nodes: Place a Matrix node from the Output palette below the Select node Connect the Matrix node to the Select node Double-click the Matrix node to edit it Put CHURNED in the Rows: Put $C-CHURNED in the Columns: Click the Appearance tab Click the Percentage of row option Click on the Output tab and custom name the Matrix node for the Training sample as Training and the Testing sample as Testing (this will make it easier to keep track of which output we are looking at) Click OK

3-19


For each actual churned category, the Percentage of row choice will display the percentage of records predicted into each of the target categories. Run each Matrix node

Figure 3.18 Matrix Output for the Training and Testing Samples

Looking at the Training sample results, the model predicts about 78.7% of the Current category correctly, 100% of the Involuntary Leavers, and 97.0% of the Voluntary Leavers correctly. The results with the testing sample compare favorably which suggests that the model will perform well with new data. Click File…Close to close the Matrix windows

Evaluation Chart Node The Evaluation Chart node offers an easy way to evaluate and compare predictive models in order to choose the best model for your application. Evaluation charts show how models perform in predicting particular outcomes. They work by sorting records based on the predicted value and confidence of the prediction, splitting the records into groups of equal size (quantiles), and then plotting the value of a criterion for each quantile, from highest to lowest. To produce a gains chart for the Current group: Place an Evaluation chart node from the Graphs palette to the right of the generated C5.0 Rule node named CHURNED Connect the generated C5.0 Rule node named CHURNED to the Evaluation chart node

Outcomes are handled by defining a specific value or range of values as a hit. Hits usually indicate success of some sort (such as a sale to a customer) or an event of interest (such as someone given credit being a good credit risk). Flag output fields are straightforward; by default, hits correspond to true values. For Set output fields, by default the first value in the set defines a hit. For the churn data, the first value for the CHURNED field is Current. To specify a different value as the hit value, use the Options tab of the Evaluation node to specify the target value in the User defined hit group. There are five types of evaluation charts, each of which emphasizes a different evaluation criterion. Here we discuss Gains and Lift charts. For information about the others, which include Profit and ROI charts, see the Modeler 14 Modeling Nodes documentation.

3-20


Figure 3.19 Evaluation Chart Dialog

Gains are defined as the proportion of total hits that occurs in each quantile. We will examine the gains when the data are ordered from those most likely to those least likely to be in the current category (based on the confidence of the model prediction). The Chart type option supports five chart types with Gains chart being the default. If Profit or ROI chart type is selected, then the appropriate options (cost, revenue and record weight values) become active so information can be entered. The charts are cumulative by default (see Cumulative plot check box), which is helpful in evaluating such business questions as “how will we do if we make the offer to the top X% of the prospects?” The granularity of the chart (number of points plotted) is controlled by the Plot drop-down list and the Percentiles choice will calculate 100 values (one for each percentile from 1 to 100). For small data files or business situations in which you can only contact customers in large blocks (say some number of groups, each representing 5% of customers, will be contacted through direct mail), the plot granularity might be decreased (to deciles (10 equal-sized groups) or vingtiles (20 equal-sized groups)). A baseline is quite useful since it indicates what the business outcome value (here gains) would be if the model predicted at the chance level. The Include best line option will add a line corresponding to

3-21


a perfect prediction model, representing the theoretically best possible result applied to the data where hits = 100% of the cases. The Separate by partition option in the node provides an easy and convenient way to validate the model by displaying not only the results of the model using the training data, but in a separate chart, showing how well it performed with the testing or holdout data. Of course, this assumes that you made use of the Partition Node to develop the model. Click the Include best line checkbox (not shown) Click Run

Figure 3.20 Gains Chart of the Current Customer Group

The vertical axis of the gains chart is the cumulative percentage of the hits, while the horizontal axis represents the ordered (by model prediction and confidence) percentile groups. The diagonal line presents the base rate, that is, what we expect if the model is predicting the outcome at the chance level. The upper line (labeled $BEST-CHURNED) represents results if a perfect model were applied to the data, and the middle line (labeled $C-CHURNED) displays the model results. The three lines connect at the extreme [(0, 0) and (100, 100)] points. This is because if either no records or all records are considered, the percentage of hits for the base rate, best model, and actual model are identical. The advantage of the model is reflected in the degree to which the model-based line exceeds the baserate line for intermediate values in the plot and the area for model improvement is the discrepancy between the model line and the Best (perfect model) line. If the model line is steep for early percentiles, relative to the base rate, then the hits tend to concentrate in those percentile groups of data. At the practical level, this would mean for our data that many of the current customers could be found within a small portion of the ordered sample. The gains line ($C-CHURNED) in the Training data chart rises steeply relative to the baseline, indicating the hits for the Current outcome are concentrated in the percentiles predicted most likely to contain current customers according to the model. Just over 75% of the hits were contained within the first 40 Percentiles. The gains line in the chart using Testing data is very similar which suggests that this model can be reliably used to predict current customers with new data.

3-22


You can hover over a line and a popup will display the value at that point, as shown in Figure 3.21. Figure 3.21 Gains Chart for the Current Customer Group

Click File…Close to close the Evaluation chart window

Changing Target Category for Evaluation Charts By default, an Evaluation chart will use the first target outcome category to define a hit. To change the target category on which the chart is based, we must specify the condition for a User defined hit in the Options tab of the Evaluation node. To create a gains chart in which a hit is based on the Voluntary Leaver category: Double-click the Evaluation node Click the Options tab Click the User defined hit checkbox Click the Expression Builder button in the User defined hit group Click @Functions on the functions category drop-down list Select @TARGET on the functions list, and click the Insert button Click the = button Right-click CHURNED in the Fields list box, then select Field Values Select Vol, and then click Insert button

3-23


Figure 3.22 Specifying the Hit Condition within the Expression Builder

The condition (Vol as the target value) defining a hit was created using the Expression Builder. Click OK

3-24


Figure 3.23 Defining the Hit Condition for CHURNED

In the evaluation chart, a hit will now be based on the Voluntary Leaver target category. Click Run

3-25


Figure 3.24 Gains Chart for the Voluntary Leaver Category (Interaction Enabled)

The gains chart for the Voluntary Leavers category is better (steeper in the early percentiles) than that for the Current category. For example, the top 40 model-ordered percentiles in the Training data chart contain over 85% of the Voluntary Leavers as opposed to the same chart when we looked at Current Customers (that value was 75.3%) Click File…Close to close the Evaluation chart window

To save this stream for later work: Click File…Save Stream As Move to the c:\Train\ModelerPredModel directory Type C5 in the File name: text box Click Save

3.7 Understanding the Most Important Factors in Prediction An advantage of rule induction models over neural networks is that the decision tree form makes it clear which fields are having an impact on the predicted field. There is no great need to use alternative methods such as web plots and histograms to understand how the rule is working. Of course, you may still use the techniques we will demonstrate in Lesson 4 for neural networks to help understand the model, but they often are not needed. As we noted, predictor importance is calculated on the testing data partition. In addition to using that information, viewing the tree also provides information about importance, as the most important fields in the predictions can be thought of as those that divide the tree in its earliest stages. Thus in

3-26


this example the most important field in predicting churn is LOCAL. Once the model divided the data into two groups, those who do more local calling and those who do less, it will focus separately on each group to determine which predictors determine whether or not a customer will remain loyal to the company, voluntarily leave, or even be dropped as a customer. The process continues until the nodes either cannot be refined any further or stopping rules are reached which causes tree growth to stop. In Figure 3.25 we show the C5.0 model node with an expanded tree along with the predictor importance chart. The order in which splits occur in the tree parallels the relative importance of the fields. Figure 3.25 Expanded Tree and Predictor Importance Chart

3.8 Further Topics on C5.0 Modeling Now that we have introduced you to the basics of C5.0 modeling, we will discuss the Expert options which allow you to refine your model even further. This time, we will use an existing stream rather than building one from scratch. Click File…Open Stream Double Click on DecisionTrees.str

The simple options within the C5.0 node allow you to use Boosting, specify the Expected noise (%) and whether the resulting tree favors Accuracy or Generality. Noisy (inconsistent) data contain records in which the same, or very similar, predictor values lead to different target values. While C5.0 will handle noise automatically, if you have an estimate of it, the method can take this into account (see the section on Minimum Records and Pruning for more information on the effect of specifying a noise value).

3-27


The expert mode allows you to fine-tune the rule induction process. Double-click on the C5.0 node named CHURNED Click the Model tab Click the Expert Mode option button

Figure 3.26 Expert Options Available within the C5.0 Dialog (Model Tab)

When constructing a decision tree, the aim is to split the data into subsets that are, or seem to be heading toward, single-class collections of records on the target field. That is, ideally the terminal nodes contain only one category of the output field. At each point of the tree, the algorithm could potentially partition the data based on any one of the input fields. To decide which is the “best” way to partition the data—to find a compact decision tree that is consistent with the data—the algorithms construct some form of test that usually works on the basis of maximizing a local measure of progress.

Gain Ratio Selection Criterion Within C5.0, the Gain Ratio criterion, based on information theory, is used when deciding how to partition the data. In the following sections, we will describe, in general terms, how this criterion measures progress. However the reader is referred to C4.5: Programs for Machine Learning by J. Ross Quinlan (Morgan Kaufmann, San Mateo CA, 1993) for a more detailed explanation of the original algorithm.

3-28


The criterion used in the predecessors to C5.0 selected the partition that maximizes the information gain. Information gained by partitioning the data based on the categories of field X (an input or predictor field) is measured by: GAI N(X) = I NFO(DATA) – I NFO X (DATA)

Where INFO(DATA) represents the average information needed to identify the class (outcome category) of a record within the total data. And INFO X (DATA) represents the expected information requirement once the data has been partitioned into each outcome of the current field being tested. The information theory that underpins the criterion of gain can be given by the statement: “The information conveyed by a message depends on its probability and can be measured in bits as minus the logarithm to the base 2 of that probability. So, if for example there are 8 equally probable messages, the information conveyed by any one of them is – log 2 (1 / 8) or 3 bits”. For details on how to calculate these values the reader is referred to Lesson 2 in C4.5: Programs for Machine Learning. Although the gain criterion gives good results, it has a flaw in that it favors partitions that have a large number of outcomes. Thus a categorical predictor with many values has an advantage over one with few categories. The gain ratio criterion, used in C5.0, rectifies this problem. The bias in the gain criterion can be rectified by a kind of normalization in which the gain attributable to tests with many outcomes is adjusted. The gain ratio represents the proportion of information generated by dividing the data in the parent node into each of the categories of field X that is useful, i.e., that appears helpful for classification. GAI N RATI O(X) = GAI N(X) / SPL I T I N FO X (DATA)

Where SPLIT INFO X (DATA) represents the potential information generated by partitioning the data into n outcomes, whereas the information gain measures the information relevant to classification. The C5.0 algorithm will choose to partition the data based on the outcomes of the field that maximizes the information gain ratio. This maximization is subject to the constraint that the information gain must be large, or at least as great as the average gain over all tests examined. This constraint avoids the instability of the gain criterion, when the split is near trivial and the split information is thus small. Two other parameters the expert options allow you to control are the severity of pruning and the minimum number of records per child branch. In the following sections we will introduce each of these in turn and give advice on their settings.

Pruning and Attribute Winnowing Within C5.0 Within C5.0, once the tree has been built, it can be pruned back to create a more general (and less bushy) tree. Within the expert mode, the Pruning severity option allows you to control the extent of the pruning. The higher this number, the more severe the pruning and the more general the resulting tree. The algorithm used to decide whether a branch should be pruned back toward the parent node is based on comparing the predicted errors for the “sub-tree” (i.e. unpruned branches) with those for the “leaf” (or pruned node). Error estimates for leaves and sub-trees are calculated based on a set of

3-29


unseen cases the same size as the training set. The formula used to calculate the predicted error rates for a leaf involves the number of cases within the leaf, the number of these cases that have been incorrectly classified within this leaf and confidence limits based on the binomial distribution. The reader is referred to Lesson 4 in C4.5: Programs for Machine Learning for a more detailed description of error estimations and pruning in general. A second phase of pruning (global pruning) is then applied by default. It prunes further based on the performance of the tree as a whole, rather than at the sub-tree level considered in the first stage of pruning. This option (Use global pruning) can be turned off, which generally results in a larger tree. After initially analyzing the data, the Winnow attributes option will discard some of the inputs to the model before building the decision tree. This can produce a model that uses fewer input fields yet maintains near the same accuracy, which can be an advantage in model deployment. This option can be especially effective when there are many inputs and where inputs are statistically related.

Minimum Records per Child Branch One other consideration when building a general decision tree is that the terminal nodes within the tree are not too small in size. Within the C5.0 dialog, you control the Minimum records per child branch, which specifies that at any split point in the tree, at least two sub-trees must cover at least this number of cases. The default is two cases but increasing this number can be useful for noisy datasets and tends to produce less bushy trees.

How to Use Pruning and Minimum Records per Branch As previously mentioned, within the C5.0 dialog the Simple mode allows you to specify both the Expected noise (%) and whether the resulting tree favors Accuracy or Generality. • • •

If the algorithm is set to favor Accuracy, the Pruning Severity is set to 75 and the Minimum records per branch is 2; hence, although the tree is accurate there is a degree of generality by not allowing the nodes to contain only one record. If the algorithm is set to favor Generality the Pruning Severity is set to 85 and the Minimum records per branch is 5. If the Expected noise (%) is used the Minimum records per branch is set to half of this value.

Once a tree has been built using the simple options, the expert options may be used to refine the tree in these two common ways. • •

If the resulting tree is large and has too many branches increase the Pruning Severity If there is an estimate for the expected proportion of noise (relatively rare in practice), set the Minimum records per branch to half of this value.

Boosting C5.0 has a special method for improving its accuracy rate, called boosting. It works by building multiple models in a sequence. The first model is built in the usual way. Then, a second model is built in such a way that it focuses especially on the records that were misclassified by the first model. Then a third model is built to focus on the second model's errors, and so on. Finally, cases are classified by applying the whole set of models to them, using a weighted voting procedure to combine the separate predictions into one overall prediction. Boosting can significantly improve the accuracy of a C5.0 model, but it also requires longer training. The Number of trials option allows you to control how many models are used for the boosted model.

3-30


While boosting might appear to offer something for nothing, there is a price. When model building is complete, more than one tree is used to make predictions. Therefore, there is no simple description of the resulting model, nor of how a single predictor affects the target field. This can be a serious deficiency, so boosting is normally used when the chief goal of an analysis is predictive accuracy, not understanding.

Misclassification Costs The Costs tab allows you to set misclassification costs. When using a tree to predict a categorical output, you may wish to assign costs to misclassifications (where the tree predicts incorrectly) to bias the model away from “expensive” mistakes. The Misclassifying controls allow you to specify the cost attached to each possible misclassification. The default costs are set at 1.0 to represent that each misclassification is equally costly. When unequal misclassification costs are specified, the resulting trees tend to make fewer expensive misclassifications, usually at the cost of an increased number of the relatively inexpensive misclassifications.

Propensity Scores Importance measures and propensity scores are available from the Analyze tab. Click Analyze tab

By default, importance scores will be calculated for model evaluation, as we have seen. There are two check boxes to request raw and adjusted propensity scores. Propensity scores are used for flag fields only, and they indicate the likelihood of the True value defined for the field. Raw propensity scores are derived from the model based on the training data only. If the model predicts the true value (will respond), then the propensity is the same as P, where P is the probability of the prediction (often the confidence). If the model predicts the false value, then the propensity is calculated as (1 – P). Propensity scores differ from confidence scores, which apply to the model prediction, whether true or false. In cases where the prediction is false, for example, a high confidence actually means a high likelihood not to respond. Propensity scores overcome this limitation to allow easier comparison across all records. For example, a no prediction with a confidence of 0.65 translates to a raw propensity of 0.35 (or 1 – 0.65). Raw propensities are based purely on estimates given by the model, which may be overfitted, leading to over-optimistic estimates of propensity. Adjusted propensities attempt to compensate by looking at how the model performs on the test or validation partitions and adjusting the propensities to give a better estimate accordingly. A partition is required to calculate adjusted propensities.

3-31


Figure 3.27 Analyze Tab Options

Close the C5.0 modeling node

3.9 Modeling Categorical Outputs with Other Decision Tree Algorithms As we saw in Table 3.1, C5.0 can only be used to model categorical targets. Quest has the same limitation. The other two algorithms, CHAID and C&R Tree can be used to model both categorical and continuous targets. Before we discuss how to create models with continuous targets, let’s take a look at the various options for modeling categorical targets in CHAID, C&RT and Quest. You can certainly try one of these techniques on the churn.txt data to compare to C5.0, but for the most part, the output format is very similar.

3.10 Modeling Categorical Outputs with CHAID First, we’ll look at the CHAID node and the options available there. Double-click the CHAID node named CHURNED Click the Field tab (if necessary)

3-32


Figure 3.28 CHAID Node Dialog Fields Tab

There are four tabs to control various aspects of the modeling process. The Fields tab, as with many modeling nodes, allows you to select the target field and the predictors, or inputs. Here, the fields are already set because of roles assigned in the Type node. Click the Build Options tab

3-33


Figure 3.29 Objective Settings in Build Options Tab

The Build Options tab enables you to control six different areas, including overall objectives, the specific CHAID algorithm, tree depth, stopping rules, costs of making an error, how to combine ensembles of models, and advanced statistical specifications, such as the significance level used for splitting a node. The default objective is to build a single tree. You can set one of two modes: Generate model builds the model, Launch Interactive session launches the Interactive Tree feature which we will discuss in a later section. Multiple trees can be built by using bagging or boosting (see Lesson 4 for a discussion of these options). If you have a server connection, CHAID can create models for very large datasets by dividing the data into smaller data blocks and building a model on each block. The most accurate models are then automatically selected and combined into a single final model. Click the Basics settings

3-34


Figure 3.30 Basics Settings in Build Options Tab

The options on the Basics panel enable you to choose the algorithm. For a single CHAID tree model, there are two methods, standard or exhaustive CHAID. The latter is a modification of CHAID designed to address some of its weaknesses. Exhaustive CHAID examines more possible splits for a predictor, thus improving the chances of finding the best predictor (at the cost of additional processing time. The Maximum tree depth specifies the maximum number of levels below the root node. The default depth is 5. Since CHAID doesn’t prune a bushy tree, the user can specify the depth with the Custom setting. This setting should depend on the size of the data file, the number of predictors, and the complexity of the desired tree. Click the Stopping Rules settings

3-35


Figure 3.31 Stopping Rules Settings in Build Options Tab

The options on the Stopping Rules panel enable you to specify the rules to be applied to cease splitting nodes in the tree. You set the minimum branch sizes to prevent splits that would create very small subgroups. These can be specified either as an absolute number of records or as a percentage of the total number of records. By default, a parent branch to be split must contain at least 2% of the records; a child branch must contain at least 1%. It is often more convenient to work with the absolute number of records rather than a percent, but in either case, you will very likely modify these values to get a smaller, or larger, tree. Click the Ensembles settings

3-36


Figure 3.32 Ensembles Settings in Build Options Tab

These settings determine the behavior of ensembling that occurs when boosting, bagging, or very large datasets are requested in Objectives. Options that do not apply to the selected objective are ignored. For bagging and very large datasets, rules must be used to combine the predictions from two or more models. The rules differ depending upon whether the target is categorical or continuous, with voting used for the former and the mean of the predictions for the latter. Other options are available. Boosting always uses a weighted majority vote to score categorical targets and a weighted median to score continuous targets. When using boosting or bagging, by default 10 models or bootstrap samples, respectively, are created. This number is usually sufficient but can be changed. Click the Advanced settings

3-37


Figure 3.33 Advanced Settings in Build Options Tab

To select the predictor for a split, CHAID uses a chi-square test in the table defined at each node by a predictor and the target field. CHAID chooses the predictor that is the most significant (smallest p value). If that predictor has more than 2 categories, CHAID compares them and collapses together those categories that show no differences in the target. This category merging process stops when all remaining categories differ at the specified testing level (Significance level for splitting:). It is possible for CHAID to split merged categories, controlled by the Allow resplitting of merged categories check box. (Note that a categorical predictor with more than 127 discrete categories will be ignored by CHAID.) There is a comparable significance level for merging. For continuous predictors, the values are binned into a maximum of 10 groups, and then the same tabular procedure is followed as for flag and categorical types. Because many chi square tests are performed, CHAID automatically adjusts its significance values when testing the predictors. These are called Bonferroni adjustments and are based on the number of tests. You should normally leave this option turned on; in small samples or with only a few predictors, you could turn it off to increase the power of your analysis. The Overfit prevention set (%) setting controls the percent of records that are internally separated into an overfit prevention set. This is an independent set of data used to track errors during training in order to prevent the tree from modeling chance variation in the data. The default is 30%. This setting is unrelated to the separation of data before modeling into training and testing partitions. The

3-38


modeling is done only with the training data, and thus the separation into an overfitting set is done only within the training data. Note The overfit prevention data split is not used by CHAID, but instead by C&R Tree and QUEST. For those decision tree methods, the overfit set is used during tree pruning. Unlike other models, CHAID uses missing, or blank, values when growing a tree. All blank values are placed in a missing category that is treated like any other category for nominal predictors. For ordinal and continuous predictors, the process of handling blanks is a bit different, but the effect is the same (see the PASW Modeler14.0 Algorithms Guide for detailed information). If you don’t want to include blank data in a model, it should be removed beforehand.

3.11 Modeling Categorical Outputs with C&R Tree We move next to the C&R Tree node to predict a categorical output field. Click Cancel to close the CHAID dialog Double-click on the C&R Tree node named CHURNED Click the Build Options tab

The same settings options are available for C&R Tree as for CHAID in the Objective settings. It is also possible, as with CHAID, to grow a tree interactively.

3-39


Figure 3.34 Classification and Regression Trees (C&R Tree) Build Options Tab

Click the Basics settings

3-40


Figure 3.35 Basics Settings for Build Options Tab

The default tree depth, as with CHAID, is five levels below the root node.

Pruning within C&RT The Prune tree to avoid overfitting check box will invoke pruning. The Maximum difference in risk (in standard Errors) allows C&R Tree to select the simplest tree whose risk estimate (which is the proportion of errors the tree model makes when equal misclassification costs and empirical priors are used) is close to that of the subtree with the smallest risk. The value in the text box indicates how many standard errors difference are allowed in the risk estimate between the final tree and the tree with the smallest risk. As this is increased, the pruning will be more severe.

Surrogates Surrogates are used to deal with missing values on the predictors. For each split in the tree, C&R Tree identifies the input fields (the surrogates) that are most similar statistically to the selected split field. When a record to be classified has a missing value for a split field, its value on a surrogate field can be used to make the split. The Maximum surrogates option controls how many surrogate predictor fields will be stored at each mode. Retaining more surrogates slows processing and the default (5) is usually adequate.

3-41


The Stopping Rules settings are identical to those for CHAID, so we won’t review them here. We do note that, unlike CHAID, while the default values may seem small, it is important to keep in mind that pruning is an important component of C&R Tree, and it can trim back some of the small branches. Click on Costs & Priors settings

Figure 3.36 Costs & Priors Settings for Build Options Tab

The misclassification costs are identical to those for the other decision tree models we have discussed.

Priors in C&RT Historically, priors have been used to incorporate knowledge about the base population rates (here of the output field categories) into the analysis. Breiman et al. (1984) point out that if one target category has twice the prior probability of occurring than another, it effectively doubles the cost of misclassifying a case from the first category, since it is counted twice. Thus by specifying a larger prior probability for a response category, you can effectively increase the cost of its misclassification. Since priors are only given at the level of the base rate for the output field categories (with J categories there are J prior probabilities), use of them implies that the misclassification of a record

3-42


actually in output category j has the same cost regardless of the category into which it is misclassified, (that is C(j) = C(k|j), for all k not equal to j). By default, the prior probabilities are set to match the probabilities found in the training data. The Equal for all classes option allows you to set all priors equal (might be used if you know your sample does not represent the population and you don’t know the population distribution on the target), and you can enter prior probabilities (Custom option). The prior probabilities should sum to 1 and if you enter custom priors that reflect the desired proportions, but do not sum to 1, the Normalize button will adjust them. Finally, priors can be adjusted based on misclassification costs (see Breiman’s comment above) entered in the Costs tab. The Ensembles settings are identical to that for CHAID, so we don’t review them. Click the Advanced settings


3-43


Impurity Criterion The criterion that guides tree growth in C&R Tree with a categorical output field is called impurity. It captures the degree to which responses within a node are concentrated into a single output category. A pure node is one in which all cases fall into a single output category, while a node with the maximum impurity value would have the same number of cases in each output category. Impurity can be defined in a number of ways and two alternatives are available within the C&R Tree procedure. The default, and more popular measure, is the Gini measure of dispersion. If P(t)i is the proportion of cases in node t that are in output category i, then the Gini measure is:

Gini = 1 − ∑ P(t ) i

2

i

Alternatively:

Gini = ∑ P(t )i P(t ) j i≠ j

If two nodes have different distributions across three response categories (for example (1,0,0) and (1/3, 1/3, 1/3)), the one with the greater concentration of responses in a single category (the first one) will have the lower impurity value (for (1,0,0) the impurity is 1 – (12 + 02 + 02), or 0; for (1/3, 1/3, 1/3) the impurity is 1 – ((1/3)2 + (1/3)2 + (1/3)2)or .667). The Gini measure ranges between 0 and 1, although the maximum value is a function of the number of output categories. Thus far we have defined impurity for a single node. It can be defined for a tree as the weighted average of the impurity values from the terminal nodes. When a node is split into two child nodes, the impurity for that branch is simply the weighted average of their impurities. Thus if two child nodes resulting from a split have the same number of cases and their individual impurities are .4 and .6, their combined impurity is .5*.4 + .5*.6. When growing the tree, C&R Tree splits a node on the predictor that produces the greatest reduction in impurity (comparing the impurity of the parent node to the impurity of the child nodes). This change in impurity from a parent node to its child nodes is called the improvement and under Expert options you can specify the minimum change in impurity for tree growth to continue. The default value is .0001 and if you are considering modifying this value, you might calculate the impurity at the root node (the overall output proportions) to establish a point of reference. The problems with using impurity as a criterion for tree growth are that you can almost always reduce impurity by enlarging the tree and any tree will have 0 impurity if it is grown large enough (if every node has a single case, impurity is 0). To address these difficulties, the developers of the classification and regression tree methodology (see Breiman, Friedman, Olshen, and Stone, Classification and Regression Trees, Wadsworth, 1984) developed a pruning method based on a cross-validated cost complexity measure (as discussed above). By default, the Gini measure of dispersion is used. Breiman and colleagues proposed Twoing as an alternative impurity measure. If the target has more than two output categories, twoing will create binary splits of the response categories in order to calculate impurity. Each possible combination of output categories split into two groups will be separately evaluated for impurity with each predictor, and the best split across predictors and target category combinations is chosen. Ordered Twoing (inactive because the target field is nominal, not ordinal) applies Twoing as described above, except that the output category combinations are limited to those consistent with the rank order of the

3-44


categories. For example, if there are five output categories numbered 1,2,3,4 and 5, Ordered Twoing would examine the (1,2) (3,4,5) split, but the (1,4) (2,3,5) split would not be considered, since only contiguous categories can be grouped together. Of the methods, the Gini measure is most commonly used. The option to prevent model overfitting is available here, as it was with CHAID.

3.12 Modeling Categorical Outputs with QUEST We turn next to the QUEST node for predicting a categorical field. Click Cancel to close the C&R Tree dialog Double-click on the QUEST node named CHURNED Click the Build Options tab

QUEST (Quick Unbiased Efficient Statistical Tree) is a binary classification method that was developed, in part, to reduce the processing time required for large C&R Tree analyses with many fields and/or records. It also tries to reduce the tendency in decision tree methods to favor predictors that allow more splits (see Loh and Shi, 1997).

3-45


Figure 3.38 QUEST Build Options Tab

There are the same settings areas as for C&R Tree. In the Objectives settings, the same selections are available as the other methods. Click the Advanced settings

3-46


Figure 3.39 QUEST Advanced Settings in Build Options Tab

QUEST separates the tasks of predictor selection and splitting at a node. Like CHAID, it uses statistical tests to pick a predictor at a node. For each continuous or ordinal predictor, QUEST performs an analysis of variance, and then uses the significance of the F test as a criterion. For nominal predictors (measurement level flag and nominal), chi-square tests are performed. The predictor with the smallest significance value from either the F or chi-square test is selected. Although not evident from the dialog box options, Bonferroni adjustments are made, as with CHAID (not under user control). QUEST is more efficient than C&R Tree because not all splits are examined, and category combinations are not tested when evaluating a predictor for selection. After selecting a predictor, QUEST determines how the field should be split (into two groups) by doing a quadratic discriminant analysis, using the selected predictor on groups formed by the target categories. The details are rather complex and can be found in Loh and Shih (1997). The measurement level of the predictor will determine how it is treated in this method. While quadratic discriminant analysis allows for unequal variances in the groups and makes one fewer assumption than does linear discriminant analysis, it does assume that the distribution of the data are multivariate normal, which is unlikely for predictors that are flags and sets.

3-47


QUEST uses an alpha (significance) value of .05 for splitting in the discriminant analysis, and you can modify this setting. For large files you may wish to reduce alpha to .01, for example.

Pruning, Stopping, Surrogates QUEST follows the same pruning rule as does C&R Tree, using a cost-complexity measure that takes into account the increase in error if a branch is pruned, using a standard error rule. The Stopping choices are the same as for CHAID and C&R Tree. QUEST also uses surrogates to allow predictions for missing values, employing the same methodology as C&R Tree.

3.13 Predicting Continuous Fields Two of the decision tree models, CHAID and C&R Tree, can predict a continuous field. We will briefly review the options available for this type of target and then run an example.

Continuous Outputs with C&R Tree When a continuous field is used as the target field in C&R Tree, the algorithm runs in the way described earlier in this lesson. For a continuous output field (the regression trees portion of the algorithm), the impurity criterion is still used but is based on a measure appropriate for a continuous field: within-node variance. It captures the degree to which records within a node are concentrated around a single value. A pure node is one in which all cases have the same output value, while a node with a large impurity value (in principle, the theoretical maximum would be infinity) would contain cases with very diverse values on the output field. For a single node, the variance (or standard deviation squared) of the output field is calculated from the records within the node. When generating a prediction, the algorithm uses the average value of the target field within the terminal node. Click File…Open Stream Double-click on CRTree.str

Figure 3.40 PASW Modeler Stream with C&R Tree Model Node (Continuous Output Field)

The data file consists of patient admissions to a hospital. The goal is to build a model predicting insurance claim amount based on hospital length of stay, severity of illness group and patient age. Right-click on the Table node and select Run Review the data and then close the Table window Double click on the C&R Tree node (labeled CLAIM) Click on Build Options tab

3-48


The Build Options tab for the C&R Tree dialog with a continuous output is the same as for a categorical output. However, if we explore the various settings we find that some—priors and misclassification costs—which are not relevant for a continuous output field, are inactive. Otherwise, setting up the model-building parameters, and executing the tree, is identical to the process for a categorical target. The generated model will display the predicted mean for the insurance claim amount in each terminal node. Figure 3.41 C&R Tree Build Options Tab for a Continuous Output Field

We will illustrate the use of interactive tree-building as well in this example. Click the Launch interactive session button Click Run

3-49


Figure 3.42 Interactive Tree Builder to Predict Claim with C&R Tree

Because the target field is continuous, different statistics appear in the nodes. The nodes display the mean, number of cases, and percentage of the sample. Thus, the mean insurance claim for persons in this data file is slightly over $4680, which we would predict for each person if we didn’t know how long they stayed in the hospital, their age, or how severely ill they were. Once the tree is grown, we should get some insight into the characteristics of patients that separate high insurance claims from low ones. As before, we are using about 70% of the data to fit the model, and reserve the remainder to test overfitting. For C&R Tree, the overfit data is used to prune the tree. We could grow the tree one level at a time. However, if we do so, the tree then can’t be easily pruned, and pruning is an important feature of creating a tree that will perform as well on new data. Right-Click on the Node 0 and select Grow Tree and Prune

3-50


Figure 3.43 Interactive Tree Builder Fully Grown and Pruned Tree

The results indicate that LOS (Length of Stay) is the best predictor. The average insurance claim for persons who stay more than 2 days in the hospital is $5637.07 while the mean claim for persons who spent 1 or two days in the hospital is $4369.51. This certainly makes sense. The predictions are further refined by the split on ASG (severity of illness) for those with a length of stay of 1 or 2 days . This field is coded 0, 1, and 2. Those with the lowest severity (0) in Node 3 have a mean insurance claim of $4194.15; those who have more severe illnesses have a mean claim of $4659.76. You can make different splits in the tree based on business information or just to try alternative trees. To see this option: Right-Click on the Node 0 and select Grow Branch with Custom Split

3-51


Figure 3.44 Choosing a Custom Split with C&R Tree

The current best split is listed, on LOS, along with the values that will be used to split the tree. You can change the split values by selecting the Custom option button. The Predictors dialog shows the top predictors and the Improvement value for the optimal split on each. Click Predictors button

3-52


Figure 3.45 Predictors and Improvement Values for that Split

You can choose another predictor here to split the node. In this instance, the Improvement value for LOS is clearly the largest, so it would be better to retain it as the first split. Click Cancel, and then Cancel again

For a continuous output, we can investigate model performance in several ways. No model is automatically created when using interactive mode, so we need to generate a model first. Click Generate…Generate Model Click OK in the Generate New Model dialog

Figure 3.46 Generate New Model Dialog

Close the Interactive Tree window Move the generated model CLAIM1 near the Type node Connect the Type node to the generated model CLAIM1 Add an Analysis node to the stream Connect the Analysis node to the generate model CLAIM1 Run the Analysis node

3-53


Figure 3.47 Analysis Output for C&R Tree Model to Predict CLAIM

The Mean Absolute Error is perhaps the most useful statistic, and has a value of 824.11. This is the amount of model prediction error, on the average. The mean value of CLAIM is about 4,631, so the average error is somewhat under 20% of the mean. The analyst has to decide whether this is sufficiently accurate, given the goals of a data-mining project. The correlation between the predicted value and actual value of CLAIM is .507, which is good, but not outstanding. On the other hand, there are only three terminal nodes in the tree, and so only three predicted values of CLAIM. Given that, this correlation isn’t bad at all. You can explore the model further by using other nodes to see the relationship between the predictors and the value of $R-CLAIM. However, the decision tree is very clear as to how the predictors are being used to make predictions. You may, though, want to investigate how AGE related to $R-CLAIM since AGE isn’t used in the tree (you would find there is essentially no relationship, hence the reason it wasn’t used). Close the Analysis window

Continuous Outputs with CHAID When CHAID is used with a continuous target, the overall approach is identical to what we have discussed above, but the specific tests used to select predictors and merge categories differ.

3-54


An analysis of variance test is used for predictor selection and merging of categories, with the target as the dependent variable. Nominal and ordinal predictors are used in their untransformed form. Continuous predictors are binned as described above for CHAID into at most 10 categories; then an analysis of variance test is used on the transformed field. The field with the lowest p value for the ANOVA F test is selected as the best predictor at a node, and the splitting and merging of categories proceeds based on additional F tests. If you are interested, you can try CHAID with this same data and target.

3-55


Summary Exercises The exercises in this lesson use the data file churn.txt that was also used for examples in this lesson. The following table provides details about the file. churn.txt contains information from a telecommunications company. The data are comprised of customers who at some point have purchased a mobile phone. The primary interest of the company is to understand which customers will remain with the organization or leave for another company. The file contains the following fields: ID LONGDIST International LOCAL DROPPED PAY_MTHD LocalBillType LongDistanceBillType AGE SEX STATUS CHILDREN Est_Income Car_Owner CHURNED

Customer reference number Time spent on long distance calls per month Time spent on international calls per month Time spent on local calls per month Number of dropped calls Payment method of the monthly telephone bill Tariff for locally based calls Tariff for long distance calls Age Gender Marital status Number of Children Estimated income Car owner (3 categories) Current – Still with company Vol – Leavers who the company wants to keep Invol – Leavers who the company doesn’t want

In these exercises we will explore the various training methods and options for the rule induction techniques within PASW Modeler. In all the exercise that follow, you can use a Partition node to split the data into Training and Testing partitions to make these modeling exercises more realistic. You may or may not want to split the data 50/50 into two partitions depending on the number of records in the file. 1. Begin a new stream with a Var.file node connected to the file Churn.txt. 2. Use C5.0 and at least one other decision tree method to predict CHURNED and compare the accuracy of both. What do you learn from this? Which rule method performs “best”? 3. Now browse the rules that have been generated by the methods. Which model appears to be the most manageable and/or practical? Do you think there is a trade-off between accuracy and practicality? 4. Try switching from Accuracy to Generality in C5.0. Does this have much effect on the size and accuracy of the tree? 5. Experiment with the model options within the methods you selected to see how they affect tree growth. Can you increase the accuracy without making the model overly complicated?

3-56


Experiment with the minimum values for parent and child nodes and see how this influences the size of the tree. 6. In C5.0, use the Winnow attributes expert option and see if it reduces the number of inputs used in the model (hint: for an easier comparison, generate a Filter node from the generated C5.0 node with Winnow attributes checked and with it unchecked). 7. Of all the models you have run, which do you think is the “best”? Why? You may wish to save the stream (use the name Exer3.str) that you have just created.

3-57


3-58

NEURAL NETWORKS

Lesson 4: Neural Networks Overview • • • • •

Describe the structure and types of neural networks Build a neural network Browse and interpret the results Evaluate the model Illustrate the use of bagging and boosting

Data In this lesson we will use the dataset churn.txt. We continue to use a Partition Node to divide the cases into two partitions (subsamples), one to build or train the model and the other to test the model (often called a holdout sample).

4.1 Introduction to Neural Networks Historically, neural networks attempted to solve problems using methods modeled on how it was envisioned the brain operates. Today they are simply viewed as powerful modeling techniques. A typical neural network consists of several neurons arranged in layers to create a network. Each neuron can be thought of as a processing element that is given a simple part of a task. The connections between the neurons provide the network with the ability to learn patterns and interrelationships in data. The figure below gives a simple representation of a common neural network (a Multi-Layer Perceptron). Figure 4.1 Simple Representation of a Common Neural Network

When using neural networks to perform predictive modeling, the input layer contains all of the fields used to predict the target. The output layer contains an output field: the target of the prediction. The input and output fields can be continuous or categorical (in PASW Modeler, categorical fields are transformed into a numeric form (dummy or binary set encoding) before processing by the network). The hidden layer contains a number of neurons at which outputs from the previous layer combine. A

4-1


network can have any number of hidden layers, although these are usually kept to a minimum. All neurons in one layer of the network are connected to all neurons within the next layer While the neural network is learning the relationships between the data and results, it is said to be training. Once fully trained, the network can be given new, unseen data and can make a decision or prediction based upon its experience. When trying to understand how a neural network learns, think of how a parent teaches a child how to read. Patterns of letters are presented to the child and the child makes an attempt at the word. If the child is correct she is rewarded and the next time she sees the same combination of letters she is likely to remember the correct response. However, if she is incorrect, then she is told the correct response and tries to adjust her response based on this feedback. Neural networks work in the same way.

4.2 Training Methods One of the advantages of PASW Modeler is the ease with which you are able to build a neural network without in fact knowing too much about how the algorithms work. Nevertheless, it helps to understand a bit about these methods, so we will begin by briefly describing the two different types of training methods: a Multi-Layer Perceptron (MLP) and a Radial Basis Function Network (RBFN) model. As was noted above, a neural network consists of a number of processing elements, often referred to as “neurons,” that are arranged in layers. Each neuron is linked to every neuron in the previous layer by connections that have strengths or weights attached to them. The learning algorithm controls the adaptation of these weights to the data; this gives the system the capability to learn by example and generalize for new situations. The main consideration when building a network is to locate the best, or global solution, within a domain; however, the domain may contain a number of sub-optimal solutions. The global solution can be thought of as the model that produces the least possible error when records are passed through it. To understand the concept of global error, imagine a graph created by plotting the hidden weights within the neural network against the error produced. Figure 4.2 gives a simple representation of such a graph. With any complex problem there may be a large number of feasible solutions, thus the graph contains a number of sub-optimal solutions or local minima (the “valleys” in the plot). The trick to training a successful network is to locate the overall minimum or global solution (the lowest point), and not to get “stuck” in one of the local minima or sub-optimal solutions.

4-2

NEURAL NETWORKS

Figure 4.2 Representation of the Error Domain Showing Local and Global Minima

There are many different types of supervised neural networks (that is neural networks that require both inputs and an output field). However, within the world of data mining, two are most frequently used. These are the Multi-Layer Perceptron (MLP) and the Radial Basis Function Network (RBFN). In the following paragraphs we will describe the main differences between these types of networks and describe their advantages and disadvantages.

4.3 The Multi-Layer Perceptron The MLP network consists of layers of neurons, with each neuron linked to all neurons in the previous layer by connections of varying weights. All MLP networks consist of an input layer, an output layer and at least one hidden layer. The hidden layer is required to perform non-linear mappings. The number of neurons within the system is directly related to the complexity of the problem, and although a multi-layered topology is feasible, in practice there is rarely a need to have more than one hidden layer. Within a Multi-Layer Perceptron, each hidden layer neuron receives an input based on a weighted combination of the outputs of the neurons in the previous layer. The neurons within the final hidden layer are, in turn, combined to produce an output. This predicted value is then compared to the correct output and the difference between the two values (the error) is fed back into the network, which in turn is updated. This feeding of the error back through the network is referred to as back-propagation. To illustrate this process we will take the simple example of a child learning the difference between an apple and a pear. The child may decide in making a decision that the most useful factors are the shape, the color and the size of the fruit—these are the inputs. When shown the first example of a fruit she may look at the fruit and decide that it is round, red in color and of a particular size. Not knowing what an apple or a pear actually looks like, the child may decide to place equal importance on each of these factors—the importance is what a network refers to as weights. At this stage the child is most likely to randomly choose either an apple or a pear for her prediction. On being told the correct response, the child will increase or decrease the relative importance of each of the factors to improve her decision (reduce the error). In a similar fashion a MLP network begins with random weights placed on each of the inputs. On being told the correct response, the network

4-3


adjusts these internal weights. In time, the child and the network will hopefully make correct predictions. To visualize how a MLP works, imagine a problem where you wish to predict a target field consisting of two groups, using only two input fields. Figure 4.3 shows a graph of the two input fields plotted against one another, overlaid with the target. Using a non-linear combination of the inputs, the MLP fits an open curve between the two classes. Figure 4.3 Decision Surface Created Using the Multi-Layer Perceptron

The advantages of using a MLP are: • It is effective on a wide range of problems • It is capable of generalizing well • If the data are not clustered in terms of their input fields, it will classify examples in the extreme regions • It is currently the most commonly used type of network and there is much literature discussing its applications The disadvantages of using a MLP are: • It can take a great deal of time to train • It does not guarantee finding the best global solution

4.4 The Radial Basis Function The Radial Basis Function (RBF) is a more recent type of network and is responsive to local regions within the space defined by the input fields. Figure 4.4 shows a graphical representation of how a RBF fits a number of basis functions to the problem described in the previous section. The RBF can be thought of performing a type of clustering within the input space, encircling individual clusters of data by a number of basis functions. If a data point falls within the region of activation of a particular basis function, then the neuron corresponding to that basis function responds most strongly. The concept of the RBF is extremely simple; however the selection of the centers of each basis function is where difficulties arise.

4-4

NEURAL NETWORKS

Figure 4.4 Operation of a Radial Basis Function

The advantages of using a RBF network are: • It is quicker to train than a MLP • It can model data that are clustered within the input space. The disadvantages of using a RBF network are: • It is difficult to determine the optimal position of the function centers • The resulting network often has a poor ability to represent the global properties of the data. Within PASW Modeler the RBF algorithm uses the K-means clustering algorithm to determine the number and location of the centers in the input space.

4.5 Which Method? Due to the random nature of neural networks, the models built using each of the algorithms will tend to perform with varying degrees of accuracy depending on the initial weights and starting positions. When building neural networks it is sensible to try both algorithms and either choose the one with the best overall performance, or, use both models to gain a majority prediction.

4-5


4.6 The Neural Network Node The Neural Net node is used to create a neural network and can be found in the Modeling palette. Once trained, a Generated Neural Net node labeled with the name of the predicted field will appear in the Generated Models palette and on the stream. This node represents the trained neural network. Its properties can be browsed and new data can be passed through this node to generate predictions. We can use the stream saved in Lesson 3. Click File…Open Stream Move to the c:\Train\ ModelerPredModel directory Double-click on the C5.str Delete the C5.0 modeling node and the C5.0 generated model node, leaving the other nodes Run the Table node Click File…Close to close the Table window Double-click the Type node

Figure 4.5 Type Node Ready for Modeling

Notice that ID will be excluded from any modeling as the role is automatically set to None for a Typeless field. The CHURNED field will be the target field for any predictive model and all fields but ID and Partition will be used as predictors.

4-6

NEURAL NETWORKS

Click OK Place a Neural Net node from the Modeling palette to the right of the Type node Connect the Type node to the Neural Net node

Figure 4.6 Neural Net Node (CHURNED) Added to Data Stream

Notice that once the Neural Net node is added to the data stream, its name becomes CHURNED, the field we wish to predict. Double-click the Neural Net node Click the Build Options tab (if necessary)

4-7


Figure 4.7 Neural Net Dialog: Build Options Tab

The Build Options tab enables you to control five different areas, including overall objectives, the specific neural net algorithm, stopping rules, how to combine ensembles of models, and advanced statistical specifications, including how to handle missing data and the size of an overfit prevention set. The default objective is to build a new single neural network. You can instead use boosting or bagging (explained in sections below) to enhance model accuracy or stability. If you have a server connection, Neural Net can create models for very large datasets by dividing the data into smaller data blocks and building a model on each block. The most accurate models are then automatically selected and combined into a single final model. Click the Basics settings

4-8

NEURAL NETWORKS

Figure 4.8 Basics Settings in Build Options Tab

The options on the Basics panel enable you to choose one of two algorithms. The multilayer perceptron allows for more complex relationships at the possible cost of increasing the training and scoring time. The radial basis function may have lower training and scoring times, at the possible cost of reduced predictive power compared to the MLP. The hidden layer(s) of a neural network contains unobservable neurons (units). The value of each hidden unit is some function of the predictors; the exact form of the function depends in part upon the network type. A multilayer perceptron can have one or two hidden layers; a radial basis function network can only have one hidden layer. By default the model will choose the best number of hidden units in each hidden layer, although you can specify this yourself. Normally, it is best to allow the algorithm to make this choice. Click the Stopping Rules settings

4-9


Figure 4.9 Stopping Rules Settings in Build Options Tab

This area allows you to control the rules that determine when to stop training multilayer perceptron networks; these settings are ignored when the radial basis function algorithm is used. Training proceeds through at least one cycle (data pass), and can then be stopped based on three criteria. By default, PASW Modeler stops when it appears to have reached its optimally trained state; that is, when accuracy in the (internal) test dataset seems to no longer improve. Alternatively, you can set a required accuracy value, a limit to the number of cycles through the data, or a time limit in minutes. We use the default in these examples. Click the Ensembles settings

4-10

NEURAL NETWORKS

Figure 4.10 Ensembles Settings in Build Options Tab

The settings in this section are used when boosting, bagging or very large datasets are modeled as requested in the Objectives section. In this case, two or more models need to be combined to make a prediction. Ensemble predicted values for categorical targets can be combined using voting, highest probability, or highest mean probability. Voting selects the category that has the highest probability most often across the base models. Highest probability selects the category that achieves the single highest probability across all base models. Highest mean probability selects the category with the highest value when the category probabilities are averaged across the individual models. Ensemble predicted values for continuous targets can be combined using the mean or median of the predicted values from the individual models. You can also specify the number of base models to build for boosting and bagging;for bagging, this is the number of bootstrap samples. Click the Advanced settings

4-11



Over-training is one of the problems that can occur within neural networks. As the data pass repeatedly through the network, it is possible for the network to learn patterns that exist in the sample only and thus over-train. That is, it will become too specific to the training sample data and lose its ability to generalize. By selecting the Overfit prevention set option (checked by default), only a randomly selected proportion of the training data is used to train the network (this is separate from a holdout sample created in the Partition node). By default, 70% of the data is selected for training the model, and 30% for testing it. Once the training proportion of data has made a complete pass through the network, the rest is reserved as a test set to evaluate the performance of the current network. By default, this information determines when to stop training and provides feedback information. We advise you to leave this option turned on. Note that with a Partition node in use, and with Overfit prevention set turned on, the Neural Net model will be trained on 70 percent of the training sample selected by the Partition Node, and not on 70% of the entire dataset. Since the neural network initiates itself with random weights, the behavior of the network can be reproduced by setting the same random seed by checking the Replicate Results checkbox (selected by default) and specifying a seed. It is advisable to run several trials on a neural network to ensure that

4-12

NEURAL NETWORKS

you obtain similar results using different random seed starting points. The Generate button will create a pseudo-random integer between 1 and 2147483647, inclusive. The Neural Net node requires valid values for all input fields. There are two options to handle missing values for the predictors (records with missing values on the target are always ignored). By default, listwise deletion will be used, which deletes any record with a missing (blank) value on one or more of the input fields. As an alternative, Modeler can impute the missing data. For categorical fields, the most frequently occurring category (the mode) is substituted; for continuous fields, the average of the minimum and maximum observed values is substituted. It is important to check the data in advance of running a Neural Net using the Data Audit node to decide which records and fields should be passed to the neural network for modeling. Otherwise you run the risk of a model being built using data values supplied by these imputation rules even if that is not your preference. Also, you can take control of the missing value imputation by using, for example, the Data Audit node to change missing values to valid values with several optional methods before using the Neural Net node. Click the Model Options tab

4-13


Figure 4.12 Model Options Tab

The Make Available for Scoring area contains options for controlling how the model is scored. The predicted value (for all targets) and confidence (for categorical targets) are always computed when the model is scored. The computed confidence can be based on the probability of the predicted value (the highest predicted probability) or the difference between the highest predicted probability and the second highest predicted probability. Propensity scores—the likelihood of the true outcome—can be created for flag targets, as we also saw for decision tree models. The model produces raw propensity scores; adjusted propensity scores are not available. We are ready to run the Neural Net node. Although we have discussed the various options, we will use all defaults for this first model. Click Run

4-14

NEURAL NETWORKS

4.7 Models Palette The Neural Net model is placed in both the stream, attached to the Type node, and in the Models Palette. The Models tab in the Manager holds and manages the results of the machine learning and statistical modeling operations. There are two context menus available within the Models palette. The first menu applies to the entire model palette. Right-click in the background (empty area) in the Models palette

Figure 4.13 Context Menu in the Models Palette

This menu allows you to open a model in the palette, save the models palette and its contents, open a previously saved models palette, clear the contents of the palette, or to add the generated models to the Modeling section of the CRISP-DM project window. If you use PASW® Collaboration and Deployment Services to manage and run your data mining projects, you can store the palette, retrieve a palette or model from the repository. The second menu is specific to the generated model nodes. Right-click the generated Neural Net node named CHURNED in the Models palette

4-15


Figure 4.14 Context Menu for Nodes in the Models Palette

This menu allows you to rename, annotate, and browse the generated model node. A generated model node can be deleted, exported as PMML (Predictive Model Markup Language) and stored in the PASW Collaboration and Deployment Services Repository, or saved in a file for future use.

4.8 The Neural Net Model We can now explore the neural net model created to predict CHURNED. Double-click the Neural Net model in the stream

The Model Tab contains five views or summaries of the model. The first is the Model Summary, which is a higher level view of model performance (see next figure). The table identifies the target, the type of neural network trained, the stopping rule that stopped training (shown if a multilayer perceptron network was trained), and the number of neurons in each hidden layer of the network. Here, eight hidden neurons were used in one hidden layer. The bar chart displays the accuracy of the final model, which is 81.1% (compare to that for the C5.0 model in the previous lesson). For a categorical target, this is simply the percentage of records for which the predicted value matches the observed value. Since we are using a Partition node, this is the percentage correct on the Training partition. The full Training partition is used for this calculation, including the overfit prevention set that was used internally during model building.

4-16

NEURAL NETWORKS

Figure 4.15 Model Summary Information for Neural Net Model

Note: Accuracy for a Continuous Target For a categorical target the accuracy is simply the percentage correct. It is worth noting that if the target is continuous then accuracy within PASW Modeler is defined as the average across all records of the following expression, which is summed over all records.

Accuracy =

1 ∑ n

 Target Value − Predicted Target Value  1 −  Range of Target Values  

Predictor importance Next we can look at predictor importance. Click the Predictor Importance chart panel

For models such as neural nets, predictor importance takes on, well, added importance because there is no single equation or other representation of the model available in the generated model nugget (but we can view the coefficients, as demonstrated below). The same is true, for example, with SVM models. Predictor importance is based on sensitivity analysis, which is a method to determine how variation in the model inputs leads to variation in the predicted values. The more important a predictor, the more changes in its values change the outcome values. In PASW Modeler, importance

4-17


is calculated by sampling repeatedly from combinations of values in the distribution of the predictors and then assessing the effect on the target. Then everything is normalized to 1.0 so that the importances can be compared. The most important predictor of CHURNED is LONGDIST followed by LOCAL and then International. These fields all measure usage of the phone service. The first customer demographic variable in importance is SEX. You may wish to compare these fields to the ones selected by the C5.0 model in Lesson 3. Figure 4.16 Predictor Importance in Model

Predictor importance is not a substitute for exploring the model and seeing how it actually functions, as we do later, but it is a first step at reviewing a model. Next we can view how well the model performs at predicting each category of CHURNED. Click the Classification panel

For categorical targets, this section displays the cross-classification of observed versus predicted values in a heat map, plus the percent in each cell. We look, as usual, on the diagonal to see the correct predictions. The neural net does best at predicting those in the InVol category, with lowest accuracy for Current customers. As a reminder, this table uses the Training partition. The depth of shading of each cell is based on the percentage, with darker shading corresponding to higher percentages. There are three other table styles available, selected from the Styles dropdown.

4-18

NEURAL NETWORKS

Figure 4.17 Predictor Importance in Model

The Neural Network Structure You can see the neural network itself in the next section of output. Click on the Neural Network panel

The network can be displayed in several views. These icons respectively: • Standard style, with inputs on the left and outputs on the right • Inputs on the top and outputs on the bottom • Inputs on the bottom and outputs on the top

display the network in,

Also available is a slider to limit the display of inputs based on predictor importance.

4-19


Figure 4.18 Neural Network Structure

It is difficult to see the full network in the standard view, so we’ll switch to the view with inputs on the top. Click on the middle network icon Maximize the window’s width

There are two different display styles, which are selected from the Style dropdown list. • •

Effects. This displays each predictor and target as one node in the diagram irrespective of whether the measurement scale is continuous or categorical. That is the current view. Coefficients. This displays multiple indicator nodes for categorical predictors and targets. The connecting lines in the coefficients-style diagram are colored based on the estimated value of the synaptic weight. Move both sliders to their endpoints so all fields are displayed

4-20

NEURAL NETWORKS

Figure 4.19 Neural Network With Inputs at Top

In this network there is one hidden layer, containing eight neurons, and the output layer still contains only one neuron corresponding to the target field CHURNED. . Click the Style dropdown and select Coefficients

In the Coefficients view all the neurons are visible. The input layer is made up of one neuron per continuous field. Categorical fields will have one neuron per value. In this example, there are seven continuous, five flag, and one nominal field with three values, totaling twenty neurons. There are also Bias neurons to set the scale of the input. The field CHURNED is represented by three neurons for its three categories.

4-21


Figure 4.20 Neural Network In Coefficients View

The connecting lines in the diagram are colored based on the estimated value of the synaptic weight, with darker blues corresponding to a greater weight. If you hover the cursor over a link between neurons, the weight will be displayed in a popup (weights vary from –1.0 to +1.0). It is visually evident that neural network models are very complicated to summarize easily. This is because of the very large number of connections (each of the input neurons is connected to each hidden neuron, and then each hidden neuron is connected to each output neuron). In effect, there are many, many equations in the network, and so the influence or effect of any one input field would have to be summed across these many equations. We conclude our review of the Model Viewer output by looking at the Settings tab. Click the Settings tab

4-22

NEURAL NETWORKS

Figure 4.21 Settings Tab for Neural Network

The options here are equivalent to those on the Model Options tab in the Neural Network modeling node. The type of confidence can be requested, along with the number of probability fields for categorical targets. One new option is to generate SQL for the model, allowing pushback of the model scoring stage to the database. This is only available for the multilayer perceptron. Click OK to close the Neural Network Model Viewer

4.9 Validating the List of Predictors Because the neural network results depend on the initial random starting point, it is important to rerun the model with a different seed to be sure that the results are consistent. It is entirely possible that because of the seed we chose that one or more of the fields the Neural Network found to be important in influencing CHURNED might not be selected again with a different seed. Therefore, it is crucial to run the Neural Network model enough times until you are convinced about which predictors are the most important in influencing your target. We will rerun the model just once and compare it with the one we just ran. Normally, you would need to rerun it several times.

4-23


Double-click the Neural Network modeling node Click Build Options tab Click Advanced Change the random seed in the Random seed: text box to 444 Click Run Edit the generated Neural Net model in the stream

We see in the Model Summary pane that the overall accuracy has decreased by about 3.6%. Also, intriguingly, the number of neurons in the hidden layer is now 3, not 8 (the number of hidden neurons is also determined based in part on the random seed). Figure 4.22 Model Summary Information for Neural Net Model After Changing Seed

Click the Predictor Importance panel

The Predictor Importance is not identical to that in our first model, but it very similar. The three usage fields are the most important. In fact, the top six fields are the same, in the same order, as the first model. After that the order changes, but importance is not something that should be viewed as a fixed and definite value (such as a regression coefficient). Instead, importance is a rough measure of a predictor’s influence on the overall network output.

4-24

NEURAL NETWORKS

Figure 4.23 Predictor Importance after Changing the Seed

So although the accuracy dropped a bit, generally these results are encouraging about the stability of the model. If we look at the Classification table, we will learn that accuracy dropped for the Current and Vol categories more than for InVol. Normally we would rerun the model a few more times with different seeds to further convince ourselves that the top predictors of CHURNED remain the same and that accuracy remains fairly constant, but we will stop here and attempt to further understand the model. Click OK to close the Neural Network Model Viewer

4.10 Understanding the Neural Network A common criticism of neural networks is that they are opaque; that is, once built, the reasoning behind their predictions is not clear. For instance, does making a lot of international calls mean that you are likely to remain a customer, or instead leave voluntarily? In the following sections we will use some techniques available in PASW Modeler to help you evaluate the network and discover its structure.

Creating a Data Table Containing Predicted Values The first step is and passing the data to a Table node to look at the output fields. Connect the generated Neural Net model named CHURNED to the nearby Table node Run the Table node

4-25


Figure 4.24 Table Showing the Two Fields Created by the Generated Net Node

The generated Neural Net node calculates two new fields, $N-CHURNED and $NC-CHURNED, for every record in the data file with valid data for the model. The first represents the predicted CHURNED value and the second a confidence value for the prediction. The latter is only appropriate for categorical targets and will be in the range of 0.0 to 1.0, with the more confident predictions having values closer to 1.0. We can observe that the first record, which is contained in the Training partition, was correctly predicted to be a voluntary churner. Close the Table window

Comparing Predicted to Actual Values In data-mining projects it is advisable to see not only how well the model performed with the data we used to train the model, but also with the data we held out for testing purposes. The Neural Net model only displays results for the Training partition, so we need to use a Matrix node to create the equivalent table for the Testing partition. Because the Matrix node does not have an option to automatically split the results by partition we must manually divide the Training and Testing samples with Select nodes. This will allow us to create a separate matrix table for each sample. We already have the Select and Matrix nodes in the stream from the C5.0 stream. Connect the generated Neural Net model named CHURNED to each Select node

We then need to specify the correct field in the Matrix nodes.

4-26

NEURAL NETWORKS

Double-click on each Matrix node to edit it Put $N-CHURNED in the Columns: Run each Matrix node

Figure 4.25 Matrix of Actual and Predicted Churned for Training and Testing Samples

For the training data, the model is predicting 75.8% of the current customers, 95.8.0% of the involuntary leavers, and 74.9% of the voluntary leavers. For the testing data, the model is doing slightly better on the current customers (76.3%), but not quite as well on the other two categories (91.8% and 70.7%, respectively). When you decide whether to accept a model, and you report on its accuracy, you should use the results from the Testing (or Validation) sample. The model’s performance on the Training data may be too optimized for that particular sample, so its performance on the Testing sample will be the best indication of its performance in the future. Close the Matrix windows

Overall Accuracy with an Analysis Node An Analysis node will allow us to assess the overall accuracy of the model on each data partition. It is often true when predicting a categorical target with more than two categories that overall accuracy is less important than accuracy at predicting specific outcomes, but usually analysts prefer to know overall accuracy, too. And, decision-makers regularly ask about it. The Analysis node in the stream is ready for our use. Connect the generated Neural Net model node to the Analysis node Click Run

Overall percent correct for the Training partition is 77.47%; overall percent correct for the Testing partition is 75.73%. This small reduction in accuracy from the Training to Testing data is typical, and it falls well within acceptable limits. You can see that the Testing data partition is slightly larger than that for the Training data, as they were created randomly.

4-27


Figure 4.26 Analysis Node Output

Close the Analysis Output browser window

Evaluation Charts The Evaluation node is included in the stream, and if you wish, you can run Evaluation charts for the Neural Network model to further study and compare the performance on the training and testing data partitions.

4.11 Understanding the Reasoning behind the Predictions One method of trying to understand how a neural network is making its predictions is to apply an alternative machine learning technique, such as rule induction, to model the neural network predictions. Here, though, we will use more straightforward methods to understand the relationships between the predicted values and the fields used as inputs.

Categorical Input with Categorical Target Based on the predictor importance chart, a categorical input of moderate importance is SEX. Since it and the target field are categorical we can use a distribution plot with a symbolic overlay to understand how gender relates to the CHURNED predictions.

4-28

NEURAL NETWORKS

Place a Distribution node from the Graphs palette near the Select node for the Training partition Connect the Select node to the Distribution node Double-click the Distribution node Select SEX from the Fields: list Select $N-CHURNED as the Color Overlay field Click the Normalize by color check box (not shown) Click Run

The Normalize by color option creates a bar chart with each bar the same length. This helps to compare the proportions in each overlay category. Figure 4.27 Distribution Plot Relating Sex and Predicted Churned ($N-CHURNED)

The chart illustrates that the model is predicting that the majority of females are voluntary leavers, while the bulk of males were predicted to remain current customers. The large difference in the proportion of each category of CHURNED for males compared to females is an illustration of why SEX is an important predictor. And this plot would help you describe the model in any summary reports you write. Close the Distribution plot window

We next look at a histogram plot with an overlay.

Continuous Input with Categorical Target The most important continuous input for this model is LONGDIST. Since the target field is categorical, we will use a histogram of LONGDIST with the predicted value as an overlay to try to understand how the network is associating long distance minutes used with CHURNED.

4-29


Place a Histogram node from the Graphs palette near the Select node for the Training sample Connect the Select node to the Histogram node Double-click the Histogram node Click LONGDIST in the Field: list Select $N-CHURNED in the Overlay Color field list Click on the Options tab Click on Normalize by Color (not shown) Click Run

Figure 4.28 Histogram with Overlay of Predicted Churned by Long Distance Minutes

Here the only clear pattern we see is that Involuntary Leavers tend to be people who do little or no long distance calling. In contrast, it appears that the amount of long distance calling was not as much an issue when it came to predicting whether or not a person would remain a current customer or voluntarily choose to leave. You may wish to try these same graphs with the Testing Partition. Note: Use of Data Audit Node We explored the relationship between just two input fields (LONGDIST and SEX) and the prediction from the neural net ($N-CHURNED), and used Distribution and Histogram nodes to create the plots. If more inputs were to be viewed in this way, a more efficient approach would be to use the Data Audit node because overlay plots could easily be produced for multiple input fields (the overlay plots can’t be normalized, though). To save this stream for later work: Click File…Save Stream As

4-30

NEURAL NETWORKS

Move to the c:\Train\ModelerPredModel directory (if necessary) Type NeuralNet in the File Name: text box Click Save

4.12 Model Summary In summary, we appear to have built a neural network that is reasonably good at predicting the three different CHURNED groups. The overall accuracy was about 77% with the Training data, and 76% with the Testing data. Focusing on the Testing or unseen data, the model is most accurate at predicting the Involuntary Leaver group but somewhat less successful predicting the Current Customers and Voluntary Leaver. Considering that the model was correct almost three-quarters of the time even in the case of these latter two groups, it is certainly within the realm of possibility that the model may be considered a success. Of course, this would depend on whether these accuracy rates met or exceeded the minimum requirements defined at the beginning of the data-mining project. In terms of how predictors relate to the model, the most important factors in making its predictions are LONGDIST, International, LOCAL, and SEX. The network appears to associate females with the Voluntary Leaver group and predicts that males will remain Current Customers. The model also tends to predict that the people who are most likely to be dropped by the company (Involuntary Leavers) are those who do little or no long distance calling. (There are many other relationships in the data between the predictors and the CHURNED that we didn’t investigate, of course.)

4.13 Boosting and Bagging Models There are two additional techniques available in the Neural Net node, and other modeling nodes in PASW Modeler, to create reliable and accurate predictive models. These techniques build a collection, or ensemble, of models, and then combine the results of the models to make a prediction. However, the techniques don’t simply create a number of models on exactly the same data, which wouldn’t provide any particular advantage. Instead, they resample or reweight the data for each additional model, which leads to a different model each time. This turns out to be a winning strategy for creating effective ensembles of models. These two techniques are called boosting and bagging. We will provide a brief description of how each operates, then simple examples of both. Boosting. The key concept behind model boosting is that successive models are built to predict the cases misclassified from earlier models. Thus, as the number of models increases, the number of misclassified cases should decrease. The method works by applying a model to the data in the normal fashion, with each record assigned an equal weight. After the first model is constructed, predictions are made, and weights are created that are inversely proportional to the accuracy of classification. Then a second model is created, using the weighted data. Then predictions are made from the second model, and then weights are created that are inversely proportional to the accuracy of classification from the second model. This process continues for a (usually) small number of iterations. When done, the model predictions can then be combined by voting, by highest probability, etc. Bagging. The term “bagging” is derived from the phrase “Bootstrap aggregating.” In this method, new training datasets are generated which are of the same size as the original training dataset. This is done by using simple random sampling with replacement from the original data. By doing so, some records will be repeated in each new dataset. This type of sample is called a bootstrap sample. Then, a model is constructed for each bootstrap sample, and the results are combined with the usual methods (voting, averaging for continuous targets). Cases are weighted normally with this method in each bootstrap sample.

4-31


Boosting can be used on datasets of almost any size and characteristics. It is designed to increase model accuracy, first and foremost. Bagging should be avoided on very small datasets, especially those with lots of outliers, where the outliers can affect the samples that are constructed. Bagging can increase accuracy, but also reduce model variability. When these methods are used, the Model Viewer will provide different views from that when a single model is constructed. Included will be the results from each model and details on each, plus some indication of the variability of the model results (for bagging). It is absolutely necessary to have a test or validation dataset on which the boosted or bagged models can be assessed. These models, even more so than a regular model, are highly tuned to the training data, and so their performance must be evaluated on data not used for model-building. Bagging and boosting are not guaranteed to improve model performance on new data, but the idea is that creating several models is worth the tradeoff of reusing the training data several times. Typically, only a small number of boosted or bagged models need to be constructed; the default number is 10 in PASW Modeler. The outcome of boosting or bagging is still only one model nugget, and it can be used the same as any other standard model nugget. The downside to boosting or bagging is that no one equation, or decision tree, or the equivalent, can represent the model, so it can be hard to describe and characterize how the model makes predictions. Thus, you should investigate the relationship between the predictors and the target field, as we did above, go gain model understanding.

4.14 Model Boosting with Neural Net We will use boosting to predict CHURNED, using the default settings but changing the random seed once more.

Edit the Neural Net modeling node named CHURNED Click Build Options tab Click Objectives Click Enhance model accuracy (boosting)

4-32

NEURAL NETWORKS

Figure 4.29 Requesting Model Boosting

Click Ensembles settings

We viewed the Ensembles settings earlier in the lesson. The number of component models to create for boosting or bagging is 10, and that can be changed here. The default choice for combining models for categorical targets is voting, and two other choices using probability are available. We’ll use the default choices.

4-33


Figure 4.30 Ensemble Model Scoring

Click Advanced settings Change the Random seed value to 5555 (not shown) Click Run

You will notice that execution does take much longer than when running a single neural net model. Once the model has finished: Edit the Neural Net Model CHURNED

4-34

NEURAL NETWORKS

Figure 4.31 Boosted Model Accuracy

The Model Summary view has three measures of accuracy. The bar chart displays the accuracy of the final model, compared to a reference model and a naive model. The reference model is the first model built on the original unweighted data. The naïve model represents the accuracy if no model was built, and assigns all records to the modal category (Current). The naive model is not computed for continuous targets. The ensemble of 10 models is perfectly accurate—100%! That is encouraging, but we’ll have to see how it performs on the Training data partition. Click Predictor Importance panel

4-35


Figure 4.32 Predictor Importance for Boosted Model

The same fields are important as in the original Neural Net model, although now SEX is the second most important, and it and LONGDIST have higher importance values than any other field. Click the Predictor Frequency panel

4-36

NEURAL NETWORKS

Figure 4.33 Predictor Frequency for Boosted Model

In some modeling methods, such as decision trees, the predictor set can vary across component models. The Predictor Frequency plot is a dot plot that shows the distribution of predictors across component models in the ensemble. Each dot represents one or more component models containing the predictor. Predictors are plotted on the y-axis, and are sorted in descending order of frequency; thus the topmost predictor is the one that is used in the greatest number of component models and the bottommost one is the one that was used in the fewest. The top 10 predictors are shown. However, all predictors are used in each Neural Net component model in the ensemble, so this plot is not useful here. Click the Ensemble Accuracy panel

4-37


Figure 4.34 Ensemble Accuracy for Boosted Model

The Ensemble Accuracy line plot shows the overall accuracy of the model ensemble as each model is added. Generally, accuracy will increase as models are added, and we see that ensemble model accuracy reached 100% after only five models (you can hover the cursor over the line and view a popup of accuracy at that point). The line plot can be used to decide how fast the ensemble accuracy is increasing, and whether it is worthwhile to increase (or decrease) the number of models in another modeling run. Click on Component Model Details panel

4-38

NEURAL NETWORKS

Figure 4.35 Component Model Details for Boosted Model

Information about each of the models, in order of their creation, is supplied in the Component Model Details panel. Included are model accuracy, the number of predictors, and the number of synapses (weights), which is directly related to the number of neurons in the network. Not surprisingly, as the models attempted to model cases that were still mispredicted by earlier models, model accuracy decreased, although not dramatically. You can sort the rows in ascending or descending order by the values of any column by clicking on the column header. This information can be used to decide whether the model should be rerun, with additional component models, or perhaps different modeling settings.

Boosted Model Performance Because this model has been added to the stream, we can immediately check its performance on the Testing data partition. Click OK Run the Analysis node

4-39


Figure 4.36 Analysis Node Output for Boosted Model

As we saw when browsing the model, the boosted model is 100.0% accurate on the Training data. On the Testing data, the accuracy is 76.12%. This large drop-off is typical for boosted models and is illustrative of why the Testing data performance is the true guide to model performance on new data. The level of accuracy is decent, but the accuracy with one model was 75.73%, almost the same. Of course, every small increase in accuracy may be important. Compared to the single neural network, the ensemble is performing better with current customers, but not as well with those who left involuntarily. All of these factors must be taken into account when deciding which model is preferred. We will not investigate the boosted neural net model further, but you can do so using the existing nodes in the stream, or others you add. If you do you will discover that the boosted model has similar relations between the key predictors and the target field CHURNED.

4.15 Model Bagging with Neural Net We next try bagging with a neural net model. We will use the same settings in the Neural Net modeling node, just changing to bagging.

4-40

NEURAL NETWORKS

Close the Analysis node Edit the Neural Net modeling node named CHURNED Click Objectives Click Enhance model stability (boosting)

Figure 4.37 Requesting Model Bagging

Click Run

You will notice that execution does take much longer than when running a single neural net model. Once the model has finished: Edit the Neural Net Model CHURNED

4-41


Figure 4.38 Bagged Model Accuracy

The Model Summary view has the same three measures of accuracy, and a measure of model diversity (variability). The reference and naïve model accuracies are the same as for the boosted model, as the reference model is a model built on all the training data, as with the boosted model. The ensemble of 10 models is very accurate at 96.94%, although not completely accurate as was the boosted model. For bagged models there is a dropdown to display accuracy for the different model combining rules. All of these can be shown on one chart by selecting the Show All Combining Rules check box. Click Show All Combining Rules check box

4-42

NEURAL NETWORKS

Figure 4.39 Bagged Model Accuracy for all Voting Methods

For this bagged model, the rule with the highest accuracy is to use the highest mean probability. You can try all three types of combining rules on the Testing data to pick the best performing model. Below the Quality bar chart is a bar chart labeled Diversity. This chart displays the "diversity of opinion" among the component models used to build the bagged ensemble, presented in greater is more diverse format, normalized to vary from 0 to 1. It is a measure of how much predictions vary across the base models. Although the label indicates that larger is better, this isn’t necessarily so. The true test is how well the bagged models perform on the Testing data. Figure 4.40 Bagged Model Diversity

We skip the Predictor Importance information, which is very similar to that for the boosted ensemble model. For the same reason, we skip the Predictor Frequency panel, which is identical to that for the boosted model, since all predictors are used in each component model.

4-43


Click the Ensemble Accuracy panel

Figure 4.41 Component Model Accuracy for Bagged Model

The Component Accuracy chart is a dot plot of predictive accuracy for the component models. Each dot represents one or more component models with the level of accuracy plotted on the y-axis. Hover over any dot to obtain the id for the corresponding individual component model. The chart also displays color coded lines for the accuracy of the ensemble as well as the reference model and naïve models. A checkmark appears next to the line corresponding to the model that will be used for scoring. What we can see from the dot plot is that the level of accuracy of the 10 bagged models is very comparable. Unlike with boosted models, we cannot see the overall accuracy as each model is added. This is because the models have no logical sequence. Click on Component Model Details panel

4-44

NEURAL NETWORKS

Figure 4.42 Component Model Details for Bagged Model

Information about each of the models created, in order of their creation, is supplied in the Component Model Details panel. This is the same type of information as supplied for boosted models. Here, there is no trend in accuracy as new bootstrap samples are taken, and there really shouldn’t be. Also, overall accuracy is very similar for each of the models. And since the diversity measure was low (.08), we know that the model predictions were also very similar. The fact that the models were so comparable in performance may be a good thing, but we won’t know until we view the bagged model with the Testing data.

Bagged Model Performance Because this model has been added to the stream, we can immediately check its performance on the Testing data partition. Click OK Run the Analysis node

4-45


Figure 4.43 Analysis Node Output for Bagged Model

As we saw when browsing the model, the bagged model is correct on 96.94% of the records on the Training data. On the Testing data, the accuracy is 76.25%. This large drop-off is typical for bagged models and is illustrative, as with boosted models, of why the Testing data performance is the true guide to model performance on new data. The model performs better than the single neural network (75.73%), but not by much. It does, though, do better on those who left involuntarily, but not as well on those who left voluntarily, compared to the boosted model.

4-46

NEURAL NETWORKS

Summary Exercises The exercises in this lesson use the file charity.sav. The following table provides details about the file. charity.sav comes from a charity and contains information on individuals who were mailed a promotion. The file contains details including whether the individuals responded to the campaign, their spending behavior with the charity and basic demographics such as age, gender and mosaic (demographic) group. The file contains the following fields: response orispend orivisit spendb visitb promspd promvis promspdb promvisb totvisit totspend forpcode mos mosgroup title sex yob age ageband

Response to campaign Pre-campaign expenditure Pre-campaign visits Pre-campaign spend category Pre-campaign visits category Post-campaign expenditure Post-campaign visits Post-campaign spend category Post-campaign visit category Total number of visits Total spend Post Code 52 Mosaic Groups Mosaic Bands Title Gender Year of Birth Age Age Category

In this set of exercises you will use a neural network to predict the field Response to campaign. 1. Begin with a blank Stream canvas. Place a Statistics source node on the Stream canvas and connect it to the file charity.sav. Tell PASW Modeler to use variable and value labels. 2. Attach a Type and Table node in a stream to the source node. Run the stream and allow PASW Modeler to automatically define the types of the fields. 3. Edit the Type node. Set all of the fields to role NONE. 4. We will attempt to predict response to campaign (Response to campaign) using the fields listed below. Set the role of all five of these fields to input and the Response to campaign field to target. Pre-campaign expenditure Pre-campaign visits Gender Age Mosaic Bands (which should be changed to nominal measurement level)

4-47


5. Attach a Neural Net node to the Type node. Run the Neural Net node with the default settings. 6. Once the model has finished training, browse the generated Net node in the stream. What is the predicted accuracy of the neural network? What were the most important fields within the network? 7. Connect the generated Net node to a Matrix node and create a data matrix of actual response against predicted response. Which group is the model predicting well? 8. Use some of the methods introduced in the lesson, as well as others, such as web plots and histograms (or use the Data Audit node with an overlay field), to try to understand the reasoning behind the network’s predictions. 9. Change the random seed and rerun the neural net and recheck its performance. 10. Try a radial basis function neural network to see if you can improve on the model performance. 11. Try a boosted or bagged model to see if you can improve on model performance compared to the models you created above. 12. For those with extra time: Use C5.0 or other decision tree methods to predict Response to campaign from the charity.sav data. How do the rule induction models compare with the neural network models built here? Which are the most accurate? Which are the easiest to understand? 13. Save a copy of the stream as Exer4.str.

4-48

SUPPORT VECTOR MACHINES

Lesson 5: Support Vector Machines Objectives • • • •

Review the foundations of the Support Vector Machines model Use an SVM model to predict customer churn Try several different kernel functions and model parameters to improve the model Discuss how missing data is handled in SVM models

Data In this lesson we use the data file customer_dbase.sav, which like churn.txt is also from a telecommunications firm. We will use an SVM model to predict customers churn. The file contains fields measuring both customer demographics and customer use of telecommunications service to use as predictors. SVM models can use many input fields, so this file will allow us to demonstrate that feature. We continue to use a Partition node to split the data file.

5.1 Introduction Support vector machine (SVM) is a robust classification technique that is used to predict either a categorical and continuous outcome field. SVM is particularly well suited to analyzed data with a large number of predictor fields. Broadly, an SVM works by mapping data into a dimensional space where the data points can be categorized or predicted accurately, even if there is no easy way to separate the points in the original dimensional space. This involves using a kernel function to map the data from the original space into the new space. An SVM, like a neural net model, does not provide output in the form of an equation with coefficients on the predictor fields, although predictor importance is available with the model. Thus, like a neural net, to understand the model, not use it simply as a black box that makes predictions, you will need to do additional analysis. We will use an SVM in this lesson to predict customer churn in three categories. First we provide some background and theory about how an SVM model is calculated and what features of the model are under user control. Developing an acceptable SVM model usually requires trying various model settings rather than accepting the default node settings.

5.2 The Structure of SVM Models SVM models were developed to handle difficult classification/prediction problems where the “simple” linear models were unable to accurately separate the categories of an outcome field. A typical complicated problem, in two dimensions, is shown in Figure 5.1. Assume that the X and Y axis represent two predictors, while the circles and squares represent the two categories of a target field we wish to predict.

5-1


Figure 5.1 Predicting a Binary Outcome Field

There is no simple straight line that can separate the categories, but the curve drawn around the squares shows that there is a complex curve that will completely separate the two categories. The central task of SVM is to transform the data from this space into another space where the curve that separates the data points will be much simpler. Typically this means transforming the data so that a hyperplane (in higher dimensional space) can be used to separate the points. The mathematical function used for the transformation is known as a kernel function. After transformation, the data points might be represented as shown in Figure 5.2. Figure 5.2 Kernel Transformation of Original Data

The squares and circles can now be separated by a straight line in this two-dimensional space. The filled in circles and squares are the cases (called vectors in the SVM literature) that are on the boundary between the two classes. They are the same points in both Figure 5.1 and Figure 5.2. The filled in circles and squares are all the data that is needed to separate the two categories, and these key points are called support vectors because they support the solution and boundary definition. Because

5-2


SVM models were developed in the machine learning tradition, this technique was called support vector machine, hence the model name. Even though it appears that we have a solution, there is more than one straight line (hyperplane) that could be used to separate the two categories, as illustrated in Figure 5.3. Figure 5.3 Multiple Possible Separating Lines

SVM models try to find the best hyperplane that maximizes the margin (separation) between the categories while balancing the tradeoff of potentially overfitting the data. The narrower the margin between the support vectors, the more accurate the model will be on the current data. Thus, a separating line as shown in Figure 5.4 maximizes the margin between the support vectors. Figure 5.4 Creating Maximum Separation Between Support Vectors

However, although this separator is 100% accurate, it may be too narrow to perform well on new data, as illustrated in Figure 5.5. Here, in a new dataset, there circles or squares that fall on the wrong side of the support vectors and so will be classified in error.

5-3


Figure 5.5 Misclassified Cases in Training Data

To allow for this, SVM models include a regularization or weight factor C that is added as a penalty term in the function used by the SVM model. The algorithm attempts to maximize the margin between the support vectors while minimizing error. As described below, you can try various values of C to find an optimal model. Although the description of an SVM has been in the context of a categorical target, an SVM can be applied to predict a continuous field. See the Modeler 14.0 Algorithms Guide for more information.

Kernel Functions Four different types of kernel function are available in the SVM node. • • • •

Linear: Simple function that works well when there are nonlinear relationships in the data are minimal Polynomial: A more complex function that allows for higher order terms RBF (Radial Basis Function): Equivalent to the neural network of this type. Can fit highly nonlinear data. Sigmoid: Equivalent to a two-layer neural network. Can also fit highly nonlinear data.

Some of these functions have other parameters that you can modify to find an optimal model, such as the degree of the polynomial, or the gamma factor that controls the influence of the function. As with the factor C, there is a tradeoff with gamma values between accuracy and overfitting. You can anticipate that you will not find the best model using the default settings in the SVM node. Just as with decision trees, where you usually need to change the depth of a tree, a pruning parameter, or the minimum number of cases in terminal nodes, SVM models must be tuned to perform better. One method to fit many models efficiently is to use the Auto Classifier (if the target field is a flag) or the Auto Numeric nodes (if the target is continuous), as appropriate, which allow you to run several versions of a model at one time (see Lessons 12 and 13, respectively). For example, you could run 10 different SVMs with 10 different values of C. SVM models make predictions by use of the separating hyperplane, and the equation of that hyperplane, and the support vectors themselves, are possible outputs from the model. However, neither of these will provide much insight into the model unless the dimensionality of the space is

5-4


very low, which is rarely the case. As a consequence, the SVM node doesn’t provide this output, although you can request predictor importance (which is not directly related to either the support vectors of hyperplane definition). So to understand a model, you will need to explore how the predictors are related to the predicted values. With this background in the basics of an SVM model, we can now apply an SVM to the churn data.

5.3 SVM Model to Predict Churn The customer database we use in this lesson contains a field (churn) that measures whether or not a customer of the telecommunications firm has renewed their service or not. We will attempt to prediction this flag field with several inputs. Click File…Open Stream and move to the c:\Train\ModelerPredModel directory Double-click on SVM.str Run the Table node Close the Table window Edit the Type node

5-5


Figure 5.6 Type Node for SVM Model to Predict churn

There are 19 fields with Role Input that will be used as predictors. SVM models can easily handle hundreds of predictors, but we limit the number of predictors here for a practical reason. Requesting predictor importance for an SVM model can greatly increase execution time (by over a factor of 10). Therefore, so that we don’t wait excessively for models to run, we limit the inputs. Close the Type window Add an SVM node to the stream Connect the Type node to the SVM node Edit the SVM node

5-6


Figure 5.7 SVM Node Model Tab

All the model settings are available in the Expert tab. Click Expert tab Click Expert option button

Figure 5.8 Expert Options for SVM Models

5-7


As mentioned in the previous section, there are four types of kernels that can be selected to effectively create different types of models. The default is a RBF (Radial Basis Function), and we use that initially. The Regularization parameter (C) controls the trade-off between maximizing the margin of the support vectors and minimizing the training error. Its value should normally be between 1 and 10 inclusive, with the default being 10. Increasing the value improves the classification accuracy (or reduces the regression error at predicting a continuous outcome) for the training data, but this can also lead to overfitting. In general, it is usually better to reduce C. The Regression precision (epsilon) is used when the measurement level of the target field is continuous. Errors in the model predictions are accepted if they are under this value. Increasing epsilon may result in faster modeling, but at the expense of accuracy. The RBF gamma value is enabled only if the kernel type is set to RBF. Gamma should normally be between 3/k and 6/k, where k is the number of input fields. For example, if there are 12 input fields, values between 0.25 and 0.5 would be worth trying. Increasing the value improves the classification accuracy (or reduces the regression error) for the training data, but this can also lead to overfitting, in a similar manner to the Regularization parameter. For our problem, with 19 predictors, gamma should be between .16 and .32, so we will need to change the default value of .10. The Gamma value is enabled only if the kernel type is set to Polynomial or Sigmoid. As with RBF gamma, increasing the value improves the classification accuracy for the training data, but this can also lead to overfitting. The Bias value is enabled only if the kernel type is set to Polynomial or Sigmoid. Bias sets the coef0 value in the kernel function. The default value 0 is suitable in most cases. You can use the Degree value when the Kernel type is Polynomial to control the complexity (dimension) of the mapping space. The default is 3 (equivalent to a term such as X3). Change the Regularization parameter to 3 Change the RBF Gamma value to 0.2

5-8


Figure 5.9 Expert Settings for SVM Model

Click Analyze tab

In the Analyze tab, unlike most models, the calculation of predictor importance is not checked on by default. This calculation can be lengthy, and we will not request it in our initial model run. Figure 5.10 Options in Analyze Tab

Click Run

5-9


After the model has finished execution: Right-click the generated SVM model and select Browse

Figure 5.11 SVM Generated Model Summary Tab

Close the SVM model browser window Add an Analysis node to the stream Attach the SVM generated model to the Analysis node Edit the Analysis node Click Coincidence matrices (for symbolic targets) check box

5-10


Figure 5.12 Analysis Node Settings

Click Run

5-11



The SVM model is accurate on 81.3% of the cases in the Training partition. Looking at the classification table, the model is extremely accurate for those customers who don’t churn (value of 0); here the accuracy is almost 94%. But on those customers who did churn (value of 1), the model accuracy is only about 44%, which is probably not satisfactory. We haven’t developed a final model yet, so we won’t bother with performance on the Testing data. Close the SVM Model browser

Requesting Predictor Importance To understand the model better, we can now request predictor importance. Even though this first model is not satisfactory, this will show which fields most affect the predictions of churn. Edit the SVM modeling node Click the Analyze tab Click Calculate predictor importance check box

5-12


Figure 5.14 Requesting Predictor Importance

Click Run

The model will take much longer to run. When it is done: Right-click the generated SVM model and select Browse

5-13


Figure 5.15 Predictor Importance in Predicting churn

The most important field, by far, is equip (VI chart will display variable label instead of variable itself if it is available), which records whether or not a customer rents equipment from the telecommunications firm. The second most important field, ebill, records whether or not a customer pays bills electronically. The most important demographic field is gender. One strategy based on this chart is to drop some of the fields of low importance. We will instead change model parameters. Close the SVM model browser

5.4 Exploring the Model Before trying a different model, we briefly illustrate examining how the SVM model makes predictions. Since we are working with the training data, we need to use a Select node to work only with the data in that partition. Add a Select node from the Record Ops palette to the stream near the Type node Attach the Type node to the Select node Edit the Select node Use the Expression Builder to create the condition Partition = “1_Training”

5-14


Figure 5.16 Select Node to Select Training Data

Click OK Connect the Select node to the SVM model node in the stream Click Replace to replace the connection Add a Distribution node to the stream near the SVM model Connect the SVM model node to the Distribution node Edit the Distribution node Specify equip as the Field Specify churn as the Overlay field Click Normalize by color check box

These selections will show us the relationship between the most important predictor and the original values of churn.

5-15


Figure 5.17 Requesting a Distribution Graph with equip and churn

Click Run

We see in Figure 5.18 that most customers who don’t rent equipment (value of 0, or you can click the Label tool) did not churn, but a sizeable fraction of customers who did rent equipment churned. Figure 5.18 Distribution Graph of equip and churn

Now we look at model predictions. Close the Distribution graph window Edit the Distribution node Change the Overlay field to $S-churn (not shown)

5-16


Click Run

Figure 5.19 Distribution Graph of equip and Predicted churn

The model predictions are similar to the previous graph, only more extreme. The model predicts almost no customers who don’t rent equipment will churn. It predicts about the same percentage of customers who do rent equipment will churn. Close the Distribution graph window

We could continue this process with other predictors, using appropriate nodes to see the relationship between them and the original and predicted values of churn. We next try to improve our first model.

5.5 A Model with a Different Kernel Function We have three other kernel functions to try, along with changing model parameters. The linear model may be too simple, so let’s start with the Polynomial. Edit the SVM modeling node Click the Analyze tab Click Calculate predictor importance to deselect it Click the Expert tab Click the Kernel type: dropdown and select Polynomial Change the Degree value to 2

We’ll use a simpler polynomial by one degree.

5-17


Figure 5.20 Requesting a Model with Polynomial Kernel

Click Run

When the model is done executing: Attach the Type node to new generated SVM model Edit the SVM model Click Annotations tab Click Custom and name the model Polynomial – 2 Click OK Attach the generated SVM model to the Analysis node by replacing the connection Run the Analysis node

5-18



This model performs about the same as the first one that used an RBF kernel, but it does a bit better at predicting customers who churned (about 45%). So all things being equal, this model might be preferred. To save time, we ran a model with a linear kernel function. The Analysis node results are displayed in Figure 5.22. That model doesn’t perform any better, either overall or at predicting customers who churn.

5-19


Figure 5.22 Analysis Node for Linear Kernel Model

Close the Analysis Output browser

We won’t show the results of a model using a Sigmoid kernel, as it performs much more poorly (you can try it if you wish).

5.6 Tuning the RBF Model To illustrate tuning a model even further, we can rerun the SVM model node with an RBF kernel. We’ll change the value of C, and even though, in theory, increasing it may make the model generalize less well, we’ll give it a try to see the effect. Edit the SVM modeling node Click on Expert tab Change the Kernel type to RBF (not shown) Change the Regularization parameter to 5 Click Run

After the model has run: Connect the SVM model to the Analysis node Run the Analysis node

5-20


Figure 5.23 Analysis Node with RBF Kernel Model

The model did a bit better overall than the original RBF-based model (see Figure 5.13). And it did better at predicting the customers who churned (47%). So this model appears to be better even with a higher value of C.

SVM Models and Missing Data In Figure 5.23 for the Analysis node output, note that there are three records for which a prediction could not be made (the column with the label $null$). In SVM models, records with missing values for any of the input or output fields are excluded from model building. In these data, there are 3 records with missing values on longten, and two of these customers also have a missing value for cardten. (You can use a Select node and Table node to view the data and see these records.) By chance, all three customers were in the Testing Partition. If the amount of missing data in a file is a small fraction of the total records, you may be willing to tolerate the loss of some records from model building and scoring. But if the amount of missing data is significant, you will need to take some action before using the SVM node. You can use the Type node Check option to change missing values to a valid value. You could also do this yourself with a Filler node. Or you could use a Data Audit node to impute missing values in a sophisticated fashion. Of course, if you do this when creating a model, you will also need to use the same methods to handle missing data before scoring new data with the model.

Model Comparison We have several results from several models, and picking a final model is based, in part, on the accuracy on the Testing partition. For ease of reference, we have created the table below which summarizes the results from Figure 5.13, Figure 5.21, Figure 5.22, and Figure 5.23.

5-21


Table 5.1 Model Performance on Testing Partition Overall Accuracy

Accuracy at Predicting Non-Churners

Accuracy at Predicting Churners

RBF (C=3, RBF gamma=0.2)

77.81%

92.1%

36.3%

Polynomial (Degree=2)

79.14%

92.9%

39.2%

Linear

79.73%

93.7%

39.4%

RBF (C=5, RBF gamma=0.2)

77.65%

91.2%

38.3%

Model Type

The best model on all three measures used a Linear kernel, while the next best model used a Polynomial kernel. One never knows what model will do best until trying several types of models with various settings. None of these models does very well at predicting those customers who churned, so we might continue model building by adding or deleting fields, modifying fields, or resume changing model parameters. It should be clear by now that to be well-organized, you will want to develop a strategy of changing model parameters in a systematic way, and keeping close track of the model parameters and results. As noted above, the Auto Classifier or Auto Numeric nodes can be very helpful. We don’t need to save the stream in this lesson.

5-22




In this set of exercises you will attempt to predict the field Response to campaign using an SVM model. 1. Begin with a clear Stream canvas. Place an Statistics source node on the Stream canvas and connect it to the file charity.sav. Tell PASW Modeler to Read Labels as Names. 2. Attach a Type and Table node in a stream to the source node. Run the stream and allow PASW Modeler to automatically define the types of the fields. 3. Edit the Type node. Set all of the fields to role NONE. 4. We will attempt to predict response to campaign (Response to campaign) using the fields listed below. Set the role of all five of these fields to Input and the Response to campaign field to Target. Pre-campaign expenditure Pre-campaign visits Gender Age Mosaic Bands (which should be changed to nominal measurement level)

5-23


5. Attach an SVM node to the Type node. Create a model with the default settings using an RBF kernel type. Request predictor importance. 6. Once the model has finished training, browse the model. Which inputs are most important? Add the generated model to the stream and use an Analysis node to check its accuracy. On which category does the model do better, responders or non-responders? 7. Now try other SVM models, by both changing the kernel type and parameters associated with that kernel, or by changing the regularization parameter C. Which model does best at predicting response to campaign? Is the most important field the same in all the models? 8. After you’ve found a satisfactory model, explore how the input fields relate to the model predictions.

5-24

LINEAR REGRESSION

Lesson 6: Linear Regression Objectives • • •

Review the concepts of linear regression Use the Regression node to model medical insurance claims data Demonstrate the Linear node to perform regression modeling

Data We use the data file InsClaim.dat, which contains 293 records based on patient admissions to a hospital. All patients belong to a single diagnosis related group (DRG). Four fields (grouped severity of illness, age, length of stay, and insurance claim amount) are included. The goal is to build a predictive model for the insurance claim amount and use this model to identify outliers (patients with claim values far from what the model predicts), which might be instances of errors or fraud made in the claims.

6.1 Introduction Linear regression is a method familiar to just about everyone these days. It is the classic general linear model (GLM) technique, and it is used to predict an target that is interval or ratio in scale (measurement level continuous) with predictors that are also interval or ratio. In addition, categorical input fields can be included by creating dummy variables (fields). The Regression model node performs linear regression in PASW Modeler. Linear regression assumes that the data can be modeled with a linear relationship. To illustrate, the figure below contains a scatterplot depicting the relationship between the length of stay for hospital patients and the dollar amount claimed for insurance. Superimposed on the plot is the best-fit regression line. The plot may look a bit unusual in that there are only a few values for length of stay, which is recorded in whole days, and few patients stayed more than three days.

6-1


Figure 6.1 Scatterplot of Hospital Length of Stay and Insurance Claim Amount

l Although there is a lot of spread around the regression line and a few outliers, it is clear that there is a positive trend in the data such that longer stays are associated with greater insurance claims. Of course, linear regression is normally used with several predictors; this makes it impossible to display the complete solution with all predictors in convenient graphical form, but it is useful to look at bivariate scatterplots.

6.2 Basic Concepts of Regression In the plot above, to the eye (as well as to one's economic sense) there seems to be a positive relation between length of stay and the amount of a health insurance claim. However, it would be more useful in practice to have some form of prediction equation. Specifically, if some simple function can approximate the pattern shown in the plot, then the equation for the function would concisely describe the relation and could be used to predict values of one field given knowledge of the other. A straight line is a very simple function and is usually what researchers start with, unless there are reasons (theory, previous findings, or a poor linear fit) to suggest otherwise. Also, since the goal of much research involves prediction, a prediction equation is valuable. However, the value of the equation would be linked to how well it actually describes or fits the data, and so part of the regression output includes fit measures.

The Regression Equation and Fit Measure In the plot above, insurance claim amount is placed on the Y (vertical axis) and the length of stay appears along the X (horizontal) axis. If we are interested in insurance claim as a function of the length of stay, we consider insurance claim to be the output field and length of stay as the input or predictor field. A straight line is superimposed on the scatterplot along with the general form of the equation: Yi = A + B * Xi + ei

6-2

LINEAR REGRESSION

Here, B is the slope (the change in Y per one unit change in X) and A is the intercept (the value of Y when X is zero, and e i is the model residual or error for the ith observation). Given this, how would one go about finding a best-fitting straight line? In principle, there are various criteria that might be used: minimizing the mean deviation, mean absolute deviation, or median deviation. Due to technical considerations, and with a dose of tradition, the best-fitting straight line is the one that minimizes the sum of the squared deviation of each point about the line. Returning to the plot of insurance claim amount and length of stay, we might wish to quantify the extent to which the straight line fits the data. The fit measure most often used, the r-square measure, has the dual advantages of being measured on a standardized scale and having a practical interpretation. The r-square measure (which is the correlation squared, or r2, when there is a single input field, and thus its name) is on a scale from 0 (no linear association) to 1 (perfect prediction). Also, the r-square value can be interpreted as the proportion of variation in one field that can be predicted from the other. Thus an r-square of .50 indicates that we can account for 50% of the variation in one field if we know values of the other. You can think of this value as a measure of the improvement in your ability to predict one field from the other (or others if there is more than one input field). Multiple regression represents a direct extension of simple regression. Instead of a single input field (Y i = A + B * X i + e i ) multiple regression allows for more than one input field. Y i = A + B 1 * X 1i + B 2 * X 2i + B 3 * X 3i + . . . + e i in the prediction equation. While we are limited to the number of dimensions we can view in a single plot, the regression equation allows for many input fields. When we run multiple regression we will again be concerned with how well the equation fits the data, whether there are any significant linear relations, and estimating the coefficients for the best-fitting prediction equation. In addition, we are interested in the relative importance of the independent fields in predicting the output field.

Residuals and Outliers Viewing the plot, we see that many points fall near the line, but some are more distant from it. For each point, the difference between the value of the dependent field and the value predicted by the equation (value on the line) is called the residual (e i ). Points above the line have positive residuals (they were under predicted), those below the line have negative residuals (they were over predicted), and a point falling on the line has a residual of zero (perfect prediction). Points having relatively large residuals are of interest because they represent instances where the prediction line did poorly. As we will see shortly in our detailed example, large residuals (gross deviations from the model) have been used to identify data errors or possible instances of fraud (in application areas such as insurance claims, invoice submission, or telephone and credit card usage).

Assumptions Regression is usually performed on data for which the input and output fields are interval scale. In addition, when statistical significance tests are performed, it is assumed that the deviations of points around the line (residuals) follow the normal bell-shaped curve. Also, the residuals are assumed to be independent of the predicted values (values on the line), which implies that the variation of the residuals around the line is homogeneous (homogeneity of variance). PASW Modeler can provide summaries and plots useful in evaluating these latter issues. One special case of the assumptions involves the interval scale nature of the predictor field(s). A field coded as a dichotomy (say 0 and 1)

6-3


can technically be considered as an interval scale. An interval scale assumes that a one-unit change has the same meaning throughout the range of the scale. If a field’s only possible codes are 0 and 1 (or 1 and 2, etc.), then a one-unit change does mean the same change throughout the scale. Thus dichotomous or flag fields (for example gender) can be used as predictors in regression. It also permits the use of categorical predictor fields if they are converted into a series of flag fields, each coded 0 or 1; this technique is called dummy coding. Note that the Regression node in PASW Modeler will only accept continuous inputs or ordinal fields that contain numeric values (they will be treated as continuous, then). Thus if you have categorical inputs, you must convert them to numeric dummy fields (using the SetToFlag node to create 0,1 dummy-coded fields, followed by a Type node to set the measurement level of these fields to continuous), before they can be used by the Regression node.

6.3 An Example: Error or Fraud Detection in Claims To illustrate linear regression we turn to a dataset containing insurance claims (CLAIM) for a single medical treatment performed in a hospital (in the US, a single DRG or diagnostic related group). In addition to the claim amount, the data file also contains patient age (AGE), length of hospital stay (LOS) and a severity of illness category (ASG). This last field is based on several health measures and higher scores indicate greater severity of the illness. The plan is to build a regression model that predicts the total claim amount for a patient on the basis of length of stay, severity of illness and patient age. Assuming the model fits adequately, we are then interested in those patients that the model predicts poorly. Such cases can simply be instances of poor model fit, or the result of predictors not included in the model, but they also might be due to errors on the claims form or fraudulent entries. Thus we are approaching the problem of fraud detection by identifying exceptions to the prediction model. Such exceptions are not necessarily instances of fraud, but since they are inconsistent with the model, they may be more likely to be fraudulent or contain errors. Some organizations perform random audits on claims applications and then classify them as fraudulent or not. Under these circumstances, predictive models can be constructed that attempt to correctly classify new claims; logistic, discriminant, rule induction and neural networks have been used for this purpose. However, when such a target field is not available, fraud detection then involves searching for and identifying exceptional instances. Here, an exceptional instance is one that the model predicts poorly. We are using regression to build the model; if there were reasons to believe the model were more complex (for example, contained nonlinear relations or complex interactions), then a neural network or rule induction model could be applied, or higher-order and interaction terms could be added to the regression equation. We begin by opening an existing stream. Click File…Open Stream and move to the c:\Train\ModelerPredModel directory Double-click on LinearRegress.str

Before the stream opens, the warning message in Figure 6.2 appears. The Regression node will be replaced in later versions of Modeler by the Linear Models node, which is newly added to Modeler 14.0. We will continue to use the Regression node in this lesson, but also demonstrate the Linear Models node at the end of the lesson.

6-4

LINEAR REGRESSION

Figure 6.2 Warning Message About Regression Node Expiration

Click OK Double-click on the Type node

Figure 6.3 Type Node for Claims Data

Note that the severity of illness field (ASG) is continuous, although it has only the three integer values 0, 1, and 2. We will leave it as continuous since these values fall on an ordered scale (higher values indicate greater severity). The predictors for linear regression must be numeric. If you have predictors that are truly categorical (nominal or ordinal), such as region of the U.S. (e.g., northwest, southwest, etc.), they must be represented by dummy fields (coded either 0 or 1). However, the Regression node will not automatically create dummy fields for these categories. You will need to create dummy fields yourself using the SetToFlag node, and then enter the dummy fields in the model, leaving one out so as not to create redundancy (ask your instructor for more detail if you are interested). Close the Type node dialog Double-click on the Regression node (named CLAIM)

6-5


Figure 6.4 Linear Regression Model Dialog

Simple options include whether a constant (intercept) will be used in the equation and the Method of input field selection. By default (Enter), all inputs will be included in the linear regression equation. With such a small number of predictor fields, we will simply add them all into the model together. However, in the common situation of many input fields (most insurance claim forms would contain far more information) a mechanism to select the most promising predictors is desirable. This could be based on the domain knowledge of the business expert (here perhaps a health administrator). In addition, an option may be chosen to select, from a larger set of independent fields, those that in some statistical sense are the best predictors (Stepwise Method). In the stepwise method, the best input field (according to a statistical criterion) is entered into the prediction equation. Then the next best input field is entered, and so on, until a point is reached when no further input fields meet the criterion. The stepwise method includes a check to insure that the fields entered into the equation before the current step still meet the statistical criterion when the additional inputs are added. Variations on the stepwise method (Forward—inputs are added one by one, as described above, but are never removed; Backward—all inputs are entered, then the least significant input is removed, and this process is repeated until only statistically significant inputs remain) are available as well. Click the Fields tab

6-6

LINEAR REGRESSION

Figure 6.5 Regression Fields Tab

The weighted least squares option (Use weight field check box) supports a form of regression in which the variability of the output field is different for different values of the input fields; an adjustment can be made for this if an input field is related to this degree of variation. In practice, this option has been rarely used in data mining. We see here the option to specify a partition field when there is such a field (but it doesn’t have the default name of Partition), also you can specify more than one input field as split fields. A model is built for each possible combination of the values of the selected split fields. Click the Expert tab Click the Expert Mode option button

6-7


Figure 6.6 Expert Options (Missing Values and Tolerance)

By default, the Regression node will only use records with valid values on the input and output fields (this is often called listwise deletion). This option can be checked off, in which case PASW Modeler will attempt to use as much information as possible to estimate the Regression model, including records where some of the fields have missing values. It does this through a method called pairwise deletion of missing values. However, we recommend not using this option unless you are a very experienced user of regression; using incomplete records in this manner can lead to computational problems in estimating the regression equation. Instead, if there is a large amount of missing data, you may wish to substitute valid values for the missing data before using the Regression node. The Singularity tolerance will not allow an input field in the model unless at least .0001 (.01 %) of its variation is independent of the other predictors. This prevents the linear regression model estimation from failing due to multicollinearity (linear redundancy in the predictors). Most analysts would recommend increasing the default tolerance value to at least .05, though, and also checking explicitly for multicollinearity. Click the Model tab, and then click Stepwise on the Method drop-down list Click the Expert tab, and then click the Stepping button

Figure 6.7 Stepping Criteria and Tolerance Expert Options

6-8

LINEAR REGRESSION

You control the criteria used for input field entry and removal from the model. By default, an input field must be statistically significant at the .05 level for entry and will be dropped from the model if its significance value increases above .1. Click Cancel Click Output button

Figure 6.8 Advanced Output Options

These options control how much supplementary information concerning the regression analysis displays. The results will appear in the Advanced tab of the generated model node in HTML format. Confidence bands (95%) for the estimated regression coefficients can be requested (Confidence interval). Summaries concerning relationships among the inputs can be obtained by requesting their Covariance matrix or Collinearity Diagnostics. The latter are especially useful when you need to identify the source and assess the level of redundancy in the predictors, and the more predictors you have, the more likely that some may be highly correlated (you can ask your instructor for more information on these diagnostics). Part and partial correlations measure the relationship between an input and the output field, controlling for the other inputs. Descriptive statistics (Descriptives) include means, standard deviations, and correlations; these summaries can also be obtained from the Statistics or Data Audit node. The Durbin-Watson statistic can be used when running regression on time series data and evaluates the degree to which adjacent residuals are correlated (regression assumes residuals are uncorrelated). Click Cancel Click the Simple option button Click Model tab, and then click Enter on the Method drop-down list (not shown) Click Run button

After the model has run: Edit the Regression generated model node in the stream Click the Summary tab, and then expand the Analysis summary

6-9


Figure 6.9 Linear Regression Browser Window (Analysis Summary)

This Analysis summary contains only the equation relating the predictor fields to the output. We could interpret the coefficients here, but since we don’t know whether they are statistically significant or not, we will postpone this until we examine additional information in the Advanced tab. To reach the more detailed results: Click the Advanced tab Increase the size of the window to see more of the output

The advanced output is formatted in HTML. After listing the dependent (output) and independent (input) fields, Regression provides several measures of how well the model fits the data. First is the multiple R, which is a generalization of the correlation coefficient. If there are several input fields (our situation) then the multiple R represents the unsigned (positive) correlation between the output and the optimal linear combination of the input fields. Thus the closer the multiple R is to 1, the better the fit. As mentioned earlier, the R-square measure can be interpreted as the proportion of variance of the output that can be predicted from the input field(s). Here it is about 32% (.318), which is far from perfect prediction, but still substantial. The adjusted R-square represents a technical improvement over the R-square in that it explicitly adjusts for the number of input fields and sample size, and as such is preferred by many analysts. Generally the two R-square values are very close in value; in fact, if they differ dramatically in multiple regression, it is a sign that you have used too many inputs relative to your sample size, and the adjusted R-square value should be more trusted. In our results, they are very close.

6-10

LINEAR REGRESSION

Figure 6.10 Model Summary and Overall Significance Tests

While the fit measures indicate how well we can expect to predict the output or how well the line fits the data, they do not tell whether there is a statistically significant relationship between the output and input fields. The analysis of variance table presents technical summaries (sums of squares and mean square statistics), but here we refer to variation accounted for by the prediction equation. We are interested in determining whether there is a statistically significant (non-zero) linear relation between the output and the input field(s) in the population. Since our analysis contains three input fields, we test whether any linear relation differs from zero in the population from which the sample was taken. The significance value accompanying the F test gives us the probability that we could obtain one or more sample slope coefficients (which measure the straight-line relationships) as far from zero as what we obtained, if there were no linear relations in the population. The result is highly significant (significance probability less than .0005—the table value is rounded to .000—or 5 chances in 10,000). Now that we have established that there is a significant relationship between the claim amount and one or more input fields and obtained fit measures, we turn to interpret the regression coefficients. Here we are interested in verifying that several expected relationships hold: (1) claims will increase with length of stay, (2) claims will increase with increasing severity of illness, and (3) claims will increase with age. Strictly speaking, this step is not necessary in order to identify cases that are exceptional. However, in order to be confident in the model, it should make sense to a domain expert in hospital claims. Since interpretation of regression models is made directly from the regression coefficients, we turn to those next.

6-11


Figure 6.11 Regression Coefficients

The second column contains a list of the input fields plus the intercept (Constant). The estimated coefficients in the B column are those we saw when we originally browsed the Linear Regression generated model node; they are now accompanied by supporting statistical summaries. Although the B coefficient estimates are important for prediction and interpretive purposes, analysts usually look first to the t test at the end of each row to determine which input fields are significantly related to the output field. Since three inputs are in the equation, we are testing if there is a linear relationship between each input field and the output field after adjusting for the effects of the two other inputs. Looking at the significance values (Sig.) we see that all three predictors are highly significant (significance values are .004 or less). If any of the fields were found to be not significant, you would typically rerun the regression after removing these input field(s). The column labeled B contains the estimated regression coefficients we would use to deploy the model via a prediction equation. The coefficient for length of stay indicates that on average, each additional day spent in the hospital was associated with a claim increase of about $1,106. The coefficient for admission severity group tells us that each one-unit increase in the severity code is associated with a claim increase of $417. Finally, the age coefficient of about –$33 suggests that claims decrease, on average, by $33 as patient age increases one year. This is counterintuitive and should be examined by a domain expert (here a physician). Perhaps the youngest patients are at greater risk or perhaps the type of insurance policy, which is linked somehow to age, influences the claim amount. If there isn’t a convincing reason for this negative association, the data values for age and claims should be examined more carefully (perhaps data errors or outliers are influencing the results). Such oddities may have shown up in the original data exploration. We will not pursue this issue here, but it certainly would be done in practice. The constant or intercept of $3,027 indicates that the amount of predicted claim of someone with 0 days in the hospital, in the least severe illness category (0) and with age 0. This is clearly impossible. This odd result stems in part from the fact that no one in the sample had less than 1 day in the hospital (it was an inpatient procedure) and the patients were adults (no ages of 0), so the intercept projects well beyond where there are any data. Thus the intercept cannot represent an actual patient, but still is needed to fit the data. Also, note that when using regression it can be risky to extrapolate beyond where the data are observed, since the assumption is that the same pattern continues. Here it clearly cannot! The Std. Error (of B) column contains standard errors of the estimated regression coefficients. These provide a measure of the precision with which we estimate the B coefficients. The standard errors can be used to create a 95% confidence band around the B coefficients (available as a Expert Output option). In our example, the regression coefficient for length of stay is $1,106 and the standard error

6-12

LINEAR REGRESSION

is about $104. Thus we would not be surprised if in the population the true regression coefficient were $1,000 or $1,200 (within two standard errors of our sample estimate), but it is very unlikely that the true population coefficient would be $300 or $2,000. If we wish to predict claims based on length of stay, severity code and age, the formula would use the estimated B coefficients: Pr edict ed Clai ms = $3,027 + $1,106 * l engt h of st ay + $417 * sever it y code – $33 * age

Betas are standardized regression coefficients and are used to judge the relative importance of each of several input fields. They are important because the values of the regression coefficients (Bs) are influenced by the standard deviations of the input fields and their scale, and the beta coefficients adjust for this. Here, not surprisingly, length of stay is the most important predictor of claim amount, followed by severity group and age. Betas typically range from –1 to 1 and the further from 0, the more influential the predictor field. Predictor importance measures are also provided by PASW Modeler, but for statistical models such as linear regression or discriminant analysis, they add little to your understanding (click the Model tab to see the Predictor Importance graph). For the claims data, the predictor importance values have roughly the same relative magnitude as do the Betas, but the Betas are more optimal to compare predictor importance in a regression equation.

Points Poorly Fit by Model The motivation for this analysis is to detect errors or possible fraud by identifying cases that deviate substantially from the model. As mentioned earlier, these need not be the result of errors or fraud, but they are inconsistent with the majority of cases and thus merit scrutiny. We first create a field that stores the residuals, or errors in prediction, which we will then sort and display in a table. Close the Regression generated model node Place a Derive node from the Field Ops palette to the right of the Regression generated model node Connect the Regression generated model node to the Derive node Edit the Derive node Enter the new field name DIFF into the Derive field: text box Enter the formula CLAIM – ‘$E-CLAIM’ into the Formula text box

6-13


Figure 6.12 Computing an Error (Residual) Field

The DIFF field measures the difference between the actual claim value (CLAIM) and the claim value predicted by the model ($E-CLAIM). Since we are most interested in the large positive errors, we will sort the data by DIFF before displaying it in a table. Click OK to complete the Derive node Place a Sort node to the right of the Derive node Connect the Derive node to the Sort node Edit the Sort node Select DIFF as the Sort by field Select Descending in the Order column (not shown) Click OK to process the Sort request Place a Table node to the right of the Sort node Connect the Sort node to the Table node Run the Table node

6-14

LINEAR REGRESSION

Figure 6.13 Errors Sorted in Descending Order

There are two records for which the claim values are much higher than the regression prediction. Both are about $6,000 more than expected from the model. These would be the first claims to examine more carefully. We could also examine the last few records for large over-predictions, which might be errors as well.

6.4 Using Linear Models Node to Perform Regression The Linear Models node was added to Modeler 14.0 to create linear models to predict a continuous target with one or more predictors. Models are created with an equation that assumes a simple linear relationship between the target and predictors. The Linear Models node has more features than the Regression node, including the ability to create the best subset model, several criteria for model selection, the option to limit the number of predictors, and the use of bagging and boosting, as discussed earlier in this course. In addition, there is a feature to automatically prepare the data for modeling, by transforming the target and predictors in order to maximize the predictive power of the model. This includes outlier handling, adjusting the measurement level of predictors, and merging similar categories. The Linear Models node automatically creates dummy variables from categorical fields (that have nominal or ordinal measurement level), which is another definite advantage. The Linear Models node also uses the new Model Viewer to display a wealth of information about the model that helps to evaluate it and understand the effect of the predictors. For this example, we will reproduce the model from the Regression node and concentrate on the new features and output, rather than attempt to find a more accurate model.

6-15


Close the Table node Add a Linear node to the stream near the Type node Connect the Linear node to the Type node Edit the Linear node Click the Build Options tab

Figure 6.14 Build Options Tab for Linear Models Node

There are five areas within the Build Options tab that correspond to the standard ones included with most Classification nodes. We will create a standard model and not use bagging and boosting. Click Basics settings

6-16

LINEAR REGRESSION

Figure 6.15 Basics Settings for Build Options Tab for Linear Models Node

The Linear node can use a variety of automatic methods to prepare the data for modeling. These include changing the measurement levels of predictors (e.g., continuous predictors with less than five values are changed to ordinal), outlier handling to reduce extreme values, special missing value handling (e.g., missing values for nominal predictors are replaced with the mode), and supervised merging of predictor categories to reduce the number of fields to be processed (helpful when there are large numbers of predictors). To keep things comparable, we will turn off this option. Click Automatically prepare data check box to deselect this option Click Model Selection setting

6-17


Figure 6.16 Model Selection Settings for Build Options Tab for Linear Models Node

There are three model selection methods available: • None, which is equivalent to the Enter method for Regression with all fields entered on one step • Forward stepwise, which is standard stepwise model selection, although using by default an Information Criterion measure for entry/removal rather than the F statistic • Best Subsets, which tries "all possible" models, or at least a much larger subset of the possible models than forward stepwise, to choose the best model based on various criteria. Click the Model selection method: dropdown and select None

The model default criterion for entry/removal is an information theoretic measure (akin to AIC or BIC), but also available is R square, the F statistic, and an overfit prevention criterion. The significance values for entry and removal can be modified. You can customize the maximum number of effects (predictors, including those coded as dummy variables) in the final model, and you can customize the maximum number of model-building steps if using forward stepwise or best subsets methods.

6-18

LINEAR REGRESSION

The Ensembles and Advanced settings are similar to those for other modeling nodes, so we won’t review them here. Click Model Options tab

Figure 6.17 Model Options Tab for Linear Models Node

Neither probability nor propensity are available for the Linear Models node because it is predicting a continuous target. Click Run

After the model has executed: Edit the CHURNED Linear Models generated model

6-19


Figure 6.18 Model Summary View of Linear Model

The Model Summary includes information on how the model was constructed. The Accuracy chart displays adjusted R square (.311 for the model). The second panel displays information on Automatic data preparation, but none was requested for this example. Click the Predictor Importance panel

6-20

LINEAR REGRESSION

Figure 6.19 Predictor Importance for Linear Model

Predictor importance is about equal for ASG and AGE, with LOS being most important. Importance for a Linear model is calculated differently than for a Regression model. For Linear Models, the method of leave one predictor out is used, with the model statistics compared with and without that predictor (see the PASW® Modeler 14 Algorithms Guide for details). Click on the Effects panel

6-21


Figure 6.20 Model Effects Diagram for Linear Model

The Model Effects display shows each predictor in the model, sorted from top to bottom by predictor importance. Connecting lines in the diagram are weighted based on effect significance, with greater line width corresponding to more significant effects (smaller p-values). There is a standard predictor importance slider, but also a significance slider controls which effects are shown in the view, beyond those shown based on predictor importance. Effects with significance values greater than the slider value are hidden. This does not change the model, but simply allows you to focus on the most important effects. The Style dropdown has a Table choice that will display the ANOVA table for the model. Click Coefficients panel

6-22

LINEAR REGRESSION

Figure 6.21 Coefficients Diagram for Linear Model

The Coefficients panel displays the value of each coefficient in the model. If there are categorical predictors, each dummy predictor created to represent these will be displayed separately. In the default Diagram style, the chart displays the intercept first, and then sorts effects from top to bottom by decreasing predictor importance. Connecting lines in the diagram are colored and weighted based on coefficient significance, with greater line width corresponding to more significant coefficients (smaller p-values). Blue corresponds to a positive effect, orange-yellow to a negative effect. You can also hover over a line and a popup will display the coefficients. Click the Style: dropdown and select Table

6-23


Figure 6.22 Coefficients in Table Style for Linear Model

The Table view shows the coefficients, significance tests, and importance for the individual model coefficients. After the intercept, the effects are sorted from top to bottom by decreasing predictor importance. Within effects containing factors, coefficients are sorted by ascending order of data values. These are the same regression coefficients and significance values as we obtained from the Regression model. There are three other panels that help assess whether the data and model meet statistical assumptions for linear modeling. These include: 1. Predicted by Observed scatterplot, which should display a linear relationship between these two fields. In addition, outlying cases can be identified that may influence model coefficients. 2. Residuals histogram, which allows you to check for the normality of the errors assumption. However, with the typically large data files encountered in data mining, this assumption becomes less critical. 3. Outliers table using Cook’s Distance, a measure of how much the residuals of all records would change if a particular record were excluded from the calculation of the model coefficients. A large Cook's distance indicates that excluding a record changes the coefficients substantially, and should therefore be considered influential. Some analysts use a cutoff of around .10 and above to identify influential records. Click on the Outliers panel

6-24

LINEAR REGRESSION

Figure 6.23 Outliers Table for Linear Model

For these data, the outliers have large values of CLAIM and so are harder to predict with the model. Finally, there is a Estimated Means panel which displays the model value of the target on the vertical axis for each value of the predictor on the horizontal axis, holding all other predictors constant at their mean. It provides a useful visualization of the total effects of each predictor on the target across the full range of that predictor. As we have reviewed, the Linear Models node provides more modeling options, and more visual output, than the Regression node.

6-25


Summary Exercises The exercises use the data file InsClaim.dat that was used in this lesson. The following table provides details about the file. InsClaim.dat contains insurance claim information from patients in a hospital. All patients were in the same diagnosis related group (DRG). Interest is in building a prediction model of total charges based on patient information and then identifying exceptions to the model (error or fraud detection). The file contains about 300 records and the following fields: ASG AGE LOS CLAIM

Severity of illness code (higher values mean more seriously ill) Age Length of hospital stay (in days) Total charges in US dollars (total amount claimed on form)

1. Using the insurance claims data, use the Stepwise method and compare the equation to the one obtained using the Enter method. Are you surprised by the result? Why or why not? Try the Forward and Backward methods. Do you find any differences? 2. Instead of examining errors in the original scale, analysts may prefer to express the residual as a percent deviation from the prediction. Such a measure may be easier to communicate to a wider audience. Add a Derive node that calculates a percent error. Name this field PERERROR and use the following formula: 100* (CLAIM – '$E-CLAIM')/'$E-CLAIM'. Compare this measure of error to the original DIFF. Do the same records stand out? What conditions is percent error most sensitive to? Use the Histogram node to produce histograms for either of the error fields, generate a Select node to select records with large errors, and then display them in a table. 3. Use the Neural Net modeling node to predict CLAIM using a neural network. How does its performance compare to linear regression? What does this suggest about the model? 4. Fit a C&R Tree model and make the same comparison. Examine the errors from the better of the neural net and C&R Tree models (as you judge them). Do the same records consistently display large errors?

6-26

COX REGRESSION FOR SURVIVAL DATA

Lesson 7: Cox Regression for Survival Data Overview • • • • •

What is Survival Analysis? What to Look for in Survival Analysis Cox Regression Checking the Proportional Hazards Assumption Predictions from a Cox Model

Data In this lesson we use the data file customer_dbase.sav that contains 5,000 records from customers of a telecommunications firm. The firm has collected a wide variety of consumer information on its customers, and we are interested in studying the length of time customers retain their primary credit card—in other words, we wish to model the time for these customers to churn—not renew—their primary credit card. We will use several predictors to model churn to learn their effect on time to churn.

7.1 Introduction Survival analysis studies the length of time to an event of interest. The analysis can involve no predictors, or it can investigate survival time as a function of one or more predictor fields. The technique was originally used in medical research to study the amount of time patients survive following onset of a disease (hence the name survival analysis). In data mining, it has been applied to model such diverse outcomes as length of time a person subscribes to a newspaper or magazine, the time employees spend with a company, the time to failure for electrical or mechanical components, the time to make a second purchase from an online retailer, or the length of tenure for renters of commercial properties. The Cox node in PASW Modeler can perform both univariate (no predictors) and multivariate (Cox regression) survival analysis. The former type of analysis is called Kaplan-Meier, often used to compare survival time for treatment and control groups in medical studies. In data mining, there are many possible predictors, so Cox regression, a semi-parametric technique, is used. In this lesson we will review the concepts and theory behind survival analysis and Cox regression, and then perform a Cox regression predicting churn.

7-1


7.2 What is Survival Analysis? Survival analysis examines the length of time to a critical event, possibly as a function of predictor fields. Since time has interval scale properties (actually ratio scale), such methods as regression or ANOVA might first come to mind as possible analytical methods to use with time ordered data.. However, survival data often contains censored values: observations for which the final outcome is not known, yet for which some information is available. For example, in a data set of subscribers to a newspaper, some will have cancelled their subscription, but many others will still be subscribers at the time of data collection or analysis. The former group is comprised of those who did not “survive,” but the latter group did survive until the end of the study. Their outcome is censored which simply means that we don’t know when they will end their subscription. Ordinary regression, or other techniques, has no easy way of handling a data value of 15+ years, meaning we know the subscriber has received the newspaper for at least 15 years so far, with an unknown future survival. Survival analysis can explicitly incorporate and handle such information (see next section). The main outcome measure in survival analysis is the length of time to a critical event. Such summaries as the mean and median survival time with their accompanying standard errors are useful. The mean survival time is not simply the sample mean value, but is estimated using the cumulative survival plot (discussed below) that adjusts for censored data. An important summary is the survival function over time. The cumulative survival plot (or table) displays the estimated probability of surviving longer than the time displayed in the chart (or table). An example of a cumulative survival plot for two treatment groups in a medical study appears below. Figure 7.1 Survival Plot for Two Treatment Groups

We see the probability of surviving longer than a given time starts out at 1.0 (when time= 0) and decreases over time. The probability is adjusted at the time of each critical event (here a death). There were censored observations (patients who could not be contacted beyond a certain point, died of other causes, or outlived the study) that appear as small plus signs (+) along the plot. They are used in the

7-2


denominator when calculating survival probability up until the time of their censoring (since they are known to be alive up until that point), and discarded thereafter. Note that the plot is an empirical curve, adjusted for each critical event. This approach does not make distribution assumptions about the shape of the curve and is called nonparametric. This is the approach taken by the Kaplan-Meier method; there are parametric models of survival analysis that would fit smooth curves to the data viewed above. For this reason, when comparing survival functions of different groups using the Kaplan-Meier method, nonparametric tests are used. A related function is called the hazard function. It measures the rate of occurrence per unit time of the critical event, at a given instant. As such it represents the risk of the event occurring at a particular time. A medical researcher would be interested in looking at time points with elevated hazard rates, as would a renter of retail space, since these would be times of greater risk of the event (death, tenant leaving). PASW Modeler can produce cumulative hazard plots. For more detailed discussion of these issues see Lee (1992), Kleinbaum (1996), or Klein and Moeschberger (1997).

Censoring A distinguishing feature of survival analysis is its use of censored data. Users of other advanced statistical methods are familiar with missing data, which are observations for which no value was recorded. Censored values contain some information, but the final value is hidden or not yet known. For example, in a time to churn analysis, at the end of the data collection phase for the study, a customer may still be a subscriber, effectively outliving the study. The analyst would thus know that the customer survived for at least n number of months, but does not know the final date when the subscription will end. Censored data is included in the calculation of the survival function. A censored case is used in the denominator to calculate survival probabilities for data involving events occurring with shorter time values than the censored case. As an example, if we know a customer survives 60 months and is then censored, use is made of the fact that the customer did not churn during the first 60 months. After the time of censoring, the censored value is dropped from any survival calculations.

Data Sampling and Censoring Technically, the type of censoring we have been discussing is called “right-censoring.” If we think of time flowing from left to right on a horizontal axis, with the time values increasing, then a case that is right-censored extends beyond the end of data collection, as exemplified in the chart in Figure 7.2. The mirror image of this is left-censoring, of which there are two types. “Left censoring” occurs when the case is never a part of the study because it occurred before the time the study began. “Lefttruncated” cases are those which have an unmeasured beginning time, then are measured, so that when the event of interest occurs, we can only say that they have been a customer for at least as long as measured (but it could be much longer).

7-3


Figure 7.2 Types of Censoring in Survival Data

Both left censoring and left truncation can cause problems for models because they lead to a biased sample. In data-mining applications, it is common to have customer history data taken from a crosssectional extraction from existing databases, using all customers active as of some fixed date when the study begins. This approach will systematically under sample customers with short survival times (those who are left-censored) and thus overestimates survival. Left truncation is less of an issue with business databases because the time a customer begins is normally well known (although when companies merge and combine customers, incompatible data systems can lead to uncertainties in customer history). To solve the left-censored problem, it is better to sample on a history of customers over time, sampling not cross-sectionally but over some defined time period. This means that you don’t need to choose all those customers who began say, on a fixed date. Instead, customers can enter, and leave the study (because they churned), over some long time interval. This doesn’t imply that a survival study must actually go on for many months or years in real-time; instead, it means that the data sampling must be done over an extended time interval. Too much right-censored data can also be a problem simply because the event of interest—here churning—will have occurred too infrequently.

Why Not Regression or Logistic Regression? Those who first encounter survival data sometimes wonder why linear regression can’t be used to predict survival time, or why logistic regression or other methods to predict dichotomous outcomes can’t be used to predict whether an event has occurred or not. Besides the censoring issue, there are several reasons, but the key one related to linear regression is that the residuals from the regression model are unlikely to be normally distributed, even for large sample sizes. This is because the time to event distribution is likely to be non-normal, even bimodal in many real-world applications (there are certain intervals when more customers are likely to churn, such as at the end of their contract period). Logistic regression doesn’t assume a particular distribution of the residuals, but it also doesn’t handle censored data appropriately. It is possible to follow a sample of say, 1000 customers, from the time they obtain a credit card until the last one has dropped that card (many years later). This type of dataset has no right-censored data because the status of every customer is known along with when the event of interest occurred. However, collecting this type of data is often impractical; moreover, since conditions change rapidly in many businesses, the effects of predictors on credit card churn may

7-4


change over time, so collecting data over a long time interval may, perversely, lead to a less accurate model. This again argues for using Cox Regression rather than another technique.

Sampling on the Outcome Field Survival analysis is based on the idea that there is one type of event that causes “death” in a study. In the context of business data, this means that models should be developed to predict only one type of churn. As illustration, if we knew that some customers dropped their DSL internet service to sign up with a cable internet provider, while other customers dropped their DSL service because they will be using broadband service available through satellite, it would be better to develop a different model for each group. Often, though, what happens to a customer after they cancel a service or subscription is unknown, but if so, it does imply that a model may be mixing customers who have different influences that lead them to churn. In turn, this will lead to less accurate models.

7.3 Cox Regression Cox regression is a survival model that represents hazard as a function of time and predictor fields that can be continuous or categorical. Because it allows for multiple predictors, it is more general than the Kaplan-Meier method. It is considered a semi-parametric model because it does not require a particular functional form to the baseline hazard or survival curves. This allows Cox regression to be applied to datasets exhibiting very different survival patterns. As we will see shortly, the model does assume that the ratio of the hazard rate between any two individuals or groups remains constant over time (for this reason it is also called Cox’s proportional hazard model). If this assumption is not met, the Cox model has been extended to incorporate time-varying predictors, in other words, interaction terms between predictors and time (although PASW Modeler doesn’t provide an automatic way to create and incorporate such effects). In the Cox Regression model, the hazard function at time t as a function of predictors X 1 to X p , can be expressed as:

h(t | X 1 , X x ,..., X p ) = h0 (t ) * e

X 1*B1 + X 2 *B2 +...+ X p *B p

The hazard is a measure of the potential for the event to occur at a particular time t, given that the event did not yet occur. Larger values of the hazard function indicate greater potential for the event to occur. The hazard function is expressed as the product of two components: a base hazard function that changes over time and is independent of the predictors ( h0 (t ) ); and factor or covariate effects (

e

X 1*B1 + X 2 *B2 +...+ X p *B p

) that are independent of time and adjust the base hazard function. The shape of the base hazard function is unspecified and is empirically determined by the data (the nonparametric aspect of the model). Since the model effects relate to the hazard through the exponential function, B the exponentiated value of a model coefficient (say e 1 ) can be interpreted as the change in the hazard function associated with a one-unit change in the predictor (X 1 ), controlling for other effects in the model. The separation of the model into two components, one a function of time alone and the other a function of the predictors alone, implies that the ratio of the hazard functions for any two individuals or groups will be a constant over time (the base hazard function cancels out, leaving the ratio constant over time and based on the differences in predictor values). If this model assumption (the effects of

7-5


the predictors are constant over time) is not met, then the Cox Regression model will not provide the best fit to the data. Since the hazard function is related to the cumulative survival function, this model can alternatively be expressed in terms of cumulative survival. The value of the survival function is the probability that the given event has not occurred by time t. Survival plots, based on the model, are also helpful in studying the model and are more intuitive for most people. Note The Cox node can perform a Kaplan-Meier analysis by specifying a model with no input fields.

Missing Data The Cox regression uses only records with valid data for all input and target fields. Thus, if a value is missing for a field, that case will not be used in the analysis. If you have a significant amount of missing data, you may want to either not use fields with lots of missing data, or you may wish to estimate/impute missing data to increase the number of valid cases for the analysis.

7.4 Cox Regression to Predict Churn The customer database we use in this lesson contains a field (cardtenure) that measures the length of time a customer has, or did hold, their primary credit card. This is the survival time field, which must of be of measurement level continuous. The target field, which must be of measurement level flag, is churn, which records whether or not a customer switched cards. We will use several predictors to model churn, mainly customer demographics. Click File…Open Stream and move to the c:\Train\ModelerPredModel directory Double-click on Cox Regression.str Run the Table node Close the Table window Edit the Type node

7-6


Figure 7.3 Type Node for Customer Database

There are over 100 fields in the file, and we only want to use a few for the Cox Regression to keep this example manageable. The best approach is to set the Role for all fields to None, and then selectively change specific fields. Right-click on any field and select Select All from the context menu Right-click on any field and select Set Role…None Change the Role for gender, age, ed, income, marital, cardfee, and cardtenure to Input Change the Role of churn to Target Click OK

Cox regression will create appropriate dummy fields to represent fields that are categorical without user intervention. It can be instructive to look at the distribution of the key fields of cardtenure and churn before proceeding further. Add a Distribution node to the stream Connect the Type node to the Distribution node Specify churn as the Field Click Run

Out of the 5000 records, about three-quarter still have their primary credit card, while one quarter changed their card. These are reasonable ratios of the two events for a Cox regression.

7-7


Figure 7.4 Distribution of churn

Next let’s look at the distribution of cardtenure. Close the Distribution window Add a Histogram node to the stream Connect the Type node to the Histogram node Specify cardtenure as the Field Click Run

The cardtenure values are in years, and the distribution ranges from 0 (people who have recently obtained a new primary credit card) to 40 (people who have had their primary credit card for a very long time). The distribution drops over time, as expected, and it is choppy, rather than smooth, which is often typical of this type of data. Figure 7.5 Histogram of cardtenure

7-8


As a final step in data exploration, let’s look at the relationship between these two key fields. Close the Histogram window Edit the Histogram node Specify churn as the Color overlay field Click Options Click Normalize by color (not shown) Click Run

At first, the smoothly declining percentage over time of values of churn=1 over cardtenure may come as a surprise. But actually, as with many products, most consumers who switch do so early. So churn rates are initially around 50% for the first few years. But then, over time, switching the primary credit card still occurs, but less frequently as a percentage of those customers who have survived that long. This trend continues right up to the end of data collection at 40 years. This will help us understand predictions from a Cox regression model in a later section. Figure 7.6 Overlay Histogram of cardtenure by churn

We are now ready to add the Cox regression node to the stream and review its settings. Add a Cox modeling node to the stream Connect the Type node to the Cox node Edit the Cox node

We selected the input and target fields in the Type node, and the Cox node correctly lists churn as the target field. However, we also need to specify which field contains survival times.

7-9


Figure 7.7 Fields Tab in Cox Node

Click Survival time field chooser and select cardtenure Click Model tab

As with other regression-based models, Cox regression can estimate a model either by using all the predictors, or by performing forwards or backwards stepwise model-building. In this example, we use the default choice of Enter and use all the predictors. If you have many predictors and want to use one of the stepwise methods, you definitely should use a testing or validation sample or partition. Complex models can be built by specifying a Model type of Custom and then selecting specific terms. This allows you to incorporate interaction terms into a model, for example. If you would prefer to see separate analyses for discrete groups of customers, a Groups: field can be selected. This field should be categorical, and a separate model will be developed for each category in this field. Alternatively, such a field can be included in the model as a predictor, but if you believe that coefficients and survival times are quite different for the various groups, or you want to assess whether this is the case, estimating separate models can be a useful approach.

7-10


Figure 7.8 Model Tab in Cox Node

Click the Expert tab Click the Expert options button Click the Output button

By default only the most basic of output is supplied by the Cox node, and neither the survival or hazard plots are included. The check box for Display baseline function will display the baseline hazard function and cumulative survival at the mean of the covariates.

7-11


Figure 7.9 Expert Output Options

Click Display baseline function check box Click Survival and Hazard check boxes under Plots area

When the plots selections are made, the bottom of the dialog box becomes active, and the fields included in the model are listed, with their value set to the mean. This is necessary because the survival and hazard functions depend on the values of the predictors, and you must use constant values for the predictors to plot the functions versus time. The default is to use the mean, but you can enter your own values for the plot using the grid. This would allow you to get plots for survival for a particular type of customer. For categorical inputs, indicator coding is used, so there is a regression coefficient for each category (except the last). For a categorical input, the mean value for each dummy field is equal to the proportion of cases in the category corresponding to that indicator contrast. You can also request a separate line for each value of a categorical field on any plot. We’ll do so for marital (the field used here does not have to be an input to the model). This is not equivalent to adding a Groups field to the model. Click the Plot a separate line for each value dropdown and select marital

7-12


Figure 7.10 Completed Advanced Output Dialog Selections

We will now briefly look at the Settings tab. Click OK Click Settings tab

The Settings tab has several options to specify how a Cox model should be applied to make predictions. The model can be scored at regular time intervals, over one or more time periods, with the unit of time defined by the field used in the model. Alternatively, another time field can be listed. In many cases, customers or the equivalent will already have a survival time which must be taken into account (not everyone is beginning as a new customer who has just acquired a product or subscription), so the setting Past survival time allows you to select a field which contains this information (this is often the same field as used for survival time itself, such as cardtenure in the current example). These options become relevant when a model has already been developed, so we won’t say more about them at this point in the lesson.

7-13


Figure 7.11 Settings Tab in a Cox Node

Click Run

After the model runs: Right-click the Cox model in the stream named churn and select Edit

The Categorical Variable Coding table shows the dummy variable coding for the categorical variables in the model. Unlike in other PASW Modeler nodes, Cox regression does indicator coding, with the last category as the reference category. This means that for flag variables such as gender, which is coded 0 for males and 1 for females, the coding within Cox regression reverses this ordering. This is very important to note for interpretation of the model. If you prefer the original ordering, you can change the values of a field with a Reclassify node.

7-14


Figure 7.12 Categorical Variable Coding

The next set of tables includes tests of the model as a whole. Since all predictors are entered at once, the values reported in the Change From Previous Step and Change From Previous Block sections are identical. Here we are testing whether the effect of one or more of the predictor fields is significantly different from zero in the population. This is analogous to the overall F test used in regression analysis. The results indicate that at least one predictor is significantly related to the hazard because the significance values are well below .05 or .01. (An omnibus test is also done using the score statistic, which is used in stepwise predictor selection.) Figure 7.13 Omnibus Tests of Model Coefficients

7-15


Figure 7.14 Variables in the Equation Table

The next table—Variables in the Equation—contains information on the individual effects of each predictor. To interpret these coefficients, recall that the model predicts the hazard directly, not survival time, and that in the scale of the predictors, the natural log of the hazard is being predicted. Therefore, the B coefficient relates the change in natural log of the hazard to a one unit change in a predictor, controlling for other predictors. As such they are difficult to understand (although positive values are associated with increasing hazard and lower survival time, while negative values are associated with decreasing hazard and increasing survival times). For this reason, the Exp(B) column is usually used when interpreting the results. The significance of each predictor is tested using the Wald statistic and the associated probability values are reported in the Sig. column. Here, four of the predictors are significant, but gender and cardfee are not. The positive B values for ed and marital indicate, respectively, that increasing education and being unmarried (check the coding) are associated with increasing hazard for churn. The negative B values for age and income indicate that increasing age and income lead to reduced hazard for churn. The Exp(B) column presents the estimated change in risk (hazard) associated with a one-unit change in a predictor, controlling for the other predictors. When the predictor is categorical and indicator coding is used, Exp(B) represents the change in hazard when comparing the reference category to another category and is referred to as relative risk. Exp(B) is also called the hazard ratio, since it represents the ratio of the hazards for two individuals who differ by one unit in the predictor of interest. The Exp(B) value for marital is 1.469; this means that, other things being equal, the hazard for customers who are unmarried is 1.469 times greater than the hazard for married customers. This does not mean that an unmarried customer will churn 1.469 times faster, or that 1.469 times as many unmarried customers as married customers will churn in a given time interval. It simply means that an unmarried customer has odds of 1.469:1 that he or she will churn compared to married customers. For a continuous predictor such as age, the odds ratio refers to the effect on the hazard for each unit change. So, a one year increase in age leads to a decrease in the hazard of .923. A ten year change in age (comparing, say a 40-year-old to a 30-year-old customer) must be calculated by multiplying the hazards and so corresponds to .92310 = .449, which is a substantial reduction in the likelihood of churning for the older customer (who tend to hang on to their credit cards). The next, rather lengthy table is the Survival table which contains the baseline cumulative hazard, the survival estimates, and the cumulative hazard. The first portion of the table is displayed in Figure

7-16


7.15. Although these summaries are technical, it is worth noting that a hazard can take on values above 1 (it is in theory unbounded on the positive side). The Baseline Cum(ulative) Hazard is the hazard rate for the model when all predictors are set to 0. There is a row for each time value in the data. Associated with this are both the Survival values (with a standard error estimate), and a Cum(ulative) Hazard function, both calculated at the mean of all the predictors (or covariates, as labeled in the table). Survival begins at or near 1.0 and drops off toward 0; Cumulative hazard begins near 0 and increases. Figure 7.15 Survival Table

The survival time gives the estimated survival rate at a specific time for customers who have mean values on the predictors (a “hypothetical” customer). So for our model, survival—retaining the primary credit card—has dropped to .933, or 93.3%, by the sixth year. These values are plotted in the graphs that follow.

7-17


Figure 7.16 Cumulative Survival Function

The cumulative survival values are plotted in the Survival Function chart. Time held primary credit card is on the horizontal axis, and cumulative survival is on the vertical axis. Until the last few years, cumulative survival for a typical customer declines steadily at about a constant pace, then more steeply over the last two or three years. The curve is not smooth, but jagged because survival for these data is measured by the year (rather than in months or even weeks). Figure 7.17 Cumulative Hazard Function

7-18


The cumulative hazard function plot is somewhat the mirror image of the survival plot; again, note that the hazard can take on values above 1. As cumulative survival decreases, cumulative hazard—the chance of not retaining the primary credit card—decreases. There are survival and hazard graphs with separate lines for each category of marital. Looking at the cumulative survival function for what is labeled “patterns 1 -2” shows that the survival curve for unmarried customers is always below the line for married customers. This means that the cumulative survival for unmarried customers is always lower than for married customers, which is consistent with the regression coefficient for marital, which had an Exp(B) value of 1.469 and indicated that there was an increased hazard for unmarried customers. Differences in survival gradually increase over time between these two groups until about 35 years, where estimates grow less precise. This type of plot is a useful adjunct to interpretation of model coefficients and help to gain additional model understanding. Figure 7.18 Cumulative Survival Function for Married and Unmarried Customers

Close the Cox model Browser window

That completes a review of the most important types of output for a Cox model.

7.5 Checking the Proportional Hazards Assumption Cox regression is based on a proportional hazards model, but we don’t know whether that assumption is valid for these data (that the hazard functions of any two individuals or groups remain in constant proportion over time). There are several approaches to test this for predictors. For our purposes, the two chief methods are to: •

Examine the survival or hazard plots with the categorical predictor as the factor

7-19


•

Examine the survival or log-minus-log plot in Cox Regression with the categorical predictor specified as a Groups variable

We will illustrate by specifying marital as the Groups field within Cox Regression and examining the survival and log-minus-log plots. Edit the Cox modeling node Click the Expert tab Click the Output button Click Log minus log check box Remove marital from the Plot a separate line for each value box (not shown) Click OK Click Model tab Select marital as the Groups field

Figure 7.19 Marital Selected As Groups Field

A separate base hazard function will be fit to each category of the Groups field. If marital does not conform to the proportional hazards assumption, this should be revealed in the survival and log minus log plots, which will present a line for each category, i.e., married and unmarried customers. Click Run Right-click on the generated model and select Browse

Since the focus of this analysis is on the proportional hazards assumption, we will not examine the model results but move directly to the diagnostic plots.

7-20


Figure 7.20 Survival Plot for Marital Categories

The survival plots for the married and unmarried customers remain roughly parallel over time, suggesting that the hazard ratio for the two groups is reasonably constant over time. We can ignore the last few years where there are few customers and so estimates are less precise. Because marital is not used as a predictor in the model (it is a groups field), the expected survival values allow us to assess whether marital would meet the proportional hazards assumption in this context.

7-21


Figure 7.21 Log Minus Log Plot for Marital Categories

Another way of examining the proportional hazards assumption is through the ln(-ln) plot. Again, we simply have to judge whether the lines for the different categories are parallel. Here the two survival lines do initially diverge after a few years, and then slowly converge over time until about year 30. However, compared to the full range of the data, the divergence is small between the two curves. Although this is requires some judgment, in this instance we will conclude that the proportional hazards assumption is met for marital. If the assumption is not met, a time-varying covariate model can be fit to the data. Or you can drop this predictor from the model, which is a viable option when you have dozens of predictors. Checking the proportional hazards assumption should be done, in theory, for all significant categorical predictors in the model. We can now look at the predictions made by the model.

7.6 Predictions from a Cox Model The Cox regression model node can be added to a stream to make predictions. Given that the data are collected over time, the idea of a “prediction” of the model is complicated, and in fact the model can make several predictions for the same customer, over different time periods, i.e., survival over the next time period, over the next five time periods, and so forth. We want to rerun the basic model without marital specified as a Groups field.

7-22


Close the Cox generated model browser Edit the Cox modeling node Click the Model tab Remove marital as the Groups field Click Run

After the model has run: Add a Table node to the stream and connect it to the generated Cox model node Run the Table node Scroll to the right to see the new fields

Figure 7.22 Output Fields from a Cox Regression Model with Default Settings

Four fields are created by default from a Cox model. $C-churn-1 is the prediction of the Cox model for whether or not this customer will churn in the time interval that the user has requested (we will review the default interval next. The field $CP-churn-1 contains the probability associated with this prediction (whether it is associated with the churn or no churn condition). As can be seen from the table, the probabilities are very high for almost all the predictions. The last two columns contain the probabilities for the churn=0 and churn=1 conditions (the field $CP-0-1 stands for the predicted probability associated with churn=0 for the first predicted interval). All the predicted probabilities seem to be taken from the $CP-0-1 field. It is very important to understand that the predictions we see in the Table do not take current survival time for a customer into account. This may seem odd, given that the model was developed with survival time data, and model nuggets usually make predictions based on values of the predictors, but because predicting into the future is so central to the use of Cox models, no survival time value is used by default in the model. To see this we need to view the Settings tab on the Cox generated model. Close the Table window

7-23


Edit the Cox model node Click the Settings tab

Figure 7.23 Settings Tab Options on Cox Generated Model

The options available within this tab are the same as those available in the modeling node, allowing a model to be developed that simultaneously makes predictions. Note that there is no Time field listed, nor any Past survival time field. By default, survival will be predicted for a time interval of 1.0, which is defined in units of the time field used to create the model (here cardtenure). So the prediction will be for one year, for one time period (Number of time periods to score). But—and here is the key to understanding how predictions are made with Cox regression—since no past survival time field is specified, the predictions are equivalent to predicting whether each customer will keep their primary credit card for 1 year after receiving it. These predictions are not whether a customer who has survived this long (as measured by cardtenure) will keep his or her card for another year. Since the odds of a customer churning in the first year are not terribly large, very few customers will be predicted to churn. To see what the predictions of the model are for each customer for their actual survival time (which varies by customer), we need to set the time field to cardtenure. This will request the Cox model predict into the “future” the number of time intervals represented by cardtenure for each customer. Recall from the previous example that, when no past survival time is listed, the model effectively begins at time 0. Click Time field option button Specify the field as cardtenure Click OK

7-24


Run the Table node

Figure 7.24 Predictions of Cox Model at Current Survival Time

In the first 20 records visible in Figure 7.24 there are still no predictions that a customer will churn. However, the probabilities of churning ($CP-1-1) are much higher than previously (see Figure 7.22). These are the predictions of the model for the actual survival times of each customer. We can now use an Analysis node to review the model predictions for churn. Close the Table window Add an Analysis node to the stream Connect the Cox generated model to the Analysis node Edit the Analysis node Click Coincidence matrices (for symbolic targets) check box (not shown) Click Run

7-25


Figure 7.25 Predictions for Cox Model

Overall the model is correct for 72.7% of the customers. However, looking at the Coincidence Matrix, we see that the model performs very well for customers who didn’t churn, but not nearly as well for those who did. But we are using only a few predictors to keep the example simple; in a real-life datamining project using Cox regression, performance is very likely to be much better than this. Given what we have learned in the two instances of prediction we have reviewed, let’s return to the Cox generated model to discuss prediction in more depth. Close the Analysis node Edit the Cox generated model

7-26


Figure 7.26 Cox Model Settings Tab

The section of the Settings tab labeled Predict survival at future times specified as: allows the user to either specify future time based on time intervals or an actual time field. This section is separate from specifying past (really current) survival time. As we have seen, if no past survival time is listed, then the model predicts based as it time=0. Let’s set the past survival time to cardtenure and predict one time interval (one year) into the future. Click Regular intervals option button Specify cardtenure as the Past survival time field

7-27


Figure 7.27 Cox Model to Predict One Year Beyond Current Survival Time

Click OK Run the Table node

7-28


Figure 7.28 Cox Predictions One Year Beyond Current Survival

There are two points to notice about these data. First, for case 16, there are values of $null$ for all four output fields. If we scroll to the left and check the value of cardtenure for this customer, we will learn that it is 40. Predictions cannot be made outside the range of survival values in the data, so for any customer already at the upper end of the survival range, no predictions can be made. Second, if you compare Figure 7.28 to Figure 7.24, you may see differences that are at first puzzling. The customer in the first row has a cardtenure value of 2. Thus, the model we just ran is predicting one year ahead for this customer, or to year 3, and the probability of churning in year 3 is 0.135. But if we look back at Figure 7.24 we see a probability of churning of 0.251, which is greater. You might wonder how the probability of churning can go down; the probability of the terminal event occurring should either stay constant or increase over time (unless we have time-varying covariates). The answer is that predictions that take into account past survival are actually conditional predictions, and the probability listed is the conditional probability. This means that, in Figure 7.28, the probability of 0.135 is the probability of this customer churning by the end of year 3, given their churn probability (and survival) through year 2. Thus, if we used the Analysis node to examine predictions from this model, we would see very few customers predicted to churn. This is because the probability of churning in the next year is always rather small for these data, so there are few predictions that a customer will churn. Once you are satisfied with a model, you will want to score customers to find those most likely to churn. A common situation is to predict some time into the future, given past survival time. We’ll predict five years into the future.

7-29


There is no need to predict for those customers who have already churned; we would only be interested in predicting churn for customers who are still current and may churn at some point in the future. So we need to select those customers with churn=0. Then we can sort the data stream by the likelihood of churning in a given future time interval. Close the Table window Place a Select node from the Record Ops palette in the stream to the right of Cox model Connect the Cox generated model to the Select node Edit the Select node Enter the text churn = 0 in the Condition: box

Figure 7.29 Selecting Customers who have not Churned

Click OK Add a Sort node to the stream to the right of the Select node Connect the Select node to the Sort node Edit the Sort node Select $C-churn-1 as the first sort field in descending order Select $CP-1-1 as the second sort field, also in descending order

7-30


Figure 7.30 Sorting by Predicted Value and Probability to Churn

Click OK Connect the Sort node to the Table node, replacing the connection Edit the Cox generated model

To see survival five years into the future, taking into account current survival for each customer, we need to make only one change. Change the Time interval value to 5.0

7-31


Figure 7.31 Settings to Predict Five Years into the Future taking into Account Current Survival

Since the number of time periods to score is still 1, this will score each record at five years into the future dependent on past survival time up until the last point data recorded for that customer. Click OK Run the Table node attached to the Sort node

7-32


Figure 7.32 Cox Predictions Five Years Beyond Current Survival

There are predictions now that customers will churn ($C-churn-1=1). All these customers have a value of $CP-1-1 above .50. If we scroll down to row 71, we see that there are 71 customers predicted to churn. However, this number depends on the default cutoff value of .500, which you are free to adjust up or down as necessary. Decreasing the cutoff will cast a wider net to find more customers who might churn in the specified five-year time period. Customers are ordered by modelpredicted probability of churning, so we can spend marketing dollars or other resources on contacting them in order of likelihood to churn. If you scroll down to record 3266 you will begin to see values of $null$ again for all the output fields. This occurs when a prediction is made outside the survival range of the model. As we all know from basic regression analysis, predicting outside the bounds of the predictor fields is to be avoided. In Cox regression, if we attempt to predict outside the bounds of the time field, the model will not pass on a predicted value downstream. In every instance, these null values come about when current survival time is 36 years or greater. Given the fewer customers at large survival times, these customers can either be ignored or handled separately (see PASW Modeler Application Examples on Cox Regression for hints). To understand more about the model predictions into the future, we’ll examine the relationship between time and the model prediction with an overlay histogram.

7-33


Close the Table window Add a Histogram node to the stream near the Sort node Connect the Sort node to the Histogram node Edit the Histogram node Specify cardtenure as the Field Specify $C-churn-1 as the Overlay Color field (not shown) Click Options Click Normalize by color Click Run

Figure 7.33 Histogram of Survival by Model Prediction

When reviewing this chart, keep in mind that, based on current survival, we are predicting survival five years into the future. This explains why there is a sharp cutoff after 35 years, since 36 + 5 extends beyond the data range of cardtenure of 40 years. Most of the predictions that people will switch their primary credit card occur to customers with relatively low current survival times—under 10 years— which is consistent with past experience and common sense. Close the Histogram window

If we wanted to see survival in a graph at cardtenure + 5 years, which is what the model is now predicting, you could use a Derive node to create a field with that equation and then use it in the Histogram in place of cardtenure. To recap how to make predictions with a Cox generated model, follow this advice: 1) If you use the default setting, you are predicting from time 0 one time period into the future. You usually don’t want this for existing customers, but this could certainly be appropriate for a data file with new customers.

7-34


2) If you want to predict survival at the current survival time for existing customers, specify the survival time field as the Time field on the Settings tab. 3) If you want to predict future survival, given past survival, specify the survival time field on the Past survival time field, and specify either time intervals and periods to score (if you score more than one period, you will get more than one prediction), or a Time field.

7-35


Summary Exercises The exercises in this lesson use the data file customer_dbase.sav that we have been using throughout the lesson examples. 1. Begin with a current stream, or alternatively just begin a new stream to access the customer_dbase.sav data. Choose a completely different set of predictors than the demographic fields used in the lesson to predict churn, although you will continue to use cardtenure as survival time. 2. Estimate a Cox regression model with this new set of predictors. What are the significant predictors? 3. Choose a categorical predictor or two that are significant and plot survival curves for each value of the predictors? What did you learn? 4. Test the assumption of proportional hazards for your model with one or more categorical predictors. Is the assumption met or not? 5. Using your model, predict survival 3 and 6 years into the future. Select only those customers who have not churned. How many customers are predicted to churn at 3 years into the future? At 6 years?

7-36

TIME SERIES ANALYSIS

Lesson 8: Time Series Analysis Objectives • • • • •

To explain what is meant by time series analysis Outline how time series models work Demonstrate the main principles behind a time-series forecasting model Forecast several series at one time Produce forecasts with a time series model on new observations

8.1 Introduction It is often essential for organizations to plan ahead. In order to do this it is important to forecast events in order to ensure a smooth transition into the future. In order to minimize errors when planning for the future it is necessary to collect information on any factors which may influence plans on a regular basis over time. Once a catalogue of past and current information has been collected, patterns can be identified and these patterns help make forecasts into the future. Even though many organizations may collect historic information relevant to the planning process, forecasts are often made on an ad-hoc basis. This often leads to large forecasting errors and costly mistakes in the planning process. Statistical techniques provide a more scientific basis upon which to base forecasts. By using these techniques, a more structured approach can be used to ensure careful planning which will reduce the chance of making costly errors. Statisticians have developed a whole area of statistical techniques, known as time series analysis, which is devoted to the area of forecasting.

Examples of Time Series Analysis In order to understand how time series analysis works it is useful to give an example. Suppose that a company wishes to forecast the growth of its sales into the future. The benefit of making the forecast is that if the company has an idea of future sales it can plan the production process for its product. In doing so, it can minimize the chances of under producing and having product shortages or, alternatively, overproducing and having excess stock which will need to be stored at additional cost. Prior to being able to make the forecast, the company will need to collect information on its sales over time in order to gain a full picture of how sales have changed in the past. Once this information has been collected it is possible to plot how sales change over time. An example of this is shown in Figure 8.1. Here information on the sales of a product has been collected each month from January 1982 until December 1995.

8-1


Figure 8.1 Plot of Sales Over Time

This is a simple example that demonstrates the idea of time series. Time series analysis looks at changes over time. Any information collected over time is known as a time series. A time series is usually numerical information collected over time on a regular basis. One of the most common uses of time series analysis is to forecast future values of a series. There are a number of statistical time series techniques which can be used to make forecasts into the future. In the above example the forecast would be the future values of sales. Some time series methods can also be used to identify which factors have been important in affecting the series you wish to forecast. For example, to determine whether an advertising campaign has had a significant and beneficial effect on sales. It is also possible to use time series analysis to quantify the likely impact of a change in advertising expenditure on future sales. Other examples of time series analysis and forecasting include:

8-2

•

Governments using time series analysis to predict the effects of government policies on inflation, unemployment and economic growth.

•

Traffic authorities analyzing the effect on traffic flows following the introduction of parking restrictions in city centers.

•

The analyses of how stock market prices change over time. By being able to predict when stock market prices rise or fall decisions can be made about the right times to buy and sell shares.


•

Companies predicting the effects of pricing policies or increased advertising expenditure on the sales of their product.

•

A company wishing to predict the number of telephone calls at different times during the day, so it can arrange the appropriate level of staffing.

Time series analysis is used in many areas of business, commerce, government and academia, and its value cannot be overstated. A number of time series techniques can be found within the Time Series node in PASW Modeler. This node provides analysts with both a flexible and powerful way to analyze time series data.

8.2 What is a Time Series? A time series is a field whose values represent equally spaced observations of a phenomenon over time. Examples of time series include quarterly interest rates, monthly unemployment rates, weekly beer sales, annual sales of cigarettes, and so on. In term of a data file, time periods constitute the rows (cases) in your file. Time series analysis is usually based on aggregated data. If we take the monthly sales shown in Figure 8.1, each sale is recorded on a transactional basis with an attached date and/or time stamp. There is usually no business need requiring sales forecasts on a minute-by-minute basis, while there is often great interest in predicting sales on a weekly, monthly, or quarterly basis. For this reason, individual transactions and events are typically aggregated at equally spaced time points (days, weeks, months, etc.), and forecasting is based on these summaries. Also, most software programs that perform time series analysis, including PASW Modeler, expect each row (case) of data to represent a time period, while the columns contain the series to be forecast. Classic time series involves forecasting future values of a time series based on patterns and trends found in the history of that series (exponential smoothing and simple ARIMA) or on predictor fields measured over time (multivariate ARIMA, or transfer functions).

Time Series Models versus Econometric Models Time Series models are models constructed without drawing on any theories concerning possible relationships between the fields. In univariate models, the movements of a field are explained solely in terms of its own past and its position with respect to time. ARIMA models are the premier time series models for single series. By way of contrast, econometric models are constructed by drawing on theory to suggest possible relationships between fields. Given that you can specify the form of the relationship, econometrics provides methods for estimating the parameters, testing hypotheses, and producing predictions. Your model might consist of a single equation, which can be estimated by some variant of regression, or a system of simultaneous equations, which can be estimated by two-stage least squares or some other technique.

The Classical Regression Model The classical linear regression model is the conventional starting point for time series and econometric methods. Peter Kennedy, in A Guide to Econometrics (5th edition, 2003, MIT Press), provides a convenient statement of the model in terms of five assumptions:

8-3


• • • • •

The dependent variable can be expressed as a linear function of a specific set of independent variables plus a disturbance term (error). The expected value of the disturbance term is zero. The disturbances have a constant variance and are uncorrelated. The observations on the independent variable(s) can be considered fixed in repeated samples. The number of observations exceeds the number of independent variables and there are no exact linear relationships between the independent variables.

While regression can serve as a point of departure for both time series and econometric models, it is incumbent on you (the researcher) to generate the plots and statistics which will give some indication of whether the assumptions are being met in a particular context. Assumption 1 is concerned with the form of the specification of the model. Violations of this assumption include omission of important regressors (predictors), inclusion of irrelevant regressors, models nonlinear in the parameters, and varying coefficient models. When assumption 2 is violated, there is a biased intercept. Assumption 3 assumes constant variance (homoscedasticity) and no autocorrelation. (Autocorrelation is the correlation of a variable with itself at a fixed time lag.) Violations of the assumption are the reverse: non-constant variance (heteroscedasticity) and autocorrelation. Assumption 4 is often called the assumption of fixed or nonstochastic independent variables. Violations of this assumption include errors in measurement in the variables, use of lagged values of the dependent variable as regressors (common in time series analysis), and simultaneous equation models. Assumption 5 has two parts. If the number of observations does not exceed the number of independent variables, then your problem has a necessary singularity and your coefficients are not estimable. If there are exact linear relationships between independent variables, software might protect you from the consequences. If there are near-exact linear relationships between your independent variables, you face the problem of multicollinearity. In regression, parameters can be estimated by least squares. Least squares methods do not make any assumptions about the distribution of the disturbances. When you make the assumptions of the classical linear regression model and add to them the assumption that the disturbances are normally distributed, the regression estimators are maximum likelihood estimators (ML). It also can be shown that the least-squares methods produce Best Linear Unbiased estimates (BLU). The BLU and ML properties allow estimation of the standard errors of the regression coefficients and the standard error of the estimate, and therefore enable the researcher to do hypothesis testing and calculate confidence intervals.

8.3 A Time Series Data File To show you what a time series data file looks like, we open a PASW Modeler stream. Click File…Open Stream and move to the c:\Train\ModelerPredModel folder Double-click on Time Series Intro.str Run the Table node

8-4


Figure 8.2 A Time Series Data File

Each column in the data editor corresponds to a given field. The important point to note concerning the organization of time series data is that each row in the Table window corresponds to a particular period of time. Each row must therefore represent a sequential time period. The above example shows a data file containing monthly data for sales starting in January 1982. In order to use standard time series methods it is important to collect, or at least be able to summarize, the information over equal time periods. Within a time series data file it is essential that the rows represent equally spaced time periods. Even time periods for which no data was collected must be included as rows in the data file (with missing values for the fields).

The Time Plot Chart The data file contains the recorded sales of a product over a fourteen-year period. The simplest way of identifying patterns in your data is to plot your information over the relevant time period, and this is essential for time series analysis. In PASW Modeler a sequence chart called a Time Plot chart is used to show how time series change over time. The Time Plot chart plots the value of the field of interest on the vertical axis, with time represented on the horizontal axis. A Time Plot chart can show several fields (series) on the same chart. Points are joined up to display a line graph which shows any patterns in your data. Close the Table window Double-click on the Time Plot node to the right of the Time Intervals node to open it Use the Field selector tool to select Sales

8-5


Figure 8.3 Time Plot Dialog

Click Run

There is an option to Display series in separate panels which can be used to generate a separate chart for each series if you want to plot several of them at once. If you do not check this option, all fields are plotted on one chart. Figure 8.4 shows how sales have changed over the fourteen years.

8-6


Figure 8.4 Sequence Plot of Sales

The sequence chart is the most powerful exploratory tool in time series analysis and it can be used to identify trend, seasonal and cyclical patterns in a time series. There is a clear regularity (repeating pattern) to the times series, and the volume of sales generally increases over time. These are the key features we will need to model.

8.4 Trend, Seasonal and Cyclic Components After identifying important patterns that have occurred in the past, time series analysis uses this information to forecast into the future. In Figure 8.4 there are clear patterns in past sales. These patterns can be divided into three main categories: trend, seasonal components and cycles.

Trend Patterns Trend refers to the smooth upward or downward movement characterizing a time series over a long period of time. This type of movement is particularly reflective of the underlying continuity of fundamental demographic and economic phenomena. Trend is sometimes referred to as secular trend where the word secular is derived from the Latin word saeculum, meaning a generation or age. Hence, trend movements are thought of as long-term movements, usually requiring 15 or 20 years to describe (or the equivalent for series with more frequent time intervals). Trend movements might be attributable to factors such as population change, technological progress, and large-scale shifts in consumer tastes. For example, if we could examine a time series on the number of pairs of shoes produced in the United States extending annually, say, from the 1700s until the present, we would find an underlying trend of growth throughout the entire period, despite fluctuations around this general upward movement. If we would compare the figures of the recent time against those near the beginning of the series, we would find the recent numbers are much larger. This is because of the increase in population, because of the technical advances in shoe-producing equipment enabling vastly increased

8-7


levels of production, and because of shifts in consumer tastes and levels of affluence which have meant a larger per capita requirement of shoes than in the earlier time In Figure 8.4 there is a clear upward trend in the data as sales have continued to increase from 1982 until 1995, albeit less pronounced from the beginning of 1991.

Cyclical Patterns Cyclical patterns (or fluctuations), or business cycle movements, are recurrent up and down movements around the trend levels which have a duration of anywhere from about 2 to 15 years. The duration of these cycles can be measured in terms of their turning points, or in other words, from trough to trough or peak to peak. These cycles are recurrent rather than strictly periodic. The height and length (amplitude and duration) of cyclical fluctuations in industrial series differ from those of agricultural series, and there are differences within these categories and within individual series. Hence, cycles in durable goods activity generally display greater relative fluctuations than consumer goods activity and a particular time series of, say, consumer goods activity may possess business cycles which have considerable variations in both duration and amplitude. Economists have produced a large number of explanations of business cycle fluctuations including external theories which seek the causes outside the economic system, and internal theories in terms of factors within the economic system that lead to self-generating cycles. Since it is clear from the foregoing discussion that there is no single simple explanation of business cycle activity and that there are different types of cycles of varying length and size, it is not surprising that no highly accurate method of forecasting this type of activity has been devised. Indeed, no generally satisfactory mathematical model has been constructed for either describing or forecasting these cycles, and perhaps never will be. Therefore, it is not surprising to find that classical time series analysis adopts a relatively rough approach to the statistical measurement of the business cycle. The approach is a residual one; that is, after trend and seasonal variations have been eliminated from a time series, by definition, the remainder or residual is treated as being attributable to cyclical and irregular factors. Since the irregular movements are by their very nature erratic and not particularly tractable to statistical analysis, no explicit attempt is usually made to separate them from cyclical movements, or vice versa. However, the cyclical fluctuations are generally large relative to these irregular movements and ordinarily no particular difficulty in description or analysis arises from this source. Therefore, unless you have data available over a long period of time, cyclic patterns are not usually fit by forecasting models.

Seasonal Patterns Seasonal variations are periodic patterns of movement in a time series. Such variations are considered to be a type of cycle that completes itself within the period of calendar year, and then continues in a repetition of this basic pattern. The major factors in this seasonal pattern are weather and customs, where the latter term is broadly interpreted to include patterns in social behavior as well as observance of various holidays such as Christmas and Easter. Series of monthly or quarterly data are ordinarily used to examine these seasonal patterns. Hence, regardless of trend or cyclical levels, one can observe in the United States that each year more ice cream is sold during the summer months than during the winter, whereas more fuel oil for home heating purposes is consumed in the winter than during the summer months. Both of these cases illustrate the effect of climatic factors in determining seasonal patterns. Also, department store sales generally reveal a minor peak during the months in which Easter occurs and a larger peak in December, when Christmas occurs, reflecting the shopping customs of consumers associated with these dates.

8-8


Seasonal patterns need not be linked to a calendar year. For example, if we studied the daily volume of packages delivered by a private delivery service, the periodic pattern might well repeat weekly (heavier deliveries mid-week, lighter deliveries on the weekend). Here the period for the seasonal pattern could be seven days. Of course, if daily data were collected over several years, then there may well be a yearly pattern as well, and just which time period constitutes a season is no longer clear. The number of time periods that occur during the completion of a seasonal pattern is referred to as the series periodicity. How often the time series data are collected usually depends on the type of seasonality that the analyst expects to find. • • • • •

For hourly data, where data are collected once an hour, there is usually one seasonal pattern every twenty-four hours. The periodicity is most likely to be 24. For monthly data, where each month a new time period of data is collected, there is usually one seasonal pattern every twelve months. The periodicity is thus likely to be 12. For daily data, where data are collected once every day, there is usually one seasonal pattern per week. The periodicity is therefore 7 if the data refer to a seven-day week or 5 if no data are collected on Saturdays and Sundays. For quarterly data, where data are collected once every three months, there is usually one seasonal pattern per year. The periodicity is therefore 4. For annual data, where data are collected once a year, there is no seasonal pattern. The periodicity is therefore none (undefined).

Of course, changes can occur in seasonal patterns because of changing institutional and other factors. Hence, a change in the date of the annual automobile show can change the seasonal pattern of automobile sales. Similarly, the advent of refrigeration techniques with the corresponding widespread use of home refrigerators has brought about a change of seasonal pattern of ice cream sales. The techniques of measurement of seasonal variation which we will discuss are particularly well suited to the measurement of relatively stable patterns of seasonal variation, but can be adapted to cases of changing seasonal movements as well. In Figure 8.4, there appears to be a rise in sales during the early part of the year while sales tend to fall to a low around November. Finally, there is some recovery in sales leading up to the Christmas period of each year.

Irregular Movements Irregular movements are fluctuations in time series that are erratic in nature, and follow no regularly recurrent or other discernible pattern. These movements are sometimes referred to as residual variations, since, by definition, they represent what is left over in an economic time series after trend, cyclical, and seasonal elements have been accounted for. These irregular fluctuations result from sporadic, unsystematic occurrences such as wars, earthquakes, accidents, strikes, and the like. In the classical time series model, the elements of trend, cyclical, and seasonal variations are viewed as resulting from systematic influences leading to gradual growth, decline, or recurrent movements. Irregular movements, however, are considered to be so erratic that it would be fruitless to attempt to describe them in terms of a formal model. Irregular movements can result from a large number of causes of widely differing impact.

8.5 What is a Time Series Model? A time series model is a tool used to predict future values of a series by analyzing the relationship between the values observed in the series and the time of their occurrence. Time series models can be developed using a variety of time series statistical techniques. If there has been any trend and/or

8-9


seasonal variation present in the data in the past then time series models can detect this variation, use this information in order to fit the historical data as closely as possible, and in doing so improve the precision of future forecasts. Time Series techniques in PASW Modeler can be categorized in the following ways: Pure time series models Exponential Smoothing Causal time series models Linear Time Series Regression Intervention Analysis Both Pure and Causal ARIMA

Pure Versus Causal Time Series Models A distinction can be made between pure and causal time series models. Pure time Series Models Pure time series models utilize information solely from the series itself. In other words, pure time series forecasting makes no attempt to discover the factors affecting the behavior of a series. For example, if the aim were to forecast future sales for a product, then a pure time series model would use just the data collected on sales. Information on other explanatory forces such as advertising expenditure and economic conditions would not be used when developing a pure time series model. In such models it is assumed that some pattern or combination of patterns in the series which is to be forecasted is recurring over time. Identifying and extrapolating that pattern can develop forecasts for subsequent time periods. The main advantage of pure time series modeling is that it is a quick and simple way of developing a forecast model. Also, such models rely upon little statistical theory. One obvious disadvantage of pure time series models, such as exponential smoothing, is that they cannot identify important factors influencing the series. Another drawback is that it is not possible to accurately predict the impact of any decisions taken by an organization on the future values of the series. Causal time series models Causal time series models such as regression and ARIMA will incorporate data on influential factors to help predict future values of a series. In such models, a relationship is modeled between a target field (the time series being predicted), time, and a set of predictor fields (other associated factors also measured over time). The first task of forecasting is to find the cause-and-effect relationship. In our sales example, a causal time series technique such as regression would indicate whether advertising expenditure or the price of the product has been an important influence on sales and if it has, whether each factor has had a positive or negative influence on sales. The real advantage of an explanatory model is that a range of forecasts corresponding to a range of values for the different fields can be developed. For example, causal time series models can assess what the effect of a $100,000 increase in advertising expenditure will have on future sales, or alternatively a $150,000 increase in advertising expenditure. The main drawbacks of causal time series models are that they require information on several fields in addition to the target that is being forecast and usually take longer to develop. Furthermore, the

8-10


model may require estimation of the future values of the independent factors before the target can be forecast.

8.6 Interventions Time series may experience sudden shifts in level, upward or downward, as a result of external events. For example, sales volume may briefly increase as the result of a direct marketing campaign or a discount offering. If sales were limited by a company’s capacity to manufacture a product, then bringing a new plant online would shift the sales level upward from that date onward. Similarly, changes in tax laws or pricing may shift the level of a series. The idea here is that some outside intervention resulted in a shift in the level of the series. In this context, a distinction is made between a pulse—that is, a sudden, temporary shift in the series level—and a step, a sudden, permanent shift in the series level. A bad storm, or a one-time, 30-day rebate offer, might result in a pulse, while a change in legislation or a large competitor’s entry into a market could result in a step change to the series. Time series models are designed to account for gradual, not sudden, change. As a result, they do not natively fit pulse and step effects very well. However, if you can identify events (by date) that you believe are associated with pulse or step effects, they can be incorporated into time series models (they are called intervention effects) and forecasts. Below we see an example of a pulse intervention. In April 1975 a one-time tax rebate occurred in an attempt to stimulate the US economy, then in recession. Note that the savings rate reached its maximum (9.7%) during this quarter. The intervention can be modeled and used in scenarios to assess the effect of a tax rebate on savings rates in the future.

8-11


Figure 8.5 U. S. Savings Rate (Seasonally Adjusted)—Tax Rebate in April 1975

8.7 Exponential Smoothing The Expert Modeler in PASW® Forecasting considers two classes of time series models when searching for the best forecasting model for your data: exponential smoothing and ARIMA. In this section we provide a brief introduction to simple exponential smoothing. Exponential smoothing is a time series technique that can be a relatively quick way of developing forecasts. This technique is a pure time series method; this means that the technique is suitable when data has only been collected for the series that you wish to forecast. In comparison, ARIMA models can accommodate predictor fields and intervention effects. Exponential smoothing takes the approach that recent observations should have relatively more weight in forecasting than distant observations. “Smoothing” implies predicting an observation by a weighted combination of the previous values. “Exponential” smoothing implies that the weights decrease exponentially as the observations get older. “Simple” (as in simple exponential smoothing) implies that slowly changing level is all that is being modeled. Exponential smoothing can be extended to model different combinations of trend and seasonality. Exponential smoothing implements many models in this fashion. An analyst using custom exponential smoothing typically examines the series to make some broad characterizations (is there trend, and if so what type? Is there seasonality [a repeating pattern], and if so what type?) and fits one or more models. The best model fit is then extrapolated into the future to make forecasts. One of the main advantages of exponential smoothing is that models can be easily

8-12


constructed. The type of exponential smoothing model developed will depend upon the seasonal and trend patterns inherent in the series you wish to forecast. An analyst building a model might simply observe the patterns in a sequence chart to decide which type of exponential smoothing model is the most promising one to generate forecasts. In PASW Forecasting, when the Expert Modeler examines the series, it considers all appropriate exponential smoothing models when searching for the most promising time series model. Simple exponential smoothing (no trend, no seasonality) can be described in two algebraically equivalent ways. One common formula, known as the recurrence form, is as follows:

S ( t ) = α * y ( t ) + (1 − α ) * S ( t −1) Also, the forecast

y ( m ) = S (t ) where y (t) is the observed value of the time series in period t, S (t-1) is the smoothed level of the series at time t-1, α (alpha) is the smoothing parameter for the level of the series, and S (t) is the smoothed level of the series at time t, computed after y (t) is observed, and y (m) is the model estimated m step ahead forecast at time t. Intuitively, the formula states that the current smoothed value is obtained by combining information from two sources: the current point and the history embodied in the series. Alpha (α) is a weight ranging between 0 and 1. The closer alpha is to 1, the more exponential smoothing weights the most recent observation and the less it weights the historical pattern of the series. The smoothed value for the current case becomes the forecast value. This is the simplest form of an exponential smoothing model. As mentioned above, extensions of the exponential smoothing model can accommodate several types of trend and seasonality, yielding a general model capable of fitting single-series data.

8.8 ARIMA Many of the ideas that have been incorporated into ARIMA models were developed in the 1970s by George Box and Gwilym Jenkins, and for this reason ARIMA modeling is sometimes called BoxJenkins modeling. ARIMA stands for AutoRegressive Integrated Moving Average, and the assumption of these models is that the variation accounted for in the series field can be divided into three components: • • •

Autoregressive (AR) Integrated (I) or Difference Moving Average (MA)

An ARIMA model can have any component, or combination of components, at both the nonseasonal and seasonal levels. There are many different types of ARIMA models and the general form of an ARIMA model is ARIMA(p,d,q)(P,D,Q), where: • • •

p refers to the order of the nonseasonal autoregressive process incorporated into the ARIMA model (and P the order of the seasonal autoregressive process) d refers to the order of nonseasonal integration or differencing (and D the order of the seasonal integration or differencing) q refers to the order of the nonseasonal moving average process incorporated in the model (and Q the order of the seasonal moving average process).

8-13


So for example an ARIMA(2,1,1) would be a nonseasonal ARIMA model where the order of the autoregressive component is 2, the order of integration or differencing is 1, and the order of the moving average component is also 1. ARIMA models need not have all three components. For example, an ARIMA(1,0,0) has an autoregressive component of order 1 but no difference or moving average component. Similarly, an ARIMA(0,0,2) has only a moving average component of order 2.

Autoregressive In a similar way to regression, ARIMA models use input fields to predict a target field (the series field). The name autoregressive implies that the series values from the past are used to predict the current series value. In other words, the autoregressive component of an ARIMA model uses the lagged values of the series target, that is, values from previous time points, as predictors of the current value of the series. For example, it might be the case that a good predictor of current monthly sales is the sales value from the previous month. The order of autoregression refers to the time difference between the series target and the lagged series target used as a predictor. If the series target is influenced by the series target two time periods back, then this is an autoregressive model of order two and is sometimes called an AR(2) process. An AR(1) component of the ARIMA model is saying that the value of series target in the previous period (t-1) is a good indictor and predictor of what the series will be now (at time period t). This pattern continues for higher-order processes. The equation representation of a simple autoregressive model (AR(1)) is:

y ( t ) = Φ 1 * y ( t −1) + e( t ) + a Thus the series value at the current time point (y (t) ) is equal to the sum of: (1) the previous series value (y (t-1) ) multiplied by a weight coefficient (Φ 1 ); (2) a constant a (representing the series mean); and (3) an error component at the current time point (e (t) ).

Moving Average The autoregressive component of an ARIMA model uses lagged values of the series values as predictors. In contrast to this, the moving average component of the model uses lagged values of the model error as predictors. Some analysts interpret moving average components as outside events or shocks to the system. That is, an unpredicted change in the environment occurs, which influences the current value in the series as well as future values. Thus the error component for the current time period relates to the series’ values in the future. The order of the moving average component refers to the lag length between the error and the series target. For example, if the series target is influenced by the model’s error lagged one period, then this is a moving average process of order one and is sometimes called an MA(1) process. An MA(1) model would be expressed as:

y ( t ) = Φ 1 * e( t −1) + e( t ) + a Thus the series value at the current time point (y (t) ) is equal to the sum of several components: (1) the previous time point’s model error (e (t-1) ) multiplied by a weight coefficient (here Φ 1 ); (2) a constant (representing the series mean); and (3) an error component at the current time point (e (t) ).

8-14


Integration The Integration (or Differencing) component of an ARIMA model provides a means of accounting for trend within a time series model. Creating a differenced series involves subtracting the values of adjacent series values in order to evaluate the remaining component of the model. The trend removed by differencing is later built back into the forecasts by Integration (reversing the differencing operation). Differencing can be applied at the nonseasonal or seasonal level, and successive differencing, although relatively rare, can be applied. The form of a differenced series (nonseasonal) would be:

x(t ) = y ( t ) − y ( t −1) Thus the differenced series values (x (t) ) is equal to the current series value (y (t) ) minus the previous series value (y (t-1) ). Multivariate ARIMA ARIMA also permits a series to be predicted from values in other data series. The relations may be at the same time point (for example, a company spending on advertising this month influences the company’s sales this month) or in a leading or lagging fashion (for example, the company’s spending on advertising two months ago influence the company’s sales this month). Multiple predictor series can be included at different time lags. A very simple example of a multivariate ARIMA model appears below:

y(t ) = b1 * x( t −1) + e( t ) + a Here the series value at the current time point (y (t) ) is equal to the sum of several components: (1) the value of the predictor series at the previous time point (x (t-1) ) multiplied by a weight coefficient (b 1 ); (2) a constant; and (3) an error component at the current time point (e (t) ). In a practical context, this model could represent monthly sales of a new product as a function of direct marketing spending the previous month. Complex ARIMA models that include other predictor series, autoregressive, moving average, and integration components can be built in the Time Series node..

8.9 Data Requirements In time series analysis, each time period at which the data are collected yields one sample point to the time series. The idea is that the more sample points you have, the clearer the picture of how the series behaves. It is not reasonable to collect just two months worth of data on the sales of a product and, on the basis of this, expect to be able to forecast two years into the future. This is because your sample size is only two (one sixth of the seasonal span) and you wish to forecast 24 data points, or months, ahead (two full seasonal spans). Therefore the way to view the collection of time series information is that the more data points you have, the greater your understanding of the past will be, and the more information you have to use to predict future values in the series. The first important question to be answered is how many data points are required before it is possible to develop time series forecasts. Unfortunately, there is no clear-cut answer to this, but the following factors influence the minimum amount of data required: • •

Periodicity How often the data are collected

8-15


•

Complexity of the time series model

It is important to note that some time series techniques incorporating seasonal effects require several seasonal spans of time series data before it is possible to use them. Usually four or more seasons of data observations is a good rule of thumb to use when attempting to explore seasonal modeling. For example, four years (seasonal spans) worth of quarterly or monthly data would be sufficient, as there are four replications of the time period. At the same time, four years worth of annual data is not enough historic data, as the sample is only four. The four year rule is not, however, a rigid rule, as time series can be developed and used for forecasting with less historic data. Two final thoughts: first, the more complex the time series model, the larger the time series sample size should be. Secondly, time series models assume that the same patterns appear throughout the series. If you are fitting a long series in which a dramatic change occurred that might influence the fundamental relations that exist over time (for example, deregulation in the airline and telecom industries), you may obtain more accurate prediction using only the recent (after the change) data to develop the forecasts.

8.10 Automatic Forecasting in a Production Setting In common data mining applications, analysts need to create forecasts for dozens of series on a regular basis. Typical examples are for inventory control for many different products/parts, or for demand forecasting within segments of customers (geographical, customer type, etc.). In principle, this task is no more complex than what we have already reviewed in the previous lessons. But in practice, it can be demanding simply because of the large number of series which could require data exploration, checking of residuals, etc. Fortunately, the Expert PASW Modeler in the Time Series node will automatically find a best-fitting model for any number of series that are added to the target list, with little work on your part (you can also use one or more predictor fields that would apply to all the target series). Although you could, if you had the time, do some preliminary work to determine the characteristics of the series, if you need to make regular forecasts on a weekly or monthly basis, it is likely that you won’t have the time to devote to this effort. After models are fit to several series—each series will have its own unique model—you can then easily apply those models in the future, without having to re-estimate or rebuild the models. This will be very time efficient. Of course, when enough time passes, you will most likely want to re-estimate the models, just in case any fundamental processes have changed in the drivers of specific series.

8.11 Forecasting Broadband Usage in Several Markets Our example of production forecasting involves a national broadband provider who wants to produce forecasts of user subscriptions in order to predict bandwidth usage. To keep the example relatively manageable, we will use only five time series in the example, although there are 85 series altogether (but we also forecast the total for all series). The file broadband_1.sav contains the monthly number of subscribers for each series from January 1999 to December 2003. After fitting models to the series, we want to produce forecasts for the next three months, which will be adequate to prepare for changes in demand/usage. We’ll open the data file and do some data exploration. Close the Time Plot graph window Click on File…Open Stream

8-16


Double-click on the file broadband1 Run the Table node

Figure 8.6 Broadband Time Series Data

The file contains information on 85 markets. Rather than looking at all of them at once, we will focus only on Markets 1 through 5. The Filter node to the right of the source node will filter out the markets we don’t need. Close the Table window Double-click the Filter node

8-17


Figure 8.7 Filter Node Dialog

The next step is to examine sequence charts of each series, but before doing so we will need to define the periodicity of each series. The Time Series modeling node, and the Time Plot node, require that the periodicity be defined. This is accomplished in the Time Intervals node which is found in the Fields Ops palette. Place a Type Node to the right of the Filter node Connect the Filter node to the Type node Place a Time Intervals node to the right of the Type node Connect the Type node to the Time Interval node Double-click on the Time Intervals node

8-18


Figure 8.8 Time Intervals Dialog

The Time Intervals node allows you to specify intervals and generate labels for time series data to be used in a Time Series model or a Time Plot node. A full range of time intervals is supported, from seconds to years. You can also specify the periodicity—for example, five days per week or eight hours per day. In this node you can specify the range of records to be used for estimating a model, and you can choose whether to exclude the earliest records in the series and whether to specify holdouts. Doing so enables you to test the model by holding out the most recent records in the time series data in order to compare their known values with the estimated values for those periods. You can also specify how many time periods into the future you want to forecast, and you can specify future values for use in forecasting by downstream Time Series modeling nodes. The Time Interval dropdown is used to define the periodicity of the series. By default it is set to None. While it is not absolutely required that you specify a periodicity, unless you do so the Expert PASW Modeler will not consider models that adjust for seasonal patterns. In this case, because we have collected data on a monthly basis, we will define our time interval as months. Click on the Time Interval dropdown and select Months

8-19


Figure 8.9 Time Intervals Dialog with Periodicity Defined

The next step is to label the intervals. You can either start labeling from the first record, which in the case of this data file is January, 1999, or build the labels from a field that identifies the time or date of each measurement. In order to use the Start labeling from first record method, you must specify the starting date and/or time to label the records. This method assumes that the records are already equally spaced, with a uniform interval between each measurement. Any missing measurements would be indicated by empty rows in the data. You can use the Build from data method for series that are not equally spaced. This requires that you have a field that contains the time or date of each measurement. PASW Modeler will automatically impute values for the missing time points so that the series will have equally spaced intervals. However, in addition, this method requires a date, time, or timestamp field in the appropriate format to use as input. For example if you have a string field with values like Jan 2000, Feb 2000, etc., you can convert this to a date field using a Filler node. This is the method that we are going to use. Click OK Insert a Filler node between the Filter node and the Type Node

8-20


Figure 8.10 Stream After Adding the Filler Node

Double-click on the Filler node Select DATE_ in the Fill in fields box Select Always from the Replace: dropdown Type or use the expression builder to insert to_date(DATE_) in the Replace with: box

Figure 8.11 Completed Filler Node

Click OK

Next, let’s set up the Type node so that the role for all the target series we want to forecast is set to target, and the role for the newly converted DATE field is set to None. We will also need to instantiate the data. Double-click on the Type node

8-21


Set the role on all the fields from Market_1 to Total to Target Set the role on the DATE_ field to None Click on Read Values button to instantiate the data

Figure 8.12 Completed Type Node

Click OK

Now we can complete the Time Intervals settings. Double-click on the Time Intervals node Click on Build from data Use the menu on the Field: option to select DATE_

8-22


Figure 8.13 Time Intervals Dialog with Date Field added

The New field name extension is used to apply either a prefix or suffix to the new fields generated by the node. By default it is the prefix $T1_. Click on the Build tab

8-23


Figure 8.14 Build Tab Dialog

The Build tab allows you to specify options for aggregating or padding fields to match the specified interval. These settings apply only when the Build from data option is selected on the Intervals tab. For example, if you have a mix of weekly and monthly data, you could aggregate or “roll up” the weekly values to achieve a uniform monthly interval. Alternatively, you could set the interval to weekly and pad the series by inserting blank values for any weeks that are missing, or by extrapolating missing values using a specified padding function. When you pad or aggregate data, any existing date or timestamp fields are effectively superseded by the generated TimeLabel and TimeIndex fields and are dropped from the output. Typeless fields are also dropped. Fields that measure time as a duration are preserved—such as a field that measures the length of a service call rather than the time the call started—as long as they are stored internally as time fields rather than timestamp. Click the Estimation tab

8-24


Figure 8.15 Estimation Tab Dialog

The Estimation tab of the Time Intervals node allows you to specify the range of records used in model estimation, as well as any holdouts. These settings may be overridden in downstream modeling nodes as needed, but specifying them here may be more convenient than specifying them for each node individually. The Begin Estimation is used to specify when you want the estimation period to begin. You can either begin the estimation period at the beginning of the data or exclude older values that may be of limited use in forecasting. Depending on the data, you may find that shortening the estimation period can speed up performance (and reduce the amount of time spent on data preparation), with no significant loss in forecasting accuracy. The End Estimation option allows you to either estimate the model using all records up to the end of the data or “hold out” the most recent records in order to evaluate the model. For example, if you hold out the last three records and then specify 3 for the number of records to forecast, you are effectively “forecasting” values that are already known, allowing you to compare observed and predicted values to gauge the model’s effectiveness to forecast into the future. We will use the default settings. Click the Forecast tab

8-25


Figure 8.16 Forecast Tab Dialog

The Forecast tab of the Time Intervals node allows you to specify the number of records you want to forecast and to specify future values for use in forecasting by downstream Time Series modeling nodes. These settings may be overridden in downstream modeling nodes as needed, but again specifying them here may be more convenient than specifying them for each node individually. The Extend records into the future option lets you specify the number of time points you wish to forecast beyond the estimation period. Note that these time points may or may not be in the future, depending on whether or not you held out some historic data for validation purposes. For example, if you hold out 6 records and extend 7 records into the future, you are forecasting 6 holdout values and only 1 future value. The Future indicator field is used to label the generated field to indicate whether a record contains forecast data. The default value for the label is $TI_Future. The Future Values to Use in Forecasting allows you to specify future values for any predictor fields you use. Future values for any predictor fields are required for each record that you want to forecast, excluding holdouts. For example, if you are forecasting next month's revenues for a hotel based on the number of reservations, you need to specify the number of reservations you actually expect. Note that fields selected here may or may not be used in modeling; to actually use a field as a predictor, it must be selected in a downstream modeling node. This dialog box simply gives you a convenient place to specify future values so they can be shared by multiple downstream modeling nodes without specifying them separately in each node. Also note that the list of available fields may be constrained by selections on the Build tab. For example, if Specify fields and functions is selected in the Build tab, any fields not aggregated or padded are dropped from the stream and cannot be used in modeling. The Future value functions lets you choose from a list of functions, or specify a value of your own. For example, you could set the value to the most recent value. The available functions depend on the measurement level of field. Click on the Extend records into the future check box Specify that you would like to forecast 3 records beyond the estimation period

8-26


Figure 8.17 Completed Forecast Tab Dialog

Click OK

The next step is to examine each series with a Sequence chart. We will display all the fields on the same chart. Place a Time Plot node from the Graphs palette below the Time Intervals node Attach the Time Intervals node to the Time Plot node Double-click on the Time Plot node Select all the series from Market_1 to Total Uncheck the Display Series in separate panels box

8-27


Figure 8.18 Completed Time Plot Dialog

Click Run

8-28


Figure 8.19 Sequence Chart Output for Each Series

From this graph, it is clear that Broadband usage has been increasing rapidly in the US over this period, so we see a steady, very smooth increase for all fields. The numbers for Market_3 do begin to dip in the last couple of months, but perhaps this is temporary. There is clearly no seasonality in these data, which makes sense. The number of broadband subscriptions does not rise and fall throughout the year. If we use this fact, we can reduce the time for the Expert PASW Modeler to fit models to these series, since requesting that seasonality be considered will increase processing time. Additionally, because the series we’ve viewed here are so smooth, with no obvious outliers, we’ll not request outlier detection. This will also save on processing time. Note, though, that if you are in doubt about this, it is safer to use outlier detection during modeling. Close the Time Plot graph Place a Time Series node from the Modeling palette near the Time Intervals node Connect the Time Intervals node to the Time Series node

Here is the stream so far:

8-29


Figure 8.20 Stream with Times Series Node Attached

Double-click on the Time Series node

Figure 8.21 Times Series Node

The default Method of modeling is Expert PASW Modeler which automatically selects the best exponential smoothing or ARIMA model for one or more series (there can be a different model for each series). As an alternative, you can use the menu to specify that you only want to specify a custom Exponential Smoothing or ARIMA model. In addition, there is a Continue estimation using

8-30


existing model(s) option, which allows you to apply an existing model to new data, without reestimating the model from the beginning. In this way you can save time by re-estimating and producing a new forecast based on the same model settings as before but using more recent data. Thus, if the original model for a particular time series was Holt's linear trend, the same type of model is used for re-estimating and forecasting for that data; the modeler does not reattempt to find the best model type for the new data. We will use the Expert Modeler in this example. In addition, you can specify the confidence intervals you want for the model predictions and residual autocorrelations. By default, a 95% confidence interval is used. You can set the maximum number of lags shown in tables and in plots of autocorrelations and partial autocorrelations. You must include a Time Intervals node upstream from the Time Series node. Otherwise, the dialog will indicate that no time interval has been defined and the stream will not run. In this example, the settings indicate that the model will be estimated from all the records and that forecasts will be made 3 time periods beyond the estimation period. Click the Criteria button

Figure 8.22 Criteria Dialog

The All models option should be checked if you want the Expert Modeler to consider both ARIMA and exponential smoothing models. The other two modeling options can be used if you want the Expert Modeler to only consider Exponential smoothing or ARIMA models. The Expert Modeler will only consider seasonal models if periodicity has been defined for the active dataset. When this option is selected, the Expert Modeler considers both seasonal and nonseasonal models. If this option is not selected, the Expert Modeler only considers nonseasonal models. We will uncheck this option

8-31


because the sequence charts clearly show that there were no seasonal patterns in broadband subscriptions. The Events and Interventions option enables you to designate certain fields as event or intervention fields. Doing so identifies a field as identifying periods when time series data might have been affected by events (predictable recurring situations, e.g., sales promotions) or interventions (one-time incidents, e.g., a power outage or employee strike). These fields must be of flag, nominal or ordinal, and must be numeric (e.g. 1/0, not T/F, for a Flag field), to be included in this list. Click the Outliers tab

Figure 8.23 Outliers Dialog

The Detect Outliers automatically option is used to locate and adjust for outliers. Outliers can lead to forecasting bias either up or down, erroneous predictions if the outlier is near the end of the series, and increased standard errors. Because there were no obvious outliers in the sequence chart, we will leave this option unchecked. Click Cancel Click Run

After the model runs: Right-click on the generated model named 6 fields in the Models palette Click Browse

8-32


Figure 8.24 Time Series Model Output (Simple View)

The Time Series model displays details of the models the Expert Modeler selected for each series. In this case, it chose the Holts linear trend exponential smoothing model for the first four series and the last one (Total), and the Winters additive exponential smoothing model for the fifth series. Given the likely similar patterns in the series, it is not surprising that the same model was chosen for most of the series. The default output shows for each series the model type, the number of predictors specified, and the goodness-of-fit measure (stationary R-squared is the default). This measure is usually preferable to an ordinary R-squared when there is a trend or seasonal pattern. If you have specified outlier methods, there is a column showing the number of outliers detected. The default output also includes a Ljung-Box Q statistic, which tests for autocorrelation of the error. Here we see that the result was significant for the Model_2, Model_4, and Total series. Below we will examine some residuals plots to see why the results were significant. The default view (Simple) displays the basic set of output columns. For additional goodness-of-fit measures, you can use the View menu to select the Advanced option. The check boxes to the left of each model can be used to choose which models you want to use in scoring. All the boxes are checked by default. The Check all and Un-check all buttons in the upper left act on all the boxes in a

8-33


single operation. The Sort by option can be used to sort the rows in ascending or descending order of a specified column. As an alternative, you can also click on the column heading itself to change the order. Click on the View: menu and select Advanced

Figure 8.25 Time Series Model Output (View = Advanced)

The Root Mean Square Error (RMSE) is the square root of the mean squared error. The Mean Absolute Percentage Error (MAPE) is obtained by taking the absolute error for each time period, dividing it by the actual series value, averaging these ratios across all time points, and multiplying by 100. The Mean Absolute Error (MAE) takes the average of the absolute values of the errors. The Maximum Absolute Percentage Error (MaxAPE) is the largest absolute forecast error expressed as a percentage. The Maximum Absolute Error (MaxAE) is the largest forecast error, positive or negative. And finally, Normalized Bayesian Information Criterion (Norm BIC) is a general measure of the overall fit of a model that attempts to account for model complexity. From this table, you can easily scan the statistics to look for better, or poorer, fitting models. We can see here that Model_5 has the highest Stationary R-squared value (0.544) and Total has a very low one (0.049). However, the Total series has a lower MAPE than any of the other series. The summary statistics at the bottom of the output provide the mean, minimum, maximum and percentile rank values for the standard fit measures. Here we see that the value for Stationary R-

8-34


squared at the highest percentile (Percentile 95) is 0.544. This means that Model_5 should be ranked in the highest percentile based on this statistic, and the Total series should be ranked in the lowest.

Model Parameters Time series models are represented by specific equations, and each model therefore has coefficients or parameters associated with its various terms. These parameters can provide additional insight into a model and the series that it predicts. Click Parameters tab

A Holts linear trend model has two parameters, the level and trend. The level of the series is modeled with Alpha, which varies from 0 (older observations count just as much as current observations) to 1 (the current observation is used exclusively). The alpha estimate is 1.0, and it is significant at the .05 level, so this smoothly varying series can be modeled for level by ignoring previous observations when predicting an observation. Gamma is a parameter which models the trend in the series, and it varies from 0 (trends are based on all observations) to 1 (the trend is based on only the most recent observations). The gamma value of 0.3 (also significant at the .05 level) indicates that the series trend (which is clearly increasing over time) requires using some of the past observations as well as more recent ones. Note that the value of gamma does not describe the trend itself. Figure 8.26 Parameters of Holts Linear Trend Model for Market_1

An experienced analyst can use the parameters for a model to compare one model to another, to compare changes over time in a model (and thus a series), and to assess whether a model makes sense. You may want to review the parameters for the other series’ models. Now let’s look at the Residual plots. Click on the Residuals tab

8-35


Figure 8.27 Residuals Output for the Market_1 Series

The Residuals tab shows the autocorrelation function (ACF) and partial autocorrelation function (PACF) of the residuals (the differences between expected and actual values) for each target field. The ACF values are the correlations between the current value and the previous time points. By default, 24 autocorrelations are displayed. The PACF values look at the correlations after controlling for the series values at the intervening time points. If all of the bars fall within the 95% confidence limits (the blue highlighted area), then there are no significant autocorrelations in the series. That seems to be the case with the Market_1 series. However, as we saw in Figure 8.24, the Market_2 series seemed to have significant autocorrelation based on the Ljung-Box Q statistic. Let’s take a look at the residuals plot for the Market_2 series to see if we can see why that statistic was significant for that series. Use the Display plot for model: option to select the Market_2 series

8-36


Figure 8.28 Residuals Output for the Market_2 Series

Here we see that there is significant autocorrelation at lag 6 in both the ACF and PACF plots. Thus, the results of the Ljung-Box Q statistic and these two plots are consistent: there is a non-random pattern in the errors. What this implies is not that the current model can’t be used for forecasting, as it may perform adequately for the broadband provider. But it does suggest that the model can be improved. The Expert Modeler is an automatic modeling technique, and it normally finds a fairly acceptable model, but that doesn’t mean that some tweaking on your part isn’t appropriate. Click OK Place a Table node nearby the generated Times Series model Connect the generated model to the Table node Run the Table node

8-37


Figure 8.29 Table Output Showing Fields Created by Time Series Model

The table now contains a forecast value for each time point for each series along with an upper and lower confidence limit. In addition, there is a field called $T1_Future that indicates that there are records that contain forecast data. For records that extend into the future, the value for this field will be “1”. Scroll to the bottom of the table and then slightly to the right

Figure 8.30 Table Output with Future Values Displayed

8-38


Notice that the original series all have null values on these last three records because they are projected into the future. On the right hand side in Figure 8.30 we can see the forecast values for future months (January 2004 to March 2004) for the Market_1 series. Finally, let’s create a chart showing the forecast for one of the series. Close the Table window Place a Time Plot node near the generated model on the stream canvas Connect the Time Plot node to the generated model Select the following fields to be plotted: Market_5, $TS-Market_5, $TSLCI-Market_5, $TSUCI-Market_5 Uncheck the Display Series in separate panels option Click Run

Figure 8.31 Sequence Chart for Market_5 along with Forecast and Upper & Lower Confidence Limits

From this chart, it appears that the model fits this series very well. Close the Time plot graph window

8-39


Click on File…Save Stream As and name the file Broadband.str

8.12 Applying Models to Several Series We just produced models for 6 series, along with forecasts for the next three months. Suppose that 3 months has passed and we now have actual data for January to March 2004 (for which we made forecasts initially). Now it is April 2004 and we want to make forecasts for the next three months (April to June 2004) using the same model we developed before without having to re-estimate the model now that we have updated the file with three months of new records. We do this with the Reuse stored settings method in the Time Series node to apply the model we just created to the updated data file. (We leave aside whether the correct forecast period is three months, more, or less.) Click File…Open Stream…Broadband2.str (in the C:\Train\ModelerPredModel folder) Copy the generated Time Series node from the Broadband.str (or add it from the Models manager to the stream Paste the generated model into Broadband2.str

Figure 8.32 Broadband2.str with the Generated Model from Broadband1.str.

This node contains the settings from the time series models we just created. Normally, at this point, with any other type of PASW Modeler generated model, we would make predictions on new data by attaching this node to the Type node and running the generated model. This would automatically make predictions for new cases. Time series data, though, is different. Unlike other types of data files, where there is usually no special order to the cases (in terms of modeling), order makes a difference in a time series. To reuse our settings, but also use the new data (from January to March) to make estimates, we must create a new Time Series node directly from the generated Time Series model. Right-click on the generated model and select Edit Click on Generate…Generate Modeling Node

This places a time series modeling node onto the palette. Close the time series modeling output and delete the copied generated model from the stream canvas Connect the Time Intervals node to the new Time Series node

8-40


Figure 8.33 Broadband2.str with the Time Series Node Generated from the Previous Model

We don’t have to specify any target fields because the models, with all specifications, are already stored in the generated time series modeling node. We simply insert the model node and decide whether the model should be re-estimated or not. Assuming that you have recently estimated models, you might be willing to act as if the model type for each series still holds. You can avoid redoing models and apply the existing models to the new data by using the method Continue estimation using existing model(s) option. This choice means that PASW Modeler will use the model settings for model form (type of exponential smoothing and ARIMA model). Thus, for example, if the original model for a particular time series was Holt's linear trend, the same type of model is used for reestimating and forecasting for that data; the system does not attempt to find the best model type for the new data. The Expert Modeler will estimate new model parameters based on the additional data. If instead you wish to re-estimate the model type, then you can uncheck the Continue estimation check box. Although it will clearly take more computing time to redo the models, you may prefer this choice unless you have many, many time series which are very long. However, if you are making forecasts every month (week, etc.) based on just one additional month (week, etc.) of data, it may not be worth the effort to redo the complete models every month. In that case, you may wish to reestimate the parameters every month, but re-estimate the models themselves every few months. Double-Click on the Time Series node

By default the Time series node will use the existing models for estimation

8-41


Figure 8.34 Time Series Model Node with Continue Estimating Using Existing Models

Click Run to place a new model in the Models Manager Browse the new model

8-42


Figure 8.35 Time Series Model Output

As we can see, the models used for each series are the same as before (see Figure 8.24), although the statistics are not (examine stationary R square, for example). Let’s review the parameter estimates. Click Parameters tab

8-43


Figure 8.36 Parameters for Market_1 Holts Linear Trend Model Estimated with New Data

The alpha parameter is still 1.0, but the gamma parameter is now almost zero (0.001) and it is nonsignificant, so it is effectively equal to zero. This means that the trend in the series is modeled using all the data, rather than more recent observations, compared to the original model for the Market_1 series. Now let’s take a look at the new forecasts for April, May and June. Close the Time Series model browser Attach a Table node to the new Time Series model Run the Table node

8-44


Figure 8.37 Table Node Output with New Forecasts

There are now null values for the original series for April, May, and June 2004, but there are predictions for these months. In addition, you can compare the predictions for the first three months of 2004 with these data with the original predictions we made above. They will not be the same because we have three months of additional data to improve the model. In summary, in this lesson we demonstrated a typical application of time series analysis in data mining by showing how to make forecasts for several series at once. We then used these models but re-estimated them on new data to make new forecasts at a future date for those same series. The process of applying the models to new data can be repeated as often as necessary.

8-45


Summary Exercises The exercises in this lesson are written for the data file broadband_1.sav. 1. Using the same dataset that was used in the lesson (broadband_1.sav), rerun the Time Series node, using different series from the ones used in the lesson to fit a model and then produce forecasts. 2. Try rerunning the models requesting outlier detection. Does this make any difference in the generated models? 3. For those with extra time: Try specifying your own exponential smoothing model(s) or an ARIMA model, if you are knowledgeable about these methods, to see whether you can obtain a better model than that found by the Expert Modeler for one or more of the series.

8-46

LOGISTIC REGRESSION

Lesson 9: Logistic Regression Objectives • •

Review the concepts of logistic regression Use the technique to model credit risk

Data A risk assessment study in which customers with credit cards were assigned to one of three categories: good risk, bad risk-profitable (some payments missed or other problems, but were profitable for the issuing company), and bad risk-loss. In addition to the risk classification field, a number of demographics, including age, income, number of children, number of credit cards, number of store credit cards, having a mortgage, and marital status, were available for about 2,500 records.

9.1 Introduction to Logistic Regression Logistic regression, unlike linear regression, develops a prediction equation for a categorical target field that contains two or more unordered categories (the categories could be ordered, but logistic regression does not take the ordering into account). Thus it can be applied to such situations as: • •

Predicting which brand (of the major brands) of personal computer an individual will purchase Predicting whether or not a customer will close her account, accept an offering, or switch providers

Logistic regression technically predicts the probability of an event (of a record being classified into a specific category of the target field). The logistic function is shown in Figure 9.1. Suppose that we wish to predict whether someone buys a product. The function displays the predicted probability of purchase based on an incentive. Figure 9.1 Logistic Model for Probability of Purchase

We see the probability of making the purchase increases as the incentive increases. Note that the function is not linear but rather S-shaped. The implication of this is that a slight change in the incentive could be effective or not depending on the location of the starting point. A linear model

9-1


would imply that a fixed change in incentive would always have the same effect on probability of purchase. The transition from low to high probability of purchase is quite gradual. However, with a logistic model the transition can occur much more rapidly (steeper slope) near the .5 probability value. To understand how the model functions, we need to review some equations. The logistic model makes predictions based on the probability of an outcome. Binary (two target category) logistic regression can be formulated as:

eα + B1 X 1+ B 2 X 2 +...+ BkXk Prob(event) = 1 + eα + B1 X 1+ B 2 X 2 +...+ BkXk Where X 1 , X 2 , …, X k are the input fields. This can also be expressed in terms of the odds of the event occurring.

Odds (event) =

Prob (event) Prob (event) = eα + B1 X 1+ B 2 X 2 +...+ BkXk or 1 − Prob (event) Prob (no event)

where the outcome is one of two categories (event, no event). If we take the natural log of the odds, we have a linear model, akin to a standard regression equation:

ln (Odds (event)) = α + B1 X 1 + B2 X 2 + ... + Bk X k With two categories, a single odds ratio summarizes the outcome. However, when there are more than two target categories, ratios of the category probabilities can still describe the outcome, but additional ratios are required. For example, in the credit risk data used in this lesson there are three target categories: good risk, bad risk–profit, and bad risk–loss. Suppose we take the Good Risk category as the reference or baseline category and assign integer codes to the target categories for identification: (1) Bad Risk–Profit, (2) Bad Risk–Loss, (3) Good Risk. For the three categories we can create two probability ratios:

g(1) =

π (1) Prob (Bad Risk - Profit) = Prob (Good Risk) π (3)

and

g(2) =

π (2) Prob (Bad Risk - Loss) = Prob (Good Risk) π (3)

Where π (j) is the probability of being in target category j. Each ratio is based on the probably of an target category divided by the probability of the reference or baseline target category. The remaining probability ratio (Bad Risk–Profit / Bad Risk–Loss) can be obtained by taking the ratio of the two ratios shown above. Thus the information in J target categories can be summarized in (J-1) probability ratios.

9-2

LOGISTIC REGRESSION

In addition, these target-category probability ratios can be related to input fields in a fashion similar to what we saw in the binary logistic model. Again using the Good Risk category as the reference or baseline, we have the following model:

ln(

π (1) Prob (Bad Risk - Profit) ) = ln( ) = α 1 + B11 X 1 + B12 X 2 + ... + B1k X k π (3) Prob (Good Risk)

and

ln(

π (2) Prob (Bad Risk - Loss) ) = ln( ) = α 2 + B21 X 1 + B22 X 2 + ... + B2 k X k π (3) Prob (Good Risk)

Notice that there are two sets of coefficients for the three-category case, each describing the ratio of a target category to the reference or baseline category. If we complete this logic and create a ratio containing the baseline category in the numerator, we would have:

ln(

Prob (Good Risk) π (3) ) = ln(1) = 0 ) = ln( Prob (Good Risk) π (3) = α 3 + B31 X 1 + B32 X 2 + ... + B3k X k

This implies that the coefficients associated with ln(

π (3) ) are all 0 and so are not of interest. Also, π (3)

the ratio relating any two target categories, excluding the baseline, can be easily obtained by subtracting their respective natural log expressions. Thus:

ln(

π (1) π (1) π (2) ) = ln( ) − ln( ) , or π (2) π (3) π (3)

ln(

Prob (Bad Risk - Profit) Prob (Bad Risk - Profit) Prob (Bad Risk - Loss) ) = ln( ) - ln( ) Prob (Bad Risk - Loss) Prob (Good Risk) Prob (Good Risk)

We are interested in predicting the probability of each target category for specific values of the predictor fields. This can be derived from the expressions above. The probability of being in target category j is:

π (j) =

g (j) J

∑ g (i)

, where J is the number of target categories.

i =1

In our example with the three risk categories, for category (1):

9-3


π (1) π (1) π (1) g (1) π (3) = = = g (1) + g (2) + g (3) π (1) π (2) π (3) π (1) + π (2) + π (3) 1 + + π (3) π (3) π (3) And substituting for the g(j)’s, we have an equation relating the predictor fields to the target category probabilities.

π (1) =

eα1 + B11 X 1 + B12 X 2 +...+ B1k X k + eα 2 + B21 X 1 + B22 X 2 +...+ B2 k X k + eα 3 + B31 X 1 + B32 X 2 +...+ B3 k X k

eα1 + B11 X 1 + B12 X 2 +...+ B1k X k eα1 + B11 X 1 + B12 X 2 +...+ B1k X k = α1 + B11 X 1 + B12 X 2 +...+ B1k X k e + eα 2 + B21 X 1 + B22 X 2 +...+ B2 k X k + 1

In this way, the logic of binary logistic regression can be naturally extended to permit analysis of categorical target fields with more than two categories.

9.2 A Multinomial Logistic Analysis: Predicting Credit Risk We will perform a multinomial logistic analysis that attempts to predict credit risk (three categories) for individuals based on several financial and demographic input fields. The data file has been split into two, and we use risktrain.txt to train the model. We are interested in fitting a model, interpreting and assessing it, and obtaining a prediction equation. Possible input fields are shown below.

Field name

Field description

age in years income (in thousands of British pounds) f=female, m=male marital status: single, married, divsepwid (divorced, separated or widowed) number of dependent children NUMKIDS NUMCARDS number of credit cards how often paid: weekly, monthly HOWPAID MORTGAGE have a mortgage: y=yes, n=no STORECAR number of store credit cards number of other loans LOANS income (in thousands of British pounds) divided by 1,000 INCOME1K AGE INCOME GENDER MARITAL

The target field is:

Field name

Field description

RISK

credit risk: 1= bad risk-loss, 2=bad risk-profit, 3= good risk

9-4

LOGISTIC REGRESSION

To access the data: Click File…Open Stream and move to the c:\Train\ModelerPredModel directory Double-click on Logistic.str Run the Table node, examine the data, and then close the Table window Double-click on the Type node

The target field is credit risk (RISK). Notice that only four input fields are used. This is done to simplify the results for this presentation. As an exercise, the other fields will be used as predictors. Figure 9.2 Type Node for Logistic Analysis

Close the Type node dialog Double-click on the Logistic Regression model node named RISK Click on the Model tab

9-5


Figure 9.3 Logistic Regression Dialog

In the Model tab, you can choose whether a constant (intercept) is included in the equation. The Procedure option is used to specify whether a binomial or multinomial model is created. The options that will be available to you in the dialog box will differ according to which modeling procedure you select. Binomial is used when the target field has two discreet values, such as good risk/bad risk, or churn/not churn. Whenever you use this option, you will in addition be asked to declare which of your flag or categorical fields should be treated as categorical, the type of contrast you want performed, and the reference category for each predictor. The default contrast is Indicator, which indicates the presence or absence of category membership. However, in fields with some implicit order, you may want to use another contrast such as Repeated, which compares each category with the one that precedes it. The default reference or base category is the First category. If you prefer, you can change this to the Last category. Multinomial should be used when the target field is a categorical field with more than two values. This is the correct choice in our example because the RISK field has three values: bad risk, bad profit, and good risk. Whenever you use this option, the Model type option will become available for you to specify whether you want a main effects model, a full factorial model, or a custom model. By default, a model including the main effects (no interactions) of factors (categorical inputs) and covariates (continuous inputs) will be run. This is similar to what the Regression model node will do (unless interaction terms are formally added). The Full factorial option would fit a model including all factor

9-6

LOGISTIC REGRESSION

interactions (in our example, with two categorical predictors, the two-way interaction of MARITAL and MORTGAGE would be added). Notice that there are Method options (as there were for linear regression), so stepwise methods can be used when the Main Effects model type is selected. When a number of input fields are available, the stepwise methods provide a method of input field selection based on statistical criteria. The Base Category for target option is used to specify the reference category. The default is the First category in the list, which in this case is bad loss. Note: This field is unavailable if the contrast setting is Difference, Helmert, Repeated, or Polynomial. Select the Multinomial Procedure option (if necessary) Click on the Specify button to the right of Base category for target. This will open the Insert Value dialog box Click on good risk

Figure 9.4 Insert Value Dialog

Click the Insert button

This will change the base target category. The result is shown in Figure 9.5.

9-7


Figure 9.5 Logistic Regression Dialog with Good Risk as the Base Target Category

Click on the Expert tab Click the Expert Mode option button

9-8

LOGISTIC REGRESSION

Figure 9.6 Logistic Expert Mode Options

The Scale option allows adjustment to the estimated parameter variance-covariance matrix based on over-dispersion (variation in the outcome greater than expected by theory, which might be due to clustering in the data). The details of such adjustment are beyond the scope of this course, but you can find some discussion in McCullagh and Nelder (1989). If the Append all probabilities checkbox is selected, predicted probabilities for every category of the target field would be added to each record passed through the generated model node. If not selected, a predicted probability field is added only for the predicted category. Click the Output button Make sure the Likelihood ratio tests check box is selected Make sure the Classification table check box is selected

By default, summary statistics and (partial) likelihood ratio tests for each effect in the model appear in the output. Also, 95% confidence bands will be calculated for the parameter estimates. We have requested a classification table so we can assess how well the model predicts the three risk categories.

9-9


Figure 9.7 Logistic Regression Advanced Output Options

In addition, a table of observed and expected cell probabilities can be requested (Goodness of fit chisquare statistics). Note that, by default, cells are defined by each unique combination of a covariate (continuous input) and factor (categorical input) pattern, and a response category. Since a continuous predictor (INCOME1K) is used in our analysis, the number of cell patterns is very large and each might have but a single observation. These small counts could possibly yield unstable results, and so we will forego goodness of fit statistics. The asymptotic correlation of parameter estimates can provide a warning for multicollinearity problems (when high correlations are found among parameter estimates). Iteration history information is requested to help debug problems if the algorithm fails to converge, and the number of iteration steps to display can be specified. Monotonicity measures can be used to find the number of concordant pairs, discordant pairs, and tied pairs in the data, as well as the percentage of the total number of pairs that each represents. The Somers' D, Goodman and Kruskal's Gamma, Kendall's tau-a, and Concordance Index C are also displayed in this table. Information criteria provides Akaike’s information criterion (AIC) and Schwarz’s Bayesian information criterion (BIC). Click OK Click Convergence button

Figure 9.8 Logistic Regression Convergence Criteria

The Logistic Regression Convergence Criteria options control technical convergence criteria. Analysts familiar with logistic regression algorithms might use these if the initial analysis fails to converge to a solution.

9-10

LOGISTIC REGRESSION

Click Cancel Click Run Browse the Logistic Regression generated model node named RISK in the Models Manager window Click the Advanced tab, and then expand the browsing window

The advanced output is displayed in HTML format. Figure 9.9 Record Processing Summary

The marginal frequencies of the categorical inputs and the target are reported, along with a summary of the number of valid and missing records. A record must have valid values on all inputs and the target in order to be included in the analysis. We have nearly 2,500 records for the analysis.

9-11


Figure 9.10 Model Fit and Pseudo R-Square Summaries

The Final model chi-square statistic tests the null hypothesis that all model coefficients are zero in the population, equivalent to the overall F test in regression. It has ten degrees of freedom that correspond to the parameters in the model (seen below), is based on the change in –2LL (–2 log likelihood) from the initial model (with just the intercept) to the final model, and is highly significant. Thus at least some effect in the model is significant. The AIC and BIC fit measures are also displayed. The model fit improves as these two values approach zero. Because each of them decreased, we can conclude that the model fit improved with the addition of the predictors. Pseudo r-square measures try to measure the amount of variation (as functions of the chi-square lack of fit) accounted for by the model. The model explains only a modest amount of the variation (the maximum is 1, and some measures cannot reach this value). Figure 9.11 Likelihood Ratio Tests

The Model Fitting Criteria table provided an omnibus test of effects in the model. Here we have a test of significance for each effect (in this case the main effect of an input field) after adjusting for the other effects in the model. The caption explains how it is calculated. All effects are highly significant. Notice that the intercepts are not tested in this way, but tests of the individual intercepts can be found in the Parameter Estimates table. In addition, we can use this table to rank order the importance of the predictors. For instance, if we focus on the -2 LL value, if INCOME1K was removed as a predictor, the -2 LL value would increase by a magnitude 302.422. Clearly, the removal of this

9-12

LOGISTIC REGRESSION

predictor would have far more impact on the overall fit than if we were to eliminate any of the other predictors. The further -2LL gets from zero, the worse the fit. Thus, we can conclude that INCOME1K is the most important predictor, followed by MARITAL, NUMKIDS, and MORTGAGE. For those familiar with binary (two category) logistic regression, note that the values in the df (degrees of freedom) column are double what you would expect for a binary logistic regression model. For example, the covariate income (INCOME1K), which is continuous, has two degrees of freedom. This is because with three target categories, there are two probability ratios to be fit, doubling the number of parameters. Income has by far the largest chi-square value compared to the other predictors with two (or even four) degrees of freedom.

9.3 Interpreting Coefficients The most striking feature of the Parameter Estimates table is that there are two sets of parameters. One set is for the probability ratio of “bad risk–loss” to “good risk,” which is labeled “bad loss.” The other set is for the probability ratio of “bad risk–profit” to “good risk,” labeled “bad profit.” You can view the estimates in equation form in the Model tab, but the Advanced tab contains more supplementary information. Figure 9.12 Parameter Estimates

For each of the two target probability ratios, each predictor is listed, plus an intercept, with the B coefficients and their standard errors, a test of significance based on the Wald statistic, and the Exp(B) column, which is the exponentiated value of the B coefficient, along with its 95% confidence interval. As with ordinary linear regression, these coefficients are interpreted as estimates for the effect of a particular input, controlling for the other inputs in the equation.

9-13


Recall that the original (linear) model is in terms of the natural log of a probability ratio. The intercept represents the log of the expected probability ratio of two target categories when all continuous inputs are zero and all categorical fields are set to their reference category (last group) values. For covariates, the B coefficient is the effect of a one-unit change in the input on the log of the probability ratio. Examining income (INCOME1K) in the “bad loss” section, an increase of 1 unit (equivalent to 1,000 British pounds) decreases the log of the probability ratio between “bad loss” and “good risk” by –.056. But what does this mean in terms of probabilities? Moving to the Exp(B) column, we see the value is .945 for INCOME1K (in the “bad loss” section of the table). Thus increasing income by 1 unit (or 1,000 British pounds) decreased the expected ratio of the probability of being a bad loss to the probability of being a good risk by a factor of .945. In other words, increasing income reduces the expected probability of being a “bad loss” relative to being a “good risk,” and this reduction is .945 per 1,000 British pounds. This finding makes common sense. If we examine the income coefficient in the “bad profit” section of the table, we see that in a similar way (Exp(B) = .878) the expected probability of being a “bad profit” relative to being a good risk decreases as income increases. Thus increasing income, after controlling for the other fields in the equation, is associated with decreasing the probability of having a “bad loss” or “bad profit” outcome relative to being a “good risk.” This relationship is quantified by the values in the Exp(B) column and the Sig column indicates that both coefficients are statistically significant. Turning to the number of children (NUMKIDS), we see that its coefficient is significant for the “bad loss” ratio, but not the “bad profit” ratio. Examining the Exp(B) column for NUMKIDS in the “bad loss” section, the coefficient estimate is 2.267. For each additional child (one unit increase in NUMKIDS), the expected ratio of the probability of being a “bad loss” to being a “good risk” more than doubles. Thus, controlling for other predictors, adding a child (one unit increase) doubles the expected probability of being a “bad loss” relative to a “good risk.” However, controlling for the other predictors, the number of children has no significant effect on the probability ratio of being a “bad profit” relative to a “good risk.” The Logistic node uses a General Linear Model coding scheme. Thus for each categorical input (here MARITAL and MORTGAGE), the last category value is made the reference category and the other coefficients for that input are interpreted as offsets from the reference category. In examining the table we see that the last categories for MARITAL (single) and MORTGAGE (y) have B coefficients fixed at 0. Because of this the coefficient of any other category can be interpreted as the change associated with shifting from the reference category to the category of interest, controlling for the other input fields. Since the reference category coefficients are fixed at 0, they have no associated statistical tests or confidence bands. Looking at the MARITAL input field, its two coefficients (for divsepwid and married categories) are significant for both the “bad loss” and “bad profit” summaries. In the “bad loss” section, we see the estimated Exp(B) coefficient for the “MARITAL=divsepwid” category is .284, while that for “MARITAL=married” is 2.891. Thus we could say that, after controlling for other inputs, compared to those who are single, those who are divorced, separated or widowed have a large reduction (.284) in the expected ratio of the probability of being a “bad loss” relative to a “good risk.” Put another way, the divorced, separated or widowed group is expected to have fewer “bad losses” relative to “good risks” than is the single group. On the other hand, the married group is expected to have a much higher (by a factor of almost 3) proportion of “bad losses” relative to “good risks” than the single group. The explanation of why being married versus single should be associated with an increase of “bad losses” relative to “good risks” should be worked out by the analyst, perhaps in conjunction with someone familiar with the credit industry (domain expert). If we examine the MARITAL Exp(B) coefficients for the “bad profit” ratios, we find a very similar result.

9-14

LOGISTIC REGRESSION

Finally, MORTGAGE is significant for both the “bad loss” and “bad profit” ratios. Since having a mortgage (coded y) is the reference category, examining the Exp(B) coefficients shows that compared to the group with a mortgage, those without a mortgage have a greater expected probability of being “bad losses” (1.828) or “bad profits” (2.526) relative to “good risks.” In short, those without mortgages are less likely to be good risks, controlling for the other predictors. In this way, the statistical significance of inputs can be determined and the coefficients interpreted. Note that if a predictor were not significant in the Likelihood Ratio Tests table, then the model should be rerun after dropping the field. Although NUMKIDS is not significant for both sets of category ratios, the joint test (Likelihood Ratio Test) indicates it is significant and so we would retain it.

Classification Table The classification table, sometimes called a misclassification table or confusion matrix, provides a measure of how well the model performs. With three target categories we are interested in the overall accuracy of model classification, the accuracy for each of the individual target categories, and patterns in the errors. Figure 9.13 Classification Table

The rows of the table represent the actual target categories while the columns are the predicted target categories. We see that overall, the predictive accuracy of the model is 62.4%. Although marginal counts do not appear in the table, by adding the counts within each row we find that the most common target category is bad profit (1,475). This constitutes 60.1% percent of all cases (2,455). Thus the overall predictive accuracy of our model is not much of an improvement over the simple rule of always predicting “bad risk–profit.” However, we should recall that this simple rule would never make a prediction of “bad risk–loss” or “good risk.” In examining the individual categories, the “bad risk–profit” group is predicted most accurately (87.3%), while the other categories, “bad risk–loss” (15.9%) and “good risk” (36.8%) are predicted with much less accuracy. Not surprisingly, most errors in prediction for these latter two target categories are predicted to be “bad risk–profit.” The classification table allows us to evaluate a model from the perspective of predictive accuracy. Whether this model would be adequate depends in part on the value of correct predictions and the cost of errors. Given the modest improvement of this model over simply classifying all cases as “bad risk–profit,” in practice an analyst would see if the model could be improved by adding additional predictors and perhaps some interaction terms. Finally, it is important to note that the predictions were evaluated on the same data used to fit the model and for this reason may be optimistic. A better procedure is to keep a separate validation sample on which to evaluate the predictive accuracy of the model.

9-15


Making Predictions We now have the estimated model coefficients. How does the Logistic generated model node make predictions from the model? First, let’s see the actual predictions by adding the generated model to the stream. Close the Model browsing window Add a Table node to the stream and connect the Logistic generated model to the Table node Run the Table node

Figure 9.14 Predicted Value and Probability from Logistic Model

The field $L-RISK contains the most likely prediction from the model (here “good risk”). The probabilities for all three target categories must sum to 1; the model prediction is the category with the highest probability. That probability is contained in the field $LP-RISK. So for the first case, the prediction is “good risk” and the predicted probability of this occurring is .692 for this combination of input values. You prefer that the probability be as close to 1 as possible (the lowest possible value for the predicted category is .333; Why?). To illustrate how the actual calculation is done, let’s take an individual who is single, has a mortgage, no children, and has an income of 35,000 British pounds (INCOME1K = 35.00). What is the predicted probability of her (although gender was not included in the model) being in each of the three risk categories? Into which risk category would the model place her?

9-16

LOGISTIC REGRESSION

Earlier in this lesson we showed the following (where π (j) is the probability of being in target category j):

π (j) =

g (j) J

∑ g (i)

, where J is the number of target categories

i =1

If we substitute the parameter estimates in order to obtain the estimated probability ratios, we have:

gˆ (1) = e .438−.056*Income1k +.818*Numkids −1.260*Marital1 +1.062*Marital2 +.603*Mortgage1 gˆ (2) = e 4.285−.130*Income1k +.153*Numkids −1.220*Marital1 +1.021*Marital2 +.927*Mortgage1 and

gˆ (3) = 1 Where because of the coding scheme for the categorical inputs (Factors): Marital 1 = 1 if Marital=divsepwid; 0 otherwise Marital 2 = 1 if Marital=married; 0 otherwise Mortgage 1 = 1 if Mortgage=n; 0 otherwise Thus for our hypothetical individual, the estimated probability ratios are:

gˆ (1) = e.438−.056*35.0+.818*0−1.260*0+1.062*0+.603*0 = e.−1.522 = .218 gˆ (2) = e 4.285−.130*35.0+.153*0−1.220*0+1.021*0+.927*0 = e −.265 = .767 gˆ (3) = 1 And the estimated probabilities are:

πˆ (1) =

.218 = .110 .218 + .767 + 1

πˆ (2) =

.767 = .386 .218 + .767 + 1

πˆ (3) =

1 = .504 .218 + .767 + 1

Since the third group (good risk) has the greatest expected probability (.504), the model predicts that the individual belongs to that group. The next most likely group to which the individual would be assigned would be group 2 (bad risk–profit) because its expected probability is the next largest (.386).

9-17


Additional Readings Those interested in learning more about logistic regression might consider David W Hosmer and Stanley Lemeshow’s Applied Logistic Regression, 2nd Edition, New York, Wiley, 2000.

9-18

LOGISTIC REGRESSION

Summary Exercises The exercises in this lesson use the data file risktrain.tx detailed in the following text box. RiskTrain.txt contains information from a risk assessment study in which customers with credit cards were assigned to one of three categories: good risk, bad risk-profitable (some payments missed or other problems, but profitable for the issuing company), and bad risk-loss. In addition to the risk classification field, a number of demographics are available for about 2,500 cases. We want to predicting credit risk from the demographic fields. The file contains the following fields: ID AGE INCOME GENDER MARITAL NUMKIDS NUMCARDS HOWPAID MORTGAGE STORECAR LOANS RISK INCOME1K

ID number Age Income in British pounds Gender Marital status Number of dependent children Number of credit cards How often is customer paid by employer (weekly, monthly) Does customer have a mortgage? Number of store credit cards Number of outstanding loans Credit risk category Income in thousands of British pounds (field derived within PASW Modeler)

1. Continuing with the stream from the lesson, add the other available inputs, excluding INCOME (which is linearly related to income1k), and ID, to a logistic regression model and evaluate the results. Do the additional fields substantially improve the predictive accuracy of the model? Examine the estimated coefficients for the significant inputs. Do these relationships make sense? 2. Rerun the Logistic node, dropping those inputs that were not significant in the last analysis. Does the accuracy of the model change much? Does the interpretation of any of the coefficients change substantially? 3. Rerun the Logistic node, this time using the Stepwise method. Do the input fields selected match those retained in Exercise 2? 4. Run a rule induction model (using C5.0 or CHAID) on this data, using all fields but ID and INCOME as inputs. How does the accuracy of this model compare to that found by logistic regression? What does this suggest about the relations in the data? Do the inputs used by the model correspond to the inputs that were found to be significant in the logistic regression analysis? 5. Run a neural net model on this data, again excluding ID and INCOME as inputs. Make sure you request predictor importance. Does the neural network outperform the other models? Are the important predictors in the neural network model the same as the significant input fields in the logistic regression?

9-19


9-20

DISCRIMINANT ANALYSIS

Lesson 10: Discriminant Analysis Objectives • • • • • • •

How Does Discriminant Analysis Work? The Elements of a Discriminant Analysis The Discriminant Model How Cases are Classified Assumptions of Discriminant Analysis Analysis Tips A Two–Group Discriminant Example

Data To demonstrate discriminant analysis we use data from a study in which respondents answered, hypothetically, whether they would accept an interactive news subscription service (via cable). We wish to identify those groups most likely to adopt the service. Several demographic fields are available, including education, gender, age, income (in categories), number of children, number of organizations the respondent belonged to, and the number of hours of TV watched per day. The target measure was whether they would accept the offering or not.

10.1 Introduction Discriminant analysis is a technique designed to characterize the relationship between a set of fields, often called the response or predictors, and a grouping field with a relatively small number of categories. By modeling the relationship, discriminant can make predictions for categories of the grouping field (target). To do so, discriminant creates a linear combination of the predictors that best characterizes the differences among the groups. The technique is related to both regression and multivariate analysis of variance, and as such it is another general linear model technique. Another way to think of discriminant analysis is as a method to study differences between two or more groups on several fields simultaneously. Common uses of discriminant include: 1. 2. 3. 4.

Deciding whether a bank should offer a loan to a new customer. Determining which customers are likely to buy a company’s products. Classifying prospective students into groups based on their likelihood of success at a school. Identifying patients who may be at high risk for problems after surgery.

10-1


10.2 How Does Discriminant Analysis Work? Discriminant analysis assumes that the population of interest is composed of separate and distinct populations, as represented by the grouping field. The discriminant analysis grouping field can have two or more categories. Furthermore, we assume that each population is measured on a set of fields—the predictors—that follow a multivariate normal distribution. Discriminant attempts to find the linear combinations of the predictors that best separate the populations. If we assume two input fields, X and Y, and two groups for simplicity, this situation can be represented as in Figure 10.1. Figure 10.1 Two Normal Populations and Two Predictor Fields, with Discriminant Axis

The two populations or groups clearly differ in their mean values on both the X and Y-axes. However, the linear function—in this instance, a straight line—that best separates the two groups is a combination of the X and Y values, as represented by the line running from lower left to upper right in the scatterplot. This line is a graphic depiction of the discriminant function, or linear combination of X and Y, that is the best predictor of group membership. In this case with two groups and one function, discriminant will find the midpoint between the two groups that is the optimum cutoff for separating the two groups (represented here by the short line segment). The discriminant function and cutoff can then be used to classify new observations. If there are more than two predictors, then the groups will (hopefully) be well separated in a multidimensional space, but the principle is exactly the same. If there are more than two groups, more than one classification function can be calculated, although not all the functions may be needed to classify the cases. Since the number of predictors is almost always more than two, scatterplots such as Figure 10.1 are not always that helpful. Instead, plots are often created using the new discriminant functions, since it is on these that the groups should be well separated. The effect of each predictor on each discriminant function can be determined, and the predictors can be identified that are more important or more central to each function. Nevertheless, unlike in regression, the exact effects of the predictors are not typically seen as of ultimate importance in discriminant analysis. Given the primary goal of correct prediction, the specifics of how this is accomplished are not as critical as the prediction itself (such as offering loans to customers who will pay them back). Second, as will be demonstrated below, the predictors do not directly predict the

10-2


grouping field, but instead a value on the discriminant function, which, in turn, is used to generate a group classification.

10.3 The Discriminant Model The discriminant model has the following mathematical form for each function: F K = D 0 + D 1 X 1 + D 2 X 2 + ... D p X p

where F K is the score on function K, the D i ’s are the discriminant coefficients, and the X i ’s are the predictor or response fields (there are p predictors). The maximum number of functions K that can be derived is equal to the minimum of the number of predictors (p) or the quantity (number of groups – 1). In most applications, there will be more predictors than categories of the grouping field, so the latter will limit the number of functions. For example, if we are trying to predict which customers will choose one of three offers, (3-1), or two classification functions can be derived. When more than one function is derived, each subsequent function is chosen to be uncorrelated, or orthogonal, to the previous functions (just as in principal components analysis, where each component is uncorrelated with all others, see Lesson 7). This allows for straightforward partitioning of variance. Discriminant creates a linear combination of the predictor fields to calculate a discriminant score for each function. This score is used, in turn, to classify cases into one of the categories of the grouping field.

10.4 How Cases Are Classified There are three general types of methods to classify cases into groups. 1. Maximum likelihood or probability methods: These techniques assign a case to group k if its probability of membership is greater for group k than for any other group. These probabilities are posterior probabilities, as defined below. This method relies upon assumptions of multivariate normality to calculate probability values. 2. Linear classification functions: These techniques assign a case to group k if its score on the function for that group is greater than its score on the function for any other group. This method was first suggested by Fisher, so these functions are often called Fisher linear discriminant functions (which is how PASW Modeler refers to them). 3. Distance functions: These techniques assign a case to group k if its distance to that group’s centroid is smaller than its distance to any other group’s centroid. Typically, the Mahalanobis distance is the measure of distance used in classification. When the assumption of equal covariance matrices is met, all three methods give equivalent results. PASW Modeler uses the first technique, a probability method based on Bayesian statistics, to derive a rule to classify cases. The rule uses two probability estimates. The prior probability is an estimate of the probability that a case belongs to a particular group when no information from the predictors is available. Prior probabilities are typically either determined by the number of cases in each category of the grouping field, or by assuming that the prior probabilities are all equal (so that if there are three

10-3


groups, the prior probability of each group would be 1/3). We have more to say about prior probabilities below. Second, the conditional probability is the probability of obtaining a specific discriminant score (or one further from the group mean) given that a case belongs to a specific group. By assuming that the discriminant scores are normally distributed, it is possible to calculate this probability. With this information and by applying Bayes’ rule, the posterior probability is calculated, which is defined as the likelihood or probability of group membership, given a specific discriminant score. It is this probability value that is used to classify a case into a group. That is, a case is assigned to the group with the highest posterior probability. Although PASW Modeler uses a probability method of classification, you will most probably use a method based on a linear function to classify new data. This is mainly for ease of calculation because calculating probabilities for new data is computationally intensive compared to using a classification function. This will be illustrated below.

10.5 Assumptions of Discriminant Analysis As with other general linear model techniques, discriminant makes some fairly rigorous assumptions about the population. And as with these other techniques, it tends to be fairly robust to violations of these assumptions. Discriminant assumes that the predictor fields are measured on an interval or ratio scale (continuous). However, as with regression, discriminant is often used successfully with fields that are ordinal, such as questionnaire responses on a five- or seven-point Likert scale. Nominal fields can be used as predictors if they are given dummy coding. The grouping field can be measured on any scale and can have any number of categories, though in practice most analyses are run with five or fewer categories. Discriminant assumes that each group is drawn from a multivariate normal population. This assumption can be and is violated often, especially as sample size increases, and moderate departures from normality are usually not a problem. If this assumption is violated, the tests of significance and the probabilities of group membership will be inexact. If the groups are widely separated in the space of the predictors, this will not be as critical as when there is a fair amount of overlap between the groups. When the number of categorical predictor fields is large (as opposed to interval–ratio predictors), multivariate normality cannot hold by definition. In that case, greater caution must be used, and many analysts would choose to use logistic regression instead. Most evidence indicates that discriminant often performs reasonably well with such predictors, though. Another important assumption is that the covariance matrices of the various groups are equal. This is equivalent to the standard assumption in analysis of variance about equal variances across factor levels. When this is violated, distortions can occur in the discriminant functions and the classification equations. For example, the discriminant functions may not provide maximum separation between groups when the covariances are unequal. If the covariance matrices are unequal but the input fields’ distribution is multivariate normal, the optimum classification rule is the quadratic discriminant function. But if the matrices are not too dissimilar, the linear discriminant function performs quite well, especially when the sample sizes are small. This assumption can be tested with the Explore procedure or with the Box’s M statistic, displayed by Discriminant.

10-4


For a more detailed discussion of problems with assumption violation, see P.A. Lachenbruch (Discriminant Analysis. 1975. New York: Hafner) or Carl Huberty (Applied Discriminant Analysis. 1994. New York: Wiley).

10.6 Analysis Tips In addition to the assumptions of discriminant, some additional guidelines are helpful. Many analysts would recommend having at least 10 to 20 times as many cases as predictor fields to insure that a model doesn’t capitalize on chance variation in a particular sample. For accurate classification, another common rule is that the number of cases in the smallest group should be at least five times the number of predictors. In the interests of parsimony, Huberty recommends having a goal of only 8 to 10 response fields in the final model. Although in applied work this may be too stringent, keep in mind that more is not always better. Outlying cases can affect the results by biasing the values of the discriminant function coefficients. Looking at the Mahalanobis distance for a case or examining the probabilities is normally an effective check for outliers. If a case has a relatively high probability of being in more than one group, it is difficult to classify. Analyses can be run with and without outliers to see how results are affected. Multicollinearity is less of a problem in Discriminant Analysis because the exact effect of a predictor field is typically not the focus of an analysis. When two fields are highly correlated, it is difficult to partition the variance between them, and the coefficient estimates are often incorrect. Still, the accuracy of prediction may be little affected. Multicollinearity can be more of a problem when stepwise methods of field selection are used, since fields can be removed from a model for reasons unrelated to that field’s ability to separate the groups.

10.7 Comparison of Discriminant and Logistic Regression Discriminant and logistic regression have the same broad purpose: to build a model predicting which category (or group) individuals belong to based on a set of interval scale predictors. Discriminant formally makes stronger assumptions about the predictors, specifically that for each group they follow a multivariate normal distribution with identical population covariance matrices. Based on this you would expect discriminant to be rarely used since this assumption is seldom met in practice. However, Monte Carlo simulation studies indicate that multivariate normality is not critical for discriminant to be effective. Discriminant follows from a view that the domain of interest is composed of separate populations, each of which is measured on variables that follow a multivariate normal distribution. Discriminant attempts to find the linear combinations of these measures that best separate the populations. This is represented in Figure 10.1. The two populations are best separated along an axis (discriminant function) that is a linear combination of x and y. The midpoint between the two populations is the cutpoint. This function and cut-point would be used to classify future cases. Logistic regression, as we have seen in Lesson 9, is derived from a view of the world in which individuals fall more along a continuum. This difference in formulation led discriminant to be employed in credit analysis (there are those who repay loans and those who don’t), while logistic regression was used to make risk adjustments in medicine (depending on demographics, health characteristics and treatment you are more or less likely to survive a disease). Despite these different origins, discriminant and logistic give very similar results in practice. Monte Carlo simulation work has not found one to be superior to the other over

10-5


very general circumstances. There is, of course, the obvious point that if the data are based on samples from multivariate normal populations, then discriminant outperforms logistic regression. One consideration when choosing between these two methods involves how many dichotomous predictor fields (or dummy coded nominal or ordinal fields) are used in the analysis. Because of the stronger assumptions made about the predictors by discriminant, the more categorical fields you have, the more you would lean toward logistic regression. Within the domain of response-based segmentation, from the business side more discriminant analysis is seen, while if the problem is formulated from a marketing perspective as a choice model, logistic models are more common. Note that neither discriminant nor logistic will produce a list of groups more or less associated with various target categories. Rather they will indicate which predictor fields (some may represent demographic characteristics) are relevant to the category. From the prediction equation or other summary measures you can determine the combinations of characteristics that most likely lead to the desired target category.

Recommendations Logistic regression and discriminant analysis give very similar results in practice. Since discriminant does make stronger assumptions about the nature of your predictors (formally, multivariate normality and equal covariance matrices are assumed), as more of your predictor fields are categorical (and thus need to be dummy coded) or dichotomous, you would move in the direction of logistic regression. Certain research areas have a tradition of using only one of the methods, which may also influence your choice.

10.8 An Example: Discriminant To demonstrate discriminant analysis we use data from a study in which respondents indicated whether they would accept an interactive news subscription service (via cable). Most of the predictor fields are continuous in scale, the exceptions being GENDER (a dichotomy) and INC (an ordered categorical field). We would expect few if any of these to follow a normal distribution, but will proceed with discriminant. Note that the predictor fields for discriminant must be numeric, although they can be categorical. Most importantly, if you have predictors that are truly categorical, such as region of the U.S. (e.g., northwest, southwest, etc.), even with numeric coding, Discriminant will not create dummy variables/fields for these categories. You will need to create dummy variables yourself (use the SetToFlag node), and then enter the dummy variables in the model, leaving one out so as not to create redundancy. In this current example we don’t face this issue. As in our other examples, we will move directly to the analysis although ordinarily you would run data checks and exploratory data analysis first. Click File…Open Stream and then move to the c:\Train\ModelerPredModel folder Double-click on Discriminant.str Right-click on the Table node and select Run to view the data

10-6


Figure 10.2 The Interactive News Study Data

Place a Discriminant node from the Modeling palette to the right of the Type node Connect the Type node to the Discriminant node

The name of the Discriminant node will immediately change to NEWSCHAN, the target field.

Figure 10.3 Discriminant Node Added to the Stream

Double-click on the Discriminant node Click on the Model tab

10-7


Figure 10.4 Discriminant Dialog

The Use partitioned data option can be used to split the data into separate samples for training and testing. This may provide an indication of how well the model will work with new data. We will not use this option in this example, but instead will take advantage of a different option for validating the model (Leave-one-out classification) that is built into the Discriminant procedure. The Build model for each split option enables you to use a single stream to build separate models for each possible value of a field which is set to role Split in the Type node. All the resulting models will be accessible from a single model nugget. The Method option allows you to specify how you want the predictors entered into the model. By default, all of the terms are entered into the equation. If you do not have a particular model in mind, you can invoke the Stepwise option that will enter predictors into the equation based on a statistical criterion. At each step, terms that have not yet been added to the model are evaluated, and if the best of those terms adds significantly to the predictive power of the model, it is added. Some analysts prefer to enter all the predictor fields into the equation and then evaluate which are important. However, if there are many correlated predictor fields, you run the risk of multicollinearity, in which case a Stepwise method may be preferred. A drawback is that the Stepwise method has a strong tendency to overfit the training data. When using this method, it is especially important to verify the validity of the resulting model with a hold-out test sample or new data (which is common practice in data mining). Click on the Method button and select Stepwise

10-8


Figure 10.5 Discriminant Analysis with Method Stepwise

Click on the Expert tab Click on Expert mode

Figure 10.6 Discriminant Expert Options

You can use the Prior Probabilities area to provide Discriminant with information about the distribution of the target in the population. By default, before examining the data, Discriminant assumes an observation is equally likely to belong to each group. If you know that the sample proportions reflect the distribution of the target in the population then you can use the Compute from group sizes option to instruct Discriminant to make use of this information. For example, if a target

10-9


category is very rare, Discriminant can make use of this fact in its prediction equation. We don’t know what the proportions would be so we retain the default. The Use covariance matrix option is often useful whenever the homogeneity of variance option is not met. In general, if the groups are well separated in the discriminant space, heterogeneity of variance will not be terribly important. However, in situations when you do violate the equal variance assumption, it may be useful to use the Separate-groups covariance matrices to see if your predictions change by very much. If they do, that would suggest that the violation of the equal variance assumption was serious. It should be noted that using separate-groups covariance matrices does not affect the results prior to classification. This is because PASW Modeler does not use the original scores to do the classification. Thus, the use of the Fisher classification functions is not equivalent to classification by PASW Modeler with separate covariance matrices. Click the Output button

Figure 10.7 Discriminant Advanced Output Dialog

Checking Univariate ANOVAs will have PASW Modeler display significance tests of between-group (target categories) differences on each of the predictors. The point of this is to provide some hint as to which fields will prove useful in the discriminant function, although this is precisely what discriminant will resolve. The Box’s M statistic is a direct test of the equality of covariance matrices. The covariance matrices are ancillary output and very rarely viewed in practice. However, you might view the within-groups correlations among the predictors to identify highly correlated predictors. Either Fisher's coefficients or the unstandardized discriminant coefficients can be used to make predictions for future observations (customers). Both sets of coefficients produce the same predictions

10-10


when equal covariance matrices are assumed. If there are only two target categories (as is our situation), either set of coefficients is easy to use. If you want to try “what if” scenarios using a spreadsheet, the unstandardized coefficients, which involve a single equation in the two-category case, are more convenient. If you run discriminant with more than two target categories, then Fisher's coefficients are easier to apply as prediction rules. Casewise results can be used to display the codes for the actual group, predicted group, posterior probabilities, and discriminant scores for each case. The Summary table, also known by several other names including Classification table, Misclassification Table, and Confusion table, displays the number of cases correctly and incorrectly assigned to each of the groups based on the discriminant analysis. The Leave-one-out classification classifies each case based on discriminant coefficients calculated while the case is excluded from the analysis. This is a jackknife method and provides a classification table that should at least slightly better generalize to other samples. You can also produce a Territorial map, which is a plot of the boundaries used to classify cases into groups, but the map will not be displayed if there is only one discriminant function (the maximum number of functions is equal to the number of categories – 1 in the target field). The Stepwise options allow you to display a Summary of statistics for all fields after each step. Click the Means, Univariate ANOVAS, and Box’s M check boxes in the Descriptives area Click the Fisher’s and Unstandardized check boxes in the Function Coefficients area Click the Summary table and Leave-one-out classification check boxes in the Classification area

Figure 10.8 Discriminant Advanced Output dialog after Option Selection

10-11


Click OK Click the Stepping button

Figure 10.9 Stepping Dialog

Wilks’ lambda is the default and probably the most common method. The differences between the methods are somewhat technical and beyond the scope of this course. You can change the statistical criterion for field entry. For example, you might want to make the criterion more stringent when working with a large sample. Click Cancel Click the Run button Browse the Discriminant generated model in the Models Manager window Click the Advanced tab, and then expand the browsing window Scroll to the Classification Results

Figure 10.10 Classification Results Table

Although this table appears at the end of the discriminant output, we turn to it first. It is an important summary since it tells us how well we can expect to predict the target. The actual (known) groups constitute the rows and the predicted groups make up the columns. Of the 227 people surveyed who said they would not accept the offering, the discriminant model correctly predicted 157 of them; thus

10-12


accuracy for this group is 69.2%. For the 214 respondents who said they would accept the offering, 66.4% were correctly predicted. Overall, the discriminant model was accurate in 67.80% of the cases. Is this good? Will this model work well with new data? The answer to the first question will largely depend on what level of predictive accuracy you required before you began the project. One way we can assess the success of the model is to compare these results with the predictions we would have made if we simply guessed the larger group. If we simply did that, we would be correct in 227 of 441 (227 + 214) instances, or about 51.5% of the time. The 67.8% correct figure, while certainly far from perfect accuracy, does far better than guessing. The Cross-validated portion of the table gives us an idea about how accurate this model will be with new data. The percent of correctly classified cases has decreased slightly from 67.8% to 67.3% for the cross-validation. Because these results are virtually identical, it appears the model is valid. Since we are interested in discovering which characteristics are associated with someone who accepts the news channel offer, we proceed. Scroll back to the Group Statistics pivot table

Figure 10.11 Group Statistics

Viewing the means by themselves is of limited use, but notice the group that would accept the service is about 7 years older than the group that would not accept, whereas the daily hours of TV viewing are almost identical for the two groups. The standard deviations are very similar across groups, which is promising for the equal covariance matrices assumption. Scroll to the Tests of Equality of Group Means pivot table

10-13


Figure 10.12 Univariate F Tests

The significance tests of between-group differences on each of the predictor fields provide hints as to which will be useful in the discriminant function (recall we are using Wilks' criterion as a stepwise method). Notice Age in Years has the largest F (is most significant) and will be first selected in the stepwise solution. This table looks at each field ignoring the others, while discriminant adjusts for the presence of the other fields in the equation (as would regression). Scroll to the Box’s M test results Figure 10.13 Box’s M Test Results

Because the significance value is well above 0.05, we can accept the null hypothesis that the covariance matrices are equal. However, the Box’s M test is quite powerful and leads to rejection of equal covariances when the ratio N/p is large, where N is the number of cases and p is the number of fields. The test is also sensitive to lack of multivariate normality, which applies to these data. If the variances were unequal, the effect on the analysis would be to create errors in the assignment of cases to groups. Scroll to the Eigenvalues and Wilks’ Lambda portion of the output

10-14


Figure 10.14 Summaries of Discriminant Function (Eigenvalues and Wilks’ Lambda)

These two tables are overall summaries of the discriminant function. The canonical correlation measures the correlation between a field (or fields when there are more than two groups) contrasting the groups and an optimal (in terms of maximizing the correlation) linear combination of the predictors. In short, it measures the strength of the relationship between the predictor fields and the groups. Here, there is a modest (.363) canonical correlation. Wilks’ lambda provides a multivariate test of group differences on the predictors. If this test were not significant (it is highly significant), we would have no basis on which to proceed with discriminant analysis. Now we view the individual coefficients. Scroll down until you see the Standardized Coefficients and Structure Matrix Figure 10.15 Standardized Coefficients and Structure Matrix

Standardized discriminant coefficients can be used as you would use standardized regression coefficients in that they attempt to quantify the relative importance of each predictor in the discriminant function. The only three predictors that were selected by the stepwise analysis were

10-15


Education, Gender and Age. Not surprisingly, age is the dominant factor. The signs of the coefficients can be interpreted with respect to the group means on the discriminant function. Notice the coefficient for gender is negative. Other things being equal, as you shift from a man (code 0) to a woman (code 1), this results in a one unit change, which when multiplied by the negative coefficient will lower the discriminant score, and move the individual toward the group with a negative mean (those that don’t accept the offering). Thus women are less likely to accept the offering, adjusting for the other predictors. The Structure Matrix displays the correlations between each field considered in the analysis and the discriminant function(s). Note that income category correlates more highly with the function than gender or education, but it was not selected in the stepwise analysis; this is probably because income correlated with predictors that entered earlier. The standardized coefficients and the structure matrix provide ways of evaluating the discriminant fields and the function(s) separating the groups. Scroll down to the Canonical Discriminant Function Coefficients and Functions at Group Centroids are visible

Figure 10.16 Unstandardized Coefficients and Group Means (Centroids)

In Figure 10.1 we saw a scatterplot of two separate groups and the axis along which they could be best separated. Unstandardized discriminant coefficients, when multiplied by the values of an observation, project an individual onto this discriminant axis (or function) that separates the groups. If you wish to use the unstandardized coefficients for prediction purposes, you would simply multiple a prospective customer’s education, gender and age values by the corresponding unstandardized coefficients and add the constant. Then you compare this value to the cut-point (by default the midpoint) between the two group means (centroids) along the discriminant function (the cut-point appears in Figure 10.1). If the prospective customer’s value is greater than the cut point you predict

10-16


the customer will accept, if the score is below the cut point you predict the customer will not accept. This prediction rule is also easy to implement with two groups, but involves much more complex calculations when more than two groups are involved. It is in a convenient form to do “what if” scenarios, for example, it we have a male with 16 years of education at what age would such an individual be a good prospect? To answer this we determine the age value that moves the discriminant score above the cut-point. Scroll down until you see the Classification Function Coefficients

Figure 10.17 Fisher Classification Coefficients

Fisher function coefficients can be used to classify new observations (customers). If we know a prospective customer’s education (say 16 years), gender (Female=1) and age (30), we multiply these values by the set of Fisher coefficients for the No (no acceptance) group (2.07*16 + 1.98*1 + .32*30 –20.85) which yield a numeric score. We repeat the process using the coefficients for the Yes group and obtain another score. The customer is then placed in the target group for which she has the higher score. Thus the Fisher coefficients are easy to incorporate later into other software (spreadsheets, databases) for predictive purposes. We did not test for the normality assumptions of discriminant analysis in this example. In general, normality does not make a great deal of difference, but heterogeneity of the covariance matrices can, especially if the sample group sizes are very different. Here the samples sizes were about the same. As mentioned earlier, whether you consider the hit rate here to be adequate really depends on the costs of errors, the benefits of a correct prediction and what your alternatives are. Here, although the prediction was far from perfect we were able to identify the relations between the demographic fields and the choice.

10-17


Summary Exercises The exercises in this lesson use the data file credit.sav. The following table provides details about fields in the file. Credit.sav has the same fields as risktrain.txt except that they are all numeric so that we can use them all in a Discriminant Analysis. The file contains the following fields: ID AGE INCOME GENDER MARITAL NUMKIDS NUMCARDS HOWPAID MORTGAGE STORECAR LOANS RISK INCOME1K

ID number Age Income Gender Marital status # of dependent children # of credit cards paid M/Wkly Mortgage # of store cards held # other loans Credit risk category Income in thousands of British pounds (field derived within PASW Modeler)

1. Begin with a clear Stream canvas. Place a Statistics File source node on the canvas and connect it to Credit.sav. 2. Attach a Type node to the Source node, and a Table node to the Type node. Run the Table and allow PASW Modeler to automatically type the fields. 3. Attach a SetToFlag node to the Type node and create separate dummy fields for each category of the marital field. Make sure that you code the True value as 1 and the False value as 0. This is important because Discriminant expects numeric data for the inputs. 4. Attach a Type node to the SetToFlag node. 5. Edit the second Type node and change the role for risk to Target, and to None for id, marital, income1k, and marital_3, or a reference field of your choice. Leave the role as Input for all the rest of the fields. 6. Use a Distribution node to examine the distribution of risk. 7. Attach a Discriminant node to the second Type node and run the analysis. How many classification functions are significant? What fields are important predictors? 8. How accurate is the model as a whole. On which category is it more accurate?

10-18

BAYESIAN NETWORKS

Lesson 11: Bayesian Networks Objectives • • • •

The Basics of Bayesian Networks Types of Bayesian Networks in PASW Modeler Creating models with the Bayes Net node Modifying Bayes Network Model Settings

Data We will use the dataset churn.txt that we have employed in several previous lessons. This data file contains information on 1477 customers of a telecommunication company who have at some time purchased a mobile phone. The customers fall into one of three groups: current customers, involuntary leavers and voluntary leavers. In this lesson, we use a Bayes Net to predict group membership. A partition node will be used to split the data.

11.1 Introduction Bayesian analysis has been introduced to data mining with Bayesian networks, which are graphic representations of the probabilistic relationships among a set of fields. These networks are very general and can be used to represent causal relationships, can have multiple target fields, and often allow an analyst to specify the existence (or non-existence) of certain relationships using domain knowledge and experience. The Bayes Net node provides the ability to use two different types of Bayesian networks to predict a categorical target. Bayes Net can use predictors on any scale, but continuous (Range) fields will be automatically binned into five groups. In theory a Bayes Net can use many predictors, but since every field will be categorical, cells with low or zero counts are more likely, especially if some categorical predictors have many categories. This is less an issue with very large data files. Bayes Nets are an alternative to other methods of prediction for categorical targets, including decision trees, neural nets, logistic/multinomial regression, or SVM models. Unlike many other PASW Modeler models, a graphical depiction of the model in the form of a Bayesian network is available in the generated model to further model understanding, although there is no predictive equation with coefficients for individual predictors as in some other models. The Bayes Net node is included in the Classification module.

11.2 The Basics of Bayesian Networks Bayesian analysis is a an area of statistics that is based on a different approach to probability than frequentist statistics, which are, for example, the standard approach used to calculate the probability values for a t-test. The frequentist approach defines probability as the limit of an outcome’s relative frequency in a large number of trials, and it assumes that a priori knowledge plays no role in determining probability. In contrast, Bayesian statistics incorporate prior knowledge or belief about an event or outcome, so that one has both prior and posterior probabilities. Bayesian analysis, and Bayes’ theorem, on which it is based, is named after the Reverend Thomas Bayes, who studied how to compute a distribution for the parameter of a binomial distribution. There

11-1


are several ways to state Bayes’ theorem. If we wish to test a hypothesis H that is conditional on evidence from some Data, then one general statement of Bayes’ theorem is: P(H|Data) = P(Data|H)*P(H) /P(Data),

Where P(H|Data) means the probability of H given the Data. The issue of prior probabilities enters because the P(H) is the prior probability of H, given no other information, i.e. the data collected for our study. This probability can be subjective, or it can be based on more objective prior knowledge, such as the proportion of persons who buy a new refrigerator in a year (for a model where we are trying to predict who will buy a new refrigerator). We won’t provide an example of using the theorem here. You can find many good worked examples on websites and elementary texts on Bayesian statistics. We also don’t provide an example because you don’t really need to understand Bayes theorem to use the Bayes Net node or its output. A portion of the output is a joint probability table, but that is really nothing other than a bivariate or multi-way crosstabulation of the fields that are found to be dependent (because values of one field depend or are related to another, although this dependence does not imply causality: correlation does not equal causality, as we have all been taught). Otherwise, the output fields from a Bayes Net model are similar to those from other models and include a prediction and the probability of that prediction. A Bayesian network is a graphical model based on a direct acyclic graph (DAG). First, a directed graph is shown in Figure 11.1 for comparison. Directed graphs are composed of vertices or nodes (the circles) that represent fields in a model, and arrows between the nodes that are called variously arcs, arrows, or directed edges. Figure 11.1 Simple Example of a Directed Graph

In comparison, a directed acyclic graph is shown in Figure 11.2. Here, for any node n, there is no path, following the arrows, that begins and ends on n. You can try that for any of the nodes in the graph.

11-2

BAYESIAN NETWORKS

Figure 11.2 Directed Acyclic Graph to Predict RESPONSE

A Bayesian network is a model that represents a set of data with a directed acyclic graph and that uses that information to make predictions. Nodes that are connected have probabilistic dependencies. Nodes that are not connected (broadly speaking) are conditionally independent, which means that these other nodes add no more information to the relationship, given the nodes that are interconnected (more about this below). So in the graph in Figure 11.2, the field VISITB is conditionally independent of ORIVISIT. A Bayesian network can display causal relationships between nodes with the arcs and arrows. However, the networks constructed by the Bayes Net node are not designed to represent causal relationships, for several important reasons. In data-mining, more emphasis is placed on the ability of a model to make accurate predictions rather than represent causal influences, i.e., the effect of field A on outcome C is direct and also indirect through field B. The networks constructed by the Bayes Net node are optimized for prediction. Second, software by itself, despite any claims otherwise, cannot successfully find causal relationships without user input. That is why in structural equation modeling, the user must set up the structure of the model and then test whether the data support that structure or model. Finally, data mining problems often incorporate many potential predictors, making specification of causal links more and more complex. The end result of these points is that it is possible to glean information from a network in PASW Modeler, but you need to be cautious when doing so and not over interpret the model. Bayesian networks in general are often resistant to problems caused by missing data, and they can make predictions for cases with missing data. However, the Bayes Net node by default uses listwise deletion, where any missing data causes it to delete a case from analysis. Why this is so and how it affects model-building is explained with an example below. Bayesian networks as implemented in PASW Modeler are designed to use only categorical data for which probability statements can be readily constructed. This means that only categorical targets can be used. If a continuous predictor is used, it will be binned into five roughly equally-spaced bins. This may not always be appropriate for skewed or other non-symmetrical distributions. If you have predictors like that, you may wish to manually bin these fields using a Binning node before the Bayesian Network node. For example, you could use Optimal Binning where the Supervisor field is the same as the Bayesian Network node Target field.

11-3


11.3 Type of Bayesian Networks in PASW Modeler The Bayes Net node provides two types of Bayesian networks. To understand them, it helps to first discuss a Naïve Bayes network. In Figure 11.3 this type of network is displayed. There is a target field A and a set of predictors B, C, and D. A is a parent node of the other nodes, and nodes B, C, and D are therefore child nodes of A. This is reminiscent of the graphical view of a decision tree, but you should not try to equate the two. Although we are attempting to predict A, the arcs point toward the predictors. This is a consequence of Bayes theorem, where the prior probability of the data, given the outcome, is included in the numerator of the equation. This probability is represented by the arrows flowing away from A, the target. Figure 11.3 Naïve Bayes Network

Of course, we include fields that are meaningful predictors of the target, so these arrows shouldn’t be confusing. For example, if we want to predict customers who will make a second purchase from an online retailer, we can include such things as income, gender, and prior purchase behavior. All of those will influence a second purchase, but not the reverse. The other key characteristic of a Naïve Bayes network is that there are no links or dependencies between the predictors. This is the simplest possible network. With this as background, we can now consider the two networks available in the Bayes Net node. Tree-Augumented Naïve Bayes (TAN). This type of network extends Naïve Bayes by allowing each predictor to depend on one other predictor in addition to the target field. Again, this dependence is not necessarily causal dependence but simply probabilistic dependence given the data at hand. Figure 11.2 shows a TAN network, where you can see that no predictor has more than two arrows pointing toward it, where all the arrows point away from the target RESPONSE, and where one predictor (ORISPEND) has no dependency on other predictors. The conditional probability tables produced by the Bayes Net node will reflect this structure, so a table for VISITB will include RESPONSE and SPENDB. Markov Blanket. This type of network selects the set of nodes in the dataset that contain the target’s parents, its children, and its children’s parents. This is illustrated in Figure 11.4 where once again the target field is RESPONSE. There were many more potential predictors available than are displayed in the network, but once a Markov Blanket has been defined, the target node is conditionally

11-4

BAYESIAN NETWORKS

independent of all other nodes (predictors), and so those predictors are not used in the network (model). Essentially, a Markov blanket identifies all the fields that are needed to predict the target. Notice that arrows can go both to and from the target field in a Markov Blanket. This type of network should, all things being equal, be more accurate than a TAN, especially with a large number of fields. However, with large datasets the processing time will be significantly greater. To reduce the amount of processing, you can use the Feature Selection options on the Expert tab to have PASW Modeler use only the fields that have a significant related bivariate relationship to the target. As before, arrows from the target to another field don’t indicate causal influence in that direction. Figure 11.4 Example of a Markov Blanket Network

You now understand the basics of a Bayesian network and the types of networks produced with the Bayes Net node. We can begin using Bayesian networks to predict customer churn.

11.4 Creating a Bayes Network Model We will use the churn data file that we have used in several other lessons. This will allow comparison to these other techniques. Click File…Open Stream and move to the c:\Train\ModelerPredModel folder Double-click on Bayes Net.str Run the Table node Close the Table window Edit the Type node

11-5


Figure 11.5 Type Settings for Churn Data

All available input fields will be used (with the exception of ID). The field CHURNED has three categories. Close the Type window Edit the Bayes Net node named CHURNED

There are two types, or structures, of networks available, as explained above. If you have many fields, you may wish to include a first step of feature selection that will reduce the number of inputs. This option can be turned on with the Include feature selection preprocessing step check box.

11-6

BAYESIAN NETWORKS

Figure 11.6 Model Tab in Bayes Net Node

Recall that the probability being modeled in a Bayesian network is comprised of a series of tables, and so there can be a significant fraction of cells with small or even zero cell counts. This can pose a computational difficulty; in addition, there is a danger of overfitting the model. The Bayes adjustment for small cell counts check box reduces these problems by applying smoothing to reduce the effect of any zero-counts. If a model has previously been trained, the results shown on the model nugget Model tab are regenerated and updated each time the model is run if you select the Continue training existing model check box. You would do this when you have added new or updated data to an existing stream with a model. Click Expert tab Click Expert options button

Missing Values in Bayes Net Models The default option for a Bayes Net is to use only complete records (Use only complete records check box). This is equivalent to standard listwise deletion, so if a record has a missing value for any field, that record won’t be used in creating a model (or in scoring from an existing model). If this option is unchecked, the Bayes Net will do the equivalent of pairwise deletion, using as much information as possible.

11-7


However, as with any algorithm that uses pairwise deletion, at least two issues become salient. The number of cases used for the analysis now becomes ill-defined. This may not be critical for most data-mining projects, but you should be aware of this issue. Perhaps more important, the estimates of the model can be unstable and be affected by small changes in the data. This could make model validation more difficult. If there is a significant amount of missing data, you may wish to estimate/impute some of the missing data values, although this raises its own complications. Computationally, the best solution is to use listwise deletion, but that only is ideal when missing data is a small percentage of the file.

Other Bayes Net Expert Options The algorithm for creating a Markov Blanket structure uses conditioning sets of increasing size to carry out independence testing and remove unnecessary links from the network (to find parents and children of the target field). This can be especially useful when processing data with strong dependencies among many fields. The default setting for Maximal conditioning set size is 5, Because tests involving a high number of conditioning fields require more time and memory for processing you can limit the number of fields to be included. If you reduce the maximum conditioning set size, though, the resulting network may contain some superfluous links. You can also use this setting if you are using a TAN network by requesting feature selection in the Model tab. The Feature Selection area is available for Markov Blanket models or with TAN models when feature selection is turned on. You can use this option to restrict the maximum number of inputs used when processing the model in order to speed up model building. If feature selection is turned on, the default maximum number of fields to be used in the network is 10. If there are important fields that should be used in a network, you can specify them in the Inputs always selected box. The Bayes Net conducts tests of independence on two-way and larger tables to construct the network. A Likelihood ratio test is used by default, but you can request that a standard Pearson chi-square be used instead. The significance level of the test can be set, but only if feature selection or a Markov Blanket network are requested.

11-8

BAYESIAN NETWORKS

Figure 11.7 Expert Options for Bayes Net

At this point we won’t change any defaults. Click Run

After the model runs: Right-click and Browse the generated Bayes Net model

11-9


Figure 11.8 Bayes Net Model Browser for TAN Model

As with most predictive models, predictor importance is included in the right half of the model browser window. The most important predictors are clearly SEX, LONGDIST, and International (you might want to compare this to other models that we developed to predict CHURNED with these data, such as the decision trees in Lesson 3). The actual Bayesian TAN model is displayed in the left half of the model browser. The network graph of nodes displays the relationship between the target and its predictors, as well as the relationship between the predictors. The importance of each predictor is shown by the density of its color; a darker blue color shows an important predictor. The target CHURNED has a red node. You can use the mouse to drag nodes around the graph to more easily view relationships. Click on the node for CHURNED and drag it more into the center of the network (see Figure 11.9)

11-10

BAYESIAN NETWORKS

Figure 11.9 Bayesian Network Graph

There are several things to notice about the network diagram. There is a path from CHURNED to every input field. The arrows all point away from CHURNED even though it is the target field. This makes CHURNED a parent of all the input nodes. These facts are simply a consequence of how a TAN is defined and don’t mean that somehow churn status is affecting the input fields. The arrows do indicate which fields will be included in conditional probability tables, as we will see shortly. Second, a TAN network allows paths between a predictor and one other predictor (plus the connection with the target). You can see this if you examine the network closely; no predictor has more than two arrows going toward it. Third, the links and arrows do have some meaning, but not causal influence. For example, there is an arrow from LOCAL to Est_Income. Since the number of minutes on average of local phone service isn’t going to affect one’s income, the direction of the arrow doesn’t indicate causality, but instead a conditional dependency or interrelatedness. Or consider the paths going from LONGDIST to International¸Car_Owner, and LOCAL. The arrows between LONGDIST and the other two measures of phone service usage do probably indirectly indicate something meaningful, but not cause and effect. Instead, the arrows are a sign that there are probabilistic dependencies among these fields. From our understanding of the data, we might conclude that these dependencies exist because of different groups of customers who have similar phone use patterns. For example, one group could be customers who make local calls but not many long distance calls; another group could be those who make lots of long distance and international calls. As mentioned earlier, the Bayes Net node bins continuous predictors into five categories, splitting the range into five equally-spaced groups. You can view the bin values by hovering with the mouse over a node for a continuous predictor. Hover the mouse over the node for Est_Income

11-11


The first bin runs from 0 to 20,054.807. The last bin contains all customers with incomes above 79,888.377. Figure 11.10 Distribution of Est_Income

We are now in the Basic view of the network (see the View dropdown in the lower left corner). We can switch to the Distribution view. Click View dropdown and select Distribution

Figure 11.11 Distribution View of TAN Network

The Distribution view displays the conditional probabilities for each node in the network as a minigraph. Bayesian networks work only with categorical data, so the graphs are all bar charts. The

11-12

BAYESIAN NETWORKS

simplest one is for the target field, which shows its distribution unrelated to any other field (because the arrows point away from it). You can hover the mouse pointer over a graph to display its values in a popup ToolTip. Hover the mouse over the bottom bar in the graph for CHURNED

In Figure 11.12 we have isolated just this portion of the network. Figure 11.12 Percentage of Customers who are Current from ToolTip

The probabilities for the input nodes are more complicated because most are conditional with the target and another field. Hover the mouse over the graph for Car_Owner and move it from top to bottom

As you move the mouse over the mini-graph for Car_Owner, you see probabilities listed along with values of CHURNED and LONGDIST. This is because there are arrows from those two fields pointed toward Car_Owner. We can learn more from viewing the conditional probability table for an input node. When you select a node in either Basic or Distribution view, the associated conditional probabilities table is displayed in the right half of the model browser. This table contains the conditional probability value for each node category and each combination of values in its parent nodes. Click on the mini-graph for Car_Owner to select it

11-13


Figure 11.13 Conditional Probability Table for Car_Owner

These conditional probabilities are based on the actual data. Thus, if we created a table with CHURNED, Car_Owner, and (binned) LONGDIST, we would find, for example, that of those customers who have the lowest value of LONGDIST (<5.996) and who are current customers (first row in table), 20% (.20) own a car and 80% (.80) do not (for reference, about 30% of all customers in the Training partition own a car). However, since we are interested in predicting CHURNED, we can look at this table a bit differently. If we hold LONGDIST constant (looking at the first 3 rows in the table where LONGDIST <5.996), we can see how car ownership varies by churn status. Customers who are voluntary churners (Vol) are more likely to own a car. Customers who are current are the least likely. It is the use of these probabilities that allows the TAN model to make predictions. Of course, this conditional probability table only includes two inputs. Since a customer will have a value on all inputs (the default of listwise deletion), there will be many conditional probability distributions that must be taken into account when making a prediction. And that is what the model does with the help of Bayes theorem to combine probabilities (see the PASW Modeler 14 Algorithms Guide for details). All these cells have at least one customer because there are probabilities listed in each one. Other conditional probability tables have zeros because no customer fit that particular pattern of values. This condition will be important in our discussion of the model predictions next. Close the Bayes Net model browser

11-14

BAYESIAN NETWORKS

Add an Analysis node to the stream and connect it to the Bayes Net model Edit the Analysis node Click Coincidence matrices (for symbolic targets) check box Click Run

Figure 11.14 Analysis Node Output for TAN Model

On the Training partition the model is accurate on 78.16% of the cases. It is most accurate on current customers, least accurate on involuntary churners. The model doesn’t do very well at all on the Testing partition and is only accurate on 69.79% of the cases overall. There are no missing data in this data file, yet something odd appears in the Coincidence Matrix for the Testing data. There is a fourth predicted value of $null$, i.e., missing. Why would the Bayes network predict a missing value or, more accurately, be unable to make a prediction for 13 cases? (The accuracy statistics don’t drop the missing cases and count them as an incorrect prediction.) The fact that missing predictions appear in the Testing partition but not the Training partition is the tip-off to the cause. As mentioned above, the conditional probabilities are used to make predictions, e.g., the probability of .20 for a customer with low long distance service who owns a car and is a current customer. But if a combination of values (a cell in a table) exists in the Training data but not in the Testing data, the network cannot make predictions for a customer with those characteristics. Fortunately there are only 13 customers who have a missing predicted value, which is less than 2% of the file. This is probably acceptable. It does illustrate the importance of having a large and varied enough training dataset so that all possible combinations have one or more records. We’ll next try the other type of network structure, a Markov Blanket, using the default settings otherwise.

11-15


Close the Analysis output browser Edit the Bayes Net modeling node Click Model tab Click Markov Blanket Click Run

After the model has run: Right-click and Browse the generated Bayes Net model

Figure 11.15 Bayes Net Model Browser for Markov Blanket Model

This model looks very different than the TAN network. First, not all the predictors are used. Second, the arrows go from the inputs to the target field, which is the direction we expect for a causal predictive model but, as with the TAN model, the arrows should not be used to indicate causal influence. Arrows in a Markov Blanket can point away from the target field. Third, there are no connections between the inputs. This isn’t always the case in a Markov Blanket, but is more likely than with a TAN network. In fact, this network is equivalent to a Naïve Bayes classifier. The top three fields on the predictor importance chart are identical to those for the TAN network, although the order is different. Let’s view the conditional probability table for CHURNED (the tables for the inputs are uninteresting). Click on the node for CHURNED Expand the right half of the model browser to view the probability table

11-16

BAYESIAN NETWORKS

Figure 11.16 Conditional Probability Table for CHURNED

The table is very large, so we can’t display it all in the figure above. Because all four inputs have arrows pointing toward CHURNED, its conditional probability table contains all four of these fields. The first thing we can see is that there are many cells with a probability value of 0 which indicates that there were no customers with that combination of values. Second, this type of table is one that is easier to think about and use in the context of predicting CHURNED because we can choose various combinations of values of the inputs and see what the distribution is of CHURNED. So, for example, if we select males who make very few calls in any category (the first row in the table), we see that they are very unlikely to be voluntary churners (.051 probability). Let’s see how well the Markov Blanket model does at predicting CHURNED. Close the Bayes Net model browser Run the Analysis node

11-17


Figure 11.17 Analysis Node Output for Markov Blanket Model

This model, with fewer predictors, is less accurate, and this is perhaps not unexpected. Overall accuracy on the Training data is 73.16%. The performance on the Testing data is also lower than the TAN model, though not by as much. There are missing predictions for 39 customers, many more than the TAN model. This is because the conditional probability table for CHURNED has many zero cells, all of which lead to a missing prediction (because of multiplication by 0 in the probability equations). If the amount of missing data is small, we can rerun these models and request the Bayes adjustment for small cell counts, which effectively adds a small amount to any cell with a zero count. We’ll return to the TAN network, which was more accurate. Close the Analysis output browser Edit the Bayes Net modeling node Click Model tab (if necessary) Click TAN option button Click Bayes adjustment for small cell counts option button Click Run

After the model runs: Right-click and Browse the generated Bayes Net model

11-18

BAYESIAN NETWORKS

Figure 11.18 TAN Network Model with Bayes Adjustment

The model is essentially identical to what we saw before. To see how the Bayes adjustment has affected the network, we need to view a conditional probability table. Click on the node for LOCAL Expand the right half of the model browser to view the table

11-19


Figure 11.19 Conditional Probability Table for LOCAL

What we can see are many cells that have a gray background. All of these cells have a zero count and so have been given a Bayes adjustment. This means these patterns (cells) in the data can now be estimated. We can see this by viewing model predictions with the Analysis node. Click Annotations tab Click Custom Type the text Bayes adjustment – TAN Click OK Add this model to the stream by the Type node Connect the Type node to the Bayes adjustment model Add an Analysis node to the stream and connect it to the Bayes adjustment model Edit the Analysis node Click Coincidence matrices (for symbolic targets) check box Click Run

11-20

BAYESIAN NETWORKS

Figure 11.20 Analysis Node Output for TAN Model with Bayes Adjustment

The accuracy on the Training partition remains at 78.3% because it is not affected by the Bayes adjustment. But there are now no predictions of $null$ for the Testing Partition, so predictions can be made for all these cases. And this increased the accuracy on the Testing data to 70.71%. Using a Bayes adjustment doesn’t guarantee that there will be no missing predictions. In fact, if you run a TAN model with a Bayes adjustment, there will still be 39 cases with a missing prediction. This is because an adjustment can be made for an existing pattern by adding a small amount to a cell with zero count, but if a pattern is completely missing from the Training data, it still won’t be possible to make a prediction in the Testing data. As mentioned in an earlier section, using the Bayes adjustment is fine when the amount of missing data is a small portion of the data file, but when there is a large amount of missing data, another solution should be employed. At this point, we can continue to use the TAN network but can change the Maximal conditioning set size parameter.

11.5 Modifying Bayes Network Model Settings As with SVM models, and many other types of models (neural networks, decision trees), finding the best model requires some experimentation with other settings. Two key options for a Bayes Net are feature selection and the Maximal conditioning set size. Close the Analysis output browser Edit the Bayes Net modeling node

11-21


Click Include feature selection preprocessing step check box

Figure 11.21 Requesting Feature Selection for TAN Model

Although we only have about a dozen input fields, to change the Maximal conditioning set size for a TAN model, we need to request feature selection (this is because of how the model is calculated without taking parent and child nodes into account). Click Expert tab, then click Expert option button

11-22

BAYESIAN NETWORKS

Figure 11.22 Feature Selection and Maximal Conditioning Set Size Options

We don’t need to specify any inputs to always be in the network, and we’ll leave the Maximum number of inputs at 10, which means that the TAN network can only include 10 of the 12 possible inputs. The algorithm for creating a Markov Blanket structure uses conditioning sets of increasing size to carry out independence testing and remove unnecessary links from the network. The TAN network can also use a conditioning set to do feature selection. The higher the value for Maximal conditioning set size, the more time and memory for processing is required, but a higher value can be especially useful when the data have strong dependencies among many fields. We don’t expect strong relationships among the predictors, so we’ll reduce the value to 3, and then run the model. Change the Maximal conditioning set size to 3 Click Run

When the model has been generated: Right-click the generated model and select Browse

11-23


Figure 11.23 TAN Network with Maximal Conditioning Set Size=3

The resulting TAN network is much simpler than the original and includes only 4 fields. These are the same four fields that were included in the Markov Blanket network, and they were four of the top five in predictor importance in the original TAN network. As before, arrows point away from CHURNED to the inputs. No arrows point to LONGDIST from any other input. Click on the node for LONGDIST

Figure 11.24 Conditional Probability Table for LONGDIST

The conditional probability table for LONGDIST only includes CHURNED because that is its only parent. We see that involuntary churners are very likely to have little long distance call usage (probability .992 in the lowest category), while the probabilities are spread more evenly for the other two types of customers. This network is very simple, but how will it do in predicting CHURNED? We’ll use an Analysis node to get the answer.

11-24

BAYESIAN NETWORKS

Closer the Bayes Net model browser Add the generated Bayes Net model to the stream Connect the new model to one of the Analysis nodes, replacing the connection Run the Analysis node

Figure 11.25 Analysis Node Output for Modified TAN Network

The overall accuracy on the Training data has declined to 72.6%, a substantial drop. However, notice that the accuracy on the Testing data is 71.24%, which is an increase of almost 1%. And, in the final analysis, how the network does on the Testing data is the key criteria. It would appear, although this is only an educated guess, that the more complicated TAN models somewhat overfit the data, and that the Markov Blanket wasn’t quite complicated enough. As is standard in data-mining modeling, we would continue developing variants of a Bayes Net model to try to find a handful of candidates to undergo further testing. However, there are fewer parameters to modify than say, an SVM model, so that process shouldn’t be too burdensome. Close the Analysis output browser

We’ll conclude the discussion of Bayes Net models by seeing how the predicted values are related to the inputs. In this latest model, the field LONGDIST has only CHURNED in its conditional probability table. Now LONGDIST has been binned into five categories (visible in Figure 11.24), but we won’t bother to take the time to use a Reclassify node to do this. We’ll just use a Histogram with an overlay to look at the general relationship. There are Select nodes on the bottom of the stream that will select the Training or Testing partitions. We’ll use the one for the Training data. Add a Select node to the stream

11-25


Connect the TAN model named Bayes adjustment – TAN to the Select node Add a Histogram node to the stream below the Select node, and connect these two nodes Edit the Histogram node Select LONGDIST as the Field and CHURNED as the Color Overlay field Click Options tab Click Normalize by color Click Run

Figure 11.26 Histogram of LONGDIST with CHURNED as Overlay on Training Data

As the conditional probability table suggests, essentially all the involuntary churners have low values on LONGDIST. The proportion of customers who are current or voluntary churners is about equal across values of LONGDIST, and the pattern in the histogram echoes this. Now we’ll look at the predicted values of CHURNED. Close the Histogram window Edit the Histogram node Change the Color Overlay field to $B-CHURNED Click Run

11-26

BAYESIAN NETWORKS

Figure 11.27 Histogram of LONGDIST with Predicted CHURNED as Overlay on Training Data

The patterns are extremely similar, although there is more range in the values of LONGDIST for predicted involuntary churners (but all values are in the first bin for that field). Although there are other inputs in the network, these two histograms are similar because the only direct parent of LONGDIST is CHURNED itself. Although you can use the conditional probability tables if you are adept at reading that type of information, it is likely that you will want to conduct this type of analysis between the inputs and original and predicted values to understand how a Bayes Net model makes its predictions. You may wish to continue this analysis with the TAN model. You can try these same histograms on the Testing partition. Or you can use another input, such as SEX (Hint: use a Distribution node).

11-27




In this set of exercises you will attempt to predict the field Response to campaign using a Bayes Net model. 1. If you have previously saved a stream that accesses the file charity.sav, you can use that stream. Otherwise, use an Statistics source node to read this file. Tell PASW Modeler to Read Labels as Names. 2. Attach a Type and Table node in a stream to the source node. Run the stream and allow PASW Modeler to automatically define the types of the fields. 3. Edit the Type node. Set all of the fields to role NONE. 4. We will attempt to predict response to campaign (Response to campaign) using the fields listed below. Set the role of all five of these fields to Input and the Response to campaign field to Target. Pre-campaign spend category Pre-campaign visits category Gender Age

11-28

BAYESIAN NETWORKS

Mosaic Bands (which should be changed to measurement level nominal) 5. Attach a Bayes Net node to the Type node. First create a TAN network with the default settings. 6. Once the model has finished training, browse the generated Bayes Net model. What are the most important fields? Are all fields used? Can you look at the conditional probability tables and learn anything about the network? How does predictor importance compare to the Neural Net results in Lesson 4 or the SVM results in Lesson 5? 7. Place the generated Bayes Net node on the Stream canvas and connect the Type node to it. Connect the generated Net node to an Analysis node and create a matrix of actual response against predicted response. How well does this model do in predicting response to the campaign? How does its performance compare to other models? 8. Now create a Markov Blanket network and answer the same questions as in #6 and 7. Additionally, compare and contrast the two models. What are the differences? Which model does better at predicting response to campaign? 9. Use various methods to explore how the two most important predictors are related to predictions of the model. 10. For those with extra time: Try using a dataset with more fields, such as customer_dbase.sav, to predict an outcome with a more complex network. If you do so, you can use some of the Expert settings in the Bayes Net node.

11-29


11-30

FINDING THE BEST MODEL FOR CATEGORICAL TARGETS

Lesson 12: Finding the Best Model for Categorical Targets Objectives • •

Introduce the Auto Classifier Node Use the Auto Classifier Node to predict customers who will churn

Data In this lesson we will use the dataset churn.txt that we have used in several previous lessons. We will build models to predict whether a customer is loyal or not, and continue to use a Partition Node to divide the cases into two segments (subsamples), one to build or train the models and the other to test the models.

12.1 Introduction When you are creating a model, it isn’t possible to know in advance which modeling technique will produce the most accurate result. Often several different models may be appropriate for a given data file and target, and normally it is best to try more than one. For example, suppose you are trying to predict a binary target (buy/not buy). Potentially, you could model the data with a Neural Net, any of the Decision Tree algorithms, an SVM model, a Bayes Net, Logistic Regression, Nearest Neighbor, Decision List, or Discriminant Analysis. Unfortunately this process can be quite time consuming. The Auto Classifier node allows you to create and compare models for categorical targets using a number of methods all at the same time, and then compare the results. You can select the modeling algorithms that you want to use and the specific options for each. You can also specify multiple variants for each model. For instance, rather than choose between the Multilayer Perceptron or Radial Basis Function methods for a neural net model, you can try them both. The Auto Classifier node generates a set of models based on the specified options and ranks the candidates based on the criteria you specify. The supported algorithms include Neural Net, all decision trees (C5.0, C&R Tree, QUEST, and CHAID), Logistic Regression, Decision List, Bayes Net, Discriminant, Nearest Neighbor and SVM. To use this node, a single target field with categorical measurement level (flag, nominal or ordinal) and at least one predictor field are required. Predictor fields can be continuous or categorical, with the limitation that some predictors may not be appropriate for some model types. For example, ordinal fields used as predictors in C&R Tree, CHAID, and QUEST models must have numeric storage (not string), and will be ignored by these models if specified otherwise. Similarly, continuous predictor fields can be binned in some cases (as with CHAID). The requirements are the same as when using the individual modeling nodes. When an automated modeling node is executed, the node estimates candidate models for every possible combination of options, ranks each candidate model based on the measure you specify, and saves the best models in a composite automated model nugget. We continue to use the Churn.txt file which we used in many earlier lessons. However, we will combine the Voluntary and Involuntary Leavers into a single category in order to use the Auto Classifier.

12-1


Click File…Open Stream, and then move to the c:\Train\ModelerPredModel folder Double-click on FindBestModel.str Place an Auto Classifier node from the Modeling palette to the right of the Type node Connect the Type Node to the Auto Classifier node Edit the Derive node named LOYAL

Figure 12.1 Creation of Flag Field Identifying Loyal Customers

In the Derive node we use the field CHURNED to create a new target with the name LOYAL. This target will be a flag, with a value of Leave when CHURNED is not equal to Current; this means that customers who are voluntary or involuntary leavers will have values of Leave. Current customers will have a value of Stay. Close the Derive node Edit the Auto Classifier node

12-2


Figure 12.2 Auto Classifier Node

The Auto Classifier will use partitioned data if available. It will also create separate models for each value of a split field. The number of models to use and display in the Auto Classifier is 3 by default, which you can change. The top-ranking 3 models are listed according to the specified ranking criterion, but you can increase or decrease this value. The Rank models by option allows you to specify the criteria used to rank the models. Note that the True value defined for the target field is assumed to represent a hit when calculating profits, lift, and other statistics (discussed below). We have defined Leave as the True category in the Derive node because we are more interested in locating persons who will leave as customers than those who will stay. Models can be ranked on either the Training or Testing data, if a Partition node is used. It is usually better to initially rank the models by the Training partition since the Testing data should only be used after you have some acceptable models. Predictor importance can also be calculated; this option is turned on by default, but it can significantly increase execution time. Click on the Rank models by dropdown to see the different ranking options

12-3


Figure 12.3 Ranking Options for Models

Overall accuracy refers to the percentage of records that is correctly predicted by the model relative to the total number of records. Area under the curve (ROC curve) provides an index for the performance of a model based on a Gains curve from an Evaluation chart. The further the curve is above the baseline, the more accurate the model, hence the greater the area. Profit (Cumulative) is the sum of profits across cumulative percentiles (sorted in terms of confidence for the prediction), based on the specified cost, revenue, and weight criteria. Typically, the profit starts near 0 for the top percentile, increases steadily, and then decreases. For a good model, profits will show a well-defined peak, which is reported along with the percentile where it occurs. For a model that provides no information, the profit curve will be relatively straight and may be increasing, decreasing, or level, depending on the cost/revenue structure that applies. Lift (Cumulative) refers to the ratio of hits in cumulative quantiles relative to the overall sample (where quantiles are sorted in terms of confidence for the prediction). For example, a lift value of 3 for the top quantile indicates a hit rate three times as high as for the sample overall. For a good model, lift should start well above 1.0 for the top quantiles and then drop off sharply toward 1.0 for the lower quantiles. For a model that provides no information, the lift will hover around 1.0. Number of fields ranks models based on the number of fields used. The Profit Criteria section is used to define the costs, revenue and weight values for each record, for only flag targets. Profit equals the revenue minus the cost for each record. Profits for a quantile are simply the sum of profits for all records in the quantile. Profits are assumed to apply only to hits, but costs apply to all records. Use the Costs option to specify the cost associated with each record. You can either specify a Fixed or Variable cost. Use the fixed costs option if the costs are the same for each record. If the costs are variable, select the field which has the cost associated for each record. The Revenue option is used to specify the amount of revenue associated with each record. Again, this value can be either Fixed or Variable. The Weight option should be used if your data represent more than one unit. This option allows you to use frequency weights to adjust the results. For fixed weights, you will need to specify the weight value (the number of units per record). For variable weights, use the Field Selector button to select a field as the weight field. Note that model profit will have nothing to do with monetary profit unless you specify actual cost and revenue values. Nevertheless, the defaults will still give you some sense of how good the model is compared to other models. For example, if it costs you 5 dollars to send out a promotion, and you get 10 dollars in revenue for each positive response, the model with the highest cumulative profit would be the one with the most hits. Lift Criteria is used to specify the percentile to use for lift calculations. The default is 30.

Select Overall accuracy from the Rank models by: dropdown

12-4


Click Training partition button under Rank models using: Click the Expert tab

Figure 12.4 Auto Classifier Expert Tab

The Expert tab allows you to select from the available model types and to specify stopping rules and misclassification costs. By default, all models are selected except KNN and SVM. However, it is important to note that the more models you select, the longer the processing time will be. You can uncheck a box if you don’t want to consider a particular algorithm. The Model parameters option can be used to change the default settings for each algorithm, or to request different versions of the same model. In this example, we will request both Neural Net model algorithms and accept the default values for all the other models. Click on the Model Parameters cell for Neural Net and select Specify

12-5


Figure 12.5 Algorithms Simple Tab for Neural Net Models

Click in the Neural network model row in the Options cell and select Both from the dropdown list

12-6


Figure 12.6 Selecting Neural Network Model Setting

The Auto Classifier node will now try both types of neural network models. The Expert tab within the Algorithm settings dialog allows detailed changes to specific models. Click Expert tab

12-7


Figure 12.7 Algorithm Settings Expert Tab for Neural Net

The settings in this dialog are those that would be available in the Neural Net node. Note that the Set random seed parameter is set to a numeric value. This means each time the Auto Classifier node is run, the same neural net model will be found for these data and target (if we use the same other settings). If you find a neural net model that performs well (or any other model dependent on a random seed), then you may wish to change the random seed and return the Auto Classifier to check for model stability. Click the Simple tab, and then click OK Click Stopping rules… button

Figure 12.8 Stopping Rules Dialog

12-8


Stopping rules can be set to restrict the overall execution time to a specific number of hours. All models generated to that point will be included in the results, but no additional models will be produced. In addition, you can request that execution be stopped once a model has been built that meets all the criteria specified in the Discard tab. Click Cancel Click the Discard Tab

Figure 12.9 Auto Classifier Discard Tab

The Discard tab allows you to automatically discard models that do not meet certain criteria. These models will not be listed in the summary report. You can specify a minimum threshold for overall accuracy, lift, profit, and area under the curve, and a maximum threshold for the number of fields used in the model. Optionally, you can use this dialog in conjunction with Stopping rules to stop execution the first time a model is generated that meets all the specified criteria. Click the Settings tab

12-9


Figure 12.10 Auto Classifier Settings Tab

The Settings tab of the Auto Classifier node allows you to pre-configure the score-time options that are available on the nugget. For flag targets you can select from the following Ensemble methods: Voting, Confidence-weighted voting, Raw propensity-weighted voting (flag targets only), Highest confidence wins, and Average raw propensity (flag targets only). For voting methods, you can specify how ties are resolved. You can choose one of the tied values randomly, choose the tied value that was predicted with the highest confidence, or with the largest absolute raw propensity.

Misclassification Costs In some contexts, certain kinds of errors are more costly than others. For example, it may be more costly to classify a high-risk credit applicant as low risk (one kind of error) than it is to classify a lowrisk applicant as high risk (a different kind of error). Misclassification costs allow you to specify the relative importance of different kinds of prediction errors. Misclassification costs are basically weights applied to specific outcomes. These weights are factored into the model and may actually change the prediction (as a way of avoiding more costly mistakes). Misclassification costs are not taken into account when ranking or comparing models using the Auto Classifier node. A model that includes costs may produce more errors than one that doesn't and may

12-10


not rank any higher in terms of overall accuracy, but it is likely to perform better in practical terms because it has a built-in bias in favor of less expensive errors. Click the Expert tab Click the Misclassification costs button

The cost matrix shows the cost for each possible combination of predicted category and actual category. By default, misclassification costs are set to 0.0 for the cells with correct predictions, and 1.0 for cells that represent errors of prediction (misclassification). To enter custom cost values, select Use misclassification costs checkbox and enter custom values into the cost matrix. Figure 12.11 Misclassification Costs Dialog

We don’t need to specify costs for this example. Click OK Click Run

Since the Auto Classifier can take some time to compute all the models, especially if variants of models are requested, a feedback dialog is presented while the node is running.

12-11


Figure 12.12 Execution Feedback Dialog

Once the Auto Classifier is finished: Edit the LOYAL model nugget in the stream

Figure 12.13 Auto Classifier Results for Testing Set

Although we requested that the models be ranked on the Training partition, the default View is of the Testing set (Partition). So let’s switch to the Training data. Click the View: dropdown and select Training set

12-12


Figure 12.14 Auto Classifier Results for Training Set

Here we see the top ranking three models contained in the nugget, including C5.0, CHAID, and Decision List (see the Appendix for information on that type of model). The number to the right of the model type indicates whether this is the first, second, etc. model of that type created by the Auto Classifier. The best model is the C5.0 model, which is over 90% accurate overall at predicting LOYAL. The order of the models on accuracy is the same on the Training data as the Testing data. Ranking models by Lift would place Decision List first, following by C5.0. As with any modeling exercise, you need to choose the model criterion that is most appropriate for your specific datamining problem. But unlike when using one algorithm at a time, the Auto Classifier makes it easy to compare models on several factors, as all the various ranking measures are displayed in separate columns. You can use the Sort by: option or click on a column header to change the column used to sort the table. In addition, you can use the Show/hide columns menu tool columns.

to show or hide specific

Delete Unused Models will permanently delete all models that are unchecked in the Use? Column. The bar chart thumbnails show how the model predictions are related to the actual value of the target field LOYAL. The full-sized plot includes up to 1000 records and will be based on a sample if the dataset contains more records. Let’s see how the C5.0 model does. Double-click on the Graph thumbnail for the C5.0 model

12-13


Figure 12.15 Distribution Graph of LOYAL by Predicted LOYAL for C5.0 Model

The predicted value ($C-LOYAL) is overlaid on the actual value of the target field. Ideally, if the model was 100% accurate, each bar would be of only one color because the overlap would be perfect. We can see from the graph that the model does fairly well at predicting customers who will stay and extremely well for those who will leave, as most of the bar for Leave is blue (expand the window vertically to see this) and most of the bar for Stay is red, meaning there is great overlap between the actual and predicted values. Close the Graphboard window

It is also possible to see how well the ensemble of three models does in predicting LOYAL. Click the Graph tab

12-14


Figure 12.16 Accuracy and Predictor Importance of Ensemble of Models

As with the C5.0 model, the predicted value is overlaid on the actual value of LOYAL. The graph includes both the training and testing partitions. Note that the accuracy for Leave is somewhat less than for the C5.0 model alone; the models are combined by confidence-weighted voting, the default. Also displayed is the predictor importance for the combined models. No one predictor is especially important. In order to further compare the performance of these three models, we can generate an evaluation chart directly from the Auto Classifier nugget. Click Generate…Evaluation Chart(s)

12-15


Figure 12.17 Evaluation Chart Selection Dialog

Since we have used accuracy to rank and select the models, we’ll use Lift to further evaluate the models on another criterion. Click Lift Click OK

Figure 12.18 Lift Charts for Models

The best possible model is represented by the dark green line labeled $BEST. Initially the three models have about the same lift value, but eventually the C5.0 model surpasses the other two models. This is more evidence that the C5.0 model is the best performer. Close the Evaluation chart window

12-16


You can also view each model in its standard Model Viewer. As illustration: Click on the Model tab Double-click on the C5.0 Model cell in the Model column Click All button on toolbar

Figure 12.19 C5.0 Model

The Model Viewer has the same detail and options as for a C5.0 model created from that modeling node. In this way you can explore specific characteristics of a model. Click OK Close the Auto Classifier Model

We would normally continue to explore models here to see which ones are most satisfactory. One of the tricky things about using the Auto Classifier (and the Auto Numeric node in the next lesson) is that you may have many models from which to choose. When you are comparing only two or three models, it is easy to simply look at the results on the Training partition, and then on the Testing

12-17


partition, to decide which model to select. But with many models, what is the appropriate methodology to follow for model selection? Ideally, you select the candidate models before looking at the Testing data, although some analysts would argue that the only thing that matters is performance on the Testing data. However, our advice is to pick a set of possible models to assess, not all models generated, but more than just 1 or 2. This will require looking at the evaluation chart, looking at other ways of ranking the models, looking at how the models make their predictions (what are the important fields; what are the decision tree rules, etc.), seeing which categories the models predict more accurately, and perhaps picking minimum levels of lift, accuracy, or other measures. For this example, let’s next use an Analysis node to evaluate the performance of the ensembled three best models. Place an Analysis node to the right of the Auto Classifier nugget named LOYAL Connect the LOYAL model node to the Analysis node Edit the Analysis node Click Coincidence matrices (for symbolic targets) (not shown) Click Run

12-18


Figure 12.20 Analysis Node Output for Ensemble of Models

We see that the ensembled model is reasonably accurate on both data partitions, with accuracy in the training data of 87.40%, and in the testing data of 82.19%. From the Coincidence Matrix, the model correctly identifies about 93% (316 of 338) of Leavers in the Training partition and 91% (280 of 307) of Leavers in the Testing partition. For the current customers who will stay the ensembled model didn’t yield the same degree of accuracy. Close the Analysis window Double-click the Auto Classifier nugget named LOYAL

Since the most accurate model on the Testing data was C5.0, let’s examine its accuracy further with an Analysis node. We can select the C5.0 model in the model column and then create a generated model. Double-click the C5.0 model nugget in the Model column In the Model Viewer, click Generate…Model to Palette Click OK, and then OK again

12-19


Figure 12.21 C 5.0 Model Added to the Models Palette

Move the generated C5.0 to the Stream Canvas Connect the C5.0 model to the Type node Place an Analysis node to the right of the C5.0 model Connect the C5.0 model to the Analysis node

Figure 12.22 Revised Stream with the Addition of the C5.0 Model and an Analysis Node

Edit the Analysis node Click Coincidence matrices (for symbolic targets) (not shown) Click Run

12-20


Figure 12.23 Analysis Node Output for C5.0 Model

We observe that the C5.0 model is very accurate on both data partitions, though accuracy fell a bit, as expected, on the Testing partition. It is more accurate than the ensemble of three models. The C5.0 model correctly identifies almost all of the Leavers in the Testing partition (302 out of 307, or about 98.4%!). And while it didn’t predict current customers who will stay with the same degree of accuracy, it still did very well with this group. You could investigate the other candidate models in a similar way to see which ones do better on which category of customers. When all this work is done, you will have a winning model, either one of the models, or a combination of the models.

12-21


Summary Exercises The exercises in this lesson use the data file charity.sav. The following table provides information on this file. charity.sav comes from a charity and contains information on individuals who were mailed a promotion. The file contains details including whether the individuals responded to the campaign, their spending behavior with the charity and basic demographics such as age, gender and mosaic (demographic) group. The file contains the following fields: response orispend orivisit spendb visitb promspd promvis promspdb promvisb totvisit totspend forpcode mos mosgroup title sex yob age ageband


1. Begin with a blank Stream canvas. Place a Statistics File source node on the canvas and connect it to charity.sav. 2. Try to predict Response to campaign using all the available model choices in the Auto Classifier. Use the defaults first. Which model is best, and which is worse? You can choose the criterion for ranking models, or use more than one. Which models use fewer inputs? 3. Now change some of the model settings on one more models and rerun the Auto Classifier. Request more than 3 models. Does the order of models change? 4. Pick two or more models and generate a model for each. Add them to the stream and use an Analysis node or other nodes to further compare their predictions. Which model would you use, and why? 5. Then use an Analysis node with the Auto Classifier model to compare the predictions of the ensemble of models. Does the ensemble do better than any individual model?

12-22

FINDING THE BEST MODEL FOR CONTINUOUS TARGETS

Lesson 13: Finding the Best Model for Continuous Targets Objectives • •

Introduce the Auto Numeric Node Use the Auto Numeric Node to predict birth weight of babies

Data In this lesson we use the dataset birthweight.sav. This file contains information on the births of about 380 babies and characteristics of their mothers, such as age and various health measures (smoking, history of hypertension, etc.). Researchers are interested in accurately predicting birth weight months in advance so that interventions can be done for potential low birth weight babies to increase their chances of survival. This dataset is relatively small, which is typical of many medical studies, but as good practice we will still use a Partition node with the data.

13.1 Introduction In the previous lesson we learned how to automate the production of models to predict categorical targets with the Auto Classifier node. In this lesson we discuss the Auto Numeric node, which in an analogous manner can automate the production of models for targets that are numeric with a continuous level of measurement. The Auto Numeric node allows you to create and compare models for continuous targets using a number of methods all at the same time, and then compare the results. You can select the modeling algorithms that you want to use and the specific options for each. You can also specify multiple variants for each model. The supported algorithms include Neural Net, C&R Tree, CHAID, Regression, Linear, Generalized Linear Models, KNN and SVM. To use this node, a single target field of measurement level continuous and at least one predictor field are required. Predictor fields can be categorical or continuous, with the limitation that some predictors may not be appropriate for some model types. For example, C&R Tree models can use categorical string fields as predictors, while linear regression models cannot use these fields and will ignore them if specified. The requirements are the same as when using the individual modeling nodes. The format of this lesson will match that of the previous lesson on the Auto Classifier. We begin by opening an existing stream file and reviewing the data. Click File…Open Stream, and then move to the c:\Train\ModelerPredModel folder Double-click on NumericPredictor.str Run the Table node

13-1


Figure 13.1 Birthweight Data File

In the Statistics source node, we have checked the option to Read labels as names so the variable labels become the field names in PASW Modeler. The field we want to predict is in the last column (Birth Weight in Grams) which measures actual birth weight. There is also a separate field (Low Birth Weight) which indicates whether the birth weight was below a certain threshold. We won’t use that field in this example. All other fields can be used as predictors except for id, of course. The Partition node splits the data into equal parts for training and testing. We need to set the role for the fields in the model. Close the Table window Edit the Birthweight.sav source node Click on the Types tab

We need to fully instantiate the data so that PASW Modeler has values for all fields. We also need to change the role of id and Low Birth Weight to None, and then the role of Birth Weight in Grams to Target. Click Read Values button Click OK

13-2


Figure 13.2 Types Tab for Birthweight Data

Change the role of id and Low Birth Weight to None Change the role of Birth Weight in Grams to Target (not shown) Click OK

Before attempting to model birth weight, let’s look at its distribution with a Histogram. Add a Histogram node to the stream Attach the Source node to the Histogram node Edit the Histogram node Specify the Field as Birth Weight in Grams (not shown) Click Run

13-3


Figure 13.3 Histogram of Birth Weight

The distribution of birth weight is approximately normal, peaking around 3,000 grams, or about 6.6 pounds. Many physical and biological quantities have a normal distribution, which makes modeling less challenging. When a continuous field is distributed normally, just about any technique can be used to predict it. Also, there aren’t too many outliers on either the low or high end. This is because babies born alive can only be so small, or large. This also makes creating models less problematic. We can now add an Auto Numeric node to the stream. Add an Auto Numeric node to the right of the Partition node Connect the Partition node to the Auto Numeric node Edit the Auto Numeric node Click the Model tab if necessary

13-4


Figure 13.4 Auto Numeric Node

The Auto Numeric node will use partitioned data and build a model for each split, if available. Models can be ranked on either the Training or Testing data, if a Partition node is used. It is usually better to initially rank the models by the Training partition since the Testing data should only be used after you have some acceptable models. As with the Auto Classifier, the number of models to use is 3 by default. Predictor importance is turned on by default but it may lengthen the execution time. Specify the Number of models to use: as 8 Click on the Rank models by menu to see the different ranking options

13-5


Figure 13.5 Ranking Options for Models

The Rank models by option allows you to specify the criteria used to rank the models. Because we are predicting a continuous target, the choices to rank models are suited for this type of target. They include: • • •

Absolute value of correlation between observed and predicted values Number of predictors used Relative error, which is defined as the ratio of the error variance for the model to the variance of the target field

If the relationship between predicted and observed values is not linear, the correlation is not a good measure of fit or ranking. We’ll view scatterplots to make that determination. The options to discard models are listed here in the Model tab dialog. You can specify criteria that correspond to the ranking options to discard candidate models. The more models you generate, the more likely you are to use these options, but we don’t need to do so for this example. When using a continuous target, profit can’t be defined based on a predicted category, nor can misclassification costs be defined (a model can certainly directly predict revenue or profit, but that isn’t the same as defining profit based on a categorical target). Click Training partition button under Rank models using: Click the Expert tab

13-6


Figure 13.6 Auto Numeric Expert Tab

The Expert tab allows you to select from the available model types and to specify stopping rules. By default, six model types except KNN and SVM are checked and will be used. The Model parameters option can be used to change the default settings for each algorithm or to request different versions of the same model. In this example, we will request both Neural Net models and also a stepwise Regression model. Click on the Model Parameters cell for Neural Net and select Specify Click in the Neural network model row in the Options cell and select Both from the dropdown list

13-7


Figure 13.7 Changing Neural Network Model Setting

The Expert tab within the Algorithm settings dialog allows detailed changes to specific models. As with the Auto Classifier example, the random seed is set to a fixed value for the Neural Net models, and so each time it is run the same model will be found for these data and target (if we use the identical other settings). Now we’ll request a stepwise regression model as well. Click OK Click on the Model Parameters cell for Regression and select Specify Click in the Method row in the Options cell and select Specify

13-8


Figure 13.8 Regression Parameter Editor

The Enter method is used by default, but we’ll add Stepwise. Click Stepwise check box Click OK Click OK

Figure 13.9 Auto Numeric Dialog Completed

We have requested that a total of 8 models be constructed.

13-9


We won’t examine the Stopping rules dialog, which is identical to that for the Auto Classifier in terms of options and operation. Click the Settings tab

Figure 13.10 Auto Numeric Settings Tab

The Settings tab of the Auto Numeric node allows you to pre-configure the score-time options that are available on the nugget. For a continuous target, the ensemble scores will be generated by averaging the predicted value of each model used. Click Run

Since the Auto Numeric node can take some time to compute all the models, especially if variants of models are requested, a feedback dialog is presented while the node is running.

13-10


Figure 13.11 Auto Numeric Settings Tab

Once the Auto Numeric model is finished: Edit the Birth Weight in Grams model nugget in the stream

Although we requested that the models be ranked on the Training partition, the default View is of the Testing set (Partition). So let’s switch to the Training data. Click the View: dropdown and select Training set

13-11


Figure 13.12 Auto Numeric Results for Training Set

There are wide differences in model performance. The best model, the first Neural Net with a Multilayer Perceptron, has a correlation between the predicted and actual values of 0.543; the worst model, the second Regression (Stepwise method) only has a correlation of 0.255. The Regression 2 model, which used stepwise selection, includes only one predictor (Presence of Uterine Irritability; you can find this information by double-clicking on the model icon. The relative error is the ratio of the variance of the observed values from those predicted by the model to the variance of the observed values from the mean. In practical terms, it compares how well the model performs relative to a null or intercept model that simply returns the mean value of the target field as the prediction. For a good model, this value should be less than 1, indicating that the model is more accurate than the null model. The same differences are evident in the relative error, where smaller numbers closer to zero are better. The Neural Net is again the best model, followed by a Generalized Linear model. This is an instance of automatic modeling where the best model clearly stands out from the others. Scatterplot thumbnails are provided for each model to show how the model predictions are related to the actual value of the target field. As noted above, if the relationship between the actual and predicted fields isn’t linear, then the correlation is not an appropriate measure to use for ranking models. Let’s see how the Neural Net model does. Double-click on the graph thumbnail for the Neural Net 1 model

13-12


Figure 13.13 Distribution Graph of Birth Weight by Predicted Birth Weight for Neural Net

If the model predictions were perfect, all the points would fall on a straight line running from the lower left to upper right. Although the neural net model is far from perfect, that is the general tendency of the points. To see what a poor model’s predictions look like in a scatterplot, let’s open the graph for the C&R Tree model. Close the Graphboard window Double-click on the graph thumbnail for the C&R Tree model

13-13


Figure 13.14 Distribution Graph of Birth Weight by Predicted Birth Weight for C&R Tree

The difference between the plots is immediately evident. The decision tree model is clearly a poor performer. In fact, it seems to predict only two values for birth weight because the number of distinct predictions depends on the number of terminal nodes in the tree. Close the Graphboard window

Because we are predicting a continuous field, evaluation charts are not available to further assess the models. We can look at how well the ensemble of models does in predicting birth weight. Click the Graph tab Move the slider in the Predictor Importance pane to the fourth position from the left

13-14


Figure 13.15 Scatterplot and Predictor Importance of Ensemble of Models

Predictor importance is calculated from the test partition; while no one predictor is very important, the more important predictors include the mother smoking, number of physician visits, and history of hypertension. The scatterplot is similar to those we viewed above, although the position of the variables on the axes is flipped so that the predicted value of birth weight is on the X axis and the observed value is on the Y axis. We expect to see a reasonably strong linear correlation between these two values, which is apparent here. The next step is to look at the performance of the models on the Testing partition. Click Model tab Click View dropdown and select Testing set

13-15


Figure 13.16 Auto Numeric Results for Testing Set

The results are somewhat different, perhaps because of the small sample sizes used in the two data partitions. The best model is now the generalized linear model, with the neural net second. CHAID has fallen to number 5, while the first regression model is now third. There is also less difference between the first two models, and in fact, the neural net model has lower relative error than the generalized linear model, so it may still be the best performer. Despite the changes in model performance, we know that we should focus on the results on the Testing partition when selecting final models. You may want to engage in a class discussion about how to use the Training and Testing data to select models. For this example, we will select the top three models: Generalized Linear 1, Neural Net 1, and Regression 1. We will check whether the ensembled model makes more accurate predictions than the Neural Net 1 model alone. Double-click the Neural Net 1 icon in the Model column Click Generate…Model to Palette Click OK Deselect the Neural Net 2, CHAID 1, Linear 1, C&R Tree 1 and Regression 2 models in the Use? Column Click OK

We can use an Analysis node to evaluate the models further, as we did with the Auto Classifier. Drag the Neural Net 1 model nugget to the right of the Birth Weight in Grams nugget Connect the two nuggets

13-16


Figure 13.17 Revised Stream with the Addition of the Neural Net Model

Place an Analysis node to the right of the Neural Net 1 model Connect the Neural Net 1 model to the Analysis node Right-click the Analysis node and select Run

13-17



The Analysis node provides various summary measures for the ensembled model (Generalized Linear, Neural Net and Regression) and the Neural Net model on its own. These include the model minimum and maximum error, the mean error, the mean absolute error (the better measure of the two), standard deviation and the correlation. By mean absolute error, the ensembled model performed better. You could investigate the other ensembled models in a similar way to see which combination do better predicting birth weight. You can also explore, using standard methods, how the ensembled model makes predictions, i.e., what is the relationship of the input fields to the predicted value of birth weight.

13-18


Summary Exercises The exercises in this lesson are written for the data file charity.sav. charity.sav is from a charity and contains information on individuals who were mailed a promotion. The file contains details including whether the individuals responded to the campaign, their spending behavior with the charity and basic demographics such as age, gender and mosaic (demographic) group. The file contains the following fields: response orispend orivisit spendb visitb promspd promvis promspdb promvisb totvisit totspend forpcode mos mosgroup title sex yob age ageband


1. Begin with a blank Stream canvas. Place a Statistics File source node on the canvas and connect it to charity.sav. 2. Try to predict Post-campaign expenditure using all the available model choices in the Auto Numeric node. Use the defaults first. Which model is best, and which is worse? You can choose the criterion for ranking models, or use more than one. Which models use fewer inputs? 3. Now change some of the model settings on one more models and rerun the Auto Numeric node. Does the order of models change? 4. Pick two or more models and generate a model for each. Add them to the stream and use an Analysis node and other nodes to further compare their predictions. Which model would you use, and why? How do they compare to the ensemble of models?

13-19


13-20

GETTING THE MOST FROM MODELS

Lesson 14: Getting the Most from Models Objectives • • • • •

Discuss common approaches to improving the performance of a model in data mining projects Use an Ensemble node to combine model predictions Use propensity scores to score records Do Meta-modeling to improve model performance Model errors in prediction

Data In this lesson we will use the dataset churntrain.txt that is a variant of the churn.txt file that we have used in several previous lessons. The data have been split into a separate training file, and we will build or use models constructed with it (there is also a churnvalidate.txt file that can be used for model testing).

14.1 Introduction Throughout this course we have looked at several different modeling techniques, including neural networks, decision trees and rule induction, regression and logistic regression, Bayes Nets, SVM models, and discriminant analysis. After building a model we have usually performed some type of diagnostic analysis that helps with the interpretation of the model, and we have also done additional analyses to help determine where a model is more and less accurate. In this lesson we develop and extend the model building skills learned so far. The key concept in these examples is that models built with an algorithm in PASW Modeler should usually (unless accuracy is very high and satisfactory) be viewed not as the endpoint of an analysis, but as a way station on the path to a robust solution. There are various methods to improve models, only some of which we discuss here, and you are likely to come up with your own as you become experienced using PASW Modeler and read references on data mining. We provide methods in this lesson for how to improve a model, but there is no one simple answer as to how this should be done. That is because the appropriate method is highly dependent upon characteristics of the existing model that has been built. Potential things to consider when improving the performance of a model are: • • • •

The modeling technique used The measurement level of the target field (categorical or continuous) Which parts of the model are under-performing, i.e., are less accurate The distribution of confidence values for the existing model.

We begin the lesson with the Ensemble node, which is an automated method of combining the predictions from two or more models. We then discuss propensity scores and show how they can be used to score a model. Following from this, we consider other methods of combining models, including modeling the error from a model.

14-1


14.2 Combining Models with the Ensemble Node Many authors of data-mining books and articles recommend developing more than one predictive model for a given project. This is usually good advice because there are so many model types available, and a priori, it isn’t normally possible to forecast which model will do better. Moreover, PASW Modeler makes it easy to try several models on the same data, including two nodes that automate the building of many models simultaneously. If you do develop several models, though, you then have the question of how to use them to make predictions. The simplest approach is to use the best model, but what is the “best” model? Is it the most accurate overall, or the one that is most accurate at predicting the most critical category? And if the models are predicting a continuous target, there are several possible definitions of best model. Since we have two or more models, another approach is to combine their predictions in some suitable manner, on the theory that two heads (models) are better than one. And in prediction, that is often true. There are a variety of methods to combine models. You could: • • • •

Let the models vote, with the category predicted most frequently the “winner” Pick the model prediction with the highest confidence Let the models vote, but weight the voting by model confidence Average the model predictions if predicting a continuous field

And there are several other possibilities, including using the propensity scores now available for most models in PASW Modeler 14.0 (but only for flag fields). All the methods in the bullet list are available in the Ensemble node, which is designed to make combining models a simple process. Each Ensemble node generates a field containing the combined prediction, with the name based on the target field and prefixed with $XF_, $XS_, or $XR_, depending on the output field type (flag, set, or range, respectively). We’ll use a preexisting stream file with four generated models to demonstrate the Ensemble node with the churn data. The Ensemble node is located in the Field Ops palette because it creates new fields. Click File…Open Stream, and then move to the c:\Train\ModelerPredModel folder Double-click on Ensemble.str Run the Table node

14-2


Figure 14.1 Churntrain Data File

The training data has just over 1,100 records. We will predict the field CHURNED, which is a nominal in measurement level with three categories (you can run the Distribution node to review its distribution). The Type node already has all the appropriate settings. Close the Table window

Looking at the stream (in Figure 14.2), we used four modeling nodes—CHAID, Neural Net, Bayes Net, and SVM—to create four models that have already been placed in the stream to save time. The models were created with all available predictors.

14-3


Figure 14.2 Ensemble Stream

Before using an Ensemble node, let’s see how well these four models predict CHURNED. Add an Analysis node to the stream near the last generated model and attach this model to the Analysis node Edit the Analysis node Click Coincidence matrices (for symbolic targets) (not shown) Click Run

There is a lot of output, so we show this in two figures. Figure 14.3 shows the results for all four models.

14-4


Figure 14.3 Analysis Node Results for Four Models

The most accurate model overall is the SVM, at 82.85%. The least accurate model is the Bayes Net at 79.15%. Interestingly, although there are only 100 customers in the InVol category, which should make this group more difficult to predict, 3 of the 4 models do very well with this group, and the CHAID model is 100% accurate, although it was not the best model overall. This illustrates the potential advantage of combining models.

14-5


Figure 14.4 Analysis Node Results When the Models Agree

The last two tables in the Analysis browser window show the accuracy when the model predictions are combined in a simplistic fashion. All four models make the same prediction for 69.13% of the cases. For this segment of the file, those predictions have an accuracy of 91.91%. Clearly, combining models can improve performance.

The Ensemble Node However, the models don’t make the same prediction for 30.87% of the customers, a sizeable fraction of the file. What to do with these records? What prediction should be made for them? This is where the Ensemble node can provide much assistance. Close the Analysis output browser window Place an Ensemble node from the Field Ops palette near the last generated model Attach the last generated model to the Ensemble node Edit the Ensemble node

14-6


Figure 14.5 Ensemble Node Settings Tab

The Ensemble node will use the settings in the last Type node or Types tab from a Source node, so it recognizes that CHURNED is the target field. Ensemble nodes can use flag, nominal, or continuous target fields from the upstream model nodes. Because the results from many models are being placed in one stream, and each one of those models generates at least two fields, by default the Ensemble node filters out all those generated fields (Filter out fields generated by ensembled models check box). If you want to continue to compare individual model predictions downstream, or use their predictions (see note below), then you will want to deselect this option. The key setting is the Ensemble method, which determines how the model predictions will be combined. Click Ensemble method dropdown

14-7


Figure 14.6 Ensemble Method Options

The choices available for Ensemble method will vary based on the measurement level of the target field. For a categorical target, there are three choices: 1) Voting: The node counts the number of times each value is predicted, and selects the value with the highest total. 2) Confidence-weighted voting: The node counts not the simple fact that a prediction was made, but instead uses the confidence of that prediction. So if a model predicts value A with a confidence of .80, then the node counts .80 as the “vote.” These weights are summed, and the value with the highest total is selected. 3) Highest confidence wins: In this method, the best model, as measured by its confidence, is used for each prediction. If the target field is a flag, there are four other options available, all based on the propensity score (discussed in more detail in the next section). The Ensemble prediction can be based on propensityweighted voting, or on average propensity. This can be done for either raw propensity or adjusted propensity (which is based on a validation or testing data partition, and so is only available in those situations). If the target field is numeric (continuous), the only available method is to average the model predictions. We’ll use the default method of confidence-weighted voting. If you use one of the voting methods, and there is a tie, the Ensemble node can break the tie in two ways. A random selection can be made, or the model with the highest confidence can be selected. This latter choice seems like a better one, so we’ll use that.

14-8


Click Highest confidence option button Click OK

There is no Run option for an Ensemble node because the node creates new fields but is not a Model or Output node. We can first view the results of combining the four models with a Table node. Add a Table node to the stream near the Ensemble node Attach the Ensemble node to the Table node Run the Table node

Figure 14.7 Table with Output Fields from Ensemble Node

The Ensemble node created two new fields, the prediction ($XS-CHURNED) and its confidence ($XSC-CHURNED). In this instance, the confidence is the sum of the confidence for the winning models divided by the total number of models. We can use another Analysis node to check the performance of the combined prediction.

14-9


Close the Table window Add an Analysis node to the stream near the Ensemble node Connect the Ensemble node to the Analysis node Edit the Analysis node Click Coincidence matrices (for symbolic targets) Click Run

Figure 14.8 Analysis Node Output for Ensemble Node Prediction

Overall “model” accuracy is 84.12%, which is better than any individual model by about 1.3%. When accuracy is important, this improvement would very likely be crucial. The model continues to predict the InVol category almost perfectly, and does very well for the other two categories (we could use a Matrix node to get exact percentages correct in each category). This stream, with the Ensemble node, can now be used to score new data. Looking back at Figure 14.4, we recall that when all four models agreed on the prediction, the prediction was accurate for 91.91% of the customers. That is much better than 84.12%. So another approach when using the Ensemble node is to follow this methodology: 1) If all models agree on a prediction, use that prediction 2) When they don’t agree, use the prediction from the Ensemble node This method can’t be used for continuous fields.

14-10


Is there any downside to combining models? The usual objection is that you cannot now explain why a specific prediction was made. If someone asks “What characteristics of this particular customer caused him/her to be predicted to be a voluntary churner?” that information will not be available, since the models are combined. Still, if you are using a single Neural Net or SVM model, the same holds true, so this is not a fatal objection. You can, as with any model, examine the predictions from the Ensemble node and see how they relate to the input fields, and that will be helpful. But full model understanding is sacrificed here for accuracy. Close the Analysis output browser.

In our next example, we will learn how to use a model to score records to rank them by the propensity of a model prediction.

14.3 Using Propensity Scores Confidence values obtained from a model in PASW Modeler reflect the level of confidence that the model has in the prediction of a given output, and they are only available for categorical targets. Confidence values make no distinction between categories of a target field; thus, for a flag with values of “yes” and “no,” confidence values can vary from 0 to 1 for predictions in each category. Consequently, a high degree of confidence does not help us determine whether that customer will continue or cancel their service (it instead indicates the confidence that the model has in its prediction, whatever that is). Sometimes it would be helpful to have a score so that, for a specific category of interest—such as customers who churned—a high score means a prediction of churn, and a low score indicates the customer is current. This type of score can be used in choosing cases for future actions—intervention, marketing efforts, and so forth. To create a score as just described, most PASW Modeler models calculate a propensity score for a flag field (propensity scores are not available for nominal, ordinal or continuous fields). A propensity score is actually based on the probability of a prediction. The raw propensity score is based only on the training data (if using a Partition node), or the whole file, otherwise. When the model predicts the true value defined for the target field, the propensity is the same as P, where P is the probability of the prediction. If the model predicts the false value, then the propensity is calculated as (1 – P). The adjusted propensity score is only available if a Partition field is being used, and it is calculated based on model performance on the Testing data. Adjusted propensities attempt to compensate for model overfitting on the Training data. We aren’t using a Partition node in this example so can only use raw propensity scores. Now for many models in PASW Modeler, such as all decision trees, the confidence is equal to the probability of the model prediction (why would that be?). For other models, such as a neural net, the confidence and the probability are not equivalent, although they are usually close in value. This simple transformation of the probability allows you to easily score a data file with the propensity (probability) of an outcome to occur. For this example, we continue using the dataset churntrain.txt. A Derive node has been added to the beginning of the stream to create a modified version of the CHURNED field. It converts CHURNED into the field LOYAL which measures whether or not a customer continued with the company. LOYAL

14-11


groups together both voluntary and involuntary leavers into one group, so comparisons can be made with customers who remain loyal. We begin by opening the corresponding stream. Click File…Open Stream and move to the c:\Train\ModelerPredModel directory (if necessary) Double-click on Propensity.str

A Derive node calculates the new field LOYAL. Then both a neural net and CHAID model were trained to predict the field LOYAL, using the ChurnTrain.txt data. Their generated models were then added to the stream connected to the Type node. Figure 14.9 Stream with Two Models and new Field LOYAL

Let’s look at the Derive node, which calculates the LOYAL field. Edit the Derive node LOYAL

The Derive node creates a flag field. The True value will be Leave, and this value will be assigned to a record when CHURNED is not equal to Current. Then LOYAL will be False for customers who are still current. The new field is defined in this manner because we are interested in finding customers who might churn.

14-12


Figure 14.10 Derive Node Combining Churning Customers in One Category

Close the Derive node dialog

Propensity scores are not calculated by default for a model but must be requested (unlike confidence values). For CHAID models, raw propensities can be calculated either at the time of model creation, or later from the model nugget. For Neural Nets, propensities must be calculated when the model is created, not afterwards. We did this in the Model Options tab of the Neural Net node before generating the model in the stream. To request propensity scores for CHAID models after the model is created: Edit the CHAID generated model Click the Settings tab

There is a check box to request raw propensity scores. Because we aren’t using a Partition field, the option for adjusted propensity scores is grayed out.

14-13


Figure 14.11 Settings Tab in CHAID Model

Click Calculate raw propensity scores Click OK

To illustrate the distribution of propensity scores and how they differ from confidence values, we’ll look at both fields for the CHAID model. Add a Table node to the stream near the CHAID model Connect the CHAID model to the Table node Run the Table node

14-14


Figure 14.12 Table with CHAID Model Confidence and Propensity

Recall that when the model predicts that a customer will Leave, the propensity is equal to the probability. And, for the first record, with a prediction of Leave, the confidence ($RC-LOYAL) is equal to the propensity ($RRP-LOYAL), where the “RP” stands for raw propensity. This means that the probability for a CHAID model is equal to the confidence. For record 4, where the prediction is Current, the propensity is 1 – confidence. To make all this very clear, we will view histograms of confidence and propensity with an overlay of the model prediction. Close the Table window Add a Histogram node to the Stream canvas, and connect the CHAID model to it Edit the Histogram node Select $RC-LOYAL as the Field to display and $R_LOYAL as the Color Overlay Field (not shown) Run the Histogram node

14-15


Figure 14.13 Histogram of Confidence Value by Predicted Loyalty

We can see that the confidence values range from .50 to 1.0, but that a high confidence doesn’t necessarily indicate that we expect a customer to leave or stay, since there are customers in both categories at high confidence values (we would find the same pattern if we used the values of LOYAL, the actual status of customers). In fact, the highest confidence is associated with customers who are current. Now we can create the histogram with the propensity scores. Close the Histogram window Edit the Histogram node Select $RRP-LOYAL as the Field to display Run the Histogram node

The distribution of the propensity is bimodal. Those customers predicted to leave have scores ranging from .50 to 1, and those predicted to remain have scores below .50. Propensity scores have a similar distribution for the neural net model.

14-16


Figure 14.14 Histogram of Propensity Scores Overlaid by Predicted Loyalty

The propensity score can now be used to score a database, as is commonly done in many data-mining applications, so that customers can, for example, be selected for a marketing campaign based on their propensity to leave. Sort the file by propensity, and begin choosing customers with the highest propensities first. In addition to scoring new records, there is another use of propensity scores. A score field can be used in a new model to improve the prediction of LOYAL. The score fields do not perfectly predict the value of LOYAL (remember we have been using the predicted value of LOYAL, not the actual values, in our histograms; try running the histograms with LOYAL itself to see the difference), but they apparently have a high degree of potential predictive power. Clearly, this is based purely upon the way that CHAID or the neural network has differentiated between customers who will leave or stay, but if the model has a high degree of accuracy (which it does in this case), then the propensity score may act as a very good predictor for another modeling technique. If a more complex model that takes output from one model as inputs for another (often called a meta-model) were to be built, information on the score values from the CHAID model could be used as an input to a neural network. We shall look at this form of meta-modeling in the next section.

14.4 Meta-Level Modeling The idea of meta-level modeling is to build a model based upon the predictions, or results, of another model. In the previous section, we used a stream which contained both a trained neural network and a CHAID rule induction model. We then created propensity scores for each to use separately. We can use the propensity score, though, from the CHAID model as one of the inputs to a modified neural network model. We know that the CHAID algorithm can predict loyalty with higher accuracy; thus it is hoped that by inputting the propensity scores into a neural network analysis, the neural

14-17


network may be able to correctly predict some of the remaining cases that the CHAID model incorrectly classified. Click File…Close Stream and click No if asked to save changes Click File…Open Stream Double-click on Metamodel.str

The figure below shows the completed stream loaded into PASW Modeler. It is fairly complex but not difficult. Figure 14.15 Meta-Model Stream

A Type node has been inserted after the two generated model nodes. If we are to build a model based upon results obtained from previous models, each of the newly created fields will need to be instantiated and have its role set. We will be using both the new propensity score field and the predicted value from the CHAID model. If we run the two Analysis nodes attached to the two generated models, we would find that the neural net and CHAID models had accuracies of 80.42% and 83.75%, respectively when trying to predict the field LOYAL. When doing this type of meta-modeling, you must make a decision about which fields should be inputs to the new model. You can use all the original fields, or reduce their number since the CHAID propensity score and predicted category will effectively contain much of the predictive power of the original fields. If the number of inputs isn’t large, then including them along with the two new fields in the new neural network will not appreciably slow training time, and that is the approach we take here. But you may wish to drop at least some of the fields that had little influence on the model, since including all fields can lead to over-fitting. We’ll begin by examining the downstream Type node. Run the Table node attached to the Type node downstream of the generated models Close the Table window Edit the Type node attached to the CHAID generated model

14-18


Figure 14.16 Type Node Settings

In this example, we will use all the original input fields as predictors, plus the predicted value of LOYAL from the CHAID model ($R-LOYAL) and the propensity score ($RRP-LOYAL). The target field remains LOYAL. A Neural Net node has been attached to the Type node (and renamed MetaModel_LOYAL). We’ve set the random seed to 1000 so that everyone will obtain the same solution, and we use the Quick training method. Let’s run the model. Close the Type node Run the neural network MetaModel_LOYAL Edit the generated model Click on the Predictor Importance graph

14-19


Figure 14.17 Predictor Importance from Meta-Model for LOYAL

We can see that, not surprisingly, the fields $RRP-LOYAL is by far the dominant input within the model. The actual predicted value from CHAID, $R-LOYAL, is only the fifth most important input. We can check the accuracy of the meta-model with an Analysis node. Close the Neural Net model browser Add an Analysis node to the stream and connect the generated meta-model to the Analysis node Edit the Analysis node Click Coincidence matrices (for symbolic targets) (not shown) Click Run

14-20


Figure 14.18 Analysis Node Output for Meta-Model for LOYAL

In the portion of the output comparing $N1-LOYAL with LOYAL, we observe the overall accuracy of the meta-model is 84.93%. This is much better than the original neural net model (and it is even about 1% better than the CHAID model). This is a realistic example of how using the results of one model to improve another model can work in practice. Sometimes it is said that doing so is misguided, and it is true that in classical statistics one doesn’t combine models in this manner. But in data mining, this is an acceptable methodology, but you must always validate the final meta-model on a testing or validation sample. As an exercise, you can use the file ChurnValidate.txt to validate this meta-model.

14-21


14.5 Error Modeling Error modeling is another form of meta-modeling that can be used to build a better model than the original, and it is often recommended in texts on data mining. In essence, this method is straightforward. Cases with errors of prediction are isolated and modeled separately. Almost invariably, some of these cases can now be accurately predicted with a different modeling technique. However, there is a catch to this technique. In both the training and test data files we have a target field to check the accuracy of a model. Thus, in the churn data, we know whether a customer remained or left. But in real life, that is exactly what we are trying to predict. So how can we create a model that uses the fact that an error of prediction has occurred since, when applying the model, we won’t know whether the model is in error until it is too late, i.e., the event we are trying to predict has occurred? The answer to this dilemma is that, of course, we can’t, so we have to find a viable substitute strategy. The most common approach is to find groups of cases with similar characteristics for which we make a greater proportion of errors. We then create separate models for these cases, assuming that the same pattern will hold in the future. It is, as always, crucial to validate the models with a holdout sample when using this technique. In this section we build an error model on the churn data in order to investigate where the initial neural network is under-performing, and then improve it by modeling the cases more prone to prediction errors with a C5.0 model. In this stream, the True value for LOYAL is reversed from our previous examples and is defined as a customer will stay with their service. The False value is then a customer who will leave. Since we would like to model errors for predicting both categories, it doesn’t make as much difference here which category is associated with True. Close the Analysis output browser Close the current stream (you don’t need to save it), and clear the Models Manager Click File…Open Stream Double-click on Errors.str Switch to small icons (right-click Stream canvas, click Icon Size…Small)

The figure below displays the error-model stream in the PASW Modeler Stream canvas. The upper stream in the canvas includes the generated model from the neural network and attaches a Derive node to it. The Derive node compares the original target field (LOYAL) with the network prediction of the target ($N-LOYAL), calculating a flag field (CORRECT) with a value of “True” if the prediction of the neural network is correct, and “False” if it was not. You can open it and review it if you wish. The first goal of the error model is to use a rule induction technique, which can isolate where the neural network model is under-performing. This will be done by using the C5.0 algorithm to predict the field CORRECT. We chose a C5.0 model because its transparent output will provide the best understanding of where the neural network is under-performing. In order to ensure that the C5.0 model returns a relatively simple model, the expert options have been set so that the minimum records per branch is 15. Setting this value is a judgment call based on the number of records in the training data and the number of rules with which you wish to work (another approach would be to winnow attributes, an expert option).

14-22


A Type node is used to set the new field CORRECT to role Target, and the original inputs to the neural network to Input. It would need to be fully instantiated before training the C5.0 model. Figure 14.19 Error Modeling from a Neural Network Model

In this example, the model has already been trained and added to the stream, labeled C5 Error Model. Let’s browse this model. Edit the C5.0 generated model node labeled C5 Error Model

We generated a ruleset from the C5.0 model because it makes it easier to view those rules for the False values of CORRECT. Again, we are trying to predict the values of CORRECT, which means we are trying to predict whether the neural network was accurate or not. There are two rules for a False value. Click the Show all levels button Click the Show or Hide Instances and Confidence button confidence values are visible)

(so instances and

The two rules all have reasonable values of confidence, ranging from .59 to .68 (although you might prefer them to be a bit higher). Rule 1 tells us that for male customers who make less than about a tenth of a minute of long distance calls per month, we predict the value of CORRECT to be False, i.e., the wrong prediction. Rule 2 is more complicated. It would be better to have more than two rules predicting False, but this will suffice for purposes of illustration.

14-23


Figure 14.20 Decision Tree Ruleset Flagging Where Errors Occur Within the Neural Network

The next step is to split the training data into two groups based on the ruleset, one for predictions of True and the other for False. We can do this by generating a Rule Tracing Supernode from the Rule browser window and applying a Reclassify or Derive node to truncate the values of the new field to just True and False. We will use the Reclassify node to modify the Rule field so that it only has two categories, which we will rename as Correct and Incorrect. Let’s check the distribution of this field. Close the C5.0 Model browser window Run the Distribution node named Split

14-24


Figure 14.21 Distribution of Split Field

The neural network accuracy was about 80.5%. The distribution of Split doesn’t match this because we limited the records per branch to no lower than 15, and because the C5.0 model can’t perfectly predict when the neural network was accurate or not. There are clearly enough cases with a value of Correct (1032) to predict with a new model, but there are only 78 cases with a value of Incorrect, which is a bit low for accurate modeling. The best solution is to create a larger initial sample so that the cases predicted to be incorrect by the C5.0 model would be represented by a larger number of cases. If that isn’t possible, you can use a Balance node and boost the number of cases in the Incorrect category (although this is not an ideal solution, either). Since this is an example of the general method, we won’t bother doing either to see how much we can improve our model with no special sampling. Looking back at the stream, we next added a Type node to set the role of FALSE_TRUE and Split to None so that they are not used in the modeling process. We wish to use only the original predictors. The stream then branches into two after the Type node. The upper branch uses a Select node to select only those records with predictions expected to be correct, while the lower branch selects those records with predictions expected to be incorrect. We reemphasize that the split of the training data is not based on the target field. Instead, only demographic and customer status fields were used to create the field Split used for record selection. It is for this reason that this model can, if successful, be used in a production environment to make predictions on new data where the outcome is unknown. After the data are split, the customers for whom we generally made correct predictions are modeled again with a neural network. We do so because these cases were modeled well before with a neural network, so the same should be true now. And, with the problematic cases removed, we expect the network to perform better.

14-25


For the customer group for which predictions were generally wrong, we use a C5.0 model to try a new technique, since the neural network tended to mispredict for this group. We could certainly try another neural network, however, or any other modeling technique. After the models are created, they are added to the stream, and Analysis nodes are then attached to assess how well each performed. Let’s see how well we did. Close the Distribution plot window Run both Analysis nodes in the lower stream

The neural network model for the group of customers with generally originally correct predictions is correct 85.56% of the time, a substantial improvement over the base figure of 80.51%. The C5.0 model is even more accurate, correctly predicting who will leave or stay for 86.84% of the cases that were originally difficult to accurately predict. Clearly, using the errors in the original neural network to create new models has led to a substantial improvement with little additional effort. If you take this approach, you would, as usual, explore each model to see which fields are the better predictors and how this differs in each model. Figure 14.22 Model Accuracy for Two Groups

So far so good, but we’d still like to automate the solution so that the data all flow in one stream rather than in two, and we can therefore make a combined prediction for LOYAL on new data. This is easy to do. To demonstrate, we open a stream with a modified version of the current one. Close the current stream and don’t’ save it if asked Click File…Open Stream Double-click on Combined_predictions.str Switch to small icons (right-click Stream canvas, click Icon Size…Small)

We have combined the two generated models in sequence in this modified stream. You might think that we could simply combine the output from each model, since each was trained on a different group of cases and thus will make predictions only for those cases, but this isn’t the case. Although

14-26


each model was trained on only a portion of the data, each will make predictions for all the cases. (Why? To verify this, run the Table node.) Figure 14.23 Combined Predictions Stream

But the solution is simple. We know that the value of the field Split tells us which model’s output to use, and we do so in the Derive node named Prediction. Edit the Derive node named Prediction

This node creates a new field called Prediction. When Split is equal to Correct, the value of Prediction is set to the output of the neural network output. Otherwise, the value of Prediction is set to the output of the C5.0 model. Thus, we have a new field that contains the combined prediction from the best model for that group of customers.

14-27


Figure 14.24 Derive Node to Create Prediction Field

We know that the baseline neural network had an accuracy of 80.51%, and made 216 errors. We will do much better with these two models. To see how much, we can run the Matrix node that crosstabulates Prediction and LOYAL. Close the Derive node Run the Matrix node named Prediction x LOYAL

The combined models have made only 159 errors, quite an improvement. This translates to an accuracy of 85.64%, or an increase of about 5.1% over the original neural network model.

14-28


Figure 14.25 Comparison of Prediction and LOYAL

The process of modeling errors need not stop here. Although there will clearly be diminishing returns as the number of errors decreases, it is certainly possible to attempt to separately model the remaining errors from the combined model. At the very least, you would still want to investigate those customers whose behavior remains difficult to model. Eventually you would validate the models with the ChurnValidate.txt dataset. We won’t do that here because the stream with the C5.0 model predicting errors in the original neural network has only 33 records, not enough for a reasonable validation. Obviously, the validation dataset should be of sufficient size, just as with the training file. We should also note that this same technique could be used for target fields that are continuous, either integers or real numbers. In that case, the errors are relative, not absolute, but some numeric bounds can be specified to differentiate cases deemed to be in error from those with sufficiently accurate predictions. Then the former group of cases can be handled in a similar manner as was done above.

14-29


Summary Exercises In these exercises we will use the streams created in this lesson. 1. Use the stream Metamodel.str. Rerun the MetaModel_LOYAL neural network model, removing all the original inputs from the model and thus using only the modified confidence score and the predicted value from the CHAID model. How does this affect model performance? Add this generated model to the stream and validate it with the ChurnValidate.txt data file. Was the model validated, in your judgment? 2. Use the stream Errors.str. Instead of using a C5.0 model to predict cases with proportionally more errors, try another type of model (your choice). How well does this perform compared to the C5.0 model? How does it compare to the accuracy of the original neural network? What do you recommend we use for these cases that were predicted in error?

14-30

DECISION LIST

Appendix A : Decision List Overview • • • • • •

Introduce the Decision List model Compare rule induction by Decision List with the decision trees nodes Outline the main differences between a decision tree and a decision rule Understand how Decision List models a categorical target Review the Interactive Decision List modeling feature Use partitioned data to test a model (optional, already covered in former lesson)

Data In this appendix we use the data file churn.txt, which contains information on 1477 customers of a telecommunications firm who have at some time purchased a mobile phone. The customers fall into one of three groups: current customers, involuntary leavers and voluntary leavers. Unlike the models developed in Lesson 3, here we want to understand which factors influence the voluntary leaving of a customer, rather than trying to predict all three categories.

Introduction PASW Modeler contains five different algorithms for performing rule induction: C5.0, CHAID, QUEST, C&R Tree (classification and regression trees) and Decision List. The first four are similar in that they all construct a decision tree by recursively splitting data into subgroups defined by the predictor fields as they relate to the target. However, they differ in several ways that are important to the user (see Lesson 3). Decision List predicts a categorical target, but it does not construct a decision tree; instead, it repeatedly applies a decision rules approach. To give you some sense of a Decision List model we begin by browsing such a model and viewing its characteristics. After that we continue by reviewing a table that highlights some distinguishing features of the rule induction algorithms. Finally, we will outline the difference between decision trees and decision rules and the various options for the Decision List algorithm in the context of predicting categorical fields.

A Decision List Model Before diving into the details of the Decision List node, we review a decision list model. Click File…Open Stream, and then move to the c:\Train\ModelerPredModel directory Double-click DecisionList.str

A-1


Figure A.1 Decision List Stream

Right-click the Decision List node CHURNED[Vol] Select Run

Once the Decision List generated model is in the Models palette, the model can be browsed. Right-click the Decision List node named CHURNED[Vol] in the Models palette Click Browse

The results are presented as a list of decision rules, hence Decision List. If you are familiar with the C5.0 model output you will see a distinct likeness to the Rule Set presentation of a C5.0 model. Figure A.2 Browsing the Decision List Model

A-2

DECISION LIST

The first row gives information about the training sample. The sample has 719 records (Cover (n)) of which 267 meet the target value Vol (Frequency). In consequence the percentage of records meeting the target value is 37.13% (Probability). A numbered row represents a model rule and consists of an id, a Segment, a target value or Score (Vol) and a number of measures (here: Cover (n), Frequency and Probability). As you can see, a segment is described by one or more conditions, and each condition in a segment is based on a predictive field, e.g. SEX = F, INTERNATIONAL > 0 in the second segment. All predictions are for the Vol category, as this is what is defined in the Decision List modeling node. The accuracy of predicting this category is listed for each segment in the Probability column, and accuracy is reasonably high for most segments. As a whole our model has 5 segments and a Remainder. The maximum number of predictive fields in a segment is 2. Each segment is not too small (see measure Cover (n)); the smallest has 52 records. This is not chance. The maximum number of segments in the model, the maximum number of predictive fields in a segment and the minimum number of records in a segment are all set in the Decision List node, as we will see later. We now review in some detail the Decision List model.

The Target A characteristic of Decision List is that it models a particular value of a categorical target. In the Decision List model at hand we have modeled the voluntary leaving of a customer as represented by target value CHURNED = Vol.

The Remainder Segment The Remainder segment is yet another defining characteristic of the Decision List model. Unlike with decision trees, there will be a group of customers for which no prediction is made (the Remainder). The Decision List algorithm is particularly suitable for business situations where you are interested in a relatively small but extremely good (in terms of response) subset of the customer base. Think of customer selection for a marketing campaign, where there is a limited campaign budget available. So the marketer will only be interested in the top N customers he can afford to approach given her budget, and the rest (Remainder) will be excluded for the campaign.

Overlapping Segments In our model the 5 segments and the Remainder form a non-overlapping segmentation of the training sample, meaning that a customer (or a record) belongs to exactly one segment or to the Remainder. So the total of the Cover (n) for all segments including the Remainder should match the Cover (n) of the training sample. This basic requirement affects the way a particular segment should be interpreted when reading the model. The Nth segment should be interpreted as: The record is in segment N and not(segment N-1) and not(segment N-2 ) and….. and not(segment 1) Example Given our model a Female customer with International >0 and AGE from 43 to 58 satisfies both segment 1 and segment 2. However she will be regarded as a member of segment 1. The rules are applied in the order you see them listed for the segments, so this customer is assigned to segment 1.

A-3


A customer belongs to segment 2 if: not (SEX = F and 42 < AGE <= 58) [the segment 1 conditions] and SEX = F and International > 0 And a customer belongs to segment 3 if: not (SEX = F and International > 0) and not (SEX = F and 42 < AGE <= 58) and SEX = F and 73 < AGE <= 89

[the segment 2 conditions] [the segment 1 conditions]

This mechanism prevents multiple counting of customers in overlapping segments. Be aware that the order of the segments in the model affects the segment a customer belongs to and so also the measures Cover (n), Frequency and Probability for each model segment. This is a consequence of the iterative method by which Decision List generates rules. In a later section we will cover in detail how this rule induction mechanism works. For now it is sufficient to realize that the Decision list algorithm is constructing trees of decision rules using a very different splitting mechanism than the one used in the decision tree algorithms. This is the reason why the Decision List algorithm is not a tree but a rule algorithm.

Comparison of Rule Induction Models The table below lists some of the important differences between the rule induction algorithms available within PASW Modeler. The first four columns are repeated from Lesson 3 for ease of comparison. Table A.1 Some Key Differences Between the Five Rule Induction Models

Model Criterion Split Type for Categorical Predictors Continuous Target Continuous Predictors Criterion for Predictor Selection Can Cases Missing Predictor Values be Used? Priors Pruning Criterion Build Models Interactively Supports Boosting

A-4

C5.0

CHAID 1

QUEST

C&R Tree

Decision List

Multiple

Multiple

Binary

Binary

Multiple

No

Yes

No

Yes

No

Yes

No2

Yes

Yes

No2

Information measure

Chi-square F test for continuous Yes, missing becomes a category No Stops rather than overfits

Statistical

Statistical

Yes, uses surrogates

Impurity (dispersion) measure Yes, uses surrogates

Yes

Yes Costcomplexity pruning Yes

Yes Costcomplexity pruning Yes

Yes

Yes

Yes

Yes, uses fractionalization No Upper limit on predicted error No Yes

Yes, missing becomes a category No Stops rather than overfits Yes Yes

DECISION LIST

1

SPSS has extended the logic of the CHAID approach to accommodate ordinal and continuous target fields. 2 Continuous predictors are binned into ordinal fields containing by default approximately equal sized categories. Unlike the decision tree algorithms, Decision List does not create subgroups by splitting but by either adding a new predictor or by narrowing the domain of the existing predictor(s) in the group (decision rule approach) and in consequence tree-splitting issues are not applicable here. Decision List can handle targets that are of measurement level flag, nominal, and ordinal. Decision List is designed to model a specific category of a categorical target, so effectively it predicts a binary outcome (target or not target). The algorithm treats continuous predictors by binning them into ordinal fields with approximately equal number of records in each category.. In generating rules, just like CHAID and QUEST, Decision List uses more standard statistical methods, as explained below. The way missing values are handled is set with Expert options. Either the missing fields in a predictor are neglected when it comes to using that predictor in forming a subgroup, or like CHAID, all missing values are used as an additional category in model building. The process of rule generation halts based on settings such as the maximum number of predictors in a rule, explicit group size related settings, and the statistical confidence required.

Rule Induction Using Decision List The Decision List modeling node must appear in a stream containing fully instantiated fields (either in a Type node or the Types tab in a source node). Within the Type node or Types tab, the field to be predicted (or explained) must have role target or it must be specified in the Fields tab of the modeling node. All fields to be used as predictors must have their role set to input (in Types tab or Type node) or be specified in the Fields tab. Any field not to be used in modeling must have its role set to none. Any field with role both will be ignored by Decision List. The Decision List node is labeled with the name of the target field and target category. Like most other models, a Decision List model may be browsed and predictions can be made by passing new data through it in the Stream Canvas. The target field must have categorical values, and Decision List will model on a particular value of the target field. That target value is set in the Decision List node. The other values of the target field will then be regarded as a second category value, appearing as the value $null$ in predictions. In this example we will attempt to predict which customers voluntarily cancel their mobile phone contract. Rather than rebuild the source and Type nodes, we use the existing stream opened previously. We’ll delete the Decision list node so we can review the default settings. Close the Decision List Browser window Delete the CHURNED[Vol] node and the generated CHURNED[Vol] mode node Place a Decision List node from the Modeling palette to the upper right of the Type node in the Stream Canvas Connect the Type node to the Decision List node (see Figure A.3)

The name of the Decision List node should immediately change to No Target Value.

A-5


Figure A.3 Decision List Modeling Node Added to Stream

The reason for the name “No Target Value” is because target field CHURNED has three values, but Decision List predicts only one specific target value. Double-click the Decision List node to edit it

Note the message stating that a target value must be specified.

A-6

DECISION LIST

Figure A.4 Decision List Dialog - Initial

The Model name option allows you to set the name for both the Decision List and resulting Decision List nodes. The Use partitioned data option is checked so that the Decision List node will make use of the Partition field created by the Partition node earlier in the stream. By default the model is built automatically, as the Mode is set to Generate model. By selecting Launch interactive session it is possible to create the model interactively. The Target value has to be set explicitly to Vol. Click the button, to the right of the Target value Click Vol, then click Insert

With Decision List you are able to generate rules better than the average or worse than the average depending on your goal (where the average is the overall probability of the target value). This is set

A-7


by the Search Direction value of Up or Down. An upward search looks for segments with a high frequency. A downward search will create segments with low frequency. A decision rule model contains a number of segments. The maximum is set in Maximum number of segments. Each segment is described by one or more predictors, also known as attributes in the Decision List node. The maximum number of predictive fields to be used in a segment is set in Maximum number of attributes. You may compare this setting with Levels below root setting in CHAID and QUEST, prescribing the maximum tree depth. The Maximum number of attributes setting implies a stopping criterion for the algorithm. Just like the stopping criteria of CHAID, Decision List also has settings related to segment size: As percentage of previous segment (%) and As absolute value (N). The percentage setting states that a segment can only be created if it contains at least a certain percentage of records of its parent. Compare this with a branch point in a tree algorithm. The absolute value setting is straightforward: a segment only qualifies for the model if it is not too small, thus serving the generality requirement of a predictive model. The larger of these two settings takes precedence. Note that whereas in CHAID’s stopping criteria you must choose either a percentage or an absolute value approach, Decision List combines the two by using the percentage requirement for the parent and the absolute value requirement for the child. The model’s accuracy is controlled by Confidence interval for new conditions (%). This is a statistical setting and the most commonly used value is 95, the default. Of course depending on the business case and how costly an erroneous prediction, you may increase or decrease this confidence value.

Understanding the Rules and Determining Accuracy The predictive accuracy of the rule induction model is not given directly within the Decision List node. To obtain that information using an Analysis node may be confusing let alone misleading as Decision List will only explicitly report on the particular target value that was modeled and the other value(s) will be regarded as $null$. To avoid that we will use Matrix nodes and Evaluation Charts to determine how good the model is. We use the Table node to examine the predictions from the Decision List model. Click Run to run the model Place a Table node from the Output palette below the generated Decision List node Connect the generated Decision List node to the Table node Right-click the Table node, then click Run and scroll to the right in the table

A-8

DECISION LIST

Figure A.5 Three New Fields Generated by the Decision List Node

Three new columns appear in the data table, $D-CHURNED, $DP-CHURNED and $DI-CHURNED. The first represents the predicted target value for each record, the second the probability and the third shows the ID of the model segment a record belongs to. The sixth segment is the Remainder. Note that the predicted value is either Vol or $null$, demonstrating that the Decision List algorithm predicts a particular value of the target field to the exclusion of the others. Click File…Close to close the Table output window

Comparing Predicted to Actual Values We will use a matrix to see where the predictions were correct, and then we evaluate the model graphically with a gains chart. Place two Select nodes from the Records palette, one to the upper right of the generated Decision List node and one to the lower right Connect the generated Decision List node to each Select node

First we will edit the Select node on the upper right that we will use to select the Training sample cases: Double-click on the Select node on the upper right to edit it Click the Expression Builder button Move Partition from the Fields list box to the Expression Builder text box Click the equal sign button Click the Select from existing field values button shown) Click OK

and insert the value 1_Training (not

A-9


Click Annotations tab Select Custom and enter value Training Click OK

Figure A.6 Completed Selection for the Training Partition

Now do the same for the Select node on the right to select the Testing sample cases. Insert Partition value “2_Testing” and annotate the node as “Testing.” Now attach a separate Matrix node to each of the Select nodes. For each of the Select nodes: Place a Matrix node from the Output palette near the Select node Connect the Matrix node to the Select node Double-click the Matrix node to edit it Put CHURNED in the Rows: Put $D-CHURNED in the Columns: Click the Appearance tab Click the Percentage of row option Click on the Output tab and custom name the Matrix node for the Training sample as Training and the Testing sample as Testing (this will make it easier to keep track of which output we are looking at) Click OK

For each actual risk category, the Percentage of row choice will display the percentage of records predicted in each of the target categories. Run each Matrix node

A-10

DECISION LIST

Figure A.7 Matrix Output for the Training and Testing Samples

Looking at the Training sample results, the model predicts about 82.0% of the Vol (Voluntary Leavers) category correctly. The results with the testing sample compare favorably (80.5% accurate) which suggests that the model will perform well with new data. Note that technically no prediction for the other two categories is correct, since the model doesn’t predict Current or InVol but just $null$. But we can combine these results by hand to obtain the accuracy. The percentage of correct not Vol predictions is: (313 + 48)/((313 + 68)+(48 +23))*100 = 79.9%. We could have made this calculation easier by creating a two-valued target field based on CHURNED, thus creating a 2 by 2 matrix. Decision List would create the same rules for such a field. To produce a gains chart for the Voluntary group: Close both Matrix windows Place the Evaluation chart node from the Graphs palette to the right of the generated Decision List node named CHURNED[Vol] Connect the generated Decision List node to the Evaluation chart node Double-click the Evaluation chart node, and click the Include best line checkbox

By default, an Evaluation chart will use the first target category to define a hit. To change the target category on which the chart is based, we must specify the condition for a User defined hit in the Options tab of the Evaluation node. Click the Options tab Click the User defined hit checkbox Click the Expression Builder button in the User defined hit group Click @Functions on the functions category drop-down list Select @TARGET on the functions list, and click the Insert button Click the = button Right-click CHURNED in the Fields list box, then select Field Values Select Vol, and then click Insert button

A-11


Figure A.8 Specifying the Hit Condition within the Expression Builder

Click OK

Figure A.9 Defining the Hit Condition for CHURNED

In the evaluation chart, a hit will now be based on the Voluntary Leaver target category. Click Run

A-12

DECISION LIST

Figure A.10 Gains Chart for the Voluntary Leaving Group

The gains line ($D-CHURNED) in the Training data chart rises steeply relative to the baseline, indicating that hits for the Voluntary Leaving category are concentrated in the percentiles predicted most likely to contain this type of customer, according to the model. th

Hold the cursor over the model line in the Training partition at the 40 percentile

Approximately 77% of the hits were contained within the first 40 Percentiles.

A-13


Figure A.11 Gains Chart for the Voluntary Leaving Group (Interaction Enabled)

The gains line in the chart using Testing data is very similar which suggests that this model can be reliably used to predict voluntary leavers with new data. Close the Evaluation chart window

To save this stream for later work: Click File…Save Stream As Move to the c:\Train\ModelerPredModel directory (if necessary) Type DecisionList Model in the File name: text box Click Save

Understanding the Most Important Factors in Prediction An advantage of rule induction models, as with decision trees, is that the rule form makes it clear which fields are having an impact on the predicted field. There is no great need to use alternative methods such as web plots and histograms to understand how the rule is working. Of course, you may still use the techniques described in the previous lessons to help understand the model, but they often are not needed. In the Decision List algorithm the most important fields in the predictions can be thought of as those that define the best subgroups in the sample used for training the model at a certain stage in the process. Thus in this example the most important fields when using the whole training sample are SEX and AGE. Because the sample used for training the model gradually decreases during the stepwise rule discovery process, there will be other predictive fields coming to the surface as being most important. This intuitively makes sense. So in step 2 when finding the best second segment,

A-14

DECISION LIST

using the whole training sample except the first segment, the most important fields turn out to be SEX and International. Similarly, when finding segment 3 and using the whole training sample except for the first two segments, SEX and AGE are again the most important predictors. The process continues until the algorithm is not able to construct segments satisfying the requirements, or stopping criteria are reached.

Expert Options for Decision List Now that we have introduced you to the basics of Decision List modeling, we will discuss the Expert options which will allow you to refine your model even further. Double-click on the Decision List node named CHURNED[Vol] to edit it

Expert mode options allow you to fine-tune the rule induction process. Click the Expert tab Click the Expert Mode option button

Figure A.12 Decision List Expert Options

Binning Binning is a method of transforming a numeric field (with measurement level continuous) into a number of categories/intervals. The Number of bins input will set the maximum number of bins to be

A-15


constructed. Whether this maximum will actually be the number of bins depends on other settings as well. There are two main types of Binning methods, Equal Count and Equal Width. Equal Width will transform numeric fields into a number of fixed width intervals. Equal Count is a more balanced binning method, and it will create intervals based on an equal number of records per interval. The three settings below this control details of the model process, described below. If Allow missing values in conditions is checked, the Decision List algorithm will regard being empty or undefined as a particular category that can be used as a condition in a segment. That may result in a segment such as “SEX = F and AGE IS MISSING”.

The Decision List Algorithm The Decision List algorithm constructs lists of rules based on the predictions of a tree. However, the tree is generated quite differently from the way it is done in the decision tree algorithms, so the word “tree” has to be regarded as a way to visualize the solution area and the rule generation process of the Decision List algorithm. Process Hierarchy In order to understand the Decision List rule generation process, we must first realize that a decision list contains segments, with each segment containing one or more conditions, and each condition being based on one predictive field. This hierarchy is directly reflected in the rule generation process: a main cycle of generating the list’s segments and a sub cycle for each segment of constructing the segment’s conditions based on the predictive fields. The main cycle is also called the List cycle and the sub cycle is called the Rule cycle. In constructing the conditions on the lowest process level the algorithm also has a Split cycle where the binning is performed in case of continuous predictive fields. Qualification A key question is: what makes one list better than the other and what makes a segment better than the other? For a list the accuracy is defined by: List% = 100* SUM (Frequency) / SUM (Cover(n)), the Remainder excluded On a segment level, a segment at hand is better than another segment if: (1) the probability of the target value on the segment is better (2) there is no overlap between the confidence interval of the segment at hand and the confidence interval of the other segment. This interval is directly related to the setting for the Confidence interval for new conditions (%), as set in the Simple mode of the Decision List dialog and defined as Probability± Error), where Error is the statistical error in the prediction of the Probability. List Generation – Simple To simplify the argument we will describe the process given the setting Model search width = 1, meaning we will not create multiple lists simultaneously to choose from in the end. So we will assume one List cycle here.

A-16

DECISION LIST

Rule Generation Given the above, the rule generation process starts with the full Training sample to search for segments. The solution area is generated as follows: on the first rule level segments are constructed based on 1 predictive field. The best 5 (Rule search width) will be selected as starting points for a second rule level, resulting in a set of segments each described by 2 predictive fields. Again the best 5 are selected for the third rule level. This goes on till the last rule level, which is 5 (Maximum number of attributes), and indeed in principle the fifth rule level segments are described by five predictive fields. It is not always possible to refine a certain segment in a next step by adding a new predictive field. One of the reasons is the group size as set in As absolute value (N). The algorithm may come up with segments that are described by less than five predictive fields. On the other hand, refining a given segment in a next step can also be done by not adding a new predictive field, but by reconsidering an existing predictive field. This is set by Allow attribute re-use. (e.g., “Age between (20, 60)” in step 1 could be refined to “Age between (25, 55)” in level 2). So this is why in rule level N there may be segments having less than N predictive fields. A segment that is not refined anymore is called a final result, which is comparable to terminal nodes in a decision tree. If Model search width =1, out of all these final results the algorithm will return the best 5 (Maximum number of segments) based on the target value’s probability. Our previous model did create all five. The decision rule process may not be able to use all the “freedom” as set in the Rule search width (5) and in Maximum number of attributes (5). The main reasons are typically group size requirements and/or the statistical confidence requested.

List Generation – Boosting Just like C5.0, Decision List has a “boosting” mechanism. This is reflected in the setting Model search width. In describing the decision list algorithm we assumed Model search width to be 1. By setting a higher value (say 2) you direct the Decision List algorithm to consider 2 alternatives for each segment Thus the Decision Rule algorithm will deliver the best 2 segments after each Rule cycle. In our model this means that we instructed the algorithm to build a list of 5 segments. The List cycle will now have 5 iterations of the Rule cycle (= Maximum Number of segments) and each Rule cycle will have 5 iterations (= Maximum number of attributes) For the first segment on the list the Rule cycle will return the top 2 segments (= Model search width). Thus, now 2 lists are created each with 1 segment and a Remainder. On each of the 2 lists the Rule cycle will be performed on the Remainder. This will result in 4 lists, each with 2 segments and a Remainder. Out of these 4 lists the top 2 based on List% are selected to find a third, and so forth. When working in Interactive mode, the Maximum Number of alternatives setting is active. When the model is automatically generated, its value is set to 1. Be aware that the Model search width and the Rule search width have a direct impact on the datamining processing time.

A-17


Interactive Decision List Decision lists can be generated automatically, allowing the algorithm to find the best model, as we did with the first example. As an alternative, you can use the Interactive List Builder to take control of the model building process. You can grow the model segment-by-segment, you can select specific predictors at a certain intermediate point, and you can insure that the list of segments is not too complex so that it is practical enough to be used for a business problem. To use the List Builder, we simply specify a decision list model as usual, with the one addition of selecting Launch interactive session in the Decision List node’s Model tab. We’ll use the Decision List Interactive session to predict the voluntary leavers. Close the Decision List modeling node Click File…Open Stream Double-click on DecisionList Interactive.str On the Stream canvas double-click the Decision List node named CHURNED[Vol] Click the Model tab Click the Launch interactive session option button

A-18

DECISION LIST

Figure A.13 Decision List Model Tab with Interactive session enabled

Note that we have modified some of the default settings, such as the maximum number of attributes, the maximum number of segments, and the absolute value of the minimum segment size. Click on the Expert tab to review those settings as well. When the model runs, a generated Decision List model is not added to the Model Manager area. Instead, the Decision List Viewer opens, as shown in Figure A.14. Click Run to open the Decision List Viewer

A-19


Figure A.14 The Decision List Viewer

The easy-to-use, task-based Decision List Viewer graphical interface takes the complexity out of the model building process, freeing you from the low-level details of data mining techniques. It allows you to devote your full attention to those parts of the analysis requiring user intervention, such as setting objectives, choosing target groups, analyzing the results, selecting the optimal model and evaluating models. The whole workspace consists of one pane and two pop-up tabs, Working Model Pane, Alternatives Tab and Snapshots Tab. The Working model pane (Figure A.14) displays the current model, including mining tasks and other actions that apply to the working model. The Alternatives tab and Snapshots tab are generated when you click Find Segments, the Alternatives tab lists all alternative mining results for the selected model or segment on the working model pane. The Snapshots tab displays current model snapshots (a snapshot is a model representation at a specific point in time). Note: The generated, read-only model displays only the working model pane and cannot be modified. In the working model pane you can see two rules. The first gives information about the training sample. Here the sample has 719 records (Cover (n)) of which 267 meet the target value (Frequency). In consequence the percentage of records meeting the target value is 37.13% (Probability). The second, called Remainder, is now the first segment in our model and contains the whole training sample. This will be the starting point for building our Decision List model. Right-click the Remainder segment From the dropdown list select Find Segments

A-20

DECISION LIST

Figure A.15 Model Albums Dialog (Alternatives tab)

The pop-up window states that the mining task was performed on the Remainder segment and is completed. There are the two alternatives that were generated by this data mining task. Recall that for this task the Model search width is set to 2. The first alternative (Alternative 1) contains 7 segments and the model represented by this list has an average probability of 59.36%. The second alternative has 8 segments and the corresponding model has an average probability of 56.13%. Let’s view each of the two alternatives Click on Alternative 1

The result will be displayed in the Alternative Preview pane.

A-21


Figure A.16 Preview of an Alternative

Click on Alternative 2 (not shown)

You will see that these two alternatives differ in their 7th segment. The second has a 7th segment based on AGE but the first has no this rule. Another interesting segment is the Remainder. The first alternative has a Remainder of 281 and misses 7 voluntarily leaving customers, whereas the second alternative list has a Remainder of 254 and misses 6 of these customers. Assume that we prefer the first alternative but we want to capture some more of the voluntary leavers in the model. First we must promote the first alternative list to our working model; from there we will continue the model building process. Right-click on Alternative 1 Select Load (or click the Load button on the bottom) Click OK

The result will be displayed in the Working model panel.

A-22

DECISION LIST

Figure A.17 Loading an Alternative to the Working Model

We can now create a Gains chart for the working model. Click Gains tab

Figure A.18 Gains Chart of Working Model

A-23


The results look encouraging on both the training data and the testing data. The segments included in the model are represented by the solid line; the excluded portion (Remainder) is represented by the dashed line. Let’s put both the Working model (Alternative 1) and Alternative 2 on display in the Gains chart, so that we can choose one better model. Click the Viewer tab Click the Take Snapshot button Click the toolbar button to view the alternatives Right-click on Alternative 2 Select Load (or click the Load button on the bottom) Click OK (for now, Alternative 2 is a working model) Click the Gains tab Click Chart Options

Figure A.19 Chart Options

Select the checkbox of Snapshot 1 (Actually Snapshot 1 is Alternative 1, we can click the toolbar Click OK

A-24

to view the snapshot)

DECISION LIST

Figure A.20 Gains Chart of Alternatives

Although model performance is similar, the alternative 2 (Working Model) performs a bit more poorly than the alternative 1 (Snapshot 1). Click Viewer tab In the Working model pane, right-click on segment SEX = F

A-25


Figure A.21 Options to Modify a Segment in the Model

Choices in the context menu allow you to modify the segments created by the data mining task. For example you may decide to delete a segment or to exclude it from scoring. You can even edit the segment. For example, you could add an extra condition to the segment ‘SEX = F’, or you could modified the lower and upper boundary value of EST_INCOME in segment 6 (Edit Segment Rule).

Model Assessment We have used the Gains chart above to get an overall view of the model. You can assess the model on a segment level by using the model measures. There are five types of measures available. From the menu, click Tools…Organize Model Measures

A-26

DECISION LIST

Figure A.22 Organize Model Measures Dialog

When building a Decision List model, you have five types of measures at your disposal (Display): a Pie Chart and four numerical measures. Each measure has a Type, the Data Selection it will operate on (here Training Data) and whether it will be displayed in the model (Show). The Pie Chart displays the part of the Training sample that is described by a segment. The other Coverage measure is Cover (n), which will show the number of records in the Training sample in that segment. The Frequency measure displays the number of records in the segment with the target value, Probability calculates the ratio of Frequency over Cover(n) and the Error returns the statistical error. It is possible to add new measures to your model by clicking the Add new model measure button We’ll create a measure (call it %Test) showing the probability of each segment on the Testing partition. Furthermore, we will rename Probability to %Train.

.

Click the Add new model measure button

This will create a new row named Measure 6. Double-click in the Name cell for Measure 6 and change the name to %Test Click the dropdown list for Type and change to Probability

A-27


Figure A.23 Creating a New Measure

Click the dropdown list for Data Selection and change to Testing Data Click the Show checkbox for %Test Double-click in the Name cell for Probability and change its name to %Train, then hit Enter

Figure A.24 Completed %Test Measure and Renamed Probability Measure

Click OK

A-28

DECISION LIST

Figure A.25 New Measures Added to the Working Model

Decision List Viewer can be integrated with Microsoft Excel, allowing you to use your own value calculations and profit formulas directly within the model building process to simulate cost/benefit scenarios. The link with Excel allows you to export data to Excel, where it can be used to create presentation charts, calculate custom measures, such as complex profit and ROI measures, and view them in Decision List Viewer while building the model. The following steps are valid only when MS Excel is installed. If Excel is not installed, the options for synchronizing models with Excel are not displayed. Suppose that we have created a template in Excel where, based on the Probability and on the Coverage of a segment, we calculate the amount of loss we will suffer should the customers in a segment actually leave voluntarily. Click Tools and select Organize Model Measures Click Yes for Calculate custom measures in Excel (TM) Click Connect to Excel (TM) … button Browse to C:\Train\ModelerPredModel\ and select Template_churn_loss.xlt Click Open

A-29


Figure A.26 The Excel Workbook for the Churn Case

Switch to PASW Modeler using the ALT-Tab keys on your keyboard (if necessary)

Figure A.27 Excel Input Fields

The Choose Inputs for Custom Measures window reveals that Excel expects two fields for input: Probability and Cover. On the other hand four fields are available to add to your model:

A-30

DECISION LIST

Loss = Probability * Cover * Loss – Cover * Variable Cost %Loss = 100 * Loss / Sum (Loss), the fraction of the total loss a segment can be accounted for Cumulative = Cumulative Loss %Cumulative = % Cumulative Loss By default all are selected. Clicking on an empty Model Measure cell in the dialog will open a drop down list with all the measures available in your model. Click in the Model Measure cell for Probability and select %Train Click in the Model Measure cell for Cover and select Cover (n)

Figure A.28 Mapping Excel Input File to the Decision List Model Measure

Click OK

In the Organize Model Measures window you will see which measures are available for input to your model. By default all are selected. Deselect measure %Test (not shown) Click OK

A-31


Figure A.29 the Decision List Model with External Measures

As you can see, segment 4 is responsible for more than 20% of the total loss expected (reflected by its measure %Loss), and the first four segments for more than 50% (reflected by the %Cumulative measure for the fourth segment). So if the business objective was to select a set of customers in a retention campaign to reduce the expected loss by at least 50%, the list manager would probably choose the first 4 segments to be scored. If you wish to exclude a segment from a model, it can be done from a context menu. Right-click on Segment 5

A-32

DECISION LIST

Figure A.30 Manually Excluding Segments from Scoring Based on External Measures

Interactive Decision Lists are not a model, but instead are a form of output, like a table or graph. When you are satisfied with the list you have built, you can generate a model to be used in the stream to make predictions. Click Generate…Generate Model Click OK in the resulting dialog box (not shown) Close the Interactive Decision List Viewer window

A generated Decision List model appears in the upper left corner of the Stream Canvas. It can be edited, attached to other nodes, and used like any other generated model. The only difference is in how it was created.

A-33


Summary Exercises The exercises in this appendix are written for the data file Newschan.sav. 1. Begin with a clear Stream canvas. Place a Statistics File source node on the canvas and connect it to Newschan.sav. 2. Try to predict with Decision List whether or not someone responds to a cable new service offer (NEWSCHAN). Start by using the default settings. Use all available predictors. How many segments were created? What fields were used? Does this model seem adequate? 3. Try different models by changing various settings, including the minimum segment size, allowing attribute reuse, confidence interval (change to 90%), or some of the expert settings. Can you find a better model?

A-34

MELJUN CORTES Predictive Modeling With IBM SPSS Modeler

Recommend Documents