RPubs - The Analytics Edge EdX MIT15 Clustering

7/23/2017

The_Analytics_Edge_edX_MIT15.071x_June2015_5

Recommendations Recommendations Worth a Million_An Introduction Introduction to Clustering_1 An Introduction to Clustering

VIDEO 1: INTRODUCTION INTRODUCTION TO NETFLIX Netflix

Online DVD rental and streaming video service More than 40 million subscribers worldwide $3.6 billion in revenue Key aspect is being able to offer customers accurate movie recommendations based on a customer’s own preferences and viewing history

netflix The Netflix Prize From 2006 - 2009 Netflix ran a contest asking the public to submit algorithms to predict user ratings for movies Training Training data set of ~100,000,000 ratings and test data set of ~3,000,000 ratings were provided Offered a grand prize of $1,000,000 USD t o the team who could beat Netflix’s own algorithm,Cinematch, by more than 10%, measured in RMSE Contest Rules If the grand prize was not yet reached, progress prizes of $50,000 USD per year would be awarded for the best result so far, as long as it had >1% improvement over the previous year. Teams must submit code and a description of the algorithm to be awarded any prizes If any team met the 10% improvement goal, last call would be issued and 30 days would remain for all teams to submit their best algorithm. Last Call Announced On June 26, 2009, the team Bel lKor’s Pragmatic Chaos submitted a 10.05% improvement over Cinematch

QUICK QUESTION About how many years did it take for a team to submit a 10% improvement over Cinematch? Ans:2.5 EXPLANATION:The contest started in October 2006, and eneded in July 2009. So it took about 2.5 years for a team to submit a 10% improvement solution.

VIDEO 2: RECOMMENDATION RECOMMENDATION SYSTEMS https://rpubs.com/nishantsbi/93582

2/67

7/23/2017


Recommendations Recommendations Worth a Million_An Introduction Introduction to Clustering_1 An Introduction to Clustering

VIDEO 1: INTRODUCTION INTRODUCTION TO NETFLIX Netflix

Online DVD rental and streaming video service More than 40 million subscribers worldwide $3.6 billion in revenue Key aspect is being able to offer customers accurate movie recommendations based on a customer’s own preferences and viewing history

netflix The Netflix Prize From 2006 - 2009 Netflix ran a contest asking the public to submit algorithms to predict user ratings for movies Training Training data set of ~100,000,000 ratings and test data set of ~3,000,000 ratings were provided Offered a grand prize of $1,000,000 USD t o the team who could beat Netflix’s own algorithm,Cinematch, by more than 10%, measured in RMSE Contest Rules If the grand prize was not yet reached, progress prizes of $50,000 USD per year would be awarded for the best result so far, as long as it had >1% improvement over the previous year. Teams must submit code and a description of the algorithm to be awarded any prizes If any team met the 10% improvement goal, last call would be issued and 30 days would remain for all teams to submit their best algorithm. Last Call Announced On June 26, 2009, the team Bel lKor’s Pragmatic Chaos submitted a 10.05% improvement over Cinematch

QUICK QUESTION About how many years did it take for a team to submit a 10% improvement over Cinematch? Ans:2.5 EXPLANATION:The contest started in October 2006, and eneded in July 2009. So it took about 2.5 years for a team to submit a 10% improvement solution.

VIDEO 2: RECOMMENDATION RECOMMENDATION SYSTEMS https://rpubs.com/nishantsbi/93582

2/67

7/23/2017

The_Analytics_Edge_edX_MIT15.071x_June2015_5 Predicting the Best User Ratings Netflix was willing to pay over $1M for the best user rating algorithm, which shows how critical the recommendation system was to their business What data could be used to predict user ratings? Every movie in Netflix’s database has the ranking from all users who have ranked that movie We also know facts about the movie itself: actors,director, actors,director, genre classifications, year released, etc. Recommender system_Collaborative Filtering

Using Other Users’ Rankings

userrating Consider suggesting to Carl that he watch “Men in Black”, since Amy rated it highly and Carl and Amy seem to have similar preferences This technique is called Collaborative Filtering Recommender system_Content Filtering

Using Movie Information

contentfiltering Strengths and Weaknesses Collaborative Filtering Systems Can accurately suggest complex items without understanding the nature of the items Requires a lot of data about the user to make accurate recommendations Millions of items - need lots of computing power Content Filtering Requires very little data to get started Can be limited in scope Hybrid Recommendation Systems

Netflix uses both collaborative and content filtering For example, consider a collaborative filtering approach where we determine t hat Amy and Carl have simi lar preferences. We could then do content filtering, where we would find that “Terminator”, “Terminator”, which both Amy and Carl liked, is classified in almost the same set of genres as “Starship Troopers” Recommend “Starship Troopers” Troopers” to both Amy and C arl,even though neither of them have seen it before QUICK QUESTION Let’s consider a recommendation system on Amazon.com, an online retail site. If Amazon.com constructs a recommendation system for books, and w ould like to use the same exact algorithm for shoes, what type would it have to be? Ans: Collaborative Filtering If Amazon.com would like t o suggest books to users based on the previous books they have purchased, what type of recommendation system would it be? Ans:Content Filtering

https://rpubs.com/nishantsbi/93582

3/67

7/23/2017

The_Analytics_Edge_edX_MIT15.071x_June2015_5 EXPLANATION:In the first case, the recommendation system would have to be collaborative filtering, since it can’t use i nformation about the items. In the second case, the recommendation system would be content filtering since other users are not involved.

VIDEO 3: MOVIE DATA AND CLUSTERING MovieLens Item Dataset Movies in the dataset are categorized as belonging to different genres

genres Each movie may belong to many genres Can we systematically find groups of movies wi th similar sets of genres? Why Clustering? “Unsupervised” learning

Goal is to segment the data into similar groups instead of prediction Can also cluster data into “simi lar” groups and then build a predictive model for each group Be careful not to overfit your model! This works best with large datasets

cluster Types of Clustering Methods There are many different algorithms for clustering Differ in what makes a cluster and how to find them We will cover Hierarchical K-means in the next lecture

QUICK QUESTION In the previous video, we discussed how clustering is used to split the data into similar groups. Which of the following tasks do you think are appropriate for clustering? Select all t hat apply. Ans:Dividing search results on Google into categories based on the topic & Grouping players into different “types” of basketball players that make it to the NBA EXPLANATION:The first two options are appropriate tasks for clustering. Clustering probably wouldn’t help us predict the winner of the World Series.

VIDEO 4: COMPUTING DISTANCES How does Clustering work? Distance Between Points

Need to define distance between two data points Most popular is “Euclidean distance” Distance between points i and j is where k is the number of independent variables Distance Example The movie “Toy Story” is categorized as Animation, Comedy, and Children’s Toy Story:(0,0,0,1,1,1,0,0,0, 0,0,0,0,0,0,0,0,0,0)


4/67

7/23/2017


toy The movie “Batman Forever” is categorized as Action, Adventure, Comedy, and Crime Batman Forever:(0,1,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0)

batman Distance Between Points

Toy Story: (0,0,0,1,1,1,0,0,0,0,0, 0,0,0,0,0,0,0,0) Batman Forever: (0,1,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0)

calc1 Other popular distance metrics: Manhattan Distance

Sum of absolute values instead of squares Maximum Coordinate Distance

Only consider measurement for which data points deviate t he most Distance Between Clusters Minimum Distance

Distance between clusters is the distance between points that are the closest


5/67

7/23/2017


min Maximum Distance

Distance between clusters is the distance between points that are the farthest

max Centroid Distance

Distance between centroids of clusters Centroid is point that has the average of all data points in each component

centroid Normalize Data Distance is highly influenced by scale of variables, so customary to normalize first In our movie dataset, all genre variables are on the same scale and so normalization is not necessary However, if we i ncluded a variable such as “Box Office Revenue,” we would need to normalize. QUICK QUESTION The movie “The Godfather” is in the genres action, crime, and drama, and i s defined by the vector: (0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0) The movie “Titanic” is in the genres action, drama, and romance, and is defined by the vector: (0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0) What is the distance between “The Godfather” and “Titanic”, using euclidean distance? Ans:1.414214 EXPLANATION:The distance between these two movies is the square root of 2. They have a difference of 1 in two genres - crime and romance.

VIDEO 5: HIERARCHICAL CLUSTERING Hierarchical

Start with each data point in its own cluster


6/67

7/23/2017

The_Analytics_Edge_edX_MIT15.071x_June2015_5 step1 Combine two nearest clusters (Euclidean, Centroid)

step2 combine two nearest clusters (Euclidean, Centroid)

step3 Combine two nearest clusters (Euclidean, Centroid)


step5


7/67

7/23/2017

The_Analytics_Edge_edX_MIT15.071x_June2015_5 Combine two nearest clusters (Euclidean, Centroid)



step8 Display Cluster Process:Dendogram


8/67

7/23/2017


dendo Select Clusters

select Meaningful Clusters? Look at statistics (mean, min, max, . . .) for each cluster and each variable See if the clusters have a feature in common that was not used in the clustering (like an outcome) QUICK QUESTION Suppose you are running the Hierarchical clustering algorithm with 212 observations. How many clusters will there be at the start of the algorithm? Ans:212 How many clusters will there be at the end of the algorithm? Ans:1 EXPLANATION:The Hierarchical clustering algorithm always starts with each data point in its own cluster, and ends with all data points in the same cluster. So there will be 212 clusters at the beginning of the algorithm, and 1 cluster at the end of the algorithm.

VIDEO 6: GETTING THE DAT DATA (R script reproduced here) here) In this video, we’ll be downloading our dataset from the MovieLens website. Please open the following link in a new window or tab of your browser to access the data: movieLens.txt (http://files.grouplens.org/datasets/movielens/ml-100k/u.item) #Unit 6 - Introduction to Clustering #Video 6 #After following the steps in the video, load the data into R

movies = read.table("movieLens.t read.table("movieLens.txt", xt", header=FALSE, sep="|",quote="\"") str(movies)


9/67

7/23/2017

The_Analytics_Edge_edX_MIT15.071x_June2015_5 ## 'data.frame': 1682 obs. of 24 variables: ## $ V1 : int 1 2 3 4 5 6 7 8 9 10 ... ## $ V2 : Factor w/ 1664 levels "'Til There Was You (1997)",..: 1525 618 555 594 344 1318 1545 111 391 1240 ... ## $ V3 : Factor w/ 241 levels "","01-Aug-1997",..: 71 71 71 71 71 71 71 71 71 182 ... ## $ V4 : logi NA NA NA NA NA NA ... ## $ V5 : Factor w/ 1661 levels "","http://us.imdb.com/M/title-exact/Independence%20(1997)",..: 1431 565 505 543 310 1661 1453 103 357 1183 ... ## $ V6 : int 0 0 0 0 0 0 0 0 0 0 ... ## $ V7 : int 0 1 0 1 0 0 0 0 0 0 ... ## $ V8 : int 0 1 0 0 0 0 0 0 0 0 ... ## $ V9 : int 1 0 0 0 0 0 0 0 0 0 ... ## $ V10: int 1 0 0 0 0 0 0 1 0 0 ... ## $ V11: int 1 0 0 1 0 0 0 1 0 0 ... ## $ V12: int 0 0 0 0 1 0 0 0 0 0 ... ## $ V13: int 0 0 0 0 0 0 0 0 0 0 ... ## $ V14: int 0 0 0 1 1 1 1 1 1 1 ... ## $ V15: int 0 0 0 0 0 0 0 0 0 0 ... ## $ V16: int 0 0 0 0 0 0 0 0 0 0 ... ## $ V17: int 0 0 0 0 0 0 0 0 0 0 ... ## $ V18: int 0 0 0 0 0 0 0 0 0 0 ... ## $ V19: int 0 0 0 0 0 0 0 0 0 0 ... ## $ V20: int 0 0 0 0 0 0 0 0 0 0 ... ## $ V21: int 0 0 0 0 0 0 1 0 0 0 ... ## $ V22: int 0 1 1 0 1 0 0 0 0 0 ... ## $ V23: int 0 0 0 0 0 0 0 0 0 1 ... ## $ V24: int 0 0 0 0 0 0 0 0 0 0 ...

#Add column names

colnames(movies) = c("ID", "Title", "ReleaseDate", "VideoReleaseDate", "IMDB", "Unknown", "Action", "Adventure", "Animation", "Childrens", "Comedy", "Crime", "Documentary", "Drama", "Fantasy", "FilmNoir", "Horror", "Musical", "Mystery", "Romance", "SciFi", "Thriller", "War", "Western") str(movies)

## 'data.frame': 1682 obs. of 24 variables: ## $ ID : int 1 2 3 4 5 6 7 8 9 10 ... ## $ Title : Factor w/ 1664 levels "'Til There Was You (1997)",..: 1525 618 555 594 344 1318 1545 111 391 1240 ... ## $ ReleaseDate : Factor w/ 241 levels "","01-Aug-1997",..: 71 71 71 71 71 71 71 71 71 182 ... ## $ VideoReleaseDate: logi NA NA NA NA NA NA ... ## $ IMDB : Factor w/ 1661 levels "","http://us.imdb.com/M/title-exact/Independence%20(1997)",..: 143 1 565 505 543 310 1661 1453 103 357 1183 ... ## $ Unknown : int 0 0 0 0 0 0 0 0 0 0 ... ## $ Action : int 0 1 0 1 0 0 0 0 0 0 ... ## $ Adventure : int 0 1 0 0 0 0 0 0 0 0 ... ## $ Animation : int 1 0 0 0 0 0 0 0 0 0 ... ## $ Childrens : int 1 0 0 0 0 0 0 1 0 0 ... ## $ Comedy : int 1 0 0 1 0 0 0 1 0 0 ... ## $ Crime : int 0 0 0 0 1 0 0 0 0 0 ... ## $ Documentary : int 0 0 0 0 0 0 0 0 0 0 ... ## $ Drama : int 0 0 0 1 1 1 1 1 1 1 ... ## $ Fantasy : int 0 0 0 0 0 0 0 0 0 0 ... ## $ FilmNoir : int 0 0 0 0 0 0 0 0 0 0 ... ## $ Horror : int 0 0 0 0 0 0 0 0 0 0 ... ## $ Musical : int 0 0 0 0 0 0 0 0 0 0 ... ## $ Mystery : int 0 0 0 0 0 0 0 0 0 0 ... ## $ Romance : int 0 0 0 0 0 0 0 0 0 0 ... ## $ SciFi : int 0 0 0 0 0 0 1 0 0 0 ... ## $ Thriller : int 0 1 1 0 1 0 0 0 0 0 ... ## $ War : int 0 0 0 0 0 0 0 0 0 1 ... ## $ Western : int 0 0 0 0 0 0 0 0 0 0 ...

# Remove unnecessary variables

movies$ID = NULL movies$ReleaseDate = NULL movies$VideoReleaseDate = NULL movies$IMDB = NULL #Remove duplicates

movies = unique(movies) #Take a look at our data again:

str(movies)


10/67

7/23/2017

The_Analytics_Edge_edX_MIT15.071x_June2015_5 ## 'data.frame': ## $ Title : 240 ... ## $ Unknown : ## $ Action : ## $ Adventure : ## $ Animation : ## $ Childrens : ## $ Comedy : ## $ Crime : ## $ Documentary: ## $ Drama : ## $ Fantasy : ## $ FilmNoir : ## $ Horror : ## $ Musical : ## $ Mystery : ## $ Romance : ## $ SciFi : ## $ Thriller : ## $ War : ## $ Western :

1664 obs. of 20 variables: Factor w/ 1664 levels "'Til There Was You (1997)",..: 1525 618 555 594 344 1318 1545 111 391 1 int int int int int int int int int int int int int int int int int int int

0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0

0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0

0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0

0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0

0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0

... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

#QUICK QUESTION #Using the table function in R, please answer the following questions about the dataset "movies". #How many movies are classified as comedies?

table(movies$Comedy)

## ## 0 ## 1162

1 502

#Ans:502 #How many movies are classified as westerns

table(movies$Western)

## ## 0 ## 1637

1 27

#Ans:27 #How many movies are classified as romance AND drama?

table(movies$Drama,movies$Romance)

## ## ## ##

0 1 0 801 147 1 619 97

#Ans:97

VIDEO 7: HIERARCHICAL CLUSTERING IN R (R script reproduced here) #Video 7 #There are 2 steps to Hierarchial Clustering #1.compute the distances between all data points #2.Then we need to cluster the points # Calculate distances between genre features:

distances = dist(movies[2:20], method = "euclidean") # Hierarchical clustering

clusterMovies = hclust(distances, method = "ward.D") #the ward method cares about the distance between clusters u sing centroid distance and also the variance in each cluster. # Plot the dendrogram

plot(clusterMovies)


11/67

7/23/2017


#from the dendogram, drawing the Horizontal line, a per the requirement of our problem, we select 10 clusters (i. e. the horizontal line cuts across 10 vertical lines) #Assign points to clusters (Label each movie in cluster (with 10 clusters))

clusterGroups = cutree(clusterMovies, k = 10) #Now let's figure out what the clusters are like. # Let's use the tapply function to compute the percentage of movies in each genre and cluster #Calculate average "action" genre value for each cluster

tapply(movies$Action, clusterGroups, mean)

## 1 2 3 4 5 6 7 ## 0.1784512 0.7839196 0.1238532 0.0000000 0.0000000 0.1015625 0.0000000 ## 8 9 10 ## 0.0000000 0.0000000 0.0000000

#Calculate average "Romance" genre value for each cluster

tapply(movies$Romance, clusterGroups, mean)

## 1 2 3 4 5 6 ## 0.10437710 0.04522613 0.03669725 0.00000000 0.00000000 1.00000000 ## 7 8 9 10 ## 1.00000000 0.00000000 0.00000000 0.00000000

#We can repeat this for each genre. If you do, you get the results in ClusterMeans.ods

#Find which cluster Men in Black is in.

subset(movies, Title=="Men in Black (1997)")

## Title Unknown Action Adventure Animation Childrens ## 257 Men in Black (1997) 0 1 1 0 0 ## Comedy Crime Documentary Drama Fantasy FilmNoir Horror Musical Mystery ## 257 1 0 0 0 0 0 0 0 0 ## Romance SciFi Thriller War Western ## 257 0 1 0 0 0

mibGrp <-clusterGroups[257] # to get which cluster Men in Black (1997) goes into mibGrp

## 257 ## 2


12/67

7/23/2017

The_Analytics_Edge_edX_MIT15.071x_June2015_5 # Create a new data set with just the movies from cluster 2

cluster2 = subset(movies, clusterGroups==mibGrp) #Look at the first 10 titles in this cluster(Find other movies in same cluster as "Men in Black"):

cluster2$Title[1:10]

## [1] GoldenEye (1995) ## [2] Bad Boys (1995) ## [3] Apollo 13 (1995) ## [4] Net, The (1995) ## [5] Natural Born Killers (1994) ## [6] Outbreak (1995) ## [7] Stargate (1994) ## [8] Fugitive, The (1993) ## [9] Jurassic Park (1993) ## [10] Robert A. Heinlein's The Puppet Masters (1994) ## 1664 Levels: 'Til There Was You (1997) ... Zeus and Roxanne (1997)

#AN ADVANCED APPROACH TO FINDING CLUSTER CENTROIDS #Other ways to calculate average genre value: #In this video, we explain how you can find the cluster centroids by using the function "tapply" for each variabl e in the dataset. While this approach works and is familiar to us, it can be a little tedious when there are a lo t of variables. An alternative approach is to use the colMeans function. With this approach, you only have one co mmand for each cluster instead of one command for each variable. #If you run the following command in your R console, you can get all of the column (variable) means for cluster 1:

colMeans(subset(movies[2:20], clusterGroups == 1))

## ## ## ## ## ## ## ##

Unknown 0.006734007 Crime 0.033670034 Musical 0.188552189 Western 0.090909091

Action Adventure Animation Childrens Comedy 0.178451178 0.185185185 0.134680135 0.393939394 0.363636364 Documentary Drama Fantasy FilmNoir Horror 0.010101010 0.306397306 0.070707071 0.000000000 0.016835017 Mystery Romance SciFi Thriller War 0.000000000 0.104377104 0.074074074 0.040404040 0.225589226

#You can repeat this for each cluster by changing the clusterGroups number. However, if you also have a lot of cl usters, this approach is not that much more efficient than just using the tapply function. #A more advanced approach uses the "split" and "lapply" functions. The following command will split the data into subsets based on the clusters:

spl = split(movies[2:20], clusterGroups) #Then you can use spl to access the different clusters, because

head(spl[[1]])

## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ##

1 4 7 8 10 21 1 4 7 8 10 21 1 4 7 8 10 21

Unknown Action Adventure Animation Childrens Comedy Crime Documentary 0 0 0 1 1 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 Drama Fantasy FilmNoir Horror Musical Mystery Romance SciFi Thriller 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 War Western 0 0 0 0 0 0 0 0 1 0 0 0

#is the same as

head(subset(movies[2:20], clusterGroups == 1))


13/67

7/23/2017

The_Analytics_Edge_edX_MIT15.071x_June2015_5 ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ##

1 4 7 8 10 21 1 4 7 8 10 21 1 4 7 8 10 21

Unknown Action Adventure Animation Childrens Comedy Crime Documentary 0 0 0 1 1 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 Drama Fantasy FilmNoir Horror Musical Mystery Romance SciFi Thriller 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 War Western 0 0 0 0 0 0 0 0 1 0 0 0

colMeans(spl[[1]])

## ## ## ## ## ## ## ##

Unknown 0.006734007 Crime 0.033670034 Musical 0.188552189 Western 0.090909091


#so colMeans(spl[[1]]) will output the centroid of cluster 1. But an even easier approach uses the lapply functio n. The following command will output the cluster centroids for all clusters:

lapply(spl, colMeans)


14/67

7/23/2017

The_Analytics_Edge_edX_MIT15.071x_June2015_5 ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ##

$`1` Unknown 0.006734007 Crime 0.033670034 Musical 0.188552189 Western 0.090909091






$`4` Unknown Action 0 0 Crime Documentary 0 0 Musical Mystery 0 0 Western 0

Adventure 0 Drama 1 Romance 0

Animation 0 Fantasy 0 SciFi 0

Childrens 0 FilmNoir 0 Thriller 0

Comedy 0 Horror 0 War 0





Adventure 0.0000000 Drama 0.6640625 Romance 1.0000000

Animation 0.0000000 Fantasy 0.0000000 SciFi 0.0000000

Childrens 0.0000000 FilmNoir 0.0078125 Thriller 0.1406250

Comedy 0.1093750 Horror 0.0156250 War 0.0000000









$`5` Unknown Action 0 0 Crime Documentary 0 0 Musical Mystery 0 0 Western 0 $`6` Unknown Action 0.0000000 0.1015625 Crime Documentary 0.0468750 0.0000000 Musical Mystery 0.0000000 0.0000000 Western 0.0000000 $`7` Unknown Action 0 0 Crime Documentary 0 0 Musical Mystery 0 0 Western 0 $`8` Unknown Action 0.0000000 0.0000000 Crime Documentary 0.0000000 1.0000000 Musical Mystery 0.0000000 0.0000000 Western 0.0000000 $`9`


15/67

7/23/2017

The_Analytics_Edge_edX_MIT15.071x_June2015_5 ## Unknown Action ## 0 0 ## Crime Documentary ## 0 0 ## Musical Mystery ## 0 0 ## Western ## 0 ## ## $`10` ## Unknown Action ## 0.0000000 0.0000000 ## Crime Documentary ## 0.0000000 0.0000000 ## Musical Mystery ## 0.0000000 0.0000000 ## Western ## 0.0000000









#The lapply function runs the second argument (colMeans) on each element of the first argument (each cluster subs et in spl). So instead of using 19 tapply commands, or 10 colMeans commands, we can output our centroids with jus t two commands: one to define spl, and then the lapply command. #Even a better way

sapply(spl, colMeans)

## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ##

1 0.006734007 0.178451178 0.185185185 0.134680135 0.393939394 0.363636364 0.033670034 0.010101010 0.306397306 0.070707071 0.000000000 0.016835017 0.188552189 0.000000000 0.104377104 0.074074074 0.040404040 0.225589226 0.090909091 9 10 Unknown 0 0.0000000 Action 0 0.0000000 Adventure 0 0.0000000 Animation 0 0.0000000 Childrens 0 0.0000000 Comedy 1 0.1587302 Crime 0 0.0000000 Documentary 0 0.0000000 Drama 1 0.0000000 Fantasy 0 0.0000000 FilmNoir 0 0.0000000 Horror 0 1.0000000 Musical 0 0.0000000 Mystery 0 0.0000000 Romance 0 0.0000000 SciFi 0 0.0000000 Thriller 0 0.1587302 War 0 0.0000000 Western 0 0.0000000 Unknown Action Adventure Animation Childrens Comedy Crime Documentary Drama Fantasy FilmNoir Horror Musical Mystery Romance SciFi Thriller War Western

2 0.000000000 0.783919598 0.351758794 0.010050251 0.005025126 0.065326633 0.005025126 0.000000000 0.110552764 0.000000000 0.000000000 0.080402010 0.000000000 0.000000000 0.045226131 0.346733668 0.376884422 0.015075377 0.000000000

3 0.000000000 0.123853211 0.036697248 0.000000000 0.009174312 0.064220183 0.412844037 0.000000000 0.380733945 0.004587156 0.105504587 0.018348624 0.000000000 0.275229358 0.036697248 0.041284404 0.610091743 0.000000000 0.000000000

4 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0

5 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0

6 0.0000000 0.1015625 0.0000000 0.0000000 0.0000000 0.1093750 0.0468750 0.0000000 0.6640625 0.0000000 0.0078125 0.0156250 0.0000000 0.0000000 1.0000000 0.0000000 0.1406250 0.0000000 0.0000000

7 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0

8 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0212766 0.0000000 1.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0212766 0.0000000

#QUICK QUESTION #Run the cutree function again to create the cluster groups, but this time pick k = 2 clusters. It turns out that the algorithm groups all of the movies that only belong to one specific genre in one cluster (cluster 2), and pu ts all of the other movies in the other cluster (cluster 1). What is the genre that all of the movies in cluster 2 belong to?

clusterGroups <- cutree(clusterMovies, k = 2) spl <- split(movies[2:20], clusterGroups) sapply(spl, colMeans)


16/67

7/23/2017

The_Analytics_Edge_edX_MIT15.071x_June2015_5 ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ##

Unknown Action Adventure Animation Childrens Comedy Crime Documentary Drama Fantasy FilmNoir Horror Musical Mystery Romance SciFi Thriller War Western

1 0.001545595 0.192426584 0.102782071 0.032457496 0.092735703 0.387944359 0.082689335 0.038639876 0.267387944 0.017001546 0.018547141 0.069551777 0.043276662 0.046367852 0.188562597 0.077279753 0.191653787 0.054868624 0.020865533

2 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0

#or

lapply(spl, colMeans)

## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ##



$`2` Unknown Action 0 0 Crime Documentary 0 0 Musical Mystery 0 0 Western 0





#Ans:Drama

VIDEO 8: THE ANALYTICS EDGE OF RECOMMENDATION SYSTEMS Beyond Movies: Mass Personalization

beyond Cornerstone of these Top Businesses


17/67

7/23/2017


corner Recommendation Method Used

methodused Winners are Declared!

winners The Edge of Recommendation Systems

In today’s digital age, businesses often have hundreds of thousands of items to offer their customers Excellent recommendation systems can make or break these businesses Clustering algorithms, which are tailored to find similar customers or simil ar items, form the backbone of many of these recommendation systems

Predictive Diagnosis_Discovering Patterns for Disease Detection_2 VIDEO 1: HEART ATTACKS Heart Attacks


18/67

7/23/2017

The_Analytics_Edge_edX_MIT15.071x_June2015_5 Heart attack is a common complication of coronary heart disease resulting from the interruption of blood supply to part of the heart 2012 report from the American Heart Association estimates about 715,000 Americans have a heart attack every year Every 20 seconds , a person has a heart attack in the US Nearly half occur without prior warning signs 250,000 Americans die of Sudden Cardiac Death yearly Analytics Helps Monitoring Understanding the clinical characteristics of patients in whom heart attack was missed is key Need for an increased understanding of the patterns in a patient’s diagnostic history that link to a heart attack Predicting whether a patient is at risk of a heart attack helps monitoring and calls for action Analytics helps understand patterns of heart attacks and provides good predictions QUICK QUESTION In this class, we’ve learned many different methods for predicting outcomes. Which of t he following methods is designed to be used to predict an outcome like whether or not someone wil l experience a heart attack? Select all that apply. Ans:Logistic Regression,CART, Random Forest EXPLANATION:Logistic Regression, CART, and Random Forest are all designed to be used to predict whether or not someone has a heart attack, since this is a classification problem. Linear regression would be appropriate for a problem with a continuous outcome, such as the amount of time until someone has a heart attack. In this lecture, we’l l use random forest, but the other methods could be used too.

VIDEO 2: THE DATA Claims Data Claims data offers an expansive view of a patient’s health history Demographics, medical history and medications Offers insights regarding a patient’s risk May reveal indicative signals and patterns We will use health insurance claims filed for about 7,000 members from January 2000 - November 2007 Concentrated on members with the following attributes At least 5 claims with coronary artery disease diagnosis At least 5 claims with hypertension diagnostic codes At least 100 total medical claims At least 5 pharmacy claims Data from at least 5 years Yields patients with a high risk of heart attack and a reasonably rich history and continuous coverage Data Aggregation The resulting dataset includes about 20 million health insurance entries including individual medical and pharmaceutical records Diagnosis, procedure and drug codes in the dataset comprise tens of thousands of attributes Codes were aggregated into groups 218 diagnosis groups, 180 procedure groups, 538 drug groups 46 diagnosis groups were considered by clinicians as possible risk factors for heart attacks Diagnostic History We then compress medical records to obtain a chronological representation of a patient’s diagnostic profile Cost and number of medical claims and hospital visits by diagnosis Observations split into 21 periods, each 90 days in length Examined 9 months of diagnostic history leading up to heart attack/no heart attack event Align data to make observations date-independent while preserving the order of events 3 months ~ 0-3 months before heart attack 6 months ~ 3-6 months before heart attack 9 months ~ 6-9 months before heart attack Target Variable Target prediction is the first occurrence of a heart attack Diagnosis on medical claim Visit to emergency room followed by hospitalization Binary Yes/No

target Dataset Compilation


19/67

7/23/2017


data1 Cost Bucket Partitioning Cost is a good summary of a person’s overall health Divide population into similar smaller groups Low risk, average risk, high risk

Build models for each group QUICK QUESTION In the previous video, we discussed how we split the data into three groups, or buckets, according to cost. Which bucket has the most data, in terms of number of patients? Ans:Cost Bucket 1 Which bucket probably has the densest data, in terms of number of claims per person? Ans:Cost Bucket 3 EXPLANATION:Cost Bucket 1 contains the most patients (see slide 7 of the previous video), and Cost Bucket 3 probably has the densest data, since these are the patients with the highest cost in terms of claims.

VIDEO 3: PREDICTING HEART ATTACKS USING CLUSTERING Predicting Heart Attacks (Random Forest) Predicting whether a patient has a heart attack for each of the cost buckets using the random forest algorithm

Incorporating Clustering Patients in each bucket may have different characteristics


20/67

7/23/2017


cluster11 Clustering Cost Buckets Two clustering algorithms were used for the analysis as an alternative to hierarchal clustering Spectral Clustering k-means clustering

Clustering Cost Buckets Two clustering algorithms were used for the analysis as an alternative to hierarchal clustering Spectral Clustering k-means clustering

k-Means Clustering

algo1


21/67

7/23/2017

The_Analytics_Edge_edX_MIT15.071x_June2015_5 Algo2

algo3

algo4

algo5


22/67

7/23/2017


algo6

algo7 Practical Considerations The number of clusters k can be selected from previous knowledge or experimenting Can strategically select initial partition of points into clusters if you have some knowledge of the data Can run algorithm several times with different random starting points In recitation, we will learn how to run the k-means clustering algorithm in R Random Forest with Clustering

RF2 QUICK QUESTION K-means clustering differs from Hierarchical clustering in a couple important ways. Which of the following statements is true? Ans:In k-means clustering, you have to pick the number of clusters you want before you run the algorithm EXPLANATION:In k-means clustering, you have to pick the number of clusters before you run the algorithm, but the computational effort needed is much less than that for hierarchical clustering (we’ll see this in more detail during the recitation).

VIDEO 4: UNDERSTANDING CLUSTER PATTERNS https://rpubs.com/nishantsbi/93582

23/67

7/23/2017

The_Analytics_Edge_edX_MIT15.071x_June2015_5 Understanding Cluster Patterns Clusters are interpretable and reveal unique patterns of diagnostic history among the population

patterns QUICK QUESTION As we saw in the previous video, the clusters can be used to find interesting patterns of health in addition to being used to improve predictive models. By changing the number of clusters, you can find more general or more specific patterns. If you wanted to find more unusual patter ns shared by a small number of people, would you increase or decrease the number of clusters? Ans:Increase EXPLANATION:If you wanted to find more unusual patterns, you would i ncrease the number of clusters since the clusters would become smaller and more patterns would probably emerge.

VIDEO 5: THE ANALYTICS EDGE Impact of Clustering Clustering members within each cost bucket yiel ded better predictions of heart attacks within clusters Grouping patients in clusters exhibits temporal diagnostic patterns within 9 months of a heart attack These patterns can be incorporated in the diagnostic rul es for heart attacks Great research interest in using analytics for early heart failure detection through pattern recognition

Seeing the Big Picture_Segmenting Images to Create Data(Recitation)_3 VIDEO 1: IMAGE SEGMENTATION Segmenting Images to Create Data

Image Segmentation Divide up digital images to salient regions/clusters corresponding to individual surfaces, objects, or natural parts of objects Clusters should be uniform and homogenous with respect to certain characteristics (color, intensity, texture) Goal: Useful and analyzable image representation Wide Applications Medical Imaging Locate tissue classes , organs, pathologies and tumors

Measure tissue/tumor volume Object Detection Detect facial features in photos Detect pedestrians in footages of surveillance vi deos Recognition tasks Fingerprint/Iris recognition Various Methods Clustering methods

Partition image to clusters based on differences in pixel colors, intensity or texture Edge detection Based on the detection of discontinuity, such as an abrupt change in t he gray level in gray-scale images Region-growing methods Divides image into regions, then sequentially merges sufficiently similar regions In this Recitation. Review hierarchical and k-means clustering in R Restrict ourselves to gray-scale images Simple example of a flower image (flower.csv) Medical imaging application with examples of transverse MRI images of the brain (healthy.csv and tumor.csv) Compare the use, pros and cons of all analytics methods we have seen so far


24/67

7/23/2017


VIDEO 2: CLUSTERING PIXELS (R script reproduced here) Grayscale Images Image is represented as a matrix of pixel intensity values ranging from 0 (black) to 1 (white) For 8 bits/pixel (bpp), 256 color levels

gray Grayscale Image Segmentation Cluster pixels according to their intensity values

gray2 #Unit 6 - Recitation #Video 2

flower = read.csv("flower.csv", header=FALSE) #since we do not have headers in the data and we dont want R to con vert the first row as header by default, hence we use header=FALSE argument

str(flower)


25/67

7/23/2017

The_Analytics_Edge_edX_MIT15.071x_June2015_5 ## 'data.frame': 50 obs. of 50 variables: ## $ V1 : num 0.0991 0.0991 0.1034 0.1034 0.1034 ... ## $ V2 : num 0.112 0.108 0.112 0.116 0.108 ... ## $ V3 : num 0.134 0.116 0.121 0.116 0.112 ... ## $ V4 : num 0.138 0.138 0.121 0.121 0.112 ... ## $ V5 : num 0.138 0.134 0.125 0.116 0.112 ... ## $ V6 : num 0.138 0.129 0.121 0.108 0.112 ... ## $ V7 : num 0.129 0.116 0.103 0.108 0.112 ... ## $ V8 : num 0.116 0.103 0.103 0.103 0.116 ... ## $ V9 : num 0.1121 0.0991 0.1078 0.1121 0.1164 ... ## $ V10: num 0.121 0.108 0.112 0.116 0.125 ... ## $ V11: num 0.134 0.125 0.129 0.134 0.129 ... ## $ V12: num 0.147 0.134 0.138 0.129 0.138 ... ## $ V13: num 0.000862 0.146552 0.142241 0.142241 0.133621 ... ## $ V14: num 0.000862 0.000862 0.142241 0.133621 0.12931 ... ## $ V15: num 0.142 0.142 0.134 0.121 0.116 ... ## $ V16: num 0.125 0.125 0.116 0.108 0.108 ... ## $ V17: num 0.1121 0.1164 0.1078 0.0991 0.0991 ... ## $ V18: num 0.108 0.112 0.108 0.108 0.108 ... ## $ V19: num 0.121 0.129 0.125 0.116 0.116 ... ## $ V20: num 0.138 0.129 0.125 0.116 0.116 ... ## $ V21: num 0.138 0.134 0.121 0.125 0.125 ... ## $ V22: num 0.134 0.129 0.125 0.121 0.103 ... ## $ V23: num 0.125 0.1207 0.1164 0.1164 0.0819 ... ## $ V24: num 0.1034 0.1034 0.0991 0.0991 0.1034 ... ## $ V25: num 0.0948 0.0905 0.0905 0.1034 0.125 ... ## $ V26: num 0.0862 0.0862 0.0991 0.125 0.1422 ... ## $ V27: num 0.086207 0.086207 0.103448 0.12931 0.000862 ... ## $ V28: num 0.0991 0.1078 0.1164 0.1293 0.1466 ... ## $ V29: num 0.116 0.134 0.134 0.121 0.142 ... ## $ V30: num 0.121 0.138 0.142 0.129 0.138 ... ## $ V31: num 0.121 0.134 0.142 0.134 0.129 ... ## $ V32: num 0.116 0.134 0.129 0.116 0.112 ... ## $ V33: num 0.108 0.112 0.116 0.108 0.108 ... ## $ V34: num 0.1078 0.1078 0.1034 0.0991 0.1034 ... ## $ V35: num 0.1078 0.1034 0.0991 0.0991 0.0991 ... ## $ V36: num 0.1078 0.1034 0.1034 0.0905 0.0862 ... ## $ V37: num 0.1078 0.1078 0.1034 0.0819 0.0733 ... ## $ V38: num 0.0948 0.0991 0.0776 0.069 0.0733 ... ## $ V39: num 0.0733 0.056 0.0474 0.0474 0.056 ... ## $ V40: num 0.0474 0.0388 0.0431 0.0474 0.0603 ... ## $ V41: num 0.0345 0.0345 0.0388 0.0474 0.0647 ... ## $ V42: num 0.0259 0.0259 0.0345 0.0431 0.056 ... ## $ V43: num 0.0259 0.0259 0.0388 0.0517 0.0603 ... ## $ V44: num 0.0302 0.0302 0.0345 0.0517 0.0603 ... ## $ V45: num 0.0259 0.0259 0.0259 0.0388 0.0474 ... ## $ V46: num 0.0259 0.0172 0.0172 0.0259 0.0345 ... ## $ V47: num 0.01724 0.01724 0.00862 0.02155 0.02586 ... ## $ V48: num 0.0216 0.0129 0.0129 0.0172 0.0302 ... ## $ V49: num 0.0216 0.0216 0.0216 0.0345 0.0603 ... ## $ V50: num 0.0302 0.0345 0.0388 0.0603 0.0776 ...

# Change the data type to matrix #since the data is given as 50 by 50 pixel intensities data frame, we need to convert/manipulate the data so that Clustering can be applied.This requires converting the data frame first to matrix and then converting this matri x into a vector using as. functions

flowerMatrix = as.matrix(flower) str(flowerMatrix)

## ## ## ##

num [1:50, 1:50] 0.0991 0.0991 0.1034 0.1034 0.1034 ... - attr(*, "dimnames")=List of 2 ..$ : NULL ..$ : chr [1:50] "V1" "V2" "V3" "V4" ...

# Turn matrix into a vector

flowerVector = as.vector(flowerMatrix) str(flowerVector)

##

num [1:2500] 0.0991 0.0991 0.1034 0.1034 0.1034 ...

#Had we converted the data frame directly into vector as below, we see that it still remains a data frame.

flowerVector2 = as.vector(flower) str(flowerVector2)


26/67

7/23/2017

The_Analytics_Edge_edX_MIT15.071x_June2015_5 ## 'data.frame': 50 obs. of 50 variables: ## $ V1 : num 0.0991 0.0991 0.1034 0.1034 0.1034 ... ## $ V2 : num 0.112 0.108 0.112 0.116 0.108 ... ## $ V3 : num 0.134 0.116 0.121 0.116 0.112 ... ## $ V4 : num 0.138 0.138 0.121 0.121 0.112 ... ## $ V5 : num 0.138 0.134 0.125 0.116 0.112 ... ## $ V6 : num 0.138 0.129 0.121 0.108 0.112 ... ## $ V7 : num 0.129 0.116 0.103 0.108 0.112 ... ## $ V8 : num 0.116 0.103 0.103 0.103 0.116 ... ## $ V9 : num 0.1121 0.0991 0.1078 0.1121 0.1164 ... ## $ V10: num 0.121 0.108 0.112 0.116 0.125 ... ## $ V11: num 0.134 0.125 0.129 0.134 0.129 ... ## $ V12: num 0.147 0.134 0.138 0.129 0.138 ... ## $ V13: num 0.000862 0.146552 0.142241 0.142241 0.133621 ... ## $ V14: num 0.000862 0.000862 0.142241 0.133621 0.12931 ... ## $ V15: num 0.142 0.142 0.134 0.121 0.116 ... ## $ V16: num 0.125 0.125 0.116 0.108 0.108 ... ## $ V17: num 0.1121 0.1164 0.1078 0.0991 0.0991 ... ## $ V18: num 0.108 0.112 0.108 0.108 0.108 ... ## $ V19: num 0.121 0.129 0.125 0.116 0.116 ... ## $ V20: num 0.138 0.129 0.125 0.116 0.116 ... ## $ V21: num 0.138 0.134 0.121 0.125 0.125 ... ## $ V22: num 0.134 0.129 0.125 0.121 0.103 ... ## $ V23: num 0.125 0.1207 0.1164 0.1164 0.0819 ... ## $ V24: num 0.1034 0.1034 0.0991 0.0991 0.1034 ... ## $ V25: num 0.0948 0.0905 0.0905 0.1034 0.125 ... ## $ V26: num 0.0862 0.0862 0.0991 0.125 0.1422 ... ## $ V27: num 0.086207 0.086207 0.103448 0.12931 0.000862 ... ## $ V28: num 0.0991 0.1078 0.1164 0.1293 0.1466 ... ## $ V29: num 0.116 0.134 0.134 0.121 0.142 ... ## $ V30: num 0.121 0.138 0.142 0.129 0.138 ... ## $ V31: num 0.121 0.134 0.142 0.134 0.129 ... ## $ V32: num 0.116 0.134 0.129 0.116 0.112 ... ## $ V33: num 0.108 0.112 0.116 0.108 0.108 ... ## $ V34: num 0.1078 0.1078 0.1034 0.0991 0.1034 ... ## $ V35: num 0.1078 0.1034 0.0991 0.0991 0.0991 ... ## $ V36: num 0.1078 0.1034 0.1034 0.0905 0.0862 ... ## $ V37: num 0.1078 0.1078 0.1034 0.0819 0.0733 ... ## $ V38: num 0.0948 0.0991 0.0776 0.069 0.0733 ... ## $ V39: num 0.0733 0.056 0.0474 0.0474 0.056 ... ## $ V40: num 0.0474 0.0388 0.0431 0.0474 0.0603 ... ## $ V41: num 0.0345 0.0345 0.0388 0.0474 0.0647 ... ## $ V42: num 0.0259 0.0259 0.0345 0.0431 0.056 ... ## $ V43: num 0.0259 0.0259 0.0388 0.0517 0.0603 ... ## $ V44: num 0.0302 0.0302 0.0345 0.0517 0.0603 ... ## $ V45: num 0.0259 0.0259 0.0259 0.0388 0.0474 ... ## $ V46: num 0.0259 0.0172 0.0172 0.0259 0.0345 ... ## $ V47: num 0.01724 0.01724 0.00862 0.02155 0.02586 ... ## $ V48: num 0.0216 0.0129 0.0129 0.0172 0.0302 ... ## $ V49: num 0.0216 0.0216 0.0216 0.0345 0.0603 ... ## $ V50: num 0.0302 0.0345 0.0388 0.0603 0.0776 ...

#Hence its crucial that we first convert the data frame into a matrix and then into a vector #lets proceed with the familiar Hierarchial clustering by computing the pair-wise distances between each intensit y value in the flower vector as below: #Compute "euclidean" distances

distance = dist(flowerVector, method = "euclidean")

VIDEO 3: HIERARCHICAL CLUSTERING (R script reproduced here) Dendrogram Example


27/67

7/23/2017


dendo1 The lowest level denotes all the obs or individual data points.All higher nodes denotes the clusters. Vertical lines denote the distance between the clusters. Taller this lines, more DISSIMILAR the cluster are. Choosing the no of clusters by drawing the horizontal lines cutting through the vertical distance lines in the dendogram

dendo2 What no of clusters to choose then? Smaller the number of clusters means coarser the clustering will be.Higher number would result in over segmentation.Hence we should think about the trade-off.

Criteria to choose the no of clusters: * If the horizontal line passing through the vertical distance has enough head room to move up & down, then that can be one reason to choose the cut.In above diagram, 3 cluster cut seems appropriate as the horizontal cut line has enough head room up & down. #Video 3 # Hierarchical clustering of the intensity values which is the dendogram tree

clusterIntensity = hclust(distance, method="ward.D2") #Plot the dendrogram

plot(clusterIntensity) # Select 3 clusters #visualising the cuts by plotting a rectangle around the clusters

rect.hclust(clusterIntensity, k = 3, border = "red")


28/67

7/23/2017


#lets split the data into these 3 clusters

flowerClusters = cutree(clusterIntensity, k = 3) flowerClusters


29/67

7/23/2017

The_Analytics_Edge_edX_MIT15.071x_June2015_5 ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ##

[1] [35] [69] [103] [137] [171] [205] [239] [273] [307] [341] [375] [409] [443] [477] [511] [545] [579] [613] [647] [681] [715] [749] [783] [817] [851] [885] [919] [953] [987] [1021] [1055] [1089] [1123] [1157] [1191] [1225] [1259] [1293] [1327] [1361] [1395] [1429] [1463] [1497] [1531] [1565] [1599] [1633] [1667] [1701] [1735] [1769] [1803] [1837] [1871] [1905] [1939] [1973] [2007] [2041] [2075] [2109] [2143] [2177] [2211] [2245] [2279] [2313] [2347] [2381] [2415] [2449] [2483]

1 1 1 1 1 1 1 1 1 1 1 2 1 1 3 1 1 3 2 1 3 3 1 3 3 1 2 3 1 3 3 1 2 2 2 2 2 3 1 2 3 1 2 3 1 3 2 1 3 2 1 3 3 1 3 1 1 1 2 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1 3 3 1 3 3 1 3 3 1 1 3 1 3 2 1 2 2 3 1 2 3 1 2 3 1 2 3 1 3 3 1 3 3 1 3 3 1 2 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 3 1 1 2 1 1 2 3 1 3 3 1 3 3 1 2 3 1 3 2 2 3 2 3 1 2 3 1 2 3 1 3 3 1 3 3 1 3 3 1 3 3 1 1 2 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 2 1 1 3 1 1 2 2 1 3 3 1 3 3 1 3 3 1 3 2 2 3 2 3 1 2 3 1 2 3 1 3 3 1 3 3 1 3 3 1 3 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 3 1 1 3 3 1 2 3 1 3 3 1 2 2 2 3 2 3 1 2 3 1 3 3 1 3 3 1 3 3 1 3 3 1 3 1 1 1 3 1 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 1 3 3 1 1 3 1 3 2 1 1 2 3 2 2 3 1 2 3 1 3 3 1 3 3 1 2 3 1 1 3 1 2 3 1 1 2 1 1 3 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 3 1 3 3 1 1 2 1 3 2 2 1 2 3 1 2 3 1 2 3 1 3 3 1 3 3 1 3 3 1 1 2 1 1 3 1 1 3 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 2 3 1 2 2 1 1 2 1 3 3 2 1 2 3 1 2 3 1 3 3 1 3 3 1 3 3 1 3 3 1 1 2 1 1 3 1 1 3 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 2 2 1 1 1 1 1 3 1 1 3 3 1 2 3 1 3 3 1 3 3 1 3 3 1 3 3 1 3 3 1 1 3 1 1 3 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 2 1 1 1 3 1 1 3 1 1 3 3 1 3 3 1 3 3 1 3 3 1 3 3 1 3 3 1 3 2 1 1 3 1 1 3 2 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 3 1 1 3 2 1 3 3 1 3 3 1 3 3 1 2 3 1 3 2 1 3 3 1 3 2 1 1 3 1 1 3 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 3 2 1 3 3 1 3 3 1 3 3 1 3 3 1 2 2 1 3 2 1 3 2 1 2 3 1 1 3 2 1 2 1 1 1 3 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 3 1 1 3 3 1 3 3 1 3 3 1 3 3 1 3 3 1 2 2 1 3 2 1 3 2 3 1 3 1 1 3 3 1 1 2 1 1 3 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 3 1 1 3 2 1 3 3 1 3 3 1 3 3 1 3 2 1 1 2 1 2 2 2 3 2 3 1 3 1 1 2 3 1 1 3 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 3 1 1 3 1 1 3 1 1 3 2 1 3 3 1 3 3 1 3 2 1 1 2 1 2 2 3 3 2 3 1 3 1 1 1 2 1 2 3 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 3 3 1 1 1 1 3 3 1 1 1 1 3 3 1 1 1 1 3 3 1 2 1 1 3 3 1 2 1 1 2 1 3 3 1 1 3 3 2 2 1 1 3 3 2 2 1 1 1 1 2 2 2 3 2 1 2 2 3 3 1 1 2 2 3 3 1 1 2 3 2 3 1 1 3 3 2 3 1 1 3 3 3 3 1 1 2 2 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 2 1 2 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 1 1 1 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 3 3 1 2 3 3 1 1 1 1 1 1 1 1 1 1 2 3 2 1 2 3 3 3 1 1 3 3 3 3 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 2 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 3 3 3 2 2 3 3 3 1 2 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 3 3 3 3 3 2 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 3 3 3 3 2 1 1 3 3 2 1 1 2 2 3 3 1 1 1 1 1 1 1 1 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 2 3 3 3 3 2 2 3 3 3 3 3 3 2 1 1 1 1 1 1 1 1 1 1 3 3 3 3 2 2 2 2 2 3 3 3 3 2 2 3 3 1 1 1 1 1 1 1 2 3 3 3 3 3 3 3 3 3 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 3 3 3 3 2 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 2 1 1 1 2 2 2 3 3 3 3 3 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 1 1 1 1 1 3 3 2 3 3 3 3 3 3 3 2 2 3 3 3 2 3 1 1 1 1 1 1 1 1 1 1 1 2 3 3 2 2 1 3 3 3 3 2 1 2 2 2 2 2 1 1 1 1 1 1 3 3 3 3 3 3 3 3 2 3 3 3 3 3 2 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 3 3 3 3 3 3 3 2 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 1 3 3 3 3 3 3 2 1 2 3 3 1 1 1 1 1 1 1 1 1 1 3 3 2 1 3 3 3 3 3 2 3 2 1 1 1 1 1 1 1 1 1 1 1 1 3 2 1 2 3 3 2 3 3 3 2 1 1 1 1 3 3 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 3 1 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 1 1 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

#this assigns values 1 to 3 denoting which of the 3 clusters the intensity(data point) was assigned to # Find mean intensity values of each of these 3 clusters

tapply(flowerVector, flowerClusters, mean)


30/67

7/23/2017

The_Analytics_Edge_edX_MIT15.071x_June2015_5 ## 1 2 3 ## 0.08574315 0.50826255 0.93147713

# Plot the image and the clusters #Before plotting the image we want to convert the flowerClusters vector into a matrix of form n x m using dim()

dim(flowerClusters) = c(50,50) #plotting the image

image(flowerClusters, axes = FALSE)

#Lets see how the original flower looks so that we can compare with out image output # Original image with gray colour scale

image(flowerMatrix,axes=FALSE,col=grey(seq(0,1,length=256)))

#this produces a low resolution image.Next we try high resolution brain image

VIDEO 4: MRI IMAGE (R script reproduced here) k-Means Clustering

The k-means clustering aims at partitioning the data into k clusters in which each data point belongs to the cluster whose mean is the nearest


31/67

7/23/2017


kmeans1 k-Means Clustering steps:

kmeans2

kmeans3

kmeans4


32/67

7/23/2017


kmeans5

kmeans6

kmeans7


33/67

7/23/2017

The_Analytics_Edge_edX_MIT15.071x_June2015_5 kmeans8 # Video 4 # Let's try this with an MRI image of the brain #healthy.csv dataframe consists of matrix of intensity values

healthy = read.csv("healthy.csv", header=FALSE) #Creating the healthy matrix

healthyMatrix = as.matrix(healthy) str(healthyMatrix)

## ## ## ##

num [1:566, 1:646] 0.00427 0.00855 0.01282 0.01282 0.01282 ... - attr(*, "dimnames")=List of 2 ..$ : NULL ..$ : chr [1:646] "V1" "V2" "V3" "V4" ...

#we can see that the image is 566 by 646....considerably larger than the flower # Plot the image in gray scale

image(healthyMatrix,axes=FALSE,col=grey(seq(0,1,length=256)))

# Hierarchial clustering

by computing the distance vector

healthyVector = as.vector(healthyMatrix) #distance = dist(healthyVector, method = "euclidean") #throws up error that the vector size is huge 480GB+ # We have an error - why? #lets see the size of the vector

str(healthyVector)

##

num [1:365636] 0.00427 0.00855 0.01282 0.01282 0.01282 ...

#Lets

store the value in n

n=365636 #therefore the pairwise distance between each of the above n data points is:

n*(n-1)/2

## [1] 66844659430

#wow...66 billion+ pairwise distances #so the bad news now is that we cannot use hierarchial clustering because of high resolution of the image.Hence w e go for other clustering method called k-means

VIDEO 5: K-MEANS CLUSTERING (R script reproduced here) https://rpubs.com/nishantsbi/93582

34/67

7/23/2017

The_Analytics_Edge_edX_MIT15.071x_June2015_5 #Video 5 # Specify number of clusters (Depends upon what you want to extract from the image)

k = 5 # Run k-means

set.seed(1)#since k means clustering starts by randomly assigning data points to cluster we set the seed to obtai n similar results

KMC = kmeans(healthyVector, centers = k, iter.max = 1000) str(KMC)

## List of 9 ## $ cluster : int [1:365636] 3 3 3 3 3 3 3 3 3 3 ... ## $ centers : num [1:5, 1] 0.4818 0.1062 0.0196 0.3094 0.1842 ## ..- attr(*, "dimnames")=List of 2 ## .. ..$ : chr [1:5] "1" "2" "3" "4" ... ## .. ..$ : NULL ## $ totss : num 5775 ## $ withinss : num [1:5] 96.6 47.2 39.2 57.5 62.3 ## $ tot.withinss: num 303 ## $ betweenss : num 5472 ## $ size : int [1:5] 20556 101085 133162 31555 79278 ## $ iter : int 2 ## $ ifault : int 0 ## - attr(*, "class")= chr "kmeans"

# Extract clusters variable for plotting it

healthyClusters = KMC$cluster #mean intensity values of each clusters is already available under the variable name centers in KMC object(NO NEE D to calculate as was done for hierarchial clustering)

KMC$centers[2] #mean intensity value of 2nd cluster

## [1] 0.1061945

# Plot the image with the clusters #first convert into a matrix using dim()

dim(healthyClusters) = c(nrow(healthyMatrix), ncol(healthyMatrix)) image(healthyClusters, axes = FALSE, col=rainbow(k))


35/67

7/23/2017

The_Analytics_Edge_edX_MIT15.071x_June2015_5 #SCREE PLOTS #While dendrograms can be used to select the final number of clusters for Hierarchical Clustering, we can't use d endrograms for k-means clustering. However, there are several other ways that the number of clusters can be selec ted. One common way to select the number of clusters is by using a scree plot, which works for any clustering alg orithm. #A standard scree plot has the number of clusters on the x-axis, and the sum of the within-cluster sum of squares on the y-axis. The within-cluster sum of squares for a cluster is the sum, across all points in the cluster, of the squared distance between each point and the centroid of the cluster.

We ideally want very small within-clus

ter sum of squares, since this means that the points are all very close to their centroid. #To create the scree plot, the clustering algorithm is run with a range of values for the number of clusters. For each number of clusters, the within-cluster sum of squares can easily be extracted when using k-means clusterin g. For example, suppose that we want to cluster the MRI image from this video into two clusters. We can first run the k-means algorithm with two clusters:

KMC2 = kmeans(healthyVector, centers = 2, iter.max = 1000) #Then, the within-cluster sum of squares is just an element of KMC2:

KMC2$withinss

## [1]

803.2878 1214.2523

#This gives a vector of the within-cluster sum of squares for each cluster (in this case, there should be two num bers). #Now suppose we want to determine the best number of clusters for this dataset. We would first repeat the kmeans function call above with centers = 3, centers = 4, etc. to create KMC3, KMC4, and so on. Then, we could generate the following plot:

KMC3 = kmeans(healthyVector, centers = 3, iter.max = 1000) KMC4 = kmeans(healthyVector, centers = 4, iter.max = 1000) KMC5 = kmeans(healthyVector, centers = 5, iter.max = 1000) KMC6 = kmeans(healthyVector, centers = 6, iter.max = 1000) KMC7 = kmeans(healthyVector, centers = 7, iter.max = 1000) KMC8 = kmeans(healthyVector, centers = 8, iter.max = 1000) KMC9 = kmeans(healthyVector, centers = 9, iter.max = 1000) KMC10 = kmeans(healthyVector, centers =10, iter.max = 1000) NumClusters = seq(2,10,1) SumWithinss = c(sum(KMC2$withinss), sum(KMC3$withinss), sum(KMC4$withinss), sum(KMC5$withinss), sum(KMC6$withinss), sum(KMC7$withinss), sum(KMC8$withinss), sum(KMC9$withinss), sum(KMC10$withinss)) #SCREE plot

plot(NumClusters, SumWithinss, type="b")


36/67

7/23/2017

The_Analytics_Edge_edX_MIT15.071x_June2015_5 #The plot looks like this (the type="b" argument just told the plot command to give us points and lines): #To determine the best number of clusters using this plot, we want to look for a bend, or elbow, in the plot. Thi s means that we want to find the number of clusters for which increasing the number of clusters further does not significantly help to reduce the within-cluster sum of squares. For this particular dataset, it looks like 4 or 5 clusters is a good choice. Beyond 5, increasing the number of clusters does not really reduce the within-clust er sum of squares too much. #Note: You may have noticed it took a lot of typing to generate SumWithinss; this is because we limited ourselves to R functions we've learned so far in the course. In fact, R has powerful functions for repeating tasks with a different input (in this case running kmeans with different cluster sizes). For instance, we could generate SumW ithinss with: #All the above steps in one go using the sapply ()

SumWithinss = sapply(2:10, function(x) sum(kmeans(healthyVector, centers=x, iter.max=1000)$withinss)) #SCREE plot

plot(NumClusters, SumWithinss, type="b")

VIDEO 6: DETECTING TUMORS (R script reproduced here) First Taste of a Fascinating Field MRI image segmentation is subject of ongoing research k-means is a good starting point, but not enough Advanced clustering techniques such as the modified fuzzy k-means (MFCM) clustering technique Packages in R specialized for medical image analysis Medical Imaging (http://cran.r-project.org/web/views/MedicalImaging.html) #Video 6 #Apply to a test image

tumor = read.csv("tumor.csv", header=FALSE) tumorMatrix = as.matrix(tumor) tumorVector = as.vector(tumorMatrix) #Apply clusters from before to new image, using the flexclust package #install.packages("flexclust") library(flexclust)

## Loading required package: grid

## Loading required package: modeltools

## Loading required package: stats4


37/67

7/23/2017

The_Analytics_Edge_edX_MIT15.071x_June2015_5 #converting KMC into kcca object

KMC.kcca = as.kcca(KMC, healthyVector) #Healthy vector as a training set tumorClusters = predict(KMC.kcca, newdata = tumorVector) #tumorvector as a testing set #Visualize the clusters #converting tumorClusters into a matrix

dim(tumorClusters) = c(nrow(tumorMatrix), ncol(tumorMatrix)) #visualising the tumorClusters

image(tumorClusters, axes = FALSE, col=rainbow(k))

Segmented MRI Images

mri1 Left side image is of a healthy brain T2 Weighted MRI Images


38/67

7/23/2017


mri2

VIDEO 7: COMPARING METHODS USED IN THE CLASS Comparison of Methods

method1

method2


39/67

7/23/2017


method3

Assignment 6 DOCUMENT CLUSTERING WITH DAILY KOS Document clustering, or text clustering, is a very popular application of clustering algorithms. A web search engine, like Google, often returns thousands of results for a simple query. For example, if you type the search term “jaguar” into Google, around 200 million results are returned. This makes it very difficult to browse or find relevant information, especially if the search term has multiple meanings. If we search for “jaguar”, we might be looking for information about the animal, the car, or the Jacksonville Jaguars football team. Clustering methods can be used to automatically group search results into categories, making it easier to find relavent results. This method is used in the search engines PolyMeta and Helioid, as well as on FirstGov.gov, the official Web portal for the U.S. government. The two most common algorithms used for document clustering are Hierarchical and k-means. In this problem, we’ll be clustering articles published on Daily Kos, an American political blog that publishes news and opinion articles written from a progressive point of view. Daily Kos was founded by Markos Moulitsas in 2002, and as of September 2014, the site had an average weekday traffic of hundreds of thousands of visits. The file dailykos.csv contains data on 3,430 news art icles or blogs that have been posted on D aily Kos. These articles were posted in 2004, leading up to the United S tates Presidential Election. The leading candidates were incumbent President George W. Bush (republican) and John Kerry (democratic). Foreign policy was a dominant topic of the election, specifically, the 2003 invasion of Iraq. Each of the variables in the dataset is a word that has appeared in at least 50 different articles (1,545 words in total). The set of words has been trimmed according to some of the techniques covered in the previous week on text analytics (punctuation has been removed, and stop words have been removed). For each document, the variable values are the number of times that word appeared in the document. #PROBLEM 1.1 - HIERARCHICAL CLUSTERING #Let's start by building a hierarchical clustering model. First, read the data set into R.

dailykos<- read.csv("dailykos.csv") dim(dailykos)

## [1] 3430 1545

#compute the distances (using method="euclidean")

kosDist<- dist(dailykos,method="euclidean") #use hclust to build the model (using method="ward.D"),clustering on all of the variables

kosHierClust<- hclust(kosDist, method="ward.D") #Running the dist function will probably take you a while. Why? Select all that apply. #Ans:We have a lot of observations, so it takes a long time to compute the distance between each pair of observat ions & We have a lot of variables, so the distance computation is long. #Explanation:The distance computation can take a long time if you have a lot of observations and/or if there are a lot of variables. As we saw in recitation, it might not even work if you have too many of either! ########################################### #PROBLEM 1.2 - HIERARCHICAL CLUSTERING #Plot the dendrogram of your hierarchical clustering model. Just looking at the dendrogram, which of the followin g seem like good choices for the number of clusters? Select all that apply. #plot the dendrogram

plot(kosHierClust)#where "kosHierClust" is the name of the clustering model


40/67

7/23/2017


#Ans:2 & 3 #Explanation:The choices 2 and 3 are good cluster choices according to the dendrogram, because there is a lot of space between the horizontal lines in the dendrogram in those cut off spots (draw a horizontal line across the d endrogram where it crosses 2 or 3 vertical lines). The choices of 5 and 6 do not seem good according to the dendr ogram because there is very little space. ####################################### #PROBLEM 1.3 - HIERARCHICAL CLUSTERING #In this problem, we are trying to cluster news articles or blog posts into groups. This can be used to show read ers categories to choose from when trying to decide what to read. Just thinking about this application, what are good choices for the number of clusters? Select all that apply. #Ans:7 & 8 #Explanation:Thinking about the application, it is probably better to show the reader more categories than 2 or 3. These categories would probably be too broad to be useful. Seven or eight categories seems more reasonable. ###################################### #PROBLEM 1.4 - HIERARCHICAL CLUSTERING #Let's pick 7 clusters. This number is reasonable according to the dendrogram, and also seems reasonable for the application. Use the cutree function to split your data into 7 clusters. #You can split your data into clusters by first using the cutree function to compute the cluster numbers #cluster assignment of hierarchical clustering

hierGroups<-cutree(kosHierClust, k=7) #Then, you can use the subset function 7 times to split the data into the 7 clusters(we don't really want to run tapply on every single variable when we have over 1,000 different variables):

HierCluster1 = subset(dailykos, hierGroups == 1) nrow(HierCluster1)

## [1] 1266


## [1] 321


## [1] 374



41/67

7/23/2017

The_Analytics_Edge_edX_MIT15.071x_June2015_5 ## [1] 139


## [1] 407


## [1] 714


## [1] 209

#or

table(hierGroups)

## hierGroups ## 1 2 3 ## 1266 321 374

4 139

5 407

6 714

7 209

#More Advanced Approach: #There is a very useful function in R called the "split" function. Given a vector assigning groups like hierGroup s, you could split dailykos into the clusters by typing:

HierCluster = split(dailykos, hierGroups) #Then cluster 1 can be accessed by typing HierCluster[[1]], cluster 2 can be accessed by typing HierCluster[[2]], etc. If you have a variable in your current R session called "split", you will need to remove it with rm(split) before using the split function. #How many observations are in cluster 3? #Ans:374 #Which cluster has the most observations? #Ans:1 #Which cluster has the fewest observations? #Ans:4 ######################################### #PROBLEM 1.5 - HIERARCHICAL CLUSTERING #Instead of looking at the average value in each variable individually, we'll just look at the top 6 words in eac h cluster. To do this for cluster 1, type the following in your R console (where "HierCluster1" should be replace d with the name of your first cluster subset):

tail(sort(colMeans(HierCluster[[1]])))

## ##

state republican 0.7575039 0.7590837


poll 0.9036335

democrat 0.9194313

kerry bush 1.0624013 1.7053712

42/67

7/23/2017

The_Analytics_Edge_edX_MIT15.071x_June2015_5 #This computes the mean frequency values of each of the words in cluster 1, and then outputs the 6 words that occ ur the most frequently. The colMeans function computes the column (word) means, the sort function orders the word s in increasing order of the mean values, and the tail function outputs the last 6 words listed, which are the on es with the largest column means. #What is the most frequent word in this cluster, in terms of average value? Enter the word exactly how you see it in the output: #Ans:bush #Explanation:After running the R command given above, we can see that the most frequent word on average is "bus h". This corresponds to President George W. Bush. ################################################## #PROBLEM 1.6 - HIERARCHICAL CLUSTERING #Now repeat the command given in the previous problem for each of the other clusters, and answer the following qu estions. #Get top 6 words of all clusters #using lapply() rather than doing clusterwise

topWords <- lapply(HierCluster, function(c) tail(sort(colMeans(c)))) topWords

## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ##

$`1` state republican 0.7575039 0.7590837

poll 0.9036335

democrat 0.9194313

kerry bush 1.0624013 1.7053712

$`2` bush 2.847352

democrat challenge 2.850467 4.096573

vote 4.398754

poll november 4.847352 10.339564

$`3` elect 1.647059

parties 1.665775


democrat 3.823529

bush 4.406417

$`4` campaign voter presided poll bush kerry 1.431655 1.539568 1.625899 3.589928 7.834532 8.438849 $`5` american 1.090909 bush 3.941032

presided administration 1.120393 1.230958

war 1.776413

iraq 2.427518

$`6` race bush kerry elect democrat poll 0.4579832 0.4887955 0.5168067 0.5350140 0.5644258 0.5812325 $`7` democrat clark edward poll kerry dean 2.148325 2.497608 2.607656 2.765550 3.952153 5.803828

#or doing each cluster wise

tail(sort(colMeans(HierCluster2)))

## ##

bush 2.847352

democrat challenge 2.850467 4.096573

vote 4.398754

poll november 4.847352 10.339564


## ##

elect 1.647059

parties 1.665775


democrat 3.823529

bush 4.406417


## campaign voter presided poll bush kerry ## 1.431655 1.539568 1.625899 3.589928 7.834532 8.438849



43/67

7/23/2017

The_Analytics_Edge_edX_MIT15.071x_June2015_5 ## ## ## ##

american 1.090909 bush 3.941032

presided administration 1.120393 1.230958

war 1.776413

iraq 2.427518


## race bush kerry elect democrat poll ## 0.4579832 0.4887955 0.5168067 0.5350140 0.5644258 0.5812325


## democrat clark edward poll kerry dean ## 2.148325 2.497608 2.607656 2.765550 3.952153 5.803828

#Which words best describe cluster 2? #Ans:november, poll, vote, challenge #Which cluster could best be described as the cluster related to the Iraq war? #Ans:Cluster 5 #In 2004, one of the candidates for the Democratic nomination for the President of the United States was Howard D ean, John Kerry was the candidate who won the democratic nomination, and John Edwards with the running mate of Jo hn Kerry (the Vice President nominee). Given this information, which cluster best corresponds to the democratic p arty? #Ans:Cluster 7 #Explanation:You can see that the words that best describe Cluster 2 are november, poll, vote, and challenge. The most common words in Cluster 5 are bush, iraq, war, and administration, so it is the cluster that can best be de scribed as corresponding to the Iraq war. And the most common words in Cluster 7 are dean, kerry, poll, and edwar d, so it looks like the democratic cluster. ############################################## #PROBLEM 2.1 - K-MEANS CLUSTERING #Now, run k-means clustering, setting the seed to 1000 right before you run the kmeans function. Again, pick the number of clusters equal to 7. You don't need to add the iters.max argument.

set.seed(1000) #cluster assignment of k-means clustering

KmeansCluster<-kmeans(dailykos, centers=7) str(KmeansCluster)

## List of 9 ## $ cluster : int [1:3430] 4 4 6 4 1 4 7 4 4 4 ... ## $ centers : num [1:7, 1:1545] 0.0342 0.0556 0.0253 0.0136 0.0491 ... ## ..- attr(*, "dimnames")=List of 2 ## .. ..$ : chr [1:7] "1" "2" "3" "4" ... ## .. ..$ : chr [1:1545] "abandon" "abc" "ability" "abortion" ... ## $ totss : num 896461 ## $ withinss : num [1:7] 76583 52693 99504 258927 88632 ... ## $ tot.withinss: num 730632 ## $ betweenss : num 165829 ## $ size : int [1:7] 146 144 277 2063 163 329 308 ## $ iter : int 7 ## $ ifault : int 0 ## - attr(*, "class")= chr "kmeans"

#Then, you can subset your data into the 7 clusters by using the following commands:

KmeansCluster1 = subset(dailykos, KmeansCluster$cluster == 1) nrow(KmeansCluster1)

## [1] 146


## [1] 144



44/67

7/23/2017

The_Analytics_Edge_edX_MIT15.071x_June2015_5 ## [1] 277


## [1] 2063


## [1] 163


## [1] 329


## [1] 308

#Alternatively, you could answer these questions by looking at

table(KmeansCluster$cluster)

## ## ##

1 146

2 144

3 4 277 2063

5 163

6 329

7 308

#More Advanced Approach: #rm(split)

KmeansClusterspl<-split(dailykos,KmeansCluster$cluster) #How many observations are in Cluster 3?

nrow(KmeansClusterspl[[3]])

## [1] 277

#Ans: #using sapply() to get the no of obs in all 7 clusters

sapply(KmeansClusterspl, nrow)

## ##

1 146

2 144

3 4 277 2063

5 163

6 329

7 308

#Which cluster has the most observations? #Ans:Cluster 4 #Which cluster has the fewest number of observations? #Ans:Cluster 2

################################################### #PROBLEM 2.2 - K-MEANS CLUSTERING #Now, output the six most frequent words in each cluster, like we did in the previous problem, for each of the kmeans clusters. #most frequent words from each kmeans cluster:

kmTopWords <- lapply(KmeansClusterspl, function(c) tail(sort(colMeans(c)))) kmTopWords


45/67

7/23/2017

The_Analytics_Edge_edX_MIT15.071x_June2015_5 ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ##

$`1` state 1.609589 bush 11.431507 $`2` primaries 2.319444

democrat 2.694444

$`3` administration 1.389892 iraq 4.093863

iraq 1.616438

edward 2.798611

kerry administration 1.636986 2.664384

clark 3.090278

iraqi 1.610108

presided 2.767123

kerry dean 4.979167 8.277778

american 1.685921

bush 2.610108

war 3.025271

$`4` elect republican 0.6010664 0.6175473

kerry 0.6495395

poll 0.7474552

democrat bush 0.7891420 1.1473582

$`5` race 2.484663 $`6` democrat 2.899696

senate 2.650307

state 3.521472

bush challenge 2.960486 4.121581

parties republican 3.619632 4.638037

vote 4.446809

democrat 6.993865

poll november 4.872340 10.370821

$`7` presided voter campaign poll bush kerry 1.324675 1.334416 1.383117 2.788961 5.970779 6.480519

#or #we can output the most frequent words in each of the k-means clusters by using the following commands:

tail(sort(colMeans(KmeansCluster1)))

## ## ## ##

state 1.609589 bush 11.431507

iraq 1.616438

kerry administration 1.636986 2.664384

presided 2.767123


## primaries ## 2.319444

democrat 2.694444

edward 2.798611

clark 3.090278

kerry dean 4.979167 8.277778


## administration ## 1.389892 ## iraq ## 4.093863

iraqi 1.610108

american 1.685921

bush 2.610108

war 3.025271


## ##

elect republican 0.6010664 0.6175473

kerry 0.6495395

poll 0.7474552

democrat bush 0.7891420 1.1473582


## ##

race 2.484663

senate 2.650307

state 3.521472

parties republican 3.619632 4.638037

democrat 6.993865


## ##

democrat 2.899696

bush challenge 2.960486 4.121581

vote 4.446809

poll november 4.872340 10.370821



46/67

7/23/2017

The_Analytics_Edge_edX_MIT15.071x_June2015_5 ## presided voter campaign poll bush kerry ## 1.324675 1.334416 1.383117 2.788961 5.970779 6.480519

#Which k-means cluster best corresponds to the Iraq War? #Ans:Cluster 3 #Which k-means cluster best corresponds to the democratic party? (Remember that we are looking for the names of t he key democratic party leaders.) #Ans:Cluster 2 #Explanation:By looking at the output, you can see that the cluster best correponding to the Iraq War is cluster 3 (top words are iraq, war, and bush) and the cluster best corresponding to the democratic party is cluster 2 (t op words dean, kerry, clark, and edward). ################################################## #PROBLEM 2.3 - K-MEANS CLUSTERING #For the rest of this problem, we'll ask you to compare how observations were assigned to clusters in the two dif ferent methods. Use the table function to compare the cluster assignment of hierarchical clustering to the cluste r assignment of k-means clustering. #topWords #kmTopWords

table(hierGroups,KmeansCluster$cluster)

## ## hierGroups ## 1 ## 2 ## 3 ## 4 ## 5 ## 6 ## 7

1 3 0 85 10 48 0 0

2 11 0 10 5 0 2 116

3 4 64 1045 0 0 42 79 0 0 171 145 0 712 0 82

5 32 0 126 1 3 0 1

6 0 320 8 0 1 0 0

7 111 1 24 123 39 0 10

#Which Hierarchical Cluster best corresponds to K-Means Cluster 2? #Ans:Hierarchical Cluster 7 #Explanation:From "table(hierGroups, KmeansCluster$cluster)", we read that 116 (80.6%) of the observations in K-M eans Cluster 2 also fall in Hierarchical Cluster 7. #################################################### #PROBLEM 2.4 - K-MEANS CLUSTERING #Which Hierarchical Cluster best corresponds to K-Means Cluster 3?

table(hierGroups, KmeansCluster$cluster)

## ## hierGroups ## 1 ## 2 ## 3 ## 4 ## 5 ## 6 ## 7

1 3 0 85 10 48 0 0

2 11 0 10 5 0 2 116

3 4 64 1045 0 0 42 79 0 0 171 145 0 712 0 82

5 32 0 126 1 3 0 1

6 0 320 8 0 1 0 0

7 111 1 24 123 39 0 10

#Ans:Hierarchical Cluster 5 #Explanation:From "table(hierGroups, KmeansCluster$cluster)", we read that 171 (61.7%) of the observations in K-M eans Cluster 3 also fall in Hierarchical Cluster 5. ########################################### #PROBLEM 2.5 - K-MEANS CLUSTERING #Which Hierarchical Cluster best corresponds to K-Means Cluster 7?



47/67

7/23/2017

The_Analytics_Edge_edX_MIT15.071x_June2015_5 ## ## hierGroups ## 1 ## 2 ## 3 ## 4 ## 5 ## 6 ## 7

1 3 0 85 10 48 0 0

2 11 0 10 5 0 2 116

3 4 64 1045 0 0 42 79 0 0 171 145 0 712 0 82

5 32 0 126 1 3 0 1

6 0 320 8 0 1 0 0

7 111 1 24 123 39 0 10

#Ans:No Hierarchical Cluster contains at least half of the points in K-Means Cluster 7. #Explanation:From "table(hierGroups, KmeansCluster$cluster)", we read that no more than 123 (39.9%) of the observ ations in K-Means Cluster 7 fall in any hierarchical cluster. ########################################## #PROBLEM 2.6 - K-MEANS CLUSTERING #Which Hierarchical Cluster best corresponds to K-Means Cluster 6?


## ## hierGroups ## 1 ## 2 ## 3 ## 4 ## 5 ## 6 ## 7

1 3 0 85 10 48 0 0

2 11 0 10 5 0 2 116

3 4 64 1045 0 0 42 79 0 0 171 145 0 712 0 82

5 32 0 126 1 3 0 1

6 0 320 8 0 1 0 0

7 111 1 24 123 39 0 10

#Ans:Hierarchical Cluster 2 #Explanation:From "table(hierGroups, KmeansCluster$cluster)", we read that 320 (97.3%) of observations in K-Means Cluster 6 fall in Hierarchical Cluster 2.

MARKET SEGMENTATION FOR AIRLINES Market segmentation is a strategy that divides a broad target market of customers into small er, more similar groups, and then designs a marketing strategy specifically for each group. Clustering i s a common technique for market segmentation since it automatically finds similar groups given a data set. In this problem, we’ll see how clustering can be used to find si milar groups of customers who belong to an ai rline’s frequent flyer program. The airline is trying t o learn more about its customers so that it can target different customer segments with different types of mileage offers. The file AirlinesCluster.csv contains information on 3,999 members of the frequent flyer program. This data comes from the textbook “Data Mi ning for Business Intelligence,” by Galit Shmueli, Nitin R. Patel, and Peter C. Bruce. For more information, see the website for the book. There are seven different variables in the dataset, described below: Balance = number of miles eligible for award travel QualMiles = number of mil es qualifying for TopFlight status BonusMiles = number of miles earned from non-flight bonus transactions in the past 12 months BonusTrans = number of non-flight bonus transactions in the past 12 months FlightMiles = number of flight miles in the past 12 months FlightTrans = number of flight transactions in the past 12 months DaysSinceEnroll = number of days since enrolled in t he frequent flyer program #Read in the data

airlines<- read.csv("AirlinesCluster.csv") summary(airlines)

## ## ## ## ## ## ## ## ## ## ## ## ## ##

Balance Min. : 0 1st Qu.: 18528 Median : 43097 Mean : 73601 3rd Qu.: 92404 Max. :1704838 FlightMiles Min. : 0.0 1st Qu.: 0.0 Median : 0.0 Mean : 460.1 3rd Qu.: 311.0 Max. :30817.0


QualMiles Min. : 0.0 1st Qu.: 0.0 Median : 0.0 Mean : 144.1 3rd Qu.: 0.0 Max. :11148.0 FlightTrans Min. : 0.000 1st Qu.: 0.000 Median : 0.000 Mean : 1.374 3rd Qu.: 1.000 Max. :53.000

BonusMiles Min. : 0 1st Qu.: 1250 Median : 7171 Mean : 17145 3rd Qu.: 23801 Max. :263685 DaysSinceEnroll Min. : 2 1st Qu.:2330 Median :4096 Mean :4119 3rd Qu.:5790 Max. :8296

BonusTrans Min. : 0.0 1st Qu.: 3.0 Median :12.0 Mean :11.6 3rd Qu.:17.0 Max. :86.0

48/67

7/23/2017

The_Analytics_Edge_edX_MIT15.071x_June2015_5 #or better way

colMeans(airlines)

## ## ## ##

Balance 73601.327582 FlightMiles 460.055764

QualMiles BonusMiles 144.114529 17144.846212 FlightTrans DaysSinceEnroll 1.373593 4118.559390

BonusTrans 11.601900

#PROBLEM 1.1 - NORMALIZING THE DATA #Looking at the summary of airlines, which TWO variables have (on average) the smallest values? #Ans:BonusTrans & FlightTrans #Which TWO variables have (on average) the largest values? #Ans:Balance & BonusMiles #EXPLANATION: For the smallest values, BonusTrans and FlightTrans are on the scale of tens, whereas all other var iables have values in the thousands.For the largest values, Balance and BonusMiles have average values in the ten s of thousands. ################################# #PROBLEM 1.2 - NORMALIZING THE DATA #In this problem, we will normalize our data before we run the clustering algorithms. Why is it important to norm alize the data before clustering? #Ans:If we don't normalize the data, the clustering will be dominated by the variables that are on a larger scal e. #EXPLANATION:If we don't normalize the data, the variables that are on a larger scale will contribute much more t o the distance calculation, and thus will dominate the clustering. ################################### #PROBLEM 1.3 - NORMALIZING THE DATA #Let's go ahead and normalize our data. You can normalize the variables in a data frame by using the preProcess f unction in the "caret" package. library(caret)

## ## Attaching package: 'caret'

## The following object is masked from 'package:mosaic': ## ## dotPlot

#Now, create a normalized data frame called "airlinesNorm" by running the following commands:

preproc = preProcess(airlines) airlinesNorm = predict(preproc, airlines) #Normalising the data ##The first command pre-processes the data, and the second command performs the normalization.

summary(airlinesNorm)

## ## ## ## ## ## ## ## ## ## ## ## ## ##

Balance Min. :-0.7303 1st Qu.:-0.5465 Median :-0.3027 Mean : 0.0000 3rd Qu.: 0.1866 Max. :16.1868 FlightMiles Min. :-0.3286 1st Qu.:-0.3286 Median :-0.3286 Mean : 0.0000 3rd Qu.:-0.1065 Max. :21.6803

QualMiles Min. :-0.1863 1st Qu.:-0.1863 Median :-0.1863 Mean : 0.0000 3rd Qu.:-0.1863 Max. :14.2231 FlightTrans Min. :-0.36212 1st Qu.:-0.36212 Median :-0.36212 Mean : 0.00000 3rd Qu.:-0.09849 Max. :13.61035

BonusMiles Min. :-0.7099 1st Qu.:-0.6581 Median :-0.4130 Mean : 0.0000 3rd Qu.: 0.2756 Max. :10.2083 DaysSinceEnroll Min. :-1.99336 1st Qu.:-0.86607 Median :-0.01092 Mean : 0.00000 3rd Qu.: 0.80960 Max. : 2.02284

BonusTrans Min. :-1.20805 1st Qu.:-0.89568 Median : 0.04145 Mean : 0.00000 3rd Qu.: 0.56208 Max. : 7.74673

#Mean of eah variable

round(colMeans(airlinesNorm),6) #or


49/67

7/23/2017


Balance 0 FlightMiles 0

QualMiles BonusMiles 0 0 FlightTrans DaysSinceEnroll 0 0

BonusTrans 0

round(sapply(airlinesNorm,mean),6)

## ## ## ##



BonusTrans 0

#std.deviation of each var

sapply(airlinesNorm,sd)

## ## ## ##



BonusTrans 1

#If you look at the summary of airlinesNorm, you should see that all of the variables now have mean zero. You can also see that each of the variables has standard deviation 1 by using the sd() function. #In the normalized data, which variable has the largest maximum value?

apply(airlinesNorm, 2, max)

## ## ## ##



BonusTrans 7.746727

#Ans:FlightMiles #In the normalized data, which variable has the smallest minimum value?

apply(airlinesNorm, 2, min)

## ## ## ##

Balance -0.7303482 FlightMiles -0.3285622

QualMiles BonusMiles -0.1862754 -0.7099031 FlightTrans DaysSinceEnroll -0.3621226 -1.9933614

BonusTrans -1.2080518

#Ans:DaysSinceEnroll #EXPLANATION:You can see from the output that FlightMiles now has the largest maximum value, and DaysSinceEnroll now has the smallest minimum value. Note that these were not the variables with the largest and smallest values in the original dataset airlines. ##################################### #PROBLEM 2.1 - HIERARCHICAL CLUSTERING #Compute the distances between data points (using euclidean distance) and then run the Hierarchical clustering al gorithm (using method="ward.D") on the normalized data. It may take a few minutes for the commands to finish sinc e the dataset has a large number of observations for hierarchical clustering.

distances <- dist(airlinesNorm,method="euclidean") airlineClust <- hclust(distances,method="ward.D") #Then, plot the dendrogram of the hierarchical clustering process:

plot(airlineClust)


50/67

7/23/2017


#Suppose the airline is looking for somewhere between 2 and 10 clusters. According to the dendrogram, which of th e following is NOT a good choice for the number of clusters? #Ans:6 #EXPLANATION:if you run a horizontal line down the dendrogram, you can see that there is a long time that the lin e crosses 2 clusters, 3 clusters, or 7 clusters. However, it it hard to see the horizontal line cross 6 clusters. This means that 6 clusters is probably not a good choice. ########################################## #PROBLEM 2.2 - HIERARCHICAL CLUSTERING #Suppose that after looking at the dendrogram and discussing with the marketing department, the airline decides t o proceed with 5 clusters. Divide the data points into 5 clusters by using the cutree function. #How many data points are in Cluster 1? #Divide into 5 clusters

clusterGroups<- cutree(airlineClust, k=5) airlineClusters <-split(airlinesNorm, clusterGroups) nrow(airlineClusters[[1]])

## [1] 776

#or

table(clusterGroups)

## clusterGroups ## 1 2 3 ## 776 519 494

4 5 868 1342

#Ans:776 #EXPLANATION:you can see that there are 776 data points in the first cluster. ##################################### #PROBLEM 2.3 - HIERARCHICAL CLUSTERING #Now, use tapply to compare the average values in each of the variables for the 5 clusters (the centroids of the clusters). You may want to compute the average values of the unnormalized data so that it is easier to interpre t. You can do this for the variable "Balance" with the following command: #The centroids of the clusters (Un-normalised data) #average values in 'Balance'variables for the 5 cluster

tapply(airlines$Balance, clusterGroups, mean)

## ##

1 2 3 57866.90 110669.27 198191.57


4 52335.91

5 36255.91

51/67

7/23/2017

The_Analytics_Edge_edX_MIT15.071x_June2015_5 #then similarly compute for all other variables

tapply(airlines$QualMiles, clusterGroups, mean)

## ##

1 2 0.6443299 1065.9826590

3 30.3461538

4 4.8479263

5 2.5111773

tapply(airlines$BonusMiles, clusterGroups, mean)

## 1 2 3 4 ## 10360.124 22881.763 55795.860 20788.766

5 2264.788

tapply(airlines$BonusTrans, clusterGroups, mean)

## 1 2 3 4 ## 10.823454 18.229287 19.663968 17.087558

5 2.973174

tapply(airlines$FlightMiles, clusterGroups, mean)

## ##

1 2 83.18428 2613.41811

3 327.67611

4 111.57373

5 119.32191

tapply(airlines$FlightTrans, clusterGroups, mean)

## 1 2 3 4 5 ## 0.3028351 7.4026975 1.0688259 0.3444700 0.4388972

tapply(airlines$DaysSinceEnroll, clusterGroups, mean)

## 1 2 3 4 5 ## 6235.365 4402.414 5615.709 2840.823 3060.081

#or #Advanced Explanation: #Instead of using tapply, you could have alternatively used colMeans and subset, as follows:

colMeans(subset(airlines, clusterGroups == 1))

## ## ## ##

Balance 5.786690e+04 FlightMiles 8.318428e+01

QualMiles BonusMiles 6.443299e-01 1.036012e+04 FlightTrans DaysSinceEnroll 3.028351e-01 6.235365e+03

BonusTrans 1.082345e+01


## ## ## ##


QualMiles BonusMiles 1.065983e+03 2.288176e+04 FlightTrans DaysSinceEnroll 7.402697e+00 4.402414e+03



## ## ## ##





## ## ## ##






52/67

7/23/2017



QualMiles BonusMiles 2.511177e+00 2.264788e+03 FlightTrans DaysSinceEnroll 4.388972e-01 3.060081e+03


#But an even more compact way of finding the centroids would be to use the function "split" to first split the da ta into clusters, and then to use the function "lapply" to apply the function "colMeans" to each of the clusters:

lapply(split(airlines, clusterGroups), colMeans)

## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ##

$`1` Balance 5.786690e+04 FlightMiles 8.318428e+01

QualMiles BonusMiles 6.443299e-01 1.036012e+04 FlightTrans DaysSinceEnroll 3.028351e-01 6.235365e+03








QualMiles BonusMiles 2.511177e+00 2.264788e+03 FlightTrans DaysSinceEnroll 4.388972e-01 3.060081e+03


$`2` Balance 1.106693e+05 FlightMiles 2.613418e+03 $`3` Balance 1.981916e+05 FlightMiles 3.276761e+02 $`4` Balance 52335.913594 FlightMiles 111.573733 $`5` Balance 3.625591e+04 FlightMiles 1.193219e+02

#or better & faster ways is using sapply()

airlinesUnnormClusters<-split(airlines,clusterGroups) round(sapply(airlinesUnnormClusters,colMeans),4)

## ## ## ## ## ## ## ##

1 2 3 4 5 Balance 57866.9046 110669.2659 198191.5749 52335.9136 36255.9098 QualMiles 0.6443 1065.9827 30.3462 4.8479 2.5112 BonusMiles 10360.1237 22881.7630 55795.8603 20788.7661 2264.7876 BonusTrans 10.8235 18.2293 19.6640 17.0876 2.9732 FlightMiles 83.1843 2613.4181 327.6761 111.5737 119.3219 FlightTrans 0.3028 7.4027 1.0688 0.3445 0.4389 DaysSinceEnroll 6235.3647 4402.4143 5615.7085 2840.8226 3060.0812


53/67

7/23/2017

The_Analytics_Edge_edX_MIT15.071x_June2015_5 #Compared to the other clusters, Cluster 1 has the largest average values in which variables (if any)? Select all that apply. #Ans:DaysSinceEnroll #EXPLANATION:The only variable for which Cluster 1 has large values is DaysSinceEnroll. #How would you describe the customers in Cluster 1? #Ans: Infrequent but loyal customers. #EXPLANATION:Cluster 1 mostly contains customers with few miles, but who have been with the airline the longest. #################################### #PROBLEM 2.4 - HIERARCHICAL CLUSTERING #Compared to the other clusters, Cluster 2 has the largest average values in which variables (if any)? Select all that apply. #Ans:QualMiles,FlightMiles,FlightTrans #EXPLANATION:Cluster 2 has the largest average values in the variables QualMiles, FlightMiles and FlightTrans. Th is cluster also has relatively large values in BonusTrans and Balance. #How would you describe the customers in Cluster 2? #Ans: Customers who have accumulated a large amount of miles, and the ones with the largest number of flight tran sactions. #EXPLANATION:Cluster 2 contains customers with a large amount of miles, mostly accumulated through flight transac tions. ################################## #PROBLEM 2.5 - HIERARCHICAL CLUSTERING #Compared to the other clusters, Cluster 3 has the largest average values in which variables (if any)? Select all that apply. #Ans:Balance,BonusMiles,BonusTrans #EXPLANATION:Cluster 3 has the largest values in Balance, BonusMiles, and BonusTrans. While it also has relativel y large values in other variables, these are the three for which it has the largest values. #How would you describe the customers in Cluster 3? #Ans:Customers who have accumulated a large amount of miles, mostly through non-flight transactions. #EXPLANATION:Cluster 3 mostly contains customers with a lot of miles, and who have earned the miles mostly throug h bonus transactions. ######################################## #PROBLEM 2.6 - HIERARCHICAL CLUSTERING #Compared to the other clusters, Cluster 4 has the largest average values in which variables (if any)? Select all that apply. #Ans:None #EXPLANATION:Cluster 4 does not have the largest values in any of the variables. #How would you describe the customers in Cluster 4? #Ans:Relatively new customers who seem to be accumulating miles, mostly through non-flight transactions. #EXPLANATION:Cluster 4 customers have the smallest value in DaysSinceEnroll, but they are already accumulating a reasonable number of miles. ################################ #PROBLEM 2.7 - HIERARCHICAL CLUSTERING #Compared to the other clusters, Cluster 5 has the largest average values in which variables (if any)? Select all that apply. #Ans:None #EXPLANATION:Cluster 5 does not have the largest values in any of the variables. #How would you describe the customers in Cluster 5? #Ans:Relatively new customers who don't use the airline very often. #EXPLANATION:Cluster 5 customers have lower than average values in all variables. #################################### #PROBLEM 3.1 - K-MEANS CLUSTERING #Now run the k-means clustering algorithm on the normalized data, again creating 5 clusters. Set the seed to 88 r ight before running the clustering algorithm, and set the argument iter.max to 1000. #Kmeans clustering

set.seed(88) kmeansClust<- kmeans(airlinesNorm, centers=5, iter.max=1000) #you can look at the number of observations in each cluster with the following command:

table(kmeansClust$cluster)


54/67

7/23/2017


## ## ##

1 408

2 141

3 4 5 993 1182 1275

#or

sum(kmeansClust$size > 1000)

## [1] 2

#How many clusters have more than 1,000 observations? #Ans:2 #Explanation:There are two clusters with more than 1000 observations. ################################### #PROBLEM 3.2 - K-MEANS CLUSTERING #Now, compare the cluster centroids to each other either by dividing the data points into groups and then using t apply, or by looking at the output of kmeansClust$centers, where "kmeansClust" is the name of the output of the k means function. (Note that the output of kmeansClust$centers will be for the normalized data. If you want to look at the average values for the unnormalized data, you need to use tapply like we did for hierarchical clusterin g.) #the centroids of the clusters (Normalised data) #Hierarchial clusters centroid

airlinesUnnormClusters<-split(airlinesNorm,clusterGroups) round(sapply(airlinesUnnormClusters,colMeans),4)

## ## ## ## ## ## ## ##

Balance QualMiles BonusMiles BonusTrans FlightMiles FlightTrans DaysSinceEnroll

1 -0.1561 -0.1854 -0.2809 -0.0811 -0.2692 -0.2823 1.0250

2 3 4 5 0.3678 1.2363 -0.2110 -0.3706 1.1916 -0.1471 -0.1800 -0.1830 0.2375 1.6004 0.1509 -0.6161 0.6901 0.8395 0.5712 -0.8985 1.5379 -0.0945 -0.2489 -0.2433 1.5895 -0.0803 -0.2713 -0.2464 0.1375 0.7250 -0.6187 -0.5125

#Kmeans centroid

kmeansClust$centers

## ## ## ## ## ## ## ## ## ## ## ##

Balance QualMiles BonusMiles BonusTrans FlightMiles FlightTrans 1 1.44439706 0.51115730 1.8769284 1.0331951 0.1169945 0.1444636 2 1.00054098 0.68382234 0.6144780 1.7214887 3.8559798 4.1196141 3 -0.05580605 -0.14104391 0.3041358 0.7108744 -0.1218278 -0.1287569 4 -0.13331742 -0.11491607 -0.3492669 -0.3373455 -0.1833989 -0.1961819 5 -0.40579897 -0.02281076 -0.5816482 -0.7619054 -0.1989602 -0.2196582 DaysSinceEnroll 1 0.7198040 2 0.2742394 3 -0.3398209 4 0.9640923 5 -0.8897747

#Do you expect Cluster 1 of the K-Means clustering output to necessarily be similar to Cluster 1 of the Hierarchi cal clustering output? #Ans:No, because cluster ordering is not meaningful in either k-means clustering or hierarchical clustering. #EXPLANATION:The clusters are not displayed in a meaningful order, so while there may be a cluster produced by th e k-means algorithm that is similar to Cluster 1 produced by the Hierarchical method, it will not necessarily be shown first.

PREDICTING STOCK RETURNS WITH CLUSTER-THEN-PREDICT In the second lecture sequence this week, we heard about cluster-then-predict, a methodology in which you first cluster observations and then build cluster-specific prediction models. In the lecture sequence, we saw how this methodology helped improve the prediction of heart attack risk. In this assignment, we’ll use cluster-then-predict to predict future stock prices using historical stock data. When selecting which stocks to invest in, investors seek to obtain good future returns. In this problem, we will first use clustering to identify clusters of stocks that have similar returns over time. Then, we’ll use logistic regression to predict whether or not the stocks will have positive future returns. For this problem, we’ll use StocksCluster.csv, which contains monthly stock returns from the NASDAQ stock exchange. The NASDAQ is the second-largest stock exchange in the world, and it li sts many technology companies. The stock price data used in this problem was obtained from infochimps, a website providing access to many datasets.


55/67

7/23/2017

The_Analytics_Edge_edX_MIT15.071x_June2015_5 Each observation in the dataset is the monthly returns of a particular company in a particular year. The years included are 2000-2009. The companies are limited to tickers that were listed on the exchange for the entire period 2000-2009, and whose stock price never fell below $1. So, for example, one observation is for Yahoo in 2000, and another observation is for Yahoo in 2001. Our goal will be to predict whether or not the stock return in December will be positive, using the stock returns f or the first 11 months of the year. This dataset contains the following variables: ReturnJan = the return for the company’s stock during January (in the year of the observation). ReturnFeb = the return for the company’s stock during February (in the year of the observation). ReturnMar = the return f or the company’s stock during March (in the year of the observation). ReturnApr = the return for the company’s stock during April (in the year of the observation). ReturnMay = the return for the company’s stock during May (in the year of the observation). ReturnJune = the return for the company’s stock during June (in the year of the observation). ReturnJuly = the return f or the company’s stock during July (in the year of the observation). ReturnAug = the return f or the company’s stock during August (in the year of the observation). ReturnSep = the return f or the company’s stock during September (in the year of the observation). ReturnOct = the return for the company’s stock during October (in the year of the observation). ReturnNov = the return for the company’s stock during November (in the year of the observation). PositiveDec = whether or not the company’s stock had a positive return in December (in the year of the observation). This variable takes value 1 if the return was positive, and value 0 if the return was not positive. For the first 11 variables, t he value stored is a proportional change in stock value during that month. For instance, a value of 0.05 means the stock increased in value 5% during the month, while a value of -0.02 means the stock decreased in value 2% during the month. #PROBLEM 1.1 - EXPLORING THE DATASET #Load StocksCluster.csv into a data frame called "stocks". #How many observations are in the dataset?

stocks <- read.csv("StocksCluster.csv") str(stocks)

## 'data.frame': 11580 obs. of 12 variables: ## $ ReturnJan : num 0.0807 -0.0107 0.0477 -0.074 -0.031 ... ## $ ReturnFeb : num 0.0663 0.1021 0.036 -0.0482 -0.2127 ... ## $ ReturnMar : num 0.0329 0.1455 0.0397 0.0182 0.0915 ... ## $ ReturnApr : num 0.1831 -0.0844 -0.1624 -0.0247 0.1893 ... ## $ ReturnMay : num 0.13033 -0.3273 -0.14743 -0.00604 -0.15385 ... ## $ ReturnJune : num -0.0176 -0.3593 0.0486 -0.0253 -0.1061 ... ## $ ReturnJuly : num -0.0205 -0.0253 -0.1354 -0.094 0.3553 ... ## $ ReturnAug : num 0.0247 0.2113 0.0334 0.0953 0.0568 ... ## $ ReturnSep : num -0.0204 -0.58 0 0.0567 0.0336 ... ## $ ReturnOct : num -0.1733 -0.2671 0.0917 -0.0963 0.0363 ... ## $ ReturnNov : num -0.0254 -0.1512 -0.0596 -0.0405 -0.0853 ... ## $ PositiveDec: int 0 0 0 1 1 1 1 0 0 0 ...

#or

nrow(stocks)

## [1] 11580

#Ans:11580 ################################### #PROBLEM 1.2 - EXPLORING THE DATASET #What proportion of the observations have positive returns in December? #Proportion of stocks with positive return

prop.table(table(stocks$PositiveDec)) #6324/11580 = 0.546

## ## 0 1 ## 0.453886 0.546114

#or

sum(stocks$PositiveDec) / nrow(stocks)

## [1] 0.546114

#or

mean(stocks$PositiveDec)


56/67

7/23/2017

The_Analytics_Edge_edX_MIT15.071x_June2015_5 ## [1] 0.546114

#Ans:0.546114 ###################################### #PROBLEM 1.3 - EXPLORING THE DATASET #What is the maximum correlation between any two return variables in the dataset? You should look at the pairwise correlations between ReturnJan, ReturnFeb, ReturnMar, ReturnApr, ReturnMay, ReturnJune, ReturnJuly, ReturnAug, R eturnSep, ReturnOct, and ReturnNov. #Max correlation between months January to November

cm <- cor(stocks[1:11]) diag(cm) <- -1 # making the diagonals as -1 max(cm)

## [1] 0.1916728

#Ans:0.1916728 #################################### #PROBLEM 1.4 - EXPLORING THE DATASET #Which month (from January through November) has the largest mean return across all observations in the dataset?

summary(stocks)

## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ##

ReturnJan Min. :-0.7616205 1st Qu.:-0.0691663 Median : 0.0009965 Mean : 0.0126316 3rd Qu.: 0.0732606 Max. : 3.0683060 ReturnApr Min. :-0.826503 1st Qu.:-0.054468 Median : 0.009059 Mean : 0.026308 3rd Qu.: 0.085338 Max. : 2.528827 ReturnJuly Min. :-0.7613096 1st Qu.:-0.0731917 Median :-0.0008047 Mean : 0.0030509 3rd Qu.: 0.0718205 Max. : 2.5500000 ReturnOct Min. :-0.685504 1st Qu.:-0.070915 Median : 0.002115 Mean : 0.005651 3rd Qu.: 0.074542 Max. : 5.665138

ReturnFeb Min. :-0.690000 1st Qu.:-0.077748 Median :-0.010626 Mean :-0.007605 3rd Qu.: 0.043600 Max. : 6.943694 ReturnMay Min. :-0.92207 1st Qu.:-0.04640 Median : 0.01293 Mean : 0.02474 3rd Qu.: 0.08396 Max. : 6.93013 ReturnAug Min. :-0.726800 1st Qu.:-0.046272 Median : 0.007205 Mean : 0.016198 3rd Qu.: 0.070783 Max. : 3.626609 ReturnNov Min. :-0.747171 1st Qu.:-0.054890 Median : 0.008522 Mean : 0.011387 3rd Qu.: 0.076576 Max. : 3.271676

ReturnMar Min. :-0.712994 1st Qu.:-0.046389 Median : 0.009878 Mean : 0.019402 3rd Qu.: 0.077066 Max. : 4.008621 ReturnJune Min. :-0.717920 1st Qu.:-0.063966 Median :-0.000880 Mean : 0.005938 3rd Qu.: 0.061566 Max. : 4.339713 ReturnSep Min. :-0.839730 1st Qu.:-0.074648 Median :-0.007616 Mean :-0.014721 3rd Qu.: 0.049476 Max. : 5.863980 PositiveDec Min. :0.0000 1st Qu.:0.0000 Median :1.0000 Mean :0.5461 3rd Qu.:1.0000 Max. :1.0000

#If you look at the mean value for each variable, you can see that April has the largest mean value (0.026308), a nd September has the smallest mean value (-0.014721). #or better

cm <- colMeans(stocks[1:11]) which.max(cm)

## ReturnApr ## 4

#Ans:ReturnApr #Which month (from January through November) has the smallest mean return across all observations in the dataset?

which.min(cm)


57/67

7/23/2017

The_Analytics_Edge_edX_MIT15.071x_June2015_5 ## ReturnSep ## 9

#Ans:ReturnSep ########################################### #PROBLEM 2.1 - INITIAL LOGISTIC REGRESSION MODEL #Run the following commands to split the data into a training set and testing set, putting 70% of the data in the training set and 30% of the data in the testing set: #Split the data into a training set and testing set library(caTools)

## ## Attaching package: 'caTools'

## The following objects are masked from 'package:base64enc': ## ## base64decode, base64encode

set.seed(144) spl = sample.split(stocks$PositiveDec, SplitRatio = 0.7) stocksTrain = subset(stocks, spl == TRUE) stocksTest = subset(stocks, spl == FALSE) #Then, use the stocksTrain data frame to train a logistic regression model (name it StocksModel) to predict Posit iveDec using all the other variables as independent variables. Don't forget to add the argument family=binomial t o your glm command. #Train Logistic regression model as:

StocksModel = glm(PositiveDec ~ ., data=stocksTrain, family=binomial) #we can compute our predictions on the training set with:

PredictTrain = predict(StocksModel, type="response") #construct a confusion/classification matrix with the table function using a threshold of 0.5:

cmat_LR <-table(stocksTrain$PositiveDec, PredictTrain > 0.5) cmat_LR

## ## ## ##

0 1

FALSE TRUE 990 2689 787 3640

#lets now compute the overall accuracy of the training set

accu_LR <-(cmat_LR[1,1] + cmat_LR[2,2])/sum(cmat_LR) accu_LR

## [1] 0.5711818

#or

sum(diag(cmat_LR))/nrow(stocksTrain)

## [1] 0.5711818


58/67

7/23/2017

The_Analytics_Edge_edX_MIT15.071x_June2015_5 #What is the overall accuracy on the training set using a threshold of 0.5? #Ans: 0.5711818

((990 + 3640)/(990 + 2689 + 787 + 3640) = 0.571)

######################################### #PROBLEM 2.2 - INITIAL LOGISTIC REGRESSION MODEL #Now obtain test set predictions from StocksModel. What is the overall accuracy of the model on the test, again u sing a threshold of 0.5? #Out-of-Sample predictions of the Logistic Regression model

PredictTest = predict(StocksModel, newdata=stocksTest, type="response") #Then we compute the confusion matrix for the testing set using a threshold of 0.5

cmat_LR<-table(stocksTest$PositiveDec, PredictTest > 0.5) cmat_LR

## ## ## ##

0 1

FALSE TRUE 417 1160 344 1553

#lets now compute the overall accuracy of the test set

accu_LR <-(cmat_LR[1,1] + cmat_LR[2,2])/sum(cmat_LR) accu_LR

## [1] 0.5670697

#or

sum(diag(cmat_LR))/nrow(stocksTest)

## [1] 0.5670697

#Ans:0.5670697

(417 + 1553)/(417 + 1160 + 344 + 1553) = 0.567

######################################## #PROBLEM 2.3 - INITIAL LOGISTIC REGRESSION MODEL #What is the accuracy on the test set of a baseline model that always predicts the most common outcome (PositiveD ec = 1)? #baseline model computed by making a table of the outcome variable in the test set:

baseline<-table(stocksTest$PositiveDec) baseline

## ## 0 1 ## 1577 1897

#Baseline accuracy

accu_baseline <- max(baseline)/sum(baseline) accu_baseline #1897/(1577 + 1897) = 0.5460564

## [1] 0.5460564

#or

baseline[2] / sum(baseline)

## 1 ## 0.5460564


59/67

7/23/2017

The_Analytics_Edge_edX_MIT15.071x_June2015_5 #Ans:0.5460564 #EXPLANATION:The baseline model would get all of the PositiveDec = 1 cases correct, and all of the PositiveDec = 0 cases wrong, for an accuracy of 1897/(1577 + 1897) = 0.5460564. ################################## #PROBLEM 3.1 - CLUSTERING STOCKS #Now, let's cluster the stocks. The first step in this process is to remove the dependent variable using the foll owing commands: #Remove dependent variable from data because we are trying to use clustering to "discover" the dependent variabl e,hence dependent variable removed

limitedTrain = stocksTrain limitedTrain$PositiveDec = NULL limitedTest = stocksTest limitedTest$PositiveDec = NULL #Why do we need to remove the dependent variable in the clustering phase of the cluster-then-predict methodology? #Ans:Needing to know the dependent variable value to assign an observation to a cluster defeats the purpose of th e methodology #EXPLANATION:In cluster-then-predict, our final goal is to predict the dependent variable, which is unknown to us at the time of prediction. Therefore, if we need to know the outcome value to perform the clustering, the method ology is no longer useful for prediction of an unknown outcome value. #This is an important point that is sometimes mistakenly overlooked. If you use the outcome value to cluster, you might conclude your method strongly outperforms a non-clustering alternative. However, this is because it is usi ng the outcome to determine the clusters, which is not valid. ################################################ #PROBLEM 3.2 - CLUSTERING STOCKS #In the market segmentation assignment in this week's homework, you were introduced to the preProcess command fro m the caret package, which normalizes variables by subtracting by the mean and dividing by the standard deviatio n. #In cases where we have a training and testing set, we'll want to normalize by the mean and standard deviation of the variables in the training set. We can do this by passing just the training set to the preProcess function: library(caret)

preproc = preProcess(limitedTrain) normTrain = predict(preproc, limitedTrain) normTest = predict(preproc, limitedTest) #What is the mean of the ReturnJan variable in normTrain?

mean(normTrain$ReturnJan)

## [1] 2.100586e-17

#Ans:2.100586e-17 #What is the mean of the ReturnJan variable in normTest?

mean(normTest$ReturnJan)

## [1] -0.0004185886


60/67

7/23/2017

The_Analytics_Edge_edX_MIT15.071x_June2015_5 #Ans:-0.0004185886 ############################### #PROBLEM 3.3 - CLUSTERING STOCKS #Why is the mean ReturnJan variable much closer to 0 in normTrain than in normTest? #Ans:The distribution of the ReturnJan variable is different in the training and testing set #EXPLANATION:From mean(stocksTrain$ReturnJan) and mean(stocksTest$ReturnJan), we see that the average return in J anuary is slightly higher in the training set than in the testing set. Since normTest was constructed by subtract ing by the mean ReturnJan value from the training set, this explains why the mean value of ReturnJan is slightly negative in normTest. ##################################### #PROBLEM 3.4 - CLUSTERING STOCKS #Set the random seed to 144 (it is important to do this again, even though we did it earlier). Run k-means cluste ring with 3 clusters on normTrain, storing the result in an object called km.

set.seed(144) km <- kmeans(normTrain, centers=3, iter.max=1000) #We can see the number of observations in each cluster by:

km$size

## [1] 3157 4696

253

#or

table(km$cluster)

## ## 1 2 ## 3157 4696

3 253

#Which cluster has the largest number of observations? #Ans:Cluster 2 ################################## #PROBLEM 3.5 - CLUSTERING STOCKS #Recall from the recitation that we can use the flexclust package to obtain training set and testing set cluster assignments for our observations (note that the call to as.kcca may take a while to complete): #Obtain training set and testing set cluster assignments: library(flexclust)

km.kcca = as.kcca(km, normTrain) clusterTrain = predict(km.kcca) clusterTest = predict(km.kcca, newdata=normTest) #How many test-set observations were assigned to Cluster 2? #we can obtain the breakdown of the testing set clusters with:

table(clusterTest)

## clusterTest ## 1 2 3 ## 1298 2080 96

#or getting directly as

sum(clusterTest == 2) #test-set observations assigned to Cluster 2

## [1] 2080


61/67

7/23/2017

The_Analytics_Edge_edX_MIT15.071x_June2015_5 #Ans:2080 ################################# #PROBLEM 4.1 - CLUSTER-SPECIFIC PREDICTIONS #Using the subset function, build data frames stocksTrain1, stocksTrain2, and stocksTrain3, containing the elemen ts in the stocksTrain data frame assigned to clusters 1, 2, and 3, respectively (be careful to take subsets of st ocksTrain, not of normTrain). Similarly build stocksTest1, stocksTest2, and stocksTest3 from the stocksTest data frame. #We can obtain the necessary subsets with:

stocksTrain1 = subset(stocksTrain, clusterTrain stocksTrain2 = subset(stocksTrain, clusterTrain stocksTrain3 = subset(stocksTrain, clusterTrain stocksTest1 = subset(stocksTest, clusterTest == stocksTest2 = subset(stocksTest, clusterTest == stocksTest3 = subset(stocksTest, clusterTest ==

== 1) == 2) == 3) 1) 2) 3)

#Which training set data frame has the highest average value of the dependent variable?

mean(stocksTrain1$PositiveDec)

## [1] 0.6024707


## [1] 0.5140545


## [1] 0.4387352

#or better way #Builing stocksTrain[1,2,3] using clustering clusterTrain and also stocksTest[1,2,3] from clusterTest

stocksTrain11 <- split(stocksTrain, clusterTrain) stocksTest11 <- split(stocksTest, clusterTest) sapply(stocksTrain11, function(s){ mean(s$PositiveDec) })

## 1 2 3 ## 0.6024707 0.5140545 0.4387352

#Ans:stocksTrain1 #EXPLANATION: we see that stocksTrain1 has the observations with the highest average value of the dependent varia ble. #################################### #PROBLEM 4.2 - CLUSTER-SPECIFIC PREDICTIONS #Build logistic regression models StocksModel1, StocksModel2, and StocksModel3, which predict PositiveDec using a ll the other variables as independent variables. StocksModel1 should be trained on stocksTrain1, StocksModel2 sho uld be trained on stocksTrain2, and StocksModel3 should be trained on stocksTrain3. #We can build the Logistic models with:

StocksModel1 = glm(PositiveDec ~ ., data=stocksTrain1, family=binomial) StocksModel2 = glm(PositiveDec ~ ., data=stocksTrain2, family=binomial) StocksModel3 = glm(PositiveDec ~ ., data=stocksTrain3, family=binomial) summary(StocksModel1)


62/67

7/23/2017

The_Analytics_Edge_edX_MIT15.071x_June2015_5 ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ##

Call: glm(formula = PositiveDec ~ ., family = binomial, data = stocksTrain1) Deviance Residuals: Min 1Q Median -2.7307 -1.2910 0.8878

3Q 1.0280

Max 1.5023

Coefficients: Estimate Std. Error z value (Intercept) 0.17224 0.06302 2.733 ReturnJan 0.02498 0.29306 0.085 ReturnFeb -0.37207 0.29123 -1.278 ReturnMar 0.59555 0.23325 2.553 ReturnApr 1.19048 0.22439 5.305 ReturnMay 0.30421 0.22845 1.332 ReturnJune -0.01165 0.29993 -0.039 ReturnJuly 0.19769 0.27790 0.711 ReturnAug 0.51273 0.30858 1.662 ReturnSep 0.58833 0.28133 2.091 ReturnOct -1.02254 0.26007 -3.932 ReturnNov -0.74847 0.28280 -2.647 --Signif. codes: 0 '***' 0.001 '**' 0.01

Pr(>|z|) 0.00628 0.93206 0.20139 0.01067 1.12e-07 0.18298 0.96901 0.47685 0.09660 0.03651 8.43e-05 0.00813

**

* ***

. * *** **

'*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1) Null deviance: 4243.0 Residual deviance: 4172.9 AIC: 4196.9

on 3156 on 3145

degrees of freedom degrees of freedom

Number of Fisher Scoring iterations: 4

summary(StocksModel2)

## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ##

Call: glm(formula = PositiveDec ~ ., family = binomial, data = stocksTrain2) Deviance Residuals: Min 1Q Median -2.2012 -1.1941 0.8583

3Q 1.1334

Max 1.9424

Coefficients: Estimate Std. Error z value (Intercept) 0.10293 0.03785 2.719 ReturnJan 0.88451 0.20276 4.362 ReturnFeb 0.31762 0.26624 1.193 ReturnMar -0.37978 0.24045 -1.579 ReturnApr 0.49291 0.22460 2.195 ReturnMay 0.89655 0.25492 3.517 ReturnJune 1.50088 0.26014 5.770 ReturnJuly 0.78315 0.26864 2.915 ReturnAug -0.24486 0.27080 -0.904 ReturnSep 0.73685 0.24820 2.969 ReturnOct -0.27756 0.18400 -1.509 ReturnNov -0.78747 0.22458 -3.506 --Signif. codes: 0 '***' 0.001 '**' 0.01

Pr(>|z|) 0.006540 1.29e-05 0.232878 0.114231 0.028189 0.000436 7.95e-09 0.003554 0.365876 0.002989 0.131419 0.000454

** ***

* *** *** ** ** ***

'*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1) Null deviance: 6506.3 Residual deviance: 6362.2 AIC: 6386.2

on 4695 on 4684



summary(StocksModel3)


63/67

7/23/2017

The_Analytics_Edge_edX_MIT15.071x_June2015_5 ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ##

Call: glm(formula = PositiveDec ~ ., family = binomial, data = stocksTrain3) Deviance Residuals: Min 1Q Median -1.9146 -1.0393 -0.7689

3Q 1.1921

Max 1.6939

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.181896 0.325182 -0.559 0.5759 ReturnJan -0.009789 0.448943 -0.022 0.9826 ReturnFeb -0.046883 0.213432 -0.220 0.8261 ReturnMar 0.674179 0.564790 1.194 0.2326 ReturnApr 1.281466 0.602672 2.126 0.0335 * ReturnMay 0.762512 0.647783 1.177 0.2392 ReturnJune 0.329434 0.408038 0.807 0.4195 ReturnJuly 0.774164 0.729360 1.061 0.2885 ReturnAug 0.982605 0.533158 1.843 0.0653 . ReturnSep 0.363807 0.627774 0.580 0.5622 ReturnOct 0.782242 0.733123 1.067 0.2860 ReturnNov -0.873752 0.738480 -1.183 0.2367 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 346.92 Residual deviance: 328.29 AIC: 352.29

on 252 on 241



#or better way using a custom function #Create logistic regression models 1 2 and 3 using stocksTrain11[[1]]

stocksModels <- lapply(stocksTrain11, function(s){ glm(s$PositiveDec ~ ., family=binomial, data=s) }) sapply(stocksModels, function(m){ m$coefficients })

## ## ## ## ## ## ## ## ## ## ## ## ##

(Intercept) ReturnJan ReturnFeb ReturnMar ReturnApr ReturnMay ReturnJune ReturnJuly ReturnAug ReturnSep ReturnOct ReturnNov

1 0.17223985 0.02498357 -0.37207369 0.59554957 1.19047752 0.30420906 -0.01165375 0.19769226 0.51272941 0.58832685 -1.02253506 -0.74847186

2 0.1029318 0.8845148 0.3176221 -0.3797811 0.4929105 0.8965492 1.5008787 0.7831487 -0.2448602 0.7368522 -0.2775631 -0.7874737

3 -0.181895809 -0.009789345 -0.046883260 0.674179495 1.281466189 0.762511555 0.329433917 0.774164370 0.982605385 0.363806823 0.782242086 -0.873752144

#Which variables have a positive sign for the coefficient in at least one of StocksModel1, StocksModel2, and Stoc ksModel3 and a negative sign for the coefficient in at least one of StocksModel1, StocksModel2, and StocksModel3? Select all that apply. #Ans: ReturnJan, ReturnFeb, ReturnMar, ReturnJune, ReturnAug, and ReturnOct

differ in sign between the models.

########################################### #PROBLEM 4.3 - CLUSTER-SPECIFIC PREDICTIONS #Using StocksModel1, make test-set predictions called PredictTest1 on the data frame stocksTest1. Using StocksMod el2, make test-set predictions called PredictTest2 on the data frame stocksTest2. Using StocksModel3, make test-s et predictions called PredictTest3 on the data frame stocksTest3. #The Out of samples predictions can be obtained with:

PredictTest1 = predict(StocksModel1, newdata = stocksTest1, type="response") PredictTest2 = predict(StocksModel2, newdata = stocksTest2, type="response") PredictTest3 = predict(StocksModel3, newdata = stocksTest3, type="response") #The Out of samples Confusion/classification matrices using a threshold of 0.5 are:

cmat1<-table(stocksTest1$PositiveDec, PredictTest1 > 0.5) cmat1


64/67

7/23/2017


0 1

FALSE TRUE 30 471 23 774


## ## ## ##

0 1

FALSE TRUE 388 626 309 757


## ## ## ##

0 1

FALSE TRUE 49 13 21 13

#lets now compute the Out of Samples overall accuracy :

accu_1<- (cmat1[1,1] + cmat1[2,2])/sum(cmat1) accu_1

## [1] 0.6194145


## [1] 0.5504808


## [1] 0.6458333

#or better way using a custom function #Make predictions using a thresholf of 0.5 & get accuracies of, using stockModels[[i]] against stocksTest11[[i]]

predictions <- sapply(1:3, function (i) { p <- predict(stocksModels[[i]], newdata=stocksTest11[[i]], type="response") (conf.mat <- table(stocksTest11[[i]]$PositiveDec, p > 0.5)) accuracy <- sum(diag(conf.mat)) / sum(conf.mat) list(predict=p, accuracy=accuracy) }) predictions

## [,1] [,2] [,3] ## predict Numeric,1298 Numeric,2080 Numeric,96 ## accuracy 0.6194145 0.5504808 0.6458333


65/67

7/23/2017

The_Analytics_Edge_edX_MIT15.071x_June2015_5 #What is the overall accuracy of StocksModel1 on the test set stocksTest1, using a threshold of 0.5? #Ans:0.6194145

(30 + 774)/(30 + 471 + 23 + 774) = 0.6194145

#What is the overall accuracy of StocksModel2 on the test set stocksTest2, using a threshold of 0.5? #Ans:0.5504808

(388 + 757)/(388 + 626 + 309 + 757) = 0.5504808

#What is the overall accuracy of StocksModel3 on the test set stocksTest3, using a threshold of 0.5? #Ans:0.6458333

(49 + 13)/(49 + 13 + 21 + 13) = 0.6458333

############################################# #PROBLEM 4.4 - CLUSTER-SPECIFIC PREDICTIONS #To compute the overall test-set accuracy of the cluster-then-predict approach, we can combine all the test-set p redictions into a single vector and all the true outcomes into a single vector:

AllPredictions = c(PredictTest1, PredictTest2, PredictTest3) AllOutcomes = c(stocksTest1$PositiveDec, stocksTest2$PositiveDec, stocksTest3$PositiveDec) #What is the overall test-set accuracy of the cluster-then-predict approach, again using a threshold of 0.5? #Computing overall confusion matrix of the cluster-then-predict approach

cmatoverall<-table(AllOutcomes, AllPredictions > 0.5) cmatoverall

## ## AllOutcomes FALSE TRUE ## 0 467 1110 ## 1 353 1544

#lets now compute the Out of Samples overall accuracy :

accu_overall<- (cmatoverall[1,1] + cmatoverall[2,2])/sum(cmatoverall) accu_overall #(467 + 1544)/(467 + 1110 + 353 + 1544) = 0.5788716

## [1] 0.5788716

#Ans:0.5788716 #EXPLANATION:Which tells us that the overall accuracy is (467 + 1544)/(467 + 1110 + 353 + 1544) = 0.5788716. #We see a modest improvement over the original logistic regression model. Since predicting stock returns is a not oriously hard problem, this is a good increase in accuracy. By investing in stocks for which we are more confiden t that they will have positive returns (by selecting the ones with higher predicted probabilities), this clusterthen-predict model can give us an edge over the original logistic regression model.

sessionInfo()


66/67

RPubs - The Analytics Edge EdX MIT15 Clustering

Recommend Documents