Table of Contents Introduction
1.1
Why?
1.1.1
About Ab out Google Analytics Analytics
1.1.2
About Ab out R
1.1.3
Author Autho r
1.1.4
Prepare environment
1.2
Data sources sources
1.2.1
Creating Creati ng Google Analytics account account
1.2.2
Getting Ge tting credentials credentials for Google Analytics API
1.2.3
Installing Installi ng Google Google Analytics on website
1.2.4
Installing R Studio
1.2.5
Summary Summ ary
1.2.6
First steps
1.3
Introduction to R
1.3.1
Connection Conne ction with with Google Analytics
1.3.2
googleAnalyticsR package
1.3.3
Import and Import and export export data to CSV
1.3.4
Code Co de repository repository
1.3.5
Summary
1.3.6
Exploratory data analysis Exploratory data analysis Data visualization
1.4 1.4.1 1.5
Data visualization in R
1.5.1
Traffic heatmap
1.5.2
Device comparsion
1.5.3
Machine Learning Clustering (k-means) Generating reports
1.6 1.6.1 1.7
Introduction to R Markdown
1.7.1
Create report
1.7.2
2
Additional analysis
1.8
Anomaly detection
1.8.1
Forecasting
1.8.2
Resources
1.9
Blogs
1.9.1
Documentation
1.9.2
Online trainings
1.9.3
Books
1.9.4
3
Introduction
Using Google Analytics with R This book is practical guide about analysis data from Google Analytics in R. In this book you will learn: What is Google Analytics and how to collect web traffic data by this tool. What is R and how to analyze data from Google Analytics in R Studio. How to discover hidden knowledge into data about traffic on your website. Feel free to share this book, read it online and offline. Thanks to Gitbook.io you can download it if different formats - printable and
.mobi
.pdf
and formats for e-book readers like
.epub
.
This is still still development development version. If you want to develop this book - feel free to contact with author via: via: about.me michalbrys.com
4
Why?
Why? I've decided to write this book to show how big value is hidden in data. If you have website you probably collecting data about web traffic. But if you use this data to make business decisions? Nowadays we are swimming in data lake. Only if you know how to use this data you will stay on the surface :). First step is to regularly check standard reports in your web analytics tool (i.e. Google Analytics). But to stay competitive you need something more. Everybody talks about data collection. But only a few tell you what to do with data after collect them. I try to describe this process and give you some ideas how to deal with data from Google Analytics using R. In this book I will share my experience on this field. I hope that it will be usefull, interesting, sometimes funny and will save you time :)
Target audience I wrote this book for marketers who worked with Google Analytics and know basic metrics included in this tool and know web interface. I hope that this material will be helpful in learning how to extend features of Google Analytics in daily work and learning how to use R. If you are analyst who knows perfectly R I hope that you also find some inspirations in this book. Especialyy in learning how to connect Google Analytyics as additional data source in R and what kind of analysis you can perform on this data.
5
About Google Analytics
About Google Analytics Google Analytics is free web analytics platform provided by Google. It's also the most popular free web analytics tool in the Internet according to Builwith report (Feb 2016). It's complete analytics platform offering solution for collect, analyze and report data. Google Analytics offers also free APIs to export data to externals systems.
Terms of service Common question is: if this great tool is really free? To be precise, according to Google Analytics Terms of Service: Service is provided without charge to You for up to 10 million Hits per month per account. If you exceed this quota, you should think about Google Analytics 360, former Gooogle Analytics Premium service. This premium and paid version offers you multiple times bigger data collection quota.
What is hit? As you read above, your Google Analytics account has 10 000 000 hits per month limit. So what is hit? According to Google Analytics help: Hit - An interaction that results in data being sent to Analytics Common hit types include page tracking hits, event tracking hits, and ecommerce hits. Each time the tracking code is triggered by a user’s behavior (for example, user loads a page on a website or a screen in a mobile app), Analytics records that activity. Each interaction is packaged into a hit and sent to Google’s servers. Examples of hit types include: page tracking hits event tracking hits ecommerce tracking hits social interaction hits
6
About Google Analytics
7
About R
About R What is R? R is a programming language and software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing. The R language is widely used among statisticians and data miners for developing statistical software and data analysis. Polls, surveys of data miners, and studies of scholarly literature databases show that R's popularity has increased substantially in recent years . Wikipedia
Pros and cons R language is language is now the fastest growing statistic analysis language. The major advantages of advantages of R: Free. Free. Offers a lot of libraries for computations. ns. Actual Actual list of packages libraries for different statistical computatio A lot of educational materials (tutorials, materials (tutorials, MOOCs, blogs) available free in the Internet. Has big community support. support. Ready to run in different platforms (Windows, platforms (Windows, Mac, Unix). Version for server installation is also available. Fast because Fast because of in-memory computations. Disadvantages? Disadvantages? R is not out-of-the-box solution with GUI for GUI for all analytical problems. You need to write a chunk of code to get the result. It sometimes can be barer for non-technical people to start with. But I hope if you read this book is not problem for you :) The advantage of in-memory computations is sometimes a trap. trap . In standard installation you can only process data set which fits to RAM memory in your machine. If you have really big data to process - think about other solution like Hadoop (MapReduce) or Apache Spark. If you feel comfortable with R you can run your script on other platforms (reading from HDFS or using SparkR). It is more advanced topic for other book ;)
8
Author
Author Michał Bryś Data scientist Michał is working in internet industry from 2009. He is expert in web analytics in e-commerce context, especially using Google Analytics & Google Tag Manager. He loves mining big data sets and transform information into actionable knowledge. He loves creating story from numbers. He graduated AGH University of Science and Technology and University of Economics in Cracow. Michal is member of Google Developers Group Cracow. Feel free to contact author: about.me michalbrys.com
9
Prepare environment
Preparing environment To analysis data you will need to set up: Google Analytics account R Studio Credentials to connect Google Analytics API in Google Developers Console (free) I will precisely describe this steps in this chapter.
10
Data sources
Data sources You can find the most popular scenarios website.
I have website WITHOUT Google Analytics tracking Please read all of this chapter. I will lead you through Google Analytics installation process and the you can start collecting data from your website.
I have website WITH Google Analytics tracking You can navigate directly to Getting credentials for Google Analytics chapter.
I don't have website nor access to Google Analytics account If you are analyst who knows R and want to learn about analyzing data from Google Analytics I recommend one of this t his options.
NGOs, University, friends, family Contact with local NGOs who might want your help. Usually they have websites and traffic on it. Installing Google Analytics and doing traffic analysis you can help this organizations and do something good for your community. You can also join to The Analytics Exchange here: www.webanalyticsdemystified.com It's community connecting web analysts and NGOs looking for help with digital analytics. You can also contact with your University, family and friends offering help with digital analytics.
Google Analytics demo account Google Analytics team prepare demo account with data from Google Merchandise Store. You can access to this account here:
11
Data sources
support.google.com\/analytics\/answer\/6367342
About this data: The data in the Google Analytics demo account is from the Google Merchandise Store, a real ecommerce store. The Google Merchandise Store sells Google-branded merchandise. The data in the account is typical of what you would see for an ecommerce website. It includes the following kinds of information: Traffic source data: information about where website visitors originate. This includes data about organic traffic, paid search traffic, display traffic, etc. Content data: information about the behavior of users on the site. This includes the URLs of pages that visitors look at, how they interact with content, etc. Transactional data: information about the transactions that occur on the Google Merchandise Store website.
12
Creating Google Analytics account
Creating Google Analytics account Get your unique tracking ID To set up your Google Analytics account go to google.com/analytics/ and register your account. If you don't have any accounts connected with your Google Account you will see this screen:
Account details To create Google Analytics account fill form with: Account Name. (Note: One Account may have a few tracking IDs so it can be one Account per one organization/company with many websites. ) In next steps create your unique tracking ID: Insert Website Name, Website URL and Reporting Time Zone. (Note: Correct time zone is critically important - your data will be divided into dates in reports using this value).
13
Creating Google Analytics account
Data Sharing Settings You can change data sharing settings (Note: disable this will not have impact of data collection).
Complete registration process
14
Creating Google Analytics account
To complete registration process, click Get Tracking ID and accept Google Analytics Terms of Service. After this you will see instructions how to install Google Analytics Tracking Code on your website:
Install Google Analytics Tracking Code (GATC) on your website To start collecting data from your website you need to insert this code on every page. Personally I recommend to install it via Google Tag Manager . Further details: Installing Google Analytics on your website
15
Getting credentials for Google Analytics API
Getting credentials for Google Analytics API Note: you can use default credentials include in googleAnalyticsR package. But this API quota is shared for all googleAnalyticsR users. To quarantee that API quota is onl y for you - please create your own credential. I've described this process below. Navigate to Google Developers Console and create new project. Enable Google Analytics API:
Search: Analytics
Select Enable
16
Getting credentials for Google Analytics API
Create credentials:
17
Getting credentials for Google Analytics API
18
Getting credentials for Google Analytics API
Get credentials:
19
Getting credentials for Google Analytics API
Save Client ID and Client Secret. You need this to configure library getting data from Google Analytics to R.
20
Installing Google Analytics on website
Installing Google Analytics on your website Go to Google Analytics > Admin > Tracking Info > Tracking Code and get tracking Code (GATC). Example code: <script> (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){ (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o), m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) })(window,document,'script','//www.google-analytics.com/analytics.js' ,'ga'); ga('create', 'UA-11111111-1', 'auto'); ga('send', 'pageview');
Install tracking code on your website, between
tags, on every page you want
to track. To do this you should have access to your website source code or contact with your webmaster. Alternatively you can install Google Analytics via Google Tag Manager . I personally recommend that way because it will save you a lot of time in future :)
21
Installing R Studio
Installing R Studio Go to R Studio and download R Studio Desktop - graphic interface tool for R language.
R Studio is available for multiple platforms: Windows, Mac, Linux. You can use R Studio for free under AGPL 3. Paid version with more functionality is also available.
Installing extra packages One of the biggest advantage of R are thousands libraries which extend R functionality. You can browse this packages on CRAN repository. To inastall new package in R type: install.packages("package name")
For example to install
ggplot2
plotting library type:
22
Installing R Studio
install.packages("ggplot2")
After installing and before using package you should load it to current session: library("ggplot2")
23
Summary
Summary In this chapter you may learn: How to create account, configure and install Google Analytics on your website. How to download and set up R Studio. How to get credentials to download data from Google Analytics into R.
24
First steps
First steps In this chapter you will set up your environment installing R Studio, creating Google Analytics account and make connection via API between both tools.
25
Introduction to R
Introduction to R Try type in console (left down corner window in R Studio) some basic instructions. Commit instructions via press
[enter]
button.
Arithmetic operations > 1+1 [1] 2 > 2*4 [1] 8
Using variables You can assign value to variable using
<-
(more popular) or
=
operator. You can find
some basic examples below.
Numeric variables
26
Introduction to R
> x <- 1+1 > x [1] 2
Text variables > z <- "Hello world" > z [1] "Hello world"
Vectors > v <- c(1,2,3,4,5) > v [1] 1 2 3 4 5
Data frames More popular than one dimensional vector is multidimensional data structure called data.frame
.
Data returned from Google Analytics API query we'll also save as data.frame
Creating data frame Let's create simple data frame (i.e. number of sessions by city in 2016-01-01) df <- data.frame( date = c("20160101","20160101","20160101", "20160101","20160101","20160101","20160101"), city = c("London","Warsaw","Krakow", "New York","Paris","Zurich","Sydney"), sessions =
c(101,80,70,50,30,60,20)
)
To display all data frame type data frame name:
df
> df
27
Introduction to R
date
city sessions
1 20160101
London
101
2 20160101
Warsaw
80
3 20160101
Krakow
70
4 20160101 New York
50
5 20160101
Paris
30
6 20160101
Zurich
60
7 20160101
Sydney
20
Basic operations on data frame To preview data frame (by default first 6 rows, useful in bigger data sets): > head(df)
date
city sessions
1 20160101
London
101
2 20160101
Warsaw
80
3 20160101
Krakow
70
4 20160101 New York
50
5 20160101
Paris
30
6 20160101
Zurich
60
To display column names of data frame: > colnames(df)
[1] "date"
"city"
"sessions"
You can refer to column by
dataframe$colname
operator:
> df$city
[1] London
Warsaw
Krakow
New York Paris
Zurich
Sydney
Levels: Krakow London New York Paris Sydney Warsaw Zurich
And select only unique values of column (we have sessions for only one date: 2016-01-01): > unique(df$date)
28
Introduction to R
[1] 20160101 Levels: 20160101
You can alternatively select columns and rows by number :
df[rownumber,colnumber]
Select column 2: > df[,2]
[1] London
Warsaw
Krakow
New York Paris
Zurich
Sydney
Levels: Krakow London New York Paris Sydney Warsaw Zurich
Select row 1: > df[1,]
date
city sessions
1 20160101 London
101
Select only one element: > df[1,1]
[1] 20160101 Levels: 20160101
This basic operations is enough to start your journey with R language :)
29
Connection with Google Analytics
Connection with Google Analytics To easy download data directly from Google Analytics server to your R Studio via API interface you have to extend R Studio using external package. This package will let you to easy build query do Google Analytics servers, authorize connection and fetch the data to your computer. External packages are one of the biggest advantage of R. So let's try!
Install package googleAnalyticsR In first step install libraries in your R Studio. install.packages("googleAuthR") install.packages("googleAnalyticsR")
When it's done, load library into current R session: library("googleAuthR") library("googleAnalyticsR")
Configure connection between R and Google Analytics API Configure package with credentials from Google Developers Console: (How to get it? See Getting credentials for Google Analytics API) # optional - add your own Google Developers Console key options(googleAuthR.client_id = "uxxxxxxx2fd4kesu6.apps.googleusercontent.com" ) options(googleAuthR.client_secret = "3JhLa_GxxxxxCQYLe31c64" ) options(googleAuthR.scopes.selected = "https://www.googleapis.com/auth/analytics" ) # authorize connection with Google Analytics servers ga_auth()
You will be asked about authorize R to download data from Google Analytics and your browser will open authorization page. Click Agree:
30
Connection with Google Analytics
All done. You can now start to send queries via Google Analytics API.
First query - "Hello world" Make first query to Google Analytics via R: ## get your accounts account_list <- google_analytics_account_list() ## pick a profile with data to query #ga_id <- account_list[275,'viewId'] # or give it explicite using tool http://michalbrys.github.io/ga-tools/table-id.html i n format 99999999 ga_id <- 00000000
# Get the Sessions by Date in 2016 gadata <- google_analytics(id = ga_id, start="2016-01-01", end="2016-06-30", metrics = "sessions", dimensions = "date", max = 5000)
How to get your table.id? For first time it may be a little tricky. The (especially unique
view
ga_id
is parameter that identify your website data
) on Google Analytics servers. Where to find this id?
31
Connection with Google Analytics
Tool using Google Analytics Management API You can use my tool to get
table.id
.
Navigate to my tool michalbrys.github.io\/ga-tools\/ and follow instructions.
Copy from your Google Analytics web interface link Navigate to Admin section on your Google Analytics account. Select your website, property and view which you want to query. You will see this screen:
Your Google Analytics table.id parameter is last number from URL. So if your current URL is: https://analytics.google.com/analytics/web/? authuser=0#management/Settings/a11111111w22222222p33333333/
In query parameters in R script you need do type: ... ga_id <- 33333333 ...
Display results After you successfully run your first query you can check results fetched from Google Analytics. Display first 6 rows of result: head(gadata)
32
Connection with Google Analytics
date sessions 1 20140101
39
2 20140102
46
3 20140103
47
4 20140104
53
5 20140105
49
6 20140106
15
Congrats! You've downloaded first data set from your Google Analytics account!
Source code Complete code for this example in GitHub repository: https:\/\/github.com\/michalbrys\/R-Google-Analytics\/blob\/master\/1_hello_world.R
33
googleAnalyticsR package
googleAnalyticsR package
To pull data from Google Analytics into R we'll use package googleAnalyticsR by Mark Edmonson. Using this add on you can use all features of Google Analytics including the latest like Google Analytics 360 or integration with Big Query. Short list of googleAnalyticsR features: First Google Analytics Reporting v4 API library for R v4 features include: dynamic calculated metrics, pivots, histograms, date comparisons, batching. v4 API explorer API metadata of possible metrics and dimensions Multi-user login in Shiny App Integration with BigQuery Google Analytics Premium\/360 exports. Single authentication flow can be used with other googleAuthR apps like searchConsoleR Automatic batching, sampling avoidance with daily walk, multi-account fetching, multi-channel funnnel Support for googleAuthR batch. For big data calls this could be 10x quicker than normal GA fetching. Meta data in attributes of returned dataframe including date ranges, totals, min and max You can read the docs visiting Mark's website: http:\/\/code.markedmondson.me\/googleAnalyticsR\/ or check docs on CRAN repository: https:\/\/cran.rproject.org\/web\/packages\/googleAnalyticsR\/vignettes\/googleAnalyticsR.html
34
googleAnalyticsR package
35
Import and export data to CSV
Import and export data to CSV Using R you can import your data from external data sources. One of the most popular scenario is that you want to analyze data from
file. I will describe this use case below.
.csv
Import data from file To import data from
.csv
file to R use function
Example use if you want to import file named directory and save data in data frame
df
read.csv
file_to_import.csv
from your working
:
df <- read.csv('file_to_import.csv' )
Sometimes you need some extra options like header or separator. If you don't want to import first line of your file, use
header = FALSE
Also if you have column separator other than comma
,
use
option.
sep=';'
option - you can
declare your separator in this place. Example code with options: df <- read.csv('file_to_import.csv' , header = FALSE, sep=';')
Where is my working directory? To check working directory type in R console: getwd()
[1] "/Users/michal"
Export data After conducted analysis you may want to save results in file t o use it in other tools. To do this you need
write.csv
function.
36
Import and export data to CSV
If you have data in data frame called
ga.data
you can use this code:
write.csv(gadata, file = "exported_data.csv")
As a result R will export data to
.csv
file. You can open it in every text editor or
spreadsheet (i.e. Microsoft Excel). Other use case is upload data as custom dimension or campaign cost data to Google Analytics.
Where I can find saved file? Your
.csv
file is in your working directory .
37
Code repository
Code repository Source code in R for all examples described in this book you can find in my GitHub repository: github.com\/michalbrys\/R-Google-Analytics Feel free to commit if you find some issue in code or if you want to share your examples.
38
Summary
Summary In this chapter you can learn: How to conduct basic arithmetic operations in R. How to deal with basic data structures in R. How to load extrenal packages into R. How to connect Google Analytics and R. Hot to import and export data from file into R.
39
Exploratory data analysis
Exploratory data analysis Making data analysis you can use this three steps framework. 1. Load your data i. Download from Google Analytics API 2. Know your data i. Make some exploratory data analysis to better understand your data. 3. Do the main data analysis i. Apply i.e. machine learning algorithms. In this part I describe some basic exploratory data analysis operations.
40
Exploratory data analysis
Exploratory data analysis Download your data and save it in data frame
gadata
# Get the Sessions by Month in 2014 query.list <- Init(start.date = "2014-01-01", end.date = "2014-12-31", dimensions = "ga:date", metrics = "ga:sessions", table.id = "ga:00000000")
Let's do some basics operations
Min Check what is minimum number of sessions in 2014? min(gadata$sessions)
[1] 0
Number of days with 0 sessions recorded It seems like error in tracking and no data for some day. When it was? Display days with 0 sessions. subset(gadata, ga.data$sessions == 0)
date sessions 7
20140107
0
8
20140108
0
129 20140509
0
130 20140510
0
131 20140511
0
132 20140512
0
133 20140513
0
134 20140514
0
135 20140515
0
41
Exploratory data analysis
How many days with 0 sessions? Use function
nrow()
to count rows with this condition.
nrow(subset(gadata, ga.data$sessions == 0))
[1] 9
So it was 9 days with 0 sessions. summary(gadata)
Max When was the biggest traffic on your website? Use
max()
function.
> max(gadata$sessions)
[1] 204
So the highest traffic is 204 sessions in 1 day. When it was? subset(gadata, gadata$sessions == 204)
date sessions 59 20140228
204
You can reach this data in one function, replacing value with
max()
. It is shorter but harder
to read: subset(gadata, gadata$sessions == max(gadata$sessions))
date sessions 59 20140228
204
Mean What is mean number of sessions per day? To calculate this, use
mean()
function.
42
Exploratory data analysis
mean(gadata$sessions)
[1] 27.6
So average number of sessions per day is equal 27.6.
Standard deviation You can check diversity of number sessions per day. Use
sd()
function.
sd(gadata$sessions)
[1] 22.12984
So average number of sessions is equal 27.6 +\/- 22.12984. This dataset has big diversity and in your case is better not to trust only average value.
Median If dataset has high standard deviation its better to calculate median (the most popular value in dataset). median(gadata$sessions)
[1] 21
The most popular number of sessions id 21 sessions per day.
Summary If you want, you can get all of this statistics in one function:
summary
.
summary(gadata)
43
Exploratory data analysis
date
sessions
Length:365
Min.
Class :character
1st Qu.: 12.0
Mode
Median : 21.0
:character
Mean
:
0.0
: 27.6
3rd Qu.: 40.0 Max.
:204.0
As a result you will get basic statistics for numeric variables and description for character variables.
Source code Complete code for this example in GitHub repository: github.com\/michalbrys\/R-Google-Analytics\/blob\/master\/2_eda.R
44
Data visualization
Data visualization
45
Data visualization in R
Data visualization in R We'll make some exploratory data analysis by visualizing data from Google Analytics in R. R has big range of visualizing packages. My favourite is
ggplot2
.
Package ggplot2 According to ggplot2 project site: ggplot2 is a plotting system for R, based on the grammar of graphics, which tries to take the good parts of base and lattice graphics and none of the bad parts. It takes care of many of the fiddly details that make plotting a hassle (like drawing legends) as well as providing a power ful model of graphics that makes it easy to produce complex multilayered graphics. Full documentation: docs.ggplot2.org This is my favourite visualization package in R because of: Nice charts design. Flexibility. Wide range charts types. Extending plugins i.e.
ggtheme
.
You can also check alternatives like Plotly or R Base Graphic. Examples in this book is made with
ggplot2
.
Using ggplot2 Download data to visualize in chart In first step install (if necessary) and load package in current session. install.packages("ggplot2") library("ggplot2")
Next build query do fetch data about date and number of session:
46
Data visualization in R
gadata <- google_analytics(id = ga_id, start="2016-01-01", end="2016-06-30", metrics = "sessions", dimensions = "date", max = 5000)
Display first 6 rows of result: head(gadata)
date sessions 1 2016-01-01
199
2 2016-01-02
212
3 2016-01-03
155
4 2016-01-04
210
5 2016-01-05
192
6 2016-01-06
180
Scatter plot Plot data in time (scatter plot) ggplot(gadata, aes(x=date, y=sessions)) + geom_point()
As a result you will get basic scatter plot with sessions in time:
Poin means number of sesions in particular day.
47
Data visualization in R
As you see this plot isn't very nice because of a-axis labels. You can fix this using 90-degree pivot. Add line: theme(axis.text.x = element_text(angle = 90, hjust = 1))
So complete example with pivoted x-axis labels: ggplot(gadata, aes(x=date, y=sessions)) + geom_point() + theme(axis.text.x = element_text(angle = 90, hjust = 1))
And the result:
You can also change point size depending on number of sessions by adding: size = sessions
ggplot(gadata, aes(x=date, y=sessions, size = sessions)) + geom_point() + theme(axis.text.x = element_text(angle = 90, hjust = 1))
And the result:
48
Data visualization in R
You can also change color of points adding: color = sessions
(the lighter color the highest number of sessions). Complete code: ggplot(gadata, aes(x=date, y=sessions, size = sessions, color = sessions)) + geom_point() + theme(axis.text.x = element_text(angle = 90, hjust = 1))
And the result:
49
Data visualization in R
This type of scatter plot is called bubble chart.
Line chart Plot data in time (line chart) with some styles: ggplot(gadata,aes(x=date,y=sessions,group= 1)) + geom_line() + theme(axis.text.x = element_text(angle = 90, hjust = 1)) # some styles to pivot x-axis labels
As a result you will get line chart with sessions in time:
50
Data visualization in R
Scatter plot with trend line Sometimes you want to aggregate data and see what is trend? gadata <- google_analytics(id = ga_id, start="2016-01-01", end="2016-06-30", metrics = "sessions", dimensions = "date", max = 5000)
And now we can plot data points with added trend line: ggplot(data = gadata, aes(x = gadata$date,y = gadata$sessions) ) + geom_point() + geom_smooth() + theme(axis.text.x = element_text(angle = 90, hjust = 1))
51
Data visualization in R
In this plot you can see, that trend is growing :)
Box plot To make some exploratory data analysis, you can visualize your traffic in different day od week. Is your website traffic is seasonal? When are more crowded days? Let's check creating box plot which will illustrate distribution of number of sessions in every day of week: Build query to download data: gadata <- google_analytics(id = ga_id, start="2016-01-01", end="2016-06-30", metrics = "sessions", dimensions = c("dayOfWeek","date"), max = 5000)
And vizualize it on boxplot: ggplot(data = gadata, aes(x = dayOfWeek, y = sessions)) + geom_boxplot()
52
Data visualization in R
In Google Analytics, number of days are named with convention: 0 - Sunday 1 - Monday 2 - Tuesday 3 - Wednesday 4 - Thursday 5 - Friday 6 - Saturday
So in this case, the highest traffic was on Thursday. Fridays are also not bad :)
Source code Complete code for this example in GitHub repository: github.com\/michalbrys\/R-Google-Analytics\/blob\/master\/3_data_visualization.R
53
Traffic heatmap
Traffic heatmap We will build some more advanced data visualization. It willl be useres engagement heatmap. The darker color the highest user engagement (avgSessionDuration) was on this time of day. Inspired by Todd Moy.
54
Traffic heatmap
# traffic heatmap # based on https://github.com/toddmoy/Google-Analytics-Heatmap/blob/master/traffic_hea tmap.R # install libraries # install.packages("googleAuthR") # install.packages("googleAnalyticsR") # install.packages("ggplot2") # install.packages("RColorBrewer") # load libraries library("googleAuthR") library("googleAnalyticsR") library("ggplot2") library("RColorBrewer") # authorize connection with Google Analytics servers ga_auth() ## pick a profile with data to query #ga_id <- account_list[275,'viewId'] # or give it explicite using tool http://michalbrys.github.io/ga-tools/table-id.html i n format 99999999 ga_id <- 00000000 gadata <- google_analytics(id = ga_id, start="2012-01-01", end="2016-06-30", metrics = c("avgSessionDuration" ), dimensions = c("dayOfWeekName", "hour"), max = 5000)
# order data gadata$dayOfWeekName <- factor(gadata$dayOfWeekName, levels = c( "Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")) gadata[order(gadata$dayOfWeekName),] # convert data frame to xtab heatmap_data <- xtabs(avgSessionDuration ~ dayOfWeekName + hour, data=gadata)
When data is prepared, we'll prepare plot:
55
Traffic heatmap
# plot heatmap heatmap(heatmap_data, col=colorRampPalette(brewer.pal( 9,"Blues"))(100), revC=TRUE, scale="none", Rowv=NA, Colv=NA, main="avgSessionDuration by Day and Hour" , xlab="Hour")
And the result is:
In this case - wednesday morning is the most engaging for users time of the day :)
Source code Complete code for this example in GitHub repository: github.com\/michalbrys\/R-Google-Analytics\/blob\/master\/6_heatmap.R
56
Traffic heatmap
57
Device comparsion
Device comparsion Let's check how engaged users are on different types of device. To do this, we'll plot 2 charts - describing how many sessions was made from different device types and what is avgSessionDuration (in seconds) on particular device type. # device comparsion # install libraries # install.packages("googleAuthR") # install.packages("googleAnalyticsR") # install.packages("ggplot2") # load libraries library("googleAuthR") library("googleAnalyticsR") library("ggplot2") # authorize connection with Google Analytics servers ga_auth() ## pick a profile with data to query #ga_id <- account_list[275,'viewId'] # or give it explicite using tool http://michalbrys.github.io/ga-tools/table-id.html i n format 99999999 ga_id <- 00000000 gadata <- google_analytics(id = ga_id, start="2015-01-01", end="2016-06-30", metrics = c("sessions", "avgSessionDuration"), dimensions = c("date", "deviceCategory"), max = 5000)
#plot sessions with deviceCategory ggplot(gadata, aes(deviceCategory, sessions)) + geom_bar(aes(fill = deviceCategory), stat="identity") #plot avgSessionDuration with deviceCategory ggplot(gadata, aes(deviceCategory, avgSessionDuration)) + geom_bar(aes(fill = deviceCategory), stat="identity")
58
Device comparsion
In this case the longest sessions was made from mobile devices.
Source code Complete code for this example in GitHub repository: github.com\/michalbrys\/R-Google-Analytics\/blob\/master\/7_device_comparsion.R
59
Machine Learning
Machine Learning
60
Clustering (k-means)
Clustering (k-means) Power of R is wide range of packages with advanced algorithms ready-to-use. In this example we'll use k-means for custom users segmentation. Unsupervised learning: k-Means k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with t he nearest mean (Source: Wikipedia ) Because this example needs custom instalation of Google Analytics tracking (content grouping, fingerprint), I've prepared special dataset for thus purpose. You can find complete code below.
61
Clustering (k-means)
# K-Means Cluster Analysis # load data into R # you can download data from Google Analytics API or download sample dataset # source('ga-connection.R') # download and preview sample dataset download.file(url="https://raw.githubusercontent.com/michalbrys/R/master/users-segment ation/sample-users.csv" , "sample-users.csv",
method="curl")
gadata <- read.csv(file="sample-users.csv", header=T, row.names = 1) head(gadata) # clustering users in 3 groups fit <- kmeans(gadata, 3) # get cluster means aggregate(gadata,by=list(fit$cluster),FUN=mean) # append
and preview cluster assignment
clustered_users <- data.frame(gadata, fit$cluster) head(clustered_users) # visualize results in 3D chart #install.packages("plotly") library(plotly) plot_ly(clustered_users, x = clustered_users$beginner_pv, y = clustered_users$intermediate_pv, z = clustered_users$advanced_pv, type = "scatter3d", mode = "markers", color=factor(clustered_users$fit.cluster) ) # write results to file write.csv(clustered_users, "clustered-users.csv" , row.names=T)
Results Result visualized in
plotly
package:
62
Clustering (k-means)
Results - clustered users In addition to chart you get
.csv
file with userId (fingerprint) and created label (number of
segment). You can use the results uploading it to your marketing systems. Example results: > clustered_users Beginner
Intermediate
Advanced
fit.cluster
266876
9
45
4
1
965265
9
51
7
1
981924
19
10
8
2
732529
19
16
1
2
...
... 377795
2
7
38
3
918083
2
8
28
3
Source code Complete code for this example in GitHub repository: github.com\/michalbrys\/R-Google-Analytics\/blob\/master\/5_users_segmentation.R
63
Generating reports
Generating reports For every analyst periodic reports can be time-consuming work. We can automate this process in R and prepare reporting templates. After that you can run this reports changing time range and save it do i.e.
.pdf
file. Sounds interesting?
64
Introduction to R Markdown
Introduction to R Markdown You can use markdowns as follow:
R Markdown options --title: "Monthly report" output: pdf_document ---
Chunks of code ```{r} # R Code ```r
If you don't want to display code in chunk in output file, use
echo = FALSE
option.
```{r, echo=FALSE} # R Code ```
Basic formatting Headers # Header 1 ## Header 2 ### Header 3
will produce
Header 1 Header 2 65
Introduction to R Markdown
Header 3 Lists * element 1 * element 2 * element 3
will produce element 1 element 2 element 3 1. element 1 2. element 2 3. element 3
will produce 1. element 1 2. element 2 3. element 3
Formatting *italic* **bold** ***bold+italic**
will produce italic bold bold+italic
More resources Full documentation: www.rstudio.com\/wp-content\/uploads\/2015\/03\/rmarkdown-reference.pdf Cheat sheet:
66
Introduction to R Markdown
www.rstudio.com\/wp-content\/uploads\/2016\/03\/rmarkdown-cheatsheet-2.0.pdf
67
Create report
Create report To generate basic report template use this code. This report will contain title, sessions in time scatter plot from
chapter 2
(Data visualization in R).
Create new RMarkdown report In R Studio, navigate to
File > New file > R Markdown
.
You will see window with some basic configuration options. Change this values or you can do this later directly in code.
You can select output of your report. Select
HTML
, PDF or
Word
.
Click OK and delete sample code.
68
Create report
Prepare custom report with Google Analytics data Copy this code to R Studio and click
Knit HTML
icon. This code will generate HTML report
with data downloaded from Google Analytics. --title: "Google Analytics Traffic Report" author: "Michal Brys" output: html_document --```{r, echo=FALSE, warning=FALSE,error=FALSE, message=FALSE } ga_id <- 67980704 date_start <- "2016-01-01" date_end <- "2016-06-30" #install.packages("googleAnalyticsR") #install.packages("ggplot2") library("googleAnalyticsR") library("ggplot2") #Run once from the console, then generate knitr document ga_auth() ``` ### Sessions from `r date_start` to `r date_end` This chart contains scatter plot of sessions number in date range. ```{r, echo=FALSE, warning=FALSE,error=FALSE, message=FALSE } gadata <- google_analytics(id = ga_id, start= date_start, end= date_end, metrics = c("sessions"), dimensions = c("date"), max = 5000) # scatter plot with trend line ggplot(data = gadata, aes(x = gadata$date,y = gadata$sessions) ) + geom_point() + geom_smooth() + theme(axis.text.x = element_text(angle = 90, hjust = 1)) ``` ### Users engagement by device type This chart contains bar chart with avgSessionSuriation
divided by device type.
```{r, echo=FALSE, warning=FALSE,error=FALSE, message=FALSE } gadata2 <- google_analytics(id = ga_id, start= date_start, end= date_end, metrics = c("sessions", "avgSessionDuration"),
69
Create report
dimensions = c("date", "deviceCategory"), max = 5000)
#plot sessions with deviceCategory ggplot(gadata2, aes(deviceCategory, sessions)) + geom_bar(aes(fill = deviceCategory), stat="identity") #plot avgSessionDuration with deviceCategory ggplot(gadata2, aes(deviceCategory, avgSessionDuration)) + geom_bar(aes(fill = deviceCategory), stat="identity") ```
Result As a result you'll get complete HTML file with report. You can also generate PDF file. For recurring reporing you can only change dates :)
70
Create report
71
Create report
Source code Complete code for this example in GitHub repository: github.com\/michalbrys\/R-Google-Analytics\/blob\/master\/8_rmarkdown_report.Rmd
72
Additional analysis
Additional analysis
73
Anomaly detection
Anomaly detection Use: https:\/\/github.com\/twitter\/AnomalyDetection
74
Forecasting
Forecasting Forecast of future web traffic using Holt-Winters method. Inspired by Richard Fergie. # forecasting using Holt-Winters algorithm # based on http://www.eanalytica.com/r-for-web-analysts/ # install libraries # install.packages("googleAuthR") # install.packages("googleAnalyticsR") # install.packages("ggplot2") # install.packages("forecast") # install.packages("reshape2") # load libraries library("googleAuthR") library("googleAnalyticsR") library("ggplot2") library("forecast") library("reshape2") # authorize connection with Google Analytics servers ga_auth() ## pick a profile with data to query #ga_id <- account_list[275,'viewId'] # or give it explicite using tool http://michalbrys.github.io/ga-tools/table-id.html i n format 99999999 ga_id <- 00000000 gadata <- google_analytics(id = ga_id, start="2016-05-01", end="2016-06-30", metrics = "sessions", dimensions = "date", max = 5000)
timeseries <- ts(gadata$sessions, frequency=7) components <- decompose(timeseries) plot(components) # note the way we add a column to a data.frame gadata$adjusted <- gadata$sessions - components$seasonal theme(axis.text.x = element_text(angle = 90, hjust = 1))
forecastmodel <- HoltWinters(timeseries)
75
Forecasting
plot(forecastmodel) forecast <- forecast.HoltWinters(forecastmodel, h=26) # 26 days in future plot(forecast, xlim=c(0,13)) forecastdf <- as.data.frame(forecast) totalrows <- nrow(gadata) + nrow(forecastdf) forecastdata <- data.frame(day=c(1:totalrows), actual=c(gadata$sessions,rep(NA,nrow(forecastdf))), forecast=c(rep(NA,nrow(gadata)-1),tail(gadata$sessions,1),forecastdf$"Point Forecast") , forecastupper=c(rep(NA,nrow(gadata)-1),tail(gadata$sessions,1),forecastdf$"Hi 80"), forecastlower=c(rep(NA,nrow(gadata)-1),tail(gadata$sessions,1),forecastdf$"Lo 80") ) ggplot(forecastdata, aes(x=day)) + geom_line(aes(y=actual),color="black") + geom_line(aes(y=forecast),color="blue") + geom_ribbon(aes(ymin=forecastlower,ymax=forecastupper), alpha=0.4, fill="green") + xlim(c(0,90)) + xlab("Day") + ylab("Sessions")
Result As a result you'll get chart with predictions about your web traffic.
76
Forecasting
Source code Complete code for this example in GitHub repository: https:\/\/github.com\/michalbrys\/R-Google-Analytics\/blob\/master\/4_forecasting.R
77
Resources
Resources
78
Blogs
Blogs R Bloggers Mark Edmondson Richard Fergie Michal Brys - blog ...
79
Documentation
Documentation R project - official website ggplot2 - official website googleAnalyticsR - R package Google Analyitcs - for developers
80
Online trainings
Online trainings To learn more details about R I recommend to check Coursera MOOC: R Programming by Johns Hopkins University
81