6/26/2014
Big Data: Prelude I‐Jen Chiang
What’s Big Data? No single definition; from Wikipedia:
• Big data is the term for a collection of data sets so large and ‐
data databas base e manag managemen ementt tools tools or tradit tradition ional al data data proces processin sing g applications.
• The challe challeng nges es inc includ lude e capture capture,, cur curat ation ion,, st stora orage, ge, sea searc rch, h, sharing, transfer, transfer, analysis, and visualiza visualization tion..
• T e tr t ren
to arger ata se s ets s ue to t e a t ona inform inf ormati ation on deriv derivabl able e from from analys analysis is of a single single large large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to "sp spot ot bu busi sine ness ss tr trend ends, s, de dete term rmin ine e qu qual alit ityy of re rese sear arch ch,, prev pr even entt di disea sease ses, s, li link nk le leg gal ci cita tati tion ons, s, co comb mbat at cri crime me,, an and d 2 determine real‐time roadway traffic conditions. conditions.”
1
6/26/2014
How much data? • Google processes 20 PB a day (2008) • Wayback Machine has 3 PB + 100 TB/month (3/2009) of user data + 15 TB/day (4/2009) • Facebook has 2.5 PB of user of user data + 50 TB/day (5/2009) • eBay has 6.5 PB of user • CERN’s Large Hydron Collider (LHC) generates 15 PB a year
640K ought to be enough 640K ought for anybody.
A Single View to the Customer
Banking Finance
Social Media
Our Known History
Customer
Gaming
Entertain
Purchase
2
6/26/2014
Collect
Accumulate
Store
5
Life Cycle of Big of Big Data Cloud Computing I n t e r n e t o f T h i g s
Quer MapReduce Distributed Storage Big Data
Mobile Computing 6
3
6/26/2014
Data, data, and more data of the data in the world today world today • According to IBM, 90% of the
was created in the past the past two two years (2011~2013). (IBM quote, microscope.co.uk)
• According to International Data Corporation, the
total amount of global of global data is expected to grow to 2.7 zettabytes during 2012. (International Data Corporation 2012 prediction, IDC website)
• The data is growing exponentially (43% growth rate) and is estimated to be 7.9 zettabytes by 2015. (CenturyLink 2015 prediction, ReadWriteWeb website)
7
Tendency of Data of Data
• Poorly structured, lightly
• Structured and •
a u ar o object/relational • Fixed schema
10s‐100s GB Up to 100,000s of transactions per hour
Optimized for transaction processing
mens one
• Specialized DBs for BI
TB‐low PBs a c oa s
Optimized for reporting and analysis
structured, or unstructured • Simple structure and extreme data rate erar erarcc ca an or e‐ • oriented • Sparsely attributed
PB and up Stream Str eam ca tur ture e batch
Optimized for distributed, cloud‐based processing
4
6/26/2014
Data Growth
9
Volume (Scale) • Data Volume – 44x increase from 2009 2020 – From 0.8 zettabytes to 35zb
• Data volume is increasing exponentially
Exponential increase in collected/generated data 10
5
6/26/2014
30 billion RFID tags today (1.3B in 2005)
12+ TBs of tw eet data every day
4.6 billion camera phones world wide
100s of millions of GPS enabled
y a f d o y s r e B v T e a ? t a d
devices sold annually
25+ TBs of log data every day
2+ billion 76 million smart meters in 2009… 200M by 2014
people on the Web by end 2011
Real‐time/Fast Data
Mobile devices (tracking all objects all the time)
Social media and networks (all of us are generating data)
Scientific instruments (collecting all sorts of data)
Sensor technology and networks (measuring all kinds of data)
• The progress and innovation is no longer hindered by the ability to collect data
• But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from the collected data in a timely manner and in a scalable 12 fashion
6
6/26/2014
The Model Has Changed… • The Model of Generating/Consuming Data has Changed Old Model: Few companies are generating data, all others are
New Model: all of us are generating data, and all of us are consuming data
13
What’s driving Big Data ‐ Optimizations and predictive analytics ‐ Complex statistical analysis ‐ All types of data, and many sources ‐ Very large datasets ‐ More of a real‐time
‐ Ad‐hoc querying and reporting ‐ Data mining tec niques ‐ Structured data, typical sources ‐ Small to mid‐size datasets
14
7
6/26/2014
What is Big Data? • Datasets which are too large, grow too rapidly, or are too varied to handle using traditional techniques • Characteristics: • Volume – 100’s of TB’s, petabytes, and beyond • Velocity – e.g., machine generated data, medical
• Variety – unstructured data, many formats, varying semantics • Not every data problem is a “Big Data” problem!!
15
Internet of Things
16
8
6/26/2014
Characteristics Volume
Big Data
17
IBM Definition
18
9
6/26/2014
A New Era of Computing 12 terabytes
5 million
of Tweets create daily
trade events per second
Volume
Variety
100’s Of video feeds from surveillance cameras
Velocity
Veracity
Only
1 in 3
Decision makers trust their information
http://watalon.com/?p=722
“We have for the first time an economy based on a key resource [Information] that is not only renewable, but self ‐generating. Running out of it is not a pro em, u rown ng n is.” – John Naisbitt
20
10
6/26/2014
21
Big Data Explained Ach iev e Br eakt hr ou gh Outcomes about your Customers
By Analyzing Any Big Data Type Transactional / Application Data
Run Zero‐latency Operations Machine Data Innovate new products at Speed and Scale Instant Awareness of Fraud and Risk Exploit Instrumented Assets
Social Media Data
Content
11
6/26/2014
Big Data Stack
24
12
6/26/2014
Value of Big Data • Unlocking significant value by making information .
• Using data collection and analysis to conduct controlled experiments to make better management decisions.
• Sophisticated analytics that substantially improve decision‐making.
• Improved development of the next generation of products and services. 25
Why does big data matter? • Big data is not just about storing large datasets • Rather, it is about leveraging datasets – Mining datasets to find new meaning – Combining datasets that have never been combined before
– Making more informed decisions – Offering new products and services
• Data is a vital asset, and analytics are the key to unlocking its potential “We don’t have better algorithms than anyone else, we just have more data.” ‐ Peter Norvig, Director of Research, Google, spoken in 2010
26
13
6/26/2014
Big Data: Processing I‐Jen Chiang
Knowledge Pyramid Data (Text) Mining area
Semantic level Wisdom (Knowledge + experience)
Knowledge (Information + rules)
Information (Data + context)
Data
What made it that unsuccessful ? What was the lowest selling product ? How many units were sold o eac pro uct ne
Signals
Resources occupied
14
6/26/2014
Value Chain Emerges Prescri tive Anal tics
Automaticall Prescribe and Take action
Predictive Analytics
Sets of Potential Future Scenarios
Identification of Patterns And Relationships
An Evaluation of what happened in the past
Analytics I n c r e a s i n g V a l e
Reporting
rocess ng
a a repare
or na ys s
Big Data
Containers and Feeds of Heterogeneous Data
n exe , rgan ze an Optimized Data Access to Structured and Unstructured Data
Michael Porter, Competitive Advantage: Creating and Sustaining Superior Performance
Big Data Processing Transactional Data
Operational & Partner
Social Data
Machine to Machine
Cloud Services
…… Event Streams …… High Speed Low Latency Infra/Band/Ethernet Interconnect
Working Local Flash Storage Layer as an extension of DRAM sr ue Shared Flash Storage Layer Low‐cost Distributed Archive & Backup Disk Storage Layer
Operational Systems
Business Analytics
Indexing & Metadata
Big Data Analytics
Databases
Indexes
Metadata
Cubes
…
Government Systems Databases
Active Data Mianagement Shared Databases
Active Indexes
Shared Metadata
Archive Data & Metadata
Archive/Backup Data Management … 30
15
6/26/2014
The Traditional Approach Query‐driven (lazy, on‐demand) Clients
Integration System
Metadata
...
... Source
Source
Source 31
Disadvantages of Query‐Driven Approach
Dela in uer
rocessin
Slow or unavailable information sources
Complex filtering and integration
Inefficient and potentially expensive for frequent queries Competes with local processing at sources caught on in industry
Hasn’t
32
16
6/26/2014
The Warehousing Approach Information integrated in advance Stored in wh for direct querying and analysis
Clients
Data Warehouse
Integration System
Metadata
.. . Extractor/ Monitor
Extractor/ Monitor
Extractor/ Monitor
... Source
Source
Source
33
Advantages of Warehousing Approach • High query performance –
• Doesn’t interfere with local processing at sources – Complex queries at warehouse – OLTP at information sources
• Information copied at warehouse – Can modify, annotate, summarize, restructure, etc. – Can store historical information – Security, no auditing • Has caught on in industry 34
17
6/26/2014
Business Intelligence Information Sources
Data Warehouse Server (Tier 1)
OLAP Servers Tier 2
Clients Tier 3
e.g., MOLAP Semistructured Sources
Data Warehouse extract transform load re resh etc.
OLAP serve
Query/Reporting serve
e.g., ROLAP
Operational DB’s
serve
Data Mining
Data Marts 35
Not Either‐Or Decision • Query‐driven approach still better for – Rapidly changing information – Rapidly changing information sources – Truly vast amounts of data from large numbers of sources
– Clients with unpredictable needs
36
18
6/26/2014
What is a Data Warehouse? A Practitioners Viewpoint “A data warehouse is simply a single, complete, and consistent store of data obtained from a variety of sources and made available to end users in a way t ey can un erstan an use t n a us ness context. ‐‐ Barry Devlin, IBM Consultant An Alternative Viewpoint
“A DW is a – subject‐oriented, – , – time‐varying, – non‐volatile
collection of data that is used primarily in organizational decision making.” ‐‐ W.H. Inmon, Building the Data Warehouse, 1992 37
A Data Warehouse is... • Stored collection of diverse data – Single repository of information
• Subject‐oriented – Organized by subject, not by application – Used for analysis, data mining, etc.
• Optimized differently from transaction‐ oriented db
• User interface aimed at executive 38
19
6/26/2014
Characteristics of a Data Mart
KROENKE and AUER ‐ DATABASE CONCEPTS (6th Edition) Copyright © 2013 Pearson Education, Inc. Publishing as Prentice Hall
Gaining market intelligence from news feeds
40 Sreekumar Sukumaran and Ashish Sureka
20
6/26/2014
Integrated BI Systems Intermedia Data ETL
omp ete ata Warehouse
ETL
Text taggor & Annotator
Structural Data DBMS
File System
XML
XML
RDBMS
Unstructured Data EA
Legacy
CMS
Scanned Documents
Email
Sreekumar Sukumaran and Ashish Sureka
41
Data Warehouse Components
SOURCE:
Ralph Kimball
21
6/26/2014
Data Warehouse Components – Detailed
SOURCE:
Ralph Kimball
Linux Adoption
44
22
6/26/2014
Distributing processing between gateways and cloud
45
Big Data Processing Techniques • Distributed data stream processing technology for on‐
the‐fly real‐time analytics of data/events generated at extremely high rates. • Technologies for reliable distributed data store, high‐ speed data structure transform to create analytics DB and quick data placement management. • Scalable data extraction technology to speed up rich querying functionality, such as multi‐dimensional ‐ , . • Scalable distributed parallel processing of a huge amount of stored data for advanced analysis such as machine learning. 46
23
6/26/2014
Batch and Real‐time Platform
http://www.nec.com/en/global/rd/research/cl/bdpt.html 47
http://info.aiim.org/digital‐landfill/newaiimo/2012/03/15/big‐data‐and‐big‐content‐48 just‐hype‐or‐a‐real‐opportunity
24
6/26/2014
Value Chain of Big Data
Fritz Venter (LEFT) and Andrew Stein, Images & videos: really big data, 2012
49
Big Data Moving Forward
50
Source: Dion Hinchcliffe
25
6/26/2014
51
Evolution of Data Evolutionary Step
Business Question
Enabling Technologies
Product Providers
Characteristics
Data Collection (1960s)
"What was my total revenue in the last five years?"
Computers, tapes, disks
IBM, CDC
Retrospective, static data delivery
Data Access (1980s)
"What were unit sales in New England last March?"
Oracle, Sybase, Informix, IBM, Microsoft
Data Warehousing & Decision Support (1990s)
"What were unit sales in New England last March? Drill down to Boston."
Relational databases (RDBMS), Structured Query Language (SQL), ODBC On-line analytic processing (OLAP), multidimensional databases, data warehouses
Retrospective, dynamic data delivery at record level Retrospective, dynamic data delivery at multiple levels
Data Mining (Emerging Today)
"What’s likely to happen to Boston unit sales next month? Why?"
Advanced algorithms, multiprocessor computers, massive databases
Pilot, Comshare, Arbor, Cognos, Microstrategy
Pilot, Lockheed, IBM, SGI, numerous startups (nascent industry)
Prospective, proactive information delivery
Data Mining
26
6/26/2014
Big Data Processing: Real‐Time • Collect and Store – In‐ memory data grid
• Speed up Processing through co‐location of business logic with data – Reduce network hops –
• Integrate with the big data store to meet volume and cost demands 53
Real‐Time Big Data Processing Data Bi Streams
Data Streaming System
Real‐time Analytic System
Metadata
Big Systems
Data Warehouses Systems of Record
Big Data
ETL Hadoop
Archive & Backup Systems (Objects, geographicallydistributed
54
27
6/26/2014
KDD Process Interpretation/ Evaluation Data Mining Transformation Preprocessing
Knowledge Pattern
Selection
Transformed Data Preprocessed Data Target Data
Data Warehouse
Data Mining
Process for Generating Evidence-based Guidelines Computer Interpretable
N. Stolba and A. M. Tjoa, The Relevance of Data Warehousing and Data Mining in the Field of Evidence‐based Medicine to Support Healthcare Decision Making. PROCEEDINGS OF WORLD ACADEMY OF SCIENCE, ENGINEERING AND TECHNOLOGY VOLUME 11 FEBRUARY 2006 ISSN 1307‐6884
29
6/26/2014
What is Cloud Computing? of computing in • Cloud computing is a style of computing w c ynam ca y sca a e an v r ua ze resources are provided as a services over the Internet. of,, expertise • Users need not have knowledge of in or control over the technolo infrastructure in the "cloud" that supports them.
Definitions • A style of computing of computing where massively scalable IT‐related capabilities are provided ‘as a service’ using Internet tec no og es to mu t p e externa customers – Gartner
• Definitions – one focusing on remote access to services and computing resources provided over the Internet "cloud“
• Ex: CRM and payroll services, as well as vendors that offer access to storage and processing power over the Web (such as Amazon's EC2 Amazon's EC2 service.
– the other focusing on the use of technologies of technologies such
as virtualizati virtualization on and automation that enable the creation and delivery of service of service‐based computing capabilities. • is an extension of traditional of traditional data center approaches and can be
applied to entirely internal enterprise systems with no use of external of external off ‐premises capabilities provided by a third party
31
6/26/2014
The NIST Cloud Definition Framework Hybrid Clouds
Deployment Models Service Models
Community Cloud
Private Cloud
Software as a Service (SaaS)
Platform as a Service (PaaS)
Infrastructure as a Service (IaaS)
On Demand Self ‐Service
Essential Characteristics
Common Characteristics
Broad Network Access
Rapid Elasticity
Resource Pooling
Measured Service
Massive Scale
Resilient Computing
Homogeneity
Geographic Distribution
Virtualization
Service Orientation
Low Cost Software
Advanced Security 63 Based upon original chart created by Alex Dowbor - http://ornot.wordpress.com http://ornot.wordpress.com
Cycle of SOA, of SOA, CLOUD COMPUTING, WEB 2.0
D. Delen, H. Demirkan / Decision Support Systems (2013)
64
32
6/26/2014
A lifecycle of Big Data • Collection/Identification • Repository/Registry • . • Integration
• Analytics / Prediction .
• Visualization
Data Data
Insight Insight
Action Action
• Data Curation • Data Scientist • Data Engineer
Decision Decision
3.
4.
• Workflow • Data Quality
65
Model for Big Data Model for Big Data • A Reference Analysis & Prediction
Service Layer
Big Data Management
Interface
Workflow Management
Data Quality Management
Data Visualization
Service Support Layer
Interface
Data Curation
Data Integration Platform Layer
Data Semantic Intellectualization
Security
Interface
Data Identification (Data Mining & Metadata Extraction) Data Collection
Data Registry
Data Layer
Data Repository
66
33
6/26/2014
Data feeds Library cata cata ogs
Search engine
Locally held documents
Public repositories
Commercial data sources
Agenc y dat a sources
Search engine
Search engine
Search engine
Search engine
INTERNET (public)
spiders
Filtered content
Search engine
Meta-Se Meta-Search arch Tool TAXONOMY
Web portal
System Design Preview Node.js Based I‐Jen Chiang
34
6/26/2014
Scenario Sensor Network
69
Distributed Processing
http://phys.org/news/2012‐03‐technology‐efficiently‐desired‐big‐streams.html
70
35
6/26/2014
71
Node.js vs Cloud Computing Child Process Pool
Node.js Master Process Incoming WebSocket Request
Static con en Request
Node.js Chile Process pp cat on Module
Node. s Application
Communication
Web Server & WebSockets Interface
Dispatcher Node.js Chile Process Application Module
Node.js Application
Communication
Websocke t request
Node.js Chile Process
Fully Asynchronous
Application Module
Node.js Application
Communication
Static Content 72
36
6/26/2014
Event Loop
73
Event Loop Example
74
37
6/26/2014
Learning Javascript …
…
__proto__
__proto__
Prototype
Prototype
__proto__
SuperConstr
__proto__ new
Object
Object
Layer 1: Single Object
Layer 2: Prototype Chain
Constructor
Instance
Layer 3: Constructor
SubConstr
Layer 4: Constructor inheritance
75
Create a single object • Objects: atomic building blocks of Javascript OOP – Objects: maps from strings to values – Properties: entries in the map – Methods: properties whose values are functions • This refers to receiver of method call // Object literal var jane = { // Property name: ‘Jane’, // Method describe: function () { return ‘Person named ‘ + this.name; } }; Advantage: create objects directly, introduce abstractions later
76
38
6/26/2014
var jane = { name: ‘Jane’, describe: function () { return ‘Person named ‘ + this.name; } };
# jane.name ‘Jane’ # jane.describe [Function] # jane.describe() ‘Person named Jane’ # jane.name = ‘John’ #jane.describe() ‘Person named John’ # jane.unknownProperty undefined 77
Objects versus maps • Similar: – Very dynamic: freely delete and add properties
• Different: – Inheritance (via prototype chains) – Fast access to properties (via constructors)
78
39
6/26/2014
Sharing properties: the problem var jane = { name: ‘Jane’, describe: function () { return ‘Person named ‘ + this.name; } }; var tarzan = { name: ‘Tarzan’, describe: function () { return ‘Person named ‘ + this.name; } };
79
Sharing properties: the solution PersonProto describe
function(0 {…}
jane
tarzan
__proto__
__proto__
name
‘Jane’
name
‘Tarzan’
.
• Both prototype chains work like single object.
80
40
6/26/2014
Sharing properties: the code var PersonProto = { describe: function () { return ‘Person named ‘ + this.name; } }; var jane = { __proto__: PersonProto, name: ‘Jane’, }; var tarzan = { __proto__: PersonProto, name: ‘Tarzan’, };
81
Getting and setting the prototype
• ECMAScript 6: __ proto__ • ECMAScript 5: – Object.create() – Object.getPrototypeOf()
82
41
6/26/2014
Getting and setting the prototype Object.create(proto) var PersonProto = { return ‘Person named ‘ + this.name; } }; var jane = Object.create(PersonProto); jane.name: ‘Jane’,
. # Object.getPrototypeOf(jane) === PersonProto true
83
Sharing methods // Instance ‐s ecific ro erties funcion Person(name) { this.name = name; } // Shared properties Person.prototype.describe = function () { return ‘Person named ‘ + this.name; };
84
42
6/26/2014
Instances created by the constructor Person
Person.prototype
prototype
describe
function(0 {…}
function Person(name) { this.name = names; } jane
tarzan
__proto__
__proto__ ‘
’
name
‘ arzan’
85
instanceof • Is value an instance of Constr?
• How does instanceof work? Check: Is Constr.prototype in the prototype chain of value? // Equivalent value instanceof Constr Constr.prototype.isPrototypeOf(value) 86
43
6/26/2014
Goal: derive Employee from Person funcion Person(name) { this.name = name; Person.prototype.sayHelloTo = function (otherName) { console.log(this.name + ‘ say hello to ‘ + otherName; }; Person.prototype.describe = function () { return ‘Person named ‘ + this.name; }; , , • Additional instance property: title • describe() return ‘Person named ()’<br />
<br />
87<br />
<br />
Things we need to do • Employee must – Inherit Person’s instance properties – Create the instance property title – Inherit Person’s prototype properties – . . (and call overridden method)<br />
<br />
88<br />
<br />
44<br />
<br />
6/26/2014<br />
<br />
Employee: the code funcion Employee(name, title) { Person.call(this, name); // (1) this.title = title; (2) Person.prototype = Object.create(Person.prototype); // (3) Person.prototype.describe = function () { return Person.prototype.describe.call(this) // (5) + ‘ (‘ + this.title + ‘)’; };<br />
<br />
(1) Inherit instance properties (2) Create the instance property title (3) Inherit prototype properties (4) Override method Person.prototype.describe (5) Call overridden method (a super‐call) 89<br />
<br />
Instances created by the constructor Object.prototype Person<br />
<br />
Person.prototype<br />
<br />
prototype<br />
<br />
__proto__ sayHelloTo function(0 {…} calls<br />
<br />
describe<br />
<br />
<br />
<br />
function(0 {…}<br />
<br />
Employee<br />
<br />
Employee.prototype<br />
<br />
prototype<br />
<br />
__proto__ escr e<br />
<br />
unc on<br />
<br />
…<br />
<br />
jane __proto__ name title<br />
<br />
<br />
<br />
‘Jane’ ‘CTO’<br />
<br />
90<br />
<br />
45<br />
<br />
6/26/2014<br />
<br />
Built‐in constructor hierarchy Object<br />
<br />
Object.prototype<br />
<br />
prototype<br />
<br />
__proto__ toString<br />
<br />
null <br />
<br />
function(0 {…}<br />
<br />
… Array<br />
<br />
Array.prototype<br />
<br />
prototype<br />
<br />
__proto__ toString<br />
<br />
<br />
<br />
sort<br />
<br />
<br />
<br />
function(0 {…} function(0 {…}<br />
<br />
… {‘foo’, ‘bar’} __proto__ 0<br />
<br />
<br />
<br />
‘foo’<br />
<br />
1<br />
<br />
<br />
<br />
‘bar’<br />
<br />
length<br />
<br />
2<br />
<br />
91<br />
<br />
Hello World var http = require('http'); http.createServer( function (req, res) { res.writeHead(200, {'Content‐Type': 'text/plain'}); res.end('Hello World.'); .<br />
<br />
,"<br />
<br />
. . . "<br />
<br />
console.log('Server running at http://127.0.0.1:1337/'); 92<br />
<br />
46<br />
<br />
6/26/2014<br />
<br />
Express – Web Application Framework =<br />
<br />
•<br />
<br />
'<br />
<br />
',<br />
<br />
app = express.createServer();<br />
<br />
app.get('/', function(req, res) {<br />
<br />
• res.send('Hello World.'); app.listen(1337);<br />
<br />
93<br />
<br />
Express – Create http services app.get('/', function(req, res){ res.sen e o wor ; }); app.get('/test', function(req, res){ res.send('test render');<br />
<br />
app.get('/user/', function(req, res){ res.send('user page'); }); 94<br />
<br />
47<br />
<br />
6/26/2014<br />
<br />
Router Identifiers // Will match /abcd app.get('/abcd', function(req, res) res.send('abcd'); }); // Will match /acd app.get('/ab?cd', function(req, res) { res.send('ab?cd'); ;<br />
<br />
// Will match /abxyzcd app.get('/ab*cd', function(req, res) { . ' * ' }); // Will match /abe and /abcde app.get('/ab(cd)?e', function(req, res) { res.send('ab(cd)?e'); });<br />
<br />
// Will match /abbcd app.get('/ab+cd', function(req, res) { res.send('ab+cd'); });<br />
<br />
95<br />
<br />
Express – Get Parameters // ... Create http server app.get('/user/:id', function(req, res){ res.send('user: ' + req.params.id); }); app.get('/:number', function(req, res){ res.send('number: ' + req.params.number); }); 96<br />
<br />
48<br />
<br />
6/26/2014<br />
<br />
Connect ‐ Middleware var connect = require("connect"); var http = require("http"); var app = connect(); app.use(function(request, response) { response.writeHead(200, { "Content‐Type": "text/plain" }); response.en He o wor n ; }); http.createServer(app).listen(1337); 97<br />
<br />
Connect: request, response, next var connect = require("connect"); var http = require("http"); var app = connect(); // log app.use(function(request, response, next) { console.log("In comes a " + request.method + " to " + request.url); next(); }); re urn " e o wor " app.use(function(request, response, next) { response.writeHead(200, { "Content‐Type": "text/plain" }); response.end("Hello World!\n"); }); http.createServer(app).listen(1337);<br />
<br />
98<br />
<br />
49<br />
<br />
6/26/2014<br />
<br />
Connect: logger var connect = re uire "connect" ; var http = require("http"); var app = connect(); app.use(connect.logger()); app.use(function(request, response) { response.writeHead(200, { "Content‐Type": "text/plain" }); response.en e o wor n ; }); http.createServer(app).listen(1337);<br />
<br />
99<br />
<br />
Connect Logging var connect = require("connect"); var http = require("http"); var app = connect(); app.use(connect.logger()); // Homepage app.use(function(request, response, next) { if (request.url == "/") { response.writeHead(200, { "Content‐ Type": "text/plain" }); . " homepage!\n"); // The middleware stops here. } else { next(); } });<br />
<br />
// About page app.use(function(request, response, next) { if (request.url == "/about") { response.writeHead(200, { "Content‐Type": " " response.end("Welcome to the about page!\n"); // The middleware stops here. } else { next(); } }); // 404'd! app.use(function(request, response) { response.writeHead(404, { "Content‐Type": "text/plain" }); response.end("404 error!\n"); }); http.createServer(app).listen(1337);<br />
<br />
100<br />
<br />
50<br />
<br />
6/26/2014<br />
<br />
Big Data: Analysis I‐Jen Chiang<br />
<br />
Big data issues<br />
<br />
102<br />
<br />
51<br />
<br />
6/26/2014<br />
<br />
The CRISP‐DM reference model<br />
<br />
Harper, Gavin; Stephen D. Pickett (August 2006)<br />
<br />
The Complete Big Data Value Chain Collection<br />
<br />
Ingestion<br />
<br />
<br />
<br />
Discovery & Cleansing<br />
<br />
Integration<br />
<br />
Analysis<br />
<br />
Delivery<br />
<br />
Collection – Structured, unstructured and semi‐structured data from multiple sources Ingestion – loading vast amounts of data onto a single data store Discovery & Cleansing – understanding format and content; clean up and formatting Integration – linking, entity extraction, entity resolution, indexing and data fusion<br />
<br />
–<br />
<br />
,<br />
<br />
,<br />
<br />
,<br />
<br />
Delivery – querying, visualization, real time delivery on enterprise‐class availability<br />
<br />
10 4<br />
<br />
52<br />
<br />
6/26/2014<br />
<br />
Phases and Tasks Business Understanding<br />
<br />
Data Understanding<br />
<br />
Determine Collect Initial Data Business Objectives Initial Data Collection Background Business Objectives Business Success Criteria<br />
<br />
Situation Assessment<br />
<br />
Data Preparation<br />
<br />
Modeling<br />
<br />
Data Set Description<br />
<br />
Select Modeling Technique<br />
<br />
Select Data<br />
<br />
Modeling Technique Modeling Assumptions<br />
<br />
Data Set<br />
<br />
Report<br />
<br />
Describe Data Data Description Report<br />
<br />
Rationale for Inclusion / Exclusion<br />
<br />
Explore Data<br />
<br />
Clean Data<br />
<br />
Inventoryof Resources Data Exploration Report Requirements, Verify Data Quality Assumptions, and Constraints Data Quality Report Risksand Contingencies Terminology Costsand Benefits<br />
<br />
Evaluate Results<br />
<br />
Deployment<br />
<br />
Plan Deployment<br />
<br />
Assessment of Data Mining Results w.r.t. Business Success Criteria Generate Test Design Approved Models Test Design<br />
<br />
Deployment Plan<br />
<br />
Plan Monitoring and Maintenance Monitoring and Maintenance Plan<br />
<br />
Review Process<br />
<br />
Data Cleaning Report<br />
<br />
Build Model<br />
<br />
Review of Process<br />
<br />
Produce Final Report<br />
<br />
Construct Data<br />
<br />
Parameter Settings Models Model Description<br />
<br />
Determine Next Steps<br />
<br />
Final Report Final Presentation<br />
<br />
List of Possible Actions Decision<br />
<br />
Review Project<br />
<br />
Derived Attributes Generated Records<br />
<br />
Asses s Mod el Integrate Data M er e d D at a<br />
<br />
Determine Data Mining Goal<br />
<br />
Evaluation<br />
<br />
Model Assessment Revised Parameter Settings<br />
<br />
Experience Documentation<br />
<br />
Format Data<br />
<br />
Data Mining Goals Data Mining Success Criteria<br />
<br />
Reformatted Data<br />
<br />
ProduceProject Plan Project Plan Initial Asessment of Tools and Techniques<br />
<br />
Data Mining Context Dimension<br />
<br />
Examples<br />
<br />
Application Data Mining Domain Problem Type<br />
<br />
Technical Aspect<br />
<br />
Modeling<br />
<br />
Summarization Values<br />
<br />
Churm Prediction<br />
<br />
Segmentation<br />
<br />
Outliers<br />
<br />
…<br />
<br />
Concept Descri tion<br />
<br />
…<br />
<br />
Tool and Technique<br />
<br />
MineSet<br />
<br />
Classification Prediction Dependency Analysis<br />
<br />
53<br />
<br />
6/26/2014<br />
<br />
What in Big Data <br />
<br />
<br />
<br />
<br />
<br />
How do you extract value from big data? <br />
<br />
You surely can’t glance over every record;<br />
<br />
<br />
<br />
And it may not even have records…<br />
<br />
What if you wanted to learn from it? <br />
<br />
Understand trends<br />
<br />
<br />
<br />
Classify into categories<br />
<br />
<br />
<br />
Detect similarities<br />
<br />
<br />
<br />
Predict the future based on the past… (No, not like Nostradamus!)<br />
<br />
Machine learning is quickly establishing as an emerging discipline. <br />
<br />
Thousands of features<br />
<br />
<br />
<br />
Billions of records<br />
<br />
<br />
<br />
The largest machine that you can get, may not be large enough…<br />
<br />
Get the picture? 10 7<br />
<br />
Data Accumulation<br />
<br />
108<br />
<br />
54<br />
<br />
6/26/2014<br />
<br />
Service‐Oriented DSS<br />
<br />
D. Delen, H. Demirkan / Decision Support Systems (2013)<br />
<br />
109<br />
<br />
Ten common big data problems • Modeling true risk • ustomer c urn analysis<br />
<br />
• Recommendation engine<br />
<br />
• Ad targeting • PoS transaction analysis<br />
<br />
• Analyzing network data • Threat analysis • Trade surveillance • Search quality • Data “sandbox”<br />
<br />
110<br />
<br />
55<br />
<br />
6/26/2014<br />
<br />
Business Applications • Modeling risk and failure prediction • Analyzing customer churn • Web recommendations (ala Amazon) • Web ad targeting • Point of sale transaction analysis • Threat analysis • Compliance and search effectiveness 111<br />
<br />
Dynamics of Data Ecosystems<br />
<br />
http://www3.weforum.org/docs/WEF_TC_MFS_BigDataBigImpact_Briefing_2012.pdf <br />
<br />
112<br />
<br />
56<br />
<br />
6/26/2014<br />
<br />
The big data opportunity<br />
<br />
113<br />
<br />
Industries are embracing big data<br />
<br />
114<br />
<br />
57<br />
<br />
6/26/2014<br />
<br />
115<br />
<br />
116<br />
<br />
58<br />
<br />
6/26/2014<br />
<br />
Business Analytics<br />
<br />
117<br />
<br />
D. Delen, H. Demirkan / Decision Support Systems (2013)<br />
<br />
Multidimensional Concept Analysis Att ri but e Variabl es<br />
<br />
T N E M U C O D<br />
<br />
Doc<br />
<br />
a<br />
<br />
bb<br />
<br />
D1<br />
<br />
True<br />
<br />
True<br />
<br />
D2<br />
<br />
True<br />
<br />
True<br />
<br />
c<br />
<br />
dd <br />
<br />
Top<br />
<br />
e<br />
<br />
True <br />
<br />
D3<br />
<br />
True<br />
<br />
D4<br />
<br />
True<br />
<br />
D5<br />
<br />
True<br />
<br />
D6<br />
<br />
True<br />
<br />
True<br />
<br />
(D1,D2), (a,b)<br />
<br />
<br />
<br />
(D1,D3,D6), (d)<br />
<br />
True (D1), (a,b,d)<br />
<br />
True<br />
<br />
Bottom<br />
<br />
Anal ys is of Acc ess es to State: Elements (Documents) + Properties (accesses to attibute variables) Anal ys is of Beh avior : Elements (Documents) + Properties (invocations to other documents)<br />
<br />
59<br />
<br />
6/26/2014<br />
<br />
Three Approaches<br />
<br />
119<br />
<br />
Mining Schemes <br />
<br />
A fully distributed and extensible set of Machine Learning techniques for Big Data<br />
<br />
State of the<br />
<br />
art algorithms in each of the Machine Learning domains, including supervised and unsupervised learning: Correlation Classifiers Clustering Statistics Document <br />
<br />
manipulation<br />
<br />
N‐gram extraction<br />
<br />
<br />
<br />
Histogram computation<br />
<br />
<br />
<br />
Natural Language Processing<br />
<br />
Distributed and<br />
<br />
parallel underlying linear algebra library<br />
<br />
120<br />
<br />
60<br />
<br />
6/26/2014<br />
<br />
Statistical Approach<br />
<br />
121<br />
<br />
Random Sample and Statistics • Population: is used to refer to the set or universe of all entities under study.<br />
<br />
• However, looking at the entire population may not be ,<br />
<br />
.<br />
<br />
• Instead, we draw a random sample from the population, and compute appropriate statistics from the sample, that give estimates of the corresponding population parameters of interest.<br />
<br />
61<br />
<br />
6/26/2014<br />
<br />
Statistic • Let Si denote the random variable corresponding to i ,<br />
<br />
ˆ<br />
<br />
ˆ<br />
<br />
1,<br />
<br />
S2, ∙ ∙ ∙ , Sn) → R.<br />
<br />
• If we use the value of a statistic to estimate a population parameter, this value is called a point , as an estimator of the parameter.<br />
<br />
Empirical Cumulative Distribution Function<br />
<br />
Where<br />
<br />
Inverse Cumulative Distribution Function<br />
<br />
62<br />
<br />
6/26/2014<br />
<br />
Example<br />
<br />
Measures of Central Tendency (Mean) Population Mean:<br />
<br />
Sample Mean (Unbiased, not robust):<br />
<br />
63<br />
<br />
6/26/2014<br />
<br />
Measures of Central Tendency (Median) Population Median:<br />
<br />
or<br />
<br />
Sample Median:<br />
<br />
Example<br />
<br />
64<br />
<br />
6/26/2014<br />
<br />
Measures of Dispersion (Range) Range: Sample Range:<br />
<br />
<br />
<br />
Not ro ust, sens t ve to extreme va ues<br />
<br />
Measures of Dispersion (Inter‐Quartile Range) Inter‐Quartile Range (IQR):<br />
<br />
Sample IQR:<br />
<br />
<br />
<br />
More ro ust<br />
<br />
65<br />
<br />
6/26/2014<br />
<br />
Measures of Dispersion (Variance and Standard Deviation) Variance:<br />
<br />
Standard Deviation:<br />
<br />
Measures of Dispersion (Variance and Standard Deviation) Variance:<br />
<br />
Standard Deviation:<br />
<br />
Sample Variance & Standard Deviation:<br />
<br />
66<br />
<br />
6/26/2014<br />
<br />
Univariate Normal Distribution<br />
<br />
Multivariate Normal Distribution<br />
<br />
67<br />
<br />
6/26/2014<br />
<br />
OLAP and Data Mining<br />
<br />
Warehouse Architecture <br />
<br />
Client<br />
<br />
Client<br />
<br />
Query & Analysis<br />
<br />
Metadata<br />
<br />
Warehouse<br />
<br />
Integration<br />
<br />
Source<br />
<br />
<br />
<br />
Source<br />
<br />
<br />
<br />
Source 136<br />
<br />
68<br />
<br />
6/26/2014<br />
<br />
Star Schemas<br />
<br />
data at a warehouse. It consists of: 1. Fact table : a very large accumulation of facts such as sales. <br />
<br />
Often “insert‐only.”<br />
<br />
. D mens on ta es : sma er, genera y stat c information about the entities involved in the facts. 137<br />
<br />
Terms • Fact table • Dimension tables • Measures<br />
<br />
sale orderId product prodId name price<br />
<br />
custId prodId storeId qty amt<br />
<br />
custId name address city<br />
<br />
store storeId city<br />
<br />
138<br />
<br />
69<br />
<br />
6/26/2014<br />
<br />
Star product<br />
<br />
prodId p1 p2<br />
<br />
name price bolt 10 nut 5<br />
<br />
store storeId c1 c3<br />
<br />
s ale o d erId d ate o100 1/7/97 o102 2/7/97 105 3/8/97<br />
<br />
c u s to m er<br />
<br />
c u s tId 53 53 111<br />
<br />
c u s tId 53 81 111<br />
<br />
n am e joe fred sally<br />
<br />
p ro d Id p1 p2 p1<br />
<br />
s to reId c1 c1 c3<br />
<br />
ad d res s 10 main 12 main 80 willow<br />
<br />
q ty 1 2 5<br />
<br />
city nyc la<br />
<br />
am t 12 11 50<br />
<br />
c ity sfo sfo la 139<br />
<br />
Cube<br />
<br />
Fact table view: s ale<br />
<br />
p ro d Id p1 p2 p1 p2<br />
<br />
Multi‐dimensional cube: s to reId c1 c1 c3 c2<br />
<br />
am t 12 11 50 8<br />
<br />
p1 p2<br />
<br />
c1 12 11<br />
<br />
c2<br />
<br />
c3 50<br />
<br />
8<br />
<br />
dimensions = 2<br />
<br />
140<br />
<br />
70<br />
<br />
6/26/2014<br />
<br />
3‐D Cube Fact table view: sale<br />
<br />
prodId p1 p2 p1 p2 p1 p1<br />
<br />
Multi‐dimensional cube: storeId c1 c1 c3 c2 c1 c2<br />
<br />
date 1 1 1 1 2 2<br />
<br />
amt 12 11 50 8 44 4<br />
<br />
day 2 day 1<br />
<br />
p1 p2 c 1 p1 12 p2 11<br />
<br />
c1 44<br />
<br />
c2 4 c2<br />
<br />
c3 c3 50<br />
<br />
8<br />
<br />
dimensions = 3<br />
<br />
141<br />
<br />
ROLAP vs. MOLAP • ROLAP: e a ona<br />
<br />
n‐ ne na y ca<br />
<br />
rocess ng<br />
<br />
• MOLAP: Multi‐Dimensional On‐Line Analytical Processing<br />
<br />
142<br />
<br />
71<br />
<br />
6/26/2014<br />
<br />
Aggregates • Add up amounts for day 1 • WHERE date = 1 s ale<br />
<br />
p ro d Id p1 p2 p1 p2 p1 p1<br />
<br />
s to reId c1 c1 c3 c2 c1 c2<br />
<br />
d ate 1 1 1 1 2 2<br />
<br />
am t 12 11 50 8 44 4<br />
<br />
81<br />
<br />
143<br />
<br />
Aggregates • Add up amounts by day • , GROUP BY date s ale<br />
<br />
p ro d Id p1 p2 p1 p2 p1 p1<br />
<br />
s to reId c1 c1 c3 c2 c1 c2<br />
<br />
d ate 1 1 1 1 2 2<br />
<br />
am t 12 11 50 8 44 4<br />
<br />
ans<br />
<br />
date 1 2<br />
<br />
sum 81 48<br />
<br />
144<br />
<br />
72<br />
<br />
6/26/2014<br />
<br />
Another Example • Add up amounts by day, product • , GROUP BY date, prodId s ale<br />
<br />
p ro d Id p1 p2 p1 p2 p1 p1<br />
<br />
s to reId c1 c1 c3 c2 c1 c2<br />
<br />
d ate 1 1 1 1 2 2<br />
<br />
am t 12 11 50 8 44 4<br />
<br />
sale<br />
<br />
prodId p1 p2 p1<br />
<br />
date 1 1 2<br />
<br />
amt 62 19 48<br />
<br />
rollup drill‐down<br />
<br />
145<br />
<br />
Aggregates • Operators: sum, count, max, min, me an, ave<br />
<br />
• “Having” clause • Using dimension hierarchy – average by region (within store) –<br />
<br />
146<br />
<br />
73<br />
<br />
6/26/2014<br />
<br />
What is Data Mining? • Discovery of useful, possibly unexpected, patterns in data • Non‐trivial extraction of implicit, previously unknown and potentially useful information from data , semi‐automatic means, of large quantities of data in order to discover meaningful patterns<br />
<br />
Data Mining Tasks • Classification [Predictive] • Clustering [Descriptive] • Association Rule Discovery [Descriptive] • Sequential Pattern Discovery [Descriptive] • Regression [Predictive] • Deviation Detection [Predictive] • Collaborative Filter [Predictive]<br />
<br />
74<br />
<br />
6/26/2014<br />
<br />
Classification: Definition • Given a collection of records (training set ) – Each record contains a set of attributes, one of the a r u es s<br />
<br />
e c ass.<br />
<br />
• Find a model for class attribute as a function of the values of other attributes. • Goal: previously unseen records should be assigned a class as accurately as possible.<br />
<br />
– A test set is used to determine the accuracy of the model. Usuall the iven data set is divided into training and test sets, with training set used to build the model and test set used to validate it.<br />
<br />
Decision Trees Example: • Conducted survey to see what customers were n eres e n new mo e car • Want to select customers for advertising campaign<br />
<br />
sale<br />
<br />
custId c1<br />
<br />
car taurus<br />
<br />
age 27<br />
<br />
c3 c4 c5 c6<br />
<br />
van taurus merc taurus<br />
<br />
40 22 50 25<br />
<br />
city newCar sf yes sf sf la la<br />
<br />
yes yes no no<br />
<br />
ranng set<br />
<br />
150<br />
<br />
75<br />
<br />
6/26/2014<br />
<br />
Clustering<br />
<br />
income<br />
<br />
age<br />
<br />
151<br />
<br />
K‐Means Clustering<br />
<br />
152<br />
<br />
76<br />
<br />
6/26/2014<br />
<br />
Association Rule Mining<br />
<br />
sales records:<br />
<br />
tran1 tran2 tran3 tran4 tran5 tran6<br />
<br />
cust33 cust45 cust12 cust40 cust12 cust12<br />
<br />
p2, p5, p8 p5, p8, p11 p1, p9 p5, p8, p11 p2, p9 p9<br />
<br />
market-basket data<br />
<br />
• Trend: Products p5, p8 often bought together • Trend: Customer 12 likes product p9<br />
<br />
153<br />
<br />
Association Rule Discovery • Marketing and Sales Promotion: – Let the rule discovered be {Bagels, … } ‐‐> {Potato Chips} – Potato Chips as consequent => Can be used to<br />
<br />
determine what should be done to boost its sales. – Bagels in the antecedent => can be used to see which products would be affected if the store discontinues selling bagels. – Ba els in antecedent and Potato chi s in consequent => Can be used to see what products should be sold with Bagels to promote sale of Potato chips!<br />
<br />
• Supermarket shelf management. • Inventory Management<br />
<br />
77<br />
<br />
6/26/2014<br />
<br />
Collaborative Filtering • Goal: predict what movies/books/… a person may be interested in, on the basis of <br />
<br />
– Past preferences of the person – Other people with similar past preferences – The preferences of such people for a new movie/book/…<br />
<br />
• One approach based on repeated clustering – Cluster people on the basis of preferences for movies – Then cluster movies on the basis of being liked by the same clusters of <br />
<br />
people – Again cluster people based on their preferences for (the newly created clusters of) movies – Repeat above till equilibrium<br />
<br />
• Above problem is an instance of collaborative filtering, where users<br />
<br />
collaborate in the task of filtering information to find information of interest<br />
<br />
155<br />
<br />
Other Types of Mining • Text mining: application of data mining to textual – cluster Web pages to find related pages – cluster pages a user has visited to organize their visit history<br />
<br />
– classify Web pages automatically into a Web directory<br />
<br />
• – Deal with graph data<br />
<br />
156<br />
<br />
78<br />
<br />
6/26/2014<br />
<br />
Data Streams • What are Data Streams? – Continuous streams – Huge, Fast, and Changing • Why Data Streams? – The arriving speed of streams and the huge amount of data are beyond our capability to store them.<br />
<br />
– “Real‐time” processing • Window Models – Landscape window (Entire Data Stream) – – Damped Window • Mining Data Stream<br />
<br />
157<br />
<br />
A Simple Problem • Finding frequent items – Given a se uence x …x where x ∈ 1 m and a real number θ between zero and one. – Looking for xi whose frequency > θ – Naïve Algorithm (m counters) • The number of frequent items ≤ 1/θ • Problem: N>>m>>1/θ<br />
<br />
P×(Nθ) ≤ N<br />
<br />
158<br />
<br />
79<br />
<br />
6/26/2014<br />
<br />
KRP algorithm ─ Karp, et. al (TODS’ 03)<br />
<br />
m=<br />
<br />
=<br />
<br />
Θ=0.35<br />
<br />
⌈1/θ⌉ = 3<br />
<br />
N/ (1/θ) ≤ Nθ<br />
<br />
159<br />
<br />
Streaming Sample Problem • Scan the dataset once • Sample K records – Each one has equally probability to be sampled – Total N record: K/N<br />
<br />
80<br />
<br />
6/26/2014<br />
<br />
Introduction to NoSQL I‐Jen Chiang<br />
<br />
Big Users<br />
<br />
81<br />
<br />
6/26/2014<br />
<br />
Big Data<br />
<br />
163<br />
<br />
Data Storage<br />
<br />
164<br />
<br />
82<br />
<br />
6/26/2014<br />
<br />
Features of the Data Warehouse • A Data Warehouse is a subject oriented, n egra e , nonvo a e, me var an co ec on of data in support of management’s decision – W.H. Inmon<br />
<br />
Data Warehouse Architecture Monitoring & Administration<br />
<br />
OLAP Servers<br />
<br />
Metadata Repository<br />
<br />
Reconciled data External Sources<br />
<br />
Extract Transform Load Refresh<br />
<br />
Analysis<br />
<br />
Serve Query/Reporting<br />
<br />
Operational Dbs<br />
<br />
Data Mining<br />
<br />
DATA SOURCES<br />
<br />
<br />
<br />
TOOLS<br />
<br />
DATA MARTS<br />
<br />
83<br />
<br />
6/26/2014<br />
<br />
Business Intelligence Loop Business Strategist<br />
<br />
OLAP<br />
<br />
Data Mining<br />
<br />
Reports<br />
<br />
Decision Support<br />
<br />
Data Storage Data Warehouse x rac on, rans orma on, & Cleansing<br />
<br />
CRM<br />
<br />
Accounting<br />
<br />
Finance<br />
<br />
HR<br />
<br />
Traditional vs. Big Data Traditional Data Warehouse<br />
<br />
Big Data Environment<br />
<br />
Complete record<br />
<br />
ata rom many sources ns e and outside of organization, including traditional DW Data often physically distributed Need to iterate solution to test/improve models Large‐memory analytics also part of iteration Every iteration usually requires complete reload of information<br />
<br />
from transactional system<br />
<br />
All data<br />
<br />
centralized<br />
<br />
168<br />
<br />
84<br />
<br />
6/26/2014<br />
<br />
NoSQL Database • Wide Column Store / Column Families ‐ Hadoop /<br />
<br />
HBase, Cassandra,Cloudata, Cloudera, Amazon SimpleDB • Document Store – MongoDB, CouchDB, Citrusleaf • Key Value / Tuple Store ‐ Azure Table Storage, MEMBASE, GenieDB, Tokyo Cabinet / Tyrant, MemcacheDB • Eventually Consistent Key Value Store ‐ Amazon ynamo, o emort • Graph Databases – Neo4J, Infinite Graph, Bigdata • XML Databases ‐ Mark Logic Server, EMC Documentum, eXist 169<br />
<br />
Why NoSQL • Too Much Data: the database became too single machine • Data Volume was growing – FAST • Data wasn’t all consistent with a specific, • Time was critical<br />
<br />
170<br />
<br />
85<br />
<br />
6/26/2014<br />
<br />
CAP Theorem • Three properties of a system: consistency, availability and partitions • You can have at most two of these three properties for any shared-data system • To scale out, you have to partition. That leaves either consistency or availability to choose from<br />
<br />
– In almost all cases, you would choose ava a y over cons s ency <br />
<br />
Theory of noSQL: CAP C<br />
<br />
• Many nodes • Nodes contain replicas of partitions of data<br />
<br />
• Consistency – all replicas contain the same version of data<br />
<br />
• Availability<br />
<br />
A<br />
<br />
P<br />
<br />
– s stem remains o erational on failing nodes<br />
<br />
• Partition tolarence – multiple entry points – system remains operational on system split<br />
<br />
CAP Theorem: satisfying all three at the same time is impossible 172<br />
<br />
86<br />
<br />
6/26/2014<br />
<br />
ACID - BASE<br />
<br />
• Basically Available (CP) • Soft‐state • Eventually cons stent AP<br />
<br />
• Atomicity • Consistency • Isolation<br />
<br />
Pritchett, D.: BASE: An Acid Alternative (queue.acm.org/detail.cfm?id=1394128)<br />
<br />
173<br />
<br />
NoSQL<br />
<br />
Key / Value<br />
<br />
Colum n<br />
<br />
Gr ap h<br />
<br />
Docu m ent<br />
<br />
174<br />
<br />
87<br />
<br />
6/26/2014<br />
<br />
Key‐Value Store Pros: very fast very scalable simple model able to distribute horizontally <br />
<br />
Cons: structures (objects) can't be easily modeled as key value pairs<br />
<br />
175<br />
<br />
Column Stores Row oriented Id<br />
<br />
username<br />
<br />
email<br />
<br />
Department<br />
<br />
1<br />
<br />
John<br />
<br />
john@foo.com<br />
<br />
Sales<br />
<br />
2<br />
<br />
Mary<br />
<br />
mary@foo.com<br />
<br />
Marketing<br />
<br />
3<br />
<br />
Yoda<br />
<br />
yoda@foo.com<br />
<br />
IT<br />
<br />
Column oriented Id<br />
<br />
Username<br />
<br />
email<br />
<br />
Department<br />
<br />
1<br />
<br />
John<br />
<br />
john@foo.com<br />
<br />
Sales<br />
<br />
2<br />
<br />
Mary<br />
<br />
mary@foo.com<br />
<br />
Marketing<br />
<br />
3<br />
<br />
Yoda<br />
<br />
yoda@foo.com<br />
<br />
IT<br />
<br />
88<br />
<br />
6/26/2014<br />
<br />
Graph Database<br />
<br />
177<br />
<br />
An introduction to MongoDB I‐Jen Chiang<br />
<br />
89<br />
<br />
6/26/2014<br />
<br />
Why NoSQL • Too Much Data: the database became too arge o n o a s ng e a a ase a e on a single machine<br />
<br />
• Data Volume was growing – FAST • Data wasn’t all consistent with a specific, well‐ • Time was critical<br />
<br />
179<br />
<br />
Document Stores • The store is a container for documents – Fields may or may not have type definitions • e.g. XSDs for XML stores, vs. schema-less JSON stores<br />
<br />
• Can create "secondary indexes" • These provide the ability to query on any document field s • Operations: • Insert and delete documents • Update fields within documents 180<br />
<br />
90<br />
<br />
6/26/2014<br />
<br />
MongoDB MongoDB is a scalable, high‐performance, open source NoSQL database.<br />
<br />
• Document‐oriented storage • Full Index Support • Replication & High Availability • Auto‐Sharding • Querying • Fast In‐Place Updates • Map/Reduce • GridFS 181<br />
<br />
Theory of noSQL: CAP C<br />
<br />
• Many nodes • Nodes contain replicas of partitions of data<br />
<br />
• Consistency – all replicas contain the same version of data<br />
<br />
• Availability<br />
<br />
A<br />
<br />
P<br />
<br />
– s stem remains o erational on failing nodes<br />
<br />
• Partition tolarence – multiple entry points – system remains operational on system split<br />
<br />
CAP Theorem: satisfying all three at the same time is impossible 182<br />
<br />
91<br />
<br />
6/26/2014<br />
<br />
Schema‐Less Pros: ‐<br />
<br />
‐<br />
<br />
pairs ‐ eventual consistency ‐ many are distributed ‐ still provide<br />
<br />
excellent performance and<br />
<br />
scalability Cons: ‐ typically no ACID transactions or joins<br />
<br />
Common Advantages • Cheap, easy to implement (open source) identical and fault‐tolerant) and can be partitioned – Down nodes easily replaced – No single point of failure<br />
<br />
• Easy to distribute ' • Can scale up and down • Relax the data consistency requirement (CAP)<br />
<br />
92<br />
<br />
6/26/2014<br />
<br />
What is NoSQL giving up? • joins • group y • order by • ACID transactions • SQL as a sometimes frustrating but still powerful query language • easy integration with other applications that support SQL<br />
<br />
Cassandra • Originally developed at Facebook ‐ • • Uses the Dynamo Eventual Consistency model • Written in Java • Open‐sourced and exists within the Apache family • Uses Apache Thrift as it’s API<br />
<br />
93<br />
<br />
6/26/2014<br />
<br />
Cassandra • Tunable consistency. • Decentralized. • Writes are faster than reads. • No Single point of failure. • Incremental scalability. • Uses consistent hashing (logical partitioning) when clustered. • Hinted handoffs. • Peer to peer routing(ring). • Thrift API. • Multi data center support.<br />
<br />
Couchdb • Availability and Partial Tolerance. • Views are used to query. Map/Reduce. • – u p e oncurren vers ons. o oc s. • • • • •<br />
<br />
<br />
<br />
• • •<br />
<br />
– A little overhead with this approach due to garbage collection. – Conflict resolution. Very simple, REST based. Schema Free. Shared nothing, seamless peer based Bi‐Directional replication. Auto Compaction. Manual with Mongodb. Uses B‐Trees ocumen s an n exes are ep n memory an us e o sc periodically. Documents have states, in case of a failure, recovery can continue from the state documents were left. No built in auto ‐sharding, there are open source projects. You can’t define your indexes.<br />
<br />
94<br />
<br />
6/26/2014<br />
<br />
Mongodb • Data types: bool, int, double, string, object(bson), oid, array, null, date. • Database and collections are created automatically. • Lots of Language Drivers. • Capped collections are fixed size collections, buffers, very fast, FIFO, good for logs. No indexes. • Object id are generated by client, 12 bytes packed data. 4 byte time, 3 byte machine, 2 byte pid, 3 byte counter. • Possible to refer other documents in different collections but more efficient to embed documents. • Replication is very easy to setup. You can read from slaves.<br />
<br />
Document store RDBMS<br />
<br />
MongoDB<br />
<br />
Database<br />
<br />
Database<br />
<br />
Table, View<br />
<br />
Collection<br />
<br />
Row<br />
<br />
Document (JSON, BSON) "first" : "John",<br />
<br />
Column<br />
<br />
"last" : "Doe", "age" : 39, Index "interests" : [ Embedded Document "Reading", Reference "Mountain Biking ] Shard "favorites": { "color": "Blue", "sport": "Soccer” } }<br />
<br />
Index Join Foreign Key Partition<br />
<br />
> db.user.findOne({age:39}) "_id" : ObjectId("5114e0bd42…"),<br />
<br />
Field<br />
<br />
190<br />
<br />
95<br />
<br />
6/26/2014<br />
<br />
Mongodb • Connection pooling is done for you.. • Su orts a re ation. – Map Reduce with JavaScript. • You have indexes, B‐Trees. Ids are always indexed. • Updates are atomic. Low contention locks. • Querying mongo done with a document: – Lazy, returns a cursor. – Reduceable to SQL, select, insert, update limit, sort etc. • There is more: upsert (either inserts of updates)<br />
<br />
– Several operators: • $ne, $and, $or, $lt, $gt, $incr,$decr and so on.<br />
<br />
• Repository Pattern makes development very easy.<br />
<br />
Terminology RDBMS<br />
<br />
MongoDB<br />
<br />
Table, View<br />
<br />
Collection<br />
<br />
Row<br />
<br />
Document (JSON, BSON)<br />
<br />
Column<br />
<br />
Field<br />
<br />
Index<br />
<br />
Index<br />
<br />
Join<br />
<br />
Embedded Document<br />
<br />
Foreign Key<br />
<br />
Reference<br />
<br />
Partition<br />
<br />
Shard 192<br />
<br />
96<br />
<br />
6/26/2014<br />
<br />
Features • Document‐Oriented storege • Replication & High<br />
<br />
Agile<br />
<br />
Availability<br />
<br />
• Auto‐Sharding • Querying • Fast In‐Place Updates • Map/Reduce<br />
<br />
193<br />
<br />
Mongodb ‐ Sharding<br />
<br />
replica set C1 mongod C2 mongod C mongod<br />
<br />
Config servers: Keeps mapping Mongos: Routing servers Mongod: master ‐slave replicas<br />
<br />
97<br />
<br />
6/26/2014<br />
<br />
Mongodb Data Analysis<br />
<br />
195<br />
<br />
CRUD <br />
<br />
Create db.collection.insert<br />
<br />
<document> db.collection.save( <document> ) db.collection.update( <query>, <update>, { upsert: true } ) <br />
<br />
Read db.collection.find( <query>,<br />
<br />
<projection> ) db.collection.findOne( <query>, <projection> ) <br />
<br />
U date db.collection.update( <query>,<br />
<br />
<br />
<br />
<update>, <options> )<br />
<br />
Delete db.collection.remove( <query>,<br />
<br />
<justOne> )<br />
<br />
196<br />
<br />
98<br />
<br />
6/26/2014<br />
<br />
SQL to Mongodb SQL Statment<br />
<br />
Mongo Statement<br />
<br />
a um er,<br />
<br />
um er<br />
<br />
.crea e o ec on “myco ”<br />
<br />
SELECT a, b FROM users<br />
<br />
db.users.find({},{a:1,b:1})<br />
<br />
SELECT * FROM users WHERE name LIKE “%Joe%”<br />
<br />
db.users.find({name:/Joe/})<br />
<br />
SELECT * FROM users WHERE a = 1 AND b = ‘q’<br />
<br />
db.users.find({a:1,b:’q’})<br />
<br />
y<br />
<br />
users<br />
<br />
.users.coun<br />
<br />
CREATE INDEX myindexname ON uses(name)<br />
<br />
db.users.ensureIndex({name:1})<br />
<br />
UPDATE users SET a= 1 WHERE b=‘q’<br />
<br />
db.users.update({b:’q’},{$set:{a:1}},false,tru e)<br />
<br />
DELETE FROM users WHERE z=“abc”<br />
<br />
db.users.remove({z:’abc’}) 197<br />
<br />
BSON • JSON has powerful, but limited set of datatypes –<br />
<br />
,<br />
<br />
,<br />
<br />
,<br />
<br />
• BSON is a binary representation of JSON – Adds extra datatypes with Date, Int types, Id, … – Optimized for performance and navigational abilities – And compression<br />
<br />
• MongoDB sends and stores data in BSON – bsonspec.org 198<br />
<br />
99<br />
<br />
6/26/2014<br />
<br />
Mongo Document<br />
<br />
199<br />
<br />
Collection<br />
<br />
200<br />
<br />
100<br />
<br />
6/26/2014<br />
<br />
Query<br />
<br />
201<br />
<br />
Modification<br />
<br />
202<br />
<br />
101<br />
<br />
6/26/2014<br />
<br />
Select<br />
<br />
203<br />
<br />
Query Stage<br />
<br />
204<br />
<br />
102<br />
<br />
6/26/2014<br />
<br />
CRUD example Create, Read, Update, Delete > db.user.insert({ first: "John", last : "Doe", icd: [ 250, 151 ], age: 39 })<br />
<br />
> db.user.find ( { "_id" : ObjectId("51…"), "first" : "John", "last" : "Doe", "age" : 39 })<br />
<br />
> db.user.update( {"_id" : ObjectId("51…")}, { $set: { age: 40, salary: 7000} } )<br />
<br />
> db.user.remove({ "first": /^J/ }) 205<br />
<br />
Import Excel into Mongodb<br />
<br />
• Mongoimport --db dbname --type csv -headerline --file /directory/file.csv • Ex. mongoimport ‐d mydb ‐c things ‐‐type csv ‐ ‐file locations.csv –headerline<br />
<br />
206<br />
<br />
103<br />
<br />
6/26/2014<br />
<br />
Export Mongodb to Excel • mongoexport --host localhost --port 27017 --username a c --passwor 12345 -collection collName --csv --fields id,sex,brithday,icd --out all_patients.csv -db my_db --query "{\”_id\": {\"\$oid\": \"5058ca07b7628c0999000006\"}}"<br />
<br />
207<br />
<br />
Blog Post Document • > p = { author: "Chris", date: new ISODate(), text: "About MongoDB...", tags: ["tech", "databases"]}<br />
<br />
• > db.posts.save(p)<br />
<br />
208<br />
<br />
104<br />
<br />
6/26/2014<br />
<br />
Querying • db.posts.find() { _id : ObjectId("4c4ba5c0672c685e5e8aabf3"), author : "Chris", date : ISODate("2012‐02‐ 02T11:52:27.442Z"), text : "About MongoDB...", tags : [ "tech", "databases" ] }<br />
<br />
Notes: _id is unique, but can be anything you'd like 209<br />
<br />
Insertion db.unicorns.insert({name: 'Horny', dob: new Date(1992,2,13,7,47),loves: ['carrot','papaya'], weight: 600, gender: 'm', vampires: 63}); db.unicorns.insert({name: 'Aurora', dob: new Date(1991, 0, 24, 13, 0), loves: ['carrot', 'grape'], weight: 450, gender: 'f', vampires: 43}); db.unicorns.insert name: 'Unicrom', dob: new Date 1973, 1, 9, 22, 10 , loves: 'energon', ' redbull' , weight: 984, gender: 'm', vampires: 182}); db.unicorns.insert({name: 'Roooooodles', dob: new Date(1979, 7, 18, 18, 44), loves: ['apple'], weight: 575, gender: 'm', vampires: 99}); db.unicorns.insert({name: 'Solnara', dob: new Date(1985, 6, 4, 2, 1), loves:['apple', 'carrot', ' chocolate'], weight:550, gender:'f', vampires:80}); db.unicorns.insert({name:'Ayna', dob: new Date(1998, 2, 7, 8, 30), loves: ['strawberry', 'lemon'], weight: 733, gender: 'f', vampires: 40}); db.unicorns.insert({name:'Kenny', dob: new Date(1997, 6, 1, 10, 42), loves: ['grape', 'lemon'], weight: 690, gender: 'm', vampires: 39}); db.unicorns.insert({name: 'Raleigh', dob: new Date(2005, 4, 3, 0, 57), loves: ['apple', 'sugar'], weight: 421, gender: 'm', vampires: 2}); db.unicorns.insert({name: 'Leia', dob: new Date(2001, 9, 8, 14, 53), loves: ['apple', 'watermelon'], weight: 601, gender: 'f', vampires: 33}); db.unicorns.insert({name: 'Pilot', dob: new Date(1997, 2, 1, 5, 3), loves: ['apple', 'watermelon'], weight: 650, gender: 'm', vampires: 54}); db.unicorns.insert({name: 'Nimue', dob: new Date(1999, 11, 20, 16, 15), loves: ['grape', 'carrot'], weight: 540, gender: 'f'}); db.unicorns.insert({name: 'Dunx', dob: new Date(1976, 6, 18, 18, 18), loves: ['grape', 'watermelon'], weight: 704, gender: 'm', vampires: 165}); 210<br />
<br />
105<br />
<br />
6/26/2014<br />
<br />
Master Selector : {field: value} db.unicorns.find({gender: ‘m’, weight: {$gt:<br />
<br />
•or (not quite the same thing, but for demonstration purposes) db.unicorns.find({gender: {$ne: ‘f’}, weight:<br />
<br />
$lt, $lte, $gt, $gte and $ne are used for less than, less than or equal, greater than, greater than or equal and not equal operations 211<br />
<br />
$exist and $or • db.unicorns.find({vampires: {$exists: a se • db.unicorns.find({gender: ’f’, $or: [{loves: ’apple’}, {loves: ’orange’}, {weight:<br />
<br />
212<br />
<br />
106<br />
<br />
6/26/2014<br />
<br />
Indexing • // 1 means ascending, ‐1 means descending > db.unicorns.ensureIndex({name: 1}) > db.unicorns.findOne({name: 'Kenny'}) name:' enny', o : new a e , , , 42), loves: ['grape', 'lemon'], weight: 690, gender: 'm', vampires: 39}<br />
<br />
,<br />
<br />
213<br />
<br />
Indexing on Multiple Fields •<br />
<br />
// 1 means ascending, ‐1 means descending db.posts.ensureIndex({author: 1, ts: ‐1})<br />
<br />
• Query: db.posts.find({author: 'Chris'}).sort({ts: ‐1})<br />
<br />
• Return: [ { _id : ObjectId("4c4ba5c0672c685e5e8aabf3"), author: "Chris", ...}, { _id : ObjectId("4f61d325c496820ceba84124"), author: "Chris", ...} ]<br />
<br />
214<br />
<br />
107<br />
<br />
6/26/2014<br />
<br />
GIS location1 = { name: "10gen East Coast , " city: "New York , zip: "10011 , ”<br />
<br />
,<br />
<br />
”<br />
<br />
”<br />
<br />
”<br />
<br />
tags: [ business , mongodb ], “<br />
<br />
”<br />
<br />
“<br />
<br />
”<br />
<br />
latlong: [40.0,72.0]<br />
<br />
db.locations.ensureIndex({latlong:”2d”}) db.locations.find({latlong:{$near :[40,70]}})<br />
<br />
GIS ‐ Place location1 = { name: "10gen HQ , address: "17 West 18th Street 8th Floor , city: "New York , zip: "10011 , latlong: [40.0,72.0], ”<br />
<br />
”<br />
<br />
”<br />
<br />
tags: [ business , cool place ], “<br />
<br />
”<br />
<br />
“<br />
<br />
”<br />
<br />
{user:"nosh", time:6/26/2010, tip:"stop by office hours on Wednesdays from 4-6pm"}, {.....}, ] }<br />
<br />
for<br />
<br />
216<br />
<br />
108<br />
<br />
6/26/2014<br />
<br />
Querying your Places Creating your indexes db.locations.ensureIndex({tags:1}) db.locations.ensureIndex({name:1}) db.locations.ensureIndex({latlong: 2d }) ”<br />
<br />
”<br />
<br />
Finding places: db.locations.find({latlong:{$near:[40,70]}}) With regular expressions: db.locations.find({name: /^typeaheadstring/) By tag: db.locations.find({tags: business }) “<br />
<br />
”<br />
<br />
Inserting and updating locations Initial data load: . oca ons. nser p ace<br />
<br />
Using update to Add tips: db.locations.update({name:"10genHQ"}, {$push :{tips: user:"nos ", me: , tip:"stop by for office hours on Wednesdays from 4‐6"}}}}<br />
<br />
109<br />
<br />
6/26/2014<br />
<br />
Requirements • Locations – Need to store locations Offices Restaurants etc • Want to be able to store name, address and tags • Maybe User Generated Content, i.e. tips / small notes ?<br />
<br />
– Want to be able to find other locations nearby<br />
<br />
• – User should be able to check in to a location – Want to be able to generate statistics ʻ<br />
<br />
ʼ<br />
<br />
Users user1 = { name: nos email: nosh@10gen.com , . . . checkins: [{ location: 10gen HQ , ts: 9/20/2010 10:12:00, …, … ] “<br />
<br />
”<br />
<br />
“<br />
<br />
”<br />
<br />
}<br />
<br />
110<br />
<br />
6/26/2014<br />
<br />
Simple Stats db.users.find({ checkins.location : 10gen HQ ) ‘<br />
<br />
’<br />
<br />
“<br />
<br />
”<br />
<br />
db.checkins.find({ checkins.location : 10gen HQ }) .sort({ts:-1}).limit(10) ‘<br />
<br />
’<br />
<br />
“<br />
<br />
”<br />
<br />
db.checkins.find({ checkins.location : 10gen HQ , ts: {$gt: midnight}}).count() ‘<br />
<br />
’<br />
<br />
“<br />
<br />
”<br />
<br />
Alternative user1 = { name: nosh email: nosh@10gen.com , . . . checkins: [4b97e62bf1d8c7152c9ccb74, 5a20e62bf1d8c736ab] “<br />
<br />
“<br />
<br />
”<br />
<br />
”<br />
<br />
}<br />
<br />
checkins [] = ObjectId reference to locations collection<br />
<br />
111<br />
<br />
6/26/2014<br />
<br />
User Check in<br />
<br />
Check‐in = 2 ops<br />
<br />
read location to obtain location id Update ($push) location id to user object Queries: find all locations where a user checked in:<br />
<br />
checkin_array = db.users.find({..}, c ec ns: rue .c ec ns db.location.find({_id:{$in: checkin_array}})<br />
<br />
Query Operators Conditional Operators ‐ $all, $exists, $mod, $ne, $in, $nin, $nor, $or, $size, $type ‐ t, te, gt, gte<br />
<br />
• find posts with any tags db.posts.find({tags: {$exists: true }})<br />
<br />
• find posts matching a regular expression .<br />
<br />
.<br />
<br />
:<br />
<br />
• count posts by author db.posts.find({author: 'Chris'}).count() 224<br />
<br />
112<br />
<br />
6/26/2014<br />
<br />
Examine the query plan Query: db.posts.find({"author": 'Ross'}).explain() esu t: { "cursor" : "BtreeCursor author_1", "nscanned" : 1, "nscannedObjects" : 1, "n" : 1, "millis" : 0, "indexBounds" : { "author" : [ [ "Chris”, "Chris” ] ] } }<br />
<br />
225<br />
<br />
Atomic Operators $set, $unset, $inc, $push, $pushAll, $pull, $pullAll, $bit<br />
<br />
•<br />
<br />
rea e a commen<br />
<br />
new_comment = { author: "Fred", date: new Date(), text: "Best Post Ever!"}<br />
<br />
•<br />
<br />
Add to ost<br />
<br />
db.posts.update({ _id: "..." }, {"$push": {comments: new_comment}},"$inc": {comments_count: 1} }); 226<br />
<br />
113<br />
<br />
6/26/2014<br />
<br />
Nested Documents { _id : ObjectId("4c4ba5c0672c685e5e8aabf3"), author : "Chris", date : "Thu Feb 02 2012 11:50:01", text : "About MongoDB...", tags : [ "tech", "databases" ], comments : [{ author : "Fred", date : "Fri Feb 03 2012 13:23:11", text : "Best Post Ever!" }], comment_count : 1 } 227<br />
<br />
Second Indexing // Index nested documents > .posts.ensureIn ex comments.aut or : 1 > db.posts.find({"comments.author": "Fred"})<br />
<br />
• // Index on tags (multi-key index) .pos s.ensure n ex ags: > db.posts.find( { tags: "tech" } )<br />
<br />
228<br />
<br />
114<br />
<br />
6/26/2014<br />
<br />
GEO • Geo‐spatial queries • Require a geo index • Find points near a given point • Find points within a polygon/sphere // geospatial index . . " . > db.posts.find( "author.location" : { $near : [22, 42] } )<br />
<br />
" "<br />
<br />
"<br />
<br />
229<br />
<br />
Memory Mapped Files • A memory‐mapped file is a segment of virtual memory w c as een ass gne a rec byte‐for‐byte correlation with some portion of a file or file‐like resource.1<br />
<br />
• mmap()<br />
<br />
1:<br />
<br />
http://en.wikipedia.org/wiki/Memory-mapped_file<br />
<br />
230<br />
<br />
115<br />
<br />
6/26/2014<br />
<br />
Replica Sets • Redundancy and Failover<br />
<br />
Host1:10000 Host2:10001<br />
<br />
upgrades and maintaince<br />
<br />
Host3:10002 replica1<br />
<br />
• Master-slave replication – Strong Consistency – e aye ons s ency<br />
<br />
Client<br />
<br />
• Geospatial features 231<br />
<br />
Sharding • Partition your data • ca e wr te t roug put • Increase capacity • Auto‐balancing 1<br />
<br />
Host1:10000<br />
<br />
2<br />
<br />
Host2:10010<br />
<br />
configdb Host3:20000 Host4:30000<br />
<br />
Client 232<br />
<br />
116<br />
<br />
6/26/2014<br />
<br />
Sharding ‐ horizontal scaling<br />
<br />
233<br />
<br />
Unsharded Deployment<br />
<br />
Primary<br />
<br />
•Configure as a replica set for automated failover<br />
<br />
•Async replication between nodes<br />
<br />
Secondary<br />
<br />
117<br />
<br />
6/26/2014<br />
<br />
High Throughput • Sharding reduces the number of operations each shard handles. Each shard processes fewer operations as the cluster grows. As a result, a cluster can increase capacity and throughput horizontally . For example, to insert data, the application only needs to access the shard responsible for that record. • Sharding reduces the amount of data that each server needs to store. Each shard stores less data as the cluster grows. For , , 4 shards, then each shard might hold only 256GB of data. If there are 40 shards, then each shard might hold only 25GB of data.<br />
<br />
235<br />
<br />
Sharded Deployment MongoS<br />
<br />
config<br />
<br />
Primary<br />
<br />
Secondary<br />
<br />
•Autosharding distributes data among two or more replica sets •Mongo Config Server(s) handles distribution & balancing •Transparent to applications<br />
<br />
118<br />
<br />
6/26/2014<br />
<br />
Sharded Cluster<br />
<br />
237<br />
<br />
Main components • Shard – – Each Shard can be a single mongod or a replica set<br />
<br />
• Config Server (meta data storage) – Stores cluster chunk ranges and locations – Can be only 1 or 3 (production must have 3) – Not a replica set – Acts as a router / balancer – No local data (persists to config database) – Can be 1 or many 238<br />
<br />
119<br />
<br />
6/26/2014<br />
<br />
Relational Database Clustering<br />
<br />
index<br />
<br />
data file<br />
<br />
0<br />
<br />
parrot Elmer cat Mittens<br />
<br />
arrot<br />
<br />
parakeet<br />
<br />
2<br />
<br />
cat cat<br />
<br />
3<br />
<br />
dog dog<br />
<br />
hash index. cat Natasha parakeetTweety dog dog<br />
<br />
4<br />
<br />
Buck Lassie<br />
<br />
goat Sertrude<br />
<br />
5 6<br />
<br />
<br />
<br />
This is a<br />
<br />
goat<br />
<br />
k<br />
<br />
h(k )<br />
<br />
c= d=3 g=6 p=15<br />
<br />
3 6 1<br />
<br />
a=0, b=1, ... z=25 h(k ) = (1st letter of k )(mod 7)<br />
<br />
Deploy a sharded cluster • Start the Config Server Database Instances – m r a a con g – mongod ‐‐configsvr ‐‐dbpath <path> ‐‐port <port> • mongod ‐‐configsvr ‐‐dbpath /data/configdb ‐‐port 27019<br />
<br />
• Start the mongos Instances (27017) –<br />
<br />
‐‐ • mongos ‐‐configdb cfg0.example.net:27019,cfg1.example.net:27019,cfg2.e xample.net:27019 240<br />
<br />
120<br />
<br />
6/26/2014<br />
<br />
Deploy a sharded cluster (contd) • Add Shards to the Cluster – mongo ‐‐host <hostname of machine running mongos> ‐‐port <port mongos listens on> • mongo ‐‐host mongos0.example.net ‐‐port 27017<br />
<br />
– sh.addShard() • sh.addShard( "rs1/mongodb0.example.net:27017" )<br />
<br />
241<br />
<br />
Enable Sharding for a Database • mongo ‐‐host <hostname of machine running mongos> ‐‐por <por mongos s ens on><br />
<br />
• sh.enableSharding("<database>") ~ db.runCommand ( { enableSharding: <database> } )<br />
<br />
Before you can shard a collection, you must enable sharding for the collection’s database.<br />
<br />
242<br />
<br />
121<br />
<br />
6/26/2014<br />
<br />
Chunk Partitioning<br />
<br />
Chunk is a section of the entire range<br />
<br />
Chunk splitting<br />
<br />
• A chunk is split once it exceeds the maximum size • There is no split point if all documents have the same shard key • Chunk split is a logical operation (no data is moved)<br />
<br />
Chunk is a section of the entire range<br />
<br />
122<br />
<br />
6/26/2014<br />
<br />
Range based sharding<br />
<br />
245<br />
<br />
Hash based Sharding<br />
<br />
246<br />
<br />
123<br />
<br />
6/26/2014<br />
<br />
Linear Hashing: Example ,<br />
<br />
*<br />
<br />
Insert 15 and 3 Bucket id<br />
<br />
0<br />
<br />
1<br />
<br />
2<br />
<br />
3<br />
<br />
4<br />
<br />
6<br />
<br />
7 11 4<br />
<br />
13<br />
<br />
8<br />
<br />
5 9<br />
<br />
17<br />
<br />
Linear Hashing: Example =<br />
<br />
,<br />
<br />
Bucket id<br />
<br />
0<br />
<br />
=<br />
<br />
1<br />
<br />
2<br />
<br />
17<br />
<br />
8<br />
<br />
9<br />
<br />
* 3<br />
<br />
4<br />
<br />
5<br />
<br />
15<br />
<br />
6<br />
<br />
7 11 4<br />
<br />
13 5<br />
<br />
3<br />
<br />
124<br />
<br />
6/26/2014<br />
<br />
Linear Hashing: Search h0(x) = x mod N ( for the un-split buckets) h1 x = x mod 2*N or the s lit ones Bucket id<br />
<br />
0<br />
<br />
1<br />
<br />
2<br />
<br />
17<br />
<br />
8<br />
<br />
9<br />
<br />
3<br />
<br />
4<br />
<br />
5<br />
<br />
15<br />
<br />
6<br />
<br />
7 11 4<br />
<br />
13 5<br />
<br />
3<br />
<br />
Enable Sharding for a Collection • Determine what you will use for the shard key. Your selection of the shard key affects the efficiency of sharding.<br />
<br />
•<br />
<br />
If the collection already contains data you must create an index on the shard key using ensureIndex (). If the collection already contains data you must create an index on the shard key using ensureIndex(). If the collection is empty then MongoDB will create the index as part of the sh.shardCollection () step.<br />
<br />
• Enable sharding for a collection by issuing the sh.shardCollection () method in the mongo shell. The method uses: – sh.shardCollection("<database>.<collection>", shard‐key‐pattern) sh.shardCollection("records.people", { "zipcode": 1, "name": 1 } ) sh.shardCollection("people.addresses", { "state": 1, "_id": 1 } ) sh.shardCollection("assets.chairs", { "type": 1, "_id": 1 } ) sh.shardCollection("events.alerts", { "_id": "hashed" } )<br />
<br />
250<br />
<br />
125<br />
<br />
6/26/2014<br />
<br />
Mixed shard1<br />
<br />
...<br />
<br />
Host1:10000 Host2:10001<br />
<br />
shardn Host4:10010<br />
<br />
Host3:10002 replica1 Host5:20000 Host6:30000<br />
<br />
<br />
<br />
Client<br />
<br />
Host7:30000 251<br />
<br />
Map/Reduce db.collection.mapReduce( <mapfunction>, <reducefunction>, { out: <collection>, query: <>, sort: <>, limit: <number>, finalize: <function>, scope: <>, , verbose: <boolean> } ) var mapFunction1 = function() { emit(this.cust_id, this.price); }; var reduceFunction1 = function(keyCustId, valuesPrices) { return sum(valuesPrices); };<br />
<br />
252<br />
<br />
126<br />
<br />
6/26/2014<br />
<br />
Map Reduce • The caller provides map and reduce // Emit each tag > map = "this['tags'].forEach( function(item) {emit(item, 1);} );" // Calculate totals > reduce = "function(key, values) { var total = 0; var valuesSize = values.length; for (var i=0; i < valuesSize; i++) { total += parseInt(values[i], 10); } return total; }; 253<br />
<br />
Map Reduce // run the map reduce ><br />
<br />
db.posts.mapReduce(map, reduce, {"out": { inline : 1}});<br />
<br />
Answer { "results" : [ {"_id" : "databases", "value" : 1}, {"_id" : "tech", "value" : 1 } ], "timeMillis" : 1, "counts" : { "input" : 1, "emit" : 2, "reduce" : 0, "output" : 2 }, "ok" : 1, }<br />
<br />
254<br />
<br />
127<br />
<br />
6/26/2014<br />
<br />
References • http://docs.mongodb.org/manual/core/inter ‐<br />
<br />
rocess ‐authentication • http://api.mongodb.org/python/2.6.2/examples/ authentication.html • https://securosis.com/assets/library/reports/Sec uringBigData_FINAL.pdf • http://docs.mongodb.org/manual/reference/user ‐pr v eges • http://www.slideshare.net/DefconRussia/firstov ‐ attacking ‐mongo‐db 256<br />
<br />
128<br />
<br />
6/26/2014<br />
<br />
Case: CRM I‐Jen Chiang<br />
<br />
Relationship Marketing • Relationship Marketing is a Process –<br />
<br />
commun ca ng w<br />
<br />
your cus omers<br />
<br />
– listening to their responses<br />
<br />
• Companies take actions – marketing campaigns –<br />
<br />
new products<br />
<br />
–<br />
<br />
new channels<br />
<br />
–<br />
<br />
new packaging 258<br />
<br />
129<br />
<br />
6/26/2014<br />
<br />
Relationship Marketing ‐‐ continued • Customers and prospects respond – most common response is no response<br />
<br />
• This results in a cycle – data is generated – opportunities to learn from the data and improve the – process emerge<br />
<br />
259<br />
<br />
An Illustration • A few years ago, UPS went on strike • • After the strike, its volume fell • FedEx identified those customers whose FedEx volumes had increased and then decreased<br />
<br />
•<br />
<br />
ese cus omers were us ng<br />
<br />
aga n<br />
<br />
• FedEx made special offers to these customers to get all of their business 260<br />
<br />
130<br />
<br />
6/26/2014<br />
<br />
The Corporate Memory • Several years ago, Land s End could not ’<br />
<br />
recognize regular Christmas shoppers – some people generally don t shop from catalogs – but spend hundreds of dollars every Christmas – if you only store 6 months of history, you will miss ’<br />
<br />
them<br />
<br />
• Victoria<br />
<br />
s Secret builds customer loyalty with a no‐hassle returns policy ’<br />
<br />
– some<br />
<br />
loyal customers outfits each month “<br />
<br />
– they are really<br />
<br />
”<br />
<br />
return several expensive<br />
<br />
loyal renters<br />
<br />
“<br />
<br />
”<br />
<br />
261<br />
<br />
CRM Requires Learning and More<br />
<br />
• Form a learning relationship with your customers – Notice their needs • On‐line Transaction Processing Systems<br />
<br />
– Remember their preferences • Decision Support Data Warehouse<br />
<br />
– • Data Mining<br />
<br />
– Act to make customers more profitable 262<br />
<br />
131<br />
<br />
6/26/2014<br />
<br />
Customer Relationship Management (CRM) Traditional Marketing<br />
<br />
CRM<br />
<br />
, increase market share by mass marketing<br />
<br />
, ‐ term, one‐to‐one relationship with customers; understanding their needs, preferences, expectations<br />
<br />
Product oriented view<br />
<br />
Customer oriented view<br />
<br />
Mass marketing / mass production<br />
<br />
Mass customization, one‐to‐one marketing<br />
<br />
Standardization of customer needs<br />
<br />
Customer‐supplier relationship<br />
<br />
Transactional relationship<br />
<br />
Relational approach 263<br />
<br />
What is CRM? The a roach of identif in establishin maintainin and enhancing lasting relationships with customers.<br />
<br />
“<br />
<br />
”<br />
<br />
The formation of bonds between a company and its .<br />
<br />
“<br />
<br />
”<br />
<br />
264<br />
<br />
132<br />
<br />
6/26/2014<br />
<br />
Strategies in CRM for Mass Customization ‐ • • Loyalty • Cross‐selling / Up‐selling • Win back or Save<br />
<br />
265<br />
<br />
Business Processes Organize Around the Customer Lifecycle<br />
<br />
Winback Former Customer High Value Prospect<br />
<br />
New Customer<br />
<br />
Voluntary Churn<br />
<br />
Established Customer Potential<br />
<br />
Low Value<br />
<br />
Forced Churn 266<br />
<br />
133<br />
<br />
6/26/2014<br />
<br />
Strategic Customer Relationship Entire universe<br />
<br />
Satisfy these Customers Before They defect<br />
<br />
Spend less marketing $ On these segments Change higher rates<br />
<br />
Retain these Loyal customers!<br />
<br />
Cross-sell and up-sell opportunities<br />
<br />
Behavioral clustering/ segmentation<br />
<br />
Get these Customers back<br />
<br />
Demographic clustering/ segmentation<br />
<br />
Distinct demographic group Market distinct portfolio in sequence<br />
<br />
Product Asso ciat ion s Customer basket analysis Market the Missing products<br />
<br />
Marketing Strategy High-valued Customer <br />
<br />
Clustering Purchasing Behaviors<br />
<br />
Keep r on Recognition<br />
<br />
Mid-valued Customer<br />
<br />
Cross-sell , up-sell<br />
<br />
Low-valued<br />
<br />
Get Best Benefit<br />
<br />
Before Attrition ,<br />
<br />
With low costs<br />
<br />
Win-back<br />
<br />
Marketing Benefit Registration<br />
<br />
Demography Clustering Product Design and Refinement<br />
<br />
Products<br />
<br />
Win-back<br />
<br />
Find Customers<br />
<br />
Pr odu ct Cl as si fi cati on s<br />
<br />
Di fferent ial Seg men tat io n<br />
<br />
Find Products<br />
<br />
134<br />
<br />
6/26/2014<br />
<br />
The Profit/Loss Matrix Someone who scores in the top 30%, is predicted to respond ACTUAL Those predicted to respond cost $1 those who actually respond yield a gain of $45 those who don t respond ’<br />
<br />
d e t c i d e r P<br />
<br />
YES<br />
<br />
NO<br />
<br />
YES<br />
<br />
$44<br />
<br />
-$1<br />
<br />
NO<br />
<br />
$0<br />
<br />
$0<br />
<br />
Those not predicted to respond cost $0 and yield no gain 269<br />
<br />
Marketing Plan • Correctly identify the customer requirements • Define promotion for customers • Adjust the flight schedule<br />
<br />
135<br />
<br />
6/26/2014<br />
<br />
Recommendation EC Multiple and effective merchandising platform Cross-products (different types) recommendation<br />
<br />
Online retailers<br />
<br />
Anonymizing filter (optional)<br />
<br />
Marketplace<br />
<br />
My Library & Preferences<br />
<br />
Aggregated Recommendations Recommendation-Based Purchases<br />
<br />
My Library (all media types)<br />
<br />
User Profile <-> Recommendations<br />
<br />
My Preferences<br />
<br />
Information/data integration Find houses with 2 bedrooms priced under 200K <br />
<br />
New faculty member<br />
<br />
realestate.com<br />
<br />
homeseekers.com<br />
<br />
…sources on the Web which provide house listings<br />
<br />
homes.com 272<br />
<br />
136<br />
<br />
6/26/2014<br />
<br />
Architecture of Data Integration System simply pose the query in the mediated schema<br />
<br />
Find houses with 2 bedrooms priced under 200K <br />
<br />
mediated schema<br />
<br />
source schema 1<br />
<br />
realestate.com<br />
<br />
source schema 2<br />
<br />
source schema 3<br />
<br />
homeseekers.com<br />
<br />
homes.com 273<br />
<br />
Semantic Matches between Schemas the schema‐matching problem is to find semantic mappings between the elements of the two schemas<br />
<br />
Mediated schema<br />
<br />
price<br />
<br />
agent‐name<br />
<br />
1‐1 match<br />
<br />
Source schema .<br />
<br />
listed ‐price<br />
<br />
contact‐name<br />
<br />
320K 240K<br />
<br />
Jane Brown Mike Smith<br />
<br />
address<br />
<br />
<br />
<br />
complex match<br />
<br />
city<br />
<br />
state<br />
<br />
Seattle WA Miami FL<br />
<br />
274<br />
<br />
137<br />
<br />
6/26/2014<br />
<br />
Big Data Platforms and Paradigms Sourangshu Bhattacharya (CSE)<br />
<br />
Outline <br />
<br />
Big Data <br />
<br />
<br />
<br />
What is Bi Data ?<br />
<br />
<br />
<br />
Challenges with Big Data Processing.<br />
<br />
<br />
<br />
Hadoop – HDFS<br />
<br />
<br />
<br />
Map Reduce<br />
<br />
<br />
<br />
PIG<br />
<br />
Analytics <br />
<br />
Basic Statistics<br />
<br />
<br />
<br />
Text Analytics<br />
<br />
<br />
<br />
SQL Queries<br />
<br />
138<br />
<br />
6/26/2014<br />
<br />
What is Big Data ? • 6 Billion web ueries er da . ~ 6 TB per day, ~ 2.5 PB per year<br />
<br />
• 10 Billion display ads per day. ~ 15 TB per day, ~ 5.5 PB per year<br />
<br />
• 30 Billion text ads per day. ~ 30 TB per day, ~ 11 PB per year<br />
<br />
•<br />
<br />
on re t car transact ons per ay. ~ 150 GB per day, ~ 5.5 TB per year<br />
<br />
• 100 Billion emails per day. ~ 1 PB per day, ~ 360 PB per year<br />
<br />
What is Big Data ? CERN ‐ Large<br />
<br />
Hadron Collider<br />
<br />
~10<br />
<br />
PB/year at start ~1000 PB in ~10 years 2500 physicists collaborating Large<br />
<br />
Synoptic Survey Telescope (NSF, DOE, and private donors) –~5‐10<br />
<br />
PB/year at start in 2012 –~100 PB b 2025 Pan‐STARRS (Haleakala, Hawaii)<br />
<br />
US Air Force –now:<br />
<br />
800 TB/year –soon: 4 PB/year Courtesy: King et al., IEEE Big Data 2013.<br />
<br />
139<br />
<br />
6/26/2014<br />
<br />
Big Data Challenges 3 V’s Scalability Cost Effective <br />
<br />
Flexibility Fault<br />
<br />
Tolerance<br />
<br />
• 3 Vs ‐ Volume, Variety, Velocity. • Fault‐tolerance: Computers<br />
<br />
1<br />
<br />
10<br />
<br />
100<br />
<br />
Chance of failure in an hours<br />
<br />
0.01<br />
<br />
0.09 0.63<br />
<br />
• Data locality. Computation goes to data<br />
<br />
What is Hadoop ? • Hadoop Map Reduce – Batch processing. • Spark – Distributed in‐memory processing. • Storm – Distributed online processing. • Graphlab – distributed processing on graph. <br />
<br />
A scalable fault ‐tolerant distributed system for data storage and processing.<br />
<br />
<br />
<br />
Core Hadoop: a oop<br />
<br />
<br />
<br />
sr ue<br />
<br />
e ys em<br />
<br />
Hadoop<br />
<br />
YARN: Job Scheduling and Cluster Resource Management<br />
<br />
Hadoop<br />
<br />
Map Reduce: Framework for distributed data processing.<br />
<br />
Open Source system with large community support. https://hadoop.apache.org/<br />
<br />
140<br />
<br />
6/26/2014<br />
<br />
Hadoop Architecture<br />
<br />
Courtesy: http://hadoop.apache.o http://hadoop.apache.org/docs/r2.3.0/hado rg/docs/r2.3.0/hadoop op‐yarn/hadoop ‐yarn‐site/YARN.html<br />
<br />
HDFS <br />
<br />
Assumptions . Streaming data Write<br />
<br />
access.<br />
<br />
once, read many times.<br />
<br />
High throughput, not<br />
<br />
low latency.<br />
<br />
Large datasets.<br />
<br />
<br />
<br />
Characteristics: Performs<br />
<br />
best with modest number of large of large files<br />
<br />
Optimized for streaming reads Layer<br />
<br />
on top of native of native file system.<br />
<br />
141<br />
<br />
6/26/2014<br />
<br />
HDFS <br />
<br />
Data is organized into file and directories. .<br />
<br />
<br />
<br />
Block placement is known at the time of read of read Computation moved<br />
<br />
<br />
<br />
to same node.<br />
<br />
Replication is used for: Speed Fault tolerance<br />
<br />
Self healing. healing. Self <br />
<br />
What is Map Reduce ? Method<br />
<br />
for distributing a task across multiple servers.<br />
<br />
<br />
<br />
Proposed by Dean and Ghemawat, 2004.<br />
<br />
<br />
<br />
Consists of two of two developer created phases: Map Reduce<br />
<br />
<br />
<br />
In between Map and Reduce is the Shuffle and Sort phase.<br />
<br />
What was the max/min temperature for the last century ?<br />
<br />
142<br />
<br />
6/26/2014<br />
<br />
Map Phase <br />
<br />
User writes the mapper method. E.g. A<br />
<br />
<br />
<br />
A row of RDBMS of RDBMS table,<br />
<br />
line of a of a text file, etc<br />
<br />
Output is a set of records of records of the of the form: <key, value> Both<br />
<br />
key and value can be anything, e.g. text, number, etc.<br />
<br />
E.g.<br />
<br />
for row of RDBMS of RDBMS table: <column id, value><br />
<br />
Line<br />
<br />
of text of text file: <word, count><br />
<br />
Shuffle/Sort phase <br />
<br />
Shuffle phase ensures that all the mapper output records with the same ke value value oes to the same reducer reducer..<br />
<br />
<br />
<br />
Sort ensures that among the records received at each reducer, records with same key arrives together.<br />
<br />
143<br />
<br />
6/26/2014<br />
<br />
Reduce phase <br />
<br />
Reducer is a user defined function which processes mapper out ut recor records ds with with some some of the ke s out ut b ma er. er.<br />
<br />
<br />
<br />
Input is of the of the form <key, value> All<br />
<br />
<br />
<br />
records having same key arrive together.<br />
<br />
Output is a set of records of records of the of the form <key, value> Key is<br />
<br />
not important<br />
<br />
Parallel picture<br />
<br />
144<br />
<br />
6/26/2014<br />
<br />
Example • Word Count: Count the total no. of occurrences of each word Map<br />
<br />
Reduce<br />
<br />
Hadoop Map Reduce <br />
<br />
Provides: Fault Tolerance Methods for interfacing with<br />
<br />
HDFS for colocation of computation and storage of output.<br />
<br />
Status and API<br />
<br />
Monitoring tools<br />
<br />
in Java<br />
<br />
through Hadoop streaming.<br />
<br />
145<br />
<br />
6/26/2014<br />
<br />
Word Count<br />
<br />
Word Count: Mapper<br />
<br />
146<br />
<br />
6/26/2014<br />
<br />
Word Count: Reducer<br />
<br />
Word Count: Main<br />
<br />
147<br />
<br />
6/26/2014<br />
<br />
Hadoop Streaming <br />
<br />
Allows the user to specify mappers and reducers as executables.<br />
<br />
<br />
<br />
Mapper launches the mapper executable and pipes the input data to the stdin of the executable.<br />
<br />
<br />
<br />
Mapper executable processes the data and writes the mapper output to stdout, which is collected by the mapper.<br />
<br />
<br />
<br />
Mapper key and value are separated by tab.<br />
<br />
<br />
<br />
Reducer operates in the same way.<br />
<br />
Word count in perl • Main Command: •<br />
<br />
hadoop j ar / opt / cl ouder a/ par cel s/ CDH/ l i b/ hadoop- 0. 20mapr educe/ cont r i b/ st r eami ng/ hadoop- st r eami ng- 2. 0. 0- mr 1cdh4. 6. 0. j ar - i nput / t mp/ smal l _svm_dat a - out put / user/ sour angshu/ out put - mapper "/ usr / bi n/ per l wcmap. pl “ - r e duc er " / us r / bi n/ per l wc r educ e. pl “ - f i l e wcmap. pl - f i l e wcreduce. pl<br />
<br />
• Mapper: wcmap.pl<br />
<br />
148<br />
<br />
6/26/2014<br />
<br />
Word count in perl • Reducer:<br />
<br />
Pig <br />
<br />
Pig is a system for fast development of big data processing al orithms.<br />
<br />
<br />
<br />
Pig has two components: Pig<br />
<br />
Latin language<br />
<br />
Interpreter<br />
<br />
for pig latin.<br />
<br />
<br />
<br />
A program expressed in Pig Latin is translated into a set of map reduce jobs.<br />
<br />
<br />
<br />
Users interact with the Pig latin language.<br />
<br />
149<br />
<br />
6/26/2014<br />
<br />
Pig Latin <br />
<br />
Pig Latin statement is an operator that takes a relation as in ut and roduces another relation as out ut. Exception: Load<br />
<br />
and Store statements, which read from and<br />
<br />
write to files. <br />
<br />
A general Pig program looks like: ‐ A LOAD statement reads data from the file system. ‐ A series of "transformation" statements rocess the data. ‐ A STORE statement writes output to the file system; or, a<br />
<br />
DUMP statement displays output to the screen.<br />
<br />
Pig Latin Datatypes A relation<br />
<br />
is a bag (more specifically, an outer bag). . . , , , A tuple is an ordered set of fields e.g. (19,2) A map is a set of key value pairs e.g. [open#apache] A field is a piece of data A field can be of type: , , , Array: chararray, bytearray Complex fields: bag, tuple, map.<br />
<br />
150<br />
<br />
6/26/2014<br />
<br />
Example • A = LOAD ' st udent ' USI NG<br />
<br />
, age: i nt , gpa: f l oat ) ; • DUMP A; • Out ut: • • • •<br />
<br />
( J ohn, 18, 4. 0F) ( Mary, 19, 3. 8F) ( Bi l l , 20, 3. 9F) ( J oe, 18, 3. 8F)<br />
<br />
Pig Latin <br />
<br />
FOREACH .. GENERATE .. : applies operation to each element of a ba .<br />
<br />
<br />
<br />
Example: A = LOAD 'data' AS (f1,f2,f3); B = FOREACH A GENERATE f1 + 5; C = FOREACH A generate f1 + f2;<br />
<br />
151<br />
<br />
6/26/2014<br />
<br />
Pig Latin <br />
<br />
Filtering:<br />
<br />
= ( f 2+f 3 > f 1) ) ; <br />
<br />
==<br />
<br />
Grouping:<br />
<br />
X = GROUP A BY f 1; DUMP X; ( 1, {( 1, 2, 3) }) ( 4, {( 4, 2, 1) , ( 4, 3, 3) }) ( 8, {( 8, 3, 4) }) <br />
<br />
Grouping generates an inner bag.<br />
<br />
Pig Latin <br />
<br />
Cogrouping:<br />
<br />
152<br />
<br />
6/26/2014<br />
<br />
Pig Latin Operators on Bags <br />
<br />
AVG: Computes the average of the numeric values in a single‐ column ba .<br />
<br />
Pig Latin Operators <br />
<br />
COUNT: Computes the number of elements in a bag.<br />
<br />
<br />
<br />
MAX: Computes the maximum.<br />
<br />
<br />
<br />
MIN: Computes the minimum.<br />
<br />
<br />
<br />
SUM: Computes the sum of the numeric values.<br />
<br />
<br />
<br />
SIZE: Computes the number of elements based on any Pig data type.<br />
<br />
153<br />
<br />
6/26/2014<br />
<br />
Pig Latin <br />
<br />
Flatten converts inner bags to multiple tuples:<br />
<br />
X = FOREACH C GENERATE DUMP X;<br />
<br />
r ou , FLATTEN( A) ;<br />
<br />
( 1, 1, 2, 3) ( 4, 4, 2, 1) ( 4, 4, 3, 3) ( 8, 8, 3, 4) ( 8, 8, 4, 3)<br />
<br />
Example: Projection • X = FOREACH A GENERATE a1, a2; • • • • • • •<br />
<br />
DUMP X; ( 1, 2) ( 4, 2) ( 8, 3) , ( 7, 2) ( 8, 4)<br />
<br />
154<br />
<br />
6/26/2014<br />
<br />
Business Analytics for Decision Making Ram Babu Roy<br />
<br />
Decision Making • Definition – Selecting the best solution from two or more alternatives • Levels of Managerial Decision Making Information aracter st cs Structure<br />
<br />
Unstructured<br />
<br />
Strategic Management Executives and Directors<br />
<br />
Semistructured<br />
<br />
Tactical Management Business Unit Managers and Self ‐Directed Teams<br />
<br />
Operational Management Structured<br />
<br />
Operating Managers and Self ‐Directed Teams<br />
<br />
Ad Hoc Unscheduled Summarized Infrequent Forward Looking External Wide Scope Prespecified Scheduled Detailed Frequent Historical Internal Narrow Focus<br />
<br />
Source: Euiho (David) Suh, POSTECH Strategic Management of Information and Technology Laboratory, (http://posmit.postech.ac.kr)<br />
<br />
155<br />
<br />
6/26/2014<br />
<br />
Analysis, Analytics, and BI • Analysis: Process of careful study of something to learn about its components, what they do, and w y • Analytics: Application of scientific methods and techniques for analysis • Business Intelligence and Analytics1 – techniques, technologies, systems, practices, methodologies, to help an enterprise better understand its business and market and make timely business decisions 1Source:<br />
<br />
Business Intelligence and Analytics: From big data to big impact, H.Chen et.al MIS Quarterly, Dec 2012<br />
<br />
BI&A Overview<br />
<br />
Source: Business Intelligence and Analytics: From big data to big impact, H.Chen et.al MIS Quarterly, Dec 2012<br />
<br />
156<br />
<br />
6/26/2014<br />
<br />
Big Data and Traditional Analytics<br />
<br />
Source: Big data at work, Davenport, HBS<br />
<br />
Which type of analytics is appropriate? • Once you gather data, you build a model of how these<br />
<br />
data work. • Descriptive ‐ to condense big data into smaller, more useful nuggets of information • Inquisitive – Why is something happening – validate/reject hypotheses • Predictive – to forecast what might happen in the future, uses statistical modeling, data mining, and machine learning techniques, e.g. whether sentiment is<br />
<br />
• Prescriptive ‐ recommending one or more courses of <br />
<br />
action and showing the likely outcome of each decision – requires a predictive model with two additional components: actionable data and a feedback system<br />
<br />
Source: http://www.informationweek.com/big‐data/big‐data‐analytics/big‐data‐ analytics‐descriptive‐vs‐predictive‐vs‐prescriptive/d/d‐id/1113279<br />
<br />
157<br />
<br />
6/26/2014<br />
<br />
Evolution of DSS into Business Intelligence • Change in the Use of DSS – Specialist → Managers → Whomever, Whenever, Wherever • Emergence of “Business Intelligence” OLAP<br />
<br />
Data Warehousing<br />
<br />
Business Intelligence Data Mining<br />
<br />
Intelligent Systems<br />
<br />
Source: Euiho (David) Suh, POSTECH Strategic Management of Information and Technology Laboratory, (http://posmit.postech.ac.kr)<br />
<br />
Business Intelligence (BI) • Definition – An umbrella term that combines architectures, tools, databases, analytical tools, applications, and methodologies<br />
<br />
• Objective – managers with the ability to conduct analysis Data Business Intelligence<br />
<br />
ACTION<br />
<br />
Information<br />
<br />
Knowledge<br />
<br />
Decisions<br />
<br />
Source: Euiho (David) Suh, POSTECH Strategic Management of Information and Technology Laboratory, (http://posmit.postech.ac.kr)<br />
<br />
158<br />
<br />
6/26/2014<br />
<br />
Foundational Technologies in Analytics<br />
<br />
Source: Business Intelligence and Analytics: From big data to big impact, H.Chen et.al MIS Quarterly, Dec 2012<br />
<br />
Taxonomy for Data Mining Tasks<br />
<br />
159<br />
<br />
6/26/2014<br />
<br />
Classification Techniques • Decision tree analysis • Neural networks • Support vector machines • Case‐based reasoning • Bayesian classifiers (naïve, Belief Network) • Rough sets<br />
<br />
Bayesian Classification: Why? • A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities<br />
<br />
• Foundation: Based on Bayes’ Theorem. • er ormance: s mp e ayes an c ass er, na ve ayes an classifier , has comparable performance with decision tree and selected neural network classifiers<br />
<br />
• Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct — prior knowledge can be combined with observed data<br />
<br />
• Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured<br />
<br />
160<br />
<br />
6/26/2014<br />
<br />
Bayes’ Theorem: Basics • Let X be a data sample (“evidence”): class label is unknown • Let H be a hypothesis that X belongs to class C • Classification is to determine P(H|X), ( posteriori probability), the probability that the hypothesis holds given the observed data sample X<br />
<br />
• P(H) ( prior probability ), the initial probability – E.g., X will buy computer, regardless of age, income, … • • P(X|H) (likelihood), the probability of observing the sample X, given that the hypothesis holds<br />
<br />
– E.g., Given that X will buy computer, the prob. that X is 31..40, medium income<br />
<br />
Bayes’ Theorem • Given training data X , posteriori probability of a hypothesis H , P(H|X) , follows the Bayes’ theorem<br />
<br />
P ( H | X ) P ( X | H ) P ( H ) P ( X | H ) P ( H ) / P ( X ) P(X) • Informally, this can be written as posteriori = likelihood x prior/evidence<br />
<br />
• Predicts X belongs to Ci iff the probability P(Ci|X) is the highest among all the P(Ck|X) for all the k classes<br />
<br />
• Practical difficulty: require initial knowledge of many probabilities, significant computational cost<br />
<br />
161<br />
<br />
6/26/2014<br />
<br />
Towards Naïve Bayes Classifier • Let D be a training set of tuples and their associated class labels, and each tuple is represented by an n‐D attribute vector X = (x1, x2, …, xn<br />
<br />
• Suppose there are m classes C1, C2, …, Cm. • Classification is to derive the maximum posteriori, i.e., the maximal P(Ci|X) • This can be derived from Bayes’ theorem P C X <br />
<br />
P(X | C )P(C ) i i<br />
<br />
• Since P(X) is constant for all classes, only needs to be maximized<br />
<br />
P(C | X) P(X | C ) P(C ) i i i<br />
<br />
323<br />
<br />
Derivation of Naïve Bayes Classifier • A simplified assumption: attributes are conditionally independent (i.e., no dependence relation between attributes): n P(X | C i) P( x | C i) P( x | C i) P( x | C i) ... P( x | C i) 1 2 k n k 1<br />
<br />
• This greatly reduces the computation cost: Only counts the class distribution<br />
<br />
• If A is categorical, P(x |C ) is the # of tuples in C having value x for Ak divided by |Ci, D| (# of tuples of Ci in D)<br />
<br />
162<br />
<br />
6/26/2014<br />
<br />
Naïve Bayes Classifier: Training Dataset Class: C1:buys_computer = ‘yes’ C2:buys_computer = ‘no’ Data to be classified: X = (age <=30, Income = medium Student = yes Credit_rating = Fair)<br />
<br />
age <=30 <=30 31…40 >40 >40 >40 31…40 <=30 <= >40 <=30 31…40 31…40 >40<br />
<br />
income student redit_ratin high no fair hi h no excellent high no fair medium no fair low yes fair low yes excellent low yes excellent medium no fair ow yes a r medium yes fair medium yes excellent medium no excellent high yes fair medium no excellent<br />
<br />
com no no yes yes yes no yes no yes yes yes yes yes no<br />
<br />
Naïve Bayes Classifier: An Example age<br />
<br />
•<br />
<br />
income tudent redit_rating_com no fair no n o excel len t no no fair yes no fair yes yes fair yes yes excel len t n o y es e xc el l en t y es no fair no yes fair yes yes fair yes yes excellent yes no excellent yes yes fair yes n o e xc el l en t no<br />
<br />
<=30 high P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643 <=30 h ig h 31…40 high P(buys_computer = “no”) = 5/14= 0.357 >40 medium >40 low • Compute P(X|Ci) for each class >40 l ow 3 1… 40 l ow <=30 medium P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222 <=30 low >40 medium P a e = “<= 30” bu s _com uter = “no” = 3 5 = 0.6 <=30 medium 31…40 medium P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444 31…40 h ig h > 40 m ed iu m P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4 P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667 P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2 P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667 P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4 • X = (age <= 30 , income = medium, student = yes, credit_rating = fair) uys_computer = yes = . x . x . x . = . i : P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019 P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028 P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007 Therefore, X belongs to class (“buys_computer = yes”)<br />
<br />
163<br />
<br />
6/26/2014<br />
<br />
Avoiding the Zero‐Probability Problem • Naïve Bayesian prediction requires each conditional prob. be non‐zero. Otherwise, the predicted prob. will be zero P ( X | C i )<br />
<br />
<br />
<br />
n<br />
<br />
P ( x k | C i ) k 1<br />
<br />
• Ex. Suppose a dataset with 1000 tuples, income=low (0), income= medium (990), and income = high (10) • Use Laplacian correction (or Laplacian estimator) – Adding 1 to each case Prob income = low = 1 1003 Prob(income = medium) = 991/1003 Prob(income = high) = 11/1003 – The “corrected” prob. estimates are close to their “uncorrected” counterparts<br />
<br />
Naïve Bayes Classifier: Comments • Advantages – Easy to implement – Good results obtained in most of the cases • Disadvantages – Assumption: class conditional independence, therefore loss of accuracy – Practically, dependencies exist among variables • E.g., hospitals: patients: Profile: age, family history, etc. Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc. • Dependencies among these cannot be modeled by Naïve Bayes Classifier • How to deal with these dependencies? Bayesian Belief Networks<br />
<br />
164<br />
<br />
6/26/2014<br />
<br />
Support Vector Machines (SVM) • A relatively new classification method for both linear and nonlinear data<br />
<br />
• It uses a nonlinear mapping to transform the original training data into a higher dimension<br />
<br />
• With the new dimension, it searches for the linear optimal separating hyperplane (i.e., “decision boundary”)<br />
<br />
• With an appropriate nonlinear mapping to a sufficiently high , hyperplane<br />
<br />
• SVM finds this hyperplane using support vectors (“essential” training tuples) and margins (defined by the support vectors) 329<br />
<br />
Support Vector Machines • Aim : To find the best classification function to the training data – a separating hyperplane f(x) that passes through the middle of the two classes – best such function is found by maximizing the margin between the two classes – xn belongs to the positive class if f(xn) > 0<br />
<br />
• One of the most robust and accurate methods • Requires only a dozen examples for training • Insensitive to the number of dimensions<br />
<br />
165<br />
<br />
6/26/2014<br />
<br />
SVM—When Data Is Linearly Separable<br />
<br />
m<br />
<br />
Let data D be (X1, y1), …, (X||D||, y|D| | |), where Xi is the set of training tuples associated with the class labels yi There are infinite lines (hyperplanes) separating the two classes but we want to find the best one (the one that minimizes classification error on unseen data) SVM searches for the hyperplane with the largest margin , i.e., maximum marginal hyperplane (MMH) 331<br />
<br />
SVM—Linearly Separable <br />
<br />
A separating hyperplane can be written as W ● X + b = 0<br />
<br />
= <br />
<br />
1,<br />
<br />
2,<br />
<br />
…,<br />
<br />
n<br />
<br />
For 2‐D it can be written as w0 + w1 x1 + w2 x 2 = 0<br />
<br />
<br />
<br />
The hyperplane defining the sides of the margin: H1: w0 + w1 x1 + w2 x2 ≥ 1 for yi = +1, and H2: w0 + w1 x1 + w2 x2 ≤ – 1 for yi = –1<br />
<br />
<br />
<br />
<br />
<br />
Any training tuples that fall on hyperplanes H1 or H2 (i.e., the sides defining the margin) are support vectors This becomes a constrained (convex) quadratic optimization problem: Quadratic objective function and linear constraints Quadratic Programming (QP) Lagrangian multipliers 332<br />
<br />
166<br />
<br />
6/26/2014<br />
<br />
SVM—Linearly Inseparable <br />
<br />
<br />
<br />
A2<br />
<br />
Transform the original input data into a higher dimensional s ace<br />
<br />
Search for a linear separating hyperplane in the new space<br />
<br />
A1<br />
<br />
333<br />
<br />
SVM: Different Kernel functions <br />
<br />
Instead of computing the dot product on the transformed data, it is math. equivalent to applying a kernel function K(Xi, X j) to the original data, i.e., K(Xi, X j) = Φ(Xi) Φ(X j)<br />
<br />
<br />
<br />
<br />
<br />
Typical Kernel Functions<br />
<br />
SVM can also be used for classifying multiple (> 2) classes and for regression analysis (with additional parameters) 334<br />
<br />
167<br />
<br />
6/26/2014<br />
<br />
Parallel Implementation of SVM • Start with the current w and b, and in parallel do<br />
<br />
several iterations based on each trainin exam le • Average the values from each of the examples to create a new w and b. • If we distribute w and b to each mapper, then the Map tasks can do as many iterations as we wish to do in one round • e nee o use e e uce as s on y o average the results • One iteration of MapReduce is needed for each round.<br />
<br />
Model Selection: ROC Curves •<br />
<br />
ROC (Receiver Operating Characteristics) curves: for visual comparison of classification models • Originated from signal detection theory • Shows the trade‐off between the true positive rate and the false positive rate • The area under the ROC curve is a measure of the accuracy of the model • Rank the test tuples in decreasing order: the one that is most likel to belon to the positive class appears at the top of the list • The closer to the diagonal line (i.e., the closer the area is to 0.5), the less accurate is the model<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
Vertical axis represents the true positive rate . the false positive rate The plot also shows a diagonal line A model with perfect accuracy will have an area of 1.0<br />
<br />
168<br />
<br />
6/26/2014<br />
<br />
Issues Affecting Model Selection • Accuracy – classifier accuracy: predicting class label • Speed – time to construct the model (training time) – time to use the model (classification/prediction time) • Robustness: handling noise and missing values • Scalabilit : efficienc in disk‐resident databases • Interpretability – understanding and insight provided by the model • Other measures, e.g., goodness of rules, such as decision tree size or compactness of classification rules<br />
<br />
What is Cluster Analysis? • Cluster: A collection of data objects – similar (or related) to one another within the same group – dissimilar (or unrelated) to the objects in other groups • Cluster analysis (or clustering, data segmentation, …) – Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters<br />
<br />
•<br />
<br />
. ., observations vs. learning by examples: supervised)<br />
<br />
• Typical applications – As a stand‐alone tool to get insight into data distribution – As a preprocessing step for other algorithms<br />
<br />
169<br />
<br />
6/26/2014<br />
<br />
Cluster Analysis: Applications • Identify natural groupings of customers • en y ru es or ass gn ng new cases o classes for targeting/diagnostic purposes • Provide characterization, definition, labelling of populations • Decrease the size and complexity of problems for other data mining methods • Identify outliers in a specific domain (e.g., rare‐event detection)<br />
<br />
Cluster Analysis: Applications • Biology: taxonomy of living things: kingdom, phylum, class, order, family, genus and species<br />
<br />
• • Land use: Identification of areas of similar land use in an earth observation database<br />
<br />
• Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs<br />
<br />
• City‐planning: Identifying groups of houses according to their house type, value, and geographical location<br />
<br />
•<br />
<br />
ar ‐qua e s u es: serve ear along continent faults<br />
<br />
qua e ep cen ers s ou<br />
<br />
e c us ere<br />
<br />
• Climate: understanding earth climate, find patterns of atmospheric and ocean<br />
<br />
• Economic Science: market resarch<br />
<br />
170<br />
<br />
6/26/2014<br />
<br />
Requirements and Challenges • Scalability – Clustering all the data instead of only on samples • Ability to deal with different types of attributes – Numerical, binary, categorical, ordinal, linked, and mixture of these • Constraint‐based clustering • •<br />
<br />
User may give inputs on constraints Use domain knowledge to determine input parameters<br />
<br />
• Interpretability and usability • Others – Discovery of clusters with arbitrary shape – Ability to deal with noisy data – Incremental clustering and insensitivity to input order – High dimensionality<br />
<br />
Association Rule Mining • Finds interesting relationships (affinities) between variables items or events<br />
<br />
• Also known as market basket analysis • A representative applications of association rule mining include – In business: cross‐marketing, cross‐selling, store design, catalog design, e‐commerce site design, optimization of online advertising, pro uc pr c ng, an sa es promo on con gura on<br />
<br />
– In medicine: relationships between symptoms and illnesses; diagnosis and patient characteristics and treatments (to be used in medical DSS); and genes and their functions (to be used in genomics projects)…<br />
<br />
171<br />
<br />
6/26/2014<br />
<br />
Association Rule Mining • Are all association rules interesting and useful? A Generic Rule: X<br />
<br />
Y [S%, C%]<br />
<br />
X, Y: products and/or services X: Left‐hand‐side (LHS) Y: Right‐hand‐side (RHS) S: Support: how often X and Y go together C: Confidence: how often Y go together with the X<br />
<br />
Example: {Laptop Computer, Antivirus Software} {Extended Service Plan} [30%, 70%]<br />
<br />
Privacy Preserving Data Mining • There is a growing concern among people and organization in protecting their privacy<br />
<br />
• Many business and government organizations have strong requirements for privacy preserving data mining<br />
<br />
• The primary task in data mining: development of models about aggregated data. – meeting privacy requirements – providing valid data mining results Source: Privacy Preserving Association Rule Mining in Vertically Partitioned Data, Jaideep Vaidya, Chris Clifton, SIGKDD ’02 Edmonton, Alberta, Canada<br />
<br />
172<br />
<br />
6/26/2014<br />
<br />
What is Privacy • Ability of an individual or group to reveal<br />
<br />
themselves selectively to remain unnoticed or<br />
<br />
• Privacy laws prohibits unsanctioned invasion of privacy by the government, corporations or individuals • Privacy may be voluntarily sacrificed, normally in exchange for perceived benefits and very o ten wit speci ic angers an osses • What is privacy preserving data mining?<br />
<br />
– Study of achieving some data mining goals without scarifying the privacy of the individuals<br />
<br />
Challenges • Privacy considerations seems conflicting with data mining. How do we mine data when we can’t even look at it?<br />
<br />
• Can we have data mining and privacy together? • Can we develop accurate models without access to precise information in individual data records?<br />
<br />
• Leakage of information is inevitable – how to minimize the leakage of information PPDM offers a compromise !<br />
<br />
173<br />
<br />
6/26/2014<br />
<br />
Example of Privacy • Alice and Bob are both teaching the same class, and each of them suspects that one specific student is cheating. None of them is completely sure about the identity of the cheater, and<br />
<br />
• For students privacy – if they both have the same suspect, then they should learn his or her name – but if they have different suspects then they should learn nothing beyond that fact.<br />
<br />
• They therefore have inputs x and y, and wish to compute f(x, = • If f(x, y) = 0 then each party does learn some information, namely that the other party’s suspect is different than his/hers, but this is inevitable<br />
<br />
Total count: “How many pneumonia deaths under age 65?” r 1= r 0+12<br />
<br />
hosp1 r 9= r 8+0<br />
<br />
hosp9 r 8= r 7 +3<br />
<br />
hosp2 r 9+5<br />
<br />
r 2= r 1+1<br />
<br />
r 0<br />
<br />
detect<br />
<br />
hosp3 Total=61 r 3= r 2+8<br />
<br />
8<br />
<br />
r 7 = r 6 +14<br />
<br />
hosp7<br />
<br />
hosp6 r 6 = r 5+7<br />
<br />
hosp4 hosp5 r 5= r 4+11<br />
<br />
r 4= r 3+0<br />
<br />
174<br />
<br />
6/26/2014<br />
<br />
Distributed Computing Scenario • Two or more parties owning confidential databases their databases without revealing any unnecessary information<br />
<br />
• Although the parties realize that combining their data has some mutual benefit, none of them is w ng to revea ts ata ase to any ot er party<br />
<br />
• Partial leak of information is inevitable<br />
<br />
Association rule mining in vertically partitioned data • Privacy concerns can prevent a central database approach for data mining<br />
<br />
• The transactions may be distributed across sources • Collaborate to mine globally valid data without revealing individual transaction data<br />
<br />
• Prevent disclosure of individual relationships – Join key revealed – Universe of attribute values revealed<br />
<br />
175<br />
<br />
6/26/2014<br />
<br />
Real‐life Example • Ford Explorers with Firestone tires from a specic factory had tread separation problems in certain situations, resulting in 800 injuries.<br />
<br />
• Since the tires did not have problems on other vehicles, and<br />
<br />
other tires on Ford Explorers did not pose a problem, neither side felt responsible<br />
<br />
• Delay in identifying the real problem led to a public relations<br />
<br />
nightmare and the eventual replacement of 14.1 million tires<br />
<br />
• Both manufacturers had their own data ‐ early generation of association rules based on all of the data may have enabled Ford and Firestone to resolve the safety problem before it became a public relations nightmare.<br />
<br />
Problem Definition • To mine association rules across two databases, where the columns in the table are at different sites, splitting each row.<br />
<br />
• One database is designated the primary, and is the initiator of the protocol. The other database is the responder.<br />
<br />
• There is a join key present in both databases. The remaining attributes are present in one database or the other, but not both.<br />
<br />
• The goal is to find association rules involving attributes other than the join key observing the privacy constraints<br />
<br />
176<br />
<br />
6/26/2014<br />
<br />
Problem Definition •<br />
<br />
Let I ={i1; i2; ; im} be a set of literals, called items. Let D be a set of transactions, where each transaction T is a set of items such that T I.<br />
<br />
•<br />
<br />
An association rule is an implication of the form, X Y , where X I, Y I, and X Y = <br />
<br />
•<br />
<br />
The rule X Y holds in the transaction set D with confidence c if c% of transactions in D that contain X also contain Y . The rule X Y has support s in the transaction set D if s% of transactions in D contain X Y<br />
<br />
•<br />
<br />
. an attribute is represented as a 0 or 1. Transactions are strings of 0s and 1s.<br />
<br />
•<br />
<br />
To find out if a particular itemset is frequent, we count the number of records where the values for all the attributes in the itemset are 1.<br />
<br />
Mathematical problem formulation • Let the total number of attributes be l + m, where A has l attributes A1 through Al, and B has the remaining m attributes B1 through Bm. transactions/records are a se uence of l + m 1s or 0s.<br />
<br />
• Let k be the support threshold required, • n be the total number of transaction/records • Let X and Y represent columns in the database, i.e., xi = 1 if row i has value 1 for attribute X. The scalar (or dot) product of two cardinality n vectors X and Y is<br />
<br />
• Determining if the two‐itemset<br />
<br />
is frequent thus reduces to testing if <br />
<br />
177<br />
<br />
6/26/2014<br />
<br />
Example • Find out if itemset {A1, B1} is frequent (i.e., If support of {A1, B1} ≥ k) A<br />
<br />
B<br />
<br />
Key<br />
<br />
A1<br />
<br />
Key<br />
<br />
B1<br />
<br />
k1<br />
<br />
1<br />
<br />
k1<br />
<br />
0<br />
<br />
k2<br />
<br />
0<br />
<br />
k2<br />
<br />
1<br />
<br />
k3<br />
<br />
0<br />
<br />
k3<br />
<br />
0<br />
<br />
k4<br />
<br />
1<br />
<br />
k4<br />
<br />
1<br />
<br />
k5<br />
<br />
1<br />
<br />
k5<br />
<br />
1<br />
<br />
• Support of itemset is defined as number of transactions in which all attributes of the itemset are present n<br />
<br />
Support <br />
<br />
A B i<br />
<br />
i<br />
<br />
i1<br />
<br />
Basic idea n<br />
<br />
Support <br />
<br />
A B i<br />
<br />
i<br />
<br />
i 1<br />
<br />
• This is the scalar (dot) product of two vectors • To find out if an arbitrary (shared) itemset is<br />
<br />
frequent, create a vector on each side consisting of the component multiplication of all attribute vectors on that side (contained in the itemset) • 1, 3, 5, 2, 3 – A forms the vector X = ∏ A1 A3 A5 – B forms the vector Y = ∏ B2 B3 – Securely compute the dot product of X and Y<br />
<br />
178<br />
<br />
6/26/2014<br />
<br />
Conclusion • Privacy preserving association rule mining algorithm using an efficient protocol for computing scalar product while preserving privacy of the individual values<br />
<br />
• Communication cost is comparable to that required to build a centralized data warehouse<br />
<br />
• Although secure solutions exist, achieving efficient secure solutions for privacy preserving distributed data mining is still an<br />
<br />
• Handling multi‐party case and avoiding collusion is challenging. Non‐categorical attributes and quantitative association rule mining are significantly more complex problems<br />
<br />
References <br />
<br />
Privacy Preserving Association Rule Mining in Vertically Partitioned Data, Jaideep Vaidya, Chris Clifton, SIGKDD ’02 Edmonton, Alberta, Canada<br />
<br />
<br />
<br />
Privacy‐preserving Data Mining. R. Agrawal and R. Srikant, ACM SIGMOD Conference on Management of Data, 2000.<br />
<br />
<br />
<br />
Cryptographic techniques for privacy‐preserving data mining. B. Pinkas, SIGKDD Explorations, Vol. 4, Issue 2.<br />
<br />
<br />
<br />
KNOWLEDGE VALUATION: Building blocks to a knowledge valuation system (KVS), Annie Green, The journal of information and knowledge management systems Vol. 36 No. 2, 2006, pp. 146‐154<br />
<br />
179<br />
<br />
6/26/2014<br />
<br />
Business Applications Ram Babu Roy<br />
<br />
, survives. It is the one that is the most adaptable to change. ‐ Charles Darwin<br />
<br />
From Big Data to Big Impact<br />
<br />
Source: Business Intelligence and Analytics: From big data to big impact, H.Chen et.al MIS Quarterly, Dec 2012<br />
<br />
180<br />
<br />
6/26/2014<br />
<br />
Big Impacts by Big Data • Radical transparency with data widely available – change the way we compete • Impact of real‐time customization on business • Augmenting management and strategy – better risk management • n ormat on‐ r ven us ness mo e nnovat ons – Leveraging valuable exhaust data by business transactions – Data aggregator as an entrepreneurial opportunity Source: Are you ready for the era of ‘big data’?, Brad Brown et.al, McKinsey Quarterly, Oct 2011<br />
<br />
Source: http://www.forbes.com/sites/danmunro/2013/04/28/big‐problem‐with‐little‐data/<br />
<br />
181<br />
<br />
6/26/2014<br />
<br />
Magic Quadrant for Advanced Analytics Platforms CHALLENGERS<br />
<br />
NICHE PLAYERS<br />
<br />
<br />
<br />
<br />
<br />
LEADERS<br />
<br />
VISIONARIES<br />
<br />
Source: Gartner (February 2014)<br />
<br />
Sources of Competitive Advantage • Big Data – a new type of corporate asset • Effective use of data at scale • Data – driven decision making • Radical customization – Gaining market share<br />
<br />
• Constant Experimentation – ‐ – Adjust prices in real‐time – Bundling synthesizing and making information available across organization<br />
<br />
• Novel Business Models<br />
<br />
182<br />
<br />
6/26/2014<br />
<br />
Growing Business Interest in Big Data • Hundreds of articles published in<br />
<br />
technolo industr ournals and eneral business press (Forbes, Fortune, Bloomberg BusinessWeek, The Wall Street Journal, The Economist etc.) • 3 V’s ‐ use of disparate data sets including social media , ‐ • requirements • The advancement of the fields of machine learning and visualization<br />
<br />
Role of Data Scientist • We often do not know what question to ask ‐<br />
<br />
requires domain expertise to identify the important roblems to solve in a iven area • Which aspect of big data makes more sense • How to apply it to the business • How one can achieve success by implementing a big data project • New challenges – Lack of structure – What technology one must use to manage it – Challenging to convert it into insights, innovation and business value – But new opportunities<br />
<br />
183<br />
<br />
6/26/2014<br />
<br />
Variation of Potential Value across Sectors<br />
<br />
Source: Are you ready for the era of ‘big data’?, Brad Brown et.al, McKinsey Quarterly, Oct 2011<br />
<br />
Sources of Big Data • Various business units – Govt./Private • Partners • Customers • Internet of Things • Social Media • Transaction data • Web pages<br />
<br />
184<br />
<br />
6/26/2014<br />
<br />
Major industries that get benefitted • Financial services • an e ecommun ca on • Healthcare • Manufacturing • Real Estate • • Travel • Media and Entertainment • Retailing<br />
<br />
Why is Big Data so Important? • Potential to radically transform businesses and industries – e‐ nven ng us ness processes – Operational efficiencies – Customer service innovations. • You can’t manage what you don’t measure – More knowledge about the business – Improved decision making and performance – rather than intuition – More‐effective interventions<br />
<br />
• But need to change the decision making culture Source: Big Data: The Management Revolution, Andrew McAfee, Erik Brynjolfsson, HBR, Oct 2012<br />
<br />
185<br />
<br />
6/26/2014<br />
<br />
How are managers using Big Data? • To create new businesses – – Airlines had a gap of at least 5 to 10 minutes between the ETA (by pilots) and ATA. – Improved ETA by combining weather data, flight schedules, information about other planes in the sky, and other factors<br />
<br />
• To drive more sales – To tailor promotions and other offerings to customers – To personalize the offers to take advantage of local conditions<br />
<br />
Integrating Big Data into Business • How to utilise unstructured data within your • The latest technical changes related to using Hadoop in the enterprise • Why big data solutions can enhance your ROI and deliver value<br />
<br />
• The future of big data and the Internet of Things • How cloud computing is changing the enterprise’s use of data<br />
<br />
186<br />
<br />
6/26/2014<br />
<br />
Applications of Big data analytics • Disaster management • • Healthcare • Spatio‐temporal data analytics • Time‐series based long‐term analysis (e.g. climate change) • Short‐term real‐time Traffic mana ement disaster management) • Benchmarking across the world • Preserving heritage ‐ Creation of e‐repositories<br />
<br />
Big Data in Manufacturing • Manufacturing generates about a third of all data today – Di itall ‐oriented businesses see hi her returns – Detecting product and design flaws with Big Data • Forecasting, cost and price modeling, supply chain connectivity • Warranty data analytics, text mining for product development • Visualization and dashboards, fault detection, failure prediction, in‐process verification tools, • Machine system/sensor health, process monitoring and correction, product data mining, • Big Data generation and management, and the Internet of Things<br />
<br />
187<br />
<br />
6/26/2014<br />
<br />
Big Data in Manufacturing • Integrating data from multiple systems • Collaboration amon functional units • Information from external suppliers and customers to cocreate products<br />
<br />
• Collaboration during design phase to reduce costs • Implement changes in product to improve quality and prevent future problems<br />
<br />
• Identify anomalies in production systems – Schedule preemptive repairs before failures – Dispatch service representatives for repair<br />
<br />
Retailing and Logistics • Optimize inventory levels at different locations • Improve the store layout and sales promotions • Optimize logistics by predicting seasonal effects<br />
<br />
• Minimize losses due to limited shelf life<br />
<br />
188<br />
<br />
6/26/2014<br />
<br />
Financial Applications • Banking and Other Financial – Automate the loan application process – Detecting fraudulent transactions – Optimizing cash reserves with forecasting<br />
<br />
• Brokerage and Securities Trading – Predict changes on certain bond prices – Forecast the direction of stock fluctuations – Assess the effect of events on market movements – Identify and prevent fraudulent activities in trading<br />
<br />
• Insurance – Forecast claim costs for better business planning – Determine optimal rate plans – Optimize marketing to specific customers – Identify and prevent fraudulent claim activities<br />
<br />
Understanding Customer Online Behaviour • Drawing insight from the online customers • en men na ys s – o gauge responses o new marketing campaigns and adjust strategies • Customer Relationship Management – Maximize return on marketing campaigns – Improve customer retent on c urn ana ys s – Maximize customer value (cross‐, up‐selling) ‐ revenue streams – Identify and treat most valued customers<br />
<br />
189<br />
<br />
6/26/2014<br />
<br />
Case Study: Quantcast • How do advertisers reach their target • Consumers spend more time in personalized media environments • It’s harder to reach large number of relevant consumers • Advertisers need to use web to choose their audiences more selectively • Decision ‐ which ad to show to whom Source: To big to ignore: the business case for big data, Phil Simon, Wiley, 2013<br />
<br />
Quantcast: A Small Big Data Company • Web measurement and targeting company – Founded in 2006 – focus on online audience • Models marketers’ prospects and finds lookalike audiences across the world • Analyses more than 300 billion observations of media consumption per month • Detailed demographic, geographic and lifestyle information • Created a massive data processing infrastructure – Quantcast File System (QFS) • Incorporates data generated from mobile devices<br />
<br />
190<br />
<br />
6/26/2014<br />
<br />
Quantcast: A Small Big Data Company • Big data allows organizations to drill down to reac spec c au ences<br />
<br />
• Different businesses have different data requirements, challenges and goals<br />
<br />
• Quantcast provides integration between its • An organization doesn’t need to be big to benefit from Big Data<br />
<br />
Promise of Big Data in Healthcare • Predictive and prescriptive analytics • Public health • Disease management • Drug discovery • Personalized medicine • Continuously scan and intervene in the healthcare practices<br />
<br />
191<br />
<br />
6/26/2014<br />
<br />
What is healthcare? • Broader than just practicing medicine • Role of physician, pharmaceutical companies, hospitals, diagnostics services<br />
<br />
• Co‐creation of health value • Role of patients and their family • Objective: to deliver high‐quality and cost‐ effective care to patients<br />
<br />
Uniqueness of Healthcare • Every patient is unique and need personalized care • All the medical professionals are unique • Can we match the core competency and unique style of medical professionals to specific patients? principles to improve the overall efficiency? • Need for engaging medical professionals to the tasks they are best at doing<br />
<br />
192<br />
<br />
6/26/2014<br />
<br />
Healthcare Business Innovation • Needs exchange of knowledge between us ness an me c ne<br />
<br />
• Entrepreneurship in healthcare • Availability of financial support to engage in value creation<br />
<br />
•<br />
<br />
eve opmen an ana ys s o business model<br />
<br />
ea<br />
<br />
ca e<br />
<br />
Healthcare Analytics • Data collected through mobile devices, health<br />
<br />
workers individuals other data sources • Crucial for understanding population health trends or stopping outbreaks • Individual electronic health records ‐ not only improves continuity of care for the individual, but also create massive datasets – ‐ – Trends in service line utilization – Improvement opportunities in the revenue cycle<br />
<br />
• Treatments and outcomes can be compared in an efficient and cost effective manner.<br />
<br />
193<br />
<br />
6/26/2014<br />
<br />
Sources of Information • Government officials • Industry representatives • Information technology experts • Healthcare professionals<br />
<br />
Enabler • Increase of processing power and storage capacities – technology – Reduced cost of storage and processing of big data<br />
<br />
• Availability of data – Digitization of data – Increase in adoption of digital gadgets by users – opu ar y o soc a me a – Mobile and internet population and penetration • Awareness about benefits of having knowledge • Requirement of data‐driven insights by various stakeholders<br />
<br />
194<br />
<br />
6/26/2014<br />
<br />
Explorys: The human Case for Big Data • January 2011, US health care spending $3 • Behavioural, operational and clinical wastes • Long‐term economic implications • Opportunity to improve healthcare delivery and reduce expenses – Integrating clinical, financial and operational data – Volume, velocity and variety of health care data<br />
<br />
• Big data can have a big impact<br />
<br />
Better Healthcare using Big Data • Real‐time exploration, performance and • Vast user base – Major integrated delivery networks in the US • Users can view key performance metrics across providers, groups, care venues, and ocat ons • Identify ways to improve outcomes and reduce unnecessary costs<br />
<br />
195<br />
<br />
6/26/2014<br />
<br />
Better Healthcare using Big Data • Why do people go to emergency room rather • Analytics to decide whether the care is available in the neighbourhoods • Service providers can reach out to patients to guide them through the treatment processes • Patients can receive the right care at the right venue at the right time • Privacy concerns<br />
<br />
Mobilising Data to Deal with an Epidemic • Case of Haiti‟s devastating 2010 earthquake e use o • o e a a pa erns cou understand the movement of refugees and the consequent health risks • Accurately analyse the destination of over 600,000 people displaced from Port‐au‐Prince • They made this information available to government and humanitarian organisations dealing with the crisis<br />
<br />
196<br />
<br />
6/26/2014<br />
<br />
Cholera outbreak in Haiti • Mobile data to track the movement of people rom a ec e zones<br />
<br />
• Aid organisations used this data to prepare for new outbreaks<br />
<br />
• Demonstrates how mobile data analysis could responses.<br />
<br />
Cost‐effective Technology • DataGrid platform on Cloudera’s enterprise ‐ready Hadoo solution • Platform needs to support<br />
<br />
– The complex healthcare data – Evolving transformation of healthcare delivery system<br />
<br />
• Need for a solution that can evolve with time • Flexibility, scalability and speed necessary to<br />
<br />
answers complex questions on ly • Explorys could focus on the core competencies of its business • Imperative to learn, adapt and be creative<br />
<br />
197<br />
<br />
6/26/2014<br />
<br />
Benefits of Information Technology • Improved access to patient data ‐ can help clinicians as they diagnose and treat patients • Patients might have more control over their health • Monitoring of public health patterns and trends • Enhanced ability to conduct clinical trials of new diagnostic methods and treatments • Creation of new high‐technology markets and jobs • Support a range of healthcare ‐related economic reforms<br />
<br />
Potential Benefits • Provide a solid scientific and economic basis or nves men recommen a ons<br />
<br />
• Establish foundation for the healthcare policy decisions<br />
<br />
• Improved patient outcomes • os ‐sav ngs • Faster development of treatments and medical breakthroughs<br />
<br />
198<br />
<br />
6/26/2014<br />
<br />
Action Plan • Government should facilitate the nationwide a op on o a un versa exc ange anguage or healthcare information<br />
<br />
• Creation of a digital infrastructure for locating patient records while strictly ensuring patient rivac .<br />
<br />
• Facilitate a transition from traditional electronic health records to the use of healthcare data tagged with privacy and security specifications<br />
<br />
NASA: Open Innovation • Effects of crowdsourcing on the economy – collaboration and innovation – Wikipedia • What is the relationship between Big Data, collaboration and open innovation? • Democratize big data to benefit from the wisdom of crowds • Groups can often use information to make better<br />
<br />
• TopCoder – brings together diverse community of software developers, data scientists, and statisticians – Risk‐prevention contest • Predict the riskiest locations for major emergencies and crimes<br />
<br />
199<br />
<br />
6/26/2014<br />
<br />
NASA’s Real‐world Challenges • Recorded more than 100TB of space images and<br />
<br />
more • Encourage exploration and analysis of Planetary Data System (PDS) databases • Image processing solutions to categorize data from missions to the moon ‐ types and numbers of craters • Software to handle com lex o erations of satellites and data analysis – Vehicle recognition – Crater detection • Real‐time information to contestants and collecting<br />
<br />
their comments<br />
<br />
Lessons Learnt • To effectively harness big data, organization nee no o er<br />
<br />
g rewar s<br />
<br />
• State the problem in a way to allow a diverse set of potential solvers to apply their knowledge<br />
<br />
• Create a compelling Big Data and open • Organization’s size doesn’t matter to reap rewards of Big Data<br />
<br />
200<br />
<br />
6/26/2014<br />
<br />
Crowd Management Cricket Match • Directional flow density during Cricket Match • Availability of public transport for commuters from different destinations<br />
<br />
• Mobile data can be used to predict the direction of movement of spectators post<br />
<br />
A study by Uttam K Sarkar et al, IIM Calcutta<br />
<br />
Potential Applications • Managing resources to pursue most promising<br />
<br />
customers and markets • Technology platform for managing the risk and value of network of different stakeholders • Meeting regulatory compliance • Development of patient referral system to reduce patient churn • Life science industries such as diagnostics, pharmaceuticals, and medical devices • Development of mobile apps for referral • Genetic data mining for prediction of disease and effective care<br />
<br />
201<br />
<br />
6/26/2014<br />
<br />
Simple Techniques for Big Data Transition • Which problem to tackle? – Need for domain • Habit of asking key questions – What do the data say? – Where did the data come from? – What kinds of analyses were conducted? – How confident are we in results?<br />
<br />
• Computers can only give answers<br />
<br />
Need for Change Management • Leadership ‐ set clear goals, define success, spot<br />
<br />
great opportunity, and understand how a market<br />
<br />
• Talent management – organizing big data,<br />
<br />
visualization tools and techniques, design of experiments • Technology – to integrate all internal and external sources of data, adopt new technologies • Decision Making – maximize cross‐functional cooperations • Company culture – shift from asking “What do we think?” to “What do we know?”<br />
<br />
202<br />
<br />
6/26/2014<br />
<br />
Challenges in Getting Returns • Collecting, processing, analyzing and using Big<br />
<br />
Data in their businesses • Handling the volume, variety and velocity of the data. • Finding data scientists ‐ by 2018, the demand for analytics and Big Data people in the U.S. alone will exceed supply by as much as 190,000<br />
<br />
• Sharing data across organizational silos • Driving data‐driven business decision rather than that based on intuition<br />
<br />
Confronting Complications • Shortage of adequately skilled work force • Shortage of managers and analysts with sharp understanding of big data applications<br />
<br />
• Tension between privacy and convenience – Consumer captures larger part of economic<br />
<br />
• Data/IP security concerns – cloud computing and open access to information<br />
<br />
203<br />
<br />
6/26/2014<br />
<br />
What Big Data Can’t Answer • How do we know what’s important in a complex wor<br />
<br />
• E.g. to predict and explain market crashes, food prices, the Arab Spring, ethnic violence, and other complex biological and social systems.<br />
<br />
• ignoring the rest) is the key to solving the world’s increasingly complex challenges. Yaneer Bar‐Yam and Maya Bialik,Beyond Big Data: Identifying important information for real world challenges(December 17, 2013), arXiv in press<br />
<br />
Complexity Profile •The amount of information that is required to describe a system as a function of the scale of description.<br />
<br />
•The most important n orma on a ou a system for informing action on that system is the behavior at the largest scale.<br />
<br />
204<br />
<br />
6/26/2014<br />
<br />
References • Holdren, John P., Eric Lander, and Harold Varmus. " Potential of Health of Health Information Technology to Improve Healthcare for Americans: The Path Forward." President's Council of Advisors of Advisors on Science and Technology. Executive Office of the of the President, Dec. 2010. • The Emerging Big Returns on Big Data, A TCS 2013 Global Trend Study, 2013 • w w w . i k a n o w . c om<br />
<br />
Mining Social‐Network Graphs Prof. Ram Babu Roy<br />
<br />
205<br />
<br />
6/26/2014<br />
<br />
Types of Networks of Networks Technological – man made, consciously created • Technological – Kolacz k, 2009 Social – interactions among social entities • Social – Biolo ical ical –in –inter teract action ion amon amon biolo biolo ical ical ele elemen ments ts • Biolo Informational – elements of information of information • Informational –<br />
<br />
What is a Social Network? of entities that participate in the • Collection of entities<br />
<br />
network ere s at east one re at ons p etween • entities of the of the network. of nonrandomness ss or • There is an assumption of nonrandomne locality • Examples – “Friends” networks ‐ Facebook Twitter Goo le+ – Telephone Networks – Email Networks – Collaboration Networks Ref: Mining of Massive of Massive Datasets, Jure Leskovec, Anand Rajaraman, and Jeffrey D. Ullman, 2014<br />
<br />
206<br />
<br />
6/26/2014<br />
<br />
Social Networks as Graphs of a • Not every graph is a suitable representation of a . of social networks that • Locality ‐ the property of social says nodes and edges of the of the graph tend to cluster in communities. • Important questions –<br />
<br />
‐<br />
<br />
with unusually strong connections – communities usually overlap ‐ you may belong to several communities<br />
<br />
Example of a of a Small Social Network •Nine edges out of the of the 7C2 = 21 pairs of nodes of nodes •Suppose X, Y , and Z are nodes with edges between X and Y and also between . be? If the graph were large, that probability would be very close to the fraction of • If the the pairs of nodes of nodes that have edges between them, i.e., 9/21 = .429<br />
<br />
•If the If the graph is small, there are edges (X, Y ) and (X,Z), therefore only seven edges remaining. Thus, the probability of an of an edge (Y,Z) is 7/19 = .368.<br />
<br />
By actually counting the pairs of nodes, of nodes, fraction of times of times the third edge exists is 9/16 = .563<br />
<br />
207<br />
<br />
6/26/2014<br />
<br />
Social Network Analysis of relations (Wasserman & Faust, 1994) • An actor exists in a fabric of relations – 1979 , Friedkin, 1991)<br />
<br />
,<br />
<br />
• Closeness ‐ ability to reach other actors at ease of a node in linking two nodes • Betweenness‐ relative importance of a of connections • Eigenvector (Bonacich, 1987) ‐ entire pattern of connections<br />
<br />
(C max C ( x)) A<br />
<br />
• Network centralization (Freeman, 1979)<br />
<br />
A<br />
<br />
xU <br />
<br />
Max<br />
<br />
(C max C ( x)) A<br />
<br />
A<br />
<br />
xU <br />
<br />
Link Analysis • Link Analysis Algorithms – – HITS (Hypertext‐Induced Topic Selection)<br />
<br />
if it has more links • Page is more important if it • In‐coming links • Out‐going links • Think of in of in‐links as votes • Are all in‐links are equal? – Links from important pages count more • Recursive question!<br />
<br />
208<br />
<br />
6/26/2014<br />
<br />
PageRank Scores<br />
<br />
Source: Jure Leskovec, Stanford C246: Mining Massive Datasets<br />
<br />
Simple Recursive Formulation • Each link’s vote is proportional to the • If page j with importance r j has n out ‐links, each link gets r j / n votes<br />
<br />
• Page j’s own importance is the sum of the votes on its in‐links • r j = r /3+r /4 i k Source: Jure Leskovec, Stanford C246: Mining Massive Datasets<br />
<br />
209<br />
<br />
6/26/2014<br />
<br />
The Flow Model • A “vote” from an important page is wor<br />
<br />
more<br />
<br />
• A page is important if it is pointed to by other important pages<br />
<br />
• Define a “rank” r j for page j <br />
<br />
Source: Jure Leskovec, Stanford C246: Mining Massive Datasets<br />
<br />
Matrix Formulation • Stochastic adjacency matrix • Let page has out‐links • If → , then = 1 / else = 0 • is a column stochastic matrix • Columns sum to 1<br />
<br />
• Rank vector : vector with an entry per page is the importance score of page <br />
<br />
• Σ =1 • The flow e uations can be written = ⋅ • The rank vector r is an eigenvector of the stochastic web matrix M • We can now solve for r using the method called Power iteration Source: Jure Leskovec, Stanford C246: Mining Massive Datasets<br />
<br />
210<br />
<br />
6/26/2014<br />
<br />
Power Iteration Method • Given a web graph with N nodes, where the no es are pages an e ges are yper n s<br />
<br />
• Initialize: r(0) = [1/N,….,1/N]T • Iterate: r(t+1) = M ∙ r(t) • Stop when |r(t+1) – r(t)|1 < ε • |x|1 = Σ1≤i≤N|xi| is the L1 norm • Can use any other vector norm, e.g., Euclidean Source: Jure Leskovec, Stanford C246: Mining Massive Datasets<br />
<br />
How to Solve?<br />
<br />
Source: Jure Leskovec, Stanford C246: Mining Massive Datasets<br />
<br />
211<br />
<br />
6/26/2014<br />
<br />
HITS (Hypertext‐Induced Topic Selection) • Is a measure of importance of pages or documents • Similar to PageRank<br />
<br />
– Proposed at around same time as PageRank (‘98)<br />
<br />
• Goal: Say we want to find good newspapers – Don’t just find newspapers. Find “experts” – people – who link in a coordinated way to good newspapers<br />
<br />
• Idea: Links as votes • Page is more important if it has more links • In‐coming links? Out‐going links?<br />
<br />
Hubs and Authorities • Each page has 2 scores: • – Total sum of votes of authorities pointed to – Hubs are pages that link to authorities • List of newspapers<br />
<br />
• Quality as a content (authority): – – Authorities are pages containing useful information • Newspaper home pages<br />
<br />
212<br />
<br />
6/26/2014<br />
<br />
Hubs and Authorities<br />
<br />
Source: Jure Leskovec, Stanford C246: Mining Massive Datasets<br />
<br />
Example<br />
<br />
•Under reasonable assumptions about the adjacency matrix A, HITS converges to vectors h* and a*: •h* is the principal eigenvector of matrix A AT •a* is the principal eigenvector of matrix AT A Source: Jure Leskovec, Stanford C246: Mining Massive Datasets<br />
<br />
213<br />
<br />
6/26/2014<br />
<br />
Financial Market as CAS • Complex Systems – interconnectedness, hierarchy of subsystems, decentralized decisions, non‐linearity, self ‐ or anisation evolution uncertaint self ‐or anized criticality<br />
<br />
• Complex Adaptive Systems – adaptation • Many socio‐economic systems behave as CAS (Mauboussin 2002; Markose 2005) – Com lex ‐ networks of interconnections – Adaptive ‐ optimizing agents, capable of learning – Complexity and homogeneity ‐ both robust and fragile<br />
<br />
• Structural vulnerabilities builds up over time (Haldane, 2009)<br />
<br />
Network‐based Modeling • Modeling Complex systems – SD, ABM, VSM, Econophysics<br />
<br />
• (Barabási, 1999)<br />
<br />
• Visual representation of the interdependence (Newman 2008) ‐ knowledge discovery<br />
<br />
• Dynamics of networks may reveal underlying mechanism ,<br />
<br />
• Recent works using network approach (Boginski et al. 2006; Tse, C.K, 2010)<br />
<br />
214<br />
<br />
6/26/2014<br />
<br />
Market graph • Logarithm of return of the instrument i over the one‐day period from (t − 1 ) to t = Ri (t) = ln Pi (t)/Pi (t − 1 )<br />
<br />
• i • Correlation coefficient between instruments i and j is<br />
<br />
• An edge connecting stocks i and j is added to the graph if Cij >= threshold – prices of these two stocks behave similarly over time – degree of similarity is defined by the chosen value of threshold <br />
<br />
• Studying the pattern of connections in the market graph would provide helpful information about the internal structure of the stock market<br />
<br />
Source: Mining market data: A network approach, Vladimir Boginski, Sergiy Butenko, Panos M. Pardalos, Computers & Operations Research 33 (2006) 3171–3184<br />
<br />
State‐of ‐the art • Regional behaviour of stock markets, relatively small sample size Korean (Jung et al. 2006; Kim et al. 2007), Indian (Pan & Sinha 2007), and Brazilian (Tabak et al. 2009)<br />
<br />
• Evolution of interdependence and characterizing the dynamics (Garas, 2007; Huang et al. 2009)<br />
<br />
• Not much work on identifying dominant stock indices (Eryigit, 2009)<br />
<br />
• Little work on the impact of recent financial crisis during 2008 • Limited business application of SNA (Bonchi et al., 2011)<br />
<br />
215<br />
<br />
6/26/2014<br />
<br />
Research Gap • Characterising the global market dynamics is a complex problem<br />
<br />
• Lack of system‐level analysis of global stock market • The network based methodologies inadequately explored • Adaptation of SNA methodologies to other domains<br />
<br />
Research Objective • Understanding underlying network structure of markets • Methods to capture interdependence structure of complex systems<br />
<br />
• Methods to characterize evolutionary behavior • Methods for change detection ‐ the impact of events on the topology<br />
<br />
216<br />
<br />
6/26/2014<br />
<br />
Research Questions • Is there any regional influence on the evolutionary behavior? • normal phase?<br />
<br />
• How to capture the macroscopic interdependence structure among stock markets and economic sectors?<br />
<br />
• • What is the response of the network to extreme event?<br />
<br />
Application: Change Detection in the Stock Markets<br />
<br />
Source: A Social Network Approach to Change Detection in the Interdependence Structure of Global Stock Markets by Ram Babu Roy, Uttam Kumar Sarkar Social Network Analysis and Mining, Springer, Vol. 3, Number 3, (2013)<br />
<br />
217<br />
<br />
6/26/2014<br />
<br />
Methodology • Secondary data on major stock markets from across the globe obtained from Bloomberg<br />
<br />
• • Characterization and pattern mining to investigate the structural and statistical properties and behavior<br />
<br />
• Statistical Control chart to detect anomaly in evolution • Graph theoretic methods and algorithms, network visualization tool (Pajek, Matlab and MS Excel)<br />
<br />
• Non‐parametric methods for analysis and change detection<br />
<br />
Data Description •<br />
<br />
The daily closing prices for 85 stock indices from 36 countries from across the world from January 2006 to December 2010 obtained from Bloomberg<br />
<br />
•<br />
<br />
In addition to these stock indices from various countries, 8 other indices namely, SX5E, SX5P, SXXE, SXXP, E100, E300, SPEURO, SPEU from European region were included to investigate whether the regional indices have any influence on the network structure<br />
<br />
•<br />
<br />
‐<br />
<br />
market network before and after the collapse of Lehman Brothers in the USA.<br />
<br />
• Restricted our samples to only those indices existing for<br />
<br />
longer period and have data available on Bloomberg (say from 1990) giving us 93 such indices<br />
<br />
218<br />
<br />
6/26/2014<br />
<br />
Computations • Logarithmic return Ri(t) of the instrument i over the one‐day period from (t − 1 ) to t = Ri (t) = ln { Pi (t)/Pi (t − 1 )}<br />
<br />
• Correlation coefficient between returns of instruments i and j C ij <br />
<br />
R i R j R i R j R i2 R i<br />
<br />
2<br />
<br />
R j2 R j<br />
<br />
2<br />
<br />
• An edge connecting stock indices i and j is added to the graph if = – returns of these two stock indices behave similarly over time – degree of similarity is defined by the chosen value of threshold <br />
<br />
• MST is used for obtaining simplified connected network d ij <br />
<br />
2(1 ij ) ,<br />
<br />
0 d ij 2,<br />
<br />
Illustration of MST creation<br />
<br />
219<br />
<br />
6/26/2014<br />
<br />
Empirical Findings<br />
<br />
Period No<br />
<br />
Start Date<br />
<br />
End Date<br />
<br />
Period No<br />
<br />
Start Date<br />
<br />
End Date<br />
<br />
1<br />
<br />
1/11/2006<br />
<br />
4/23/2008<br />
<br />
36<br />
<br />
9/17/2008<br />
<br />
12/29/2010<br />
<br />
6<br />
<br />
5/31/2006<br />
<br />
9/10/2008<br />
<br />
37<br />
<br />
1/4/2006<br />
<br />
12/29/2010<br />
<br />
7<br />
<br />
6/28/2006<br />
<br />
10/8/2008<br />
<br />
Pre‐LB<br />
<br />
220<br />
<br />
6/26/2014<br />
<br />
Post‐LB<br />
<br />
• Indices cluster with the regional hubs • Relatively more decentralized network •European stock indices emerge more central<br />
<br />
Application: Identifying Dominant Economic Sectors and Stock Markets Source: Identifying dominant economic sectors and stock markets: A social network , Coast, Australia, Springer, LNCS 7867, pp 59‐70., (2013)<br />
<br />
221<br />
<br />
6/26/2014<br />
<br />
Data Description Stock Index<br />
<br />
Number of Stocks<br />
<br />
Stock Index<br />
<br />
Number of Stocks<br />
<br />
GICS Economic Sector<br />
<br />
Number of Stocks<br />
<br />
AS52<br />
<br />
136<br />
<br />
NMX<br />
<br />
192<br />
<br />
Consumer Discretionary<br />
<br />
486<br />
<br />
CNX500 FSSTI HDAX<br />
<br />
303 19 38<br />
<br />
NZSE50FG SBF250 SET<br />
<br />
26 140 285<br />
<br />
Consumer Staples Energy Financials<br />
<br />
229 128 408<br />
<br />
HSI IBOV<br />
<br />
25 27<br />
<br />
SHCOMP SPTSX<br />
<br />
382 147<br />
<br />
Health Care Industrials<br />
<br />
146 512<br />
<br />
KRX100<br />
<br />
65<br />
<br />
SPX<br />
<br />
384<br />
<br />
Information Technology<br />
<br />
234<br />
<br />
MEXBOL<br />
<br />
24<br />
<br />
TWSE<br />
<br />
313<br />
<br />
Materials<br />
<br />
425<br />
<br />
NKY<br />
<br />
192 Total<br />
<br />
829<br />
<br />
Total<br />
<br />
1869<br />
<br />
Telecommuni cation Services Utilities<br />
<br />
36 94<br />
<br />
Grand Total<br />
<br />
2698<br />
<br />
Data for 13 years from January 1998 to January 2011 GICS ‐ Global Industry Classification Standard <br />
<br />
Identification of Dominant Economic Sectors and Stock Markets • The normalized intra‐sectoral edge density (in percent) is the<br />
<br />
ratio of the number of edges between the stocks of the particular sector and the maximum number of possible edges between the stocks of that particular sector (i.e. n‐1 where n is the no. of stocks of that particular sector)<br />
<br />
• The normalized inter‐sectoral edge density it is the ratio of the<br />
<br />
number of edges between the stocks of the two different sectors and the maximum number of possible edges between the stocks of those two sectors (i.e. the min(n1, n2) where n1 and n2 are the number of stocks belonging to the two sectors)<br />
<br />
• Similar procedure has been followed to identify dominant stock markets after computing the normalized inter‐index and intra‐ index edge densities.<br />
<br />
222<br />
<br />
6/26/2014<br />
<br />
Index<br />
<br />
Color<br />
<br />
NMX<br />
<br />
Green<br />
<br />
Index AS52 SPTSX<br />
<br />
Color Pink Black <br />
<br />
CNX500<br />
<br />
Green<br />
<br />
SPX<br />
<br />
HDAX<br />
<br />
Magenta<br />
<br />
SBF250<br />
<br />
Brown<br />
<br />
HSI<br />
<br />
Purple<br />
<br />
SET<br />
<br />
Brown<br />
<br />
IBOV<br />
<br />
Yellow<br />
<br />
FSSTI<br />
<br />
Cyan<br />
<br />
KRX100<br />
<br />
Magenta<br />
<br />
TWSE<br />
<br />
Orange<br />
<br />
MEXBOL<br />
<br />
Blue<br />
<br />
SHCOMP<br />
<br />
Red <br />
<br />
NKY<br />
<br />
Black <br />
<br />
No. of Stocks<br />
<br />
Continent Australia Zealandia Asia Europe North America South America<br />
<br />
NZSE50FG Green Red <br />
<br />
Node Shape Diamond Diamond Triangle box Circle Ellipse<br />
<br />
Continent<br />
<br />
Node Shape<br />
<br />
Australia<br />
<br />
Diamond <br />
<br />
Zealandia<br />
<br />
Diamond <br />
<br />
Asia<br />
<br />
Triangle<br />
<br />
Europe<br />
<br />
box<br />
<br />
North America<br />
<br />
Circle<br />
<br />
South America<br />
<br />
Ellipse<br />
<br />
No. of Stocks<br />
<br />
G IC S S ec to r<br />
<br />
C ol or<br />
<br />
GICS Sector<br />
<br />
Color<br />
<br />
Materials<br />
<br />
Brown<br />
<br />
425<br />
<br />
Financials<br />
<br />
Magenta<br />
<br />
408<br />
<br />
Industrials Health Care<br />
<br />
Cyan Orange<br />
<br />
512 146<br />
<br />
Blue Black<br />
<br />
234 229<br />
<br />
Energy Consumer Discretionary<br />
<br />
Red<br />
<br />
128<br />
<br />
Information Technology Consumer Staples Telecommunication Services<br />
<br />
Green<br />
<br />
36<br />
<br />
Yellow<br />
<br />
486<br />
<br />
Utilities<br />
<br />
Purple<br />
<br />
94<br />
<br />
223<br />
<br />
6/26/2014<br />
<br />
Interdependence structure of economic sectors (Weighted edge) Eigenvector<br />
<br />
1 Financials<br />
<br />
0.4785<br />
<br />
2 Industrials<br />
<br />
0.4042<br />
<br />
3 Materials Consumer 4 Discretionary Information 5 Technology<br />
<br />
0.3795 0.3449 0.2962<br />
<br />
6 Consumer Staples Telecommunication 7 Services<br />
<br />
0.2702<br />
<br />
8 Utilities<br />
<br />
0.2002<br />
<br />
9 Energy<br />
<br />
0.1957<br />
<br />
10 Health Care<br />
<br />
0.2694<br />
<br />
0.1817<br />
<br />
Network of Stock Indices (Weighted edge) Rank<br />
<br />
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17<br />
<br />
Index<br />
<br />
SBF250 HDAX NMX SPX AS52 NZSE50FG SPTSX HS Index SET MEXBOL CNX500 NKY FSSTI SHCOMP IBOV TWSE KRX100<br />
<br />
Eigenvector Centrality<br />
<br />
0.6381 0.5938 0.393 0.1523 0.1517 0.1248 0.1112 0.071 0.0465 0.0423 0.0353 0.0289 0.0208 0.0048 0.0043 0.0004 0<br />
<br />
224<br />
<br />
6/26/2014<br />
<br />
Inter‐sectoral interdependence linking cross‐country stock markets<br />
<br />
Rank<br />
<br />
Economic Sector<br />
<br />
1 2 3 4 5 6 7 8 9 10<br />
<br />
Industrials Financials Materials Utilities Health Care Consumer Staples Consumer Discretionary Information Technology Telecommunication Services Energy<br />
<br />
EV Centrality<br />
<br />
0.4738 0.4575 0.3105 0.2921 0.2886 0.2747 0.2685 0.2306 0.225 0.2232<br />
<br />
Findings and Conclusions • Presence of distinct regional and sectoral clusters • Regional influence dominates economic sectors • Stock indices from Europe and US emerge as dominant<br />
<br />
• Financial,<br />
<br />
Industrial, Materials, Discretionary sectors dominate<br />
<br />
Consumer<br />
<br />
• oc a pos t on o stoc s ‐ port o o management • System level understanding of the structure and behavior<br />
<br />
225<br />
<br />
6/26/2014<br />
<br />
Scope for Future Research • Potential application for classification of stocks in portfolio management<br />
<br />
• Detecting epicenter of turbulence in near real‐time ‐ development of EWS<br />
<br />
• Modeling shock propagation through the network • • Whether the self ‐organizing network provides in‐built resilience<br />
<br />
References • Bessler, D.A., Yang, J., The structure of interdependence in international stock markets. Journal of International Money and Finance 22, 261‐287, 2003.<br />
<br />
• Eryiğit, M. and R. Eryiğit, Network structure of cross‐correlations among the world market indices. Physica A: Statistical Mechanics and its Applications 388(17): 3551‐3562 , 2009.<br />
<br />
• Adams, J., K. Faust, et al. , Capturing context: Integrating spatial and social network analyses. Social • • • • • • • <br />
<br />
• <br />
<br />
Networks 34(1), 1‐5, , 2012. , , , , 2005 Tse, C.K., Liu, J., Lau, F.C.M., A network perspective of the stock market. Journal of Empirical Finance doi:10.1016/j. jempfin.2010.04.008, 2010. Boginski, V., Butenko, S., Pardalos, P.M., Mining market data: A network approach. Computers & Operations Research 33, 3171–3184, 2006. Wasserman, S. and K. Faust , Social Network Analysis: Methods and Applications."Cambridge University Press: 461‐502, 1994. Freeman, L.C., Centrality in social networks: Conceptual clarification. Social Networks 1, 215‐239, 1979. Bonacich, P., Power and Centrality: A Family of Measures. The American Journal of Sociology 92, 1987.. Roy, R. B. and U. K. Sarkar, A social network approach to examine the role of influential stocks in shaping interdependence structure in global stock markets. International Conference on Advances in Social Network Analysis and Mining (ASONAM), Kaohsiung, Taiwan, DOI: 10.1109/ASONAM.2011.87,2011. Roy, R. B. and U. K. Sarkar, "A social network approach to change detection in the interdependence structure of global stock markets." Social Network Analysis and Mining DOI 10.1007/s13278‐012‐ 0063‐y, 2012.<br />
<br />
226<br />
<br />
6/26/2014<br />
<br />
Classification using Neural Networks Uttam K Sarkar Indian Institute of Management Calcutta<br />
<br />
Session Plan • The need for neural networks – Signature recognition problem • Whether the signature in the cheque is ‘same’ as the signature the bank has against the account in its database<br />
<br />
• Concept of a Neural Network • Demonstration of how it works using Excel • Potential business applications • Issues in using neural networks<br />
<br />
227<br />
<br />
6/26/2014<br />
<br />
Artificial Neural Network (ANN) An ANN is a computational paradigm inspired by e s ruc ure o o og ca neura ne wor s and their way of encoding and solving problems<br />
<br />
Biological inspirations • Some numbers… – The human brain contains about 10 billion nerve cells<br />
<br />
(neurons) – Each neuron is connected to the others through about 10000 synapses<br />
<br />
• Properties of the brain – It can learn, reorganize itself from experience – It a apts to t e env ronment – It is robust and fault tolerant<br />
<br />
228<br />
<br />
6/26/2014<br />
<br />
Biological neural networks • Human brain contains several billion nerve‐cells • A neuron receives electrical signals through its dendrites • The accumulated effect of several signals received simultaneously is linearly additive • Output is non‐linear (all or none) type of output signal • Connectivity (no of neurons connected to a neuron) varies from 1 to 105. For the cerebral cortex it’s about 103.<br />
<br />
Biological Neuron synapse<br />
<br />
axon<br />
<br />
nucleus<br />
<br />
cell body<br />
<br />
dendrites<br />
<br />
• Dendrites sense input • Axons transmit output • Information flows from dendrites to axons via cell body • Axon connects to other dendrites via synapses – Interactions of neurons – synapses can be excitatory or inhibitory – synapses vary in strength<br />
<br />
• How can the above biological characteristics be modeled in an artificial system?<br />
<br />
229<br />
<br />
6/26/2014<br />
<br />
Artificial implementation using a computer ? • Input – Acce tin external in ut is sim le and common lace<br />
<br />
• Axons transmit output – Output mechanisms too are well known<br />
<br />
• Information flows from dendrites to axons via cell body – Information flow is doable<br />
<br />
• Axon connects to other dendrites via synapses – Interact ons o neurons ow W at n o grap or networ – synapses can be excitatory or inhibitory (1/0 ? Continuous?) – synapses vary in strength (weighted average?)<br />
<br />
Typical Excitement or Activation Function at a Neuron (Sigmoid or Lo istic curve<br />
<br />
1 y f ( x) e<br />
<br />
Logistic<br />
<br />
230<br />
<br />
6/26/2014<br />
<br />
Interconnections? Feed Forward Neural Network • Information is fed at the input • Com utations done at the hidden layers • Deviations of computed results from desired goals retunes computations • Network thus ‘learns’ • Computation is terminated once the learning is assumed acceptable or resources earmarked for computation get exhausted<br />
<br />
Output layer<br />
<br />
2nd hidden layer 1st hidden<br />
<br />
x1<br />
<br />
x2<br />
<br />
…Input layer..<br />
<br />
xn<br />
<br />
Supervised Learning • Inputs and outputs are both known. • The network tunes its weights to transform the inputs to the outputs without trying to discover the mapping in an explicit form<br />
<br />
• One may provide examples and teach the knowing exactly how!<br />
<br />
231<br />
<br />
6/26/2014<br />
<br />
Characteristics of ANN • Supervised networks are good approximators • oun e unct ons can e approx mate y an to any precision<br />
<br />
• Does self ‐learning by adapting weights to environmental needs<br />
<br />
• Can work with incomplete data • The information is distributed across the network. If one part gets damaged the overall performance may not degrade drastically<br />
<br />
What do we need to use NN ? • Determination of pertinent inputs of the neural network • Finding the optimum number of hidden nodes • Estimate the parameters (Learning) • Evaluate the performances of the network<br />
<br />
• the precedent points<br />
<br />
232<br />
<br />
6/26/2014<br />
<br />
Applications of ANN • Dares to address difficult problems where cause‐ effect relationship of input‐output is very hard to quantify – Stock market predictions – Face recognition – Time series prediction – Process control – Optical character recognition – Optimization<br />
<br />
Concluding remarks on Neural Networks • Neural networks are utilized as statistical tools – Adjust non linear functions to fulfill a task – Need of multiple and representative examples but fewer<br />
<br />
than in other methods • Neural networks enable to model complex phenomena • NN are good classifiers BUT – Good representations of data have to be formulated – Training vectors must be statistically representative of the<br />
<br />
• Effective use of NN needs a good comprehension of the problem and a good grip on underlying mathematics<br />
<br />
233<br />
<br />
6/26/2014<br />
<br />
Business Data Mining Promises and Reality<br />
<br />
Uttam K Sarkar Indian Institute of Management Calcutta<br />
<br />
Background • “War is too important to be left to the Generals” – Georges Benzamin Clemenceau – Decision making now requires navigating beyond transactional data<br />
<br />
• Need for exploring ocean of data – – – – – –<br />
<br />
How to filter? How about outliers? How to summarize? How to analyze? How to apply?<br />
<br />
234<br />
<br />
6/26/2014<br />
<br />
Challenges (Opportunities?) • Real world does not have a consistent behaviour – What model is to be extracted by analytics then? “Future, by definition, is uncertain”<br />
<br />
• Ca tured data are error rone and involve uncertaint – “You are dead by the time you know the ultimate truth”<br />
<br />
• Often there is no concrete clarity on goal as you optimize – What are the strengths and weakness of underlying assumptions?<br />
<br />
• Data are voluminous, have high variety, and high velocity – How to capture, store, transmit, share, sample, analyze? – “Sufficiently tortured and abused, statistics would confess to almost anything”<br />
<br />
Analytics Promises, Myths, and Reality • – Business intelligence, Data mining, Data warehousing, Predictive analytics, Prescriptive analytics, Big data, ….<br />
<br />
• Too many questions – emerge from? Why are they getting so popular? Is analytics the panacea? Is it yet another overhyped buzzword?<br />
<br />
235<br />
<br />
6/26/2014<br />
<br />
Analytics in action • Instead of struggling for academic definitions e us oo n o w a ese are n en e o achieve and look at some examples from the business world where these are being used<br />
<br />
Netflix • Disruptive innovation riding on analytics dimension –<br />
<br />
: ee as ngs, ompu er c ence as er s rom Stanford, pays $40 late fee for a VHS cassette of the movie Apollo 13<br />
<br />
– The recommender system – The transportation logistics fine tuning – The silent battle of the nondescript entrant Netflix against incumbent billion dollar Blockbuster<br />
<br />
– Blockbuster initially ignored Netflix – Recognition (and virtual surrender) by Blockbuster – it was too late!<br />
<br />
236<br />
<br />
6/26/2014<br />
<br />
Harrah’s • Knowing customers better using analytics – The magic of CRM<br />
<br />
Wal‐Mart , and inventory management and the success of Radio Frequency Identifcation (RFID) based analytics<br />
<br />
,<br />
<br />
– Reduced stockout!<br />
<br />
Examples Galore … •<br />
<br />
HP<br />
<br />
– Which employee is likely to quit? • Target – Which customers are expectant mothers? • Telenor – Which customers can be persuaded to stay back? • Latest US presidential election – Which voter will be positively persuaded by political campaign suc as a ca , oor noc ,<br />
<br />
er, or<br />
<br />
a<br />
<br />
• IBM (Watson) – Automated Jeopardy! Analytics software challenged human opponent!<br />
<br />
237<br />
<br />
6/26/2014<br />
<br />
Analytics defined? • Analytics is (Ref: Davenport) – Extensive use of data, statistical, quantitative, and computer based analysis, explanatory and predictive models, and fact‐based rather than gut feeling‐ based approach to arrive at business decisions and to drive actions<br />
<br />
• Ana yt cs can e p arr ve at nte gent decisions • What is intelligence in this context? • How can a machine help take intelligent decisions?<br />
<br />
Changing nature of competition in business world: How to get competitive advantage?<br />
<br />
• Operational business processes aren’t much different from an bod else’s<br />
<br />
– Thanks to R&D on “best practices”<br />
<br />
238<br />
<br />
6/26/2014<br />
<br />
…Competitive advantage • Operational business processes aren’t much different from anybody else’s – Thanks to R&D on “best practices” ’<br />
<br />
– Thanks to comparable technology accessible to all • Unique geographical advantage doesn’t matter much – Thanks to improved transportation/communication logistics • Protective regulations are largely gone – Thanks to globalization • Pr Pro o rie rietar te tech chno nolo lo ets ra idl idl co ied ied – Thanks to technological innovations • What is left is to improve efficiency and effectiveness by taking the smartest business decisions possible out of of data data which may as well be available to competitors<br />
<br />
Stock Market Analogy • It’s too well known the stock market would cease to exist if if one one could find a predictive theory of of price price movements! – The market exists because no such ideal model is possible. Players try disparate models ‐ some gain while others lose<br />
<br />
•<br />
<br />
– interpretation is meaningless – The smarter guy discovers more meaningful patterns for his business by using analytics!<br />
<br />
239<br />
<br />
6/26/2014<br />
<br />
Emergence of of analytics: analytics: the contributors • Old wine in new bottle? – Yes and No<br />
<br />
of business business analytics • Pillars of – Data capture and storage in bulk – Statistical principles revisited – Developments in machine learning principles of easy easy to use software tools – Availability of ange ge m n se sett – Manager a nvo vement w t a c an<br />
<br />
Panacea ? • Too many ready to use tools and techniques are available! – “A fool with a tool is still a fool”<br />
<br />
Road to Analytics Uttam K Sarkar Indian Institute of of Management Management Calcutta<br />
<br />
240<br />
<br />
6/26/2014<br />
<br />
Attributes of of organizations organizations thriving on analytics • Identification of distinctive of distinctive strategic capability for analytics to make a difference – Highly organization specific – Ma be customer customer lo alt for Harrah’ Harrah’ss revenue revenue mana ement for Marriott Marriott supply chain performance for Wal‐Mart. Customer’s movie preference for Netflix , …<br />
<br />
• Enterprise‐level approach to analytics – Gary Loveman of Harrah’s of Harrah’s broke the fiefdom of marketing of marketing and customer service islands into cross‐department chorus on shared data and thoughts<br />
<br />
• Senior management capability and commitment – Reed Hastings of Netflix of Netflix • Computer Science graduate from Stanford) – Gary Loveman of Harrah’s of Harrah’s • PhD in Economics from MIT – Bezos of Amazon.com of Amazon.com • Computer Science graduate from Princeton<br />
<br />
Stages of of analytics analytics preparedness of of an an organization<br />
<br />
• Analytically impaired – Groping for data to improve operations<br />
<br />
• – Using limited analytics to improve a functional activity<br />
<br />
• Analytics Experimenter – Exploring analytics to improve a distinctive capability<br />
<br />
• Analytics Mindset –<br />
<br />
• Competing on analytics of analytics – Staying ahead on strength of analytics<br />
<br />
241<br />
<br />
6/26/2014<br />
<br />
Investment in analytics without due diligence may not yield results<br />
<br />
• United airlines had been the world’s largest airline in terms of fleet of fleet size and number of destinations, of destinations, operating over 1200 aircraft with service to 400 odd destinations<br />
<br />
• United airlines had invested heavily in analytics • The company filed for Chapter 11 bankruptcy rotection in 2002<br />
<br />
• Despite business downturn, most other airlines were not as adversely affected<br />
<br />
• What went wrong with its analytics?<br />
<br />
United airlines analytics …postmortem • UA pioneered only yield management – Other smaller airlines were doing cost cutting and offering sea s a ower ares<br />
<br />
• UA were developing complex route planning optimization analytics with multiple plane types SouthWest used only one kind of plane of plane – Competitor like SouthWest and had a far simpler and cheaper system of running of running<br />
<br />
• UA pioneered loyalty programme based on analytics – Their customer service was so pathetic that frequent flyers hardy had any loyalty to it<br />
<br />
• UA spent a fortune developing analytics – Other airlines can buy at far cheaper price Sabre system for pretty similar analysis<br />
<br />
242<br />
<br />
6/26/2014<br />
<br />
Questions to ask when evaluating an analytics initiative<br />
<br />
• How does this initiative improve enterprise‐wide capabilities? • What complementary changes need to be made to take advantage of the capabilities? – More IT? More training? Redesign jobs? Hire new skills?<br />
<br />
• Do we have access to right data? – Are data timely, accurate, complete, consistent?<br />
<br />
• Do we have access to right technology? – Is the technology workable, scalable, reliable, and cost effective?<br />
<br />
Missteps to avoid when getting into an analytics initiative • Focusing excessively on one dimension of the capability (say, only on technology)<br />
<br />
• Attempting to do everything at once – Any complex system is best handled incrementally • Investing too much or too little on analytics without<br />
<br />
matching impact on and demand of business • Choosing the wrong problem – Wrong formulation, wrong assumptions, wrong data, , • Making the wrong interpretation – Tool + data + few mouse clicks = model – Model + input = output – Who assesses whether the output is garbage?<br />
<br />
243<br />
<br />
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
<div class="modal fade" id="report" tabindex="-1" role="dialog" aria-hidden="true">
<div class="modal-dialog">
<div class="modal-content">
<form role="form" method="post" action="https://idoc.tips/report/big-data-analytics-handoutsfinal-pdf-free" style="border: none;">
<div class="modal-header">
<button type="button" class="close" data-dismiss="modal" aria-hidden="true">×</button>
<h4 class="modal-title">Report "Big Data Analytics _ Handouts_final"</h4>
</div>
<div class="modal-body">
<div class="form-group">
<label>Your name</label>
<input type="text" name="name" required="required" class="form-control" />
</div>
<div class="form-group">
<label>Email</label>
<input type="email" name="email" required="required" class="form-control" />
</div>
<div class="form-group">
<label>Reason</label>
<select name="reason" required="required" class="form-control">
<option value="">-Select Reason-</option>
<option value="pornographic" selected="selected">Pornographic</option>
<option value="defamatory">Defamatory</option>
<option value="illegal">Illegal/Unlawful</option>
<option value="spam">Spam</option>
<option value="others">Other Terms Of Service Violation</option>
<option value="copyright">File a copyright complaint</option>
</select>
</div>
<div class="form-group">
<label>Description</label>
<textarea name="description" required="required" rows="3" class="form-control" style="border: 1px solid #cccccc;"></textarea>
</div>
<div class="form-group">
<div style="display: inline-block;">
<div class="g-recaptcha" data-sitekey="6LcHT8sZAAAAAPKfs_PZGhwvz-OHbUMuekQzz5xK"></div>
</div>
</div>
<script src='https://www.google.com/recaptcha/api.js'></script>
</div>
<div class="modal-footer">
<button type="button" class="btn btn-default" data-dismiss="modal">Close</button>
<button type="submit" class="btn btn-success">Send</button>
</div>
</form>
</div>
</div>
</div>
<script>
$(document).ready(function () {
var inner_height = $(window).innerHeight() - 250;
$('#pdfviewer').css({"height": inner_height + "px"});
});
</script>
<footer class="footer" style="margin-top: 60px;">
<div class="container-fluid">
Copyright © 2025 IDOC.TIPS. All rights reserved.
<div class="pull-right">
<span><a href="https://idoc.tips/about">About Us</a></span> |
<span><a href="https://idoc.tips/privacy">Privacy Policy</a></span> |
<span><a href="https://idoc.tips/term">Terms of Service</a></span> |
<span><a href="https://idoc.tips/copyright">Copyright</a></span> |
<span><a href="https://idoc.tips/contact">Contact Us</a></span> |
<span><a href="https://idoc.tips/cookie_policy">Cookie Policy</a></span>
</div>
</div>
</footer>
<!-- Modal -->
<div class="modal fade" id="login" tabindex="-1" role="dialog" aria-labelledby="myModalLabel">
<div class="modal-dialog" role="document">
<div class="modal-content">
<div class="modal-header">
<button type="button" class="close" data-dismiss="modal" aria-label="Close" on="tap:login.close"><span aria-hidden="true">×</span></button>
<h4 class="modal-title" id="add-note-label">Sign In</h4>
</div>
<div class="modal-body">
<form action="https://idoc.tips/login" method="post">
<div class="form-group">
<label class="sr-only" for="email">Email</label>
<input class="form-input form-control" type="text" name="email" id="email" value="" placeholder="Email" />
</div>
<div class="form-group">
<label class="sr-only" for="password">Password</label>
<input class="form-input form-control" type="password" name="password" id="password" value="" placeholder="Password" />
</div>
<div class="form-group">
<div class="checkbox">
<label class="form-checkbox">
<input type="checkbox" name="remember" value="1" />
<i class="form-icon"></i> Remember me
</label>
<label class="pull-right"><a href="https://idoc.tips/forgot">Forgot password?</a></label>
</div>
</div>
<button class="btn btn-primary btn-block" type="submit">Sign In</button>
</form>
</div>
</div>
</div>
</div>
<!-- Global site tag (gtag.js) - Google Analytics -->
<script async src="https://www.googletagmanager.com/gtag/js?id=UA-177830117-1"></script>
<script>
window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
gtag('config', 'UA-177830117-1');
</script>
<script src="https://idoc.tips/assets/js/jquery-ui.min.js"></script>
<link rel="stylesheet" href="https://idoc.tips/assets/css/jquery-ui.css">
<script>
$(function () {
$("#document_search").autocomplete({
source: function (request, response) {
$.ajax({
url: "https://idoc.tips/suggest",
dataType: "json",
data: {
term: request.term
},
success: function (data) {
response(data);
}
});
},
autoFill: true,
select: function (event, ui) {
$(this).val(ui.item.value);
$(this).parents("form").submit();
}
});
});
</script>
<!-- cookie policy -->
<div id="IDOCTIPS_cookie_box" style="z-index:99999; background: #97c479; width: 100%; position: fixed; padding: 5px 15px; text-align: center; left:0; bottom: 0;">
Our partners will collect data and use cookies for ad personalization and measurement. <a href="https://idoc.tips/cookie_policy" target="_blank">Learn how we and our ad partner Google, collect and use data</a>. <a href="#" class="btn btn-success" onclick="accept_IDOCTIPS_cookie_box();return false;">Agree & close</a>
</div>
<script>
function accept_IDOCTIPS_cookie_box() {
document.cookie = "IDOCTIPS_cookie_box_viewed=1;max-age=15768000;path=/";
hide_IDOCTIPS_cookie_box();
}
function hide_IDOCTIPS_cookie_box() {
var cb = document.getElementById('IDOCTIPS_cookie_box');
if (cb) {
cb.parentElement.removeChild(cb);
}
}
(function () {
var IDOCTIPS_cookie_box_viewed = (function (name) {
var matches = document.cookie.match(new RegExp("(?:^|; )" + name.replace(/([\.$?*|{}\(\)\[\]\\\/\+^])/g, '\\$1') + "=([^;]*)"));
return matches ? decodeURIComponent(matches[1]) : undefined;
})('IDOCTIPS_cookie_box_viewed');
if (IDOCTIPS_cookie_box_viewed) {
hide_IDOCTIPS_cookie_box();
}
})();
</script>
<!-- end cookie policy -->
</body>
</html>