220CT Revision Notes

Short Notes By: Salman Fazal

Contents Normalisation

……………………………………………………………

1

Big Data

……………………………………………………………

3

Map/Reduce

……………………………………………………………

4

Hadoop

……………………………………………………………

4

NoSQL

……………………………………………………………

7

Graph DB

……………………………………………………………

8

Mongo DB

……………………………………………………………

9

Cassandra DB

……………………………………………………………

11

Data Mining

……………………………………………………………

13

Extras

……………………………………………………………

17

-

Big Data & Hadoop Clusters & Consistency

220CT Notes Salman Fazal

Normalisation The process by which we efficiently organize data to achieve the following goals: 1. Eliminate redundancy (not important data) 2. Organize data efficiently 3. Reducing data anomalies 

Anomalies – Anomalies – inconsistencies inconsistencies in the data stored. This can arise when inserting, updating or deleting. Eg. When a particular record is in many locations but each of them will need to be updated individually.

Normal Forms (3 levels) *in order to achieve one level of normal form, each previous level must be met.

Item T-shirt Polo T-shirt Shirt

Colors Red, Blue Red, Yellow Red, Blue Blue, Black

Price 12 12 12 25

Tax 0.60 0.60 0.60 1.25

<- TASK: CONVERT THIS TABLE INTO THIRD NORMAL FORM

First Normal Form -

Item T-shirt T-shirt Polo Polo Shirt Shirt

Each record is unique (no repeating data) Each cell is atomic (contains only a single value) No repeating groups (multiple columns do not store similar info. Eg. Child1, child2)

Colors Red Blue Red Yellow Blue Black

Price 12 12 12 12 25 25

Tax 0.60 0.60 0.60 0.60 1.25 1.25

-

Duplicate records removed Colour column contains single value.

Second Normal Form -

Item T-shirt T-shirt Polo Polo Shirt Shirt

All attributes (non-key columns) are dependent on the key Must be moved to a separate table & be related by a foreign key.

Colors Red Blue Red Yellow Blue Black

Item

Price

Tax

T-shirt

12

0.60

Polo

12

0.60

Shirt

25

1.25

Price and tax depends on item but not colour, so its moves to a different table.

1


Third Normal Form -

Eliminate fields that do not depend on primary key If a column isn’t dependent on the primary key but another column, it must be moved to another table

Item

Price

Tax

Price

Tax

T-shirt

12

0.60

12

0.60

Polo

12

0.60

25

1.25

Shirt

25

1.25

In the above tables, tax is dependent on price and not item, so a new table is created.

Extra Mission No ISS2237

ISS3664 ISS2356

Equipment

Qty

Portable Water Dispenser Flexable Airduct Small Storage/Rack Biofilter

2

Item Weight 100KG

Total Weight 200KG

6 4

0.5KG 2KG

3KG 8KG

6

0.2KG

1.2KG

Small Storage/Rack

3

2KG

6KG

In this table, the last column is ‘Total Weight’ which is calculated from the previous row columns. When normalizing, we need eliminated the ‘Total Weight’ column. Although total weight depends on the weight and quantity, the column is computed and it can easily be constructed outside of the database. Therefore, the column does not belong to the database and must m ust be discarded.

2


Big Data Definition: 1. Big data refers to data sets that grow so large that it is difficult to capture, manage, store and analyse with typical database software tools. 2. Huge volume of data that cannot be stored and processed using the traditional approach within the given time-frame.

Types of Data: Structured Data – Data – often often referred to as data in a structured relational database, mostly organized in tabular format. Eg; SQL Database



Unstructured Data – Data – everything everything else. Eg; emails, video, audio, social networking.



90% of the data is unstructured! Characteristics Characteristics of Big Data (4 V’s):

1. Volume (size): describes the amount of data generated by organizations or individuals. (The need to process terabytes of data). 2. Velocity (speed): describes the rate at which the data is generated. (The need to be analysed quickly). 3. Variety (types): Deals with different types of data coming in from different sources. Structured + unstructured data. 4. Value: Locating information within the data. Having access to big data is of no good unless we can turn it into value.

Meaningful data analysis: i. Identify what you’re looking for ii.

Prepare the data

iii.

Explore the data

iv.

Apply algorithms

v.

Analyse results

vi.

Repeat

This leads to: 

Discovering a relationship between pieces of data that would not have been possible to identify otherwise



Discovering how and when behaviours occur (eg. can help business create a new sales model)

Big Data QoS Analysing Big Data allows for better, faster and more profitable decisions from a business point of view. This is done using data that was not accessible before, not available before and not usable before. Availability system remains operational on failing no des o (clients can read & write). o Ensures business continuity Reliability o High accuracy o Low accuracy puts the organisation at risk Flexibility System is able to meet changing c hanging business o demand Scalability Systems ability to meet growth requirements o Performance How quickly and efficiently a system runs o Security o Protection against unauthorized access *Scalability, Tolerant, Flexibility and Efficiency benefits the user too. 











3


Map/Reduce 

Simply a way to take a big task and divide it into discrete tasks that can be done in parallel.



Cost effective and easy to use

Functionalities: 1. Split data into smaller chunks 2. Map data according to mapping key 3. Reduce and merge all related data Split - Map - Shuffle/Sort - Reduce

Pros

Cons

Simplicity

Restricted

Fault-tolerant

Does not provide solution for Graphs

Scalability

Hadoop Big Data (recap) – High-volume, high-velocity and high-variety data that demand cost-effective information processing for enhanced insight and decision-making. Hadoop – Framework for parallel processing of large datasets distributed across clusters of nodes (computers). An open-source software implementation of MapReduce. Cluster – multiple multiple machines linked together by high speed LAN It focuses on the following: 1. Performance – Performance – supports supports processing of huge data sets parallelly within clusters 2. Economics – Economics – lower lower costs by using commodity computing hardware (high-performance, lowcost machines) 3. Linearly scalable – scalable – more more nodes can do more work within the same time 4. Fault-tolerance – Fault-tolerance – node node failure does not cause computational failure as data is replicated. *Hadoop QoS -> Scalable, Tolerant, Flexible & Efficient (In big data section)

4

220CT Notes Salman Fazal - Hadoop consists of 2 components; HDFS (storing data) and MapReduce (processing data) - Word Counting example (see image in mapreduce section): Counting the number of times each word is used in every book in Coventry University Library. We would do the following: 1. Partition the texts (pages) and put each on a separate computer or computing element/instance (think cloud). 2. Each computing element takes care of its portion 3. The word count is then combined

Hadoop – The design

Data is distributed around the network o Every node can host data o Data is replicated to support faulttolerance 

Computation is sent to data, not viceversa o Codes to be run are sent to nodes o Results of computations are combined 



o o

Basic architecture is master/worker

Master (JobNode) launches application Workers (WorkerNodes) perform computation

The Architecture (components) 1. i. ii.

iii.

2. i.

ii.

Name Node (master): Keeps track of where the data is within the node Executes operations (like opening, closing, renaming a file) One per cluster Data/Slave Node (worker): Stores the data and communicates with other nodes One per node

3. Job Tracker: i. Central Manager; schedules the mapreduce tasks to run 4. Task Tracker: i. Accepts & runs map, reduce and shuffle 5


How Hadoop works (HDFS and MapReduce): 1. MapReduce library splits files into pieces (64-256MB), master assigns the tasks. Blocks are distributed across nodes o Each input file is processed by one mapper (local) o Splitting depends on file format o 2. Mapping tasks Read contents from input then parse into key-value pairs o Apply map operation to each pair o File location is forwarded to master, which then forwards the file locations to reduce o workers 3. Reduce Fetch input sent by master o o Sort the input by key o For each key, apply the reduce operation to the key values associated with key. o Write result in output file then return file location to master

Summary: During the map process, map process, the master node instructs worker nodes to process their local input data. Hadoop performs a map process, map process, where each worker node passes its results to the appropriate reducer node. node. The master node collects the results from all reducers and compiles the answer to the overall query. HDFS basics Files split into fixed size blocks and stored on nodes. no des. Data blocks are replicated for fault-tolerance (default is 3) Client talks to namenode for metadata (info about filesystem. Ie. Which datanodes manage which block), and talks with datanodes for reads and writes.

Hadoop and fault tolerance The bigger the cluster, the more chances of hardware failure (ie. Disk crashes, overheating). What happens if - Worker fails: o Worker is marked failed if the master gets no response from it when pinged Tasks assigned to failed worker will be added back to the task list of re-assignment, o at this time HDFS ensures data is replicated. - Master fails: Master writes checkpoints showing progress o If master fails, a new master can start from previous checkpoint, therefore job is o restarted. Replication - 3 copies (default) are created. (objectives: load-balancing, fast access & fault tolerance) First written to the same node. Second to a different node within the same rack. Third to a node in another rack. 6


NoSQL Not Only SQL. NoSQL databases are geared toward managing large sets of data which come in huge variety and velocity, often in distributed systems or the cloud.

CAP Theorem

NoSQL Family 1. Graph-family – Graph-family – elements elements structured in nodes and edges. Eg. Neo4j Graph DB 2. Document-family – Document-family – Elements Elements stored in document-like structures. each document in this type of database has its own data, and its own unique key, which is used to retrieve it. Eg. MongoDB 3. Column-family – Column-family – stores data tables as columns rather than rows (therefore has a large number of columns). Eg. CassandraDB.

7


RDBMS VS NoSQL RDBMS

NoSQL

Can store only structured data

Works with all kinds of data

Structured query language (SQL)

No predefined schema

Performance decreases with large volumes of data (joins required)

Can support huge volumes of data without affecting its performance

Expensive hardware required for scaling

Horizontally scalable. Uses cheap commodity hardware Has no functionality for joins as data is denormalized CAP – CAP – Consistent, Consistent, Available & PartitionTolerance

Offers powerful queries such as joins and group by ACID – ACID – Atomic, Consistent, Isolated, Durable

8


Graph DB A database that uses graph structures with nodes, edges (relationships) & properties to store and represent information. -

 

A graph is a collection of nodes (things) nodes (things) and edges (relationships). (relationships). Both of these have properties (in key-value pairs).

ER Model

Graph Model

Tables

Nodes + Edges

Rows

Nodes

Columns

Key-value pairs (Properties)

Joins

Edges

Nodes – Nodes – Instances Instances of objects (entities). Eg. Billy is an instance of a user, Toyota of a car. Relationships – Relationships – connection connection between nodes. Must have a name and direction. This adds a structure to the graph.

Features: 1. Flexible – Flexible – can can easily adapt to changes/additions. Ie. Relationships and properties can be expanded, nodes can be tailored without af fecting existing queries. 2. Speed – as – as the volume increases, traversal is constant unlike RDBMS where speed is dependent on the total amount of data stored (as several joins may be required). 3. Agility – can – can effectively and rapidly respond to changes. 4. Schemaless – Schemaless – unstructured unstructured (not a tabular-type format)

Traversal Navigating a graph (from a specific node to other nodes) along relationship edges. Traversal is bidirectional – bidirectional – can can follow incoming or outgoing nodes. Eg. Find my friends of friend => start with my node, navigate to friend, fi nd friends. Traversal can be of two types; -

Depth-first: follow the first path to its end, then return and go to second and so on. Breadth-first: follows all the first steps/depths then moves to second depth and so on.

Cypher – query language for graph databases. Declarative language (specify what you want rather than how to achieve it). Commands are built from clauses that represent matches to patterns and relationships Eg. Create(Kev:Person{Name:’Kevin’, Age:45})

Create(Beer:Drink{Name:’Beer’, Alcoholic:’Yes’})

9


MongoDB An open-source, non-relational, document-family database that provides high-performance, highavailability and horizontal scalability. -

A MongoDB hosts a number of databases. A database holds a set of collections. A collection holds a set of documents. A document, is a set of key-value pairs MongoDB stores data in many nodes, which contain replicas of the data. Therefore: Consistency – Consistency – All All replicas contain same data, client always has the same view of the o data no matter what node. Availability – Availability – system system remains operational on failing nodes (clients can read & write). o Partition Tolerance – Tolerance – system system functions even if there is a communication breakdown o between nodes.

MongoDB Architecture MongoDB can host a number of databases. A database holds a set of collection. A collection holds a set of documents. A document is a set of key-value pairs.

RDBMS

MongoDB

Database

Database

Table Row

Collection Documents

Column

Fields

MongoDB Features 

Document-based – Document-based – documents documents are stored JSON format



Querying – Querying – supports dynamic querying that’s nearly as powerful as SQL Replication and availability – availability – Provides Provides redundancy and increases data availability with multiple copies of data on o n different database servers



Horizontal-scalability – Horizontal-scalability – easy easy to scale out o n commodity hardware









Supports map/reduce functionality – functionality – Ie. Ie. In a situation where you would have to use GROUP BY in SQL, map/reduce is the right tool in MongoDB. Schemaless – Schemaless – non-relational. non-relational. Does not follow a specific structure like relational databases. Can store any number and variety of key-value pairs in a document Scalable – Scalable – replication replication and sharding o Replication – Replication – duplicates duplicates data across multiple nodes o Sharding – Sharding – Splits Splits data across multiple machines/shards

10


When to use MongoDB 

When you need scalability and high-availability



Real time: can analyse data within the database, giving results straight away.



If your data size will increase a lot, and you will need to scale (by sharding)



If you don’t need too many joins on tables

 

Particularly useful for storing unstructured data When your application is supposed to handle high insert loads

Sharding Sharding is the process of storing data records across multiple machines and it is MongoDB's approach to meeting the demands of data growth. As the size of the data increases, a single machine may not be sufficient to store the data nor provide an acceptable read and write throughput. Sharding solves the problem with horizontal scaling. With sharding, you add more machines to support data growth and the demands of read and write operations. Sharding reduces the number of operations each node handles. Each node processes fewer operations as the cluster grows. As a result, a cluster can increase capacity and throughput horizontally . Ie. to insert data, the application only needs to access the machine/shard responsible for that record. Benefits -

Splits workload – workload – work work is distributed amongst machines. This increases performance as there will be a smaller working set. Scaling – Scaling – vertical vertical scaling is to costly, sharding lets you add more machines to your cluster. This makes it possible to increase capacity without any downtime.

Replication Process of duplicating data across multiple nodes. Provides redundancy and incr eases data availability. Why replication? -

To keep the data safe High-availability (24/7) Disaster recovery No downtime for maintenance

11


Cassandra DB A distributed, highly-scalable, fault to lerant columnar database.

Column-family Database

-

-

A column family is very similar to RDBMS, it consists of rows and columns. Rows are uniquely indexed by an ID (rowkey) and each row can have different columns, each of which has a name, a value plus a timestamp (of when the data was last added/updated). Organised into families of related columns while relational databases are organised into tables Empty Columns: The basic idea behind this is, if a row does not contain a value for a c ertain column, instead of giving it a null value (like in RDBMS), the column is simply missing from the given row. Denormalized: Joins in a relational model are flexible, storage efficient and elegant, but can also be very slow at run time. They perform very poorly in a distributed data model. Cassandra has no joins and therefore t herefore denormalisation can be the answer

-

-

ID

Host Name

Discovery Method

Orbital Period

1

11 Com

Radial Velocity

326.03±0.32

2

2MASS

Imaging

Timestamp 2016-02-12 11:32:00 2016-11-12 18:05:09

Cassandra Architecture: The Cassandra cluster is pictured as a ‘ring’ in which nodes communicate and exchange information with other nodes. How writes and reads operate? Cassandra is a masterless architecture, meaning at any point, the client can connect to any node, the node the client is connected to takes charge and forwards and replicates the data to other appropriate nodes. When reading the data, the client supplies a rowkey then the node the client is connected to determines the latest version replica using the rowkey. Peer to peer replication: no master, no slaves. No single point of failure!

12


Key Cassandra Features and Benefits: dec ide what fields your records will 1. Flexible-schema – with CassandraDB, it isn’t necessary to decide need beforehand. You could add/remove required fields extemporaneously. For massive databases, this is an incredible efficiency boost. – you could add more hardware (nodes) as the amount of data increases. This al so 2. Scalability – you increases performance as more nodes can do more work within the same time. – In NoSQL databases (specifically Cassandra), data is replicated to multiple 3. Fault-tolerant – nodes. Therefore, a node failure will not cause any downtime or computational failure.

Replication – 3 – 3 copies of the same data are created into different nodes. If a node fails, data is replicated again to a thir d node. Other objectives for replication are loadbalancing and fast access. 4. Flexible data storage – Cassandra – Cassandra can store all data type, these could be structured, semistructured or unstructured. – with linear scalability, Cassandra can perform extremely fast writes 5. Fast read and writes – with without effecting its read efficiency. 

6. Query Language – an – an SQL-like language that makes moving from a relational database very easy. Extra (How Cassandra retrieves data): (Part of the NASA Exponent dataset) In a traditional database method (first picture), data is retrieved row by row, reading data from left to right. In the picture we see although we are acquiring data from just two columns we eventually we read the entire row then just retrieve the required column. Now, Let us say if we took the entire dataset with thousands of rows, getting all the data could take a while! This is when a columnar database could be very effective. If we look at the second picture, instead of taking every row, we just take the column reading the data from top to bottom. All you will need is the rowkey column, in this case we needed the first 16 rows. This method is much more effective effec tive and has a much better performance when running large numbers of queries!

13


Data Mining -

Data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information. Simple terms: Data Mining refers to extracting knowledge from large amount of data. The information can be used for any application purpose such as to increase revenue, cut costs, make forecasts, etc.

Why do we need it? - Too much data and too little information. There is a need to extract useful information from the data and to interpret it. - Data mining helps discover relationships between two or more variables in your data. This can help create new opportunities (ie. For businesses) by: Predicting trends and behaviours o o Discovering previously unknown or hidden patterns - The tasks of data mining are twofold: Predictive—using features to predict unknown or future values of the same or other o feature Descriptive—find interesting, human-interpretable patterns that describe the data. o Data Warehousing is a process of combining data from multiple sources into one common repository (dataset). Data Mining is a process of finding patterns in a given dataset. Problems with data mining: 

Individual privacy – privacy – analyses analyses routine behaviour and gathers a significant amount of information.



Data integrity – integrity – inaccurate, inaccurate, conflicting or out-of-date data from different sources.



Cost



Efficiency & scalability – scalability – data data mining algorithms must be able to work with masses of data.

Data Mining Process Steps: 1. Understanding the problem and what we are trying to achieve. 2. Setting up a data source. Here we collect the historical data then put it into a structured form (dataset) so it can be used in the next step. 3. In this step the model (dataset) is built and turned into a predictive mode. The results are then tested and evaluated to get the best and most accurate results. 4. Here we apply the model and combine the feedback and findings on new incoming examples.

14


Data Mining Tasks/Methods 



Classification [predictive] – [predictive] – categorizing. categorizing. process in which ideas and objects are recognized, differentiated and understood. Clustering [descriptive] – [descriptive] – grouping grouping the data in more than one group based on their similarity. For example, news can be clustered into different groups, entertainment group, o politics, national, and world news.



Association [descriptive] – [descriptive] – identifies identifies relationships between events that occur at one time



Sequencing [descriptive] – [descriptive] – identifies identifies relationships that exist over a period of time.



Forecasting – Forecasting – process process of making predictions of the future based on past and present data and analysis of trends.



Regression [predictive] – [predictive] – statistical statistical process for estimating the relationships among variables.



Time Series analysis – analysis – examines examines a value as it varies over time.

Data Mining can help in -

fraud detection, aid marketing campaigns, detecting diseases, scientific experiments, weather predictions study consumers.

Build Model – Decision Tree (Classification) A decision tree can be used as a model for a sequential decision problems under uncertainty

Pros -

easy to interpret easy to construct can handle large number of features very fast at testing time

Cons -

low predictive accuracy not possible to predict beyond min and max limits 15


Build Model – SOM (Clustering) Self Organising Map. Train the map using the examples from the data sets. Used for clustering data without knowing the class from the input data

Pros

Cons

No need to specify classes

Difficult to understand decision Train gets a different map each time

Can visualise data Can identify new relationships

Tools for performing data mining 1. SAS Enterprise Miner – Miner – reorganises reorganises the data mining process to create highly accurate predictive and descriptive models. Benefits: Supports the entire data mining process o Builds more models faster o Enhance accuracy of prediction 2. WEKA – WEKA – a a collection of data mining tools (pre-processing data, classification, clustering, association). o

Data pre-processing -

-

High quality data mining need data that is useful, to achieve this we need to perform some preprocessing on the data. This combines data cleaning, data integration and data transformation. Data quality issues can be expensive and time consuming to overcome.

Why Data Quality? -

Cost saving, increased efficiency, reduction of risk/fraud, enable more informed decisions.

Measures for data quality: 

Accuracy: accurate or not



Completeness: complete? Or unavailable?



Consistency: some modified but some not, dodgy?



Timeliness: timely updates?



Reliable: trustable?



Interpretability: can easily be understood?

Data cleaning – cleaning – fill fill in missing values, smooth noisy data, correcting incorrect values. Data integration – integration – combination combination of multiple data sources Data transformation – transformation – techniques techniques to transform data (ie. normalization) Data reduction – reduction – techniques techniques that can be applied to obtain a reduces representation of data that is much smaller in volume, yet very similar to the original data. 16


EXTRAS HADOOP AND BIG DATA why big data? - data growth is HUGE - all that data is valuable - disk is cheap BUT; - won’t fit on a single computer - so, it needs to be distributed across thousands of nodes - good side is, distributed data = faster computation when run parallelly. ie. 1 HDD = 100Mb/sec 100HDD = 10Gb/sec

Hadoop has 2 components: HDFS- allows storing huge amounts of data in a distributed manner MapReduce- allows processing the huge data in a parallel manner HDFS HDFS architecture -files stored in blocks (64-256MB) -provides reliability through replication HDFS file storage - NameNode (master) = stores all metadata (filenames, location of blocks in DataNodes) - DataNode (slave) = stores file contents as blocks. Blocks are replicated. Periodically sends reports of existing blocks to NameNode. - Clients reads NameNode for metadata, then directly talks with DataNode for reads and writes Failures: -DataNode- marked failed if no report/heartbeat is sent to NameNode. NameNode replicates lost blocks to other nodes. 17

220CT Notes Salman Fazal -NameNode- a new or the backup master m aster takes over. NameNode keeps checkpoints therefore new master starts from previous checkpoint. Replication: 3 copies are created; -first on same node -second to different node within the same rack -third to a node in another rack

MAPREDUCE -2 stages: 1. Map stage - split data into smaller chunks and map them into key/value pairs) 2. Reduce stage - sorts/shuffles by key, then outputs the combined results MapReduce file storage -JobTracker = schedules the tasks to run (on the slaves) -TaskTracker = executes the tasks (from the master) *task=map/reduce Steps: Input data, Split, Map, Shuffle, Reduce, Output results. How Hadoop Works? 



Input Split o input is splitted into blocks and distributed across the nodes (HDFS). Mapper o o o



JobTracker retrieves the input splits from HDFS. JobTracker will initiate mapper phase on available TaskTrackers Once the assigned Task Trackers are done with mapping, they will send status to the JobTracker.

Reduce o o

o

JobTracker initiates sort/shuffle phase on the mapper outputs Once completed, JobTracker initiates the reduce operation f rom the results on the TaskTrackers. TaskTrackers will send output back to JobTracker once reduce is complete. The JobTracker then sends the output report back to the client.

CLUSTER DATABASES Why run databases on clusters? The traditional model runs on one big machine, there is a single point of failure if machine, storage or network goes down. It is also difficult to scale up, as you would need to buy a whole new machine (server), this is too costly and not flexible. To resolve this, we use a cluster. A cluster combines several racks, which contains several machines/nodes. Flexibility is achieved as data is replicated, meaning we won’t need to backup as data is always available. Also there is no single point of failure as nodes are replicated at least 2 times. If scaling-out is required, just add more nodes to the cluster. Cheaper and flexible. 18


Types of replication  

Synchronous – Synchronous – all all replicas are updated on every write. All nodes are always up to date. Asynchronous – Asynchronous – writes writes the data as soon as possible, but reads could be out of date. Eventual consistency.

Consistency In relational databases, there is ACID consistency which maintains data integrity. In NoSQL, consistency refers to whether or not reads reflect previous writes. -

Strict Consistency – Consistency – A A read is guaranteed to be up to date data Eventual Consistency (MongoDB uses this) – this) – Read Read data may be stale, but writes are very quick. This provides high performance.

Inconsistencies – Inconsistencies – occur occur if two database versions are updated at the same time, or read is made from one machine while it’s still still not updated.

19

220CT Revision Notes

Recommend Documents