Data Warehouse Book................................................................. ............................................................................................................................. ............................................................ 2 The Key Acronyms And Definitions ......................................................................................................... ......................................................................................................... 2 What's the need for a Data Warehouse? .......................................................... ................................................................................................ ...................................... 3 Difference Between DWH and OLTP ..................................................................................................... ....................................................................................................... .. 4 Data Acquisition ...................................................................................................................................... ...................................................................................................................................... 4 Data Loading ..................................................................... .......................................................................................................................................... ....................................................................... .. 4 Good Link ......................................................................... ............................................................................................................................................... ........................................................................ .. 5 Difference between Data mart and DWH ............................................................................................... ............................................................................................... 5 Top-down and Bottom-up Approach ...................................................................................................... ...................................................................................................... 6 ETL Application Development Example ............................................................. ................................................................................................... ...................................... 6 Staging ..................................................................................................................................................... ..................................................................................................................................................... 7 Informatica (9.5) ..................................................................................... ...................................................................................................................................... ................................................. 7 Informatica Corporation Products .......................................................................................................... .......................................................................................................... 8 PowerCenter Integration Service (PCIS) .................................................................................................. .................................................................................................. 8 Transformations ...................................................................................................................................... ...................................................................................................................................... 9 Mapplets............................................................................................................ ................................................................................................................................................ .................................... 10 Tasks and types of Tasks.................................................. Tasks........................................................................................................................ ...................................................................... 10 Worklet and types of Worklets ............................................................................................................. 10 Dimensions ............................................................................................................................................ ............................................................................................................................................ 11 Installation and Practice .................................................................................... ........................................................................................................................ .................................... 12
What is the purpose of this article ?
A quick briefing to ETL, Business Intelligence and Informatica
With this article it is possible to understand what needs to be known to have a solid understanding with ETL, BI and Informatica
Data Warehouse Book The Data Warehouse Toolkit - The Complete Complete Guide to Dimensional Modeling Modeling 2nd Edition by Ralph Kimball and Margy Ross
The Key Acronyms And Definitions OLTP stands for Online Transaction Processing ODS stands for Operational Data Source OLAP (Online Analytical Processing) : This system is an application that collects, manages, processes and presents multidimensional data for Analysis and management purposes.
Data Mining : Data Mining is the process of analyzing data from different perspectives and summarizing into useful information.
BI (Business Intelligence) : BI is the leveraging of Data Warehouse to help and make business decisions and business recommendations. Information and data rules engines are leveraged here to help make these decisions along with statistical analysis tools and data mining tools. From an information systems standpoint, BI provides users with online analytical processing or data analysis capabilities to predict trends, evaluate business questions, and so on. From an business analyst viewpoint, it is the process of gathering high-quality, meaningful information about a subject, which enables the analyst to draw conclusions.
ISP in relation to Informatica stands for I nformatica Service Platform OLCP (Online Complex Processing) : OLCP is typified by a moderate number of rows processed against multiple databases with a response in minutes, but not seconds. Examples: 1. Filling out an application for instant credit in a store is an OLCP Transaction. 2. Getting an instant approval on an interest rate for a loan.
DSS (Decision Support Systems) : Decision Support Systems include batch Reports, which Roll-up numbers to give business the big picture, and overtime, have evolved. DSS is an area where where Teradata has traditionally excelled with its its parallel architecture.
Homogenous Joins : Homogenous means same. A Join that is made on similar kind of database is known as Homogenous Joins.
Heterogeneous Joins : Heterogeneous means different. A Join that is made made on different kinds of database is known known as Homogenous Joins.
Flat Files : A File which is saved with an extension , some examples, of " .txt, .csv, .dat etc ". There are two kinds of Flat Files (1) Delimited Flat Files (2) Fixed with Flat Files. Fixed with Flat Files can improve the performance over Delimited Flat Files when extracting the data from Flat Files.
File List : A File list is the l ist of all data files that can be grouped together. The files should contain similar meta data definitions.
XML Source Qualifier Transformation : It reads the data from XML files, which are saved with an extension of .xml
What's the need for a Data Warehouse? 1. All the past attempts by IT to provide strategic information have been been failures. This was mainly mainly because IT has been trying to provided strategic information from operational systems. (Operational System here refers to OLTP) Operational systems are not designed for strategic information. 2. Informational systems are different f rom the traditional operational systems. 3. We need a new type of computing environment environment to provide strategic strategic information. The Data Warehouse promises to be this new computing environment. For example we can provide business the below : 1. Show the top-selling products 2. Tell me why this is going on (drill down)? 3. Let me see other data (drill across) 4. Show the highest margins 5. Alert when a region sells below the target "A Data Warehouse is a subject oriented, integrated, nonvolatile and time variant collection of data in support of management's decisions." In the Data Warehouse, data is not stored by operational applications, but by business subjects.
Difference Between DWH and OLTP Below is the difference between DWH (Data Warehouse) and OLTP (Online Transaction Processing) They are both databases and hold data. But, they have been designed for different scopes. Running the business (OLTP) and Managing the business (DWH)
OLTP (Online Transaction Processing)
DWH (Data Warehouse)
It is designed to support Business Transaction Processing
It is designed to support Analysis and decision making
Maintaining current data
Maintaining historical data
Normalized data (No duplicate rows)
De-normalized data (duplicate rows)
Query response time is low
Query response time is high
Application oriented data
Subject oriented data
Detailed data
Summarized data
Limited period of Transaction history (current, 1 3 Months)
Transaction history per years
Data Acquisition
It means an ETL It is the process of extracting data from the multiple OLTP source systems Transforming the data into a desired Business format Loading the data into a destination system There are two types of ETL to perform Data Acquisition, that are :
(1) Code Based ETL and (2) GUI Based ETL
Some GUI based ETL are : Informatica, Data Stage, BODS (Business Object Data Services) and SSIS (SQL Server Integration Service) etc. I personally would like to focus on Informatica and SSIS. Later on, we will be introduced introduced to Staging. That means we are not going to do ETL directly. We perform Extract and do Staging. Staging is a temporary data parking.
Data Loading It is the process of inserting the data into a destination system. There are two types of data loading. That are: 1. Initial Load or Full Load and 2. Incremental Load or Delta Load 1. Initial Load: It is the process of inserting the source data rows into an empty target tables. In the Initial Load all the source data gets loaded into target.
2. Incremental Load: It is the process of inserting only the new records after Initial Load. This topic places a question in our mind "How do we do it? " and introduces us "Top-Down Approach and Bottom-Up Approach".
Good Link http://www.nagesh.com/publications/technology/173-inmon-vs-kimball-an-analysis.html The DWH Gurus INMON and KIMBALL explained us about Top-Down Approach and Bottom-Up Approach. 1. Top-down Approach (INMON) According to the Inmon's approach we develop an Enterprise Data Warehouse first and then derive department specific subject oriented Database called Data marts. 2. Bottom-up Approach (KIMBALL) According to the Kimball's approach we develop department specific subject oriented Database called Data marts first and then consolidate or conglomerate the Data marts into an Enterprise Data Warehouse.
Difference between Data mart and DWH What is the difference between Data mart and DWH? That's the question pops up and we heard about it. So, it is worth worth full to see the difference in a tabular form.
DWH
Data mart
Designed to store Enterprise Information
Designed to information
Designed for Top Management who take the decisions at Enterprise Enterprise Level (CEO etc)
Designed for Middle Level Management who take the decisions for department level
Recommended to build high range Database. For example Teradata Database
Recommended to build mid range Database. For example Oracle Database
store
department
specific
Top-down and Bottom-up Approach Note: Data mart development is an independent of Enterprise Data Warehouse development.
Top-down Approach: Data mart HR Enterprise DWH
Data mart Finance
Bottom-up Approach:
Data mart Enterprise
Finance
DWH
Data mart HR
ETL Application Development Example Let's say the source system system (OLTP Database) has a table called "customer" and we load that table into the target system system (DWH Database)with a table called "dim_cust". The below shapes explains it further.
OLTP Database
DWH Database
customer +cust_id +cust_first_name + gender
dim_cust +cust_id +cust_name + gender
metadata
metadata
OLTP Database Customer +cust_id +cust_first_name + gender
Metadata of source[E] customer
DWH Database
ODBC
Business Rule [T] concat () decode() or any etc. ETL Application
Metadata of Target [L] dim_cust
ODBC
dim_cust +cust_id +cust_name + gender
Staging
A Staging is a "Temporary Database" and an intermediate layer between OLTP System and DWH. A Staging is a replica of OLTP System. A Staging can be called as a data parking area where the data parks temporarily. It can be a collection of data from one more OLTP Databases.
Informatica (9.5)
This took me to the main topics from Informatica. An Informatica PowerCenter is a Data Integration platform which allows to access the data from multiple OLTP source systems. Transforming the data and deliver the data throughout the Enterprise at any speed. Founded by Mr. Gaurav Dhillon. Current CEO is Mr. Sohaib Abbasi (It is very interesting and inspiring to read his professional achievements. Do not miss it ! )
Informatica Corporation Products 1. Informatica PowerCenter 2. Informatica PowerMart 3. Informatica PowerExchange 4. Informatica Data Quality (IDQ) 5. Informatica PowerAnalyzer 6. Master Data Manager (MDM) 7. Informatica B2B Integration 8. Informatica Cloud
1. Informatica PowerCenter The following PowerCenter components get installed with Informatica PowerCenter : 1. Informatica PowerCenter Clients 2. Informatica PowerCenter Repository 3. PowerCenter Repository Service (PCRS) 4. PowerCenter Integration Service (PCIS) 5. PowerCenter Domain 6. Informatica Administrator (Webclient)
1. Informatica PowerCenter Clients : Using PowerCenter Clients we can develop an ETL P rocess. The steps involved are : 1 - Creation of Source definition 2 - Creation of Target definition 3 - Design a Mapping with / without Business rule 4 - Create Session for each Mapping 5 - Create Workflow (a Sequential or Concurrent) 6 - Start Workflow (Running a Workflow) 7 - View Status and Monitor Monitor
PowerCenter Integration Service (PCIS) It is an ETL Engine or an ETL Server that performs Extraction, Transformation and Loading. An Integration Service has following components: 1. Reader 2. DTM (Data Transformation Manager) 3. Writer 1. Reader : It extracts the data from source systems. 2. DTM (Data Transformation Manager) : It process the data according to the business rule that you configure in the Mapping. 3. Writer : It loads the data into the target.
Transformations A Transformation is a PowerCenter object which allows to create business rules for processing data. There are two types of Transformations, they are: 1) Active Transformation (New feature from Informatica PowerCenter 9.1) 2) Passive Transformation
1) Active Transformation : A Transformation which can affect the rows that is change the number of rows while processing the data is known as an Active Transformation. Below is the list of Active Transformations used for processing the data : Note : t/r means Transformation
1. Source Qualifier t/r 2. Filter t/r 3. Router t/r 4. Sorter t/r 5. Aggregator t/r 6. Rank t/r
7. Joiner t/r 8. Union t/r 9. Update Strategy t/r 10. Transaction Control t/r 11. Normalizer t/r
2) Passive Transformation : A Transformation which does not not affect the rows or does not the change the number of rows while processing the data is known as a Passive Transformation. Below is the list of Passive Transformations used for processing the data : Note : t/r means Transformation
1. Expression t/r 2. Lookup t/r (Very important Topic) 3. Sequence Generator t/r 4. Stored Procedure t/r
5. XML Source Qualifier t/r 6. Input t/r (Mapplet) 7. Output t/r (Mapplet) 8. External Procedure t/r
Port and types of Ports : In the language or terminology of Transformation, Transformation, a Port is nothing but a Column of a Table or simply a file. There are two types of Ports, they are : (1) Input Port and (2) Output Port (1) Input Port : A Port that can receive the data is called an Input Port. (2) Output Port : A Port that can provide the data is called an Output Port.
Mapplets
A Mapplet is a reusable object created with the business rules using a set of transformations A maplet is created using Mapplet designer tool in the designer client cli ent components. There are types of Mapplets :
1. Active Mapplet : A Mapplet which is created with at least one Active Transformation. Such Mapplet is known as Active Mapplet. 2. Passive Mapplet : A Mapplet which is c reated with only Passive Transformation is known as Passive Mapplet.
Tasks and types of Tasks A Task is a set of executable actions, commands and functions. There are two types of Tasks, that are : 1. Reusable Task 2. Non-Reusable Task
1. Reusable Task :
A task that can integrated to multiple Workflows is knows as Reusable task. Reusable tasks are created using "Task Developer" tool and "Worklet Designer" tool
2. Non- Reusable Task :
A Task which is created specific to a Workflow is knows as Non-Reusable Task. A Non-Reusable Task is created using "Workflow Designer" tool.
Worklet and types of Worklets A Worklet is defined as group of related tasks. There are two types of Worklets, that are : 1. Reusable Worklet 2. Non-Reusable Worklet
Reusable Worklet :
Created using "Worklet Designer tool" Can be assigned to multple Workflows Wo rkflows A Worflow which contains the Worklet is known as Parent Workflow A Worklet expands and executes the inner Workflow, if any child Workflows are there.
Non-Reusable Worklet : A Worklet which is created specific to the Workflow is known as Non-Reusable Worklet Created using "Workflow Desinger" tool and can be converted to Reusable Worklet.
Dimensions Star Schema : A Star Schema is a Database Design which contains a centrally located Fact Table which is radially surrounded by multiple Dimension Tables. Since the database design looks like a Star hence it is known as Star Schema. There two variants of Star Schema, that are : 1. Simple Star Schema 2. Complex Star Schema
Simple Star Schema : It contains only one Fact Table. Complex Star Schema : It contains more than one Fact Table Fact Table :
A Fact Table contains the Facts Facts are numeric but not every numeric is fact Facts are Business measures which are used to evaluate the performance of enterprise.
Dimension table :
A Dimension table is a descriptive data and text which describes the Key Performance indicators, knows as Facts. The related dimensions are organized in dimension tables A dimension table can organize the data in hierarchical format and dimension tables are in de-normalized A dimension can proved or answer the following business questions, "who, what , when, where"
Snowflake Schema :
A very large de-normalized dimensional tables are split into normalized dimensional tables
A dimensional table may have parents
Advantage : query performance increases
Disadvantage : more joins joins
Star Schema and Snowflake Schema Differences : Star Schema
Snowflake Schema
1. No dimension table have parents 2. De-normalized dimensions 3. Less Joins
1. The dimension table may have parents 2. Normalized dimensions 3. More Joins
Galaxy Schema : It is also known as Hybrid Schema (or) Complex Star Schema (or) Multi Star Schema (or) Integrated Schema (or) Constellation Schema
Fact Constellation : It is the process of joining two Fact Tables using constraints (Primary and Foreign Keys)
Conformed Dimensions : A dimension table which is shared or used by multiple Fact Tables is known as Conformed Dimensions.
Slowly Changing Dimensions (SCD) : A dimension that can capture the changes which takes place over the period of time is known as Slowly Changing Dimensions. There are three types of Slowly Changing Dimensions, that are : 1 .TYPE 1 Dimension : It captures only current data. It does not maintain history. 2. TYPE 2 Dimension : It captures complete historical data. For each update in OLTP it will insert a new record in the target. 3. TYPE 3 Dimension Dimension : It captures partial historical information in the target that is current and previous.
Installation and Practice PowerCenter Express - ETL Tool If your aim is to practice or learn ETL then PowerCenter Express is initially sufficient. Link to Download : https://community.informatica.com/solutions/pcexpress Registration is required to download it and Personal Edition is the free one. Click on "Resources" and open PowerCenter Express Installation and Upgrade Guide for the installation instructions. Tip : If your installation is on Windows then focus the instructions related only to Windows.
Informatica PowerCenter If your aim is to install PowerCenter Server, Clients etc to get a real time experience then a lot of installations have to made and much memory and disk space is needed -- BI Software, BI Applications, DAC Server, Oracle Clients etc. Well, It is not a bad idea to check the Sof tware once !
Link to Download : Registration of a normal Oracle Account is sufficient to download the software https://edelivery.oracle.com/ Below is about 64 bit Installation. Navigate to the below and click "Go" and then "Continue"
Scroll down or use Ctl+F to see the below and to start downloading
Tip :
In my case I extracted everything to my external hard disk first Create a directory, for example INFA, and the location or destination of all the extracting files should be only this Folder, that is INFA Extract one after the other Use better WinRar to extract the files
Some useful Links :
http://www.disoln.org/2012/09/Informatica-Powe http://www.disoln.org/2012/09/In formatica-PowerCenter-9-Installa rCenter-9-Installation-and-Con tion-and-ConfigurationfigurationComplete-Guide.html http://bloggingaboutoracle.com/installation-in http://bloggingaboutoracle.com /installation-informatica-powerc formatica-powercenter-9-0-1-on enter-9-0-1-on-oracle-oracleenterprise-linux-5-6/ https://community.oracle.com/thread/24 https://community.o racle.com/thread/2474504?tstart=0 74504?tstart=0