Data Warehousing Interview Questions Answers 1. Wh What at is is Data Data War Wareh ehou ousi sing ng? ? A data warehouse is the main repository of an organization’s historical data, its corporate memory. It contains the raw material for management’s decision support system. The critical factor leading to the use of a data warehouse is that a data analyst can perform complex queries and analysis, such as data mining, on the information without slowing down the ope operatio rational nal syste systems. ms. Data warehousing warehousing collection collection of data designed designed to supp support ort man managem agement ent deci decision sion making. Data warehouses contain a wide variety of data that present a coherent picture of business conditions at a single point in time. It is a repository of integrated information, information, available for queries and analysis. 2. What are fund fundamen amental tal stage stages s of Data Data Wareh Warehousin ousing? g? Offline Operational Databases – Data warehouses in this initial stage are developed by simply copying the database of an op opera eratio tional nal system system to an off off-li -line ne ser serve verr wh where ere the pro proce cessi ssing ng loa load d of rep report orting ing does no nott im impac pactt on the operational system’s performance performance.. Offline Data Warehouse – Data warehouses in this stage of evolution are updated on a regular time cycle (usually daily, weekly or monthly) from the operational systems and the data is stored in an integrated reporting-oriented data structure. Real Time Data Warehouse – Data warehouses at this stage are updated on a transaction or event basis, every time an operational system performs a transaction (e.g. an order or a delivery or a booking etc.) Integrated Data Warehouse – Data warehouses at this stage are used to generate activity or transactions that are passed back into the operational systems for use in the daily activity of the organization. 3. Wh What at is Dim Dimen ensio sional nal Mo Mode delin ling? g? Dimensionall data model concept involves two types of tables and it is different from the 3rd normal form. This concept Dimensiona uses Facts table which contains the measurements of the business and Dimension table which contains the context (dimension of calculation) of the measureme measurements. nts. 4. Wh What at is Fa Fact ct ta tabl ble? e? Fact table contains measurements of business processes also fact table contains the foreign keys for the dimension tables. For example, if your business process is “paper production” then “ aver average age production production of pap paper er by one machine”” or “weekly machine “weekly production of paper ” would be considered as measurement of business process. 5. Wh What at is is Dime Dimens nsio ion n tabl table? e? Dimensional table contains textual attributes of measurements stored in the facts tables. Dimensional table is a collection of hierarchies, categories and logic which can be used for user to traverse in hierarchy nodes. 6. What are are the the Different Different metho methods ds of loadin loading g Dimensio Dimension n tables? tables? There are two different ways to load data in dimension tables. Conventionall (Slow): Conventiona All the constraints and keys are validated against the data before, it is loaded, this way data integrity is maintained. Direct (Fast): All the constraints and keys are disabled before the data is loaded. Once data is loaded, it is validated against all the constraints and keys. If data is found invalid or dirty it is not included in index and all future processes are skipped on this data. 7.
What is is OL OLTP?
OLTP is abbreviation of On-Line Transaction Processing. This system is an application that modifies data the instance it receives and has a large number of concurrent users. 8.
What is is OL OLAP?
OLAP is abbreviation of Online Analytical Processing. This system is an application that collects, manages, processes and presents multidimensional data for analysis and management purposes.
9. What is the difference between OLTP and OLAP? Data Source OLTP: Operational data is from original data source of the data OLAP: Consolidation data is from various sources. Process Goal OLTP: Snapshot of business processes which does fundamental business tasks OLAP: Multi-dimensional views of business activities of planning and decision making Queries and Process Scripts OLTP: Simple quick running queries ran by users. OLAP: Complex long running queries by system to update the aggregated data. Database Design OLTP: Normalized small database. Speed will be not an issue due to smaller database and normalization will not degrade performance. This adopts entity relationship(ER) model and an application-oriented database design. OLAP: De-normalized large database. Speed is issue due to larger database and de-normalizing will improve performance as there will be lesser tables to scan while performing tasks. This adopts star, snowflake or fact constellation mode of subject-oriented database design. 10. Describes the foreign key columns in fact table and dimension table Foreign keys of dimension tables are primary keys of entity tables. Foreign keys of facts tables are primary keys of Dimension tables. 11. What is Data Mining? Data Mining is the process of analyzing data from different perspectives and summarizing it into useful information. 12. What is the difference between view and materialized view? A view takes the output of a query and makes it appear like a virtual table and it can be used in place of tables. A materialized view provides indirect access to table data by storing the results of a query in a separate schema object. 13. What is ER Diagram? Entity Relationship Diagrams are a major data modeling tool and will help organize the data in your project into entities and define the relationships between the entities. This process has proved to enable the analyst to produce a good database structure so that the data can be st ored and retrieved in a most efficient manner. An entity-relationship (ER) diagram is a specialized graphic that illustrates the interrelationships between entities in a database. A type of diagram used in data modeling for relational data bases. These diagrams show the structure of each table and the links between tables. 14. What is ODS? ODS is abbreviation of Operational Data Store. A database structure that is a repository for near real-time operational data rather than long term trend data. The ODS may further become the enterprise shared operational database, allowing operational systems that are being re-engineered to use the ODS as there operation databases. 15. What is ETL? ETL is abbreviation of extract, transform, and load. ETL is software that enables businesses to consolidate their disparate data while moving it from place to place, and it doesn’t really matter that data is in different forms or formats. The data can come from any source.ETL is powerful enough to handle such data disparities. First, the extract function reads data from a specified source database and extracts a desired subset of data. Next, the transform function works with the acquired data – using rules or lookup tables, or creating combinations with other data – to convert it to the desired state. Finally, the load function is used to write the resulting data to a target database.
16. What is VLDB? VLDB is abbreviation of Very Large Data Base. A one terabyte database would normally be considered to be a VLDB. Typically, these are decision support systems or transaction processing applications serving large numbers of users. 17. Is OLTP database is design optimal for Data Warehouse? No. OLTP database tables are normalized and it will add additional time to queries to return results. Additionally OLTP database is smaller and it does not contain longer period (many years) data, which needs to be analyzed. An OLTP system is basically ER model and not Dimensional Model. If a complex query is executed on an OLTP system, it may cause a heavy overhead on the OLTP server that will affect the normal business processes. 18. If de-normalized is improves data warehouse processes, why fact table is in normal form? Foreign keys of facts tables are primary keys of Dimension tables. It is clear that fact table contains columns which are primary key to other table that itself make normal form table. 19. What are lookup tables? A lookup table is the table placed on the target table based upon the primary key of the target; it just updates the table by allowing only modified (new or updated) records based on the lookup condition. 20. What are Aggregate tables? Aggregate table contains the summary of existing warehouse data which is grouped to certain levels of dimensions. It is always easy to retrieve data from aggregated tables than visiting original table which has million records. An aggregate table reduces the load in the database server and increases the performance of the query and can retrieve the result quickly. 21. What is real time data-warehousing? Data warehousing captures business activity data. Real-time data warehousing captures business activity data as it occurs. As soon as the business activity is complete and there is data about it, the completed activity data flows into the data warehouse and becomes available instantly. 22. What are conformed dimensions? Conformed dimensions mean the exact same thing with every possible fact table to which they are joined. They are common to the cubes. 23. What is conformed fact? Conformed dimensions are the dimensions which can be used across multiple Data Marts in combination with multiple facts tables accordingly. 24. How do you load the time dimension? Time dimensions are usually loaded by a program that loops through all possible dates that may appear in the data. 100 years may be represented in a time dimension, with one row per day. 25. What is a level of Granularity of a fact table? Level of granularity means level of detail that you put into the fact table in a data warehouse. Level of granularity would mean what detail are you willing to put for each transactional fact. 26. What are non-additive facts? Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact table. However they are not considered as useless. If there are changes in dimensions the same facts can be useful. 27. What is factless facts table? A fact table which does not contain numeric fact columns it is called factless facts table. 28. What are slowly changing dimensions (SCD)? SCD is abbreviation of Slowly changing dimensions. SCD applies to cases where the attribute for a record varies over time. There are three different types of SCD. I.
SCD1: The new record replaces the original record. Only one record exists in database – current data.
II.
SCD2 : A new record is added into the customer dimension table. Two records exist in database – current data and previous history data.
III.
SCD3: The original data is modified to include new data. One record exist in database – new information are attached with old information in same row 29. What is hybrid slowly changing dimension?
Hybrid SCDs are combination of both SCD 1 and SCD 2. It may happen that in a table, some columns are important and we need to track changes for them i.e. capture the historical data for them whereas in some columns even if the data changes, we don’t care. 30. What is BUS Schema? BUS Schema is composed of a master suite of confirmed dimension and standardized definition if facts. 31. What is a Star Schema? Star schema is a type of organizing the tables such that we can retrieve the result from the database quickly in the warehouse environment. 32. What Snow Flake Schema? Snowflake Schema, each dimension has a primary dimension table, to which one or more additional dimensions can join. The primary dimension table is the only t able that can join to the fact table. 33. Differences between star and snowflake schema? Star schema – A single fact table with N number of Dimension, all dimensions will be linked directly with a fact table. This schema is de-normalized and results in simple join and less complex query as well as faster results. Snow schema – Any dimensions with extended dimensions are known as snowflake schema, dimensions maybe interlinked or may have one to many relationships with other tables. This schema is normalized and results in complex join and very complex query as well as slower results. 34. What is Difference between ER Modeling and Dimensional Modeling? ER modeling is used for normalizing the OLTP database design. Dimensional modeling is used for de-normalizing the ROLAP/MOLAP design. 35. What is degenerate dimension table? If a table contains the values, which are neither dimensions nor measures is called degenerate dimensions. 36. Why is Data Modeling Important? Data modeling is probably the most labor intensive and time consuming part of the development process. The goal of the data model is to make sure that the all data objects required by the database are completely and accurately represented. Because the data model uses easily understood notations and natural language, it can be reviewed and verified as correct by the end-users. In computer science, data modeling is the process of creating a data model by applying a data model theory to create a data model instance. A data model theory is a formal data model description. When data modeling, we are structuring and organizing data. These data structures are then typically implemented in a database management system. In addition to defining and organizing the data, data modeling will impose (implicitly or explicitly) constraints or limitations on the data placed within the structure. Managing large quantities of structured and unstructured data is a primary function of information systems. Data models describe structured data for storage in data management systems such as relational databases. They typically do not describe unstructured data, such as word processing documents, email messages, pictures, digital audio, and video. 37. What is surrogate key? Surrogate key is a substitution for the natural primary key. It is just a unique identifier or number for each row that can be used for the primary key to the table. The only requirement for a surrogate primary key is that it is unique for each row in the table. It is useful because the natural primary key can change and this makes updates more difficult. Surrogated keys are always integer or numeric.
38. What is Data Mart? A data mart (DM) is a specialized version of a data warehouse (DW). Like data warehouses, data marts contain a snapshot of operational data that helps business people to strategize based on analyses of past trends and experiences. The key difference is that the creation of a data mart is predicated on a specific, predefined need for a certain grouping and configuration of select data. A data mart configuration emphasizes easy access to relevant information Data Marts are designed to help manager make strategic decisions about their business. 39. What is the difference between OLAP and data warehouse? Data warehouse is the place where the data is stored for analyzing where as OLAP is the process of analyzing the data, managing aggregations, partitioning information into cubes for in depth visualization. 40. What is a Cube and Linked Cube with reference to data warehouse? Cubes are logical representation of multidimensional data. The edge of the cube contains dimension members and the body of the cube contains data values. The linking in cube ensures that the data in the cubes remain consistent. 41. What is junk dimension? A number of very small dimensions might be lumped together to form a single dimension, a junk dimension – the attributes are not closely related. Grouping of Random flags and text attributes in a dimension and moving them to a separate sub dimension is known as junk dimension. 42. What is snapshot with reference to data warehouse? You can disconnect the report from the catalog to which it is attached by saving the report with a snapshot of the data. 43. What is active data warehousing? An active data warehouse provides information that enables decision-makers within an organization to manage customer relationships nimbly, efficiently and proactively. 44. What is the difference between data warehousing and business intelligence? Data warehousing deals with all aspects of managing the development, implementation and operation of a data warehouse or data mart including meta data management, data acquisition, data cleansing, data transformation, storage management, data distribution, data archiving, operational reporting, analytical reporting, security management, backup/recovery planning, etc. Business intelligence, on the other hand, is a set of software tools that enable an organization to analyze measurable aspects of their business such as sales performance, profitability, operational efficiency, effectiveness of marketing campaigns, market penetration among certain customer groups, cost trends, anomalies and exceptions, etc. Typically, the term “business intelligence” is used to encompass OLAP, data visualization, data mining and query/reporting tools. 45. How can you do incremental load in Datastage? Incremental load means daily load. when ever you are selecting data from source select the records which are loaded or updated between the timestamp of lastsuccessful load and todays load start date and time. for this u have to pass parameters for those two dates. store the last rundate and time in a file and read the parameter through job parameters and state second argument as currentdate and time. 46. How we use NLS function in Datastage? what are advantages of NLS function? where we can use that one? Explain briefly? By using NLS function we can do the following •
Process the data in a wide range of languages
•
Use Local formats for dates, times and money
•
Sort the data according to the local rules
If NLS is installed, various extra features appear in the product. For Server jobs, NLS is implemented in DataStage Server engine For Parallel jobs, NLS is implemented using the ICU library.
47.
What is APT_CONFIG in Datastage ?
Datastage understands the architecture of the system through this file(APT_CONFIG_FILE). For example this file consists information of node names disk storage information...etc. 48. How do we do the automation of dsjobs? We can call Datastage Batch Job from Command prompt using 'dsjob'. We can also pass all the parameters from command prompt. Then call this shell script in any of the market available schedulers. The 2nd option is schedule these jobs using Data Stage director. 49.
What is trouble shooting in server jobs? what are the diff kinds of errors encountered while
running any job? Troubleshooting in datastage server jobs involves monitoring the job log for fatal errors and taking appropriate actions to resolve them.There can be various errors which could be encountered while running the ds jobs.Some are following: a) Ora-1400 error b) Invalid userid or password.login denied(From OCI stage) c) error - Dataset does not exist. (parallel jobs) d) Job may fail for lookup failure saiyng -- lookup failed on a key column.(If "failure" setting is done in lookup stage for lookup failures.) etc.... 50. Explain other Performance tunings to increase the performance of slowly running jobs? Staged the data coming from ODBC/OCI/DB2UDB stages or any database on the server using Hash/Sequential files for optimum performance also for data recovery in case job aborts. •
Tuned the OCI stage for ‘Array Size’ and ‘Rows per Transaction’ numerical values for faster inserts, updates and selects.
•
Tuned the ‘Project Tunables’ in Administrator for better performance.
•
Used sorted data for Aggregator.
•
Sorted the data as much as possible in DB and reduced the use of DS-Sort for better performance of jobs.
•
Removed the data not used from the source as early as possible in the job.
•
Worked with DB-admin to create appropriate Indexes on tables for better performance of DS queries.
•
Converted some of the complex joins/business in DS to Stored Procedures on DS for faster execution of the jobs.
•
If an input file has an excessive number of rows and can be split-up then use standard logic to run jobs in parallel.
•
Before writing a routine or a transform, make sure that there is not the functionality required in one of the standard routines supplied in the sdk or ds utilities categories.
•
Constraints are generally CPU intensive and take a significant amount of time to process. This may be the case if the constraint calls routines or external macros but if it is inline code then the overhead will be minimal.
•
Try to have the constraints in the ‘Selection’ criteria of the jobs itself. This will eliminate the unnecessary records even getting in before joins are made.
•
Tuning should occur on a job-by-job basis.
•
Use the power of DBMS.
•
Try not to use a sort stage when you can use an ORDER BY clause in the database.
•
Using a constraint to filter a record set is much slower than performing a SELECT … WHERE….
•
Make every attempt to use the bulk loader for your particular database. Bulk loaders are generally faster than using ODBC or OLE.
51.
What Happens if RCP is disable?
Runtime column propagation (RCP): If RCP is enabled for any job and specifically for those stage whose output connects to the shared container input then meta data will be propagated at run time so there is no need to map it at design time. If RCP is disabled for the job in such case OSH has to perform Import and export every time when the job runs and the processing time job is also increased.
52. What are Routines and where/how are they written and have you written any routines before? Routines are stored in the Routines branch of the DataStage Repository, where you can create, view or edit. The following are different types of routines:
1) Transform functions 2) Before-after job subroutines 3) Job Control routines 53. What is version Control? Version Control stores different versions of DS jobs runs different versions of same job reverts to previos version of a job view version histories In addition DataStage offers version control that saves the history of all the data integration development. It preserves • • • •
application components such as table definitions transformation rules and source/target column mappings within a twopart numbering scheme. Developers can review older rules and optionally restore entire releases that can then be moved to remote locations 54. What are the Repository Tables in Datastage and What are they? A data warehouse is a repository(centralized as well as distributed) of Data able to answer any adhoc analytical historical or complex queries.Metadata is data about data. Examples of metadata include data element descriptions data type descriptions attribute/property descriptions range/domain descriptions and process/method descriptions. The repository environment encompasses all corporate metadata resources: database catalogs data dictionaries and navigation services. Metadata includes things like the name length valid values and description of a data element. Metadata is stored in a data dictionary and repository. It insulates the data warehouse from changes in the schema of operational systems.In data stage I/O and Transfer under interface tab: input out put & transfer pages.U will have 4 tabs and the last one is build under that u can find the TABLE NAME .The DataStage client components are:AdministratorAdministers DataStage projects and conducts housekeeping on the serverDesignerCreates DataStage jobs that are compiled into executable programs DirectorUsed to run and monitor the DataStage jobsManagerAllows you to view and edit the contents of the repository. 55.
How do you remove duplicate records from a file and capture the rejected records in a
separate file. sort the records using the key field.In sort stage put "key change column = true".Then zero will be assigned to the duplicate records.then put a condition as which is record is zero then send it to reject link Other than remove duplicate stage we can also use aggregator stage to count the number of records exist for the key columns.If more than one record exist for the key column then they are considered as duplicate records and using transformer we can set a stage variable as 'COUNT' and check if 'COUNT>1'.If so using a constraint reject that duplicate records into reject file. 56.
I want to process 3 files in sequentially one by one, how can i do that. while processing the
files it should fetch files automatically. •
Parameterize the file name.
•
Build the job using that parameter.
•
Build
job
sequencer
which
will
call this job
and
will
accept
the
parameter
for
file
name.
Write a UNIX shell script which will call the job sequencer three times by passing different file each time. •
Append all the source files into a single file in Unix if the metadata is same. Use that file as the input for the DS job.
57.
Where does UNIX script of Datastage executes weather in client machine or in server?
suppose if it executes on server then it will execute? Datastage jobs are executed in the server machines only. There is nothing that is stored in t he client machine. 58. What are the Job parameters? there are two types of parameters we can create 1.
Job level parameters. - Those parameters only use in that particular job.U can create in Desinger.
2.
Project Level parameters. - Those parameters can use into multi jobs under that particular project.Using Admin we have to create the project level parameters.
Use - Dynamic value passing at runtime(while running of job) like file_name path etc.so runtime u can pass the file name or path or whatever u want.so ur job wil be dynamic take that parameters and do perform the work. 59. How can I connect my DB2 database on AS400 to Datastage? Do I need to use ODBC 1st to open the database connectivity and then use an adapter for just connecting between the two? need to configure the ODBC connectivity for database (DB2 or AS400) in the datastage. 60.
What is difference between server jobs & parallel jobs
In server we don’t have an option to process the data in multiple nodes as in parallel. In parallel we have an advantage to process the data in pipelines and by partitioning whereas we dont have any such concept in server jobs. There are lot of differences in using same stages in server and parallel. For example in parallel a sequential file or any other file can have either an input link or an output ink but in server it can have both (that too more than 1). 61. What is the difference between Datastage and Datastage TX? DataStage is an ETL tool Datastage TX is an EAI tool. Datastage used in DWH ,TX used in EDI(Enterprise Data Interchange). The application of both tools is vary from both. 62. How can we improve the performance of Datastage jobs? Performance and tuning of DS jobs: 1. Establish Baselines 2. Avoid the Use of only one flow for tuning/performance testing 3. Work in increment 4. Evaluate data skew 5. Isolate and solve 6. Distribute file systems to eliminate bottlenecks 7. Do not involve the RDBMS in intial testing 8. Understand and evaluate the tuning knobs available.
63. What are the defaults nodes for Datastage parallel Edition Actually the Number of Nodes depend on the number of processors in your system.If your system is supporting two processors we will get two nodes by default. 64. Explain Orchestrate Vs Datastage Parallel Extender? Orchestrate itself is an ETL tool with extensive parallel processing capabilities and running on UNIX platform. Datastage used Orchestrate with Datastage XE (Beta version of 6.0) to incorporate the parallel processing capabilities. Now Datastage has purchased Orchestrate and integrated it with Datastage XE and released a new version Datastage 6.0 i.e Parallel Extender.
65.
How can we join one Oracle source and Sequential file?
Join and look up used to join oracle and sequential file 66. What is Modulus and Splitting in Dynamic Hashed File? In a Hashed File, the size of the file keeps changing randomly. If the size of the file increases it is called as "Modulus". If the size of the file decreases it is called as "Splitting". 67. What is the Batch Program and how can generate? Batch Program is used to run a batch of jobs by writing the Server routine code in the job control section.To generate a batch program do following: a) Open datasatge director. b) Go to Tools->Batch->New. c) A new window will open with the "Job Control" tab selected. d)
Write
the
routine
code
and
save
it.You
may
run
multiple
jobs
in
batch
by
making
use
of
this. 68.
What's the difference between Datastage Developers and Datastage Designers? What are the
skills required for this. datastage developer is one how will code the jobs.datastage designer is how will desgn the job i mean he will deal with blue prints and he will design the jobs the stages that are required in developing the code 69. What are types of Hashed File? Hashed File is classified broadly into 2 types. a) Static - Sub divided into 17 types based on Primary Key Pattern. b) Dynamic - sub divided into 2 types i) Generic ii) Specific. Default Hased file is Dynamic - Type30. 70. How do you eliminate duplicate rows? In server jobs we can make use of HASH FILE stage to eliminate the duplicate rows. In parallel jobs we can use REMOVE DUPLICATE stage to eliminate the duplicates. we can also use sort stage to eliminate duplicate records. 71. What is DS Designer used for - did you use it? You use the Designer to build jobs by creating a visual design that models the flow and transformation of data from the data source through to the target warehouse. The Designer graphical interface lets you select stage icons, drop them onto the Designer work area, and add links. 72.
How would call an external Java function which are not supported by Datastage?
Starting from DS 6.0 we have the ability to call external Java functions using a Java package from Ascential. In this case we can even use the command line to invoke the Java function and write the return values from the Java program (if any) and use that files as a source in DataStage job. 73.
Why do we have to load the dimensional tables first, then fact tables:
in dimensional model fact tables are dependent on the dimension tables. This means that fact table contains foreign keys to dimension tables. This is the reason dimension tables are loaded first and then the fact table. 74. How to create batches in Datastage from command prompt? From command prompt batches can be created in following way :
a) create a batch file say RunbatchJobs.bat b) Open this file in notepad. c) Now write the command "dsjob" with proper syntax for each job you want to run. d) in there are four jobs to be run in a batch use dsjob command 4 times with different job names on each line. e) Save the file and close it.
f) Next time whenever you want to run the jobs just click on the batch file "RunbatchJobs.bat".All jobs will run one by one by the batch file.