10.5.5 Datasets Produced from Import Processing ....................................................................21 11. METADATA MANAGEMENT..............................................................................................................21 11.1 SOURCE AND TARGET METADATA..........................................................................................................22 11.2 INTERNAL METADATA..........................................................................................................................22 12. STANDARD COMMON COMPONENTS............................................................................................22
12.1 JOB TEMPLATES................................................................................................................................22 12.1.1 Import Jobs.......................................................................................................................23 12.1.2 Transform Jobs.................................................................................................................23 12.1.3 Unload Jobs......................................................................................................................24 12.2 CONTAINERS....................................................................................................................................24 13. DEBUGGING A JOB...........................................................................................................................25 14. COMMON ISSUES AND TIPS............................................................................................................25
14.1 1-WAY / N-WAY.................................................................................................................................25 14.2 DUPLICATE KEYS..............................................................................................................................26 14.3 RESOURCE USAGE VS PERFORMANCE....................................................................................................27 14.4 GENERAL TIPS.................................................................................................................................28 15. REPOSITORY STRUCTURE..............................................................................................................29
1.1 Objective This document will serve as a source of standards for use of the DataStage software as employed by the Dummy Transformation project. The below mentioned standards will be followed by all developers. It is understood that this document, while setting the standards might not be possible to cover all the development scenarios. In such cases, developer must contact the appropriate authority to seek clarification and ensure that such missing items are subsequently added to this document. It will therefore be an evolving document which will be updated to continually reflect the changing needs and thoughts of the development team and hence continue to represent best practices as the project progresses. Initial review and sign-off process will therefore be followed within this context. 1.2 Document Usage This document describes the DataStage best practices to be applied to the Dummy Transformation project. It is intended to channel the general knowledge of DataStage developers towards the specific things they need to know about the Dummy project and the specific way jobs will be developed. It will be referenced by developers initially for familiarisation and as required during the course of the project. Use of the document will therefore reduce over time as developers become familiar with the practices described. The Offshore Build Manager will maintain the document (in collaboration with the development team – through weekly developer meetings) and will be responsible for distributing the document to developers (and explaining it’s content) initially and after updates have been applied, ensuring that the standards it describes are communicated and understood. Such communication will highlight the areas of change. The best practices will also form the basis for QA and peer testing within the development environment. 2. DATASTAGE OVERVIEW DataStage is a powerful Extraction, Transformation, and Loading tool. DataStage has the following features to aid the design and processing: •
•
•
•
•
55000783.doc
Uses graphical design tools. With simple point-and-click techniques you can draw a scheme to represent your processing requirements Extracts data from any number or type of database Handles all the metadata definitions required to define your data warehouse or migration. You can view and modify the table definitions at any point during the design of your application Aggregates data. You can modify SQL SELECT statements used to extract data Transforms data. DataStage has a set of predefined transforms and functions you can use to convert your data. You can easily extend the functionality by defining your own transforms to use. Page5of41
3.
DATASTAGE DEVELOPMENT WORKFLOW
3.1 Building and Testing Jobs This section provides an overview of the DataStage Job development process for the Dummy transformation project. As detailed in diagram below there will three environments i.e. development, test and production. Within DataStage, a project is the entity in which all related material to a development is stored and organised. Development willDevelopers have three projects where each code will move i.e. Dummy_Dev, Dummy_Promo. will develop code in Dummy_Dev project and after unitVersion testing itand promote to project Version where Version controlling will be managed. After base-lining the code the DataStage administrator will collate all code in the Dummy_Promo project from where the DMCoE will move it for unit and end to end testing on the Test server. Finally the code will be moved by DMCoE to production. Please refer to the Dummy Transform Code Migration Strategy document for further details. Development Server
Production Server Developer Role Ranch_Prod DS Project
Version DS Project
Administrator Role
Onshore Activity -- DMcoE
Each DataStage project is defined below: 3.1.1
Dummy_Dev Project
The Dummy_Dev project will be used by developers for building DataStage jobs and unit testing by the developers. It will be mapped to a working directory on the UNIX DataStage server. This will also be used for unit testing, changes / defect fixing will be documented and fixed before promoting the job to “Dummy_Promo” for integration testing. 3.2
Other DataStage Projects
Several further DataStage projects will be employed across the Development, Test and Production environments. Please refer to the Dummy Transform Code Migration Strategy document for further details. 55000783.doc
Page6of41
4.
DATASTAGE JOB DESIGN CONSIDERATIONS
4.1 Job Types As per diagram below there will be three types of Jobs within Transform i.e. Import, Transform and Unload Jobs. Source data having complex file layout will be processed by these jobs in sequence to give Target file which will in the format required by Load team.
Hogan Extract 4.1.1 Import Jobs Import Jobs will be starting point for transformation. Sanity checks on file and validation of external properties e.g. Size will be done here. Source file will be read as per source record layout. If there are any unwanted or bad record the job will fail and file needs to be corrected before restarting the job. Source data will then be filtered to process records and unprocessed data will be maintained in a dataset for future reference. Finally one or more datasets will be created which will be input to actual transform process. See section 9 for further details of action to be taken on failure or reject.
Hogan Extract Data
NonHogan Extract Data
Import Job Check zero byte file, Validate header and trailer details
Read F ile in specific format
Create output datasets
Write error details in Stats File and Stop processi ng
55000783.doc
Page7of41
Record s to be Proces sed
4.1.2 Transform Jobs Datasets created by import jobs will be processed by transform jobs. Transform will join two or more datasets, lookup data as per functional design specification. Finally the records will be split as per destination file design and a destination dataset will be created. All data errors will be captured in an exception log for future reference.
Transform job have a numbe datasets. The from complete or may repres 4.1.3 Unload Jobs Unload jobs will take transform datasets as a source and create final files required by load team in the given format. Data Held for Future Job
Unload Job Load Data
Unload Data in output file as per layout
Data Held for Future Job
Target data is provided as flat files
Data Held in temporary datasets
5. 5.1
USE OF STAGES Combining Data
5.1.1 Join, Lookup and Merge Stages The Join, Lookup and Merge stages combine two or more input links according to values of key columns. They differ mainly in memory usage, treatment of rows with unmatched key values and input requirements i.e. sorted and de-duped. A brief description as to when to use these stages is provided in the following table: 55000783.doc
Page8of41
Type
Join SQL-like
Memory Number of Inputs
Light 1 Left, 1 Right
SortonInput All Duplicates on OK Primary Input Duplicates on OK Secondary Input(s) Options on None Unmatched Primary Options on None Unmatched Secondary Number of Output 1 Links Captured on Reject Nothing
Lookup Merge InRAMLookup Master / Update Table Heavy Light 1 Source, n Lookup 1 Master, n Tables Updates None All OK Warning Warning Fail, Continue, Drop or Reject None
1 Out, 1 Reject Unmatched Primary Rows
OK(whenn=1) Keep or Drop
CaptureasReject
1 Out, n Rejects Unmatched Secondary Rows
The Lookup stage is most appropriate when the reference data for all lookup stages in a job is small enough to fit into available physical memory. Each lookup reference requires a contiguous block of physical memory. If the datasets are larger than available resources, the JOIN or MERGE stage should be used. 5.1.2 Aggregate Stage The purpose of the aggregator stage is to perform data aggregations. In order to do this, it is necessary to understand the key columns that define the aggregation groups, the columns to be aggregated and the kind of aggregation. Common aggregation functions include: • • • •
Count Sum Mean Min / Max.
Several others are available to process business logic, however it is most likely that aggregations will be used as part of a calculation to determine the number of rows in an output table for inclusion in header and footer records for unload files. 5.1.3 The Funnel Stage The funnel requires all input links to have identical schemas (column names, types, attributes including null ability). The single output link matches the input schema. 5.2
Sorting
There are two options for sorting data within a job, either on the input properties page of many stages (a simple sort) or using the explicit sort stage. The explicit sort stage has additional properties, such as the ability to generate key change column and to specify the memory usage of the stage.
55000783.doc
Page9of41
5.3 5.3.1
Data Manipulation Transformer Usage Guidelines
5.3.1.1 Choosing Appropriate Stages The parallel Transformer stage always generates "C" code which is then compiled to a parallel component. For this reason, it is important to minimize the number of transformers, and to use other stages (Copy, Filter, Switch, Modify etc) when derivations are not needed. Optimize the overall job flow design to combine derivations from multiple Transformers into a single Transformer stage when possible. 5.3.1.2 Transformer NULL Handling and Reject Link When evaluating expressions for output derivations or link constraints, the Transformer will reject (through the reject link indicated by a dashed line) any row that has a NULL value used in the expression. To create a Transformer reject link in DataStage Designer, right-click on an output link and choose "Convert to Reject". The Transformer rejects NULL derivation results because the rules for arithmetic and string handling of NULL values are by definition undefined. For this reason, always test for null values before using a column in an expression, for example: If ISNULL(link.col) Then… Else… 5.3.1.3 Transformer Derivation Evaluation Output derivations are evaluated BEFORE any type conversions on the assignment. For example, the PadString function uses the length of the source type, not the target. Therefore, it is important to make sure the type conversion is done before a row reaches the Transformer. For example, TrimLeadingTrailing(string) works only if string is a VarChar field. Thus, the incoming column must be type VarChar before it is evaluated in the Transformer. 5.3.1.4 Optimizing Transformer Expressions and Stage Variables In order to write efficient Transformer stage derivations, it is useful to understand what items are evaluated and when. The evaluation sequence is as follows: • •
•
Evaluate each stage variable initial value For each input row to process: o Evaluate each stage variable derivation value, unless the derivation is empty o For each output link: 1. Evaluate each column derivation value 2. Write the output record o Next output link Next input row
The stage variables and the columns within a link are evaluated in the order in which they are displayed in the Transformer editor. Similarly, the output links are also evaluated in the order in which they are displayed. From this sequence, it can be seen that there are certain constructs that will be inefficient to include in output column derivations, as they will be evaluated once for every output column that uses them. Such constructs are: 55000783.doc
Page10of41
Where the same part of an expression is used in multiple column derivations For example, suppose multiple columns in output links want to use the same substring of an input column, then the following test may appear in a number of output column derivations: IF (DSLINK1.col[1,3] = "001") THEN ... In this case, the evaluation of the substring of DSLINK1.col[1,3] is evaluated for each column that uses it. This can be made more efficient by moving the substring calculation into a stage variable. By doing this, the substring is evaluated just once for every input row. In this case, the stage variable definition will be: DSLINK1.col1[1,3] and each column derivation will start with: IF (Stage Var1 = "001" THEN ... This example could be improved further by also moving the string comparison into the stage variable. The stage variable will be: IF (DSLink1.col[1,3] = "001" THEN 1 ELSE 0 and each column derivation will start with: IF (Stage Var1) THEN This reduces both the number of substring functions evaluated and string comparisons made in the Transformer. Where an expression includes calculated constant values For example, a column definition may include a function call that returns a constant value, such as: Str(" ",20) This returns a string of 20 spaces. In this case, the function will be evaluated every time the column derivation is evaluated. It will be more efficient to calculate the constant value just once for the whole Transformer. This can be achieved using stage variables. This function could be moved into a stage variable derivation. However in this case, the function will still be evaluated once for every input row. The solution here is to move the function evaluation into the initial value of a stage variable. A stage variable can be assigned an initial value from the Stage Properties dialog/Variables tab in the Transformer stage editor. In this case, the variable will have its initial value set to: Str(" ",20)
55000783.doc
Page11of41
You will then leave the derivation of the stage variable on the main Transformer page empty. Any expression that previously used this function will be changed to use the stage variable instead. The initial value of the stage variable is evaluated just once, before any input rows are processed. Then, because the derivation expression of the stage variable is empty, it is not reevaluated for each input row. Therefore, it is value for the whole Transformer processing is unchanged from the initial value. In addition to a function value returning a constant value, another example would be part of an expression such as: "abc" : "def" As with the function call example, this concatenation is evaluated every time the column derivation is evaluated. Since the subpart of the expression is actually constant, this constant part of the expression could again be moved into a stage variable, using the initial value setting to perform the concatenation just once. Where an expression requiring a type conversion is used as a constant, or it is used in multiple places For example, an expression may include something like this: DSLink1.col1+"1" In this case, the "1" is a string constant, and so, in order to be able to add it to DSLink1.col1, it must be converted from a string to an integer each time the expression is evaluated. The solution in this case is just to change the constant from a string to an integer: DSLink1.col1+1 In this example, if DSLINK1.col1 were a string field, then a conversion will be required every time the expression is evaluated. If this just appeared once in one output column expression, this will be fine. However, if an input column is used in more than one expression, where it requires the same type conversion in each expression, it will be more efficient to use a stage variable to perform the conversion once. In this case, you will create, for example, an integer stage variable, specify its derivation to be DSLINK1.col1, and then use the stage variable in place of DSLink1.col1, where that conversion would have been required. It should be noted that when using stage variables to evaluate parts of expressions, the data type of the stage variable should be set correctly for that context, otherwise needless conversions are required wherever that variable is used. 5.3.2
Modify Stage
The Modify stage is the most efficient stage available. Transformations that touch a single field, such as keep/drop, type conversions, some string manipulations, and null handling, are the primary operations which should be implemented using Modify instead of using Transform.
55000783.doc
Page12of41
5.4
Transitioning Data
5.4.1 External Data The External Source stage is a file stage which is used to read data that is output from one or more source programs. The stage calls the program and passes appropriate arguments. The stage can have a single output link, and a single rejects link. It can be configured to execute in parallel or sequential mode. This stage will be typically used in the ‘Import’ jobs to import the External Data to parallel datasets to be processed by further ‘Transformation’ jobs. 5.4.2
Parallel Dataset
The Data Set stage is used to read data from or write data to a data set. The stage can have a single input link or a single output link. It can be configured to execute in parallel or sequential mode. DataStage parallel extender jobs use data sets to manage data within a job. The Data Set stage can store data being operated on in a persistent form, which can then be used by other DataStage jobs. Data sets are operating system files, each referred to by a control file, which by convention has the suffix .ds. Using data sets wisely can be key to good performance in a set of linked jobs. These Parallel Datasets will be created from the external data by the ‘Import’ job and will be created whenever intermediate datasets are needed to be created for further single/multiple jobs to process. Due to the parallel nature of processing, the danger of bottle necks is eliminated during dataset creation. 5.5
Unit Test
5.5.1 Copy Stage The Copy stage copies a single input data set to a number of output data sets. Each record of the input data set is copied to every output data set. Records can be copied without modification or columns be dropped or changed (to copy with modification –purpose for example column datacan types). This stage is used commonly for more debugging/testing wherechanging a copy of the data flowing from a particular stage can be isolated from the flow and analysed. 5.5.2 Peek Stage The Peek stage can print record column values either to the job log or to a separate output link as the stage copies records from its input data set to one or more output data sets. This stage is used when a specific column’s data is only to be analysed while Unit Testing to validate whether the preceding transformation logic is working as desired. 5.5.3 Row Generator The Row Generator stage is a Development/Debug stage that has no input links, and a single output link. The Row Generator stage produces a set of mock data fitting the specified metadata. This is useful where you want to test your job but have no real data available which may be a source file or a dataset produced by some other job whose development is also underway. Also, more details can be specified about each data type if required to shape the data being generated. For e.g. Type as ‘Cycle’ specifying what ‘Increment’ value is required Type as ‘Random’ specifying what percent of invalid/zero data is required. 5.5.4 Column Generator The Column Generator stage is a Development/Debug stage that can have a single input link and a single output link. The Column Generator stage adds columns to incoming data and generates mock data for these columns for each data row processed. The new data set is then 55000783.doc
Page13of41
output. This is used where not all the columns’ real data is available for testing. Those columns need to be inserted with mock data fitting the specified metadata. 5.5.5
Manual XLS Generation
In addition to the ‘Row Generator’ and ‘Column Generator’ methods DataStage provides, mock data can also be created manually in an XLS file and then saved as a CSV file to be given as input to the DataStage job where this test data is required. These methods of data generation will be used extensively during Unit testing.
6.
GUI STANDARDS
Job Description Fields updates – the description annotation is mandatory for each job. Note that the description annotation the job short description. The full description should include the job version number, developer name, date and a brief reference to the design document including the version number the job has been coded up to, plus the main job annotation and any modifications to the job. Where the job has not yet entered Version Control, the initial version should be referred to as 0.1. When using DataStage Version Control, the Full Description field in job properties is also used by DS Version control to append revision history. This is packaged and maintained with the job and will be visible when the jobs are deployed to test, promo and production. It does not stop developers from using Full Description as a method of maintaining the relevant documentation, but information maintained by the developer will get appended to by the Version Control tool. Naming conventions must be enforced on links, transforms and source and target files. Annotations are also used to further describe the functionality of jobs and stages. Two types of annotation, a blue job description (description annotation) and aisyellow operator specific description (standard annotation) are used. The detailed description also updated automatically in by DataStage Version Control process following the first initialization into Version Control. Entries put in the detailed description by Version Control must not be modified manually. Standard description annotations should be used on every non-trivial stage. 7.
Syntax Import/transform/unload jb_fdXX__ Where XX is 01,02…13 indicating FD name. indicates Import Job
indicates Transform Job
indicates Unload Job js___ Where XX is 01,02…13 indicating FD name. indicates Import Job Sequence
indicates Transform Job Sequence
indicates Unload Job Sequence source
Page14of41
Target Definition Category Link*
target lnk__ lnk_ is the name of the stage from which the link is coming out. indicates the type of link rej=reject, njn=non join, jn=join. If not applicable then this will be dropped.