Datastage_performance_Tuning

Performance Tuning Database Performance y y y

y

y

y

y

Try not to use a sort stage when you can use an ORDER BY clause in the database. Use a constraint to filter a record set is much slower than performing a SELECT WHERE . Make every attempt to use the bulk loader for your particular database. Bulk loaders are generally faster than using ODBC or OLE. Tune the OCI stage for 'Array Size' and 'Rows per Transaction' numerical values for faster inserts updates and selects. Sort the data as much as possible in DB and reduced the use of DS-Sort for better performance of jobs. Work with DB-admin to create appropriate Indexes on tables for better performance of DS queries. Use the native database stages.

File Performance y

y

y

If the source file is fixed width, the Readers Per Node option can be used to read a single input file in parallel at evenly-spaced offsets. The environment variables $APT_IMPORT_BUFFER_SIZE and $APT_EXPORT_BUFFER_SIZE can be used to improve im prove I/O performance. Use DataSet for any intermittent persistent persistent storage needs instead of Sequential File stage since DataSet reads and writes data in a native format editable by DataStage and easily administered through DataStage itself. DataSets should not be edited/deleted directly from Unix.

Configuration y

File

The degree of parallelism of a DataStage Job is determined by the number of nodes that is defined in the Configuration File. A configuration file with a larger number of nodes will generate a larger number of processes and will in turn add to the processing overheads as compared to a configuration file with a smaller number of nodes. Therefore, while choosing the configuration file one must weigh the benefits of increased parallelism against the losses in processing efficiency (increased processing overheads and slow start up time).Ideally, if the amount of data to be processed is small, configuration files with less number of nodes should be used while if data volume is more, configuration files with larger number of nodes should be used.

Partitioning y

y

Proper partitioning of data is another aspect of DataStage Job design, which significantly improves overall job performance. Partitioning should be set in such a way so as to have balanced data flow i.e. nearly equal partitioning of data should occur and data skew should be minimized. Avoid unnecessary Re-partitioning of data to t o avoid data being skewed.

Transformer y

Take precautions when using expressions or derivations on nullable columns within the parallel Transformer:

y

y

Always convert nullable columns to in-band values before using them in an expression or derivation. Always place a reject link on a parallel Transformer to capture / audit possible rejects.

Lookup y

y

It is most appropriate when reference data is small enough to fit i nto available shared memory. If the Data Sets are larger than available memory resources, use the Join or Merge stage. Limit the use of database Sparse Lookups to scenarios where the number of input rows is significantly smaller (for example 1:100 or more) than the number of reference rows, or when exception processing.

Join y

Be particularly careful to observe the nullability properties for input links to any form of Outer Join. Even if the source data is not nullable, the non-key columns must be defined as nullable in the Join stage input in order to identify unmatched records.

Aggregator y y

Use Hash method Aggregators only when the number of distinct key column values is small. A Sort method Aggregator should be used when the num ber of distinct key values is large or unknown.

Modify y

y

The Modify stage is the most efficient stage available. Any transformation which can be implemented in Modify is more efficient. Transformations that touch a single field, such as keep/drop, type conversions, some string manipulations, and null handling, are the primary operations which should be implemented using Modify instead of Transformer derivations.

Sort y

y

y

If data has already been partitioned and sorted on a set of key columns, specify the don't sort, previously sorted option for the key columns in the Sort stage The stable sort option is much more expensive than non-stable sorts, and should only be used if there is a need to maintain row order other than as needed to perform the sort The performance of individual sorts can be improved by increasing the memory usage per partition using the Restrict Memory Usage (MB) option of the Sort stage. The default setting is 20 MB per partition. Note that sort memory usage can only be specified for standalone Sort stages, it cannot be changed for inline (on a link) sorts.

Sequence Job y y

Always use ³Reset if required, then run´ option in Job Activity stages. Please enable checkpoint in the jobs properties.

Environment Variable y

DataStage EE provides a number of environment variables to control how jobs operate on a UNIX system. In addition to providing required information, environment variables can be used to enable or disable various DataStage features, and to tune performance settings.

Job Design y

y y y

Minimize the usage of Transformer (Instead of this use Copy modify Filter Row Generator). Reduce the number of lookups in a job design. Minimize the warnings. Please use RCP (Runtime Column Propagation) whenever possible. Remove unneeded columns as early as possible within the job flow.

Architecture y

Datastage support grid architecture from 8.1 version. As grid does proper utilization of hardware and speed up the process so if possible it is better to implement the grid architecture.

Server Settings y

y y

8.5

In the scratch disk there should be enough space otherwise jobs using sort stage with large volume of data might fail. Proper NLS setting Need to take care of the environment variables while moving from one environment to other environments. Migrating the entire DSParams file can result i n development environment settings being moved into production and trying to migrate just the user defined section can result in a corrupt DSParams file.

Performance Enhancements y

y y

y y

As connector stages (like Oracle Connector, Db2 connector) are the upgraded stages(Better error Handling) for enterprise stage go f or connector stages. As the connector stage supports multiple inputs it can load the multiple tables at a time. Instead of using stage variables in transformer please use pivot enterprise stage for vertical pivoting. Use transformer for looping and aggregation. Use the new xml input stage and xml transformer stage as it can handle large and complex xml and the processing time is much faster.

Performance Monitoring Job Monitor y

The IBM InfoSphere DataStage job monitor can be accessed through the IBM InfoSphere DataStage Director .

Score Dump y

In order to resolve any performance issues it is essential to have an understanding of the data flow within the jobs. To help understand a job flow, a score dum p should be taken. This can be done by setting the APT_DUMP_SCORE environment variable to "true´ prior to running the job.

Resource Estimation

y

With new sophisticated analytical information and deep understanding of the parallel framework, IBM has added Resource Estimation to DataStage (and QualityStage) 8.x. This can be used to determine the needed system requirements or to analyze if the current infrastructure can support the jobs that have been created.

Performance Analysis y

IBM Infosphere DataStage 8.x adds a new capability called Performance Analysis. It is enabled through a job property on the execution tab which collects data at job execution time

Datastage_performance_Tuning

Recommend Documents