Page 1 of 57
1. Main differences b/w DataStage 7.5.2 to 8.0.1 A. In DataStage 7.5.2 we have manager manager as client. client. In 8.0.1 we don’t have any manager manager B. C. D. E. F. G.
H. I. J.
1.
client. The manager client is embedded into designer client. In 7.5.2 quality stage has separate designer. In 8.0.1, quality stage is integrated in designer. In 7.5.2 code and metadata is stored in file based system. In 8.0.1 code is a file based system where as metadata is stored in database. In 7.5.2 we required operating system authentications. In 8.0.1, we require operating system authentications and DataStage authentications. In 7.5.2 we don’t have range lookup. In 8.0.1, we have range lookup. In 7.5.2 a single join stage can't support multiple references. In 8.0.1 a single join stage can support multiple references. In 7.5.2, when a developer opens a particular job, and another developer wants to open the same job, that job can't be opened. In 8.0.1, it can be possible when a developer opens a particular job and another developer wants to open the same job then it can be opened as read only job. In 8.0.1 a compare utility is available to compare 2 jobs, one in development another is in production. In 7.5.2 it is not possible. In 8.0.1 quick find and advance find features are available, in 7.5.2 not available. In 7.5.2 first time one job is run and surrogate key’s generated from initial to n value. Next time the same job is compiled and run again surrogate key is generated from initial to n. automatic increment of surrogate key is not available in 7.5.2 but in 8.0.1 surrogate key is incremented automatically. a state file is used to store the maximum value of surrogate key. Data Modeling
1) E-R Modeling (Entity-Relationship Modeling) OLTP
Side a) Logical Modeling: Logical modeling deals with gathering the business requirements and converting them into a model. b) Physical Modeling: Physical modeling involves the actual design of a database according to the requirements that were established during logical modeling. 2) Dimensional Modeling Dimensional modeling is divided into 2 types. a) Star Schema - Simple & Denormalized form. Much Faster compared to snow flake. b) Snowflake Schema - Complex with more Granularities. More normalized form. Slow.
2.
Importance of Surrogate Key in Data warehousing?
Surrogate Key is the Primary Key in a Dimension table. It is independent of the underlying database i.e. Surrogate Key is not affected by the changes going on with the source database. 3.
Differentiate between Database data and Data warehouse data? Data in a Database is for OLTP. a) Detailed or Transactional b) Both Readable and Writable. c) Current. d) Volatile
4.
Data in a DWH is for OLAP. a) for Analysis & BI. b) We can only read from the DWH c) Historical data d) non-volatile
What What is the the flo flow w of of loa loadi ding ng data data into into fact fact & dim dimen ensi sion onal al tabl tables es? ?
Page 2 of 57
First Data should be loaded into Dimension tables where the surrogate keys are generated and then to Fact tables. The surrogate keys are referenced as foreign keys in Fact tables.
5.
Orchestrate Vs DataStage Parallel Extender?
Orchestrate itself is an ETL tool with extensive parallel processing capabilities and running on UNIX platform. DataStage used Orchestrate with DataStage XE (Beta version of 6.0) to incorporate the parallel processing capabilities. Now DataStage has purchased Orchestrate and integrated it with DataStage XE and released a new version DataStage 6.0 i.e. Parallel Extender.
6.
What are Stage Variables, Derivations and Constraints?
Stage Variable - An intermediate processing variable that retains value during read and doesn’t pass the value into target column (in case of comparison, if-else). We can also use this for computing the result which is useful in multiple columns of the target table(s). Constraints - Constraint is like a filter condition which limits the number of records coming from Input according to business rule. Derivation - Expression that specifies the value to be passed to the target column.
7.
DataStage Lookup types?
Normal Lookup: In this whenever DataStage wants to look up, it just places the target table data into buffer. It is used when the reference table or file contains less data. Sparse Lookup: To look up, it fires a SQL query to the database instead of placing into buffer. To use sparse look up your target database must be larger larger than source table and your target should be a database of any type. Range lookup: This will help you to search records based on particular range. It will search only that particular range records and provides good performance instead of searching the entire record set. 8.
Expl Explai ain n abou aboutt Err Error or Hand Handli ling ng in Data DataSt Stag age e and and best best prac practi tice ces. s.
In DataStage sequence there is an "Exception Handler" activity. When you are calling your jobs from a DataStage Sequence you should should do following : Step 1: Go to job properties of master sequence and check the checkbox "Add Checkpoints so sequence is restartable on Failure" and "Automatically handle activities that fail". Step2: In your sequence use an exception handler activity. After exception handler activity you may include an email notification activity. Here if the job fails the handle will go to the exception handler activity and an email will be sent notifying the user that a sequence has failed. 9.
What What are are the the dif diffe fere rent nt type types s of of lin links ks in Data DataSt Stag age? e?
They are 3 different links in the DataStage. DataStage. 1. stream stream link link means means stra straigh ightt link link 2. Refere Referenc nce e link link it acts acts like like a looku lookup. p. 3. Reject Rejected ed link link used used in para paralle llell jobs jobs 10.. 10
How Ho w to us use e Exc Excel el fi file le as in inpu putt in in Dat DataS aSta tage ge? ?
Page 3 of 57
You can use excel file as input input by importing the .xls file. step1 --> Go to Administrative Tools -> Data Source (ODBC) --> System DSN. Click on Add button and configure the corresponding .xsl file in your system DSN. Make sure that workbook contains the name of your excel sheet. Step2 --> Import the excel file into the DataStage as ODBC table definition. Step3 --> Use ODBC stage as input stage. You should be able to use excel excel file very effectively. effectively. Please let me know if you face any problem.
11.
What is the default cache size? How do you change the cache size if needed?
Default cache size is 128 MB. We can increase it by going into DataStage Administrator and selecting the “Tunables” Tab and specify the cache size over there. 12. 12.
Diff Differ eren enti tiat ate e Prim Primar ary y Key and and Part Partit itio ion n Key? Key?
Primary key is the key that we define on a table column or set of columns (composite PK) to make sure that all the rows in that table column or columns are unique. Partition key is the key that we use while partitioning a table (in database) database) for processing the source records in ETL. We should define the partition based on the stages (in DataStage) or transformations (in Informatica) we use in a job (in DataStage) or mapping (in Informatica). To improve the target load process, we partition the data. 13. How to remove the locked locked jobs using using DataStage? Go to Director -- Tools --- “Clear the Job Resources” option there u find the PID Number. Then select that PID and click click a logout. Your job gets released.
14.
How do you execute DataStage job from UNIX command line prompt?
/opt /opt/A /Asc scen enti tial al/D /Dat ataS aSta tage ge/D /DSE SEng ngin ine/ e/bi bin/ n/ds dsjo job b -ser -serve verr $Ser $Serve verN rNam ame e \ -use -userr $Us $User er \ -password $ServerPassword \ -run -run -wait -wait -mod -mode e NORMA NORMAL L \ -param FmtDirectory=$FMT_PATH \ -param -param ReF ReFmtD mtDire irecto ctory= ry=$RE $REFMT FMT_PA _PATH TH \ -param Twin=$TWIN \ -param Group=$GROUP \ -param PrcsDte=$PROCDTE \ -param Cmelnd=$CME_IND\ -param Mndlnd=$MND_IND \ IDL $DSSEQUENCE.${APPLCDE}
15.
What are types of Hashed File?
Hashed File is classified into 2 types. a) Static - Sub divided into 17 types based on the Primary Key Pattern. b) Dynamic - Default Hash file is "Dynamic - Type Random 30 D". These are the three types of files files will be created when we create a Hash Hash file “.data .type .over “.
16. How to call a Stored Procedure which is in SQL Server database in DataStage job?
Page 4 of 57
In ODBC stage properties -- Click on OUTPUTS --- General --- select “Stored Procedure” -- browse to get the stored procedure. We can use a Stored Procedure stage while designing parallel jobs. 17. Explain what are SUMMARY SUMMARY TABLES and use? Summary tables contain the summarized or "Aggregated" data to tune up query performance. Example: Suppose that we have a table which contains following columns: a) Medicine_Name b) Sell c) Time Now the business requirement is to get the sales of medicine on monthly basis. Here if a query is fired to aggregate the medicine cell will have to use aggregation to get the monthly sales each time. Instead of that if a summary table is created which contains the monthly sales records the query cost will decrease as the query will directly get the aggregated data from the summary table. In this scenario the summary table will contain following columns: a) Medicine_Name b) Sell c) MONTH Hence for sell of all days of month only one aggregated record will come in Summary table. i.e. for each month one row will be there in summary table containing aggregated data. This will increase the performance of a query.
18. Containers: Usage and Types? Container is a collection of stages used for the purpose of Reusability. There are 2 types of Containers. a) Local Container: Job Specific b) Shared Container: for any job within a project. Again 2 types. 1. Server shared container: Used in server jobs (can also be used in parallel jobs). 2. Parallel shared container: Used only in parallel jobs.
19. Where the DataStage hash files are stored when they are created? There are two ways of specifying where the Hash Hash Files will be created. created. a) Account: This is generally the project name. If this is selected the Hash file will be generated in the hash file path specific to the project/Account. b) Folder Location: If this option is selected the Hash file will be generated in the directory specified by the user. 20. How do you create a sequencer in DataStage parallel parallel extender? Sequencer is an activity in the DataStage sequence which will run the output job connected to it in the sequence based on the input job status (ALL/ANY). Steps to create a sequence:
Page 5 of 57
a) Go to Menu. In the Menu select File->New. File->New. b) In the window that appears select "Job Sequence". The other way of running the jobs in a sequence sequence is using "Job Control" options in DataStage. DataStage. For this – a) Open a parallel job. b) Go to Job Properties-> Job Control Here you can select the jobs which should run before these parallel jobs. 21. How and where do we use use or integrate Java with with DataStage? Java can be integrated into DataStage using using Java Pack. There are two stages which come in java pack a) Java Client: It is used for Sources or targets. b) Java Transformer: It is used for processing row by row. Both stages allow using the Java Classes in your DataStage jobs. 22. Where does director director create its log files? files? The log files are stored in server. server. At the client side (windows) (windows) the log files are not stored. The logs are maintained in RT_Logn files files on DataStage server. Deleting logs: You can set Auto purge of job logs from DataStage administrator administrator to delete the log files automatically.
23. Compare and Contrast ODBC and Plug-In stages? ODBC: a) Poor Performance. Performance. b) Can be used for Variety of Databases. c) Can handle Stored Procedures.
Plug-In: a) Good Performance. b) Database specific. c) Cannot handle Stored Procedures.
24. Use of Terminator Activity? Activity? A terminator Activity Sequence stage can be used in a job sequence to ensure that the sequence is stopped cleanly if certain situations arise. You can have multiple terminator activities and can place them anywhere in the sequence. They are connected to other stages by triggers which specify when a terminator will be invoked. 25. Types of Parallel Processing? Processing? DS offers 2 types of parallelism to take advantage of the above hardware: 1. Pipeline Parallelism 2. Partition Parallelism Hardware wise there are 3 types of parallel processing systems available: 1. SMP (Symmetric Multi Processing: multiple CPUs, shared memory, memory, single OS) 2. MPP (Massively Parallel Processing: multiple CPUs each having a Personal set of resources Memory, OS …etc, but physically housed on the same Machine) 3. Clusters: same as MPP, but physically dispersed (not on the same box & Connected via high speed Networks).
Page 6 of 57
26. How can u see, delete delete dataset in UNIX and windows? You can use orchadmin dump .ds to see see the data in the Dataset in UNIX. rm -rf in UNIX In DataStage, delete using the “dataset management” option in designer or director.
27. What is the normal view and materialized view? A view is a virtual table representing the result of a database query. Whenever an ordinary view's table is queried or updated, the DBMS converts these into queries or updates against the underlying base tables. A materialized view takes a different approach in which the query result is cached as a concrete table that may be updated from the or iginal base tables from time to time. This enables much more efficient access, at the cost of some data being potentially out-of-date. It is most useful in data warehousing scenarios, where frequent queries of the actual base tables can be extremely expensive. Materialized View stores both - definition of the view and rows resulted from execution of the view. Main disadvantage of materialized view is that its contents get outdated when underlying base table is modified. You can refresh Materialized View using DBMS_MVIEW.REFRESH
28. When do you use separate sort stage and inbuilt sort utility in a stage? What is the difference? 29. How to check the no. of nodes while running the job in in UNIX? 30. A transformer stage is running in parallel. parallel. How can we run it in sequential? Transformer properties -- Advanced -- Execution mode -- there u can select default/ sequential/ parallel mode of execution. 31. A sequential file has 4 duplicate duplicate records? How to get 3rd one?
32. How can the server job be used in PX job? Through Server shared containers. containers. We can select the stages and define define the functionality using them. 33. How can u run the job in sequential mode? 34. If a job compiled with "job completed successfully" successfully" but The DS DS director window of Log page shows warning messages. What will we do? Generally we can ignore warning war ning messages shown in director. If you want to abort your job based on the no. of warning messages, you can specify that in DataStage Director --- Tools --- Options ---Limits tab ---- Rows/Warnings.
35. Explain the difference between Data Sets and File Sets? Where each one is used and how they are used? What is the size limit? The dataset and file set both are file stages. stages.
Page 7 of 57
Dataset can have single input link or single output link whereas File set can have single input link or single output link and a single reject link. File set can execute only in parallel mode whereas dataset can be configured to execute either in parallel or in sequential mode. The default size for both is 2GB. 2GB. 36. How to implement the scenario: Add Add 10 days to a particular date in DataStage PX? Extract the day part of the date by using the string function and then add how many days you want and then again add all those parts into a single string and then convert the string into date object.
37. What is DataStage DSTX? DataStage Transformation Extender was developed, owned and marketed by Mercator (an aviation company). e.g. Version 6.5.2 is Mercator. Later on Ascential Software purchased it and named it as DataStage Transformation Extender (DSTX). e.g. Version 7.5.1.2 is DSTX. Later on, IBM purchased Ascential Software and renamed the product as IBM-WTX (WebSphere Transformation Extender) e.g. Version 8.1 is WTX DSTX/WTX is a middleware that is used in complex transformations and one of the major players in applications like EDI and SWIFT. It processes live data and is an EAI tool. Its major rival software is Gentran, GIS. Like DSPX, it has a designer where "maps" (equivalent of jobs) are written down/designed. Its metadata is defined in Type Trees in Type Tree maker.
38. Explain the difference between server transformer and parallel transformer? Or what is the difference between BASIC Transformer and NORMAL Transformer? When we will go for BASIC or NORMAL Transformer? The main difference is server server Transformer supports basic transforms only but but in parallel supports both basic and parallel transforms. Server transformer is basic language compatibility & parallel transformer is c++ language compatibility. Transformer stage in server jobs vs Transformer Transformer stage in Parallel jobs 1. In server server jobs support support single input input stream stream link and multiple multiple input input reference reference link link where in parallel support only single input stream link and no reference link. 2. In server jobs lookup has to be done in transformer stage whereas in parallel it has separate lookup stage for this. 3. In parallel it has the capability to perform the surrogate key generation as an in-built functionality whereas no such functionality available in server job transformation stage. Difference: 1. A Basic transformer compiles in "Basic Language" whereas a Normal Transformer compiles in "C++". 2. Basic transformer does not run on multiple nodes whereas a Normal Transformer can run on multiple nodes giving better performance. 3. Basic transformer takes less time to compile than the Normal Transformer. Usage: 1. Basic transformer should be used in Server Jobs.
Page 8 of 57
2. Normal Transformer should be used in Parallel jobs as it will run on multiple nodes here giving better performance.
39. Where do you specify nodes for DataStage jobs? In DataStage Manager->Tools->Configurations here one default configuration file exist name "default.apt" there we can add any no of nodes or we can add new configuration file. 40. How to retrieve hash file data from administrator administrator command? Open the DataStage administrator. Select your project. On the right hand side you will see "Command" button. Click on it. You will get a command interface window. Here you can write select queries to retrieve data from a hash file. For example: Let's say you have created a Hash file say "TestHashFile". To retrieve the contents you will will have to fire following query: select * from TestHashFile TestHashFile;; Note: In the Hashed file stage you will have to select the option "Use Account Name" to create a hash file. 41. What is the difference between between Filter and External Filter Filter stages?
42. How to enable the runtime column propagation in the DataStage? Where this option does exists? There is an option in DataStage administrator administrator ---> projects tab ---> properties ---> general general tab ---> enable runtime column propagation for parallel jobs. If you enable this you can select runtime propagation to specify that columns encountered by a stage in a parallel job can be used even if they are not defined in the Metadata explicitly. You can see the runtime propagation option in most most of the active stages in the output output tab (if exists) in columns sub tab.
43. How can we filter in sequential stage? How do you remove duplicates in a flat file? 44. How to remove duplicate records from a file and capture the rejected records in a separate file? Sort the records using the key field. In sort stage put "key change column true”. Then zero will be assigned to the duplicate records. Then put a condition for zero record then send it to reject link. 45. What is the difference between between a Filter and a Switch Stage? A switch stage can have maximum 128 output links. A filter stage can have any number of output links.
46. How can I handle the before/after job subroutine in the transformer stage? In Transformer Stage click on properties tab in tool bar (Left corner tab). It displays stage properties page like the other job properties page. There you can specify before/after stage subroutines. 47. How to load the top 3 marks (column name) from sequential file file to the target?
Page 9 of 57
First sort the data on Marks column using sort stage then use transformer stage. In constraints, use system variable @INROWNUM (@INROWNUM< 3). In this way you can get top 3 marks. (OR) This can be achieved by sort & external filter stages. First sort the data in descending/ascending descending/ascending order after that use external filter stage & give UNIX command head -3 in the box. 48. When do you use dataset stage/lookup stage/lookup file stage/file stage in a parallel parallel job? What is the significance of these files...what are their differences? 49. What is the difference between between “job control” and sequencer? Job control: It just inserts the compilation code. Whenever you add the job it just places executable code there. There is no diagrammatical r epresentation here. Sequencer: Sequencer is a diagrammatical representation of activities where we can connect different type of objects through activities. There is no interaction of executable code here. Sequencer can apply the control to jobs means we can run the job depending on the condition. We can run schedule monitor as the normal job.
50. How to run a Shell Script within the scope of a DataStage job? job ? By using "ExcecSH" command at Before/After job properties. Select the EDIT tab in the toolbar ---> choose job properties ---> Select the job parameters ---> choose the before/ after routines r outines ---> select the EXCESH Command. In the Sequence, we have an ExcecSH stage, through which we can run the shell scripts by providing the path and parameters for the script.
51. How to handle Date conversions in DataStage? Convert an mm/dd/yyyy format to yyyy-dd-mm? We use a) "Iconv" function - Internal Conversion. b) "Oconv" function - External Conversion. Function to convert mm/dd/yyyy format to yyyy-dd-mm is Oconv(Iconv(Fieldname, Oconv(Iconv(Fieldname, D/MDY[2 2 4] ) D-YDM[4 2 2] ) 52. What does a Config File in parallel parallel extender consist of? (APT_CONFIG_FILE) Config file consists of the following. a) Number of Processors or Nodes. b) Actual Disk Storage Location.
53. Functionality of Link Partitioner and Link Collector? Server jobs mainly execute the jobs in sequential fashion, the IPC stage as well as Link partitioner and link collector will simulate the parallel mode of execution over the server jobs having single CPU. Link Partitioner: It receives data on a single input link and diverts the data to a maximum no. of 64 output links and the data processed by the same stage having same Meta data. Link Collector: It will collect the data from 64 input links, merges it into a single data flow and loads into the target. These both are active stages and the design and mode of execution of server jobs have to be decided by the designer.
Page 10 of 57
54. What is Modulus, Splitting and Merging in Dynamic Hashed File? In Hash Files, the size of the file keeps on changing randomly (according to the incoming data). Groups are numbered sequentially from 1. The number of groups is called as Modulus. The operation of increasing the modulus modulus is called Splitting Splitting and decreasing is called as merging.
55. Types Types of views in DataStage Director? Director? There are 3 types of views in DataStage Director. Director. a) Status View – list of jobs, status, dates and time taken to run. b) Schedule View – scheduling details of jobs. c) Log View – log for last few runs of each Job with Warning messages, Error messages, informational messages, Event messages & Program Generated messages.
56. What are the difficulties faced in using DataStage? 1.
The "DataStage director job log error messages'. messages'. It doesn't give you user friendly error messages and informative messages. 2. We don’t have many date functions that are available in Informatica or in traditional Relational databases. 3. DataStage is like unique product in terms of functions ex: Most of the databases or ETL tools use “UPPER” for converting lower case data to upper case. The DataStage use "UCASE". 57. How do you eliminate duplicate duplicate rows?
1.
DataStage provides us with a stage “Remove Duplicates” in Enterprise Edition E dition.. Using that we can eliminate the duplicates based on a key column. We can also use sort stage (option: allow duplicates (True/False)). 2. The Duplicates can be eliminated eliminated by loading the corresponding corresponding data in to Hash file and Specify the column(s) on which u want to eliminate as the key(s) of hash. 3. SQL: delete from tablename a where rowid > (select min (rowid) from tablename b) where a.key values=b.key values.
58. What are XML files and how to read data from XML files and what stage to be used? We can use XML metadata importer to import the XML source definition and XML input to read the XML document. For each and every element of the XML, we should give the XPATH expression in the XML input. 59. Give the OS of the Server and the OS of the Client Client of your most recent project? Server is on UNIX and Client machine i.e. the machine where you design a job is on Windows XP. 60. What is Hash file file stage and what is it used for? Used for Look-ups. It is like a reference table. It can also be used in-place of ODBC, of ODBC, OCI Tables for better performance. performance. We can also use the Hash File stage to remove duplicate rows by specifying the hash key on a particular field or column.
Page 11 of 57
61.Why do you use SQL LOADER or OCI STAGE? OCI STAGE/SQL LOADER -- Server Extract job name, PE -- Oracle Enterprise stage. When the source data is enormous or for bulk data extract/load we use OCI or SQL loader depending on the source. Suppose if we have millions of records then we will use Orabulk.
62. How do you pass the filename as a parameter for a job?
Administrator--->Projects--->Properties--->Environment--->User ser 1. Go to DataStage Administrator--->Projects--->Properties--->Environment--->U Defined. Here you can see a grid, where you can enter the Parameter name and the corresponding path of the file. stage Tab of the job, select the NLS tab, click on the "Use Job Parameter" and 2. Go to the stage Tab select the parameter name which you have given in the above. The selected parameter name appears in the text box beside the "Use Job Parameter" button. Copy the parameter name from the text box and use it in your job. Keep the project default in the text box.
63. How do you pass the parameter to a job sequence if the job is running at night? Two ways 1. Set the default values of the Parameters in the Job Sequencer and map these Parameters to the job. 2. Run the job in the sequencer using dsjobs utility where we can specify the values for each parameter.
64. How do you catch bad rows row s from OCI stage? We will place some conditions like 'where' inside the OCI stage and the rejected rows can be obtained. 65. What happens if the job fails at night? night? Job Sequence will abort and an error exception exception handling job will send a mail to the corresponding member. You can define a job sequence to send an email using SMTP activity if the job fails. OR Log the failure to a log file using DSlogfatal/DSLogEvent from the job control tab in the job properties.
66. What is the order of execution done internally in the transformer with the stage editor having input links on the left hand side and output links on the right hand side?
1. 2. 3. 4.
First the left side tables will be joined. Then it executes stage stage variables. Constraints in transformer. Derivations on the columns.
67.How do you track performance statistics and enhance it? You can right click on the server job and select select the "view performance Statistics” Statistics” option. This will show the output as the number number of rows per second format when the job runs. From Director also we can view this in “view monitor” option.
Page 12 of 57
68. How will you call external function or subroutine from DataStage? U can call external functions/subroutines by using Before/After stage/job Subroutines: ExecSH if it is UNIX machine ExecDOS if it is Windows machine.
69. How do you merge two files in DS? Use the funnel stage to merge the two files in PE. Also if you want to add a file to an existing dataset/ Sequential file, then you can w rite to the same dataset/sequential dataset/sequential with dataset/sequential dataset/sequential file set to APPEND mode. 70. What are Static Hash files and Dynamic Hash Hash files?
Dynamic Hash Files can automatically adjust their size - modulus (no. of groups) and separation (group size) based on the incoming data. Type 30 is default dynamic hash file being used. Static Hash files do not adjust their modulus automatically and are best when data is Pre known/static. Overflow groups are used when the data size is equal or greater than the specified Large Record size in dynamic HFs. Since Static HFs do not create hashing groups automatically, when the group cannot accommodate a row it goes to overflow. Overflows should be minimized as much as possible to optimize the performance. data? 71. How do you handle reject data? Typically a Reject-link Reject-link is defined and the rejected rejected data is loaded back into data warehouse or into a sequential file. So Reject link has to be defined for every Output link you wish to collect the rejected data. Rejected data is typically bad data like duplicates of Primary keys or null-rows where data is expected. 72. Do you know know about METASTAGE? METASTAGE? Meta stage is a repository in which you can store the metadata (DDLs etc.) and perform analysis on data dependencies, change impact etc. 73. Did you Parameterize the job or hard-coded the values values in the jobs? We always parameterized the job. Either the values are coming from Job Properties or through shell script or from a ‘Parameter Manager’ – a third part tool. The often Parameterized variables in a job are: DB DSN name, username, password, dates W.R.T the data to be looked against.
74. Have you ever involved in updating the DS versions like DS 5.X, if so tell us some of the steps you have taken in doing so? Yes. The following are some of the steps I have have taken in doing so: 1) Take a back up of the whole project(s) by exporting the project as a .dsx file. 2) See that you are using the same parent folder for the new version also for your old jobs using the hard-coded file path to work. 3) After installing the new version import the old project(s) and you have to compile them again. You can use 'Compile All' tool for this.
Page 13 of 57
4) Make sure that all your DB DSN's are created with the same name as the old ones. This step is for moving DS from one machine to another. 5) In case if you are just upgrading your DB from Oracle 8i to Oracle 9i, then there is tool in DS CD that can do this for you. 6) Do not stop the 6.0 server before the upgrade version 7.0. Install process collects the project information during the upgrade. There is NO rework (recompilation of existing jobs/routines) needed needed after the upgrade. 75. What other ETL's you have worked worked with? ETL tools are 1) Informatica by Informatica Corporation 2) DataStage by IBM 3) Ab Initio by Ab Initio software Corporation 4) Data Junction by Pervasive software 5) Oracle Warehouse Builder by Oracle Corporation 6) MS SQL Server Integration by Microsoft 7) Decision Stream or Data Manager by Cognos. 76. What is DS Designer Designer used for - did u use it?
1. 2. 3. 4.
Design the server jobs & parallel jobs Creating the local & shared containers Design the sequences Compile (single & multiple), validate and running of the jobs.
77. What is DS Director used for - did u use it? DataStage Director is used to validate, run, monitor and debug the warnings & errors by viewing the log file and to schedule the Jobs. 78. What is DS Administrator Administrator used for - did u use it? The Administrator enables you to to set up DataStage users, control the purging of the Repository, and, if National Language Support (NLS) is enabled, install and manage maps and locales. It is primarily used to create/delete the DataStage project, assign the user roles to the project and set the parameters for the jobs at project level. Assigning the users to the project can also be done here.
79. What is DS Manager used for - did u use it? To manage the metadata, create/update routines, routines, functions, export/import of jobs (.dsx or .xml) from Dev to Test and then to production environment. 80. Explain the differences differences between Oracle8i/9i? Oracle8i/9i?
1. Oracle 8i does not support pseudo column SYSDATE but 9i supports. 2. Oracle 8i we can create 256 columns in a table but in 9i we can create up to 1000 columns. Multiprocessing, Multiprocessing, databases more dimensional modeling
Page 14 of 57
81. What are Routines and where/how are they written and have you written any any routines before? Routines are stored in the Routines branch of the DataStage Repository, where you can create, view or edit. The following are different types of routines: 1) Transform functions 2) Before-after job subroutines 3) Job Control routines Routines are stored in the Routines branch of the DataStage Repository, where you can create, view, or edit them using the Routine dialog box box.. The following program components are classified classified as routines: • Transform functions: These are functions that you can use when defining custom transforms. DataStage has a number of built-in transform functions which are located in the Routines. ➤ Examples: Functions branch of the Repository. You can also define your own transform functions in the Routine dialog box box.. • Before/After subroutines: When designing designing a job, you can specify a subroutine to run before or after the job, or before or after an active stage. DataStage has a number of built-in before/after subroutines, which are located in the Routines. ➤ Built-in: Before/After branch in the Repository. You can also define your own before/after subroutines using the Routine dialog box. • Custom Universe functions. These are specialized BASIC functions that have been defined outside DataStage. Using the Routine dialog box, you can get DataStage to create a wrapper that enables you to call these functions from within DataStage. These functions are stored under the Routines branch in the Repository. You specify the category when you create the routine. If NLS is enabled,
82. What are other Performance tunings you have done in your last project to increase the performance of slowly running jobs? Extraction Side: 1. Staged the data coming from ODBC/OCI stages or any database on the server using Hash/Sequential files for optimum performance – Performance of hash files/ sequential files is good because it avoids many trips to the database server. 2. Tuned the OCI stage for 'Array Size' and 'Rows per Transaction' numerical numerical values for faster selects, inserts and updates. Row buffering also to improve the performance. In process and inter process (for server jobs). Administrator for better performance for each project. 3. Tuned the 'Tunables' in Administrator Default cache size for hash files is 128 MB. We can modify it to improve the performance. 4. Sorted the data as much as possible in DB and reduced the use of DataStage-Sort for better performance of jobs – use order by constraint. 5. Worked Worked with DB-admin DB-admin to to create appropriate appropriate Indexe Indexes s on tables tables for better better performance of DS queries. 6. Conv Converted erted some some of the complex complex joins/bu joins/busine siness ss in DS to Stored Stored Procedures Procedures on DS for faster execution of the jobs. 7. Use Use the the powe powerr o off RDBM RDBMS. S. 8. Try not to use a sort stage when you can use an ORDER BY clause in the the database.. database 9. Using a constraint to filter records is much slower than performing a SELECT … WHERE….
Page 15 of 57
Transformations Side:
10. Used sorted data for Aggregator stage. 11. If an input file has an excessive excessive number of rows and can be split-up then use use standard logic to run jobs in parallel. 12. Before writing a routine or a transform, make sure that there is not the functionality required in one of the standard routines supplied in the SDK or DS utility categories. Constraints are generally CPU intensive and take a significant amount of time to process. This may be the case if the constraint calls routines or external macros but if it is inline code then the overhead will be minimal. 13. Tuning should occur occur on a job-by-job basis. basis. Loading Side: 14. Stored data in sequential/hash sequential/hash files before loading to the database for data recovery in case the job aborts. 15. Make every attempt to use the bulk bulk loader for your particular database. Bulk loaders are generally faster than using ODBC or OLE.
83. Do you know about INTEGRITY/QUALITY INTEGRITY/QUALITY stage? Integrity/Quality Stage is a data integration tool from Ascential which is used to standardize/integrate the data from different sources and cleansing that. Quality Stage can be integrated with DataStage, In Quality Stage we have many stages like investigate, match, survivorship…etc. survivorship…etc. so that we can do the Quality related works and we can integrate with DataStage. We need Quality stage plug-in to achieve this task. 84. Why do we have to load the dimensional tables tables first, then fact tables? As we load the dimensional tables first, the surrogate keys (primary) are generated which are foreign keys in the Fact tables. 85. How did you handle handle an 'Aborted' sequencer? sequencer? By selecting the compilation options you can run the aborted sequence from the point where it was aborted. For example, you have 10 jobs (job1, job2, job3 etc.) in a sequence and job 5 aborted, then by checking "Add checkpoints so sequence is restartable on failure" and "Automatically handle activities that fail", you can restart this sequence from job 5 only. It will not run the jobs 1, 2, 3 and 4. Please check these options options in your sequence.
86. How can we implement Slowly Changing Dimensions in DataStage? By using lookup stage and change capture stage we can implement the SCD. We have 3 types of SCD’s type1: it will maintain the current values only. type2: it will maintain the both current and historical values. type3: it will maintain the current and partial historical values. 87. What will you in a situation where somebody somebody wants to send you a file and use that file as an input or reference and then run job?
Page 16 of 57
1.
Under Windows: Use the 'WaitForFileActivity' under the Sequencers and then r un the job. May be you can schedule the sequencer around the time the file is expected to arrive. 2. Under UNIX: Poll for the file. Once the file has start the job or sequencer depending on the file. 88. Read the String String functions in in DS Functions like [] -> sub-string function and ':' -> concatenation operator Syntax: string [ [start] length], string [delimiter, instance, repeats] 89. How did u connect to DB2 in your last project? Using DB2 ODBC drivers. DB2 Plug-in Stage Dynamic Relational Stage. 90. What are Sequencers Sequencers? ? Sequencers are the job control programs that execute other jobs based on the present Job parameters. A sequencer allows you to synchronize the control flow of multiple activities in a job sequence. It can have multiple input triggers as well as multiple output triggers. The sequencer operates in two two modes: ALL mode: In this mode all the inputs to the sequencer must be TRUE for any of the sequencer output triggers to fire. ANY mode: In this mode, output triggers can be fired if any of the sequencer inputs are TRUE.
91. How would call an external Java function which are not supported by DataStage? Starting from DS 6.0 we have the ability to call external Java functions using a Java package from Ascential. In this case we can even use the command line to invoke the Java function and write the return values from the Java program (if any) and use those files as a source in DataStage job. 92. What is the utility you use to schedule schedule the jobs on a UNIX server other than using using Ascential Director?
1. Use Crontab utility along with dsexecute () function with proper parameters passed. 2. "AUTOSYS": Through Autosys you can automate the job by invoking the shell script written to schedule the DataStage jobs. 93. What is the difference between between routine and transform and function?
1. A routine describes the Business logic and Transformer specifies the transform of the data from one location to another by using transformation rules. 2. By using Routines we can return values but by transformers we cannot return values.
94. Tell Tell me one situation in your last project, project, where you had faced a problem and how you solved it?
Page 17 of 57
1. The jobs in which data is read directly from ODBC/OCI stages are running extremely slow. I had to stage the data before sending to the transformer to make the jobs run faster. 2. The job aborts in the middle of loading some 500,000 rows. Have an option either cleaning/deleting cleaning/deleting the loaded data and then run the fixed job or run the job again from the row the job has aborted. To make sure the load is proper we opted for the former.
95. Does the selection of 'Clear the table and Insert rows' in the ODBC stage send a Truncate statement to the DB or Delete statement? There is no TRUNCATE on ODBC stages. It is Clear Clear table and that is a “delete from” statement. In an OCI stage such as Oracle, you do have both Clear and Truncate options. They are radically different in permissions permissions (Truncate requires requires you to have altered table permissions where Delete doesn't). It just clears the data in the table when each new loop it encounters. It won’t send any commands to the database. 96. What are the command line functions that import import and export the DS jobs? A. dsimport.exe- imports the DataStage components. B. dsexport.exe- exports the DataStage components. Parameters: Username, Password, Hostname, Project name, Current Directory (C:/Ascential/DataStage7.5.1/dsexport.exe), (C:/Ascential/DataStage7.5.1/dsexport.exe), Filename (Job name). 97. How do you rename all of the jobs to support your new File-naming File-naming conventions? Create an Excel spreadsheet with new and old names. Export the whole project as a .dsx. Write a Perl/Shell program, which can do a simple rename of the strings looking up in to the Excel file. Then import import the new .dsx file probably into a new project for testing. testing. Recompile all jobs. Be cautious that the name of the jobs has also been changed in your job control jobs or Sequencer jobs. So you have to make the necessary changes to these Sequencers.
98. Difference between Hash file and Sequential File? 1. Hash file stores the data based on hash algorithm and on a key value. A 2. 3. 4. 5.
sequential file is just a file with no key column. Hash file can be used as a reference for look up. Sequential file cannot. Hash file can be stored in DS memory (Buffer) but Sequential file cannot. Duplicates can be removed in a hash file by specifying a key on a column. It is not true for sequential file. There is a limit of 2GB for both Hash file and Sequential Sequential file.
99. How can we we create Containers? There are Two types of containers 1. Local Container: available for that particular Job only 2. Shared Container: can be used anywhere in the project. Steps to create a Local/Shared Container: Step1:Select the stages required Step2:Edit--->Construct Step2:Edit--->Construct Container--->Local/Sh Container--->Local/Shared ared Shared containers are stored in the Shared Containers branch of the Tree Structure. They are of 2 types. 1. Server shared containers (Can also be used in Parallel jobs)
Page 18 of 57
2. Parallel shared containers (Cannot be used in Server jobs). 100. What ar are the Job parameters? It is a runtime parameter which allows you to enter a value to restrict the data at the job run time. A job parameter is created with name, prompt, data type, default value and it is represented as follows #parameter name#.
101.
How do you fix the error "OCI "OCI has fetched truncated data" data" in DataStage? This error occurs when Oracle Stage tries to fetch fetch a column like 34.55676776... And actually its data type is decimal (10,2). The solution here is to either truncate or Round the data till 2 decimal positions. 102. 10 2. How Ho w can can we joi join n one one Or Orac aclle sou sourc rce e an and Seq Seque uent ntiial fi fille? Join and look up stages can be used to join oracle oracle and sequential file. 103.
What about System variables?
DataStage provides a set of variables containing useful system information that you can access from a transform or routine. System variables are read-only. @DATE The internal date when the program started. See the Date function. @DAY The day of the month extracted from the value in @DATE. @FALSE The compiler replaces the value with 0. @FM a field mark, Char (254). @IM an item mark, Char (255). @INROWNUM Input row counter. For use in constrains and derivations in Transformer stages. @OUTROWNUM Output row counter (per link). For use in derivations in Transformer stages. @LOGNAME The user login name. @MONTH The current extracted from the value in @DATE. @NULL The null value. @NULL.STR The internal representation of the null value, Char (128). @PATH The pathname of the current DataStage project. @SCHEMA The schema name of the current DataStage project. @SM A sub value mark (a delimiter used in Universe files), Char (252). @SYSTEM.RETURN.CODE Status codes returned by system processes or commands. @TIME The internal time when the program started. started. See the Time function. @TM A text mark (a delimiter used in Universe files), Char (251). @TRUE The compiler replaces the value with 1. @USERNO The user number. @VM A value mark (a delimiter used in Universe files), Char (253). @WHO The name of the current DataStage project directory. @YEAR The current year extracted from @DATE.
104.
What’s difference between operational data store (ODS) & data warehouse?
An operational data store (or "ODS "ODS") ") is a database designed to integrate data from multiple sources to facilitate operations, analysis and reporting. Because the data originates from multiple sources, the integration often involves cleaning, redundancy resolution and business rule enforcement. An ODS is usually designed to contain low level or atomic (indivisible) data such as transactions and prices as opposed to aggregated or summarized data such as net contributions. Aggregated data is usually stored in the Data warehouse. warehouse.
Page 19 of 57
The data in ODS is volatile and the data which is nonvolatile, historical and time variant variant is DWH data.
105.
What’s the difference between DataStage Developers and DataStage Designers? What are the skills required for this.
DataStage developer is one who will code the jobs and DataStage designer is one who will deal with the blue prints and he will design the jobs the Stages those are required in developing the code.
106.
How to create batches in DataStage from command prompt?
107.
How many places u can call Routines?
Four Places u can call (i) Transform of routine (A) Date Transformation (B) Upstring Transformation (ii ii)) Transform of the Before & After Subroutines (iii) XML transformation (iv)Web (iv) Web base transformation
108.
Suppose that 4 jobs controlled by the sequencer like (job 1, job 2, job 3 and job 4). If job 1 has 10,000 rows, after run the job only 5000 data has been loaded in to target table remaining are not loaded and your job going to be aborted then. How can sort out the problem.
Suppose job sequencer synchronizes synchronizes or control 4 job but job 1 have problem, in this condition u should go to director and check what type of problem it is showing either data type problem, warning massage, job fail or job aborted, I f job fail means data type problem or missing column action. So u should go to Run window--->Click---> Tracing--->Performance Tracing--->Performance or In your target table --->general---> action---> select select this option here two option (i) On Fail -- Commit , Continue (ii) On Skip -- Commit, Continue. First u check how much data is already loaded and after that select on skip option then continue and what remaining position data not loaded then select On Fail , Continue ...... Again Run the job defiantly u gets successful massage.
109.
What happens if RCP is disabled?
Runtime column propagation (RCP): If RCP is enabled for any job, and specifically for those stages whose output connects to the shared container input, then meta data will be propagated at run time, so there is no need to map it at design time. If RCP is disabled for the job, in such case OSH has to perform Import and export every time when the job runs and the processing time job is also increased.
Page 20 of 57
110. What are the third party tools used in DataStage? Autosys, Control-M, TNG, event coordinator, compile all are some of them. 111.
Where does UNIX script of DataStage executes whether in client machine or in server? DataStage jobs are executed in the server machines only.
112.
How can we pass parameters to job by using file? You can do this by passing parameters from Unix file file and then calling the execution execution of a DataStage job. the ds job has the parameters defined (which are passed by Unix)
113.
Default nodes for DataStage parallel Edition? The Number of Nodes depends on the number number of processors in your System. Usually 2 ** n.
114.
I want to process 3 files in sequentially one by one, how I can do that while processing the files it should fetch files automatically.
If the metadata for all the files is same then create a job having file name as parameter, then use same job in routine and call the job with different file names...OR You can create sequencer to use the job.
115.
What is ' insert for update ' in DataStage? There is a lock for update option in Hashed Hashed File Stage, which locks the hashed hashed file for updating when the search key in the lookup is not found.
116.
What are the Repository Tables in DataStage and what are they?
A data warehouse is a repository (centralized as well as distributed) of Data of Data,, able to answer any adhoc, analytical, historical or complex queries. Metadata is data about data. Examples of metadata include data element descriptions, data type descriptions, attribute/property descriptions, range/domain descriptions and process/method descriptions. The repository environment environment encompasses all corporate metadata resources: resources: database catalogs, data dictionaries, and navigation services. Metadata includes things like the name, length, valid values, and description of a data element. Metadata is stored in a data dictionary and repository. It insulates the data warehouse from changes in the schema of operational systems. In DataStage I/O and Transfer, under interface tab: input, output & transfer pages. U will have 4 tabs and the last one is build under that u can find the TABLE NAME. 117. What is is version Control? Version Control is used to
i) ii) iii) ii)
Store/view different versions of DS jobs run different versions of same job rev reverts erts to prev previo iou us ver versi sion on of a job job
118.
What happens if output of hash file is connected to transformer? transformer? What error it thoughts?
If Hash file’s output is connected to a transformer stage, the hash file will be consider as a Lookup file. If there is no primary link to the same Tr ansformer stage, then this will be
Page 21 of 57
treated as a primary link itself. You can implement SCD in server jobs by using Lookup functionality. This will not return any error code. 119.
What is is Merging an and ho how it it ca can be done?
Merge stage is used to join two tables. It takes the Key columns and sorts them in ascending/descending ascending/descending order. Let us consider two tables Emp, Dept. If we want to join these two tables we are having DeptNo as a common Key so we can give that column name as key and sort Deptno in ascending/desce ascending/descending nding order and can join those those two tables.
120.
How can I connect my DB2 database on AS400 to DataStage? Do I need to use ODBC 1st to open the database connectivity and then use an adapter for just connecting between the two? You need to configure the ODBC connectivity connectivity for any database in the DataStage. 121 121. What is is th the OC OCI? An And ho how to to us use th the ET ETL To Tools? OCI uses the "Oracle Call Interface" of the oracle to load the data. I t’s a kind of lowest level of Oracle being used for loading the data.
122.
If data If data is partitioned in your job on key 1 and you aggregate on key 2, what issues could arise? Data will partition on both the keys and it w ill take some more time to execute.
123.
What is the mean of Try to have the constraints in the 'Selection' criteria of the jobs itself? This will eliminate the unnecessary records even getting in before joins are made?
This means try to improve the performance by by avoiding use of constraints wherever possible possible and instead using them while selecting the data itself using a where clause. This improves performance.
124.
How can you do incremental load in DataStage?
Incremental load means daily load. Whenever you are selecting data from source, Select the records which are loaded or Updated between the time stamp of the last successful load and today’s load starts Date and Time. For this you have to pass the parameters for those two dates. Store the last run’s date and time in a file and read the parameter through job parameters and state second argument as current date and time.
125.
How can ETL excel file to Data mart?
Open the ODBC Data Source Administrator found in the control panel/administrative Tools. Under the system DSN tab, add the Driver to Microsoft Excel. Then you will be able to access the XLS file from DataStage.
126.
What is NLS in DataStage? How we use NLS in DataStage? What advantages in that? At the time of installation I am not chosen that NLS option, now I want to use that options what can I do? To reinstall that DataStage or first uninstall and install once again?
NLS (National Language Support) is basically a Local language setting (character set). Once you install the DS you will get NLS. Just login into Admin and you can set the NLS of your project based on the requirement (need to map the NLS with your project). Suppose if you
Page 22 of 57
know you have a file with some Greek characters, if u set the NLS for Greek, while running the job DataStage will recognize those special characters. By using NLS function we can do the following - Process the data in a wide range of languages - Use Local formats for dates, times and money - Sort the data according to the local rules If NLS is installed, various extra features appear in the product. For Server jobs, NLS is implemented in DataStage Server engine. For Parallel jobs, NLS is implemented using the ICU library.
127.
What is APT_CONFIG in DataStage
DataStage understands the architecture of the system through (APT_CONFIG_FILE). This file contains information about node names, disk storage information...etc. APT_CONFIG is just an environment variable used to identify the *.apt file. Don’t confuse that with *.apt file that has the node's information and Configuration of SMP of SMP/MMP /MMP server. 128. 128.
what what is diff differ eren ence ce betw betwee een n serv server er jobs jobs & para parall llel el jobs jobs
Server jobs: Server jobs: These These are available if you have have installed DataStage Server Edition. Edition. They run on the DataStage Server, connecting to other data sources as necessary. Parallel jobs: Parallel jobs: These are only available available if you have installed Enterprise Enterprise Edition. These run on DataStage servers that are SMP, MPP, or cluster systems. They can also run on a separate z/OS (USS) machine if required. Server jobs can be run on SMP, MPP machines. Here performance is low (Speed). Parallel jobs can be run only on cluster machines. Here performance is high (Speed). Server generates DataStage BASIC, BASIC, parallel generates Orchestrate shell script (OSH), C++ and mainframe generates COBOL, COBOL, JCL. In server and mainframe you tend to do most of the work in/using Transformer stage. In parallel you tend to use specific stage types for specific tasks (and the Transformer stage doesn't do lookups). There are many stages like the the automatic partitioning and collection of data in the parallel parallel environment, which would have to be managed manually (if at all) in the server environment.
129.
How is DataStage 4.0 functionally different from the enterprise edition now? What are the exact changes? There are lots of Changes in DS EE. CDC Stage, Stage, Stored Procedure Stage, Etc..........
130.
What is the meaning of the following? 1) If an input file has an excessive number of rows and can be split-up then use standard logic to run jobs in parallel 2) Tuning 2) Tuning should occur on a job-by-job basis. Use the power of DBMS of DBMS..
If u have SMP machines u can use IPC, link-collector, link-partitioner for performance tuning. If u have cluster, MPP machines u can use parallel jobs. The third point specifies about about tuning the performance of job. By using the power of Database like Analyzing, creating index, creating partitions one can improve the performance of SQL’s used in the jobs.
131.
What is DataStage Multi-byte, Single-byte file conversions in Mainframe jobs? How we use those conversions in DataStage? What is UTF 8? What’s the use of it?
Page 23 of 57
132.
How to implement routines in DataStage?
Three kinds of routines are there in DataStage. DataStage. 1. Server routines used in server jobs will be written in BASIC Language. 2. Parallel routines used in parallel jobs will be written in C/C++ Language. 3. Mainframe routines used in mainframe jobs w ill be written in COBOL/JCL Language.
133.
What are the validations you perform after creating a job in designer? What are the different types of errors you faced during loading and how did you solve them?
Check for the Parameters, check whether w hether the input files/tables exist or not and also check for the Usernames, data source names, passwords …etc
134.
How can you implement Complex Jobs in DataStage?
Complex design means having more joins and look ups. We can easily implement any complex design in DataStage by following simple tips in terms of increasing the performance also. There is no limitation in using stages in a job. For better performance, Use at the Max of 20 stages in each job. If it is exceeding then go for another job. Don’t use more than 7 look ups for a transformer stage otherwise go for including one more transformer stage.
135.
Can we use shared container as lookup in DataStage server jobs?
Yes, we can use shared container container as lookup in server jobs. Wherever we use same same Lookup in multiple places, on that time we will develop lookup in shared containers, and then we can use those shared containers as lookup. 136. 136. Is it pos possibl sible e to to rru un par paral alle lell job jobs s in in se server rver jobs obs? No, U cannot run parallel jobs in server. But you can run parallel jobs AS server jobs by changing APT_CONFIG file and setting node = 1 or setting node map constrain with node1. 137. 13 7. What Wh at is tr trou oubl ble e sho shoot otin ing g in in ser serve verr job jobs? s? Wh What at ar are e the the di diff ff ki kind nds s of of err error ors s encountered while running any job?
138.
How can you implement slowly changing dimensions in DataStage? Can u join flat file and database in DataStage? How?
Yes, we can implement SCD in DataStage. DataStage. Let me explain SCD Type 2 for Time stamp. stamp. Step 1: Time stamp we are creating via shared container. It returns system time and One key. For satisfying the lookup condition we are creating a key column by using the column
Page 24 of 57
generator. Step 2: Our source is Data set and Lookup table is oracle OCI stage. By using the Changes capture stage we will find out the differences. The change capture Stage will return a value for chage_code. Based on return value we will find out whether this is for insert, Edit, or update. if it is insert we will modify With current timestamp and the old time stamp will keep as history. Yes, we can join a flat file and database in an indirect indirect way. First create a job which can populate the data from database into a Sequential file and name it as Seq_First. Take the flat file which you are having and use a Merge Stage to join these two files. You have various join types in Merge Stage like Pure Inner Join, Left Outer Join and Right Outer Join. You can use any one of these which suits your requirements.
139.
Does Enterprise Edition only add the parallel processing for better performance? Are any stages/transformations available in the enterprise edition only?
• DataStage Standard Edition was previously called as DataStage Server Edition. • DataStage Enterprise Edition was originally called Orchestrate, and then renamed to Parallel Extender when purchased by Ascential. • DataStage Enterprise: Server jobs, sequence jobs, parallel jobs. The enterprise edition offers parallel processing features for scalable high volume solutions. Designed mainly for Unix, it now supports Windows, Linux and Unix System Services on mainframes. • DataStage Enterprise MVS: Server jobs, sequence jobs, parallel jobs, MVS jobs. MVS jobs are jobs designed using an alternative set of stages that are generated into Cobol/JCL code and are transferred to a mainframe to be compiled and run. Jobs are developed on a Unix or Windows server and then transferred to the mainframe to be compiled and run. The first two versions share the same same Designer interface but have a different different set of design stages depending on the type of job you are working on. Parallel jobs have parallel stages but also accept some server stages via a container. Server jobs only accept server stages; MVS jobs only accept MVS stages. There are some stages that are common to all types types (such as aggregation) but they tend to have different fields and options within that stage. Row Merger and Row splitter are only present in parallel Stage. 140. What ar are OraBulk and BCP stages? ORABULK is used to load bulk data into single table of target oracle database. BCP is used to load bulk data into a single table for Microsoft SQL server and Sybase.
141.
If you’re running 4 ways parallel and you have 10 stages on the canvas, how many processes does DataStage create?
It depends on the number of active stages and how they are linked. Only active stages can create a process. For e.g. if there are 6 active stages (like transforms) linked by some passive stages, the total no. of processes will be 6x4=24.
142.
How can I specify a filter command for processing data while defining sequential file output data? data?
When u opens a sequential file to edit, In general tab u has the check box option - stage uses filter command, select this and then go for required page (inputs/outputs). (inputs/outputs). You will find in page's General tab Filter command enabled, enter required filter commands such as gzip-
Page 25 of 57
compress command in Unix, or Gunzip-uncompress in Unix. Sequential file - edit | General tab - select stage uses filter command option | Inputs/Outputs (page) - Filter Command option - enter/browse required command in it.
143.
What is user variable activity? activity? When it is used? How it is used? used? Where it is
used? By using this User variable activity we can create some variables in the job sequence. These variables are available available for all the activities in that sequence. sequence. Most probably this activity is @ starting of the job sequence.
144.
Can you convert a snow flake schema into star schema? schema? Yes, we can convert by attaching one hierarchy hierarchy to lowest level of another hierarchy.
145.
How to extract data from more than 1 heterogeneous Source. Means, example 1 sequential file, file , Sybase, Oracle in a single Job.
You can convert all the heterogeneous heterogeneous sources into sequential sequential files & join them using using Merge stage (OR) You can write a user defined query in the the source itself to join them.
146.
What is data set? And what is file set?
Dataset: DataStage parallel extender jobs use data sets to manage data within a job. You can think of each link in a job as carrying a data set. The Data Set stage allows you to store data being operated on in a persistent form, which can then be used by other DataStage jobs. You can use orchadmin dump .ds to view dataset in UNIX. A fileset is a mechanism DataStage uses to manage a collection of sequential files. The file collection metadata is in a fileset file with an .fs extension.
147. 147.
What What is is the the exac exactt dif diffe fere renc nce e bet betwe ween en Join Join,, Mer Merge ge and and Look Lookup up Stag Stage? e?
The exact difference between between Join, Merge and lookup is The three stages differ mainly mainly in the memory they use. DataStage doesn't doesn't know how large your data is, so cannot make an informed choice whether to combine data using a join stage or a merge stage or a lookup stage. Here's how to decide what to use: If the reference datasets are big enough to cause trouble, use a Join. A Join does a HighSpeed sort on the driving and reference datasets. This can involve I/O if the Data is big Enough, but the I/O is all highly optimized and sequential. Once the sort is over the Join processing is very fast and never involves paging or other I/O. Unlike Join stages and Lookup stages, the Merge stage allows you to specify several Reject links as many as input links. Data selection is very easy in join and it is difficult in merge. Join doesn’t support reject links links where as merge support reject links. links. Join supports only 2 files or 2 databases databases where as merge supports no. of i/p & o/p’s. So finally performance wise join is better than merge. Lookup is used just for comparison purpose when the data is less.
Page 26 of 57
148.
There are three different types of user-created user-created stages available for for PX. What are they? Which would you use? What are the disadvantages for using each type?
These are the three different different stages: Custom i) Build ii) Wrapped iii)
149.
DataStage from Staging to MDW (Master Data War ehouse) is only running at 1 row per second! What do we do to remedy?
In general, If you have too many stages (especially transformers, hash look up), There would be a lot of overhead and the performance would degrade drastically. I would suggest you to write a query instead of doing several look ups. If there are too many look ups that are being done, ensure that you have appropriate Indexes while querying. If you don’t want to write the query and use intermediate stages, ensure that you use proper elimination of data of data between stages so that data volume don’t cause overhead. So, there might be a re-ordering of stages needed for good performance. Other things in general that could be looked in: 1. for massive transaction, set hashing size and buffer size to appropriate values to perform as much as possible in memory and there is no I/O overhead to disk. 2. Enable row buffering and set appropriate size for row buffering. 3. It is important to use appropriate objects between stages for performance.
150.
How the hash file is doing lookup in server jobs? How is it comparing the key values?
Hashed File is used for two purposes: 1. To remove duplicate records 2. For reference lookups. The hashed file contains of 3 parts: each record have Hash Key, Key Header and Data portion. By using the hash algorithm and key value the lookup is faster.
151.
What is the difference between DataStage and Informatica?
Informatica has more Developer-friendly features, but when it comes to scalability in performance, It is much inferior as compared to DataStage. Here are a few areas where Informatica is inferior – 1. Partitioning - DataStage PX provides many more robust partitioning options than Informatica. You can also re-partition the data whichever way you want. 2. Parallelism - Informatica does not support full pipeline parallelism. 3. File Lookup - Informatica supports flat file lookup, but the caching is horrible. DataStage supports hash files, lookup file sets, datasets for much more efficient lookup. 4. Merge/Funnel - DataStage has a very rich functionality of merging or funneling the streams. In Informatica the only way is to do a Union, which by the way is always a Union-all.
152.
What is the difference between build opts and subroutines? Build opts generates C++ code (oops concept) and Subroutine is a normal program and you can call anywhere in your project.
153.
If I add a new environment variable in Windows, how can I access it in DataStage? You can view all the environment environment variables in designer. You can check them them in Job
Page 27 of 57
properties. You can add and access the environment variables from Job properties. 154. 15 4.
Is it po poss ssib ible le to ca call ll on one e job job in an anot othe herr job job in se serv rver er jo jobs bs? ?
I think we can call a job into another job. In fact calling doesn't sound good, because you attach/add the other job through job properties. In fact, you can attach zero or more jobs. Steps will be Edit --> Job Properties --> Job Control Click on Add Job and select the desired job. 155. 15 5.
What Wh at is ha hash shin ing g alg algor orit ithm hm an and d exp expla lain in br brie iefl fly y how how it wo work rks? s?
Hashing is key-to-address translation. translation. This means the value of a key is transformed into a disk address by means of an algorithm, usually a relative block and anchor point within the block. It's closely related to statistical probability as to how well the algorithms work. These algorithms are usually quite simple and use division and remainder techniques. Interesting to note that these approaches are called "Monte Carlo Techniques" because because the behavior of the hashing or randomizing algorithms can be simulated by a roulette Wheel where the slots represent the blocks and the balls represent the records.
156.
Is it possible to move the data from oracle w arehouse to SAP Warehouse using DATASTAGE? We can use DataStage Extract Pack for SAP R/3 and DataStage Load Pack for SAP BW to transfer the data from oracle to SAP Warehouse. These Plug In Packs are available with DataStage Version 7.5.
157. Is it possible to access the same job by two users at a time in DataStage? No, it is not possible to access the same job by two users at the same time. DS will produce the following error: "Job is accessed by other user". 158.
What are the differences between the data stage 7.0 and 7.5 in server jobs?
There are lots of new stages available available in DS7.5 For E.g.: CDC Stage, Stored procedure Stage etc... In server jobs we have stored procedure stage, command stage and generate report option in file tab. In job sequence, stages like start loop activity, end loop activity, terminate loop activity and user variable activities were introduced. In parallel jobs surrogate key stage, stored procedure stages were introduced.
159.
How much would be the size of the database in DataStage? What is the difference between Inprocess and Interprocess?
Regarding the database size, it varies and depends upon the project. Inprocess is the process where the server transfers only one row at a time to the target and Interprocess means that the server sends group of rows to the target table. These both are available at the Tunables tab page of the administrator client component.
160.
What are the Steps involved in development of a job in DataStage?
The steps required are: 1. Select Select the data data source stage stage dependin depending g on the sources. sources. E.g.: flat flat file, databas database, e,
Page 28 of 57
xml. 2. select select the required required stages stages for transform transformation ation logic logic such as transfo transformer, rmer, link link collector, link partitioner, Aggregator, merge etc 3. select the final target stage where you want to load the data either into data warehouse, data mart, ODS, staging etc 161. 161.
What What is the the diffe iffere renc nce e bet betwe wee en DRS DRS and and ODB ODBC C stag stage? e?
DRS stage should be faster than the ODBC stage as it uses native database connectivity. You need to install and configure configure the required database clients on your your DataStage Server for it to work. Dynamic Relational Stage was leveraged for PeopleSoft to have a job to run on any of the supported databases. It supports ODBC connections too. ODBC uses the ODBC driver for a particular database, DRS is a stage that tries to make it seamless for switching from one database to another. It uses the native connectivity for the chosen target. We use DRS stage in parallel jobs.
162.
What are Profile Stage, Quality Stage and Audit Stage in DataStage?
Profile stage is used to analyze the data and their relationship. Quality stage is used to check for the quality. Audit Stage -------------
163.
What are the Environmental variables in DataStage?
Environment variable is a predefined variable that can be used while creating a DS job. We create/ declare these variables in Administrator. While designing the job we set the properties for these variables. For Example Database Uname/Pwd we declare these variables in Admin and set the values ( Parameters tab). Scott and Tiger values in the properties tab in the Menu (Parameters ex: $APT_CONFIG_FILE Like above we have so many environment variables. Please go to job properties and click on "add environment variable" to see more of the environment variables.
164.
What is DataStage engine? What is its purpose? DataStage sever contains DataStage engine. DS Server will interact with the Client components and Repository. Whenever the engine is on then only we can develop the jobs.
165.
What is the purpose of exception activity in DataStage 7.5? The stages followed by exception exception activity will be executed executed whenever there is an unknown unknown error occurs while running the job sequencer.
166. 166.
What What is a proj projec ectt? Spe Speci cify fy its its var vario iou us com comp pone onents? nts?
You always enter into the DataStage through a DataStage project. project. When you start a DataStage Client you are prompted to connect to a project. Each project contains: jobs. 1. DataStage jobs. 2. Built-in Built-in compone components. nts. These These are predefine predefined d components components used used in a job. 3. User-defined components. These are customized components created using the DataStage Manager or DataStage Designer. Designer .
Page 29 of 57
167.
What is meaning of file extender in DataStage server jobs? Can we run the DataStage job from one job to another job that files data where it is stored? File extender means adding the columns or records to the existing file. We can run the DataStage job from one job to another job in DataStage.
168.
How do you do Usage analysis in DataStage?
1. If you want to know some job is a part of a sequence, then in the Manager right click on the Job and select Usage Analysis. It will show all the jobs dependents. 2. To find find how many many jobs jobs are are using using a partic particular ular table. table. 3. To find how many jobs are using a particular routine. Like this, you can find all the dependents dependents of a particular object. It’s like nested. nested. We can move forward and backward and can see all the dependents. 169. 16 9. key?
Purp Pu rpos ose e of of usi using ng th the e key key an and d dif diffe fere renc nce e bet betwe ween en Su Surro rroga gate te ke key y and and na natu tura rall
We use keys to provide relationships between the entities (Tables). By using Primary and Foreign key relationship, we can maintain integrity between the data. The natural key is the one coming from the the OLTP system. The surrogate key is the artificial artificial key which we are going to create in the target DW. We can use these surrogate keys instead of using natural key. In the SCD2 scenarios Surrogate keys play a major role. 170. 170. Whe Where act actual ually the the fla flatt fi files store tore? ? Wha Whatt is is the the pat path? Flat files stores the data and the path can be given in general tab of the sequential file stage. Normally flat files will be stored at FTP servers or local folders. .CSV, .EXL and .TXT are the file formats available for the Flat files.
171.
What is the difference between validated ok and compiled in DataStage?
When you compile a job, it ensure that basic things like all the important stage Parameters have been set, mappings are correct, etc. and then it creates an executable Job. You validate a compiled job to make sure that that all the connections are valid. The entire entire job Parameters are set and a valid output can be expected after running this job. It is like a dry run where you don't actually play with the live data but you are confident that things Will work. When we say "Validating a Job", we are talking about running the Job in the "check only" mode. The following checks are made made : - Connections are made to the data sources or data warehouse. - SQL SELECT statements are prepared. - Files are opened. Intermediate files like Hashed File, Universe, or ODBC stages that uses the local data source is created, if they do not already exist. 172. 17 2.
What Wh at is di diff ffer eren ence ce be betw twee een n Me Merg rge e st stag age e an and d Jo Join in st stag age? e?
Other than full outer join (more than two links are not supported.), all other joins support Joining of more than two inputs. 1) Number Of Reject Link ( Join) Join) does not support reject link. (Merge) has as many reject links as the input links (if there are
Page 30 of 57
n-input links then 1 Will be master link and n-1 will be the update link). 2) Data Selection (Join) There are various ways in which data is being selected. E.g. we have different types of joins like inner, outer (left, right, full), cross join, etc. So, you have different selection criteria criteria for dropping/ selecting a row. (Merge) Data in Master record and update records are merged only when both have same value for the merge key columns.
173.
How to implement type2 slowly changing dimension in DataStage? Give me with example?
Slow changing dimension is a common problem in Data ware housing. For example: There exists a customer called Lisa in a company ABC and lives in New York. Later she moved to Florida. The company must modify her address now. In general 3 ways to solve this problem Type 1: The new record replaces the the original record, no trace of the old record at all. Type 2: A new record is added into the customer customer dimension table. Therefore, Therefore, the customer is treated essentially as two different people. Type 3: The original record is modified modified to reflect the changes. changes. In Type1 the new one will over write the existing one that means no history is maintained, History of the person where she stayed last is lost, simple to use. In Type2 New record is added, therefore both the original and the new record Will be present, the new record will get its own primary key. Advantage of using this type2 is, Historical information is maintained But size of the dimension table grows, storage and Performance can become a concern. Type2 should only be used if it is necessary for the data warehouse to track the historical changes. In Type3 there will be 2 columns one to indicate the original value and the other to indicate the current value. For example, a new column will be added which shows the original address as New York and the current address as Florida. Helps in keeping some part of the history and table size is not increased. But one problem is when the customer moves from Florida to Texas the new york information is lost. So Type 3 should only be used if the changes will only occur for a finite number of times.
174.
How we can call the routine in DataStage job? Explain with steps?
Routines are used for implementing the business logic. They are two types 1) Before Sub Routines and 2) After Sub Routines Steps: double click on the transformer stage right click on any one of the mapping field select [dsroutines] option within edit window give the business logic and select the either of the options ( Before / After Sub Routines). 175. 17 5.
What Wh at is jo job b con contr trol ol? ? How How it is is dev devel elop oped ed? ? Exp Expla lain in wi with th st step eps? s?
Controlling Datstage jobs through some other DataStage jobs. Ex: Consider two Jobs XXX and YYY. The Job YYY can be executed from Job XXX by using DataStage macros in Routines. To Execute one job from other job, following steps steps needs to be followed in Routines. Routines. 1. Attach job using DSAttachjob function. 2. Run the other job using DSRunjob function
Page 31 of 57
3. Stop the job using DSStopJob function JCL defines Job Control Language it is used used to run more number of jobs at a time with or without using loops. steps: click on edit in the menu bar and select 'job properties' and enter the parameters as parameter prompt type STEP_ID stringSource SRC stringDSN DSN string Username unm stringPassword pwd stringafter editing the above steps then set JCL button and select the jobs from the listbox and run the job.
176.
How can we create read only jobs in DataStage?
A job can be made read only by the following process: Export the job in the .dsx format and change the attribute which store the read-only information from 0 ( 0 refers to editable job) to 1 ( 1 refer to the read only job). Then import the job again and override or rename the existing job to have both of the form.
177.
How to drop the index before loading data in target and how to rebuild it in DataStage? This can be achieved by "Direct "Direct Load" option of SQL Loaded utility.
178.
What is the meaning of instance in DataStage? Explain with examples?
How to kill the job in data stage? 179. You should use kill -14 jobid. jobid. Sometimes using kill -9 jobid jobid leaves things in a bad state.
180.
What is the difference between sequential file and a dataset? When to use the copy stage?
Sequential file: It acts as a source & permanent storage for target. Its extension is .txt. Dataset: (.ds) it act as a temporary storage stage, mainly it is used before the target Stage. While using this stage the I/P data’s are partied & convert into Internal Dataset format. Then it is easy to load the data into target stage. Dataset Dataset supports parallelism but sequential file doesn't. Copy: it acts as a placeholder. It has a single i/p & many o/p’s. If u want to add a new stage in ur job at that time it is very easy otherwise u have to modify the whole job.
181.
How to parameterize a field in a sequential file? I am using DataStage as ETL Tool, Sequential file as source. We cannot parameterize a particular field in a sequential file; instead we can parameterize the source file name in a sequential file. #FILENAME# 182. 18 2.
What What does does sepa separa rati tion on opti option on in stat static ic hash hash-f -fil ile e mea mean? n?
The different hashing algorithms are designed to distribute records evenly among the groups of the files based on characters and their position in the record ids. When a hash file is created, Separation and Modulo respectively specifies specifies the group buffer size and the
Page 32 of 57
number of buffers allocated for a file. When a Static Hash file is created, DATASTAGE creates a file that contains the number of groups specified by modulo. Size of Hash file = modulus (no. groups) * Separations (buffer size).
183.
What do you mean by active stages and passive stages in DataStage environment? Active Stage: It is the "T" of ETL and Passive Stage : It is the "E & L" of ETL
184. 184. What What is Run Runti time me Col Column umn Pro Propa paga gati tion on and and how how to use use it? it? If your job has more columns which are not defined in metadata if runtime propagation is enabled it will propagate those extra columns to the rest of the job.
185.
How does u check for the consistency and integrity of model of model and repository?
186. 186.
Give Give one one rea reall tim time e sit situa uati tion on wher where e lin link k par parti titi tion oner er stag stage e use used? d?
If we want to send more data from the source to the targets quickly we will use the link partitioner stage in the server jobs. We can make a maximum of 64 partitions and this is an active stage. We can't connect two active stages but it is accepted only for this stage to connect to the transformer or aggregator stage. The data sent from the link partitioner will be collected by the link collector at a max of 64 partitions. This is also an active stage. So in order to avoid the connection of active stage from the transformer to the link collector we will be using inter process communication. As this is a passive stage, by using this data can be collected by the link collector. But we can use inter process communication only when the target is in passive stage. How do you clean the DataStage repository? 187. Remove the log files periodically.
188.
Can both Source systems and Target Data warehouse (may be oracle, SQL Server...etc) can be on windows environment or one of the systems should be in UNIX/Linux environment?
In server edition you can have both in Windows. But in PX target should be in UNIX. 189. 189. What What is the the dif diffe fere renc nce e bet betwe ween en stat static ic hash hash file files s and and dyna dynami mic c has hash h fil files es? ? Static hash file cannot change their no. of groups except through manual resizing whereas Dynamic hash file automatically change their no. of groups (modulus) in response to the amount of data of data stored in a file. 190. 19 0. How How can can I sch sched edul ule e the the clea cleani ning ng of the the fil file e &PH &PH& & by dsjo dsjob? b? Create a job with dummy transformer and sequential file stage. In Before Job subroutine, use ExecTCL to execute the following command CLEAR.FILE &PH&.
191.
What is the difference between symmetrically parallel processing, massively parallel processing?
Page 33 of 57
Symmetric Multi Processing (SMP) - Some Hardware resources may be shared by Processors. Processors communicate via shared memory and have single operating System. System . Cluster or Massively Parallel Processing (MPP) - Known as shared nothing in which each processor has exclusive access to the hardware resources. Cluster systems can be physically dispersed. The processors have their own operating system and communicate via high speed network. Symmetric Multi Processing (SMP) is the processing of programs by multiple processors that share a common operating system and memory. This SMP is also Called as "Tightly Coupled Multiprocessing". Multiprocessing". A Single copy of the Operating System is In-charge for all the Processors running in an SMP. This SMP Methodology doesn’t exceed more than 16 Processors. SMP is better than MMP systems when Online Transaction Processing is Done, in which many users can access the same database to do a search with a relatively simple set of common transactions. One main advantage of SMP is its ability to dynamically balance the workload among computers (As a result Serve more users at a faster rate). Massively Parallel Processing Processing (MPP) is the processing of programs by multiple processors that work on different parts of the program and share different operating systems and memories. These Different Processors which run communicate with each other through message interfaces. There are cases in which up to 200 processors which run for a single application. An Interconnect arrangement of data paths allows messages to be sent between different processors which run for a single Application or product. The Setup for MPP is more complicated complicated than SMP. An Experienced Thought Thought Process should be applied when u setup these MPP and one should have a good in-depth knowledge to partition the database among these Processors and how to assign the work to these processors. An MPP system can also Be called as a “loosely coupled system”. An MPP is considered to be better than an SMP for applications that allow a number of databases to be searched in parallel.
192.
How I create DataStage Engine stop start script.
!#bin/bash dsadm - user su - root password (encrypt) DSHOMEBIN=/Ascential/DataStage/home/dsadm DSHOMEBIN=/Ascential/DataStage/home/dsadm/Ascential/DataStage/DSEn /Ascential/DataStage/DSEngine/bin gine/bin if check ps -ef | grep DataStage (client connection is there) { kill -9 PID (client connection) } uv -admin - stop > dev/null uv -admin - start > dev/null verify process check the connection echo "Started properly" run it as dsadm 193. 193. How How to find find the proc rocess ess id? id? Expla xplaiin with with step teps? You can find it in UNIX by using ps -ef command -ef command.. It displays all the process currently running on the system along with the process ids. 194. 194. How How to to fi find the numbe umberr o off row rows s in in a sequen quenti tial al fil file? Using Row Count system variable.
195.
How to improve the performance of hash file?
Page 34 of 57
You can improve performance of hashed file by 1. Preloading hash file into memory this can be done by enabling preloading options in hash file output stage. 2. Write caching options it makes data written into cache before being flushed to disk. You can enable this to ensure that hash files are written in order onto cache before flushed to disk instead of order in which individual rows are written. 3. Pre-allocating Estimating the approx size of the hash file so those files need not to be splitted too often after write operation. 196 196.
What is difference between ETL and ELT?
ETL usually scrubs the data then loads into the Data mart or Data Warehouse where as ELT Loads the data then use the RDMBS to scrub and reload into the Data mart or DWH. ETL = Extract >>> Transform >>> Load ELT = Extract >>> Load >>> Transform
197.
What is the transaction size and array size in OCI stage?
Transaction Size - The transaction size for the new jobs is now handled by Rows per transaction on the Transaction Handling tab on the Input page. Rows per transaction - The number of rows written before a commit is executed for the Transaction. The default value is 0, that is, all the the rows are written after being committed to the data table. Array Size - The number of rows written to or read from the database at a time. The default value is 1, that is, each row is written in a separate statement.
198.
What is the difference between OCI stage and ODBC stage?
Oracle OCI: · We can write the source query in this stage but we can’t write lookup query in this stage instead of this we are using hash file stage for the lookup. · We are having the facility to write multiple queries before (Oracle OCI/Output/SQL OCI/Output/SQL/Before /Before)) or after (Oracle (Oracle OCI/Output/SQL/After) OCI/Output/SQL/After) executing the actual query (Oracle OCI/Output/SQL/Before) · We don’t have multi-row lookup facility in this stage. ODBC: · We can write both source query as well as lookup query in this stage itself · We are not having the facility to write multiple queries in this stage. · We are having the multi-row lookup facility in this stage.
199.
Can you tell me for what purpose .dsx files are used in the DataStage?
.dsx is the standard file extension of all the various DataStage jobs. Whenever we export a job or a sequence, the file is exported exported in the .dsx format. A standard usage for the same can be that, we develop the job in our test environment and after testing we export the file and save it as x.dsx. This can be done using DataStage Manager.
200.
How can we generate a surrogate key in server/ server/parallel jobs?
In parallel jobs we can use surrogate key generator stage. In server jobs we can use an inbuilt routine called “KeyMgtGetNextValue”. You can also generate the surrogate key in
Page 35 of 57
the database using the sequence generator.
201.
If we have two sources with same Meta data, data, how to check the data in 2 sources is same or not? If the data is not same I want to abort the job? How can we do this?
Use a change Capture Stage. Output it into a Tr ansformer. Write a routine to abort the job which is initiated at the Function. @INROWNUM = 1. So if the data is not matching it is passed in the transformer and the job is aborted.
202.
How do you call procedures in DataStage? Use the Stored Procedure Stage.
203. 20 3.
What What is is the the diff differ eren ence ce bet betwe ween en ref refer eren ence ce lin link k and and stra straig ight ht lin link? k?
The straight link is the one where data are passed to next stage directly and The reference link is the one where it shows that it has a reference (reference key) to the main table. 204. Disadvantages of of st staging ar area? Disadvantage of staging area is disk space as we have to dump data into a local area.
205.
What is the difference between DataStage and DataStage Scripting?
206.
What are the new features of DataStage 7.1 from DataStage 6.1?
207. 207. How How to to wri write te and and ex execut ecute e rou routi tin nes for for PX PX job jobs s in in c++ c++? ? You define and store the routines in the the DataStage repository (ex: in routine folder) folder) and these routines are executed on c++ compilers. compilers.
208.
1. What are various processes processes which start when the DataStage engine starts? 2. What are the changes need to be done on the database side, if I have to use DB2 stage? 3. DataStage engine is responsible for compilation or execution or both?
There are three processes start when the DataStage engine starts: 1. DSRPC 2. DataStage Engine Resources 3. DataStage telnet Services.
Page 36 of 57
209.
How we use the DataStage Director and its run-time engine to schedule running the solution, testing and debugging its components, and monitoring the resulting executable versions on ad hoc or scheduled basis?
210. 210. Whe Where can can you you outp output ut data ata using sing the Peek Stag Stage e? The output of peek stage can be viewed viewed in DataStage director LOG and also can be saved as a text file. 211. 211. Why Why has hash h fil file e is is fast faster er than than sequ sequen enti tial al fil file and and ODBC ODBC stag stage? e? Hash files store in local server and cache.
212.
What is Ad-Hoc access? access? What is the difference diff erence between Managed Query and Ad-Hoc access?
213.
1. Difference between Hash file and Sequential File? What is modulus? 2. What is iconv and oconv functions? 3. How can we join one Oracle source and Sequential file? 4. How can we implement Slowly Changing Dimensions in DataStage? 5. How can we implement Lookup in DataStage Server jobs? 6. What are the third party tools used in DataStage? 7. What is the difference between routine and transform and function? 8. What are the Job parameters? 9. What is Plug-in? 10. How can we improve the performance of DataStage jobs? 11. How can we create Containers? 12. What about System variables? 13. What is the use of Usage analysis? 14. Different ways a project can be moved to production? Ans: export-import and Version control. 15. What database is the DataStage repository use? Answer: Universe Database 16. How is the scheduling done in the project? 17. Which version of DataStage is used in the project? 18. What is the performance tuning required while dealing with large data? 19. What does the reject option in transformer do? 20. What is the architecture of DataStage? 21. How do you define and use the job parameters? 22. What are stage variables, system variables and environment variables? 23. How to use routines in DataStage? 24. What is difference between shared-container shared-container and local-container? 25. How do you connect to Oracle? Or acle? 26. Please explain any ETL process that you have developed? 27. What is hash file? Types of hash files.
Page 37 of 57
28. If you are doing any changes in shared-container will it reflect in all the jobs wherever you used this shared container? 29. Have u written any custom routines in your project? If so explain? 30. How do you get log info into a file? 31. What is before job subroutine/After job subroutine? When do you use them? 32. How do you backup and restore the project? 33. What is Clear Status File and when do you use it? 34. What is Cleanup Resources and when do you use it? 35. Can I join a flat file and oracle and load into oracle? Is this possible? 36. While loading some data into target suddenly there is a problem loading Process stopped how can u start loading from the records that were left? 37. What are the general problems that u face f ace in DataStage? 38. What are the various reports that could be generated using this DataStage? 39. How to remove blank spaces from data? 40. What is Active and Passive stage? 41. What all are the stages you have used in your project? 42. Could DataStage generate test cases? 43. What is difference between hash file and sequential file? 44. What is the difference between Transform and routine? 45. What is sequencer? These are some of DataStage PX Jobs questions questions which can be asked in interviews. interviews. 1) Types of parallel processing 2) What is the SMP (Symmetric Multi Processing) and MPP (Massively Parallel Processing)? 3) What are CPU limited, Memory limited and Disk I/O limited jobs? 4) Can one combine pipeline & partition parallelism? 5) Advantages of PX over server job 6) Is it possible to create user-defined stage in PX? 7) Can I use hash file in PX? 8) What is surrogate key stage? 9) What is the use of APT_DUMP_SCORE? Ans: To get messages in logs such as no. of processes, no. of nodes used. 10) What are the four types of joins possible in Joiner stage? Ans: Inner, Left Outer, Right Outer, Full outer 11) What are the components of APT_CONFIG_FILE? Ans: Nodes, Fast node, Pools, Resource 12) What are the points that needs to be considered while creating the config file? Ans: Available nodes, CPU time, available memory, what other process to be executed on same nodes, are there any configurations restrictions? E.g. DB only runs on certain nodes and ETL cannot run on them, Get breakdown of the resource usage? Is the hardware config SMP, Cluster or MPP? 13) When are wrappers created? Ans: only for executable commands for UNIX, dos 14) When are buildups created? Ans: More functionality, complex logic needed. 15) When are custom stage created? Ans: new operators need which are not in EE 16) What are different job sequencer stages? 17) What is iconv and oconv functions? 18) Can we implement Slowly Changing Dimensions in DataStage? What are Job parameters? 19) What does the reject option in transformer do? 20) How is parallelism executed?
Page 38 of 57
21) What is RCP? 22) Difference between join, merge, and lookup stage? 23) What is dataset? 24) Differences between dataset, fileset and lookup file set? Questions on Data Warehousing concept 1. What is Data Warehouse? 2. What is difference between Data Warehouse and Data Mart? 3. What is Star schema? 4. What is Snow-flake schema? 5. What is fact and dimension? 6. What is surrogate key? 7. What Normalization? Explain 3rd Normalized form? 8. What is the difference between OLTP and OLAP? 9. Are you involved in data modeling? If yes which tool/tech you are using? 10. Which schema modeling techniques you ever used? 11. What do you mean by summary table? 12. What are Degenerated Dimensions? 13. What is fact less fact? Fact table that contains no measures or facts are called "factless fact tables" or " junction tables". tables". Oracle question based on data warehouse? 1. What is parallel execution 2. What are Bitmap and B-Tree indexes? Explain Local Vs Global variables 3. What is materialized view 4. What is page size/array size in oracle? 5. What are integrity constraints? 6. How can one tune SQL’s in Oracle? O racle?
214.
How can we Test jobs in DataStage?
Testing of jobs can be performed at many diff levels: Unit testing, SIT (System Integration Testing) and UAT (User Acceptance Acceptance Testing) Phases. Phases. Testing basically involves involves functionality and performance performance tests. First, data for the job needs to be created to test the functionality. By changing the data we will see whether the requirements are met by the existing code. Each iteration of code change should be accompanied by a testing iteration. Performance tests basically involve load tests and see how well the existing code Perform in a finite period of time. Performance tuning can be performed on SQL or the job design or the basic/OSH (Orchestrate Shell) code for faster processing times. In addition, all job designs should include an error correction and fail over support so that the code will be robust. r obust. 215. 215.
What What is pivo pivott sta stage ge? ? Why Why are are u usin using? g? What What purp purpos ose e tha thatt sta stage ge will will be be use used? d?
Pivot stage is used to make the horizontal rows into vertical. Source Table ProdID Q1_Sales Q2_Sales 1010 123450 234550 Target Table ProdID 1010
Quarter_Sales 123450
Quarter Q1
Page 39 of 57
1010 216.
234550
Q2
What is an environment variable?
Environment variable is a predefined variable where we can use while creating a DS job. We can set this either at Project level or at Job level. Once we set specific variable that will be available into the project/job. We can also define new environment variable. For that we can got to DS Admin. 217.
What is a routine?
Routines are stored in the Routines branch of the DataStage Repository, where you can create, view or edit. The following are different types of routines: 1) Transform functions 2) Before-after job subroutines 3) Job Control routines.
218.
How can we create environment variables in DataStage? This mostly comes under under Administrator part. As a Designer, we can add directly by Designer Designer -- view -- job properties -- parameters -- add environment variable -- under user defined -- then add.
219.
What is the use of Hash file? Why can’t we use sequential file?
The primary use of the hash file is to do a look up. up. You can use a sequential file for look up but you need to write your own routine to match the columns. Coding time and execution time will be more expensive. But when you generate a hash file the hash file indexes the key by an inbuilt hashing algorithm. So when a look up is made, it will be much faster. Also it eliminates the duplicate rows.
220.
How to read the data from EXCEL FILES? Explain with steps?
Reading data from Excel file is 1) Save the file in .csv (comma separated files). 2) Use a flat file stage in DataStage job panel. 3) Double click on the flat file stage and assign input file to the .csv file (which u stored). 4) Import metadata for the file. (once you imported or typed metadata, click view data to check the data values) Then do the rest transformation as needed 221. 221.
What What’s ’s the the mea meani ning ng of perf perfor orma manc nce e tun tunin ing g tec techn hniq ique ue,, Exa Examp mple le? ?
Performance tuning means we have to take some action to increase the performance of the slowly running job by 1) using link partitioner and link collector to speed up the performance 2) using sorted data for aggregation 3) using sorter at source side and aggregation at target side 4) Tuning the OCI stage for 'Array Size' and 'Rows per Transaction' numerical Values for faster selects, inserts and updates.
Page 40 of 57
5) Do not use IPC stage at target side...... ........etc.
222.
What is Integrated & Unit testing in DataStage?
Unit Testing: In DataStage scenario, Unit Testing is the technique of testing the Individual DataStage jobs for their functionality. Integrating Testing: When the two or more jobs are collectively tested for their Functionality that is called Integrating testing. 223. 223. How How can can we impro mprov ve pe perfor rform mance ance in agg aggrega regattor stag tage? We need to sort the data before sending to it.
224.
Differentiate between pipeline and partition parallelism?
Consider three CPU’s connected in series. When data is being fed into the first one, it starts processing, simultaneously simultaneously is being transferred into the second CPU and so on. You can compare this with 3 sections of pipe pipe as water enters the pipe it starts moving moving into all the sections of pipe. Partition Pipeline- Consider 3 CPU’s connected in parallel and being fed with data at the same time thus reduces the load and efficiency. You can compare a single big pipe having 3 inbuilt pipes. As water w ater is being fed to them it consumes large quantity in less time.
225.
How do we create index in DataStage?
CREATE INDEX emp_ename ON emp(ename) TABLESPACE users STORAGE (INITIAL 20K NEXT 20k PCTINCREASE 75);
226.
What are the main diff between server job and parallel job in DataStage?
There are lot of differences in using using same stages in server and parallel. parallel. For example, in parallel, a sequential file or any other file can have either an input link or an output ink, but in server it can have both (that too more than 1). Server jobs can compile and run with in the DataStage server but parallel jobs can compile and run in the DataStage Unix server. Server jobs extract total rows from source to other stage and then only that stage will be activate and passing the rows into target level or DWH. It is time taking. But in parallel jobs it is two types 1.pipeline parallelism 2.partition parallelism 1. Based on statistical statistical performance, we extract extract some rows from source to to another stage and at the same time the stage will be activate and passing the rows into target level or DWH. It will maintain only one node within source and target. 2. Partition parallelism will maintain more than one node within source and target. 227. What is the use of job control? Job control is used for scripting. With the help help of scripting, we can set parameters for a caller job, execute it, do error handling etc tasks. tasks. 228 228. What ar are different ty types of fi file formats? Some are comma delimited csv files, tab delimited text files...
Page 41 of 57
229.
What is complex Flat File stage? In which situation we are using this one?
A complex flat file can be used to read the data at the initial level. By using CFF, we can read ASCII or EBCDIC (Extended Binary coded Decimal Interchange Code) data. We can select the required columns and can omit the remaining. We can collect the rejects (bad formatted records) by setting the property of rejects to "save" (other options: continue, fail). We can flatten the arrays (COBOL files).
230. 230. Why Why job job seq seque uenc nce e is is use use for? for? What What are are batc batche hes? s? Wha Whatt is is the the diff differ eren ence ce between job sequence and batches? Job Sequence allows you to specify specify a sequence for server or parallel jobs jobs to run. The sequence can also have control information information,, for e.g., you can specify different courses of action to take depending on whether a job in the sequence succeeds or fails. Once you defined a job sequence, it can be scheduled and run using DataStage Director. It appears in the DataStage Repository and in the DataStage Director client as a job. A Batch is a collection of jobs grouped together to perform a specific task. I.e. It is a special type of job created using DataStage director which can be scheduled to run at a specific time. Difference between Sequencers and Batches: Unlike as in sequencers, in batches we cannot provide the control information. 231. 231.
For For w wha hatt pur purpo pose se is the the Stag tage Var Varia iabl ble e is is mai mainl nly y us used? ed?
Stage variable is a temporary storage memory variable in transformer stage. If we are doing calculations repeatedly, we can store that result in a stage variable. It can be used in situations where you want to Store a Previous record value in a variable and compare that with current record value and use in if then else conditional statements. 232. 232. What What is a sequ equenti ential al file file that that has has si single ngle inpu inputt lin link? k? Sequential file always has a single link because it cannot accept the multiple links or threads. Data in sequential file always runs sequentially.
233.
What are different types of star schema? schema ? Multi star schema or galaxy schema is one of the types of star schema. 234.
What is the sequencer stage?
Let’s say there are two jobs (J1 & J2) as the input links and one job (J3) as output link for a sequencer stage in a DS job sequencer. The sequencer can be set to "ALL" or "ANY". If it is set to "ALL", the sequencer triggers the third job (J3) only after the 2 input linked jobs (J1 & J2) completes and if it is set to "ANY", "ANY", it just waits for any of the job (J1 or J2) to complete and then triggers the third one. 235. What is the use of tunable? Tunables is a tab in project property in DataStage DataStage Administrator, in that we can change the value of cache size between 000 to 999 MB. Default size of cache size is 128MB.
236.
What is user activity in DataStage?
Page 42 of 57
The User variable activity stage stage defines some variables; variables; those are used in the sequence sequence in future. 237. 237.
What What is the the alt alterna ernattive ive way way wher where e we we can can do job job con control trol? ?
Job control can be done using : DataStage job Sequencers DataStage Custom routines Scripting Scheduling tools like Autosys
238.
Is it possible to calculate a hash total for an EBCDIC file and have the hash total stored as EBCDIC using DataStage? Currently, the total is converted to ASCII, even though the individual records are stored as EBCDIC.
239.
Suppose you have table "sample" & three columns in that table sample:
Cola Colb Colc 1 10 100 2 20 200 3 30 300 Assume: cola is primary key How will you fetch the record with maximum cola value using DataStage tool into the target system. You can use OCI stage to read the source file. file. In the OCI Stage write user defined SQL query as "Select max(cola) from the table" which will fetch the maximum value available in the table then load the data to Target Table.
240.
Which partition we have to use for Aggregate Stage in parallel jobs?
It is always preferable & appropriate that we must use a sort stage before aggregate stage. Hence based on the aggregate logic we should sort the incoming data by using hash partition on keys. Then we can use same partition on Aggregate stage. This is most commonly used. 241. 241. Is it poss possib ible le to quer query y a has hash h file file? ? Jus Justi tify fy your your answ answer er.. Yes, it is possible. A hash file file is like a database table for all practical practical purposes. Hence you can login to the tssh/uvsh/ TCL TCL prompt and query it like any other database table. 242. 242. What What is the the dif diffe fere renc nce e bet betwe ween en RELE RELEAS ASE E THE THE JOB JOB and and KILL KILL THE THE JOB JOB? ? Release the job is to release the job from any dependencies and run it. Kill the job means kill the job that's currently running or scheduled to run.
243.
What is process to remove duplicate records using Copy Stage? In the copy stage, there is an option to sort the input data where you can select the key on the basis of which you can perform the sort operation. There is an option named 'Unique'. 'Unique'. Select this, it will return the unique record after the sort operation.
Page 43 of 57
244.
What is the difference between IBM Web Sphere DataStage 7.5 (Enterprise Edition) & Standard Ascential DataStage 7.5 Version?
IBM acquires DataStage from Ascential. DataStage 7.5 was released with the name of Ascential DataStage and only 8.0 was released with IBM Websphere DataStage 8.0. IBM Information Server is also known as DS 8.0. It has more features like Quality Stage & MetaStage. It maintains its repository in DB2 unlike files in 7.5. Also it has a stage specifically for implementing the SCD 1 & 2 (SCD Stage). The advanced edition of DataStage is Websphere Websphere DataStage and Quality stage, it is released released by IBM itself and given the version as 8.0.1, in this there are only 3 client tools (administrator, designer, director), here they have removed the manager, it is included in designer itself (for importing and exporting) and in this they have added some extra stages like SCD stage, by using this we can implement scd1 and scd2 directly, and some other advanced stages. They have included the Quality Quality Stage, which is used for data validation validation which is very important for DWH. There are so many things available in Quality stage. 245. 245.
What What is the the dif diffe fere renc nce e bet betwe ween en Merg Merge e Sta Stage ge and and Loo Looku kup p Sta Stage ge? ?
Lookup stage: 1. Used to perform lookups. 2. Multiple reference links, single input link, single output link, single rejects link. 3. Large amount of memory usage. Because paging is required. 4. Data on input links or reference links need NOT to be sorted. Merge stage: 1. Combines the sorted datasets with the update datasets. 2. Several reject links, multiple output links will be exist. 3. Less memory usage. 4. Data needs to be sorted. 246. What is repository? Repository resides in a specified data base. It holds all the meta data, raw data, mapping information and all the respective mapping information. 247 247. How do do yo you ex extract jo job pa parameters from a file? Through user variable activity in sequencer sequencer Job. Through calling a routine. 248. 248. Aggr Aggreg egat ator ors s – Wha Whatt doe does s the the warn warnin ing g “Has “Hash h tab table le has has gro grown wn to ‘xyz ‘xyz’’ ….” ….” mean? Aggregator cannot store the data onto disk like Sort stage do the data landing. Your system memory will be occupied by the data that is going to aggregator. If your system memory is full then you get that kind of weird messages. I dealt with that kind of error once. My solution to that is using multiple chunks of data and multiple aggregators.
249.
What are the different Debug stages available in DataStage PX? Head, tail, row generator, column generator, peek, sample, write range map. map.
250.
Where we can use the Stages Link Partitioner, Link Collector & IPC? Whether
Page 44 of 57
in Server Jobs or in Parallel Jobs? And SMP is a Parallel or Server? We use Link partitioner and link collector stages in server jobs to speed up processing. Suppose you have a source and target and a transformer in between that does some processing, applying fns etc. You can speed it up by using link partitioner to split the data from source into different links, apply the Business logic and then collect the data back using link collector and pump it into output. IPC stage is also intended to speed up processing.
251.
What is the difference between the Dynamic & Static RDBMS Stages?
252.
What is TX and what is the use of this in DataStage? As I know TX stand for Transformer Extender, but I don't know how it will work and where we will be used? used?
WebSphere® Transformation Extender (WTX) is the universal transformation engine for WebSphere that addresses these complex data challenges to integration. Through its unique ability to speak to and process any data type in its native format, it tackles the "hard, ugly challenges" in integrating systems and information across the enterprise through a codeless, graphical approach to development.
253.
When you are able to handle Null handling and Data type changes in ODBC stages why you need Modify Stage?
Used to change the data types, if the source contains the varchar and the target contains integer then we have to use this Modify Stage and we have to change according to the requirement. And we can do some modification in length also. 254. 254. Expl Explai ain n a spe speci cifi fic c sce scena nari rio o wher where e we we wou would ld use use rang range e par parti titi tion onin ing? g? If the data is large and if you cannot process the full data in one time process you will generally use the Range partitioning. It's Partitioning by Column wise.
255.
Have you ever contact with the DS Administration people? What are the situations and what problems you phased? Creating projects, maintaining rights of users, killing hung jobs.
256.
How to enable the DataStage engine?
257. 25 7.
In whi which ch si situ tuat atio ion, n, we we are are us usin ing g RUN RUN TIM TIME E COL COLUM UMN N PRO PROPA PAGA GATI TION ON opt optio ion? n?
Page 45 of 57
Run time column propagation is used in case of partial schema usage. When we only know about the columns to be processed and we want all other columns to be propagated to target as they are, we check enable RCP option in administrator or output page columns tab or stage page general tab and we only need to specify the schema of tables we are concerned with. According to documentation “Runtime column propagation (RCP) allows DataStage to be flexible about the columns you define in a job. If RCP is enabled for a project, you can just define the columns you are interested in using in a job, but ask DataStage to propagate the other columns through the various stages. So such columns can be extracted from the data source and end up on your data target without explicitly being operated on in between. Sequential files, unlike most other data sources, do not have inherent column definitions, and so DataStage cannot always tell where there are extra columns that need propagating. You can only use RCP on sequential files if you have used the Schema File property (see “Schema File” on page 5-8 and on page 5-31) to specify a schema which describes all the columns in the sequential file. You need to specify the same schema file for Sequential File Stage 5-47any similar stages in the job where you want to propagate columns. Stages that will require a schema file are:• Sequential File• File Set• External Source• External Target• Column Import• Column Export ". 258.
What is is th the us use of of en environ ronmental va variables?
Environment Variables are the ones which sets the environment. Once you set these variables in DataStage you can use them in any job as a parameter. Example is you want to connect to database you need userid, password and schema. These are constant throughout the project so they will be created as environment variables. Use them where ever you want with #Var#. By using this if there is any change in password or schema no need to worry about all the jobs. Change it at the level of environment environment variable that will take care of all the jobs. jobs. 259. 259.
What What is the the dif diffe fere renc nce e bet betwe ween en Job Job Con Contr trol ol and and Job Job Sequ Sequen ence ce? ?
Job control specially used used to control the job, means through this we can pass the parameters, some conditions, some log file information, information, dashboard information, load recover etc..., Job sequence is used used to run the group of jobs based upon some conditions. conditions. For final/incremental processing we keep all the jobs in one diff sequence and we run the jobs at a time by giving some triggers.
260.
How you will define global variables in DS? The user variable stage allows you to define define global variables within a sequence. sequence. These variables can be used anywhere in the sequence, The values of the user variables are set by expressions in the stage’s properties. 261. 26 1.
How Ho w to to fin find d dup dupli lica cate te re reco cord rds s usi using ng tr tran ansf sfor orme merr sta stage ge in se serv rver er ed edit itio ion? n?
1. Can write a SQL query depending upon the fields. doesn’t allow duplicates. Attach a reject 2. You can use a hash file, by nature which doesn’t link to see the duplicates for your verification. 3. Transformer stage to identify and remove duplicates from one output, and direct all Input rows to another output (the "rejects"). This approach requires sorted input.
Page 46 of 57
262. 26 2.
Does Do es ty type pe of pa part rtit itio ioni ning ng ch chan ange ge fo forr SMP SMP an and d MPP MPP sy syst stem ems? s?
263.
What is phantom error in DataStage?
For every job in DataStage, a phantom is generated for the job as well as for every active stage which contributes to the job. These phantoms write logs reg. the stage/job. If there is any abnormality occurs, an error message is written and these errors are called phantom errors. These logs are stored at &ph& folder. Phantoms can be killed through DataStage Administrator or at server level.
264.
Is Hashed file an Active or Passive Stage? When will be it useful?
Has files is a passive stage. In DS Server edition, hash files are used in jobs where lookups are required. Hash files are generally used in intermediate jobs in DS, where there is a need to store data in an intermediate table format. All files in DataStage are Passive stage like hash file, sequential file etc. The stage which does some process process in it is called active active stage. Ex: Transformer, Sort, Aggregate.
265.
What are Orchestrate options in generic stage, what are the option names? Value? Name of an Orchestrate operator to call? What are the orchestrate operators available in DataStage for AIX environment?
Orchestrate is the old name for PX and EE. All the internal stages generally called Internal Operators in PX. You can write Custom operators if there is any necessity of enhancing the functionality that is not available in Orchestrate. It’s a very complex process and only few people actually do this kind of coding. You can write this code in C++ and add the complied complied code to DataStage. Please verify the the Custom Operators Docs in EE or PX versions. 266.
How do you configure api_dump?
267. 26 7. How Ho w to to atta attach ch an an MTR MTR fi file le (M (Map ap Tr Trac ace) e) vi via a ema email il an and d the the Map Map Tra Trace ce is us used ed to record all the execute map errors?
Page 47 of 57
268. 268. What What are are the the impo import rtant ant cons consid ider erat atio ions ns whil while e usin using g join join stag stage e inst instea ead d of of Lookups? 1. If the volume of data is high then we should use Join stage instead of L ookup. 2. If you need to capture mismatches between the two sources, lookups provide easy option.
269.
What is Phantom error in the DataStage? How to overcome this error?
Phantom Process is Orphaned process. Sometimes some processes will be still running in the server even though you kill the actual process. Some threads will keep running without any source process they are called Phantom Process. If you see the Directory called %PH% this folder captures the log of phantom process. In the transformer stage the order of the output links should be such that the dependant links should come after independent links. Otherwise job will abort with message phantom error. How to find out individual counts for a particular table? 270. While implementing the SCD type-2, due to maintaining of history, the primary key constraint will be violated. To have a primary key for the table for the sake of unique key we use surrogate key.
271. How to work with Repeating fields in DataStage? We should create some staging variable & define the value for repeating fields there & then just map these to respective respective columns. 272.
How do you load partial data after job failed source has 10000 records, Job failed after 5000 records are loaded. This status of the job is abort, Instead of removing 5000 records from target, how can I resume the load?
There are lots of ways of doing this. But we keep the Extract, Extract, Transform and Load process separately. Generally only load job never fails unless there is a data issue. All data issues are cleared before in transform only. There are some DB tools that do this automatically. If you want to do this manually. Keep track of number of records in a hash file or test file. Update the file as you insert the record. If job failed in the middle then read the number from the file and process the records from there only ignoring the record numbers before that try @INROWNUM function for better result. 273. 273. How How can can we incr increm emen entt the the surro surroga gate te key key val value ue for for e eve very ry inse insert rt in to targ target et database? We have Surrogate key generator stage in DataStage enterprise edition which will generate keys incrementally. In Server edition, we have to use stage variables or predefined variables like @inrownum, @outrownum.
274.
What is the baseline to implement partition or parallel execution method in DataStage job? E.g. more than 2 million records only advised?
Page 48 of 57
275. 27 5. A batc batch h is is runn runnin ing g and and it is sch sched edul uled ed to to run run in 5 minu minute tes. s. But But aft after er 10 days the time changes to 10 minutes. What type of error is this t his and how to fix it?
276.
How do you delete header and footer on the source sequential file and how do you create header and footer on target sequential file using DataStage?
In Designer Palette Development/Debug we can find Head & tail. By using this we can do. By using UNIX sed command we can delete header and footer. i.e.; for header sed -n '1|p' and footer sed -n '$|p' 277.
What is Invocation ID?
An 'invocation id' is what makes a 'multi-instance' job unique at runtime. With normal jobs, you can only have one instance of it running at any given time. Multi-instance jobs extend that and allow you to have multiple instances of that job running (hence the name). So, you can run multiple 'copies' of the same job as long as the currently running invocation ids are unique.
278.
What is the difference between Sequential Stage & Dataset Stage? When do you use them?
Sequential file can be used as a target file but Dataset can't be used as a final target, it can be used as temporary target. Dataset can be used as a hash file (just like in server hash file), it improves the performance. Dataset has no limit and sequential file has a size limit. Sequential file does not show the null value means it never accepts the null, we need to pass some value, but it is not in dataset. Performance is very good in dataset when compared to sequential file because of parallelism.
279.
What are the errors you experienced with DataStage?
If there is any fatal error means the job got aborted but if there are any warnings w arnings are there means the job not aborts but we have to handle those warnings also. Log file must be cleared with no warnings also. So many errors will come in diff jobs like Parameter not found in job load recover, Child job is failed because of some....., Control job is failed because of some.....etc.
Page 49 of 57
280. 280. How How to to con conne nect ct two two sta stage ges s whic which h do do not not hav have e any any comm common on col colum umns ns betw betwe een them? If suppose two stages don’t have the same column name then in between use one Transformer stage and map the required column. If those are sources then you better use copy stage after one stage and change the column name according to your requirement, then join these two sources. If it is oracle database then select the user defined sql, there you put alias name and change the column name by selecting the columns tab. 281 281.
How to to de develop the SCD us using LO LOOKUP KUP stage?
We can implement SCD by using LOOKUP stage, but it is for only scd1, not for scd2. We have to take source (file or db) and dataset as a ref link (for look up) and then LOOKUP stage, in this we have to compare the source with dataset and we have to give condition. After that in transformer we have to give the condition, after that we have to take two targets for insert and update, there we have to manually write the sql insert and update statements.
282.
How can we improve the performance of the job while handling huge amount of data?
1. 2. 3. 4.
Decrease usage of transformer stage. Use sorted input for aggregator. Filter the data in source itself (if it is oracle database, write user def sql). Use tunable tab in Administrator for cache memory, buffer size (if u have knowledge, then only go for that, otherwise u will get some problems like deadlocks…). 5. According to the requirement you use join, lookup or merge (depends upon the data in input dataset and reference link data), use dataset for ref. link (for lookup). 6. Use partition methods for different stages.
283.
I am running huge amount of data of data through a sort stage and it is very slow. Is there any option which can increase the performance with in the sort stage?
See if it is oracle db then u can write user def query for sort and remove duplicates in the source itself. And maintaining some key partition techniques u can improve the performance. If it is not the case means better go for some key partition techniques in sort, keeping the same partition which is in previous stage. Don’t allow the duplicates, remove duplicates and give unique partition key.
284.
Hi friends, I know while using lookup stage if the lookup table has huge amount of data of data we will go for sparse lookup or else will use normal lookup. Would anybody explain how the memory will be allocated for both types of lookups?
The lookup type is not decided decided by the data on the reference link but by the amount amount of data on the input link. If and only if your input link has few records, Then you can go for a Sparse Lookup. There is no memory allocation for this type of lookup, And a SQL where clause is directly fired on the reference link database for fetching the lookup key value.
Page 50 of 57
In case of a Normal Lookup, The entire reference link data is read into the memory and then a lookup is done in the memory itself. So memory comes into picture only for normal lookup's and not for Sparse lookup's.
285.
If a record is duplicated 3 times then how to get middle duplicated record? Is it advisable to use basic Transformer in Parallel jobs?
I think we have to use remove duplicate stage here to remove duplicates. First we have to take dataset, after that remove duplicates stage and target, OR source as dataset and transformer (in this give constraint) and target. What is the purpose of Debugging stages? In real time where we will use? 286. The main uses of Debugging Debugging Stages (row gen, peak, tail, head etc) are to monitor the jobs and to generate mock data when we don’t have real time data to test.
287.
In DataStage job successfully successfully finished. But showing errors in director window. How do we fix and find the cause for the errors?
I think in designer u find the green color but in Director -> log view --> u find some fatal errors and job is aborted, it means job is successfully successfully completed but the data is not loaded in to the target, may be some miss matched data, data type length problem…etc
288.
What is the use of clear status file in DS? How we will use it? Explain with examples?
When you clear a job’s status file you reset the status records associated with all stages in that job. You should therefore use this option with great care, and only when you believe a job has hung or aborted. In particular, before you clear a status file you you should: 1) Try to reset the job 2) Ensure that all the job’s processes have ended.
289.
If you have a huge volume of data to be referenced, which stage will you use? Join or Lookup stage? Why?
For huge size of data we must use join. Lookup stage requires all the data to be loaded in memory 1st & then only it can start matching process with inputs. Hence It is advisable to use join stage than Lookup stage. If you want to capture Reject records then you can use transformer (or filter) after Join Stage and capture the null values. Or Instead of join stage use Merge Stage to capture the reject records.
290.
How can I load a flat file into target as fast as I can? Assuming that the source bottleneck is not there, that is there are no performance issues in the source side.
Page 51 of 57
291.
At the time to incremental loading I need to take the reusable files' path. In this situation how to assign the one path of file to PROJDEF using the job parameters?
292.
Write the shell script using UNIX to run the 2 jobs? How to schedule the 2 jobs using UNIX crontab to run for particular time?
Use crontab -e to edit your crontab file, or create one if it doesn’t already exist. exist. Now enter following line into the file : 30 08 10 06 * /home/nikhilanshuman/runjob.sh Following is the description of the schedule: 30 – 30th Minute 08 – 08 AM 10 – 10th Day 06 – 6th Month (June) * – Every day of the week /home/nikhilanshuman/runjob.sh /home/nikhilansh uman/runjob.sh is the path where w here the script was saved. • • • • •
293.
1. Is the Hash file is active or passive? If we take as source? 2. Can u take sequential file as look up? 3. In hash file dynamic 30, there r two types: 1) generic 2) specific what is the meaning? 4. How to connect MERGE STAGE while source as two tables? 5. What is the purpose of MERGE? 6. How can DS job scheduled in UNIX? 7. How do u know how many rows rejected? 8. Use of universe stage? 9. What is SEQ file buffer? 10. Diff between DS routine, DS transform, DS function in transform stage? 11. If one table contains 100 rows, another one is 1000. Which one we take as lookup? 12. If a file has 100 records. r ecords. After 70 records job is abort? How to load reaming? 13. Shared Containers replace server jobs into parallel job? 14. is possible to extract EXCEL sheet by using SEQ file? 15. How do you decide when to go for join or lookup? 16. Diff between hash & range & entire Partion keys? 17. In lookup stage in parallel which Partion key u select? 18. What is routine? 19. Where the hash file store, if I take hash is target? 20. Def of system variables? Where we use it? What are system variables in your project? 21. What are plug-ins? 22. What is the purpose of fact less fact table? 23. If source is OCI & flat files then why u need ODBC in u r project rather than OCI? 24. Can we take SEQ file as source to Hash file, it works or not? If not what error it
Page 52 of 57
will give? 25. Can we use AGG & Transform Tr ansform stage use for sorting? How? 26. How do you do automation job? 27. How can we join one oracle or acle & flat files? 28. If there is one million records. We use OCI or some another stage? 29. What is purpose of filter option in SEQ file in server jobs? 30. How to filter data by using SEQ? 31. Diff between clear log file & clear stage file? 32. I want to process 3 files in sequentially at a time how can we do in job sequence? 33. What is version control in DataStage? 34. How to release job? 35. Diff between parallel jobs and parallel shared containers? 36. How to give scheduling time? 37. In job sequence I have 5 jobs? I w ant to run 1 & 4 only. Reaming I run later? How can u do that? 38. Diff between lookup file set & file set? 39. Data migration? 40. I have 1000 records? How to load last 100 records? 41. How to cleansing your DataStage job? 42. Where we write the routines? 43. How to execute job sequence? 44. I am using oracle database? But I am not using OCI stage? Without OCI or ODBC HOW CAN I EXTRACT THE DATA? 45. I have 4 jobs, is it run at a time? Is it possible? What is status? 46. Is it aborted why? Is it validating why? 47. What is an environmental variable? Where is it stores? 48. How to connect DB source? I.e. path? 49. Diff between server sequence stage & parallel sequence stage? 50. Diff between dataset & SEQ file in parallel? 51. Diff between change apply & change capture stage? 52. Diff copy, difference, transform stage? 53. Diff funnel, filter, external filter in parallel? 54. Diff between file set, dataset, lookup file set, SEQ file? 55. Diff between hash & SEQ file in server? 56. Diff between static & dynamic types in hash file? 57. Diff between server SEQ & parallel SEQ files? 58. What are buildup and audit stages? What is the purpose of it? 59. How to find multiple jobs process id’s without using cleanup resources in DS director? 60. Where we find the process id’s from last 5 execution jobs? 61. How many format types to export data by using DS manager? 62. Can u convert server jobs into parallel jobs? 63. how to set configure file node, if I have million records? 64. I have two types of metadata in SEQ file. Is it possible to load target? 65. Diff between hash & dataset? 66. I have two files. One file contains 1,00,000 records and another one contains 1000 records. Then how will we set nodes? 67. In oracle enterprise edition, when we select Upsert mode? 68. Diff between server seq file & parallel seq file? 69. Diff between server transform stage & parallel transform stage? 70. In OCI stage when we go for insert update OR update insert?
294.
What is the Disadvantage of surrogate key stage?
Page 53 of 57
If the data is not properly processed on nodes...in such situations this surrogate key generates holes(simply skips the ordering) ex: aa-1 bb-2 ba.. bg... cc-3 dd-4 ddd-5 here for ba and bg, we don’t have sequence key because they have holes.
295.
What is the difference between change capture and change apply stages?
Change capture stage is used to get the difference between two sources i.e. after dataset and before dataset. The source which is used as a reference to capture the changes is called after dataset. The source in which we are looking for the change is called before dataset. This change capture will add one field field called "change code" in the the output from this stage. By this change code one can recognize which kind of change this is like whether it is delete, insert or update. Change apply stage is used along with change capture stage. It takes change code from the change capture stage and apply all the changes in the before dataset based on the change code. Change Capture is used to capture the changes between the two sources. Change Apply will apply those changes in the output file.
296.
I am eager to know about DataStage Engine commands. Tell me something about that?
DataStage Engine commands are otherwise called as Universe Commands. They are used to retrieve the details that are stored in the DataStage repository in the form of files and tables. We can also access the hash files and VOC contents by using those commands. E.g.: EVERY, LISTL are some of the commands. The following commands can be taken as DS Engine Engine commands, used to start and stop the DS Engine DSHOME/bin/uv -admin -start DSHOME/bin/uv -admin –stop
297.
Using server job, how to transform data in XML file into sequential file? I have used XML input, XML transformer and a sequential file?
Import the XSD for the xml file via manager-metadata importer and store it. If transformation functions are not required to be done on the data in XML file, there is no need for the transformer. To load data directly from a XML file to Sequential file:use these stages in the design:FOLDER->XMLINPUT->SEQUENTIAL FOLDER->XMLINPUT->SEQUE NTIAL FILE (If transformation is required, u can use the Transformer stage in between XMLINPUT stage and SEQFILE stage) In FOLDER stage properties->Assign the file path , can be given as wildcard. Here the whole
Page 54 of 57
data in the file will be taken as a single row and streamed through the folder stage. In output: metadata will be 'File'(filename-key-varchar) 'File'(filename-key-varchar) and 'Data'(single row of data-long varchar 9999) In XMLINPUT stage->Source- select select the 'File' in source column column and content as File path or (select 'Data' and XML doc- select accordingly) Output Link-> Import the column metadata from table definition stored and map directly to seq file( or to transformer->seq file).
298.
If suppose we have 3 jobs in sequencer, while running if job1 is failed then we have to run job2 and job 3, how we can run?
To run a job even if its previous job in the sequence sequence is failed you need need to go to the TRIGGER tab of that particular job activity in the sequence itself. There you will find three fields: fields: Name: This is the name of the next link (link going to the next job, e. g. for job activity 1 link name will be the link going to job activity 2). Expression Type: Type: This will allow you to trigger your next job activity activity based on the status you want. For example, if in case job 1 fails and you want to run the job 2 and job 3 then go to trigger properties of the job 1 and select expression type as "Failed - (Conditional)". This way you can run your job 2 even if your job 1 is aborted. There are many other options available. Expression: This Expression: This is editable for some options. Like for expression type "Failed" you cannot change this field.
299.
How to fetch the last row from a particular column in a sequential file?
300.
How do you remove duplicates using transformer stage in DataStage?
double click on transformer stage---> Go to Stage properties(its having in header line first icon) --->double click on stage properties --->Go to inputs --->go to partitioning---> select one partition technique(without auto)--->now enable perform sort--->click on perform sort--> now enable unique--->click on that and we can take required column name. Now output will have unique values so here duplicates will be removed. 301 301.
What is the use of combinabi ability mode?
302.
Explain Wrapped, Build and Custom stages?
Wrapped stage: Enables you to run an existing sequential program in parallel Build stage: Enables you to write a C expression that is automatically generated into a Parallel custom stage. Custom stage: Provides a complete C++ API for developing complex and extensible Stages.
Page 55 of 57
303.
Is it like if we define the values of variables in DSParams file, then there is no need to give the values at job level are Project level? & how to configure this file at job level? So that we need not hardcode the values. values.
304.
What are the DataStage standards to be considered?
305.
How high level design will be prepared in DataStage?
306.
How low level design will be prepared in DataStage?
307.
How technical design will be prepared in DataStage?
308.
What is Orchestrate Schema? Distinguish internal data type (Orchestrate schema) vs. external data type?
309.
What are these terms used for in DataStage - Dow nstream and Upstream?
Page 56 of 57
Things to Learn:
1. Loading data into Dimension Tables & Fact tables. 2. SCD Type 2 3. All Stages Stages (Serve (Serverr + Para Paralle llel) l).. 4. Perf Perfor orma manc nce e Tun Tunin ing. g.
Manager: Routines: 1. ExecSh 2. DSwa DSwait itfo forf rfil ile e 3. DSs DSsendm endmai aill SDK: Date Routines 1. Conver Convertt date/TI date/TIME ME STAMP STAMP to OraOC OraOCII format format 2. ODB ODBC form format at 3. Date Datecu curr rren entG tGMT MTti time me User defined: 1. Find job 2. Any date date to YYYY-MM-D YYYY-MM-DD D format (used (used date date functions functions iconv, iconv, oconv) oconv) 3. Getnextkeyvalue: This routine is used in Server jobs for reading the max Surrogate key value of a dimension from the hash file (max. key value is loaded into the hash file by a job at the starting of a sequence). This is for inserting a new record into the dimension table. Transforms: 1. Dates 2. Strings 3. Numeric Functions: 1. Date function functions s : current time stamp, stamp, null handling handling,, strings (concat, (concat, TRIM, TRIM, left, right), right), numeric Constraints & derivations in a transformer. Manager: metadata creation 1. Table definitions & metadata 2. Multiple job compilation. 311. 31 1. How How do do you you do orac oracle le 4 way inne innerr joi join n if if ther there e are are 4 ora oracl cle e inp input ut files?
Page 57 of 57
312.
What is is SQ SQL tu tuning? Ho How do yo you do do it it?
313.
What is the Batch Program and how can generate?