What is EME? EME is an object oriented data storage system that version controls and manages various kinds of information associated with AbInitio applications, which may range from design information to operational data. In simple terms, it is a repository, which contains data about data–metadata.
Revisiting Sandbox Concept What is a Sandbox? Projects held in the EME Data store can’t be manipulated directly. To work on Projects, they must be checked out to a working area on the file system where we can develop and modify code. This working area on the file system is known as a Sandbox. It has exactly the similar directory structure as that of a Project in the Datastore. Each object that needs to be worked on is checked out to a sandbox where modifications or enhancements are carried out. After the changes are complete the code is checked in from the sandbox area to the EME Datastore. This action creates a new version of the code in the EME Datastore.
Sandbox Projects vs. EME Datastore Project Sandboxes are work areas used to develop, test or run code associated with a given project. Only one version of the code can be held within the sand box at any time. The EME Datastore contains all versions of the code that have been checked into it. A particular sandbox is associated with only one Project where as a Project can be checked out to a number of sandboxes.
EME Data store/Repository connection settings EME datastore is a specific instance of EME in the environment. This is a repository where different versions of code and its related data like the record formats, transformations etc are maintained. At any point of time a user can connect to only one such EME repository instance. To access an EME Datastore, go to Project>EME Datastore Settings in the GDE Menu and details are to be filled up in the following boxlike:
EME Data store/Repository connection settings Following details are to be filled up in the EME Datastore Settings Method: Remote Execution (Rexec)/Telnet Host: The host where the EME Datastore resides Login and Password: Unix Login credentials for the host Co>Operating system Location: Path to where the Ab Initio Co>Operating system is installed EME Datastore Location: Path to where the EME Datastore is located Mode: Source Code Control After filling in the detail press on the Connect button to test the connection. If the details are filled in correctly you will get a message box confirming the connection.
Project
Project A Project is a collection of related graphs and its associated elements like dml, xfr etc in the EME Datastore.
Project structure Typically a project should contains maximum of 5 to 10 graphs. This helps in organizing the code efficiently within EME. With increase in the number of graphs in a Project, the time taken to perform dependency analysis on the graphs and related data increases.
Before adding a Project to an existing application, which already has a number of Projects in place, the impact it might have on other Projects and on the Application as a whole must be considered.
Structure of a Project in EME
SQL
Sub Directory for Sql queries
Different Types of Projects Private vs. Public Project There is often information common to multiple Projects. For instance several Projects may share some record format files or transform files. Such elements which are used across Projects can be made widely reusable by making them part of a Project and including this Project in other projects to access the common elements. A Project that is included by other Projects is termed as a Public Project and the Projects including public Projects are known as Private Projects.
A public Project is public in the sense that their data and metadata are expected to be shared with other Projects and a private Project is private in the sense that their data and metadata are not expected to be shared with other Projects.
The Environment Project (Stdenv) There is a special Project associated with every instance of Ab Initio environment known as the Environment Project or stdenv (Standard Environment). This is no different from a regular Project in the structure. It contains machine and Application specific settings like the data directory mount points, max-core settings and application wide parameters like current date, which are used across all Projects. During creation of any Project, stdenv is included in it by default. A single stdenv is required for an entire set of applications on a single machine and sharing a single EME Datastore.
Version Control and Tagging Each object under EME source control, which may be a file, a directory or a Project, exist as a series of versions, each of which is a representation of what was checked in by some user. It can optionally have a textual description attached to it called at agenda description as a comment. Each version is separately numbered and can be accessed by either the version number or the tag attached to it.Version numbers, which are integers and tags, are global to the whole EME datastore. Tags are the basic units during migration of code across EME data store instances.
Check out of files using GDE Check out wizard is invoked by navigating to Project>CheckOut, which looks as follows:
Select the Project /directory or file you want to check out by browsing to the particular Project /directory or file. In sandbox host dropdown list select the host on which the sandbox resides. Enter the path to an existing sandbox (the sandbox must be associated with the concerned Project, which is being checked out) or mention a new one in the directory field, which would be created during check out. The advanced options dialog can be seen by clicking on advanced button.
The first two options specify whether to check out the required files from the parent project and whether to check out required files from the common Projects. The default is check out the required files from the parent project. A file is required if it is directly referenced in a graph or if it is referenced in an include in a dml or xfr. While checking out a whole project these two options are disabled as shown above. Run host setup script makes sure to run the host profile’s set up script before check out and mark files read only on check out does exactly what it says. The default is on for both of these options. We can select a particular tagged version of the object we want to check out from the tag drop down list. By default the latest version is checked out.
On clicking next, if the sandbox doesn’t exist then a confirmation is asked whether to create the new sandbox or not. Clicking yes creates the sandbox and checks out the object mentioned to this sandbox. You will be prompted to enter the sandbox locations of stdenv and any common projects associated with the project, unless the sandbox has already these values specified or the sandbox is a pre-existing one.
Clicking on Do Check out performs the checkout operation and on its completion a window shows the operations performed.
Locking A lock must be acquired on the object to be modified in the sandbox after successful completion of checkout. To modify a graph that has been checked out, first open the graph in the GDE and then click on the lock symbol on the menu. This checks whether the version in the sandbox is the latest version of the object in the data store and if it is, the lock symbol turns green showing that the graph is now locked and is editable. If the graph has already been locked in some other sandbox, after opening the graph in the GDE the lock is red in colour denoting that there is already a lock on it. A lock can be acquired on an object only if the sandbox version and the current version of the object in the EME are the same. Once a lock is acquired and the changes are complete the object must be checked into the data store to create a new version in the Datastore. For Non-AbInitio objects which can’t be locked from the GDE,a lock can be obtained from the Unix command line using the air commands available to obtain a lock on the particular object.
Check in of files using GDE Once the project files have been edited and updated they need to be checked into create a new version in the EME data store, which will be available for other users. Check in wizard is invoked by navigating to Project>Checkin. Before starting the check in wizard, it checks for any unsaved file in the sandbox and prompts whether to save them or not. The checkin wizard looks as follows:
Choose the Sandbox host from the drop down list In the Directory or file field,browse to the particular file in the sandbox that you want to checkin. You may select a file under the sandbox or you may also select the whole sandbox in which case the whole project would be checked into the EME datastore. Browse to the parent Project in Project Directory field,which points to the Project directory in the EME data store where the object would be checked in. To go to the advanced options in check in click on the advanced button. The checkin tab indicates how you want the checkin to be performed.By default“Force overwrite”is unchecked. Once it is checked the object is checked in even if there are conflicts and becomes the latest version in the datastore.“Run Host Setup script”causes to run the host profiles setup script before each checkin. It is advised not to change any settings here.
The analysis tab specifies how much dependency analysis is done and on which objects during check in.
A tag, which is a descriptive piece of text and a comment, can be attached to the version that will be checked in.This can be mentioned in the tag tab of advanced options dialog box. The tagging standards are described in another document. After filling in the tag information, on clicking next in the check in wizard a check in ready dialog is displayed.
Clicking on “Do Checkin” performs the actual check in and displays a window similar to the “check out finished” window with the results of check in and dependency analysis (if specified in the advanced option).
Working with previous versions of graphs/objects EME Check out the required previous tagged version of the graph to your sandbox.(V1 in figure below). Check it back in with“Force Overwrite”in advanced option in checkin wizard. This will make it the current version in the datastore.(V4 in figure below). Lock the graph now to make the changes. Check in the graph back to the EME data store. This updated version will become the latest version in the EME datastore.(V5 in figure below) Check in the graph back to the EME data store. This updated version will become the latest version in the EME data store.(V5 in figure below)
Parameters A parameter is a name-value pair with some additional attributes that determine when and how to interpret or resolve its value. Parameters are used to provide logical names to physical location and should always be used instead of hardcoded paths in graphs. We can have two types of parameters, graph and Project parameters.
Graph parameters Graph parameters, as the name suggests are specific to the individual graphs and are private to them. They affect execution of the graph for which they have been defined. Graph parameters can be defined by navigating to Edit>Parameters in the GDE which opens the graph parameters editor.
Project parameters Project parameters are inherited by all the graphs in the Project and are accessed from the GDE by the sandbox parametered it or in Project>Edit Sandbox>Parameters. This shows a dialog box prompting to enter the sand box path. Choose the correct host and the sand box path and press OK to open the sand box parameter editor, which exactly like the graph parametered it or shown as above.
Major Parameter Attributes Scope: Scope of a parameter can be formal or local. A local parameter is internal to the sandbox and most of the parameters have their scope as local. Its value is taken from the value column in the parameter editor. A formal parameter is one whose value can be set from outside, i.e. from the environment where the graph is run. Its value is supplied from the command line. A green diamond can identify the formal parameters with an arrow mark. Kind: If scope is local, kind is left unspecified, but if it is formal, the kind is automatically set to keyword. Type: This determines the nature of the parameter. Project parameters have four types as string, common Project, switch and dependent. Graph parameters have different set of types. Export: When this check box is checked, the corresponding parameter value is exported as an environment variable, otherwise it is generated as a local shell variable.
Private Value: If a parameter is specified as a private value, any subsequent changes to it remain private to the local sandbox and are not checked in into the EME. This is useful when different users want different values for the same parameter. Value: This column specifies the value of the parameter. Interpretation: This determines how the parameter is going to be evaluated. Constant: Value is taken literally. $ Substitution: Variables with $ prefixes are replaced with their values ${} Substitution: Variables within {} and with $ prefixes are replaced by their values but other occurrences of $ are ignored. Shell: Korn shell syntax is used to evaluate the value of the parameter. Required: This attribute can take two values, required (the default) or optional. If it is required, the value column can’t be left blank but if it is optional, it can be left blank.
SESSION –I (Day 1)
Introduction to Ab-Initio
What is Ab Initio? Applications of Ab Initio Architecture Co>Operating system Types of Development GDE Co>Op system Configuration Sandbox Environment Graph Component Properties Attribute Editor Graph Properties View Data Panel Expression Editor
Session II (Day 2 & 3) DML
Type Reference Key Specifier Reference Expression Reference Transform Reference Package Reference Function Reference DML Utilities DML Examples
Components
Run SQL Intermediate File Lookup File Concatenate Gather Interleave Merge Gather Logs Redefine Format Replicate Filter by Expression Join Reformat Rollup
Session III (Day 4 & 5) Parallelism
Multi file system Component Parallelism Data Parallelism Pipeline Parallelism Partition and De-Partition Components
Metadata Management Concepts Commands
Introduction to Job Management
Ab Initio is Latin for “From the Beginning” From the beginning the software was designed to support a complete range of business applications, from simple to the most complex. The graphical development environment and a powerful set of components allows the customers to get valuable results from the beginning. Moving Data Move small and large volumes of data in an efficient manner. Deal with the complexity associated with business data. High Performance Scalable Solutions Better Productivity. Ab Initio software is a general purpose data processing platform for mission critical applications such as: Data warehousing Batch Processing Click-Stream Analysis Data Movement Data Transformation Computers come in many shapes and sizes: Single-CPU, Multi-CPU Network of single-CPU computers Network of multi-CPU computers Multi-CPU machines are often called SMP‟s (for Symmetric Multi Processors). Specifically-built networks of machines are often called MPP‟s (for Massively Parallel Processors). Distribution –a platform for applications to execute across collection of processors within confines of a single machine or across multiple machines.
Reduced Run Time Complexity –The ability for applications to run in parallel on any combination of computers where the Ab Initio Co>Operating system is installed from a single point of control. Ab Initio software consists of two main programs. Co>Operating System, which your system administrator installs on a host UNIX or Windows NT Server, as well as on processing nodes. (The host is also referred to as the control node). Graphical Development Environment (GDE), which you install on your PC (client node) and configure to communicate with the host (control node).
Ab Initio Architecture
Co>Operating System Co>Operating system is a powerful engine for every kind of data processing. It delivers crucial facilities including distributed and parallel execution, platform independent data transport and Process Monitoring. Co-operating system delivers: Unlimited scalability – double the number of cpu's and execution time is halved Flexibility – open component model for extending and customizing ab initio's functionality. Portability – The Co>Operating system runs heterogeneously across a huge variety of operating system and hardware platforms from OS/390 on mainframes, to 10 different implementations of Unix, to windows NT and windows 2000. Parallel and distributed application execution Control Data Transport Transactional semantics at the application level Check pointing Monitoring and debugging Parallel file management Metadata-driven components
Co>Op system Configuration
Testing Co>Op systems
The Ab Initio Co>Operating System Runs on CompaqTru64Unix DIGITALUNIX HP-UX IBMAIX NCRMP-RAS Red hot linux IBM/Sequent DYNIX/ptx
Siemens Pyramid Reliant UNIX SiliconGraphicsIRIX Sun Solaris WindowsNTandWindows2000
Types of Development Environment GDE (Graphical Development Environment) SDE (Shell Development Environment)
GDE Layout
A Sandbox Environment A sandbox is a collection of graphs and related files that are stored in a single directory tree, and treated as a group for purposes of version control, navigation, and migration. Setting up a standard working environment helps a development team work together The Sandbox capability allows an application to be designed to be trivially portable
The Sandbox contents are a project administrative function
Sandbox Parameters Start the Ab Initio GDE Go to Repository-Edit Sandbox
Environment –Quick Overview
$AI_RUN –run directory $AI_DML –record format files $AI_XFR –transform files $AI_MP –graphs $AI_DB –database config files $AI_SERIAL –Serial source data, other serial data. $AI_MFS -Ab Initio multi file directory –in training will also contain partition directories (more about this after). $AI_LOG –A location to place logging files, etc. The goal is to have a development which enables the migration of a graph or set of graphs to any other environment with absolutely no changes.
Sample Graph
1. Components 2. Datasets 3. Flows
Components Components may run on any computer running the Co>Operating System. The Ab Initio Component library contains a diverse built-in set of components. The particular work a component accomplishes depends upon its parameter settings. Some components may require a data transformation parameter, that is, a set of business rules to be applied to an input(s) to produce a required output.
Datasets A dataset is a source or destination of data. It can be a simple file, a database table, a SAS dataset, …. Datasets may reside on any machine running the Co>Operating System.
Datasets may reside on other machines if connected by FTP or database middleware. Data within a dataset must always be exactly described using Ab Initio‟s Data Manipulation Language (DML) to form record format metadata.
Viewing Component Properties
Viewing Port Properties
Dataset: Records and Fields A dataset is made up of records; a record consist of fields Analogous database terms are rows and columns