What are Flat Files? Flat files are extensively used for exchanging data between enterprises or between organizations within an enterprise. Flat files come in two forms - delimited files such files such as CS !comma separated" files or fixed width files. files.
What is Flat Fl at File Fi le #e #esting? Flat File testing is the process of validating the $uality of data in the flat file as well as ensuring that the data in the flat file has been consumed appropriately by the application or %#& process.
Challenges in Flat File #esting? #esting of inbound flat files presents uni$ue challenges because the producer of the flat file is usually different organizations within an enterprise or an external vendor. Conse$uently' there might be differences in the format and content of the files since files since there is no easy way to enforce the data type and data $uality constraints on the data in the flat files. (ssues in flat file data can cause failures in failures in the consuming process. While the file processing re$uirements are different from pro)ect to pro)ect' the focus of this use case is to list out some of the common chec*s that need to be performed for validating flat files.
Flat File Testing Categories FLAT FILE INGESTION TESTING
When data is moed using flat files !etween enter"rises or organi#ations within enter"rise$ it is im"ortant to "erform a set of file ingestion alidations on the in!ound flat files !efore consuming the data in those files% File name validation Files are ftp+ed or copied over to a specific folder for processing. #hese files usually have a specific naming convention so that the process consuming the file is able to understand the contents and date. From a testing standpoint' the file name pattern needs to be validated to verify that it meets the re$uirement.
%xample, government agency that gets files from multiple vendors on a periodic basis. #he arriving files should follow a naming convension of +CompanyCodeContent#ype/ate#imestamp.csv+. 0owever' the files coming in from a specific vendor do not have have the correct company name.
Size and Format of the flat files lthough' flat files are generally delimited or fixed width' it is common to have a header and footer in these files. Sometimes' these headers have a rowcount that can be used to verify that the file contains the entire data as expected. Some of the relevant chec*s are, erify that the size of the file is within the expected range where applicable. erify that the header' footer and column heading rows have the expected format and have the expected location within the flat file. 1erform any row count chec*s to cross chec* the data in the header with the values in the delimited data.
%xample, financial reporting company generates files with a header that contains the summary amount with the line items having the detailed split. #he sum of the amounts in the line items should match the summary amount in the header.
File arrival' processing and deletion times Files arrive periodically into a specific networ* folder or an ftp location before getting consumed by a process. 2sually' there are specific re$uirements that need to b e met regarding the file arrival time' order of arrival and retaining them.
%xample, pharma company gets a set of files from a vendor on a daily basis. #he process consuming this files expects the complete set of files to be available before processing 3. file that were supposed to come yesterday was delayed. (t came in sometime after today+s file arrived
causing issues due to difference in the order of processing the files. 4. fter the files gets processed' it is supposed to be moved to a specific directory where it is to be retained for a specified period of time and deleted. 0owever' the file did not get copied over. 5. Compare the transformed data in the target table with the expected values for the test data.
Automate file ingestion testing using ETL &alidator
%#& alidator comes with Com"onent Test Case and File Watcher which can be used to test Flat Files. Flat File Com"onent' Flat file component is part of the Component #est Case. (t can be used to define data t("e anddata )ualit( rules on the incoming flat file. #he data in the flat file can also be compared with data from the database. File Watcher' 2sing File Watcher test plans can be triggered a utomatically when a new file comes into a directory so that the test cases on the file can be executed automatically before the files are used further by the consuming process. SFT* Connection' 6a*es it easy compare and validate flat files located in a remote SF#1 location.
FLAT FILE +ATA T,*E TESTING The "ur"ose of +ata T("e testing is to erif( that the t("e and length of the data in the flat file is as ex"ected% /ata #ype Chec* erify that the type and format of the data in the inbound flat file matches the expected data type for the file. For date' timestamp and time data types' the values are expected to be i n a specific format so that they can be parsed by the consuming process. %xample, n (/ column of the flat file is expected to have only numbers. 0owever' few rows in the flat file have characters.
/ata &ength Chec* &ength of string and number data values in the flat file should match the maximum allowed length for those columns.
%xample, /ata for the comments column has more than 7888 characters in the inbound flat file while the limit for the corresponding column in the database is only 4888 characters.
9ot 9ull Chec* erify that any re$uired data elements in the flat file have d ata for all the rows.
%xample, /ate of :irth is a re$uired data element but some of the records are missing values in the inbound flat file.
Automate flat file data t("e testing with ETL &alidator
%#& alidator provides the capability to specify data type chec*s on the flat file in the flat file
com"onent. :ased on the data types specified' %#& alidator automatically chec* all the records in the incoming flat file to find any invalid records.
FLAT FILE +ATA -.ALIT, TESTING The "ur"ose of +ata -ualit( tests is to erif( the accurac( of the data in the in!ound flat files% /uplicate /ata Chec*s Chec* for duplicate rows in the inbound flat file with the same uni$ue *ey column or a uni$ue combination of columns as per business re$uirement. %xample, :usiness re$uirement says that a combination of First 9ame' &ast 9ame' 6iddle 9ame and /ata of :irth should be uni$ue for the Customer list flat file. Sam"le )uer( to identif( du"licates /assuming that the flat file data can !e im"orted into a data!ase ta!le0 SELECT fst_name, lst_name, mid_name, date_of_birth, count(1) FROM Customer RO!" #$ fst_name, lst_name, mid_name %&' count(1)*1
;eference /ata Chec*s Flat file standards may dictate that the values in certain columns should adhere to a values in a domain. erify that the values in the inbound flat file conforms to reference data standards.
%xample, alues in the countrycode column should have a valid country code from a Country Code domain. select distinct countr+_code from address minus select countr+_code from countr+
/ata alidation ;ules 6any data fields can contain a range of values that cannot be e numerated. 0owever' there are reasonable constraints or rules that can be a pplied to detect situations where the data is clearly wrong. (nstances of fields containing values violating the validation rules defined represent a $uality gap that can impact inbound flat file processing. %xample, /ate of birth !/<:". #his is defined as the /#% datatype and can assume any valid date. 0owever' a /<: in the future' or more than 388 years in the past are probably invalid. lso' the date of birth of the child is should not be greater than that of their parents.
/ata (ntegrity Chec*s #his chec* addresses =*eyed= relationships of entities within a domain. #he goal is to idenfity orphan records in the child entity with a foreign *ey to the parent entity. 3. Count of records with null foriegn *ey values in the flat file 4. Count of invalid foriegn *ey values in the flat file that do not have a corresponding primary *ey in the parent flat file or database table
%xample, Consider a file import process for a C;6 application which imports contact lists for existing ccounts. #he contact list are CS files with a column having the corresponding accountid. &ets assume that the contact list can be loaded into a database table for the purpose of validation. 1% Count of null or uns"ecified dimension 2e(s in a Fact ta!le' SELECT count(account_id) FROM contacts here account_id is null
3% Count of inalid foriegn 2e( alues in the contact list ' SELECT account_id FROM contacts minus SELECT s-account_id FROM accounts s, contacts c here s-account_id.c-account_id
Automate flat file data )ualit( testing using ETL &alidator
%#& alidator supports defining of data $uality rules in Flat File Com"onent for automating the data $uality testing without writing any database $ueries. Custom rules can be defined and added to the /ata 6odel template.
FLAT FILE +ATA CO4*LETENESS TESTING +ata in the in!ound flat files is generall( "rocessed and loaded into a data!ase% In some cases the ou"ut ma( also !e another flat file% The "ur"ose of +ata Com"leteness tests are to erif( that all the ex"ected data is loaded in the target from the in!ound flat file% Some of the tests that can !e run are ' Com"are and &alidate counts$ aggregates /min$ max$ sum$ ag0 and actual data !etween the flat file and target% ;ecord Count alidation Compare count of records of the flat file and database table. Chec* for any re)ected records. %xample, simple count of records comparison between the source and target tables. Source -uer( /assuming the flat file data is loaded into 5customer5 ta!le for alidation0 SELECT count(1) src_count FROM customer Target -uer( SELECT count(1) t/t_count FROM customer_dim
Column /ata 1rofile alidation Column or attribute level data profiling is an effective tool to compare source and target data without actually comparing the entire data. (t is similar to comparing the chec*sum of your source and target data. #hese tests are essential when testing large amounts of data. Some of the common data profile comparisons that can be done between the flat file and target are, Compare uni)ue alues in a column between the flat file and target Compare max$ min$ ag$ max length$ min length values for columns depending of the data type Compare null alues in a column between the flat file and target For important columns' compare data distri!ution /fre)uenc(0 in a column between the flat file and target
%xample 3, Compare column counts with values !non null values" between source and target for each column based on the mapping. Source -uer( /assuming the flat file data is loaded into 5customer5 ta!le for alidation0
SELECT count(ro_id), count(fst_name), count(lst_name), a0/(re0enue) FROM customer Target -uer( SELECT count(ro_id), count(first_name), count(last_name), a0/(re0enue) FROM customer_dim
%xample 4, Compare the number of customers by country between the source and target. Source -uer( /assuming the flat file data is loaded into 5customer5 ta!le for alidation0 SELECT countr+, count() FROM customer RO!" #$ countr+ Target -uer( SELECT countr+_cd, count() FROM customer_dim RO!" #$ countr+_cd
Compare entire flat file and target data Compare data !values" between the flat file and target data effectively validating 388> of the data. (n regulated industries such as finance and p harma' 388> data validation might be a compliance re$uirement. (t is also a *ey re$uirement for data migration pro)ects. 0owever' performing 388> data validation is a challenge when large volumes of data is involved. #his is where %#& testing tools such as %#& alidator can be used because they have an inbuilt %& engine !%xtract' &oad' alidate" capabile of comparing large values of data. %xample, Write a source $uery on the flat file that matches the data in the target table after transformation. Source -uer( /assuming the flat file data is loaded into 5customer5 ta!le for alidation0 SELECT cust_id, fst_name, lst_name, fst_name223,322lst_name, 4O# FROM Customer Target -uer( SELECT inte/ration_id, first_name, Last_name, full_name, date_of_birth FROM Customer_dim
Automate flat file data com"leteness testing using ETL &alidator
%#& alidator comes with Flat File Com"onent and +ata *rofile Com"onent as part of Com"onent Test Case for automating the comparison of flat file and target data. (t ta*es care of loading the flat file data into a table for running validations. +ata *rofile Com"onent' utomatically computes profile of the flat file data and target $uery results - count' count distinct' nulls' avg' max' min' maxlength and minlength. Com"onent Test Case' 1rovides a visual test case builder that can be used to compare multiple flat files and target.
FLAT FILE +ATA T6ANSFO64ATION TESTING +ata in the in!ound Flat File is transformed !( the consuming "rocess and loaded into the target /ta!le or file0% It is im"ortant to test the transformed data % There are two a""roaches for testing transformations 7 white !ox testing and !lac2!ox testing #ransformation testing using White :ox approach White box testing is a testing techni$ue' that examines the program structure and derives test data from the program logiccode. For transformation testing' this involves reviewing the transformation logic from the flat file data ingestion design document and corresponding code to come up with test cases. #he steps to be followed are listed below, ;eview the transformation design document pply transformations on the flat file data using S@& or a procedural language such as 1&S@& to reflect the %#& transformation logic Compare the results of the transformed data with the d ata in the target table or target flat file. #he advantage with this approach is that the tests can be rerun easil y on a larger data set. #he disadvantage of this approach is that the tester has to reimplement the transformation logic.
%xample, (n a financial company' (n a financial company' the interest earned on the savings account is dependent the daily balance in the account for the month. #he daily balance for the month is part of an inbound CS file for the process that computes the interest. 3. ;eview the re$uirement and design for calculating the interest. 4. (mplement the logic using your favorite programming language. 5. Compare your output with data in the target table.
#ransformation testing using :lac* :ox approach :lac*-box testing is a method of software testing that examines the functionality of an application without peering into its internal structures or wor*ings. For transformation testing' this involves reviewing the transformation logic from the mapping design document setting up the test data appropriately. #he steps to be followed are listed below,
;eview the re$urements document to understand the transformation re$uirements 1repare test data in the flat file to reflect different transformation scenarios Come with the transformed data values or the expected values for the test data from the previous step Compare the results of the transformed test data in the target table with the expected values. #he advantage with this approach is that the transformation logic does not need to be reimplemented during the testing. #he disadvantage of this approach is that the tester needs to setup test data for each transformation scenario and come up with the expected values for the transformed data manually.
%xample, (n a financial company' the interest earned on the savings a ccount is dependent the daily balance in the account for the month. 3. ;eview the re$uirement for calculating the interest. 4. Setup test data in the flat file for various scenarios of daily account balance. 5. Compare the transformed data in the target table with the expected values for the test data.
Automate data transformation testing using ETL &alidator
%#& alidator comes with Com"onent Test Case which can be used to test transformations using the White :ox approach or the :lac* :ox approach. &isual Test Case 8uilder' Component test case has a visual test case builder that ma*es it easy to rebuild the transformation logic for testing purposes. Wor2schema' %#& alidator+s wor*schema stores the test data from source and target $ueries. #his ma*es it easy for the tester to implement transformations and compare using a Scri"t Com"onent. 8enchmar2 Ca"a!ilit(' 6a*es it easy baseline the target table !expected data" and compare the latest data with the baselined data.
FLAT FILE INGESTION *E6FO64ANCE TESTING
The goal of "erformance testing is to alidate that the "rocess consuming the in!ound flat files is a!le to handle flat files with the ex"ected data olumes and in!ound arrial fre)uenc(%
%xample 3, #he process ingesting the flat file might perform well when the data when there are only a few records in the file but perform bad when there is large number of rows.
%xample 4, #he flat file ingestion process may also perform bad as the data volumes increase in the target table.
%nd-to-%nd /ata #esting of Flat File ingestion (ntegration testing of the inbound flat file ingestion process and the related applications involves the following steps, %stimate expected data volumes in each of the source flat files for the consuming process for the next 3-5 years. Setup test data for performance testing either by generating sample flat files or getting sample flat files. %xecute the flat file ingestion process to load the test data into the target. %xecuting the flat file ingestion process again with large data in the target tables to identify bottlenec*s.