Data Structures and CAATTs for Data Extraction.docx

Data Structures and CAATTs for Data Extraction Data Structures (TRICIA) 

Two fundamental components: 

Organization: the way records are physically physically arranged on the secondary storage storage device



Access method: technique used to locate locate records and to navigate through the database database or file

File Processing Operations

1. Retrieve a record by key 2. Insert a record 3. Update a record 4. Read a file 5. Find next record 6. Scan a file 7. Delete a record Data Structures 

Flat file structures 

Sequential structure [Figure 8-1]









All records in contiguous storage spaces in specified sequence (key field)



Sequential files are simple & easy to process



Application reads from beginning in sequence



If only small portion of file being processed, inefficient method



Does not permit accessing a record directly



Efficient: 4, 5 – sometimes 3



Inefficient: 1, 2, 6, 7 – usually 3

Indexed structure 

In addition to data file, separate index file



Contains physical address in data file of each indexed recor d

Indexed random file [Figure 8-2] 

Records are created without regard to physical proximity to other related records



Physical organization of index file itself may be sequential or random



Random indexes are easier to maintain, sequential more difficult



Advantage over sequential: rapid searches



Other advantages: processing individual records, efficient usage of disk storage



Efficient: 1, 2, 3, 7



Inefficient: 4

Virtual Storage Access Method (VSAM) [Figure 8-3] 

Large files, routine batch processing



Moderate degree of individual record processing



Used for files across cylinders



Uses number of indexes, with summarized content



Access time for single record is slower than Indexed Sequential or Indexed Random



Disadvantage: does not perform record insertions efficiently – requires physical relocation of all records beyond that point – SOS



Has 3 physical components: indexes, prime data storage area, overflow area [Figure 8-4]



Might have to search index, prime data area, and overflow area – slowing down access time



Integrating overflow records into prime data area, then reconstructing indexes reorganizes ISAM files



Very Efficient: 4, 5, 6



Moderately Efficient: 1, 3



Inefficient: 2, 7



Hashing Structure o Employs algorithm to convert primary key into physical record storage address [Figure 8-5]  No separate index necessary  Advantage: access speed  Disadvantage 

Inefficient use of storage

Different keys may create same address Efficient: 1, 2, 3, 6 Inefficient: 4, 5, 7 

  

Pointer Structure



Stores the address (pointer) of related record in a field with each data record [Figure 8-6] 

Records stored randomly



Pointers provide connections b/w records



Pointers may also provide links of records b/w files [Figure 8-7]



Types of pointers [Figure 8-8]:



Physical address – actual disk storage location •

•



Disadvantage: if related record moves, pointer must be changed & w/o logical reference, a pointer could be lost causing referenced record to be lost

Relative address – relative position in the file (135 th) •



Advantage: Access speed

Must be manipulated to convert to physical address

Logical address – primary key of related record •

Key value is converted by hashing to physical address



Efficient: 1, 2, 3, 6



Inefficient: 4, 5, 7

Database Conceptual Models (JOSIA) 

Refers to the particular method used t o organize records in a database. ◦

a.k.a. “logical data structures”



Objective: develop the database efficiently so that data can be accessed quickly and easily.



There are three main models:



◦

hierarchical (tree structure)

◦

network

◦

relational

Most existing databases are relational. Some legacy systems use hierarchical or network databases.

The Relational Model 

The relational model portrays data in the form of two dimensional ‘tables’.



Its strength is the ease with which tables may be linked to o ne another. ◦



a major weakness of hierarchical and network databases

Relational model is based on the relational algebra functions of restrict, project, and join.

The Relational Algebra Functions Restrict, Project, and Join

Associations and Cardinality 



Association ◦

Represented by a line connecting two entities

◦

Described by a verb, such as ships, requests, or receives

Cardinality – the degree of association between two entities ◦

◦

The number of possible occurrences in one table that are associated with a single occurrence in a related table Used to determine primary keys and foreign keys

Examples of Entity Associations

Properly Designed Relational Tables 

Each row in the table must be unique in at least one attribute, which is the primary key. ◦

Tables are linked by embedding the primary key into the related table as a foreign key.



The attribute values in any column must all be of the same class or data type.



Each column in a given table must be uniquely named.



Tables must conform to the rules of normalization, i.e., free from structural dependencies or anomalies.

Three Types of Anomalies (RIZ GEL) 

Insertion Anomaly: A new item cannot be added to the table until at least one entity uses a particular attribute item.



Deletion Anomaly: If an attribute item used by only one entity is deleted, all information about that attribute item is lost.



Update Anomaly: A modification on an attribute must be made in each of the rows in which the attribute appears.



Anomalies can be corrected by creating additional relational tables.

Advantages of Relational Tables 

Removes all three types of anomalies.



Various items of interest (customers, inventory, sales) are stored in separate tables.



Space is used efficiently.



Very flexible – users can form ad hoc relationships.

The Normalization Process 



A process which systematically splits unnormalized complex tables into smaller tables that meet two conditions: ◦

all nonkey (secondary) attributes in the table are dependent on the primary key

◦

all nonkey attributes are independent of the other nonkey attributes

When unnormalized tables are split and reduced to third normal form, they must then be linked together by foreign keys.

Steps in the Normalization Process

Accountants and Data Normalization 

Update anomalies can generate conflicting and obsolete database values .



Insertion anomalies can result in unrecorded transactions and incomplete audit trails.



Deletion anomalies can cause the loss of accounting records and the destruction of audit trails.



Accountants should understand the data normalization process and be able to determine whether a database is properly normalized.

Six Phases in Designing Relational Databases (SHELA)

1. Identify entities •

identify the primary entities of the organization

•

construct a data model of their relationships

2. Construct a data model showing entity associations •

determine the associations between entities

•

model associations into an ER diagram

3. Add primary keys and attributes •

assign primary keys to all entities in the m odel to uniquely identify records

•

every attribute should appear in one or more user views

4. Normalize and add foreign keys •

remove repeating groups, partial and transitive dependencies

•

assign foreign keys to be able to link tables

5. Construct the physical database •

create physical tables

•

populate tables with data

6. Prepare the user views •

normalized tables should support all required views of system users

•

user views restrict users from having acce ss to unauthorized data

Auditors and Data Normalization 

Database normalization is a technical matter that is usually the re sponsibility of systems professionals.



The subject has implications for internal control that make it t he concern of auditors also.



Most auditors will never be responsible for normalizing an organization’s databa ses; they should have an understanding of the process and be able to determine whether a table is properly

normalized. 

In order to extract data from tables to perform audit procedures, the auditor first nee ds to know how the data are structured.

Embedded Audit Module (EAM) 

Identify important transactions live while they are being processed and extract them [Figure 826] 

Examples 

Errors



Fraud



Compliance •



SAS 109, SAS 94, SAS 99 / S-OX

Disadvantages: 

Operational efficiency – can decrease performance, especially if testing is extensive



Verifying EAM integrity - such as environments with a high level of program maintenance



Status: increasing need, demand, and usage of COA/EAM/CA

Generalized Audit Software (GAS) 

Brief history •

1950-1967 –nascent field, little tools or techniques (e.g., K. D avis in Viet Nam)

•

October 1967 – Haskins & Sells, Ken Stringer, AUDITAPE

•

1967-1970 – AICPA efforts for one GAS, Big 8 each developed their own

•

c1970 – first commercial GAS – CARS

•

2000 – commercial GAS is common place

•

Importance of GAS in history of I S auditing



Popular because: o

GAS software is easy to use and requires little computer background

o

Many products are platform independent, works on mainframes and PCs

o

Auditors can perform tests independently of IT staff

o

GAS can be used to audit the data currently being stored in most file structures and formats



Simple structures [Figure 8-27]



Complex structures [Figures 8-28, 8-29]



Auditing issues: 

Auditor must sometime rely on IT personnel to produce files/data



Risk that data integrity is compromised by extr action procedures



Auditors skilled in programming better prepared to avoid these pitfalls

ACL 

ACL is a proprietary version of GA S



Leader in the industry



Designed as an auditor-friendly meta-language (i.e., contains commonly used auditor tests)



Access to data generally easy with ODBC interface

Data Structures and CAATTs for Data Extraction.docx

Recommend Documents