Chapter 3
Data Warehouse and OLAP Technology: An Overview 3.7 3.7
Exer Exerci cise sess
1. State why, why, for the integration integration of multiple multiple heterogeneous heterogeneous information information sources, sources, many many companies companies in industry industry prefer the update-driven approach (which constructs and uses data warehouses), rather than the query-driven approach (which (which applies applies wrappers wrappers and integrators). integrators). Describe Describe situations situations where the query-driven query-driven approach is preferable over the update-driven approach. Answer: For decision-making queries and frequently-asked queries, the update-driven approach is more preferable. This is because expensive data integration and aggregate computation are done before query processing time. In order for the data collected in multiple heterogeneous databases to be used in decision-making processes, data must be integrated and summarized with the semantic heterogeneity problems among multiple databases analyzed and solved. If the query-driven approach is employed, these queries will be translated into multiple (often (often complex) complex) queries for each individual individual database. The translated translated queries will compete compete for resources resources with the activities activities at the local sites, thus thus degrading their performance. performance. In addition, these queries will generate generate a complex complex answer set, which which will require further filtering and integrati integration. on. Thus, Thus, the query-driven query-driven approach approach is, in general, general, inefficient inefficient and expensive. expensive. The update-driven update-driven approach employed employed in data warehousing warehousing is faster faster and more efficient since most of the queries needed could be done off-line. For queries that are used rarely, reference the most current data, and/or do not require aggregations, the query-drive query-driven n approach approach would be preferable preferable over over the update-driv update-driven en approach. approach. In this case, it may not be justifiable for an organization to pay heavy expenses for building and maintaining a data warehouse, if only a small number and/or relatively small size databases are used; or if the queries rely on the current data, since the data warehouses warehouses do not contain the most current current information. information.
2. Briefly compare compare the following concepts. concepts. You may use an example to explain your point(s). point(s). (a) Snowflake Snowflake schema, schema, fact constellat constellation, ion, starnet query model (b) Data cleaning, cleaning, data transformatio transformation, n, refresh (c) Discovery-driven Discovery-driven cube, multifeature cube, virtual warehouse warehouse Answer: (a) Snowflake Snowflake schema, schema, fact constellat constellation, ion, starnet query model 31
.
:
The snowflake schema and fact constellation are both variants of the star star schem schema a model, model, which consists of a fact table and a set of dimension tables; the snowflake schema contains some normalized dimension tables, whereas the fact constellation contains a set of fact tables that share some common dimension tables. A starnet query model is a query model (not a schema model), which consists of a set of radial lines emanating from a central point, where each radial line represents one dimension and each point (called a “footprint”) along the line represents a level of the dimension, and each step going out from the center represents the stepping down of a concept hierarchy of the dimension. The starnet query model, as suggested by its name, is used for querying and provides users with a global view of OLAP operations. (b) Data cleaning, cleaning, data transformatio transformation, n, refresh Data cleaning cleaning is the process of detecting errors in the data and rectifying them when possible. Data transformation is the process of converting the data from heterogeneous sources to a unified data warehouse format or semantics. Refresh is the function propagating the updates from the data sources to the warehouse. (c) Discovery-driven Discovery-driven cube, multi-feature cube, virtual warehouse A discovery-driven cube uses precomputed measures and visual cues to indicate data exceptions at all levels of aggregation, aggregation, guiding guiding the user in the data analysis process. process. A multi-feature cube computes complex queries involving multiple dependent aggregates at multiple granularities (e.g., to find the total sales for every item having a maximum price, we need to apply the aggregate function SUM to the tuple set output by the aggregate function MAX). A virtual warehouse is a set of views (containing data warehouse schema, dimension, and aggregate measure definitions) over operational databases.
3. Suppose that a data warehouse consists of the three dimensions time, doctor doctor , and patient , and the two measures count and charge, charge, where charge is the fee that a doctor charges a patient for a visit. (a) Enumerate Enumerate three three classes classes of schemas schemas that are popularly popularly used for modeling data warehouses. warehouses. (b) Draw a schema diagram diagram for the above data warehouse warehouse using one of the schema schema classes listed in (a). (c) Starting Starting with the base base cuboid [day, [ day, doctor, doctor, patient], patient], what specific OLAP operations should be performed in order to list the total fee collected by each doctor in 2004? (d) To obtain the same list, write an SQL query assuming assuming the data is stored in a relational relational database database with the schema fee (day, month, year, doctor, hospital, patient, count, charge ). Answer: (a) Enumerate Enumerate three three classes classes of schemas schemas that are popularly popularly used for modeling data warehouses. warehouses. Three classes of schemas popularly used for modeling data warehouses are the star schema, the snowflake schema, and the fact constellations schema. (b) Draw a schema diagram diagram for the above data warehouse warehouse using one of the schema schema classes listed in (a). A star schema is shown in Figure 3.1. (c) Starting Starting with the the base cuboid [day, [day, doctor, doctor, patient], patient], what specific OLAP operations should be performed in order to list the total fee collected by each doctor in 2004? The operations to be performed are:
• Roll-up on time from day to year . • Slice for time=2004. time=2004. • Roll-up on patient from individual patient to all.
(d) To obtain the same list, write an SQL query assuming assuming the data is stored in a relational relational database database with the schema. f ee( ee(day, month,y month,y ear, ear, doctor doctor,, hospit hospital, al, patient patient,, count,charge count,charge)).
. .
time
doctor
dimension table
fact table
time_key day day_of_week month quarter year
time_key doctor_id patient_id charge count
dimension table
dooctor_id doctor_name phone# address sex
patient dimension table
patient_id patient_name phone_# sex description address
Figure 3.1: A star schema for data warehouse of Exercise 2.3. select doctor, SUM(charge) from fee where year=2004 group by doctor
4. Suppose that that a data warehouse warehouse for Big-University consists of the following four dimensions: student, course, semester , and instructor , and two measures count and avg grade. grade. When at the lowest conceptual level (e.g., for a given student, course, semester, and instructor combination), the avg grade measure stores the actual course grade of the student student.. At higher conceptual conceptual levels, levels, avg grade stores the average grade for the given combination. (a) Draw a snowflake schema diagram diagram for the data warehouse. warehouse. (b) Starting with the base cuboid [student [ student,, course, semester, semester,instruct instructor or], ], what specific OLAP operations (e.g., roll-up from semester to year ) year ) should one perform in order to list the average grade of CS courses for each Big-University student. (c) If each dimension dimension has five levels levels (including (including all), such as “student “student < major < status < university < all”, how many cuboids will this cube contain (including the base and apex cuboids)? Answer: (a) Draw a snowflake schema diagram for the data warehouse. warehouse. A snowflake schema is shown in Figure 3.2. (b) Starting with the base cuboid [student, student, course, semester, semester,instruct instructor or], ], what specific OLAP operations (e.g., roll-up from semester to year ) should one perform in order to list the average grade of CS courses for each Big-University student. The specific OLAP operations to be performed are:
• Roll-up on course from course id to department . • Roll-up on student from student id to university . • Dice on course, student with department=“CS” and university = “Big-University”. “Big-University” .
.
course dimension table
course_id course_name department semester dimension table
univ fact table
student_id course_id semester_id instructor_id count avg_grade
semester_id semester year
:
student dimension table
student_id student_name area_id major status university
area dimension table
area_id city province country
instructor dimension table
instructor_id dept rank
Figure 3.2: A snowflake schema for data warehouse of Exercise 2.4.
• Drill-down on student from university to student student name. name.
(c) If each dimension has five levels (including (including all), such as student < major < status < university < all, how many cuboids will this cube contain (including the base and apex cuboids)? This cube will contain 5 4 = 625 cuboids.
5. Suppose that a data warehouse consists consists of the four dimensions, dimensions, date, spectator, location , and game, game, and the two measures, count and charge, charge, where charge is the fare that a spectator pays when watching a game on a given date. Spectators Spectators may be students students,, adults, adults, or seniors, with each category category having its own charge rate. rate. (a) Draw a star schema diagram for the data warehouse. warehouse. (b) Starting with the base cuboid [ date, spectator spectator,, location, location, game], game], what specific OLAP operations should one perform in order to list the total charge paid by student spectators at GM Place in 2004? (c) Bitmap indexing is useful in data warehousing. Taking this cube as an example, briefly discuss advantages and problems of using a bitmap index structure. Answer: (a) Draw a star schema diagram for the data warehouse. warehouse. A star schema is shown in Figure 3.3. (b) Starting with the base cuboid [date, [ date,specta spectator tor,, location, location, game], game], what specific OLAP operations should one perform in order to list the total charge paid by student spectators at GM Place in 2004? The specific OLAP operations to be performed are:
• Roll-up on date from date id to year . • Roll-up on game from game id to all. • Roll-up on location from location id to location name. • Roll-up on spectator from spectator id to status. status. • Dice with status=“students”, status=“students”, location name=“GM Place”, Place” , and year=2004. year=2004.
. .
date dimension table
date_id day month quarter year game dimension table
game_id game_name description producer
sales fact table
date_id spectator_id game_id location_id count charge
spectator dimension table
spectator_id spectator_name status phone address location dimension table
location_id location_name phone#
street city province country
Figure 3.3: A star schema for data warehouse of Exercise 2.5. (c) Bitmap indexing is useful in data warehousing. Taking this cube as an example, briefly discuss advantages and problems of using a bitmap index structure. Bitmap Bitmap indexing indexing is advant advantageous ageous for low-cardinalit low-cardinality y domains. For example, in this cube, if dimension dimension location is bitmap indexed, comparison, join, and aggregation operations over location are then reduced to bit arithmetic, which substantially reduces the processing time. Furthermore, strings of long location names can be represented by a single bit, which leads to significant reduction in space and I/O. For dimensions with high-cardinality, such as date in this example, the vector to present the bitmap index could be very long. For example, a 10-year collection collection of data could result in 3650 date records, meaning meaning that every tuple in the fact table would require 3650 bits, or approximately 456 bytes, to hold the bitmap index.
6. [Tao Cheng ] A data warehouse can be modeled by either a star schema or a snowflake schema . Briefly describe the similarities and the differences of the two models, and then analyze their advantages and disadvantages with with regard regard to one another another.. Give Give your your opinion opinion of which which might might be more more empiric empirically ally useful useful and state state the reasons behind your answer. Answer: They are similar in the sense that they all have a fact table, as well as some dimensional tables. The major difference is that some dimension tables in the snowflake schema are normalized, thereby further splitting the data into into additional additional tables. The advantage advantage for star schema schema is its simplicit simplicity y, which which will enable efficiency efficiency,, but it requires requires more space. For snowflake snowflake schema, schema, it reduce reduce some redundancy redundancy by sharing sharing comm common on tables: The tables are easy to maintain and save some space. However, it is less efficient, and the saving of space is negligible negligible in comparison comparison with the typical magnitude magnitude of the fact table. Therefore, Therefore, empirically empirically,, star schema schema is better simply because nowadays nowadays,, efficiency has higher priority over over space, if it is not too huge. huge. Sometimes Sometimes in industry, to speed up processing, people “denormalize data from a snowflake schema into a star schema” [1]. Another option here is that “some practitioners use a snowflake schema to maintain dimensions, and then present users with the same data collapsed into a star” [2]. Some references references for the answer to this question: question: Oracle Tip: Understand Understand the difference difference between between star and snowflake schemas in OLAP. [1] http://builder.com.com/5100-6388-5221690.html Star vs. Snowflake Schemas.
.
:
[2] http://blogs.msdn.com/bi systems/articles/164525.aspx
7. Design a data warehouse warehouse for a regional regional weather weather bureau. The weather weather bureau has about 1,000 probes, which are scattered throughout various various land and ocean locations in the region to collect coll ect basic weather data, including air pressure, temperature temperature,, and precipitation precipitation at each each hour. All data are sent to the central central station, which has collected collected such data for over 10 years. years. Your design should facilitate facilitate efficient efficient querying querying and on-line on-line analytical analytical processing, and derive general weather patterns in multidimensional space. Answer: Since the weather bureau has about 1000 probes scattered throughout various land and ocean locations, we need to construct a spatial data warehouse so that a user can view weather patterns on a map by month, by region, and by different combinations of temperature and precipitation, and can dynamically drill down or roll up along any dimension to explore desired patterns. The star schema of this weather spatial data warehouse can be constructed as shown in Figure 3.4.
date dimension table
time day month quarter year
weather fact table
time region temperature precipitation region_map area
temperature dimension table
temperature temp_range temp_description
region dimension table
location_id probe_location district city county region_name province precipitation dimension table
precipitation prec_range prec_description
Figure 3.4: A star schema for a weather spatial spatial data warehouse warehouse of Exercise Exercise 2.6. To construct this spatial data warehouse, we may need to integrate spatial data from heterogeneous sources and systems. Fast and flexible on-line analytical processing in spatial data warehouse is an important factor. There are three types of dimensions dimensions in a spatial spatial data cube: nonspatial nonspatial dimensions, dimensions, spatial-to-nons spatial-to-nonspatial patial dimensions, dimensions, and spatial-tospatial-to-spati spatial al dimensions. dimensions. We distinguish distinguish two types of measures measures in a spatial spatial data cube: numerical numerical measures measures and spatial spatial measures. measures. A nonspatial nonspatial data cube contains contains only nonspatial nonspatial dimensions dimensions and numerical numerical measures. measures. If a spatial spatial data cube contains contains spatial dimensions dimensions but no spatial measures, measures, then its OLAP operations (such as drilling or pivoting) can be implemented in a manner similar to that of nonspatial data cubes. If a user needs to use spatial measures in a spatial data cube, we can selectively precompute some spatial spatial measures in the spatial data cube. Which Which portion of the cube should be selected selected for materialization materialization depends on the utility (such as access frequency or access priority), sharability of merged regions, and the balanced balanced overall cost of space and on-line computation computation..
8. A popular data warehouse implementation is to construct a multidimensional database, known as a data cube. Unfortunately, this may often generate a huge, yet very sparse multidimensional matrix. (a) Present Present an example example illustrating illustrating such a huge and sparse data cube.
. .
(b) Design an implemen implementatio tation n method method that can elegantly elegantly overcome overcome this sparse matrix problem. Note that you need to explain your data structures in detail and discuss the space needed, as well as how to retrieve data from your structures. (c) Modify your design in (b) to handle incremental data updates. Give the reasoning behind your new design. Answer: (a) Present Present an example example illustrating illustrating such a huge and sparse data cube. An example of a huge and sparse data cube can be one that is generated from a telephone company billing database that keeps records on each customer’s billing information, such as contact information, payment methods, date of payment, and detailed call records. For the telephone company, it would be very expensive to keep detailed call records for every customer for longer than three months. Therefore, it would be beneficial to remove that information from the database, keeping only the total number of calls made, the total minutes billed, and the amount billed, for example. The resulting computed data cube for the billing database would have large amounts of missing or removed data, resulting in a huge and sparse data cube. (b) Design an implemen implementatio tation n method method that can elegantly elegantly overcome overcome this sparse matrix problem. Note that you need to explain your data structures in detail and discuss the space needed, as well as how to retrieve data from your structures. A way to overcome the sparse matrix problem is to use multiway multiway array aggregation. aggregation. (Not (Note: e: this this answer is based on the paper, Y. Zhao, P. M. Deshpande, and J. F. Naughton, “ An array-based algorithm for simultaneous multidimensional aggregates”, aggregates”, in Pro Proc. 1997 1997 ACM-SIGMO ACM-SIGMOD D Int. Conf. Managemen Management t of Data , pages 159–170, Tucson Tucson,, Arizona, Arizona, May 1997). The first step consists of partitioning the array-based cube into chunks, or subcubes that are small enough enough to fit into the available available memory memory for cube computation. computation. Each of these chunks chunks is first compressed compressed to remove cells that do not contain any valid data, and is then stored as an object on disk. For storage and retrieval purposes, the “chunkID “ chunkID + offset ” can be used as the cell address. The second step involv involves es computing the aggregates by visiting cube cells in an order that minimizes the number of times that each cell must must be revisited, revisited, thereby thereby reducing memory memory access and storage costs. By first sorting and computing the planes of the data cube according to their size in ascending order, a smaller plane can be kept in main memory while fetching and computing only one chunk at a time for a larger plane. (c) Modify your design in (b) to handle incremental data updates. Give the reasoning behind your new design. In order to handle incremental data updates, the data cube is first computed as described in (b). Then only the chunk that contains the cells with the new data is recomputed without needing to recompute the entire entire cube. This is because, because, with incremen incremental tal updates, only one chunk at a time can be affected. affected. Then the recomputed recomputed value needs to be propagated propagated to its corresponding corresponding higher-lev higher-level el cuboids. Thus, Thus, incremental data updates can be performed efficiently.
9. Regarding Regarding the the computation of measures in a data cube: (a) Enumerate Enumerate three three categories categories of measures, based on the kind of aggregate aggregate functions functions used in computing computing a data cube. (b) For a data cube with the three dimensions time, location , and item , which category does the function variance belong to? Describe how to compute it if the cube is partitioned into many chunks. N 1 Hint: The formula for computing variance is N x¯i )2 , wher wheree x¯i is the average of xi s. i=1 (xi
−
(c) Suppose the the function function is “top “ top 10 sales.” sales.” Discuss how to efficiently compute this measure in a data cube. Answer:
.
:
(a) Enumerate Enumerate three three categories categories of measures, based on the kind of aggregate aggregate functions functions used in computing computing a data cube. The three categories categories of measures measures are distributiv distributive, e, algebraic, algebraic, and holistic. holistic. (b) For a data cube with three dimensions: time, location , and product , which category does the function variance belong to? Describe how to compute it if the cube is partitioned into many chunks. Hint: The formula for computing variance is n1 ni=1 (xi )2 x¯i 2 , wher wheree x¯i is the average of xi ’s. The function variance is algebraic. If the cube is partitioned partitioned into many chunks, chunks, the variance variance can be computed computed as follows: follows: Read in the chunks chunks one by one, keeping track track of the accumulate accumulated d (1) number number of 2 tuples, (2) sum of (x ( xi ) , and (3) sum of xi . After reading all the chunks, compute the average of xi ’s as the sum of xi divided divided by the total number of tuples. Use the formula formula as shown shown in the hint to get the variance.
−
(c) Suppose the the function function is “top “ top 10 sales.” sales.” Discuss how to efficiently compute this measure in a data cube. For each cuboid, cuboid, use 10 units to regist register er the top 10 sales found found so far. Read Read the data in each each cubiod once. If the sales amount amount in a tuple is greater than an existing existing one in the top-10 list, insert the new sales amount amount from the new tuple into into the list, and discard the smallest one in the list. The computation computation of a higher level cuboid can be performed similarly by propagation of the top-10 cells of its corresponding lower level cuboids.
10. Suppose that that we need to record three measures measures in a data cube: min, average, and median. Design an efficient computation and storage method for each measure given that the cube allows data to be deleted deleted incrementally (i.e., in small portions at a time) from the cube. Answer: For min, keep the min val, count count pair for each cuboid to register register the smallest value value and its count. For each deleted deleted tuple, if its value value is greater greater than min val, do nothing. Otherwise, Otherwise, decrement decrement the count of the corresponding node. If a count goes down to zero, recalculate the structure.
For average, keep a pair sum, count for each cuboid. For each deleted deleted node N , N , decrement the count and subtract value N from the sum, and average = sum/count. sum/count.
For median, keep a small number p of centered centered values, values, (e.g. p = 10), 10), plus plus two two count counts: s: up count and down count. count. Each remov removal al may change the count count or remove remove a centered centered value. If the median no longer falls among these centered centered values, values, recalculate recalculate the set. Otherwise, Otherwise, the median median can be easily calculated calculated from the above set.
11. In data warehouse warehouse technology technology,, a multiple multiple dimensional dimensional view can be implemen implemented ted by a relational relational database technique technique (ROLAP ), ROLAP ), or by a multi multidim dimens ension ional al databa database se techni technique que (MOLAP ), ), or by a hybrid database technique (HOLAP (HOLAP ). ). (a) Briefly describe each implementation implementation technique. (b) For each technique, explain how each of the fol lowing functions may be implemented: i. ii. iii. iv.
The generation generation of a data warehouse warehouse (includin (includingg aggregation) aggregation) Roll-up Drill-down Drill-down Incremen Incremental tal updating
Which implementation techniques do you prefer, and why? Answer:
. .
(a) Briefly describe each implementation implementation technique. A ROLAP technique for implementing a multiple dimensional view consists of intermediate servers that stand in between a relational back-end server and client front-end tools, thereby using a relational or extended-relational DBMS to store and manage warehouse data, and OLAP middleware to support missing missing pieces. pieces. A MOLAP implementation technique consists of servers, which support multidimensional views of data through array-base array-based d multidim multidimension ensional al storage storage engines engines that map multidim multidimension ensional al views directly directly to data cube array structures. structures. A HOLAP implementation approach combines ROLAP and MOLAP technology, which means that large volumes of detailed data and some very low level aggregations can be stored in a relational database, while some high level aggregations are kept in a separate MOLAP store. (b) For each technique, explain how each of the fol lowing functions may be implemented: i. The generation generation of a data warehouse warehouse (includin (includingg aggregation) aggregation) ROLAP: Using a ROLAP server, the generation of a data warehouse can be implemented by a relational relational or extendedextended-relati relational onal DBMS using summary fact tables. The fact tables can store aggregated data and the data at the abstraction levels indicated by the join keys in the schema for the given data cube. MOLAP: In generating a data warehouse, the MOLAP technique uses multidimensional array structures to store data and multiway array aggregation to compute the data cubes. HOLAP: The HOLAP technique typically uses a relational database to store the data and some low level aggregations, aggregations, and then uses a MOLAP to store higher-leve higher-levell aggregations aggregations.. ii. Roll-up ROLAP: To roll-up on a dimension using the summary fact table, we look for the record in the table that contains a generalization on the desired dimension. For example, to roll-up the date dimension from day to day to month , select select the record record for which which the day field day field contains the special value all. The value of the measure field, dollars sold , for example, given in this record will contain the subtotal for the desired roll-up. MOLAP: To perform a roll-up in a data cube, simply climb up the concept hierarchy for the desired dimension. dimension. For example, one could roll-up on the location dimension from city to country , which is more general. HOLAP: The roll-up using the HOLAP technique will be similar to either ROLAP or MOLAP, depending depending on the techniques techniques used in the implementation implementation of the corresponding corresponding dimensions. dimensions. iii. Drill-down Drill-down ROLAP: To drill-down on a dimension using the summary fact table, we look for the record in the table that contains a generalization generalization on the desired dimension. dimension. For example, to drill-down drill-down on the location dimension from country to province or state, state, select the record for which only the next lowest field in the concept hierarchy for location contains the special value all. In this case, the city field should contain the value all. The value of the measure field, dollars sold , for example, given in this record will contain the subtotal for the desired drill-down. MOLAP: To perform a drill-down in a data cube, simply step down the concept hierarchy for the desired desired dimension. dimension. For example, one could drill-down drill-down on the date dimension from month to day in order to group the data by day rather than by month . HOLAP: The drill-down using the HOLAP technique is similar either to ROLAP or MOLAP depending depending on the techniques techniques used in the implementation implementation of the corresponding corresponding dimensions. dimensions. iv. Incremen Incremental tal updating OLAP: To perform incremental updating, check whether the corresponding tuple is in the summary fact table. If not, insert it into the summary table and propagate the result up. Otherwise, update the value and propagate the result up. MOLAP: To perform incremental updating, check whether the corresponding cell is in the MOLAP cuboid. If not, insert it into the cuboid and propagate the the result up. Otherwise, Otherwise, update the value value and propagate the result up. HOLAP: similar either to ROLAP or MOLAP depending on the techniques used in the implementation tation of the corresponding dimensions. dimensions.
.
:
(c) Which implementation implementation techniques do you prefer, and why? HOLAP is often preferred since it integrates the strength of both ROLAP and MOLAP methods and avoids their shortcomings—if the cube is quite dense, MOLAP is often preferred. Also, if the data are sparse and the dimensionality is high, there will be too many cells (due to exponential growth) and, in this case, it is often desirable to compute iceberg cubes instead of materializing the complete cubes.
12. Suppose that a data warehouse warehouse contains contains 20 dimensions, dimensions, each with about five levels of granularit granularity y. (a) Users are mainly intereste interested d in four particular particular dimensions, dimensions, each having having three frequentl frequently y accessed accessed levels for rolling up and drilling down. How would you design a data cube structure structure to support this preference preference efficiently? (b) At times, a user may want to drill through the cube, down to the raw data for one or two particular dimensions. How would you support this feature? Answer: (a) Users are mainly intereste interested d in four particular particular dimensions, dimensions, each having having three frequentl frequently y accessed accessed levels for rolling up and drilling down. How would you design a data cube structure structure to support this preference preference efficiently? An efficient data cube structure to support this preference would be to use partial materialization, or selected computati computation on of cuboids. By computing computing only the proper subset subset of the whole set of possible cuboids, the total amount of storage space required would be minimized while maintaining a fast response time and avoiding redundant computation. (b) At times, a user may want to drill through the cube, down to the raw data for one or two particular dimensions. How would you support this feature? Since the user may want to drill through the cube for only one or two dimensions, this feature could be supported by computing computing the required cuboids on the fly. fly. Since the user may only need this feature infrequently, the time required for computing aggregates on those one or two dimensions on the fly should be acceptable.
13. A data cube, C , has n dimensions, and each dimension has exactly p distinct values in the base cuboid. Assume Assume that there are no concept concept hierarchies hierarchies associated with the dimensions. dimensions. (a) What is the maximum number of cells possible in the base cuboid? (b) What is the minimum number of cells possible in the base cuboid? (c) What is the maximum number of cells possible (including both base cells and aggregate cells) in the data cube, C ? (d) What is the minimum number of cells possible in the data cube, C ? Answer: (a) What is the maximum number of cells possible in the base cuboid? pn . This is the maximum number of distinct tuples that you can form with p distinct values per dimensions. (b) What is the minimum number of cells possible in the base cuboid? p. p. You need at least p tuples tuples to contain contain p distinct distinct values values per dimension. dimension. In this case no tuple shares any any value on any dimension.
. .
(c) What is the maximum number of cells possible (including both base cells and aggregate cells) in the data cube, C ? ( p + 1)n . The argument is similar to that of part ( a), but now we have p + 1 because in addition addition to the p distinct values of each dimension we can also choose .
∗
(d) What is the minimum number of cells possible in the data cube, C ? (2n 1) p + 1. The minimum number of cells is when each cuboid contains only p cells, except for the apex, which contains a single cell.
− ×
14. What are the differences between the three main types of data warehouse usage: information processing processing , analytical processing , and data mining? Discuss the motivation behind OLAP mining (OLAM). (OLAM). Answer: Information processing involves using queries to find and report useful information using crosstabs, tables, Analytical processing uses basic OLAP operations such as slice-and-dice, drill-down, charts, or graphs. Analytical roll-up, and pivoting on historical data in order to provide multidimensional analysis of data warehouse data. Data mining uses knowledge discovery to find hidden patterns and associations, constructing analytical models, performing classification and prediction, and presenting the mining results using visualization tools. The motivations behind OLAP mining are the following: The high quality of data (i.e., integrated, consistent, and cleaned data) in data warehouses serves as a valuable source for OLAP as well as for data mining. The available information processing infrastructure surrounding data warehouses means that comprehensive information information processing and data analysis infrastructure infrastructuress will not need to be b e constructe constructed d from scratch. scratch. OLAP-based OLAP-based exploratory exploratory data analysis analysis can be realized by coupling coupling on-line analytical analytical mining mining with data/knowledge data/knowledge visualization tools to allow users to traverse through a database, select portions of relevant data, analyze them at different granularities, and present knowledge/results in different forms. On-line selection of data mining functions allows users who may not know what kinds of knowledge they would like to mine the flexibility to select desired data mining functions and dynamically swap data mining tasks.
.
:
Chapter 4
Data Cube Computation and Data Generalization 4.5 4.5
Exer Exerci cise sess
1. Assume Assume a base base cuboid of 10 dimensio dimensions ns contain containss only three base cells: cells: (1) (a ( a1 , d2 , d3 , d4 , . . . , d9 , d10 ), (2) (d1 , b2 , d3 , d4 , . . . , d9 , d10 ), and (3) (d ( d1 , d2 , c3 , d4 , . . . , d9 , d10 ), where a1 = d1 , b2 = d2 , and c3 = d3 . The measure of the cube is count .
(a) How many nonempty cuboids will a full data cube contain? (b) How many many nonempty aggregate (i.e., non-base) cells will a full cube contain? (c) How many many nonempty aggregate cells will an iceberg cube contain if the condition of the iceberg cube is “count 2”?
≥
(d) A cell, c, is a closed cell if there exists no cell, d, such that d is a specialization of cell c (i.e., d is obtained by replacing a in c by a non- value) and d has the same measure value as c. A closed cube is a data cube consisting of only closed cells. How many closed cells are in the full cube?
∗
∗
Answer: (a) How many nonempty cuboids will a complete data cube contain? 210 . (b) How many many nonempty aggregated aggregated (i.e., non-base) cells a complete complete cube will contain? contain? 10 (1) Each cell generates 2 1 nonempty aggregated cells, thus in total we should have 3 210 3 cells with overlaps removed. (2) We have 3 27 cells overlapped once (thus count 2) and 1 27 (which is ( , , , d4 ,...,d 10 )) overlapped twice (thus count 3). Thus we should remove in total 5 27 overlapped cells. (3) Thus we have: 3 8 27 5 27 3 = 19 27 3.
−
× −
×
× × − × −
× −
×
×
∗∗∗
(c) how many many nonempty aggregated aggregated cells an iceberg cube will contain, contain, if the condition condition of the iceberg cube is “count 2”? Analysis: (1) ( , , d3 , d4 , . . . , d9 , d10 ) has count 2 since it is generated by both cell 1 and cell 2; similarly, we have (2) ( , d2 , , d4 , . . . , d9 , d10 ):2, (3) ( , , d3 , d4 , . . . , d9 , d10 ):2; and (4) ( , , , d4 , . . . , d9 , d10 ):3. Therefore we have, 4 27 = 2 9 .
≥
∗∗ ∗ ∗
×
∗∗
∗∗∗
• A cell c is a closed cell if there exists no cell d such that d is a specialization of cell c (i.e., d is obtained by replacing a ∗ in c by a non-∗ value) and d has the same measure value as c. A closed cube is a data cube consisting of only closed cells. How many closed cells in the full cube? 43
.
7 cells, as follows (1) (a (a1 , d2 , d3 , d4 , . . . , d9 , d10 ) : 1, (2) (d (d1 , b2 , d3 , d4 , . . . , d 9 , d10 ) : 1, (3) (d (d1 , d2 , c3 , d4 , . . . , d9 , d10 ) : 1, (4) ( , , d3, d4 , . . . , d9 , d10 ) : 2, (5) ( , d2 , , d4 , . . . , d9 , d10 ) : 2, (6) (d (d1 , , , d4 , . . . , d9 , d10 ) : 2, and (7) ( , , , d4 , . . . , d9 , d10 ) : 3.
∗∗ ∗ ∗ ∗∗ ∗∗∗
2. There are several typical typical cube computation computation methods, such as multiway array computation computation [ZDN97], BUC (bottom-up computation) [BR99], and Star-Cubing [XHLW03]. Briefly describe these three methods (i.e., use one or two lines to outline the key points), and compare their feasibility and performance under the following conditions: (a) computing computing a dense full cube of low dimensionali dimensionality ty (e.g., less than 8 dimensions), dimensions), (b) computing computing an iceberg cube of around 10 dimensions dimensions with a highly skewed data distribution distribution,, and (c) computing computing a sparse iceberg cube of high dimensionalit dimensionality y (e.g., over 100 dimensions). dimensions). Answer: Multiway: top-down, simultaneous array aggregation, sharing precomputed results, and minimizing memory requirement. BUC: bottom-up, recursive partition and conquer, shared sorting StarCubing: StarCubing: top-down top-down and bottom-up bottom-up integratio integration n using star-tree, star-tree, enables simultane simultaneous ous computatio computation n while allows Apriori pruning. (a) computing computing dense full cube of low dimensionalit dimensionality y (e.g., less than 8 dimensions) dimensions),, Both Multiway and Star-Cubing work fine and better than BUC. (b) computing computing iceberg cube around 10 dimensions with highly skewed skewed data distribution distribution,, Multiway cannot work for iceberg, and Star-Cubing work better than BUC for highly skewed data sets. and (c) computing computing sparse iceberg cube of high dimensionalit dimensionality y (e.g., over 100 dimensions). dimensions). Multiway cannot work for iceberg, none of the other two may work efficiently for high-D data, and one should explore closed cube and shell-fragment ideas.
3. [Contributed by Chen Chen] Chen] Suppose a data cube has D dimensions, and the base cuboid contains k distinct tuples. (a) Present Present a formula formula to calculate calculate the minimum minimum number number of cells that the cube, C , may contain. (b) Present Present a formula formula to calculate calculate the maximum maximum number of cells that C may contain. (c) Answer parts (a) and (b) above as if the count in each cube cell must be no less than a threshold, v . (d) Answer parts (a) and (b) above as if only closed cells are considered (with the minimum count threshold, v). Answer:
. .
(a) Present Present a formula formula to calculate calculate the minimum minimum number number of cells that the cube, C , may contain. To achieve the minimum case, we need to “merge” k distinct tuples tuples as fast as possible so that on higher levels there will be fewer cells (there ( there are always k cells in the base cuboid). cuboid). Here, we have have two two cases, which represent two possible extremes, i. If we drop one specific dimension, say A, then k tuples tuples are immediate immediately ly “merged” into one. This means that k tuples are organized like following: t1 = (a1 , t ), t2 = (a2 , t ), , , tk = (a ( ak , t ), where t is a (D (D 1)-dimensi 1)-dimensional onal tuple. However However,, this scheme scheme is not so good when we keep dimension A and drop B instead, because obviously there will still be k tuples then, which is not a good behavior. ii. Case Case 1 describ describes es an extrem extremee where where the specifie specified d “quan “quantit tity y reduct reduction ion”” only only happens happens when one particular dimension A is rolled-u rolled-up p to all . We can imagine another situation in which the “reduction” occurs occurs in an distri distribut butiv ivee and “avera “average” ge” manner manner.. Since Since it will will be somewh somewhat at hard to state and understand, I shall use a concrete example to show these two processes. Say we have a 8(A 8(A) 8(B 8(B ) 8(C 8(C ) three-dimensional cube and k = 8. If we follow follow case 1, then there there will be 8 base cells, 1(roll up to all on A)+8+8=17 2-d cells, 1+1+8(roll up to all on two two dimensions dimensions other than A)=10 )=10 1-d cells, and 1 0-d cells. cells. Ho Howe weve ver, r, if we try case 2, that is building building a 2 2 2 3-dimensional sub-cube on one “corner” of the full cube and then fill it with 8 tuples, we will get 8 base cells, 4+4+4=12 2-d cells (roll up in either dimension results in 4 cells), 2+2+2 = 6 1-d cells (likewise), and and 1 00-d d cells. cells. Sinc Sincee 36 = 8 + 17 17 + 10 10 + 1 > 8 + 12 12 + 6 + 1 = 27, case 2 is better than case case 1. It seems seems that case 2 is always always better. better. Say Say k = 4, then for case 1, 4 base cells, 1+4+4=9 2-d cells, 1+1+4=6 1-d cells, and 1 0-d cells, 4+9+6+1=20 in all; for case 2, 2 2 = 4, then we can only build a 2(B 2(B ) 2(C 2(C ) 2-dimensional sub-cube: 4 base cells, 2+2+4(roll up to all on A)=8 2-d cells, 1(roll up to all on B and C )+2+2=5 )+2+2=5 1-d cells, and 1 0-d cells, 4+8+5+1=18 in all. A heuristic way to think it over is listed here: we want to put k distinct tuples in a a b c sub-cube, case 1 is doing it in a 1 1 k way, while case 2 is performing k k k. Obviously, a + b + c is the number of 1-d cells, and we all know how a + b + c can reach its minimum given the condition that abc = k . To summarize, in order to have the minimum case occur, we shall put all k tuples in a x1 x2 xD sub-cube satisfying x1 x2 xD = k (if equation is not possible, we can change it to and make the objective objective as minimizing minimizing x1 x2 xD , which means the sub-cube should be as small as possible), we will choose the vector (x ( x1 , x2 , xD ) such that xi s are as “close” as possible, which is also the condition that makes x1 + x2 + + xD (see above paragraph) as small as possible. D The total total number number of cells is 1 + d=1 ( total # of xj s i s d xj ). ′
′
′
−
×
′
···
×
× ×
×
√ ×√ ×√ 3
× ×
···
··· ··· ···
3
3
× ×
≥
× ×···×
(b) Present Present a formula formula to calculate calculate the maximum maximum number of cells that C may contain. The maximum circumstance occurs when k tuples are placed in a completely “irrelevant” way, whereas any two tuples (t (t11 , t12 , , t1D ) and (t ( t21 , t22 , , t2D ) can not be “merged” into one unless all D dimensions are rolled-up to all , i.e. t1i = t2i , i = 1, 2, , D. Obviously Obviously,, this can generate the most number of cells: no matter how we choose choose those dimensions to be dropped, after we do so, there are still k distinct tuples, unless all D dimensions dimensions are discarded. This D 1 1 0 D will result in k (C D + C D + + C D ) + C D = k(2D 1) + 1 cells. Here, I am assuming that we can alway alwayss do placem placemen entt as we propose to, disrega disregardi rding ng the fact that dimensionality D and the cardinality ci of each dimension i may indeed put some constraints there (the same assumption is kept throughout this document) . Say Say there there is a 8 8 2-dimensional cube, then we can place at most 8 tuples in the “irrelevant” manner mentioned above above.. If we fail fail to do so, e.g. e.g. k = 20, cases will be much more complex and thus are out of the considerations here. Also, since it is not mentioned in the original question that how cardinalities of dimensions are set, I think it can be understood in such a way that we can always increase the cardinalities so that k tuples can be placed in an “irrelevant” style. (c) Answer parts (a) and (b) above as if the count in each cube cell must be no less than a threshold, v . To answer this question, we have a core observation: observation: all base cells should contain contain at least c tuples, then all aggregate cells will also satisfy the condition.
··· −
×
···
···
···
−
.
i. minimum minimum case, case, the distinct condition no longer holds here, since c tuples have to be in one identical base cell now. Thus, we can put all k tuples in one base cell, which results in 2 D cells in all. ii. maximum maximum case, we will replace replace k with kc and follow the procedure in subproblem 2, since we can get at most that many base cells in all.
⌊⌋
(d) Answer parts (a) and (b) above as if only closed cells are considered (with the minimum count threshold, v). From the analysis in subproblem 3, we will not consider the threshold c, as long as k can be replaced by a new value. value . i. minimum minimum case, we still don’t have the distinct condition condition here. Considering Considering number number of closed cells, 1 is the minimum, if we put all k tuples together in one base cell. ii. maximum maximum case. Given Given a fixed tuple composition, composition, say say p tuples t1 , t2 , , t p , we can obviously have at most one closed cell, in this way we can get an upperbound of C k1 + C k2 + + C kk = 2 k 1 closed cells. In below, I will specify a way to reach this bound. Make a 2 2 2 k-dimensional cube, k tuples are distributed according to following coordinates: t1 = (1, (1, 0, , 0), 0), t2 = (0, (0, 1, , 0), 0), tk = (0, (0, 0, , 1). Take as an instance, for t1 , t2 , , t p , ( 1, 2 , , p , 0, , 0) will make them in one closed cuboid cell, because changing any * to 0 or 1 results in a smaller count, p 1 or 1. This finishes our construction, because all 2 k 1 closed cells can be formed likewise. However, above statements requires k D, let’s let’s assum assume e it, it, though here we can not increase cardinalities, like what have been done in subproblem 2, to satisfy the condition.
{
× × ··· × ··· ∗ ∗ ··· ∗ ···
···
··· } ···
−
···
−
···
−
≤
4. Suppose that a base base cuboid has three three dimensions A,B,C , with the following number of cells: A = 1, 1 , 000 000,, 000, B = 100, and C = 1000. Suppose that each dimension is evenly partitioned into 10 portions for chunking .
| |
| |
| |
(a) Assuming Assuming each dimension dimension has only one level, draw the complete lattice lattice of the cube. (b) If each cube cell stores one measure with 4 bytes, what is the total size of the computed cube if the cube is dense? dense ? (c) State the order for computing the chunks in the cube that requires the least amount of space, and compute the total amount of main memory space required for computing the 2-D planes. Answer: (a) Assuming Assuming each dimension dimension has only one level, draw the complete lattice lattice of the cube. The complete lattice is shown in Figure 4.1.
Figure 4.1: A complete lattice for the cube of Exercise 2.12. (b) If each cube cell stores one measure with 4 bytes, what is the total size of the computed cube if the cube is dense? dense? The total size of the computed cube is as follows.
• all: 1
. .
• A: 1,000,000; B: 100; C: 1, 000; subtotal: 1,001,100 • AB: 100,000,000; 100,000,000; BC: 100,000; 100,000; AC: 1,000,000,000; 1,000,000,000; subtotal: subtotal: 1,100,100,000 1,100,100,000 • ABC:100,000,000,000 • Total: 101,101,101,101 cells × 4 bytes = 404,404,404,404 bytes
(c) State the order for computing the chunks in the cube that requires the least amount of space, and compute the total amount of main memory space required for computing the 2-D planes. The order of computation that requires the least amount of space is B-C-A. as show in Figure 4.2.
Figure 4.2: The order of computation in Exercise 2.12 that requires the least amount of space. The total amount of main memory space required for computing the 2-D planes is: Total space = (100 1,000) + (1,000,000 (1,000,000 10) + (100 10,000) = 20,100,000 20,100,000 cells = 80,400,000 80,400,000 bytes. bytes.
×
×
×
5. When computing computing a cube of high dimensionalit dimensionality y, we encounter encounter the inherent inherent curse of dimensionality problem: there exists a huge number of subsets of combinations of dimensions. (a) Suppose that that there there are only two two base cells, cells, (a1 , a2 , a3 , . . . , a100 ), (a ( a1 , a2 , b3, . . . , b100 ) , in a 100-dimensional base cuboid. Compute Compute the number number of nonempty nonempty aggregate aggregate cells. cells. Comment Comment on the storage space space and time required to compute compute these cells.
{
}
(b) Suppose Suppose we are to compute compute an iceberg iceberg cube from the above above.. If the minim minimum um support support count count in the iceberg condition is two, how many aggregate cells will there be in the iceberg cube? Show the cells. (c) Introducing iceberg cubes will lessen the burden of computing trivial aggregate cells in a data cube. However, even with iceberg cubes, we could still end up having to compute a large number of trivial uninteres uninteresting ting cells (i.e., with small counts). counts). Suppose that that a database database has 20 tuples that map to (or cover) the two following base cells in a 100-dimensional base cuboid, each with a cell count of 10: (a1 , a2 , a3 , . . . , a100 ) : 10, (a (a1 , a2 , b3 , . . . , b100 ) : 10 .
{
}
i. Let the minimum minimum support be 10. How many many distinct aggregate aggregate cells will there be like the following: following: (a1 , a2 , a3 , a4 , . . . , a99 , ) : 10 10, .. . , (a (a1 , a2 , , a4 , . . . , a 99 , a100 ) : 10, . . . , (a (a1 , a2 , a3 , , . . . , , ) : 10 ? ii. If we ignore all the aggregate cells that can be obtained obtained by replacing some constants constants by ’s while keeping the same measure value, how many distinct cells are left? What are the cells?
{
∗
∗
∗
∗∗ ∗
}
Answer: (a) Suppose that that there there are only two two base cells, cells, (a1 , a2 , a3 , . . . , a100 ), (a ( a1 , a2 , b3, . . . , b100 ) , in a 100-dimensional base cuboid. Compute Compute the number number of nonempty nonempty aggregate aggregate cells. cells. Comment Comment on the storage space space and time required to compute compute these cells. Each base cell generates generates 2100 1 aggregate aggregate cells. (We (We subtract subtract 1 because, for example example,, (a ( a1 , a2 , a3 , . . . , a100 ) 100 101 is not an aggregate cell.) So, the two base cells generate 2 (2 1) = 2 2 aggregate cells. But four of these cells are counted twice. These four cells are: ( a1 , a2 , , . . . , ), (a1 , , . . . , ), ( , a2 , , . . . , ), and ( , , . . . , ). Therefore, the total number of cells generated is 2 101 6.
{
−
∗∗
∗
}
×
∗
−
∗ −
− ∗
∗ ∗
∗
∗
.
(b) Suppose Suppose we are to compute compute an iceberg iceberg cube from the above above.. If the minim minimum um support support count count in the iceberg condition is two, how many aggregate cells will there be in the iceberg cube? Show the cells. They are 4: (a1 , a2 , , . . . , ), (a1 , , , . . . , ), ( , a2 , , . . . , ), ( , , , . . . , ) .
{
∗
∗
∗∗
∗ ∗
∗
∗ ∗∗∗
∗}
(c) Introducing iceberg cubes will lessen the burden of computing trivial aggregate cells in a data cube. However, even with iceberg cubes, we could still end up having to compute a large number of trivial uninteres uninteresting ting cells (i.e., with small counts). counts). Suppose that that a database database has 20 tuples that map to (or cover) the two following base cells in a 100-dimensional base cuboid, each with a cell count of 10: (a1 , a2 , a3 , . . . , a100 ) : 10, (a (a1 , a2 , b3 , . . . , b100 ) : 10 .
{
}
i. Let the minimum minimum support be 10. How many many distinct aggregate aggregate cells will there be like the following: following: (a1 , a2 , a3 , a4 , . . . , a99 , ) : 10 10, .. . , (a (a1 , a2 , , a4 , . . . , a 99 , a100 ) : 10, . . . , (a (a1 , a2 , a3 , , . . . , , ) : 10 ? There will be 2101 6, as shown above. ii. If we ignore all the aggregate cells that can be obtained obtained by replacing some constants constants by ’s while keeping the same measure value, how many distinct cells are left? What are the cells? There are only three distinct distinct cells left: (a1 , a2 , a3 , . . . , a100 ) : 10, (a (a1 , a2 , b3 , . . . , b100 ) : 10, (a (a1 , a2 , , . . . , ) : 20 .
{
−
}
∗
∗
∗
∗∗
}
∗
{
∗
6. Propose an algorithm that that computes computes closed iceberg cubes efficiently. Answer: We base our answer on the algorithm presented in the paper: “Quotient Cube: How to summarize the semantics of a Data Cube” by Laks V. S. Lakshamanan, Jian Pei, and Jiawei Han. VLDB 2002. Lets define the cover of a cell to be the set of base tuples that are aggregated in the cell. For example if we have 3 base tuples (a1,b1,c1) (a1,b2,c1) (a1,b1,c3), the cover of (a1,b1,*) = (a1,b1,c1), (a1,b1,c3) . Cells Cells with the same cover can be grouped in the same class as they share the same measure. Each class will have an upper bound which are the most specific cells in the class, and a lower bound which are the most general cells cells in the class. class. The set of closed closed cells cells corres correspond pond to the upper bounds of all the distinct distinct classes classes that that compose the cube. We can compute the classes by following following a Depth First Search strategy strategy:: First look for the upper bound of the cell (*,*,...,*), lets call them u1 , u2 ,...,u k , we then specialize each ui (assign a value to a * dimension) dimension) and recursively recursively find its upper bounds. bounds. Finding Finding the upper bounds would depend on the measure, for example if we use count, we can find the upper bound by just instantiating the * to a value v if all base cells share the same value value in the dimension. dimension. For the above example, example, the upper bound of (*,b1,*) (*,b1,*) is (a1,b1,*). (a1,b1,*).
{
}
Incorporating iceberg conditions is not difficult. For example, if we have an antimonotonic condition such as count( count( ) > k , we can stop the recursion when we reach an upper bound that does not satisfy the condition.
∗
7. Suppose that we would like to compute an iceberg cube for the dimensions, A,B,C,D, A,B,C,D , wher wheree we wish wish to materialize all cells that satisfy a minimum support count of at least v, and where cardinality(A) < cardinality(B) < cardinality(C)
8. Discuss Discuss how you might extend extend the Star-Cubing algorithm to compute iceberg cubes where the iceberg condition tests for avg that is no bigger than some value, v. Answer:
∗
. .
5 DCBA DCBA
4 DCB
3 DC
7 DB
2D
6 DCA
9 DA
8 DBA 12 CBA
11 CB
10 C
14 B
13 CA
1 5 BA
16 A
1 all
Figure 4.3: BUC processing order for problem 4.7 Instead of using average we can use the bottom-k average of each cell which is antimonotonic. If the bottom-k average of a cell is larger than v, we can be sure than no descendant cell (a cell with less * dimensions) can have a bottom-k average smaller than v. In order to reduce the amount of space required to check the condition bottom-k average we can store a few statistics such as count and sum for the base tuples that fall between a certain range of v (e.g, less than v, [1.0-1.2) times times v, [1.2-1.4) [1.2-1.4) times v, etc.) and use those few values values for pruning. This is analogous to the optimizatio optimization n presented presented in section section 4.1.6 page 26.
9. A flight data warehouse for a travel travel agent consists of six dimensions: traveler, departure departure (city), departure departure time, arrival, arrival time, time , and flight ; and two measures: count , and avg fare, fare, where avg fare stores the concrete fare at the lowest level but average fare at other levels. (a) Suppose the cube is fully materialized. Starting with the base cuboid [traveller, traveller, departure, departure, departure time, time, arrival, arrival, arrival time, time, flight], flight], what specific OLAP operations (e.g., roll-up flight to AA AA)) should one perform in order to list the average fare per month for each business traveler who flies American Airlines (AA AA)) from L.A. in the year 2004? (b) Suppose one wants wants to compute a data cube where the condition is that the minimum minimum number number of records records is 10 and the average average fare is over over $500. Outlin Outlinee an efficient efficient cube com comput putati ation on method method (based (based on commonsense about flight data distribution). Answer: (a) Suppose the cube is fully materialized. Starting with the base cuboid [traveller, traveller, departure, departure, departure time, time, arrival, arrival, arrival time, time, flight], flight], what specific OLAP operations (e.g., roll-up flight to AA AA)) should one perform in order to list the average fare per month for each business traveler who flies American Airlines (AA AA)) from L.A. in the year 2004? i. ii. iii. iv. v. vi. vii.
roll-up on “traveler” “traveler” to the level of “category” and and dice on “business” “business” roll-up on “departure “departure”” to the level of “city” and dice on “LA” roll-up on “departu “departure re time” to the level of “ANY” roll-up on “arrival” “arrival” to the level of “ANY” roll-up on “arrival “arrival time” to the level of “ANY” roll-up on “flight” “flight” to the level level of “com “company pany” ” and dice on “AA” drill-down drill-down on “traveler” “traveler” to the level level of “individual” “individual”
.
viii. drill-down drill-down on “departure “departure time” to the level of “month”. “month”. (b) Suppose one wants wants to compute a data cube where the condition is that the minimum minimum number number of records records is 10 and the average average fare is over over $500. Outlin Outlinee an efficient efficient cube com comput putati ation on method method (based (based on commonsense about flight data distribution). There are two constraints: min sup = 10 and avg fare > 500. Use iceberg cubing alg., such as BUC. Since avg fare > 500 is not anti-monotone, it should be converted into top-k-avg, i.e., avg 10 (fare) fare) > 500. Then use binning plus min sup to prune the computation of the cube.
10. (An implementatio implementation n project) There are four typical typical data cube computation computation methods: methods: multiwa multiway y array aggregation [ZDN97], BUC [BR99], H-cubing [HPDW01], and Star-cubing [XHLW03]. (a) Implemen Implementt any one of these these cube computation computation algorithms and describe your implemen implementatio tation, n, experimentatio mentation, n, and performance. performance. Find another student who has implemen implemented ted a different different algorithm on the same platform (e.g., C++ on Linux) and compare your algorithm performance with his/hers. Input: i. An n-dimensional base cuboid table (for n < 20), which is essentially a relational table with n attributes; ii. An iceberg conditio condition: n: count( count(C ) k where k is a positive integer as a parameter. Output
≥
i. The set of computed computed cuboids that satisfy satisfy the iceberg condition, condition, in the order of your output genergeneration; ii. Summary of the set of cuboids in the form of“cuboid of“ cuboid ID : the number of nonempty cells”, sorted in alphabetical order of cuboids, e.g., A:155, AB : 120, ABC : ABC : 22, ABCD : 4, ABCE : ABCE : 6, ABD: ABD : 36, (where the number after “:” represents the number of nonempty cells).—This is used to quickly check the correctness of your results. (b) Based on your implemen implementation tation,, discuss discuss the following: following: i. What challenging computation problems are encountered encountered as the number of dimensions grows large? ii. How can iceberg cubing solve the problems of part (a) for some data sets (and characterize such data sets)? iii. Give one simple example example to show that sometimes sometimes iceberg cubes cannot provide a good solution. solution. (c) Instead of computing a data cube of high dimensionality, we may choose to materialize the cuboids having having only a small number number of dimension dimension combinations combinations.. For example, for a 30-dimension 30-dimensional al data cube, we may only compute all cuboids with five dimensions, for every possible 5-dimensional combination. The resulting cuboids form a shell cube. cube. Discuss Discuss how easy or hard it is to modify your cube computatio computation n algorithm to facilitate such computation. Answer: (a) What challenging challenging computation computation problems are encounte encountered red as the number of dimensions dimensions grows large? The number number of cuboids for a cube grows exponent exponentially ially with the number of dimensions. dimensions. If the number of dimension grows large, it takes huge memory and time to compute all of the cuboids. (b) How can iceberg cubing solve the problems problems of part (a) for some data sets (and character characterize ize such data sets)? Iceberg Iceberg cube, by eliminating eliminating statistically statistically insignificant insignificant aggregated aggregated cells, can substant substantially ially reduce reduce the number number of aggregate aggregate cells and therefore greatly reduce the computati computation. on. Benefit from Iceberg cubing can be maximized if the data sets are sparse but not skewed. This is because in these data sets, it has relatively low chance that cells collapse into the same aggregated cell except for the cuboids consisting of small number of dimensions. dimensions. Thus, Thus, many many cells may have values values less than threshold and therefore be pruned effectively.
. .
(c) Give one simple example to show that sometimes sometimes iceberg cubes cannot provide provide a good solution. Give an example so that cells in cuboids of aggregated against a small number of dimensions satisfy the threshold. Consider, for example, an OLAP database consisting of 100 dimensions. Let ai,j be the j-th value of dimension i. Assume that there are 10 cells in the base cuboid, and all the cells aggregated to the cell (a (a1,1 , a2,1 ,...,a 99,1 , ), and the threshold is 10. Then, all 299 descendent cells of this cell satisfy the threshold. In this case, iceberg cubing cannot benefit from the pruning effect.
∗
Instead of computing a data cube of high dimensionality, we may choose to materialize the cuboids having only only a small small numbe numberr of dimens dimension ion com combin binati ations ons.. For exampl example, e, for a 3030-dim dimens ensiona ionall data data cube, cube, we may only compute all cuboids with five dimensions, for every possible 5-dimensional combination. The resulting cuboids form a shell cube. cube. Discuss Discuss how easy or hard hard it is to modify modify your cube computat computation ion algorith algorithm m to facilitate such computation. It is easy to modify the algorithms if they are adopting bottom-up approaches. Consider BUC as an example. We can modify the algorithm to generate shell cube up to specific dimension combination as it goes from bottom and then increase increase the combination, combination, it can be stopped stopped when it reaches reaches the maximum maximum number. number. Hcubing and star-cubing can be modified in similar ways. However, multi-way array aggregation is difficult to modify because it computes every cuboid at the same time.
11. Consider Consider the following multifeature cube query: Grouping Grouping by all subsets of item, region, month , find the minimum shelf life in 2004 for each group, and the fraction of the total sales due to tuples whose price is less than $100, and whose shelf life is between 1.25 and 1.5 of the minimum shelf life.
{
}
(a) Draw the multifeatur multifeaturee cube graph for the query. query. (b) Express Express the query in extended SQL. SQL. (c) Is this a distributive multifeature cube? Why or why not? Answer: (a) Draw the multifeatur multifeaturee cube graph for the query. query.
R0
→ R1(> 1(>= 1. 1 .25 ∗ min( min(shelf ) shelf ) and and
<= 1. 1 .5 min( min(shelf )) shelf ))
∗
(b) Express Express the query in extended SQL. SQL. select from where cube by such that
item, region, month, MIN(shelf), SUM(R1) Purchases year = 2004 item, region, month: R1 R1.Shelf >= 1.25*MIN(Sh 1.25*MIN(Shelf) elf) and (R1.Shelf <= 1.5*MIN(She 1.5*MIN(Shelf) lf) and R1.Price < 100
(c) Is this a distributive multifeature cube? Why or why not? No, this is not a distributive multifeature cube because of the
≤ conditions conditions in the such that clause.
12. For class characterization , what are the major differences between a data cube-based implementation and a relational implementation such as attribute-oriented attribute-oriented induction? Discuss which method is most efficient and under what conditions this is so. Answer: For class characterization, the major differences between a data cube-based implementation and a relational based implementation such as attribute-oriented induction include the following:
. .
14. Discuss Discuss why relevance relevance analysis is beneficial and how it can be performed and integrated into the characterization ization process. Compare Compare the result of two two induction methods: methods: (1) with relevance relevance analysis and (2) without relevance analysis. Answer: In characterizat characterization ion specifying specifying the set of relevant relevant attributes attributes may be difficult difficult for the user. If the user is not familiar with the underlying distribution of values for an attribute or the possible inter-attribute dependencies he/she may specify attributes attributes for which which there is no good goo d generalizat generalization ion operator or that are irrelevant irrelevant to the character characterizatio ization n process. The benefit of doing character characterizatio ization n with relevance relevance analysis vs. without without is that the former can uncover uncover patterns patterns that may be hidden hidden by irrelevant irrelevant attributes attributes present in the later.
15. Given Given a generalized generalized relation, relation, R, derived from a database, DB, DB , suppose that a set, ∆DB ∆ DB,, of tuples needs to be deleted from DB. updating procedure procedure for applying applying the necessary deletions deletions to R. DB . Outline an incremental updating Answer: An incremental updating procedure for applying the necessary deletions to R to reflect the changes in ∆DB ∆ DB is as follows:
• Generalize the tuples to be deleted (∆ DB) DB ) to the same level of abstraction as those in the generalized relation R.
• Include in the tuples to be deleted, the attribute required to indicate the class of each tuple as it relates to the generalized relation R.
• Calculate count and other required aggregate values for the tuples in ∆ DB, DB , as reflected by the values held in R.
• Update the affected tuples in R to reflect the difference in count between their current value and the value in the relation to be deleted (∆DB (∆ DB)) and recalculate aggregate values using the difference in measures between R and ∆DB ∆DB..
16. Outline Outline a data cube-based cube-based incremental algorithm algorithm for mining mining class comparisons. comparisons. Answer: See the algorithm: algorithm: A data-cube-b data-cube-based ased algorithm for increment incremental al analytical analytical class comparison. comparison.
.
Data-cube based incremental algorithm for mining analytical class comparisons. Input: (i) DB , a relational database (or other data source such as a flat file); (ii) a data mining query; (iii) a list of attributes; (iv) a set of concept hierarchies or generalization operators on the attributes; (v) attribute generalization thresholds. Output: P , P , a Prime generalized relation used to build the data cube. Method: The method is outlined as follows. To build the initial data cube for mining: (a) Data Collection: Collection: The incremental incremental part of the data is identified identified to produce a target class class and contrastin contrastingg class(es) from the set of task relevant data to generate the initial working relations. (b) Incremen Incremental tal dimension dimension relevance relevance analysis: analysis: This is performed performed on the initial initial working working relation for the target class in order to determine which attributes should be retained (attribute relevance). Also, an attribute will have to be added that indicates the class of the data entry. (c) Incremen Incremental tal synchronous synchronous generalization generalization:: The desired level of generalizat generalization ion is determine determined d to form prime target class and prime contrasting contrasting class cuboid(s). cuboid(s). This generalization generalization will be synchronou synchronouss between between all of the classes, as the contrasting class relation(s) will be generalized to the same level. To process revisions to the relevant data set and thus make the algorithm incremental: (Note: we do not want to completely rebuild the data cube for all relevant data as this would be time consuming and inefficient. Instead, only the changes to the relevant data will be processed and added to the prime relation as held in the data cube.) (a) Generalize Generalize the changed changed data to the same level of abstraction, abstraction, with the same attributes attributes as held in the data cube and add the attribute required to indicate the class of each tuple. (b) Calculate Calculate count and other required aggregate values values as done for the initial relations, and then merge the statistical information from this changed data with those results as currently held in the data cube.
Bibliography [BR99] [BR99]
K. Beye Beyerr and and R. Rama Ramakris krishna hnan. n. Bottom Bottom-up -up com comput putati ation on of of sparse sparse and icebe iceberg rg cubes cubes.. In Proc. 1999 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD’99), (SIGMOD’99) , pages 359–370, Philadelphia, PA, June 1999.
[HPDW01] [HPDW01] J. Han, J. Pei, G. Dong, and K. Wang. Efficient Efficient computatio computation n of iceberg cubes with complex measures. measures. In Proc Proc.. 2001 ACM-SIGMOD ACM-SIGMOD Int. Conf. Conf. Management Management of Data (SIGMOD’01) (SIGMOD’01),, pages pages 1–12, Santa Santa Barbara, CA, May 2001. [XHL [XH LW03 W03]] D. Xin, J. Han, X. Li, and B. W. Wah. Star-c Star-cubi ubing: ng: Comput Computing ing iceberg iceberg cubes by top-do top-down wn and bottom-up integration. In Proc. 2003 Int. Conf. Very Large Data Bases (VLDB’03) , Berlin, Germany, Sept. Sept. 2003. [ZDN97] [ZDN97]
Y. Zhao, P. P. M. Deshpande, Deshpande, and and J. F. Naughton. Naughton. An array-bas array-based ed algorithm algorithm for simultan simultaneous eous multid multidiimensional aggregates. In Proc. 1997 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD’97) , pages 159–170, Tucson, Arizona, May 1997.
55