Decision support systems use On-Line Analytical processing (OLAP) and data mining to find interesting information from large databases. Multidimensional databases are suitable for OLAP and data mining since these applications require dimension oriented operations on data. Traditional multidimensional databases store data in multidimensional arrays on which analytical operations are performed. Multidimensional arrays are good to store dense data, but most datasets are sparse in practice for which other efficient storage schemes are required. For large databases, reduced storage requirements can reduce the I/O time needed for data accesses during operations when appropriate data structures are used. Each cell in a multidimensional array can be accessed by an index in each of its dimension. Hence, finding an element in it is a trivial operation. Sparse data structures can provide efficient storage mechanisms, but dimension oriented operations can be more expensive when compared to multidimensional arrays, because of the additional processing required to de-reference indices. It is important to weigh the trade-offs involved in reducing the storage space versus the increase in access time for each sparse data structure, in comparison to multidimensional arrays. These trade-offs are dependent on many parameters some of which are (1) number of dimensions, (2) sizes of dimensions and (3) degree of sparsity of the data. Complex operations such as those required for data-driven data mining can be very expensive in terms of data access time if efficient data structures are not used. We compare the storage and operational efficiency in OLAP and data mining of various sparse data storage schemes in [GC97b]. A novel data structure using bit encodings for dimension indices called Bit-Encoded Sparse Structure (BESS) is used to store data in chunks, which supports fast OLAP query operations on sparse data using bit operations without the need for exploding the sparse data into a multidimensional array. This allows for high dimensionality and large dimension sizes. Our techniques for multidimensional analysis can be applied to scientific and statistical databases [LS98] which rely heavily on dimensional analysis.
In this paper we present a parallel multidimensional framework for large data sets in OLAP and SSDB which has been integrated with data mining of association rules. Parallel data cube construction for large data sets and a large number of dimensions using sparse storage structures is presented. Sparsity is handled by using compressed chunks using BESS. Three schedules for constructing complete and partial data cubes from the cube lattice are developed, using cost estimates between parent and child aggregates: (1) Pipelined method for cubes (2) Pipelined method for chunks and (3) Level by level method. These incorporate various optimizations to reduce computation, communication and I/O overheads. Chunk management strategies are described to store the various intermediate cubes and handle large data sets.
The rest of the paper is organized as follows. Section 2 describes chunk storage using BESS and the multidimensional infrastructure. Section 3 describes mining of association rules on the data cube. Section 4 describes strategies and optimizations for parallel aggregate computation. Section 5 presents results of our implementation on the IBM SP2. Section 6 concludes the paper.