Attribute-Oriented Mining On Data Cubes

Next: Implementation and Optimizations Up: OLAP and Data Mining Previous: OLAP and Data Mining

Attribute-Oriented Mining On Data Cubes

Data mining techniques are used to discover patterns and relationships from data which can enhance our understanding of the underlying domain. Discovery of quantitative rules is associated with quantitative information from the database. The data cube represents quantitative summary information for a subset of the attributes. Data mining techniques like Associations, Classification, Clustering and Trend analysis [FPSSU94] can be used together with OLAP to discover knowledge from data. Attribute-oriented approaches [Bha95] [HCC93] [HF95] to data mining are data-driven and can use this summary information to discover association rules. Transaction data can be used to mine association rules by associating support and confidence for each rule. Support of a pattern A in a set S is the ratio of the number of transactions containing A and the total number of transactions in S. Confidence of a rule is the probability that pattern B occurs in S when pattern A occurs in S and can be defined as the ratio of the support of AB and support of A, (P(B|A)). The rule is then described as [support, confidence] and a strong association rule has a support greater than a pre-determined minimum support and a confidence greater than a pre-determined minimum confidence. This can also be taken as the measure of ``interestingness'' of the rule. Calculation of support and confidence for the rule involve the aggregates from the cube AB, A and ALL. Additionally, dimension hierarchies can be utilized to provide multiple level data mining by progressive generalization (roll-up) and deepening (drill-down). This is useful for data mining at multiple concept levels and interesting information can potentially be obtained at different levels.

An event, E is a string ; in which for some , is a possible value for some attribute and is a value for a different attribute of the underlying data. Whether or not E is interesting to that 's occurrence depends on 's occurrence. The ``interestingness'' measure is the size of the difference between:
(a) the probability of E among all such events in the data set and,
(b) the probability that and occurred independently. The condition of interestingness can then be defined as , where is some fixed threshold.

Consider a 3 attribute data cube with attributes A, B and C, defining . For showing 2-way associations, interestingness function between A and B, A and C and finally between B and C are calculated. When calculating associations between A and B, the probability of E, denoted by P(AB) is the ratio of the aggregation values in the sub-cube AB and ALL. Similarly the independent probability of A, P(A) is obtained from the values in the sub-cube A, dividing them by ALL. P(B) is similarly calculated from B. The calculation , for some threshold , is performed in parallel. Since the cubes AB and A are distributed along the A dimension no replication of A is needed. However, since B is distributed in sub-cube B, and B is local on each processor in AB, B needs to be replicated on all processors. Similarly, an appropriate placement of values is required for a two dimensional data distribution.

Next: Implementation and Optimizations Up: OLAP and Data Mining Previous: OLAP and Data Mining

Sanjay Goil
Fri Aug 7 14:58:04 CDT 1998