next up previous
Next: Implementation and Optimizations Up: OLAP and Data Mining Previous: OLAP and Data Mining

Attribute-Oriented Mining On Data Cubes

Data mining techniques are used to discover patterns and relationships from data which can enhance our understanding of the underlying domain. Discovery of quantitative rules is associated with quantitative information from the database. The data cube represents quantitative summary information for a subset of the attributes. Data mining techniques like Associations, Classification, Clustering and Trend analysis [FPSSU94] can be used together with OLAP to discover knowledge from data. Attribute-oriented approaches [Bha95] [HCC93] [HF95] to data mining are data-driven and can use this summary information to discover association rules. Transaction data can be used to mine association rules by associating support and confidence for each rule. Support of a pattern A in a set S is the ratio of the number of transactions containing A and the total number of transactions in S. Confidence of a rule tex2html_wrap_inline558 is the probability that pattern B occurs in S when pattern A occurs in S and can be defined as the ratio of the support of AB and support of A, (P(B|A)). The rule is then described as tex2html_wrap_inline558 [support, confidence] and a strong association rule has a support greater than a pre-determined minimum support and a confidence greater than a pre-determined minimum confidence. This can also be taken as the measure of ``interestingness'' of the rule. Calculation of support and confidence for the rule tex2html_wrap_inline558 involve the aggregates from the cube AB, A and ALL. Additionally, dimension hierarchies can be utilized to provide multiple level data mining by progressive generalization (roll-up) and deepening (drill-down). This is useful for data mining at multiple concept levels and interesting information can potentially be obtained at different levels.

An event, E is a string tex2html_wrap_inline584; in which for some tex2html_wrap_inline586, tex2html_wrap_inline588 is a possible value for some attribute and tex2html_wrap_inline590 is a value for a different attribute of the underlying data. Whether or not E is interesting to that tex2html_wrap_inline588's occurrence depends on tex2html_wrap_inline590's occurrence. The ``interestingness'' measure is the size tex2html_wrap_inline598 of the difference between:
(a) the probability of E among all such events in the data set and,
(b) the probability that tex2html_wrap_inline602 and tex2html_wrap_inline588 occurred independently. The condition of interestingness can then be defined as tex2html_wrap_inline606, where tex2html_wrap_inline608 is some fixed threshold.

Consider a 3 attribute data cube with attributes A, B and C, defining tex2html_wrap_inline610. For showing 2-way associations, interestingness function between A and B, A and C and finally between B and C are calculated. When calculating associations between A and B, the probability of E, denoted by P(AB) is the ratio of the aggregation values in the sub-cube AB and ALL. Similarly the independent probability of A, P(A) is obtained from the values in the sub-cube A, dividing them by ALL. P(B) is similarly calculated from B. The calculation tex2html_wrap_inline640, for some threshold tex2html_wrap_inline608, is performed in parallel. Since the cubes AB and A are distributed along the A dimension no replication of A is needed. However, since B is distributed in sub-cube B, and B is local on each processor in AB, B needs to be replicated on all processors. Similarly, an appropriate placement of values is required for a two dimensional data distribution.


next up previous
Next: Implementation and Optimizations Up: OLAP and Data Mining Previous: OLAP and Data Mining

Sanjay Goil
Fri Aug 7 14:58:04 CDT 1998