Data mining techniques are used to discover patterns and relationships
from data which can enhance our understanding of the underlying
domain. Discovery of quantitative rules is associated with
quantitative information from the database. The data cube represents
quantitative summary information for a subset of the attributes. Data
mining techniques like Associations, Classification, Clustering
and Trend analysis [FPSSU94] can be used together with OLAP to
discover knowledge from data. Attribute-oriented approaches
[Bha95] [HCC93] [HF95] to data mining are
data-driven and can use this summary information to discover
association rules. Transaction data can be used to mine association
rules by associating support and confidence for each
rule. Support of a pattern A in a set S is the ratio of the number
of transactions containing A and the total number of transactions in
S. Confidence of a rule is the probability that
pattern B occurs in S when pattern A occurs in S and can be
defined as the ratio of the support of AB and support of
A, (P(B|A)). The rule is then described as
[support, confidence] and a strong association rule has a
support greater than a pre-determined minimum support and a confidence
greater than a pre-determined minimum confidence. This can also be
taken as the measure of ``interestingness'' of the rule. Calculation
of support and confidence for the rule
involve the
aggregates from the cube AB, A and ALL. Additionally, dimension
hierarchies can be utilized to provide multiple level data mining by
progressive generalization (roll-up) and deepening (drill-down). This
is useful for data mining at multiple concept levels and interesting
information can potentially be obtained at different levels.
An event, E is a string ; in which
for some
,
is a possible value for some attribute
and
is a value for a different attribute of the underlying data.
Whether or not E is interesting to that
's occurrence depends
on
's occurrence. The ``interestingness'' measure is
the size
of the difference
between:
(a) the probability of E among all such events in the
data set and,
(b) the probability that and
occurred independently. The
condition of interestingness can then be defined as
,
where
is some fixed threshold.
Consider a 3 attribute data cube with attributes A, B and C, defining
. For showing 2-way associations, interestingness
function between A and B, A and C and finally between B and
C are calculated. When calculating associations between A and B,
the probability of E, denoted by P(AB) is the ratio of the
aggregation values in the sub-cube AB and ALL. Similarly the
independent probability of A, P(A) is obtained from the values in
the sub-cube A, dividing them by ALL. P(B) is similarly calculated
from B. The calculation
, for some
threshold
, is performed in parallel. Since the cubes AB and A
are distributed along the A dimension no replication of A is
needed. However, since B is distributed in sub-cube B, and B is local
on each processor in AB, B needs to be replicated on all processors.
Similarly, an appropriate placement of values is required for a two
dimensional data distribution.