DescriptionGiven a set of data objects, correlation computing refers to the problem of efficiently finding groups of strongly-related data objects in very large databases. Many
important applications in business and science depend on efficient and effective correlation computing techniques to discover relationships within large collections of data. In spite of much attention to the development of traditional statistical correlation
computing techniques, researchers and practitioners are facing increasing challenges to discover association patterns from data produced by emerging data-intensive applications. Indeed, the sizes of real-world data sets are growing at an extraordinary rate.
Furthermore, these data can be multi-scale, multi-level, multi-source, and dynamic in nature. These characteristics may not be a critical issue for data analysis when the size of data is small. However, as data become very large, it becomes a real challenge
for applying traditional statistical correlation computing techniques directly. In this dissertation, we first introduce an incremental solution for dynamic correlation computing. Along this line, we develop checkpoint-based algorithms, which can efficiently incorporate new transactions for correlation computing as they become available. The key idea is to exploit a checkpoint to establish a computation buffer, which can help us determine an upper bound for the correlation. This checkpoint bound can be used to identify a short list of candidate pairs, which will be maintained and computed for correlations as new transactions are added into the database. When the total number of new transactions goes beyond the buffer size, a new upper bound is computed for the next checkpoint, and a new list of candidate pairs is established. Experimental results on real-world data sets show that the checkpoint based algorithms can significantly reduce the correlation computing costs in dynamic data environments, and has the advantage of compacting the use of memory space. Furthermore, extending the pair-wise relationship, we examine confounding effects of additional items on the correlation of an item pair. Instead of searching for correlation patterns at the global level, we propose to efficiently find
confounding effects attributable to local associations. Finally, we examine applications of correlation computing to solve two real-world problems. One application is correlation range query for a given item in recommender systems. The other application
is in financial risk computing, which can serve as an example of time series data that are prone to noise and outliers.