Date of Award

Fall 2017

Document Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

Department

Computational Analysis and Modeling

First Advisor

Sumeet Dua

Abstract

Recent advancements in data collection technologies have made it possible to collect heterogeneous data at complex levels of abstraction, and at an alarming pace and volume. Data mining, and most recently data science seek to discover hidden patterns and insights from these data by employing a variety of knowledge discovery techniques. At the core of these techniques is the selection and use of features, variables or properties upon which the data were acquired to facilitate effective data modeling. Selecting relevant features in data modeling is critical to ensure an overall model accuracy and optimal predictive performance of future effects. The problem of relevant feature selection becomes compounded when the relevance of previously selected features cannot be guaranteed due to changes in the underlying dataset. This dissertation proposes an algorithm based on the statistical Plaid Model for the discovery of high quality biclusters from which sets of features and their corresponding relevance scores are tracked in datasets that undergo changes with time.

Initially, the algorithm employs an enhanced Plaid Model that integrates multiple results from the traditional Plaid Model to generate a list of statistically significant biclusters. This is achieved through the recursive use of combined set operations and statistical inferential tests to guide the generation of persistent set of biclusters of high quality in goodness scores. Next, the sets of features that define these biclusters are selected and marked for tracking based on their discriminatory powers exerted on the host biclusters at different time instances. As the dataset changes with time, the originally discovered biclusters also change together with the previously established discriminatory tendencies of the respective sets of features per biclusters. These changes in discriminatory powers among the sets of features that define the host biclusters are then modeled for tracking as the underlying dataset changes with time.

The proposed technique was tested on simulated spatiotemporal phenomena in a real microarray gene expression dataset. The results indicate that the algorithm was able to generate and track subsets of features successfully through their relevance based discriminatory characteristics over a span of time instances, as the underlying dataset underwent changes.

Share

COinS