|
A data exploration strategy must be tailored
to the problem it is trying to solve. While no two problems will
have the same solution, we can describe general approaches that
are likely to be involved in any new problem. The intelligent application
of these approaches constitutes AnVil's process for developing a
data exploration strategy for complex, high-dimensional data sets.
We begin with a survey and analysis of the metadata
- that is, data about the data. Using traditional statistical methods,
we assess the quality of the data. Is the data uniformly spread
or clustered? Widely spaced or concentrated? Does it have holes
or voids? Are there significant outliers?
Next, visualization techniques are employed that display
all the data in a compact form to provide an exploratory overview.
RadViz (patent pending) graphs can display hundreds of data
points, each having thousands of attributes or descriptors. A Smart
Clustering algorithm then positions the data in a statistically
rational manner. This leads to Predictor Development using analytical
and visual methods. Multiple analytical methods are used to score
the efficiency of different classifiers. The Principle Uncorrelated
Record Selection (PURS) method produced the best classification
of both training set and test set samples.
Traditional analysis and visualization methods are
unable to work with entire complex data sets due to limitations
of existing software tools. Reducing dimensions simplifies the data
but can obscure or fail to detect data relationships, leading to
a lack of conclusions or incorrect conclusions. AnVil Informatics
uses an approach that preserves all the information in the full
data set by creating global data and metastatistical overviews that
reveal major data patterns and identify aberrant samples that may
cause bias.
|