A data exploration strategy must be tailored to the problem it is trying to solve. While no two problems will have the same solution, we can describe general approaches that are likely to be involved in any new problem. The intelligent application of these approaches constitutes AnVil's process for developing a data exploration strategy for complex, high-dimensional data sets.

We begin with a survey and analysis of the metadata - that is, data about the data. Using traditional statistical methods, we assess the quality of the data. Is the data uniformly spread or clustered? Widely spaced or concentrated? Does it have holes or voids? Are there significant outliers?

Next, visualization techniques are employed that display all the data in a compact form to provide an exploratory overview. RadViz™ (patent pending) graphs can display hundreds of data points, each having thousands of attributes or descriptors. A Smart Clustering algorithm then positions the data in a statistically rational manner. This leads to Predictor Development using analytical and visual methods. Multiple analytical methods are used to score the efficiency of different classifiers. The Principle Uncorrelated Record Selection (PURS)™ method produced the best classification of both training set and test set samples.

Traditional analysis and visualization methods are unable to work with entire complex data sets due to limitations of existing software tools. Reducing dimensions simplifies the data but can obscure or fail to detect data relationships, leading to a lack of conclusions or incorrect conclusions. AnVil Informatics uses an approach that preserves all the information in the full data set by creating global data and metastatistical overviews that reveal major data patterns and identify aberrant samples that may cause bias.

 

 

Copyright © 2001 AnVil Informatics, Inc. All rights reserved.
sitemapoutdoor fireplaces | infrared heaters | kerosene heaters | short term payday loans | TOSHIBA SATELLITE L505-S5964 REPLACEMENT LAPTOP LCD SCREEN