|
AnVil Informatics, Inc. is an in
silico drug discovery company employing proprietary
data exploration technology. Our highly experienced
team of scientists, informaticists, data visualization
specialists, and data miners are experts at navigating
large, high dimensional datasets. Here we present
an example of one such exploration where AnVil uncovered
new knowledge previously not identified.
Golub & Slonim
New Breakthroughs From The Re-examination of Previous
Results
Background
Golub & Slonim et
al. (1999) produced gene expression profiles from
72 Acute Leukemia patient samples using Affymetrix
GeneChip™ Hu6800 microarrays. Samples from four sites,
collected over a 20 year period, were assayed and
evaluated using Affymetrix software. While the lack
of replicate testing and sample quality control may
have increased the signal-noise level in the resulting
data set, the data were sufficient for the authors
to successfully classify cancer types and discover
cancer sub-types.
Read
the original paper (PDF format)
by Golub & Slonim et al. for more details.
The Golub & Slonim et al. data set
continues to be actively studied, with dozens of published
analyses of the data in print and on the Internet.
Of the dozens of groups that have published analyses
of the Golub & Slonim data set, including the
original authors, only one other than AnVil has presented
a gene predictor set diagnostic of clinical treatment
outcome.
What sets AnVil Informatics' approach
apart from others is not only the visual means by
which we use to conduct our analyses, but the "big
picture" view we create of the problem as well
as the rigor with which we verify our answers. Traditional
analysis and visualization methods are unable to look
at the entire data set as a whole due to limitations
of existing software tools. Assumptions made in reduction
of data dimensions may also remove or obscure data
relationships and candidate genes resulting in weaker
classifiers and predictors and loss of valuable targets.
AnVil Informatics identifies clustering potentials
and outlier samples early in order to avoid the costly
mistakes that can slow or misdirect the drug discovery
process.
Using AnVil's Approach to Analyze
the Data Set
AnVil Informatics' approach preserves
the value and meaning inherent in the full data set
by creation of global data and meta-statistical overviews
that allow one to reveal major data patterns and identify
aberrant samples that may bias results. This approach
departs from the now traditional analysis methods
exemplified by Golub & Slonim et al.'s analysis
of microarray data. We first take a high-level overview
of the data set, examining statistical metadata and
working in high-dimensional space to assess the full
data set. Visualizations are used not only as a means
of portraying the results of analyses, but also as
interactive tools for the exploration, manipulation
and analysis of the data and generation of subsequent
results.
This presentation demonstrates how
high dimensional visualization of massive microarray
data sets can reveal valuable clustering relationships,
even before filtering, thresholding and other pre-processing.
Results
Using the data of Golub & Slonim et al., AnVil
Informatics' analysis identified:
Several suspect patient samples that were subsequently
shown to be falsely misclassified samples in a 3 gene predictor set
that otherwise classified B- and T-Cell ALL and AML based on the influence
of a B-Cell associated gene. This biologically important result may
have been discarded by others using methods not evaluating data quality
along with classifier prediction.
Analysis of limited chemotherapy treatment
outcome data also yielded a 76-gene predictor for remission success
and failure; a finding until recently unique among the dozens of presented
analyses of the Golub & Slonim et al. data set. While a survival
prognosis predictor has been generated for lymphoma [Alizadeh et.
al (2000)], a gene predictor set for prediction of the success or
failure of chemotherapy treatment is a significant contribution.
|