HomeSite Map
AnVil
About AnVilSolutionsTechnologyNews and EventsContact
OverviewCase Studies

Case Studies

Company Brochure (.pdf)


return to case studies

Golub & Slonim's Original Analysis

Golub & Slonim collected a group of 38 tissue samples from repositories of clinical samples taken from patients with Acute Lymphoblastic Leukemia ("ALL") & Acute Myeloid Leukemia ("AML") and conducted DNA Microarray analysis of gene expression profiles using the Affymetrix Hu6800 GeneChip. A Class Predictor was constructed from data derived from this 38 sample Training Set and used to predict cancer classification on an independent Test Set of samples. Using correlation statistics and nearest-neighbor analysis of known classes and Training Set microarray data, Golub & Slonim, et. al. (G&S) arrived at a set of 50 genes to predict classification of patients in a Test Set.

While one might have wished for genes always on in one cancer type and always off in the other, there appears to be no black and white answer in the real world of gene expression profiling. Hence you see in the upper left 25 genes selected as strongly expressed in ALL and weakly expressed in AML and the opposite circumstance in the lower left. The discrimination favors strongly expressed genes (RED) as more weakly expressed genes fall below the signal-to-noise ratio. click to enlarge

Figure 1:

Class Predictor Set created by Golub & Slonim

When applied to the Test Set, as seen on the right quadrants, the RED vs BLUE segregation is less clean, but sufficient to accurately predict patient assignment to ALL or AML.

Golub & Slonim's Methodoology

click to enlarge

Figure 2:

Golub & Slonim methodology
The authors presented their Methodology for building this Cancer Class Predictor from known cases for subsequent use in classifying 'unknown' samples as ALL & AML. They further presented a method for Cancer Class Discovery a priori, without training via the use of known samples or classes. While perhaps unique at the time, this has become the generic prescription for class prediction and discovery from microarray data sets.

AnVil's Methodology & Analysis

AnVil departs from the now traditional analysis methods exemplified by Golub & Slonim, et al.'s analysis of microarray data by first taking a high-level overview of the data set, examining statistical metadata and working in high-dimensional space to assess the full data set. Visualizations are used not only as a means of portraying the results of analyses, but also as interactive tools for the exploration, manipulation and analysis of the data and generation of subsequent results.

Click on sections of the figure below to see the details of each step of AnVil's discovery process and how they shed new light on this data set:



Figure 3:

AnVil methodology applied to the Golub & Slonim data set

Statistical Metadata Visualization

In addition to a review of traditional statistical information about a data set, AnVil creates an overview of the statistics about the data in a Statistical MetaData visualization. In this example, the statistics about gene expression levels for each of 38 samples in the Training Set are displayed in relative comparison. Standard statistical values, such as minimum, maximum, mean, mode, standard deviation, variance, skewness, kurtosis, etc., are given in this illustration, although measure specifically relevant to microarray analysis are also included (thresholding, filtering, etc.). Values of each patient across each statistic are colored from relatively low to high on a BLUE to YELLOW scale.

Several patient samples are observed to stand out due to the presence of higher and/or lower values. Samples 9, 17 and 20 are pointed out in this display as samples that should be tracked for interesting or deviant contribution to results as analysis progresses. If these samples significantly influence the results, the analysis might be repeated omitting the samples. click to enlarge

Figure 4:

Visualization of metadata

Exploratory Overview

click to enlarge

Figure 5:

RadViz overview
This RadViz™ (patent pending) graph displays all 6817 genes as records and 38 patients as columns (or dimensions). The global gene grouping has an axis corresponding to the ALL - AML class axis. To determine whether this has meaning, gene glyphs are overlaid with a color scheme corresponding to a disease correlation measure which calculates a "GS" value (Golub & Slonim, 1999). Negative and positive GS values correlate to preferential expression in AML and ALL, respectively. Looking at the gene distribution so colored, a view of all available data can have implicit meaning and, as will be later seen, can be used to pick predictor set genes.

Transposing the data table so that records are now patients and columns are now genes (as 6817 dimensions), RadViz™ algorithmically positions patient glyphs with respect to the gene expression pattern for each patient sample. A Smart Clustering algorithm is applied to order the genes in a statistically rational manner. The Training Set show very good discrimination between ALL and AML classes, with only one potential misclassification. As applied to the Test Set, there again appears to be a polar discrimination between ALL and AML, although separation isn't so distinct. click to enlarge

Figure 6:

Patients x Genes

Predictor Development

click to enlarge

Figure 7:

PatchGrid view of 25 user selected genes
In this display, 38 patients are arranged by rows and 6817 genes by columns, with data presented as Affymetrix Absent/Marginal/Present call discretization of gene expression level. Genes always absent or always present in all patients in either leukemia class are of little interest in creating a predictor, but those that are positioned at the Absent/Present and ALL/AML boundaries are good candidates for predictor set selection.

These selected 25 genes are shown here in the familiar Parallel Coordinates display. As expected, those genes selected from the ALL zone in RadViz™ are highly expressed in ALL as seen in Parallel Coordinates while those genes selected from the AML zone in RadViz™ are highly expressed in AML as seen in Parallel Coordinates. It would have been impossible to select a predictor set from a Parallel Coordinates display of all 6817 genes. click to enlarge

Figure 8:

Parallel coordinates view alongside RadViz view of the same 25 user selected genes

click to enlarge

Figure 9:

Visual classifier created using PatchGrid and RadViz™
The 35 gene predictor set was now used to create a visual classifier. In this RadViz™ analytical visualization, the white pie wedge demarks the space occupied by the AML Training Set samples. Gene expression data "loaded" into this visual classifier will classify patients as AML or ALL by inclusion or exclusion from the pie wedge space, respectively. When applied to the Test Set data, this visual predictor performs better than the analytical predictor constructed by Golub & Slonim et al.

Analytical Techniques

These genes were obtained by the use of a Genetic Algorithm to reduce a 76 genes predictor set constructed by a purely analytical method known as Principle Uncorrelated Record Selection (PURS™). PURS™ reduces the dimensions of this data set by eliminating genes that correlate with others, such that the final gene set is composed of gene uncorrelated to pre-determined parameters. The three genes clearly distinguish ALL and AML clusters, as shown by the white border. ALL samples are, however, further subclustered in meaningful ways. These subclusters do, in fact, correspond to B-Cell ALL and T-Cell ALL cancer subtypes. Each of these genes needs to be explored for biological significance especially in light of the possible random selection occurring (see Conclusion below). click to enlarge

Figure 10:

Graphing PURS™ results

Other analytical classifiers used to obtain the results below included:

  • Neural Net
  • Support Vector Machines
  • Naпve Bayes
  • Logistic Regression
  • IB3 (K-nearest neighbor)

Predictor Testing and Comparison

click to enlarge

Figure 11:

Comparison of clustering techniques
As a means of scoring or rating the efficacy of many different classifiers, multiple analytical methods (NN - Neural Net, SVM - Support Vector Machines, NB - Naive Bayes, LR - Logistic Regression, K - K-Means Clustering) were employed. This chart depicts results for 8 different classifiers, derived from both analytic and visual methods. Principal Uncorrelated Record Selection (PURS) classifiers are found to perfectly classify both Training and Test samples for 3 of the 5 methods.

The 76 gene PURS analytic classifier can be depicted in RadViz™ to demonstrate a delineation of ALL and AML samples in 76-dimensional space. This same boundary also correctly classifies the 34 samples in the Test Set.

Conclusion

Chemotherapy treatment outcomes (success or failure of remission) are known for 15/72 patients:

  • 15 AML from a single site
  • 7 Successes
  • 8 Failures

AnVil's Discovery process could predicict from the remaining data set that 8 patients will succeed in chemotherapy, and for two patients, chemotherapy shall fail.

click to enlarge

Figure 12:

76 gene set predictor for the success or failure of chemotherapy treatment outcome for Acute Myeloid Leukemia

Of the many groups that have published analyses of the Golub & Slonim, et al. data set, none have presented a gene predictor set diagnostic of clinical treatment outcome. From the 15 samples where outcome is known, AnVil has applied a 76 PURS gene set that predicts the success or failure of chemotherapy treatment outcome for AML. Known cases of successful treatment in GREEN are distinguished from known cases where treatment failed in RED, where patients failed to go into remission. A further set of 10 AML patients, for whom treatment outcome is not available, is plotted to show a prediction of outcome.

The 18-19 Dec 2000 Critical Assessment of Techniques for Microarray Data Analysis (CAMDA00) meeting presented the Golub & Slonim, et. al. data set as a common starting point for comparison of microarray analytical methods. More than a dozen groups evaluated the data set, yet few have presented the genes composing their derived Class Predictor sets.

We have used a visualization to depict, in classical Parallel Coordinates style, the relationship between gene predictor sets amongst the published groups and AnVil. AnVil's set is the union of 6 separate sets derived from different methods.

As groups variably applied pre-processing (thresholding, filtering, etc.) and different gene selection algorithms in construction of predictor sets, common as well as unique genes have been identified in each predictor set. By categorizing and ranking the various methods AnVil uses to generate predictor sets, AnVil can advise on appropriate genes for drug targets and diagnostic applications.

To find out more about how our AnVil can help you forge the optimal path to commercial discovery, contact us at: info@anvilinfo.com.

back to top

 



AnVil partners with companies in the pharmaceutical, biotechnology, and other life sciences to extract knowledge from their data and enable them to turn it into commercially valuable results.

Copyright © 2002 AnVil, Inc. All rights reserved.
Created by PixelMEDIA, Inc.

Statistical Metadata Exploratory Overview Predictor Development PURS - Analytic sitemapoutdoor fireplaces | infrared heaters | kerosene heaters | Your one-stop shop for water coolers look at our website | применение гальванола в промышленности