4th Floor, 600
Suffolk Street
Lowell, MA 01854
info@anvilinformatics.com
http://www.anvilinformatics.com
The insights that result from analyses of large
microarray datasets represent an important new focus in the drug discovery
process. In this poster we demonstrate
the application of two machine learning techniques, supervised and unsupervised
learning, to microarray data.
Additionally we present new techniques that facilitate clustering
comparisons using visual and analytical approaches.
The microarray data sets we used are publicly
available and result from various yeast gene experiments. Our purpose was to
demonstrate the value of applying high dimensional analytic and visual data
mining techniques to discover trends and patterns in the data.
In our analyses, we compare many classification and
clustering techniques on both yeast diauxic shift data and yeast cell cycle
data. Application of novel visualization techniques (Parallel Coordinates,
Circle Segments, Radviz, etc.) to both datasets helps us gain insights into the
gene expression data.
Supervised
Clustering of
Microarray
Data
Microarray experiments
typically lead to the analysis of thousands of gene expression profiles. Genes of similar function often have similar
expression profiles. This attribute can
be exploited by creating classifiers that are trained on the expression
profiles of genes with known function, and applied to unknown genes in order to
classify them based on expression profile.
We performed several
experiments that built classifiers from 35 genes with 5 distinct expression
profiles. The information came from publicly available yeast gene expression
data that was generated from microarray experiments. Some of the results are shown in this poster.
Most classifiers, such
as a Decision Tree, Neural Network and Naive Bayes, can classify
the 35 “training” genes perfectly. Once trained, these tools can be used to
automatically classify the 6000 remaining unclassified genes based on the
characteristics of their expression profiles.
A Kohonen
self-organizing map can be used to cluster the 35 “training” genes, and
based on this clustering can be applied to the classification of the unknown
genes. The following four pictures show
the expression profiles of the “training” genes, the Kohonen map built from the
genes, and the expression profiles of the unknown genes after being classified
with the Kohonen Map.

A Parallel Coordinates visualization
displaying gene expression levels for 35 genes with distinct expression
profiles. The genes were classified
based on their expression profile, which is shown plotted over the 7 measured
time intervals.

A Kohonen self-organizing map clusters the 35
genes with known class by computing a new pair of axes and locating the genes
according to its idea of similarity.
The Kohonen map can then be used as a classifier if the operator
designates which clusters correspond to which gene function.

After classification based on the genes of known
expression profile, the Kohonen self-organizing map shows the distribution
of over 6000 microarray records (genes).

A Parallel Coordinates visualization shows
the expression profiles of the 6000 genes after classification by the Kohonen
self organizing map.
Unsupervised
Clustering of Microarray Data
There are many clustering techniques that can be
applied to microarray data, such as Hierarchical, K-Means and Self
Organizing Maps. We applied several
clustering techniques to publicly available microarray yeast gene expression
data. The expression levels were
measured over two cell cycles and 800 genes were identified algorithmically as
being cell cycle regulated. These genes
were classified into 5 groups based on the cell cycle phase of their
expression. We analyzed and visualized
the expression levels of the 800 genes using several unsupervised clustering
techniques; a few excerpts of these analyses are shown.
In the following pictures we show several traditional and novel techniques for visualizing data once it has been clustered or classified, and then present the results of two unsupervised clustering techniques.
If the data has already been clustered, graphs such as this average expression profile plot can be used to present summary information about the characteristics of each cluster. Here we see the average expression profile, with standard deviation bars added, plotted for the 5 Peak clusters in the cell cycle data. The clusters clearly demonstrate the cyclic nature of the data set.

A novel extension to the average expression profile
plot is this Histogram Matrix visualization. The 5 Peak phase clusters are displayed as a sequence of sixteen
histograms for each cell cycle. Rather then
providing standard deviation bars, this visualization presents all of the
distribution information using a histogram at each time point.

Another powerful way of examining classified or
clustered data is with an interactive Parallel Coordinates visualization. This parallel coordinates visualization is
being used to examine all of the gene expression values corresponding to two of
the cell cycle clusters. The phase difference
between the expression times of the two clusters can be clearly seen.

If the data has not been clustered, a common approach is to apply a Hierarchical Agglomerative Clustering method, and to visualize the results the familiar Dendrogram visualization. A colored patch grid corresponding to positive (green) and negative (red) data values enhances the visual analysis and comprehension of the clustering.

Another way
to cluster data is to use Polyviz, a proprietary high-dimensional
clustering technique based on a spring force paradigm. This Polyviz visualization clustered the
microarray data using the expression values and is colored by the Peak phase
classification.

The Kohonen self-organizing map is another
powerful clustering technique that can be applied to unclustered data. This clustering of the microarray data shows
the relationship between gene expression levels (based on cluster location) and
the Peak classification column (used to color the points).
Cluster
Comparison Techniques
Scientists have many clustering techniques at their disposal, each with its own set of advantages and disadvantages. How can scientists determine which clustering technique is best for their data? How can the results of two different clustering algorithms be meaningfully compared? The answers to these questions are ongoing research issues, but we present here three visual and one analytic approach towards answering these questions.
This custom visualization allows one to visually
compare the results of a K-means clustering technique that generates 30
clusters (on the left) with a technique that produced 5 clusters (on the
right). Poly-lines are used to identify
an individual record’s location within each of the results. The visual comparison allows one to gain a
meaningful understanding of how the cluster results differ.

This visualization of two clustering techniques uses
a jittered scatter plot to enable the comparison of the clustering
results. Five clusters from one technique
(along the Y-axis) are compared with 12 clusters from another technique (along
the X-axis). If the X-axis clusters
were a pure subset of the Y-axis clusters then there would only be one clump
per vertical line. In this case only
the 12th cluster on the X-axis is pure while the 1st is
nearly so.
The Color Correlated Column visualization is
another custom visualization for comparing clustering results. This visualization
allows one to simultaneously compare the results of over 20 different
clusterings of the data. The records
are sorted vertically by the Peak class, which is represented by the colored
bar on the right. The predicted class
is represented with a grayscale. If the
change in grayscale value corresponds to the change in color, then there is a
strong correlation between the true and predicted class.
Comparing Clustering Techniques
|
Rank |
Clustering
|
Data |
Number of |
%correct |
%correct |
%correct |
%correct |
%correct |
|
|
Technique
|
|
Clusters |
method -1 |
method -2 |
method -3 |
method -4 |
maximum |
|
|
|
|
|
|
|
|
|
|
|
1 |
Kohonen 3 |
Norm |
30 |
72.6 |
69.1 |
65.7 |
67.8 |
72.6 |
|
2 |
Kohonen 1 |
Norm |
30 |
72.3 |
69.5 |
65.2 |
67.7 |
72.3 |
|
3 |
Kohonen 2 |
Norm |
30 |
71.8 |
66.4 |
62.3 |
65.2 |
71.8 |
|
4 |
C K-means 1 |
Norm |
30 |
71.1 |
66.4 |
59.7 |
65.1 |
71.1 |
|
5 |
SOM 4 |
Original |
25 |
70.1 |
61.9 |
59.9 |
63.2 |
70.1 |
|
6 |
SOM 12 |
Original |
27 |
69.3 |
64.0 |
60.1 |
63.0 |
69.3 |
|
7 |
Kohonen 2 |
Original |
19 |
68.5 |
64.3 |
58.6 |
62.7 |
68.5 |
|
8 |
C K-means 1 |
Original |
30 |
67.2 |
63.6 |
55.0 |
61.9 |
67.2 |
|
9 |
Kohonen 1 |
Original |
19 |
67.1 |
59.8 |
53.6 |
58.8 |
67.1 |
|
10 |
Kohonen 3 |
Original |
18 |
66.8 |
65.5 |
56.4 |
63.9 |
66.8 |
|
11 |
C K-means 2 |
Norm |
5 |
66.8 |
61.1 |
56.4 |
58.6 |
66.8 |
|
12 |
SOM 7 |
Norm |
12 |
62.5 |
57.8 |
49.6 |
52.8 |
62.5 |
|
13 |
M K-means 1 |
Original |
5 |
59.7 |
51.8 |
48.4 |
54.7 |
59.7 |
|
14 |
Dendrogram 2 |
Original |
6 |
58.8 |
54.5 |
46.8 |
47.5 |
58.8 |
|
15 |
K-means 2 |
Original |
5 |
55.8 |
50.0 |
47.8 |
54.5 |
55.8 |
|
16 |
SOM 7 |
Original |
5 |
54.8 |
51.8 |
42.8 |
55.1 |
55.1 |
|
17 |
Dendrogram 1 |
Original |
6 |
45.6 |
43.1 |
32.7 |
33.4 |
45.6 |
|
18 |
SOM 12 |
Norm |
30 |
44.2 |
38.5 |
31.0 |
36.0 |
44.2 |
|
19 |
M K-means 2 |
Original |
30 |
43.7 |
36.6 |
29.3 |
35.9 |
43.7 |
|
20 |
M K-means 3 |
Original |
17 |
39.5 |
30.8 |
23.5 |
30.2 |
39.5 |
|
21 |
random |
Original |
6 |
37.5 |
16.3 |
20.0 |
22.9 |
37.5 |
The
results of several clustering techniques are analytically compared with the
Peak class in this example. For a given technique, each generated cluster was considered
to be a subset of one of the true classes. The class chosen for each cluster
was based on the majority of “truth” classes for the genes in that cluster.
After each cluster was categorized, the resulting accuracies were calculated.
The total percent correct and the average accuracy for each class was
calculated and is presented in the method columns.