Visualization Directed Data Mining of NCI Anti-Cancer Drug Activities on Cancer Cell Lines

 

Patrick Hoffman, Dave Pinkney

AnVil Informatics, Inc.

                                                600 Suffolk St.

                                                Lowell, MA 01854

 

 

Kenneth A. Marx, Georges G. Grinstein

 

                                                The IVPR and The CIB

                                                Departments of Chemistry and Computer Science

                                                University of Massachusetts Lowell

                                                Lowell, MA  01854

 

 

Abstract

 

Visualization and machine learning are playing increasingly important roles in analysis of human genome project data and in the process of drug discovery.  In this work we demonstrate the symbiotic relationship of these two techniques applied to the problem of understanding and predicting drug mechanism of action against cancer cell lines.   A number of visualizations (Radviz, Polyviz, Dendrograms, and Parallel Coordinates) helped guide machine learning algorithms (Neural net classifiers, Decision trees, Kohonen SOM, and Hierarchical clustering) to predict biological properties from measured ones.  We also discuss the problem of predicting drug biological properties from chemical structure information. 

 

Introduction

 

        Organic chemists have been synthesizing vast numbers of different organic molecules in research laboratories for many decades (well over 15 million have been made to date). Over the same time frame, natural products chemists and medicinal chemists have been isolating large numbers of natural products compounds from diverse sources. These sources include: soil microbes, single celled organisms and invertebrates found in the oceans, and more recently plants used by indigenous peoples in traditional healing practices, to name a few. Huge reservoirs of potential sources remain unexplored, such as in the spectrum of rain forest plants and animals.  And a diversity of new locations have been identified where novel xenobiotic organisms, possessing previously uncatalogued compounds, have begun to be found.

Traditionally, pharmaceutical companies have created and maintained large libraries of such chemical compounds (typically near or greater than one million compounds in size) to carry out routing testing for their activities against diseases such as cancer. Both the increasing size and complexity of traditional libraries as well as the recent development of rapidly created large synthetic libraries from combinatorial chemistry methods have placed a premium upon the utilization of computational methods to analyze large databases resulting from tests carried out using these large chemical libraries. In the public sector, the National Cancer Institute has a large screening program in place to test tens of thousands of compounds for their biological activities against cancer (http://dtp.nci.nih.gov/docs/cancer/cancer_data.html).

        Predicting biological activity directly from the chemical structure of any compound is one of the most sought after goals in the pharmaceutical industry and is a focus of academic research as well. Success in this predictive capability would enable the design of drugs without time consuming, costly exploratory testing and research. Individual drugs are estimated to cost the pharmaceutical industry anywhere from  $300-500 million to develop. Therefore, any advance in direct prediction of activity or the selection of superior drug candidates for further testing represents a potentially valuable capability that could result in considerable cost savings to this industry and to society through reduced cost of medications.

 

 The Problem and Datasets

        We have been engaged in several data exploration research projects. One was to use the NCI chemical database to help understand the structure activity relationship of drugs in relation to their anti-cancer mechanisms.  An easier goal is to build classifiers to predict unknown biological properties from measured biological properties.  Here we predicted the biochemical Mechanism of Action (MOA) of individual drugs in the NCI chemical database from their GI50 results. The GI50 is defined as the concentration of that drug required to inhibit the growth of a given cancer cell line by 50%. The 60 cell lines chosen by NCI were based upon a number of factors. One was their ability to effectively screen different MOA classes of chemicals. And the other was that as a group they effectively represented the biological spectrum of human cancer types that NCI wished to test activity against.

            Another dataset we investigated was a chemical fingerprint dataset (provided by CambridgeSoft’s ChemOffice database) in which the compounds found in the NCI GI50 dataset have calculated values used as chemical and structural descriptors.

 

MIVAC (Multidimensional Information Visualization and Analysis Center)

In this report, we use MIVAC, an integrated multi-dimensional visualization and data mining platform (AnVil Informatics, Inc.), which was developed to facilitate the exploration of datasets within a single flexible environment. We analyze the NCI GI50 and chemical fingerprint datasets with MIVAC to guide analytic approaches to relate drug MOA to chemical structure and to biological activity. The dataset used is from the NCI Anti-Cancer Agent Mechanism Database (http://dtp.nci.nih.gov/docs/cancer/searches/standard_agent.html).

 

 

Visualizing the NCI GI50 Dataset

 

Radviz Visualization

        A valuable first step in data exploration is to effectively visualize the data.  This helps one to get a feel for the possible patterns, limits, or correlations embedded in the data.  One such useful visualization is Radviz  [Hof97], which realizes a high-dimensional clustering of the data using spring forces.  Each column in the dataset corresponds to one dimension represented by an imaginary spring fixed to the perimeter of a circle and radially-oriented to the center of the circle.  Each record is attached to each of the springs, the net sum of the resulting individual spring forces (based on each record value) from all of the data dimensions is computed and thus controls the eventual placement of the record in the plane.  Radviz can handle many spring-represented data dimensions and for this reason we initially used this visualization to look at the compounds and the 60 GI50 dimension values.

In Figure 1 we display 123 compounds, each tested on the 60 cell lines Each compound is colored by its known MOA type. For all of these 123 compounds the MOA has been determined experimentally. The 6 MOA types are listed below with a brief description of its molecular target inside the cell.

 

The 6 known Mechanism of Action are

 

    1-Alkylating Agents

    2-Topoisomerase I inhibitors

    3-Topoisomerase II inhibitors

    4-RNA/DNA antimetabolites

    5-DNA antimetabolites

    6-Antimitotic Agents.

 

 

Figure 1 RADVIZ Visualization of 60 data dimensions corresponding to GI50 tests of each compound on the 60 cell lines, colored by the compound MOA type.

 

 

 

It is clear from Figure 1 that compounds from the 6 MOA classes cluster. The MOA 2 and MOA 6 clusters are the most clearly separated in this 60 dimensional Radviz representation. On the other hand, MOA 1, MOA 3 and MOA 5 clusters demonstrate a significant degree of overlap.

 

Visualization of Dendrogram and Kohonen Map Clustering

The clustering via MOA can also be seen in both the hierarchical clustering (Dendrogram) and the Kohonen (SOM) Self-Organizing Map  [Koh82] as shown, respectively, in Figures 2 and 3 below.

 

Figure 2 Dendrogram clustering of the 123 compounds and 60 cell lines.   The green and red Patch Grid visualization shows the values of the log(GI50) data for each cell line.  Green are positive values and red are negative values.

 

 

Figure 3 this Kohonen SOM clustering of the data also shows that MOA should be predictable from the GI50 data.

 

Hierarchical clustering or Dendrogram and Kohonen SOM are two commonly used methods of unsupervised machine learning.  It is not always clear which method is best for what types of data.  The Dendrogram has the advantage of presenting visually the clustering whereas the SOM can handle more complex data.

   A classifier for predicting MOA from NCI GI50 data

From the three visualizations available within MIVAC, it seems clear that a useful predictor or classifier for MOA could be built from these data.   We thus built classifiers using a variety of techniques including Decision Tree, Option Tree, Naive Bayes, C5.0 [Qui96], and a standard back propagation neural net. Various predicted accuracies were calculated for the different classifiers and these ranged from 66% to 95%. Among the most accurate classifiers were the neural nets. An example from one neural net classifier is shown below. In this case, 60 GI50 cell line test input nodes were used. There were 14 hidden nodes and 6 MOA output nodes. This neural net was trained with 2/3 of the data, while 1/3 of the data was held out for validation and to prevent over training.  An accuracy of 87.76% was predicted from the validation dataset. In testing on the entire dataset, an accuracy of 94.31% was achieved. These results are preliminary but illustrative. This approach, produced accuracy comparable to previous results where cross validation was applied to the neural net training and testing [Wei92]. The relative importance of each one of the 60 cell line GI50 values as inputs in contributing to the classification accuracy of the net, ranged from 0.21 to 0.04. Specific accuracy features are shown below in the Confusion Matrix.

 

Comparing Predicted-MOA with Known MOA

Correct

116

94.31%

Wrong

7

5.69%

Total

123

 

 

Confusion Matrix

Predicted MOA

 

1

2

3

4

5

6

1

33

0

1

0

1

0

2

0

23

0

0

1

0

3

1

0

14

0

1

0

4

0

0

0

19

0

0

5

0

0

1

0

15

0

6

0

0

1

0

0

12

 

 

Problems with a Classifier

We encountered several problems in building a classifier from this dataset.  In the original published work [Wei92], several compounds were used more than once with different concentration levels tested.  In addition over 600 missing values were in the original training set and these missing values were replaced with the mean value for the whole training dataset for that specific cell line.  It is not clear that this is the best value to use for the missing data.  Another possibility is to use the average value across all cell lines for a given compound. We have generated neural nets using both methods. Another problem with the dataset was its size: the number of data records (123) is too small for building a robust classifier for this complicated function (MOA).  Compounds predicted from this classifier may have a very low probability of being correctly classified.  Also in the original published work the data values used in training the neural net were modified to be difference values. That is, they were subtracted from the mean value of all the cell lines. This accentuates the differences across the cell lines, rather than placing emphasis on the varying MOAs/potencies of the drug.  We have built neural nets using the difference values and the original GI50 values. 

Building a classifier for finding MOAs from a large dataset would resolve many of these problems.  In order to alleviate some of these problems we are building many different classifiers, varying the missing value replacements, the data normalization, and multiple data records.  We were unable to effectively evaluate the neural nets generated using the different strategies described above with the small 123 compound dataset available. A larger 32,918 compound dataset is available, but these compounds are of unknown MOA and therefore cannot be used to evaluate the neural nets described above.

 

Preliminary Neural Net Classifier Results

We trained several neural net classifiers to predict MOA, using as training inputs the 60 cell line GI50 data from the 123 compounds of known MOA.  We then fed 60 cell line GI50 test data for 32,918 compounds of unknown MOA into the trained neural net, in order to predict their MOA values.  Both datasets are available on the NCI web site (http://dtpsearch.ncifcrf.gov), With the various neural nets we found a subset of compounds with very high confidence levels for their predicted MOAs.   Several of these were from the training set and could be expected.  Many of the others were also found by running the COMPARE algorithm at the NCI website with some of the resulting compounds.   The COMPARE algorithm does a Pearson rank correlation after subtracting out the mean value from a given seed compound.  It returns the compounds with a high correlation to the seed compound.  While the COMPARE program is useful, it does not have the flexibility of a general classifier, and one can only test one compound at a time.  For example, records for four compounds with a predicted MOA 6, of the DNA antimetabolites class, are shown in.  This Parallel Coordinate [Ins85] visualization shows for each compound the GI50 values for 22 of the 60 cell lines.  It can be seen that the profiles are very similar, but exceptions occur as pointed out with arrows.  The neural net classified the compounds similarly. Using COMPARE to get the same results would require more work.

 

 

 

 

 

Figure 4  Parallel Coordinate visualization of 4 compounds predicted by the neural net to have an MOA 6, corresponding to the DNA antimetabolites mechanism class.  The arrows point out cell lines where a compound profile does not match well.

 

 

High Confidence Subset Prediction of MOAs

From the 32, 918 compounds of unknown MOA tested, various levels of confidence can be generated. Many compounds were found to possess high confidence levels.  For example, 18 compounds were found to have confidence levels above 0.990, while 48 compounds had confidence levels above 0.985.   If you would like to know the compound identities, their predicted MOAs, and their neural net confidence levels, please contact us at AnVil Informatics, Inc. or send an e-mail to info@anvilinformatics.com .  These results are being continually updated and include additional compounds and data from CambridgeSoft’s Chem Office database.

            Some efforts toward classifier construction have been carried out previously [Wei92], [van94], and [Kou94].  While the utility of neural nets has been demonstrated previously, our work applies several trained neural nets to the prediction of MOAs for 32,918 compounds of unknown MOA.

 

Visualization of Chemical Property Data

 

        As we discussed earlier, a central concern of large pharmaceutical and smaller niche biotechnology companies involved in small molecule drug design is how the known or readily computed properties of any small molecule drug candidate can be utilized to predict its behavior inside a cell. Once there, it will interact specifically with the particular molecular subsystem where it exerts its MOA. Chemists have developed generalized descriptors, some simple and others more complex, that describe each chemical's physical and chemical properties. These descriptors can be thought of as a chemical fingerprint of the compound. Medicinal chemists have found these properties to be useful in developing rules that correlate chemical and physical properties with biological activity. The term Quantitative Structure Activity Relationships, or QSAR, is often used as an acronym for this. Below we have taken some representative NCI GI50 MOA data and combined it with QSAR data for the individual chemicals, calculated again using the CambridgeSoft ChemOffice Software.

 

Chemical Finger Print Data

            The chemical finger print data shown here represents generalized functional group characteristics of each compound as well as size, physical properties and some interaction parameters of that compound. These characteristics are very important in determining the behavior of compounds in tests that mimic drug uptake, metabolic behavior inside the body and general ability to interact with different classes of molecules and cellular structures.

CMR

computed Molecular Refractivity

CLOGP

ChemLogP is an implementation of the method of Suzuki and Kudo. This method breaks a molecule down into basic and extended fragments and then sums the LogP contributions for each group to provide a total value of LogP for the whole molecule.   LogP, the octanol-water partition coefficient, is the most commonly adopted measure of lipophilicity. The lipophilicity of the whole molecule affects the transport properties through cell membranes, and interactions between the drug and receptor site are affected by the profile of the lipophilicity of the molecular surface.

 

MlogP

another calculated LogP

DON

H-bond donor

ACC

H-bond acceptor

LIP

hydrophobic center

TOT

Number of heavy atoms (not H)

MWT

Molecular Weight

BIND_E

Binding energy

BIND_K

Log(binding constant)

XSCMR

another calculated MR

ROT_BONDS

The number of single bonds, which rotation changes the conformation

H-BND

Hydrogen bonds

NO

number of Nitrogen and Oxygen atoms

 

Polyviz Visualization of Chemical Finger Print Data

In Figure 5, a variation of Radviz called Polyviz [Hof00] is shown colored by MOA type. In this visualization each of the spring anchor points in Radviz are arrayed along the sides of a spread polygon. Additionally, part of the spring lines are shown giving a visual representation of the distribution of the data for each dimension.  For example, the finger print variable DON is seen to have only one value for all of the data points. Here the importance of visualization of the data is clear, since we can clearly eliminate DON from any classifier we build without losing information or classification accuracy.  This visualization also clearly depicts the similarity of the MWT and TOT variables expressed in how the MOA colors are similarly distributed for both dimensions. One might consider seeing the effect of only using MWT or TOT in a classifier.  For this dataset the spring paradigm does not do as well in separating the six MOA classes as distinct clusters, as we observed in the GI50 60 cell line data shown in Figure 1.  This suggests that building an accurate classifier may be more difficult. 

Figure 5 Polyviz showing the colored MOA classes for 123 compounds with 13 chemical finger print dimensions.

 

Chemical Property NN Classifier Predicting MOAs

 

A number of neural net and decision tree (C5.0)[Qui96] classifiers were trained and tested using the 13 chemical finger print properties described above for the 123 compounds with known MOAs.  The predicted accuracies ranged from 56% to 80 %, significantly lower than the classifiers trained from the GI50 cell data.

 The number of hidden layers and the number of compounds held out for validation and to prevent over training were varied. In an example of one classifier shown below, thirty percent of the data was randomly left out for validation and to prevent over training.

 

The configuration the net was:  Input Layer: 13 neurons, Hidden Layer: 25 neurons, Output Layer: 6 neurons (for the 6 MOA classes).  The predicted accuracy was 77.42% (from the validation test data set)

 

The accuracy on all data fed back to the net was:

Comparing Predicted-MOA with Known MOA

Correct

99

80.49%

Wrong

24

19.51%

Total

123

 

 

Confusion Matrix

Predicted MOA

 

1

2

3

4

5

6

1

29

2

1

1

1

1

2

0

24

0

0

0

0

3

2

2

11

0

0

1

4

3

0

0

15

1

0

5

2

1

0

1

12

0

6

0

5

0

0

0

8

 

Relative Importance of Inputs were: H-BND : 0.43270, MWT : 0.35316, TOT: 0.25811, ROT_BONDS : 0.25573, XSCMR: 0.25082,ACC : 0.21565, CLOGP: 0.20593, CMR : 0.18721, LIP: 0.16368, MLogP: 0.15458, BIND_E : 0.15020, BIND_K : 0.14061, NO: 0.08400 

 

The rules from a C5.0 decision tree classifier with 10% cross validation is shown below for comparison with the neural net classifier.  In both cases TOT, MWT, and H-BND were important in predicting MOA.

 

 

TOT =< 37

    H-BND =< 2

        TOT =< 16-> 4

        TOT  > 16

    H-BND  > 2

        MWT =< 278

        MWT  > 278-> 1

TOT  > 37

    H-BND =< 5

        CMR =< 8.67-> 1

        CMR  > 8.67

    H-BND  > 5

        H-BND =< 8-> 3

        H-BND  > 8-> 4

 

As we expected from the weaker clustering in the Polyviz visualization, the predicted accuracy of this neural net and the corresponding confidence levels predicted for compounds is lower compared to the previous neural net developed to predict MOA classes for unknown compounds from the 60 cell line GI50 data in the earlier section of this report.

Conclusions

            We have demonstrated that high dimensional visualization techniques are useful in preliminary evaluations of datasets. We used the MIVAC environment on the NCI 123 compounds of known MOA and chemical finger print data. We took clues from the initial examination of the data that guided us in building classifiers. Using the MIVAC neural net classifiers, we predicted small subsets of a large 32,918 compound database that were classified into MOAs with very high confidence values.  Chemical structure information can be used for predicting biological activity. However, more work in this area is necessary to build more accurate classifiers.  Machine learning and visualization will continue to play a prominent role in analyzing data from the human genome project and facilitating the drug discovery process.

 

Future Work

AnVil Informatics continues to work in the cheminformatics area by developing more interactive visual and analytic data exploration tools. One further application example includes predicting specific cancer cell line activity from chemical structure data.

 

Acknowledgements

          The authors acknowledge support from Anvil Informatics, and funding from the IVPR and CIB Centers and the CFCI at the University of Massachusetts Lowell.

 

References

 

[Hof97] Hoffman P. E., Grinstein, G. G., Marx K., Grosse  I.., Stanley E.: DNA Visual and Analytic Data Mining, IEEE Visualization '97, Phoenix, AZ, pages 437-441, 1997.

 

[Hof00] Hoffman P., Grinstein G.: Dimensional Anchors: A Graphic Primitive for Multidimensional Multivariate Information, Workshop on New Paradigms in Information Visualization and Manipulation, In Conjunction with the ACM Conference on Information and Knowledge Management (CIKM'99).  To be published in 2000.

 

[Ins85]  Inselberg A.:  The Plane with Parallel Coordinates, Special Issue on Computational Geometry, The Visual Computer  1:  69-91. 1985.

 

[Koh82] Kohonen, T.: Self-organized formation of topologically correct feature maps, Biological Cybernetics 43, 59-69. Reprinted in Anderson and Rosenfeld, 1988.

 

[Kou94] Koutsoukos, A.D. Rubinstein, L.V., Faraggi, D., Simon, R.M., Kalyandrug, S., Weinstein, J.N., Paull, K.D., Kohn, K.W., Discrimination techniques applied to the NCI in vitro antitumor drug screen: predicting biochemical mechanism of action, Stat. in Med. 13:719-730, 1994.

 

[Qui96] Quinlan, R., Improved Use of Continuous Attributes in C4.5, Journal if Artificial Intelligence Research  vol 4, 77-90, March 1996.

 

[van94]   van Osdol W.W., Myers, T.G., Paull, K.D., Kohn, K.W., and Weinstein, J.N.m Use of the Kohonen self-organizing map to study the mechanisms of action of chemotherapeutic agents  J Natl Cancer Inst 1994; 86: 1853-1859.

 

[Wei92] Weinstein, J.N., Kohn, K.W., Grever, M.R., Viswanadhan, V.N., Rubinstein, L.V., Monks, A., Scudiero, D.A., Welch L., Koutsoukos, A., Paull, K.D.: Neural computing in cancer drug development: Predicting mechanism of action, Science 258: 447-451, 1992.

 

[Wei97] Weinstein, J.N., Myers, T.G., O'Connor, P.M., Friend, S.H., Fornace, A.J., Kohn, K.W., Fojo, T., Bates, S.E., Rubinstein, L.V., Anderson, N.L., Buolamwini, J.K., van Osdol, W.W., Monks, A.P., Scudiero, D.A., Sausville, E.A., Zaharevitz, D.W., Bunow, B., Viswanadhan, V.N., Johnson, G.S., Wittes, R.E., and Paull, K.D. An information-intensive approach to the molecular pharmacology of cancer. Science 1997; 275:343-349.

 

 

 

sitemap