gene strand
Bioinformatics and Medical Tools
Developing machine learning and other tools for biological discovery is a long-standing activity in the lab
FASTlab logo GT

FASTlab Home Papers/Code Team
molecule Learning Protein Folding Energy Functions
Wei Guan, Arkadas Ozakin, Alexander G. Gray, Jose Borreguero, Shashi Pandit, and Jeffrey Skolnick
Georgia Institute of Technology Technical Report, 2010


We show a way to sculpt energy functions automatically using machine learning. [pdf]

Abstract: A critical open problem in ab initio protein folding is protein energy function design, which pertains to defining the energy of protein conformations in way that makes folding most efficient and reliable. In this paper, we address this issue as a weight optimization problem and demonstrate a machine learning approach, learning to rank, to solve this problem. We investigate the ranking-via-classification approach, especially the RankingSVM method and compare it with the state-of-the-art approach to the problem using the MINUIT optimization package. To maintain the physicality of the results, we impose non-negativity constraints on the weights. For this we develop two efficient non-negative support vector machine (NNSVM) methods, derived from L2-norm SVM and L1-norm SVMs, respectively. We demonstrate an energy function which maintains the correct ordering with respect to structure dissimilarity to the native state more often, is more efficient and reliable for learning on large protein sets, and is qualitatively superior to the current state-of-the-art energy function.

@techreport{guan2010lef, Author = "Wei Guan and Arkadas Ozakin and Alexander G. Gray and Jose Borreguero and Shashi Pandit and Jeffrey Skolnick", Title = "{Learning Protein Folding Energy Functions}", institution = "{Georgia Institute of Technology}", series = "{College of Computing Technical Report}", year = "2010" }
In preparation


Automatic Mass Spectrometry Alignment
We are working on automatic alignment of mass spectrograms using machine learning.


Efficient Molecular Dynamics with the Axilrod-Teller Potential
A fast algorithm for force computation using the 3-body Axilrod-Teller potential function, with a demonstration of its use in molecular dynamics.


Fast Monte Carlo and Multi-Body Potential Evaluation
An ultra-fast approach to evaluating potentials combining Monte Carlo and multipole techniques, which generalizes to multi-body (more than pairwise) potentials.


Fast Hartree-Fock Calculation
A new approach to computing Hartree-Fock exchange and Coulomb matrices, based on a four-tree algorithm.
Ovarian Cancer Detection from Metabolomic Liquid Chromatography/Mass Spectrometry Data by Support Vector Machines
Wei Guan, Manshui Zhou, Christina Y. Hampton, Benedict Benigno, DeEtte Walker, Alexander Gray, John F. McDonald and Facundo M. Fernandez
BMC Bioinformatics, 2009 (accepted)


A method for diagnosing ovarian cancer using LC/TOF mass spectrometry, showing promising accuracy. [pdf]

Abstract: Background: The majority of ovarian cancer biomarker discovery efforts focus on the identification of proteins that can improve the predictive power of presently available diagnostic tests. We here show that metabolomics, the study of metabolic changes in biological systems, can also provide characteristic small molecule fingerprints related to this disease. Results: In this work, new approaches to automatic classification of metabolomic data produced from sera of ovarian cancer patients and benign controls are investigated. The performance of support vector machines (SVM) for the classification of liquid chromatography/time-of-flight mass spectrometry (LC/TOF MS) metabolomic data focusing on recognizing combinations or "panels" of potential metabolic diagnostic biomarkers was evaluated. Utilizing LC/TOF MS, sera from 37 ovarian cancer patients and 35 benign controls were studied. Optimum panels of spectral features observed in positive or/and negative ion mode electrospray (ESI) MS with the ability to distinguish between control and ovarian cancer samples were selected using state-of-the-art feature selection methods such as recursive feature elimination and L1-norm SVM. Conclusions: Three evaluation processes (leave-one-out-cross-validation, 12-fold-cross-validation, 52-20-split-validation) were used to examine the SVM models based on the selected panels in terms of their ability for differentiating control vs. disease serum samples. The statistical significance for these feature selection results were comprehensively investigated. Classification of the serum sample test set was over 90% accurate indicating promise that the above approach may lead to the development of an accurate and reliable metabolomic-based approach for detecting ovarian cancer.

@article{guan2009cancersvm1, author = "Wei Guan and Manshui Zhou and Christina Y. Hampton and Benedict B. Benigno and L. DeEtte Walker and Alexander G. Gray and John F. McDonald and Facundo M. Fernandez", title = "{Ovarian Cancer Detection from Metabolomic Liquid Chromatography/Mass Spectrometry Data by Support Vector Machines}", journal = "BMC Bioinformatics Journal", note = "{\em Accepted}", year = "2009" }
Automatic Joint Classification and Segmentation of Whole Cell 3D Images
Rajesh Narasimha, Hua Ouyang, Alexander G. Gray, Steven W. McLaughlin, Sriram Subramaniam
Pattern Recognition, 2009


A method for segmenting cell parts from ion-abrasion scanning electron microscopy, an emerging high-end imaging technology, which requires little human tuning, unlike most segmentation methods. [pdf]

Abstract: We present a machine learning tool for automatic texton-based joint classification and segmentation of mitochondria in MNT-1 cells imaged using ion-abrasion scanning electron microscopy (IA-SEM). For diagnosing signatures that may be unique to cellular states such as cancer, automatic tools with minimal user intervention need to be developed for analysis and mining of high-throughput data from these large volume data sets (typically ~2GB/cell). Challenges for such a tool in 3D electron microscopy arise due to low contrast and signal-to-noise ratios (SNR) inherent to biological imaging. Our approach is based on block-wise classification of images into a trained list of regions. Given manually labeled images, our goal is to learn models that can localize novel instances of the regions in test datasets. Since datasets obtained using electron microscopes are intrinsically noisy, we improve the SNR of the data for automatic segmentation by implementing a 2D texture-preserving filter on each slice of the 3D dataset. We investigate texton-based region features in this work. Classification is performed by k-nearest neighbor (k-NN) classifier, support vector machines (SVMs), adaptive boosting (AdaBoost) and histogram matching using a NN classifier. In addition, we study the computational complexity vs. segmentation accuracy tradeoff of these classifiers. Segmentation results demonstrate that our approach using minimal training data performs close to semi-automatic methods using the variational level-set method and manual segmentation carried out by an experienced user. Using our method, which we show to have minimal user intervention and high classification accuracy, we investigate quantitative parameters such as volume of the cytoplasm occupied by mitochondria, differences between the surface area of inner and outer membranes and mean mitochondrial width which are quantities potentially relevant to distinguishing cancer cells from normal cells. To test the accuracy of our approach, these quantities are compared against manually computed counterparts. We also demonstrate extension of these methods to segment 3D images obtained using electron tomography.

@article{narasimha2009tomo, title = "{Automatic Joint Classification and Segmentation of Whole Cell 3D Images}", author = "Rajesh Narasimha and Hua Ouyang and Alexander G. Gray and Steven W. McLaughlin and Sriram Subramaniam", journal = "Pattern Recognition", year = "2009", volume = "42", number = "6", pages = "1067--1079" }
Discovering Ovarian Cancer Biomarkers using Gene Ontology Based Microarray Analysis
Wei Guan, Alexander Gray, Sham Navathe, Nathan Bowen, John McDonald, Lilya Matyunina
Data Mining in Bioinformatics (BIOKDD) 2007


A way of combining microarray data and gene ontologies for discovering biomarkers of ovarian cancer. [pdf]

Abstract: The advent of microarray data has opened new doorways for biological discovery. However, over the years, not all of the hoped-for possibilities have been realized, due to fundamental limitations of microarray data. In this paper, we present a method for augmenting microarray analysis with gene ontology data to provide insight into possible biomarkers (critical genes) for ovarian cancer pathogenesis which is not possible using microarray expression data alone. Using expression data for 12558 genes in 43 patients with both benign and malignant epithelial ovarian tumors, we apply representative state-of-the-art methods for microarray biomarker analysis including support vector machines, five data normalization methods, five feature selection methods, and two dimensionality reduction methods. Our findings showed that for this data: 1) Guanine Cytosine Robust Multi-array Average (GCRMA) appears to outperform other normalization methods, 2) the classification problem alone is not constraining enough to yield unique biomarkers with high confidence. Our new method combining statistical microarray analysis with ontological information is capable of finding putative biomarkers whose expression values are not significantly different between patient groups, but instead may be mutated or regulated at the post-translational level. For example, our method was capable of recovering the known importance of the TUMOR PROTEIN 53 (TP53) in the etiology of epithelial ovarian cancer (EOC) from expression data in which TP53 was not found to be differentially expressed.

@inproceedings{guan2007biomarkers, title = "{Discovering Ovarian Cancer Biomarkers using Gene Ontology Based Microarray Analysis}", author = "Wei Guan and Alexander G. Gray and Sham Navathe and Nathan Bowen and John McDonald and Lilya Matyunina", booktitle = "Proceedings of the Seventh International Workshop on Data Mining in Bioinformatics (BIOKDD)", year = "2007" }
High-Dimensional Probabilistic Classification for Drug Discovery
Alexander G. Gray, Paul Komarek, Ting Liu, Andrew Moore
International Symposium on Computational Statistics (COMPSTAT) 2004


A method for ranking molecules in terms of their promise for drug development, based on nonparametric Bayes classifiers. [pdf]

Abstract: Automated high-throughput drug screening constitutes a critical emerging approach in modern pharmaceutical research. The statistical task of interest is that of discriminating active versus inactive molecules given a target molecule, in order to rank potential drug candidates for further testing. Because the core problem is one of ranking, our approach concentrates on accurate estimation of unknown class probabilities, in contrast to popular non-probabilistic methods which simply estimate decision boundaries. While this motivates nonparametric density estimation, we are faced with the fact that the molecular descriptors used in practice typically contain thousands of binary features. In this paper we attempt to improve the extent to which kernel density estimation can work well in high-dimensional classification settings. We present a synthesis of techniques (SLAMDUNK: Sphere, Learn A Metric, Discriminate Using Nonisotropic Kernels) which yields favorable performance in comparison to previous published approaches to drug screening, as tested on a large proprietary pharmaceutical dataset.

@inproceedings{gray2004slamdunk, title = "{Probabilistic Classification in High Dimensions, With Application to Drug Discovery}", author = "Alexander G. Gray and Paul Komarek and Ting Liu and Andrew Moore", Booktitle = "International Symposium on Computational Statistics (COMPSTAT)", year = "2004" }