
Fundamental Algorithmic and Statistical Tools Laboratory
School of Computational Science and Engineering,
Georgia Institute of Technology
The FASTlab performs research on practical yet theoretically rigorous tools for fundamental algorithmic and statistical problems arising in challenging applications. Our main focus is stateoftheart data analysis (machine learning/data mining/statistics) on massive datasets  we are the largest academic lab devoted to this topic.
Modern datasets are increasingly "massive" either in the number of objects (rows of a data table) or the number of dimensions/measurements describing each object (columns of a data table), and we are concerned with both. The research questions we pursue are motivated by collaborations with domain experts on highimpact science and engineering problems. Our goal is to develop practical highaccuracy methods and fast algorithms for them, then if possible generalize them with a theory allowing many future practical methods and algorithms to be developed. We strive to ensure practicality by developing real software and copublishing in the domain area  thus our work cycle takes us from mathematics to code to scientific results to mathematics. To achieve our end goals we are multidisciplinary. Work on the computational foundations of machine learning is ultimately intertwined with the fundamental statistical issues of data analysis  a longterm goal is to unify them theoretically. Our statistical work (as well as our algorithmic work) incorporates concepts from geometry/topology, which we believe represents a longterm bridge between statistics and computation. Within our computational work, we utilize the tools of both discrete computer science (yielding powerful strategies for efficiency and analysis) and continuous numerical analysis (yielding, for example, rigorous bounding of approximation errors), and rare confluences of both can be found in our work. Our work bleeds into general contributions to basic applied mathematics and scientific computing problems.
All Papers and Code
Our Team

Algorithmics
Fast Algorithms for Large Datasets
Our goal: To make all textbook stateoftheart machine learning methods computationally tractable on big datasets.
Of particular concern here are nonparametric methods, which are the most powerful but tend to result in the most expensive computations. Most existing approaches to analysis of massive datasets consist of either using simple methods, using a small subsample of the data, or using MapReduce/Hadoop, all of which result in significant limitations. Our strategy is to first design (or adopt) the fastest possible algorithms for the key computational bottlenecks, then consider their extension to outofcore and parallel settings. Decomposing our goal, five main types of computations can be identified: generalized Nbody problems, matrix computations, optimizations, graphical model inferences, and integrations. (This landscape will be outlined in a forthcoming survey paper on largescale machine learning.) Of these, graphical model inferences and integrations have received much attention in the machine learning literature and therefore have not been a major focus of ours so far. We have so far developed the overallfastestwhileaccurate practical algorithms for the following methods:
Our software for scalable machine learning also includes the best algorithms from the research literature for other methods.
NearestNeighbor and Computational Geometry
Many of these fastest algorithms are achieved using multidimensional trees, which are also key to basic problems beyond data analysis. These include kdtrees, ball/metric trees, and cover trees. We have: done extensive experimentation with virtually all known data structures of this kind; introduced new types of trees, including spill trees, subspace trees, and cosine trees; shown how to use trees, our cosine trees, for linear algebra computations; shown how to formally analyze algorithms for pairwise problems based on cover trees; and shown how to create outofcore versions of trees in relational databases. We are currently working on more insightful datadependent analyses of treebased algorithms, as well as parallel trees and treebased algorithms. Nearestneighbor search is important in its own right, but also as a prototype problem for others that are welltreated with trees. We have: shown that allnearestneighbors can be done more efficiently than by using repeated singlequery algorithms, using a dualtree traversal; shown that nearestneighbor classification can be done more efficiently than by actually finding the nearest neighbors; developed the fastest/most accurate method for (1+epsilon) distanceapproximate nearestneighbor using our spill trees; and introduced a more powerful and natural paradigm of approximation which is more effective for highdimensional nearestneighbor search than distance approximation.
(More)
Kernel Summations and Multipole Methods
Large summations over kernel functions such as Gaussians occur in the heart of kernel estimators, RKHS kernel machines, and manifold learning methods. While similar to some computational physics problems, statistical problems may be in arbitrary dimension and require a wider array of possible kernels. We have: shown how to develop computational geometry (treebased) methods for kernel summations which yield userspecifiable error guarantees; shown how to use multipolelike hierarchical series expansions with general multidimensional trees, achieving the first true (hierarchical) multipolelike method for statistical problems; and shown how to achieve efficiency in highdimensional kernel summation problems by using our subspace trees and incorporating Monte Carlo techniques.
(More)
Generalized NBody Problems
Generalized Nbody problems (GNPs), roughly speaking, are computations involving about comparisons between points (e.g. distances, dot products, and kernels), including all nearestneighborlike and kernel summationlike problems  not surprisingly, many of the bottleneck computations in statistics are of this form. A common form of such problems involves comparisons between all pairs of points, a naively quadratictime computation. We have: coined and formally (algebraically) defined the class, which subsumes the mapreduce class, as well as shown that a single abstract algebraic algorithm, a broad generalization of the famous Fast Multipole Method, solves all such pairwise problems for the first time; shown for the first time that such pairwise problems, including the original Nbody problems of computational physics, are solvable in provably linear time; shown for the first time that the Euclidean minimum spanning tree (EMST) problem of computational geometry can be solved in O(NlogN) time; shown that higherorder (beyond pairs, to ntuples) problems such as npoint correlations and morethanpairwise kernels can also be computed efficiently; and created a programming model called THOR (Treebased HigherOrder Reduce) for GNPs.
(More)
Matrix Computations and Optimization
Matrix computations are the main bottleneck in several methods, both basic and stateoftheart. The idea of approximate linear algebra is arguably more sensible in machine learning than in many other areas of scientific computing. A matrix structure which is somewhat unique to machine learning (though not completely) is the kernel matrix. We have: developed a fast algorithm for SVD (full SVD, not just lowrank factorization) using Monte Carlo and a new kind of tree, cosine trees, which automatically determines the rank needed to achieve a specified approximation error; and shown how to use treebased Nbody solvers in kernel matrix problems. We are working on fast, stable algorithms for Gaussian process regression and kernel PCA.
Every statistical method results in an optimization, whether implicit or explicit. Recent methods have required increasingly difficult optimizations, and while generic offtheshelf optimizers exist, custom algorithm designs are needed for scalability. We have: developed a fast solver for the kinds of semidefinite programs that occur in manifold learning methods; developed the fastest online algorithm for nonlinear SVMs, based on the classic FrankWolfe approach; and developed more efficient optimization formulations for L_0 SVMs using perspectivecut mixed integer nonlinear programming and L_q SVMs for 0 < q < 1 using DC (difference of convex functions) programming. We are working on online algorithms for all machine learning methods.
(More)
Truly Massive Data
Truly massive datasets may fit only on disk, rather than RAM, and may even fit only on multiple computers' disks. At this point it is difficult o separate the issues of efficient machine learning from those of data management. We have shown for the first time how to use outofcore versions of multidimensional trees within existing relational databases, with a demonstration using Microsoft SQL Server; we are working on new ideas for largescale data management, aka data warehousing, and also on the problem of efficient data transfer to allow largescale machine learning via cloud computing. Our THOR (TreeBased HigherOrder Reduce) programming model and system for generalized Nbody problems can achieve certain kinds of parallelism for treebased algorithms but in general we are working on efficient parallelizations of the fastest algorithms for all machine learning methods, for all types of modern hardware setups (clusters, multicore, GPUs).
(More)

Statistics
Methods for HighDimensional/Complex Data
Our goal: To efficiently estimate, compute, and derive insight in the presence of many variables with possibly complex interdependencies.
Of particular concern here are nonparametric methods, which are the most accurate but tend to have the most difficulties in high dimensionalities. In general, in pragmatically choosing between statistical methods, the tradeoff space is mostly defined by three axes: accuracy, scalability, and interpretability. We believe the accuracy/scalability tradeoff can be characterized mathematically, a longterm theoretical goal of ours, by utilizing tools from geometry/topology. One approach to achieving scalability is to consider new methods which result in easier computations yet hope to preserve most of the modeling power of more expensive methods, making the line between statistical method and computational method blurry  our examples of this include density estimation trees, Markov matrix factorization, and local kernel machines. Interpretability, which can be captured by defining more interpretable objective functions (as in densitypreserving maps) or by using more interpretable model classes (as in density estimation trees and nonnegative SVMs), is more difficult to cast mathematically, though the number of variables in the final model might be a useful proxy. We have so far developed the following new statistical methods:
 density estimation methods: submanifold kernel density estimation, convex adaptive density estimation, hyperkernel density estimation, density estimation trees
 classification/regression methods: isometric separation maps, local kernel machines, nonnegative kernel machines, generative mean map kernel
 dimension reduction/clustering methods: rankmap, isometric nonnegative matrix factorization, convex nonnegative matrix factorization, affine nonnegative matrix factorization, densitypreserving maps, functional independent component analysis


Our machine learning software includes these and other stateoftheart machine learning methods.
Curses of Dimensionality and Geometry
There are various curses of dimensionality, which we believe can ultimately be related to each other. There are statistical curses of dimensionality, such as the exponential dependence of the estimation error of kernel estimators on the dimension of the data, and computational curses of dimensionality, such as the exponential dependence of the runtime of nearestneighbor algorithms on the dimension of the data. We have: shown that kernel estimators are sensitive to the dimension of the data's submanifold, rather than the full dimension of their ambient space, thus showing that kernel estimation in high dimensionality is possible; we have developed more refined variants of the expansion constant, which can be regarded as a worstcase form of intrinsic dimension in the context of some computational geometry computations; we are working on rigorous nonparametric estimators of the intrinsic dimension of datasets, and on distributiondependent analyses of algorithms which apply the mathematical tools of statistics to algorithm analysis.
(More)
Dimension Reduction and Feature Selection
The identification of the fundamental axes of variation in data, in principle, has implications for all statistical tasks, and has undergone a revival with the introduction of geometryinspired ideas in recent years. We are interested in bringing more powerful geometry/topology ideas to this area, in blending them with basic statistical principles of estimation, and in creating new methods using formulations made possible by modern optimization capabilities. We have: shown for the first time how to perform nonnegative matrix factorization in a convex fashion; shown how to perform nonnegative matrix factorization with isometric (manifold learninglike) constraints, achieving a more compact spectrum; shown a stable way to perform manifold learning without distances, using only an ordering on the distances between points; and demonstrated a new paradigm for dimension reduction which preserves densities rather than distances, which is shown to always be possible while distance preservation is wellknown not to be always possible, and which is wellmotivated by the most common use of highdimensional visualization  identification of clusters and outliers. See also our work on isometric separation maps, which bridge manifold learning and SVMs.
The identification of subsets of the original features (rather than linear or nonlinear combinations of all of the features) which are relevant to a particular learning task has both interpretive and statistical value, and much recent attention has surrounded the issues of sparsity and regularization. We have: shown how to perform fast but provably nonparametric density estimation using density estimation trees, the analog of classification and regression trees, yielding the interpretable rules and automatic feature selection capabilities of decision trees. See also our work where we have shown how to achieve the expected aggressive feature selection capabilities of the relatively littlestudied regularization norms below L_1, in particular L_0 SVMs using certain relaxations of the true combinatorial objective function and L_q SVMs for 0 < q < 1. We are working on new ways to perform feature selection in multiclass learning problems and in the future will enter causality inference.
Dimension reduction and feature selection can often be seen as special cases of learning the metric, or similarity or kernel function, at the root of a statistical task  rather than, for example, taking the Euclidean distance (and functions of it) as given. See our work on this here.
(More)
Kernel Machines and Estimators
Toward general methodology for complex data, we are working on extending the capabilities of the best nonparametric methods, which revolve around notions of similarities of kernels between data points. Support vector machines and their relatives are more or less the most accurate classifiers available, but require improvements in scalability and extensions for making them applicable to certain kinds of problems. We have: shown the utility of a local kernel machines, which retain some of the advantage of SVMs while enjoying the greater scalability of other nonparametric methods; shown how to create nonnegative SVMs which only find models with nonnegative coefficients; proven the empirical and theoretical advantages of generative mean map kernels, a new kind of kernel for probability distributions, allowing complex generative models to be used within SVMs with increased accuracy. Kernel estimators (or smoothers) constitute another powerful class of nonparametric methods. We are working on ways to perform kernel estimation on data with measurement errors. See also our work on highdimensional kernel estimation. We are working on the fundamental relationships between kernel estimators and kernel machines. We explored various questions regarding learning the kernel or similarity measure between points  we have: shown how to learn the kernel in kernel density estimation via semidefinite programming, yielding behavior similar to variablebandwidth methods but bypassing the idea of bandwidth selection; shown how to learn the metric for a maxmarginbased nearestneighbor classifier; and developed a way to learn the kernel in support vector machines such that linear separability in the kernel space is guaranteed rather than hoped for.
(More)
Structured and Complex Data
Structured data are those for which there are known types of dependencies between the variables. Common types of structure include temporal and spatial dependence. We have: developed a form of independent component analysis (ICA) for functional data (such as time series) which accounts for the shape similarity between time series rather than requiring that they be aligned (as with the common use of iid ICA for time series); we have demonstrated a way to effectively learn fast approximations of hidden Markov models which cast the problem as a certain matrix factorization, which we call Markov matrix factorization; developed a way to simultaneously learn transformations of each data point while estimating models, in the context of an affine nonnegative matrix factorization for images such as faces. See also our work on new kernels for parametric models, which can allow the use of structured (noniid) data objects such as time series or images to be used within kernel machines. We are also working on ways to ensure generalization in dependent data.
(More)

Tools
HighImpact Applications and Software
Our goal: To make significant impact in today's outstanding, largescale problems, while using their underlying issues to drive our longterm research questions.
When we can find the right collaborators, we try to focus on the problems which have the most longterm impact. This has often led us to longstanding fundamental science problems, but we have also begun to consider difficult problems occurring in industry. We believe that starting with problems in realworld domains, as opposed to academic ones, forces us to consider the truly hardest and most fundamental mathematical/computational problems, rather than simplify and abstract them. We embrace, rather than seek to avoid, the "nonideal" aspects of real data analysis problems and regard them as opportunities for research. With each applied problem we tackle, we strive to develop not only an improved solution for it but also a general method (computational or statistical) which can be a used for a much larger class of problems. Firstofakind application results our tools have enabled include:
We are working on customized wrappers of our general software which are tailored to each of the user communities we work with.
Astroinformatics Tools
Though in fact arguably very old, astroinformatics is just beginning to congeal as a community, in the way that bioinformatics has. Among the sciences, astronomy/astrophysics is the poster child of massive datasets, with the coming generation of sky surveys generating billions of data points per month, and the successful history among the sciences of being able to manage such influxes of data. We have worked in this area since 1993, beginning with contributions to the SKICAT stargalaxy separation system for the DPOSS survey. More recently we have been working mainly in the area of observational cosmology, or largescale structure, with the SDSS survey. We have enabled the largest and most accurate quasar catalog to date, consisting of nearly 1 million quasars having the largest redshift and luminosity range yet, using our largescale algorithm for kernel discriminant analysis  this has led to the first largescale verification of cosmic magnification (article in Nature). Our algorithm for npoint correlation functions, the first efficient exact algorithm for this critical family of statistics, helped to enable the first largescale evidence for dark energy using the WMAP data (article in Science), the most accurate 2point function of AGNs, and the largest 3point function ever computed. We are still actively working on extensions to this algorithm. Our fast algorithm for kernel density estimation enabled an influential study of the ecology of galaxies. We are working on what we believe will be the most accurate redshift estimator to date using our fast algrorithms for kernel regression and kernel conditional density estimation, coupled with our work on kernel estimation accounting for measurement errors. We are also working on the problems of decomposing astrophysical spectra using our advanced methods of nonnegative matrix factorization, crossmatching catalogs, deblending overlapped objects, and friendsoffriends clustering in astrophysical simulations.
(More)
Bioinformatics and Medical Tools
Bioinformatics remains the model success story among the sciences of the multidisciplinary union of a science domain with algorithmic and statistical methods. Our work on various bioinformatics problems goes back many years, and includes some of the early work on microarray analysis attempting to learn genetic circuits from microarray data, as well as a more recent foray into integrating microarray data and gene ontologies to identify cancer biomarkers. We have also worked on segmentation of cell images toward cancer detection. More recently we have focused on mass spectrometry for metabolomics, and have developed functional support vector machine (fSVM) models yielding the most accurate early detector of ovarian cancer to date (article in Horizons). We are working on using our work on featureselecting SVMs to identify a smaller set of biomarkers, and on the difficult calibration and noise removal issues with mass spectrometry data. We are also now working on identification of salient temporal patterns in EEG data for braincomputer interfaces for the disabled and for clinical diagnosis of sleep disorders.
Molecular modeling encompasses fundamental capabilities for the design and discovery of new drugs and materials, including tertiary structure prediction by protein folding or molecular dynamics simulations. The energy function used in such simulations lies at the center these activities and is a source of constant experimentation and tuning, within the large space of possible approximations to the true (quantum mechanical) energy function. We have formalized the problem of learning energy functions via nonnegative ranking SVMs, and shown the ability to automatically learn superior energy functions. It has been found that traditional pairwise interaction potentials are not able to capture all of the salient physics in molecules, which has opened up the area of empirical and multibody (more than pairwise) potentials. Using our framework of generalized Nbody methods, we have developed the first fast algorithm for multibody (generalorder) potentials, enabling the largestscale AxilrodTeller molecular simulation to date. The next higher level of simulation fidelity lies at the quantum level, but longstanding approaches such as HartreeFock simulation are prohibitive as they result in quartic runtime in the number of basis functions used  however our generalized Nbody framework points to a more efficient approach than has existed so far, and we are exploring this to see if quantumlevel simulation can be done on much larger molecules than previously considered possible.
(More)
Industrial and Other Scientific Tools
Manmade data, such as internet data, has also become massive and lies at the heart of important societal problems. We have developed an initial system to automatically flag emails as spam based on the sending patterns of their source, rather than their content using a source of 300 million emails per day, and are working on a more advanced system using our fast online SVM and fast HMM algorithms. We are also working on a large system for classification and Netflixlike recommendation of documents. In highenergy physics, we have developed fast methods for particle event triggering in the Large Hadron Collider.
(More)
Software and Systems
We believe the ultimate incarnation of our research is robust software. There is a void in the space of today's machine learning software, which we are working to fill. It is either comprehensive (covering a large set of statistical methods) but nonscalable, such as Weka, R, or Matlab, or algorithmically sophisticated and tightlycoded but limited in scope, such as SVMlight. Our model is the world of linear algebra, in which LAPACK represents both comprehensiveness and stateoftheart algorithms, within robust and widely usable software. Our attempt to create the analog for machine learning is called MLPACK, a collection of machine learning methods based on the Fastlib C++ library  a prerelease is available for the curious, though a muchimproved version 1.0 release is scheduled for June 2010. Longerterm, we are also working on a comprehensive system for data visualization and analysis, to make machine learning usable by nonexperts.
(More)
Algorithm Derivation and Code Generation
Ultimately, we would like to be able to develop customized models for new domains/clients orders of magnitude more quickly than is currently possible. Perhaps our most ambitious project is to, over time, encode all the statistical and computational principles needed to develop the algorithms and code that an expert could develop if given enough time, so that a modeler can declaratively specify models and have stateoftheart algorithms derived for them and highly efficient code generated for those algorithms, then have the models automatically tested and compared to each other. We have, based on our earlier involvement with the AutoBayes system, developed steps toward typetheoretic formallization of optimization and formalization of probability and statistics which will form a base for a new system we call Algorithmica.
(More)

