
Matrix Computations and Optimization
Matrix computations are some of the core sources of computational expense in machine learning.



Fast Stochastic FrankWolfe Algorithms for Nonlinear SVMs
Hua Ouyang and Alexander G. Gray
SIAM International Conference on Data Mining (SDM) 2010
While fast online algorithms for linear support vector machines exist, they do not perform as well for nonlinear SVMs. We show the fastest online algorithm for nonlinear SVMs (using squared loss), based on the classic FrankWolfe method.
[pdf]
Abstract:
The high computational cost of nonlinear support vector machines has limited their usability for largescale problems. We propose two novel stochastic algorithms to tackle this problem. These algorithms are based on a simple and classic optimization method: the FrankWolfe method, which is known to be fast for problems with a large number of linear constraints. Formulating the nonlinear SVM problem to take advantage of this method, we achieve a provable time complexity of O(dQ^2 / epsilon^2). The proposed algorithms achieve accuracies that are comparable to or even better than the stateoftheart methods, and are significantly faster.
@inproceedings{ouyang2010sfw,
Author = "Hua Ouyang and Alexander G. Gray",
Title = "{Fast Stochastic FrankWolfe Algorithms for Nonlinear SVMs}",
booktitle = "SIAM International Conference on Data Mining (SDM)",
Year = "2010"
}

See also
Methods for Kernel MatrixVector Multiplications
Kernel matrixvector multiplications can be done quickly using efficient kernel summation methods.
[see webpage here]
Typetheoretic Formalization of Optimization
We have developed a type system for mathematical programs, toward automatic reformulation of optimization problems.
[see full entry here]
Scalable Semidefinite Programming
We showed an approach for scalable learning based on semidefinite programming based on convex relaxations, in the context of manifold learning. [see full entry here]
In preparation
Efficient Optimization for FractionalNorm SVMs
We have developed faster optimization methods for L_q SVMs, where 0 < q < 1, based on recent ideas in optimization theory.
Kernel PCA and Gaussian Process Regression
While our fast PCA/SVD method can be used for kernel PCA and Gaussian process regression,
we are developing a methodology which does not have to explicitly form the
N x N kernel matrix, circumventing significant computational and storage costs.


Mixed Integer Support Vector Machines
Wei Guan and Alexander Gray
Neural Information Processing Systems (NIPS) 2009 Workshop on Optimization in Machine Learning
An approach to L0 SVMs exploiting recent perspective cuts method of mixed integer programming, toward more aggressive and accurate feature selection. [pdf]
Abstract:
In this paper, we propose a formulation of a feature selecting support vector machine based on the L0norm. We explore two convex relaxations of the optimization problem and solve them using mixedinteger nonlinear programming (MINLP) techniques. Given a training set of labeled data instances, we construct a maxmargin classifier that minimizes the hinge loss as well as the cardinality of the weight vector of the separating hyperplane  w 0 , effectively performing feature selection and classification simultaneously in one optimization. We compare these two relaxations with the standard L2norm SVM, L1norm SVM, and recursive feature elimination (RFE) method, and show promising results on realworld
datasets in terms of accuracy and sparsity.
@inproceedings{guan2010misvm,
Author = "Wei Guan and Alexander G. Gray and Sven Leyffer",
Title = "{Mixed Integer Support Vector Machines}",
booktitle = "Neural Information Processing Systems (NIPS) Workshop on Optimization in Machine Learning",
Year = "2009"
}


QUICSVD: Fast SVD Using Cosine Trees
Michael Holmes, Alexander Gray, and Charles Isbell
Neural Information Processing Systems (NIPS) 2008, appeared 2009
Fast singular value decomposition (SVD) using multiresolution Monte Carlo based on cosine trees, a new data structure.
Unlike previous Monte Carlo linear algebra approaches, it is not limited to a fixedrank approximation and is thus automatically
as accurate as the user requires, without the need for trying many different ranks or other parameters. It is also more efficient for the same number of samples than previous sampling schemes.
[pdf]
Abstract:
The Singular Value Decomposition is a key operation in many machine learning
methods. Its computational cost, however, makes it unscalable and impractical for
the massivesized datasets becoming common in applications. We present a new
method, QUICSVD, for fast approximation of the full SVD with automatic
sample size minimization and empirical relative error control. Previous Monte Carlo
approaches have not addressed the full SVD nor benefited from the efficiency of
automatic, empiricallydriven sample sizing. Our empirical tests show speedups
of several orders of magnitude over exact SVD. Such scalability should enable
QUICSVD to meet the needs of a wide array of methods and applications.
@Inproceedings{holmes2009quicsvd,
author = "Michael Holmes and Alexander Gray and Charles Isbell",
Title = "{QUICSVD: Fast SVD Using Cosine Trees}",
booktitle = "Advances in Neural Information Processing Systems 21 (Dec 2008)",
Year = "2009",
publisher = {MIT Press}
}


Fast Kernel MatrixVector Multiplication with Application to Gaussian Process Learning
Alexander G. Gray
Carnegie Mellon University Technical Report, 2004
An example of using our dualtree, automaticallyerrorcontrolled kernel summation methods for fast matrixvector multiplication with kernel matrices, as arises in Gaussian process learning. The idea of using kernel summation methods inside linear algebraic methods has been used in some computational physics problems, but was introduced for statistical problems for the first time here.
[pdf]
Abstract:
A number of core computational problems in machine learning, both old and new, can be cast as a matrixvector multiplication between a kernel matrix or classprobability matrix and a vector of weights. This arises prominently, for example, in the kernel estimation methods of nonparametric statistics, many common probabilistic graphical models, and the more recent kernel machines. After highlighting the existence of this computational problem in several wellknown machine learning methods, we focus on a solution for one specific example for clarity, Gaussian process (GP) prediction  one whose applicability has been particularly hindered by this computational barrier. We demonstrate the application of a recent Nbody approach developed specically for statistical problems, employing adaptive computational geometry and finitedifference approximation. This core algorithm reduces the O(N^2) matrixvector multiplications within GP learning to O(N), making the resulting overall learning algorithm O(N). GP learning for N = 1 million points is demonstrated.
@techreport{gray2004kermatvec,
title = "{Fast Kernel MatrixVector Multiplication with Application to Gaussian Process Regression}",
author = "Alexander G. Gray",
institution = "{Carnegie Mellon University}",
series = "{Computer Science Department Technical Report, CMUCS04110}",
year = "2004"
}

