Skip to main content
eScholarship
Open Access Publications from the University of California

UC Merced

UC Merced Electronic Theses and Dissertations bannerUC Merced

Large-Scale Methods for Nonlinear Manifold Learning

Abstract

High-dimensional data representation is an important problem in many different areas of science. Nowadays, it is becoming crucial to interpret the data of varying dimensionality correctly. Dimensionality reduction methods process the data in order to help visualize the data, reduce its complexity, or find latent representation of the original problem. The algorithms of nonlinear dimensionality reduction (also known as manifold learning) are used to decrease the dimensionality of the problem while preserving the general structure of the data. Both spectral methods (such as Laplacian Eigenmaps or ISOMAP) and nonlinear embedding algorithms (NLE, such as t-SNE or Elastic Embedding) have shown to provide very good nonlinear embedding of high-dimensional data sets. However, those methods are notorious for very slow optimization, practically preventing them from being used when a data set is bigger than few thousand points.

In my thesis we investigate several techniques to improve different stages of nonlinear dimensionally algorithms. First, we analyze the entropic affinities as a better way to build a similarity matrix. We explore its properties and propose a nearly-optimal algorithm to construct them. Second, we present a novel faster method to optimize NLE by using second-order information during the optimization. Third, for spectral methods, we investigate landmark-based optimization that cleverly substitutes original large-scale problem with a much smaller easy-to-solve subproblem. Finally, we apply Fast Multipole Methods approximation that allows fast computation of the gradient and the objective function of NLE and reduces their computational complexity from O(N^2) to O(N).

Each of the proposed methods accelerate the optimization dramatically by one or two orders of magnitude compared to the existing techniques, effectively allowing corresponding methods to run on a dataset with millions of points.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View