Scalable Text Analysis with Efficient Distributed Word Representation
Skip to main content
eScholarship
Open Access Publications from the University of California

UCLA

UCLA Electronic Theses and Dissertations bannerUCLA

Scalable Text Analysis with Efficient Distributed Word Representation

Abstract

The demand for Natural Language Processing has been thriving rapidly due to the various emerging Internet services such as social networks, e-commerce, intelligent assistant, etc. The large volume of natural language data and the new market requirements have brought significant challenges and opportunities for academia and industry. The challenges include (1) efficiently processing a massive volume of data and devising a highly scalable architecture, (2) extending the horizon of the current natural language processing to tackle problems that cannot be handled well. This thesis demonstrates several advanced techniques to tackle the challenges by boosting scalability and efficiency or extending the horizon of existing core NLP techniques on different granularities. This thesis introduces (i) a neural-based lexical simplification algorithm that utilizes contextualized word embedding to capture the meaning more accurately, (ii) a sample-efficient algorithm that solves extreme large-output classification for contrastive representation learning. (iii) a scalable and efficient algorithm for the probabilistic topic model LDA.First, we propose amplified negative sampling and rigorous mathematical analysis of the famous negative sampling method for the word level distributed representations. Specifically, we distinguish the impact of various loss functions for the contrastive representation learning process. Our theoretical analysis provides a better understanding of a few empirical observations that are well known among practitioners when negative sampling is employed. We also propose the amplified negative sampling based on our analysis. It's a simple yet efficient method that can effectively improve the negative sampling efficiency and boost the downstream tasks' performance. Second, we propose the neural simplicity ranking (NSR) model that utilizes the contextualized word representations in an unconventional way for lexical simplification for the sentence level tasks. This model creatively uses the contextualized word embeddings beyond its original masked language model (MLM) task to identify the relatedness among candidate words and the original words. By redesigning the feature set for the simplicity measure and the ranking scheme, the NSR model achieves new state-of-the-art performance. Finally, we propose sd-LDA for the paragraph-level tasks, a scalable disk-based LDA model with sparse-prior sampling. The sd-LDA brings a highly optimized disk-based design that reduces the memory consumption by two orders of magnitude, making it possible to do large topic analysis on mobile devices. A sparse prior sampling technique is introduced in this work to accelerate the processing time. The sd-LDA algorithm can effectively reduce the memory consumption and processing time at the same time without loss of performance.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View