Text Detoxification in Natural Language Processing
Skip to main content
eScholarship
Open Access Publications from the University of California

UC Santa Barbara

UC Santa Barbara Electronic Theses and Dissertations bannerUC Santa Barbara

Text Detoxification in Natural Language Processing

Abstract

The rapid rise of social media has given individuals an easy platform to publicly communicate with others. Unfortunately, this has also led to improper use of online spaces, such as the propagation of toxic speech. Worse yet, user-generated toxic speech can propagate beyond online social platforms. We recently observed toxic degeneration and biased behavior in language models (LMs) pretrained on a web text corpus. In order to create a healthy communication environment, developing text detoxification techniques has attracted increasing attention from both the industry and academia. To achieve this goal, detoxification methods need to handle both pre-existing toxic speech and LMs that exhibit toxic degeneration. In this dissertation, we investigate how to use Natural Language Processing techniques for text detoxification in two directions. 1) for pre-existing toxic speech, we develop automatic tools for post-processing, including detection, analysis, and intervention. 2) for the model-generated toxic speech, another complementary solution to post-processing is to detoxify the pretrained LMs by reducing the likelihood that the model will generate toxic content. In the first part, we focus on toxic speech, especially hate speech, that already exists. We start by improving automated hate speech detection through intra-user and inter-user representation learning. We then move beyond the standard binary hate speech detection. We study fine-grained hate speech characterizing in both the isolated learning setting and the lifelong learning setting. We also investigate how neural network models are able to decipher hate symbols. We then explore the intervention strategies for online conversations that contain hate speech. As part of this work, we make publicly available two fully-labeled hate speech datasets with human-written intervention responses. In the next part, we focus on the not-yet-emergent toxic speech from LMs. We begin by controlling pretrained LMs in the freeform text generation scenario. We then further investigate LM detoxification in dialogue with contextualized stance control. Our work effectively lowers the toxic content rate of the pretrained LMs while sacrificing less linguistic quality. Finally, we summarize the key findings of our work and discuss future research directions to push the boundaries of text detoxification.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View