Emmons, Scott

The Alignment Problem Under Partial Observability

2024

Emmons, Scott
Advisor(s): Russell, Stuart

Abstract

We adopt the game-theoretic framework of assistance games to study the human-AI alignment problem. Past work on assistance games studied the case where both the human and the AI assistant fully observe the physical state of the environment. Generalizing to the case where the human and the assistant may only partially observe the environment, we present the partially observable assistance game (POAG). Using the framework of POAGs, we prove a variety of theoretical results about AI assistants. We first consider the question of observation interference, showing three distinct factors that can cause an optimal AI assistant to interfere with a human’s observations. We then revisit past guarantees about the so-called off-switch problem, showing that partial observability poses a new challenge for designing AI assistants that allow themselves to be switched off. Finally, we characterize how partial observability can cause reinforcement learning from human feedback---a widely-used algorithm for training AI assistants---to fall into deceptive failure modes. We conclude by discussing possible paths for translating these theoretical insights into improved techniques for creating beneficial AI assistants.

Main Content

For improved accessibility of PDF content, download the file to your device.

UC Berkeley

The Alignment Problem Under Partial Observability