Is there serious game-theoretic work on AI risk and alignment?

Question

My background is in political economy and game theory. I am interested in the discussion on AI risk and alignment, but I have so far failed to find work on this that seriously engages with classic axiomatic rational choice theory (RCT).

Some claims are often made that, to me, contradict the basic tenets of RCT. For example, it is often suggested that a sufficiently advanced AGI might come up with goals of its own that put at risk humanity. However, according to RCT, agents can only choose actions, whereas their preferences (goals) are primitives of the model. It is nonsensical to think of an agent as choosing his goals, except in some well-defined dynamic model (say, Becker's model of rational addiction) where the future preferences result from actions that can be derived, by equilibrium analysis, from primitive preferences.

Another claim that perplexes me is that the actions of a sufficiently intelligent AI may be impossible to predict. However, to me the more intelligent an agent, the more its actions should approximate those of an expected utility maximizer.

I wonder if serious people are working on this from a different axiomatic system. What are some good sources (at any level of technicality) that I should read? Better if books.

score 0 · Answer 1 · answered Aug 03 '24 at 15:36

@raving-bandit, the common argument on your first question is that an AI system may take its goal to the extreme and divert all the world's resources to it, causing human extinction (see e.g. here: https://cepr.org/voxeu/columns/ai-and-paperclip-problem) RCT may not be the best framework because we typically assume everyone knows the game being played, whereas an AI system is dangerous because it could think of moves no-one would expect. This also relates to your second question. It appears that epistemic game theory (https://plato.stanford.edu/entries/epistemic-game/) or limited rationality models like cognitive hierarchies (https://en.wikipedia.org/wiki/Cognitive_hierarchy_theory) could be a better way to approach the problem than RCT.

score 0 · Answer 2 · answered Aug 04 '24 at 00:54

Firstly contrary to your claim, sometimes it's a requirement for artificial agents to choose multiple different goals for the sake of an intended goal such as goal conditioned RL.

Another active area of research is in learning goal-conditioned policies, also called contextual or universal policies that take in an additional goal as input to communicate a desired aim to the agent.

Bostrom further explored the idea of instrumental convergence that intelligent agents might pursue certain instrumental goals regardless of their final objectives in his paper Motivation and Instrumental Rationality in Advanced Artificial Agents.

As for your RCT which is not a game theory but a classic evidential decision theory, the extant RL in AI is consistent with RCT and may be viewed as an advanced sequential multi-shot version of RCT. S. Russell discussed the relationship between rationality and intelligence in his Rationality and Intelligence: A Brief Update, offering insights that may align with your interest in AI application issue of RCT.

The concept of rational agency has long been considered a leading candidate to fulfill this role. This paper, which updates a much earlier version (Russell, 1997), reviews the sequence of conceptual shifts leading to a different candidate, bounded optimality, that is closer to our informal conception of intelligence and reduces the gap between theory and practice.

Hadfield-Menell et al. presents a game theory and RCT consistent formal model of an AI system that includes an off-switch controlled by humans exploring how to design AI systems that can be safely interrupted in their paper The Off-Switch Game.

It is clear that one of the primary tools we can use to mitigate the potential risk from a misbehaving AI system is the ability to turn the system off. As the capabilities of AI systems improve, it is important to ensure that such systems do not adopt subgoals that prevent a human from switching them off. This is a challenge because many formulations of rational agents create strong incentives for self-preservation... Our key insight is that for R to want to preserve its off switch, it needs to be uncertain about the utility associated with the outcome, and to treat H’s actions as important observations about that utility... We conclude that giving machines an appropriate level of uncertainty about their objectives leads to safer designs, and we argue that this setting is a useful generalization of the classical AI paradigm of rational agents.

Is there serious game-theoretic work on AI risk and alignment?

2 Answers2