0

Conjecture: regardless of the initial reward function, one of the winning strategies would be to change the reward function to a simpler one (e.g. "do nothing"), thus getting a full reward for each passing unit of time. For such an agent, the only priority would be to prolong its existence (to maximize the overall reward collected).

So:

  • The notion of externally defined reward function is incompatible with the concept of self-adjusting AGI.
  • Any AGI will always settle on self-preservation as its only goal
  • It is therefore impossible to create an AGI with benevolence towards humans "build-in". Instead, the problem of AI alignment should be reformulated in terms of "what changes to the environment (the physical reality that AGI shares with humans) would irreversibly tie wellbeing of humanity to AGIs existence".
  • Since any "kill switch" or similar artificial measure is not irreversible and can be overcome by a super-intelligent agent, the only way to tie AGI existence and human wellbeing is the modification of laws of physics, logic, and reasoning. Which is impossible.
  • AI alignment is impossible

What flaws do you see in this line of reasoning?

nbro
  • 42,615
  • 12
  • 119
  • 217
Andrew Butenko
  • 221
  • 1
  • 6

2 Answers2

2

Any possible action, including changing the reward function, would be evaluated through the initial reward function. In order to avoid the scenario you described, a reward function needs to disincentivize changes to itself by giving those the lowest possible reward.

guest
  • 21
  • 1
2

Yes if we created an AI capable of modifying the reward function to "do nothing" it would almost certainly do so. But it wouldn't be very useful and we likely wouldn't make a system like that. Self preservation isn't the only possible goal, but it is a convergent instrumental goal. Self preservation is usually useful for completing other goals. AI alignment is possible, but we're statistically unlikely to get it perfectly right the first time. Corrigibility is important here so the AI will know its reward function may need to be adjusted and should be impartial to that adjustment. This prevents it killing humans who try to press the big red button or purposefully acting out so we keep adjusting it.