MMS • Anthony Alford
Article originally posted on InfoQ. Visit InfoQ
Meta AI Research recently open-sourced CICERO, an AI that can beat most humans at the strategy game Diplomacy, a game that requires coordinating plans with other players. CICERO combines chatbot-like dialogue capabilities with a strategic reasoning, and recently placed first in an online Diplomacy tournament against human players.
CICERO was described in a paper published in the journal Science. CICERO uses a 2.7B parameter language model to handle dialogue between itself and other players. To determine its moves, CICERO’s planning algorithm uses the dialogue to help predict what other players are likely to do, as well as what other players think CICERO will do. In turn, the output of the planner provides intents for the dialogue model. To evaluate CICERO, the team entered it anonymously in 40 online Diplomacy games; the AI achieved a score more than double that of the human average. According to the Meta team,
While we’ve made significant headway in this work, both the ability to robustly align language models with specific intentions and the technical (and normative) challenge of deciding on those intentions remain open and important problems. By open sourcing the CICERO code, we hope that AI researchers can continue to build off our work in a responsible manner. We have made early steps towards detecting and removing toxic messages in this new domain by using our dialogue model for zero-shot classification. We hope Diplomacy can serve as a safe sandbox to advance research in human-AI interaction.
Diplomacy is a strategy board game where players must capture a majority of territories called supply centers to win. There is no random component in the game; instead, battles are determined by numerical superiority. This often requires players to cooperate, so the bulk of game play consists of players sending messages to each other to coordinate their actions. Occasionally players will engage in deceit; for example, promising to help another player, while actually planning to attack that player.
To be successful, therefore, an AI must not only generate messages of human-level quality; the messages must make sense given the state of the game board, and the messages must cause other players to trust the AI. To generate the dialogue, Meta used a pre-trained R2C2 language model that was fine-tuned on a dataset of almost 13M messages from online Diplomacy games. The generated dialogue is conditioned on the intents generated by a planning module; the intents are the most likely actions that message sender and receiver will take after reading that message.
CICERO’s planning module generates intents by predicting other players’ likely actions, given the state of the board and messages from those players, then choosing an optimal action for itself. To model the likely actions of the other players, CICERO uses an iterative planning algorithm called piKL which incorporates information from the dialogues with other players. To train the planning module, the Meta researchers used a self-play algorithm similar to that used by AlphaZero.
The Meta team entered CICERO into anonymous league play for online Diplomacy games. The AI played 40 games, including an 8-game tournament with 21 players; CICERO placed first in the tournament. For its entire 40 games, CICERO was ranked in the top 10 percent of players with an average score was 25.8%, while the average score of its 82 human opponents was 12.4%.
In a Twitter thread about the work, CICERO co-author Mike Lewis replied to a question about whether CICERO would “backstab” (that is, lie to) other players:
It’s designed to never intentionally backstab – all its messages correspond to actions it currently plans to take. However, sometimes it changes its mind…
The CICERO source code is available on GitHub.