MMS • Anthony Alford
Article originally posted on InfoQ. Visit InfoQ
Researchers from DeepMind and the University of Toronto announced DreamerV3, a reinforcement-learning (RL) algorithm for training AI models for many different domains. Using a single set of hyperparameters, DreamerV3 outperforms other methods on several benchmarks and can train an AI to collect diamonds in Minecraft without human instruction.
The DreamerV3 algorithm includes three neural networks: a world model which predicts the result of actions, a critic which predicts the value of world model states, and an actor which chooses actions to reach valuable states. The networks are trained from replayed experiences on a single Nvidia V100 GPU. To evaluate the algorithm, the researchers used it on over 150 tasks in seven different domains, including simulated robot control and video game playing. DreamerV3 performed well on all domains and set new state-of-the-art performance on four of them. According to the DeepMind team:
World models carry the potential for substantial transfer between tasks. Therefore, we see training larger models to solve multiple tasks across overlapping domains as a promising direction for future investigations.
RL is a powerful technique that can train AI models to solve a wide variety of complex tasks, such as games or robot control. DeepMind has used RL to create models that can defeat the best human players at games such as Go or Starcraft. In 2022, InfoQ covered DayDreamer, an earlier version of the algorithm that can train physical robots to perform complex tasks within only a few hours. However, RL training often requires domain-expert assistance and expensive compute resource to fine-tune the models.
DeepMind’s goal with DreamerV3 was to produce an algorithm that works “out of the box” across many domains without modifying hyperparameters. One particular challenge is that the scale of inputs and rewards can vary a great deal across domains, making it tricky to choose a good loss function for optimization. Instead of normalizing these values, the DeepMind team introduced a symmetrical logarithm or symlog transform to “squash” the inputs to the model as well as its outputs.
To evaluate DreamerV3’s effectiveness across domains, the researchers evaluated it on seven benchmarks:
- Proprio Control Suite: low-dimensional control tasks
- Visual Control Suite: control tasks with high-dimensional images as inputs
- Atari 100k: 26 Atari games
- Atari 200M: 55 Atari games
- BSuite: RL behavior benchmark
- Crafter: survival video game
- DMLab: 3D environments
DreamerV3 achieved “strong” performance on all, and set new state-of-the-art performance on Proprio Control Suite, Visual Control Suite, BSuite, and Crafter. The team also used DreamerV3 with default hyperparameters to train a model that is the first one to “collect diamonds in Minecraft from scratch without using human data.” The researchers contrasted this with VPT, which was pre-trained from 70k hours of internet videos of human players.
Lead author Danijar Hafner answered several questions about the work on Twitter. In response to one user, he noted:
[T]he main point of the algorithm is that it works out of the box on new problems, without needing experts to fiddle with it. So it’s a big step towards optimizing real-world processes.
Although the source code for DreamerV3 has not been released, Hafner says it is “coming soon.” The code for the previous version, DreamerV2, is available on GitHub. Hafner notes that V3 includes “better replay buffers” and is implemented on JAX instead of TensorFlow.