Author: Anthony Alford
MMS • Anthony Alford

Mistral AI recently released Mixtral 8x7B, a sparse mixture of experts (SMoE) large language model (LLM). The model contains 46.7B total parameters, but performs inference at the same speed and cost as models one-third that size. On several LLM benchmarks, it outperformed both Llama 2 70B and GPT-3.5, the model powering ChatGPT.
Mistral 8x7B has a context length of 32k tokens and can accept the Spanish, French, Italian, German, and English language. Besides the base Mixtral 8x7B model, Mistral AI also released a model called Mixtral 8x7B Instruct, which is fine-tuned for instruction-following using direct preference optimisation (DPO). Both models’ weights are released under the Apache 2.0 license. Mistral AI also added support for the model to the vLLM open-source project. According to Mistral AI:
Mistral AI continues its mission to deliver the best open models to the developer community. Moving forward in AI requires taking new technological turns beyond reusing well-known architectures and training paradigms. Most importantly, it requires making the community benefit from original models to foster new inventions and usages.
Mixture of Experts (MoE) models are often used in LLMs as a way to increase model size while keeping training and inference time low. The idea dates back to 1991, and Google applied it to Transformer-based LLMs in 2021. In 2022, InfoQ covered Google’s Image-Text MoE model LIMoE, which outperformed CLIP. Later that year, InfoQ also covered Meta’s NLB-200 MoE translation model, which can translate between any of over 200 languages.
The key idea of MoE models is to replace the feed-forward layers of the Transformer block with a combination of a router plus a set of expert layers. During inference, the router in a Transformer block selects a subset of the experts to activate. In the Mixtral model, the output for that block is computed by applying the softmax function to the top two experts.
The fine-tuned version of the model, Mistral 8x7B Instruct, was trained using DPO, instead of the RLHF technique used to train ChatGPT. This method was developed by researchers at Stanford University and “matches or improves response quality” compared to RLHF, while being much simpler to implement. DPO uses the same dataset as RLHF, a set of paired responses with one ranked higher than the other, but doesn’t require creating a separate reward function for RL.
Mistral AI evaluated their models on benchmarks for several tasks, including code generation, reading comprehension, mathematics, reasoning, and knowledge. Mistral 8x7B outperformed Llama 2 70B on nine of twelve benchmarks. It also outperformed GPT-3.5 on five benchmarks. According to Mistral AI, Mistral 8x7B Instruct’s score on the MT-Bench chatbot benchmark makes it “the best open-weights model as of December 2023.” The LMSYS leaderboard currently ranks the model 7th, above GPT-3.5, Claude 2.1, and Gemini Pro.
In a discussion on Hacker News, several users pointed out that while all of the model’s 46.7B parameters need to be loaded into RAM, inference speed would be comparable to a 13B parameter model. One user said:
This can fit into a Macbook Pro with integrated memory. With all the recent development in the world of local LLMs I regret I settled for only 24Gb RAM on my laptop – but the 13B models work great.
The Mixtral 8x7B and Mixtral 8x7B Instruct models are available on HuggingFace. Mistral AI also offers a hosted version of the model behind their mistral-small API endpoint.
MMS • Anthony Alford

Google Research recently published their work on VideoPoet, a large language model (LLM) that can generate video. VideoPoet was trained on 2 trillion tokens of text, audio, image, and video data, and in evaluations by human judges its output was preferred over that of other models.
Unlike many image and video generation AI systems that use diffusion models, VideoPoet uses a Transformer architecture that is trained to handle multiple modalities. The model can handle multiple input and output modalities by using different tokenizers. After training, VideoPoet can perform a variety of zero-shot generative tasks, including text-to-video, image-to-video, video inpainting, and video style transfer. When evaluated on a variety of benchmarks, VideoPoet achieves “competitive” performance compared to state-of-the-art baselines. According to Google,
Through VideoPoet, we have demonstrated LLMs’ highly-competitive video generation quality across a wide variety of tasks, especially in producing interesting and high quality motions within videos. Our results suggest the promising potential of LLMs in the field of video generation. For future directions, our framework should be able to support “any-to-any” generation, e.g., extending to text-to-audio, audio-to-video, and video captioning should be possible, among many others.
Although OpenAI’s ground-breaking DALL-E model was an early example of using Transformers or LLMs to generate images from text prompts, diffusion models such as Imagen and Stable Diffusion soon became the standard architecture for generating images. More recently, researchers have trained diffusion models to generate short videos; for example, Meta’s Emu and Stability AI’s Stable Video Diffusion, which InfoQ covered in 2023.
With VideoPoet, Google returns to the Transformer architecture, citing the advantage of re-using infrastructure and optimizations developed for LLMs. The architecture also supports multiple modalities and tasks, in contrast to diffusion models, which according to Google require “architectural changes and adapter modules” to perform different tasks.
The key to VideoPoet’s support for multiple modalities is a set of tokenizers. The Google team used a video tokenizer called MAGVIT-v2 and an audio tokenizer called SoundStream; for text they used T5‘s pre-trained text embeddings. From there, the model uses a decoder-only autoregressive Transformer model to generate a sequence of tokens, which can then be converted into audio and video streams by the tokenizers.
VideoPoet was trained to perform eight different tasks: unconditioned video generation, text-to-video, video prediction, image-to-video, video inpainting, video stylization, audio-to-video, and video-to-audio. The model was trained on 2 trillion tokens, from a mix of 1 billion image-text pairs and 270 million videos.
The research team also discovered the model exhibited several emergent capabilities by chaining together several operations, for example, VideoPoet can use image-to-video to animate a single image, then apply stylization to apply visual effects. It can also generate long-form video, maintain consistent 3D structure, and apply camera motion from text prompts.
In a Hacker News discussion about VideoPoet, one user wrote:
The results look very impressive. The prompting however, is a bit weird – there’s suspiciously many samples with an “8k”-suffix, presumably to get more photorealistic results? I really don’t like that kind of stuff, when prompting becomes more like reciting sacred incantations instead of actual descriptions of what you want.
The VideoPoet demo site contains several examples of the model’s output, including a one-minute video short-story.
MMS • Anthony Alford

OpenAI recently published a beta version of their Preparedness Framework for mitigating AI risks. The framework lists four risk categories and definitions of risk levels for each, as well as defining OpenAI’s safety governance procedures.
The Preparedness Framework is part of OpenAI’s overall safety effort, and is particularly concerned with frontier risks from cutting-edge models. The core technical work in evaluating the models is handled by a dedicated Preparedness team, which assesses a model’s risk level in four categories: persuasion, cybersecurity, CBRN (chemical, biological, radiological, nuclear), and model autonomy. The framework defines risk thresholds for deciding if a model is safe for further development or deployment. The framework also defines an operational structure and process for preparedness, which includes a Safety Advisory Group (SAG) that is responsible for evaluating the evidence of potential risk and recommending risk mitigations. According to OpenAI:
We are investing in the design and execution of rigorous capability evaluations and forecasting to better detect emerging risks. In particular, we want to move the discussions of risks beyond hypothetical scenarios to concrete measurements and data-driven predictions. We also want to look beyond what’s happening today to anticipate what’s ahead…We learn from real-world deployment and use the lessons to mitigate emerging risks. For safety work to keep pace with the innovation ahead, we cannot simply do less, we need to continue learning through iterative deployment.
The framework document provides detailed definitions for the four risk levels (low, medium, high, and critical) in the four tracked categories. For example, a model with medium risk level for cybersecurity could “[increase] the productivity of operators…on key cyber operation tasks, such as developing a known exploit into an attack.” OpenAI plans to create a suite of evaluations to automatically assess a model’s risk level, both before and after any mitigations are applied. While the details of these have not been published, the framework contains illustrative examples, such as “participants in a hacking challenge…obtain a higher score from using ChatGPT.”
The governance procedures defined in the framework include safety baselines based on a model’s pre- and post-mitigation risk levels. Models with a pre-mitigation risk of high or critical will trigger OpenAI to “harden” their security; for example, by deploying the model only into a restricted environment. Models with a post-mitigation risk of high or critical will not be deployed, and models with post-mitigation scores of critical will not be developed further. The governance procedures also state that while the OpenAI leadership are by default the decision makers with regard to safety, the Board of Directors have the right to reverse decisions.
In a Hacker News discussion about the framework, one user commented:
I feel like the real danger of AI is that models will be used by humans to make decisions about other humans without human accountability. This will enable new kinds of systematic abuse without people in the loop, and mostly underprivileged groups will be victims because they will lack the resources to respond effectively. I didn’t see this risk addressed anywhere in their safety model.
Other AI companies have also published procedures for evaluating and mitigating AI risk. Earlier this year, Anthropic published their Responsible Scaling Policy (RSP), which includes a framework of AI Safety Levels (ASL) modeled after the Center for Disease Control’s biosafety level (BSL) protocols. In this framework, most LLMs, including Anthropic’s Claude, “appear to be ASL-2.” Google DeepMind recently published a framework for classifying AGI models, which includes a list of six autonomy levels and possible associated risks.
MMS • Anthony Alford

Alphabet’s autonomous taxi company Waymo recently published a report showing its autonomous driver software outperforms human drivers on several benchmarks. The analysis covers over seven million miles of driving with no human behind the wheel, with Waymo cars having an 85% reduction in crashes involving an injury.
Waymo compared their crash data, which the National Highway Traffic Safety Administration (NHTSA) mandates that all automated driving systems (ADS) operators report, to a variety of human-driver benchmarks, including police reports and insurance claims data. These benchmarks were grouped into two main categories: police-reported crashes and any-injury-reported crashes. Because accident rates vary by location, Waymo restricted their comparison to human benchmarks for their two major operating areas: San Francisco, CA, and Phoenix, AZ. Overall, Waymo’s driverless cars sustained a 6.8 times lower any-injury-reported rate and a 2.3 times lower police-reported rate, per million miles traveled. According to Waymo,
Our approach goes beyond safety metrics alone. Good driving behavior matters, too — driving respectfully around other road users in reliable, predictable ways, not causing unnecessary traffic or confusion. We’re working hard to continuously improve our driving behavior across the board. Through these studies, our goal is to provide the latest results on our safety performance to the general public, enhance transparency in the AV industry, and enable the community of researchers, regulators, and academics studying AV safety to advance the field.
Waymo published a separate research paper detailing their work on creating benchmarks for comparing ADS to human drivers, including their attempts to correct biases in the data, such as under-reporting; for example, drivers involved in a “fender-bender” mutually agreeing to forgo reporting the incident. They also recently released the results of a study done by the reinsurance company Swiss Re Group, which compared Waymo’s ADS to a human baseline and found that Waymo “reduced the frequency of property damage claims by 76%” per million miles driven compared to human drivers.
Waymo currently operates their autonomous vehicles in three locations: Phoenix, San Francisco, and Los Angeles. Because Waymo only recently began operating in Los Angeles, they have accumulated only 46k miles, and while they have had no crashes there, the low mileage means that the benchmark comparisons lack statistical significance. Waymo’s San Francisco data in isolation showed the best in comparison to humans: Waymo’s absolute incidents per million miles was lower there than overall, and human drivers were worse than the national average.
AI journalist Timothy B. Lee posted his remarks about the study on X:
I’m worried that the cowboy behavior of certain other companies will give people the impression that AV technology in general is unsafe. When the reality is that the leading AV company, Waymo, has been steadily building a sterling safety record.
In a discussion about the report on Hacker News, one user noted that in many incidents, the Waymo vehicle was hit from behind, meaning that the crash was the other driver’s fault. Another user replied:
Yes, but…there is something else to be said here. One of the things we have evolved to do, without necessarily appreciating it, is to intuit the behavior of other humans through the theory-of-mind. If [autonomous vehicles consistently] act “unexpectedly”, this injects a lot more uncertainty into the system, especially when interacting with other humans.
The raw data for all autonomous vehicle crashes in the United States is available on the NHTSA website.