Alibaba Releases Two Open-Weight Language Models for Math and Voice Chat

MMS Founder
MMS Anthony Alford

Article originally posted on InfoQ. Visit InfoQ

Alibaba released two open-weight language model families: Qwen2-Math, a series of LLMs tuned for solving mathematical problems; and Qwen2-Audio, a family of multi-modal LLMs that can accept voice or text input. Both families are based on Alibaba’s Qwen2 LLM series, and all but the largest version of Qwen2-Math are available under the Apache 2.0 license.

Qwen2-Math is available in a base version and an instruction-tuned version, each with a choice of 1.5B, 7B, or 72B parameters. Because most benchmark datasets are available on the internet, Alibaba conducted decontamination on their training datasets to remove mathematical problem-solving benchmark examples. After pre-training, the instruction-tuned models were trained with both supervised fine-tuning and reinforcement learning. On the popular MATH benchmark, the largest model, Qwen2-Math-72B-Instruct, outperformed state-of-the-art commercial models including GPT-4o and Claude-3.5-Sonnet. According to Alibaba,

Given the current limitation of English-only support, we plan to release bilingual models that support both English and Chinese shortly, with the development of multilingual models also in the pipeline. Moreover, we will continue to enhance our models’ ability to solve complex and challenging mathematical problems.

Besides MATH, Alibaba evaluated Qwen2-Math on benchmarks and mathematics exams, such as GSM8K and AIME 2024. They found that Qwen2-Math-Instruct had better performance than other baseline models of comparable size, “particularly in the 1.5B and 7B models.” The 72B parameter version achieved a score of 86.4 on the Chinese-language math exam benchmark CMATH, which Alibaba claims is a new high score. They also claim that it outperformed Claude, GPT-4, and Gemini on the AIME 2024 exam.

Alibaba published a technical report with more details on Qwen2-Audio. The model accepts both text and audio input, but can only output text. Depending on the type of audio input provided, the model can operate in two modes, Voice Chat or Audio Analysis. In Voice Chat mode, the input is a user’s speech audio, and the model acts as a chatbot. In Audio Analysis mode, the model can answer questions about the content of audio input. For example, given a clip of music, the model can identify the tempo and key of the song.

Andrew Ng’s newsletter The Batch covered Alibaba’s release, saying:

Qwen2 delivered extraordinary performance with open weights, putting Alibaba on the map of [LLMs]. These specialized additions to the family push forward math performance and audio integration in AI while delivering state-of-the-art models into the hands of more developers. It’s thrilling to see models with open weights that outperform proprietary models. The white-hot competition between open and closed technology is good for everyone!

Users on Reddit discussed both model series. One user described Qwen2-Math-7B as “punching really high and hard for its size.” Another user said of Qwen2-Audio:

It would be very interesting to try to synthesize audio output using this model. The audio encoder is almost identical to WhisperSpeech one. Although Qwen2 is using Whisper-large-v3 which would probably require retraining of the WhisperSpeech acoustic model. If successful, that would be basically equivalent to GPT4o advanced voice mode running locally.

The model files for Qwen2-Math and Qwen2-Audio can be downloaded from Huggingface.

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Apple Unveils Apple Foundation Models Powering Apple Intelligence

MMS Founder
MMS Anthony Alford

Article originally posted on InfoQ. Visit InfoQ

Apple published the details of their new Apple Foundation Models (AFM), a family of large language models (LLM) that power several features in their Apple Intelligence suite. AFM comes in two sizes: a 3B parameter on-device version and a larger cloud-based version.

The smaller model, AFM-on-device, was created by pruning a 6.4B parameter model; the larger model, known as AFM-server, was trained “from scratch,” but Apple did not disclose its size. Apple did release details of both models’ development: both are based on the Transformer decoder-only architecture, pre-trained on 6.3T tokens of data. The models use pluggable task-specific LoRA adapters that are chosen at runtime to tailor model performance for specific tasks, such as proofreading or replying to email. Apple evaluated both models on several benchmarks, including instruction-following and mathematical reasoning, and found that they “compared favorably,” and in some cases outperformed, similar-sized models such Llama 3 or GPT-4. According to Apple:

Our models have been created with the purpose of helping users do everyday activities across their Apple products, and developed responsibly at every stage and guided by Apple’s core values. We look forward to sharing more information soon on our broader family of generative models, including language, diffusion, and coding models.

InfoQ recently covered Apple’s announcement of Apple Intelligence at their WWDC 2024 event. InfoQ also covered Swift Assist, a code generation model integrated with XCode, which Apple describes as being part of the same family of generative AI models as AFM.

The adapter architecture allows AFM to be modified “on-the-fly” for specific tasks. The adapters are “small neural network modules” that plug into the self-attention and feed-forward layers of the base model. They are created by fine-tuning the base model with task-specific datasets. The adapter parameters are quantized to low bit-rates to save memory; the on-device adapters consume on the order of 10 MB, making them suitable for small embedded devices.

Apple took several steps to ensure AFM produced safe output. In addition to ensuring that no user data was included in their pre-training set, Apple applied filtering to remove harmful content, spam, and PII. In the fine-tuning stage, Apple treated “safety alignment as one of the many core post-training tasks” and more than 10% of the fine-tuning data was safety-related. They also performed manual and automated “red-teaming” to identify and test model vulnerabilities.

Apple evaluated AFM’s performance on a variety of benchmarks and compared the results to several baseline models, including GPT-4, Llama 3, and Phi-3. In tests where human judges ranked the outputs of two models side-by-side, AFM-on-device outperformed larger models Gemma-7B and Mistral-7B. AFM-server achieved “competitive” results, with a win-rate of 52% against GPT-3.5.

Ruoming Pang, the lead author of Apple’s technical report on AFM, posted on X that

While these LMs are not chatbots, we trained them to have general purpose capabilities so that they can power a wide range of features including summarization, writing assistance, tool-use, and coding.

Several other users posted their thoughts about AFM on X. Huggingface engineer Vaibhav Srivastav summarized the report, calling it “quite feature packed” and saying he “quite enjoyed skimming through it.” LiquidAI Staff ML Scientist Maxime Labonne estimated that AFM-server might have ~70B parameters, but lamented that the paper had “almost no details” on this model’s size.

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Google’s JEST Algorithm Automates AI Training Dataset Curation and Reduces Training Compute

MMS Founder
MMS Anthony Alford

Article originally posted on InfoQ. Visit InfoQ

Google DeepMind recently published a new algorithm for curating AI training datasets: multimodal contrastive learning with joint example selection (JEST), which uses a pre-trained model to score the learnability of batches of data. Google’s experiments show that image-text models trained with JEST-curated data require 10x less computation than baseline methods.

JEST tries to solve the problem of curating training datasets; that is, filtering the dataset to choose the specific examples that will be most effective in training a model. However, because manually curating datasets is time-consuming, JEST automates the process by using a pre-trained reference model to select the best batches of samples based on their learnability score, which combines the loss from both the reference model and the learner model being trained. The goal is to find batches that have a high loss for the learner but a low one for the reference, which means that the data is both “unlearned and learnable.” According to Google,

[W]e find that central to the performance of our framework is the ability to steer the curation process towards the distribution of smaller, well-curated datasets…Crucially, we find this process [enables] strong data quality bootstrapping: a reference model trained on a small curated dataset can effectively guide the curation of a much larger dataset, allowing the training of a model which strongly surpasses the quality of the reference model on many downstream tasks.

JEST is applied during the training process. Given a large super-batch of training data, JEST selects chunks or sub-batches based iteratively by calculating their joint learnability conditioned on the sub-batches previously sampled. The research team found that this improves the quality of the batches, similar to the concept of hard negatives.

Because the learnability score is computed online during training, it imposes some additional compute cost. To address this, JEST uses model approximation for efficient scoring; for example, the vision component of the reference model can drop layers or image patches. The researchers also improved efficiency by training the learner at different image resolutions.

The DeepMind team ran several experiments to evaluate JEST. They first trained an image-text reference model on a curated dataset based on the Web Language Image (WebLI) dataset. They trained learner models using both JEST and compared to models trained using a baseline uniform batch selection. Models trained using JEST achieved the same benchmark performance as baseline models, while requiring 10x fewer training FLOPS.

In a discussion on Hacker News, several users praised DeepMind’s work. One wrote:

So the paper itself is pretty significant, I think, from looking at it. The general methodology seems to be: train small model as a discriminatory scoring model on very high quality data…This turns out to be significant FLOPs and quality win, even counting for the initial model training and scoring part of it…As always, appreciate the publishing from DeepMind – this looks like great work.

Another user pointed out that JEST was similar to another method called Cappy, which also uses a “pretrained small scorer.” Other related techniques include RHO-LOSS, which inspired JEST and is open-source. Google has not open-sourced JEST.

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


OpenAI’s CriticGPT Catches Errors in Code Generated by ChatGPT

MMS Founder
MMS Anthony Alford

Article originally posted on InfoQ. Visit InfoQ

OpenAI recently published a paper about CriticGPT, a version of GPT-4 fine-tuned to critique code generated by ChatGPT. When compared with human evaluators, CriticGPT catches more bugs and produces better critiques. OpenAI plans to use CriticGPT to improve future versions of their models.

When originally developing ChatGPT, OpenAI used human “AI trainers” to rate the outputs of the model, creating a dataset that was used to fine-tune it using reinforcement learning from human feedback (RLHF). However, as AI models improve, and can now perform some tasks at the same level as human experts, it can be difficult for human judges to evaluate their output. CriticGPT is part of OpenAI’s effort on scalable oversight, which is intended to help solve this problem. OpenAI decided first to focus on helping ChatGPT improve its code-generating abilities. The researchers used CriticGPT to generate critiques of code; they also paid qualified human coders to do the same. In evaluations, AI trainers preferred CriticGPT’s critiques 80% of the time, showing that CriticGPT could be a good source for RLHF training data. According to OpenAI:

The need for scalable oversight, broadly construed as methods that can help humans to correctly evaluate model output, is stronger than ever. Whether or not RLHF maintains its dominant status as the primary means by which LLMs are post-trained into useful assistants, we will still need to answer the question of whether particular model outputs are trustworthy. Here we take a very direct approach: training models that help humans to evaluate models….It is…essential to find scalable methods that ensure that we reward the right behaviors in our AI systems even as they become much smarter than us. We find LLM critics to be a promising start.

Interestly, CriticGPT is also a version of GPT-4 that is fine-tuned with RLHF. In this case, the RLHF training data consisted of buggy code as the input, and a human-generated critique or explanation of the bug as the desired output. The buggy code was produced by having ChatGPT write code, then having a human contractor insert a bug and write the critique.

To evaluate CriticGPT, OpenAI used human judges to rank several critiques side-by-side; judges were shown outputs from CriticGPT and from baseline ChatGPT, as well as critiques generated by humans alone or by humans with CriticGPT assistance (“Human+CriticGPT”). The judges preferred CriticGPT’s output over that of ChatGPT and human critics. OpenAI also found that the Human+CriticGPT teams’ output was “substantially more comprehensive” than that of humans alone. However, it tended to have more “nitpicks.”

In a discussion about the work on Hacker News, one user wrote:

For those new to the field of AGI safety: this is an implementation of Paul Christiano’s alignment procedure proposal called Iterated Amplification from 6 years ago…It’s wonderful to see his idea coming to fruition! I’m honestly a bit skeptical of the idea myself (it’s like proposing to stabilize the stack of “turtles all the way down” by adding more turtles)…but every innovative idea is worth a try, in a field as time-critical and urgent as AGI safety.

Christiano formerly ran OpenAI’s language model alignment team. Other companies besides OpenAI are also working on scalable oversight. In particular, Anthropic has published research papers on the problem, such as their work on using a debate between LLMs to improve model truthfulness.

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


OpenAI Publishes GPT Model Specification for Fine-Tuning Behavior

MMS Founder
MMS Anthony Alford

Article originally posted on InfoQ. Visit InfoQ

OpenAI recently published their Model Spec, a document that describes rules and objectives for the behavior of their GPT models. The spec is intended for use by data labelers and AI researchers when creating data for fine-tuning the models.

The Model Spec is based on existing internal documentation used by OpenAI in their reinforcement learning from human feedback (RLHF) training used to fine-tune recent generations of their GPT models. The Spec contains three types of principles: objectives, rules, and defaults. Objectives define broad descriptions of desirable model behavior: “benefit humanity.” Rules are more concrete, and address “high-stakes” situations that should never be overridden by users: “never do X.” Finally, the Spec includes default behaviors that, while they can be overridden, provide basic style guidance for responses and templates for handling conflicts. According to OpenAI

As a continuation of our work on collective alignment and model safety, we intend to use the Model Spec as guidelines for researchers and AI trainers who work on reinforcement learning from human feedback. We will also explore to what degree our models can learn directly from the Model Spec. We see this work as part of an ongoing public conversation about how models should behave, how desired model behavior is determined, and how best to engage the general public in these discussions.

In 2022, OpenAI introduced a fine-tuned version of GPT-3 called InstructGPT. The model was fine-tuned using RLHF on a dataset of ranked model outputs. The idea was to make the model more “aligned” with user intent and reduce false or toxic output. Since then, many research teams have done similar instruction-tuning on their LLMs. For example, Google’s Gemini model is also fine-tuned with RLHF. Meta’s Llama 3 is also instruction-tuned, but via a different fine-tuning method, direct preference optimization (DPO).

The key to instruction-tuning, however, is the dataset of prompt inputs with multiple outputs ranked by human labelers. Part of the purpose of the Model Spec is to guide the labelers in ranking outputs. OpenAI also claims to be working on methods for automating the instruction-tuning process directly from the Model Spec. Because of this, much of the content of the Model Spec are examples of user prompts along with “good” and “bad” responses.

Many of the rules and defaults in the Spec are intended to address common abuses of LLMs. For example, the rule to follow the chain of command is designed to help prevent the simple “jailbreak” of prompting the model to ignore previous instructions. Other specifications are intended to shape the responses of the model, especially when refusing to perform a task; according to the Spec, “refusals should be kept to a sentence and never be preachy.”

Wharton Professor and AI researcher Ethan Mollick posted about the Model Spec on X:

As people have pointed out in the comments, Anthropic has its Constitution. I find it to be much less weighty as a statement & less clarifying, since it outlines generally good stuff & tells the AI to be good, making it hard to understand the difficult choices between principles.

Anthropic introduced the idea of Constitutional AI in 2022. This process uses an AI model to rank outputs for instruction-tuning. Although Anthropic’s code is not open-source, the AI community HuggingFace published a reference implementation of Constitutional AI based on Anthropic’s work.

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


OpenAI Announces New Flagship Model GPT-4o

MMS Founder
MMS Anthony Alford

Article originally posted on InfoQ. Visit InfoQ

OpenAI recently announced the latest version of their GPT AI foundation model, GPT-4o. GPT-4o is faster than the previous version of GPT-4 and has improved capabilities in handling speech, vision, and multilingual tasks, outperforming all models except Google’s Gemini on several benchmarks.

The “o” in GPT-4o stands for “omni,” reflecting the model’s multi-modal capabilities. While previous versions of ChatGPT supported voice input and output, this used a pipeline of models: a distinct speech-to-text to provide input to GPT-4, followed by a text-to-speech model to convert GPT-4’s text output to voice. The new model was trained end-to-end to handle audio, vision, and text, which reduces latency and gives GPT-4o access to more information from the input as well as control over the output. OpenAI evaluated the model on a range of benchmarks, including common LLM benchmarks as well as their own AI safety standards. The company has also performed “extensive external red teaming” on the model to discover potential risks in its new modalities. According to OpenAI: 

We recognize that GPT-4o’s audio modalities present a variety of novel risks. Today we are publicly releasing text and image inputs and text outputs. Over the upcoming weeks and months, we’ll be working on the technical infrastructure, usability via post-training, and safety necessary to release the other modalities. For example, at launch, audio outputs will be limited to a selection of preset voices and will abide by our existing safety policies. We will share further details addressing the full range of GPT-4o’s modalities in the forthcoming system card.

OpenAI gave a demo of the GPT-4o and its capabilites in their recent Spring Update livestream hosted by CTO Mira Murati. Murati announced that the new model will be rolled out to free users, along with access to features such as custom GPTs and the GPT Store which were formerly available only to paid users. She also announced that GPT-4o would be available via the OpenAI API and claimed the model was 2x faster than GPT-4 Turbo, with 5x higher rate limits.

In a Hacker News discussion about the release, one user noted: 

The most impressive part is that the voice uses the right feelings and tonal language during the presentation. I’m not sure how much of that was that they had tested this over and over, but it is really hard to get that right so if they didn’t fake it in some way I’d say that is revolutionary.

OpenAI CEO Sam Altman echoed this sentiment in a blog post:

The new voice (and video) mode is the best computer interface I’ve ever used. It feels like AI from the movies; and it’s still a bit surprising to me that it’s real. Getting to human-level response times and expressiveness turns out to be a big change. The original ChatGPT showed a hint of what was possible with language interfaces; this new thing feels viscerally different. It is fast, smart, fun, natural, and helpful.

Along with GPT-4o, OpenAI released a new MacOs desktop app for ChatGPT. This app supports voice mode for conversing with the model, with the ability to screenshots to the discussion. OpenAI has also launched a simplified “look and feel” for the ChatGPT web interface.

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


OpenAI Releases New Fine-Tuning API Features

MMS Founder
MMS Anthony Alford

Article originally posted on InfoQ. Visit InfoQ

OpenAI announced the release of new features in their fine-tuning API. The features will give model developers more control over the fine-tuning process and better insight into their model performance.

The updates include the ability to create a model checkpoint after every training epoch during fine-tuning, compute metrics over the entire validation dataset, and integrate with 3rd-parties such as Weights and Biases. Besides changes to the API, OpenAI also updated the fine-tuning dashboard, giving developers more control of training hyperparameters and jobs as well as better insight into metrics. The model playground now has a side-by-side model comparison feature that allows users to enter a single prompt and compare the output of different standard and fine-tuned models. Finally, OpenAI announced an update to their Custom Model program: assisted fine-tuning, where OpenAI’s team works with an organization to help fine-tune a model. According to OpenAI:

We believe that in the future, the vast majority of organizations will develop customized models that are personalized to their industry, business, or use case. With a variety of techniques available to build a custom model, organizations of all sizes can develop personalized models to realize more meaningful, specific impact from their AI implementations. The key is to clearly scope the use case, design and implement evaluation systems, choose the right techniques, and be prepared to iterate over time for the model to reach optimal performance.

Although foundation models such as GPT-3.5 and GPT-4 can perform well on a variety of tasks “out of the box,” a fine-tuned model can provide better performance on specific tasks, or can be made to “exhibit specific ingrained behavior patterns.” Further, since these models often require less verbose prompts, they can operate with lower cost and latency. InfoQ covered the initial launch of OpenAI’s fine-tuning API in 2023. Since then, OpenAI claims that it has been used to train “hundreds of thousands of models.”

OpenAI announced their Custom Model program at their 2023 Dev Day. In this program, “selected” organizations can work with OpenAI’s researchers to modify any step of the training process to produce a bespoke model for the organization “from scratch.” OpenAI claims that one customer in this program built a custom model that showed an “83% increase in factual responses.” The new service announced for the program doesn’t build a completely new model. Instead, it offers customers fine-tuning features not available in the API, including “bespoke parameters and methods to maximize model performance.”

In a Hacker News discussion about the release, one user pointed out:

Btw, if you’ve tried fine-tuning OpenAI models before January and came away unimpressed with the quality of the finished model, it’s worth trying again. They made some unannounced changes in the last few months that make the fine-tuned models much stronger. That said, we’ve found that Mixtral fine-tunes still typically outperform GPT-3.5 fine tunes, and are far cheaper to serve.

OpenAI’s YouTube channel includes a talk from the 2023 Dev Day that compares different performance-improving techniques, including fine-tuning and prompt engineering, given by the engineering lead of their Fine-Tuning Product. The OpenAI docs also offer suggestions on alternatives to fine-tuning, including prompt engineering and function calling.

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Meta Unveils 24k GPU AI Infrastructure Design

MMS Founder
MMS Anthony Alford

Article originally posted on InfoQ. Visit InfoQ

Meta recently announced the design of two new AI computing clusters, each containing 24,576 GPUs. The clusters are based on Meta’s Grand Teton hardware platform, and one cluster is currently used by Meta for training their next-generation Llama 3 model.

Meta designed the clusters to support their generative AI efforts. The two cluster variants differ in their networking fabric. The Llama 3 cluster uses remote direct memory access (RDMA) over converged Ethernet (RoCE) while the other uses NVIDIA’s Quantum2 InfiniBand. The storage layer is based on Meta’s custom-built Tectonic filesystem, which supports the synchronized I/O needed to handle checkpoints from thousands of GPUs. According to Meta,

These two AI training cluster designs are a part of our larger roadmap for the future of AI. By the end of 2024, we’re aiming to continue to grow our infrastructure build-out that will include 350,000 NVIDIA H100s as part of a portfolio that will feature compute power equivalent to nearly 600,000 H100s.

Meta has a history of open-sourcing their hardware platform and rack designs. In 2021, InfoQ covered Meta’s ZionEX cluster. InfoQ covered the development of the Grand Teton platform and Meta’s open rack design in 2022. As part of that effort, Meta contributed their work to the Open Compute Project, which Meta founded in 2011. In late 2023, Meta and IBM launched the AI Alliance “to support open innovation and open science in AI.”

One of the big challenges Meta faced with the new clusters was the difficulty of debugging at that scale. Meta worked with Hammerspace to build interactive debugging tools for their storage system. Meta also worked on a “distributed collective flight recorder” for troubleshooting distributed training.

While developing the new clusters, Meta ran several simulations to predict their intern-node communication performance. However, “out of the box” the clusters did not perform as well as smaller, optimized clusters; bandwidth utilization during benchmarking was extremely variable. After tuning job schedulers and optimizing network routing in the cluster, this metric was consistently greater than 90%.

Meta also worked on their PyTorch framework implementation to better utilize the cluster hardware. For example, the H100 GPUs support 8-bit floating point operations which can be used to accelerate training. Meta also worked on parallelization algorithms and initialization bottlenecks, reducing init time from “sometimes hours down to minutes.”

In a Hacker News discussion about the Meta clusters, several users lamented that hardware costs can make it difficult to compete in the AI space with “hyper-scale” companies like Meta. AI developer Daniel Han-Chen remarked:

Another way to compete with the big tech incumbents is instead of hardware, try maths and software hacks to level the playing field! Training models is still black magic, so making it faster on the software side can solve the capital cost issue somewhat!

Besides Meta, other AI players have also released details of their large compute clusters. Google recently announced their AI Hypercomputer, based on their new Cloud TPU v5p accelerator hardware. Microsoft Azure’s Eagle supercomputer, which contains 14,400 NVIDIA H100 GPUs, recently placed third on the HPC Top500.

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


RWKV Project Open-Sources LLM Eagle 7B

MMS Founder
MMS Anthony Alford

Article originally posted on InfoQ. Visit InfoQ

The RWKV Project recently open-sourced Eagle 7B, a 7.52B parameter large language model (LLM). Eagle 7B is trained on 1.1 trillion tokens of text in over 100 languages and outperforms other similarly-sized models on multilingual benchmarks.

Eagle 7B is based on the Receptance Weighted Key Value (RWKV) architecture, described as an attention-free Transformer that combines the benefits of both Transformers and recurrent neural networks (RNNs) while reducing their drawbacks; in particular, the model has no maximum input context length. The architecture has also been benchmarked as the most energy-efficient, measured by joules per token. Eagle 7B outperforms other 7B-parameter LLMs, including Mistral, Falcon, and Llama 2, on several multi-lingual benchmarks. The RWKV Project is supported by the Linux Foundation, and Eagle 7B carries the Apache 2.0 license, making it available for both personal and commercial use. According to the Project team:

RWKV opens a new route for scalable and efficient architectures to model complex relationships in sequential data. While many alternatives to Transformers have been proposed with similar claims, ours is the first to back up those claims with pretrained models with tens of billions of parameters.

Before Google published their work on Transformers, RNN-based models were the state-of-the-art solution for many AI applications, particularly in multilingual NLP domains such as translation. The Transformer was an attractive alternative, since training RNNs presents challenges, and their inherent serial nature makes them slower than Transformers. However, Transformers have their own drawbacks. In particular, their self-attention mechanism has quadratic complexity in both compute and storage, which limits their input context length.

To solve these problems, RWKV uses a variant of the Attention-Free Transformer (AFT), with a modification that allows the model to be formulated as an RNN. This formulation makes the model efficient during inference, when it is used for autoregressive generation. However, during training, many of the model’s matrix operations can be parallelized, as with a standard Transformer.

The RWKV architecture does have known limitations. While it does not have a maximum input context length, it may not perform as well as attention-based models on tasks that require “looking back” in a very long context. For the same reason, it also requires “carefully designed prompts,” as prompt information may be lost during inference.

In a discussion about Eagle 7B on Hacker News, one user touted its advantages:

These models don’t have a fixed context size and are progressively fine-tuned for longer and longer contexts. The context length also doesn’t impact inference cost. Another aspect of performance is not just how well does the trained model perform, but is it data efficient (performance per token trained)?

Lead RWKV developer Peng Bo posted about the model on X, showing its performance on what he called an “uncheatable” benchmark: calculating the model’s perplexity on new papers posted to arXiv:

Arxiv is the beginning. We can use latest news, github repos, arxiv paper, blog posts, new wiki entries, and more. The point is to benchmark LLMs on new data – although they can be polluted by ChatGPT too, it is still better than using very old (and actually noisy) evals.

The Eagle 7B code is available on GitHub, and the model weights on Huggingface.

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Mistral AI’s Open-Source Mixtral 8x7B Outperforms GPT-3.5

MMS Founder
MMS Anthony Alford

Article originally posted on InfoQ. Visit InfoQ

Mistral AI recently released Mixtral 8x7B, a sparse mixture of experts (SMoE) large language model (LLM). The model contains 46.7B total parameters, but performs inference at the same speed and cost as models one-third that size. On several LLM benchmarks, it outperformed both Llama 2 70B and GPT-3.5, the model powering ChatGPT.

Mistral 8x7B has a context length of 32k tokens and can accept the Spanish, French, Italian, German, and English language. Besides the base Mixtral 8x7B model, Mistral AI also released a model called Mixtral 8x7B Instruct, which is fine-tuned for instruction-following using direct preference optimisation (DPO). Both models’ weights are released under the Apache 2.0 license. Mistral AI also added support for the model to the vLLM open-source project. According to Mistral AI:

Mistral AI continues its mission to deliver the best open models to the developer community. Moving forward in AI requires taking new technological turns beyond reusing well-known architectures and training paradigms. Most importantly, it requires making the community benefit from original models to foster new inventions and usages.

Mixture of Experts (MoE) models are often used in LLMs as a way to increase model size while keeping training and inference time low. The idea dates back to 1991, and Google applied it to Transformer-based LLMs in 2021. In 2022, InfoQ covered Google’s Image-Text MoE model LIMoE, which outperformed CLIP. Later that year, InfoQ also covered Meta’s NLB-200 MoE translation model, which can translate between any of over 200 languages.

The key idea of MoE models is to replace the feed-forward layers of the Transformer block with a combination of a router plus a set of expert layers. During inference, the router in a Transformer block selects a subset of the experts to activate. In the Mixtral model, the output for that block is computed by applying the softmax function to the top two experts.

The fine-tuned version of the model, Mistral 8x7B Instruct, was trained using DPO, instead of the RLHF technique used to train ChatGPT. This method was developed by researchers at Stanford University and “matches or improves response quality” compared to RLHF, while being much simpler to implement. DPO uses the same dataset as RLHF, a set of paired responses with one ranked higher than the other, but doesn’t require creating a separate reward function for RL.

Mistral AI evaluated their models on benchmarks for several tasks, including code generation, reading comprehension, mathematics, reasoning, and knowledge. Mistral 8x7B outperformed Llama 2 70B on nine of twelve benchmarks. It also outperformed GPT-3.5 on five benchmarks. According to Mistral AI, Mistral 8x7B Instruct’s score on the MT-Bench chatbot benchmark makes it “the best open-weights model as of December 2023.” The LMSYS leaderboard currently ranks the model 7th, above GPT-3.5, Claude 2.1, and Gemini Pro.

In a discussion on Hacker News, several users pointed out that while all of the model’s 46.7B parameters need to be loaded into RAM, inference speed would be comparable to a 13B parameter model. One user said:

This can fit into a Macbook Pro with integrated memory. With all the recent development in the world of local LLMs I regret I settled for only 24Gb RAM on my laptop – but the 13B models work great.

The Mixtral 8x7B and Mixtral 8x7B Instruct models are available on HuggingFace. Mistral AI also offers a hosted version of the model behind their mistral-small API endpoint.

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.