AWS Adds New Code Generation Models to Amazon SageMaker JumpStart

MMS Founder
MMS Anthony Alford

Article originally posted on InfoQ. Visit InfoQ

AWS recently announced the availability of two new foundation models in Amazon SageMaker JumpStart: Code Llama and Mistral 7B. These models can be deployed with one click to provide AWS users with private inference endpoints for code generation tasks.

Code Llama is a fine-tuned version of Meta’s Llama 2 foundation model and carries the same license. It is available in three variants: base, Python, and Instruct; and each has three model sizes: 7B, 13B, and 34B parameters; for a total of nine options. Besides code generation, it can also perform code infilling, and the Instruct models can follow natural language instructions in a chat format. Mistral 7B is a seven billion parameter large language model (LLM) that is available under the Apache 2.0 license. There are two variants of Mistral 7B: base and Instruct. In addition to code generation, with performance that “approaches” that of Code Llama 7B, Mistral 7B is also a general purpose text generation model and outperforms the larger Llama 2 13B foundation model on all NLP benchmarks. According to AWS:

Today, we are excited to announce Code Llama foundation models, developed by Meta, are available for customers through Amazon SageMaker JumpStart to deploy with one click for running inference. Code Llama is a state-of-the-art large language model (LLM) capable of generating code and natural language about code from both code and natural language prompts. Code Llama is free for research and commercial use. You can try out this model with SageMaker JumpStart, a machine learning (ML) hub that provides access to algorithms, models, and ML solutions so you can quickly get started with ML.

The Mistral 7B models support a context length of up to 8k tokens. This long context can be used for “few shot” in-context learning in tasks such as question answering, or for maintaining a chat history. The Instruct variants support a special format for multi-turn prompting:

[INST] {user_prompt_0} [/INST] {assistant_response_0} [INST] {user_prompt_1} [/INST]

The various sizes of Code Llama models support different context lengths: 10k, 32k, 48k respectively; however, the 7B models only support 10k tokens on ml.g5.2xlarge instance types. All models can perform code generation, but only the 7B and 13B models can perform code infilling. This task prompts the model with a code prefix and code suffix, and the model generates code to put between them. There are special input tokens,

, , and  to mark their locations in the prompt. The model can accept these pieces in one of two different orderings: suffix-prefix-middle (SPM) and prefix-suffix-middle (PSM). Meta’s paper on Code Llama recommends using PSM when “the prefix does not end in whitespace or a token boundary.” The PSM format is:

 {prefix_code} {suffix_code} 

The Instruct version of Code Llama is designed for chat-like interaction, and according to Meta “significantly improves performance” on several NLP benchmarks, with a “moderate cost” of code generation. An example application of this model is to generate and explain code-based solutions to problems posed in natural language; for example, how to use Bash commands for certain tasks. Code Llama Instruct uses a special prompt format similar to that of Mistral 7B, with option of a “system” prompt:

[INST] <>
{system_prompt}
<>

{user_prompt_0} [/INST] {assistant_response_0} [INST] {user_prompt_1} [/INST]

The Code Llama announcement says that the models are available in the US East (N. Virginia), US West (Oregon) and Europe (Ireland) regions. AWS has not announced the regions where Mistral is available.

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Google Open-Sources AI Fine-Tuning Method Distilling Step-by-Step

MMS Founder
MMS Anthony Alford

Article originally posted on InfoQ. Visit InfoQ

A team from the University of Washington and Google Research recently open-sourced Distilling Step-by-Step, a technique for fine-tuning smaller language models. Distilling Step-by-Step requires less training data than standard fine-tuning and results in smaller models that can outperform few-shot prompted large language models (LLMs) that have 700x the parameters.

Although LLMs can often perform well on a wide range of tasks with few-shot prompting, hosting the models is challenging due to their memory and compute requirements. Smaller models can also perform well when fine-tuned, but that requires a manually created task-specific dataset. The key idea of Distilling Step-by-Step is to use a LLM to automatically generate a small fine-tuning dataset that contains both an input with an output label, as well as a “rationale” for why the output label was chosen. The fine-tuning process trains the small model to predict both the output label as well as generate the rationale. When evaluated on NLP benchmarks, the small fine-tuned models outperformed the 540B PaLM model while requiring only 80% of the benchmark’s fine-tuning data. According to Google:

We show that distilling step-by-step reduces both the training dataset required to curate task-specific smaller models and the model size required to achieve, and even surpass, a few-shot prompted LLM’s performance. Overall, distilling step-by-step presents a resource-efficient paradigm that tackles the trade-off between model size and training data required.

Research has shown that increasing the number of parameters in an LLM can improve its performance, with the current state of the art models such as PaLM having 100s of billions of parameters. However, these large models are expensive and difficult to use at inference time, as they require multiple parallel GPUs simply to hold the parameters in memory. Recent efforts have produced slightly smaller models, such as Meta’s Llama 2, that can perform nearly as well but with an order of magnitude fewer parameters; however, these models are still quite large and compute-intensive.

One way to get a smaller model that performs well on a certain task is to fine-tune a smaller language model with a task-specific dataset. While this dataset might be relatively small—on the order of thousands of examples—it may still be costly and time-consuming to collect. Another option is knowledge distillation, where a large model is used as a teacher for a smaller model. InfoQ recently covered such a technique developed by Google that uses a PaLM LLM to create training datasets, producing fine-tuned models that performed comparable to LLMs that were 10x larger.

Distilling Step-by-Step does require a fine-tuning dataset, but it reduces the amount of data needed to create a high-performing model. The source dataset is fed to a PaLM LLM via a chain-of-thought prompt that asks the model to give the rationale for its answer. The result is a modified fine-tuning dataset that contains the original input and answer as well as the rationale. The smaller target model is fine-tuned to perform two tasks: answer the original question and generate a rationale.

Google evaluated their technique using four NLP benchmarks, each of which contains a fine-tuning dataset. They used Distilling Step-by-Step to modify these datasets and fine-tune T5 models with fewer than 1B parameters. They found that their models could outperform baseline fine-tuned models while using only a fraction of the dataset; as little as 12.5% in some cases. They also found that their 770M parameter model outperformed the 700x larger 540B parameter PaLM on the ANLI benchmark, while needing only 80% of the fine-tuning dataset.

In a discussion about the work on X (formerly Twitter), AI entrepreneur Otto von Zastrow wrote:

These results are very strong. I would call it synthetic data generation, not distillation, and I am really curious to see what happens if you train the original LLM on this synthetic rationale per sample question.

The Distilling Step-by-Step source code and training dataset are available on GitHub. Google Cloud’s Vertex AI platform also offers a private preview of the algorithm.

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Google DeepMind Announces LLM-Based Robot Controller RT-2

MMS Founder
MMS Anthony Alford

Article originally posted on InfoQ. Visit InfoQ

Google DeepMind recently announced Robotics Transformer 2 (RT-2), a vision-language-action (VLA) AI model for controlling robots. RT-2 uses a fine-tuned LLM to output motion control commands. It can perform tasks not explicitly included in its training data and improves on baseline models by up to 3x on emergent skill evaluations.

DeepMind trained two variants of RT-2, using two different underlying visual-LLM foundation models: a 12B parameter version based on PaLM-E and a 55B parameter one based on PaLI-X. The LLM is co-fine-tuned on a mix of general vision-language datasets and robot-specific data. The model learns to output a vector of robot motion commands, which is treated as simply a string of integers: in effect, it is a new language that the model learns. The final model is able to accept an image of the robot’s workspace and a user command such as “pick up the bag about to fall off the table,” and from that generate motion commands to perform the task. According to DeepMind,

Not only does RT-2 show how advances in AI are cascading rapidly into robotics, it shows enormous promise for more general-purpose robots. While there is still a tremendous amount of work to be done to enable helpful robots in human-centered environments, RT-2 shows us an exciting future for robotics just within grasp.

Google Robotics and DeepMind have published several systems that use LLMs for robot control. In 2022, InfoQ covered Google’s SayCan, which uses an LLM to generate a high-level action plan for a robot, and Code-as-Policies, which uses an LLM to generate Python code for executing robot control. Both of these use a text-only LLM to process user input, with the vision component handled by separate robot modules. Earlier this year, InfoQ covered Google’s PaLM-E which handles multimodal input data from robotic sensors and outputs a series of high-level action steps.

RT-2 builds on a previous implementation, RT-1. The key idea of the RT series is to train a model to directly output robot commands, in contrast to previous efforts which output higher-level abstractions of motion. Both RT-2 and RT-1 accept as input an image and a text description of a task. However, while RT-1 used a pipeline of distinct vision modules to generate visual tokens to input to an LLM, RT-2 uses a single vision-language model such as PaLM-E.

DeepMind evaluated RT-2 on over 6,000 trials. In particular, the researchers were interested in its emergent capabilities: that is, to perform tasks not present in the robot-specific training data, but that emerge from its vision-language pre-training. The team tested RT-2 on three task categories: symbol understanding, reasoning, and human recognition. When compared to baselines, RT-2 achieved “more than 3x average success rate” of the best baseline. However, the model did not acquire any physical skills that were not included in the robot training data.

In a Hacker News discussion about the work, one user commented:

It does seem like this work (and a lot of robot learning works) are still stuck on position/velocity control and not impedance control. Which is essentially output where to go, either closed-loop with a controller or open-loop with a motion planner. This seems to dramatically lower the data requirement but it feels like a fundamental limit to what task we can accomplish. The reason robot manipulation is hard is because we need to take into account not just what’s happening in the world but also how our interaction alters it and how we need to react to that.

Although RT-2 has not been open sourced, the code and data for the RT-1 have been.

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Stability AI Releases Generative Audio Model Stable Audio

MMS Founder
MMS Anthony Alford

Article originally posted on InfoQ. Visit InfoQ

Harmonai, the audio research lab of Stability AI, has released Stable Audio, a diffusion model for text-controlled audio generation. Stable Audio is trained on 19,500 hours of audio data and can generate 44.1kHz quality audio in realtime using a single NVIDIA A100 GPU.

Similar to Stability AI’s image generation model Stable Diffusion, Stable Audio takes as input a user’s text prompt describing the desired output. As with Stable Diffusion, a U-Net-based diffusion model forms the core of the system. Besides the text prompt, users can also specify the desired output length in seconds. The model can generate the sound of single instruments, a full ensemble, or more ambient sounds such as crowd noise. According to Stability AI:

Stable Audio represents the cutting-edge audio generation research by Stability AI’s generative audio research lab, Harmonai. We continue to improve our model architectures, datasets, and training procedures to improve output quality, controllability, inference speed, and output length.

Recent advancements in generative AI for text and images have also spurred development of music generating models. OpenAI’s MuseNet, which is based on GPT-2, generates a sequence of MIDI notes which can be converted to sound using MIDI synthesizer software. This year, InfoQ covered Google’s MusicLM and Meta’s MusicGen models, which operate similarly to autoregressive language models, but they output “audio tokens” instead of text tokens. In 2022, several diffusion-based music generation models appeared, including Dance Diffusion, an earlier effort by Harmonai; and Riffusion, a project that uses a fine-tuned version of Stable Diffusion to generate spectrogram images which are converted to sound using classic digital signal processing techniques.

Stable Audio uses a pre-trained model called CLAP to map the user’s text prompt into an embedding space that is shared with musical features, similar to the way OpenAI’s CLIP is used in Stable Diffusion. These feature vectors, as well as embeddings for the desired output length and a noise vector, are fed into the 970M parameter denoising U-Net model, which is based on a system called Moûsai. This outputs a latent-space representation of the generated sound, which is then converted to audio via a variational autoencoder (VAE) called Descript Audio Codec.

Several users on X (formerly Twitter) commented on the release of Stable Audio. Ezra Sandzer-Bell, founder of AudioCipher, linked to a:

Detailed guide on how to use Stable Audio, including text-to-music prompting tips…. We’ve identified and summarized some of the most important Terms of Service, to help you stay out of trouble.

Stability AI CEO Emad Mostaque wrote:

This is the first commercially licensed music model and platform, amazing work by team. This is still in the experimental phase but expect it to advance rapidly so you can create any audio you can imagine, plus integrate your own data and more.

Although Stable Audio is not currently open-source, Harmonai says they will release “open-source models based on Stable Audio” as well as code for training custom models. The Harmonai Github account contains a fork of the Moûsai repository. The Stable Audio website allows users to sign up for a free tier, which gives them up to 20 generations per month with a non-commercial use restriction.

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


OpenAI Announces ChatGPT Voice and Image Features

MMS Founder
MMS Anthony Alford

Article originally posted on InfoQ. Visit InfoQ

OpenAI recently announced new voice and image features for ChatGPT. A new backend model, GPT-4V, will handle image inputs, and an updated DALL-E model will be integrated to generate images. In addition, users of the mobile ChatGPT app will be able to hold voice conversations with the chatbot.

OpenAI announced that the newest version of their image generation AI, DALL-E 3, was in “research preview” and would be available for users of ChatGPT Plus and Enterprise in the coming month. Its integration with ChatGPT means that users can more easily create prompts with help from the chatbot. The ability to understand image input is supported by a multimodal version of the underlying GPT model called GPT-4 Vision (GPT-4V). The voice feature uses OpenAI’s Whisper automatic speech recognition (ASR) model to handle user voice input, and a new text-to-speech (TTS) model will convert ChatGPT’s text output into the user’s choice of five available voices. OpenAI is deploying the new features gradually, citing safety concerns, and has conducted beta testing and “red teaming” to explore and mitigate risks. According to OpenAI:

Large multimodal models introduce different limitations and expand the risk surface compared to text-based language models. GPT-4V possesses the limitations and capabilities of each modality (text and vision), while at the same time presenting novel capabilities emerging from the intersection of said modalities and from the intelligence and reasoning afforded by large scale models.

OpenAI published a paper describing their testing efforts with GPT-4V. They used the model in a tool called Be My AI, which aids vision-impaired people by describing the contents of images. OpenAI ran a pilot program with 200 beta testers from March until August 2023, then in September 2023 expanded it to 16,000 users. They also ran a developer alpha program, where more than 1,000 devs had access to the model over three months; the goal was to “gain additional feedback and insight into the real ways people interact with GPT-4V.”

The paper summarizes OpenAI’s evaluation of the model’s behavior in several areas, such as refusing to generate harmful content, refusing to identify people in images, ability to break CAPTCHAs, and refusal of image-based “jailbreaks.” OpenAI also engaged “red teams” to test the model’s abilities in scientific domains, such as understanding images in publications; and its ability to provide medical advice given medical images such as CT scans. The paper specifically notes that “we do not consider the current version of GPT-4V to be fit for performing any medical function.”

Several users discussed the new features in a thread on Hacker News. One user pointed out some limitations of the voice feature:

Voice has the potential to be awesome. This demo is really underwhelming to me because of the multi-second latency between the query and response, just like every other lame voice assistant. It doesn’t have to be this way! [Determining] when the user is done talking is tough. What’s needed is a speech conversation turn-taking dataset and model; that’s missing from off the shelf speech recognition systems.

Several of OpenAI’s partners have been releasing products that use the new features. Spotify recently announced Voice Translation for some of their podcasts, which uses “OpenAI’s newly released voice generation technology” to generate a translation that mimics the original speaker. Microsoft’s CEO of Advertising and Web Services, Mikhail Parakhin, announced on X (formerly Twitter) that DALL-E 3 was being rolled out to Bing’s image generation tool. OpenAI also announced on X that it would be making ChatGPT’s “Browse with Bing” feature generally available soon. This feature gives the bot access to information that was published on the web after the model was trained.

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Meta Open-Sources Multilingual Translation Foundation Model SeamlessM4T

MMS Founder
MMS Anthony Alford

Article originally posted on InfoQ. Visit InfoQ

Meta recently open-sourced Massively Multilingual & Multimodal Machine Translation (SeamlessM4T), a multilingual translation AI that can translate both speech audio and text data across nearly 100 languages. SeamlessM4T is trained on 1 million hours of audio data and outperforms the current state-of-the-art speech-to-text translation model.

SeamlessM4T is a multimodal model that can handle both text and audio data as input and output, allowing it to perform automated speech recognition (ASR), text-to-text translation (T2TT), speech-to-text translation (S2TT), text-to-speech translation (T2ST), and speech-to-speech translation (S2ST). The model is released under the non-commercial CC BY-NC 4.0 license. Meta is also releasing their training dataset, SeamlessAlign, which contains 270,000 hours of audio data with corresponding text transcription, as well as their code for mining the data from the internet. According to Meta,

We believe the work we’re announcing today is a significant step forward….Our single model provides on-demand translations that enable people who speak different languages to communicate more effectively. We significantly improve performance for the low and mid-resource languages we support. These are languages that have smaller digital linguistic footprints….This is only the latest step in our ongoing effort to build AI-powered technology that helps connect people across languages. In the future, we want to explore how this foundational model can enable new communication capabilities—ultimately bringing us closer to a world where everyone can be understood.

Meta’s motivation for their research is to build a universal translation system like the Babelfish from The Hitchhiker’s Guide to the Galaxy sci-fi stories. InfoQ has covered several of their previous efforts, including their T2TT model No Language Left Behind (NLLB) which can translate text between 200 languages, and their Massively Multilingual Speech (MMS) model which supports ASR and text-to-speech synthesis (TTS) in over 1,100 languages. InfoQ also covered other work in the area, such as OpenAI’s Whisper which can transcribe and translate speech audio from 97 different languages, Google’s Universal Speech Model (USM) that supports ASR in over 100 languages, and Google’s AudioPaLM, which was the previous state-of-the-art model for S2ST.

SeamlessM4T is based on the UnitY neural network architecture, which consists of a pipeline of three components. First is an encoder that can handle both speech audio and text data input and recognizes the input’s meaning; the audio sub-component is based on w2v-BERT and the text on NLLB. Next is a decoder, also based on NLLB, which converts that meaning into a text output in a target language. Finally, there is a text-to-acoustic unit decoder to convert the target text into speech.

Meta compared their model’s performance to both cascaded approaches, which consist of a pipeline of discrete ASR, T2TT, and TTS models, and to single-model systems. The systems were evaluated on the FLEURS and CVSS benchmarks. On FLEURS, SeamlessM4T “sets a new standard for translations into multiple target languages,” outperforming AudioPaLM by 20%. SeamlessM4T also outperformed cascaded models; on CVSS it was “stronger by 58%.”

Several users discussed SeamlessM4T on Hacker News. One user shared tips on how to get the model to run locally, and pointed out that it had a context limit of 4096 tokens. Another user asked:

Will there be a whispercpp equivalent? Half the reason I love whisper is how dead simple it is to get running. I will take somewhat lower accuracy for easier operation.

The SeamlessM4T code and models are available on GitHub. There is an interactive translation demo available on Huggingface.

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Abu Dhabi Releases Largest Openly-Available Language Model Falcon 180B

MMS Founder
MMS Anthony Alford

Article originally posted on InfoQ. Visit InfoQ

The Abu Dhabi government’s Technology Innovation Institute (TII) released Falcon 180B, currently the largest openly-available large language model (LLM). Falcon 180B contains 180 billion parameters and outperforms GPT-3.5 on the MMLU benchmark.

Falcon 180B was trained on 3.5 trillion tokens of text–4x the amount of data used to train Llama 2. Besides the base model, TII also released a chat-specific model that is fine-tuned on instruction datasets. The models are available for commercial use, but the license includes several restrictions and requires additional permission for use in a hosted service. Although TII says that Falcon’s performance is “difficult to rank definitively,” it is “on par” with PaLM 2 Large and “somewhere between GPT 3.5 and GPT4,” depending on the benchmark used. According to TII:

As a key technology enabler, we firmly believe that innovation should be allowed to flourish. That is why we decided to open source or open access all our Falcon models. We are launching our latest Falcon 180B LLM as an open access model for research and commercial use. Together, we can mobilize and fast-track breakthrough solutions worldwide – catalyzing innovation for outsized impact.

Falcon 180B is based on TII’s smaller model, Falcon 40B, which was released earlier this year. One innovation in the Falcon architecture was the use of multiquery attention, which reduces the model’s memory bandwidth requirements when running inference. Both the models were trained on TII’s RefinedWeb dataset; for the new 180B model, the amount of data was increased from 1.5 trillion tokens to 3 trillion. Training Falcon 180B took approximately 7 million GPU-hours on Amazon Sagemaker, using 4,096 GPUs concurrently.

On X (formerly Twitter), several users posted about the Falcon 180B. One user speculated that:

GDPR in the EU may make the Falcon 180B model the only viable option for those who prioritize data localization and privacy.

Although the model’s size makes it difficult for most users to run locally, Huggingface scientist Clémentine Fourrier pointed out that there is no difference in inference quality “between the 4-bit Falcon-180B and the bfloat-16 one,” meaning that users could reduce memory needs by 75%. Georgi Gerganov, developer of llama.cpp, a package that helps users to run LLMs on their personal hardware, claimed to be running the model on an Apple M2 Ultra.

Commenting on the model’s generative capabilities, HyperWrite CEO Matt Shumer noted TII’s claim that the model’s performance was between GPT-3.5 and GPT-4 and predicted “We’re now less than two months away from GPT-4-level open-source models.” NVIDIA’s senior AI scientist Dr. Jim Fan took issue with the model’s lack of training on source code data:

Though it’s beyond me why code is only 5% in the training mix. It is by far the most useful data to boost reasoning, master tool use, and power AI agents. In fact, GPT-3.5 is finetuned from a Codex base….I don’t see any coding benchmark numbers. From the limited code pretraining, I’d assume it isn’t good at it. One cannot claim “better than GPT-3.5” or “approach GPT-4” without coding. It should’ve been an integral part in the pretraining recipe, not a finetuning after-thought.

The Falcon 180B models, both base and chat, are available on the Hugging Face Hub. An interactive chat demo and the RefinedWeb dataset are also available on Huggingface.

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Meta Open-Sources Code Generation LLM Code Llama

MMS Founder
MMS Anthony Alford

Article originally posted on InfoQ. Visit InfoQ

Meta recently open-sourced Code Llama, a code generation LLM which is based on the Llama 2 foundation model and carries the same community license. Code Llama was fine-tuned on 500B tokens of code and is available in three model sizes ranging up to 34B parameters. In evaluations on code-generation benchmarks, the model outperformed all other open-source models and is comparable to ChatGPT.

Meta used three sizes of the Llama 2 foundation model—7B, 13B, and 34B parameters—as starting points for Code Llama. These were fine-tuned on a “near-deduplicated” dataset of code as well as natural language related to code, such as questions and discussions. Meta also trained two variants of each model size, besides the base version: Code Llama – Python, which is further fine-tuned on Python code; and Code Llama – Instruct, which is fine-tuned on natural-language instructions. All nine model versions are licensed for commercial use. According to Meta, 

Code Llama is designed to support software engineers in all sectors – including research, industry, open source projects, NGOs, and businesses. But there are still many more use cases to support than what our base and instruct models can serve…We hope that Code Llama will inspire others to leverage Llama 2 to create new innovative tools for research and commercial products.

InfoQ previously covered other code-generation AI models, including OpenAI’s Codex, which is based on GPT-3 and powers Github’s Copilot. Like the other models in the GPT series, Codex is only available via OpenAI’s web service API. This has prompted the development of open models, such as BigCode’s StarCoder. StarCoder also has the advantage of being trained on “permissively-licensed” code, so that the use of its output is unlikely to result in license violations. While Llama 2 and its derived models, including Code Llama, are licensed for commercial use, the Code Llama license notes that its output “may be subject to third party licenses.”

In addition to fine-tuning the models on code, Meta also performed long context fine-tuning (LCFT), which increases the length of input the model can handle. While Llama 2 was trained on sequences up to 4k tokens, the LCFT for Code Llama includes sequences up to 16k. Meta’s goal for this was “unlocking repository-level reasoning for completion or synthesis,” giving the model access to an entire project’s code instead of only a single function or source file. Meta’s experiments show that the model exhibits “stable behavior” for sequences up to 100k tokens.

In a Twitter/X thread about the model, Furkan Gözükara, an assistant professor at Toros University, noted that GPT-4 still outperformed Code Llama on the HumanEval benchmark. Another user replied that GPT-4 was not “not 34B,” meaning that GPT-4 was a far bigger model. The makers of phind, an AI assistant for programmers, released a fine-tuned version of the 34B parameter version of Code Llama – Python that they claim achieved 69.5% pass@1 score on HumanEval, which outperforms GPT-4’s published score of 67%. One of the developers joined a Hacker News discussion about their release, and said:

This model is only the beginning — it’s an early experiment and we’ll have improvements next week.

The Code Llama source code is available on GitHub. The model files can be downloaded after applying for approval from Meta.

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Stability AI Launches Open Source Chatbot Stable Chat

MMS Founder
MMS Anthony Alford

Article originally posted on InfoQ. Visit InfoQ

Stability AI, makers of the image generation AI Stable Diffusion, recently launched Stable Chat, a web-based chat interface for their open-access language model Stable Beluga. At the time of its release, Stable Beluga was the best-performing open large language model (LLM) on the HuggingFace leaderboard.

Stable Beluga is based on the LLaMA foundation model released by Meta. The model is fine-tuned using a synthetic dataset generated by GPT-4. The largest Stable Beluga model contains 70B parameters and outperforms ChatGPT on several benchmarks, including AGIEval, which is based on several common examinations such as LSAT and SAT. To help evaluate Stable Beluga, Stability AI created the Stable Chat web interface to help users interact with the model and give feedback on its output. According the Stability AI, 

As part of our efforts at Stability AI to build the world’s most trusted language models, we’ve set up a research-purpose-only website to test and improve our technology. We will continue to update new models as our research progresses rapidly. We ask that you please avoid using this site for real-world applications or commercial uses.

The Stable Beluga models were inspired by a paper published by Microsoft on Orca, a fine-tuned version of LLaMA. In the paper, Microsoft described a technique called explanation tuning. Like instruction tuning, which has been used on many open LLMs recently, including ChatGPT and Vicuna, explanation tuning uses a dataset of example inputs and desired model outputs that are generated by a teacher. In the case of ChatGPT, the teachers are actual human users of the model. In contrast, for Orca and Stable Beluga, the explanation tuning dataset is generated by prompting GPT-4 to explain why it generated the output it did (“explain like I’m five.”)

Stability AI created their own explanation tuning dataset of 600,000 examples—one-tenth the size of the Microsoft dataset. They then trained two versions of Stable Beluga: Stable Beluga 1, based on the 65B parameter original LLaMA model, and Stable Beluga 2, based on the 70B Llama 2 model. Both are released under a non-commercial license. Although the models achieved fourth and first place, respectively, on the leaderboard when they were released, the proliferation of LLaMA-based fine-tuned models has currently pushed Stable Beluga 2 out of the top ten, and Stable Beluga 1 even lower.

The models were released under a non-commercial license to encourage researchers to help iterate and improve on the technology, according to Stability AI. However, the company noted that this required resources that are “beyond the reach of everyday researchers,” and decided to create the Stable Chat website. Users can create a free login or use a Google account to access the chat. The responses from the model can be up-voted, down-voted, or flagged; this user feedback will be used to help improve the model in the future.

Stability AI founder Emad Mostaque posted about the release on Twitter/X. One user replied that the model was “too cautious in giving factual information.” Mostaque urged the user to give that feedback via the web interface.

Stability AI also recently announced that their LLMs will be used at an AI red-teaming event at DEF CON 31. This event is sponsored by the White House and features models from “Anthropic, Google, Hugging Face, Microsoft, NVIDIA, OpenAI, and Stability AI.” The goal is to help identify risks and vulnerabilities in the models.

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


LMSYS Org Releases Chatbot Arena and LLM Evaluation Datasets

MMS Founder
MMS Anthony Alford

Article originally posted on InfoQ. Visit InfoQ

Large Model Systems Organization (LMSYS Org) recently released Chatbot Arena, a comparison platform for large language models (LLMs), where users can pick the better response from a pair of chatbots. LMSYS also released a dataset containing conversations from the Arena as well as a dataset of human annotations of results from evaluating LLMs on the MT-Bench benchmark.

LMSYS Org created Chatbot Arena earlier this year to “crowdsource” an evaluation of several different open- and closed-source LLMs, including GPT-4 and LLaMA. The Arena produced a leaderboard of models, ranking them according to their Elo rating. Because this method was time-consuming, the LMSYS team developed an additional benchmark, MT-bench, which consists of 80 multi-turn questions to ask a chatbot, with the chatbot’s responses graded by GPT-4. According to LMSYS Org: 

[We] have shown that MT-Bench effectively differentiates between chatbots of varying capabilities. It’s scalable, offers valuable insights with category breakdowns, and provides explainability for human judges to verify. However, LLM judges should be used carefully. It can still make errors, especially when grading math/reasoning questions.

The rise of LLMs has led to a need for new benchmarks to measure their abilities, as the models have achieved superhuman performance on traditional ones like GLUE. The Massive Multitask Language Understanding (MMLU) benchmark can measure a LLM’s knowledge capabilities, but it does not measure how well the LLM can produce output that is aligned with human preference, which is the feature that new models such as ChatGPT are pursuing.

Earlier this year, LMSYS Org released their Vicuna LLM, a fine-tuned version of Meta’s LLaMA model. To evaluate Vicuna, the researchers used GPT-4 as a judge of its output, and claimed that Vicuna achieved “more than 90% quality” of ChatGPT and Bard. Within a few months, LMSYS Org announced the ChatBot Arena, as an attempt to crowdsource the evaluation of models. Users would interact with two different models at once and choose which one they preferred; the result is an Elo rating of models. In this latest move, LMSYS Org is releasing a dataset of 33K Arena chatbot conversations with humans.

After running the Arena for several months, the researchers identified 8 categories of user prompts, including math, reasoning, and STEM knowledge. They created 10 multi-turn questions for each category, producing MT-Bench, a “quality-controlled complement” to the Arena. They again used GPT-4 to grade a chatbot’s responses to the benchmark prompts, and found that the GPT-4 judge agreed with human judges more than 80% of the time, which was similar to how often two different human judges agreed. GPT-4’s explanations for its choice could even persuade human judges to change their picks 34% of the time. LMSYS Org has now released a dataset of 3.3k “expert-level pairwise human preferences” for responses generated by six different models.

ML researcher Nathan Lambert discussed the work on Twitter, pointing out that the MT-Bench score “seems like the clearest benchmark to optimize” for researchers trying to produce models that match leaders like GPT-4. MT-Bench co-author Wei-Lin Chiang also answered several user questions on Twitter. In response to a question about correctly using models when evaluating them, Chiang replied:

That’s a great point. We try our best to find the official template if it exists…But lack of standard and LLM’s sensitivity to the template is definitely an issue.

The Chatbot Arena and MT-Bench evaluation code are available on GitHub. The Arena conversation dataset and MT-Bench response dataset are available on Huggingface, as is the current LLM Leaderboard.

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.