Author: Anthony Alford

MMS • Anthony Alford
Article originally posted on InfoQ. Visit InfoQ

Google DeepMind recently announced PaLM 2, a large language model (LLM) powering Bard and over 25 other product features. PaLM 2 significantly outperforms the previous version of PaLM on a wide range of benchmarks, while being smaller and cheaper to run.
Google CEO Sundar Pichai announced the model at Google I/O ’23. PaLM 2 performs well on a variety of tasks including code generation, reasoning, and multilingual processing, and it is available in four different model sizes, including a lightweight version called Gecko that is intended for use on mobile devices. When evaluated on NLP benchmarks, PaLM 2 showed performance improvements over PaLM, and achieved new state-of-the-art levels in many tasks, especially on the BIG-bench benchmark. Besides powering Bard, the new model is also a foundation for many other products, including Med-PaLM 2, a LLM fine-tuned for the medical domain, and Sec-PaLM, a model for cybersecurity. According to Google,
PaLM 2 shows us the impact of highly capable models of various sizes and speeds—and that versatile AI models reap real benefits for everyone. Yet just as we’re committed to releasing the most helpful and responsible AI tools today, we’re also working to create the best foundation models yet for Google.
In 2022, InfoQ covered the original release of Pathways Language Model (PaLM), a 540-billion-parameter large language model (LLM). PaLM achieved state-of-the-art performance on several reasoning benchmarks and also exhibited capabilities on two novel reasoning tasks: logical inference and explaining a joke.
For PaLM 2, Google implemented several changes to improve model performance. First, they studied model scaling laws to determine the optimal combination of training compute, model size, and data size. They found that, for a given compute budget, data and model size should be scaled “roughly 1:1,” whereas previous researchers had scaled model size 3x the data size.
The team improved PaLM 2’s multilingual capabilities by including more languages in the training dataset and updating the model training objective. The original dataset was “dominated” by English; the new dataset pulls from a more diverse set of languages and domains. Instead of using only a language modeling objective, PaLM 2 was trained using a “tuned mixture” of several objectives.
Google evaluated PaLM 2 on six broad classes of NLP benchmark, including: reasoning, coding, translation, question answering, classification, and natural language generation. The focus of the evaluation was to compare its performance to the original PaLM. On BIG-bench, PaLM 2 showed “large improvements,” and on classification and question answering even the smallest PaLM 2 model achieved performance “competitive” with much the larger PaLM model. On reasoning tasks, PaLM 2 was also “competitive” with GPT-4; it outperformed GPT-4 on the GSM8K mathematical reasoning benchmark.
In a Reddit discussion about the model, several users commented that although its output wasn’t as good as that from GPT-4, PaLM 2 was noticeably better. One user said:
They probably want it to be scalable so they can implement it for free/low cost with their products. Also so it can accompany search results without taking forever (I use GPT 4 all the time and love it, but it is pretty slow.)…I just used the new Bard (which is based on PaLM 2) and it’s a good amount faster than even GPT 3.5 turbo.
The PaLM 2 tech report page on Papers with Code lists the model’s performance on several NLP benchmarks.

MMS • Anthony Alford
Article originally posted on InfoQ. Visit InfoQ

Meta AI Research open-sourced DINOv2, a foundation model for computer vision (CV) tasks. DINOv2 is pretrained on a curated dataset of 142M images and can be used as a backbone for several tasks, including image classification, video action recognition, semantic segmentation, and depth estimation.
Meta based the model on the Vision Transformer (ViT) architecture, with modifications for self-supervised learning objectives. To train the model, the team built an automated pipeline to build a curated dataset of images scraped from the web. A major contribution of the work was improving in the training process, which is twice as fast and uses one-third the memory of previous approaches. When evaluated on CV benchmarks, DINOv2 outperformed other self-supervised learning (SSL) models and showed performance comparable to or better than that of weakly-supervised models. According to Meta,
Going forward, the team plans to integrate this model, which can function as a building block, in a larger, more complex AI system that could interact with large language models. A visual backbone providing rich information on images will allow complex AI systems to reason on images in a deeper way than describing them with a single text sentence. Models trained with text supervision are ultimately limited by the image captions. With DINOv2, there is no such built-in limitation.
Deep learning models for CV tasks have typically relied on large datasets of images with human annotations; for example, ImageNet. In 2021, OpenAI released CLIP, a foundation model for CV that was trained using a form of weak supervision, where the annotations were automatically derived by scraping html tags and other web-based metadata associated with source images. That same year, Google published the ViT model, which uses SSL for training, and Meta published their work on the original version of DINO, which combines a ViT model with knowledge distillation, which resulted in smaller models with comparable performance.
For DINOv2, Meta focused on gathering more training data and scaling up the training process. For the training data, Meta collected 1.2B unique images from the internet, then clustered them according to their similarity to the images in the ImageNet dataset for a final set of 142M images. To scale up training, Meta implemented a custom version of FlashAttention and used Fully-Sharded Data Parallel (FSDP) training with PyTorch. Overall, the project consumed about 200k GPU-days of compute.
To evaluate DINOv2’s performance as a foundation model, the team tested it on a variety of CV tasks and compared it to several baseline SSL models as well as weakly-supervised models such as CLIP. On the ImageNet-1k classification task, DINOv2 showed a “very significant improvement” compared to other SSL models and also outperformed the weakly-supervised ones. It also set a new SSL state-of-the-art record on three video action recognition benchmarks and outperformed baselines on instance-level recognition benchmarks and on three monocular depth estimation benchmarks.
In a Hacker News discussion about the work, several users praised Meta’s recent work in computer vision as well as past contributions such as PyTorch. One did note a shift in Meta’s communications around their work:
As a grad student in this field, Meta has always had great contributions to the open source machine learning effort, through no small effort of Yann LeCun’s internal advocacy. What has changed recently is their PR strategy: [OpenAI] has basically shown everybody that it doesn’t matter if you have the best models if your publicity sucks.
The DINOv2 code and models are available on GitHub. The project site hosts an interactive demo of several computer vision tasks using DINOv2.

MMS • Anthony Alford
Article originally posted on InfoQ. Visit InfoQ

OpenAI recently announced plugin support for ChatGPT, allowing the language model to access external tools and databases. The company also open-sourced the code for a knowledge retrieval plugin, which organizations can use to provide ChatGPT-based access to their own documents and data.
Although large language models (LLM) like ChatGPT can answer many questions correctly, their knowledge can become out of date, since it is not updated after the LLM is trained. Further, the model can only output text, which means it cannot directly act on behalf of a user.
To help solve this problem, researchers have explored ways to allow LLMs to execute APIs or access knowledge databases. ChatGPT’s plugin system will allow the model to integrate with external systems such as knowledge bases and 3rd party APIs. The retrieval plugin allows the model to perform semantic search against a vector database. Because the plugin is self-hosted, organizations can securely store their own internal documents in the database, which lets their users interact with the data via ChatGPT’s natural language interface.
The plugin supports several commercial and open-source vector databases, including one developed by Pinecone, who contributed to the open-source plugin code. InfoQ spoke with Roy Miara, an Engineering Manager at Pinecone, about their contribution to the plugin.
InfoQ: What is a ChatGPT plugin, and in particular, what is the Retrieval Plugin used for?
Miara: ChatGPT plugins serve as supplementary tools that facilitate access to current information, execute computations, or integrate third-party services for ChatGPT. The Retrieval Plugin, in particular, empowers ChatGPT to obtain external knowledge via semantic search techniques. There are two prevalent paradigms for employing the Retrieval Plugin: 1) utilizing the plugin to access personal or organizational data, and 2) implementing the plugin as a memory component within ChatGPT. Both are using semantic search as a way for the model to rephrase user prompts as queries to a vector database such as Pinecone, Milvus, or Weaviate.
InfoQ: What are some advantages a ChatGPT plugin has over other LLM integrations such as LangChain?
Miara: Although LangChain enables an “agent” experience with tools and chains, ChatGPT plugins are more suitable for AI app development. The advantages of ChatGPT plugins include: 1) a more sophisticated implementation that leverages OpenAI’s internal plugin capabilities, as opposed to LangChain’s approach of concatenating plugin information as a prompt to the model, and 2) the support for security and authentication methods, which are essential for AI application development, particularly when accessing personal data or performing actions on behalf of a user. These features are not present in Langchain’s current offerings.
InfoQ: Can you describe your contributions to the retrieval plugin open source project?
Miara: Pinecone datastore implementation was contributed to the project, alongside some other internal improvements of testing and documentation. The overall base implementation follows Pinecone’s upsert/query/delete paradigm, and we are currently working on hybrid queries and other advanced query techniques.
InfoQ: Can you provide some technical details on how a typical ChatGPT plugin works?
Miara: A ChatGPT Plugin is a web server that exposes an “instruction” manifest to ChatGPT, where it describes the operation of the Plugin as a prompt and the API reference as an OpenAPI yaml specification. With those, ChatGPT understands the different API calls that are possible, and the instructions it should follow.
So in order to build Plugin, one should build the application logic, implement a web server that follows OpenAPI specification, and deploy the server such that ChatGPT is able to access it. Although there is no limit to the application logic that one can implement, it is not recommended to construct a complex API server, since this might result in undesired behavior, confusion etc.
We have found that the “description_for_model” part of the manifest, which is essentially a prompt that is injected before the context retrieved, is a key for a successful plugin. OpenAI are providing some guidelines, but in the end of the day, it’s in the developer hands to find the right prompt for the task.
InfoQ: OpenAI mentions that plugins are “designed specifically for language models with safety as a core principle.” What are some of the safety challenges you encountered in developing your plugin?
Miara: Firstly, enabling ChatGPT to access personal or organizational data necessitated the implementation of both security and data integrity features. The plugin is designed to handle API authentication, ensuring secure data access for both reading and writing purposes.
Secondly, generative language models often grapple with hallucinations and alignment issues. We observed that earlier versions of plugins occasionally provided incorrect responses to queries, but subsequent iterations demonstrated improved accuracy while also admitting when certain questions were beyond their scope. Moreover, by running plugins in an alpha stage for an extended period, OpenAI can better align the results before releasing them to a broader audience.
Additionally, it’s important to note that the plugins feature is designed with complete transparency for users. First, users explicitly select the plugins they wish to enable for use with ChatGPT. Secondly, whenever ChatGPT utilizes a plugin, it clearly indicates this to the user, while also making it simple to view the specific results that the plugin service has provided to ChatGPT’s context.

MMS • Anthony Alford
Article originally posted on InfoQ. Visit InfoQ

Stability AI released two sets of pre-trained model weights for StableLM, a suite of large language models (LLM). The models are trained on 1.5 trillion text tokens and are licensed for commercial use under CC BY-SA-4.0.
The released models contain 3B and 7B parameters respectively, with larger models soon to come. The new suite of models is a result of Stability’s previous collaboration efforts with EleutherAI; the training dataset is an updated version of EleutherAI’s The Pile dataset, with three times the data used to train EleutherAI’s models. The release also includes versions of the StableLM models that have been fine-tuned on instruction-following and chat datasets, including Stanford’s Alpaca dataset. The fine-tuned models are licensed for non-commercial use only, due to Alpaca’s licensing requirements. According to Stability AI,
With the launch of the StableLM suite of models, Stability AI is continuing to make foundational AI technology accessible to all. Our StableLM models can generate text and code and will power a range of downstream applications. They demonstrate how small and efficient models can deliver high performance with appropriate training…Language models will form the backbone of our digital economy, and we want everyone to have a voice in their design. Models like StableLM demonstrate our commitment to AI technology that is transparent, accessible, and supportive.
The success of generative LLMs such as OpenAI’s GPT-3 spurred the development of smaller open-source models with similar capabilities. In 2022, InfoQ covered the release of EleutherAI’s GPT-NeoX-20B, an open-source 20B parameter LLM; more recently, InfoQ covered Meta’s 7B parameter LLaMA LLM. OpenAI’s release of ChatGPT showed that LLM performance could be improved by fine-tuning them on “instruction-following” datasets, which led to the release of similar models such as Stanford’s Alpaca, a fine-tuned version of LLaMA.
Although only the 3B and 7B parameter StableLM models have been released, Stability AI says that models with 15B, 30B, and 65B parameters are in progress, and a 175B parameter model is planned. The company also says they will be crowd-sourcing an open-source dataset for fine-tuning chatbot assistants, to further the efforts of projects such as OpenAssistant. While Stability AI did not announce any benchmark performance data for the models, they claim “surprisingly high performance in conversational and coding tasks.”
In a Hacker News discussion about the release, one user said:
Selling access to LLMs via remote APIs is the “stage plays on the radio” stage of technological development. It makes no actual sense; it’s just what the business people are accustomed to. It’s not going to last very long. So much more value will be unlocked by running them on device. People are going to look back at this stage and laugh, like paying $5/month to a cell phone carrier for Snake on a feature phone.
Stability’s CEO Emad Mostaque replied to questions about StableLM in an “ask me anything” thread on Twitter. When asked about the hardware used to train the models, he said that they were using “3,000 A100s and 512 TPU v4s.”
Stanislav Fort, LLM lead at Stability, posted a helpful tip on Twitter:
For the early StableLM models, try adding “User: ” to the prompt. Because of the way these models were trained, prepending your evals with “User: ” should make things *much* better.
The code for the StableLM models is available on GitHub. The model weights and a demo chat interface are available on HuggingFace.

MMS • Anthony Alford
Article originally posted on InfoQ. Visit InfoQ

OpenAI recently announced GPT-4, the next generation of their GPT family of large language models (LLM). GPT-4 can accept both text and image inputs and outperforms state-of-the-art systems on several natural language processing (NLP) benchmarks. The model also scored in the 90th percentile on a simulated bar exam.
OpenAI’s president and co-founder, Greg Brockman, demonstrated the model’s capabilities in a recent livestream. The model was trained using the same infrastructure as the previous generation model, GPT-3.5, and like ChatGPT it has been fine-tuned using reinforcement learning from human feedback (RLHF). However, GPT-4 features several improvements over the previous generation. Besides the ability to handle image input, the default context length has doubled, from 4,096 tokens to 8,192. There is also a limited-access version that supports 32,768 tokens, which is approximately 50 pages of text. The model’s response behavior is more steerable via a system prompt. The model also has fewer hallucinations than GPT-3.5, when measured on benchmarks like TruthfulQA. According to OpenAI:
We look forward to GPT-4 becoming a valuable tool in improving people’s lives by powering many applications. There’s still a lot of work to do, and we look forward to improving this model through the collective efforts of the community building on top of, exploring, and contributing to the model.
Although OpenAI has not released details of the model architecture or training dataset, they did publish a technical report showing its results on several benchmarks, as well as a high level overview of their efforts to identify and mitigate the model’s risk of producing harmful output. Because fully training the model requires so much computation power and time, they also developed techniques to predict the model’s final performance, given performance data for smaller models. According to OpenAI, this will “improve decisions around alignment, safety, and deployment.”
To help evaluate their models, OpenAI has open-sourced Evals, a framework for benchmarking LLMs. The benchmark examples or evals typically consist of prompt inputs to the LLM along with expected responses. The repo already contains several eval suites, including some implementations of existing benchmarks such as MMLU, as well as other suites where GPT-4 does not perform well, such as logic puzzles. OpenAI says they will use the Evals framework to track performance when new model versions are released; they also intend to use the framework to help guide their future development of model capabilities.
Several users discussed GPT-4 in a thread on Hacker News. One commenter said:
After watching the demos I’m convinced that the new context length will have the biggest impact. The ability to dump 32k tokens into a prompt (25,000 words) seems like it will drastically expand the reasoning capability and number of use cases. A doctor can put an entire patient’s medical history in the prompt, a lawyer an entire case history, etc….What [percentage] of people can hold 25,000 words worth of information in their heads, while effectively reasoning with and manipulating it?
However, several other users pointed out that medical and legal applications would require better data privacy guarantees from OpenAI. Some suggested that a homomorphic encryption scheme, where the GPT model operates on encrypted input, might be one solution.
Developers interested in using the model can join OpenAI’s waitlist for granting access.

MMS • Anthony Alford
Article originally posted on InfoQ. Visit InfoQ

The PyTorch Foundation recently released PyTorch version 2.0, a 100% backward compatible update. The main API contribution of the release is a compile function for deep learning models, which speeds up training. Internal benchmarks on 163 open-source AI projects showed that the models ran on average 43% faster during training.
Plans for the 2.0 release were announced at the PyTorch Conference in December 2022. Besides the new compile function, the release also includes performance improvement for Transformer-based models, such as large language models and diffusion models, via a new implementation of scaled dot product attention (SDPA). Training on Apple silicon is accelerated via improved Metal Performance Shaders (MPS), now with 300 operations implemented in MPS. Besides the core release, the domain libraries, including TorchAudio, TorchVision, and TorchText, were updated with new beta features. Overall, the 2.0 release includes over 4,500 commits from 428 developers since the 1.13.1 release. According to the PyTorch Foundation blog,
We are excited to announce the release of PyTorch® 2.0 which we highlighted during the PyTorch Conference on 12/2/22! PyTorch 2.0 offers the same eager-mode development and user experience, while fundamentally changing and supercharging how PyTorch operates at compiler level under the hood with faster performance and support for Dynamic Shapes and Distributed.
In his keynote speech at the PyTorch Conference 2022, PyTorch co-creator Soumith Chintala pointed out that thanks to increases in GPU compute capacity, many existing PyTorch workloads are constrained by memory bandwidth or by PyTorch framework overhead. Previously the PyTorch team had addressed performance problems by writing some of their core components in C++; Chintala described PyTorch as “basically a C++ codebase,” and said that he “hates” contributing to the C++ components.
The new compile feature is based on four underlying components written in Python:
- TorchDynamo – performs graph acquisition by rewriting Python code representing deep learning models into blocks of computational graphs
- AOTAutograd – performs “ahead of time” automatic differentiation for the backprop step
- PrimTorch – canonicalizes the over 2k PyTorch operators down to a fixed set of around 250 primitive operators
- TorchInductor – generates fast hardware-specific backend code for accelerators
To demonstrate the performance improvements and ease of use of the compile function, the PyTorch team identified 163 open-source deep learning projects to benchmark. These included implementations of a wide variety of tasks including computer vision, natural language processing, and reinforcement learning. The team made no changes to the code besides the one-line call to the compile function. This single change worked in 93% of the projects, and the compiled models ran 43% faster when trained on NVIDIA A100 GPUs.
In a Hacker News discussion about the release, one user noted:
A big lesson I learned from PyTorch vs other frameworks is that productivity trumps incremental performance improvement. Both Caffe and MXNet marketed themselves for being fast, yet apparently being faster here and here by some percentage simply didn’t matter that much. On the other hand, once we make a system work and make it popular, the community will close the performance gap sooner than competitors expect. Another lesson is probably old but worth repeating: investment and professional polishing [matter] to open source projects.
The PyTorch code and version 2.0 release notes are available on GitHub.

MMS • Anthony Alford
Article originally posted on InfoQ. Visit InfoQ

Researchers from Stanford University have developed a brain-computer interface (BCI) for synthesizing speech from signals captured in a patient’s brain and processed by a recurrent neural network (RNN). The prototype system can decode speech at 62 words-per-minute, 3.4x faster than previous BCI methods.
The system was described in a paper published on bioRxiv. Working with a patient who lost speech ability due to amyotrophic lateral sclerosis (ALS), the team used microelectrodes implanted in the patient’s brain to capture neural activity signals generated when the patient attempted to speak. These signals were passed to a RNN, specifically a gated recurrent unit (GRU) model, which was trained to decode the neural signals into phonemes for speech synthesis. When trained on a limited vocabulary of 50 words, the system achieved a 9.1% error rate, and a 23.8% error rate on a 125k word vocabulary. According to the researchers:
[We] demonstrated a speech BCI that can decode unconstrained sentences from a large vocabulary at a speed of 62 words per minute, the first time that a BCI has far exceeded the communication rates that alternative technologies can provide for people with paralysis…Our demonstration is a proof of concept that decoding attempted speaking movements from intracortical recordings is a promising approach, but it is not yet a complete, clinically viable system.
The use of deep learning models to interpret human brain activity is an active research area, and InfoQ has covered several BCI projects involving assistive devices. Many of these use sensors that are implanted in a patient’s brain, because these provide the best signal quality; in 2019 InfoQ covered a system developed by Meta which uses such signals to allow users to “type” by imagining themselves speaking. InfoQ has also covered systems that use external or “wearable” sensor, such as the one developed by Georgia Tech in 2021, which allows users to control a video game by imagining activity.
The Stanford system uses four microelectrode arrays implanted in the patient’s ventral premotor cortex and Broca’s area. To collect data for training the RNN, the patient was given a few hundred sentences each day which she “mouthed,” or pantomimed speaking, which generated neural signals which were captured by the microelectrodes. Overall, the team collected 10,850 sentences. Using “custom machine learning methods” from the speech recognition domain, the researchers trained the RNN to output a sequence of phonemes.
To evaluate the system, the team had the patient mouth sentences that were never used in training; the test sentences included some using only the 50 word vocabulary as well as the 125k one. The researchers also experimented with adding a language model to the decoder, which improved error rate from 23.8% to 17.4%, and with reducing the time between training and testing the RNN, to eliminate the day-to-day changes in neural activity. Their conclusion was that the system could see “substantial gains in performance” with further work on language modeling and more robust decoding techniques.
Lead researcher Frank Willett posted about the work on Twitter and answered several questions. In response to a question about whether the RNN predicted the next word that would be spoken, Willett replied:
No next word prediction – the language model simply outputs the best explanation of all RNN outputs produced so far.
Willett also said that the team would publish their code and data after the work is “published in a peer-reviewed journal.”

MMS • Anthony Alford
Article originally posted on InfoQ. Visit InfoQ

Researchers from DeepMind and the University of Toronto announced DreamerV3, a reinforcement-learning (RL) algorithm for training AI models for many different domains. Using a single set of hyperparameters, DreamerV3 outperforms other methods on several benchmarks and can train an AI to collect diamonds in Minecraft without human instruction.
The DreamerV3 algorithm includes three neural networks: a world model which predicts the result of actions, a critic which predicts the value of world model states, and an actor which chooses actions to reach valuable states. The networks are trained from replayed experiences on a single Nvidia V100 GPU. To evaluate the algorithm, the researchers used it on over 150 tasks in seven different domains, including simulated robot control and video game playing. DreamerV3 performed well on all domains and set new state-of-the-art performance on four of them. According to the DeepMind team:
World models carry the potential for substantial transfer between tasks. Therefore, we see training larger models to solve multiple tasks across overlapping domains as a promising direction for future investigations.
RL is a powerful technique that can train AI models to solve a wide variety of complex tasks, such as games or robot control. DeepMind has used RL to create models that can defeat the best human players at games such as Go or Starcraft. In 2022, InfoQ covered DayDreamer, an earlier version of the algorithm that can train physical robots to perform complex tasks within only a few hours. However, RL training often requires domain-expert assistance and expensive compute resource to fine-tune the models.
DeepMind’s goal with DreamerV3 was to produce an algorithm that works “out of the box” across many domains without modifying hyperparameters. One particular challenge is that the scale of inputs and rewards can vary a great deal across domains, making it tricky to choose a good loss function for optimization. Instead of normalizing these values, the DeepMind team introduced a symmetrical logarithm or symlog transform to “squash” the inputs to the model as well as its outputs.
To evaluate DreamerV3’s effectiveness across domains, the researchers evaluated it on seven benchmarks:
- Proprio Control Suite: low-dimensional control tasks
- Visual Control Suite: control tasks with high-dimensional images as inputs
- Atari 100k: 26 Atari games
- Atari 200M: 55 Atari games
- BSuite: RL behavior benchmark
- Crafter: survival video game
- DMLab: 3D environments
DreamerV3 achieved “strong” performance on all, and set new state-of-the-art performance on Proprio Control Suite, Visual Control Suite, BSuite, and Crafter. The team also used DreamerV3 with default hyperparameters to train a model that is the first one to “collect diamonds in Minecraft from scratch without using human data.” The researchers contrasted this with VPT, which was pre-trained from 70k hours of internet videos of human players.
Lead author Danijar Hafner answered several questions about the work on Twitter. In response to one user, he noted:
[T]he main point of the algorithm is that it works out of the box on new problems, without needing experts to fiddle with it. So it’s a big step towards optimizing real-world processes.
Although the source code for DreamerV3 has not been released, Hafner says it is “coming soon.” The code for the previous version, DreamerV2, is available on GitHub. Hafner notes that V3 includes “better replay buffers” and is implemented on JAX instead of TensorFlow.

MMS • Anthony Alford
Article originally posted on InfoQ. Visit InfoQ

AI research groups LAION and CarperAI have released OpenAssistant and trlX, open-source implementations of reinforcement learning from human feedback (RLHF), the algorithm used to train ChatGPT. Independent AI developer Phil Wang has also open-sourced his own implementation of the algorithm.
LAION, the Large-scale Artificial Intelligence Open Network, is a non-profit machine learning research organization dedicated to making AI models, datasets, and code available to the public. In 2022, InfoQ covered LAION’s release of LAION-5B, an AI training dataset containing over five billion image-text pairs. LAION’s latest project is OpenAssistant, which is intended to “give everyone access to a great chat based large language model.” The planned MVP implementation of OpenAssistant will be based on OpenAI’s InstructGPT paper: a dataset of human-generated instructions, a dataset of machine-generated responses and their human rankings, and an implementation of RLHF. According to LAION:
We are not going to stop at replicating ChatGPT. We want to build the assistant of the future, able to not only write email and cover letters, but do meaningful work, use APIs, dynamically research information, and much more, with the ability to be personalized and extended by anyone. And we want to do this in a way that is open and accessible, which means we must not only build a great assistant, but also make it small and efficient enough to run on consumer hardware.
CarperAI is a new lab within the EleutherAI research group, tasked with “improving the performance and safety of large language models (LLMs) with reinforcement learning.” InfoQ previously covered EleutherAI’s development of open-source language model GPT-NeoX. In October 2022, the lab announced a project to train and publicly release “instruction-tuned” models using RLHF. The project is a cooperative effort of several organizations, including HuggingFace, Scale, and Humanloop. As part of this project, CarperAI open-sourced Transformer Reinforcement Learning X (trlX), a framework for fine-tuning HuggingFace language models using RLHF.
Phil Wang, an AI developer known for open-source implementations of deep learning research models such as Imagen and Make-A-Video, shared his work-in-progress implementation of RLHF for the PaLM language model called PaLM + RLHF. Wang notes that there is no pre-trained model, only a framework for users to train their own. He also recommends users interested in replicating ChatGPT should join the LAION discord channel.
Although these open-source projects include implementations of ChatGPT’s training methods, they do not have any trained models currently available. Wang’s project FAQ suggests that training might require “millions of dollars of compute + data” to complete. LAION’s roadmap document for OpenAssistant does list efforts to collect data and train models, but isn’t clear on when trained models might be released. CarperAI’s Twitter account noted:
We haven’t released any RLHF models yet officially, just a few small replication efforts of hh-RLHF, learning to summarize, etc in our discord. We can match performance reported in respective papers on these.
Several prominent members of the AI community have discussed these efforts on social media. On Twitter, HuggingFace CTO Julien Chaumond predicted that in six months there will be “10 open reproductions of ChatGPT.” AI researcher Sebastian Raschka replied:
Agreed, there will be many open source implementations of ChatGPT. But there won’t be many high-quality models. I think we underestimate how much people hate labeling (or worse: writing) training data by hand.
StabilityAI’s founder Emad Mostaque tweeted that his company is “working on open chatGPT.” He also said:
Toughest part of open chatGPT creation (aside from millions of bucks for RL bit) is the governance aspect…The nice thing is once all the blood sweat and tears go into creating the models and frameworks they can proliferate like crazy as a new type of dev primitive.
The source code for OpenAssistant, trlX, and PaLM + RLHF are all available on GitHub.

MMS • Anthony Alford
Article originally posted on InfoQ. Visit InfoQ

The BigCode Project recently released The Stack, a 6.4TB dataset containing de-duplicated source code from permissively licensed GitHub repositories which can be used to train code generation AI models. BigCode also released SantaCoder, a 1.1B parameter code generation model trained on The Stack. SantaCoder outperforms similar open-source code generation models.
BigCode is a collaborative organization sponsored by HuggingFace and ServiceNow Research, with the mission of developing responsible and open-source language models. In response to recent criticism of some code generation AI models for using copyrighted code in their training data, BigCode began investigating the performance of models trained only on source code with permissive licenses, such as Apache or MIT. BigCode also created web-based tools for developers to determine if their code is contained in The Stack and to request it be excluded. To test the performance of models trained on The Stack, BigCode trained SantaCoder, which outperforms previous open-source code generation models on the MultiPL-E benchmark. According to BigCode:
We release all permissively licensed files for 30 common programming languages, along with a near-deduplicated version. In future work, we would like to further improve the released dataset. We are open to releasing data of other programming languages, plan to work on methods for removing PII and malicious code, and start experimenting with giving developers the possibility to have their data removed from the dataset. We hope The Stack will be a useful resource for open and responsible research on Code LLMs.
AI models for generating code are currently an active research area. In 2021, InfoQ covered OpenAI’s Codex and GitHub’s CoPilot, which are based on GPT-3 language models that are fine-tuned on code stored in public GitHub repositories. Although these models perform quite well at generating code, they have been criticized for copyright violations. In late 2022, InfoQ covered a lawsuit against Microsoft and OpenAI that alleges copyright violations, including lack of attribution required by the licenses of the included source code.
One goal of The Stack is to avoid these violations by only including source code with permissive licenses; that is “those with minimal restrictions on how the software can be copied, modified, and redistributed.” This includes MIT and Apache 2.0, but excludes “copyleft” licenses such as GPL, in part because copyleft advocates point out that models trained on GPL code could be considered “derivative works” which must themselves adopt the copyleft license.
Because excluding these repositories reduces the amount of training data, the BigCode team investigated whether this would reduce the performance of models trained on the dataset. They found that by near-deduplicating the dataset—that is, by removing from the dataset both exact duplicates as well as files that are very similar—that model quality was competitive with Codex. When training the 1.1B parameter SantaCoder model, the team discovered that filtering the dataset to only include 5-star repositories, however, reduces model quality “significantly.”
Thomas Wolf, co-founder of HuggingFace, joined a Twitter discussion about SantaCoder. In response to a user’s complaint about the quality of code generated by the model, Wolf replied:
It’s a completion model, not (yet) an instruct fine tuned model so you should formulate your task as a completion task. For instance by writing your prompt as a code comment or docstring.
Both The Stack dataset and the SantaCoder model are available on HuggingFace.