Author: Anthony Alford

MMS • Anthony Alford

The BigCode Project recently released The Stack, a 6.4TB dataset containing de-duplicated source code from permissively licensed GitHub repositories which can be used to train code generation AI models. BigCode also released SantaCoder, a 1.1B parameter code generation model trained on The Stack. SantaCoder outperforms similar open-source code generation models.
BigCode is a collaborative organization sponsored by HuggingFace and ServiceNow Research, with the mission of developing responsible and open-source language models. In response to recent criticism of some code generation AI models for using copyrighted code in their training data, BigCode began investigating the performance of models trained only on source code with permissive licenses, such as Apache or MIT. BigCode also created web-based tools for developers to determine if their code is contained in The Stack and to request it be excluded. To test the performance of models trained on The Stack, BigCode trained SantaCoder, which outperforms previous open-source code generation models on the MultiPL-E benchmark. According to BigCode:
We release all permissively licensed files for 30 common programming languages, along with a near-deduplicated version. In future work, we would like to further improve the released dataset. We are open to releasing data of other programming languages, plan to work on methods for removing PII and malicious code, and start experimenting with giving developers the possibility to have their data removed from the dataset. We hope The Stack will be a useful resource for open and responsible research on Code LLMs.
AI models for generating code are currently an active research area. In 2021, InfoQ covered OpenAI’s Codex and GitHub’s CoPilot, which are based on GPT-3 language models that are fine-tuned on code stored in public GitHub repositories. Although these models perform quite well at generating code, they have been criticized for copyright violations. In late 2022, InfoQ covered a lawsuit against Microsoft and OpenAI that alleges copyright violations, including lack of attribution required by the licenses of the included source code.
One goal of The Stack is to avoid these violations by only including source code with permissive licenses; that is “those with minimal restrictions on how the software can be copied, modified, and redistributed.” This includes MIT and Apache 2.0, but excludes “copyleft” licenses such as GPL, in part because copyleft advocates point out that models trained on GPL code could be considered “derivative works” which must themselves adopt the copyleft license.
Because excluding these repositories reduces the amount of training data, the BigCode team investigated whether this would reduce the performance of models trained on the dataset. They found that by near-deduplicating the dataset—that is, by removing from the dataset both exact duplicates as well as files that are very similar—that model quality was competitive with Codex. When training the 1.1B parameter SantaCoder model, the team discovered that filtering the dataset to only include 5-star repositories, however, reduces model quality “significantly.”
Thomas Wolf, co-founder of HuggingFace, joined a Twitter discussion about SantaCoder. In response to a user’s complaint about the quality of code generated by the model, Wolf replied:
It’s a completion model, not (yet) an instruct fine tuned model so you should formulate your task as a completion task. For instance by writing your prompt as a code comment or docstring.
Both The Stack dataset and the SantaCoder model are available on HuggingFace.

MMS • Anthony Alford
Geoffrey Hinton, professor at the University of Toronto and engineering fellow at Google Brain, recently published a paper on the Forward-Forward algorithm (FF), a technique for training neural networks that uses two forward passes of data through the network, instead of backpropagation, to update the model weights.
By Anthony Alford

MMS • Anthony Alford

Microsoft announced the release of ML.NET 2.0, the open-source machine learning framework for .NET. The release contains several updated natural language processing (NLP) APIs, including Tokenizers, Text Classification, and Sentence Similarity, as well as improved automated ML (AutoML) features.
Program manager Luis Quintanilla announced the release at the recent .NET Conf 2022. The updated NLP APIs are powered by TorchSharp, a .NET wrapper for the popular PyTorch deep learning framework. The release includes the EnglishRoberta tokenization model and a TorchSharp implementation of NAS-BERT, which is used by the Text Classification and Sentence Similarity APIs. Updates to AutoML include an API for automated data pre-processing and a set of APIs for running experiments to find the best models and hyperparameters. Quintanilla also announced a new release of the Model Builder tool for Visual Studio, which includes the new text classification scenario and advanced training options.
The Text Classification API, which was previewed earlier this year, is based on the NAS-BERT model published by Microsoft Research in 2021. This model was developed using neural architecture search (NAS), resulting in smaller models than the standard BERT model, while maintaining accuracy. Users can fine-tune the pre-trained NAS-BERT model with their own data, to fit their custom use cases. The Sentence Similarity API uses the same pre-trained model, but instead of being fine-tuned to classify an input string, the model takes two strings as input and outputs a score indicating the similarity of the meaning of the two inputs.
The AutoML APIs are based on Microsoft’s Fast Library for Automated Machine Learning & Tuning (FLAML). While the Featurizer API is designed for pre-processing, the rest of the APIs work together to search for the best set of hyperparameters. The Experiment API coordinates the optimization of a Sweepable pipeline over a Search Space using a Tuner. Devs can use the Sweepable API to define the training pipeline for hyperparameter optimization of their models; the Search Space API for configuring the range of the hyperparameter search space for that pipeline; and the Tuner API to choose a search algorithm for that space. The release includes several tuner algorithms, including basic grid and random searches as well as Bayesian and Frugal optimizers.
Quintanilla also gave viewers a preview of the ML.NET roadmap. Future plans for deep learning features include new scenarios and APIs for question answering, named-entity recognition, and object detection. There are also plans for TorchSharp integrations for custom scenarios and improvements to the ONNX integration. Other plans include upgrades to the LightGBM implementation and to the implementation of the IDataView interface, as well as improvements to the AutoML API.
At the end of his presentation, Quintanilla answered questions from the audience. One viewer asked about support for different vendors’ GPUs and accelerator libraries, and Quintanilla noted that currently only NVIDIA’s CUDA accelerator is supported. When another viewer asked whether ML.NET’s object detection algorithms would run fast enough to support a live video stream, Quintanilla replied:
We want to focus on performance. We’re introducing new deep learning scenarios and we realized that performance is key there, so performance is a focus for us going forward.
The ML.NET source code is available on GitHub.

MMS • Anthony Alford

Meta AI Research recently open-sourced CICERO, an AI that can beat most humans at the strategy game Diplomacy, a game that requires coordinating plans with other players. CICERO combines chatbot-like dialogue capabilities with a strategic reasoning, and recently placed first in an online Diplomacy tournament against human players.
CICERO was described in a paper published in the journal Science. CICERO uses a 2.7B parameter language model to handle dialogue between itself and other players. To determine its moves, CICERO’s planning algorithm uses the dialogue to help predict what other players are likely to do, as well as what other players think CICERO will do. In turn, the output of the planner provides intents for the dialogue model. To evaluate CICERO, the team entered it anonymously in 40 online Diplomacy games; the AI achieved a score more than double that of the human average. According to the Meta team,
While we’ve made significant headway in this work, both the ability to robustly align language models with specific intentions and the technical (and normative) challenge of deciding on those intentions remain open and important problems. By open sourcing the CICERO code, we hope that AI researchers can continue to build off our work in a responsible manner. We have made early steps towards detecting and removing toxic messages in this new domain by using our dialogue model for zero-shot classification. We hope Diplomacy can serve as a safe sandbox to advance research in human-AI interaction.
Diplomacy is a strategy board game where players must capture a majority of territories called supply centers to win. There is no random component in the game; instead, battles are determined by numerical superiority. This often requires players to cooperate, so the bulk of game play consists of players sending messages to each other to coordinate their actions. Occasionally players will engage in deceit; for example, promising to help another player, while actually planning to attack that player.
To be successful, therefore, an AI must not only generate messages of human-level quality; the messages must make sense given the state of the game board, and the messages must cause other players to trust the AI. To generate the dialogue, Meta used a pre-trained R2C2 language model that was fine-tuned on a dataset of almost 13M messages from online Diplomacy games. The generated dialogue is conditioned on the intents generated by a planning module; the intents are the most likely actions that message sender and receiver will take after reading that message.
CICERO’s planning module generates intents by predicting other players’ likely actions, given the state of the board and messages from those players, then choosing an optimal action for itself. To model the likely actions of the other players, CICERO uses an iterative planning algorithm called piKL which incorporates information from the dialogues with other players. To train the planning module, the Meta researchers used a self-play algorithm similar to that used by AlphaZero.
The Meta team entered CICERO into anonymous league play for online Diplomacy games. The AI played 40 games, including an 8-game tournament with 21 players; CICERO placed first in the tournament. For its entire 40 games, CICERO was ranked in the top 10 percent of players with an average score was 25.8%, while the average score of its 82 human opponents was 12.4%.
In a Twitter thread about the work, CICERO co-author Mike Lewis replied to a question about whether CICERO would “backstab” (that is, lie to) other players:
It’s designed to never intentionally backstab – all its messages correspond to actions it currently plans to take. However, sometimes it changes its mind…
The CICERO source code is available on GitHub.