MMS • Anthony Alford
The Abu Dhabi government’s Technology Innovation Institute (TII) released Falcon 180B, currently the largest openly-available large language model (LLM). Falcon 180B contains 180 billion parameters and outperforms GPT-3.5 on the MMLU benchmark.
Falcon 180B was trained on 3.5 trillion tokens of text–4x the amount of data used to train Llama 2. Besides the base model, TII also released a chat-specific model that is fine-tuned on instruction datasets. The models are available for commercial use, but the license includes several restrictions and requires additional permission for use in a hosted service. Although TII says that Falcon’s performance is “difficult to rank definitively,” it is “on par” with PaLM 2 Large and “somewhere between GPT 3.5 and GPT4,” depending on the benchmark used. According to TII:
As a key technology enabler, we firmly believe that innovation should be allowed to flourish. That is why we decided to open source or open access all our Falcon models. We are launching our latest Falcon 180B LLM as an open access model for research and commercial use. Together, we can mobilize and fast-track breakthrough solutions worldwide – catalyzing innovation for outsized impact.
Falcon 180B is based on TII’s smaller model, Falcon 40B, which was released earlier this year. One innovation in the Falcon architecture was the use of multiquery attention, which reduces the model’s memory bandwidth requirements when running inference. Both the models were trained on TII’s RefinedWeb dataset; for the new 180B model, the amount of data was increased from 1.5 trillion tokens to 3 trillion. Training Falcon 180B took approximately 7 million GPU-hours on Amazon Sagemaker, using 4,096 GPUs concurrently.
On X (formerly Twitter), several users posted about the Falcon 180B. One user speculated that:
GDPR in the EU may make the Falcon 180B model the only viable option for those who prioritize data localization and privacy.
Although the model’s size makes it difficult for most users to run locally, Huggingface scientist Clémentine Fourrier pointed out that there is no difference in inference quality “between the 4-bit Falcon-180B and the bfloat-16 one,” meaning that users could reduce memory needs by 75%. Georgi Gerganov, developer of llama.cpp, a package that helps users to run LLMs on their personal hardware, claimed to be running the model on an Apple M2 Ultra.
Commenting on the model’s generative capabilities, HyperWrite CEO Matt Shumer noted TII’s claim that the model’s performance was between GPT-3.5 and GPT-4 and predicted “We’re now less than two months away from GPT-4-level open-source models.” NVIDIA’s senior AI scientist Dr. Jim Fan took issue with the model’s lack of training on source code data:
Though it’s beyond me why code is only 5% in the training mix. It is by far the most useful data to boost reasoning, master tool use, and power AI agents. In fact, GPT-3.5 is finetuned from a Codex base….I don’t see any coding benchmark numbers. From the limited code pretraining, I’d assume it isn’t good at it. One cannot claim “better than GPT-3.5” or “approach GPT-4” without coding. It should’ve been an integral part in the pretraining recipe, not a finetuning after-thought.