MMS • Anthony Alford
Article originally posted on InfoQ. Visit InfoQ
OpenAI recently announced GPT-4, the next generation of their GPT family of large language models (LLM). GPT-4 can accept both text and image inputs and outperforms state-of-the-art systems on several natural language processing (NLP) benchmarks. The model also scored in the 90th percentile on a simulated bar exam.
OpenAI’s president and co-founder, Greg Brockman, demonstrated the model’s capabilities in a recent livestream. The model was trained using the same infrastructure as the previous generation model, GPT-3.5, and like ChatGPT it has been fine-tuned using reinforcement learning from human feedback (RLHF). However, GPT-4 features several improvements over the previous generation. Besides the ability to handle image input, the default context length has doubled, from 4,096 tokens to 8,192. There is also a limited-access version that supports 32,768 tokens, which is approximately 50 pages of text. The model’s response behavior is more steerable via a system prompt. The model also has fewer hallucinations than GPT-3.5, when measured on benchmarks like TruthfulQA. According to OpenAI:
We look forward to GPT-4 becoming a valuable tool in improving people’s lives by powering many applications. There’s still a lot of work to do, and we look forward to improving this model through the collective efforts of the community building on top of, exploring, and contributing to the model.
Although OpenAI has not released details of the model architecture or training dataset, they did publish a technical report showing its results on several benchmarks, as well as a high level overview of their efforts to identify and mitigate the model’s risk of producing harmful output. Because fully training the model requires so much computation power and time, they also developed techniques to predict the model’s final performance, given performance data for smaller models. According to OpenAI, this will “improve decisions around alignment, safety, and deployment.”
To help evaluate their models, OpenAI has open-sourced Evals, a framework for benchmarking LLMs. The benchmark examples or evals typically consist of prompt inputs to the LLM along with expected responses. The repo already contains several eval suites, including some implementations of existing benchmarks such as MMLU, as well as other suites where GPT-4 does not perform well, such as logic puzzles. OpenAI says they will use the Evals framework to track performance when new model versions are released; they also intend to use the framework to help guide their future development of model capabilities.
Several users discussed GPT-4 in a thread on Hacker News. One commenter said:
After watching the demos I’m convinced that the new context length will have the biggest impact. The ability to dump 32k tokens into a prompt (25,000 words) seems like it will drastically expand the reasoning capability and number of use cases. A doctor can put an entire patient’s medical history in the prompt, a lawyer an entire case history, etc….What [percentage] of people can hold 25,000 words worth of information in their heads, while effectively reasoning with and manipulating it?
However, several other users pointed out that medical and legal applications would require better data privacy guarantees from OpenAI. Some suggested that a homomorphic encryption scheme, where the GPT model operates on encrypted input, might be one solution.
Developers interested in using the model can join OpenAI’s waitlist for granting access.