Anthony Alford, Author at Mobile Monitoring Solutions

Uncategorized

Apple’s Illusion of Thinking Paper Explores Limits of Large Reasoning Models

MMS • Anthony Alford

Article originally posted on InfoQ. Visit InfoQ

Apple Machine Learning Research published a paper titled The Illusion of Thinking, which investigates the abilities of Large Reasoning Models (LRMs) on a set of puzzles. As the complexity of the puzzles increases, the researchers found that LRMs encounter a “collapse” threshold where the models reduce their reasoning effort, indicating a limit to the models’ scalability.

For their experiments, Apple researchers chose four puzzle problems, including Tower of Hanoi, and a variety of LRMs and standard LLMs, including o3-mini and DeepSeek-R1. Each puzzle’s complexity could be varied; for example, the Tower of Hanoi puzzle can have a variable number of disks. They found that as complexity increased, model behavior went through three regimes: in the first, with simple problems, both reasoning and non-reasoning models performed similarly well. In the second, medium complexity regime, the reasoning models with their Chain-of-Thought (CoT) inference performed better than LLMs. But in the high complexity regime, both groups’ performance “collapsed to zero.” According to Apple,

In this study, we probe the reasoning mechanisms of frontier LRMs through the lens of problem complexity….Our findings reveal fundamental limitations in current models: despite sophisticated self-reflection mechanisms, these models fail to develop generalizable reasoning capabilities beyond certain complexity thresholds….These insights challenge prevailing assumptions about LRM capabilities and suggest that current approaches may be encountering fundamental barriers to generalizable reasoning.

LRMs such as o3 and DeepSeek-R1 are LLMs that have been fine-tuned to generate step-by-step instructions for itself before producing a response to users; in essence, the models “think out loud” to produce better answers. This allows the models to outperform their “standard” LLM counterparts on many tasks, especially coding, mathematics, and science benchmarks.

As part of their experiments, the Apple team analyzed these reasoning traces generated by the models. They noted that for simpler problems, the models would often “overthink:” the correct solution would appear early in the trace, but the models would continue to explore incorrect ideas. In medium complexity problems, however, the models would explore incorrect solutions before finding the correct one.

Apple’s paper sparked a wide debate in the AI community. Gary Marcus, a cognitive scientist and critic of the current state of AI, wrote about the research, saying:

What the Apple paper shows, most fundamentally, regardless of how you define [Artificial General Intelligence (AGI)], is that LLMs are no substitute for good well-specified conventional algorithms. (They also can’t play chess as well as conventional algorithms, can’t fold proteins like special-purpose neurosymbolic hybrids, can’t run databases as well as conventional databases, etc.)

Open source developer and AI commentator Simon Willison pointed out:

I’m not interested in whether or not LLMs are the “road to AGI”. I continue to care only about whether they have useful applications today, once you’ve understood their limitations. Reasoning LLMs are a relatively new and interesting twist on the genre. They are demonstrably able to solve a whole bunch of problems that previous LLMs were unable to handle, hence why we’ve seen a rush of new models from OpenAI and Anthropic and Gemini and DeepSeek and Qwen and Mistral….They’re already useful to me today, whether or not they can reliably solve the Tower of Hanoi….

Apple acknowledges several limitations of their research, noting in particular that their experiments mostly relied on “black box” API calls, leaving them unable to examine the inner state of the models. They also agree that the use of puzzles means that their conclusions may not generalize to all reasoning domains.

About the Author

Anthony Alford

Show moreShow less

Uncategorized

Google’s Gemma 3 QAT Language Models Can Run Locally on Consumer-Grade GPUs

MMS • Anthony Alford

Article originally posted on InfoQ. Visit InfoQ

Google released the Gemma 3 QAT family, quantized versions of their open-weight Gemma 3 language models. The models use Quantization-Aware Training (QAT) to maintain high accuracy when the weights are quantized from 16 to 4 bits.

All four Gemma 3 model sizes are now available in QAT versions: 1B, 4B, 12B, and 27B parameters. The quantized versions require as little as 25% of the VRAM needed by the 16 bit models. Google claims that the 27B model can run on a desktop NVIDIA RTX 3090 GPU with 24GB VRAM, while the 12B model can run on a laptop NVIDIA RTX 4060 GPU with 8GB VRAM. The smaller models can run on mobile phones or other edge devices. By using Quantization-Aware Training, Google was able to reduce the accuracy loss from quantization, as much as 54%. According to Google,

While top performance on high-end hardware is great for cloud deployments and research, we heard you loud and clear: you want the power of Gemma 3 on the hardware you already own. We’re committed to making powerful AI accessible, and that means enabling efficient performance on the consumer-grade GPUs found in desktops, laptops, and even phones…Bringing state-of-the-art AI performance to accessible hardware is a key step in democratizing AI development…We can’t wait to see what you build with Gemma 3 running locally!

InfoQ covered Google’s initial launch of the Gemma series in 2024, which was quickly followed by Gemma 2. The open-source models achieved performance competitive with models 2x larger by incorporating design elements from Google’s flagship Gemini LLMs. The latest iteration, Gemma 3, has performance improvements that make it the “top open compact model,” according to Google. Gemma 3 also added vision capabilities, except in the 1B size.

While the unquantized Gemma 3 models exhibit impressive performance for their size, they still require substantial GPU resources. For example, the unquantized 12B model requires an RTX 5090 with 32GB of VRAM. To allow the quantization of model weights without sacrificing performance, Google used QAT. This technique simulates inference-time quantization during training, instead of simply quantizing the model after it’s trained.

Google dev Omar Sanseviero wrote about using the QAT models in a thread on X and suggested there was still room for improvement:

We still recommend playing with the models (e.g. we didn’t quantize the embeddings, some people even did 3-bit quantization and it was working better than naive 4 bits)

Users praised the QAT models’ performance in a discussion on Hacker News:

I have a few private “vibe check” questions and the 4 bit QAT 27B model got them all correctly. I’m kind of shocked at the information density locked in just 13 GB of weights. If anyone at Deepmind is reading this — Gemma 3 27B is the single most impressive open source model I have ever used. Well done!

Django Web Framework co-creator Simon Willison wrote about his experiments with the models and said:

Having spent a while putting it through its paces via Open WebUI and Tailscale to access my laptop from my phone I think this may be my new favorite general-purpose local model. Ollama appears to use 22GB of RAM while the model is running, which leaves plenty on my 64GB machine for other applications.

The Gemma 3 QAT model weights are available on HuggingFace and in several popular LLM frameworks, including Ollama, LM Studio, Gemma.cpp, and llama.cpp.

About the Author

Anthony Alford

Show moreShow less

Uncategorized

Google Open-Sources Agent2Agent Protocol for Agentic Collaboration

MMS • Anthony Alford

Article originally posted on InfoQ. Visit InfoQ

Google released the Agent2Agent (A2A) Protocol, an open-source specification for building AI agents that can connect with other agents that support the protocol. Google has enlisted over 50 technology partners to contribute to A2A’s development.

Google announced the release at the recent Google Cloud Next conference. A2A is billed as a “complement” to Anthropic’s Model Context Protocol (MCP) and defines a client-server relationship between AI agents. Google developed the protocol with help from partners like Salesforce, Atlassian, and LangChain, with the goal of creating an interoperability standard for any agent, regardless of vendor or framework. According to Google,

A2A has the potential to unlock a new era of agent interoperability, fostering innovation and creating more powerful and versatile agentic systems. We believe that this protocol will pave the way for a future where agents can seamlessly collaborate to solve complex problems and enhance our lives. We’re committed to building the protocol in collaboration with our partners and the community in the open. We’re releasing the protocol as open source and setting up clear pathways for contribution.

InfoQ covered Anthropic’s MCP release last year. Intended to solve the “MxN” problem—the combinatorial difficulty of integrating M different LLMs with N different tools—MCP defines a client-server architecture and a standard protocol that LLM vendors and tool builders can follow.

Google’s documentation points out that A2A solves a different problem than MCP does: it “allows agents to communicate as agents (or as users) instead of as tools.” The difference between a tool and an agent is that tools have structured I/O and behavior, while agents are autonomous and can solve new tasks using reasoning. In Google’s vision, an agentic application requires both tools and agents. However, A2A docs do recommend that “applications model A2A agents as MCP resources.”

A2A defines three types of actor: remote agents, which are “blackbox” agents on an A2A server; clients that request action from remote servers; and users (human users or services) that want to accomplish tasks using an agentic system. Like MCP, A2A uses JSON-RPC over HTTP for communication between clients and remote agents. The core abstraction used in the communication spec between agents is the task, which is created by a client and fulfilled by a remote agent.

In a Hacker News discussion, several users compared A2A to MCP; some were not sure what value A2A proved over MCP, while others saw it as a “superset” of MCP and praised its “clear documentation and explanation” compared to MCP. User TS_Posts claimed to be working on A2A and wrote:

[T]he current specification and samples are early. We are working on many more advanced examples and official SDKs and client/servers. We’re working with partners, other Google teams, and framework providers to turn this into a stable standard. We’re doing it in the open – so there are things that are missing because (a) it’s early and (b) we want partners and the community to bring features to the table. tldr – this is NOT done. We want your feedback and sincerely appreciate it!

The A2A source code is available on GitHub. Google also released a demo video showing collaboration between agents from different frameworks.

About the Author

Anthony Alford

Show moreShow less

Uncategorized

Google DeepMind’s AlphaGeometry2 AI Achieves Gold-Medal Math Olympiad Performance

MMS • Anthony Alford

Article originally posted on InfoQ. Visit InfoQ

Google DeepMind’s AlphaGeometry2 (AG2) AI model solved 84% of the geometry problems from the last 25 years of International Math Olympiads (IMO), outperforming the average human gold-medalist performance.

AlphaGeometry2 is a new iteration of DeepMind’s earlier geometry AI, AlphaGeometry (AG1), which could only solve 54% of the IMO problems. Both models operate by using a domain-specific formal language to describe the problems and a symbolic deductive engine to generate proofs. The new model’s improvements include a more powerful LLM based on Gemini, which translates the natural language form of the problem into formal language. AG2 solved 42 of the 50 IMO geometry problems from the years 2000 to 2024, while the average gold medalist solves about 41. Flagship commercial reasoning LLMs, such as OpenAI’s o1 and Gemini Thinking, cannot solve any of the problems. According to DeepMind,

Despite achieving an impressive 84% solve rate on all 2000-2024 IMO geometry problems, there is still room for improvement…AG2 has not solved all IMO and IMO [short list] problems. We hypothesize that breaking problems into subproblems and applying reinforcement learning approaches could close this gap. Finally, in this paper we reported progress on building a fully automated geometry problem solving system, which takes input in natural language and outputs a solution reliably without any hallucinations. Despite good initial results, we think the auto-formalization can be further improved with more formalization examples and supervised fine-tuning.

AG2, like AG1, solves geometry problems by stating them in a formal language which consists of predicates: for example, acompute a b c d means “Find the angle between AB and CD.” AG2’s predicates can cover 88% of the IMO problems; the model will not attempt to solve the other problems.

But first, the problems written in natural language must be expressed in this formal language. To do this, DeepMind uses a Gemini LLM with few-shot prompting: the prompts contain “several dozens” of examples of problem translation. This approach is “very consistent and makes almost no mistakes” on the easier problems.

Once the problems are specified as formal predicates, they are solved using a symbolic engine called Deductive Database Arithmetic Reasoning (DDAR). If the engine fails to find a proof, AG2 uses a language model and tree search algorithm to generate auxiliary constructions, then it re-runs the DDAR engine; this loop is repeated until a proof is found.

Writing on X, Berkeley CS PhD student Yuxi Liu said,

AlphaGeometry2 is pretty cool, but clearly not bitter-lessoned. It has a very 1950s auto theorem proving feel, with handcrafted representation language, logical inference engine, etc…They are just doing autoformalization (succeeding 30/39) and proposing auxiliary constructions during tree search. Many of them require just a single auxiliary construction! Though there are cursed examples that required 12.

Oxford University ML researcher Simon Frieder also wrote on X:

AlphaGeometry2 was published, 2.5 months since we released Newclid without much fanfare (in true scientist style! :D) and two months after TongGeometry. It seems no code was provided for AG2. So now we have two closed systems, AlphaGeometry2 and TongGeometry that we cannot compare. Newclid…is fully open-source, fixed many AlphaGeometry bugs and slightly improved it in terms of performance – and we also have GeoGebra support for better input.

Although the AG2 code has not been released, the code for AG1 is available on GitHub.

About the Author

Anthony Alford

Show moreShow less

Uncategorized

Google Releases PaliGemma 2 Vision-Language Model Family

MMS • Anthony Alford

Article originally posted on InfoQ. Visit InfoQ

Google DeepMind released PaliGemma 2, a family of vision-language models (VLM). PaliGemma 2 is available in three different sizes and three input image resolutions and achieves state-of-the-art performance on several vision-language benchmarks.

PaliGemma 2 is an update of the PaliGemma family, which was released in 2024. It uses the same SigLIP-So400m vision encoder as the original PaliGemma, but upgrades to the Gemma 2 LLM. The PaliGemma 2 family contains nine different models, combining LLM sizes of 2B, 9B, and 27B parameters with vision encoders of 224, 448, and 896 pixels-squared resolution. The research team evaluated PaliGemma 2 on a variety of benchmarks, where it set new state-of-the-art records, including optical character recognition (OCR), molecular structure recognition, and radiography report generation. According to Google:

We’re incredibly excited to see what you create with PaliGemma 2. Join the vibrant Gemma community, share your projects to the Gemmaverse, and let’s continue to explore the boundless potential of AI together. Your feedback and contributions are invaluable in shaping the future of these models and driving innovation in the field.

PaliGemma 2 is a combination of a pre-trained SigLIP-So400m image encode and a Gemma 2 LLM. This combination is then further pre-trained on a 1B example multimodal dataset. Besides the pre-trained base models, Google also released variants that were fine-tuned on the Descriptions of Connected and Contrasting Images (DOCCI) dataset, a collection of images and corresponding detailed descriptions. The fine-tuned variants can generate long, detailed captions of images, which are “more factually aligned sentences” than those produced by other VLMs.

Google created other fine-tuned versions for benchmarking purposes. The benchmark tasks included OCR, table structure recognition, molecular structure recognition, optical music score recognition, radiography report generation, and spatial reasoning. The fine-tuned PaliGemma 2 outperformed previous state-of-the-art models on most of these tasks.

The team also evaluated performance and inference speed for quantized versions of the model running on a CPU instead of a GPU. Reducing the model weights from full 32-bit to mixed-precision quantization showed “no practical quality difference.”

In a Hacker News discussion about the model, one user wrote:

Paligemma proves easy to train and useful in fine-tuning. Its main drawback was not being able to handle multiple images without being partly retrained. This new version does not seem to support multiple images as input at once. Qwen2vl does. This is useful for vision RAG typically.

Gemma team member Glenn Cameron wrote about PaliGemma 2 on X. In response to a question about using it to control a robot surgeon, Cameron said:

I think it could be taught to generate robot commands. But I wouldn’t trust it with such high-stakes tasks…Notice the name of the model is PaLM (Pathways Language Model). The “Pa” in PaliGemma stands for “Pathways”. It is named that because it continues the line of PaLI (Pathways Language and Image) models in a combination with the Gemma family of language models.

InfoQ previously covered Google’s work on using VLMs for robot control, including Robotics Transformer 2 (RT-2) and PaLM-E, a combination of their PaLM and Vision Transformer (ViT) models.

The PaliGemma 2 base models as well as fine-tuned versions and a script for fine-tuning the base model are available on Huggingface. Huggingface also hosts a web-based visual question answering demo of a fine-tuned PaliGemma 2 model.

About the Author

Anthony Alford

Show moreShow less

Uncategorized

InstaDeep Open-Sources Genomics AI Model Nucleotide Transformers

MMS • Anthony Alford

Article originally posted on InfoQ. Visit InfoQ

Researchers from InstaDeep and NVIDIA have open-sourced Nucleotide Transformers (NT), a set of foundation models for genomics data. The largest NT model has 2.5 billion parameters and was trained on genetic sequence data from 850 species. It outperforms other state-of-the-art genomics foundation models on several genomics benchmarks.

The InstaDeep published a technical description of the models in Nature. NT uses an encoder-only Transformer architecture and is pre-trained using the same masked language model objective as BERT. The pre-trained NT models can be used in two ways: to produce embeddings for use as features in smaller models, or fine-tuned with a task-specific head replacing the language model head. InstaDeep evaluated NT on 18 downstream tasks, such as epigenetic marks prediction and promoter sequence prediction, and compared it to three baseline models. NT achieved the “highest overall performance across tasks” and outperformed all other models on promoter and splicing tasks. According to InstaDeep:

The Nucleotide Transformer opens doors to novel applications in genomics. Intriguingly, even probing of intermediate layers reveals rich contextual embeddings that capture key genomic features, such as promoters and enhancers, despite no supervision during training. [We] show that the zero-shot learning capabilities of NT enable [predicting] the impact of genetic mutations, offering potentially new tools for understanding disease mechanisms.

The best-performing NT model, Multispecies 2.5B, contains 2.5 billion parameters and was trained on data from 850 species of “diverse phyla,” including bacteria, fungi, and invertebrates as well as mammals such as mice and humans. Because this model outperformed a 2.5B parameter NT model trained only on human data, InstaDeep says that the multi-species data is “key to improving our understanding of the human genome.”

InstaDeep compared Multispecies 2.5B’s performance to three other genomics foundational models: Enformer, HyenaDNA, and DNABERT-2. All models were fine-tuned for each of the 18 downstream tasks. While Enformer had the best performance on enhancer prediction and “some” chromatin tasks, NT was the best overall. It outperformed HyenaDNA on all tasks, even though HyenaDNA was trained on the “human reference genome.”

Besides its use on downstream tasks, InstaDeep also investigated the model’s ability to predict the severity of genetic mutations. This was done using “zero-shot scores” of sequences, calculated using cosine distances in embedding space. They noted that this score produced a “moderate” correlation with severity.

An Instadeep employee BioGeek joined a Hacker News discussion about the work, pointing out example use cases in a Huggingface notebook. BioGeek also mentioned a previous Instadeep model called ChatNT:

[Y]ou can ask natural language questions like “Determine the degradation rate of the human RNA sequence @myseq.fna on a scale from -5 to 5.” and the ChatNT will answer with “The degradation rate for this sequence is 1.83.”

Another user said:

I’ve been trialing a bunch of these models at work. They basically learn where the DNA has important functions, and what those functions are. It’s very approximate, but up to now that’s been very hard to do from just the sequence and no other data.

The Nucleotide Transformers code is available on GitHub. The model files can be downloaded from Huggingface.

About the Author

Anthony Alford

Show moreShow less

Uncategorized

PyTorch 2.5 Release Includes Support for Intel GPUs

MMS • Anthony Alford

Article originally posted on InfoQ. Visit InfoQ

The PyTorch Foundation recently released PyTorch version 2.5, which contains support for Intel GPUs. The release also includes several performance enhancements, such as the FlexAttention API, TorchInductor CPU backend optimizations, and a regional compilation feature which reduces compilation time. Overall, the release contains 4095 commits since PyTorch 2.4.

The Intel GPU support was previewed at the recent PyTorch conference. Intel engineers Eikan Wang and Min Jean Cho described the PyTorch changes made to support the hardware. This included generalizing the PyTorch runtime and device layers which makes it easier to integrate new hardware backends. Intel specific backends were also implemented for torch.compile and torch.distributed. According to Kismat Singh, Intel’s VP of engineering for AI frameworks:

We have added support for Intel client GPUs in PyTorch 2.5 and that basically means that you’ll be able to run PyTorch on the Intel laptops and desktops that are built using the latest Intel processors. We think it’s going to unlock 40 million laptops and desktops for PyTorch users this year and we expect the number to go to around 100 million by the end of next year.

The release includes a new FlexAttention API which makes it easier for PyTorch users to experiment with different attention mechanisms in their models. Typically, researchers who want to try a new attention variant need to hand-code it directly from PyTorch operators. However, this could result in “slow runtime and CUDA OOMs.” The new API supports writing these instead with “a few lines of idiomatic PyTorch code.” The compiler then converts these to an optimized kernel “that doesn’t materialize any extra memory and has performance competitive with handwritten ones.”

Several performance improvements have been released in beta status. A new backend Fused Flash Attention provides “up to 75% speed-up over FlashAttentionV2” for NVIDIA H100 GPUs. A regional compilation feature for torch.compile reduces the need for full model compilation; instead, repeated nn.Modules, such as Transformer layers, are compiled. This can reduce compilation latency while incurring only a few percent performance degradation. There are also several optimizations to the TorchInductor CPU backend.

Flight Recorder, a new debugging tool for stuck jobs, was also included in the release. Stuck jobs can occur during distributed training, and could have many root causes, including data starvation, network issues, or software bugs. Flight Recorder uses an in-memory circular buffer to capture diagnostic info. When it detects a stuck job, it dumps the diagnostics to a file; the data can then be analyzed using a script of heuristics to identify the root cause.

In discussions about the release on Reddit, many users were glad to see support for Intel GPUs, calling it a “game changer.” Another user wrote:

Excited to see the improvements in torch.compile, especially the ability to reuse repeated modules to speed up compilation. That could be a game-changer for large models with lots of similar components. The FlexAttention API also looks really promising – being able to implement various attention mechanisms with just a few lines of code and get near-handwritten performance is huge. Kudos to the PyTorch team and contributors for another solid release!

The PyTorch 2.5 code and release notes are available on GitHub.

About the Author

Anthony Alford

Show moreShow less

Uncategorized

Apple Open-Sources Multimodal AI Model 4M-21

MMS • Anthony Alford

Article originally posted on InfoQ. Visit InfoQ

Researchers at Apple and the Swiss Federal Institute of Technology Lausanne (EPFL) have open-sourced 4M-21, a single any-to-any AI model that can handle 21 input and output modalities. 4M-21 performs well “out of the box” on several vision benchmarks and is available under the Apache 2.0 license.

4M-21 is a 3B-parameter Transformer-based encoder-decoder model. All 21 input modalities are mapped to discrete tokens using modality-specific tokenizers, and the model can generate any output modality given any input modality. The model was trained on around 500 million samples of multimodal data, including COYO and C4. Out of the box, 4M-21 can perform a wide range of tasks, including steerable image generation and image retrieval. On vision benchmarks including semantic segmentation and depth estimation, it outperformed comparable baseline models. According to Apple:

The resulting model demonstrates the possibility of training a single model on a large number of diverse modalities/tasks without any degradation in performance and significantly expands the out-of- the-box capabilities compared to existing models. Adding all these modalities enables new potential for multimodal interaction, such as retrieval from and across multiple modalities, or highly steerable generation of any of the training modalities, all by a single model.

4M-21 builds on Apple’s earlier model, Massively Multimodal Masked Modeling (4M), which handled only seven modalities. The new model triples the modalities, which include text and pixel data, as well as “multiple types of image, semantic and geometric metadata.” Each modality has a dedicated tokenizer; text modalities use a WordPiece tokenizer, while image modalities use variational auto-encoders (VAE). The model is trained using a single objective: “a per-token classification problem using the cross-entropy loss.”

By allowing inputs with multiple modalities and chaining operations, 4M-21 supports fine-grained image editing and generation. For example, providing a text caption input will prompt the model to generate the described image. Users can control details about the generated image by including geometric input such as bounding boxes, segmentation maps, or human poses along with the caption. The model can also perform image retrieval based on different inputs; for example, by finding images given a caption or a semantic segmentation map.

Research team member Amir Zamir posted about the work in a thread on X. One user asked Zamir why the model does not support audio modalities. Zamir replied that “It’s a matter of data,” and suggested their method should work with audio. He also wrote:

IMO, the multitask learning aspect of multimodal models has really taken a step forward. We can train a single model on many diverse tasks with ~SOTA accuracy. But a long way to go in terms of transfer/emergence.

Andrew Ng’s AI newsletter The Batch also covered 4M-21, saying:

The limits of this capability aren’t clear, but it opens the door to fine control over the model’s output. The authors explain how they extracted the various modalities; presumably users can do the same to prompt the model for the output they desire. For instance, a user could request an image by entering not only a prompt but also a color palette, edges, depth map extracted from another image, and receive output that integrates those elements.

The code and model weights for 4M-21 are available on GitHub.

About the Author

Anthony Alford

Show moreShow less

Uncategorized

Podcast: A Primer on AI for Architects with Anthony Alford

MMS • Anthony Alford

Article originally posted on InfoQ. Visit InfoQ

Subscribe on:

Transcript

Introduction [00:42]

Thomas Betts: Hi, everyone. Here at InfoQ, we try to provide our audience with information about the latest software innovations and trends. And I personally recognize that sometimes there’s a lot of new information out there and we tend to focus on the subjects that are most relevant to what we’re currently working on and what we’re interested in. Then sometimes you realize that what used to be one of those subjects off to the side is now right in front of you and you can’t ignore it anymore. And I’ll admit that that was my approach for a lot of the news over the past decade or so about big data, machine learning, artificial intelligence. I found it interesting, but because it wasn’t what I was working with, I had this very thin, high-level understanding of most of those topics. And that’s fine. That’s how software architects usually approach a problem.

We tend to be T-shaped in our knowledge. We have a broad range of subjects we need to know about, and we only go deep in our understanding of a few of them until we have to go deeper in our understanding for something else. That’s where I think we’ve gotten with ML and AI. It’s no longer something off to the side. Architects have to deal with these every day. They’re front and center because product owners, CTOs, CEOs, maybe even our customers are asking, “Can you put some AI in that?” to just about everything, it seems.

That gets me to today’s episode. I’ve invited Anthony Alford on to help explain some of these ML and AI concepts that are now, I think, required knowledge to be an effective software architect. Anthony’s voice will probably sound familiar because he’s another InfoQ editor. He co-hosts the Generally AI podcast with Roland Meertens, and I believe that just started its second season. Anthony, thanks for joining me on my episode of the InfoQ Podcast.

Anthony Alford: Thank you for having me.

Thomas Betts: I think a useful way to go through this today in our discussion is to do this big AI glossary. There’s a lot of terms that get thrown around, and that’s where, I think, architects need to understand what is that term and then figure out how much do I need to know about it so they can have intelligent conversations with their coworkers. I want to provide today just enough information so that those architects can go and have those conversations and realize when something comes up and they have to start implementing that for a project or thinking about a design, they have a little bit more context and that will help them be more successful as they do more research. Sound like a plan?

Anthony Alford: Sounds great.

AI usually means deep learning or neural networks [03:00]

Thomas Betts: All right. First give me your definition. What is AI?

Anthony Alford: AI is artificial intelligence.

Thomas Betts: And we’re done.

Anthony Alford: Yay. And, in fact, when I talk to people about this, I say, “AI really tells you more about the difficulty of the problem you’re trying to solve”. It’s not an actual solution. The good news is when most people are talking about AI, they’re actually talking about some type of machine learning. And machine learning is definitely a technology. It’s a well-studied, well-defined branch of science. And, in fact, the part of machine learning that most people mean now is something called deep learning, which is also known as neural networks. This has been around since the 1950s, so it’s pretty widely studied.

ML models are just functions that take input and provide output [03:48]

Thomas Betts: Yes, I think that’s the idea that AI is not a product you can go buy. You can go buy a machine learning model. You can build a machine learning model. You can add it to your system, but you can’t just say, “I want an AI”. But that’s the way people are talking about it. Let’s start talking about the things that exist, the tangible elements. Give me some examples of what people are thinking when they say, “I want AI in my system”. What are the machine learning elements they’re talking about?

Anthony Alford: Of course, most people are talking about something like a large language model or a generative AI. What I like to tell people as software developers, the way you can think about these things is it’s a function. We write code that calls functions in external libraries all the time. At one level you can think about it. It is just a function that you can call. The inputs and outputs are quite complex, right? The input might be an entire image or a podcast audio, and the output might also be something big like the transcript of the podcast or a summary.

Thomas Betts: And that’s where we get into the… Most people are thinking of generated AI, gen AI. Give me some text, give me an image, give me some sound. That’s the input. Machine learning model, it all comes down to ones and zeros, right? It’s breaking that up into some sort of data it can understand and doing math on it, right?

Anthony Alford: Yes, that’s right. Again, when I talk to software developers, I say, “When you think about the input and output of these functions, the input and output is just an array of floats”. Actually, it’s possibly a multidimensional array. The abstract term for that is a tensor. And if you look at some of the common machine learning libraries, they’re going to use the word tensor. It just means a multidimensional array, but you have to be able to express all your inputs and outputs as these tensors.

Building an ML model is like writing a lot of unit tests and refining the function [05:42]

Thomas Betts: Yes, these are the things I learned in math back in university years ago, but because I’m not a data scientist, I don’t use those words every day and forget, “Oh yes, multidimensional array, I understand what that is”. But exactly. That’s like several extra syllables I don’t need to say. I’ve got these tensors I’m putting in. What do I do with it? How do I build one of these models?

Anthony Alford: Okay, if you want to build your own model, which you actually might want to consider not doing that, we can talk about that later. But in general, the way these models are built is a process called training and supervised learning. What you really need, again, from our perspective as software developers, we need a suite of unit tests. A really big suite of unit tests, which means just what we expect, some inputs to the function and expected outputs from the function. The training process essentially is randomly writing a function. It starts with a random function and then just keeps fixing bugs in that function until the unit test pass somewhat. They don’t actually have to exactly pass. You also tell it, “Here’s a way to compute how bad the tests are failing and just make that number smaller every time”.

Thomas Betts: That’s where you get to the probability that this all comes down to math. Again, I’m used to writing unit tests and I say, “My inputs are A and B, and I expect C to come out”. You’re saying, “Here’s A and B and I expect C”. But here’s how you can tell how close you are to C?

Anthony Alford: Exactly, yes. It depends on the data type. I mentioned they all turn into tensors, but the easiest one is, let’s say, you’re building a model that outputs an actual number. Maybe you’re building a model that the inputs are things like the square feet of a house and the number of rooms, et cetera. And the output is the expected house price. If you give it unit tests, you can get a measure of how off the unit test is just by subtracting the number that you get out from the number that you expect. You can do sum of squared errors. Then the machine learning will just keep changing the function to make that sum of squared errors lower. With something like text or an image, it may be a little trickier to come up with a measurement of how off the unit tests are.

Language models are trained using sentences to predict the probability of the next word in the sentence [07:59]

Thomas Betts: We’re getting into all the ideas of gen AI. Let’s just take the text example for now and we’ll leave off audio and images and everything else, because it’s the same principles. Most people are familiar with interacting with ChatGPT. I type in something and it gives me a bunch of text. How did those come about and how did we create these LLMs that people said, “When I type in this sentence, I expect this sentence in response”.

Anthony Alford: Okay, so how long of the story do you want Here? We can go back to 2017 or even earlier.

Thomas Betts: Let’s give the high level details. If it’s an important milestone, I think it’s useful to sometimes have the origin story.

Anthony Alford: You’re right. The short answer is these things like ChatGPT or what are called language models, the input of the function is a sequence of words or more abstractly tokens. The output is all possible tokens along with their probability of being the next one. Let me give you an example. If I give you the input sequence once upon a… What’s the next word?

Thomas Betts: I’m going to guess time.

Anthony Alford: Right. What the LLM will give you is will give you every possible word with its probability and time will have a very high probability of being the next one. Then something like pancake would have a lower probability. That’s a probability distribution. We actually know the answer. In training, we know that in the unit test, the word time has the probability of 100%. Every other word has a probability of zero. That’s one probability distribution. The probability distribution it gives us is another one. And there’s a measure of how different those are. That’s called cross-entropy loss.

That’s how you can train it to improve that. It’ll shift its output distribution to have time be closer to 100% and everything else zero. That’s a language model, and the method that I described is really how they’re trained. You take a whole bunch of text and you take sequences of that text and you chop out a word or multiple words and you have it fill in those words. The way it fills it in is it gives you a probability distribution for every possible word. Ideally, the one you chopped out has the highest probability.

Thomas Betts: Got you. It’s like the image recognitions that we’ve seen for years.

Anthony Alford: Exactly.

Thomas Betts: We’ve had image recognition models. It’s like, “How do I identify this is a dog? This is a cat?” and we trained it. It’s like, “This is a cat. This is a dog”. And it started putting that into its model somehow. It’s like when I see this array of pixels, the answer is such a probability that it is a dog in that picture.

Anthony Alford: Yes. And in general, if we want to talk about data types, this is an enumeration. With enumeration data types, this thing is what you might call a classifier. You were talking about a dog or a cat. It’ll give you an answer for every possible output class. Every possible enumeration value has a probability associated with it. You want the real one to be close to 100%, and you want the rest of them to be close to zero. It’s the same for text. The entire vocabulary is given in a probability distribution.

Neural networks are doing matrix multiplication, with extremely large matrices [11:14]

Thomas Betts: That’s when you hear about how big these models are, it’s how much they’ve been trained on. The assumption is that ChatGPT and GPT-4 was basically trained on everything that you could possibly get off the internet. I don’t know how true that is, but that’s the way people talk about.

Anthony Alford: It’s close enough to be true. That’s the data set. There’s also the number of parameters that make up the model. When we’re talking about these deep learning models, those are neural networks. And neural networks are, at heart, matrix multiplication. I mentioned those input tensors. You could think of them as like matrices. You can multiply that times the model’s matrix. We talk about those matrix entries are sometimes called weights because ultimately what you’re doing is a weighted sum of the input values. When we talk about how big is the model, we’re talking about how many matrix parameters are in that thing. For GPT-4, we don’t know. We were not told. If you go all the way back to GPT-2, there was like one and a half billion parameters in the matrices inside it.

Thomas Betts: Yes, I think we’re now seeing…

Anthony Alford: Hundreds of billions.

Thomas Betts: Hundreds of billions, yes.

Where does large language model come in? Is it in the billion?

Anthony Alford: Yes. Well, it’s not a hard number. But what we’re seeing now is if something is tens or hundreds of billions, that’s probably large. We have smaller ones now where you’ll see Llama or something like… What is it, Gemma from Google? And Phi from Microsoft. Those are still billions, but they’re only… From 1 to 10 billion is considered a small model now. That’s small enough to run on your laptop actually.

Thomas Betts: Okay, you just threw out several other names and these are the things that I’m talking about that architects were like, “Oh, I think I’ve heard of Llama. Gemma sounds familiar”. And was it Psi?

Anthony Alford: Phi, P-H-I, right. The Greek letter. Here in America, Phi, but other places it’s Phee.

Hugging Face is like GitHub for language models [13:28]

Thomas Betts: I know you can go out and find details of some of these. There’s a site called Hugging Face that I don’t understand, but you can go and find the models and you can test the models. What is that?

Anthony Alford: Hugging Face, you can think of as the GitHub for language models. In fact, I mentioned a library. They have an SDK. They have a library, Python library you can install on your laptop that will behind the scenes download and run these smaller language models that you can actually run on your machine. What they do is they have files that contain those matrix entries that I mentioned.

Thomas Betts: That’s the composed model, if you will, right? I always think the training is I’m going to run my program and the output is the model. The training process might take hours or days, but once it’s done, it’s done and it’s baked. Now I have the model, and now that model, for large language models or small language models, you’re saying it’s something that I can put on my laptop. Some of those, if they were smaller machine learning models, we’ve been able to move those around for a while, right?

Two phases of the machine learning life cycle [14:35]

Anthony Alford: Oh, yes. We can think of two phases in the life cycle of machine learning model, the training that you mentioned. We could think of that as developing a function, and then once it’s developed, once we’ve written the function, we might build it and deploy it as a jar, for example, or some kind of library that you can use. The trained model is like that, and when you load it up and you put an input into it and get an output out, that’s called inference. The model infers some output from your input. Those are the two big chunks of the model lifecycle.

Auto-regressive models take the output and feed it back in as the next input, adding to the context [15:12]

Thomas Betts: Back to the large language models where you’re talking about predict the next word and then predict the next word. This is where it’s feeding it back in. The way I’ve understood is it’s just auto-complete on steroids, one letter, one word. It’s like, “I’ll just do all of it”. It keeps feeding that sentence that it’s building back into the context, and so that’s the next thing.

Anthony Alford: That’s right. And you’ll hear these models referred to as autoregressive, and that’s exactly what they’re doing. You start with initial input, which sometimes we call that the prompt. We also call the input to the model, the context. The prompt is the initial context and then it outputs one more token that’s stuck on the end and then it feeds back as the new context and the process just repeats. These things also are able to output a token that basically says, “Stop”. And that’s how they know to stop. Whereas I’ve tried that auto-complete with my phone where I just keep auto-completing over and over. It eventually produces gibberish, but it is the exact same idea.

Tokens are the words or parts of words that the model can respond with [16:18]

Thomas Betts: You’ve now said token a few times, and I keep saying word. And I know the layman is usually interchanging those, and it’s not exactly the same thing. That a token is not a word all the time. What is a token in terms of these language models?

Anthony Alford: When people first started, it was words. We’re probably familiar with the idea with search engines of doing things like stemming or things like that where the word itself doesn’t actually become the token. The reason you want to do something that’s not exactly the word is I mentioned you can only get an output that is one of the tokens that it knows about. You’ve seen things like, “Well, let’s just use the bytes as tokens”. I think now it’s byte pairs. Basically, it’s no longer at the word level. A token is smaller than a word. You might see a token be a couple of letters or characters or bytes.

Thomas Betts: And what’s the advantage of shrinking those down? Instead of predicting the next word is once upon a time, it would predict T and then I and then M then E.

Anthony Alford: Or something like that, or TI. The reason is so that you can output words that are not real words, that wouldn’t be in the regular vocabulary.

Thomas Betts: Now is it smart enough to say that time is one possible token and TI might be a different one? Does it break it down both ways?

Anthony Alford: The tokenization is, that’s almost become a commodity in itself. Most people are not really looking at what the specific token data set is. I think typically you want something a little bigger than one character, but you want something smaller than a word. This is something that researchers have experimented with.

Thomas Betts: And my interaction with knowing the number of tokens counts is… When I’ve played around these things, used a ChatGPT or OpenAI, API, it’s measuring how many tokens are being used. And you’re being sometimes billed by the number of tokens.

Anthony Alford: Yes, that’s right. Because essentially the output is a token, and the input we mentioned, that’s called the context, the models have a maximum size of the context or input in the number of tokens. It’s in the order of thousands or maybe even hundreds of thousands now with a lot of these models. But eventually, it will have to stop because effectively you can’t take a larger input.

Thomas Betts: Yes, and I remember people found those limits when ChatGPT came out is you’d have this conversation that would go on and on and on, and pretty soon you watched the first part of your conversation just fall off the stack, if you will.

Anthony Alford: Yes, the maximum context length is built into the model. And there’s a problem with the algorithmic complexity is the square of that context size. As you get bigger, the model gets bigger as the square of that, and that’s how the runtime increases as the square of that, et cetera.

Efficiency and power consumption [19:15]

Thomas Betts: That’s where you’re getting into the efficiency of these models. There’s been some discussion of how much power is being consumed in data centers all around the world to build these models, run these models, and that’s one of those things that you can get your head around. If you have this thing it takes…

Anthony Alford: It’s an awful lot.

Thomas Betts: It’s a lot. It’s an awful lot. Say it takes 30,000, 32,000 tokens and you’re saying the square of that, that suddenly gets very, very large.

Anthony Alford: Oh, yes. Not only does it grow as a square of that, but it’s like there’s a big multiplier as well. Training these models consumes so much power, only the people who do it know how much. But really they’re just looking at their cloud bill. Nobody knows what the cloud bill was for training GPT-3 or 4, but it’s a lot.

Thomas Betts: Yes, that’s why people are looking not to build your own model. Most people are not in the business of needing to create their own LLM. These things are done, but people are using them to replace Google searches. One of the problems is you don’t have the context because the model wasn’t trained on current events. It’s not searching Google and giving you results. It’s just predicting words.

Anthony Alford: Exactly. Now they are trying to build that in. If you use Bing, Bing is actually using GPT-4, and it will include search results in its answer, which when we get to the… I don’t want to spoiler, when we get to RAG, we can talk about that.

Transformers – GPT means Generative, Pretrained Transformer [20:43]

Thomas Betts: Well, let’s leave RAG off to the side a little bit. Let’s dig a little bit into transformer without rewriting the entire history. I think you and Roland have talked about that a little bit on your podcast.

Anthony Alford: Right, we’ve mentioned LLMs in general and GPT family in particular. Well, the T in GPT stands for transformer, and this was something that a Google research team came up with in 2017. They wrote a paper called Attention is All You Need. They were working on translation and before that the translation models were using recursion, which is different from what we were talking about with autoregression. Anyway, they came up with this model that really just uses a feature called attention or a mechanism called attention. They called it the transformer.

Now really all the language models are based on this. That’s what the T in GPT stands for. GPT stands for generative pre-trained transformer, and they all use this attention mechanism. You could think of attention as a way for the model to pick out what’s important in that input sequence. The word is, I think, sometimes used in… It’s similar to information retrieval, so it uses a lot of concepts like queries and keys and values. But at a high level, it’s a way for the model to… Given that input sequence, identify the important parts of it and use that to generate the next token.

Attention is weighting the input [22:13]

Thomas Betts: It might throw out some of your input or recategorize and say, “These are the important words in that context”.

Anthony Alford: The mathematics is, it finds keys that match the query and then returns the values that are associated with those. A lot of times it does focus on certain parts of the input versus other pieces.

Thomas Betts: That’s where weighting comes into play, right?

Anthony Alford: Exactly. That’s how it is.

Thomas Betts: You mentioned that these matrices have weights on them. It’s going to figure out which words or parts of that input, and that one word doesn’t always have the same weight. It’s in the context of that input, it might have more weight.

Anthony Alford: Yes, you did a better job explaining that than I did.

Thomas Betts: It’s not my first time trying to explain this. I get a little bit better every time. Again, one of the points of why I wanted to do this episode.

Adding an LLM to your product [23:03]

Thomas Betts: We’ve got transformers, it’s just a term, and the attention, that’s how we’re figuring out what goes in. That outputs, in the case of GPT outputs, GPT, but that’s a branded term. LLM is the generic term, right?

Anthony Alford: Right.

Thomas Betts: It’s like Kleenex versus tissue. Let’s say I want to use one of these LLMs in my application. This is the thing that my product owner, my CEO is like, “Put some AI on it”. I want to look like we’re being innovative. We’ve got to have something that is this predictive thing like, “Look at how it looked at our model and comes up with something”. How do we go about doing that?

Anthony Alford: Can I plug an InfoQ piece already? Just earlier this year I edited the eMag, the Practical Applications of Generative AI e-magazine. And we had several experts on LLMs in particular to talk about this. Definitely recommend everybody read that, but what they recommended is… You have publicly available commercial LLMs like GPT for ChatGPT. There’s also Claude. There’s also Google’s Gemini. AWS has some as well. Anyway, if you find one of these that seems to work, try it out. So you can quickly adopt LLM functionality by using one of these commercial ones. It’s just an API. It’s a web-based API. You call it using an SDK, so it looks like any kind of web service.

That’s number one. Number two, for long-term cost maybe, right? Because it’s a web service and API, like we said, we’re paying per token. It’s actually probably pretty cheap. But longer term there’s cost concerns, and there may be privacy concerns because these commercial LLMs have gotten better at their promises about, “We’re not gonna keep your data. We’re going to keep your data safe”. But there’s also the data that it gives you back in the case of, say, like code generation.

I think there was a lawsuit just recently. I think people whose code was used to train this, they’re saying that this thing is outputting my code, right? There’s concerns about copyright violation. Anyway, longer term, if you want to bring that LLM capability in house, you can use one of these open source models. You can run it in your own cloud, or you can run it in a public cloud but on your own machine. Then you have more control over that.

Thomas Betts: Yes, it’s kind of the build versus buy model. Right?

Anthony Alford: Exactly.

Thomas Betts: And I like the idea of, “Let’s see if this is going to work”. Do the experiments. Run those tests on the public one and maybe put some very tight guardrails. Make sure you aren’t sending private data. I think it was to plug another InfoQ thing. Recently the AI, ML trends report came out. I listened to that podcast. That was one where it mentioned that because they were setting up so many screens to filter and clean out the data before sending it to OpenAI or whichever API they were using, that scrubbed out some of the important context and the results coming back weren’t as good. Once you brought the model in house and you could say, “Oh, we own the data. It never leaves our network. We’ll send it everything”. All of a sudden your quality goes up too.

Anthony Alford: It’s definitely very easy to experiment with. And if you find that the experiment works, it may make sense to bring it in house. There’s the short answer.

Hosting an open-source LLM yourself [26:36]

Thomas Betts: Like you said, “If you want to pay per use and it’s easy to get started”. That’s one way to go. When you’re talking about bringing in house, you mentioned you can have it on your own cloud. Like we’re on Azure, AWS. Is that basically I spin up an EC2 instance and I just install my own.

Anthony Alford: That’s one way. Of course, the service providers like AWS are going to give you a value add version where they spin it up for you and it’s very much like the regular model where you pay per use. But yes, you could do that. You could do it right on EC2.

Thomas Betts: Yes. Are you doing the product as a service, the platform as a service, the infrastructure as a service, then you can do whatever you want on it. Your results may vary, but that might be another way to do that next phase of your experiment as you’re trying to figure out what this is. How easy is it for me to spin up something, put out a model there and say, “Okay, here’s our results using this public API, and here’s if we bring it in house with our private API”. Maybe you look at the cost. Maybe look at the quality of the results.

Anthony Alford: Yep, for sure.

Comparing LLMs [27:37]

Thomas Betts: How are people comparing those things? What is the apples to apples comparison of, “I’m going to use OpenAI versus one of the things I pull off of Hugging Face?”

Anthony Alford: This is actually a problem. As these things get better, it’s tricky to judge. In the olden days where we had things like linear regression and we had that supervised learning where we know the answer, we can get a metric that’s based on something like accuracy. What is the total sum of squared error? But nowadays, how good is the output of ChatGPT? Well, if you’re having it do your homework, if you get an A, then it was pretty good. And, in fact, believe it or not, this is very much a common thing that they’re doing now with these models is they’re saying, “We train this model, it can take the AP Chemistry exam and make a passing grade”.

Another thing I see a lot in the literature is if they’re comparing their model to a baseline model, they’ll have both models produce the output from the same input and have human judges compare them. It’s like Coke versus Pepsi, which four out of five people chose Pepsi. And even more fun is do that, but with ChatGPT as the judge. And believe it or not, a lot of people are doing that as well. I guess the answer is it’s not easy.

Thomas Betts: Yes, that’s where I tend to say these things are non-deterministic. You talked about the probability, you don’t know that the answer is going to come out. Your test is not… I asked this question, I got this answer. Because you don’t necessarily know what types of questions are going to be going in, so you don’t know what outputs are going to come out.

Anthony Alford: Yes, exactly. That’s actually one of the most scary things is you don’t know what’s going to come out. Something very unpleasant or embarrassing might come out and that’s really got people concerned about using these things in production environments.

Thomas Betts: Yes.

Before you adopt LLMs in your application, define your success criteria [29:38]

Anthony Alford: But I will say one thing… Again, talking back the e-magazine, one of my experts said, “Before you adopt LLMs in your application, you should have good success criteria lined out for that”. That may be the harder part to do. How will I know if it’s successful? It’s going to depend on your application, but it’s something you should think hard about.

Thomas Betts: Well, I like that because it puts back the question on the product owners. The CTOs are saying, “I need some AI in it”. What do you want to have happen? Because there’s a lot of places where you shouldn’t put AI. I work on an accounting system. You should not have it just guess your books.

Retrieval-Augmented Generation (RAG) should be an early step for improving your LLM adoption [30:19]

Thomas Betts: When we’re talking about using these for ourselves, whether we’re hosting them or bringing them in house, how do we get those better quality results? Do we just use them out of the box? I had a podcast a while ago and learned about retrieval augmented generation. I hear RAG talked about a lot. Give me the high level overview of what that is and why that should be a first step to make your LLM adoption better.

Anthony Alford: Again, on my expert panel, they said, “The first thing is to try better prompts”. We’ve probably heard of prompt engineering. We know that the way you phrase something to ChatGPT makes a big difference in how it responds. Definitely try doing stuff with prompts. The next step, retrieval augmented generation or RAG. I think we mentioned, the LLMs, they’re trained and they don’t know anything that happened after that training. If we ask who won the football game last night? It doesn’t know, or it might not say it doesn’t know, it might actually make up something completely not true. This is also a problem for a business where you want it to know about your internal knowledge base, right? If you want it to know things that are on your Wiki or in your documentation, things like that. What RAG is is you take your documents, you break them up into chunks, but essentially you take a big chunk of text and you run it through an LLM that generates a single vector for that chunk of text.

This is called an embedding. And that vector in some way encodes the meaning of that text. You do this with all your documents and then you have a database where each document has a vector associated with it that tells you something about its meaning. Then when you go and ask the LLM a question, you do the same thing. You take your question and you turn that into a vector, and the vector database lets you quickly and efficiently find vectors that are close to that and therefore are close to your question in meaning. It takes the content from that and shoves that into the LLM context along with your question. And now it knows all that stuff along with your question. We know that these LLMs are very good at… If you give it a chunk of text and say, “Explain this”. Or, “Here’s a question about this chunk of text”. It is quite good. That’s what the intention mechanism does, is it lets it find parts of that chunk of text that answer the question or solve the problem that you’re asking.

Thomas Betts: The way I’ve heard that explained is, let’s say I do my search and instead of me writing a really elaborate prompt, because I’m willing to sit there and type for 30 seconds. That’s all the words I’m going to come up with. Instead, I would say, “Answer the question based on these documents”. And I can give all those documents in the context and now it knows, “Okay, that’s what I’m going to use”. I’m not going to use just my base level LLM predict the next word. I’m going to predict the next word based on this context.

Anthony Alford: Right. And the retrieval part is finding those documents automatically and including them in the context for you. That’s the key component… If you actually know the documents, and let’s say somebody gave you, “Here’s our user manual, answer questions about it”. Which is a pretty cool use case for someone who’s, say, in customer service. If the user manual is small enough to fit into the context, which it probably is at hundreds of thousands of tokens, then that’s great. But maybe you don’t have that. Maybe you have a bunch of knowledge base articles. This will go and find the right knowledge base article and then answer the question based on that.

Thomas Betts: Right, because our knowledge base has tens of thousands of articles opposed to a couple of hundred pages.

Anthony Alford: Exactly.

Thomas Betts: And you’re still using the LLM, which has all of its knowledge of, “Here’s how I complete a sentence”.

Anthony Alford: Yep.

Fine-tuning is one option to make an LLM better suited for your needs [34:07]

Thomas Betts: You are not building a new model based off of your knowledge base or your training documents.

Anthony Alford: Exactly. But let’s say you did want to do that and that might be a better solution in some cases, this process is called fine-tuning. I mentioned the T in GPT was transformer. The P is pre-trained. This is a whole subfield of machine learning called transfer learning where you train a model, you pre-train it and it’s general purpose. Then you can fine tune it for a specific case. In the case of GPT-2, 3 and higher, they found out you don’t need to. It’s pretty good on its own. But what the fine-tuning does is its additional training on that model. Instead of using the model as is, you restart the training process. You’ve got your own set of unit tests. You got your own fine-tuning data where the know the inputs, you know the outputs. The advantage is for fine-tuning, it can be much smaller than what is needed to train the full GPT.

Thomas Betts: And that’s because you’re starting from what already exists.

Anthony Alford: Exactly. Right.

Thomas Betts: You’re not starting from baseline or nothing. It’s just saying tweak your model. That’s going back to things that I understood, again, at a superficial level with machine learning training is like, “You can overtrain the data”. If you give too many answers in one area, it’s like, “Look, we got to a 99.9%”. But then something comes in and it doesn’t know about and it has no answer. It’s way off base. In this case, if I’m trying to get the model to be very specific to my company’s applications, my data, that might be the desired outcome. I don’t want someone to be using my customer service chatbot to ask about when’s the next Taylor Swift show?

Anthony Alford: Yes, exactly. In fact, the original ChatGPT and newer models, they do fine tune them to give more helpful answers and follow instructions. This is something with GPT-3.5, again, that model is pre-trained on basically the whole internet, and it could give you answers that were pretty good, but they found that sometimes it would just give you answers that were… It’s that whole joke about this is technically true but not at all useful. So they fine-tuned it to give you answers that are more helpful to follow instructions. They call it alignment, and the way they do that is they have a small data set of, “This was the input. Here’s the output you gave, but this output here is better”. They fine-tune it to work towards the more appropriate output.

Vector databases provide nearest-neighbor searches [36:45]

Thomas Betts: I need to back up just a little bit. When you mentioned we’re going to create these vectors, I’m going to have a vector database, I’m going to do a vector search. Another one of those terms that gets thrown around and people are like, “Well, do I have a vector database?” I think Azure just announced that they’re going to… I think it’s in beta right now. Basically turn your Cosmos database into a vector database, like flip a checkbox in the portal and all of a sudden you have vectors. What does that do for me? Why is that an advantage?

Anthony Alford: Okay, I have an upcoming podcast on this very problem. We mentioned that for a chunk of text you can create from that a vector that encodes the meaning. The vector is very high dimensional. It’s hundreds, maybe thousands of dimensions. You’re trying to solve the problem of given one vector, how do you find those vectors in the database that are close to that input vector? You could just run through all of them. You do a table scan basically, and sort the output. That probably is actually fine. The complexity is high enough that at scale it’s not going to perform great. What you need is something more like a B-tree lookup, which is log-N. The vector database is actually… The vectors are probably not the important part, it’s the nearest neighbor search. This is the problem we’re solving, is given a vector input what are the nearest neighbors to that vector in your database? That’s the problem that you want to solve in an efficient, scalable way.

Thomas Betts: Got you. It’s going through and looking at my data and saying, “Here are the vectors for all the parameters”. And based on that, these are related words…?

Anthony Alford: Well, no, literally just it doesn’t look it. It’s just given two vectors. How close are..

Thomas Betts: How close are the vectors? It doesn’t know what it came from?

Anthony Alford: Exactly, right. Now, once it finds the ones that are closed, then those are in the same database row or there’s a pointer to the content that it came from, which is what you actually care about.

Thomas Betts: Got you.

Anthony Alford: But the database, its purpose is to do the nearest neighbor search where you give it a vector and it finds the top K in its database that are closest to it.

Thomas Betts: Yes. This is where, I think, we’re going back to the beginning that AI as a product isn’t something that exists. We’ve had fuzzy search techniques for a while. This has been something people have wanted and everyone’s gotten used to Google. I can type in whatever I want, and it figures out. Like you said, you take the stems of the words… This is another one of those, I didn’t give you exactly what the answer asked for. So it’s not a find this row in the database, but find records that are close to what I intended and that’s what they’re doing.

Anthony Alford: Yes, I think you might find this referred to as semantic search or maybe neural search. The neural meaning that’s how the vectors are generated from a neural network.

Thomas Betts: But it’s all about that. I don’t have a specific thing in mind to find the intent.

An LLM is a tool to solve natural language processing (NLP) problems [39:45]

Thomas Betts: I guess LLMs really fall under in my head the category of natural language processing, right?

Anthony Alford: Yes, exactly.

Thomas Betts: Because that used to be a thing. I had data scientists on my team who were working in the field of natural language processing. Is that still a thing? Is that just a subset or has it just gotten overwhelmed in the news by LLMs?

Anthony Alford: I think you could think of an LLM as a tool to solve natural language processing problems. For example, we used to look at things like named-entity recognition, parts of speech recognition, that kind of thing. That’s still something you have to do, but an LLM can do it.

Thomas Betts: Right.

Anthony Alford: And it can do it pretty well and works out of the box. If you look at what, I think… Again, we were talking about Google and Attention is All You Need. They came up with a version of that called BERT, and it would do this stuff like named entities and parts of speech, tagging and things like that.

LLMs are useful because they are very general, but that does not make them general AI [40:44]

Thomas Betts: Got you. And that’s one of those LLMs are these generalists. Find ways to make them more specific. If you have a specific use case in mind, you can go down a fine-tuning route. You can find a different model that’s just closer, and that’s going to have those benefits of… It’s going to cost less to run. It’s probably going to be better quality answers. It’s probably going to return faster, I’m assuming if it’s less computational.

Anthony Alford: Yes, this is one of the reasons people are excited about LLMs is that they are very general. That’s one of the things where people started saying, “Is this general AI?” That’s been the holy grail of AI research forever. Yes, we can make a program that plays chess very well, but it can’t drive a car. The holy grail is to build one model that can solve just about any problem. Just if we flatter ourselves as human beings, we can do lots of different tasks. We can do podcasts. We can build model race cars or read books. The holy grail of AI is one model to rule them all, and LLMs could do so much without additional training. That’s what one of the early GPT papers was like, “Look, we built this thing out of the box. It can do summarization, question answering, translation, code generation, all these tasks”. And that’s one of the things that got people really excited about it. It looks like it could do everything.

Thomas Betts: Yes, I think it’s the now how do we use it? Because that seems so powerful. But going back to your point, you need to have a specific output in mind. What is your goal? Why would you add this? Because it sounds exciting. Everyone wants to use it, but you have an intent of how does that fit into your product? How does that fit into your solution?

Anthony Alford: Yes. It’s always about what business problem am I trying to solve? And how do I know if I succeeded?

AI copilots versus AI agents [42:38]

Thomas Betts: We’re over on time. I’m going to have one last bonus round question and we’ll wrap it up.

Anthony Alford: Yes.

Thomas Betts: A lot of people talk about having AI Copilots. I can’t remember how many Microsoft Copilots and GitHub Copilots. Everything’s a copilot. Distinguish that from an AI agent because that’s another term that’s being thrown around. They both sound like the embodiment of this thing as a person. There’s a whole different discussion about that. But these are two different things. What’s a co-pilot versus an agent?

Anthony Alford: I think we did talk about this on the trends podcast. An agent has some degree of autonomy. The co-pilot is you’ve got to push the button to make it go eventually. Again, I don’t want to turn this into AI fear. The fear that people have of AI is autonomous AI, in my opinion. If we can mitigate that fear by keeping things as co-pilots, then maybe that’s the way to go. But I think the key is autonomy. You have to agree to the co-pilots answer and make it go.

Thomas Betts: The agents can do stuff on their own, but maybe we have supervisor agents. And like you said, “I don’t know how to tell the output, so I’m going to ask ChatGPT, ‘Did I train my model correctly?'” And you feed it back into yet another AI. The AI agent story is you have supervisor agents who watch the other ones, and then it’s who’s watching the watchers?

Anthony Alford: Watches the watchers? Yes, indeed.

Thomas Betts: Well, I really appreciate all your time. I learned a lot. I hope this was useful for the audience.

Anthony Alford: Me too.

Thomas Betts: It’s always good to go through and do this little refresher of here’s what I think I understand, but bounced off someone who really knows. I’ll be sure to provide links to the things we mentioned. The eMag is great.

Anthony Alford: Yes.

Thomas Betts: Then the trends report and podcast and some other stuff. Anthony, thanks again for joining me on the InfoQ Podcast.

Anthony Alford: It was a pleasure. Thanks for having me.

Thomas Betts: And we hope you’ll join us again next time.

About the Author

Anthony Alford

Show moreShow less

.
From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.

Uncategorized

Alibaba Releases Two Open-Weight Language Models for Math and Voice Chat

MMS • Anthony Alford

Article originally posted on InfoQ. Visit InfoQ

Alibaba released two open-weight language model families: Qwen2-Math, a series of LLMs tuned for solving mathematical problems; and Qwen2-Audio, a family of multi-modal LLMs that can accept voice or text input. Both families are based on Alibaba’s Qwen2 LLM series, and all but the largest version of Qwen2-Math are available under the Apache 2.0 license.

Qwen2-Math is available in a base version and an instruction-tuned version, each with a choice of 1.5B, 7B, or 72B parameters. Because most benchmark datasets are available on the internet, Alibaba conducted decontamination on their training datasets to remove mathematical problem-solving benchmark examples. After pre-training, the instruction-tuned models were trained with both supervised fine-tuning and reinforcement learning. On the popular MATH benchmark, the largest model, Qwen2-Math-72B-Instruct, outperformed state-of-the-art commercial models including GPT-4o and Claude-3.5-Sonnet. According to Alibaba,

Given the current limitation of English-only support, we plan to release bilingual models that support both English and Chinese shortly, with the development of multilingual models also in the pipeline. Moreover, we will continue to enhance our models’ ability to solve complex and challenging mathematical problems.

Besides MATH, Alibaba evaluated Qwen2-Math on benchmarks and mathematics exams, such as GSM8K and AIME 2024. They found that Qwen2-Math-Instruct had better performance than other baseline models of comparable size, “particularly in the 1.5B and 7B models.” The 72B parameter version achieved a score of 86.4 on the Chinese-language math exam benchmark CMATH, which Alibaba claims is a new high score. They also claim that it outperformed Claude, GPT-4, and Gemini on the AIME 2024 exam.

Alibaba published a technical report with more details on Qwen2-Audio. The model accepts both text and audio input, but can only output text. Depending on the type of audio input provided, the model can operate in two modes, Voice Chat or Audio Analysis. In Voice Chat mode, the input is a user’s speech audio, and the model acts as a chatbot. In Audio Analysis mode, the model can answer questions about the content of audio input. For example, given a clip of music, the model can identify the tempo and key of the song.

Andrew Ng’s newsletter The Batch covered Alibaba’s release, saying:

Qwen2 delivered extraordinary performance with open weights, putting Alibaba on the map of [LLMs]. These specialized additions to the family push forward math performance and audio integration in AI while delivering state-of-the-art models into the hands of more developers. It’s thrilling to see models with open weights that outperform proprietary models. The white-hot competition between open and closed technology is good for everyone!

Users on Reddit discussed both model series. One user described Qwen2-Math-7B as “punching really high and hard for its size.” Another user said of Qwen2-Audio:

It would be very interesting to try to synthesize audio output using this model. The audio encoder is almost identical to WhisperSpeech one. Although Qwen2 is using Whisper-large-v3 which would probably require retraining of the WhisperSpeech acoustic model. If successful, that would be basically equivalent to GPT4o advanced voice mode running locally.

The model files for Qwen2-Math and Qwen2-Audio can be downloaded from Huggingface.

About the Author

Anthony Alford

Show moreShow less

Apple’s Illusion of Thinking Paper Explores Limits of Large Reasoning Models

MMS • Anthony Alford

About the Author

Anthony Alford

Subscribe for MMS Newsletter

Did you know...

Google’s Gemma 3 QAT Language Models Can Run Locally on Consumer-Grade GPUs

MMS • Anthony Alford

About the Author

Anthony Alford

Subscribe for MMS Newsletter

Did you know...

Google Open-Sources Agent2Agent Protocol for Agentic Collaboration

MMS • Anthony Alford

About the Author

Anthony Alford

Subscribe for MMS Newsletter

Did you know...

Google DeepMind’s AlphaGeometry2 AI Achieves Gold-Medal Math Olympiad Performance

MMS • Anthony Alford

About the Author

Anthony Alford

Subscribe for MMS Newsletter

Did you know...

Google Releases PaliGemma 2 Vision-Language Model Family

MMS • Anthony Alford

About the Author

Anthony Alford

Subscribe for MMS Newsletter

Did you know...

InstaDeep Open-Sources Genomics AI Model Nucleotide Transformers

MMS • Anthony Alford

About the Author

Anthony Alford

Subscribe for MMS Newsletter

Did you know...

PyTorch 2.5 Release Includes Support for Intel GPUs

MMS • Anthony Alford

About the Author

Anthony Alford

Subscribe for MMS Newsletter

Did you know...

Apple Open-Sources Multimodal AI Model 4M-21

MMS • Anthony Alford

About the Author

Anthony Alford

Subscribe for MMS Newsletter

Did you know...

Podcast: A Primer on AI for Architects with Anthony Alford

MMS • Anthony Alford

Subscribe on:

Transcript

Introduction [00:42]

AI usually means deep learning or neural networks [03:00]

ML models are just functions that take input and provide output [03:48]

Building an ML model is like writing a lot of unit tests and refining the function [05:42]

Language models are trained using sentences to predict the probability of the next word in the sentence [07:59]

Neural networks are doing matrix multiplication, with extremely large matrices [11:14]

Hugging Face is like GitHub for language models [13:28]

Two phases of the machine learning life cycle [14:35]

Auto-regressive models take the output and feed it back in as the next input, adding to the context [15:12]

Tokens are the words or parts of words that the model can respond with [16:18]

Efficiency and power consumption [19:15]

Transformers – GPT means Generative, Pretrained Transformer [20:43]

Attention is weighting the input [22:13]

Adding an LLM to your product [23:03]

Hosting an open-source LLM yourself [26:36]

Comparing LLMs [27:37]

Before you adopt LLMs in your application, define your success criteria [29:38]

Retrieval-Augmented Generation (RAG) should be an early step for improving your LLM adoption [30:19]

Fine-tuning is one option to make an LLM better suited for your needs [34:07]

Vector databases provide nearest-neighbor searches [36:45]

An LLM is a tool to solve natural language processing (NLP) problems [39:45]

LLMs are useful because they are very general, but that does not make them general AI [40:44]

AI copilots versus AI agents [42:38]

About the Author

Anthony Alford

Subscribe for MMS Newsletter

Did you know...

Alibaba Releases Two Open-Weight Language Models for Math and Voice Chat