Gemma 3n Available for On-Device Inference Alongside RAG and Function Calling Libraries

MMS Founder
MMS Sergio De Simone

Article originally posted on InfoQ. Visit InfoQ

Google has announced that Gemma 3n is now available in preview on the new LiteRT Hugging Face community, alongside many previously released models. Gemma 3n is a multimodal small language model that supports text, image, video, and audio inputs. It also supports finetuning, customization through retrieval-augmented generation (RAG), and function calling using new AI Edge SDKs.

Gemma 3n is available in two parameter variants, Gemma 3n 2B and Gemma 3n 4B, both supporting text and image inputs, with audio support coming soon, according to Google. This marks a notable increase in size compared to the earlier non-multimodal Gemma 3 1B, which launched earlier this year and required just 529MB to process up to 2,585 tokens per second on a mobile GPU.

Gemma 3n is great for enterprise use cases where developers have the full resources of the device available to them, allowing for larger models on mobile. Field technicians with no service could snap a photo of a part and ask a question. Workers in a warehouse or a kitchen could update inventory via voice while their hands were full.

According to Google, Gemma 3n use selective parameter activation, a technique for efficient parameter management. This means the two models contain more parameters than the 2B or 4B that are active during inference.

Google highlights the ability for developers to fine-tune the base model and then convert and quantize it using new quantization tools available through Google AI Edge.

With the latest release of our quantization tools, we have new quantization schemes that allow for much higher quality int4 post training quantization. Compared to bf16, the default data type for many models, int4 quantization can reduce the size of language models by a factor of 2.5-4X while significantly decreasing latency and peak memory consumption.

As an alternative to fine-tuning, the models can be used for on-device Retrieval Augmented Generation (RAG), which enhances a language model with application-specific data. This capability is powered by the AI Edge RAG library, currently available only on Android and coming in future on other platforms.

The RAG library uses a simple pipeline including several steps: data import, chunking and indexing, embeddings generation, information retrieval, and response generation using an LLM. It allows full customization of the RAG pipeline, including support for custom databases, chunking strategies, and retrieval functions.

Alongside Gemma 3n, Google also announced the AI Edge On-device Function Calling SDK, also currently available only on Android, which enables models to call specific functions to execute real-world actions.

Rather than just generating text, an LLM using the FC SDK can generate a structured call to a function that executes an action, such as searching for up-to-date information, setting alarms, or making reservations.

To integrate an LLM with an external function, you describe the function by specifying its name, a description to guide the LLM on when to use it, and the parameters it requires. This metadata is placed into a Tool object that is passed to the large language model via the GenerativeModel constructor. The function calling SDK includes support for receiving function calls from the LLM based on the description you provided, and sending execution results back to the LLM.

If you want to take a closer look at these new tools, the best place to start is the Google AI Edge Gallery, an experimental app that showcases various models and supports text, image, and audio processing.

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.