Google Enhances LiteRT for Faster On-Device Inference

Uncategorized

Google Enhances LiteRT for Faster On-Device Inference

MMS • Sergio De Simone

Article originally posted on InfoQ. Visit InfoQ

The new release of LiteRT, formerly known as TensorFlow Lite, introduces a new API to simplify on-device ML inference, enhanced GPU acceleration, support for Qualcomm NPU (Neural Processing Unit) accelerators, and advanced inference features.

One of the goals of the latest LiteRT release is to make it easier for developers to harness GPU and NPU acceleration, which previously required working with specific APIs or vendor-specific SDKs:

By accelerating your AI models on mobile GPUs and NPUs, you can speed up your models by up to 25x compared to CPU while also reducing power consumption by up to 5x.

For GPUs, LiteRT introduces MLDrift, a new GPU acceleration implementation that offers several improvements over TFLite’s GPU delegate. These include more efficient tensor-based data organization, context- and resource-aware smart computation, and optimized data transfer and conversion.

This results in significantly faster performance than CPUs, than previous versions of our TFLite GPU delegate, and even other GPU enabled frameworks particularly for CNN and Transformer models.

LiteRT also targets neural processing units (NPUs), which are AI-specific accelerators designed to speed up inference. According to Google’s internal benchmarks, NPUs can deliver up to 25× faster performance than CPUs while using just one-fifth of the power. However, there is no standard way to integrate these accelerators, often requiring custom SDKs and vendor-specific dependencies.

Thus, to provide a uniform way to develop and deploy models on NPUs, Google has partenered with Qualcomm and MediaTek to add support for their NPUs in LiteRT, enabling acceleration for vision, audio, and NLP models. This includes automatic SDK downloads alongside LiteRT, as well as opt-in model and runtime distribution via Google Play.

Moreover, to make handling GPU and NPU acceleration even easier, Google has streamlined LiteRT’s API to let developers specify the target backend to use when creating a compiled model. This is done with the CompiledModel::Create method, which supports CPU, XNNPack, GPU, NNAPI (for NPUs), and EdgeTPU backends, significantly simplifying the process compared to previous versions, which required different methods for each backend.

The LiteRT API also introduces features aimed at optimizing inference performance, especially in memory- or processor-constrained environments. These include buffer interoperability via the new TensorBuffer API, which eliminates data copies between GPU memory and CPU memory; and support for asynchronous, concurrent execution of different parts of a model across CPU, GPU, and NPUs, which, according to Google, can reduce latency by up to 2x.

LiteRT can be downloaded from GitHub and includes several sample apps demonstrating how to use it.

About the Author

Sergio De Simone

Show moreShow less

Mobile Monitoring Solutions

Uncategorized