PyTorch 1.12 Release Includes Accelerated Training on Macs and New Library TorchArrow

MMS Founder
MMS Anthony Alford

Article originally posted on InfoQ. Visit InfoQ

The PyTorch open-source deep-learning framework announced the release of version 1.12 which includes support for GPU-accelerated training on Apple silicon Macs and a new data preprocessing library, TorchArrow, as well as updates to other libraries and APIs.

The PyTorch team highlighted the major features of the release in a recent blog post. Support for training on Apple silicon GPUs using Apple’s Metal Performance Shaders (MPS) is released with “prototype” status, offering up to 20x speedup over CPU-based training. In addition, the release includes official support for M1 builds of the Core and Domain PyTorch libraries. The TorchData library’s DataPipes are now backward compatible with the older DataLoader class; the release also includes an AWS S3 integration for TorchData. The TorchArrow library features a Pandas-style API and an in-memory data format based on Apache Arrow and can easily plug into other PyTorch data libraries, including DataLoader and DataPipe. Overall, the new release contains more than 3,100 commits from 433 contributors since the 1.11 release.

Before the 1.12 release, PyTorch only supported CPU-based training on M1 Macs. With help from Apple’s Metal team, PyTorch now includes a backend based on MPS, with processor-specific kernels and a mapping of the PyTorch model computation graph onto the MPS Graph Framework. The Mac’s memory architecture gives the GPU direct access to memory, improving overall performance and allowing for training using larger batch sizes and larger models.

Besides support for Apple silicon, PyTorch 1.12 includes several other performance enhancements. TorchScript, PyTorch’s intermediate representation of models for runtime portability, now has a new layer fusion backend called NVFuser, which is faster and supports more operations than the previous fuser, NNC. For computer vision (CV) models, the release implements the Channels Last data format for use on CPUs, increasing inference performance up to 1.8x over Channels First. The release also includes enhancements to the bfloat16 reduced-precision data type which can provide up to 2.2x performance improvement on Intel® Xeon® processors.

The release includes several new features and APIs. For applications requiring complex numbers, PyTorch 1.12 adds support for complex convolutions and the complex32 data type, for reduced-precision computation. The release “significantly improves” support for forward-mode automatic differentiation, for eager computation of directional derivatives in the forward pass. There is also a prototype implementation of a new class, DataLoader2, a lightweight data loader class for executing a DataPipe graph.

In the new release, the Fully Sharded Data Parallel (FSDP) API moves from prototype to Beta. FSDP supports training large models by distributing model weights and gradients across a cluster of workers. New features for FSDP in this release include faster model initialization, fine-grained control of mixed precision, enhanced training of Transformer models, and an API that supports changing sharding strategy with a single line of code.

AI researcher Sebastian Raschka posted several tweets highlighting his favorite features of the release. One user replied that the release:

Seems to have immediately broken some backwards compatibility. E.g. OpenAIs Clip models on huggingface now produce CUDA errors.

HuggingFace developer Nima Boscarino followed up that HuggingFace would have a fix soon.

The PyTorch 1.12 code and release notes are available on GitHub.

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.