MMS • Sabri Bolkar
Article originally posted on InfoQ. Visit InfoQ
Zero-copy and in-memory data manager Vineyard (v6d) has recently released version 0.13.2 which brought improved features for Python/C++ development, and Kubernetes deployment. It is maintained as a CNCF sandbox project and provides distributed operators that can be utilized to share immutable data within or across cluster nodes. V6d is of particular interest for deep network training (e.g. large language and graph models) on big (sharded) datasets. Its development is currently led by an Alibaba engineering team.
Zero-copy and in-memory data distribution is a central problem for many real-time applications. From image processing pipelines to deep learning models such as LLM and graph mining algorithms, many data-crunching applications require to ingest large data from many independent processes. In machine learning engineering, this bottleneck has become more evident as deep networks are getting larger and distribution of model parameters mandate access to shared state and data. As an early-stage project, V6d aims to bring a high-level API for such use cases.
Architectures of real-time applications generally exploit in-memory key-value storages/caches (e.g. etcd, Memcached, Redis) for storing and interchanging frequently reached data. According to service type, engineering teams have to consider related trade-offs that come with these tools. V6d consists of two main components: Apache Arrow Plasma-derived shared-memory data manager (within a node) and a metadata server backed by etcd (between different nodes). While the Plasma-derived service allows zero-copy data transfer, etcd service handles the global distribution of (possibly partitioned) data’s properties.
V6d places itself within the Python community. In a way, it can be considered to scale Python’s native multiprocess shared_memory to multiple machines for immutable blobs. V6d offers two different Python client interfaces IPCClient and RPCClient for manipulating local and remote objects respectively. Both client APIs permit uniform data insertion and retrieval patterns that are based on object IDs. However, v6d does not automatically move data between cluster nodes unless instructed to do so due to the high network cost of such operations.
We could present a simple example that can be run on a local machine, let’s start with creating a local v6d instance:
python -m vineyard --socket /tmp/vineyard.sock --size 16733650944
As the first step, let’s show how we can utilize Python’s native API. For this purpose we will create a dummy 10k resolution RGB image using NumPy and share it quickly using the shared_memory() interface:
import numpy as np
from multiprocessing import shared_memory
shape_, dtype_ = (3, 10000, 10000), np.uint8
array_to_share = np.random.randint(0, high=255, size=shape_, dtype=dtype_)
# Create shared memory
shm = shared_memory.SharedMemory(create=True, size=array_to_share.nbytes)
array_shm = np.ndarray(shape_, dtype=array_to_share.dtype, buffer=shm.buf)
array_shm[:] = array_to_share[:] # Here we need to copy as we use existing array
# Use the shared memory name, size and type info to retrieve data in another process
existing_shm = shared_memory.SharedMemory(name=shm.name)
array_retrieved = np.ndarray(shape=shape_, dtype=dtype_, buffer=existing_shm.buf)
Here, we could carry out the same operation using v6d:
import vineyard
client = vineyard.connect('/tmp/vineyard.sock')
array_id = client.put(array_to_share)
# Retrieve the previous array_to_share in another process
array_retrieved = client.get(array_id)
As shown above, the API is quite easy to use and propagates the dtype and array shape to the retrieved object. Because of the common array protocol (aka buffer protocol), the NumPy interface also accepts zero-copy operations on PyTorch, TensorFlow, and MxNet tensors. In addition to that, v6d enables same operations on Pandas/Arrow dataframes. Further information on library integrations can be reached in the related documentation page. An example machine learning training tutorial can also be found in the webpage.
For multi-node settings, V6d allows deployment of vineyard operators on Kubernetes clusters via the Python API and Helm charts. A more detailed overview of the architecture is also provided in the official documentation.