Aug 24, 2025

Samagra Sharma
Founder
containerd
, pulling the complete container image from a registry. This process is a significant bottleneck for AI workloads,
whose images often exceed 20 GB due to large model weights and dependencies like CUDA and PyTorch.
The typical startup sequence consists of three time-consuming, sequential steps:
- Download: Transferring all image layers from the remote registry to the node. This is network-bound.
- Decompress: Unpacking each gzipped layer. This is CPU-bound and often single-threaded.
-
Write & Mount: Writing the decompressed files to the node’s local disk and constructing a union filesystem using a
snapshotter like
overlayfs
. This is I/O-bound.
ENTRYPOINT
can execute. For a 20 GB image, this sequence can
take over 10 minutes. However, typically only a small fraction of the image data is required
for the application to initialize. This inefficiency leads to long cold start times, forcing teams to overprovision expensive GPU resources
to keep “warm” instances available.
Tensorfuse Architecture: On-Demand File Access
Tensorfuse solves this problem by implementing acontainerd
remote
snapshotter. It replaces the default download-and-unpack model with an
on-demand, lazy-loading mechanism. This is achieved through two core components: a build-time image
indexer and a runtime FUSE-based daemon.
1. Build-Time: Creating a Seekable Image Index
The primary obstacle to lazy-loading is the OCI image format, which uses gzipped tarballs (tar.gz
) for its layers. This format is a compressed stream, making random access to individual
files impossible without decompressing the entire stream up to the desired file.
Tensorfuse addresses this with a build tool that converts standard OCI images into a highly optimized and seekable format
based on the Registry Accelerated File System design, while remaining compatible with OCI registries. This conversion process
fundamentally restructures the image by separating filesystem metadata from file data. The metadata is stored in a compact “bootstrap”
file, which acts as a comprehensive Table of Contents (TOC).
The file data itself is broken down into content-addressable chunks, or “blobs”. This architecture makes the entire filesystem
instantly seekable, enabling the runtime to fetch only the required data chunks for a specific file. This bypasses the need to
download or decompress the entire multi-gigabyte layer just to start the container.
2. Runtime: FUSE and Lazy-Loading
The Tensorfuse snapshotter runs as a daemon on each Kubernetes node. Whencontainerd
is instructed to create a container, the following occurs:
- Instead of pulling layers, the Tensorfuse daemon instantly mounts a FUSE (Filesystem in Userspace) filesystem. To the container, this virtual filesystem appears as if the entire image is present on local disk.
-
When a process inside the container attempts to read a file (e.g., Python’s
import torch
), the Linux kernel intercepts theread()
syscall and forwards it to the Tensorfusedaemon
. -
The
daemon
consults the pre-generated Table Of Contents (the RAFS bootstrap) to locate the file’s data within the compressed layer in the remote registry. -
It performs an
HTTP
Range Request to the registry, fetching only the small chunk of compressed data containing the file and its preceding decompression checkpoint. - The daemon uses the checkpoint to initialize the decompressor and unpacks the small data segment in memory.
-
The file’s contents are returned to the kernel, which satisfies the application’s
read()
call.
Integration with containerd
Tensorfuse integrates non-intrusively usingcontainerd's
stable remote snapshotter gRPC
API. The key interaction
occurs during the image pull process.
- For each image layer,
containerd
calls thePrepare
method on the Tensorfuse gRPC service. - The Tensorfuse daemon, which only needs to mount the FUSE filesystem, immediately returns an
ErrAlreadyExists
error. - This specific error code signals to
containerd
that the snapshotter can provide the layer’s contents without needingcontainerd
to download and unpack it.containerd
trusts this signal and skips the download for that layer.
containerd's
core code, preserving the stability and security of the standard container runtime.
Performance and Impact on vLLM
The architectural changes result in a dramatic reduction in startup time. The multi-minute download and decompression phases are eliminated entirely.Stage | Standard overlayfs | Tensorfuse Snapshotter | Improvement |
---|---|---|---|
Image Data & Unpack | ~12 minutes | Eliminated (On-Demand) | - |
Time to ENTRYPOINT | ~12 min, 5 sec | ~2 seconds | > 360x |
vLLM Server Ready | ~12 min, 30 sec | ~20 seconds | > 37x |