Running Helix on ZFS
Helix runs on top of a ZFS storage layer so the inference servers, schedulers and data tooling can assume fast snapshots, copy-on-write semantics and consistent recovery.
Layout
Each production cluster is split across mirrored vdevs. Application containers write into datasets mounted with sync=always for transactional metadata, and log-heavy workloads use sync=standard to let ZFS absorb bursts through the SLOG.
We isolate datasets per tenant so support teams can snapshot, rollback or promote staging datasets without impacting other tenants.
Snapshots and clones
Helix creates guard-rail snapshots on every deployment and before any automated migration. The control plane tags the snapshots with the git SHA and helm revision, enabling point-in-time rollbacks.
For regression testing we clone read-only datasets into ephemeral namespaces so load tests can re-use realistic volumes without copying terabytes of parquet and columnar artifacts.
Self-healing
Scrubs run on a rolling basis and surface telemetry into our Prometheus stack. The alerting rules trip before checksum errors reach the query tier, and faulty disks are automatically offlined so the orchestrator can evacuate affected pods.
We track dataset health with lightweight probes that verify expected table counts and checksums from the metadata API. When discrepancies appear the remediation workers replay write-ahead logs stored in object storage.
Security and compliance
ZFS datasets are encrypted with per-tenant keys loaded via Vault. Replication streams use zfs send | zfs recv over WireGuard tunnels, and the receiving side verifies Merkle digests before promoting the snapshot.
Access to raw datasets is mediated through our job queue; automation cannot mount production datasets directly, protecting customer data from accidental leakage.
Operational tips
- Continuous integration exercises the same zpool layout using loopback devices so the code paths for snapshots and clones stay tested.
- Dataset quotas are enforced and surfaced back to the product so chart generation can block early instead of failing mid-flight.
- We ship trimmed
zpool statusdashboards so engineers can diagnose performance anomalies without shell access.
Looking ahead
Upcoming work includes dataset-level anomaly detection, automated promotion of read replicas, and richer reporting from zpool events to shorten recovery times.