Experiment Recovery and Cleanup#
ommx.experiment records optimization work as one OMMX Artifact. For the Experiment data model, a runnable logging example, sharing, inspection, and forked Experiments, see Record and Share Experiments.
This guide focuses on the failure-time behavior: what OMMX writes before an Experiment is committed, how checkpoints are restored, and how Local Registry cleanup decides which blobs can be removed.
Storage Boundaries#
Experiment data is written in three layers.
Layer |
Stored as |
Role |
|---|---|---|
Blob |
Content-addressed files in the Local Registry |
Payload bytes for attachments, Instances, Solutions, run parameters, configs, and manifests |
Manifest |
An OCI Image Manifest blob |
The list of blobs that make one immutable OMMX Artifact |
Ref |
SQLite rows in the Local Registry index |
The name or checkpoint pointer that makes a manifest reachable |
In this page, publish means updating a Local Registry ref so it points to an already-written manifest. This is a local SQLite operation. It does not mean pushing an Artifact to a remote container registry.
Logging methods such as log_json() and log_solve() write payload bytes to the Local Registry immediately. OMMX does not wait until the final commit to write all bytes. If the same content is already present, the existing CAS blob is reused and its modification time is touched so recent active writes remain protected by GC grace periods.
A successful commit() writes the Experiment config and root manifest, then publishes the requested image reference in SQLite. Publishing a ref does not rewrite payload blobs. This ordering means a process can leave behind blob files that are not reachable from any manifest or ref; Local Registry GC handles that case.
Run Contexts and Experiment Commit#
Use Run objects as context managers. A Run is one trial, and closing it is
the recovery boundary that adds the closed Run to the parent Experiment’s
uncommitted state. After the Run is closed, OMMX writes a draft checkpoint for
that parent Experiment and publishes the checkpoint ref.
An Experiment does not have to be a context manager. In notebooks, a typical workflow keeps one Experiment open across multiple cells: run one trial, inspect plots and tables, decide the next condition, run another trial, and commit explicitly when the human workflow is finished.
from ommx.experiment import Experiment
image_name = "ghcr.io/example/team/experiment:baseline"
experiment = Experiment(image_name)
experiment.log_json("dataset", {"name": "demo"})
with experiment.run() as run:
run.log_parameter("capacity", 47)
# Inspect results, make plots, and decide the next condition.
with experiment.run() as run:
run.log_parameter("capacity", 64)
artifact = experiment.commit()
For batch scripts where all Runs are known in advance, with Experiment(...)
is a convenience: normal exit calls commit(), and exceptional exit publishes
a failed or interrupted checkpoint instead of advancing the successful image
reference.
Operation or event |
Stored state |
|---|---|
|
The closed Run is added to the parent Experiment with status |
|
The closed Run is added to the parent Experiment with status |
|
The final Experiment is committed, the requested image reference is published, and any local checkpoint for that Experiment is removed. |
|
Equivalent to calling |
|
The requested successful image reference is not advanced. A checkpoint Experiment is published with status |
A notebook kernel or process dies after a Run has closed but before |
Recovery starts from the latest Experiment draft checkpoint produced after a Run close. |
A notebook kernel or process dies before an open |
Payload blobs written by that open Run may exist, but they are not part of recoverable Run state. Recovery starts from the latest checkpoint before that Run. |
KeyboardInterrupt is recorded as "interrupted" for both Run and Experiment status. Other exceptions are recorded as "failed".
If you do not use Experiment as a context manager, exceptions outside a Run do
not automatically publish a failed Experiment checkpoint. The usual interactive
workflow relies on Experiment draft checkpoints produced after Run closes and an
explicit commit() when the Experiment is
ready to publish.
Restoring a Checkpoint#
Restore with the original Experiment image name.
from ommx.experiment import Experiment
image_name = "ghcr.io/example/team/experiment:baseline"
experiment = Experiment.restore_from_checkpoint(image_name)
with experiment.run() as run:
run.log_parameter("capacity", 64)
artifact = experiment.commit()
Checkpoint refs are internal Local Registry refs derived from the original image name. They are intentionally not exposed as normal Artifact handles, so keep the original image name if you want to resume.
Restoration returns an uncommitted Experiment, so it can be kept open across
notebook cells just like a newly created Experiment. Calling commit() publishes
the original requested image reference and removes the checkpoint. If the
restored Experiment is used as a context manager and fails again, OMMX publishes
a new failed or interrupted checkpoint instead of advancing the successful image
reference.
Reachability After Failure#
Local Registry cleanup is based on reachability from SQLite refs.
Data |
Reachable? |
Cleanup behavior |
|---|---|---|
A committed Experiment image ref |
Yes |
|
An Experiment checkpoint ref |
Yes |
|
A forked Experiment’s parent manifest through OCI |
Yes, if the child ref is kept |
|
Anonymous artifact refs |
Yes while the ref exists |
|
Blobs written by a process that died before manifest/ref publication |
No |
|
Blobs written by a currently active process |
Usually no until a checkpoint or commit exists |
|
OMMX does not store an orphan table in SQLite. Orphans are computed during each GC report by walking refs and manifests, then comparing that reachable set with the CAS files in the Local Registry.
Cleanup Workflow#
Run cleanup commands in report mode first.
ommx prune-anonymous
ommx gc
Both commands are dry-run by default and mutate the registry only with --delete.
ommx prune-anonymous --delete
ommx gc --delete
The same operations are available from the Python SDK. Python returns structured reports instead of formatted CLI output.
from ommx.artifact import gc, prune_anonymous
prune_report = prune_anonymous()
gc_report = gc()
prune_deleted = prune_anonymous(delete=True)
gc_deleted = gc(delete=True)
Use root=... to inspect a non-default Local Registry and
grace_period="2h" to override the GC grace period.
Use ommx prune-anonymous first when you have anonymous Artifact refs from temporary Artifact builds or unnamed archive imports. This command only removes matching SQLite refs; it does not unlink blobs. Those blobs become reclaimable by ommx gc if no other ref reaches them.
ommx gc performs a mark-sweep pass:
Roots are all SQLite refs, including Experiment checkpoint refs.
For each reachable manifest, GC marks the manifest blob, config blob, layer blobs, and OCI
subjectmanifest chain.Blob files outside the marked set are unreachable.
Unreachable blobs older than
--grace-periodare reported as orphan candidates.Unreachable blobs newer than
--grace-periodare reported as deferred.With
--delete, only orphan candidates are unlinked, and each candidate is checked again immediately before deletion.
The default grace period is 24h. The option accepts s, m, h, and d suffixes.
ommx gc --grace-period 2h
ommx gc --grace-period 0s
Use 0s only when you know no OMMX process is writing to that registry. For a shared or default Local Registry, keep a nonzero grace period so open Runs and interrupted imports are not deleted while they are still being written.
Normal reports show counts and byte sizes rather than raw digests. Add --show-digests when investigating a specific missing, invalid, orphan, or deferred blob.
ommx gc --show-digests
ommx gc --delete --show-digests
Use --root <path> to inspect or clean a non-default Local Registry.