Safe rollout checklist
- Keep changes small (one logical change set per apply).
- Avoid changing many resource IDs at once.
- Push a version and check diff before apply.
- Apply during a window where you can monitor results.
- Confirm run completion before stacking more changes.
High-risk changes to treat carefully
- deleting multiple resources,
- switching core resource shape that may require replacement,
- empty desired state (
spec.resources: []).
Common failure categories
Policy or billing gate failures
These happen before queueing, such as:- subscription not active,
- plan limits reached,
- spending limit exceeded.
Validation/configuration issues
The YAML is accepted structurally but contains a change that cannot be applied as requested.Provider/runtime failures
The provider accepted part of the plan but one or more resource operations failed.Fast recovery workflow
- Open run details and identify first failing resource.
- Fix the root cause (config, capacity, billing, or provider dependency).
- Re-apply (after pushing a new version when YAML changed) or resume the failed/canceled run.
- Confirm all expected resources reached final desired state.
When to use resume vs re-apply
Use resume when:- same intended change,
- root cause fixed,
- the run is already
failedorcanceled, - you want to continue the same run context (
POST /deployments/runs/{runId}/resume).
- you intentionally changed desired state,
- you pushed a new version and want a fresh run for that version (
POST /deployments/apply), - there is no resumable run you want to continue.
Rollback pattern
If a rollout fails functionally (not operationally), revert YAML, push a new version, diff it against the prior version, then apply.Operational habits that improve reliability
- Maintain a predictable naming/ID strategy.
- Keep an internal change log for major deployment updates.
- Pair infrastructure changes with billing awareness for large scale-ups.