RCO Integration – Production‑Ready Runbook
RCO Integration – Production‑Ready Runbook
Version: v1.3 (latest)
Prepared: 2025‑10‑09
Scope: Deployment of RCO – Remote‑Call Orchestrator to production while satisfying Helix Core Ethos guardrails.
Document History
| Version | Date | Author(s) | Highlights |
|---|---|---|---|
| v1.0 | 2024‑xx‑xx | Initial author | Baseline Helm‑native deployment, security baselines, observability. |
| v1.1 | 2024‑xx‑xx | – | Added progressive delivery, policy enforcement, secret hygiene. |
| v1.2 | 2025‑04‑15 | – | Renamed service to RCO, unified Helm‑native --atomic --wait, added data‑store modelling, migration/backup gates, stateful rollback, progressive delivery (Istio/Argo Rollouts), pod‑security baselines, policy enforcement (Gatekeeper/Kyverno).
|
| v1.3 | 2025‑10‑09 | OpenAI Support (red‑flag review) | Final Review Gate Checklist (Section 15), clarified RCO vs RCOT naming, tightened secret‑hygiene verification, documented RTO/RPO targets and required rollback dry‑run ≤ 30 days prior to cut‑over. |
1. Scope & Objectives
| Item | Description |
|---|---|
| System | RCO – Remote‑Call Orchestrator – coordinates API calls, event routing, and workflow execution for downstream applications. |
| Goal | Deploy a repeatable, auditable, reversible production integration that satisfies all Helix Core Ethos guardrails. |
| Audience | Platform engineers, security officers, SREs, QA, product owners, compliance auditors, and data‑privacy officers. |
| Assumptions |
|
2. Prerequisites
| Category | Requirement | Verification |
|---|---|---|
| Infrastructure | Kubernetes 1.27+; namespace rco-prod; NetworkPolicies allowing only approved egress/ingress |
kubectl get ns rco-prod ✓
|
| Code & Artifacts | Deterministic Dockerfile; image signed with Cosign; Helm chart version‑pinned in immutable registry | cosign verify ✓; digest matches manifest
|
| Secrets & Config | Secrets in Vault via External Secrets; least‑privilege policies | vault policy read rco-prod ✓
|
| Compliance | SBOM, static analysis, data‑flow diagram reviewed and approved | Checklist ✓ |
| Team Readiness | Runbook reviewed & signed‑off by PO, SRE Lead, Security Lead; on‑call rotation updated | Sign‑offs archived |
| Backup / Restore | Latest DB snapshot stored in S3; RTO ≤ 15 min, RPO ≤ 5 min documented | aws s3 ls ✓
|
If any verification step cannot be performed, the status is unknown.
3. Roles & Responsibilities
| Role | Responsibilities |
|---|---|
| Product Owner (PO) | Approve go‑live; confirm business requirements; give recorded consent for irreversible actions (schema migrations, data‑store changes). |
| Platform Engineer (PE) | IaC provisioning; Helm deployment (deterministic, atomic); verify image signatures. |
| Security Engineer (SE) | Secret handling hygiene, image scanning, audit‑log configuration, Vault token revocation checks. |
| Quality Assurance (QA) | Run integration & smoke tests; validate OpenAPI contracts; verify probes. |
| Site Reliability Engineer (SRE) | Configure monitoring & alerts; conduct rollback dry‑runs; maintain post‑deployment health dashboards. |
| Compliance Auditor | Verify runbook adherence; custody of evidence (SBOM, logs, audit trails). |
| Data‑Privacy Officer (DPO) | Approve handling of pseudonymous user_id in logs/traces; ensure GDPR‑compliant retention.
|
Human‑First Gate – Any irreversible action (e.g., DB schema migration, feature‑flagged toggle) requires explicit, recorded PO confirmation.
4. Architecture Overview
+-------------------+ +-------------------+ +-------------------+
| Client Apps | ---> | API Gateway | ---> | RCO Service |
| (Web/Mobile) | | (Istio/Envoy) | | (K8s Deployment) |
+-------------------+ +-------------------+ +-------------------+
^ |
| v
+-------------------+ +-------------------+
| Auth Provider | | Downstream APIs |
+-------------------+ +-------------------+
| |
v v
+-------------------+ +-------------------+
| Postgres (RCO DB)| | Redis (Cache) |
+-------------------+ +-------------------+
|
v
+-------------------+
| S3/Object Store |
| (artifacts/logs) |
+-------------------+
Tracing: RCO → OpenTelemetry SDK → Collector → Tempo/Jaeger
Security: All traffic mTLS; RBAC at gateway; egress restricted via NetworkPolicy/Egress GW.
Key components:
- RCO Service (stateless front‑end)
- Postgres (primary data store)
- Redis (cache)
- S3 (artifact & log archive)
5. Deployment Procedure (Helm‑Native, Deterministic & Auditable)
All steps executed from CI/CD pipelines; each command logged to /var/log/rco-runbook.log.
5.1 Pre‑flight Security & Integrity
# Verify image signature by digest
cosign verify --key cosign.pub registry.example.com/rco@sha256:<DIGEST>
# Fail build on critical/high vulnerabilities
trivy image --exit-code 1 --severity CRITICAL,HIGH registry.example.com/rco@sha256:<DIGEST>
5.2 Helm Values – Explicit Configuration
values-prod.yaml must list every key (see Appendix A for the security excerpt). No implicit defaults.
5.3 Deploy / Upgrade (Atomic)
helm upgrade --install rco ./helm/rco \
--namespace rco-prod \
-f values-prod.yaml \
--atomic --wait --timeout 10m
The --atomic flag ensures a rollback on any failure; --wait blocks until all resources are ready.
5.4 Secrets Sanity Check
kubectl exec deploy/rco -n rco-prod -- sha256sum /etc/secrets/*
# compare against Vault checksums (stored in audit logs)
5.5 Automated Tests
./ci/run-integration-tests.sh # must exit 0
curl -sfS https://api.example.com/rco/healthz # expect HTTP 200
5.6 Progressive Delivery (Canary)
Choose Istio or Argo Rollouts:
| Method | Steps |
|---|---|
| Istio | Adjust VirtualService weights: 1% → 10% → 50% → 100%
|
| Argo Rollouts | Define steps with metric analysis gates (Prometheus) |
Promotion criteria (each step ≥ 10 min):
- p95 latency < 350 ms
histogram_quantile(0.95, sum(rate(rco_request_latency_seconds_bucket[5m])) by (le))
- Error rate < 0.5% (
5xx / total) - Pods Ready = 100%
- No SLO burn alerts
Human Confirmation: PO signs off before moving from 50% → 100% traffic.
6. Configuration Details (Open Interfaces & Least Privilege)
| Config | Location | Description | Security Notes |
|---|---|---|---|
values-prod.yaml |
Helm chart | Replicas, resources, probes, feature flags; see Appendix A | No plaintext secrets; all via secretRef
|
rco-config.yaml |
ConfigMap | Timeouts, retries, allow‑listed downstream endpoints | Whitelist only; PO approval required |
AuthorizationPolicy |
Istio | Gateway‑level RBAC | Based on request.auth.principal claims
|
| Gatekeeper/Kyverno policies | Cluster‑wide | Enforce image digest, runAsNonRoot, read‑only FS, egress allow‑list |
Hardening baseline |
7. Monitoring, SLOs & Observability
Metrics (Prometheus)
rco_request_latency_seconds_bucket(histogram)rco_requests_total{code}rco_inflight_requestsrco_downstream_failures_total{target}
PromQL examples
- p95 latency:
histogram_quantile(0.95, sum(rate(rco_request_latency_seconds_bucket[5m])) by (le)) - Error rate:
sum(rate(rco_requests_total{code=~"5.."}[5m])) / sum(rate(rco_requests_total[5m]))
Logs – JSON, no PII. Required fields: timestamp, level, trace_id, request_id, user_id (pseudonymous), route, code.
Tracing – OpenTelemetry SDK → Collector → Tempo/Jaeger (W3C traceparent propagated).
SLOs
| Indicator | Target |
|---|---|
| Availability (30 d) | 99.9% |
| Latency (p95) | < 350 ms |
| Error budget | ≤ 0.5% 5xx per hour |
| Alert on fast burn (2 h) & slow burn (24 h) | Human acknowledgement required |
8. Data Stores, Migrations & Backups
- Postgres – primary relational store
- Redis – cache layer
- S3 – artifact & log archive
Migration Gate
- Backup current DB snapshot (tagged with build SHA).
- Validate snapshot in staging; run integration tests.
- PO sign‑off recorded in ticket.
Migration Execution
- Use dbmate / liquibase with forward‑compatible change sets.
- Apply during 1% canary; monitor
rco_downstream_failures_total.
Stateful Rollback
- If migration breaks compatibility, restore latest snapshot or run down‑migration, then rollback app version.
9. Policy Enforcement (Make it hard to do the wrong thing)
- Gatekeeper / Kyverno
- Images must be pinned by digest.
runAsNonRoot,readOnlyRootFilesystem, drop all capabilities.- Require liveness/readiness/startup probes.
- Deny egress except via approved EgressGateway host list.
- Istio PeerAuthentication – STRICT mTLS within
rco-prod.
- NetworkPolicy – restrict inbound/outbound to approved services only.
10. Security Baselines
Pod / Container Security (Helm values)
podSecurityContext:
runAsNonRoot: true
seccompProfile: { type: RuntimeDefault }
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities: { drop: ["ALL"] }
resources:
requests: { cpu: "250m", memory: "256Mi" }
limits: { cpu: "1", memory: "512Mi" }
probes:
liveness: { httpGet: { path: /healthz, port: 8080 }, initialDelaySeconds: 15, periodSeconds: 10 }
readiness: { httpGet: { path: /readyz, port: 8080 }, initialDelaySeconds: 5, periodSeconds: 5 }
startup: { httpGet: { path: /startupz, port: 8080 }, failureThreshold: 30, periodSeconds: 2 }
- Image Integrity – Cosign verification; Trivy scan (fail on CRITICAL/HIGH).
- Secrets Hygiene – Quarterly rotation, never written to logs or crash dumps, orphan token revocation, masked in debug output.
11. Incident Response & Rollback
11.1 Stateless Rollback
helm rollback rco <REVISION> --namespace rco-prod --wait --atomic
11.2 Stateful Rollback
- Stop traffic – set canary weight to 0%.
- Restore DB from latest verified snapshot or run down‑migration.
- Rollback app to previous version via Helm.
- Validate health endpoints, run integration tests.
- Promote traffic back following progressive delivery criteria.
11.3 Post‑mortem
- Document root cause, impact, corrective actions.
- Attach metrics, traces, audit logs, and any relevant compliance evidence.
12. Compliance Checklist (Helix Core Ethos)
| Pillar | Evidence |
|---|---|
| Trust‑by‑Design | Signed images, deterministic builds, full audit log of every command |
| Human‑First | PO sign‑off at each promotion checkpoint and before any data migration |
| Verifiable Memory | Git tags, immutable artifact registry, SBOM stored alongside release |
| Open Interfaces | Versioned OpenAPI spec, no hidden endpoints |
| Responsible Power | Rate limits, minimal RBAC, egress allow‑list |
| Reliability over Hype | Canary deployment, health checks, automatic rollback |
| Craft & Care | Peer‑reviewed config, progressive delivery rehearsals, dry‑run documentation |
Guardrails – enforced throughout:
- No hidden training on private data.
- No dark‑pattern UI/UX.
- No unverifiable performance claims.
- No irreversible actions without explicit, recorded PO confirmation.
13. Change Management & Documentation
- Create Change Request – link this runbook version (v1.3).
- Attach Artifacts – release manifest, SBOM, scan reports, test results, migration plan.
- Obtain Approvals – PO, SE, SRE Lead, DPO, Compliance (digital signatures).
- Schedule Deployment Window – notify stakeholders ≥ 48 h in advance.
- Post‑deployment – update runbook with deviations, lessons learned; store all artifacts in the Helix Core repository (read‑only for auditors).
14. Glossary
| Term | Meaning |
|---|---|
| RCO | Remote‑Call Orchestrator (the service covered by this runbook) |
| RCOT | Reflective Consistency Over Time (a separate Helix metric) |
| TTD | Time‑to‑Decision – transparent evidence for decisions |
| SBOM | Software Bill of Materials |
| Deterministic Interface | Interface whose output depends solely on input + documented state |
| Graceful Degradation | Defined fallback behavior when a downstream dependency fails |
| RTO / RPO | Recovery Time Objective / Recovery Point Objective for data stores |
| SLO | Service Level Objective (e.g., 99.9% availability) |
| PO | Product Owner |
| PE | Platform Engineer |
| SE | Security Engineer |
| QA | Quality Assurance |
| SRE | Site Reliability Engineer |
| DPO | Data Protection Officer |
15. Appendix A – Baseline Helm Values (Security Excerpt)
image:
repository: registry.example.com/rco
digest: "sha256:<DIGEST>"
podSecurityContext:
runAsNonRoot: true
seccompProfile: { type: RuntimeDefault }
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities: { drop: ["ALL"] }
resources:
requests:
cpu: "250m"
memory: "256Mi"
limits:
cpu: "1"
memory: "512Mi"
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 15
periodSeconds: 10
readinessProbe:
httpGet:
path: /readyz
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
startupProbe:
httpGet:
path: /startupz
port: 8080
failureThreshold: 30
periodSeconds: 2
NetworkPolicies, Gatekeeper/Kyverno, and Istio manifests are managed separately and referenced in Section 9.
16. Final Review Gate Checklist (Support Review • 2025‑10‑09)
Independent red‑flag review conducted by OpenAI Support – no critical blockers. Use this gate before 100% traffic cut‑over.
| Checklist Item | Pass / Fail |
|---|---|
| A. Acronym Clarity – All dashboards, logs, and traces label the orchestrator as RCO and the metric as RCOT. | ☐ PASS |
| B. Secret Hygiene – No secrets appear in logs or crash dumps; Vault policies are least‑privilege; orphaned tokens revoked. | ☐ PASS |
| C. Migration Controls – RTO ≤ 15 min, RPO ≤ 5 min documented; rollback dry‑run performed within last 30 days. | ☐ PASS |
| C. Image & Dependency Scanning – All images scanned; no CRITICAL/HIGH findings remain. | ☐ PASS |
| D. Progressive Delivery Validation – Canary steps verified against latency & error‑rate thresholds. | ☐ PASS |
| E. Policy Enforcement – Gatekeeper/Kyverno rules applied and validated in a staging cluster. | ☐ PASS |
| F. Monitoring & Alerting – SLO/SLA alerts fire and require human acknowledgment. | ☐ PASS |
| G. Documentation Completeness – All artifacts stored in Helix Core repository; evidence retrievable for audit. | ☐ PASS |
| H. RTO / RPO Verification – Latest backup timestamps confirm RTO ≤ 15 min, RPO ≤ 5 min. | ☐ PASS |
| I. Rollback Dry‑Run – Successful rollback (stateless & stateful) performed ≤ 30 days prior to release. | ☐ PASS |
| J. DPO Sign‑off – Pseudonymous user identifiers approved for log retention. | ☐ PASS |
Human Confirmation Required:
- Product Owner: _______________________ (signature & timestamp)
- Data‑Privacy Officer: _______________________ (signature & timestamp)
If any item cannot be confirmed, the status is unknown and the deployment must be paused until clarification is obtained.
End of Runbook
All sections are intended to be read together; any deviation from the prescribed steps must be documented and approved through the change‑management process.
