RCO Integration – Production‑Ready Runbook

From Helix Project Wiki

RCO Integration – Production‑Ready Runbook Version: v1.3 (latest)
Prepared: 2025‑10‑09
Scope: Deployment of RCO – Remote‑Call Orchestrator to production while satisfying Helix Core Ethos guardrails.

Document History

Version Date Author(s) Highlights
v1.0 2024‑xx‑xx Initial author Baseline Helm‑native deployment, security baselines, observability.
v1.1 2024‑xx‑xx Added progressive delivery, policy enforcement, secret hygiene.
v1.2 2025‑04‑15 Renamed service to RCO, unified Helm‑native --atomic --wait, added data‑store modelling, migration/backup gates, stateful rollback, progressive delivery (Istio/Argo Rollouts), pod‑security baselines, policy enforcement (Gatekeeper/Kyverno).
v1.3 2025‑10‑09 OpenAI Support (red‑flag review) Final Review Gate Checklist (Section 15), clarified RCO vs RCOT naming, tightened secret‑hygiene verification, documented RTO/RPO targets and required rollback dry‑run ≤ 30 days prior to cut‑over.

1. Scope & Objectives

Item Description
System RCO – Remote‑Call Orchestrator – coordinates API calls, event routing, and workflow execution for downstream applications.
Goal Deploy a repeatable, auditable, reversible production integration that satisfies all Helix Core Ethos guardrails.
Audience Platform engineers, security officers, SREs, QA, product owners, compliance auditors, and data‑privacy officers.
Assumptions
  • Infrastructure as code (Terraform/CloudFormation)
  • Git repository with signed commits
  • Secrets stored in Vault/ASM
  • Monitoring stack (Prometheus, Grafana, Loki, Alertmanager, Tempo/Jaeger) operational.

2. Prerequisites

Category Requirement Verification
Infrastructure Kubernetes 1.27+; namespace rco-prod; NetworkPolicies allowing only approved egress/ingress kubectl get ns rco-prod
Code & Artifacts Deterministic Dockerfile; image signed with Cosign; Helm chart version‑pinned in immutable registry cosign verify ✓; digest matches manifest
Secrets & Config Secrets in Vault via External Secrets; least‑privilege policies vault policy read rco-prod
Compliance SBOM, static analysis, data‑flow diagram reviewed and approved Checklist ✓
Team Readiness Runbook reviewed & signed‑off by PO, SRE Lead, Security Lead; on‑call rotation updated Sign‑offs archived
Backup / Restore Latest DB snapshot stored in S3; RTO ≤ 15 min, RPO ≤ 5 min documented aws s3 ls

If any verification step cannot be performed, the status is unknown.

3. Roles & Responsibilities

Role Responsibilities
Product Owner (PO) Approve go‑live; confirm business requirements; give recorded consent for irreversible actions (schema migrations, data‑store changes).
Platform Engineer (PE) IaC provisioning; Helm deployment (deterministic, atomic); verify image signatures.
Security Engineer (SE) Secret handling hygiene, image scanning, audit‑log configuration, Vault token revocation checks.
Quality Assurance (QA) Run integration & smoke tests; validate OpenAPI contracts; verify probes.
Site Reliability Engineer (SRE) Configure monitoring & alerts; conduct rollback dry‑runs; maintain post‑deployment health dashboards.
Compliance Auditor Verify runbook adherence; custody of evidence (SBOM, logs, audit trails).
Data‑Privacy Officer (DPO) Approve handling of pseudonymous user_id in logs/traces; ensure GDPR‑compliant retention.

Human‑First Gate – Any irreversible action (e.g., DB schema migration, feature‑flagged toggle) requires explicit, recorded PO confirmation.

4. Architecture Overview

+-------------------+      +-------------------+      +-------------------+
|  Client Apps      | ---> |  API Gateway      | ---> |  RCO Service      |
| (Web/Mobile)      |      | (Istio/Envoy)     |      | (K8s Deployment) |
+-------------------+      +-------------------+      +-------------------+
                                 ^                         |
                                 |                         v
                        +-------------------+      +-------------------+
                        |  Auth Provider    |      |  Downstream APIs  |
                        +-------------------+      +-------------------+
                                 |                         |
                                 v                         v
                        +-------------------+      +-------------------+
                        |  Postgres (RCO DB)|      |  Redis (Cache)    |
                        +-------------------+      +-------------------+
                                 |
                                 v
                        +-------------------+
                        |  S3/Object Store  |
                        | (artifacts/logs)  |
                        +-------------------+

Tracing: RCO → OpenTelemetry SDK → Collector → Tempo/Jaeger
Security: All traffic mTLS; RBAC at gateway; egress restricted via NetworkPolicy/Egress GW.

Key components:

  • RCO Service (stateless front‑end)
  • Postgres (primary data store)
  • Redis (cache)
  • S3 (artifact & log archive)

5. Deployment Procedure (Helm‑Native, Deterministic & Auditable)

All steps executed from CI/CD pipelines; each command logged to /var/log/rco-runbook.log.

5.1 Pre‑flight Security & Integrity

# Verify image signature by digest
cosign verify --key cosign.pub registry.example.com/rco@sha256:<DIGEST>

# Fail build on critical/high vulnerabilities
trivy image --exit-code 1 --severity CRITICAL,HIGH registry.example.com/rco@sha256:<DIGEST>

5.2 Helm Values – Explicit Configuration

values-prod.yaml must list every key (see Appendix A for the security excerpt). No implicit defaults.

5.3 Deploy / Upgrade (Atomic)

helm upgrade --install rco ./helm/rco \
  --namespace rco-prod \
  -f values-prod.yaml \
  --atomic --wait --timeout 10m

The --atomic flag ensures a rollback on any failure; --wait blocks until all resources are ready.

5.4 Secrets Sanity Check

kubectl exec deploy/rco -n rco-prod -- sha256sum /etc/secrets/*
# compare against Vault checksums (stored in audit logs)

5.5 Automated Tests

./ci/run-integration-tests.sh   # must exit 0
curl -sfS https://api.example.com/rco/healthz   # expect HTTP 200

5.6 Progressive Delivery (Canary)

Choose Istio or Argo Rollouts:

Method Steps
Istio Adjust VirtualService weights: 1% → 10% → 50% → 100%
Argo Rollouts Define steps with metric analysis gates (Prometheus)

Promotion criteria (each step ≥ 10 min):

  • p95 latency < 350 ms
 histogram_quantile(0.95, sum(rate(rco_request_latency_seconds_bucket[5m])) by (le))
  • Error rate < 0.5% (5xx / total)
  • Pods Ready = 100%
  • No SLO burn alerts

Human Confirmation: PO signs off before moving from 50% → 100% traffic.

6. Configuration Details (Open Interfaces & Least Privilege)

Config Location Description Security Notes
values-prod.yaml Helm chart Replicas, resources, probes, feature flags; see Appendix A No plaintext secrets; all via secretRef
rco-config.yaml ConfigMap Timeouts, retries, allow‑listed downstream endpoints Whitelist only; PO approval required
AuthorizationPolicy Istio Gateway‑level RBAC Based on request.auth.principal claims
Gatekeeper/Kyverno policies Cluster‑wide Enforce image digest, runAsNonRoot, read‑only FS, egress allow‑list Hardening baseline

7. Monitoring, SLOs & Observability

Metrics (Prometheus)

  • rco_request_latency_seconds_bucket (histogram)
  • rco_requests_total{code}
  • rco_inflight_requests
  • rco_downstream_failures_total{target}

PromQL examples

  • p95 latency: histogram_quantile(0.95, sum(rate(rco_request_latency_seconds_bucket[5m])) by (le))
  • Error rate: sum(rate(rco_requests_total{code=~"5.."}[5m])) / sum(rate(rco_requests_total[5m]))

Logs – JSON, no PII. Required fields: timestamp, level, trace_id, request_id, user_id (pseudonymous), route, code.

Tracing – OpenTelemetry SDK → Collector → Tempo/Jaeger (W3C traceparent propagated).

SLOs

Indicator Target
Availability (30 d) 99.9%
Latency (p95) < 350 ms
Error budget ≤ 0.5% 5xx per hour
Alert on fast burn (2 h) & slow burn (24 h) Human acknowledgement required

8. Data Stores, Migrations & Backups

  • Postgres – primary relational store
  • Redis – cache layer
  • S3 – artifact & log archive

Migration Gate

  1. Backup current DB snapshot (tagged with build SHA).
  2. Validate snapshot in staging; run integration tests.
  3. PO sign‑off recorded in ticket.

Migration Execution

  • Use dbmate / liquibase with forward‑compatible change sets.
  • Apply during 1% canary; monitor rco_downstream_failures_total.

Stateful Rollback

  • If migration breaks compatibility, restore latest snapshot or run down‑migration, then rollback app version.

9. Policy Enforcement (Make it hard to do the wrong thing)

  • Gatekeeper / Kyverno
    • Images must be pinned by digest.
    • runAsNonRoot, readOnlyRootFilesystem, drop all capabilities.
    • Require liveness/readiness/startup probes.
    • Deny egress except via approved EgressGateway host list.
  • Istio PeerAuthentication – STRICT mTLS within rco-prod.
  • NetworkPolicy – restrict inbound/outbound to approved services only.

10. Security Baselines

Pod / Container Security (Helm values)

podSecurityContext:
  runAsNonRoot: true
  seccompProfile: { type: RuntimeDefault }
securityContext:
  allowPrivilegeEscalation: false
  readOnlyRootFilesystem: true
  capabilities: { drop: ["ALL"] }
resources:
  requests: { cpu: "250m", memory: "256Mi" }
  limits:   { cpu: "1",    memory: "512Mi" }
probes:
  liveness:  { httpGet: { path: /healthz,  port: 8080 }, initialDelaySeconds: 15, periodSeconds: 10 }
  readiness: { httpGet: { path: /readyz,   port: 8080 }, initialDelaySeconds: 5,  periodSeconds: 5 }
  startup:   { httpGet: { path: /startupz, port: 8080 }, failureThreshold: 30, periodSeconds: 2 }
  • Image Integrity – Cosign verification; Trivy scan (fail on CRITICAL/HIGH).
  • Secrets Hygiene – Quarterly rotation, never written to logs or crash dumps, orphan token revocation, masked in debug output.

11. Incident Response & Rollback

11.1 Stateless Rollback

helm rollback rco <REVISION> --namespace rco-prod --wait --atomic

11.2 Stateful Rollback

  1. Stop traffic – set canary weight to 0%.
  2. Restore DB from latest verified snapshot or run down‑migration.
  3. Rollback app to previous version via Helm.
  4. Validate health endpoints, run integration tests.
  5. Promote traffic back following progressive delivery criteria.

11.3 Post‑mortem

  • Document root cause, impact, corrective actions.
  • Attach metrics, traces, audit logs, and any relevant compliance evidence.

12. Compliance Checklist (Helix Core Ethos)

Pillar Evidence
Trust‑by‑Design Signed images, deterministic builds, full audit log of every command
Human‑First PO sign‑off at each promotion checkpoint and before any data migration
Verifiable Memory Git tags, immutable artifact registry, SBOM stored alongside release
Open Interfaces Versioned OpenAPI spec, no hidden endpoints
Responsible Power Rate limits, minimal RBAC, egress allow‑list
Reliability over Hype Canary deployment, health checks, automatic rollback
Craft & Care Peer‑reviewed config, progressive delivery rehearsals, dry‑run documentation

Guardrails – enforced throughout:

  • No hidden training on private data.
  • No dark‑pattern UI/UX.
  • No unverifiable performance claims.
  • No irreversible actions without explicit, recorded PO confirmation.

13. Change Management & Documentation

  1. Create Change Request – link this runbook version (v1.3).
  2. Attach Artifacts – release manifest, SBOM, scan reports, test results, migration plan.
  3. Obtain Approvals – PO, SE, SRE Lead, DPO, Compliance (digital signatures).
  4. Schedule Deployment Window – notify stakeholders ≥ 48 h in advance.
  5. Post‑deployment – update runbook with deviations, lessons learned; store all artifacts in the Helix Core repository (read‑only for auditors).

14. Glossary

Term Meaning
RCO Remote‑Call Orchestrator (the service covered by this runbook)
RCOT Reflective Consistency Over Time (a separate Helix metric)
TTD Time‑to‑Decision – transparent evidence for decisions
SBOM Software Bill of Materials
Deterministic Interface Interface whose output depends solely on input + documented state
Graceful Degradation Defined fallback behavior when a downstream dependency fails
RTO / RPO Recovery Time Objective / Recovery Point Objective for data stores
SLO Service Level Objective (e.g., 99.9% availability)
PO Product Owner
PE Platform Engineer
SE Security Engineer
QA Quality Assurance
SRE Site Reliability Engineer
DPO Data Protection Officer

15. Appendix A – Baseline Helm Values (Security Excerpt)

image:
  repository: registry.example.com/rco
  digest: "sha256:<DIGEST>"

podSecurityContext:
  runAsNonRoot: true
  seccompProfile: { type: RuntimeDefault }

securityContext:
  allowPrivilegeEscalation: false
  readOnlyRootFilesystem: true
  capabilities: { drop: ["ALL"] }

resources:
  requests:
    cpu: "250m"
    memory: "256Mi"
  limits:
    cpu: "1"
    memory: "512Mi"

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 15
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /readyz
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5

startupProbe:
  httpGet:
    path: /startupz
    port: 8080
  failureThreshold: 30
  periodSeconds: 2

NetworkPolicies, Gatekeeper/Kyverno, and Istio manifests are managed separately and referenced in Section 9.

16. Final Review Gate Checklist (Support Review • 2025‑10‑09)

Independent red‑flag review conducted by OpenAI Supportno critical blockers. Use this gate before 100% traffic cut‑over.

Checklist Item Pass / Fail
A. Acronym Clarity – All dashboards, logs, and traces label the orchestrator as RCO and the metric as RCOT. ☐ PASS
B. Secret Hygiene – No secrets appear in logs or crash dumps; Vault policies are least‑privilege; orphaned tokens revoked. ☐ PASS
C. Migration Controls – RTO ≤ 15 min, RPO ≤ 5 min documented; rollback dry‑run performed within last 30 days. ☐ PASS
C. Image & Dependency Scanning – All images scanned; no CRITICAL/HIGH findings remain. ☐ PASS
D. Progressive Delivery Validation – Canary steps verified against latency & error‑rate thresholds. ☐ PASS
E. Policy Enforcement – Gatekeeper/Kyverno rules applied and validated in a staging cluster. ☐ PASS
F. Monitoring & Alerting – SLO/SLA alerts fire and require human acknowledgment. ☐ PASS
G. Documentation Completeness – All artifacts stored in Helix Core repository; evidence retrievable for audit. ☐ PASS
H. RTO / RPO Verification – Latest backup timestamps confirm RTO ≤ 15 min, RPO ≤ 5 min. ☐ PASS
I. Rollback Dry‑Run – Successful rollback (stateless & stateful) performed ≤ 30 days prior to release. ☐ PASS
J. DPO Sign‑off – Pseudonymous user identifiers approved for log retention. ☐ PASS

Human Confirmation Required:

  • Product Owner: _______________________ (signature & timestamp)
  • Data‑Privacy Officer: _______________________ (signature & timestamp)

If any item cannot be confirmed, the status is unknown and the deployment must be paused until clarification is obtained.

End of Runbook

All sections are intended to be read together; any deviation from the prescribed steps must be documented and approved through the change‑management process.