HMI‑2025‑G4SH — Granite‑4 Small‑Hybrid

From Helix Project Wiki

HMI‑2025‑G4SH — Granite‑4 Small‑Hybrid (Helix Model Integration Sheet)

Status: READY FOR INTEGRATION • Owner: Helix Ops • Last Updated: 2025‑10‑13


0) At‑a‑Glance

  • Model ID (Helix): granite4-small-h
  • Upstream Name: ibm/granite4:small-h (aka Granite 4.0 H‑Small)
  • Family: IBM Granite 4.0 (hybrid Mamba‑2 + Transformer)
  • License: Apache‑2.0 (open weights)
  • Params (total / active): ~32B total / ~9B active (hybrid MoE)
  • Context: long‑context capable via SSM layers; temperature 0 recommended for most inference
  • Quantized Builds: Q4_K_M (~19 GB), Q5_K_M (~23 GB) as GGUF/ollama variants
  • Targets: Low‑latency enterprise assistants, RAG, tool‑use, code & ops copilots

1) Provenance & Verification

Sources: IBM Granite 4.0 model cards/repos + Ollama registry entries. Record the exact artifact you deploy.

Record on deploy:

  • source_url: (HF/GitHub/Ollama)
  • artifact_sha256: (computed locally)
  • artifact_size_bytes: (from filesystem)
  • pull_command: (ollama pull ibm/granite4:small-h or HF path)
  • signed_by: (if signed; attach attestation if present)
  • helix_proof_id: (SHA‑256 of this HMI file)

Hashing procedure:

# After pull/export to file (example path):
sha256sum /opt/models/ibm/granite4-small-h.gguf | tee /opt/helix/proofs/models/granite4-small-h.sha256

Attach the resulting .sha256 to the deployment record and reference it from TTD consent.


2) Compatibility Matrix (Helix Runtimes)

Runtime Status Notes
Ollama ✅ Primary Official ibm/granite4:small-h images; multiple quantizations
vLLM Use HF checkpoint; ensure Mamba‑2 support flags enabled
Text‑Gen Inference (TGI) Load via transformers w/ Mamba‑2 kernels
LM Studio H‑Small listed; for local eval

3) Helix Registry Entry (YAML)

# /opt/helix/registry/models/granite4-small-h.yaml
model_id: granite4-small-h
family: granite
version: "4.0"
upstream: ibm/granite4:small-h
license: Apache-2.0
architecture:
  type: hybrid
  mix: [transformer, mamba2]
  moe:
    total_parameters_b: 32
    active_parameters_b: 9
context:
  recommended_temperature: 0
  max_new_tokens_default: 512
quantizations:
  - name: Q4_K_M
    approx_size_gb: 19
  - name: Q5_K_M
    approx_size_gb: 23
capabilities:
  instruction_following: strong
  tool_use: strong
  code: medium
  languages: [en, de, es, fr, ja, pt, ar, cs, it, ko, nl, zh]
qsr_profile: granite4-small-h-2025Q4
policy:
  pii_scrub: helix-default
  safety_tier: standard
  audit_headers: true
vector_memory:
  collection: ttd_memory_v2
  ethos_tag: ethos_core_v1

3.1) System Prompt — Vector Memory Ethos Hook

Use this single source of truth system prompt for Granite4 in Helix. It requires your router/middleware to fetch the Helix Core Ethos from Qdrant and inject it into the placeholder before sending to the model.

Router responsibility (pseudocode):

ethos = qdrant.search(collection="ttd_memory_v2", query="Helix Core Ethos", filter={"must": [{"key":"tags","match": {"any": ["ethos","ethos_core_v1"]}}]}, top_k=1)
ETHOS_SNIPPET = ethos[0].payload.get("output") or ethos[0].payload.get("text")
# Fallback to cached file
if not ETHOS_SNIPPET:
    ETHOS_SNIPPET = open("/opt/helix/cache/ethos_core_v1.txt").read()

System Prompt Template (inject ETHOS_SNIPPET):

You are a Helix assistant operating under the Helix Core Ethos. Use the following canonical ethos excerpt as your ground truth:

[HELIX_CORE_ETHOS]
{{ETHOS_SNIPPET}}
[/HELIX_CORE_ETHOS]

POLICY:
- Treat the ethos as normative. If a user request conflicts with it, explain the conflict and propose a compliant alternative.
- Prefer citations and proofs over opinions; label outputs with FACT/HYPOTHESIS/ASSUMPTION when non‑obvious.
- Keep answers concise by default; expose reasoning only when asked or when safety requires it.
- Never fabricate sources. Abstain if evidence is insufficient.

OPERATING MODES:
- JSON‑mode when tools or structured outputs are requested.
- Strict schema enforcement on tool calls.
- Respect consent and least‑privilege at all times.

Ollama example (direct):

curl -fsS http://127.0.0.1:11434/api/generate -d '{
  "model": "ibm/granite4:small-h",
  "system": "<PASTE SYSTEM PROMPT WITH ETHOS_SNIPPET INLINED>",
  "prompt": "Good morning Helix, summarize the ethos in 3 bullets.",
  "stream": false
}' | jq -r '.response'

Helix router (optional) message shape:

{
  "route": "granite4-small-h",
  "messages": [
    {"role":"system","content":"<SYSTEM PROMPT WITH ETHOS_SNIPPET>"},
    {"role":"user","content":"Summarize the Helix Core Ethos in 3 bullets."}
  ],
  "trace": true
}

Guardrails: If ETHOS_SNIPPET cannot be loaded, abort the request with HTTP 503 and reason=ethos_unavailable (do not proceed without ethos context).


4) Audit Envelope (VectorDB‑first)

Store provenance with each completion in VectorDB (Qdrant) and mirror to filesystem proofs. Use this canonical envelope as payload.audit when upserting a point.

{
  "model": "granite4-small-h",
  "vendor": "ibm",
  "version": "4.0",
  "artifact_sha256": "<fill-from-provenance>",
  "quantization": "Q4_K_M|Q5_K_M|bf16",
  "inference_stack": "ollama|vllm|tgi",
  "router": "helix-router|direct-ollama",
  "source": {
    "pulled_at": "<iso8601>",
    "channel": "ollama|hf|github",
    "attestation": "<optional>"
  },
  "session": {
    "id": "<uuid>",
    "user": "helix",
    "ts": "<iso8601>"
  }
}

Filesystem mirror (recommended): write the same envelope to /opt/helix/proofs/sessions/<session-id>/audit.json.


5) QSR Defaults (Helix Quality Score Rubric)

Use these as initial thresholds; tune with live telemetry over the first week.

# /opt/helix/qsr/profiles/granite4-small-h.yaml
profile_id: granite4-small-h-2025Q4
weights:
  coherence: 0.28
  accuracy: 0.26
  completeness: 0.18
  relevance: 0.18
  novelty: 0.10
thresholds:
  soft_flag: 0.74     # trigger human‑in‑the‑loop note
  hard_block: 0.62    # route to rollback / alternative model
mri_risk_tiers:
  low:    [0.80, 1.00]
  medium: [0.70, 0.80)
  high:   [0.00, 0.70)
fallback_chain:
  - model: magnus-supernova
  - model: qwen3-7b
  - model: deepseek-coder

Calibration plan:

  • Run 200‑sample eval across Helix golden tasks (RAG, tool‑use, code review).
  • Fit reliability diagram; adjust soft_flag to equalize FP/FN at δ≤2%.
  • Freeze for 7 days; revisit after first incident or drift >3%.

6) Integration Steps

A) Pull & list (Ollama):

ollama pull ibm/granite4:small-h
ollama list | grep granite4

B) Direct inference (no TTD shim):

# Simple chat via Ollama
curl -fsS http://127.0.0.1:11434/api/generate -d '{
  "model": "ibm/granite4:small-h",
  "prompt": "Summarize the Helix Core Ethos in 3 bullets.",
  "stream": false
}' | jq

C) (Optional) Helix router entry: if you still route via Helix router, keep the minimal model map but no /chat shim dependency.

{
  "route": "granite4-small-h",
  "backend": "ollama",
  "model": "ibm/granite4:small-h",
  "parameters": {"temperature": 0, "num_ctx": 32768}
}

Restart router only if used:

sudo systemctl restart helix-router || true

7) RAG / Tool‑Use Settings

  • RAG: prefer temperature 0, top_p=0.9, max_new_tokens=512; penalize repetition >1.1
  • Tool‑calling: enable JSON‑mode; enforce schema; set tool_timeout_ms=8000
  • Memory windows: use 2‑phase: compressive memory for history >12k tokens; emit summarization proofs into ttd_memory_v2

JSON schema example:

{
  "name": "lookup_kb",
  "strict": true,
  "parameters": {
    "type": "object",
    "properties": {
      "query": {"type": "string"},
      "top_k": {"type": "integer", "minimum": 1, "maximum": 10}
    },
    "required": ["query"],
    "additionalProperties": false
  }
}

8) Smoke Test (VectorDB‑first)

Two paths: A) full VectorDB upsert with embeddings, B) proofs‑only if embeddings not available yet.

A) End‑to‑end with Qdrant

1. Embed the prompt+output (Ollama embeddings):

PROMPT='Good morning Helix and welcome Granite4:small-h, please review the Helix Core Ethos.'
# Replace with your embeddings model route if different
EMB_MODEL="nomic-embed-text"
VEC=$(curl -fsS http://127.0.0.1:11434/api/embeddings -d "{\"model\":\"$EMB_MODEL\",\"prompt\":\"$PROMPT\"}" | jq -c '.embedding')

2. Generate with Granite 4:

OUT=$(curl -fsS http://127.0.0.1:11434/api/generate -d "{\"model\":\"ibm/granite4:small-h\",\"prompt\":\"$PROMPT\",\"stream\":false}" | jq -r '.response')

3. Build audit envelope and upsert into Qdrant:

SESSION=$(uuidgen)
SHA=$(cat /opt/helix/proofs/models/granite4-small-h.sha256 2>/dev/null | awk '{print $1}')
cat > /tmp/audit.json <<JSON
{
  "model": "granite4-small-h",
  "vendor": "ibm",
  "version": "4.0",
  "artifact_sha256": "${SHA:-unknown}",
  "quantization": "Q4_K_M|Q5_K_M|bf16",
  "inference_stack": "ollama",
  "session": {"id":"$SESSION","user":"helix","ts":"$(date -Iseconds)"}
}
JSON

# Upsert (default Qdrant localhost:6333; collection ttd_memory_v2)
cat > /tmp/qdrant_upsert.json <<JSON
{
  "points": [
    {
      "id": "$SESSION",
      "vector": $VEC,
      "payload": {
        "type": "completion",
        "prompt": "$PROMPT",
        "output": "$OUT",
        "audit": $(cat /tmp/audit.json)
      }
    }
  ]
}
JSON
curl -fsS -X POST 'http://127.0.0.1:6333/collections/ttd_memory_v2/points' \
     -H 'content-type: application/json' \
     -d @/tmp/qdrant_upsert.json | jq

Expected: "status":"ok" and "operation_id" in response.

B) Proofs‑only (no embeddings yet)

SESSION=$(uuidgen)
mkdir -p /opt/helix/proofs/sessions/$SESSION
printf '%s' "$OUT" > /opt/helix/proofs/sessions/$SESSION/output.txt
cp /tmp/audit.json /opt/helix/proofs/sessions/$SESSION/audit.json
sha256sum /opt/helix/proofs/sessions/$SESSION/output.txt | tee /opt/helix/proofs/sessions/$SESSION/output.sha256

9) Observability & KPIs

  • Latency P50/P95: <= 800 ms / 1.8 s @ 1k tok on L40S (target)
  • Tool‑call success rate: ≥ 96%
  • RAG groundedness (auto‑eval): ≥ 0.88
  • Incident budget: ≤ 1 hard‑block per 5k calls weekly
  • Drift trigger: QSR moving average −3% over 24h

Export Prometheus metrics under helix_granite4_small_h_* (latency, tokens, qsr, blocks, fallbacks).


10) Risk & Policy Notes

  • Hybrid MoE/SSM can surface long‑context carryover errors; reset memory at task boundaries.
  • JSON‑mode hallucination: enforce strict schemas; reject extra fields.
  • PII: apply Helix PII scrubber pre‑ and post‑gen; route hits to human review.
  • Rollback: pre‑stage magnus-supernova + qwen3-7b; automatic switchover on hard_block.

11) Rollout Plan

  1. Stage (dev): canary 5% of RAG traffic for 24h → compare KPIs.
  2. Pilot (prod shadow): mirror 10% queries; human‑only consumption.
  3. Prod: ramp 10% → 25% → 50% with guardrails; freeze if incident budget breached.

Change ticket: HMI‑2025‑G4SH‑ROLLOUT‑001


12) Appendices

A) Incantations

# Quantized alt pulls
ollama pull ibm/granite4:small-h-q4_K_M
ollama pull ibm/granite4:small-h-q5_K_M

# vLLM (HF)
python - <<'PY'
from vllm import LLM
llm = LLM(model="ibm-granite/granite-4.0-h-small", dtype="bfloat16")
print(llm.generate(["Hello Granite 4!"]))
PY

B) Consent stub (TTD)

{
  "consent_id": "ttd-consent-granite4-small-h-2025-10-13",
  "subject": "Deployment of IBM Granite 4.0 H-Small in Helix",
  "artifacts": [
    {"name": "granite4-small-h.gguf", "sha256": "<fill>", "source": "ollama"}
  ],
  "approvers": ["owner:helix","safety_champion"],
  "effective_from": "2025-10-13T00:00:00Z",
  "notes": "Apache-2.0; hybrid mamba2+transformer; long-context tasks"
}

— End of HMI‑2025‑G4SH —