HMI‑2025‑G4SH — Granite‑4 Small‑Hybrid
HMI‑2025‑G4SH — Granite‑4 Small‑Hybrid (Helix Model Integration Sheet)
Status: READY FOR INTEGRATION • Owner: Helix Ops • Last Updated: 2025‑10‑13
0) At‑a‑Glance
- Model ID (Helix):
granite4-small-h - Upstream Name:
ibm/granite4:small-h(aka Granite 4.0 H‑Small) - Family: IBM Granite 4.0 (hybrid Mamba‑2 + Transformer)
- License: Apache‑2.0 (open weights)
- Params (total / active): ~32B total / ~9B active (hybrid MoE)
- Context: long‑context capable via SSM layers; temperature 0 recommended for most inference
- Quantized Builds: Q4_K_M (~19 GB), Q5_K_M (~23 GB) as GGUF/ollama variants
- Targets: Low‑latency enterprise assistants, RAG, tool‑use, code & ops copilots
1) Provenance & Verification
Sources: IBM Granite 4.0 model cards/repos + Ollama registry entries. Record the exact artifact you deploy.
Record on deploy:
source_url: (HF/GitHub/Ollama)artifact_sha256: (computed locally)artifact_size_bytes: (from filesystem)pull_command: (ollama pull ibm/granite4:small-hor HF path)signed_by: (if signed; attach attestation if present)helix_proof_id: (SHA‑256 of this HMI file)
Hashing procedure:
# After pull/export to file (example path):
sha256sum /opt/models/ibm/granite4-small-h.gguf | tee /opt/helix/proofs/models/granite4-small-h.sha256
Attach the resulting .sha256 to the deployment record and reference it from TTD consent.
2) Compatibility Matrix (Helix Runtimes)
| Runtime | Status | Notes |
|---|---|---|
| Ollama | ✅ Primary | Official ibm/granite4:small-h images; multiple quantizations
|
| vLLM | ✅ | Use HF checkpoint; ensure Mamba‑2 support flags enabled |
| Text‑Gen Inference (TGI) | ✅ | Load via transformers w/ Mamba‑2 kernels |
| LM Studio | ✅ | H‑Small listed; for local eval |
3) Helix Registry Entry (YAML)
# /opt/helix/registry/models/granite4-small-h.yaml
model_id: granite4-small-h
family: granite
version: "4.0"
upstream: ibm/granite4:small-h
license: Apache-2.0
architecture:
type: hybrid
mix: [transformer, mamba2]
moe:
total_parameters_b: 32
active_parameters_b: 9
context:
recommended_temperature: 0
max_new_tokens_default: 512
quantizations:
- name: Q4_K_M
approx_size_gb: 19
- name: Q5_K_M
approx_size_gb: 23
capabilities:
instruction_following: strong
tool_use: strong
code: medium
languages: [en, de, es, fr, ja, pt, ar, cs, it, ko, nl, zh]
qsr_profile: granite4-small-h-2025Q4
policy:
pii_scrub: helix-default
safety_tier: standard
audit_headers: true
4) TTD/Helix Audit Headers
Add these headers (or JSON fields) to each generation record emitted by the router.
{
"model": "granite4-small-h",
"vendor": "ibm",
"version": "4.0",
"artifact_sha256": "<fill-from-provenance>",
"quantization": "Q4_K_M|Q5_K_M|bf16",
"inference_stack": "ollama|vllm|tgi",
"router": "helix-ttd-shim/>=1.3",
"x_granite_proof": {
"source": "ollama|hf|github",
"pulled_at": "<iso8601>",
"attestation": "<optional-blob-or-url>"
}
}
5) QSR Defaults (Helix Quality Score Rubric)
Use these as initial thresholds; tune with live telemetry over the first week.
# /opt/helix/qsr/profiles/granite4-small-h.yaml
profile_id: granite4-small-h-2025Q4
weights:
coherence: 0.28
accuracy: 0.26
completeness: 0.18
relevance: 0.18
novelty: 0.10
thresholds:
soft_flag: 0.74 # trigger human‑in‑the‑loop note
hard_block: 0.62 # route to rollback / alternative model
mri_risk_tiers:
low: [0.80, 1.00]
medium: [0.70, 0.80)
high: [0.00, 0.70)
fallback_chain:
- model: magnus-supernova
- model: qwen3-7b
- model: deepseek-coder
Calibration plan:
- Run 200‑sample eval across Helix golden tasks (RAG, tool‑use, code review).
- Fit reliability diagram; adjust
soft_flagto equalize FP/FN at δ≤2%. - Freeze for 7 days; revisit after first incident or drift >3%.
6) Integration Steps
A) Pull & list (Ollama):
ollama pull ibm/granite4:small-h
ollama list | grep granite4
B) Enable in Helix router:
sudo tee /opt/helix/router/models.d/granite4-small-h.json >/dev/null <<'JSON'
{
"route": "granite4-small-h",
"backend": "ollama",
"model": "ibm/granite4:small-h",
"parameters": {"temperature": 0, "num_ctx": 32768}
}
JSON
systemctl restart helix-router
C) Register proofs:
helix-stats record --model granite4-small-h --source ollama --hash-file /opt/helix/proofs/models/granite4-small-h.sha256
7) RAG / Tool‑Use Settings
- RAG: prefer temperature 0,
top_p=0.9,max_new_tokens=512; penalize repetition >1.1 - Tool‑calling: enable JSON‑mode; enforce schema; set
tool_timeout_ms=8000 - Memory windows: use 2‑phase: compressive memory for history >12k tokens; emit summarization proofs into
ttd_memory_v2
JSON schema example:
{
"name": "lookup_kb",
"strict": true,
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string"},
"top_k": {"type": "integer", "minimum": 1, "maximum": 10}
},
"required": ["query"],
"additionalProperties": false
}
}
8) Smoke Test (Helix)
curl -fsS http://127.0.0.1:9010/chat -H 'content-type: application/json' -d '{
"route": "granite4-small-h",
"messages": [
{"role":"system","content":"You are a precise Helix assistant. Answer in JSON."},
{"role":"user","content":"Summarize the Helix Core Ethos in 3 bullets."}
],
"tools": [],
"trace": true
}' | jq '. | {text: .choices[0].message.content, audit: .audit}'
Expected: JSON reply ≤ 120 tokens; audit block present with model hash and quantization.
9) Observability & KPIs
- Latency P50/P95: <= 800 ms / 1.8 s @ 1k tok on L40S (target)
- Tool‑call success rate: ≥ 96%
- RAG groundedness (auto‑eval): ≥ 0.88
- Incident budget: ≤ 1 hard‑block per 5k calls weekly
- Drift trigger: QSR moving average −3% over 24h
Export Prometheus metrics under helix_granite4_small_h_* (latency, tokens, qsr, blocks, fallbacks).
10) Risk & Policy Notes
- Hybrid MoE/SSM can surface long‑context carryover errors; reset memory at task boundaries.
- JSON‑mode hallucination: enforce strict schemas; reject extra fields.
- PII: apply Helix PII scrubber pre‑ and post‑gen; route hits to human review.
- Rollback: pre‑stage
magnus-supernova+qwen3-7b; automatic switchover onhard_block.
11) Rollout Plan
- Stage (dev): canary 5% of RAG traffic for 24h → compare KPIs.
- Pilot (prod shadow): mirror 10% queries; human‑only consumption.
- Prod: ramp 10% → 25% → 50% with guardrails; freeze if incident budget breached.
Change ticket: HMI‑2025‑G4SH‑ROLLOUT‑001
12) Appendices
A) Incantations
# Quantized alt pulls
ollama pull ibm/granite4:small-h-q4_K_M
ollama pull ibm/granite4:small-h-q5_K_M
# vLLM (HF)
python - <<'PY'
from vllm import LLM
llm = LLM(model="ibm-granite/granite-4.0-h-small", dtype="bfloat16")
print(llm.generate(["Hello Granite 4!"]))
PY
B) Consent stub (TTD)
{
"consent_id": "ttd-consent-granite4-small-h-2025-10-13",
"subject": "Deployment of IBM Granite 4.0 H-Small in Helix",
"artifacts": [
{"name": "granite4-small-h.gguf", "sha256": "<fill>", "source": "ollama"}
],
"approvers": ["owner:helix","safety_champion"],
"effective_from": "2025-10-13T00:00:00Z",
"notes": "Apache-2.0; hybrid mamba2+transformer; long-context tasks"
}
— End of HMI‑2025‑G4SH —
