HMI‑2025‑G4SH — Granite‑4 Small‑Hybrid

From Helix Project Wiki
Revision as of 13:24, 13 October 2025 by Steve Helix (talk | contribs) (Created page with " = HMI‑2025‑G4SH — Granite‑4 Small‑Hybrid (Helix Model Integration Sheet) = <blockquote>'''Status:''' READY FOR INTEGRATION • '''Owner:''' Helix Ops • '''Last Updated:''' 2025‑10‑13</blockquote> ---- == 0) At‑a‑Glance == * '''Model ID (Helix):''' <code>granite4-small-h</code> * '''Upstream Name:''' <code>ibm/granite4:small-h</code> (aka '''Granite 4.0 H‑Small''') * '''Family:''' IBM Granite 4.0 (hybrid Mamba‑2 + Transformer) * '''License:''' A...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

HMI‑2025‑G4SH — Granite‑4 Small‑Hybrid (Helix Model Integration Sheet)

Status: READY FOR INTEGRATION • Owner: Helix Ops • Last Updated: 2025‑10‑13


0) At‑a‑Glance

  • Model ID (Helix): granite4-small-h
  • Upstream Name: ibm/granite4:small-h (aka Granite 4.0 H‑Small)
  • Family: IBM Granite 4.0 (hybrid Mamba‑2 + Transformer)
  • License: Apache‑2.0 (open weights)
  • Params (total / active): ~32B total / ~9B active (hybrid MoE)
  • Context: long‑context capable via SSM layers; temperature 0 recommended for most inference
  • Quantized Builds: Q4_K_M (~19 GB), Q5_K_M (~23 GB) as GGUF/ollama variants
  • Targets: Low‑latency enterprise assistants, RAG, tool‑use, code & ops copilots

1) Provenance & Verification

Sources: IBM Granite 4.0 model cards/repos + Ollama registry entries. Record the exact artifact you deploy.

Record on deploy:

  • source_url: (HF/GitHub/Ollama)
  • artifact_sha256: (computed locally)
  • artifact_size_bytes: (from filesystem)
  • pull_command: (ollama pull ibm/granite4:small-h or HF path)
  • signed_by: (if signed; attach attestation if present)
  • helix_proof_id: (SHA‑256 of this HMI file)

Hashing procedure:

# After pull/export to file (example path):
sha256sum /opt/models/ibm/granite4-small-h.gguf | tee /opt/helix/proofs/models/granite4-small-h.sha256

Attach the resulting .sha256 to the deployment record and reference it from TTD consent.


2) Compatibility Matrix (Helix Runtimes)

Runtime Status Notes
Ollama ✅ Primary Official ibm/granite4:small-h images; multiple quantizations
vLLM Use HF checkpoint; ensure Mamba‑2 support flags enabled
Text‑Gen Inference (TGI) Load via transformers w/ Mamba‑2 kernels
LM Studio H‑Small listed; for local eval

3) Helix Registry Entry (YAML)

# /opt/helix/registry/models/granite4-small-h.yaml
model_id: granite4-small-h
family: granite
version: "4.0"
upstream: ibm/granite4:small-h
license: Apache-2.0
architecture:
  type: hybrid
  mix: [transformer, mamba2]
  moe:
    total_parameters_b: 32
    active_parameters_b: 9
context:
  recommended_temperature: 0
  max_new_tokens_default: 512
quantizations:
  - name: Q4_K_M
    approx_size_gb: 19
  - name: Q5_K_M
    approx_size_gb: 23
capabilities:
  instruction_following: strong
  tool_use: strong
  code: medium
  languages: [en, de, es, fr, ja, pt, ar, cs, it, ko, nl, zh]
qsr_profile: granite4-small-h-2025Q4
policy:
  pii_scrub: helix-default
  safety_tier: standard
  audit_headers: true

4) TTD/Helix Audit Headers

Add these headers (or JSON fields) to each generation record emitted by the router.

{
  "model": "granite4-small-h",
  "vendor": "ibm",
  "version": "4.0",
  "artifact_sha256": "<fill-from-provenance>",
  "quantization": "Q4_K_M|Q5_K_M|bf16",
  "inference_stack": "ollama|vllm|tgi",
  "router": "helix-ttd-shim/>=1.3",
  "x_granite_proof": {
    "source": "ollama|hf|github",
    "pulled_at": "<iso8601>",
    "attestation": "<optional-blob-or-url>"
  }
}

5) QSR Defaults (Helix Quality Score Rubric)

Use these as initial thresholds; tune with live telemetry over the first week.

# /opt/helix/qsr/profiles/granite4-small-h.yaml
profile_id: granite4-small-h-2025Q4
weights:
  coherence: 0.28
  accuracy: 0.26
  completeness: 0.18
  relevance: 0.18
  novelty: 0.10
thresholds:
  soft_flag: 0.74     # trigger human‑in‑the‑loop note
  hard_block: 0.62    # route to rollback / alternative model
mri_risk_tiers:
  low:    [0.80, 1.00]
  medium: [0.70, 0.80)
  high:   [0.00, 0.70)
fallback_chain:
  - model: magnus-supernova
  - model: qwen3-7b
  - model: deepseek-coder

Calibration plan:

  • Run 200‑sample eval across Helix golden tasks (RAG, tool‑use, code review).
  • Fit reliability diagram; adjust soft_flag to equalize FP/FN at δ≤2%.
  • Freeze for 7 days; revisit after first incident or drift >3%.

6) Integration Steps

A) Pull & list (Ollama):

ollama pull ibm/granite4:small-h
ollama list | grep granite4

B) Enable in Helix router:

sudo tee /opt/helix/router/models.d/granite4-small-h.json >/dev/null <<'JSON'
{
  "route": "granite4-small-h",
  "backend": "ollama",
  "model": "ibm/granite4:small-h",
  "parameters": {"temperature": 0, "num_ctx": 32768}
}
JSON
systemctl restart helix-router

C) Register proofs:

helix-stats record --model granite4-small-h --source ollama --hash-file /opt/helix/proofs/models/granite4-small-h.sha256

7) RAG / Tool‑Use Settings

  • RAG: prefer temperature 0, top_p=0.9, max_new_tokens=512; penalize repetition >1.1
  • Tool‑calling: enable JSON‑mode; enforce schema; set tool_timeout_ms=8000
  • Memory windows: use 2‑phase: compressive memory for history >12k tokens; emit summarization proofs into ttd_memory_v2

JSON schema example:

{
  "name": "lookup_kb",
  "strict": true,
  "parameters": {
    "type": "object",
    "properties": {
      "query": {"type": "string"},
      "top_k": {"type": "integer", "minimum": 1, "maximum": 10}
    },
    "required": ["query"],
    "additionalProperties": false
  }
}

8) Smoke Test (Helix)

curl -fsS http://127.0.0.1:9010/chat -H 'content-type: application/json' -d '{
  "route": "granite4-small-h",
  "messages": [
    {"role":"system","content":"You are a precise Helix assistant. Answer in JSON."},
    {"role":"user","content":"Summarize the Helix Core Ethos in 3 bullets."}
  ],
  "tools": [],
  "trace": true
}' | jq '. | {text: .choices[0].message.content, audit: .audit}'

Expected: JSON reply ≤ 120 tokens; audit block present with model hash and quantization.


9) Observability & KPIs

  • Latency P50/P95: <= 800 ms / 1.8 s @ 1k tok on L40S (target)
  • Tool‑call success rate: ≥ 96%
  • RAG groundedness (auto‑eval): ≥ 0.88
  • Incident budget: ≤ 1 hard‑block per 5k calls weekly
  • Drift trigger: QSR moving average −3% over 24h

Export Prometheus metrics under helix_granite4_small_h_* (latency, tokens, qsr, blocks, fallbacks).


10) Risk & Policy Notes

  • Hybrid MoE/SSM can surface long‑context carryover errors; reset memory at task boundaries.
  • JSON‑mode hallucination: enforce strict schemas; reject extra fields.
  • PII: apply Helix PII scrubber pre‑ and post‑gen; route hits to human review.
  • Rollback: pre‑stage magnus-supernova + qwen3-7b; automatic switchover on hard_block.

11) Rollout Plan

  1. Stage (dev): canary 5% of RAG traffic for 24h → compare KPIs.
  2. Pilot (prod shadow): mirror 10% queries; human‑only consumption.
  3. Prod: ramp 10% → 25% → 50% with guardrails; freeze if incident budget breached.

Change ticket: HMI‑2025‑G4SH‑ROLLOUT‑001


12) Appendices

A) Incantations

# Quantized alt pulls
ollama pull ibm/granite4:small-h-q4_K_M
ollama pull ibm/granite4:small-h-q5_K_M

# vLLM (HF)
python - <<'PY'
from vllm import LLM
llm = LLM(model="ibm-granite/granite-4.0-h-small", dtype="bfloat16")
print(llm.generate(["Hello Granite 4!"]))
PY

B) Consent stub (TTD)

{
  "consent_id": "ttd-consent-granite4-small-h-2025-10-13",
  "subject": "Deployment of IBM Granite 4.0 H-Small in Helix",
  "artifacts": [
    {"name": "granite4-small-h.gguf", "sha256": "<fill>", "source": "ollama"}
  ],
  "approvers": ["owner:helix","safety_champion"],
  "effective_from": "2025-10-13T00:00:00Z",
  "notes": "Apache-2.0; hybrid mamba2+transformer; long-context tasks"
}

— End of HMI‑2025‑G4SH —