Sovereign LLM Cluster
On-prem Qwen 3.5 122B + embeddings on dual-GPU R730. Zero per-token spend on internal AI workloads.
Removes per-token AI spend from internal triage, scoring, and summarization workloads. Owns the inference layer rather than renting it. Frees the AI Interaction Analyzer and OCR Pipeline to run at any volume without a cost cap.
A locally-hosted LLM cluster running on the R730 (88 cores / 768 GB RAM, dual NVIDIA GPUs, GPU-aware fan control). The active reasoning model is Qwen 3.5 122B at 81 GB on disk; nomic-embed-text handles embedding workloads for the RAG service in security-stack. Both are served by Ollama and reachable across the Tailscale mesh.
The cluster is the substrate for any non-real-time AI workload: nightly summary jobs, batch transcription scoring, embedding generation for the SOC RAG, OCR post-processing on tank inspections, and reasoning over scraped permit corpora. Anthropic Claude is reserved for latency-sensitive customer-facing flows; everything internal flows here.
- > Qwen 3.5 122B (81 GB) live since 2026 Q2
- > Dual-GPU R730 with custom fan control
- > nomic-embed-text for vector workloads
- > Tailscale-mesh accessible from every node in the cluster
- → Add a third smaller model for low-latency triage
- → Move PeakDipVibe pipeline fully on-prem
- → Publish internal usage dashboard