Air-Gapped LLM Architecture Pattern for Energy Sector | EthosPower

Pattern Context

We've deployed LLM infrastructure in seventeen energy utilities over the past two years. Every single one required air-gapped operation. Not "we'd prefer air-gapped" or "maybe eventually air-gapped"—absolute, non-negotiable isolation from public networks. NERC CIP-005 doesn't give you wiggle room when you're working with critical cyber assets. Your LLM infrastructure either runs completely isolated or it doesn't run at all.

The pattern we're describing applies specifically to operational technology environments where you need LLM capabilities for incident analysis, maintenance documentation, SCADA alarm correlation, or procedure generation—but you cannot send data outside your control perimeter. This isn't about paranoia. It's about regulatory compliance and operational reality.

The Problem

Energy sector operations generate massive amounts of unstructured data: maintenance logs, incident reports, equipment manuals, operator procedures, alarm histories, weather correlations, and decades of tribal knowledge that exists only in email threads and handwritten notes. LLMs are genuinely useful for making sense of this chaos. We've seen 40-60% reductions in mean time to diagnosis when operators can query historical incident data in natural language instead of grep-ing through log files.

But here's the constraint set:

No cloud APIs: OpenAI, Anthropic, and Google are completely off the table. Even "private cloud" deployments trigger compliance issues because data leaves your administrative control.
Hardware limitations: OT environments run on modest hardware. You're not getting an H100 cluster. You might get a single server with 2x RTX 4090s if you're lucky. More likely you're working with consumer-grade GPUs or even CPU-only inference.
Model updates are painful: In air-gapped environments, updating models means physically carrying drives through security checkpoints. You can't just docker pull the latest weights.
Inference latency matters: Operators won't use a tool that takes 30 seconds to answer a question. If your inference time exceeds 5 seconds, adoption collapses.
Multi-user concurrency: You need to support 20-50 simultaneous users during incident response, not just single-user experimentation.

The traditional approach—spinning up a cloud API integration—solves none of this. You need a complete local stack that handles model serving, document ingestion, vector search, and user interfaces without ever touching external networks.

Solution Architecture

After multiple failed attempts with custom solutions, we've converged on a three-layer architecture that actually works in production:

Layer 1: Model Serving with Ollama

Ollama is the foundation. It handles model lifecycle, inference, and GPU management with zero ceremony. We run it on dedicated inference servers—typically Dell PowerEdge R750 with 2x RTX 4090 or A6000 cards.

Key configuration decisions:

Model selection: We standardize on Llama 3.1 8B for most deployments. The 70B models provide better reasoning but require multi-GPU setups and increase latency to 8-12 seconds for complex queries. In our testing across twelve utilities, the marginal quality improvement from 70B didn't justify the 3x latency increase for operational queries.
Quantization: GGUF Q4_K_M quantization reduces memory footprint by 75% with minimal quality loss. We tested Q2 through Q8 variants on incident analysis tasks and found Q4_K_M hits the sweet spot—faster inference than Q8, better accuracy than Q3.
Concurrent requests: Ollama's parallel request handling is critical. With default settings, you'll queue requests sequentially. We configure OLLAMA_NUM_PARALLEL=4 and OLLAMA_MAX_LOADED_MODELS=2 to serve multiple users simultaneously. This requires 48GB+ VRAM but enables sub-3-second response times even under load.

Ollama's API is OpenAI-compatible, which matters because it makes the rest of the stack portable. If you later get budget for bigger hardware or decide to swap models, your application layer doesn't change.

Layer 2: Document Intelligence with AnythingLLM

AnythingLLM provides the RAG layer. It handles document ingestion, chunking, embedding, vector storage, and retrieval—all the infrastructure you need to let LLMs answer questions about your specific operational data.

We run AnythingLLM on the same physical server as Ollama but in a separate container. Critical implementation details:

Vector database: AnythingLLM supports multiple backends. We use Qdrant for vector storage because it handles 50M+ document chunks without degradation and provides efficient filtered search. The built-in LanceDB option works for smaller deployments under 10M chunks but chokes on utility-scale document sets.
Embedding model: We use nomic-embed-text running locally via Ollama. It produces 768-dimensional embeddings and runs fast enough (500+ embeddings/second on RTX 4090) to ingest 100,000 documents overnight. Some teams prefer mxbai-embed-large for slightly better retrieval accuracy, but the speed difference matters during bulk ingestion.
Chunking strategy: Default 1000-token chunks with 200-token overlap work well for technical documentation. For SCADA alarm logs, we chunk by timestamp boundaries instead—each chunk is a complete alarm sequence with 2-minute context windows.
Multi-workspace isolation: AnythingLLM's workspace concept maps perfectly to operational boundaries. We create separate workspaces for generation data, transmission operations, distribution maintenance, and corporate functions. Access control happens at workspace level, which satisfies auditors.

The web interface is decent—not beautiful, but functional enough that operators actually use it. We've tried building custom React frontends and consistently found that the maintenance burden outweighs any UX improvements.

Layer 3: User Interface Options

Beyond AnythingLLM's built-in UI, we provide additional interfaces depending on user sophistication:

Open WebUI: We deploy this for technical users who want more control over system prompts, temperature, and retrieval parameters. It connects to the same Ollama backend but provides a cleaner chat interface and better conversation management. The model switching interface is particularly useful when we're running multiple specialized models for different tasks.
LibreChat: For teams that need multi-agent workflows or want to integrate with existing chat platforms, LibreChat provides more sophisticated orchestration. We've used it in three deployments where operators needed to coordinate between multiple LLM agents—one for procedure lookup, another for equipment specs, a third for regulatory compliance checks.
Msty: On operator workstations where we can install desktop software, Msty provides the best user experience. The Shadow Personas feature is genuinely useful for switching between different operational contexts—generation planning mode vs. real-time operations mode vs. incident analysis mode. GPU acceleration on local workstations reduces latency another 20-30% compared to remote inference.

Most deployments end up with AnythingLLM as the primary interface (70% of users) and Open WebUI for power users (30%). LibreChat and Msty are situational.

Implementation Considerations

Hardware Sizing

We size infrastructure based on concurrent users and document set size:

Small deployment (under 20 users, under 1M documents): Single server with RTX 4090, 128GB RAM, 4TB NVMe. Run Ollama, AnythingLLM, and Qdrant on same host. Cost: ~$8K. Inference latency: 2-4 seconds.
Medium deployment (20-50 users, 1M-10M documents): Two servers. Dedicated Ollama inference node with 2x RTX 4090. Separate application node for AnythingLLM and Qdrant. Cost: ~$18K. Inference latency: 1.5-3 seconds.
Large deployment (50+ users, 10M+ documents): Three-node cluster. Two inference nodes behind load balancer, dedicated storage node. Cost: ~$35K. Inference latency: 1-2 seconds with 99th percentile under 4 seconds.

These numbers assume you're buying hardware, not renting cloud infrastructure. In air-gapped environments, cloud isn't an option anyway.

Model Update Pipeline

Updating models in air-gapped environments requires process discipline. Our standard approach:

Maintain an identical staging environment with internet access for testing new models
Evaluate new models against a benchmark set of 200 operational queries specific to that utility
If new model improves accuracy by 15%+ or reduces latency by 25%+, proceed with update
Export model weights to encrypted drives
Physical transport through security (yes, someone literally carries a hard drive)
Import to production Ollama instance during maintenance window
A/B test with 10% of users for one week before full rollout

This process takes 2-3 weeks minimum. You can't do rapid iteration. Choose stable, proven models over bleeding-edge releases.

Document Ingestion Workflow

Getting operational data into the RAG system is harder than running inference. Most utility documentation exists as scanned PDFs, Word documents from the 1990s, or proprietary formats from legacy systems.

Our ingestion pipeline:

OCR layer: For scanned documents, we run Tesseract OCR in batch mode overnight. Quality varies, but it's good enough for retrieval if you tune chunk overlap.
Format conversion: AnythingLLM handles most common formats, but we pre-process PowerPoint and Visio files into PDF to improve extraction quality.
Metadata enrichment: We extract and attach metadata during ingestion—document date, equipment type, facility code, regulatory category. This enables filtered retrieval: "show me only transmission line maintenance procedures from the last five years."
Incremental updates: New documents get added weekly via automated sync from document management systems. This requires custom scripts since most utilities run SharePoint or similar.

Initial ingestion of 500K-1M documents takes 48-72 hours. After that, incremental updates process in 2-4 hours overnight.

Security and Access Control

Even in air-gapped environments, you need defense in depth:

Network segmentation: LLM infrastructure runs in its own VLAN. Operators access via jump hosts or VDI.
Authentication: We integrate with existing Active Directory or LDAP. No local accounts.
Authorization: Workspace-level access control in AnythingLLM. Generation operators don't see transmission documentation and vice versa.
Audit logging: Every query gets logged with user ID, timestamp, and retrieved documents. Logs export to SIEM. This satisfies NERC CIP-007 requirements.
Data retention: We retain embeddings and vector indexes indefinitely but purge raw query logs after 90 days unless flagged for incident investigation.

The operational security model matters more than the technical implementation. We've seen utilities with perfect technical security undermined by operators sharing credentials or taking screenshots of sensitive responses.

Real-World Trade-offs

The Model Size Dilemma

Bigger models provide better reasoning but kill latency. We consistently see this tension:

7B-8B models: Fast (1-3 seconds), good enough for factual retrieval, struggle with complex reasoning
13B-14B models: Moderate speed (3-5 seconds), better reasoning, acceptable for most use cases
30B-70B models: Slow (8-15 seconds), excellent reasoning, users won't wait

In practice, we deploy 8B models and live with occasional reasoning failures rather than frustrate users with long waits. For the 5-10% of queries that need deeper reasoning, we provide a "detailed analysis" button that explicitly runs a 70B model and warns users about 10-15 second wait time.

Embedding Model Quality vs. Speed

Better embedding models improve retrieval accuracy but slow down ingestion:

nomic-embed-text: Fast, 768 dimensions, good enough for most retrieval
mxbai-embed-large: 20% slower, 1024 dimensions, 5-8% better retrieval accuracy
OpenAI text-embedding-3-large: Not available in air-gapped environments

We default to nomic and only switch to mxbai when users report consistent retrieval failures. The marginal accuracy improvement rarely justifies the 2x longer ingestion time.

Interface Complexity vs. Adoption

More sophisticated interfaces enable power users but confuse beginners. We've learned to start simple:

Launch with AnythingLLM's default interface only
Add Open WebUI after 3-6 months when power users complain about missing features
Add LibreChat only if specific agent workflows emerge organically
Never add all options at once—it fragments the user base and multiplies support burden

On-Premise Cost vs. Capability

The air-gapped constraint forces a cost-capability trade-off. For $35K, you get an on-premise system that's 60-70% as capable as GPT-4 via API (if you could use APIs, which you can't). But that $35K is a one-time capital expense, not recurring API costs that would hit $50K-100K/year at utility scale.

The TCO math works if you'll use the system for 2+ years and have 30+ regular users. For smaller deployments or short-term projects, the capital investment might not pencil. But in energy operations, these systems run for 5-10 years, so the economics favor ownership.

The Verdict

The three-layer architecture—Ollama for inference, AnythingLLM for RAG and document intelligence, plus selective interface layers—is the only pattern we've seen work consistently in air-gapped energy environments. It's not perfect. You sacrifice model quality compared to frontier APIs. Initial setup takes 2-4 weeks. Model updates are painful. But it actually ships, passes audits, and gets used.

Start with a small deployment: single server, 8B model, AnythingLLM interface only. Run it for three months with a pilot group of 10-15 users. Measure actual usage and collect feedback. If you see consistent engagement (daily active users above 60% of pilot group), scale up to medium deployment. If adoption stalls, the problem isn't infrastructure—it's either data quality, user training, or fundamental product-market fit.

We've deployed this pattern seventeen times. Thirteen are still running in production 12-18 months later. Three were decommissioned after pilot phase due to low adoption (data quality issues in two cases, organizational resistance in one). One was replaced by a custom solution when the utility got budget for a dedicated AI team—they kept the same architecture but built proprietary interfaces.

The pattern works because it respects constraints instead of fighting them. Air-gapped operation isn't a limitation to work around—it's the requirement. Open-source models aren't second-best compared to proprietary APIs—they're the only option. Once you accept this reality, the architecture becomes obvious. Build for the constraints you have, not the ones you wish you had.

The Air-Gapped LLM Stack: A Pattern for Energy Sector AI