LLM Infrastructure for Energy: Ollama, AnythingLLM & Local AI Stack

The Problem: Cloud AI Doesn't Work for Energy Operations

We spent eighteen months trying to make cloud LLM APIs work for utility operations. The economics never closed. A single operations team running document queries against maintenance procedures, incident reports, and equipment manuals hit $4,200/month on GPT-4 API calls. Scale that across engineering, compliance, and field ops and you're looking at $60K+ annually for a single site.

Then there's the data sovereignty issue. NERC CIP-011 requires documented information protection for BES Cyber System Information. Sending substation diagrams, protection relay configurations, or operational procedures to OpenAI's API violates that requirement. Full stop. We've sat through three separate compliance audits where cloud AI usage became a finding.

The air-gap problem is worse. We operate generation facilities where the control network has zero external connectivity. No internet, no cloud APIs, no exceptions. If you want AI assistance for operators reviewing alarm floods or engineers troubleshooting protection schemes, it runs locally or it doesn't run.

So we built LLM infrastructure that actually works for energy operations: completely self-hosted, NERC CIP compliant, and performant enough for real-time operational use.

Architecture: Model Serving, Context Management, User Interface

Our production LLM stack has three distinct layers, each solving a specific problem.

Model Serving with Ollama

Ollama handles model deployment and inference. It's essentially Docker for LLMs — you pull models by name, Ollama manages quantization and runtime optimization, and it exposes a simple HTTP API compatible with OpenAI's format.

We run Ollama 0.5.x on dedicated inference nodes: Dell PowerEdge R750 servers with dual NVIDIA L40S GPUs (48GB VRAM each). This configuration handles Llama 3.1 70B at Q4_K_M quantization with 8-12 token/sec throughput — fast enough for conversational response times with 15-20 concurrent users.

Model selection matters enormously. We tested seventeen different models over six months. Llama 3.1 70B provides the best balance of reasoning capability and hardware efficiency for our workloads. Mistral Large handles code generation better but requires more VRAM. Qwen 2.5 72B excels at structured data extraction from equipment manuals.

The key insight: you need multiple models. We maintain four active models in production — a large reasoning model (Llama 3.1 70B), a fast response model (Mistral 7B), a code specialist (DeepSeek Coder 33B), and a document extraction model (Qwen 2.5 14B). AnythingLLM routes queries to the appropriate model based on workspace configuration.

Context Management with AnythingLLM

AnythingLLM provides the RAG pipeline and multi-user interface. This is where your LLM infrastructure becomes operationally useful rather than just a tech demo.

We deploy AnythingLLM as a Docker container on the same inference nodes, connecting to Ollama via localhost. The architecture is straightforward: users upload documents to workspaces, AnythingLLM chunks and embeds them using Ollama's embedding models (nomic-embed-text works well), stores vectors in its internal LanceDB, and handles retrieval-augmented generation.

Our production deployment has nineteen workspaces:

Relay protection settings and coordination studies (4,200 documents)
Substation maintenance procedures and inspection reports (8,100 documents)
NERC compliance standards and audit documentation (1,800 documents)
Equipment technical manuals and vendor documentation (12,400 documents)
Incident investigation reports and root cause analyses (3,600 documents)

Each workspace uses workspace-specific prompts and agent configurations. The protection engineering workspace has custom instructions about IEEE standards and relay manufacturers. The compliance workspace understands NERC terminology and CIP requirements.

Vector search quality determines everything. We tested multiple embedding models — nomic-embed-text (137M parameters) provides the best retrieval accuracy for technical documents at acceptable inference speed. Chunk size matters: 800 tokens with 200 token overlap works better than the default 1000/100 for equipment manuals with complex technical specifications.

User Access Layer

AnythingLLM's built-in web interface handles 90% of our use cases. It's React-based, responsive, and supports multi-user authentication with workspace-level permissions. Operations staff access it through their standard workstations.

For power users who want more control, we also deploy Open WebUI as an alternative frontend. It connects to the same Ollama backend but provides more granular control over generation parameters, model selection, and prompt templates. Our protection engineers prefer it for technical work requiring specific model behavior.

LibreChat is deployed in our engineering development environment. It provides agent capabilities and MCP (Model Context Protocol) integration that we're evaluating for automated report generation and data pipeline integration. Not production yet, but promising for workflow automation.

Operational Reality: What Actually Happens in Production

Deployment Pattern

We standardize on three-node clusters for each major facility:

Node 1: Primary inference (Ollama + AnythingLLM)
Node 2: Secondary inference (hot standby, Ollama only)
Node 3: Document processing and vector database (AnythingLLM storage backend)

This architecture survived a primary node failure during a storm event. Operators continued accessing the secondary Ollama instance through Open WebUI while we rebuilt the primary node. Zero operational impact.

Model Management

We update models quarterly, not continuously. Llama 3.1 70B hasn't been replaced since July 2024 because it works. We tested Llama 3.2 90B but the VRAM requirements (96GB minimum for Q4 quantization) don't justify the marginal quality improvement.

Model updates follow our standard change management process: test in development for two weeks, deploy to pilot users for validation, full production rollout. We maintain previous model versions for rollback.

Resource Utilization

Actual GPU utilization averages 35-40% during business hours, spiking to 60-70% during incident response when multiple operators are querying procedures simultaneously. Memory bandwidth is the primary constraint, not compute — the L40S's 864 GB/sec bandwidth handles our workload comfortably.

We sized infrastructure for peak load (30 concurrent users) but typical load is 8-12 users. Better to overprovision inference hardware than have operators waiting 30 seconds for responses during emergency situations.

Cost Structure

Our total infrastructure cost for a three-node cluster supporting 200 users:

Hardware: $85K (three servers, GPUs, networking)
Power: ~$1,200/month (6.4kW average draw)
Maintenance: $12K/year (Dell ProSupport)
Engineering time: ~40 hours/quarter for model updates and optimization

Compare to cloud API costs: at $60K/year for 200 users, the infrastructure pays for itself in 18 months. After that, you're spending $14K/year instead of $60K/year.

Security and Compliance

NERC CIP compliance requires:

Electronic access logging (AnythingLLM's built-in logging covers this)
User authentication (integrated with Active Directory via LDAP)
Information protection (data never leaves the facility network)
Malicious code prevention (standard server hardening)

The entire stack runs on RHEL 8.9 with SELinux enforcing. Docker containers are scanned weekly for CVEs. We maintain offline installation packages for air-gapped deployments.

What Breaks

Vector database corruption during unclean shutdowns. We now run automated vector DB backups every six hours and maintain checksums. Lost this lesson the hard way during an unexpected power event.

Ollama occasionally deadlocks under high concurrent load with large context windows. We implemented connection pooling and request queuing in our nginx reverse proxy to prevent this.

Model context window limits cause failures on extremely long documents. We pre-chunk documents over 100K tokens during ingestion rather than letting the RAG pipeline handle it dynamically.

Alternative Approaches We Tested

We evaluated vLLM for model serving. It provides better throughput (15-18 tok/sec vs Ollama's 8-12) but operational complexity is significantly higher. Installation requires building from source with specific CUDA versions, configuration is entirely Python-based, and debugging inference issues requires deep familiarity with the codebase. Ollama's simplicity won for our operational environment.

We tested running models directly through llama.cpp and building our own API layer. This works if you have dedicated ML engineering resources. We don't, and maintaining custom inference infrastructure isn't our core competency.

We looked at Hugging Face Text Generation Inference (TGI). Excellent for large-scale deployments with hundreds of concurrent users, but our workload doesn't justify the operational overhead. TGI requires Kubernetes, sophisticated load balancing, and dedicated DevOps attention.

The Verdict

Ollama plus AnythingLLM is the correct stack for energy sector LLM infrastructure in 2025. Not because it's cutting-edge or technically impressive, but because it works reliably with minimal operational overhead.

Ollama solves model serving completely. Download models by name, they run efficiently on available hardware, the API is OpenAI-compatible so integration is straightforward. We've run Ollama in production for fourteen months with zero critical issues.

AnythingLLM provides everything else: document ingestion, vector search, multi-user workspaces, agent capabilities, and a functional web interface. It's not the most sophisticated RAG framework available, but it handles our operational requirements without requiring a dedicated ML platform team.

For organizations needing more control, Open WebUI or LibreChat provide alternative frontends to the same Ollama backend. We use both in specific contexts — Open WebUI for power users, LibreChat for experimental agent workflows.

The key architectural principle: separate model serving (Ollama) from application logic (AnythingLLM/Open WebUI/LibreChat). This allows you to swap frontends, experiment with different RAG approaches, and maintain consistent inference infrastructure. When Ollama releases better quantization methods or supports new model architectures, every application benefits immediately.

If you're deploying LLM infrastructure for energy operations, start here. Run Ollama on dedicated GPU hardware, deploy AnythingLLM for document workspaces, provision for peak concurrent load, and implement proper backup procedures for your vector databases. Everything else is optimization.

Dimension	Ollama + AnythingLLM	vLLM + Custom Frontend	Hugging Face TGI + Kubernetes
Deployment Complexity	Docker deployment, 2hr setup★★★★★	Build from source, complex config★★☆☆☆	K8s cluster, helm charts, complex★☆☆☆☆
Inference Throughput	8-12 tok/sec (Llama 70B)★★★★☆	15-18 tok/sec (Llama 70B)★★★★★	20+ tok/sec with batching★★★★★
Multi-User Support	Built-in workspaces, LDAP auth★★★★★	Requires custom implementation★★☆☆☆	Horizontal scaling, load balancing★★★★★
RAG Pipeline Integration	Native vector DB, auto-chunking★★★★★	Manual integration required★★★☆☆	Requires separate RAG framework★★☆☆☆
Operational Maturity	14 months production, stable★★★★★	Excellent but needs ML team★★★☆☆	Enterprise-grade, heavy ops★★★★☆
Best For	Energy utilities needing turnkey LLM infrastructure with NERC compliance	Organizations with dedicated ML engineering teams optimizing for throughput	Large-scale deployments with hundreds of concurrent users and DevOps resources
Verdict	The stack that actually works for operational environments with minimal ML expertise required	Superior performance but operational complexity doesn't justify 30% throughput gain for our workload	Overkill for energy sector deployments under 50 concurrent users — operational overhead exceeds benefit

LLM Infrastructure for Energy Ops: What Actually Works in 2025