Articles/LLM Infrastructure

LLM Infrastructure: What 200 Energy Sector Deployments Taught Me

Ollama Feature Landscape
LLM InfrastructurePractitioner Take
By EthosPower EditorialMarch 28, 20269 min readVerified Mar 28, 2026
ollama(primary)anythingllmopen-webuilibrechatmsty
LLM InfrastructureOllamaOpen Source AIEdge ComputingAI DeploymentEnergy Sector AIModel HostingGPU Architecture

The Conventional Wisdom Is Expensive and Wrong

Every vendor deck I see claims you need H100s, cloud-scale infrastructure, and six-figure budgets to run production LLMs. At EthosPower, we've deployed LLM infrastructure in 200+ energy facilities over the past 18 months, and I can tell you definitively: this is nonsense designed to sell expensive hardware.

The reality is that a $2,500 workstation with an RTX 4090 runs Llama 3.1 8B at 85 tokens/second — fast enough for every energy sector use case I've encountered. The 70B models everyone obsesses about? You don't need them for 90% of operational tasks. I've watched utilities waste months trying to run massive models when an 8B parameter model fine-tuned on their procedures would have solved the problem in week one.

The infrastructure question isn't about raw capability. It's about reliability, data sovereignty, and whether your AI keeps working when the internet doesn't. In energy operations, those requirements eliminate cloud-dependent solutions immediately. If your AI Readiness Assessment identified NERC CIP compliance or air-gapped operations as requirements, you're building local infrastructure — the only question is how.

What Actually Breaks in Production

I've seen three failure modes repeat across dozens of deployments, and none of them are what the architecture diagrams worry about.

First: memory management. Everyone specs GPU VRAM for model size but forgets about context windows. A 13B model with an 8K context window needs 24GB under load, not the 13GB the model card claims. At a Texas generation facility last year, we had Mixtral 8x7B falling over every 45 minutes because the ops team was feeding it entire shift reports. The model fit in VRAM fine — until it didn't. We solved it by chunking documents and switching to Ollama with proper context management. Ollama handles model loading, memory allocation, and context windows automatically. It's not exciting, but it's reliable.

Second: model switching latency. In theory, you can run multiple models and swap between them. In practice, loading a 13B model from disk takes 8-12 seconds on NVMe, and users won't wait. I watched a SCADA operator in Ohio close the interface and go back to Google because the model switch took too long. We fixed it by keeping the two most-used models (Llama 3.1 8B for procedures, CodeLlama 13B for PLC troubleshooting) resident in VRAM and accepting the memory overhead. Sometimes the right answer is more RAM, not clever engineering.

Third: integration fragility. The LLM itself is usually stable. The Python wrapper script some contractor wrote that connects it to your document management system? That breaks constantly. Version mismatches, API changes, authentication token expiry — we've spent more time debugging integration code than tuning models. This is why we standardize on AnythingLLM for document RAG workflows. It's not perfect, but it has a stable API and handles document ingestion without custom code.

The Five-Server Pattern That Actually Works

After 200 deployments, we've converged on a standard architecture that handles 95% of energy sector requirements. It's boring, proven, and costs less than one year of enterprise AI platform licenses.

Inference Server (Primary): Single workstation, RTX 4090 (24GB VRAM), 64GB system RAM, 2TB NVMe. Runs Ollama hosting Llama 3.1 8B and one specialized model (usually CodeLlama or Mistral). This handles 95% of queries. Ubuntu Server 22.04 LTS, no desktop environment. Direct 10GbE connection to the application server. Cost: $3,200.

Application Server: Hosts AnythingLLM or LibreChat depending on whether you need RAG or multi-model orchestration. Intel Xeon or Ryzen 9, 128GB RAM for vector embeddings, 4TB NVMe for document storage. This is where your energy sector documents live and where embeddings get generated. Cost: $2,800.

Vector Database: Separate server running Qdrant. 64GB RAM minimum, NVMe storage. Embedding search is memory-intensive, and you don't want it competing with LLM inference. In a Colorado utility deployment, we tried running everything on one server and watched query latency jump from 200ms to 4 seconds under load. Separate the concerns. Cost: $1,900.

Backup Inference: Identical spec to primary inference server, kept in cold standby. When the primary GPU fails (and it will — we've had three RTX 4090 failures in 18 months), you swap servers in under 10 minutes. For NERC CIP environments, this isn't optional. Cost: $3,200.

Desktop Clients: Where appropriate, we deploy Msty on engineering workstations with discrete GPUs. Msty runs local models with GPU acceleration and has excellent prompt management through Shadow Personas. For engineers who work with PLCs and protective relay settings, having a local model that works offline is non-negotiable. The RTX 4060 Ti in a standard engineering workstation runs Llama 3.1 8B at 45 tokens/second — perfectly adequate for interactive use.

Total infrastructure cost: ~$11,000 plus workstations. Total annual cloud AI platform cost we've replaced: $180,000 average. The SaaS vs Sovereign ROI Calculator puts breakeven at 4.2 months for typical energy sector usage patterns.

Model Selection: Smaller Is Better Until It Isn't

The industry fixation on 70B+ parameter models is a cargo cult. Llama 3.1 8B handles procedure lookups, simple troubleshooting, and report summarization perfectly well. I've deployed it in substations for relay setting verification and seen 94% accuracy on flag identification — better than the junior engineers were doing manually.

Where you need larger models: Complex root cause analysis across multiple systems. Anything involving code generation beyond simple scripts. Compliance document analysis where nuance matters. In those cases, we jump to Qwen 2.5 32B or Llama 3.1 70B, but we run them on-demand, not continuously. The 70B models need 48GB VRAM minimum (so dual RTX 4090s or an A6000), and they run at 12-15 tokens/second. That's fine for batch processing but frustrating for interactive use.

The model that surprised me: Mistral 7B v0.3. It's fast, fits in 16GB VRAM with room to spare, and has excellent instruction following for structured tasks. In a West Virginia coal plant, we used it for parsing equipment logs and extracting failure modes. It outperformed Llama 2 13B while running twice as fast. Sometimes the newer small models beat the older large ones.

One pattern I see repeatedly: Organizations spec 70B models, discover they can't run them interactively on available hardware, and end up using 8B models anyway. Start with 8B, prove the use case, then scale up if you actually need it. The performance gap between 8B and 70B is smaller than the reliability gap between working infrastructure and aspirational architecture.

The Interface Layer Matters More Than You Think

The LLM is the engine. The interface is what determines whether people actually use it. We've deployed the same Ollama backend with four different interfaces and seen usage vary by 10x.

LibreChat works well when you need multi-model support and agent workflows. It's ugly but functional, and engineers don't care about aesthetics. The MCP (Model Context Protocol) integration is legitimately useful for connecting to internal tools. In a Pennsylvania utility, we connected LibreChat to their asset management system through MCP, and suddenly engineers could ask "what's the maintenance history on breaker CB-234" and get real answers. That integration took 6 hours to build.

AnythingLLM is better when document RAG is the primary use case. It handles PDFs and DOCX files well, manages embeddings automatically, and has workspace isolation for different teams. The UI is cleaner than LibreChat. In a renewable energy site with 4,000 technical documents, AnythingLLM with Qdrant backend gave 30-second answers to questions that previously took 2 hours of manual searching.

Open WebUI is the middle ground — good general-purpose interface with Ollama integration and reasonable document handling. If you don't have strong RAG requirements and don't need complex agent workflows, start here. It's the easiest to deploy and maintain.

Msty is different — it's desktop software, not a web app. For engineers who work offline or in air-gapped environments, it's the only real option. The Shadow Personas feature (prompt templates that persist) is more useful than it sounds. We've built personas for PLC troubleshooting, relay coordination, and NERC CIP compliance checks that get reused across sites.

What We'd Do Differently

If I were starting today with what I know now, I'd change three things.

First: I'd standardize on Ollama as the model serving layer everywhere, period. We wasted months trying different serving solutions (vLLM, TGI, LocalAI) before accepting that Ollama's simplicity beats everyone else's features. It installs in one command, manages models automatically, and has a dead-simple API. The performance is good enough, and the reliability is excellent.

Second: I'd invest more upfront in document preprocessing. Half our RAG accuracy problems trace back to messy PDFs — scanned documents, weird formatting, tables that don't parse correctly. We now run every document through a cleanup pipeline before ingestion, and it's cut implementation time by 40%. The energy sector has decades of scanned manuals and typed-on-typewriter procedures. You can't just dump them into a vector database and expect good results.

Third: I'd stop trying to make one LLM do everything. The vendors promise general-purpose models, but task-specific models work better. Use a small, fast model for procedure lookup. Use a code-specialized model for PLC work. Use a larger model for complex analysis. Trying to find one model that handles everything means compromising on everything. With Ollama, switching between models is trivial. Take advantage of it.

The Verdict

LLM infrastructure for energy operations doesn't require exotic hardware or six-figure budgets. You need a $3,200 server with a good GPU, Ollama managing your models, one of four proven interfaces depending on your use case, and realistic expectations about what 8B parameter models can handle (which is more than you think). The hard parts are integration, document quality, and getting people to actually use the system — not the infrastructure itself. Every hour you spend optimizing model serving is an hour you're not spending on the integration work that actually determines whether the project succeeds. Build boring infrastructure that works, then focus on the problems that matter. If you're still trying to figure out which configuration makes sense for your facility's requirements, try EthosAI Chat to see how these patterns apply to your specific operational context.

Decision Matrix

DimensionOllamavLLMtext-gen-webui
Setup Time5 minutes★★★★★2-4 hours★★★☆☆30-60 min★★★★☆
Memory ManagementAutomatic★★★★★Manual config★★★☆☆Config required★★★☆☆
API StabilityStable 6mo+★★★★★Breaking changes★★☆☆☆UI-dependent★★★☆☆
Model Selection100+ models★★★★★Any HF model★★★★★Manual import★★★☆☆
Air-Gap SupportFull offline★★★★★Requires setup★★★★☆Full offline★★★★★
Best ForProduction deployments where reliability matters more than featuresResearch environments where bleeding-edge performance justifies complexitySingle-user setups where the web UI is the primary interface
VerdictThe boring choice that actually works in energy operations.Faster inference but fragile in production environments.Good for desktop use, awkward for multi-user server deployments.

Last verified: Mar 28, 2026

Subscribe to engineering insights

Get notified when we publish new technical articles.

Topic:LLM Infrastructure

Unsubscribe anytime. View our Privacy Policy.