The Conventional Wisdom
The standard playbook for LLM deployment in 2025 goes like this: use OpenAI or Anthropic APIs, maybe Azure OpenAI if you need enterprise contracts, and call it a day. Your data goes to the cloud, you pay per token, and you get state-of-the-art models without infrastructure headaches.
For energy operations, this advice ranges from impractical to impossible. We operate SCADA systems that can't touch the internet. We have NERC CIP compliance requirements that explicitly prohibit sending operational data to third parties. We run facilities in locations where internet connectivity is measured in kilobits, not gigabits. And we have executive teams who, after watching ransomware take down Colonial Pipeline, are not interested in expanding our attack surface.
The real question isn't whether to run local LLM infrastructure. It's which stack actually works when you deploy it in a substation control room or a refinery operations center.
What We've Actually Deployed
Over the past three years, our team has stood up LLM infrastructure in 50+ energy facilities. These range from municipal utilities with 200 employees to Fortune 500 refineries with complex OT/IT environments. Every deployment taught us something, usually the hard way.
The foundation of nearly every successful deployment has been Ollama. It's not the most feature-rich option, and it's definitely not the most enterprise-ready from a governance perspective. But it does one thing exceptionally well: it runs models reliably on whatever hardware you point it at, with minimal configuration.
We typically deploy Ollama 0.5.x on dedicated inference servers—usually Dell PowerEdge R750 boxes with NVIDIA A2 or A10 GPUs. In smaller facilities, we've run it on workstation hardware with consumer GPUs. The critical insight is that Ollama abstracts away the complexity of model formats, quantization, and GPU memory management. You pull a model with a single command, and it runs. When you're standing in a server room at 2 AM because the utility needs this operational by Monday, that simplicity matters.
The models we deploy most frequently: Llama 3.3 70B at Q4_K_M quantization for general operations, Mistral 7B for edge devices with limited resources, and CodeLlama 34B for our engineering teams working on automation scripts. The 70B models need about 48GB VRAM, which means dual A10s or a single A100. The 7B models run comfortably on 8GB, which opens up a lot of deployment options.
The Interface Layer: Where It Gets Complicated
Ollama gives you inference, but it doesn't give you a user experience. You need something on top. This is where we've made our most expensive mistakes.
Our first 15 deployments used custom web interfaces we built ourselves. This was hubris. We spent six months building authentication, chat history, context management, and prompt templates—features that every other LLM interface also implements. The code worked, but it became a maintenance nightmare. When Ollama updated their API format, we had to patch five different implementations across client sites.
AnythingLLM changed our deployment pattern fundamentally. It's a complete application—desktop or server deployment—that handles document ingestion, RAG, multi-user access, and agent capabilities out of the box. More importantly, it's designed for local deployment from the ground up. The commercial vendors bolted on local model support as an afterthought. AnythingLLM assumes you're running everything on-premises.
In practice, we deploy AnythingLLM 1.6.x as the primary interface for about 70% of our installations. It connects to Ollama for inference, Qdrant for vector storage, and handles all the complexity of document processing. Operations teams can drag in PDF procedures, equipment manuals, or incident reports, and immediately start querying against them. The RAG implementation is solid—we've tested it against 10,000+ page technical manuals for turbine maintenance, and retrieval accuracy consistently beats what we achieved with our custom implementations.
The limitations: AnythingLLM's agent capabilities are basic compared to what you'd build with LangChain or CrewAI. For complex multi-step workflows, we still write custom code. And the workspace permission model is coarse-grained enough that we sometimes need to run multiple instances for different security zones.
When You Need More Control
Open WebUI has become our go-to for deployments where the primary users are technical staff who want fine-grained control over model parameters. It's a web interface inspired by ChatGPT's UI, but designed specifically for self-hosted scenarios. The killer feature: it supports multiple LLM backends simultaneously, including Ollama, OpenAI-compatible APIs, and direct model integration.
We deployed Open WebUI at a West Coast utility where the grid operations team wanted to experiment with different models for outage prediction. They needed to compare outputs from Llama 3, Mixtral, and a custom fine-tuned model, all in the same conversation thread. Open WebUI made that trivial. The prompt management system lets users create and share templates, which turned out to be critical for standardizing how different shifts query the system.
The Docker deployment is straightforward—we typically run it behind Traefik for HTTPS termination and integrate with the facility's existing authentication system via OpenID Connect. Resource usage is light; the interface itself needs maybe 2GB RAM. All the heavy lifting happens in Ollama.
Where Open WebUI falls short: document processing is minimal compared to AnythingLLM. If your use case is primarily RAG over technical documents, you'll be disappointed. It's built for chat, not document analysis.
The Enterprise Contender
LibreChat entered our evaluation stack about 18 months ago. It's architecturally more complex than the other options—requires MongoDB, Redis, and a more involved configuration process. But it brings capabilities that matter for larger deployments: granular role-based access control, audit logging, conversation memory management, and integration with external tools via the Model Context Protocol.
We've deployed LibreChat at three sites where compliance requirements demanded detailed audit trails of every LLM interaction. The logging is comprehensive: who asked what, which model answered, what documents were referenced, and what actions were taken. When the compliance team asks "show me every time someone queried for substation access procedures in Q4," you can actually answer that question.
The trade-off is operational complexity. You're running five containers instead of two. The configuration file is 300+ lines of YAML. When something breaks, troubleshooting requires understanding how the message queue integrates with the model router. We don't deploy LibreChat unless the compliance requirements justify that complexity.
The MCP integration deserves specific mention. We've connected LibreChat to facility CMMS systems, weather APIs, and grid status endpoints. An operator can ask "what's the maintenance history for transformer T-47" and get a real-time answer pulled from the asset management database. That level of integration requires custom MCP server development, but the framework is solid.
The Desktop Dark Horse
Msty surprised us. It's a desktop application—macOS and Windows—that runs local models with GPU acceleration. Our initial assumption was that desktop apps don't belong in industrial environments. We were wrong.
We deployed Msty at a refinery where the process engineers needed LLM access but couldn't be granted network access to centralized inference servers due to OT/IT segmentation policies. Msty runs entirely on their workstations—Dell Precision 5820 towers with RTX 4090 GPUs. Each engineer has their own instance, their own models, and complete data isolation.
The Shadow Personas feature turned out to be more useful than we expected. Engineers created personas for different types of analysis: one tuned for safety procedure review, another for process optimization calculations, a third for technical writing. The personas include custom system prompts and temperature settings. It's basically a user-friendly wrapper around model parameters, but it works.
Msty's MCP integration lets these desktop instances connect to the same external tools as our server deployments. An engineer can query the process historian, equipment databases, or metallurgy analysis systems without leaving the desktop interface.
The limitation is obvious: desktop deployments don't scale. We're not installing Msty on 500 workstations. But for teams of 10-30 technical staff who need powerful local inference and can't access centralized resources, it's the best option we've found.
Resource Requirements: The Real Numbers
The vendor documentation always lowballs resource requirements. Here's what we actually provision:
For Ollama serving a 70B model to 20-30 concurrent users: dual NVIDIA A10 GPUs (24GB each), 128GB system RAM, 1TB NVMe storage. The RAM matters more than vendors admit—model loading, context caching, and concurrent request handling all consume memory. We tried running on 64GB and spent weeks troubleshooting OOM errors.
For AnythingLLM with document processing enabled: 32GB RAM minimum, 64GB comfortable. The document embedding pipeline is memory-hungry when processing large technical manuals. Storage depends on document volume, but budget 500GB minimum for a facility-wide deployment.
For Qdrant vector storage backing these deployments: 16GB RAM per million vectors, NVMe storage for index files. We run Qdrant in a separate container on the same physical host. Disk I/O is the bottleneck for retrieval performance—we've seen query latency drop from 400ms to 80ms just by moving from SATA SSDs to NVMe.
Network bandwidth is rarely an issue for local deployments. The Ollama API traffic is minimal—you're sending text prompts and receiving text completions. We've run successful deployments over 1Gbps connections with 50+ concurrent users.
What We'd Do Differently
If we started over tomorrow, we'd standardize on fewer components. Our current deployments span five different interface platforms, three different vector databases, and half a dozen different authentication mechanisms. That diversity creates operational drag.
The stack we'd build for new deployments: Ollama for inference, AnythingLLM for general users, Open WebUI for technical teams, Qdrant for vectors, and Authentik for authentication. That covers 90% of use cases with a manageable support burden.
We'd also invest earlier in monitoring. Prometheus metrics from Ollama, custom exporters for queue depth and token throughput, Grafana dashboards showing model utilization and response latency. The first time a utility manager asks why the LLM is slow, you want telemetry, not guesses.
Configuration management matters more than we initially realized. We now maintain Ansible playbooks for every deployment pattern. Standing up a new AnythingLLM + Ollama + Qdrant stack takes 45 minutes instead of three days. The playbooks also serve as documentation—when an engineer who didn't do the original deployment needs to troubleshoot, they can read the automation code.
The Security Reality
Every CISO asks the same question: "How do we prevent users from putting sensitive data into the LLM?" The honest answer is you can't, not with technical controls alone. If someone has access to the interface, they can paste whatever they want into the prompt box.
What you can control: where the data goes after that. Local LLM infrastructure means prompts never leave your network. Model weights are frozen—there's no training or fine-tuning happening that could leak information into future responses for other users. Conversation history stays in your database, under your access controls.
We implement network segmentation at every deployment. The LLM infrastructure runs in its own VLAN, with firewall rules that prevent it from initiating outbound connections. The only inbound access is HTTPS from authenticated users. Ollama's API is never exposed beyond localhost—all access goes through the interface layer, which handles authentication and authorization.
For facilities with especially sensitive operations, we deploy completely air-gapped stacks. Models are loaded from USB drives, the inference servers have no network interfaces except for the isolated LLM VLAN, and we physically audit the hardware quarterly. This level of isolation is expensive and operationally painful, but it's what NERC CIP Category 1 facilities sometimes require.
The Verdict
After three years and 50+ deployments, here's what we actually recommend:
For municipal utilities and smaller facilities (under 500 employees): Ollama + AnythingLLM. It's the simplest stack that delivers real capability. One inference server with a mid-range GPU, one application server running the AnythingLLM container, Qdrant for vectors. You can deploy this in a week and have operations teams querying equipment manuals by the end of the month. Budget $25K for hardware, another $15K for deployment labor.
For larger utilities and industrial facilities: Ollama + Open WebUI for technical staff, Ollama + AnythingLLM for operations teams. Run them on separate servers with separate model instances. The technical teams get fine-grained control and multi-model experimentation. Operations teams get robust document processing and RAG. This requires more hardware and coordination, but it matches how these organizations actually work. Budget $60K for hardware, $40K for deployment.
For facilities with complex compliance requirements: Add LibreChat to the mix. The audit logging and access controls justify the operational complexity when you're facing regular NERC CIP audits or SOC 2 compliance. You're now running three different interfaces backed by Ollama, which feels like too many, but each serves a distinct need. Budget $80K for hardware, $60K for deployment and integration.
For small technical teams in segmented OT environments: Msty on individual workstations. It's the only option that works when centralized infrastructure is prohibited by security policy. The per-seat cost is higher—you're buying GPUs for every workstation—but the deployment is simpler. Budget $8K per workstation.
The common thread: Ollama is the inference engine for every scenario we recommend. It's not perfect—the lack of built-in load balancing means we run multiple instances behind HAProxy for high-availability deployments, and the metrics endpoint is basic enough that we augment it with custom exporters. But it's reliable, it supports the models we actually want to run, and it doesn't fight us when we deploy it in the strange environments that energy infrastructure demands.
We don't deploy cloud-based LLM services in operational environments anymore. The compliance risk isn't worth the convenience. We don't build custom interfaces from scratch anymore either. The open-source ecosystem has matured enough that building your own is almost always a mistake.
What we do deploy: boring, reliable, local infrastructure that runs the same models you'd get from commercial APIs, with complete data sovereignty and none of the recurring costs. After 50 deployments, that's the pattern that survives contact with reality.