Ollama vs AnythingLLM vs Open WebUI: Energy Sector LLM Deployment

The Infrastructure Problem Nobody Talks About

We've spent the last eighteen months deploying local LLM infrastructure across six utilities and two midstream operators. The question isn't whether you need private LLM capability anymore—it's which stack actually works when you're behind an air gap with 40,000 legacy SCADA documents and compliance auditors watching your every API call.

The energy sector has a unique problem: we can't just pipe operational data to OpenAI or Claude. NERC CIP-002 through CIP-011 make that a non-starter for anything touching bulk electric system operations. We need inference running on-premises, document embedding that never leaves our network perimeter, and audit trails that satisfy both IT security and OT operations teams.

Three platforms dominate our deployment conversations: Ollama for model serving, AnythingLLM for document RAG, and Open WebUI for user interfaces. We've deployed each in production. Here's what we've learned.

Ollama: The Model Serving Layer Everyone Needs

Ollama solved the model distribution problem. Before Ollama, getting Llama 2 or Mistral running meant downloading sketchy GGUF files, figuring out llama.cpp compilation flags, and writing wrapper scripts. Ollama packages everything into a single binary with a sensible API.

We run Ollama 0.5.1 on Ubuntu 22.04 LTS servers with NVIDIA A4000 GPUs in our OT networks. Installation is genuinely simple: one curl command, and you've got a model server. The ollama pull command downloads models from their library—100+ options including Llama 3.3 70B, Mistral Large, CodeLlama, and specialized models like Medusa for structured output.

The API is OpenAI-compatible, which matters more than it sounds. Every tool that knows how to talk to OpenAI can talk to Ollama with a URL change. That compatibility eliminated weeks of integration work when we connected it to our n8n workflows and ERPNext customizations.

Performance is excellent for our workloads. Llama 3.1 8B gives us 35 tokens/second on a single A4000, which is fast enough for technical documentation queries and work order classification. The 70B models drop to 8-12 tokens/second but provide notably better reasoning for complex maintenance procedures and root cause analysis.

The weakness: Ollama is just model serving. It doesn't do document chunking, embedding management, or conversational memory. It serves one purpose—inference—and does it well. You need to build everything else yourself or add another layer.

AnythingLLM: Document RAG Without the PhD

AnythingLLM is what we deploy when utilities need "ChatGPT for our documents" but can't send anything to OpenAI. It's a full-stack RAG platform: document ingestion, chunking, embedding, vector storage, conversational UI, and multi-user workspaces in one Docker compose stack.

We typically run AnythingLLM 1.6.x connected to Ollama for inference and its built-in LanceDB for vector storage. The architecture is clean: documents go in, get chunked into 1000-token segments with 200-token overlap, embedded with nomic-embed-text, stored in LanceDB, and retrieved based on cosine similarity during queries.

The document ingestion is genuinely good. We've fed it PDF technical manuals, Word procedures, Excel asset lists, and plaintext engineering notes. It handles them all without the preprocessing nightmares we've hit with custom scripts. The OCR pass for scanned PDFs works well enough—not perfect, but 90% accuracy beats manually transcribing 30-year-old relay manuals.

Workspaces are the killer feature for energy operations. We create separate workspaces for each substation, each plant, each compliance domain. Protection engineers get a workspace with relay manuals and coordination studies. Compliance teams get NERC CIP standards and internal procedures. No data bleeds between workspaces, which satisfies our information barriers.

The embedded mode works offline. We've deployed it on laptops for field engineers who need LLM access at remote substations with no connectivity. The entire stack—documents, embeddings, model inference—runs locally. That's rare in the RAG world.

Performance: query latency is 2-4 seconds for most questions, which includes embedding the query, vector search, context assembly, and LLM inference. Acceptable for technical Q&A, too slow for real-time operational use. Document ingestion takes about 30 seconds per 100-page PDF on our hardware.

The limitation: AnythingLLM is opinionated. You can't easily swap vector databases or customize chunking strategies without forking the code. The admin interface is functional but not elegant. Multi-tenancy exists but isn't enterprise-grade—no SSO, no granular RBAC, no audit logging that satisfies compliance teams.

Open WebUI: The Interface Layer

Open WebUI (formerly Ollama WebUI) is a web frontend for Ollama that looks and feels like ChatGPT. It's what we deploy when users want a familiar chat interface for local models without the RAG complexity of AnythingLLM.

The installation is straightforward—Docker container, point it at your Ollama endpoint, add users. The interface is polished: chat history, model switching, system prompts, temperature controls. Users who've spent time in ChatGPT feel immediately comfortable.

Open WebUI added document RAG in recent versions, but it's basic compared to AnythingLLM. You can upload files per conversation, and it will chunk and embed them, but there's no persistent document library or workspace isolation. It's "chat with this PDF right now" not "build a knowledge base."

The strength is multi-model management. We run eight different models on our Ollama instances—different sizes, different specializations. Open WebUI lets users switch between them mid-conversation. That flexibility matters when you're comparing Llama 3.3 70B analysis against Mistral Large or testing whether CodeLlama better understands IEC 61850 configurations.

We've deployed Open WebUI for engineering teams who need ad-hoc LLM access without the structure of workspaces. They ask one-off questions, paste log files for analysis, generate Python scripts for data processing. It's the command-line interface of LLM tools—fast, flexible, minimal ceremony.

Performance matches Ollama since it's just a frontend. Latency is under 300ms for the web layer. The limitation is the lack of persistence and structure. Every conversation is isolated, no organizational knowledge builds up, no compliance-friendly audit trail.

The Architecture Question: Stack or Standalone?

We deploy these tools in two patterns, and the choice depends on your operational maturity and use cases.

Pattern 1: Ollama + AnythingLLM is our standard for utilities with defined RAG requirements. AnythingLLM handles everything: ingestion, storage, retrieval, UI. Ollama sits behind it providing inference. This is the "turnkey private ChatGPT" stack. Setup time is 4-6 hours. Users get workspaces, document chat, and reasonable performance. We deploy this when compliance teams need document Q&A, when engineers need procedure lookups, when trainers need interactive manuals.

Pattern 2: Ollama + Open WebUI is for technical teams who want model access without RAG structure. Engineers use it for code generation, log analysis, quick questions. Setup time is 90 minutes. This is lighter, faster, more flexible. We deploy this when the primary need is LLM inference, not document retrieval. It's particularly good for R&D groups and advanced users who understand prompt engineering.

We rarely deploy AnythingLLM and Open WebUI together—they solve overlapping problems and confuse users. Pick one based on whether you need structured RAG or flexible chat.

Performance and Resource Reality

Here's what these stacks actually consume in our deployments:

Ollama with Llama 3.1 8B: 8GB GPU VRAM, 16GB system RAM, 6GB disk per model. Throughput of 35 tokens/second on NVIDIA A4000, 22 tokens/second on RTX 4070, 8 tokens/second on CPU (32-core Xeon). CPU inference is viable for batch processing but too slow for interactive use.

Ollama with Llama 3.3 70B: 48GB GPU VRAM (requires A6000 or multi-GPU setup), 64GB system RAM, 40GB disk. Throughput of 8-12 tokens/second. We only deploy this for critical analysis workloads where accuracy justifies the hardware cost.

AnythingLLM stack: 4GB RAM for the application, plus Ollama requirements, plus 2-3GB per million tokens of embedded documents. A workspace with 5000 documents (roughly 50 million tokens) needs about 16GB just for LanceDB. Budget 32GB RAM minimum for production.

Open WebUI: 2GB RAM, negligible CPU. It's just a web interface.

The hidden cost is storage. Each model is 4-40GB. We typically maintain 6-8 models per site to give users options. That's 100-200GB just for model weights. Add another 50-100GB for AnythingLLM document storage and embeddings.

NERC CIP and Compliance Considerations

We've taken these stacks through CIP-005, CIP-007, and CIP-010 audits. Here's what compliance teams care about:

Network isolation: All three tools run entirely on-premises. No outbound calls, no telemetry, no model updates without explicit action. That satisfies ESP perimeter requirements. We deploy them inside the same network zones as SCADA historians and EMS systems.

Access control: AnythingLLM has built-in user authentication but no integration with Active Directory or LDAP. We front it with an nginx reverse proxy doing SAML authentication. Open WebUI has similar limitations. Both need external auth layers for CIP-007 compliance.

Audit logging: Neither tool provides compliant audit logs out of the box. We capture all HTTP traffic at the reverse proxy level and ship logs to our SIEM. That gives us the "who accessed what when" trail auditors demand.

Patch management: Ollama updates frequently—new model support, performance improvements. AnythingLLM and Open WebUI update monthly. We test updates in dev environments and deploy quarterly in production. The CIP-010 change management overhead is real.

Data retention: AnythingLLM stores all documents and conversations in its database. We've had to build retention policies and cleanup scripts because compliance requires we delete documents after their retention period expires. That's not built into the tool.

The biggest compliance gap: none of these tools have configuration management databases or asset tracking integration. We maintain separate documentation proving what's deployed where, which satisfies CIP-002 and CIP-010 but adds manual overhead.

The Verdict

Deploy Ollama everywhere. It's the foundation layer. Every LLM infrastructure stack we build starts with Ollama. The model library is comprehensive, the API is standard, the performance is solid. Install it on every GPU-equipped server in your OT environment.

Choose AnythingLLM when document RAG is the primary use case. If you need to give 50 engineers ChatGPT-style access to 20,000 technical documents, AnythingLLM is the fastest path to production. The workspace isolation works, the ingestion pipeline handles real-world document chaos, and users grasp it immediately. Budget 40 hours for deployment including hardware procurement, installation, document ingestion, and user training.

Choose Open WebUI when you need flexible model access for technical users. If your audience is engineers and developers who will craft their own prompts and want to experiment with different models, Open WebUI provides that with minimal friction. It's also better for multi-model comparison and ad-hoc analysis tasks.

Don't try to build a custom stack. We've watched three utilities waste 6-12 months building bespoke RAG pipelines with LangChain and custom UIs. They ended up with fragile systems that broke on every model update and required Python expertise to maintain. Ollama and AnythingLLM are production-ready now. Use them.

The energy sector's LLM future is local inference behind the network perimeter. These three tools make that practical today. We've deployed them in environments where bringing in a cloud API would trigger a compliance incident. They work. Start here.

Dimension	Ollama + AnythingLLM	Ollama + Open WebUI	Ollama Standalone
Deployment Time	4-6 hours★★★★☆	90 minutes★★★★★	30 minutes★★★★★
Document RAG	Full RAG stack★★★★★	Per-chat only★★☆☆☆	API only★☆☆☆☆
Multi-User	Workspaces★★★★☆	Basic auth★★★☆☆	None★☆☆☆☆
Compliance Ready	Needs proxy★★★☆☆	Needs proxy★★★☆☆	API logs★★★★☆
Resource Usage	32GB+ RAM★★★☆☆	16GB RAM★★★★☆	16GB RAM★★★★★
Best For	Structured document Q&A for 10-200 users	Flexible LLM access for technical power users	Programmatic integration with existing applications
Verdict	Best turnkey solution when RAG over technical documentation is the primary requirement.	Fastest to production for teams who need model experimentation and ad-hoc inference.	Foundation layer for custom tools, n8n workflows, and ERPNext extensions.

Ollama vs AnythingLLM vs Open WebUI: Which LLM Stack for Energy Operations