The Architecture Diagrams Everyone Shows You
Every AI vendor walks in with the same slide deck. There's a neat layered architecture: data ingestion at the bottom, vector embeddings in the middle, LLM inference at the top, maybe some microservices scattered around for good measure. Everything talks to everything else through REST APIs. The cloud provider logos sit prominently in the corner. It all looks clean, scalable, and enterprise-ready.
Then you mention your SCADA network is air-gapped, your historians run on Windows Server 2012, and your compliance team just handed you a 47-page NERC CIP-013 assessment questionnaire. The architecture diagram suddenly needs some revisions.
We've deployed AI infrastructure across eleven power utilities, four refineries, and two independent grid operators over the past three years. The conventional architectures work fine for SaaS companies processing customer support tickets. They fall apart spectacularly when you're parsing real-time telemetry from 40-year-old RTUs while ensuring every data access gets logged for regulatory audit.
Here's what we've learned actually works in energy operations.
The Data Reality Nobody Talks About
The textbook says ingest everything into a data lake, run ETL pipelines, feed clean normalized data to your AI models. In practice, your data sources look like this:
- Historian databases (OSIsoft PI, Wonderware, GE Proficy) with proprietary protocols and licensing restrictions that make bulk export expensive or contractually problematic
- SCADA systems that communicate over Modbus, DNP3, or IEC 61850 with millisecond timing requirements you cannot disrupt
- Maintenance management systems (Maximo, SAP PM) that export to CSV files dropped on network shares at 2 AM
- Engineering drawings in AutoCAD, P&IDs in PDF, equipment manuals scanned from paper decades ago
- Tribal knowledge in email threads, Word documents, and the heads of engineers three years from retirement
You cannot just point an ingestion pipeline at this and expect useful results. We tried. The first deployment attempt at a 800 MW combined cycle plant involved writing twelve custom connectors, three of which required vendor cooperation we never got.
What actually works: Build your architecture around two separate data domains with an explicit boundary between them.
The operational domain handles real-time data. This stays close to the source systems, respects OT network segmentation, and prioritizes availability over everything else. You're not running heavy inference workloads here. You're doing anomaly detection, pattern matching, and time-series forecasting with models small enough to run on edge compute.
The analytical domain handles everything else. This is where you deploy your vector databases, knowledge graphs, and large language models. Data flows one direction across the boundary: from operational to analytical. Never the reverse without explicit manual approval. This boundary is where your NERC CIP compliance controls live.
Vector Databases and the Semantic Search Promise
Every AI architecture in 2025 includes a vector database. We run Qdrant in production across all our deployments. It's Rust-based, genuinely fast, and the memory footprint stays predictable under load. We've tested Milvus, Weaviate, and Pinecone. Qdrant won on operational simplicity and performance with energy sector document volumes.
But here's what the tutorials don't tell you: vector similarity search is only useful if your embedding model understands your domain terminology. Generic models trained on internet text don't know that "CT ratio" means current transformer ratio, that "UFLS" is underfrequency load shedding, or that "firm capacity" has specific regulatory meaning.
We spent four months in 2023 trying to make OpenAI's ada-002 embeddings work for procedure retrieval at a nuclear plant. The semantic search kept returning plausible-looking but contextually wrong results. An operator searching for "reactor scram procedure" would get documents about scramming procedures for other equipment types, or historical incident reports that mentioned scrams, but not the actual operating procedure they needed.
The fix required domain-specific fine-tuning. We used a base model (all-MiniLM-L6-v2 from sentence-transformers) and fine-tuned on 40,000 document pairs from utility technical libraries. Accuracy for procedure retrieval went from 62% to 94% on our test set. But this took three weeks of a senior engineer's time plus compute resources.
If you're deploying vector search in energy operations, budget for embedding model customization. The out-of-box experience will disappoint you.
Knowledge Graphs and Why Graph Databases Matter
This is where our architecture diverged most sharply from typical AI deployments. Most companies treat their data as documents to be embedded and searched. We model it as a connected graph of entities and relationships.
Why? Because energy infrastructure is inherently relational. A transformer failure doesn't exist in isolation. It connects to:
- The upstream substation feeding it
- The protection relays that should have opened
- The maintenance history over 15 years
- The vendor who manufactured it
- The engineering standard that specified its rating
- The load forecast that justified its installation
- The customers affected by its outage
Documents and vector embeddings lose these connections. You can retrieve similar text, but you can't traverse relationships or reason about causality.
We run Neo4j as our knowledge graph platform. Version 5.15 added vector indexing natively, which means we can combine graph traversal with semantic search in a single query. This is transformative for root cause analysis.
Example from a wind farm deployment: An operator asks "Why did turbine 47 go offline last Tuesday?" The system:
- Uses vector search to identify "turbine 47" and "offline" as key concepts
- Traverses the graph to find all events, alarms, and maintenance records connected to that turbine in the relevant time window
- Follows relationships to upstream equipment (collector substation, SCADA controller)
- Retrieves similar historical incidents through vector similarity
- Returns a causal chain with evidence, not just keyword-matched documents
This required modeling our domain explicitly. We built an ontology: Turbines connect to Substations through PowerFlows. Maintenance records attach to Equipment through MaintenanceEvents. Alarms link to ControllerStates. It took six weeks to model initially, and we refine it quarterly.
The payoff is substantial. Our graph-based RAG systems outperform pure vector RAG by 30-40% on complex technical questions in energy domains. The implementation complexity is higher, but for the problems we're solving, it's worth it.
The RAG Layer and Model Selection
Retrieval-Augmented Generation is table stakes now. Nobody runs pure LLMs for technical question answering anymore. The architecture question is how you structure your RAG pipeline.
We use AnythingLLM as our RAG orchestration platform. It's open source, runs entirely on-premises, and handles the workflow of chunking documents, generating embeddings, storing vectors, and orchestrating retrieval plus generation. The key advantage for energy sector: complete data sovereignty. Nothing leaves your network.
Our typical deployment architecture:
- AnythingLLM running in Docker on dedicated hardware (64GB RAM, no GPU required for embedding, GPU optional for inference)
- Qdrant as the vector store backend
- Ollama for local LLM inference
- Neo4j for graph-augmented retrieval when needed
Model selection matters more than most architecture decisions. We run different models for different tasks:
For document chat and procedure lookup: Llama 3.1 8B. Fast enough for interactive use, accurate enough for technical content, small enough to run multiple instances for concurrency.
For incident analysis and root cause reasoning: Mixtral 8x7B. The mixture-of-experts architecture handles multi-domain reasoning better, which matters when an incident spans electrical, mechanical, and control systems.
For code generation and configuration scripts: DeepSeek Coder 33B when we have GPU resources, CodeLlama 13B when we don't.
We avoid using the largest models (70B+) in production. The inference latency kills interactive use cases, and the accuracy gains don't justify the infrastructure cost for our applications. A well-prompted 8B model with good RAG retrieval beats a poorly-implemented 70B model every time.
The Agent Layer and Workflow Orchestration
This is where architecture discussions get hand-wavy. Everyone talks about "AI agents" without defining what that means operationally.
For us, an agent is a semi-autonomous process that:
- Receives a goal or query
- Plans a sequence of actions
- Executes those actions using available tools
- Evaluates results and adjusts
- Returns a final answer or outcome
We've built agents for equipment diagnostics, procedure compliance checking, and maintenance schedule optimization. The architecture challenge is orchestration: how do you reliably chain together retrieval, reasoning, external tool calls, and response generation?
We've used two platforms in production.
SmythOS provides visual workflow building for agent logic. The drag-and-drop interface makes it accessible to domain experts who aren't software engineers. Our maintenance planners can build diagnostic agents that query equipment databases, retrieve historical failure patterns, and recommend inspection priorities. The downside is the abstraction layer sometimes fights you when you need precise control over prompts or retrieval logic.
For more complex workflows, we use n8n with custom nodes for AI operations. This gives us full control over every step in the agent execution. We've built nodes that interface with our SCADA historians, equipment databases, and document management systems. The learning curve is steeper, but the flexibility is worth it for production systems.
One pattern we've settled on: agents should always show their work. Every retrieval, every reasoning step, every tool call gets logged and made visible to the user. In energy operations, explainability isn't optional. When an agent recommends taking a generator offline, the operator needs to see exactly what data and logic led to that recommendation.
We implement this with structured output templates that include source citations, confidence scores, and reasoning chains. AnythingLLM's chat interface supports this natively, which is why we standardized on it.
The Deployment Reality
Here's the architecture component nobody puts in diagrams: operational burden.
We maintain AI infrastructure across thirteen customer sites. The deployment environments range from:
- Modern Kubernetes clusters in co-located data centers (rare)
- VMware ESXi hosts in facility server rooms (common)
- Bare metal servers in electrical rooms with questionable cooling (more common than we'd like)
- Edge compute boxes mounted in SCADA cabinets (for real-time applications)
Every site has different constraints. Some have no internet access period. Some have intermittent connectivity for updates. Some have strict change control windows (quarterly for generation facilities under NERC CIP-003).
Our architecture had to accommodate this reality:
- Everything runs in containers (Docker or Podman) for consistency
- All dependencies bundled in container images, no internet-required pulls
- Configuration through environment files, not databases
- State stored on mounted volumes that can be backed up with standard tools
- Health checks and monitoring through Prometheus with local Grafana dashboards
- Updates deployed as container image files on USB drives when necessary
This isn't glamorous, but it's what works. The fanciest Kubernetes-native architecture is useless if you can't deploy it in an air-gapped facility or update it during a 4-hour maintenance window twice a year.
What We'd Do Differently
Three years in, here's what we'd change if we started over:
Invest in graph modeling earlier. We spent the first year treating everything as documents and vectors. Retrofitting graph structure later required rework. If we'd modeled domain entities and relationships from day one, we'd be six months ahead.
Build evaluation frameworks first. We deployed our first RAG system before we had systematic ways to measure accuracy. We flew blind for three months, relying on user feedback and anecdotal reports. Now we maintain test sets of 500+ question-answer pairs per customer, with automated evaluation runs before every release. Should have done this from week one.
Standardize on fewer components. Our architecture includes six different databases (Postgres, TimescaleDB, Neo4j, Qdrant, InfluxDB, MongoDB) because different components had different preferences. The operational burden is real. If we could redo it, we'd enforce strict constraints: Neo4j for graphs, Qdrant for vectors, Postgres for everything else. Some inefficiency is worth the operational simplicity.
Plan for model evolution. We hard-coded model names and parameters in application code. When Llama 3.1 released with better instruction following, upgrading required code changes across eight repositories. We should have abstracted model selection into configuration from the start.
Take NERC CIP seriously from day one. We treated compliance as a post-deployment concern on our first two projects. Wrong. CIP-013 supply chain security requirements and CIP-011 data protection controls affect fundamental architecture decisions. Retrofitting compliance is painful and expensive. Design for it upfront.
The Verdict
AI architecture for energy operations isn't about using the newest models or the most sophisticated techniques. It's about building systems that work reliably in constrained environments, respect operational boundaries, and maintain explainability under regulatory scrutiny.
Our production architecture isn't elegant by Silicon Valley standards. It's a pragmatic combination of vector databases for semantic search, knowledge graphs for relational reasoning, local LLMs for inference, and careful orchestration that respects OT/IT boundaries.
What makes it work:
- Complete data sovereignty with on-premises deployment
- Domain-specific embedding models and ontologies
- Graph-based knowledge representation for connected data
- Structured RAG pipelines with source attribution
- Agent workflows that show their reasoning
- Deployment patterns that work in air-gapped facilities
If you're building AI systems for energy operations, ignore the vendor slide decks. Start with your data reality, your compliance requirements, and your operational constraints. Build the simplest architecture that solves your actual problems. Add complexity only when the benefits clearly justify the operational burden.
We run Qdrant, Neo4j, AnythingLLM, and Ollama in production because they're reliable, deployable in our environments, and solve real problems for our users. Your architecture might need different components. The principles remain the same: data sovereignty, domain specificity, relational reasoning, explainability, and operational simplicity.
That's what actually works when the architecture diagrams meet reality in a 500 kV substation at 3 AM during an unplanned outage.