Open-Source Workflow Automation Stack for Energy Sector Operations

The Problem: Process Fragmentation in Energy Operations

We've watched utilities run the same manual process for fifteen years: an alarm triggers in SCADA, an operator logs it in a spreadsheet, emails maintenance, who creates a work order in a different system, which triggers a procurement request in yet another platform, while compliance manually copies data into their audit database. Every handoff loses context. Every system speaks a different language. The average power utility we work with has 47 separate software systems that don't talk to each other.

The energy sector's workflow problem isn't about automating simple tasks. It's about stitching together operational technology that predates the internet with modern business systems, all while maintaining NERC CIP compliance and keeping critical infrastructure air-gapped. When Colonial Pipeline went down in 2021, their billing system couldn't talk to their operational system — they shut down manually because they couldn't invoice.

We've deployed workflow automation for asset management, outage response, regulatory reporting, and procurement across six utilities and two midstream operators. The stack that consistently works: n8n for orchestration, ERPNext for business logic and data persistence, with carefully designed boundaries between IT and OT networks. This is the architecture walkthrough nobody gives you.

Architecture Overview: Orchestration Plus Business Logic

The fundamental mistake most teams make is treating workflow automation as a single-layer problem. They pick a tool like n8n and try to cram all their business logic, data models, and integration code into workflow definitions. Six months later, they have 200 unmaintainable workflows and no clear data lineage.

Our reference architecture separates concerns into three distinct layers. The orchestration layer handles process flow, timing, and integration choreography — that's n8n. The business logic layer manages data models, validation rules, and domain operations — that's ERPNext. The integration layer bridges OT systems with read-only data diodes and protocol converters — that's custom Python services running on dedicated hardware.

In a typical transmission utility deployment, n8n runs in the corporate IT network on three VMs behind the firewall. ERPNext runs on the same network segment, providing the system of record for assets, work orders, and maintenance history. The OT integration layer sits on a separate VLAN with one-way data flow from SCADA historians to the IT network. No workflow ever writes directly to operational systems.

Component Deep-Dive: n8n as Orchestration Engine

We chose n8n after evaluating Apache Airflow, Prefect, and Temporal. Airflow is phenomenal for data pipelines but terrible for event-driven workflows spanning days or weeks. Temporal has the best reliability model but requires serious engineering investment. n8n hits the sweet spot: visual workflow designer for operators, code-based nodes for engineers, built-in credential management, and native webhook support.

The key technical capability is n8n's execution model. Workflows can be triggered by webhooks, schedules, or manual execution. Each execution maintains complete state, so a workflow that starts when a transformer alarm fires can wait three days for maintenance completion, then resume to close the loop. State persists in PostgreSQL, which we replicate to a standby for HA.

In production, we run n8n 1.19+ with these critical configurations. Queue mode enabled with Redis, allowing horizontal scaling of worker nodes. Execution timeout set to 24 hours maximum — longer workflows get chunked into multiple stages. Binary data stored in S3-compatible object storage, not PostgreSQL. Credential encryption using hardware security modules for NERC CIP compliance.

The 400+ pre-built integrations matter less than you'd think. We use maybe fifteen: HTTP request, PostgreSQL, email, Slack, and a handful of SaaS APIs. The real power is the function node, where you write JavaScript to handle complex logic. We've built custom nodes for DNP3 protocol parsing, NERC CIP audit log formatting, and EIA form generation. Those live in a private npm registry and get version-controlled like any other code.

Component Deep-Dive: ERPNext as System of Record

ERPNext handles what n8n shouldn't: data models, business rules, and user interfaces. When a workflow creates a work order, that work order lives in ERPNext with full audit history, approval chains, and cost tracking. When maintenance needs to view their backlog, they use ERPNext's web interface. When finance needs to analyze maintenance spending, they query ERPNext's data warehouse.

The technical architecture: ERPNext 15.x running on Frappe Framework with MariaDB backend and Redis for caching. We deploy using Frappe's bench CLI on Ubuntu 22.04 LTS. Three-tier setup: web servers behind nginx, application servers running Gunicorn, background workers handling async jobs. Full deployment documented in Ansible playbooks.

The modules we actually use: Asset Management for equipment hierarchy and maintenance history, Project Management for outage coordination and capital projects, Accounting for cost tracking, HR for crew scheduling. We don't use ERPNext's CRM or e-commerce modules — wrong fit for utilities. The beauty is you only deploy what you need.

Integration between n8n and ERPNext happens through ERPNext's REST API. Every DocType in ERPNext automatically gets full CRUD endpoints with bearer token authentication. We create API keys with role-based permissions — the automation service account can create work orders but can't approve purchases. Rate limiting at 1000 requests per hour prevents runaway workflows from DOSing the ERP.

The data model customization capability is why ERPNext works for energy. Out of the box, their Asset Management module tracks assets and maintenance. We extended it with custom fields for NERC compliance data: CIP classification, cyber security controls, change management approval chains. We added a custom DocType for SCADA points, linking each sensor to its physical asset. All done through ERPNext's web-based customization tools, no source code modification required.

Integration Layer: The OT/IT Boundary

This is where deployments fail. Everyone wants real-time bidirectional integration between SCADA and business systems. NERC CIP says absolutely not. IEC 62443 says air-gap your OT. Insurance underwriters require network segmentation. The physics of power systems says you cannot tolerate workflow automation introducing latency into protective relay logic.

Our integration architecture uses one-way data replication from OT historians to an IT-side data lake. Typical implementation: OSIsoft PI historian or GE Proficy in the control center writes to its own SQL database. Every five minutes, a Python service on a dedicated integration box queries for new data and writes to TimescaleDB in the IT network. The integration box has two network interfaces on separate VLANs with host-based firewall rules permitting only the specific queries needed.

Workflows read from TimescaleDB, never from the OT network. This introduces 5-10 minute latency, which is acceptable for workflows but not for control logic. When a transformer overheats, the protective relay trips it in milliseconds. Five minutes later, our workflow sees the trip in the historian, creates a work order in ERPNext, dispatches a crew via n8n, and notifies the reliability coordinator.

For commands that must flow from IT to OT — rare but necessary — we use a manual approval step. A workflow can prepare a setpoint change and stage it for operator review. The operator logs into the SCADA HMI, verifies the change makes sense, and executes it. The workflow polls for confirmation that the change occurred. Zero automation directly commanding field devices.

Protocol translation happens at the integration boundary. SCADA systems speak DNP3, Modbus, IEC 61850 — none of which belong in an IT network. The integration service translates to JSON over HTTPS before data crosses the VLAN. We've written custom parsers in Python using libraries like pydnp3. The code is well-tested because a protocol parsing bug can cause trips.

Operational Reality: Deployment and Day-Two Operations

The first production deployment takes three months from kickoff to go-live. We've done this eight times; it doesn't get much faster. Month one is requirements gathering and data model design. Month two is integration layer buildout and testing. Month three is workflow development and user acceptance testing.

The infrastructure footprint: 6 VMs for n8n and ERPNext (3 app servers, 2 database servers, 1 Redis), 2 dedicated integration boxes at the OT boundary, 1 jump host for administrative access. Total compute: 48 vCPUs, 192GB RAM, 2TB SSD storage. This supports 500 active workflows executing 50,000 times per month.

Monitoring and observability is non-negotiable. We run Prometheus for metrics collection, Grafana for dashboards, and Loki for log aggregation. Critical metrics: workflow execution time (p50, p95, p99), failure rate by workflow, API response times for ERPNext, queue depth in Redis, database connection pool utilization. Alert on execution time exceeding 2x baseline or failure rate above 2%.

The most common failure mode: external API timeouts. A workflow calls a vendor API that takes 60 seconds to respond instead of the usual 2 seconds. The workflow times out, retries, and creates duplicate records. Fix: implement idempotency keys for all external API calls and set realistic timeouts with exponential backoff.

The second most common failure: schema changes breaking workflows. ERPNext gets upgraded and a field name changes. Twenty workflows that reference that field break simultaneously. Fix: abstract ERPNext interactions into reusable subworkflows that act as an API facade. When the schema changes, update one subworkflow instead of twenty main workflows.

Backup and disaster recovery: PostgreSQL and MariaDB replicated to standby servers with 5-minute RPO. Daily full backups to S3-compatible storage with 90-day retention. Workflow definitions version-controlled in GitLab with automatic deployment via CI/CD. Recovery time objective: 4 hours to restore full service from cold backup. We test this quarterly.

Security operations: all API credentials stored in n8n's encrypted credential store, rotated every 90 days. ERPNext access controlled through role-based permissions, with separate service accounts for automation vs. human users. Integration layer runs on hardened Ubuntu with CIS benchmarks applied. Vulnerability scanning weekly with Nessus. Penetration testing annually.

Scaling Patterns: From Pilot to Enterprise

The pilot starts with five workflows: substation alarm response, routine work order generation, regulatory report compilation, spare parts inventory alerts, and crew dispatch optimization. These prove the technology works and build organizational confidence. Pilot runs for 90 days with daily review meetings.

Expansion phase adds twenty more workflows, including the complex ones: outage coordination across transmission and distribution, capital project management linking engineering and construction, vegetation management integrating GIS and work management, compliance reporting that pulls from eight source systems.

At enterprise scale — 200+ workflows — the architecture needs evolution. We add a second n8n instance for development and testing, completely isolated from production. We implement workflow governance: every workflow has an owner, documentation, and quarterly review. We build a workflow catalog in Confluence so people can discover existing automation instead of building duplicates.

Horizontal scaling happens at the n8n worker layer. The main n8n process handles UI and scheduling, while worker processes execute workflows. Workers scale linearly: add more worker VMs, point them at the same PostgreSQL and Redis, and throughput increases. We've scaled to 10 workers handling 200,000 executions per month.

ERPNext scaling is more nuanced. The application layer scales horizontally behind a load balancer. The database layer requires vertical scaling or read replicas. At 50,000 documents per month, a single beefy database server handles the load. Beyond that, we implement read replicas for reporting and configure application servers to route read queries to replicas.

The integration layer doesn't scale horizontally — it's bounded by the number of OT systems and data diodes. Instead, we optimize the polling frequency and query efficiency. Five-minute polling is standard; one-minute polling for critical systems. We use TimescaleDB's continuous aggregates to pre-compute metrics instead of scanning raw data.

The Verdict

Workflow automation in energy operations is not a point solution. It's an architecture that bridges incompatible systems while respecting the reality that operational technology cannot tolerate the failure modes of IT systems. n8n provides the orchestration layer because it's flexible enough to handle event-driven workflows that span days, reliable enough to run in production without constant babysitting, and open-source enough to deploy in air-gapped environments.

ERPNext provides the business logic layer because utilities need an ERP system anyway, and this one doesn't charge $300 per user per month while actually allowing the customization required for energy sector operations. The integration between n8n and ERPNext through REST APIs is straightforward and maintainable.

The real complexity lives in the integration layer at the OT/IT boundary. This requires domain expertise in power system protocols, security architecture, and regulatory compliance that no software vendor provides out of the box. Budget 40% of your project effort here.

After deploying this stack eight times, we'd make the same technology choices again. The alternatives — proprietary workflow tools, custom development, or continuing with manual processes — cost more and deliver less. The open-source stack works if you architect it correctly, staff it properly, and accept that workflow automation is infrastructure that requires ongoing engineering effort, not a product you buy and forget.

Dimension	n8n	Apache Airflow	Temporal
Execution Model	Event-driven, stateful★★★★★	DAG-based, batch★★★☆☆	Durable execution★★★★★
Integration Ecosystem	400+ nodes, custom JS★★★★★	Strong data tools★★★★☆	Code-first, SDK★★★☆☆
Customization Depth	Full code access★★★★☆	Python operators★★★★★	Full Golang/Java★★★★★
Enterprise Scaling	Queue-based workers★★★★☆	Celery executors★★★★★	Multi-cluster★★★★★
Operational Maturity	Production-ready HA★★★★☆	Battle-tested scale★★★★★	Requires expertise★★★☆☆
Best For	Event-driven workflows spanning IT/OT with state persistence	Scheduled data pipelines and ETL jobs without event triggers	Mission-critical workflows requiring guaranteed execution semantics
Verdict	Best orchestration engine for energy sector workflows requiring flexibility and air-gapped deployment.	Excellent for batch data workflows but poor fit for long-running event-driven processes.	Superior reliability model but requires significant engineering investment for deployment and maintenance.

Workflow Automation for Energy Operations: The Open-Source Stack