The Pattern: AI-Native Web
We're no longer building systems that passively consume APIs. Our energy sector clients need agents that navigate regulatory portals, monitor competitor pricing dashboards, extract intelligence from equipment vendor sites, and synthesize data from hundreds of utility commission filings. The AI-Native Web pattern treats the entire internet as a queryable database where LLMs act as autonomous browsers.
This isn't screen scraping from 2005. Modern browser agents combine headless automation, LLM-powered decision trees, semantic content extraction, and vector storage into a cohesive system. We've deployed this pattern for utilities tracking FERC Order 2222 compliance updates, renewable developers monitoring interconnection queue changes, and trading desks extracting real-time generation data from ISO websites.
The Problem: Web Content Wasn't Built for Machines
Energy sector intelligence is trapped in human-readable web interfaces. Your team needs to know when ERCOT updates its CDR data, when PJM posts new capacity auction results, or when a solar panel manufacturer changes their warranty terms. But these systems provide no APIs, inconsistent data formats, and JavaScript-heavy interfaces that defeat simple HTTP requests.
Traditional approaches fail:
- Static scrapers break when site layouts change, which happens constantly
- Manual monitoring doesn't scale beyond a handful of sources
- Third-party aggregators introduce vendor lock-in and don't cover specialized energy sector sources
- RSS feeds and email alerts require each source to provide them, and most don't
The real problem: web interfaces are optimized for human visual processing, not structured data extraction. A utility regulatory filing might bury critical transmission cost data in the third paragraph of a PDF linked from a dynamically-loaded table that requires JavaScript to render.
Solution Architecture: Four-Layer Stack
We build AI-Native Web systems as four integrated layers. Each layer solves a specific problem and can be deployed independently, but they work best together.
Layer 1: Autonomous Navigation (Playwright)
Playwright handles the browser interaction layer. Microsoft built it specifically for reliable automation, and it works where older tools like Selenium fail. We run Playwright in headless mode on Ubuntu servers, typically containerized.
Key capabilities we use:
- Cross-browser consistency: Chromium, Firefox, WebKit support means we can route around browser-specific blocks
- Reliable element selection: Auto-waiting and retry logic eliminate the flaky tests that plague traditional scrapers
- Network interception: We can modify requests, block unnecessary resources, and capture API calls that might be easier to parse than rendered HTML
- Authentication handling: Persist sessions across runs, handle OAuth flows, manage cookies
Configuration example: We run Playwright with --disable-dev-shm-usage and explicit --user-data-dir paths in containerized environments to avoid memory issues and maintain session state.
Layer 2: Content Extraction (Firecrawl)
Once Playwright navigates to content, Firecrawl converts it into LLM-friendly markdown. This is where the magic happens for complex pages.
Firecrawl renders JavaScript, extracts semantic structure, and produces clean markdown that preserves document hierarchy. For energy sector use cases, this means:
- Regulatory PDFs: Extract tables and structured data without manual parsing
- Dynamic dashboards: Capture rendered content after JavaScript execution completes
- Multi-page documents: Follow pagination automatically and stitch content together
- Semantic chunking: Split long documents at logical boundaries for embedding
We self-host Firecrawl behind our firewall. The API is straightforward—POST a URL, receive markdown. Response times average 3-8 seconds for typical utility web pages, longer for PDF-heavy content. We run three instances load-balanced for redundancy.
Layer 3: Intelligence Storage (ChromaDB)
Raw markdown is useless without retrieval. ChromaDB stores embeddings of extracted content with metadata for filtering and temporal queries.
Our typical schema:
- Document chunks: 500-1000 token segments with overlap
- Metadata fields: source_url, extraction_timestamp, document_type, regulatory_jurisdiction, relevance_score
- Collections per use case: One collection for FERC filings, another for ISO market data, another for equipment specs
ChromaDB runs in persistent mode on NVMe SSDs. We embed using Ollama-hosted models (typically nomic-embed-text for 768-dimension vectors). Query performance stays under 100ms for collections up to 5 million embeddings.
Layer 4: Agent Orchestration (Ollama + Custom Logic)
The orchestration layer decides what to scrape, when to re-check sources, and how to interpret extracted content. We use Ollama-hosted LLMs (typically Llama 3.1 70B or Mixtral 8x7B depending on required reasoning depth).
Agent workflows:
- Source prioritization: LLM reviews the monitoring schedule and current events to decide which sources to check
- Navigation strategy: Generate Playwright scripts dynamically based on site structure
- Content validation: Verify extracted data completeness and flag anomalies
- Change detection: Compare new extractions against historical embeddings to identify material updates
- Alert generation: Synthesize findings into natural language summaries for human review
We don't trust LLMs for deterministic tasks like date parsing or numerical extraction—those use traditional Python logic. LLMs handle ambiguity: "Is this filing material for our interconnection project?" or "Has the vendor's warranty policy become more restrictive?"
Implementation Considerations
Deployment Topology
We run this stack on three Ubuntu 22.04 servers behind the utility's firewall:
- Server 1: Playwright instances (8 cores, 32GB RAM, handles 12 concurrent browser sessions)
- Server 2: Firecrawl and ChromaDB (16 cores, 64GB RAM, 2TB NVMe for embeddings)
- Server 3: Ollama with LLMs (dual GPUs, 128GB RAM, runs 70B models at 20 tokens/sec)
All communicate over internal network. No internet-exposed endpoints. Playwright accesses external sites through a proxy server that logs all traffic for security audit.
Rate Limiting and Ethics
We implement strict rate limiting:
- Maximum 1 request per 10 seconds per domain
- Respect robots.txt (but many energy sector sites don't have it or block too aggressively)
- Set descriptive User-Agent strings identifying our organization
- Cache aggressively—we store Firecrawl responses for 24 hours before re-extracting
Public utility commissions and ISOs provide data for transparency. We're not circumventing paywalls or terms of service. But we do scrape sites that would prefer we didn't—that's a business decision your legal team needs to approve.
Error Handling
Web scraping fails constantly. Our architecture assumes failure:
- Playwright retries: Three attempts with exponential backoff before marking a source as unreachable
- Content validation: Every extraction runs through a schema validator. If critical fields are missing, we flag for human review rather than storing garbage
- Dead source detection: If a source fails consistently for 72 hours, we alert the operations team
- Fallback extraction: If Firecrawl times out (happens with massive PDFs), we fall back to simpler HTML-to-markdown conversion
Our production system maintains 94% uptime on extractions across 200+ monitored sources. The 6% failures are usually temporary site outages or maintenance windows.
NERC CIP Compliance
For utilities subject to NERC CIP, this architecture requires careful hardening:
- CIP-005: All servers in Protected Cyber Assets, network traffic through monitored access points
- CIP-007: Automated patching schedule, security event logging to SIEM
- CIP-010: Configuration baselines, change control for Playwright scripts and LLM prompts
The biggest challenge: Playwright needs internet access to scrape external sites, but CIP mandates strict egress filtering. We solve this with a dedicated proxy server in a DMZ that allows only HTTPS to whitelisted domains. The whitelist is automatically updated based on sources configured in the monitoring system.
Trade-offs and Limitations
Brittleness vs. Flexibility
LLM-powered navigation is more resilient to site changes than hardcoded selectors, but it's not magic. When CAISO redesigned their OASIS interface last year, our agents needed three days of prompt tuning to reliably extract curtailment data again. The failure mode is better—agents report extraction uncertainty rather than silently returning wrong data—but it's not zero-maintenance.
Cost at Scale
Running this stack for 200 sources costs us approximately:
- Compute: $800/month (3 servers, 24/7 operation)
- Storage: $150/month (growing 50GB/month for embeddings and raw extractions)
- Human oversight: 10 hours/week reviewing alerts and fixing broken extractors
That's cheaper than licensing equivalent data from commercial aggregators, which would run $5K-15K/month for comparable coverage. But the upfront engineering investment was substantial—figure 400 hours to build the initial system and integrate it with existing workflows.
Legal Gray Areas
Some vendors explicitly prohibit automated access in their terms of service. We scrape them anyway when the data is factual (equipment specifications, public pricing) and not creative content. Our legal team's position: terms of service don't override fair use for factual information. But this is untested in court for our specific use cases.
If you're risk-averse, limit scraping to government sites and sources that explicitly allow automated access. That still covers 70% of high-value energy sector intelligence.
The Verdict
The AI-Native Web pattern works in production and delivers measurable value. Our utility clients catch regulatory changes 2-3 days faster than competitors who rely on manual monitoring. Trading desks see price movements on ISO sites before they hit official APIs. Renewable developers get interconnection queue updates the moment they're posted.
Start with Playwright and Firecrawl as your foundation—they're mature, well-documented, and handle the hard problems of modern web interaction. Add ChromaDB when you're monitoring more than 20 sources and need semantic search. Layer in Ollama-hosted LLMs only after you have clean data pipelines and understand which decisions actually benefit from LLM reasoning.
The biggest mistake we see: trying to build "general purpose web agents" that can scrape anything. That's a research problem, not a production system. Instead, identify your 10 highest-value sources, build reliable extractors for those specific sites, and expand gradually. You want a fleet of specialized agents, not one brittle generalist.
This architecture isn't appropriate for every use case. If your sources provide APIs, use those instead. If you need real-time data, scraping with 10-second delays won't work. And if your organization can't handle the legal ambiguity or operational overhead, commercial data providers might be worth the premium.
But for energy sector teams that need intelligence from hundreds of disparate web sources, need data sovereignty for NERC CIP compliance, and have the engineering capacity to maintain the system, AI-Native Web delivers capabilities that weren't possible three years ago. We're running this pattern in production at five utilities and two renewable developers. It works.