Reliability Detection #1: Connectivity Failures

Jennifer Buchanan
Nov 12, 2025
3 min read

Why connectivity is the silent killer of charge success

After nearly three decades in software, I’ve learned that most reliability problems don’t start with hardware - they start with communication. It’s true in distributed systems, cloud microservices, and it’s true in EV charging networks.

Connectivity is the nervous system of a charge network. And when it falters - even slightly - everything else starts to degrade.

Yet, many charge point operators (CPOs) still treat connectivity as binary: it’s either online or offline. That’s not how reliability works at scale.

The Hidden Layers of “Offline”

Clockwork’s platform continuously monitors OCPP connections, websocket closures, signal strength, SIM data traffic, and more. We’ve learned that “offline” is rarely one thing - it’s usually a pattern building up over hours or days.

Some examples we see every day:

OCPP timeouts or WebSocket closures that exceed configurable thresholds — signaling a loss of connectivity between EVSE and CSMS.
Poor signal strength from modems or Wi-Fi access points leading to intermittent connectivity, not total failure.
SIM data contract issues or throttled traffic causing subtle packet loss and failed authorizations.
Firmware bugs that lock modems after idle periods — common across certain hardware generations.
ISP or utility outages that ripple through multiple sites simultaneously.

In traditional software infrastructure, this is the equivalent of intermittent API latency, DNS errors, or flaky database connections - small instabilities that compound into big reliability failures.

How Detection Works

Clockwork’s detection engine pulls in telemetry from EVSEs, SIM cards, and CSMS platforms - not just OCPP.

We apply pattern-based detection to identify anomalies in frequency, duration, and severity of communication failures.

Once a threshold is hit, the platform runs probable cause analysis - correlating site design, firmware versions, signal data, and event history to determine what’s most likely at fault.

What Happens Next

Reliability isn’t about detection alone — it’s about closed-loop response.

Based on the probable cause, Clockwork automatically takes one or more of the following actions:

Networking equipment remote reboot and firmware update (for known modem or EVSE firmware issues)
Reconfiguration of network settings or modem resets via secure commands
Dispatch automation — creating a technician “backgrounder” with full site history, root cause, and next best action such as power cycling the impacted stations
Driver communication mitigation if a utility or ISP outage is detected — proactive messaging rather than passive downtime

Every action, outcome, and performance result is fed back into the model — improving future detection accuracy.

That’s the “closed loop” we talk about at Clockwork: detect → diagnose → act → learn → repeat.

Why This Matters

Connectivity failures are the leading cause of false downtime reports, failed transactions, and degraded driver experience across public charging networks.

When you fix the root communication issues, you don’t just improve uptime - you eliminate the cascade of downstream problems: authorization errors, power misreads, and failed sessions that follow lost connectivity.

In other words: fix the pipe, and everything else flows.

The Bigger Picture

Connectivity is just one domain in a larger reliability system - but it’s the foundation for all others.

In the next post, we’ll move to Detection Domain #2: Faults - exploring how operators can use multi-source diagnostics to isolate electrical, interoperability, and hardware issues automatically.

Because reliability isn’t achieved in layers - it’s achieved in loops.

And the best networks are already learning to fix themselves.