Reliability Detection #2: Faults

Jennifer Buchanan
Dec 4, 2025
3 min read

The Hardest Part of Effective Fault Detection Is Learning What to Ignore

Everything Else Is a Distraction…

When you’re designing fault detection, it’s tempting to surface every little thing you see. Comprehensive detection is table stakes now. We have the tools, the logs, the telemetry, the vendor error codes. You don’t get reliability without full visibility.

But the operators on the ground - the people responding to tickets, planning dispatch, and managing uptime - consistently value actionable prioritization over raw completeness.

They’ll tell you when the system is missing something.

But they’ll quietly suffer through noise.

And that’s the real challenge: building a system that sees everything, but only interrupts people for what matters.

Comprehensive Detection Is Not the Same as Comprehensive Alerting

This is where many detection strategies drift off course.

A good system ingests every signal:

Every OCPP message
Every vendor code
Every modem drop
Every power fluctuation
Every pattern deviation

But a great system doesn’t treat all of those signals as equal. It digests everything, then decides - intelligently - what rises to the level of action.

The other day, I was reviewing data with a large operator. We hadn’t reported a small issue at a site, and I was trying to talk him into loosening up his thresholds so we’d catch it. He stopped me mid-sentence and said, essentially:

“You’re the only system in our stack that I rely on not to get lost in the noise. I’ll catch that minor fault the next time — when it becomes more serious or starts to manifest in other ways.”

And he was right. He didn’t want more sensitivity; he wanted smarter prioritization.

That’s the nuance: You want full detection coverage, but you also want attention discipline.

Three Fundamentals for Turning “Everything” Into Something Useful

1. Use persistence and patterning to separate noise from real faults

Every EVSE throws momentary errors.

Every network hiccups.

Every firmware does something odd.

Comprehensive detection means you log it all.

Actionable prioritization means you only alert when:

a state persists long enough to disrupt charging, or
a fault pattern repeats enough to signal a true failure.

An intelligent system learns to distinguish what truly affects operations and customer experience from what doesn’t, elevating only the signals that matter.

It’s not about ignoring data - it’s about not interrupting someone over it.

2. Group related signals into meaningful “problems”

One real issue often sets off a chain reaction:

status changes
vendor error codes
communication failures
transaction impacts

If you trigger on each independently, operators drown in tickets.

Grouping signals isn’t just about reducing noise - it enables teams to identify system-wide issues and apply system-wide fixes, instead of burning time on scattered one-off interventions.

3. Prioritize likely causes, not possible causes

Comprehensive detection collects thousands of data points. But comprehensive alerting does not mean presenting every hypothetical cause.

The most helpful thing you can do is rank:

what’s most likely
what’s worth checking first
what historically correlates with failures
what actually affects charging

Operators repeat this theme constantly:

“Just point me toward what deserves my attention first.”

The real goal is prioritizing effort - helping operators find the true root cause as quickly and confidently as possible.

The Real Work: Respecting Human Time

No one has a shortage of data anymore. What they have a shortage of is capacity.

A fault detection system should reduce cognitive load, not add to it. It should capture everything, surface only what matters, and make the next step obvious.

The goal is not fewer signals - it’s fewer distractions. Not less data - but smarter triage.

Comprehensive detection + actionable prioritization = usable reliability.

A Light Touch on How We Apply This

At Clockwork, this philosophy has become a guiding principle. We collect everything - from OCPP traffic to network performance to vendor-specific behaviors - but we’ve learned that operators benefit most when that comprehensive data sits behind a layer of thoughtful filtering, grouping, and prioritization .

The result is a system that supports deep investigation when someone goes looking, but only elevates the issues that genuinely need attention.

Because ultimately, reliability isn’t just about what you detect. It’s about what you help people focus on.