Consumer-Reported Dependency Health

Posted on December 8, 2025 • 5 min read • 856 words

SRE Monitoring Prometheus Devops Thibault

Share via

Link copied to clipboard

An in-depth exploration of the CRDH practice, where consumers become distributed probes that report the real health of their dependencies. A modern and reliable approach for monitoring, observability, and incident detection.

On this page

Consumer-Reported Dependency Health — Photo by Thibault Deheurles

I. Rethinking How We Evaluate the Health of Distributed Systems

In modern distributed architectures, a system’s health depends as much — if not more — on the state of its dependencies as on its own internal state. Yet most monitoring strategies still rely on synthetic or dedicated healthchecks: /health endpoints, liveness/readiness probes, external scripts, and similar mechanisms.

These techniques work, but they miss the essential point:
the real experience of the consumers.

There is a simpler, more robust, naturally distributed alternative:
Consumer-Reported Dependency Health (CRDH).

II. What Is CRDH?

Consumer-Reported Dependency Health (CRDH) means that consumers themselves indicate whether a dependency is functioning properly, based directly on what they observe during their real calls.

The principle:

If a real call to a dependency succeeds → the consumer reports a success.
If a real call fails → the consumer reports a failure.
Metrics are exported automatically (Prometheus / OpenTelemetry).
The global view emerges through aggregation.

In practice, this creates a distributed health matrix that instantly reveals whether a problem is:

global (many consumers impacted),
local (one consumer affected),
configuration-related,
load-related,
tied to a specific call pattern.

III. Why Traditional Healthchecks Are Not Enough

1. They do not reflect real traffic

A /health endpoint often checks “SELECT 1”, “PING Redis”, or “GET /status”.
But real business traffic is far more complex (permissions, payloads, batching, etc.).

A service may appear “healthy” in a healthcheck but be unusable in reality.

2. They do not detect local issues

A consumer may fail because of:

local DNS issues,
a firewall rule,
an expired secret,
an internal routing problem.

The dependency’s healthcheck will still show “OK”.

3. They require constant maintenance

Each new service must:

write its probes,
maintain them,
ensure they test the real functionality.

CRDH eliminates this burden.

IV. The “Health Matrix”

Thanks to CRDH metrics, a global view emerges:

Dependency	Consumer A	Consumer B	Consumer C	Global Status
Redis	FAIL	FAIL	FAIL	❌ Global outage
Payment API	OK	FAIL	OK	⚠️ Local issue (B)
S3	OK	OK	OK	✓ Healthy

This is an extremely powerful tool for:

diagnosing faster,
avoiding false positives,
understanding the real scope of an incident,
prioritizing remediation efforts.

V. CRDH Technical Specification

1. Recommended Prometheus Metrics

Success

crdh_dependency_success_total{
  consumer="order-service",
  dependency="payment-api",
  method="POST",
  status="200"
}

Failure

crdh_dependency_error_total{
  consumer="order-service",
  dependency="payment-api",
  error="timeout",
  status="504"
}

2. Dependency Latency

crdh_dependency_latency_ms_bucket{
consumer="web",
dependency="db",
le="100"
}

3. Health Rate

(always calculated with PromQL)

100 * sum(rate(crdh_dependency_success_total[5m]))
/
sum(rate(crdh_dependency_success_total[5m]) + rate(crdh_dependency_error_total[5m]))

VI. Example: Go Implementation

func CallPaymentAPI(ctx context.Context) error {
    start := time.Now()
    err := doRealPaymentCall(ctx)
    duration := time.Since(start)

    labels := prometheus.Labels{
        "consumer":   "order-service",
        "dependency": "payment-api",
    }

    if err != nil {
        CRDHErrors.With(labels).Inc()
    } else {
        CRDHSuccess.With(labels).Inc()
        CRDHLatency.With(labels).Observe(duration.Seconds())
    }

    return err
}

And that’s it. No dedicated healthcheck is required.

VII. Comparison: CRDH vs Common SRE Practices

Practice	Advantages	Limitations
Dedicated healthchecks	Simple to implement	Do not reflect real traffic
Synthetic checks	Great for external monitoring	Limited business context
Distributed tracing	Excellent granularity	Complex, requires heavy instrumentation
CRDH	Realistic, scalable, simple, self-sustained	Traffic-based (requires minimal volume)

CRDH is not a replacement but a natural complement:
it adds the business context missing from traditional probes.

VIII. Alerting: A Smarter Model

1. Local Alerts (only one consumer impacted)

sum(rate(crdh_dependency_error_total{consumer="serviceA", dependency="redis"}[5m])) > 5

2. Global Alerts (multiple consumers impacted)

count by (dependency) (
sum(rate(crdh_dependency_error_total[5m])) by (consumer, dependency) > 5
) > 2

3. Degradation Alerts (rising latency)

histogram_quantile(0.95, sum by (le, consumer, dependency) (rate(crdh_dependency_latency_ms_bucket[5m]))) > 200

IX. When to Adopt CRDH

CRDH is particularly effective when:

multiple services consume the same dependencies,
you want to reduce alerting false positives,
you need to quickly understand incident scope,
you want to avoid maintaining custom healthchecks,
you need a robust multi-service monitoring model.

X. Limitations (and Solutions)

Limitation	Solution
Low traffic → low metric quality	Add light synthetic probing
Unused dependencies remain invisible	Add weak “lightweight probes”
High cardinality due to too many labels	Standardize `consumer` & `dependency`

Code Knowledge Requirements

CRDH assumes that developers understand the real behavior of their dependency calls: retries, timeouts, fallbacks, logical errors, and what constitutes true business success.
If not, CRDH metrics may become incorrect or misleading.

To avoid this, use a shared middleware or SDK that standardizes metric emission, and document clearly what counts as “success” and “failure” for each dependency.

XI. Why This Approach Works So Well

Because it relies on a simple principle:

The best measure of a system’s health is the real experience of the services that use it.

CRDH turns all consumers into a distributed probe, free, realistic, and self-maintaining.

It is exactly the backend equivalent of user-reported health on the frontend —
but for microservices.

XII. Conclusion

CRDH represents a powerful shift in perspective:
it is no longer the responsibility of services to “prove” that they are alive —
it is their consumers who report what they actually observe.

It is simple.
It is robust.
It reflects reality.
And it significantly improves how we detect, diagnose, and resolve incidents in distributed architectures.

🔗 Useful Links

Interfaces, Functions and Modules in Go: Structuring Your Code for TDD Without Adding Unnecessary Complexity

Why AWS Spot Instances Become Impossible to Get in December

We work with you!