Simple Enough Blog logo
  • Home 
  • Projects 
  • Tags 

  •  Language
    • English
    • Français
  1.   Blogs
  1. Home
  2. Blogs
  3. Consumer-Reported Dependency Health

Consumer-Reported Dependency Health

Posted on December 8, 2025 • 5 min read • 856 words
SRE   Monitoring   Prometheus   Devops   Thibault  
SRE   Monitoring   Prometheus   Devops   Thibault  
Share via
Simple Enough Blog
Link copied to clipboard

An in-depth exploration of the CRDH practice, where consumers become distributed probes that report the real health of their dependencies. A modern and reliable approach for monitoring, observability, and incident detection.

On this page
I. Rethinking How We Evaluate the Health of Distributed Systems   II. What Is CRDH?   III. Why Traditional Healthchecks Are Not Enough   1. They do not reflect real traffic   2. They do not detect local issues   3. They require constant maintenance   IV. The “Health Matrix”   V. CRDH Technical Specification   1. Recommended Prometheus Metrics   Success   Failure   2. Dependency Latency   3. Health Rate   VI. Example: Go Implementation   VII. Comparison: CRDH vs Common SRE Practices   VIII. Alerting: A Smarter Model   1. Local Alerts (only one consumer impacted)   2. Global Alerts (multiple consumers impacted)   3. Degradation Alerts (rising latency)   IX. When to Adopt CRDH   X. Limitations (and Solutions)   Code Knowledge Requirements   XI. Why This Approach Works So Well   XII. Conclusion   🔗 Useful Links  
Consumer-Reported Dependency Health
Photo by Thibault Deheurles

I. Rethinking How We Evaluate the Health of Distributed Systems  

In modern distributed architectures, a system’s health depends as much — if not more — on the state of its dependencies as on its own internal state. Yet most monitoring strategies still rely on synthetic or dedicated healthchecks: /health endpoints, liveness/readiness probes, external scripts, and similar mechanisms.

These techniques work, but they miss the essential point:
the real experience of the consumers.

There is a simpler, more robust, naturally distributed alternative:
Consumer-Reported Dependency Health (CRDH).


II. What Is CRDH?  

Consumer-Reported Dependency Health (CRDH) means that consumers themselves indicate whether a dependency is functioning properly, based directly on what they observe during their real calls.

The principle:

  • If a real call to a dependency succeeds → the consumer reports a success.
  • If a real call fails → the consumer reports a failure.
  • Metrics are exported automatically (Prometheus / OpenTelemetry).
  • The global view emerges through aggregation.

In practice, this creates a distributed health matrix that instantly reveals whether a problem is:

  • global (many consumers impacted),
  • local (one consumer affected),
  • configuration-related,
  • load-related,
  • tied to a specific call pattern.

III. Why Traditional Healthchecks Are Not Enough  

1. They do not reflect real traffic  

A /health endpoint often checks “SELECT 1”, “PING Redis”, or “GET /status”.
But real business traffic is far more complex (permissions, payloads, batching, etc.).

A service may appear “healthy” in a healthcheck but be unusable in reality.

2. They do not detect local issues  

A consumer may fail because of:

  • local DNS issues,
  • a firewall rule,
  • an expired secret,
  • an internal routing problem.

The dependency’s healthcheck will still show “OK”.

3. They require constant maintenance  

Each new service must:

  • write its probes,
  • maintain them,
  • ensure they test the real functionality.

CRDH eliminates this burden.


IV. The “Health Matrix”  

Thanks to CRDH metrics, a global view emerges:

DependencyConsumer AConsumer BConsumer CGlobal Status
RedisFAILFAILFAIL❌ Global outage
Payment APIOKFAILOK⚠️ Local issue (B)
S3OKOKOK✓ Healthy

This is an extremely powerful tool for:

  • diagnosing faster,
  • avoiding false positives,
  • understanding the real scope of an incident,
  • prioritizing remediation efforts.

V. CRDH Technical Specification  

1. Recommended Prometheus Metrics  

Success  

crdh_dependency_success_total{
  consumer="order-service",
  dependency="payment-api",
  method="POST",
  status="200"
}

Failure  

crdh_dependency_error_total{
  consumer="order-service",
  dependency="payment-api",
  error="timeout",
  status="504"
}

2. Dependency Latency  

crdh_dependency_latency_ms_bucket{
consumer="web",
dependency="db",
le="100"
}

3. Health Rate  

(always calculated with PromQL)

100 * sum(rate(crdh_dependency_success_total[5m]))
/
sum(rate(crdh_dependency_success_total[5m]) + rate(crdh_dependency_error_total[5m]))

VI. Example: Go Implementation  

func CallPaymentAPI(ctx context.Context) error {
    start := time.Now()
    err := doRealPaymentCall(ctx)
    duration := time.Since(start)

    labels := prometheus.Labels{
        "consumer":   "order-service",
        "dependency": "payment-api",
    }

    if err != nil {
        CRDHErrors.With(labels).Inc()
    } else {
        CRDHSuccess.With(labels).Inc()
        CRDHLatency.With(labels).Observe(duration.Seconds())
    }

    return err
}

And that’s it. No dedicated healthcheck is required.


VII. Comparison: CRDH vs Common SRE Practices  

PracticeAdvantagesLimitations
Dedicated healthchecksSimple to implementDo not reflect real traffic
Synthetic checksGreat for external monitoringLimited business context
Distributed tracingExcellent granularityComplex, requires heavy instrumentation
CRDHRealistic, scalable, simple, self-sustainedTraffic-based (requires minimal volume)

CRDH is not a replacement but a natural complement:
it adds the business context missing from traditional probes.


VIII. Alerting: A Smarter Model  

1. Local Alerts (only one consumer impacted)  

sum(rate(crdh_dependency_error_total{consumer="serviceA", dependency="redis"}[5m])) > 5

2. Global Alerts (multiple consumers impacted)  

count by (dependency) (
sum(rate(crdh_dependency_error_total[5m])) by (consumer, dependency) > 5
) > 2

3. Degradation Alerts (rising latency)  

histogram_quantile(0.95, sum by (le, consumer, dependency) (rate(crdh_dependency_latency_ms_bucket[5m]))) > 200

IX. When to Adopt CRDH  

CRDH is particularly effective when:

  • multiple services consume the same dependencies,
  • you want to reduce alerting false positives,
  • you need to quickly understand incident scope,
  • you want to avoid maintaining custom healthchecks,
  • you need a robust multi-service monitoring model.

X. Limitations (and Solutions)  

LimitationSolution
Low traffic → low metric qualityAdd light synthetic probing
Unused dependencies remain invisibleAdd weak “lightweight probes”
High cardinality due to too many labelsStandardize consumer & dependency

Code Knowledge Requirements  

CRDH assumes that developers understand the real behavior of their dependency calls: retries, timeouts, fallbacks, logical errors, and what constitutes true business success.
If not, CRDH metrics may become incorrect or misleading.

To avoid this, use a shared middleware or SDK that standardizes metric emission, and document clearly what counts as “success” and “failure” for each dependency.


XI. Why This Approach Works So Well  

Because it relies on a simple principle:

The best measure of a system’s health is the real experience of the services that use it.

CRDH turns all consumers into a distributed probe, free, realistic, and self-maintaining.

It is exactly the backend equivalent of user-reported health on the frontend —
but for microservices.


XII. Conclusion  

CRDH represents a powerful shift in perspective:
it is no longer the responsibility of services to “prove” that they are alive —
it is their consumers who report what they actually observe.

It is simple.
It is robust.
It reflects reality.
And it significantly improves how we detect, diagnose, and resolve incidents in distributed architectures.


🔗 Useful Links  

  • Prometheus — Best Practices
  • OpenTelemetry — Metrics Specification
  • OpenTelemetry Collector
 Interfaces, Functions and Modules in Go: Structuring Your Code for TDD Without Adding Unnecessary Complexity
Why AWS Spot Instances Become Impossible to Get in December 
  • I. Rethinking How We Evaluate the Health of Distributed Systems  
  • II. What Is CRDH?  
  • III. Why Traditional Healthchecks Are Not Enough  
  • IV. The “Health Matrix”  
  • V. CRDH Technical Specification  
  • VI. Example: Go Implementation  
  • VII. Comparison: CRDH vs Common SRE Practices  
  • VIII. Alerting: A Smarter Model  
  • IX. When to Adopt CRDH  
  • X. Limitations (and Solutions)  
  • XI. Why This Approach Works So Well  
  • XII. Conclusion  
  • 🔗 Useful Links  
Follow us

We work with you!

   
Copyright © 2026 Simple Enough Blog All rights reserved. | Powered by Hinode.
Simple Enough Blog
Code copied to clipboard