Skip to content
This repository was archived by the owner on Nov 7, 2022. It is now read-only.

Commit b5a7ae3

Browse files
fivesheepsongy23
authored andcommitted
Fix starttime and summary has no quantiles for prometheus receiver (#597)
* To address start time for counter/histogram/summary, and take deltas as values for these types. It also fixes the issue when summary has no quantiles, allow it to produce a summary with nil snapshots * rephase README as per comment * fix concurrent issue on adjuster and add more integration tests * test again * Restructure end to end test with comments and more * remove empty tests
1 parent 7b57179 commit b5a7ae3

File tree

10 files changed

+2506
-1550
lines changed

10 files changed

+2506
-1550
lines changed

receiver/prometheusreceiver/README.md

Lines changed: 96 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -176,7 +176,7 @@ object of OpenCensus. The target object is not accessible from the Appender inte
176176
ocagent appender, we need to have a way to inject the binding target into the appender instance.
177177

178178

179-
3. Group metrics from the same family together
179+
2. Group metrics from the same family together
180180

181181
In OpenCensus, metric points of the same name are usually grouped together as one timeseries but different data points.
182182
It's important for the appender to keep track of the metric family changes, and group metrics of the same family together.
@@ -186,17 +186,24 @@ and `summary`, not all the data points have the same name, there are some specia
186186
we need to handle this properly, and do not consider this is a metric family change.
187187

188188

189-
4. Group complex metrics such as histogram together in proper order
189+
3. Group complex metrics such as histogram together in proper order
190190

191191
In Prometheus, a single aggregated type of metric data such as `histogram` and `summary` is represent by multiple metric data points, such as
192192
buckets and quantiles as well as the additional `_sum` and `_count` data. ScrapeLoop will feed them into the appender individually. The ocagent
193193
appender need to have a way to bundle them together to transform them into a single Metric Datapoint Proto object.
194194

195-
5. Tags need to handle carefully
195+
4. Tags need to handle carefully
196196

197197
ScrapeLoop strips out any tag with empty value, however, in OpenCensus, the tag keys is stored separately, we need to able to get all the possible tag keys
198198
of the same metric family before committing the metric family to the sink.
199199

200+
5. StartTimestamp and values of metrics of cumulative types
201+
202+
In OpenCensus, every metrics of cumulative type is required to have a StartTimestamp, which records when a metric is first recorded, however, Prometheus
203+
dose not provide such data. One of the solutions to tackle this problem is to cache the first observed value of these metrics as well as
204+
the timestamp, then for any subsequent data of the same metric, use the cached timestamp as StartTimestamp and the delta with the first value as value.
205+
However, metrics can come and go, or the remote server can restart at any given time, the receiver also needs to take care of issues such as a new value is
206+
smaller than the previous seen value, by considering it as a metrics with new StartTime.
200207

201208
## Prometheus Metric to OpenCensus Metric Proto Mapping
202209

@@ -232,16 +239,27 @@ Counter as described in the [Prometheus Metric Types Document](https://prometheu
232239
> is a cumulative metric that represents a single monotonically increasing counter whose value can only increase or be
233240
> reset to zero on restart
234241
235-
It is one of the most simple metric types we can be found in both systems. Examples of Prometheus Counters is as shown
236-
below:
242+
It is one of the most simple metric types we can find in both systems. However, it is a cumulative type of metric,
243+
considering with have two continuous scrapes from a target, with the first one as shown below:
237244
```
238245
# HELP http_requests_total The total number of HTTP requests.
239246
# TYPE http_requests_total counter
240247
http_requests_total{method="post",code="200"} 1027
241248
http_requests_total{method="post",code="400"} 3
242249
```
243250

244-
The corresponding Ocagent Metric will be:
251+
and the 2nd one:
252+
```
253+
# HELP http_requests_total The total number of HTTP requests.
254+
# TYPE http_requests_total counter
255+
http_requests_total{method="post",code="200"} 1028
256+
http_requests_total{method="post",code="400"} 5
257+
```
258+
259+
The Prometheus Receiver will only produce one Metric from the 2nd scrape and subsequent ones if any, the 1st scrape,
260+
however, is stored as metadata to take delta from.
261+
262+
The output of the 2nd scrape is as shown below:
245263
```go
246264
metrics := []*metricspb.Metric{
247265
{
@@ -251,26 +269,25 @@ metrics := []*metricspb.Metric{
251269
LabelKeys: []*metricspb.LabelKey{{Key: "method"}, {Key: "code"}}},
252270
Timeseries: []*metricspb.TimeSeries{
253271
{
254-
StartTimestamp: tsOc,
272+
StartTimestamp: startTimestamp,
255273
LabelValues: []*metricspb.LabelValue{{Value: "post", HasValue: true}, {Value: "200", HasValue: true}},
256274
Points: []*metricspb.Point{
257-
{Timestamp: tsOc, Value: &metricspb.Point_DoubleValue{DoubleValue: 1027.0}},
275+
{Timestamp: currentTimestamp, Value: &metricspb.Point_DoubleValue{DoubleValue: 1.0}},
258276
},
259277
},
260278
{
261-
StartTimestamp: tsOc,
279+
StartTimestamp: startTimestamp,
262280
LabelValues: []*metricspb.LabelValue{{Value: "post", HasValue: false}, {Value: "400", HasValue: true}},
263281
Points: []*metricspb.Point{
264-
{Timestamp: tsOc, Value: &metricspb.Point_DoubleValue{DoubleValue: 3.0}},
282+
{Timestamp: currentTimestamp, Value: &metricspb.Point_DoubleValue{DoubleValue: 2.0}},
265283
},
266284
},
267285
},
268286
},
269287
}
270288
```
271289

272-
*Note: `tsOc` is a timestamp object, which is based on the timestamp provided by a scrapLoop. In most cases, it is
273-
the timestamp when a target is scrapped, however, it can also be the timestamp recorded with a metric*
290+
*Note: `startTimestamp` is the timestamp cached from the first scrape, `currentTimestamp` is the timestamp of the current scrape*
274291

275292

276293
### Gauge
@@ -299,17 +316,17 @@ metrics := []*metricspb.Metric{
299316
LabelKeys: []*metricspb.LabelKey{{Key: "id"}, {Key: "foo"}}},
300317
Timeseries: []*metricspb.TimeSeries{
301318
{
302-
StartTimestamp: tsOc,
319+
StartTimestamp: nil,
303320
LabelValues: []*metricspb.LabelValue{{Value: "1", HasValue: true}, {Value: "bar", HasValue: true}},
304321
Points: []*metricspb.Point{
305-
{Timestamp: tsOc, Value: &metricspb.Point_DoubleValue{DoubleValue: 1.0}},
322+
{Timestamp: currentTimestamp, Value: &metricspb.Point_DoubleValue{DoubleValue: 1.0}},
306323
},
307324
},
308325
{
309-
StartTimestamp: tsOc,
326+
StartTimestamp: nil,
310327
LabelValues: []*metricspb.LabelValue{{Value: "2", HasValue: true}, {Value: "", HasValue: false}},
311328
Points: []*metricspb.Point{
312-
{Timestamp: tsOc, Value: &metricspb.Point_DoubleValue{DoubleValue: 2.0}},
329+
{Timestamp: currentTimestamp, Value: &metricspb.Point_DoubleValue{DoubleValue: 2.0}},
313330
},
314331
},
315332
},
@@ -322,7 +339,10 @@ metrics := []*metricspb.Metric{
322339
Histogram is a complex data type, in Prometheus, it uses multiple data points to represent a single histogram. Its
323340
description can be found from: [Prometheus Histogram](https://prometheus.io/docs/concepts/metric_types/#histogram).
324341

325-
An example of histogram is as shown below:
342+
Similar to counter, histogram is also a cumulative type metric, thus only the 2nd and subsequent scrapes can produce a metric for OpenCensus,
343+
with the first scrape stored as metadata.
344+
345+
An example of histogram with first scrape response:
326346
```
327347
# HELP hist_test This is my histogram vec
328348
# TYPE hist_test histogram
@@ -339,6 +359,24 @@ hist_test_count{t1="2"} 100.0
339359
340360
```
341361

362+
And a subsequent 2nd scrape response:
363+
```
364+
# HELP hist_test This is my histogram vec
365+
# TYPE hist_test histogram
366+
hist_test_bucket{t1="1",,le="10.0"} 2.0
367+
hist_test_bucket{t1="1",le="20.0"} 6.0
368+
hist_test_bucket{t1="1",le="+inf"} 13.0
369+
hist_test_sum{t1="1"} 150.0
370+
hist_test_count{t1="1"} 13.0
371+
hist_test_bucket{t1="2",,le="10.0"} 10.0
372+
hist_test_bucket{t1="2",le="20.0"} 30.0
373+
hist_test_bucket{t1="2",le="+inf"} 100.0
374+
hist_test_sum{t1="2"} 10000.0
375+
hist_test_count{t1="2"} 100.0
376+
377+
```
378+
379+
342380
Its corresponding Ocagent metrics will be:
343381
```go
344382
metrics := []*metricspb.Metric{
@@ -349,10 +387,10 @@ metrics := []*metricspb.Metric{
349387
LabelKeys: []*metricspb.LabelKey{{Key: "t1"}}},
350388
Timeseries: []*metricspb.TimeSeries{
351389
{
352-
StartTimestamp: tsOc,
390+
StartTimestamp: startTimestamp,
353391
LabelValues: []*metricspb.LabelValue{{Value: "1", HasValue: true}},
354392
Points: []*metricspb.Point{
355-
{Timestamp: tsOc, Value: &metricspb.Point_DistributionValue{
393+
{Timestamp: currentTimestamp, Value: &metricspb.Point_DistributionValue{
356394
DistributionValue: &metricspb.DistributionValue{
357395
BucketOptions: &metricspb.DistributionValue_BucketOptions{
358396
Type: &metricspb.DistributionValue_BucketOptions_Explicit_{
@@ -361,17 +399,17 @@ metrics := []*metricspb.Metric{
361399
},
362400
},
363401
},
364-
Count: 10,
365-
Sum: 100.0,
366-
Buckets: []*metricspb.DistributionValue_Bucket{{Count: 1}, {Count: 2}, {Count: 7}},
402+
Count: 3,
403+
Sum: 50.0,
404+
Buckets: []*metricspb.DistributionValue_Bucket{{Count: 1}, {Count: 2}, {Count: 0}},
367405
}}},
368406
},
369407
},
370408
{
371-
StartTimestamp: tsOc,
409+
StartTimestamp: startTimestamp,
372410
LabelValues: []*metricspb.LabelValue{{Value: "2", HasValue: true}},
373411
Points: []*metricspb.Point{
374-
{Timestamp: tsOc, Value: &metricspb.Point_DistributionValue{
412+
{Timestamp: currentTimestamp, Value: &metricspb.Point_DistributionValue{
375413
DistributionValue: &metricspb.DistributionValue{
376414
BucketOptions: &metricspb.DistributionValue_BucketOptions{
377415
Type: &metricspb.DistributionValue_BucketOptions_Explicit_{
@@ -380,9 +418,9 @@ metrics := []*metricspb.Metric{
380418
},
381419
},
382420
},
383-
Count: 100,
384-
Sum: 10000.0,
385-
Buckets: []*metricspb.DistributionValue_Bucket{{Count: 10}, {Count: 20}, {Count: 70}},
421+
Count: 0,
422+
Sum: 0.0,
423+
Buckets: []*metricspb.DistributionValue_Bucket{{Count: 0}, {Count: 0}, {Count: 0}},
386424
}}},
387425
},
388426
},
@@ -403,12 +441,19 @@ Prometheus. We have to set this value to `0` instead.
403441

404442
### Gaugehistogram
405443

406-
This is an undocumented data type, it shall be same as regular [Histogram](#histogram)
444+
This is an undocumented data type, and it's not supported currently
407445

408446
### Summary
409447

410448
Same as histogram, summary is also a complex metric type which is represent by multiple data points. A detailed
411449
description can be found from [Prometheus Summary](https://prometheus.io/docs/concepts/metric_types/#summary)
450+
451+
The sum and count from Summary is also cumulative, however, the quantiles are not. The receiver will still consider the first scrape
452+
as metadata, and won't produce an output to OpenCensus. For any subsequent scrapes, the count and sum will be deltas from the first scrape,
453+
while the quantiles are left as it is.
454+
455+
For the following two scrapes, with the first one:
456+
412457
```
413458
# HELP go_gc_duration_seconds A summary of the GC invocation durations.
414459
# TYPE go_gc_duration_seconds summary
@@ -421,6 +466,19 @@ go_gc_duration_seconds_sum 17.391350544
421466
go_gc_duration_seconds_count 52489
422467
```
423468

469+
And the 2nd one:
470+
```
471+
# HELP go_gc_duration_seconds A summary of the GC invocation durations.
472+
# TYPE go_gc_duration_seconds summary
473+
go_gc_duration_seconds{quantile="0"} 0.0001271
474+
go_gc_duration_seconds{quantile="0.25"} 0.0002455
475+
go_gc_duration_seconds{quantile="0.5"} 0.0002904
476+
go_gc_duration_seconds{quantile="0.75"} 0.0003426
477+
go_gc_duration_seconds{quantile="1"} 0.0023639
478+
go_gc_duration_seconds_sum 17.491350544
479+
go_gc_duration_seconds_count 52490
480+
```
481+
424482
The corresponding Ocagent metrics is as shown below:
425483

426484
```go
@@ -432,20 +490,20 @@ metrics := []*metricspb.Metric{
432490
LabelKeys: []*metricspb.LabelKey{}},
433491
Timeseries: []*metricspb.TimeSeries{
434492
{
435-
StartTimestamp: tsOc,
493+
StartTimestamp: startTimestamp,
436494
LabelValues: []*metricspb.LabelValue{},
437495
Points: []*metricspb.Point{
438-
{Timestamp: tsOc, Value: &metricspb.Point_SummaryValue{
496+
{Timestamp: currentTimestamp, Value: &metricspb.Point_SummaryValue{
439497
SummaryValue: &metricspb.SummaryValue{
440-
Sum: &wrappers.DoubleValue{Value: 17.391350544},
441-
Count: &wrappers.Int64Value{Value: 52489},
498+
Sum: &wrappers.DoubleValue{Value: 0.1},
499+
Count: &wrappers.Int64Value{Value: 1},
442500
Snapshot: &metricspb.SummaryValue_Snapshot{
443501
PercentileValues: []*metricspb.SummaryValue_Snapshot_ValueAtPercentile{
444502
{Percentile: 0.0, Value: 0.0001271},
445503
{Percentile: 25.0, Value: 0.0002455},
446504
{Percentile: 50.0, Value: 0.0002904},
447505
{Percentile: 75.0, Value: 0.0003426},
448-
{Percentile: 100.0, Value: 0.0023638},
506+
{Percentile: 100.0, Value: 0.0023639},
449507
},
450508
}}}},
451509
},
@@ -456,10 +514,13 @@ metrics := []*metricspb.Metric{
456514

457515
```
458516

459-
The major difference between the two systems is that in Prometheus it uses `quantile`, while in OpenCensus `percentile`
460-
is used. Other than that, OpenCensus has optional values for `Sum` and `Count` of a snapshot, however, they are not
517+
There's also some differences between the two systems. One of them is that in Prometheus it uses `quantile`, while in OpenCensus it uses `percentile`.
518+
Other than that, OpenCensus has optional values for `Sum` and `Count` of a snapshot, however, they are not
461519
provided in Prometheus, and `nil` will be used for these values.
462520

521+
Other than that, in some prometheus implementations, such as the Python version, Summary is allowed to have no quantiles, in a case
522+
like this, the receiver will produce a Summary of OpenCensus with Snapshot set to `nil`
523+
463524
### Others
464525

465526
For any other Prometheus metrics types, they will make to the [Guage](#gague) type of Ocagent

receiver/prometheusreceiver/internal/internal_test.go

Lines changed: 7 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,6 @@ import (
2121
"github.com/prometheus/prometheus/pkg/labels"
2222
"github.com/prometheus/prometheus/scrape"
2323
"go.uber.org/zap"
24-
"sync"
2524
)
2625

2726
// test helpers
@@ -39,6 +38,10 @@ type mockMetadataCache struct {
3938
data map[string]scrape.MetricMetadata
4039
}
4140

41+
func newMockMetadataCache(data map[string]scrape.MetricMetadata) *mockMetadataCache {
42+
return &mockMetadataCache{data: data}
43+
}
44+
4245
func (m *mockMetadataCache) Metadata(metricName string) (scrape.MetricMetadata, bool) {
4346
mm, ok := m.data[metricName]
4447
return mm, ok
@@ -49,20 +52,15 @@ func (m *mockMetadataCache) SharedLabels() labels.Labels {
4952
}
5053

5154
func newMockConsumer() *mockConsumer {
52-
return &mockConsumer{
53-
Metrics: make(chan *data.MetricsData, 1),
54-
}
55+
return &mockConsumer{}
5556
}
5657

5758
type mockConsumer struct {
58-
Metrics chan *data.MetricsData
59-
consumOnce sync.Once
59+
md *data.MetricsData
6060
}
6161

6262
func (m *mockConsumer) ConsumeMetricsData(ctx context.Context, md data.MetricsData) error {
63-
m.consumOnce.Do(func() {
64-
m.Metrics <- &md
65-
})
63+
m.md = &md
6664
return nil
6765
}
6866

0 commit comments

Comments
 (0)