Skip to content

Issue 4896 traceql metrics fixes #4962

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

ruslan-mikhailov
Copy link
Contributor

@ruslan-mikhailov ruslan-mikhailov commented Apr 7, 2025

What this PR does

Start and end of the trace in generator are converted to seconds, dropping the fractional part.

// Convert from unix nanos to unix seconds
startSeconds := uint32(start / uint64(time.Second))
endSeconds := uint32(end / uint64(time.Second))

err := p.headBlock.AppendTrace(id, tr, startSeconds, endSeconds, p.Cfg.AdjustTimeRangeForSlack)

Additionally, blocks' start and end times are in seconds.

image

Bug 1

Search in blocks was exclusive, but it should have included both edges. See

func blockMetasForSearch(allBlocks []*backend.BlockMeta, start, end time.Time, filterFn func(m *backend.BlockMeta) bool) []*backend.BlockMeta {
as a reference

Bug 2. Losing the last trace in the block.

image

It filters here:

			if e.checkTime {
				st := s.StartTimeUnixNanos()
				if st < e.start || st >= e.end {
					continue
				}
			}

To fix it, need to include the right border, but only when request's time range fully includes the right edge of the block

Bug 3. Lonely trace in a block.

It is possible to have a block with start==end, it can be reproduced if only one trace was pushed and flushed.

This results in skipping this block

func (e *MetricsEvaluator) Do(ctx context.Context, f SpansetFetcher, fetcherStart, fetcherEnd uint64) error {
	// Make a copy of the request so we can modify it.
	storageReq := *e.storageReq

	if fetcherStart > 0 && fetcherEnd > 0 &&
		fetcherStart != fetcherEnd {
			overlap := timeRangeOverlap(e.start, e.end, fetcherStart, fetcherEnd)
			if overlap == 0.0 { // <------ overlap will be zero
				// This shouldn't happen but might as well check.
				// No overlap == nothing to do
				return nil
			}

Bug 4. Exemplars are not filtered

		for _, s := range ss.Spans {
			if e.checkTime {
				st := s.StartTimeUnixNanos()
				if st < e.start || st >= e.end {
					continue
				}
			}

			e.spansTotal++

			e.metricsPipeline.observe(s)
		}

		if len(ss.Spans) > 0 && e.sampleExemplar(ss.TraceID) {
			e.metricsPipeline.observeExemplar(ss.Spans[rand.Intn(len(ss.Spans))])
		}

This results in including traces outside of the requested time range

image

Which issue(s) this PR fixes

Fixes #4896

How it has been tested

Vulture

It was checked by #4886

  1. one binary docker compose (https://github.com/grafana/tempo/tree/main/example/docker-compose/local)
  2. distributed docker compose (https://github.com/grafana/tempo/tree/main/example/docker-compose/distributed)
  3. rhythm docker compose (https://github.com/grafana/tempo/tree/main/example/docker-compose/ingest-storage)
  4. long run in k8s with rhythm enabled
    image

Autotest

  1. e2e test for a single trace
  2. unit test for 2 traces to test border conditions

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@ruslan-mikhailov ruslan-mikhailov force-pushed the issue-4896_traceql-metrics-fix branch 2 times, most recently from f7bc27e to f294067 Compare April 7, 2025 15:56
@ruslan-mikhailov ruslan-mikhailov force-pushed the issue-4896_traceql-metrics-fix branch from f294067 to ca9f6b9 Compare April 7, 2025 19:01
@ruslan-mikhailov ruslan-mikhailov force-pushed the issue-4896_traceql-metrics-fix branch from ca9f6b9 to 7c3c219 Compare April 7, 2025 19:46
@ruslan-mikhailov ruslan-mikhailov changed the title WIP: Issue 4896 traceql metrics fix Issue 4896 traceql metrics fix Apr 7, 2025
@@ -1047,7 +1054,8 @@ func (e *MetricsEvaluator) Do(ctx context.Context, f SpansetFetcher, fetcherStar

e.mtx.Lock()

for _, s := range ss.Spans {
filteredSpansIndex := make([]int, 0, len(ss.Spans)) // optimistic preallocation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to watch performance here of maintaining another slice for every span. A few thoughts: it is worth losing some exemplar "fairness" here in order to keep performance high if needed, and also check e.sampleExemplar earlier to skip over extra work when we already have enough exemplars. I have a feeling that it would be faster to iterate ss.Spans twice in a second pass looking for exemplars, than to populate another slice.

Copy link
Contributor Author

@ruslan-mikhailov ruslan-mikhailov Apr 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, it gives ~10-20% degradation in CPU time. I'll change to more optimal solution

Benchmarks.

Benchmark function
func BenchmarkExperimentA(b *testing.B) {

  spans := make([]Span, 0, 1000)
  for i := 0; i < 1000; i++ {
  	spans = append(spans, newMockSpan(nil).WithStartTime(uint64(i)).WithSpanString("foo", "bar").WithSpanFloat("kafka.lag", randFloat(1e10, 1e20)).WithDuration(100))
  }

  req := &tempopb.QueryRangeRequest{
  	Start: uint64(1 * time.Second),
  	End:   uint64(3 * time.Second),
  	Step:  uint64(1 * time.Second),
  	Query: "{ } | count_over_time()",
  }
  engine := NewEngine()
  e, _ := engine.CompileMetricsQueryRange(req, 0, 0, false)
  e.maxExemplars = 100000000000

  for b.Loop() {
  	targetA(e, spans)
  }
}

BenchmarkExperimentOriginal-12 51366 22503 ns/op 0 B/op 0 allocs/op

Original
func targetOriginal(e *MetricsEvaluator, spans []Span) {
  for _, s := range spans {
  	if e.checkTime {
  		st := s.StartTimeUnixNanos()
  		if st < e.start || st >= e.end {
  			continue
  		}
  	}

  	e.spansTotal++

  	e.metricsPipeline.observe(s)
  }

  if len(spans) > 0 && e.sampleExemplar(spans[0].ID()) {
  	e.metricsPipeline.observeExemplar(spans[randv1.Intn(len(spans))])
  }
}

BenchmarkExperimentA-12 46334 24571 ns/op 8193 B/op 1 allocs/op

Current version
func targetA(e *MetricsEvaluator, spans []Span) {
  filteredSpansIndex := make([]int, 0, len(spans)) // optimistic preallocation
  for i, s := range spans {
  	if e.checkTime {
  		st := s.StartTimeUnixNanos()
  		if st < e.start || st >= e.end {
  			continue
  		}
  	}

  	e.spansTotal++
  	filteredSpansIndex = append(filteredSpansIndex, i)

  	e.metricsPipeline.observe(s)
  }

  if len(filteredSpansIndex) > 0 && e.sampleExemplar(spans[0].ID()) {
  	i := filteredSpansIndex[randv1.Intn(len(filteredSpansIndex))]
  	e.metricsPipeline.observeExemplar(spans[i])
  }
}

BenchmarkExperimentB-12 51210 22709 ns/op 0 B/op 0 allocs/op

PoC of optimised solution Note that this solution does not guarantee a fair probability distribution
func targetB(e *MetricsEvaluator, spans []Span) {

  var validSpansCount int
  var randomSpanIndex int

  needExemplar := e.maxExemplars > 0 && e.sampleExemplar(spans[0].ID())

  for _, s := range spans {
  	if e.checkTime {
  		st := s.StartTimeUnixNanos()
  		if st < e.start || st >= e.end {
  			continue
  		}
  	}

  	e.spansTotal++
  	validSpansCount++
  	e.metricsPipeline.observe(s)
  }

  if needExemplar {
  	for i, s := range spans {
  		if e.checkTime {
  			st := s.StartTimeUnixNanos()
  			if st < e.start || st >= e.end {
  				continue
  			}
  			if needExemplar && (validSpansCount == 1 || randv1.Intn(validSpansCount) == 0) {
  				randomSpanIndex = i
  			}
  		}
  	}
  	// If we found valid spans and need to sample an exemplar, use the randomly selected span
  	if validSpansCount > 0 && needExemplar {
  		e.metricsPipeline.observeExemplar(spans[randomSpanIndex])
  	}
  }
}

Copy link
Contributor Author

@ruslan-mikhailov ruslan-mikhailov Apr 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied reservoir sampling algorithm. Benchmark results:

// BenchmarkExperimentOriginal-12 538930 2220 ns/op 0 B/op 0 allocs/op
// BenchmarkExperimentA-12 480423 2505 ns/op 896 B/op 1 allocs/op
// BenchmarkExperimentB-12 468480 2255 ns/op 0 B/op 0 allocs/op
// BenchmarkExperimentC-12 526160 2271 ns/op 0 B/op 0 allocs/op

Let me know if I need to move the optimisation any further.

Source code of the benchmark
func targetC(e *MetricsEvaluator, spans []Span) {
  var validSpansCount int
  var randomSpanIndex int

  needExemplar := e.maxExemplars > 0 && e.sampleExemplar(spans[0].ID())

  for i, s := range spans {
  	if e.checkTime {
  		st := s.StartTimeUnixNanos()
  		if st < e.start || st >= e.end {
  			continue
  		}
  	}

  	validSpansCount++
  	e.metricsPipeline.observe(s)

  	if !needExemplar {
  		continue
  	}

  	if needExemplar && (validSpansCount == 1 || randv1.Intn(validSpansCount) == 0) {
  		randomSpanIndex = i
  	}
  }
  e.spansTotal += uint64(validSpansCount)

  if needExemplar && validSpansCount > 0 {
  	e.metricsPipeline.observeExemplar(spans[randomSpanIndex])
  }
}

Copy link
Contributor Author

@ruslan-mikhailov ruslan-mikhailov Apr 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Experimentation proves that the probability distribution is flat:

>>> def pickme(limit):
...     res = 0
...     for ii in range(limit):
...         if random.randint(0, ii) == 0:
...             res = ii
...     return res
...     
>>> c = Counter()

>>> for i in range(1000000):
...     c[pickme(100)] += 1
...     
>>> c
Counter({27: 10378, 18: 10311, 79: 10307, 69: 10288, 67: 10278, 82: 10275, 92: 10259, 6: 10256, 73: 10243, 34: 10224, 83: 10219, 31: 10210, 43: 10210, 53: 10207, 21: 10199, 36: 10193, 14: 10187, 17: 10187, 46: 10183, 99: 10182, 98: 10170, 7: 10165, 10: 10160, 71: 10159, 4: 10156, 64: 10153, 76: 10151, 48: 10149, 35: 10145, 41: 10145, 59: 10141, 63: 10138, 51: 10137, 72: 10137, 29: 10137, 84: 10136, 26: 10135, 42: 10133, 13: 10131, 45: 10131, 39: 10124, 55: 10119, 38: 10119, 40: 10119, 85: 10115, 1: 10112, 81: 10104, 47: 10102, 5: 10101, 15: 10099, 89: 10095, 61: 10087, 23: 10086, 22: 10086, 86: 10083, 93: 10082, 95: 10082, 28: 10082, 16: 10080, 94: 10079, 96: 10077, 32: 10076, 97: 10074, 77: 10073, 37: 10071, 65: 10069, 60: 10066, 9: 10064, 30: 10061, 75: 10058, 74: 10054, 3: 10053, 57: 10050, 68: 10050, 25: 10049, 33: 10045, 62: 10042, 56: 10040, 49: 10038, 78: 10036, 8: 10036, 88: 10033, 19: 10025, 0: 10020, 2: 10011, 90: 10008, 54: 9998, 70: 9997, 50: 9991, 91: 9987, 44: 9977, 24: 9977, 12: 9965, 52: 9928, 11: 9921, 87: 9921, 80: 9914, 58: 9899, 66: 9861, 20: 9724})

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Answer to #4962 (review) .

For a ~500mb block, the results are quite the same, the only significant difference is that the unoptimised solution requires more allocations. Baseline is code from main branch.

benchstat
goos: darwin
goarch: arm64
pkg: github.com/grafana/tempo/tempodb/encoding/vparquet4
cpu: Apple M4 Pro
                                                                                                          │  main.txt   │            optimised.txt            │           unoptimised.txt           │
                                                                                                          │   sec/op    │   sec/op     vs base                │   sec/op     vs base                │
BackendBlockQueryRange/{}_|_rate()/5-12                                                                     12.61 ± ∞ ¹   12.51 ± ∞ ¹       ~ (p=1.000 n=1) ²   12.63 ± ∞ ¹       ~ (p=1.000 n=1) ²
BackendBlockQueryRange/{}_|_rate()_by_(name)/5-12                                                           20.22 ± ∞ ¹   19.46 ± ∞ ¹       ~ (p=1.000 n=1) ²   19.99 ± ∞ ¹       ~ (p=1.000 n=1) ²
BackendBlockQueryRange/{}_|_rate()_by_(resource.service.name)/5-12                                          17.12 ± ∞ ¹   16.50 ± ∞ ¹       ~ (p=1.000 n=1) ²   16.85 ± ∞ ¹       ~ (p=1.000 n=1) ²
BackendBlockQueryRange/{}_|_rate()_by_(span.http.url)/5-12                                                  27.09 ± ∞ ¹   26.27 ± ∞ ¹       ~ (p=1.000 n=1) ²   27.10 ± ∞ ¹       ~ (p=1.000 n=1) ²
BackendBlockQueryRange/{resource.service.name=`loki-ingester`}_|_rate()/5-12                                2.115 ± ∞ ¹   2.020 ± ∞ ¹       ~ (p=1.000 n=1) ²   2.155 ± ∞ ¹       ~ (p=1.000 n=1) ²
BackendBlockQueryRange/{span.http.host_!=_``_&&_span.http.flavor=`2`}_|_rate()_by_(span.http.flavor)/5-12   7.618 ± ∞ ¹   8.031 ± ∞ ¹       ~ (p=1.000 n=1) ²   7.776 ± ∞ ¹       ~ (p=1.000 n=1) ²
BackendBlockQueryRange/{status=error}_|_rate()/5-12                                                         2.276 ± ∞ ¹   2.261 ± ∞ ¹       ~ (p=1.000 n=1) ²   2.291 ± ∞ ¹       ~ (p=1.000 n=1) ²
geomean                                                                                                     8.875         8.732        -1.61%                   8.901        +0.29%
¹ need >= 6 samples for confidence interval at level 0.95
² need >= 4 samples to detect a difference at alpha level 0.05

                                                                                                          │  main.txt   │            optimised.txt            │           unoptimised.txt           │
                                                                                                          │  MB_IO/op   │  MB_IO/op    vs base                │  MB_IO/op    vs base                │
BackendBlockQueryRange/{}_|_rate()/5-12                                                                     22.58 ± ∞ ¹   22.58 ± ∞ ¹       ~ (p=1.000 n=1) ²   22.58 ± ∞ ¹       ~ (p=1.000 n=1) ²
BackendBlockQueryRange/{}_|_rate()_by_(name)/5-12                                                           27.13 ± ∞ ¹   27.13 ± ∞ ¹       ~ (p=1.000 n=1) ²   27.13 ± ∞ ¹       ~ (p=1.000 n=1) ²
BackendBlockQueryRange/{}_|_rate()_by_(resource.service.name)/5-12                                          23.16 ± ∞ ¹   23.16 ± ∞ ¹       ~ (p=1.000 n=1) ²   23.16 ± ∞ ¹       ~ (p=1.000 n=1) ²
BackendBlockQueryRange/{}_|_rate()_by_(span.http.url)/5-12                                                  24.83 ± ∞ ¹   24.83 ± ∞ ¹       ~ (p=1.000 n=1) ²   24.83 ± ∞ ¹       ~ (p=1.000 n=1) ²
BackendBlockQueryRange/{resource.service.name=`loki-ingester`}_|_rate()/5-12                                23.16 ± ∞ ¹   23.16 ± ∞ ¹       ~ (p=1.000 n=1) ²   23.16 ± ∞ ¹       ~ (p=1.000 n=1) ²
BackendBlockQueryRange/{span.http.host_!=_``_&&_span.http.flavor=`2`}_|_rate()_by_(span.http.flavor)/5-12   70.74 ± ∞ ¹   70.74 ± ∞ ¹       ~ (p=1.000 n=1) ²   70.74 ± ∞ ¹       ~ (p=1.000 n=1) ²
BackendBlockQueryRange/{status=error}_|_rate()/5-12                                                         24.35 ± ∞ ¹   24.35 ± ∞ ¹       ~ (p=1.000 n=1) ²   24.35 ± ∞ ¹       ~ (p=1.000 n=1) ²
geomean                                                                                                     28.16         28.16        +0.00%                   28.16        +0.00%
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal

                                                                                                          │   main.txt   │            optimised.txt             │           unoptimised.txt            │
                                                                                                          │   spans/op   │   spans/op    vs base                │   spans/op    vs base                │
BackendBlockQueryRange/{}_|_rate()/5-12                                                                     4.736M ± ∞ ¹   4.736M ± ∞ ¹       ~ (p=1.000 n=1) ²   4.736M ± ∞ ¹       ~ (p=1.000 n=1) ²
BackendBlockQueryRange/{}_|_rate()_by_(name)/5-12                                                           4.736M ± ∞ ¹   4.736M ± ∞ ¹       ~ (p=1.000 n=1) ²   4.736M ± ∞ ¹       ~ (p=1.000 n=1) ²
BackendBlockQueryRange/{}_|_rate()_by_(resource.service.name)/5-12                                          4.736M ± ∞ ¹   4.736M ± ∞ ¹       ~ (p=1.000 n=1) ²   4.736M ± ∞ ¹       ~ (p=1.000 n=1) ²
BackendBlockQueryRange/{}_|_rate()_by_(span.http.url)/5-12                                                  4.736M ± ∞ ¹   4.736M ± ∞ ¹       ~ (p=1.000 n=1) ²   4.736M ± ∞ ¹       ~ (p=1.000 n=1) ²
BackendBlockQueryRange/{resource.service.name=`loki-ingester`}_|_rate()/5-12                                340.6k ± ∞ ¹   340.6k ± ∞ ¹       ~ (p=1.000 n=1) ²   340.6k ± ∞ ¹       ~ (p=1.000 n=1) ²
BackendBlockQueryRange/{span.http.host_!=_``_&&_span.http.flavor=`2`}_|_rate()_by_(span.http.flavor)/5-12   3.262k ± ∞ ¹   3.262k ± ∞ ¹       ~ (p=1.000 n=1) ²   3.262k ± ∞ ¹       ~ (p=1.000 n=1) ²
BackendBlockQueryRange/{status=error}_|_rate()/5-12                                                         124.2k ± ∞ ¹   124.2k ± ∞ ¹       ~ (p=1.000 n=1) ²   124.2k ± ∞ ¹       ~ (p=1.000 n=1) ²
geomean                                                                                                     683.1k         683.1k        +0.00%                   683.1k        +0.00%
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal

                                                                                                          │   main.txt   │            optimised.txt             │           unoptimised.txt            │
                                                                                                          │   spans/s    │   spans/s     vs base                │   spans/s     vs base                │
BackendBlockQueryRange/{}_|_rate()/5-12                                                                     375.7k ± ∞ ¹   378.6k ± ∞ ¹       ~ (p=1.000 n=1) ²   375.1k ± ∞ ¹       ~ (p=1.000 n=1) ²
BackendBlockQueryRange/{}_|_rate()_by_(name)/5-12                                                           234.2k ± ∞ ¹   243.4k ± ∞ ¹       ~ (p=1.000 n=1) ²   236.9k ± ∞ ¹       ~ (p=1.000 n=1) ²
BackendBlockQueryRange/{}_|_rate()_by_(resource.service.name)/5-12                                          276.7k ± ∞ ¹   287.0k ± ∞ ¹       ~ (p=1.000 n=1) ²   281.1k ± ∞ ¹       ~ (p=1.000 n=1) ²
BackendBlockQueryRange/{}_|_rate()_by_(span.http.url)/5-12                                                  174.8k ± ∞ ¹   180.3k ± ∞ ¹       ~ (p=1.000 n=1) ²   174.8k ± ∞ ¹       ~ (p=1.000 n=1) ²
BackendBlockQueryRange/{resource.service.name=`loki-ingester`}_|_rate()/5-12                                161.0k ± ∞ ¹   168.6k ± ∞ ¹       ~ (p=1.000 n=1) ²   158.1k ± ∞ ¹       ~ (p=1.000 n=1) ²
BackendBlockQueryRange/{span.http.host_!=_``_&&_span.http.flavor=`2`}_|_rate()_by_(span.http.flavor)/5-12    428.2 ± ∞ ¹    406.2 ± ∞ ¹       ~ (p=1.000 n=1) ²    419.5 ± ∞ ¹       ~ (p=1.000 n=1) ²
BackendBlockQueryRange/{status=error}_|_rate()/5-12                                                         54.55k ± ∞ ¹   54.91k ± ∞ ¹       ~ (p=1.000 n=1) ²   54.19k ± ∞ ¹       ~ (p=1.000 n=1) ²
geomean                                                                                                     76.97k         78.23k        +1.63%                   76.75k        -0.29%
¹ need >= 6 samples for confidence interval at level 0.95
² need >= 4 samples to detect a difference at alpha level 0.05

                                                                                                          │   main.txt    │             optimised.txt             │            unoptimised.txt            │
                                                                                                          │     B/op      │     B/op       vs base                │     B/op       vs base                │
BackendBlockQueryRange/{}_|_rate()/5-12                                                                     485.4Mi ± ∞ ¹   485.1Mi ± ∞ ¹       ~ (p=1.000 n=1) ²   536.8Mi ± ∞ ¹       ~ (p=1.000 n=1) ²
BackendBlockQueryRange/{}_|_rate()_by_(name)/5-12                                                           662.9Mi ± ∞ ¹   659.0Mi ± ∞ ¹       ~ (p=1.000 n=1) ²   707.2Mi ± ∞ ¹       ~ (p=1.000 n=1) ²
BackendBlockQueryRange/{}_|_rate()_by_(resource.service.name)/5-12                                          644.4Mi ± ∞ ¹   640.2Mi ± ∞ ¹       ~ (p=1.000 n=1) ²   688.3Mi ± ∞ ¹       ~ (p=1.000 n=1) ²
BackendBlockQueryRange/{}_|_rate()_by_(span.http.url)/5-12                                                  873.2Mi ± ∞ ¹   869.8Mi ± ∞ ¹       ~ (p=1.000 n=1) ²   915.7Mi ± ∞ ¹       ~ (p=1.000 n=1) ²
BackendBlockQueryRange/{resource.service.name=`loki-ingester`}_|_rate()/5-12                                170.1Mi ± ∞ ¹   177.4Mi ± ∞ ¹       ~ (p=1.000 n=1) ²   179.8Mi ± ∞ ¹       ~ (p=1.000 n=1) ²
BackendBlockQueryRange/{span.http.host_!=_``_&&_span.http.flavor=`2`}_|_rate()_by_(span.http.flavor)/5-12   184.7Mi ± ∞ ¹   184.7Mi ± ∞ ¹       ~ (p=1.000 n=1) ²   176.1Mi ± ∞ ¹       ~ (p=1.000 n=1) ²
BackendBlockQueryRange/{status=error}_|_rate()/5-12                                                         79.23Mi ± ∞ ¹   84.01Mi ± ∞ ¹       ~ (p=1.000 n=1) ²   79.79Mi ± ∞ ¹       ~ (p=1.000 n=1) ²
geomean                                                                                                     332.6Mi         336.6Mi        +1.20%                   346.9Mi        +4.29%
¹ need >= 6 samples for confidence interval at level 0.95
² need >= 4 samples to detect a difference at alpha level 0.05

                                                                                                          │   main.txt   │            optimised.txt             │            unoptimised.txt            │
                                                                                                          │  allocs/op   │  allocs/op    vs base                │  allocs/op    vs base                 │
BackendBlockQueryRange/{}_|_rate()/5-12                                                                     1.754M ± ∞ ¹   1.753M ± ∞ ¹       ~ (p=1.000 n=1) ²   2.347M ± ∞ ¹        ~ (p=1.000 n=1) ²
BackendBlockQueryRange/{}_|_rate()_by_(name)/5-12                                                           3.145M ± ∞ ¹   3.144M ± ∞ ¹       ~ (p=1.000 n=1) ²   3.744M ± ∞ ¹        ~ (p=1.000 n=1) ²
BackendBlockQueryRange/{}_|_rate()_by_(resource.service.name)/5-12                                          3.121M ± ∞ ¹   3.118M ± ∞ ¹       ~ (p=1.000 n=1) ²   3.713M ± ∞ ¹        ~ (p=1.000 n=1) ²
BackendBlockQueryRange/{}_|_rate()_by_(span.http.url)/5-12                                                  5.822M ± ∞ ¹   5.825M ± ∞ ¹       ~ (p=1.000 n=1) ²   6.417M ± ∞ ¹        ~ (p=1.000 n=1) ²
BackendBlockQueryRange/{resource.service.name=`loki-ingester`}_|_rate()/5-12                                621.7k ± ∞ ¹   621.5k ± ∞ ¹       ~ (p=1.000 n=1) ²   624.1k ± ∞ ¹        ~ (p=1.000 n=1) ²
BackendBlockQueryRange/{span.http.host_!=_``_&&_span.http.flavor=`2`}_|_rate()_by_(span.http.flavor)/5-12   151.6k ± ∞ ¹   151.2k ± ∞ ¹       ~ (p=1.000 n=1) ²   154.7k ± ∞ ¹        ~ (p=1.000 n=1) ²
BackendBlockQueryRange/{status=error}_|_rate()/5-12                                                         127.9k ± ∞ ¹   127.6k ± ∞ ¹       ~ (p=1.000 n=1) ²   188.1k ± ∞ ¹        ~ (p=1.000 n=1) ²
geomean                                                                                                     1.027M         1.026M        -0.09%                   1.210M        +17.79%
¹ need >= 6 samples for confidence interval at level 0.95
² need >= 4 samples to detect a difference at alpha level 0.05

@ruslan-mikhailov ruslan-mikhailov marked this pull request as ready for review April 8, 2025 08:37
@ruslan-mikhailov ruslan-mikhailov changed the title Issue 4896 traceql metrics fix Issue 4896 traceql metrics fixes Apr 8, 2025
@ruslan-mikhailov ruslan-mikhailov force-pushed the issue-4896_traceql-metrics-fix branch from 73f313c to bc32619 Compare April 8, 2025 10:21
Copy link
Contributor

@mdisibio mdisibio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good changes, and I like the sampling algorithm. It's hard to tell the overall impact to queries from the direct benchmarks numbers you have provided. Can you try doing a benchstat comparison using the BenchmarkBackendBlockQueryRange?

wasInstant := end1-start1 == step

start2 := uint64(blockStart.UnixNano())
blockEnd = blockEnd.Add(time.Second) // Add 1 second to include the last second of the block
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is worth commenting a little more: blockEnd is rounded down to the nearest second and considered inclusive. The end result of this function is that end2 covers the same time range but in an exclusive nanosecond version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a little bit more comments. Let me know if I need to explain in very detail

@ruslan-mikhailov
Copy link
Contributor Author

ruslan-mikhailov commented Apr 8, 2025

Automation closed the PR, reopening

@ruslan-mikhailov ruslan-mikhailov force-pushed the issue-4896_traceql-metrics-fix branch from bc32619 to 818a1ea Compare April 8, 2025 15:48
@ruslan-mikhailov
Copy link
Contributor Author

+ review fix and minor variable name changes

@mdisibio mdisibio merged commit f290734 into grafana:main Apr 10, 2025
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TraceQL Metrics lose traces
2 participants