Skip to content

Conversation

scrye
Copy link
Contributor

@scrye scrye commented Oct 11, 2019

Metrics:

# HELP lib_reliable_dials_total Total number of Dial calls.
# TYPE lib_reliable_dials_total counter
lib_reliable_dials_total{result="err_not_classified"} 1
lib_reliable_dials_total{result="ok_success"} 3
# HELP lib_reliable_registers_total Total number of Register calls.
# TYPE lib_reliable_registers_total counter
lib_reliable_registers_total{result="ok_success",svc="CS"} 1
lib_reliable_registers_total{result="ok_success",svc="UNKNOWN"} 3
# HELP lib_reliable_reads_total Total number of Read calls
# TYPE lib_reliable_reads_total histogram
lib_reliable_reads_total_bucket{result="err_not_classified",le="1"} 2
lib_reliable_reads_total_bucket{result="err_not_classified",le="+Inf"} 2
lib_reliable_reads_total_sum{result="err_not_classified"} 0
lib_reliable_reads_total_count{result="err_not_classified"} 2
lib_reliable_reads_total_bucket{result="ok_success",le="1"} 0
lib_reliable_reads_total_bucket{result="ok_success",le="+Inf"} 422
lib_reliable_reads_total_sum{result="ok_success"} 148709
lib_reliable_reads_total_count{result="ok_success"} 422
# HELP lib_reliable_writes_total Total number of Write calls
# TYPE lib_reliable_writes_total histogram
lib_reliable_writes_total_bucket{result="ok_success",le="1"} 0
lib_reliable_writes_total_bucket{result="ok_success",le="+Inf"} 403
lib_reliable_writes_total_sum{result="ok_success"} 85732
lib_reliable_writes_total_count{result="ok_success"} 403
# HELP lib_reconnect_retries_total Total number of reconnection attempt retries
# TYPE lib_reconnect_retries_total counter
lib_reconnect_retries_total 2
# HELP lib_reconnect_timeouts_total Total number of reconnection attempt timeouts
# TYPE lib_reconnect_timeouts_total counter
lib_reconnect_timeouts_total 0

Fixes #3183.


This change is Reviewable

@scrye scrye added SNET feature New feature or request c/observability Metrics, logging, tracing labels Oct 11, 2019
@scrye scrye requested a review from kormat October 11, 2019 12:07
@scrye scrye self-assigned this Oct 11, 2019
@scrye scrye requested review from lukedirtwalker and removed request for kormat October 16, 2019 06:21
Copy link
Collaborator

@lukedirtwalker lukedirtwalker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 15 of 15 files at r1.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @scrye)


go/lib/prom/prom.go, line 75 at r1 (raw file):

		1.28, 2.56, 5.12, 10.24}
	// DefaultSizeBuckets 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384
	DefaultSizeBuckets = []float64{1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192,

I think on the lower bound we could start at 32 or something.


go/lib/sock/reliable/reliable.go, line 402 at r1 (raw file):

		return prom.Success
	}
	if opError, ok := err.(*net.OpError); ok && opError.Timeout() {

would serrors.IsTimeout(err) not work for this case?


go/lib/sock/reliable/internal/metrics/metrics.go, line 1 at r1 (raw file):

// Copyright 2019 ETH Zurich

Please add tests for metrics labels.
promtest.CheckLabelsStruct(t, DialLabels{}) etc.


go/lib/sock/reliable/reconnect/errors.go, line 21 at r1 (raw file):

var (
	ErrDispatcherDead            = serrors.New("dispatcher dead")
	ErrReconnecterTimeoutExpired = serrors.New("timeout expired")

I wonder if that should be a timeout error so that serrors.IsTimeout would return true.


go/lib/sock/reliable/reconnect/internal/metrics/metrics.go, line 62 at r1 (raw file):

// Timeouts returns a counter for timeout errors.
func (m metrics) Timeouts(_ Labels) prometheus.Counter {

I think we could just skip the Labels for this. just Timeouts().Inc() on the call site.

scrye added 2 commits October 16, 2019 10:16
Metrics:
```
# HELP lib_reliable_dials_total Total number of Dial calls.
# TYPE lib_reliable_dials_total counter
lib_reliable_dials_total{result="err_not_classified"} 1
lib_reliable_dials_total{result="ok_success"} 3
# HELP lib_reliable_registers_total Total number of Register calls.
# TYPE lib_reliable_registers_total counter
lib_reliable_registers_total{result="ok_success",svc="CS"} 1
lib_reliable_registers_total{result="ok_success",svc="UNKNOWN"} 3
# HELP lib_reliable_reads_total Total number of Read calls
# TYPE lib_reliable_reads_total histogram
lib_reliable_reads_total_bucket{result="err_not_classified",le="1"} 2
lib_reliable_reads_total_bucket{result="err_not_classified",le="+Inf"} 2
lib_reliable_reads_total_sum{result="err_not_classified"} 0
lib_reliable_reads_total_count{result="err_not_classified"} 2
lib_reliable_reads_total_bucket{result="ok_success",le="1"} 0
lib_reliable_reads_total_bucket{result="ok_success",le="+Inf"} 422
lib_reliable_reads_total_sum{result="ok_success"} 148709
lib_reliable_reads_total_count{result="ok_success"} 422
# HELP lib_reliable_writes_total Total number of Write calls
# TYPE lib_reliable_writes_total histogram
lib_reliable_writes_total_bucket{result="ok_success",le="1"} 0
lib_reliable_writes_total_bucket{result="ok_success",le="+Inf"} 403
lib_reliable_writes_total_sum{result="ok_success"} 85732
lib_reliable_writes_total_count{result="ok_success"} 403
# HELP lib_reconnect_retries_total Total number of reconnection attempt
retries
# TYPE lib_reconnect_retries_total counter
lib_reconnect_retries_total 2
# HELP lib_reconnect_timeouts_total Total number of reconnection attempt
timeouts
# TYPE lib_reconnect_timeouts_total counter
lib_reconnect_timeouts_total 0
```

Fixes #3183.
Copy link
Contributor Author

@scrye scrye left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 9 of 15 files reviewed, 5 unresolved discussions (waiting on @lukedirtwalker)


go/lib/prom/prom.go, line 75 at r1 (raw file):

Previously, lukedirtwalker (Lukas Vogel) wrote…

I think on the lower bound we could start at 32 or something.

Done.


go/lib/sock/reliable/reliable.go, line 402 at r1 (raw file):

Previously, lukedirtwalker (Lukas Vogel) wrote…

would serrors.IsTimeout(err) not work for this case?

Yes, it should work. Done.


go/lib/sock/reliable/internal/metrics/metrics.go, line 1 at r1 (raw file):

Previously, lukedirtwalker (Lukas Vogel) wrote…

Please add tests for metrics labels.
promtest.CheckLabelsStruct(t, DialLabels{}) etc.

Done.


go/lib/sock/reliable/reconnect/errors.go, line 21 at r1 (raw file):

Previously, lukedirtwalker (Lukas Vogel) wrote…

I wonder if that should be a timeout error so that serrors.IsTimeout would return true.

It probably should, but it's not something I would change in this PR because it can break stuff, since it interferes with execution paths.

Added a FIXME instead.


go/lib/sock/reliable/reconnect/internal/metrics/metrics.go, line 62 at r1 (raw file):

Previously, lukedirtwalker (Lukas Vogel) wrote…

I think we could just skip the Labels for this. just Timeouts().Inc() on the call site.

Done.

Copy link
Collaborator

@lukedirtwalker lukedirtwalker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 6 of 6 files at r2.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @scrye)


go/lib/sock/reliable/internal/metrics/metrics.go, line 1 at r1 (raw file):

Previously, scrye (Sergiu Costea) wrote…

Done.

The file is missing.

Copy link
Collaborator

@lukedirtwalker lukedirtwalker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 1 files at r3.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved

@scrye scrye merged commit 95c4cd5 into scionproto:master Oct 16, 2019
@scrye scrye deleted the pubpr-fix-3183 branch October 16, 2019 09:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/observability Metrics, logging, tracing feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

lib/sock/reliable: expose metrics
2 participants