-
Notifications
You must be signed in to change notification settings - Fork 0
Proposed changes to Append CT PR. #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Changes: * Changed textparse Parser interface for consistency and robustness. * Changed CT interface to be more explicit and handle validation. * Simplified test, change scrapeManager to allow testability. * Added TODOs. Signed-off-by: bwplotka <bwplotka@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for raising this and challenging many decisions I've made!
I liked how you changed many places, e.g. feature-flag and method names, to make a point that we're ingesting zeros.
One thing that I couldn't understand was the metadata test, what are we trying to accomplish over there? 🤔
// CreatedTimestamp returns nil as it's not implemented yet. | ||
// TODO(bwplotka): https://github.com/prometheus/prometheus/issues/12980 | ||
func (p *PromParser) CreatedTimestamp() *int64 { | ||
return nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had the impression we didn't want to implement CT for Prometheus text. The issue mentioned here also doesn't mention prometheus text parsing 🤔
func (p *ProtobufParser) CreatedTimestamp(ct *types.Timestamp) bool { | ||
var foundCT *types.Timestamp | ||
// CreatedTimestamp returns CT or nil if CT is not present or | ||
// invalid (as timestamp e.g. negative value) on counters, summaries or histograms. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// invalid (as timestamp e.g. negative value) on counters, summaries or histograms. | |
// invalid (e.g. a timestamp with negative value) on counters, summaries or histograms. |
Not sure if that's what you meant here, but it reads weird 🤔
if err != nil { | ||
// Errors means ct == nil or invalid timestamp, which we silently ignore. | ||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmmm, ideally we would want to at least log client-side errors. Would it make sense to return an error as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also change the proto definition to return int64 like callum suggested and not worry with conversion errors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Return where? So by making int64 we don't magically remove error surface. We simply not check things. So why not doing the same here and simply not check things? (:
Also how do you propose changing field to int64 without braking changes? Do you propose adding another field (as we should in proto)? What would you do with other timestamp fields that have google Timestamp type? (:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that adding a new message is the right way of doing backward/forward compatibility, but since the proto change was released only in Prometheus 2.48 and seeing that proto changes weren't announced in the changelog, I don't expect that replacing the field will break anyone (even though this is a breaking change)
tsdb/agent/db.go
Outdated
func (a *appender) AppendCreatedTimestamp(ref storage.SeriesRef, l labels.Labels, t int64) (storage.SeriesRef, error) { | ||
func (a *appender) AppendCTZeroSample(ref storage.SeriesRef, l labels.Labels, t int64, ct int64) (storage.SeriesRef, error) { | ||
// TODO(bwplotka): Implement | ||
panic("to implement") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
panic? 😱
Don't Prometheus server and agent share the same scrape codebase? Won't this cause agents to panic on every scrape? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will, but only for this PR. I planned to do extra work in this PR, but now I think we should postpone, so fixed in next commit to this PR (pushing in a 10m)
scrape/manager_test.go
Outdated
// Disable regular scrapes. | ||
ScrapeInterval: model.Duration(9999 * time.Minute), | ||
ScrapeTimeout: model.Duration(5 * time.Second), | ||
// Ensure proto is chosen. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point adding a comment here! Maybe we could extend by saying that we're choosing proto because at the time it is the only protocol that parses CT? We can remove this option once we add more protocols
ScrapeProtocols: []config.ScrapeProtocol{config.PrometheusProto}, | ||
}, | ||
ScrapeConfigs: []*config.ScrapeConfig{{JobName: "test"}}, | ||
})) | ||
|
||
// Start fake HTTP target to scrape returning a single metric. | ||
once := sync.Once{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL about once
scrape/scrape.go
Outdated
if ctMs := p.CreatedTimestamp(); sl.enableCTZeroIngestion && ctMs != nil { | ||
ref, err = app.AppendCTZeroSample(ref, lset, t, *ctMs) | ||
if err != nil && !errors.Is(err, storage.ErrOutOfOrderCT) { // OOO is a common case, ignoring completely for now. | ||
// CT is an experimental feature. For now, we don't need to fail the | ||
// scrape on errors updating the created timestamp, log debug. | ||
level.Debug(sl.l).Log("msg", "Error when updating metadata in scrape loop", "series", string(met), "ct", *ctMs, "t", t, "err", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks so much simpler now!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but the log message looks like a copy-paste mistake? We're not updating metadata yet, are we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whooops, great catch!
Signed-off-by: bwplotka <bwplotka@gmail.com>
Changes: