-
Notifications
You must be signed in to change notification settings - Fork 490
Message delay metrics #4565
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Message delay metrics #4565
Conversation
Build FAILURE |
09284a9
to
f7a1bff
Compare
Build FAILURE |
Build FAILURE |
d523129
to
000b549
Compare
Build FAILURE |
f6088fb
to
a525146
Compare
Build FAILURE |
a525146
to
262e1a0
Compare
Build FAILURE |
0b83790
to
a6a3423
Compare
Build FAILURE |
1 similar comment
Build FAILURE |
48dedf4
to
e2765e4
Compare
lib/stats/stats-prometheus.c
Outdated
if (key->formatting.frame_of_reference == SCFOR_RELATIVE_TO_TIME_OF_QUERY) | ||
{ | ||
unix_time_set_now(&now); | ||
converted_double = (now.ut_sec + now.ut_usec / 1e6) - converted_int; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this still does not fix the y2038 problem pointed out by @MrAnno
I am not sure we need the resolution that floating point gives us here, especially as the timestamp is only updated once per second at most.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, I have removed the support for milli and nanosec.
I believe I have fixed the Y2038 problem, too. I hope I did not miss anything.
44383df
to
5034ce9
Compare
…e file The file is difficult to navigate, so I am trying to move related code closer to each other. Signed-off-by: Balazs Scheidler <balazs.scheidler@axoflow.com>
Signed-off-by: Balazs Scheidler <balazs.scheidler@axoflow.com>
Signed-off-by: Balazs Scheidler <balazs.scheidler@axoflow.com>
Signed-off-by: Balazs Scheidler <balazs.scheidler@axoflow.com>
Signed-off-by: Balazs Scheidler <balazs.scheidler@axoflow.com>
Signed-off-by: Balazs Scheidler <balazs.scheidler@axoflow.com>
Signed-off-by: Balazs Scheidler <balazs.scheidler@axoflow.com>
…ted in the main thread Signed-off-by: Balazs Scheidler <balazs.scheidler@axoflow.com>
Signed-off-by: Balazs Scheidler <balazs.scheidler@axoflow.com>
Signed-off-by: Balazs Scheidler <balazs.scheidler@axoflow.com>
…on functions Signed-off-by: Balazs Scheidler <balazs.scheidler@axoflow.com>
…egator_add_data_point() Signed-off-by: Balazs Scheidler <balazs.scheidler@axoflow.com>
This is metric that stores a time_t value and is formatted as the number of seconds before the time it is queried (e.g. age in seconds). This can be used to represent absolute time values irrespective wheather our clock is right. Signed-off-by: Balazs Scheidler <balazs.scheidler@axoflow.com>
Signed-off-by: Balazs Scheidler <balazs.scheidler@axoflow.com>
Signed-off-by: Balazs Scheidler <balazs.scheidler@axoflow.com>
This patch implements the syslogng_output_message_delay_sample_seconds, which shows the delay incurred between the message being received and then sent out by syslog-ng. Please note that this is the delay associated with one specific message that was recently sent out, e.g. this is a sample, rather than the average of a series of messages. It is believed that the delay incurred is not a volatile value, rather messages sitting in the queue in close proximity have very similar delay values. Thus, we only sample this delay once every second. There's also another associated metric, called output_message_delay_sample_age_seconds, which shows when the delay metric was last sampled. It contains the age of the sample relative to the current time. This contains this relative timestamp so that the actual clock accuracy of the syslog-ng host does not matter. To come up with an alertable metric for message delays, the following algorithm is recommended: We need the values for these metrics which are scraped at the same time: * output_message_delay_sample_seconds * output_message_delay_sample_age_seconds * syslogng_output_events_total{result="queued"} After every scraping iteration: 1) check if the delay_sample_age is recent (e.g. happened between the last two scrape periods), if it was then the delay can be considered current. 2) if the age is older than that, look at the length of the associated queue - 2a) if the queue is empty, then our delay can be considered zero, as any new messages would be sent immediately. - 2b) if the queue is not empty, it means that even though we would have data to send, we did not (as if we were our age would be recent), our delay building up The expected delay value would be the sum of the last delay sample plus its age value. Signed-off-by: Balazs Scheidler <balazs.scheidler@axoflow.com>
Signed-off-by: Balazs Scheidler <balazs.scheidler@axoflow.com>
Signed-off-by: Attila Szakacs <attila.szakacs@axoflow.com>
Signed-off-by: Attila Szakacs <attila.szakacs@axoflow.com>
Signed-off-by: Attila Szakacs <attila.szakacs@axoflow.com>
Signed-off-by: Attila Szakacs <attila.szakacs@axoflow.com>
Signed-off-by: Attila Szakacs <attila.szakacs@axoflow.com>
Signed-off-by: Attila Szakacs <attila.szakacs@axoflow.com>
This intends to fix the y2038 problem on 32bit machines in cases where we can guarantee to use only positive values. Signed-off-by: Attila Szakacs <attila.szakacs@axoflow.com>
Signed-off-by: Attila Szakacs <attila.szakacs@axoflow.com>
Signed-off-by: Attila Szakacs <attila.szakacs@axoflow.com>
Signed-off-by: Attila Szakacs <attila.szakacs@axoflow.com>
cf88134
to
58b1372
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving on behalf of @bazsi.
This adds a new metric to measure the outgoing message delay, which is a good indication of
the health of our upstream connection. (e.g. are we building up the queue or it is normally emptied).
This is still work in progress:
This also contains some refactor steps for our stats code.
Depends on #4588