Skip to content

Conversation

bazsi
Copy link
Collaborator

@bazsi bazsi commented Jul 26, 2023

This adds a new metric to measure the outgoing message delay, which is a good indication of
the health of our upstream connection. (e.g. are we building up the queue or it is normally emptied).

This is still work in progress:

  • no news file
  • the log threaded driver piece is still missing

This also contains some refactor steps for our stats code.

Depends on #4588

@kira-syslogng
Copy link
Contributor

Build FAILURE

@bazsi bazsi force-pushed the message-delay-metrics branch from 09284a9 to f7a1bff Compare July 31, 2023 13:56
@kira-syslogng
Copy link
Contributor

Build FAILURE

@alltilla alltilla changed the title WIP: Message delay metrics Message delay metrics Aug 3, 2023
@kira-syslogng
Copy link
Contributor

Build FAILURE

@alltilla alltilla force-pushed the message-delay-metrics branch 2 times, most recently from d523129 to 000b549 Compare August 3, 2023 15:26
@kira-syslogng
Copy link
Contributor

Build FAILURE

@alltilla alltilla force-pushed the message-delay-metrics branch 2 times, most recently from f6088fb to a525146 Compare August 3, 2023 16:17
@kira-syslogng
Copy link
Contributor

Build FAILURE

@alltilla alltilla force-pushed the message-delay-metrics branch from a525146 to 262e1a0 Compare August 4, 2023 07:25
@alltilla alltilla marked this pull request as draft August 4, 2023 07:25
@kira-syslogng
Copy link
Contributor

Build FAILURE

@alltilla alltilla force-pushed the message-delay-metrics branch 2 times, most recently from 0b83790 to a6a3423 Compare August 4, 2023 08:07
@kira-syslogng
Copy link
Contributor

Build FAILURE

1 similar comment
@kira-syslogng
Copy link
Contributor

Build FAILURE

@alltilla alltilla force-pushed the message-delay-metrics branch from 48dedf4 to e2765e4 Compare August 4, 2023 08:57
@alltilla alltilla marked this pull request as ready for review August 4, 2023 09:22
if (key->formatting.frame_of_reference == SCFOR_RELATIVE_TO_TIME_OF_QUERY)
{
unix_time_set_now(&now);
converted_double = (now.ut_sec + now.ut_usec / 1e6) - converted_int;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this still does not fix the y2038 problem pointed out by @MrAnno

I am not sure we need the resolution that floating point gives us here, especially as the timestamp is only updated once per second at most.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I have removed the support for milli and nanosec.

I believe I have fixed the Y2038 problem, too. I hope I did not miss anything.

@alltilla alltilla force-pushed the message-delay-metrics branch 3 times, most recently from 44383df to 5034ce9 Compare August 7, 2023 11:57
bazsi and others added 27 commits August 18, 2023 10:07
…e file

The file is difficult to navigate, so I am trying to move related
code closer to each other.

Signed-off-by: Balazs Scheidler <balazs.scheidler@axoflow.com>
Signed-off-by: Balazs Scheidler <balazs.scheidler@axoflow.com>
Signed-off-by: Balazs Scheidler <balazs.scheidler@axoflow.com>
Signed-off-by: Balazs Scheidler <balazs.scheidler@axoflow.com>
Signed-off-by: Balazs Scheidler <balazs.scheidler@axoflow.com>
Signed-off-by: Balazs Scheidler <balazs.scheidler@axoflow.com>
Signed-off-by: Balazs Scheidler <balazs.scheidler@axoflow.com>
…ted in the main thread

Signed-off-by: Balazs Scheidler <balazs.scheidler@axoflow.com>
Signed-off-by: Balazs Scheidler <balazs.scheidler@axoflow.com>
Signed-off-by: Balazs Scheidler <balazs.scheidler@axoflow.com>
…on functions

Signed-off-by: Balazs Scheidler <balazs.scheidler@axoflow.com>
…egator_add_data_point()

Signed-off-by: Balazs Scheidler <balazs.scheidler@axoflow.com>
This is metric that stores a time_t value and is formatted as the number of
seconds before the time it is queried (e.g. age in seconds).

This can be used to represent absolute time values irrespective wheather our
clock is right.

Signed-off-by: Balazs Scheidler <balazs.scheidler@axoflow.com>
Signed-off-by: Balazs Scheidler <balazs.scheidler@axoflow.com>
Signed-off-by: Balazs Scheidler <balazs.scheidler@axoflow.com>
This patch implements the syslogng_output_message_delay_sample_seconds, which
shows the delay incurred between the message being received and
then sent out by syslog-ng.

Please note that this is the delay associated with one specific message that
was recently sent out, e.g. this is a sample, rather than the average
of a series of messages. It is believed that the delay incurred is not
a volatile value, rather messages sitting in the queue in close proximity
have very similar delay values.

Thus, we only sample this delay once every second.

There's also another associated metric, called output_message_delay_sample_age_seconds,
which shows when the delay metric was last sampled. It contains the
age of the sample relative to the current time. This contains this
relative timestamp so that the actual clock accuracy of the syslog-ng
host does not matter.

To come up with an alertable metric for message delays, the following
algorithm is recommended:

We need the values for these metrics which are scraped at the same time:
  * output_message_delay_sample_seconds
  * output_message_delay_sample_age_seconds
  * syslogng_output_events_total{result="queued"}

After every scraping iteration:
1) check if the delay_sample_age is recent (e.g. happened between
the last two scrape periods), if it was then the delay can be considered
current.
2) if the age is older than that, look at the length of the associated queue

  - 2a) if the queue is empty, then our delay can be considered zero, as any new messages would
        be sent immediately.
  - 2b) if the queue is not empty, it means that even though we would have data to send,
        we did not (as if we were our age would be recent), our delay building up
        The expected delay value would be the sum of the last delay sample plus
        its age value.

Signed-off-by: Balazs Scheidler <balazs.scheidler@axoflow.com>
Signed-off-by: Balazs Scheidler <balazs.scheidler@axoflow.com>
Signed-off-by: Attila Szakacs <attila.szakacs@axoflow.com>
Signed-off-by: Attila Szakacs <attila.szakacs@axoflow.com>
Signed-off-by: Attila Szakacs <attila.szakacs@axoflow.com>
Signed-off-by: Attila Szakacs <attila.szakacs@axoflow.com>
Signed-off-by: Attila Szakacs <attila.szakacs@axoflow.com>
Signed-off-by: Attila Szakacs <attila.szakacs@axoflow.com>
This intends to fix the y2038 problem on 32bit machines in cases
where we can guarantee to use only positive values.

Signed-off-by: Attila Szakacs <attila.szakacs@axoflow.com>
Signed-off-by: Attila Szakacs <attila.szakacs@axoflow.com>
Signed-off-by: Attila Szakacs <attila.szakacs@axoflow.com>
Signed-off-by: Attila Szakacs <attila.szakacs@axoflow.com>
@alltilla alltilla force-pushed the message-delay-metrics branch from cf88134 to 58b1372 Compare August 18, 2023 08:07
Copy link
Collaborator

@alltilla alltilla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving on behalf of @bazsi.

@alltilla alltilla merged commit 1c73a6e into syslog-ng:master Aug 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants