Skip to content

Disk based buffer causes metric collection/publishing issues #16500

@bmagistro

Description

@bmagistro

Relevant telegraf.conf

# Telegraf Configuration
#
# Telegraf is entirely plugin driven. All metrics are gathered from the
# declared inputs, and sent to the declared outputs.
#
# Plugins must be declared in here to be active.
# To deactivate a plugin, comment out the name and any variables.
#
# Use 'telegraf -config telegraf.conf -test' to see what metrics a config
# file would generate.
#
# Environment variables can be used anywhere in this config file, simply surround
# them with ${}. For strings the variable must be within quotes (ie, "${STR_VAR}"),
# for numbers and booleans they should be plain (ie, ${INT_VAR}, ${BOOL_VAR})


# Global tags can be specified here in key="value" format.
[global_tags]
  # dc = "us-east-1" # will tag all metrics with dc=us-east-1
  # rack = "1a"
  ## Environment variables can be used as tags, and throughout the config file
  # user = "$USER"
  tenant = "redacted"
  region = "redactd"
  network = "redacted"


# Configuration for telegraf agent
[agent]
  ## Default data collection interval for all inputs
  interval = "10s"
  ## Rounds collection interval to 'interval'
  ## ie, if interval="10s" then always collect on :00, :10, :20, etc.
  round_interval = true

  ## Telegraf will send metrics to outputs in batches of at most
  ## metric_batch_size metrics.
  ## This controls the size of writes that Telegraf sends to output plugins.
  metric_batch_size = 25000

  ## Maximum number of unwritten metrics per output.  Increasing this value
  ## allows for longer periods of output downtime without dropping metrics at the
  ## cost of higher maximum memory usage.
  metric_buffer_limit = 20000

  ## Collection jitter is used to jitter the collection by a random amount.
  ## Each plugin will sleep for a random time within jitter before collecting.
  ## This can be used to avoid many plugins querying things like sysfs at the
  ## same time, which can have a measurable effect on the system.
  collection_jitter = "3s"

  ## Collection offset is used to shift the collection by the given amount.
  ## This can be be used to avoid many plugins querying constraint devices
  ## at the same time by manually scheduling them in time.
  # collection_offset = "0s"

  ## Default flushing interval for all outputs. Maximum flush_interval will be
  ## flush_interval + flush_jitter
  flush_interval = "10s"

  ## Jitter the flush interval by a random amount. This is primarily to avoid
  ## large write spikes for users running a large number of telegraf instances.
  ## ie, a jitter of 5s and interval 10s means flushes will happen every 10-15s
  flush_jitter = "5s"

  ## Collected metrics are rounded to the precision specified. Precision is
  ## specified as an interval with an integer + unit (e.g. 0s, 10ms, 2us, 4s).
  ## Valid time units are "ns", "us" (or "µs"), "ms", "s".
  ##
  ## By default or when set to "0s", precision will be set to the same
  ## timestamp order as the collection interval, with the maximum being 1s:
  ##   ie, when interval = "10s", precision will be "1s"
  ##       when interval = "250ms", precision will be "1ms"
  ##
  ## Precision will NOT be used for service inputs. It is up to each individual
  ## service input to set the timestamp at the appropriate precision.
  precision = "0s"

  ## Log at debug level.
  debug = true
  ## Log only error level messages.
  quiet = false

  ## Log format controls the way messages are logged and can be one of "text",
  ## "structured" or, on Windows, "eventlog".
  # logformat = "text"

  ## Message key for structured logs, to override the default of "msg".
  ## Ignored if `logformat` is not "structured".
  # structured_log_message_key = "message"

  ## Name of the file to be logged to or stderr if unset or empty. This
  ## setting is ignored for the "eventlog" format.
  logfile = "/var/log/telegraf/telegraf.log"

  ## The logfile will be rotated after the time interval specified.  When set
  ## to 0 no time based rotation is performed.  Logs are rotated only when
  ## written to, if there is no log activity rotation may be delayed.
  logfile_rotation_interval = "0d"

  ## The logfile will be rotated when it becomes larger than the specified
  ## size.  When set to 0 no size based rotation is performed.
  logfile_rotation_max_size = "10MB"

  ## Maximum number of rotated archives to keep, any older logs are deleted.
  ## If set to -1, no archives are removed.
  logfile_rotation_max_archives = 5

  ## Pick a timezone to use when logging or type 'local' for local time.
  ## Example: America/Chicago
  # log_with_timezone = ""

  ## Override default hostname, if empty use os.Hostname()
  hostname = ""
  ## If set to true, do no set the "host" tag in the telegraf agent.
  omit_hostname = false

  ## Method of translating SNMP objects. Can be "netsnmp" (deprecated) which
  ## translates by calling external programs snmptranslate and snmptable,
  ## or "gosmi" which translates using the built-in gosmi library.
  snmp_translator = "gosmi"

  ## Name of the file to load the state of plugins from and store the state to.
  ## If uncommented and not empty, this file will be used to save the state of
  ## stateful plugins on termination of Telegraf. If the file exists on start,
  ## the state in the file will be restored for the plugins.
  # statefile = ""

  ## Flag to skip running processors after aggregators
  ## By default, processors are run a second time after aggregators. Changing
  ## this setting to true will skip the second run of processors.
  skip_processors_after_aggregators = true

  buffer_strategy = "disk"
  buffer_directory = "/var/lib/telegraf"
# Collect statistics about itself
[[inputs.internal]]
  ## If true, collect telegraf memory stats.
  collect_memstats = true

  ## If true, collect metrics from Go's runtime.metrics. For a full list see:
  ##   https://pkg.go.dev/runtime/metrics
  # collect_gostats = false

  tags = {influx_bucket_set = "general", influx_bucket = "tig_internal"}

# Read metrics about cpu usage
[[inputs.cpu]]
  ## Whether to report per-cpu stats or not
  percpu = true
  ## Whether to report total system cpu stats or not
  totalcpu = true
  ## If true, collect raw CPU time metrics
  collect_cpu_time = false
  ## If true, compute and report the sum of all non-idle CPU states
  ## NOTE: The resulting 'time_active' field INCLUDES 'iowait'!
  report_active = false
  ## If true and the info is available then add core_id and physical_id tags
  core_tags = false

  tags = {influx_bucket_set = "localhost", influx_bucket = "localhost"}




# Read metrics about disk usage by mount point
[[inputs.disk]]
  ## By default stats will be gathered for all mount points.
  ## Set mount_points will restrict the stats to only the specified mount points.
  # mount_points = ["/"]

  ## Ignore mount points by filesystem type.
  ignore_fs = ["tmpfs", "devtmpfs", "devfs", "iso9660", "overlay", "aufs", "squashfs", "efivarfs"]

  ## Ignore mount points by mount options.
  ## The 'mount' command reports options of all mounts in parathesis.
  ## Bind mounts can be ignored with the special 'bind' option.
  ignore_mount_opts = ['bind']
  tags = {influx_bucket_set = "localhost", influx_bucket = "localhost"}


# Read metrics about disk IO by device
[[inputs.diskio]]
  ## Devices to collect stats for
  ## Wildcards are supported except for disk synonyms like '/dev/disk/by-id'.
  ## ex. devices = ["sda", "sdb", "vd*", "/dev/disk/by-id/nvme-eui.00123deadc0de123"]
  # devices = ["*"]

  ## Skip gathering of the disk's serial numbers.
  skip_serial_number = false

  ## Device metadata tags to add on systems supporting it (Linux only)
  ## Use 'udevadm info -q property -n <device>' to get a list of properties.
  ## Note: Most, but not all, udev properties can be accessed this way. Properties
  ## that are currently inaccessible include DEVTYPE, DEVNAME, and DEVPATH.
  # device_tags = ["ID_FS_TYPE", "ID_FS_USAGE"]

  ## Using the same metadata source as device_tags, you can also customize the
  ## name of the device via templates.
  ## The 'name_templates' parameter is a list of templates to try and apply to
  ## the device. The template may contain variables in the form of '$PROPERTY' or
  ## '${PROPERTY}'. The first template which does not contain any variables not
  ## present for the device is used as the device name tag.
  ## The typical use case is for LVM volumes, to get the VG/LV name instead of
  ## the near-meaningless DM-0 name.
  # name_templates = ["$ID_FS_LABEL","$DM_VG_NAME/$DM_LV_NAME"]
  tags = {influx_bucket_set = "localhost", influx_bucket = "localhost"}


# Get kernel statistics from /proc/stat
[[inputs.kernel]]
  ## Additional gather options
  ## Possible options include:
  ## * ksm - kernel same-page merging
  ## * psi - pressure stall information
  # collect = []
  tags = {influx_bucket_set = "localhost", influx_bucket = "localhost"}


# Read metrics about memory usage
[[inputs.mem]]
  # no configuration
  tags = {influx_bucket_set = "localhost", influx_bucket = "localhost"}


# Get the number of processes and group them by status
[[inputs.processes]]
  ## Use sudo to run ps command on *BSD systems. Linux systems will read
  ## /proc, so this does not apply there.
  # use_sudo = false
  tags = {influx_bucket_set = "localhost", influx_bucket = "localhost"}


# Read metrics about swap memory usage
[[inputs.swap]]
  # no configuration
  tags = {influx_bucket_set = "localhost", influx_bucket = "localhost"}


# Read metrics about system load & uptime
[[inputs.system]]
  # no configuration
  tags = {influx_bucket_set = "localhost", influx_bucket = "localhost"}


# Read metrics about network interface usage
[[inputs.net]]
  ## By default, telegraf gathers stats from any up interface (excluding loopback)
  ## Setting interfaces will tell it to gather these explicit interfaces,
  ## regardless of status. When specifying an interface, glob-style
  ## patterns are also supported.
  # interfaces = ["eth*", "enp0s[0-1]", "lo"]

  ## On linux systems telegraf also collects protocol stats.
  ## Setting ignore_protocol_stats to true will skip reporting of protocol metrics.
  ##
  ## DEPRECATION NOTICE: A value of 'false' is deprecated and discouraged!
  ##                     Please set this to `true` and use the 'inputs.nstat'
  ##                     plugin instead.
  ignore_protocol_stats = true
  tags = {influx_bucket_set = "localhost", influx_bucket = "localhost"}


# Read TCP metrics such as established, time wait and sockets counts.
[[inputs.netstat]]
  # no configuration
  tags = {influx_bucket_set = "localhost", influx_bucket = "localhost"}


# Collect kernel snmp counters and network interface statistics
[[inputs.nstat]]
  ## file paths for proc files. If empty default paths will be used:
  ##    /proc/net/netstat, /proc/net/snmp, /proc/net/snmp6
  ## These can also be overridden with env variables, see README.
  proc_net_netstat = "/proc/net/netstat"
  proc_net_snmp = "/proc/net/snmp"
  proc_net_snmp6 = "/proc/net/snmp6"
  ## dump metrics with 0 values too
  dump_zeros       = true
  tags = {influx_bucket_set = "localhost", influx_bucket = "localhost"}


# Get standard chrony metrics, requires chronyc executable.
[[inputs.chrony]]
  ## Server address of chronyd with address scheme
  ## If empty or not set, the plugin will mimic the behavior of chronyc and
  ## check "unixgram:///run/chrony/chronyd.sock", "udp://127.0.0.1:323"
  ## and "udp://[::1]:323".
  # server = ""

  ## Timeout for establishing the connection
  # timeout = "5s"

  ## Try to resolve received addresses to host-names via DNS lookups
  ## Disabled by default to avoid DNS queries especially for slow DNS servers.
  # dns_lookup = false

  ## Metrics to query named according to chronyc commands
  ## Available settings are:
  ##   activity    -- number of peers online or offline
  ##   tracking    -- information about system's clock performance
  ##   serverstats -- chronyd server statistics
  ##   sources     -- extended information about peers
  ##   sourcestats -- statistics on peers
  # metrics = ["tracking"]

  ## Socket group & permissions
  ## If the user requests collecting metrics via unix socket, then it is created
  ## with the following group and permissions.
  # socket_group = "chrony"
  # socket_perms = "0660"
  tags = {influx_bucket_set = "localhost", influx_bucket = "localhost"}

[[inputs.x509_cert]]
  sources = ["file:////etc/puppetlabs/puppet/ssl/certs/alma95-0003.redacted.pem"]

# This file applies influx_bucket_set and influx_bucket tags on metrics as they flow through Telegraf. It is
# defaulted on all Telegraf instances, so it applies these tags to metrics everywhere it possibly can.
#
# Most metrics should be pre-tagged coming to Telegraf. Some metrics may go through Telegraf twice: once
# at the first point (e.g. to get into Kafka) and a second time going from Kafka to InfluxDB. Because of
# these two changes for metrics to be pre-tagged, we are using tagdrop to skip those already tagged. This
# means this processor does not replace existing tags, even if what it would add is different. It leaves
# the original tag in place.
#
# The order set here is important; otherwise the generic catch alls apply first, then none of the rest of
# the specific tags work because the generic tagging was already applied.


[[processors.override]]
  order = 1
  tagdrop = { "influx_bucket" = ["*"], "influx_bucket_set" = ["*"] }
  namepass = [
    redacted
  ]
  tags = {influx_bucket_set = "redacted", influx_bucket = "redacted"}

[[processors.override]]
  order = 2
  tagdrop = { "influx_bucket" = ["*"], "influx_bucket_set" = ["*"] }
  namepass = [
    redacted
  ]
  tags = {influx_bucket_set = "redacted", influx_bucket = "redacted"}

[[processors.override]]
  order = 3
  tagdrop = { "influx_bucket" = ["*"], "influx_bucket_set" = ["*"] }
  namepass = [
    redacted
  ]
  tags = {influx_bucket_set = "redacted", influx_bucket = "redacted"}
  
[[processors.override]]
  order = 4
  tagdrop = { "influx_bucket" = ["*"], "influx_bucket_set" = ["*"] }
  namepass = [
    redacted
  ]
  tags = {influx_bucket_set = "redacted", influx_bucket = "redacted"}


[[processors.override]]
  order = 5
  tagdrop = { "influx_bucket" = ["*"], "influx_bucket_set" = ["*"] }
  namepass = [
    redacted
  ]
  tags = {influx_bucket_set = "redacted", influx_bucket = "redacted"}

[[processors.override]]
  order = 6
  tagdrop = { "influx_bucket" = ["*"], "influx_bucket_set" = ["*"] }
  namepass = [
    "cert_expire_date",
    "x509_cert"
  ]
  tags = {influx_bucket_set = "general", influx_bucket = "cert_expiration"}

[[processors.override]]
  order = 7
  tagdrop = { "influx_bucket" = ["*"], "influx_bucket_set" = ["*"] }
  namepass = [
    "chrony",
    "cpu",
    "disk",
    "diskio",
    "kernel",
    "mem",
    "net",
    "netstat",
    "nstat",
    "processes",
    "swap",
    "system",
    "win_*"
  ]
  tags = {influx_bucket_set = "localhost", influx_bucket = "localhost"}

[[processors.override]]
  order = 8
  tagdrop = { "influx_bucket" = ["*"], "influx_bucket_set" = ["*"] }
  namepass = [
    "ping",
  ]
  tags = {influx_bucket_set = "general", influx_bucket = "ping"}

[[processors.override]]
  order = 9
  tagdrop = { "influx_bucket" = ["*"], "influx_bucket_set" = ["*"] }
  namepass = [
    redacted
  ]
  tags = {influx_bucket_set = "general", influx_bucket = "redacted"}

[[processors.override]]
  order = 10
  tagdrop = { "influx_bucket" = ["*"], "influx_bucket_set" = ["*"] }
  namepass = [
    redacted
  ]
  tags = {influx_bucket_set = "general", influx_bucket = "redacted"}

[[processors.override]]
  order = 11
  tagdrop = { "influx_bucket" = ["*"], "influx_bucket_set" = ["*"] }
  namepass = [
    "f5telemetry",
    "f5telemetry_*",
  ]
  tags = {influx_bucket_set = "redacted", influx_bucket = "f5_ephemeral"}

[[processors.override]]
  order = 12
  tagdrop = { "influx_bucket" = ["*"], "influx_bucket_set" = ["*"] }
  tags = {influx_bucket_set = "general", influx_bucket = "default"}
# Configuration for the Kafka server to send metrics to
[[outputs.kafka]]
  alias = "redacted_test-metrics"
  ## URLs of kafka brokers
  ## The brokers listed here are used to connect to collect metadata about a
  ## cluster. However, once the initial metadata collect is completed, telegraf
  ## will communicate solely with the kafka leader and not all defined brokers.
  brokers = ["kafka-metrics-0001.redacted:9092","kafka-metrics-0002.redacted:9092","kafka-metrics-0003.redacted:9092"]

  ## Kafka topic for producer messages
  topic = "test-metrics"

  ## The value of this tag will be used as the topic.  If not set the 'topic'
  ## option is used.
  # topic_tag = ""

  ## If true, the 'topic_tag' will be removed from to the metric.
  # exclude_topic_tag = false

  ## Optional Client id
  # client_id = "Telegraf"

  ## Set the minimal supported Kafka version.  Setting this enables the use of new
  ## Kafka features and APIs.  Of particular interested, lz4 compression
  ## requires at least version 0.10.0.0.
  ##   ex: version = "1.1.0"
  # version = ""

  ## The routing tag specifies a tagkey on the metric whose value is used as
  ## the message key.  The message key is used to determine which partition to
  ## send the message to.  This tag is preferred over the routing_key option.
  # routing_tag = "host"

  ## The routing key is set as the message key and used to determine which
  ## partition to send the message to.  This value is only used when no
  ## routing_tag is set or as a fallback when the tag specified in routing tag
  ## is not found.
  ##
  ## If set to "random", a random value will be generated for each message.
  ##
  ## When unset, no message key is added and each message is routed to a random
  ## partition.
  ##
  ##   ex: routing_key = "random"
  ##       routing_key = "telegraf"
  # routing_key = ""

  ## Compression codec represents the various compression codecs recognized by
  ## Kafka in messages.
  ##  0 : None
  ##  1 : Gzip
  ##  2 : Snappy
  ##  3 : LZ4
  ##  4 : ZSTD
  compression_codec = 0

  ## Idempotent Writes
  ## If enabled, exactly one copy of each message is written.
  # idempotent_writes = false

  ##  RequiredAcks is used in Produce Requests to tell the broker how many
  ##  replica acknowledgements it must see before responding
  ##   0 : the producer never waits for an acknowledgement from the broker.
  ##       This option provides the lowest latency but the weakest durability
  ##       guarantees (some data will be lost when a server fails).
  ##   1 : the producer gets an acknowledgement after the leader replica has
  ##       received the data. This option provides better durability as the
  ##       client waits until the server acknowledges the request as successful
  ##       (only messages that were written to the now-dead leader but not yet
  ##       replicated will be lost).
  ##   -1: the producer gets an acknowledgement after all in-sync replicas have
  ##       received the data. This option provides the best durability, we
  ##       guarantee that no messages will be lost as long as at least one in
  ##       sync replica remains.
  required_acks = 1

  ## The maximum number of times to retry sending a metric before failing
  ## until the next flush.
  max_retry = 3

  ## The maximum permitted size of a message. Should be set equal to or
  ## smaller than the broker's 'message.max.bytes'.
  # max_message_bytes = 1000000

  ## Producer timestamp
  ## This option sets the timestamp of the kafka producer message, choose from:
  ##   * metric: Uses the metric's timestamp
  ##   * now: Uses the time of write
  # producer_timestamp = metric

  ## Add metric name as specified kafka header if not empty
  # metric_name_header = ""

  ## Optional TLS Config
 
  tls_ca = "/etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem"
  # tls_cert = "/etc/telegraf/cert.pem"
  # tls_key = "/etc/telegraf/key.pem"
  ## Use TLS but skip chain & host verification
  # insecure_skip_verify = false

  ## Period between keep alive probes.
  ## Defaults to the OS configuration if not specified or zero.
  # keep_alive_period = "15s"

  ## Optional SOCKS5 proxy to use when connecting to brokers
  # socks5_enabled = true
  # socks5_address = "127.0.0.1:1080"
  # socks5_username = "alice"
  # socks5_password = "pass123"

  ## Optional SASL Config
  # sasl_username = "kafka"
  # sasl_password = "secret"

  ## Optional SASL:
  ## one of: OAUTHBEARER, PLAIN, SCRAM-SHA-256, SCRAM-SHA-512, GSSAPI
  ## (defaults to PLAIN)
  # sasl_mechanism = ""

  ## used if sasl_mechanism is GSSAPI
  # sasl_gssapi_service_name = ""
  # ## One of: KRB5_USER_AUTH and KRB5_KEYTAB_AUTH
  # sasl_gssapi_auth_type = "KRB5_USER_AUTH"
  # sasl_gssapi_kerberos_config_path = "/"
  # sasl_gssapi_realm = "realm"
  # sasl_gssapi_key_tab_path = ""
  # sasl_gssapi_disable_pafxfast = false

  ## Access token used if sasl_mechanism is OAUTHBEARER
  # sasl_access_token = ""

  ## Arbitrary key value string pairs to pass as a TOML table. For example:
  # {logicalCluster = "cluster-042", poolId = "pool-027"}
  # sasl_extensions = {}

  ## SASL protocol version.  When connecting to Azure EventHub set to 0.
  # sasl_version = 1

  # Disable Kafka metadata full fetch
  # metadata_full = false

  ## Maximum number of retries for metadata operations including
  ## connecting. Sets Sarama library's Metadata.Retry.Max config value. If 0 or
  ## unset, use the Sarama default of 3,
  # metadata_retry_max = 0

  ## Type of retry backoff. Valid options: "constant", "exponential"
  # metadata_retry_type = "constant"

  ## Amount of time to wait before retrying. When metadata_retry_type is
  ## "constant", each retry is delayed this amount. When "exponential", the
  ## first retry is delayed this amount, and subsequent delays are doubled. If 0
  ## or unset, use the Sarama default of 250 ms
  # metadata_retry_backoff = 0

  ## Maximum amount of time to wait before retrying when metadata_retry_type is
  ## "exponential". Ignored for other retry types. If 0, there is no backoff
  ## limit.
  # metadata_retry_max_duration = 0

  ## Data format to output.
  ## Each data format has its own unique set of configuration options, read
  ## more about them here:
  ## https://github.com/influxdata/telegraf/blob/master/docs/DATA_FORMATS_OUTPUT.md
  data_format = "influx"

  # These are in order of how they are processed
  namepass = []
  namedrop = []
  tagpass = {"tenant"=["dev","test"]}
  tagdrop = {"tc_no_kafka_republish"=["true"]}
  metricpass = ''
  taginclude = []
  tagexclude = []

  ## NOTE: Due to the way TOML is parsed, tables must be at the END of the
  ## plugin definition, otherwise additional config options are read as part of
  ## the table

  ## Optional topic suffix configuration.
  ## If the section is omitted, no suffix is used.
  ## Following topic suffix methods are supported:
  ##   measurement - suffix equals to separator + measurement's name
  ##   tags        - suffix equals to separator + specified tags' values
  ##                 interleaved with separator

  ## Suffix equals to "_" + measurement name
  # [outputs.kafka.topic_suffix]
  #   method = "measurement"
  #   separator = "_"

  ## Suffix equals to "__" + measurement's "foo" tag value.
  ## If there's no such a tag, suffix equals to an empty string
  # [outputs.kafka.topic_suffix]
  #   method = "tags"
  #   keys = ["foo"]
  #   separator = "__"

  ## Suffix equals to "_" + measurement's "foo" and "bar"
  ## tag values, separated by "_". If there is no such tags,
  ## their values treated as empty strings.
  # [outputs.kafka.topic_suffix]
  #   method = "tags"
  #   keys = ["foo", "bar"]
  #   separator = "_"
[[outputs.kafka]]
  alias = "redacted_test-metrics"
  ## URLs of kafka brokers
  ## The brokers listed here are used to connect to collect metadata about a
  ## cluster. However, once the initial metadata collect is completed, telegraf
  ## will communicate solely with the kafka leader and not all defined brokers.
  brokers = ["kafka-metrics-0001.redacted:9092","kafka-metrics-0002.redacted:9092","kafka-metrics-0003.redacted:9092"]

  ## Kafka topic for producer messages
  topic = "test-metrics"

  ## The value of this tag will be used as the topic.  If not set the 'topic'
  ## option is used.
  # topic_tag = ""

  ## If true, the 'topic_tag' will be removed from to the metric.
  # exclude_topic_tag = false

  ## Optional Client id
  # client_id = "Telegraf"

  ## Set the minimal supported Kafka version.  Setting this enables the use of new
  ## Kafka features and APIs.  Of particular interested, lz4 compression
  ## requires at least version 0.10.0.0.
  ##   ex: version = "1.1.0"
  # version = ""

  ## The routing tag specifies a tagkey on the metric whose value is used as
  ## the message key.  The message key is used to determine which partition to
  ## send the message to.  This tag is preferred over the routing_key option.
  # routing_tag = "host"

  ## The routing key is set as the message key and used to determine which
  ## partition to send the message to.  This value is only used when no
  ## routing_tag is set or as a fallback when the tag specified in routing tag
  ## is not found.
  ##
  ## If set to "random", a random value will be generated for each message.
  ##
  ## When unset, no message key is added and each message is routed to a random
  ## partition.
  ##
  ##   ex: routing_key = "random"
  ##       routing_key = "telegraf"
  # routing_key = ""

  ## Compression codec represents the various compression codecs recognized by
  ## Kafka in messages.
  ##  0 : None
  ##  1 : Gzip
  ##  2 : Snappy
  ##  3 : LZ4
  ##  4 : ZSTD
  compression_codec = 0

  ## Idempotent Writes
  ## If enabled, exactly one copy of each message is written.
  # idempotent_writes = false

  ##  RequiredAcks is used in Produce Requests to tell the broker how many
  ##  replica acknowledgements it must see before responding
  ##   0 : the producer never waits for an acknowledgement from the broker.
  ##       This option provides the lowest latency but the weakest durability
  ##       guarantees (some data will be lost when a server fails).
  ##   1 : the producer gets an acknowledgement after the leader replica has
  ##       received the data. This option provides better durability as the
  ##       client waits until the server acknowledges the request as successful
  ##       (only messages that were written to the now-dead leader but not yet
  ##       replicated will be lost).
  ##   -1: the producer gets an acknowledgement after all in-sync replicas have
  ##       received the data. This option provides the best durability, we
  ##       guarantee that no messages will be lost as long as at least one in
  ##       sync replica remains.
  required_acks = 1

  ## The maximum number of times to retry sending a metric before failing
  ## until the next flush.
  max_retry = 3

  ## The maximum permitted size of a message. Should be set equal to or
  ## smaller than the broker's 'message.max.bytes'.
  # max_message_bytes = 1000000

  ## Producer timestamp
  ## This option sets the timestamp of the kafka producer message, choose from:
  ##   * metric: Uses the metric's timestamp
  ##   * now: Uses the time of write
  # producer_timestamp = metric

  ## Add metric name as specified kafka header if not empty
  # metric_name_header = ""

  ## Optional TLS Config
 
  tls_ca = "/etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem"
  # tls_cert = "/etc/telegraf/cert.pem"
  # tls_key = "/etc/telegraf/key.pem"
  ## Use TLS but skip chain & host verification
  # insecure_skip_verify = false

  ## Period between keep alive probes.
  ## Defaults to the OS configuration if not specified or zero.
  # keep_alive_period = "15s"

  ## Optional SOCKS5 proxy to use when connecting to brokers
  # socks5_enabled = true
  # socks5_address = "127.0.0.1:1080"
  # socks5_username = "alice"
  # socks5_password = "pass123"

  ## Optional SASL Config
  # sasl_username = "kafka"
  # sasl_password = "secret"

  ## Optional SASL:
  ## one of: OAUTHBEARER, PLAIN, SCRAM-SHA-256, SCRAM-SHA-512, GSSAPI
  ## (defaults to PLAIN)
  # sasl_mechanism = ""

  ## used if sasl_mechanism is GSSAPI
  # sasl_gssapi_service_name = ""
  # ## One of: KRB5_USER_AUTH and KRB5_KEYTAB_AUTH
  # sasl_gssapi_auth_type = "KRB5_USER_AUTH"
  # sasl_gssapi_kerberos_config_path = "/"
  # sasl_gssapi_realm = "realm"
  # sasl_gssapi_key_tab_path = ""
  # sasl_gssapi_disable_pafxfast = false

  ## Access token used if sasl_mechanism is OAUTHBEARER
  # sasl_access_token = ""

  ## Arbitrary key value string pairs to pass as a TOML table. For example:
  # {logicalCluster = "cluster-042", poolId = "pool-027"}
  # sasl_extensions = {}

  ## SASL protocol version.  When connecting to Azure EventHub set to 0.
  # sasl_version = 1

  # Disable Kafka metadata full fetch
  # metadata_full = false

  ## Maximum number of retries for metadata operations including
  ## connecting. Sets Sarama library's Metadata.Retry.Max config value. If 0 or
  ## unset, use the Sarama default of 3,
  # metadata_retry_max = 0

  ## Type of retry backoff. Valid options: "constant", "exponential"
  # metadata_retry_type = "constant"

  ## Amount of time to wait before retrying. When metadata_retry_type is
  ## "constant", each retry is delayed this amount. When "exponential", the
  ## first retry is delayed this amount, and subsequent delays are doubled. If 0
  ## or unset, use the Sarama default of 250 ms
  # metadata_retry_backoff = 0

  ## Maximum amount of time to wait before retrying when metadata_retry_type is
  ## "exponential". Ignored for other retry types. If 0, there is no backoff
  ## limit.
  # metadata_retry_max_duration = 0

  ## Data format to output.
  ## Each data format has its own unique set of configuration options, read
  ## more about them here:
  ## https://github.com/influxdata/telegraf/blob/master/docs/DATA_FORMATS_OUTPUT.md
  data_format = "influx"

  # These are in order of how they are processed
  namepass = []
  namedrop = []
  tagpass = {"tenant"=["dev","test"]}
  tagdrop = {"tc_no_kafka_republish"=["true"]}
  metricpass = ''
  taginclude = []
  tagexclude = []

  ## NOTE: Due to the way TOML is parsed, tables must be at the END of the
  ## plugin definition, otherwise additional config options are read as part of
  ## the table

  ## Optional topic suffix configuration.
  ## If the section is omitted, no suffix is used.
  ## Following topic suffix methods are supported:
  ##   measurement - suffix equals to separator + measurement's name
  ##   tags        - suffix equals to separator + specified tags' values
  ##                 interleaved with separator

  ## Suffix equals to "_" + measurement name
  # [outputs.kafka.topic_suffix]
  #   method = "measurement"
  #   separator = "_"

  ## Suffix equals to "__" + measurement's "foo" tag value.
  ## If there's no such a tag, suffix equals to an empty string
  # [outputs.kafka.topic_suffix]
  #   method = "tags"
  #   keys = ["foo"]
  #   separator = "__"

  ## Suffix equals to "_" + measurement's "foo" and "bar"
  ## tag values, separated by "_". If there is no such tags,
  ## their values treated as empty strings.
  # [outputs.kafka.topic_suffix]
  #   method = "tags"
  #   keys = ["foo", "bar"]
  #   separator = "_"
[[outputs.kafka]]
  alias = "redacted_live-metrics"
  ## URLs of kafka brokers
  ## The brokers listed here are used to connect to collect metadata about a
  ## cluster. However, once the initial metadata collect is completed, telegraf
  ## will communicate solely with the kafka leader and not all defined brokers.
  brokers = ["kafka-metrics-0001.redacted:9092","kafka-metrics-0002.redacted:9092","kafka-metrics-0003.redacted:9092"]

  ## Kafka topic for producer messages
  topic = "live-metrics"

  ## The value of this tag will be used as the topic.  If not set the 'topic'
  ## option is used.
  # topic_tag = ""

  ## If true, the 'topic_tag' will be removed from to the metric.
  # exclude_topic_tag = false

  ## Optional Client id
  # client_id = "Telegraf"

  ## Set the minimal supported Kafka version.  Setting this enables the use of new
  ## Kafka features and APIs.  Of particular interested, lz4 compression
  ## requires at least version 0.10.0.0.
  ##   ex: version = "1.1.0"
  # version = ""

  ## The routing tag specifies a tagkey on the metric whose value is used as
  ## the message key.  The message key is used to determine which partition to
  ## send the message to.  This tag is preferred over the routing_key option.
  # routing_tag = "host"

  ## The routing key is set as the message key and used to determine which
  ## partition to send the message to.  This value is only used when no
  ## routing_tag is set or as a fallback when the tag specified in routing tag
  ## is not found.
  ##
  ## If set to "random", a random value will be generated for each message.
  ##
  ## When unset, no message key is added and each message is routed to a random
  ## partition.
  ##
  ##   ex: routing_key = "random"
  ##       routing_key = "telegraf"
  # routing_key = ""

  ## Compression codec represents the various compression codecs recognized by
  ## Kafka in messages.
  ##  0 : None
  ##  1 : Gzip
  ##  2 : Snappy
  ##  3 : LZ4
  ##  4 : ZSTD
  compression_codec = 0

  ## Idempotent Writes
  ## If enabled, exactly one copy of each message is written.
  # idempotent_writes = false

  ##  RequiredAcks is used in Produce Requests to tell the broker how many
  ##  replica acknowledgements it must see before responding
  ##   0 : the producer never waits for an acknowledgement from the broker.
  ##       This option provides the lowest latency but the weakest durability
  ##       guarantees (some data will be lost when a server fails).
  ##   1 : the producer gets an acknowledgement after the leader replica has
  ##       received the data. This option provides better durability as the
  ##       client waits until the server acknowledges the request as successful
  ##       (only messages that were written to the now-dead leader but not yet
  ##       replicated will be lost).
  ##   -1: the producer gets an acknowledgement after all in-sync replicas have
  ##       received the data. This option provides the best durability, we
  ##       guarantee that no messages will be lost as long as at least one in
  ##       sync replica remains.
  required_acks = 1

  ## The maximum number of times to retry sending a metric before failing
  ## until the next flush.
  max_retry = 3

  ## The maximum permitted size of a message. Should be set equal to or
  ## smaller than the broker's 'message.max.bytes'.
  # max_message_bytes = 1000000

  ## Producer timestamp
  ## This option sets the timestamp of the kafka producer message, choose from:
  ##   * metric: Uses the metric's timestamp
  ##   * now: Uses the time of write
  # producer_timestamp = metric

  ## Add metric name as specified kafka header if not empty
  # metric_name_header = ""

  ## Optional TLS Config
 
  tls_ca = "/etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem"
  # tls_cert = "/etc/telegraf/cert.pem"
  # tls_key = "/etc/telegraf/key.pem"
  ## Use TLS but skip chain & host verification
  # insecure_skip_verify = false

  ## Period between keep alive probes.
  ## Defaults to the OS configuration if not specified or zero.
  # keep_alive_period = "15s"

  ## Optional SOCKS5 proxy to use when connecting to brokers
  # socks5_enabled = true
  # socks5_address = "127.0.0.1:1080"
  # socks5_username = "alice"
  # socks5_password = "pass123"

  ## Optional SASL Config
  # sasl_username = "kafka"
  # sasl_password = "secret"

  ## Optional SASL:
  ## one of: OAUTHBEARER, PLAIN, SCRAM-SHA-256, SCRAM-SHA-512, GSSAPI
  ## (defaults to PLAIN)
  # sasl_mechanism = ""

  ## used if sasl_mechanism is GSSAPI
  # sasl_gssapi_service_name = ""
  # ## One of: KRB5_USER_AUTH and KRB5_KEYTAB_AUTH
  # sasl_gssapi_auth_type = "KRB5_USER_AUTH"
  # sasl_gssapi_kerberos_config_path = "/"
  # sasl_gssapi_realm = "realm"
  # sasl_gssapi_key_tab_path = ""
  # sasl_gssapi_disable_pafxfast = false

  ## Access token used if sasl_mechanism is OAUTHBEARER
  # sasl_access_token = ""

  ## Arbitrary key value string pairs to pass as a TOML table. For example:
  # {logicalCluster = "cluster-042", poolId = "pool-027"}
  # sasl_extensions = {}

  ## SASL protocol version.  When connecting to Azure EventHub set to 0.
  # sasl_version = 1

  # Disable Kafka metadata full fetch
  # metadata_full = false

  ## Maximum number of retries for metadata operations including
  ## connecting. Sets Sarama library's Metadata.Retry.Max config value. If 0 or
  ## unset, use the Sarama default of 3,
  # metadata_retry_max = 0

  ## Type of retry backoff. Valid options: "constant", "exponential"
  # metadata_retry_type = "constant"

  ## Amount of time to wait before retrying. When metadata_retry_type is
  ## "constant", each retry is delayed this amount. When "exponential", the
  ## first retry is delayed this amount, and subsequent delays are doubled. If 0
  ## or unset, use the Sarama default of 250 ms
  # metadata_retry_backoff = 0

  ## Maximum amount of time to wait before retrying when metadata_retry_type is
  ## "exponential". Ignored for other retry types. If 0, there is no backoff
  ## limit.
  # metadata_retry_max_duration = 0

  ## Data format to output.
  ## Each data format has its own unique set of configuration options, read
  ## more about them here:
  ## https://github.com/influxdata/telegraf/blob/master/docs/DATA_FORMATS_OUTPUT.md
  data_format = "influx"

  # These are in order of how they are processed
  namepass = []
  namedrop = []
  tagpass = {}
  tagdrop = {"tc_no_kafka_republish"=["true"]}
  metricpass = '!(tags.tenant == "dev" || tags.tenant == "test") || tags.influx_bucket == "localhost" || tags.influx_bucket == "cert_expiration"'
  taginclude = []
  tagexclude = []

  ## NOTE: Due to the way TOML is parsed, tables must be at the END of the
  ## plugin definition, otherwise additional config options are read as part of
  ## the table

  ## Optional topic suffix configuration.
  ## If the section is omitted, no suffix is used.
  ## Following topic suffix methods are supported:
  ##   measurement - suffix equals to separator + measurement's name
  ##   tags        - suffix equals to separator + specified tags' values
  ##                 interleaved with separator

  ## Suffix equals to "_" + measurement name
  # [outputs.kafka.topic_suffix]
  #   method = "measurement"
  #   separator = "_"

  ## Suffix equals to "__" + measurement's "foo" tag value.
  ## If there's no such a tag, suffix equals to an empty string
  # [outputs.kafka.topic_suffix]
  #   method = "tags"
  #   keys = ["foo"]
  #   separator = "__"

  ## Suffix equals to "_" + measurement's "foo" and "bar"
  ## tag values, separated by "_". If there is no such tags,
  ## their values treated as empty strings.
  # [outputs.kafka.topic_suffix]
  #   method = "tags"
  #   keys = ["foo", "bar"]
  #   separator = "_"

Logs from Telegraf

2025-02-10T17:25:01Z W! [agent] ["outputs.kafka::live.iad_test-metrics"] did not complete within its flush interval
2025-02-10T17:25:01Z W! [inputs.system] Collection took longer than expected; not complete after interval of 10s
2025-02-10T17:25:01Z W! [inputs.swap] Collection took longer than expected; not complete after interval of 10s
2025-02-10T17:25:01Z W! [inputs.netstat] Collection took longer than expected; not complete after interval of 10s
2025-02-10T17:25:01Z W! [inputs.cpu] Collection took longer than expected; not complete after interval of 10s
2025-02-10T17:25:02Z W! [inputs.chrony] Collection took longer than expected; not complete after interval of 10s
2025-02-10T17:25:02Z W! [inputs.internal] Collection took longer than expected; not complete after interval of 10s
2025-02-10T17:25:02Z W! [inputs.mem] Collection took longer than expected; not complete after interval of 10s
2025-02-10T17:25:10Z W! [inputs.x509_cert] Collection took longer than expected; not complete after interval of 10s
2025-02-10T17:25:10Z W! [inputs.diskio] Collection took longer than expected; not complete after interval of 10s
2025-02-10T17:25:10Z W! [inputs.nstat] Collection took longer than expected; not complete after interval of 10s
2025-02-10T17:25:10Z W! [inputs.processes] Collection took longer than expected; not complete after interval of 10s
2025-02-10T17:25:10Z W! [inputs.disk] Collection took longer than expected; not complete after interval of 10s
2025-02-10T17:25:10Z W! [inputs.kernel] Collection took longer than expected; not complete after interval of 10s
2025-02-10T17:25:10Z W! [inputs.net] Collection took longer than expected; not complete after interval of 10s
2025-02-10T17:25:11Z W! [inputs.system] Collection took longer than expected; not complete after interval of 10s
2025-02-10T17:25:11Z W! [inputs.swap] Collection took longer than expected; not complete after interval of 10s
2025-02-10T17:25:11Z W! [inputs.netstat] Collection took longer than expected; not complete after interval of 10s
2025-02-10T17:25:11Z W! [inputs.cpu] Collection took longer than expected; not complete after interval of 10s
2025-02-10T17:25:12Z W! [inputs.chrony] Collection took longer than expected; not complete after interval of 10s
2025-02-10T17:25:12Z W! [inputs.internal] Collection took longer than expected; not complete after interval of 10s
2025-02-10T17:25:12Z W! [inputs.mem] Collection took longer than expected; not complete after interval of 10s

----
restart to enable debug logs
----

2025-02-10T18:36:41Z E! [inputs.chrony] Removing temporary socket "/run/chrony/chrony-telegraf-ab3bf21e-1cc1-4e3b-a0db-5ae5b1d9edf3.sock" failed: remove /run/chrony/chrony-telegraf-ab3bf21e-1cc1-4e3b-a0db-5ae5b1d9edf3.sock: permission denied
2025-02-10T18:36:41Z I! [agent] Hang on, flushing any cached metrics before shutdown
2025-02-10T18:36:41Z I! [agent] Stopping running outputs
2025-02-10T18:36:41Z I! Loading config: /etc/telegraf/telegraf.conf
2025-02-10T18:36:41Z I! Loading config: /etc/telegraf/telegraf.d/base.conf
2025-02-10T18:36:41Z I! Loading config: /etc/telegraf/telegraf.d/input-internal.conf
2025-02-10T18:36:41Z I! Loading config: /etc/telegraf/telegraf.d/input-localhost.conf
2025-02-10T18:36:41Z I! Loading config: /etc/telegraf/telegraf.d/input-puppetcert.conf
2025-02-10T18:36:41Z I! Loading config: /etc/telegraf/telegraf.d/output-kafka.conf
2025-02-10T18:36:41Z W! Using disk buffer strategy for plugin outputs.kafka, this is an experimental feature
2025-02-10T18:36:41Z W! Using disk buffer strategy for plugin outputs.kafka, this is an experimental feature
2025-02-10T18:36:41Z W! Using disk buffer strategy for plugin outputs.kafka, this is an experimental feature
2025-02-10T18:36:41Z I! Loading config: /etc/telegraf/telegraf.d/processor-tag-bucket.conf
2025-02-10T18:36:41Z I! Starting Telegraf 1.33.1 brought to you by InfluxData the makers of InfluxDB
2025-02-10T18:36:41Z I! Available plugins: 236 inputs, 9 aggregators, 33 processors, 26 parsers, 63 outputs, 6 secret-stores
2025-02-10T18:36:41Z I! Loaded inputs: chrony cpu disk diskio internal kernel mem net netstat nstat processes swap system x509_cert
2025-02-10T18:36:41Z I! Loaded aggregators:
2025-02-10T18:36:41Z I! Loaded processors: override (12x)
2025-02-10T18:36:41Z I! Loaded secretstores:
2025-02-10T18:36:41Z I! Loaded outputs: kafka (3x)
2025-02-10T18:36:41Z I! Tags enabled: host=alma95-0003.redacted network=redacted region=redacted tenant=redacted
2025-02-10T18:36:41Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"alma95-0003.redacted", Flush Interval:10s
2025-02-10T18:36:41Z D! [agent] Initializing plugins
2025-02-10T18:36:41Z D! [agent] Connecting outputs
2025-02-10T18:36:41Z D! [agent] Attempting connection to [outputs.kafka::redacted_test-metrics]
2025-02-10T18:36:41Z D! [agent] Successfully connected to outputs.kafka::redacted_test-metrics
2025-02-10T18:36:41Z D! [agent] Attempting connection to [outputs.kafka::redacted_test-metrics]
2025-02-10T18:36:41Z D! [agent] Successfully connected to outputs.kafka::redacted_test-metrics
2025-02-10T18:36:41Z D! [agent] Attempting connection to [outputs.kafka::redacted_live-metrics]
2025-02-10T18:36:41Z D! [agent] Successfully connected to outputs.kafka::redacted_live-metrics
2025-02-10T18:36:41Z D! [agent] Starting service inputs
2025-02-10T18:36:41Z D! [inputs.chrony] Connected to "udp://127.0.0.1:323"...
2025-02-10T18:36:51Z D! [inputs.x509_cert] Invalid certificate 1f9
2025-02-10T18:36:51Z D! [inputs.x509_cert]   cert DNS names:    [alma95-0003.redacted]
2025-02-10T18:36:51Z D! [inputs.x509_cert]   cert IP addresses: []
2025-02-10T18:36:51Z D! [inputs.x509_cert]   cert subject:      CN=alma95-0003.redacted
2025-02-10T18:36:51Z D! [inputs.x509_cert]   cert issuer:       CN=Puppet CA: provisioning-0001.redacted
2025-02-10T18:36:51Z D! [inputs.x509_cert]   opts.DNSName:      
2025-02-10T18:36:51Z D! [inputs.x509_cert]   verify options:    { 0xc0018d6e40 <nil> 0001-01-01 00:00:00 +0000 UTC [0] 0}
2025-02-10T18:36:51Z D! [inputs.x509_cert]   verify error:      x509: certificate signed by unknown authority
2025-02-10T18:36:51Z D! [inputs.x509_cert]   tlsCfg.ServerName: 
2025-02-10T18:36:51Z D! [inputs.x509_cert]   ServerName:        
2025-02-10T18:36:51Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/run/credentials/systemd-sysctl.service"): permission denied
2025-02-10T18:36:51Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/run/credentials/systemd-sysusers.service"): permission denied
2025-02-10T18:36:51Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/run/credentials/systemd-tmpfiles-setup-dev.service"): permission denied
2025-02-10T18:36:51Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/run/credentials/systemd-tmpfiles-setup.service"): permission denied
2025-02-10T18:36:54Z D! [outputs.kafka::redacted_test-metrics] Wrote batch of 25000 metrics in 126.143812ms
2025-02-10T18:36:56Z D! [outputs.kafka::redacted_test-metrics] Wrote batch of 25000 metrics in 149.050453ms
2025-02-10T18:36:56Z D! [outputs.kafka::redacted_test-metrics] Wrote batch of 25000 metrics in 198.327251ms
2025-02-10T18:36:56Z D! [outputs.kafka::redacted_live-metrics] Wrote batch of 25000 metrics in 477.581988ms
2025-02-10T18:36:57Z D! [outputs.kafka::redacted_test-metrics] Wrote batch of 25000 metrics in 230.172985ms
2025-02-10T18:36:57Z D! [outputs.kafka::redacted_test-metrics] Wrote batch of 25000 metrics in 131.284946ms
2025-02-10T18:36:58Z D! [outputs.kafka::redacted_live-metrics] Wrote batch of 25000 metrics in 213.511181ms
2025-02-10T18:36:59Z D! [outputs.kafka::redacted_test-metrics] Wrote batch of 25000 metrics in 215.768965ms
2025-02-10T18:37:00Z D! [outputs.kafka::redacted_live-metrics] Wrote batch of 25000 metrics in 283.884916ms
2025-02-10T18:37:00Z D! [outputs.kafka::redacted_test-metrics] Wrote batch of 25000 metrics in 180.424512ms
2025-02-10T18:37:01Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/run/credentials/systemd-sysctl.service"): permission denied
2025-02-10T18:37:01Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/run/credentials/systemd-sysusers.service"): permission denied
2025-02-10T18:37:01Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/run/credentials/systemd-tmpfiles-setup-dev.service"): permission denied
2025-02-10T18:37:01Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/run/credentials/systemd-tmpfiles-setup.service"): permission denied
2025-02-10T18:37:01Z D! [outputs.kafka::redacted_test-metrics] Wrote batch of 25000 metrics in 218.750123ms
2025-02-10T18:37:02Z D! [inputs.x509_cert] Invalid certificate 1f9
2025-02-10T18:37:02Z D! [inputs.x509_cert]   cert DNS names:    [alma95-0003.redacted]
2025-02-10T18:37:02Z D! [inputs.x509_cert]   cert IP addresses: []
2025-02-10T18:37:02Z D! [inputs.x509_cert]   cert subject:      CN=alma95-0003.redacted
2025-02-10T18:37:02Z D! [inputs.x509_cert]   cert issuer:       CN=Puppet CA: provisioning-0001.redacted
2025-02-10T18:37:02Z D! [inputs.x509_cert]   opts.DNSName:      
2025-02-10T18:37:02Z D! [inputs.x509_cert]   verify options:    { 0xc00ced9110 <nil> 0001-01-01 00:00:00 +0000 UTC [0] 0}
2025-02-10T18:37:02Z D! [inputs.x509_cert]   verify error:      x509: certificate signed by unknown authority
2025-02-10T18:37:02Z D! [inputs.x509_cert]   tlsCfg.ServerName: 
2025-02-10T18:37:02Z D! [inputs.x509_cert]   ServerName:        
2025-02-10T18:37:02Z D! [outputs.kafka::redacted_live-metrics] Wrote batch of 25000 metrics in 247.615815ms
2025-02-10T18:37:03Z D! [outputs.kafka::redacted_test-metrics] Wrote batch of 25000 metrics in 170.277466ms
2025-02-10T18:37:05Z D! [outputs.kafka::redacted_test-metrics] Wrote batch of 25000 metrics in 290.286934ms
2025-02-10T18:37:05Z W! [agent] ["outputs.kafka::redacted_test-metrics"] did not complete within its flush interval
2025-02-10T18:37:05Z W! [agent] ["outputs.kafka::redacted_test-metrics"] did not complete within its flush interval
2025-02-10T18:37:06Z D! [outputs.kafka::redacted_live-metrics] Wrote batch of 25000 metrics in 311.44424ms
2025-02-10T18:37:07Z W! [agent] ["outputs.kafka::redacted_live-metrics"] did not complete within its flush interval
2025-02-10T18:37:07Z D! [outputs.kafka::redacted_test-metrics] Buffer fullness: 748980 metrics
2025-02-10T18:37:07Z D! [outputs.kafka::redacted_test-metrics] Wrote batch of 25000 metrics in 149.666139ms
2025-02-10T18:37:09Z D! [outputs.kafka::redacted_test-metrics] Buffer fullness: 748979 metrics
2025-02-10T18:37:09Z D! [outputs.kafka::redacted_test-metrics] Wrote batch of 25000 metrics in 194.804707ms
2025-02-10T18:37:10Z D! [inputs.x509_cert] Invalid certificate 1f9
2025-02-10T18:37:10Z D! [inputs.x509_cert]   cert DNS names:    [alma95-0003.redacted]
2025-02-10T18:37:10Z D! [inputs.x509_cert]   cert IP addresses: []
2025-02-10T18:37:10Z D! [inputs.x509_cert]   cert subject:      CN=alma95-0003.redacted
2025-02-10T18:37:10Z D! [inputs.x509_cert]   cert issuer:       CN=Puppet CA: provisioning-0001.redacted
2025-02-10T18:37:10Z D! [inputs.x509_cert]   opts.DNSName:      
2025-02-10T18:37:10Z D! [inputs.x509_cert]   verify options:    { 0xc00e125bf0 <nil> 0001-01-01 00:00:00 +0000 UTC [0] 0}
2025-02-10T18:37:10Z D! [inputs.x509_cert]   verify error:      x509: certificate signed by unknown authority
2025-02-10T18:37:10Z D! [inputs.x509_cert]   tlsCfg.ServerName: 
2025-02-10T18:37:10Z D! [inputs.x509_cert]   ServerName:        
2025-02-10T18:37:10Z D! [outputs.kafka::redacted_live-metrics] Buffer fullness: 287813 metrics
2025-02-10T18:37:10Z D! [outputs.kafka::redacted_live-metrics] Wrote batch of 25000 metrics in 147.262621ms
2025-02-10T18:37:11Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/run/credentials/systemd-sysctl.service"): permission denied
2025-02-10T18:37:11Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/run/credentials/systemd-sysusers.service"): permission denied
2025-02-10T18:37:11Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/run/credentials/systemd-tmpfiles-setup-dev.service"): permission denied
2025-02-10T18:37:11Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/run/credentials/systemd-tmpfiles-setup.service"): permission denied
2025-02-10T18:37:13Z D! [outputs.kafka::redacted_test-metrics] Wrote batch of 25000 metrics in 141.633571ms
2025-02-10T18:37:14Z D! [outputs.kafka::redacted_test-metrics] Wrote batch of 25000 metrics in 173.655371ms
2025-02-10T18:37:16Z D! [outputs.kafka::redacted_live-metrics] Wrote batch of 25000 metrics in 348.550099ms
2025-02-10T18:37:18Z W! [agent] ["outputs.kafka::redacted_live-metrics"] did not complete within its flush interval
2025-02-10T18:37:19Z W! [agent] ["outputs.kafka::redacted_test-metrics"] did not complete within its flush interval
2025-02-10T18:37:20Z D! [outputs.kafka::redacted_test-metrics] Wrote batch of 25000 metrics in 123.045227ms
2025-02-10T18:37:20Z D! [inputs.x509_cert] Invalid certificate 1f9
2025-02-10T18:37:20Z D! [inputs.x509_cert]   cert DNS names:    [alma95-0003.redacted]
2025-02-10T18:37:20Z D! [inputs.x509_cert]   cert IP addresses: []
2025-02-10T18:37:20Z D! [inputs.x509_cert]   cert subject:      CN=alma95-0003.redacted
2025-02-10T18:37:20Z D! [inputs.x509_cert]   cert issuer:       CN=Puppet CA: provisioning-0001.redacted
2025-02-10T18:37:20Z D! [inputs.x509_cert]   opts.DNSName:      
2025-02-10T18:37:20Z D! [inputs.x509_cert]   verify options:    { 0xc00c6dda70 <nil> 0001-01-01 00:00:00 +0000 UTC [0] 0}
2025-02-10T18:37:20Z D! [inputs.x509_cert]   verify error:      x509: certificate signed by unknown authority
2025-02-10T18:37:20Z D! [inputs.x509_cert]   tlsCfg.ServerName: 
2025-02-10T18:37:20Z D! [inputs.x509_cert]   ServerName:        
2025-02-10T18:37:20Z W! [agent] ["outputs.kafka::redacted_test-metrics"] did not complete within its flush interval
2025-02-10T18:37:20Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/run/credentials/systemd-sysctl.service"): permission denied
2025-02-10T18:37:20Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/run/credentials/systemd-sysusers.service"): permission denied
2025-02-10T18:37:20Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/run/credentials/systemd-tmpfiles-setup-dev.service"): permission denied
2025-02-10T18:37:20Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/run/credentials/systemd-tmpfiles-setup.service"): permission denied
2025-02-10T18:37:21Z D! [outputs.kafka::redacted_test-metrics] Buffer fullness: 698981 metrics
2025-02-10T18:37:21Z D! [outputs.kafka::redacted_test-metrics] Wrote batch of 25000 metrics in 244.499572ms
2025-02-10T18:37:23Z D! [outputs.kafka::redacted_live-metrics] Buffer fullness: 237813 metrics
2025-02-10T18:37:23Z D! [outputs.kafka::redacted_live-metrics] Wrote batch of 25000 metrics in 246.784683ms
2025-02-10T18:37:28Z D! [outputs.kafka::redacted_test-metrics] Buffer fullness: 673982 metrics
2025-02-10T18:37:28Z D! [outputs.kafka::redacted_test-metrics] Wrote batch of 25000 metrics in 137.000615ms
2025-02-10T18:37:30Z W! [agent] ["outputs.kafka::redacted_test-metrics"] did not complete within its flush interval
2025-02-10T18:37:30Z D! [outputs.kafka::redacted_test-metrics] Buffer fullness: 673983 metrics
2025-02-10T18:37:30Z D! [outputs.kafka::redacted_test-metrics] Wrote batch of 25000 metrics in 159.584965ms
2025-02-10T18:37:30Z D! [inputs.x509_cert] Invalid certificate 1f9
2025-02-10T18:37:30Z D! [inputs.x509_cert]   cert DNS names:    [alma95-0003.redacted]
2025-02-10T18:37:30Z D! [inputs.x509_cert]   cert IP addresses: []
2025-02-10T18:37:30Z D! [inputs.x509_cert]   cert subject:      CN=alma95-0003.redacted
2025-02-10T18:37:30Z D! [inputs.x509_cert]   cert issuer:       CN=Puppet CA: provisioning-0001.redacted
2025-02-10T18:37:30Z D! [inputs.x509_cert]   opts.DNSName:      
2025-02-10T18:37:30Z D! [inputs.x509_cert]   verify options:    { 0xc00ae689c0 <nil> 0001-01-01 00:00:00 +0000 UTC [0] 0}
2025-02-10T18:37:30Z D! [inputs.x509_cert]   verify error:      x509: certificate signed by unknown authority
2025-02-10T18:37:30Z D! [inputs.x509_cert]   tlsCfg.ServerName: 
2025-02-10T18:37:30Z D! [inputs.x509_cert]   ServerName:        
2025-02-10T18:37:31Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/run/credentials/systemd-sysctl.service"): permission denied
2025-02-10T18:37:31Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/run/credentials/systemd-sysusers.service"): permission denied
2025-02-10T18:37:31Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/run/credentials/systemd-tmpfiles-setup-dev.service"): permission denied
2025-02-10T18:37:31Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/run/credentials/systemd-tmpfiles-setup.service"): permission denied
2025-02-10T18:37:31Z W! [agent] ["outputs.kafka::redacted_live-metrics"] did not complete within its flush interval
2025-02-10T18:37:32Z D! [outputs.kafka::redacted_live-metrics] Buffer fullness: 212813 metrics
2025-02-10T18:37:32Z D! [outputs.kafka::redacted_live-metrics] Wrote batch of 25000 metrics in 242.283889ms
2025-02-10T18:37:35Z W! [agent] ["outputs.kafka::redacted_test-metrics"] did not complete within its flush interval
2025-02-10T18:37:38Z D! [outputs.kafka::redacted_test-metrics] Buffer fullness: 648984 metrics
2025-02-10T18:37:39Z D! [outputs.kafka::redacted_test-metrics] Wrote batch of 25000 metrics in 146.661748ms
2025-02-10T18:37:41Z D! [outputs.kafka::redacted_test-metrics] Wrote batch of 25000 metrics in 180.469939ms
2025-02-10T18:37:41Z D! [inputs.x509_cert] Invalid certificate 1f9
2025-02-10T18:37:41Z D! [inputs.x509_cert]   cert DNS names:    [alma95-0003.redacted]
2025-02-10T18:37:41Z D! [inputs.x509_cert]   cert IP addresses: []
2025-02-10T18:37:41Z D! [inputs.x509_cert]   cert subject:      CN=alma95-0003.redacted
2025-02-10T18:37:41Z D! [inputs.x509_cert]   cert issuer:       CN=Puppet CA: provisioning-0001.redacted
2025-02-10T18:37:41Z D! [inputs.x509_cert]   opts.DNSName:      
2025-02-10T18:37:41Z D! [inputs.x509_cert]   verify options:    { 0xc00c6dc1e0 <nil> 0001-01-01 00:00:00 +0000 UTC [0] 0}
2025-02-10T18:37:41Z D! [inputs.x509_cert]   verify error:      x509: certificate signed by unknown authority
2025-02-10T18:37:41Z D! [inputs.x509_cert]   tlsCfg.ServerName: 
2025-02-10T18:37:41Z D! [inputs.x509_cert]   ServerName:        
2025-02-10T18:37:42Z W! [agent] ["outputs.kafka::redacted_live-metrics"] did not complete within its flush interval
2025-02-10T18:37:42Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/run/credentials/systemd-sysctl.service"): permission denied
2025-02-10T18:37:42Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/run/credentials/systemd-sysusers.service"): permission denied
2025-02-10T18:37:42Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/run/credentials/systemd-tmpfiles-setup-dev.service"): permission denied
2025-02-10T18:37:42Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/run/credentials/systemd-tmpfiles-setup.service"): permission denied
2025-02-10T18:37:43Z D! [outputs.kafka::redacted_live-metrics] Buffer fullness: 187813 metrics
2025-02-10T18:37:43Z D! [outputs.kafka::redacted_live-metrics] Wrote batch of 25000 metrics in 341.952999ms
2025-02-10T18:37:44Z W! [agent] ["outputs.kafka::redacted_test-metrics"] did not complete within its flush interval
2025-02-10T18:37:48Z W! [agent] ["outputs.kafka::redacted_test-metrics"] did not complete within its flush interval
2025-02-10T18:37:50Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/run/credentials/systemd-sysctl.service"): permission denied
2025-02-10T18:37:50Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/run/credentials/systemd-sysusers.service"): permission denied
2025-02-10T18:37:50Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/run/credentials/systemd-tmpfiles-setup-dev.service"): permission denied
2025-02-10T18:37:50Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/run/credentials/systemd-tmpfiles-setup.service"): permission denied
2025-02-10T18:37:51Z D! [outputs.kafka::redacted_test-metrics] Buffer fullness: 623985 metrics
2025-02-10T18:37:51Z D! [inputs.x509_cert] Invalid certificate 1f9
2025-02-10T18:37:51Z D! [inputs.x509_cert]   cert DNS names:    [alma95-0003.redacted]
2025-02-10T18:37:51Z D! [inputs.x509_cert]   cert IP addresses: []
2025-02-10T18:37:51Z D! [inputs.x509_cert]   cert subject:      CN=alma95-0003.redacted
2025-02-10T18:37:51Z D! [inputs.x509_cert]   cert issuer:       CN=Puppet CA: provisioning-0001.redacted
2025-02-10T18:37:51Z D! [inputs.x509_cert]   opts.DNSName:      
2025-02-10T18:37:51Z D! [inputs.x509_cert]   verify options:    { 0xc013bd3e90 <nil> 0001-01-01 00:00:00 +0000 UTC [0] 0}
2025-02-10T18:37:51Z D! [inputs.x509_cert]   verify error:      x509: certificate signed by unknown authority
2025-02-10T18:37:51Z D! [inputs.x509_cert]   tlsCfg.ServerName: 
2025-02-10T18:37:51Z D! [inputs.x509_cert]   ServerName:        
2025-02-10T18:37:51Z D! [outputs.kafka::redacted_test-metrics] Wrote batch of 25000 metrics in 97.210539ms
2025-02-10T18:37:53Z D! [outputs.kafka::redacted_test-metrics] Buffer fullness: 623984 metrics
2025-02-10T18:37:53Z D! [outputs.kafka::redacted_test-metrics] Wrote batch of 25000 metrics in 252.622682ms
2025-02-10T18:37:54Z W! [agent] ["outputs.kafka::redacted_live-metrics"] did not complete within its flush interval
2025-02-10T18:37:55Z W! [agent] ["outputs.kafka::redacted_test-metrics"] did not complete within its flush interval
2025-02-10T18:37:55Z D! [outputs.kafka::redacted_live-metrics] Buffer fullness: 162813 metrics
2025-02-10T18:37:56Z D! [outputs.kafka::redacted_live-metrics] Wrote batch of 25000 metrics in 481.428268ms
...
2025-02-10T18:40:03Z D! [outputs.kafka::redacted_test-metrics] Buffer fullness: 473990 metrics
2025-02-10T18:40:03Z W! [agent] ["outputs.kafka::redacted_test-metrics"] did not complete within its flush interval
2025-02-10T18:40:03Z D! [outputs.kafka::redacted_test-metrics] Buffer fullness: 473990 metrics
2025-02-10T18:40:04Z D! [outputs.kafka::redacted_test-metrics] Wrote batch of 25000 metrics in 171.158583ms
2025-02-10T18:40:05Z D! [outputs.kafka::redacted_live-metrics] Buffer fullness: 12813 metrics
2025-02-10T18:40:05Z W! [agent] ["outputs.kafka::redacted_live-metrics"] did not complete within its flush interval
2025-02-10T18:40:05Z D! [outputs.kafka::redacted_live-metrics] Buffer fullness: 12813 metrics
2025-02-10T18:40:05Z D! [outputs.kafka::redacted_live-metrics] Wrote batch of 12813 metrics in 73.422729ms
2025-02-10T18:40:05Z D! [outputs.kafka::redacted_live-metrics] Buffer fullness: 0 metrics
2025-02-10T18:40:06Z D! [outputs.kafka::redacted_live-metrics] Buffer fullness: 0 metrics
2025-02-10T18:40:07Z W! [agent] ["outputs.kafka::redacted_test-metrics"] did not complete within its flush interval
...
2025-02-10T18:40:30Z D! [inputs.mem] Previous collection has not completed; scheduled collection skipped
2025-02-10T18:40:30Z D! [inputs.net] Previous collection has not completed; scheduled collection skipped
2025-02-10T18:40:30Z D! [inputs.netstat] Previous collection has not completed; scheduled collection skipped
2025-02-10T18:40:30Z D! [outputs.kafka::redacted_live-metrics] Buffer fullness: 0 metrics
2025-02-10T18:40:30Z W! [inputs.nstat] Collection took longer than expected; not complete after interval of 10s
2025-02-10T18:40:30Z W! [inputs.mem] Collection took longer than expected; not complete after interval of 10s
2025-02-10T18:40:30Z W! [inputs.processes] Collection took longer than expected; not complete after interval of 10s
2025-02-10T18:40:30Z D! [inputs.x509_cert] Previous collection has not completed; scheduled collection skipped
2025-02-10T18:40:30Z D! [inputs.system] Previous collection has not completed; scheduled collection skipped
2025-02-10T18:40:31Z W! [inputs.disk] Collection took longer than expected; not complete after interval of 10s
2025-02-10T18:40:31Z W! [inputs.diskio] Collection took longer than expected; not complete after interval of 10s
2025-02-10T18:40:31Z D! [inputs.internal] Previous collection has not completed; scheduled collection skipped
2025-02-10T18:40:31Z D! [inputs.processes] Previous collection has not completed; scheduled collection skipped
2025-02-10T18:40:31Z D! [inputs.chrony] Previous collection has not completed; scheduled collection skipped
2025-02-10T18:40:31Z D! [inputs.disk] Previous collection has not completed; scheduled collection skipped
2025-02-10T18:40:31Z D! [inputs.nstat] Previous collection has not completed; scheduled collection skipped
2025-02-10T18:40:31Z W! [inputs.net] Collection took longer than expected; not complete after interval of 10s
2025-02-10T18:40:31Z W! [inputs.system] Collection took longer than expected; not complete after interval of 10s
2025-02-10T18:40:31Z D! [inputs.swap] Previous collection has not completed; scheduled collection skipped
2025-02-10T18:40:32Z W! [inputs.x509_cert] Collection took longer than expected; not complete after interval of 10s
2025-02-10T18:40:32Z W! [inputs.chrony] Collection took longer than expected; not complete after interval of 10s


System info

Telegraf 1.33.1 , Alma 9.5

Docker

No response

Steps to reproduce

Not 100% clear but enabling the disk buffer strategy and turning it off seemed to show a change as shown in the screen shot of one metric value below.

Expected behavior

No metrics loss (provided disk is available)
When viewing collected data, no gaps
When viewing logs from telegraf, no collection did not complete messages

Actual behavior

When running with the disk based buffer strategy, after a period of time, logs indicate that collection too longer than expected and did not complete. This did not seem to show in the logs immediately after reboot with debug turned on. WIll continue to monitor that aspect. This results in inconsistent metrics making it hard if not impossible to monitor a host.

Image

Additional info

No response

Metadata

Metadata

Assignees

Labels

bugunexpected problem or unintended behavior

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions