Skip to content

Memory usage spikes during WAL replay to more than normal usage #6934

@erkexzcx

Description

@erkexzcx

What did you do?
Tried to start prometheus.

What did you expect to see?
Prometheus up & running, web interface showing up.

What did you see instead? Under which circumstances?
Prometheus runs out of RAM during "WAL segment loaded" process.

Environment
Debian 9

  • System information:
    Linux 4.9.0-11-amd64 x86_64

  • Prometheus version:

prometheus, version 2.16.0 (branch: HEAD, revision: b90be6f32a33c03163d700e1452b54454ddce0ec)
  build user:       root@7ea0ae865f12
  build date:       20200213-23:50:02
  go version:       go1.13.8
  • Prometheus configuration file:
global:
  evaluation_interval: 60s
  scrape_interval: 60s
...
...
...
  • Logs:

This is what happends during the start after 10+- minutes:

... prometheus[39101]: level=info ts=2020-03-05T14:02:26.811Z caller=head.go:625 component=tsdb msg="WAL segment loaded" segment=41869 maxSegment=41871
... prometheus[39101]: level=info ts=2020-03-05T14:02:26.812Z caller=head.go:625 component=tsdb msg="WAL segment loaded" segment=41870 maxSegment=41871
... prometheus[39101]: level=info ts=2020-03-05T14:02:26.812Z caller=head.go:625 component=tsdb msg="WAL segment loaded" segment=41871 maxSegment=41871
... prometheus[39101]: fatal error: runtime: out of memory
... prometheus[39101]: runtime stack:
... prometheus[39101]: runtime.throw(0x253885d, 0x16)
... prometheus[39101]:         /usr/local/go/src/runtime/panic.go:774 +0x72
... prometheus[39101]: runtime.sysMap(0xce78000000, 0x14000000, 0x3f5bc78)
... prometheus[39101]:         /usr/local/go/src/runtime/mem_linux.go:169 +0xc5
... prometheus[39101]: runtime.(*mheap).sysAlloc(0x3f432c0, 0x11de6000, 0xc000, 0x4373e7)
... prometheus[39101]:         /usr/local/go/src/runtime/malloc.go:701 +0x1cd
... prometheus[39101]: runtime.(*mheap).grow(0x3f432c0, 0x8ef3, 0xffffffff)
... prometheus[39101]:         /usr/local/go/src/runtime/mheap.go:1255 +0xa3
... prometheus[39101]: runtime.(*mheap).allocSpanLocked(0x3f432c0, 0x8ef3, 0x3f5bc88, 0x20339d00000000)
... prometheus[39101]:         /usr/local/go/src/runtime/mheap.go:1170 +0x266
... prometheus[39101]: runtime.(*mheap).alloc_m(0x3f432c0, 0x8ef3, 0x101, 0x7f5861cc3fff)
... prometheus[39101]:         /usr/local/go/src/runtime/mheap.go:1022 +0xc2
... prometheus[39101]: runtime.(*mheap).alloc.func1()
... prometheus[39101]:         /usr/local/go/src/runtime/mheap.go:1093 +0x4c
... prometheus[39101]: runtime.(*mheap).alloc(0x3f432c0, 0x8ef3, 0x7f5861010101, 0x7f5861d11008)
... prometheus[39101]:         /usr/local/go/src/runtime/mheap.go:1092 +0x8a
... prometheus[39101]: runtime.largeAlloc(0x11de5ec0, 0x450101, 0x7f5861d11008)
... prometheus[39101]:         /usr/local/go/src/runtime/malloc.go:1138 +0x97
... prometheus[39101]: runtime.mallocgc.func1()
... prometheus[39101]:         /usr/local/go/src/runtime/malloc.go:1033 +0x46
... prometheus[39101]: runtime.systemstack(0x0)
... prometheus[39101]:         /usr/local/go/src/runtime/asm_amd64.s:370 +0x66
... prometheus[39101]: runtime.mstart()
... prometheus[39101]:         /usr/local/go/src/runtime/proc.go:1146
... prometheus[39101]: goroutine 225 [running]:
... prometheus[39101]: runtime.systemstack_switch()
... prometheus[39101]:         /usr/local/go/src/runtime/asm_amd64.s:330 fp=0xc0022234c0 sp=0xc0022234b8 pc=0x45d180
... prometheus[39101]: runtime.mallocgc(0x11de5ec0, 0x1fd78e0, 0x5949d401, 0xc0003631d0)
... prometheus[39101]:         /usr/local/go/src/runtime/malloc.go:1032 +0x895 fp=0xc002223560 sp=0xc0022234c0 pc=0x40c755
... prometheus[39101]: runtime.makeslice(0x1fd78e0, 0x0, 0x23bcbd8, 0xd)
... prometheus[39101]:         /usr/local/go/src/runtime/slice.go:49 +0x6c fp=0xc002223590 sp=0xc002223560 pc=0x445bac
... prometheus[39101]: github.com/prometheus/prometheus/tsdb/index.(*MemPostings).Delete(0xc000f29a70, 0xce71eef470)

This is how systemd service looks like:

[Unit]
Description=Prometheus Server
Documentation=https://prometheus.io/docs/introduction/overview/
After=network-online.target

[Service]
User=prometheus
Restart=always
RestartSec=5s
LimitNOFILE=infinity
ExecStart=/usr/local/bin/prometheus \
    --config.file=/etc/prometheus/prometheus.yml \
    --storage.tsdb.path=/var/lib/prometheus/data \
    --web.listen-address="127.0.0.1:1234" \
    --web.external-url="https://example.com" \
    --web.enable-admin-api \
    --storage.tsdb.retention.time=30d
ExecReload=/bin/kill -HUP $MAINPID

[Install]
WantedBy=multi-user.target

Here is RAM usage of the server (pas 1 hour) - note that RAM fills up, runs out of RAM, service gets killed and is being restarted:
image

Please advise how do I troubleshoot further this issue?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions