Skip to content

Conversation

haydentherapper
Copy link
Contributor

@haydentherapper haydentherapper commented May 6, 2025

As noted in docker/compose#12424, compose --wait doesn't seem to honor healthchecks with restart:always, when the server crashes and restarts a few times and eventually becomes healthy. This was happening with Rekor:

  • MySQL was not yet healthy because the healthcheck wasn't working as expected. Correct health check for MySQL 5.7 to prevent connecting to temporary server docker-library/mysql#930 (comment) suggested using 127.0.0.1 instead of localhost
  • trillian-log-server was not yet healthy even when MySQL reported as healthy, causing trillian-log-server to crash and restart a few times. There was no healthcheck for either Trillian service because the image we're using is based on Distroless, which has no curl/wget.
  • rekor-server tried to start up with an unhealthy trillian-log-server, and crashed. The healthcheck reported as unhealthy, and even though the server eventually became healthy because of the restart:always policy, the healthcheck reported the startup as unhealthy.

This change adds healthchecks to trillian-log-server and log-signer by pulling the binaries out of the images and putting them into Debian 12 containers that include curl, so we can curl the /healthz endpoint. This also fixes the MySQL healthcheck as noted above. Now, docker compose up --wait properly waits for a healthy MySQL before starting trillian-log-server, and a healthy Trillian before starting Rekor.

Also fix minor Dockerfile linting errors.

Summary

Release Note

Documentation

@haydentherapper haydentherapper requested a review from a team as a code owner May 6, 2025 02:32
haydentherapper added a commit to sigstore/fulcio that referenced this pull request May 6, 2025
Same as documented in sigstore/rekor#2473, the MySQL healthcheck was inaccurately reporting healthy when localhost was used. Switching to an address seems to fix it.

Signed-off-by: Hayden B <haydentherapper@users.noreply.github.com>
Copy link

codecov bot commented May 6, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 25.17%. Comparing base (488eb97) to head (ed660e5).
Report is 403 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff             @@
##             main    #2473       +/-   ##
===========================================
- Coverage   66.46%   25.17%   -41.29%     
===========================================
  Files          92      191       +99     
  Lines        9258    24790    +15532     
===========================================
+ Hits         6153     6240       +87     
- Misses       2359    17784    +15425     
- Partials      746      766       +20     
Flag Coverage Δ
e2etests 47.03% <ø> (-0.53%) ⬇️
unittests 16.37% <ø> (-31.31%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@haydentherapper haydentherapper marked this pull request as draft May 6, 2025 02:48
@haydentherapper haydentherapper force-pushed the fix-healthchecks branch 5 times, most recently from 28b880a to 54495cb Compare May 6, 2025 04:47
@haydentherapper haydentherapper marked this pull request as ready for review May 6, 2025 05:27
cpanato
cpanato previously approved these changes May 6, 2025
haydentherapper added a commit to sigstore/fulcio that referenced this pull request May 6, 2025
Same as documented in sigstore/rekor#2473, the MySQL healthcheck was inaccurately reporting healthy when localhost was used. Switching to an address seems to fix it.

Signed-off-by: Hayden B <haydentherapper@users.noreply.github.com>
haydentherapper added a commit to sigstore/fulcio that referenced this pull request May 6, 2025
As noted in sigstore/rekor#2473 (comment), this did not properly populate username and password.

Signed-off-by: Hayden B <haydentherapper@users.noreply.github.com>
haydentherapper added a commit to sigstore/fulcio that referenced this pull request May 6, 2025
As noted in sigstore/rekor#2473 (comment), this did not properly populate username and password.

Signed-off-by: Hayden B <haydentherapper@users.noreply.github.com>
cpanato
cpanato previously approved these changes May 6, 2025
As noted in docker/compose#12424, compose
--wait doesn't seem to honor healthchecks with restart:always, when the
server crashes and restarts a few times and eventually becomes healthy.
This was happening with Rekor:

* MySQL was not yet healthy because the healthcheck wasn't working as
  expected. docker-library/mysql#930 (comment)
  suggested using 127.0.0.1 instead of localhost
* trillian-log-server was not yet healthy even when MySQL reported as
  healthy, causing trillian-log-server to crash and restart a few times.
  There was no healthcheck for either Trillian service because the image
  we're using is based on Distroless, which has no curl/wget.
* rekor-server tried to start up with an unhealthy trillian-log-server,
  and crashed. The healthcheck reported as unhealthy, and even though
  the server eventually became healthy because of the restart:always
  policy, the healthcheck reported the startup as unhealthy.

This change adds healthchecks to trillian-log-server and log-signer by
pulling the binaries out of the images and putting them into Debian
12 containers that include curl, so we can curl the /healthz endpoint.
This also fixes the MySQL healthcheck as noted above. Now, docker
compose up --wait properly waits for a healthy MySQL before starting
trillian-log-server, and a healthy Trillian before starting Rekor.

Also fix minor Dockerfile linting errors.

Signed-off-by: Hayden B <8418760+haydentherapper@users.noreply.github.com>
@haydentherapper haydentherapper requested a review from cpanato May 13, 2025 19:10
@haydentherapper haydentherapper enabled auto-merge (squash) May 13, 2025 19:11
@haydentherapper haydentherapper merged commit 62a6617 into sigstore:main May 14, 2025
16 checks passed
@haydentherapper haydentherapper deleted the fix-healthchecks branch May 14, 2025 06:36
haydentherapper added a commit to sigstore/rekor-monitor that referenced this pull request May 20, 2025
With sigstore/rekor#2473, healthchecks have been added for Trillian's log signer and log server, so there are now 5 services rather than 3 to wait for. E2E tests failed because there will never be just 3 healthy services, there will be 5.

This e2e script is a little brittle and can be cleaned up and simplified to use `compose up --wait`, will do in a later PR.

Signed-off-by: Hayden B <haydentherapper@users.noreply.github.com>
haydentherapper added a commit to sigstore/rekor-monitor that referenced this pull request May 20, 2025
With sigstore/rekor#2473, healthchecks have been added for Trillian's log signer and log server, so there are now 5 services rather than 3 to wait for. E2E tests failed because there will never be just 3 healthy services, there will be 5.

This e2e script is a little brittle and can be cleaned up and simplified to use `compose up --wait`, will do in a later PR.

Signed-off-by: Hayden B <haydentherapper@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants