Skip to content

Conversation

stelucz
Copy link
Contributor

@stelucz stelucz commented May 3, 2024

We hit issue couple months back where hubble-ui was not starting after fresh deployment.

Backend container complained about rights on mounted certificates:

frontend /docker-entrypoint.sh: /docker-entrypoint.d/ is not empty, will attempt to perform configuration
frontend /docker-entrypoint.sh: Looking for shell scripts in /docker-entrypoint.d/
frontend /docker-entrypoint.sh: Launching /docker-entrypoint.d/10-listen-on-ipv6-by-default.sh
frontend 10-listen-on-ipv6-by-default.sh: info: can not modify /etc/nginx/conf.d/default.conf (read-only file system?)
frontend /docker-entrypoint.sh: Sourcing /docker-entrypoint.d/15-local-resolvers.envsh
frontend /docker-entrypoint.sh: Launching /docker-entrypoint.d/20-envsubst-on-templates.sh
frontend /docker-entrypoint.sh: Launching /docker-entrypoint.d/30-tune-worker-processes.sh
frontend /docker-entrypoint.sh: Configuration complete; ready for start up
frontend 2024/05/03 06:21:00 [notice] 1#1: using the "epoll" event method
frontend 2024/05/03 06:21:00 [notice] 1#1: nginx/1.25.3
frontend 2024/05/03 06:21:00 [notice] 1#1: built by gcc 12.2.1 20220924 (Alpine 12.2.1_git20220924-r10) 
frontend 2024/05/03 06:21:00 [notice] 1#1: OS: Linux 6.2.0-1015-aws
frontend 2024/05/03 06:21:00 [notice] 1#1: getrlimit(RLIMIT_NOFILE): 1048576:1048576
frontend 2024/05/03 06:21:00 [notice] 1#1: start worker processes
frontend 2024/05/03 06:21:00 [notice] 1#1: start worker process 21
frontend 2024/05/03 06:21:00 [notice] 1#1: start worker process 22
frontend 2024/05/03 06:21:00 [notice] 1#1: start worker process 23
frontend 2024/05/03 06:21:00 [notice] 1#1: start worker process 24
frontend 2024/05/03 06:21:00 [notice] 1#1: start worker process 25
frontend 2024/05/03 06:21:00 [notice] 1#1: start worker process 26
frontend 2024/05/03 06:21:00 [notice] 1#1: start worker process 27
frontend 2024/05/03 06:21:00 [notice] 1#1: start worker process 28
frontend 10.0.2.103 - - [03/May/2024:06:21:01 +0000] "GET / HTTP/1.1" 200 688 "-" "kube-probe/1.28" "-"
frontend 10.0.2.103 - - [03/May/2024:06:21:08 +0000] "GET / HTTP/1.1" 200 688 "-" "kube-probe/1.28" "-"
frontend 10.0.2.103 - - [03/May/2024:06:21:18 +0000] "GET / HTTP/1.1" 200 688 "-" "kube-probe/1.28" "-"
frontend 10.0.2.103 - - [03/May/2024:06:21:28 +0000] "GET / HTTP/1.1" 200 688 "-" "kube-probe/1.28" "-"
frontend 10.0.2.103 - - [03/May/2024:06:21:38 +0000] "GET / HTTP/1.1" 200 688 "-" "kube-probe/1.28" "-"
frontend 10.0.2.103 - - [03/May/2024:06:21:48 +0000] "GET / HTTP/1.1" 200 688 "-" "kube-probe/1.28" "-"
frontend 10.0.2.103 - - [03/May/2024:06:21:58 +0000] "GET / HTTP/1.1" 200 688 "-" "kube-probe/1.28" "-"
backend time="2024-05-03T06:21:46Z" level=warning msg="using fallback value for env var" fallback=false var=GOPS_ENABLED
backend time="2024-05-03T06:21:46Z" level=error msg="failed to initialize application config" error="failed to load keypair: open /var/lib/hubble-ui/certs/client.crt: permission denied"
Stream closed EOF for kube-system/hubble-ui-77f86b68b6-nmt2s (backend)

I found issue having similar symptoms #18816. I started digging into changes of hubble-ui and chart templates, and I observed following details:

  • release 1.12 deprecated hubble.ui.securityContext.enabled commit and it was removed in 1.13 commit.
  • However, logic evaluating enabled parameter was not removed from template in commit.
  • This worked fine as containers run under root and mounted files are owned by root too.
  • But this commit changed container image user to 101 and 65532 for frontend and backend images.
  • This combination of commits currently breaks hubble-ui backend container with rights issue mentioned above.

What's changed:

Change removes hubble.ui.securityContext.enabled from hubble-ui deployment template and allows to securityContext part be properly applied from values file. It probably does not exactly resolved root-cause of #18816, but current symptoms are resolved by it. Reviewer, please decide if it is valid to close that issue or not by this change.

Possible improvement?
As both containers are in same pod, setting securityContext changes user id, group id and fs id to 1001 for both containers. But Dockerfiles contain different user/group (101 and 65532) for each container image. Probably this could be aligned to single same user/group?

Fixes: #18816

Remove deprecated `hubble.ui.securityContext.enabled` from hubble-ui deployment template

@stelucz stelucz requested review from a team as code owners May 3, 2024 10:53
@stelucz stelucz requested review from youngnick and squeed May 3, 2024 10:53
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label May 3, 2024
@stelucz stelucz requested a review from a user May 3, 2024 10:53
@github-actions github-actions bot added the kind/community-contribution This was a contribution made by a community member. label May 3, 2024
@squeed
Copy link
Contributor

squeed commented May 3, 2024

Good catch!

Does this need to be backported? If so, do you know which versions are affected?

@squeed squeed added the release-note/bug This PR fixes an issue in a previous release of Cilium. label May 3, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label May 3, 2024
@stelucz
Copy link
Contributor Author

stelucz commented May 3, 2024

@squeed I am not really sure, but I guess any release since commit 8ac9159 is impacted. I am not so skillful with GitHub to tell specific releases since that commit :) Releases 1.15.4 and 1.14.7 are impacted for sure, these are the versions which we are running right now.

@squeed
Copy link
Contributor

squeed commented May 3, 2024

@stelucz It looks like 3092ed1 bumped hubble-ui to v0.12.3, which has the commit changing user ID. It was backported to v1.13, v1.14, and v1.15. So, this fix will need to be backported too. @lambdanis, can you confirm?

Copy link

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch @stelucz.

@squeed My understanding is the same, I marked it for backports. cc @geakstr you might want to take a look.

@stelucz The CI is complaining about a long commit message title: https://github.com/cilium/cilium/actions/runs/8937892162/job/24552554956?pr=32338 Could you rephrase it so that the title is no longer than 75 characters?

@stelucz stelucz force-pushed the hubble-ui-securitycontext branch from 0d658dc to d225b24 Compare May 3, 2024 17:04
@stelucz
Copy link
Contributor Author

stelucz commented May 3, 2024

@lambdanis commit message fixed :)

@ghost
Copy link

ghost commented May 7, 2024

/test

@joestringer
Copy link
Member

joestringer commented May 30, 2024

This test appears to be reliably failing in the ci-runtime suite (Details):

RuntimeAgentKVStoreTest KVStore tests 
  Etcd KVStore
  /home/runner/work/cilium/cilium/test/ginkgo-ext/scopes.go:515
09:22:04 STEP: Starting Cilium with etcd as kvstore
09:22:04 STEP: Restarting cilium-docker service
FAIL: Endpoints are not ready after timeout
Expected
    <bool>: false
to be true
=== Test Finished at 2024-05-07T09:26:14Z====
09:26:14 STEP: Running JustAfterEach block for EntireTestsuite RuntimeAgentKVStoreTest
===================== TEST FAILED =====================
09:26:14 STEP: Running AfterFailed block for EntireTestsuite RuntimeAgentKVStoreTest
cmd: sudo cilium endpoint list
Exitcode: 0 
Stdout:
 	 ENDPOINT   POLICY (ingress)   POLICY (egress)   IDENTITY   LABELS (source:key[=value])   IPv6                 IPv4            STATUS   
	            ENFORCEMENT        ENFORCEMENT                                                                                     
	 588        Disabled           Disabled          4          reserved:health               f00d::a0f:0:0:e9d2   10.15.206.136   ready       
	 3583       Disabled           Disabled          1          reserved:host                                                      restoring   
	 3794       Disabled           Disabled          5          container:id.client           f00d::a0f:0:0:8834   10.15.59.15     restoring   
	 
Stderr:
 	 

cmd: sudo cilium-dbg status
Exitcode: 0 
Stdout:
 	 KVStore:                 Ok         Waiting for initial connection to be established
	 Kubernetes:              Disabled   
	 Host firewall:           Disabled
	 SRv6:                    Disabled
	 CNI Chaining:            none
	 CNI Config file:         CNI configuration management disabled
	 Cilium:                  Ok   1.16.0-dev (v1.16.0-dev-d225b24d)
	 NodeMonitor:             Disabled
	 IPAM:                    IPv4: 3/65534 allocated from 10.15.0.0/16, IPv6: 3/4294967294 allocated from f00d::a0f:0:0:0/96
	 IPv4 BIG TCP:            Disabled
	 IPv6 BIG TCP:            Disabled
	 BandwidthManager:        Disabled
	 Routing:                 Network: Tunnel [vxlan]   Host: Legacy
	 Attach Mode:             TCX
	 Masquerading:            IPTables [IPv4: Enabled, IPv6: Enabled]
	 Controller Status:       20/20 healthy
	 Proxy Status:            OK, ip 10.15.245.184, 0 redirects active on ports 10000-20000, Envoy: embedded
	 Global Identity Range:   min 256, max 65535
	 Hubble:                  Disabled
	 Encryption:              Disabled   
	 Cluster health:                        Warning   cilium-health daemon unreachable
	 Modules Health:          Stopped(0) Degraded(0) OK(19)
	 
Stderr:
 	 

===================== Exiting AfterFailed =====================
09:26:50 STEP: Running AfterEach for block EntireTestsuite RuntimeAgentKVStoreTest
09:26:50 STEP: Running AfterEach for block EntireTestsuite
<Checks>
Number of "context deadline exceeded" in logs: 0
Number of "level=error" in logs: 0
Number of "level=warning" in logs: 0
Number of "Cilium API handler panicked" in logs: 0
Number of "Goroutine took lock for more than" in logs: 0
No errors/warnings found in logs

</Checks>

09:26:50 STEP: Running AfterAll block for EntireTestsuite RuntimeAgentKVStoreTest

• Failure [292.295 seconds]
RuntimeAgentKVStoreTest
/home/runner/work/cilium/cilium/test/ginkgo-ext/scopes.go:461
  KVStore tests
  /home/runner/work/cilium/cilium/test/ginkgo-ext/scopes.go:461
    Etcd KVStore [It]
    /home/runner/work/cilium/cilium/test/ginkgo-ext/scopes.go:515

    Endpoints are not ready after timeout
    Expected
        <bool>: false
    to be true

    /home/runner/work/cilium/cilium/test/runtime/kvstore.go:95

It's not immediately obvious to me why this would fail given the changes in this PR, but it also doesn't seem like there's an equivalent failure in the main branch: https://github.com/cilium/cilium/actions/workflows/conformance-runtime.yaml?query=branch%3Amain .

@joestringer
Copy link
Member

There's a small chance the failure could have been temporarily broken in the tree when the PR was last rebased. Perhaps I can try triggering a rebase to see if that resolves the failure.

Signed-off-by: Lukas Stehlik <stehlik.lukas@gmail.com>
@joestringer joestringer force-pushed the hubble-ui-securitycontext branch from d225b24 to e5a0430 Compare May 30, 2024 19:10
@joestringer
Copy link
Member

/test

@aanm aanm enabled auto-merge May 31, 2024 10:26
@aanm
Copy link
Member

aanm commented May 31, 2024

/test

@aanm aanm added this pull request to the merge queue Jun 3, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jun 3, 2024
Merged via the queue into cilium:main with commit 4b81d22 Jun 3, 2024
@viktor-kurchenko viktor-kurchenko mentioned this pull request Jun 4, 2024
3 tasks
@viktor-kurchenko viktor-kurchenko mentioned this pull request Jun 4, 2024
3 tasks
@viktor-kurchenko viktor-kurchenko added backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. and removed needs-backport/1.14 labels Jun 4, 2024
@viktor-kurchenko viktor-kurchenko mentioned this pull request Jun 4, 2024
5 tasks
@viktor-kurchenko viktor-kurchenko added backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. and removed needs-backport/1.15 labels Jun 4, 2024
@github-actions github-actions bot added backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. and removed backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. backport-pending/1.13 labels Jun 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. kind/community-contribution This was a contribution made by a community member. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
No open projects
Status: Backport done to v1.14
Status: Backport done to v1.15
Status: Released
Status: Released
Status: Released
Development

Successfully merging this pull request may close these issues.

hubble-ui backend container file permissions issue on loading client cert for hubble-relay
6 participants