Skip to content

Conversation

igooch
Copy link
Collaborator

@igooch igooch commented Apr 23, 2025

What type of PR is this?

/kind hotfix

What this PR does / Why we need it:

GitHub is now enforcing explicit permissions in workflows, causing the labeler and pr_update workflows to fail.

This adds the permissions to write to a PR, using the same permissions as the existing labeler-pr workflow.

Which issue(s) this PR fixes:

NA

Special notes for your reviewer:

@igooch igooch requested a review from markmandel April 23, 2025 00:23
@github-actions github-actions bot added the kind/hotfix Hotfixes for issues against release label Apr 23, 2025
Copy link
Collaborator

@markmandel markmandel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL!

@agones-bot
Copy link
Collaborator

Build Failed 😭

Build Id: 9efebf02-81a6-4d88-8223-0062b76c8539

Status: FAILURE

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@lacroixthomas
Copy link
Collaborator

It seems that the pipeline is always having issue on the standard-upgrade-test-cluster-1-31:

/tmp/tmp.aPfSQ3MT5V/gke-autopilot-upgrade-test-cluster-1-31.log: FailureTarget
/tmp/tmp.aPfSQ3MT5V/standard-upgrade-test-cluster-1-32.log: SuccessCriteriaMet
Complete/tmp/tmp.aPfSQ3MT5V/standard-upgrade-test-cluster-1-30.log: Complete
SuccessCriteriaMetSuccessCriteriaMetFailureTargetSuccessCriteriaMet/tmp/tmp.aPfSQ3MT5V/standard-upgrade-test-cluster-1-31.log: SuccessCriteriaMet

https://github.com/googleforgames/agones/blob/main/cloudbuild.yaml#L397

The output of the k8s 1.31 seems to be SuccessCriteriaMet for almost all of them but one is FailureTarget, not sure where I can have access to the logs ?

@markmandel
Copy link
Collaborator

Agreed - seems rather flaky. @igooch does it make sense to cat the log of the file on failure? Just so we can see what's up?

@markmandel markmandel enabled auto-merge (squash) April 23, 2025 18:29
@lacroixthomas
Copy link
Collaborator

lacroixthomas commented Apr 23, 2025

Agreed - seems rather flaky. @igooch does it make sense to cat the log of the file on failure? Just so we can see what's up?

I was looking at the code, from what I understand, it reuses an existing cluster to do the upgrade test (standard-upgrade-test-cluster-1-31 from us-east1), maybe there is some leftovers / issues from a previous setup or something on it, as it's specific to this clusters (the others looks fine with 1.30 and 1.32)

@igooch
Copy link
Collaborator Author

igooch commented Apr 23, 2025

Agreed - seems rather flaky. @igooch does it make sense to cat the log of the file on failure? Just so we can see what's up?

Yes, the logs are rather verbose, but if we're OK with that we can do a dump of the upgrade-test-controller, sdk-client-test and / or the agones sidecar / controller containers / pods as well.

For this build it looks like it fails during upgrade from 1.44 -> 1.45.

Install of 1.45 begins:

2025/04/23 00:40:06 Running command helm [upgrade --install --atomic --wait --timeout=10m --namespace=agones-system --create-namespace --version 1.45.0 --set agones.image.tag=1.45.0 --set agones.image.registry=us-docker.pkg.dev/agones-images/release --set agones.image.allocator.pullPolicy=Always --set agones.image.controller.pullPolicy=Always --set agones.image.extensions.pullPolicy=Always --set agones.image.ping.pullPolicy=Always --set agones.image.sdk.alwaysPull=true --set agones.controller.logLevel=debug agones agones/agones]

Ongoing creating of 1.44 game servers continues:

2025/04/23 00:40:11 Running command kubectl [create -f /tmp/gs1440.yaml]
2025/04/23 00:40:11 CombinedOutput: gameserver.agones.dev/sdk-client-test-9mlgv created

Creating of 1.44 game servers temporarily fails (expected) while the controller service switches to the new endpoint during upgrade:

2025/04/23 00:40:21 Running command kubectl [create -f /tmp/gs1440.yaml]
2025/04/23 00:40:21 CombinedOutput: Error from server (InternalError): error when creating "/tmp/gs1440.yaml": Internal error occurred: failed calling webhook "mutations.agones.dev": failed to call webhook: Post "https://agones-controller-service.agones-system.svc:443/mutate?timeout=10s": no endpoints available for service "agones-controller-service"
2025/04/23 00:40:21 CombinedOutput err: exit status 1
2025/04/23 00:40:21 Could not create Gameserver /tmp/gs1440.yaml: exit status 1. Retries left: 8.

Failure continues for longer than expected, and the test fails:

2025/04/23 00:41:01 Could not create Gameserver /tmp/gs1440.yaml: exit status 1. Retries left: 0.
2025/04/23 00:41:06 Running command kubectl [create -f /tmp/gs1440.yaml]
2025/04/23 00:41:06 CombinedOutput: Error from server (InternalError): error when creating "/tmp/gs1440.yaml": Internal error occurred: failed calling webhook "mutations.agones.dev": failed to call webhook: Post "https://agones-controller-service.agones-system.svc:443/mutate?timeout=10s": no endpoints available for service "agones-controller-service"
2025/04/23 00:41:06 CombinedOutput err: exit status 1
2025/04/23 00:41:06 Could not create Gameserver /tmp/gs1440.yaml: exit status 1. Too many successive errors.

Looking at the controller logs, it's possible the slow upgrade was due to a node not being ready, although there's a 5 min difference in time on the logs, so it may be unrelated.

2025-04-23 00:46 0/5 nodes are available: 1 Insufficient ephemeral-storage, 3 Insufficient cpu, 4 Insufficient memory.

@markmandel markmandel merged commit daafb8f into googleforgames:main Apr 23, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/hotfix Hotfixes for issues against release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants