Skip to content

Conversation

txuna
Copy link
Contributor

@txuna txuna commented Jul 2, 2025

What type of PR is this?
/kind bug

Which issue(s) this PR fixes:
Closes #4162

Fix GameServer Not Created on addMoreGameServers Failure

What this PR does / Why we need it
affected version <= 1.50.0
The core of my pull request is that I identified an issue where, if all GameServers are deleted and the addMoreGameServers function fails, no further GameServers are created. To resolve this, I added logic to reprocess the GameServerSet using c.workerqueue.EnqueueAfter.

Note

In the case of partial GameServer deletion (not all GameServers), the GameServer informer’s resync triggers an Update event. This leads to a call to the computeReconciliationAction function, which calculates the number of GameServers to add (numServersToAdd), followed by a call to the addMoreGameServers function.

Without this fix, since no GameServers exist, there are no further GameServer informer updates or resyncs triggered. As a result, the addMoreGameServers function is never called again, and no new GameServers are created.

POC

Imagine that the Agones Controller and Agones Extension are deployed on top of Kubernetes, and there is one GameServerSet (test-gss) with two GameServers (test-gss-1 and test-gss-2).

Delete the two GameServers (test-gss-1 and test-gss-2) while the Agones Extension is in the TLS certs changing state.

Note

Delete all deployed GameServers at once.

# Agones Extension Pod
http: TLS handshake error from ip:port: EOF
http: TLS handshake error from ip:port: EOF
http: TLS handshake error from ip:port: EOF
[...]
# my cli terminal
kubectl delete pod test-gss-1 test-gss-2 -n default

In a normal scenario, the addMoreGameServers function is triggered when a GameServer transitions to the Shutdown state or when its Pod is deleted.

but we can see this error message when invoke addMoreGameServers

{
  "error": "error creating gameserver for gameserverset test-gss: Internal error occurred: failed calling webhook \"mutations.agones.dev\": failed to call webhook: Post \"https://agones-controller-service.agones-system.svc:443/mutate?timeout=10s\": tls: failed to verify certificate: x509~~"
}

Due to the above error, the addMoreGameServers function is no longer invoked.

Solution

The solution is simple: if the addMoreGameServers function call fails, it should be reprocessed using c.workerqueue.EnqueueAfter.

	if numServersToAdd > 0 {
		if err := c.addMoreGameServers(ctx, gsSet, numServersToAdd); err != nil {
+			c.workerqueue.EnqueueAfter(gsSet, 1*time.Second) // retry!
			loggerForGameServerSet(c.baseLogger, gsSet).WithError(err).Warning("error adding game servers and will retry")
		}
	}

Limitation

A potential drawback is that retries may continue indefinitely until success.

If you have any suggestions for a better approach, I'd appreciate your comments.

Special notes for your reviewer:
N/A

Copy link

google-cla bot commented Jul 2, 2025

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@github-actions github-actions bot added kind/bug These are bugs. size/XS labels Jul 2, 2025
@txuna txuna changed the title fix: Adding a retry mechanism in case the addMoreGameServers function call fails. Fix: Adding a retry mechanism in case the addMoreGameServers function call fails. Jul 2, 2025
@0xaravindh
Copy link
Member

/gcbrun

@agones-bot
Copy link
Collaborator

Build Failed 😭

Build Id: b17dfa63-46dd-49e0-84c0-c2cb31558f88

Status: FAILURE

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Succeeded 🥳

Build Id: 74c02649-f424-490b-ad05-ee6459b3b3e5

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

git fetch https://github.com/googleforgames/agones.git pull/4214/head:pr_4214 && git checkout pr_4214
helm install agones ./install/helm/agones --namespace agones-system --set agones.image.registry=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.51.0-dev-27932a4

@txuna txuna requested a review from markmandel July 7, 2025 01:49
@markmandel
Copy link
Collaborator

/gcbrun

@agones-bot
Copy link
Collaborator

Build Succeeded 🥳

Build Id: bf8d4493-7f6a-4ab0-833e-8133900ef684

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

git fetch https://github.com/googleforgames/agones.git pull/4214/head:pr_4214 && git checkout pr_4214
helm install agones ./install/helm/agones --namespace agones-system --set agones.image.registry=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.51.0-dev-933d902

Copy link
Collaborator

@markmandel markmandel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it.

@markmandel markmandel enabled auto-merge (squash) July 12, 2025 16:14
@markmandel
Copy link
Collaborator

/gcbrun

@agones-bot
Copy link
Collaborator

Build Succeeded 🥳

Build Id: 7a741d83-08db-4428-b04e-9924722a854c

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

git fetch https://github.com/googleforgames/agones.git pull/4214/head:pr_4214 && git checkout pr_4214
helm install agones ./install/helm/agones --namespace agones-system --set agones.image.registry=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.51.0-dev-f3458bb

@markmandel markmandel merged commit ab7c13c into googleforgames:main Jul 12, 2025
4 checks passed
@txuna txuna deleted the fix/addMoreGameServers branch July 13, 2025 07:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug These are bugs. size/XS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fleet scaling stuck at 0 after CNI timeout
4 participants