-
Notifications
You must be signed in to change notification settings - Fork 862
Description
What happened:
We encountered a scaling issue with Agones. We are running multiple FleetAutoscaler
resources that use the webhook feature to scale game server instances up or down.
When a timeout occurs, we see the following log message:
{
"error": "error creating gameserver for gameserverset official-6f678: Internal error occurred: failed calling webhook \"mutations.agones.dev\": failed to call webhook: Post \"https://agones-controller-service.agones-system.svc:443/mutate?timeout=10s\": dial tcp 192.168.253.0:8131: connect: operation not permitted",
"gss": {
"metadata": {
"message": "error adding game servers",
"severity": "warning",
"source": "*gameserversets.Controller",
"time": "2025-04-16T19:03:24.203646352Z"
}
}
}
When there's a CNI (Container Network Interface) issue—such as a timeout—Agones scales down the number of game servers in each fleet to 0.
However, after the CNI issue is resolved, Agones does not scale the fleet back up to the desired instance count reported by the webhook.
To work around the issue, we manually adjust the replica count by increasing or decreasing it by one. Alternatively, restarting the Agones controller also resolves the problem.
What you expected to happen:
After the CNI timeout is resolved, Agones should automatically scale the number of game servers back to the desired count provided by the webhook.
How to reproduce it (as minimally and precisely as possible):
- Block access to the Kubernetes API server temporarily.
- Allow Agones to scale the fleet to 0 due to webhook timeouts.
- Restore access to the API server.
- Observe that the fleet does not scale back up automatically.
Anything else we need to know?:
N/A
Environment:
- Agones version: 1.48.0
- Kubernetes version: 1.32.2
- Cloud provider or hardware configuration: gcore
- Install method: Helm
- Troubleshooting guide log(s): See log snippet above
- Others: N/A