Skip to content

Fleet scaling stuck at 0 after CNI timeout #4162

@cynt4k

Description

@cynt4k

What happened:
We encountered a scaling issue with Agones. We are running multiple FleetAutoscaler resources that use the webhook feature to scale game server instances up or down.

When a timeout occurs, we see the following log message:

{
  "error": "error creating gameserver for gameserverset official-6f678: Internal error occurred: failed calling webhook \"mutations.agones.dev\": failed to call webhook: Post \"https://agones-controller-service.agones-system.svc:443/mutate?timeout=10s\": dial tcp 192.168.253.0:8131: connect: operation not permitted",
  "gss": {
    "metadata": {
      "message": "error adding game servers",
      "severity": "warning",
      "source": "*gameserversets.Controller",
      "time": "2025-04-16T19:03:24.203646352Z"
    }
  }
}

When there's a CNI (Container Network Interface) issue—such as a timeout—Agones scales down the number of game servers in each fleet to 0.
However, after the CNI issue is resolved, Agones does not scale the fleet back up to the desired instance count reported by the webhook.

To work around the issue, we manually adjust the replica count by increasing or decreasing it by one. Alternatively, restarting the Agones controller also resolves the problem.

What you expected to happen:
After the CNI timeout is resolved, Agones should automatically scale the number of game servers back to the desired count provided by the webhook.

How to reproduce it (as minimally and precisely as possible):

  1. Block access to the Kubernetes API server temporarily.
  2. Allow Agones to scale the fleet to 0 due to webhook timeouts.
  3. Restore access to the API server.
  4. Observe that the fleet does not scale back up automatically.

Anything else we need to know?:
N/A

Environment:

  • Agones version: 1.48.0
  • Kubernetes version: 1.32.2
  • Cloud provider or hardware configuration: gcore
  • Install method: Helm
  • Troubleshooting guide log(s): See log snippet above
  • Others: N/A

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugThese are bugs.stalePending closure unless there is a strong objection.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions