Skip to content

Sidecar is slow to startup when GCP is not accessible #39950

@rinormaloku

Description

@rinormaloku

The below method is querying metadata from GCP. However, when GCP is not accessible (e.g. firewall rules), this will delay the agent to start up for about 90 seconds.

func (e *gcpEnv) Metadata() map[string]string {

We discovered this when holdApplicationUntilProxyStarts is set to true. It caused the pod to be in a infinite crash loop, as the pilot-wait health checks were constantly failed (the health check timeouts in 60 seconds).

How to reproduce:

  1. In GKE (with Kubernetes Network Policies enbaled) install istio
  2. Create test ns and network policy to block traffic to GCP:
kubectl create ns test                                      
kubectl label namespace test istio-injection=enabled 

kubectl -n test apply -f- <<EOF
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: test-network-policy
  namespace: test
spec:
  podSelector:
    matchLabels: 
      app: demo
  policyTypes:
    - Egress
  egress:
    - to:
        - ipBlock:
            cidr: 0.0.0.0/0
            except: 
            - 169.254.169.254/32
        - ipBlock:
            cidr: 10.0.0.0/8
EOF
  1. Create the following workload
kubectl -n test apply -f- <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name:  netshoot
  labels:
    app: demo
spec:
  selector:
    matchLabels:
      app: demo
  replicas: 1
  template:
    metadata:
      annotations:
        sidecar.istio.io/agentLogLevel: "all:debug"
      labels:
        app: demo
    spec:
      containers:
      - name:  netshoot
        image:  nicolaka/netshoot
        imagePullPolicy: IfNotPresent
        args: ['tail', '-f', '/dev/null']
        resources:
          requests:
            cpu: 100m
            memory: 100Mi
          limits:
            cpu: 100m
            memory: 100Mi
EOF

The sidecar will take about 2 minutes to start up. If you would annotate it to hold until app starts it would fail.
To do so add proxy.istio.io/config: '{ "holdApplicationUntilProxyStarts": true }' to the above annotations.

NOTE: Errors are being hidden, I created a second container with some logs prefixed with a series of = signs. You might want to use that to see all the errors: sidecar.istio.io/proxyImage: 'rinormaloku/proxyv2:wd8'

I can make a PR to fix this issue but wanted to pick up your mind, what you'd recommend.

Question: Should we error out on failure? Shall we make the requests concurrently? Shall we skip all subsequent requests when we discover that GCP is not accessible? Or shall we simply increase the pilot-agent wait timeout?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions