-
Notifications
You must be signed in to change notification settings - Fork 8k
Description
The below method is querying metadata from GCP. However, when GCP is not accessible (e.g. firewall rules), this will delay the agent to start up for about 90 seconds.
istio/pkg/bootstrap/platform/gcp.go
Line 137 in 6450ecc
func (e *gcpEnv) Metadata() map[string]string { |
We discovered this when holdApplicationUntilProxyStarts
is set to true
. It caused the pod to be in a infinite crash loop, as the pilot-wait health checks were constantly failed (the health check timeouts in 60 seconds).
How to reproduce:
- In GKE (with Kubernetes Network Policies enbaled) install istio
- Create test ns and network policy to block traffic to GCP:
kubectl create ns test
kubectl label namespace test istio-injection=enabled
kubectl -n test apply -f- <<EOF
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: test-network-policy
namespace: test
spec:
podSelector:
matchLabels:
app: demo
policyTypes:
- Egress
egress:
- to:
- ipBlock:
cidr: 0.0.0.0/0
except:
- 169.254.169.254/32
- ipBlock:
cidr: 10.0.0.0/8
EOF
- Create the following workload
kubectl -n test apply -f- <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
name: netshoot
labels:
app: demo
spec:
selector:
matchLabels:
app: demo
replicas: 1
template:
metadata:
annotations:
sidecar.istio.io/agentLogLevel: "all:debug"
labels:
app: demo
spec:
containers:
- name: netshoot
image: nicolaka/netshoot
imagePullPolicy: IfNotPresent
args: ['tail', '-f', '/dev/null']
resources:
requests:
cpu: 100m
memory: 100Mi
limits:
cpu: 100m
memory: 100Mi
EOF
The sidecar will take about 2 minutes to start up. If you would annotate it to hold until app starts it would fail.
To do so add proxy.istio.io/config: '{ "holdApplicationUntilProxyStarts": true }'
to the above annotations.
NOTE: Errors are being hidden, I created a second container with some logs prefixed with a series of = signs. You might want to use that to see all the errors:
sidecar.istio.io/proxyImage: 'rinormaloku/proxyv2:wd8'
I can make a PR to fix this issue but wanted to pick up your mind, what you'd recommend.
Question: Should we error out on failure? Shall we make the requests concurrently? Shall we skip all subsequent requests when we discover that GCP is not accessible? Or shall we simply increase the pilot-agent wait timeout?