Skip to content

Conversation

0xaravindh
Copy link
Member

What type of PR is this?

Uncomment only one /kind <> line, press enter to put that in a new line, and remove leading whitespace from that line:

/kind breaking
/kind bug

/kind cleanup

/kind documentation
/kind feature
/kind hotfix
/kind release

What this PR does / Why we need it:

Which issue(s) this PR fixes:

Closes #4163

Special notes for your reviewer:

@0xaravindh 0xaravindh self-assigned this Apr 28, 2025
@0xaravindh 0xaravindh requested a review from markmandel April 28, 2025 10:32
@github-actions github-actions bot added the kind/cleanup Refactoring code, fixing up documentation, etc label Apr 28, 2025
@0xaravindh 0xaravindh requested a review from igooch April 28, 2025 10:32
@agones-bot
Copy link
Collaborator

Build Failed 😭

Build Id: 944b0a49-63f1-4ec5-841b-2586ca33a489

Status: FAILURE

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Failed 😭

Build Id: b5ecf3aa-2b70-4515-beb7-830a0fbd96df

Status: FAILURE

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Failed 😭

Build Id: c605097d-0c93-4011-8873-71f2b8cbb039

Status: FAILURE

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Failed 😭

Build Id: 58b3ccdb-181b-49d1-a136-c35cb6fd8ffe

Status: FAILURE

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@lacroixthomas
Copy link
Collaborator

lacroixthomas commented Apr 28, 2025

Regarding the issue on the build around jq it seems that the submit-upgrade-test-cloud-build job in the cloudbuild.yaml is using the image gcr.io/google.com/cloudsdktool/cloud-sdk and doesn't use the dockerfile from the test/upgrade/Dockerfile but directly apply the files with kubectl, if we want to use jq maybe we can install it on the args from the submit-upgrade-test-cloud-build ?

I'm not fully sure about it, but from what I see, the jq is properly installed on the docker image for the push-upgrade-test, but it's only used by the test itself:

image: us-docker.pkg.dev/agones-images/ci/upgrade-test-controller:${DevVersion}

@0xaravindh
Copy link
Member Author

Regarding the issue on the build around jq it seems that the submit-upgrade-test-cloud-build job in the cloudbuild.yaml is using the image gcr.io/google.com/cloudsdktool/cloud-sdk and doesn't use the dockerfile from the test/upgrade/Dockerfile but directly apply the files with kubectl, if we want to use jq maybe we can install it on the args from the submit-upgrade-test-cloud-build ?

I'm not fully sure about it, but from what I see, the jq is properly installed on the docker image for the push-upgrade-test, but it's only used by the test itself:


Got it! installing jq in the args section.

@agones-bot
Copy link
Collaborator

Build Failed 😭

Build Id: f25955e4-5711-4d6a-8dae-bccf6032af92

Status: FAILURE

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Failed 😭

Build Id: b404b589-c45b-41bb-b09c-0ec7a655aabe

Status: FAILURE

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Failed 😭

Build Id: 3faa7b62-2ab3-48e0-a03e-1e5961ebd386

Status: FAILURE

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

cloudbuild.yaml Outdated
Comment on lines 387 to 394

echo "===== Job status for upgrade-test-runner on cluster ${testCluster} ====="
kubectl get job/upgrade-test-runner -o json | jq '.status' > "${tmpdir}/${testCluster}-job-status.json"
cat "${tmpdir}/${testCluster}-job-status.json"

echo "===== Extracting job conditions (if available) ====="
job_conditions=$(kubectl get job/upgrade-test-runner -o json | jq -c 'if .status.conditions then .status.conditions[] | select(.status=="True") | {type, reason, message} else empty end')
echo "Job conditions output: $job_conditions"
Copy link
Collaborator

@lacroixthomas lacroixthomas Apr 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could maybe move this part after the if wait $pid; then (L400), to print the values once the process is complete ? Otherwise the status / conditions might not be available yet, as the call to wait to get the statuses is run in the background*

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the upgrade-test-runner Job finishes (either successfully or unsuccessfully), Kubernetes will wait for 100 seconds before deleting the Job. if we move that code into that line we are facing this issue

Error from server (NotFound): jobs.batch "upgrade-test-runner" not found

ttlSecondsAfterFinished: 100

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, hmm
I'm not sure the log is usefull here then, because it would only print the status / conditions from the really beginning of the creation of the job, we would never got the values when the status are in error or not

But if we want to keep them, maybe we need to do the same design as the wait with waitPids to access the values once the job got all the status to True ?

What do you think about it ?

@agones-bot
Copy link
Collaborator

Build Succeeded 🥳

Build Id: 8d22703b-c2d3-4183-adfd-9cdb11f3eccc

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

git fetch https://github.com/googleforgames/agones.git pull/4165/head:pr_4165 && git checkout pr_4165
helm install agones ./install/helm/agones --namespace agones-system --set agones.image.registry=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.49.0-dev-9ce27fd

@agones-bot
Copy link
Collaborator

Build Failed 😭

Build Id: 54d61520-601e-400e-9fc9-a4f622905000

Status: FAILURE

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Failed 😭

Build Id: 3e31e567-6f19-44dc-ad11-7ec9af416624

Status: FAILURE

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Failed 😭

Build Id: 5aee9245-329a-4a45-a47d-af6ff87e2f8b

Status: FAILURE

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Failed 😭

Build Id: 836cc0fa-4d37-4343-9fbc-001440b562f5

Status: FAILURE

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@0xaravindh
Copy link
Member Author

/gcbrun

@agones-bot
Copy link
Collaborator

Build Failed 😭

Build Id: 81fa3be8-5fc9-48b6-9e69-3825259762ad

Status: FAILURE

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Failed 😭

Build Id: dbd98f3f-ba68-4442-a137-3780dfaf460d

Status: FAILURE

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Failed 😭

Build Id: 34410939-111c-4f78-8efe-8fc9797c5335

Status: FAILURE

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@0xaravindh
Copy link
Member Author

/gcbrun

@agones-bot
Copy link
Collaborator

Build Failed 😭

Build Id: a94c3251-a40e-48b8-bc82-1f92d09b16f0

Status: FAILURE

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@0xaravindh
Copy link
Member Author

/gcbrun

@agones-bot
Copy link
Collaborator

Build Succeeded 🥳

Build Id: bfe9c5ff-304c-442a-ad4c-fc29ac4c5b64

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

git fetch https://github.com/googleforgames/agones.git pull/4165/head:pr_4165 && git checkout pr_4165
helm install agones ./install/helm/agones --namespace agones-system --set agones.image.registry=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.51.0-dev-4ece9f4

Copy link
Collaborator

@igooch igooch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work! One comment on printing the log output.

done
done

echo "Logs from log bucket: https://console.cloud.google.com/logs/query;storageScope=storage,projects%2F${PROJECT_ID}%2Flocations%2Fglobal%2Fbuckets%2F${logBucketName}%2Fviews%2F_AllLogs;query=$(printf 'resource.labels.container_name=(\"upgrade-test-controller\" OR \"sdk-client-test\")' | jq -sRr @uri);cursorTimestamp=$(date -u +%Y-%m-%dT%H:%M:%SZ);duration=PT10M?project=${PROJECT_ID}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this is only printed to the console in the case that the test fails when waiting on the job to start. Could you pull this logic out into a separate function, and also call in the case that one of the pids exits with a non-zero status later in the test?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the log printing into a function and called it in both failure cases.

@agones-bot
Copy link
Collaborator

Build Failed 😭

Build Id: 27fa6b18-ac95-41b3-91e4-d14456cad9cf

Status: FAILURE

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@0xaravindh
Copy link
Member Author

/gcbrun

@agones-bot
Copy link
Collaborator

Build Succeeded 🥳

Build Id: 4c123848-f0bf-4b76-bcda-20b4445d1304

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

git fetch https://github.com/googleforgames/agones.git pull/4165/head:pr_4165 && git checkout pr_4165
helm install agones ./install/helm/agones --namespace agones-system --set agones.image.registry=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.51.0-dev-0570528

@@ -0,0 +1,196 @@
#!/usr/bin/env bash
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥 🔥 🔥

Copy link
Collaborator

@igooch igooch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@igooch igooch enabled auto-merge (squash) July 21, 2025 16:46
@igooch igooch requested a review from lacroixthomas July 21, 2025 16:46
Comment on lines +32 to +50
print_failure_logs() {
local cluster_name=$1
echo "ERROR: Upgrade test failed on cluster: $cluster_name"

job_pods=$(kubectl get pods -l job-name=upgrade-test-runner -o jsonpath="{.items[*].metadata.name}")
kubectl logs --tail=20 "$job_pods" || echo "Unable to retrieve logs for pod: $job_pods"
for pod in $job_pods; do
containers=$(kubectl get pod "$pod" -o jsonpath='{.spec.containers[*].name}')
for container in $containers; do
if [[ "$container" == "sdk-client-test" || "$container" == "upgrade-test-controller" ]]; then
echo "----- Logs from pod: $pod, container: $container -----"
kubectl logs "$pod" -c "$container" || echo "Failed to retrieve logs from $pod/$container"
fi
done
done

log_url="https://console.cloud.google.com/logs/query;storageScope=storage,projects%2F${PROJECT_ID}%2Flocations%2Fglobal%2Fbuckets%2Fupgrade-test-container-logs%2Fviews%2F_AllLogs;query=$(printf 'resource.labels.container_name=(\"upgrade-test-controller\" OR \"sdk-client-test\")' | jq -sRr @uri);cursorTimestamp=$(date -u +%Y-%m-%dT%H:%M:%SZ);duration=PT10M?project=${PROJECT_ID}"
echo "Logs from log bucket: $log_url"
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I just got a question, are we sure this print_failure_logs will not only show logs from the last kube context from this ?

for cloudProduct in "${cloudProducts[@]}"
do
    for version in "${!versionsAndRegions[@]}"

I was thinking to re-do the gcloud container clusters get-credentials "$testCluster" --region="$testClusterLocation" --project="$PROJECT_ID" or using the right context before calling this function, it doesn't seems to be done from L162-L194

Copy link
Collaborator

@lacroixthomas lacroixthomas Jul 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comment is not really clear, but I'm referring to the calls of kubectl logs --tail .. which might use the context from the last call of gcloud container clusters get-credentials .., as we were previously only storing the PID and the log file, then waiting for pid to be done to just print / logs it without kubectl calls

But with this, it might calls kubectl using the context from the last one of the list (which might not have errors / not the expected context)

If that make sense 😄

print_failure_logs:
   ...
   kubectl logs .. 
   ...

for cloudProduct in [A, B, C]:
do
   for version in [1, 2, 3]:
      ...
      gcloud container ... (retrieving context for A-1, A-2, ..., C-2, C-3, depending on where we are from the loop)
      ...
   done
done

for pid in "${pids[@]}"; do
    ...
    print_failure_logs // Wouldn't that use the kube context of C-3 all the time ?
    ...
done

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we use ttlSecondsAfterFinished: 100, the Job and its pods are deleted very quickly after they finish. Because of that, when we try to get the logs later using kubectl logs, the pods might already be gone — so we can’t capture the logs.

But in our case, we do capture the logs earlier — right after waiting for the pod to be ready:

# Wait for the pod to become ready (or timeout)
if ! kubectl wait --for=condition=ready pod -l job-name=upgrade-test-runner --timeout=5m; then
    echo "ERROR: The pod for job 'upgrade-test-runner' did not become ready within the timeout period."
    print_failure_logs "$testCluster"
    exit 1
fi

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That part looks all good with the wait, but about this one:

for pid in "${pids[@]}"; do
...
print_failure_logs ..
...
done

Feels that the calls to print_failure_logs from this loop will use the same kube context though, compared to how it was logged before

L183 and L191

@agones-bot
Copy link
Collaborator

Build Succeeded 🥳

Build Id: 2a72b07d-9cb4-41fd-bd6e-7c9dce29bdcf

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

git fetch https://github.com/googleforgames/agones.git pull/4165/head:pr_4165 && git checkout pr_4165
helm install agones ./install/helm/agones --namespace agones-system --set agones.image.registry=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.51.0-dev-ace1202

@igooch igooch merged commit 23f0f75 into googleforgames:main Jul 21, 2025
4 checks passed
@lacroixthomas lacroixthomas mentioned this pull request Aug 13, 2025
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/cleanup Refactoring code, fixing up documentation, etc size/M size/S
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flaky: submit-upgrade-test-cloud-build
5 participants