Add logs reporting to submit-upgrade-test-cloud-build for better visibility #4165

0xaravindh · 2025-04-28T10:31:59Z

What type of PR is this?

Uncomment only one /kind <> line, press enter to put that in a new line, and remove leading whitespace from that line:

/kind breaking
/kind bug

/kind cleanup

/kind documentation
/kind feature
/kind hotfix
/kind release

What this PR does / Why we need it:

Which issue(s) this PR fixes:

Closes #4163

Special notes for your reviewer:

agones-bot · 2025-04-28T11:24:52Z

Build Failed 😭

Build Id: 944b0a49-63f1-4ec5-841b-2586ca33a489

Status: FAILURE

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

agones-bot · 2025-04-28T11:47:28Z

Build Failed 😭

Build Id: b5ecf3aa-2b70-4515-beb7-830a0fbd96df

Status: FAILURE

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

agones-bot · 2025-04-28T12:36:30Z

Build Failed 😭

Build Id: c605097d-0c93-4011-8873-71f2b8cbb039

Status: FAILURE

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

agones-bot · 2025-04-28T13:05:28Z

Build Failed 😭

Build Id: 58b3ccdb-181b-49d1-a136-c35cb6fd8ffe

Status: FAILURE

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

lacroixthomas · 2025-04-28T13:43:48Z

Regarding the issue on the build around jq it seems that the submit-upgrade-test-cloud-build job in the cloudbuild.yaml is using the image gcr.io/google.com/cloudsdktool/cloud-sdk and doesn't use the dockerfile from the test/upgrade/Dockerfile but directly apply the files with kubectl, if we want to use jq maybe we can install it on the args from the submit-upgrade-test-cloud-build ?

I'm not fully sure about it, but from what I see, the jq is properly installed on the docker image for the push-upgrade-test, but it's only used by the test itself:

agones/test/upgrade/upgradeTest.yaml

Line 29 in 97b07cc

image: us-docker.pkg.dev/agones-images/ci/upgrade-test-controller:${DevVersion}

0xaravindh · 2025-04-28T14:00:20Z

Regarding the issue on the build around jq it seems that the submit-upgrade-test-cloud-build job in the cloudbuild.yaml is using the image gcr.io/google.com/cloudsdktool/cloud-sdk and doesn't use the dockerfile from the test/upgrade/Dockerfile but directly apply the files with kubectl, if we want to use jq maybe we can install it on the args from the submit-upgrade-test-cloud-build ?

I'm not fully sure about it, but from what I see, the jq is properly installed on the docker image for the push-upgrade-test, but it's only used by the test itself:

Got it! installing jq in the args section.

agones-bot · 2025-04-28T14:03:50Z

Build Failed 😭

Build Id: f25955e4-5711-4d6a-8dae-bccf6032af92

Status: FAILURE

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

agones-bot · 2025-04-28T14:30:57Z

Build Failed 😭

Build Id: b404b589-c45b-41bb-b09c-0ec7a655aabe

Status: FAILURE

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

agones-bot · 2025-04-28T14:50:29Z

Build Failed 😭

Build Id: 3faa7b62-2ab3-48e0-a03e-1e5961ebd386

Status: FAILURE

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

lacroixthomas · 2025-04-28T16:16:30Z

cloudbuild.yaml

+
+            echo "===== Job status for upgrade-test-runner on cluster ${testCluster} ====="
+            kubectl get job/upgrade-test-runner -o json | jq '.status' > "${tmpdir}/${testCluster}-job-status.json"
+            cat "${tmpdir}/${testCluster}-job-status.json"
+
+            echo "===== Extracting job conditions (if available) ====="
+            job_conditions=$(kubectl get job/upgrade-test-runner -o json | jq -c 'if .status.conditions then .status.conditions[] | select(.status=="True") | {type, reason, message} else empty end')
+            echo "Job conditions output: $job_conditions"


We could maybe move this part after the if wait $pid; then (L400), to print the values once the process is complete ? Otherwise the status / conditions might not be available yet, as the call to wait to get the statuses is run in the background*

After the upgrade-test-runner Job finishes (either successfully or unsuccessfully), Kubernetes will wait for 100 seconds before deleting the Job. if we move that code into that line we are facing this issue

Error from server (NotFound): jobs.batch "upgrade-test-runner" not found

agones/test/upgrade/upgradeTest.yaml

Line 24 in 97b07cc

ttlSecondsAfterFinished: 100

Oh I see, hmm
I'm not sure the log is usefull here then, because it would only print the status / conditions from the really beginning of the creation of the job, we would never got the values when the status are in error or not

But if we want to keep them, maybe we need to do the same design as the wait with waitPids to access the values once the job got all the status to True ?

What do you think about it ?

agones-bot · 2025-04-28T16:35:50Z

Build Succeeded 🥳

Build Id: 8d22703b-c2d3-4183-adfd-9cdb11f3eccc

The following development artifacts have been built, and will exist for the next 30 days:

image: us-docker.pkg.dev/agones-images/ci/agones-controller:1.49.0-dev-9ce27fd
image: us-docker.pkg.dev/agones-images/ci/agones-extensions:1.49.0-dev-9ce27fd
image: us-docker.pkg.dev/agones-images/ci/agones-sdk:1.49.0-dev-9ce27fd-linux
image: us-docker.pkg.dev/agones-images/ci/agones-ping:1.49.0-dev-9ce27fd
image: us-docker.pkg.dev/agones-images/ci/agones-allocator:1.49.0-dev-9ce27fd
Linux C++ SDK (build): agonessdk-1.49.0-dev-9ce27fd-linux-arch_64.tar.gz
SDK Server: agonessdk-server-1.49.0-dev-9ce27fd.zip

A preview of the website (the last 30 builds are retained):

https://9ce27fd-dot-preview-dot-agones-images.appspot.com/

To install this version:

git fetch https://github.com/googleforgames/agones.git pull/4165/head:pr_4165 && git checkout pr_4165
helm install agones ./install/helm/agones --namespace agones-system --set agones.image.registry=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.49.0-dev-9ce27fd

agones-bot · 2025-04-28T17:53:55Z

Build Failed 😭

Build Id: 54d61520-601e-400e-9fc9-a4f622905000

Status: FAILURE

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

agones-bot · 2025-07-10T09:20:18Z

Build Failed 😭

Build Id: 3e31e567-6f19-44dc-ad11-7ec9af416624

Status: FAILURE

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

agones-bot · 2025-07-10T09:38:02Z

Build Failed 😭

Build Id: 5aee9245-329a-4a45-a47d-af6ff87e2f8b

Status: FAILURE

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

agones-bot · 2025-07-10T12:04:11Z

Build Failed 😭

Build Id: 836cc0fa-4d37-4343-9fbc-001440b562f5

Status: FAILURE

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

0xaravindh · 2025-07-11T04:08:51Z

/gcbrun

agones-bot · 2025-07-11T05:13:57Z

Build Failed 😭

Build Id: 81fa3be8-5fc9-48b6-9e69-3825259762ad

Status: FAILURE

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

agones-bot · 2025-07-14T08:06:49Z

Build Failed 😭

Build Id: dbd98f3f-ba68-4442-a137-3780dfaf460d

Status: FAILURE

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

agones-bot · 2025-07-14T09:35:04Z

Build Failed 😭

Build Id: 34410939-111c-4f78-8efe-8fc9797c5335

Status: FAILURE

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

0xaravindh · 2025-07-14T09:42:31Z

/gcbrun

agones-bot · 2025-07-14T10:15:10Z

Build Failed 😭

Build Id: a94c3251-a40e-48b8-bc82-1f92d09b16f0

Status: FAILURE

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

0xaravindh · 2025-07-15T06:01:54Z

/gcbrun

agones-bot · 2025-07-15T06:56:51Z

Build Succeeded 🥳

Build Id: bfe9c5ff-304c-442a-ad4c-fc29ac4c5b64

The following development artifacts have been built, and will exist for the next 30 days:

image: us-docker.pkg.dev/agones-images/ci/agones-controller:1.51.0-dev-4ece9f4
image: us-docker.pkg.dev/agones-images/ci/agones-extensions:1.51.0-dev-4ece9f4
image: us-docker.pkg.dev/agones-images/ci/agones-sdk:1.51.0-dev-4ece9f4-linux
image: us-docker.pkg.dev/agones-images/ci/agones-ping:1.51.0-dev-4ece9f4
image: us-docker.pkg.dev/agones-images/ci/agones-allocator:1.51.0-dev-4ece9f4
Linux C++ SDK (build): agonessdk-1.51.0-dev-4ece9f4-linux-arch_64.tar.gz
SDK Server: agonessdk-server-1.51.0-dev-4ece9f4.zip

A preview of the website (the last 30 builds are retained):

https://4ece9f4-dot-preview-dot-agones-images.appspot.com/

To install this version:

git fetch https://github.com/googleforgames/agones.git pull/4165/head:pr_4165 && git checkout pr_4165
helm install agones ./install/helm/agones --namespace agones-system --set agones.image.registry=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.51.0-dev-4ece9f4

igooch

Good work! One comment on printing the log output.

igooch · 2025-07-16T05:00:20Z

build/e2e_upgrade_test.sh

+        done
+        done
+
+        echo "Logs from log bucket: https://console.cloud.google.com/logs/query;storageScope=storage,projects%2F${PROJECT_ID}%2Flocations%2Fglobal%2Fbuckets%2F${logBucketName}%2Fviews%2F_AllLogs;query=$(printf 'resource.labels.container_name=(\"upgrade-test-controller\" OR \"sdk-client-test\")' | jq -sRr @uri);cursorTimestamp=$(date -u +%Y-%m-%dT%H:%M:%SZ);duration=PT10M?project=${PROJECT_ID}"


It looks like this is only printed to the console in the case that the test fails when waiting on the job to start. Could you pull this logic out into a separate function, and also call in the case that one of the pids exits with a non-zero status later in the test?

Moved the log printing into a function and called it in both failure cases.

agones-bot · 2025-07-16T08:26:59Z

Build Failed 😭

Build Id: 27fa6b18-ac95-41b3-91e4-d14456cad9cf

Status: FAILURE

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

0xaravindh · 2025-07-16T08:58:06Z

/gcbrun

agones-bot · 2025-07-16T09:50:13Z

Build Succeeded 🥳

Build Id: 4c123848-f0bf-4b76-bcda-20b4445d1304

The following development artifacts have been built, and will exist for the next 30 days:

image: us-docker.pkg.dev/agones-images/ci/agones-controller:1.51.0-dev-0570528
image: us-docker.pkg.dev/agones-images/ci/agones-extensions:1.51.0-dev-0570528
image: us-docker.pkg.dev/agones-images/ci/agones-sdk:1.51.0-dev-0570528-linux
image: us-docker.pkg.dev/agones-images/ci/agones-ping:1.51.0-dev-0570528
image: us-docker.pkg.dev/agones-images/ci/agones-allocator:1.51.0-dev-0570528
Linux C++ SDK (build): agonessdk-1.51.0-dev-0570528-linux-arch_64.tar.gz
SDK Server: agonessdk-server-1.51.0-dev-0570528.zip

A preview of the website (the last 30 builds are retained):

https://0570528-dot-preview-dot-agones-images.appspot.com/

To install this version:

git fetch https://github.com/googleforgames/agones.git pull/4165/head:pr_4165 && git checkout pr_4165
helm install agones ./install/helm/agones --namespace agones-system --set agones.image.registry=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.51.0-dev-0570528

markmandel · 2025-07-16T18:31:31Z

build/e2e_upgrade_test.sh

@@ -0,0 +1,196 @@
+#!/usr/bin/env bash


🔥 🔥 🔥

igooch

LGTM

lacroixthomas · 2025-07-21T16:56:53Z

build/e2e_upgrade_test.sh

+print_failure_logs() {
+    local cluster_name=$1
+    echo "ERROR: Upgrade test failed on cluster: $cluster_name"
+
+    job_pods=$(kubectl get pods -l job-name=upgrade-test-runner -o jsonpath="{.items[*].metadata.name}")
+    kubectl logs --tail=20 "$job_pods" || echo "Unable to retrieve logs for pod: $job_pods"
+    for pod in $job_pods; do
+        containers=$(kubectl get pod "$pod" -o jsonpath='{.spec.containers[*].name}')
+        for container in $containers; do
+            if [[ "$container" == "sdk-client-test" || "$container" == "upgrade-test-controller" ]]; then
+                echo "----- Logs from pod: $pod, container: $container -----"
+                kubectl logs "$pod" -c "$container" || echo "Failed to retrieve logs from $pod/$container"
+            fi
+        done
+    done
+
+    log_url="https://console.cloud.google.com/logs/query;storageScope=storage,projects%2F${PROJECT_ID}%2Flocations%2Fglobal%2Fbuckets%2Fupgrade-test-container-logs%2Fviews%2F_AllLogs;query=$(printf 'resource.labels.container_name=(\"upgrade-test-controller\" OR \"sdk-client-test\")' | jq -sRr @uri);cursorTimestamp=$(date -u +%Y-%m-%dT%H:%M:%SZ);duration=PT10M?project=${PROJECT_ID}"
+    echo "Logs from log bucket: $log_url"
+}


LGTM, but I just got a question, are we sure this print_failure_logs will not only show logs from the last kube context from this ?

for cloudProduct in "${cloudProducts[@]}" do for version in "${!versionsAndRegions[@]}"

I was thinking to re-do the gcloud container clusters get-credentials "$testCluster" --region="$testClusterLocation" --project="$PROJECT_ID" or using the right context before calling this function, it doesn't seems to be done from L162-L194

My comment is not really clear, but I'm referring to the calls of kubectl logs --tail .. which might use the context from the last call of gcloud container clusters get-credentials .., as we were previously only storing the PID and the log file, then waiting for pid to be done to just print / logs it without kubectl calls

But with this, it might calls kubectl using the context from the last one of the list (which might not have errors / not the expected context)

If that make sense 😄

print_failure_logs: ... kubectl logs .. ... for cloudProduct in [A, B, C]: do for version in [1, 2, 3]: ... gcloud container ... (retrieving context for A-1, A-2, ..., C-2, C-3, depending on where we are from the loop) ... done done for pid in "${pids[@]}"; do ... print_failure_logs // Wouldn't that use the kube context of C-3 all the time ? ... done

we use ttlSecondsAfterFinished: 100, the Job and its pods are deleted very quickly after they finish. Because of that, when we try to get the logs later using kubectl logs, the pods might already be gone — so we can’t capture the logs.

But in our case, we do capture the logs earlier — right after waiting for the pod to be ready:

# Wait for the pod to become ready (or timeout) if ! kubectl wait --for=condition=ready pod -l job-name=upgrade-test-runner --timeout=5m; then echo "ERROR: The pod for job 'upgrade-test-runner' did not become ready within the timeout period." print_failure_logs "$testCluster" exit 1 fi

That part looks all good with the wait, but about this one:

for pid in "${pids[@]}"; do
...
print_failure_logs ..
...
done

Feels that the calls to print_failure_logs from this loop will use the same kube context though, compared to how it was logged before

L183 and L191

agones-bot · 2025-07-21T18:05:21Z

Build Succeeded 🥳

Build Id: 2a72b07d-9cb4-41fd-bd6e-7c9dce29bdcf

The following development artifacts have been built, and will exist for the next 30 days:

image: us-docker.pkg.dev/agones-images/ci/agones-controller:1.51.0-dev-ace1202
image: us-docker.pkg.dev/agones-images/ci/agones-extensions:1.51.0-dev-ace1202
image: us-docker.pkg.dev/agones-images/ci/agones-sdk:1.51.0-dev-ace1202-linux
image: us-docker.pkg.dev/agones-images/ci/agones-ping:1.51.0-dev-ace1202
image: us-docker.pkg.dev/agones-images/ci/agones-allocator:1.51.0-dev-ace1202
Linux C++ SDK (build): agonessdk-1.51.0-dev-ace1202-linux-arch_64.tar.gz
SDK Server: agonessdk-server-1.51.0-dev-ace1202.zip

A preview of the website (the last 30 builds are retained):

https://ace1202-dot-preview-dot-agones-images.appspot.com/

To install this version:

git fetch https://github.com/googleforgames/agones.git pull/4165/head:pr_4165 && git checkout pr_4165
helm install agones ./install/helm/agones --namespace agones-system --set agones.image.registry=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.51.0-dev-ace1202

0xaravindh added 2 commits April 28, 2025 07:46

Add failure logs in submit-upgrade-test-cloud-build

67c6bd3

Add failure logs in submit-upgrade-test-cloud-build

821ff73

0xaravindh self-assigned this Apr 28, 2025

0xaravindh requested a review from markmandel April 28, 2025 10:32

github-actions bot added the kind/cleanup Refactoring code, fixing up documentation, etc label Apr 28, 2025

0xaravindh requested a review from igooch April 28, 2025 10:32

0xaravindh added 2 commits April 28, 2025 10:33

rebase

8b81848

clean

2eda77e

update

ba1f5e4

update

23cb7e6

Install jq in Dockerfile

4d20abf

0xaravindh added 2 commits April 28, 2025 13:54

update

1473332

update

f496171

lint fixes

c9e719d

0xaravindh added 2 commits April 28, 2025 14:42

Improve logging for upgrade test job status parsing

4d27538

remove docker change

63503fc

remove docker change

9ce27fd

lacroixthomas requested changes Apr 28, 2025

View reviewed changes

Improve logging for upgrade test job status parsing

124ceea

fixes

3ab1210

fixes

ac4456e

0xaravindh added 2 commits July 14, 2025 07:54

Merge branch 'main' into ci/logs

85f43af

fixes

7a4f590

rollback e2e test

4ece9f4

igooch reviewed Jul 16, 2025

View reviewed changes

Improves upgrade test failure logging

0570528

markmandel reviewed Jul 16, 2025

View reviewed changes

build/e2e_upgrade_test.sh

@@ -0,0 +1,196 @@

#!/usr/bin/env bash

Copy link

Collaborator

markmandel Jul 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥 🔥 🔥

igooch approved these changes Jul 21, 2025

View reviewed changes

igooch enabled auto-merge (squash) July 21, 2025 16:46

igooch requested a review from lacroixthomas July 21, 2025 16:46

Merge branch 'main' into ci/logs

ace1202

lacroixthomas reviewed Jul 21, 2025

View reviewed changes

igooch merged commit 23f0f75 into googleforgames:main Jul 21, 2025
4 checks passed

lacroixthomas mentioned this pull request Aug 13, 2025

Approver request #4245

Closed

5 tasks

Add logs reporting to submit-upgrade-test-cloud-build for better visibility #4165

Add logs reporting to submit-upgrade-test-cloud-build for better visibility #4165

Uh oh!

Conversation

0xaravindh commented Apr 28, 2025

Uh oh!

agones-bot commented Apr 28, 2025

Uh oh!

agones-bot commented Apr 28, 2025

Uh oh!

agones-bot commented Apr 28, 2025

Uh oh!

agones-bot commented Apr 28, 2025

Uh oh!

lacroixthomas commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0xaravindh commented Apr 28, 2025

Uh oh!

agones-bot commented Apr 28, 2025

Uh oh!

agones-bot commented Apr 28, 2025

Uh oh!

agones-bot commented Apr 28, 2025

Uh oh!

lacroixthomas Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

agones-bot commented Apr 28, 2025

Uh oh!

agones-bot commented Apr 28, 2025

Uh oh!

agones-bot commented Jul 10, 2025

Uh oh!

agones-bot commented Jul 10, 2025

Uh oh!

agones-bot commented Jul 10, 2025

Uh oh!

0xaravindh commented Jul 11, 2025

Uh oh!

agones-bot commented Jul 11, 2025

Uh oh!

agones-bot commented Jul 14, 2025

Uh oh!

agones-bot commented Jul 14, 2025

Uh oh!

0xaravindh commented Jul 14, 2025

Uh oh!

agones-bot commented Jul 14, 2025

Uh oh!

0xaravindh commented Jul 15, 2025

Uh oh!

agones-bot commented Jul 15, 2025

Uh oh!

igooch left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

agones-bot commented Jul 16, 2025

Uh oh!

0xaravindh commented Jul 16, 2025

Uh oh!

agones-bot commented Jul 16, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

igooch left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lacroixthomas Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

lacroixthomas commented Apr 28, 2025 •

edited

Loading

lacroixthomas Apr 28, 2025 •

edited

Loading

lacroixthomas Jul 21, 2025 •

edited

Loading