-
Notifications
You must be signed in to change notification settings - Fork 862
Add logs reporting to submit-upgrade-test-cloud-build for better visibility #4165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Build Failed 😭 Build Id: 944b0a49-63f1-4ec5-841b-2586ca33a489 Status: FAILURE To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
Build Failed 😭 Build Id: b5ecf3aa-2b70-4515-beb7-830a0fbd96df Status: FAILURE To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
Build Failed 😭 Build Id: c605097d-0c93-4011-8873-71f2b8cbb039 Status: FAILURE To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
Build Failed 😭 Build Id: 58b3ccdb-181b-49d1-a136-c35cb6fd8ffe Status: FAILURE To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
Regarding the issue on the build around I'm not fully sure about it, but from what I see, the jq is properly installed on the docker image for the agones/test/upgrade/upgradeTest.yaml Line 29 in 97b07cc
|
Got it! installing jq in the args section. |
Build Failed 😭 Build Id: f25955e4-5711-4d6a-8dae-bccf6032af92 Status: FAILURE To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
Build Failed 😭 Build Id: b404b589-c45b-41bb-b09c-0ec7a655aabe Status: FAILURE To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
Build Failed 😭 Build Id: 3faa7b62-2ab3-48e0-a03e-1e5961ebd386 Status: FAILURE To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
cloudbuild.yaml
Outdated
|
||
echo "===== Job status for upgrade-test-runner on cluster ${testCluster} =====" | ||
kubectl get job/upgrade-test-runner -o json | jq '.status' > "${tmpdir}/${testCluster}-job-status.json" | ||
cat "${tmpdir}/${testCluster}-job-status.json" | ||
|
||
echo "===== Extracting job conditions (if available) =====" | ||
job_conditions=$(kubectl get job/upgrade-test-runner -o json | jq -c 'if .status.conditions then .status.conditions[] | select(.status=="True") | {type, reason, message} else empty end') | ||
echo "Job conditions output: $job_conditions" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could maybe move this part after the if wait $pid; then
(L400), to print the values once the process is complete ? Otherwise the status / conditions might not be available yet, as the call to wait to get the statuses is run in the background*
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the upgrade-test-runner Job finishes (either successfully or unsuccessfully), Kubernetes will wait for 100 seconds before deleting the Job. if we move that code into that line we are facing this issue
Error from server (NotFound): jobs.batch "upgrade-test-runner" not found
agones/test/upgrade/upgradeTest.yaml
Line 24 in 97b07cc
ttlSecondsAfterFinished: 100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see, hmm
I'm not sure the log is usefull here then, because it would only print the status / conditions from the really beginning of the creation of the job, we would never got the values when the status are in error or not
But if we want to keep them, maybe we need to do the same design as the wait
with waitPids
to access the values once the job got all the status to True ?
What do you think about it ?
Build Succeeded 🥳 Build Id: 8d22703b-c2d3-4183-adfd-9cdb11f3eccc The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version:
|
Build Failed 😭 Build Id: 54d61520-601e-400e-9fc9-a4f622905000 Status: FAILURE To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
Build Failed 😭 Build Id: 3e31e567-6f19-44dc-ad11-7ec9af416624 Status: FAILURE To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
Build Failed 😭 Build Id: 5aee9245-329a-4a45-a47d-af6ff87e2f8b Status: FAILURE To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
Build Failed 😭 Build Id: 836cc0fa-4d37-4343-9fbc-001440b562f5 Status: FAILURE To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
/gcbrun |
Build Failed 😭 Build Id: 81fa3be8-5fc9-48b6-9e69-3825259762ad Status: FAILURE To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
Build Failed 😭 Build Id: dbd98f3f-ba68-4442-a137-3780dfaf460d Status: FAILURE To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
Build Failed 😭 Build Id: 34410939-111c-4f78-8efe-8fc9797c5335 Status: FAILURE To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
/gcbrun |
Build Failed 😭 Build Id: a94c3251-a40e-48b8-bc82-1f92d09b16f0 Status: FAILURE To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
/gcbrun |
Build Succeeded 🥳 Build Id: bfe9c5ff-304c-442a-ad4c-fc29ac4c5b64 The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work! One comment on printing the log output.
build/e2e_upgrade_test.sh
Outdated
done | ||
done | ||
|
||
echo "Logs from log bucket: https://console.cloud.google.com/logs/query;storageScope=storage,projects%2F${PROJECT_ID}%2Flocations%2Fglobal%2Fbuckets%2F${logBucketName}%2Fviews%2F_AllLogs;query=$(printf 'resource.labels.container_name=(\"upgrade-test-controller\" OR \"sdk-client-test\")' | jq -sRr @uri);cursorTimestamp=$(date -u +%Y-%m-%dT%H:%M:%SZ);duration=PT10M?project=${PROJECT_ID}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this is only printed to the console in the case that the test fails when waiting on the job to start. Could you pull this logic out into a separate function, and also call in the case that one of the pids exits with a non-zero status later in the test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved the log printing into a function and called it in both failure cases.
Build Failed 😭 Build Id: 27fa6b18-ac95-41b3-91e4-d14456cad9cf Status: FAILURE To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
/gcbrun |
Build Succeeded 🥳 Build Id: 4c123848-f0bf-4b76-bcda-20b4445d1304 The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version:
|
@@ -0,0 +1,196 @@ | |||
#!/usr/bin/env bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔥 🔥 🔥
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
print_failure_logs() { | ||
local cluster_name=$1 | ||
echo "ERROR: Upgrade test failed on cluster: $cluster_name" | ||
|
||
job_pods=$(kubectl get pods -l job-name=upgrade-test-runner -o jsonpath="{.items[*].metadata.name}") | ||
kubectl logs --tail=20 "$job_pods" || echo "Unable to retrieve logs for pod: $job_pods" | ||
for pod in $job_pods; do | ||
containers=$(kubectl get pod "$pod" -o jsonpath='{.spec.containers[*].name}') | ||
for container in $containers; do | ||
if [[ "$container" == "sdk-client-test" || "$container" == "upgrade-test-controller" ]]; then | ||
echo "----- Logs from pod: $pod, container: $container -----" | ||
kubectl logs "$pod" -c "$container" || echo "Failed to retrieve logs from $pod/$container" | ||
fi | ||
done | ||
done | ||
|
||
log_url="https://console.cloud.google.com/logs/query;storageScope=storage,projects%2F${PROJECT_ID}%2Flocations%2Fglobal%2Fbuckets%2Fupgrade-test-container-logs%2Fviews%2F_AllLogs;query=$(printf 'resource.labels.container_name=(\"upgrade-test-controller\" OR \"sdk-client-test\")' | jq -sRr @uri);cursorTimestamp=$(date -u +%Y-%m-%dT%H:%M:%SZ);duration=PT10M?project=${PROJECT_ID}" | ||
echo "Logs from log bucket: $log_url" | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but I just got a question, are we sure this print_failure_logs
will not only show logs from the last kube context from this ?
for cloudProduct in "${cloudProducts[@]}"
do
for version in "${!versionsAndRegions[@]}"
I was thinking to re-do the gcloud container clusters get-credentials "$testCluster" --region="$testClusterLocation" --project="$PROJECT_ID"
or using the right context before calling this function, it doesn't seems to be done from L162-L194
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My comment is not really clear, but I'm referring to the calls of kubectl logs --tail ..
which might use the context from the last call of gcloud container clusters get-credentials ..
, as we were previously only storing the PID and the log file, then waiting for pid to be done to just print / logs it without kubectl calls
But with this, it might calls kubectl using the context from the last one of the list (which might not have errors / not the expected context)
If that make sense 😄
print_failure_logs:
...
kubectl logs ..
...
for cloudProduct in [A, B, C]:
do
for version in [1, 2, 3]:
...
gcloud container ... (retrieving context for A-1, A-2, ..., C-2, C-3, depending on where we are from the loop)
...
done
done
for pid in "${pids[@]}"; do
...
print_failure_logs // Wouldn't that use the kube context of C-3 all the time ?
...
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we use ttlSecondsAfterFinished: 100, the Job and its pods are deleted very quickly after they finish. Because of that, when we try to get the logs later using kubectl logs, the pods might already be gone — so we can’t capture the logs.
But in our case, we do capture the logs earlier — right after waiting for the pod to be ready:
# Wait for the pod to become ready (or timeout)
if ! kubectl wait --for=condition=ready pod -l job-name=upgrade-test-runner --timeout=5m; then
echo "ERROR: The pod for job 'upgrade-test-runner' did not become ready within the timeout period."
print_failure_logs "$testCluster"
exit 1
fi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That part looks all good with the wait, but about this one:
for pid in "${pids[@]}"; do
...
print_failure_logs ..
...
done
Feels that the calls to print_failure_logs from this loop will use the same kube context though, compared to how it was logged before
L183 and L191
Build Succeeded 🥳 Build Id: 2a72b07d-9cb4-41fd-bd6e-7c9dce29bdcf The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version:
|
What type of PR is this?
/kind cleanup
What this PR does / Why we need it:
Which issue(s) this PR fixes:
Closes #4163
Special notes for your reviewer: