-
Notifications
You must be signed in to change notification settings - Fork 25.3k
Use working pre-built sccache binary #95997
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/95997
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit b7ec0c1: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
I probably need someone with Rust knowledge to debug the issue further. So here is my finding so far. The server failed to start because it couldn't bind to the port (invalid argument, errno 22). Our sccache fork uses this old crate https://crates.io/crates/tokio-tcp/0.1.4 for the job:
The code has not changed, so I still couldn't tell why it starts failing now. Confirm that there is no process running hoarding sccache port. So the only thing left from https://man7.org/linux/man-pages/man2/bind.2.html is:
But AFIAK, the code looks reasonable, binding to "127.0.0.1:4226". I already create a task T136185557 to update sccache version, but lack the Rust knowledge to finish it. That would be the long term fix for this issue. |
@ZainRizvi Let's get this merge to unblock your PR and others like #95896 to update triton pinned commit. Anything touching Docker image would be blocked otherwise. @ngimel Those failed builds due to sccache in #95896 would be fixed once this PR is merged. |
FYI, the PR to upgrade cmake version is here pytorch/builder#1331 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for debugging this!
Any idea who owns our sccache fork? It would be good to raise this issue to them (if it even has an owner)
Ed was last seen updating it https://github.com/pytorch/sccache/commits/master, but I think this is likely that we are the owner. Let's me bring this up for discussion this week. |
@pytorchbot merge -f 'Docker build and all pull jobs have passed. The cache looks fine. We can skip trunk and save some trees' |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
From pytorch/pytorch#95938 where a new Docker image build fails to start sccache. This issue starts to happen today (Mar 3rd). The server fails to start with a cryptic `sccache: error: Invalid argument (os error 22)` ``` =================== sccache compilation log =================== ERROR 2023-03-03T20:31:14Z: sccache::server: failed to start server: Invalid argument (os error 22) sccache: error: Invalid argument (os error 22) =========== If your build fails, please take a look at the log above for possible reasons =========== + sccache --show-stats sccache: error: Connection to server timed out ``` I don't have a good explanation for this yet. The version of sccache we build from https://github.com/pytorch/sccache is ancient. If I start to build the exact same version on Ubuntu Docker image now, the issue will manifest. But the older binary built only few days ago https://hud.pytorch.org/pytorch/pytorch/commit/e50ff3fcdb3890ce3bbab99e60b1c27ff49be2af works without any issue. So I fix sccache binary to that version instead of rebuilding it every time in the image as a temporary mitigation while trying to root cause this further. Pull Request resolved: pytorch/pytorch#95997 Approved by: https://github.com/ZainRizvi
From pytorch/pytorch#95938 where a new Docker image build fails to start sccache. This issue starts to happen today (Mar 3rd). The server fails to start with a cryptic `sccache: error: Invalid argument (os error 22)` ``` =================== sccache compilation log =================== ERROR 2023-03-03T20:31:14Z: sccache::server: failed to start server: Invalid argument (os error 22) sccache: error: Invalid argument (os error 22) =========== If your build fails, please take a look at the log above for possible reasons =========== + sccache --show-stats sccache: error: Connection to server timed out ``` I don't have a good explanation for this yet. The version of sccache we build from https://github.com/pytorch/sccache is ancient. If I start to build the exact same version on Ubuntu Docker image now, the issue will manifest. But the older binary built only few days ago https://hud.pytorch.org/pytorch/pytorch/commit/e50ff3fcdb3890ce3bbab99e60b1c27ff49be2af works without any issue. So I fix sccache binary to that version instead of rebuilding it every time in the image as a temporary mitigation while trying to root cause this further. Pull Request resolved: pytorch/pytorch#95997 Approved by: https://github.com/ZainRizvi
From pytorch#95938 where a new Docker image build fails to start sccache. This issue starts to happen today (Mar 3rd). The server fails to start with a cryptic `sccache: error: Invalid argument (os error 22)` ``` =================== sccache compilation log =================== ERROR 2023-03-03T20:31:14Z: sccache::server: failed to start server: Invalid argument (os error 22) sccache: error: Invalid argument (os error 22) =========== If your build fails, please take a look at the log above for possible reasons =========== + sccache --show-stats sccache: error: Connection to server timed out ``` I don't have a good explanation for this yet. The version of sccache we build from https://github.com/pytorch/sccache is ancient. If I start to build the exact same version on Ubuntu Docker image now, the issue will manifest. But the older binary built only few days ago https://hud.pytorch.org/pytorch/pytorch/commit/e50ff3fcdb3890ce3bbab99e60b1c27ff49be2af works without any issue. So I fix sccache binary to that version instead of rebuilding it every time in the image as a temporary mitigation while trying to root cause this further. Pull Request resolved: pytorch#95997 Approved by: https://github.com/ZainRizvi
We should have a followup task for this one, to resort back to the source builds |
This essentially reverts #95997 but switches to builds from source to official mozilla's sccache repo for CPU builds, except PCH one, see #139188 - Define `SCCACHE_REGION` for the jobs that needs it. - Enable aarch64 builds to use sccache, which allows one to do incremental rebuilds under 10 min, see https://github.com/pytorch/pytorch/actions/runs/11565944328/job/32197278296 Fixes #121559 Pull Request resolved: #121323 Approved by: https://github.com/atalman
This essentially reverts pytorch#95997 but switches to builds from source to official mozilla's sccache repo for CPU builds, except PCH one, see pytorch#139188 - Define `SCCACHE_REGION` for the jobs that needs it. - Enable aarch64 builds to use sccache, which allows one to do incremental rebuilds under 10 min, see https://github.com/pytorch/pytorch/actions/runs/11565944328/job/32197278296 Fixes pytorch#121559 Pull Request resolved: pytorch#121323 Approved by: https://github.com/atalman
From #95938 where a new Docker image build fails to start sccache. This issue starts to happen today (Mar 3rd). The server fails to start with a cryptic
sccache: error: Invalid argument (os error 22)
I don't have a good explanation for this yet. The version of sccache we build from https://github.com/pytorch/sccache is ancient. If I start to build the exact same version on Ubuntu Docker image now, the issue will manifest. But the older binary built only few days ago https://hud.pytorch.org/pytorch/pytorch/commit/e50ff3fcdb3890ce3bbab99e60b1c27ff49be2af works without any issue. So I fix sccache binary to that version instead of rebuilding it every time in the image as a temporary mitigation while trying to root cause this further.