Skip to content

Conversation

huydhn
Copy link
Contributor

@huydhn huydhn commented Mar 3, 2023

From #95938 where a new Docker image build fails to start sccache. This issue starts to happen today (Mar 3rd). The server fails to start with a cryptic sccache: error: Invalid argument (os error 22)

=================== sccache compilation log ===================
ERROR 2023-03-03T20:31:14Z: sccache::server: failed to start server: Invalid argument (os error 22)

sccache: error: Invalid argument (os error 22)

=========== If your build fails, please take a look at the log above for possible reasons ===========

+ sccache --show-stats
sccache: error: Connection to server timed out

I don't have a good explanation for this yet. The version of sccache we build from https://github.com/pytorch/sccache is ancient. If I start to build the exact same version on Ubuntu Docker image now, the issue will manifest. But the older binary built only few days ago https://hud.pytorch.org/pytorch/pytorch/commit/e50ff3fcdb3890ce3bbab99e60b1c27ff49be2af works without any issue. So I fix sccache binary to that version instead of rebuilding it every time in the image as a temporary mitigation while trying to root cause this further.

@huydhn huydhn requested review from ZainRizvi and a team March 3, 2023 21:15
@pytorch-bot
Copy link

pytorch-bot bot commented Mar 3, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/95997

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit b7ec0c1:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Mar 3, 2023
@huydhn huydhn marked this pull request as ready for review March 3, 2023 22:26
@huydhn huydhn requested a review from jeffdaily as a code owner March 3, 2023 22:26
@huydhn
Copy link
Contributor Author

huydhn commented Mar 3, 2023

Some above steps fails because I forgot to grant read access to the binary from S3. Once pull workflow finishes, retrying will work All looks green now

@huydhn huydhn requested a review from malfet March 3, 2023 23:02
@huydhn
Copy link
Contributor Author

huydhn commented Mar 4, 2023

I probably need someone with Rust knowledge to debug the issue further. So here is my finding so far. The server failed to start because it couldn't bind to the port (invalid argument, errno 22). Our sccache fork uses this old crate https://crates.io/crates/tokio-tcp/0.1.4 for the job:

use tokio_tcp::TcpListener;
...
let addr = SocketAddrV4::new(Ipv4Addr::new(127, 0, 0, 1), port);
let listener = TcpListener::bind(&SocketAddr::V4(addr))?; <--- FAILING HERE
...

The code has not changed, so I still couldn't tell why it starts failing now. Confirm that there is no process running hoarding sccache port. So the only thing left from https://man7.org/linux/man-pages/man2/bind.2.html is:

EINVAL addrlen is wrong, or addr is not a valid address for this socket's domain.

But AFIAK, the code looks reasonable, binding to "127.0.0.1:4226". I already create a task T136185557 to update sccache version, but lack the Rust knowledge to finish it. That would be the long term fix for this issue.

@huydhn
Copy link
Contributor Author

huydhn commented Mar 4, 2023

@ZainRizvi Let's get this merge to unblock your PR and others like #95896 to update triton pinned commit. Anything touching Docker image would be blocked otherwise.

@ngimel Those failed builds due to sccache in #95896 would be fixed once this PR is merged.

@ngimel
Copy link
Collaborator

ngimel commented Mar 4, 2023

@huydhn great, thanks! #95896 also needs CMake v 3.20 (we currently have v 3.17), can we update that?

@huydhn
Copy link
Contributor Author

huydhn commented Mar 6, 2023

@huydhn great, thanks! #95896 also needs CMake v 3.20 (we currently have v 3.17), can we update that?

FYI, the PR to upgrade cmake version is here pytorch/builder#1331

Copy link
Contributor

@ZainRizvi ZainRizvi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for debugging this!

Any idea who owns our sccache fork? It would be good to raise this issue to them (if it even has an owner)

@huydhn
Copy link
Contributor Author

huydhn commented Mar 6, 2023

Any idea who owns our sccache fork? It would be good to raise this issue to them (if it even has an owner)

Ed was last seen updating it https://github.com/pytorch/sccache/commits/master, but I think this is likely that we are the owner. Let's me bring this up for discussion this week.

@huydhn
Copy link
Contributor Author

huydhn commented Mar 6, 2023

@pytorchbot merge -f 'Docker build and all pull jobs have passed. The cache looks fine. We can skip trunk and save some trees'

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

cyyever pushed a commit to cyyever/pytorch_private that referenced this pull request Mar 12, 2023
From pytorch/pytorch#95938 where a new Docker image build fails to start sccache. This issue starts to happen today (Mar 3rd).   The server fails to start with a cryptic `sccache: error: Invalid argument (os error 22)`

```
=================== sccache compilation log ===================
ERROR 2023-03-03T20:31:14Z: sccache::server: failed to start server: Invalid argument (os error 22)

sccache: error: Invalid argument (os error 22)

=========== If your build fails, please take a look at the log above for possible reasons ===========

+ sccache --show-stats
sccache: error: Connection to server timed out
```

I don't have a good explanation for this yet.  The version of sccache we build from https://github.com/pytorch/sccache is ancient.  If I start to build the exact same version on Ubuntu Docker image now, the issue will manifest.  But the older binary built only few days ago https://hud.pytorch.org/pytorch/pytorch/commit/e50ff3fcdb3890ce3bbab99e60b1c27ff49be2af works without any issue.  So I fix sccache binary to that version instead of rebuilding it every time in the image as a temporary mitigation while trying to root cause this further.

Pull Request resolved: pytorch/pytorch#95997
Approved by: https://github.com/ZainRizvi
cyyever pushed a commit to cyyever/pytorch_private that referenced this pull request Mar 12, 2023
From pytorch/pytorch#95938 where a new Docker image build fails to start sccache. This issue starts to happen today (Mar 3rd).   The server fails to start with a cryptic `sccache: error: Invalid argument (os error 22)`

```
=================== sccache compilation log ===================
ERROR 2023-03-03T20:31:14Z: sccache::server: failed to start server: Invalid argument (os error 22)

sccache: error: Invalid argument (os error 22)

=========== If your build fails, please take a look at the log above for possible reasons ===========

+ sccache --show-stats
sccache: error: Connection to server timed out
```

I don't have a good explanation for this yet.  The version of sccache we build from https://github.com/pytorch/sccache is ancient.  If I start to build the exact same version on Ubuntu Docker image now, the issue will manifest.  But the older binary built only few days ago https://hud.pytorch.org/pytorch/pytorch/commit/e50ff3fcdb3890ce3bbab99e60b1c27ff49be2af works without any issue.  So I fix sccache binary to that version instead of rebuilding it every time in the image as a temporary mitigation while trying to root cause this further.

Pull Request resolved: pytorch/pytorch#95997
Approved by: https://github.com/ZainRizvi
ydwu4 added a commit to ydwu4/pytorch that referenced this pull request Mar 13, 2023
From pytorch#95938 where a new Docker image build fails to start sccache. This issue starts to happen today (Mar 3rd).   The server fails to start with a cryptic `sccache: error: Invalid argument (os error 22)`

```
=================== sccache compilation log ===================
ERROR 2023-03-03T20:31:14Z: sccache::server: failed to start server: Invalid argument (os error 22)

sccache: error: Invalid argument (os error 22)

=========== If your build fails, please take a look at the log above for possible reasons ===========

+ sccache --show-stats
sccache: error: Connection to server timed out
```

I don't have a good explanation for this yet.  The version of sccache we build from https://github.com/pytorch/sccache is ancient.  If I start to build the exact same version on Ubuntu Docker image now, the issue will manifest.  But the older binary built only few days ago https://hud.pytorch.org/pytorch/pytorch/commit/e50ff3fcdb3890ce3bbab99e60b1c27ff49be2af works without any issue.  So I fix sccache binary to that version instead of rebuilding it every time in the image as a temporary mitigation while trying to root cause this further.

Pull Request resolved: pytorch#95997
Approved by: https://github.com/ZainRizvi
@malfet
Copy link
Contributor

malfet commented Mar 6, 2024

We should have a followup task for this one, to resort back to the source builds

pytorchmergebot pushed a commit that referenced this pull request Oct 29, 2024
This essentially reverts #95997 but switches to builds from source to official mozilla's sccache repo for CPU builds, except PCH one, see #139188
- Define `SCCACHE_REGION` for the jobs that needs it.
- Enable aarch64 builds to use sccache, which allows one to do incremental rebuilds under 10 min, see https://github.com/pytorch/pytorch/actions/runs/11565944328/job/32197278296

Fixes #121559
Pull Request resolved: #121323
Approved by: https://github.com/atalman
rahulsingh-intel pushed a commit to rahulsingh-intel/pytorch that referenced this pull request Nov 5, 2024
This essentially reverts pytorch#95997 but switches to builds from source to official mozilla's sccache repo for CPU builds, except PCH one, see pytorch#139188
- Define `SCCACHE_REGION` for the jobs that needs it.
- Enable aarch64 builds to use sccache, which allows one to do incremental rebuilds under 10 min, see https://github.com/pytorch/pytorch/actions/runs/11565944328/job/32197278296

Fixes pytorch#121559
Pull Request resolved: pytorch#121323
Approved by: https://github.com/atalman
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants