Skip to content

[Bug]: copy_to_directory fails with host!=exec RBE #466

@anguslees

Description

@anguslees

What happened?

I'm using bazel on a darwin/arm64 host, pointing at a linux/amd64 RBE cluster. copy_to_directory fails with "cannot execute binary file" because bazel is trying to run a linux/amd64 binary locally.

Here is one example failure, where copy_to_directory is invoked as part of rules_oci's oci_pull

ERROR: /private/var/tmp/_bazel_gus/798cdab9ca916cc0b6f69f7c876590eb/external/distroless_static_linux_arm64/BUILD.bazel:47:18: Copying files to directory distroless_static_linux_arm64/blobs/sha256 failed: (Exit 126): copy_to_directory failed: error executing command (from target @distroless_static_linux_arm64//:blobs)
  (cd /private/var/tmp/_bazel_gus/798cdab9ca916cc0b6f69f7c876590eb/execroot/com_canva_infrastructure && \
  exec env - \
  external/copy_to_directory_linux_amd64/copy_to_directory bazel-out/darwin_arm64-fastbuild/bin/external/distroless_static_linux_arm64/blobs_config.json)
# Configuration: 552dda68697b5bb17c41b4bef2af5919b102e5481624a5bb19489b4de2a07a0c
# Execution platform: //tools/build/bazel/toolchains/remote:ubuntu-act-22-04-platform
external/copy_to_directory_linux_amd64/copy_to_directory: external/copy_to_directory_linux_amd64/copy_to_directory: cannot execute binary file

Note bazel thinks the execution platform is my RBE platform (ubuntu-act-22-04-platform), and has selected copy_to_directory_linux_amd64 appropriately, but the action is actually executed locally on my darwin/amd64 mac and fails ("Exit 126" and "cannot execute binary file").

After much head scratching, I discovered @aspect_bazel_lib//lib/private/copy_common.bzl COPY_EXECUTION_REQUIREMENTS, which forces copy commands to be performed locally. This effectively discards all the toolchain resolution hard work, and results in the above error.

Version

Development (host) and target OS/architectures:

host = darwin/arm64
exec = linux/amd64
target = linux/amd64

Output of bazel --version:

bazel 6.1.2

Version of the Aspect rules, or other relevant rules from your
WORKSPACE or MODULE.bazel file:

Aspect rules v1.32.1

Language(s) and/or frameworks involved:

How to reproduce

No response

Any other information?

I think the fix is either:

  • Remove all of COPY_EXECUTION_REQUIREMENTS. This is what I've done locally. I appreciate the optimisation goal, but these are dubious at best with a sufficiently large+fast RBE cache, remote asset API, and a slow network link from my client to RBE cluster. If performance is an issue, it's certainly easier in my case to expand the size of CAS disk rather than expand any other part of the system. I have not confirmed if the caveat about src treeartifacts not working over remote-exec api still applies, but it hasn't affected my use cases in ways that I've noticed .. yet.
  • Not use toolchain resolution for copy_to_directory. If we're forcing it to execute locally, then we also want to force it to use the host platform's copy_to_directory executable.
  • Lean further into local execution, and add exec_compatible_with=HOST_CONSTRAINTS on all relevant rules (or actions, with exec_groups)

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions