Skip to content

Releases: aws/aws-ofi-nccl

AWS OFI NCCL v1.16.3

13 Aug 00:53
v1.16.3
Compare
Choose a tag to compare

v1.16.3 (2025-08)

The 1.16.3 release series supports NCCL v2.27.7-1 while maintaining backward compatibility with older NCCL versions (NCCL v2.17.1 and later).

With this release, building with platform-aws requires Libfabric v1.22.0amzn4.0 or greater. And it is currently tested with versions up to Libfabric 2.1.0amzn5.0.

Bug Fixes and Improvements:

  • Enable domain-per-thread by default on all AWS instance types for improved performance for some applications where NCCL creates multiple proxy threads

Checksum (sha512) for the release tarball aws-ofi-nccl-1.16.3.tar.gz:

56abd446d2c4376dd9e5579d4777f997d97d99202e0a0a0d45985de78ebf3aab14e13e6577bb0a3cd28db8c2a984105a686993ffed1ecae528b1bd6335519c7a  aws-ofi-nccl-1.16.3.tar.gz

AWS OFI NCCL v1.16.2

26 Jul 01:37
v1.16.2
Compare
Choose a tag to compare

v1.16.2 (2025-07)

The 1.16.2 release series supports NCCL v2.27.6-1 while maintaining backward compatibility with older NCCL versions (NCCL v2.17.1 and later).

With this release, building with platform-aws requires Libfabric v1.22.0amzn4.0 or greater. And it is currently tested with versions up to Libfabric 2.1.0amzn4.0.

Bug Fixes and Improvements:

  • Added a new platform configuration to support using the OFI NCCL plugin on the AWS p5.4xlarge instance type

Checksum (sha512) for the release tarball aws-ofi-nccl-1.16.2.tar.gz:

5f675f4467e919c79100ec1875b344fb8f73e05552269f05d840a2f084e95d9ac75b739eeb3a9f81049fec11c1fbe4b4d872971d7dbef6b89e1ba3b0abebc8e9  aws-ofi-nccl-1.16.2.tar.gz

AWS OFI NCCL v1.16.1

23 Jul 01:39
v1.16.1
Compare
Choose a tag to compare

v1.16.1 (2025-07)

The 1.16.1 release series supports NCCL v2.27.6-1 while maintaining backward compatibility with older NCCL versions ((NCCL v2.17.1 and later).

With this release, building with platform-aws requires Libfabric v1.22.0amzn4.0 or greater. And it is currently tested with versions up to Libfabric 2.1.0amzn3.

Bug Fixes and Improvements:

  • Update the PCI link speed format reported in the topology file to match kernel 5.7+
  • Added SKIP_NICS_WITHOUT_ACCEL_AT_SAME_PCI_LEVEL to skip libfabric nics that do not have an accelerator at the same pci level

Checksum (sha512) for the release tarball aws-ofi-nccl-1.16.1.tar.gz:

033da7df476484f3368ca00676923ee28d6fd09e598e2f5020fa980f24908064fbe25f318aa60ca736f48eba01eba4839854761710a0f8dbe54d060185a83ad1  aws-ofi-nccl-1.16.1.tar.gz

AWS OFI NCCL v1.16.0

27 Jun 23:26
v1.16.0
Compare
Choose a tag to compare

v1.16.0 (2025-06)

The 1.16.0 release series supports NCCL v2.27.5-1 while maintaining backward compatibility with older NCCL versions ((NCCL v2.17.1 and later).

With this release, building with platform-aws requires Libfabric v1.22.0amzn4.0 or greater. And it is currently tested with versions up to Libfabric 2.1.0amzn3.

Bug Fixes and Improvements:

  • On AWS platforms the following environment variables NCCL_BUFFSIZE, NCCL_P2P_NET_CHUNKSIZE, NCCL_NVLSTREE_MAX_CHUNKSIZE, NCCL_NVLS_CHUNKSIZE, NCCL_NET_FORCE_FLUSH may be set by the plugin
  • Fix bug that prevented communicators from aborting gracefully, as part of supporting NCCL fault tolerance features
  • On AWS platforms, enable collective algorithm tuner by default
  • Improve P6-B200 tuner configuration to improve performance for 4 -- 32 MiB messages across node counts and large message AllReduce on 8 nodes
  • Added libnccl-tuner-ofi.so symlink for easier configuration with NCCL_TUNER_PLUGIN=ofi

Checksum (sha512) for the release tarball aws-ofi-nccl-1.16.0.tar.gz:

079635016d1e12407e072f7c0023d45074a9bb60ceeee23b4e82b3be8b2dbf7944eb57a03e33e57242da06386027a4fb3eb64f7c5d85f4a84215072d8b23a8fc  aws-ofi-nccl-1.16.0.tar.gz

AWS OFI NCCL v1.15.0

04 Jun 04:18
v1.15.0
Compare
Choose a tag to compare

v1.15.0 (2025-06)

The 1.15.x release series supports NCCL 2.26.6-1 while maintaining backward compatibility with older NCCL versions (NCCL v2.17.1 and later).

With this release, building with platform-aws requires Libfabric v1.22.0amzn4.0 or greater. And it is currently tested with versions up to Libfabric 2.1.0amzn3.

Bug Fixes and Improvements:

  • Build system and platform support
    • Added AWS P6-B200 platform support
    • Changed default plugin library name to libnccl-net-ofi.so, and by default create symlink from libnccl-net-ofi.so to libnccl-net.so to maintain backward compatibility. This allows users to set NCCL_NET_PLUGIN=ofi to force NCCL to use the OFI plugin for communication. Specifying --disable-nccl-net-symlink to configure will skip the symlink, allowing multiple plugins to be installed in the same container.
  • Tuning and performance improvements
    • Added tuner support on P6-B200 for AllReduce, AllGather, and ReduceScatter regions for 0x0 and 0x7 bitmask
    • Updated default latency for P5en and P6-B200 platforms based on empirical results and analysis
  • Update to use NCCL v10 API with trafficClass parameter support for future traffic prioritization
  • Migrated plugin code base from C to C++
  • Added support for jobs where the number of NICs per GPU is different across systems. See the OFI_NCCL_FORCE_NUM_RAILS runtime environment variable documentation for more information.

OFI NCCL plugin runtime environment variable changes:

Deprecated environment variables

  • OFI_NCCL_RDMA_MIN_POSTED_BOUNCE_BUFFERS
  • OFI_NCCL_RDMA_MAX_POSTED_BOUNCE_BUFFERS

New environment variables

  • OFI_NCCL_SCHED_MAX_SMALL_RR_SIZE
  • OFI_NCCL_RDMA_MIN_POSTED_EAGER_BUFFERS
  • OFI_NCCL_RDMA_MAX_POSTED_EAGER_BUFFERS
  • OFI_NCCL_RDMA_MIN_POSTED_CONTROL_BUFFERS
  • OFI_NCCL_RDMA_MAX_POSTED_CONTROL_BUFFERS
  • OFI_NCCL_CQ_SIZE

Updated environment variables defaults

  • OFI_NCCL_RR_CTRL_MSG: default changed from 0 to 1

Checksum (sha512) for the release tarball aws-ofi-nccl-1.15.0.tar.gz:

9d529512927d3b2d1387f942283846889d0679dfd21b427f72e90d89d43bceb301e9f839a0290df3accb1ca9929818e811b94517241722becf6878d6d8646242  aws-ofi-nccl-1.15.0.tar.gz

AWS OFI NCCL v1.14.2

26 Apr 05:30
v1.14.2
Compare
Choose a tag to compare

v1.14.2 (2025-04)

This is a general release that is broadly applicable and is designed to be used with any network that can satisfy the network capabilities the plugin requires, as expressed through the Libfabric API's provider discovery mechanism. We are expanding our test coverage to continue making general releases going forward. If you would like to facilitate this effort to get coverage for networks you intend to use the plugin with, please reach out to us.

With this release, building with platform-aws requires Libfabric v1.22.0amzn4.0 or greater. And it is currently tested with versions up to Libfabric 2.1.0

The 1.14.x release series supports NCCL 2.26.2-1 while maintaining backward compatibility with older NCCL versions (NCCL v2.17.1 and later).

Improvements:

  • Enable DMA-BUF by default, but blocklist DMA-BUF on EFA versions 1-3 due to a known issue on those platforms.

Checksum (sha512) for the release tarball aws-ofi-nccl-1.14.2.tar.gz:

68488362185222818070456e141a51aa7e4afafdbd403018bed618063969b63c62c194eeac58f23bad96e484f48ed76c5c4c8a845d9129dfbfffc649ea919521  aws-ofi-nccl-1.14.2.tar.gz

AWS OFI NCCL v1.14.1

08 Apr 04:04
v1.14.1
Compare
Choose a tag to compare

v1.14.1 (2025-04)

This is a general release that is broadly applicable and is designed to be used with any network that can satisfy the network capabilities the plugin requires, as expressed through the Libfabric API's provider discovery mechanism. We are expanding our test coverage to continue making general releases going forward. If you would like to facilitate this effort to get coverage for networks you intend to use the plugin with, please reach out to us.

With this release, building with platform-aws requires Libfabric v1.22.0amzn4.0 or greater. And it is currently tested with versions upto Libfabric 2.1.0

The 1.14.x release series supports NCCL 2.26.2-1 while maintaining backward compatibility with older NCCL versions (NCCL v2.17.1 and later).

Bug Fixes and Improvements:

  • Fixed an issue in the sendrecv protocol that would result in a leaking MR keys warning with some providers.

These changes improve compatibility with libfabric 2.0 and enhance the overall reliability of the plugin, particularly in scenarios involving memory registration and connection establishment.

Checksum (sha512) for the release tarball aws-ofi-nccl-1.14.1.tar.gz:

188c84750cce0121f6abd090c9d7bc419dab095a2224292fc8c79d4653cf72955a30777211318f8cfaff87d689a6ac1f6daddb7144db986611b8ddb1f6602ca5

AWS OFI NCCL v1.14.0

14 Mar 22:39
v1.14.0
Compare
Choose a tag to compare

v1.14.0 (2025-03)

Releases v1.7.0-aws through v1.13.2-aws were intended only for use on AWS P* instances. With this release, we are resuming general releases that are broadly applicable and is designed to be used with any network that can satisfy the network capabilities the plugin requires, as expressed through the Libfabric API's provider discovery mechanism.

We are expanding our test coverage to continue making general releases going forward. If you would like to facilitate this effort to get coverage for networks you intend to use the plugin with, please reach out to us.

With this release, building with platform-aws requires 1.22.0amzn4.0 or greater. AWS customers are generally recommended to track the latest-available EFA Installer for performance improvements and bug fixes.

The 1.14.x release series supports NCCL 2.26.2-1 while maintaining backward compatibility with older NCCL versions (NCCL v2.17.1 and later).

Bug Fixes and Improvements:

  • Transport Enhancements:

    • Added memory descriptor handling for control messages to properly support FI_MR_LOCAL.
    • RDMA Transport Enhancements: Added NCCL receive request early completion support when provider data progress model is FI_PROGRESS_AUTO
  • Tuning Improvements:

    • Modified tuner behavior to default to NCCL internal tuner on two-node configurations.
      This change addresses outlier performance issues in two-node scenarios.

These changes improve compatibility with libfabric 2.0 and enhance the overall reliability of the plugin, particularly in scenarios involving memory registration and connection establishment.

Checksum (sha512) for the release tarball:

d0943ecea58d4335e59f007275789eee2da9acad639d3a46f676d71525e6161a65c875602ccbeef7ade54339c2388137cf0e767402c5bfc8eb77637651b05c46

AWS OFI NCCL v1.13.2

11 Dec 17:58
v1.13.2-aws
Compare
Choose a tag to compare

v1.13.2-aws (2024-12-06)

This release is intended only for use on AWS P* instances. A general release that supports other libfabric networks may be made in the near future.

With this release, building with platform-aws requires 1.22.0amzn4.0 or greater. AWS customers are generally recommended to track the latest-available EFA Installer for performance improvements and bug fixes.

The 1.13.x release series supports NCCL 2.23.4-1 while maintaining backward compatibility with older NCCL versions (NCCL v2.17.1 and later).

Bug Fixes:

  • Tuner Improvements:
    • Fixed algorithm selection for larger ranks and message sizes.
    • Re-calibrated the tuner for AllGather and ReduceScatter regions for 0x7 bitmask on P5en, optimizing performance for larger messages.
    • Added tuner support for AllGather and ReduceScatter regions for 0x0 bitmask on P5en.
  • Resolved a performance issue by preventing the eager protocol when RDMA writes are in flight, improving small AllReduce collective performance.

Note: dmabuf support is now turned off by default. Users can enable it explicitly using OFI_NCCL_DISABLE_DMABUF=0 if needed.

Checksum (sha512) for the release tarball:

4c0ac3144f178062fda9e86b50bb1784822e8fdbdffadf41cdbb30839456c4e912254ff12a5b0a8c63abbe910597fd14211a42572a451d10e01932100013971e  aws-ofi-nccl-1.13.2-aws.tar.gz

AWS OFI NCCL v1.13.1

26 Nov 23:10
v1.13.1-aws
Compare
Choose a tag to compare

(2024-11-26)

This release is intended only for use on AWS P* instances. A general release that supports other libfabric networks may be made in the near future.

With this release, building with platform-aws requires 1.22.0amzn4.0 or greater. AWS customers are generally recommended to track the latest-available EFA Installer for performance improvements and bug fixes.

The 1.13.x release series supports NCCL 2.23.4-1 while maintaining backward compatibility with older NCCL versions (NCCL v2.17.1 and later).

Supported Distributions

  • Amazon Linux 2
  • Amazon Linux 2023
  • Ubuntu 20.04 LTS, 22.04 LTS.

For releases before v1.6.0, we generally created releases from two separate
branches, an AWS-specific branch and a general release branch. With v1.6.0, we
have unified the code into a single branch, and made the AWS-specific parts a
compile-time option. When a feature (or entire release) only supports one of
the two variants, we note that in the release notes.

What's Changed

This release contains no functional changes compared to v1.13.0-aws. This release merely updates the version set in AC_INIT to include the -aws suffix to match the tag name and ensure generated artifacts are named correctly.

Checksum (sha512) for the release tarball:

b71afd2e7776b77392c91abb818fa011e415f31fa9061556cd725d7a52eb4101b45a10fe91284ec7cff06a9653456e95ae70a472affb32f68e01b1ce5e49ff83  aws-ofi-nccl-1.13.1-aws.tar.gz