Skip to content

AIStore with Webdataset #13604

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 31 commits into from
Jun 2, 2025
Merged

AIStore with Webdataset #13604

merged 31 commits into from
Jun 2, 2025

Conversation

monica-sekoyan
Copy link
Collaborator

Important

The Update branch button must only be pressed in very rare occassions.
An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.

What does this PR do ?

Added AIStore Python SDK support for Webdataset dataloaders (previously was only for lhotse). This fixes the issue when ais binary is being forced to be used in all the scenarios.

Collection: [Note which collection this PR will affect]

Changelog

  • Add specific line by line info of high level changes in this PR.

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

monica-sekoyan and others added 22 commits April 7, 2025 04:02
Signed-off-by: Monica Sekoyan <msekoyan@nvidia.com>
Signed-off-by: monica-sekoyan <monica-sekoyan@users.noreply.github.com>
Signed-off-by: Monica Sekoyan <msekoyan@nvidia.com>
Signed-off-by: monica-sekoyan <monica-sekoyan@users.noreply.github.com>
Signed-off-by: Monica Sekoyan <msekoyan@nvidia.com>
Signed-off-by: monica-sekoyan <monica-sekoyan@users.noreply.github.com>
Signed-off-by: Monica Sekoyan <msekoyan@nvidia.com>
Signed-off-by: monica-sekoyan <monica-sekoyan@users.noreply.github.com>
Signed-off-by: Monica Sekoyan <msekoyan@nvidia.com>
Signed-off-by: monica-sekoyan <monica-sekoyan@users.noreply.github.com>
Signed-off-by: Monica Sekoyan <msekoyan@nvidia.com>
Signed-off-by: Monica Sekoyan <msekoyan@nvidia.com>
Signed-off-by: Monica Sekoyan <msekoyan@nvidia.com>
Signed-off-by: Monica Sekoyan <msekoyan@nvidia.com>
Signed-off-by: monica-sekoyan <monica-sekoyan@users.noreply.github.com>
return None


@lru_cache(maxsize=1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@monica-sekoyan let's not copy the code from lhotse into nemo for this purpose. This would cause a lot of headaches later as we'll have to modify AIS related functionality in both repos.

It'd be better to instead move/copy the necessary functions from nemo/utils/data_utils.py into nemo/collections/asr/... collection and import lhotse there, avoiding lhotse import outside of ASR collection.

Signed-off-by: monica-sekoyan <monica-sekoyan@users.noreply.github.com>
@@ -168,6 +180,61 @@
return local_path


def open_datastore_object_with_binary(path: str, num_retries: int = 5):

Check notice

Code scanning / CodeQL

Explicit returns mixed with implicit (fall through) returns Note

Mixing implicit and explicit returns may indicate an error, as implicit returns always return None.

Copilot Autofix

AI 3 months ago

To fix the issue, we need to ensure that the function open_datastore_object_with_binary always has an explicit return statement. If the function is expected to return a file-like object, we should add a default return value (e.g., None or a placeholder) at the end of the function to make the behavior explicit. This will align the implementation with the docstring and improve code readability.


Suggested changeset 1
nemo/utils/data_utils.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/nemo/utils/data_utils.py b/nemo/utils/data_utils.py
--- a/nemo/utils/data_utils.py
+++ b/nemo/utils/data_utils.py
@@ -229,2 +229,6 @@
 
+
+    # Explicitly return None if no other return statement is reached
+    return None
+
 def open_best(path: str, mode: str = "rb"):
EOF
@@ -229,2 +229,6 @@


# Explicitly return None if no other return statement is reached
return None

def open_best(path: str, mode: str = "rb"):
Copilot is powered by AI and may make mistakes. Always verify output.
if not done:
raise RuntimeError('Download failed: %s', subprocess.list2cmdline(cmd))
with open(local_path, 'wb') as f:
f.write(open_best(path).read(), num_retries=num_retries)

Check warning

Code scanning / CodeQL

File is not always closed Warning

File is opened but is not closed.

Copilot Autofix

AI 3 months ago

To fix the issue, we should ensure that any file-like object returned by open_best(path) is properly closed after its contents are read. The best practice is to use a with statement, which ensures that the file is closed automatically, even if an exception occurs.

The fix involves:

  1. Wrapping the call to open_best(path) in a with statement.
  2. Reading the file's contents within the with block and writing them to f.

This change will ensure that the file opened by open_best(path) is always closed properly.


Suggested changeset 1
nemo/utils/data_utils.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/nemo/utils/data_utils.py b/nemo/utils/data_utils.py
--- a/nemo/utils/data_utils.py
+++ b/nemo/utils/data_utils.py
@@ -262,3 +262,4 @@
             with open(local_path, 'wb') as f:
-                f.write(open_best(path).read(), num_retries=num_retries)
+                with open_best(path) as source_file:
+                    f.write(source_file.read(), num_retries=num_retries)
 
EOF
@@ -262,3 +262,4 @@
with open(local_path, 'wb') as f:
f.write(open_best(path).read(), num_retries=num_retries)
with open_best(path) as source_file:
f.write(source_file.read(), num_retries=num_retries)

Copilot is powered by AI and may make mistakes. Always verify output.
pzelasko
pzelasko previously approved these changes May 16, 2025
Copy link
Collaborator

@pzelasko pzelasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pzelasko pzelasko requested a review from ko3n1g May 16, 2025 14:02
from nemo.utils.nemo_logging import LogMode

try:
from lhotse.serialization import open_best as lhotse_open_best
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ko3n1g could you take a look to confirm this import guard is sufficient to avoid the previous issues in the reverted PR #12905 ? Thanks!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this guarding looks good!

ko3n1g
ko3n1g previously approved these changes May 27, 2025
Copy link
Collaborator

@ko3n1g ko3n1g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change looks good. Our infra doesn't have the capacity to test all extra-requires like asr in isolated fashion. So this approval is based on a manual review and not on automated testing. Having that said, guarding dependencies by try/except and HAVE is our common approach, so the change appears to be safe.

Thanks for following up!

pablo-garay
pablo-garay previously approved these changes May 28, 2025
Signed-off-by: Monica Sekoyan <msekoyan@nvidia.com>
@ko3n1g ko3n1g disabled auto-merge June 2, 2025 12:41
@ko3n1g ko3n1g merged commit 338fb20 into main Jun 2, 2025
239 of 241 checks passed
@ko3n1g ko3n1g deleted the msekoyan/webdataset_url_fix branch June 2, 2025 12:41
nasretdinovr pushed a commit to nasretdinovr/NeMo that referenced this pull request Aug 8, 2025
* refactor webdataset url creation

Signed-off-by: Monica Sekoyan <msekoyan@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: monica-sekoyan <monica-sekoyan@users.noreply.github.com>

* add lhotse_open to wds

Signed-off-by: Monica Sekoyan <msekoyan@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: monica-sekoyan <monica-sekoyan@users.noreply.github.com>

* change manifest_caching method

Signed-off-by: Monica Sekoyan <msekoyan@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: monica-sekoyan <monica-sekoyan@users.noreply.github.com>

* remove monkey patching repetition

Signed-off-by: Monica Sekoyan <msekoyan@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: monica-sekoyan <monica-sekoyan@users.noreply.github.com>

* change open_best mode

Signed-off-by: Monica Sekoyan <msekoyan@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: monica-sekoyan <monica-sekoyan@users.noreply.github.com>

* handle modulenotfound in __init__.py

Signed-off-by: Monica Sekoyan <msekoyan@nvidia.com>

* modify aistore reading schema

Signed-off-by: Monica Sekoyan <msekoyan@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: monica-sekoyan <monica-sekoyan@users.noreply.github.com>

* add lhotse open_best support

Signed-off-by: Monica Sekoyan <msekoyan@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: monica-sekoyan <monica-sekoyan@users.noreply.github.com>

* remove aistore sdk support

Signed-off-by: Monica Sekoyan <msekoyan@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: monica-sekoyan <monica-sekoyan@users.noreply.github.com>

* fix test

Signed-off-by: Monica Sekoyan <msekoyan@nvidia.com>

---------

Signed-off-by: Monica Sekoyan <msekoyan@nvidia.com>
Signed-off-by: monica-sekoyan <monica-sekoyan@users.noreply.github.com>
Co-authored-by: monica-sekoyan <monica-sekoyan@users.noreply.github.com>
Co-authored-by: Charlie Truong <chtruong@nvidia.com>
Co-authored-by: oliver könig <okoenig@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants