-
Notifications
You must be signed in to change notification settings - Fork 3.1k
AIStore with Webdataset #13604
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AIStore with Webdataset #13604
Conversation
Signed-off-by: Monica Sekoyan <msekoyan@nvidia.com>
Signed-off-by: monica-sekoyan <monica-sekoyan@users.noreply.github.com>
Signed-off-by: Monica Sekoyan <msekoyan@nvidia.com>
Signed-off-by: monica-sekoyan <monica-sekoyan@users.noreply.github.com>
…ebdataset_url_fix
Signed-off-by: Monica Sekoyan <msekoyan@nvidia.com>
Signed-off-by: monica-sekoyan <monica-sekoyan@users.noreply.github.com>
Signed-off-by: Monica Sekoyan <msekoyan@nvidia.com>
Signed-off-by: monica-sekoyan <monica-sekoyan@users.noreply.github.com>
…ebdataset_url_fix
Signed-off-by: Monica Sekoyan <msekoyan@nvidia.com>
Signed-off-by: monica-sekoyan <monica-sekoyan@users.noreply.github.com>
…ebdataset_url_fix
Signed-off-by: Monica Sekoyan <msekoyan@nvidia.com>
Signed-off-by: Monica Sekoyan <msekoyan@nvidia.com>
Signed-off-by: Monica Sekoyan <msekoyan@nvidia.com>
Signed-off-by: Monica Sekoyan <msekoyan@nvidia.com>
Signed-off-by: monica-sekoyan <monica-sekoyan@users.noreply.github.com>
nemo/utils/data_utils.py
Outdated
return None | ||
|
||
|
||
@lru_cache(maxsize=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@monica-sekoyan let's not copy the code from lhotse into nemo for this purpose. This would cause a lot of headaches later as we'll have to modify AIS related functionality in both repos.
It'd be better to instead move/copy the necessary functions from nemo/utils/data_utils.py
into nemo/collections/asr/...
collection and import lhotse there, avoiding lhotse import outside of ASR collection.
Signed-off-by: monica-sekoyan <monica-sekoyan@users.noreply.github.com>
@@ -168,6 +180,61 @@ | |||
return local_path | |||
|
|||
|
|||
def open_datastore_object_with_binary(path: str, num_retries: int = 5): |
Check notice
Code scanning / CodeQL
Explicit returns mixed with implicit (fall through) returns Note
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 3 months ago
To fix the issue, we need to ensure that the function open_datastore_object_with_binary
always has an explicit return statement. If the function is expected to return a file-like object, we should add a default return value (e.g., None
or a placeholder) at the end of the function to make the behavior explicit. This will align the implementation with the docstring and improve code readability.
-
Copy modified lines R230-R233
@@ -229,2 +229,6 @@ | ||
|
||
|
||
# Explicitly return None if no other return statement is reached | ||
return None | ||
|
||
def open_best(path: str, mode: str = "rb"): |
if not done: | ||
raise RuntimeError('Download failed: %s', subprocess.list2cmdline(cmd)) | ||
with open(local_path, 'wb') as f: | ||
f.write(open_best(path).read(), num_retries=num_retries) |
Check warning
Code scanning / CodeQL
File is not always closed Warning
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 3 months ago
To fix the issue, we should ensure that any file-like object returned by open_best(path)
is properly closed after its contents are read. The best practice is to use a with
statement, which ensures that the file is closed automatically, even if an exception occurs.
The fix involves:
- Wrapping the call to
open_best(path)
in awith
statement. - Reading the file's contents within the
with
block and writing them tof
.
This change will ensure that the file opened by open_best(path)
is always closed properly.
-
Copy modified lines R263-R264
@@ -262,3 +262,4 @@ | ||
with open(local_path, 'wb') as f: | ||
f.write(open_best(path).read(), num_retries=num_retries) | ||
with open_best(path) as source_file: | ||
f.write(source_file.read(), num_retries=num_retries) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
from nemo.utils.nemo_logging import LogMode | ||
|
||
try: | ||
from lhotse.serialization import open_best as lhotse_open_best |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this guarding looks good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change looks good. Our infra doesn't have the capacity to test all extra-requires like asr
in isolated fashion. So this approval is based on a manual review and not on automated testing. Having that said, guarding dependencies by try/except
and HAVE
is our common approach, so the change appears to be safe.
Thanks for following up!
…ebdataset_url_fix
5995c92
* refactor webdataset url creation Signed-off-by: Monica Sekoyan <msekoyan@nvidia.com> * Apply isort and black reformatting Signed-off-by: monica-sekoyan <monica-sekoyan@users.noreply.github.com> * add lhotse_open to wds Signed-off-by: Monica Sekoyan <msekoyan@nvidia.com> * Apply isort and black reformatting Signed-off-by: monica-sekoyan <monica-sekoyan@users.noreply.github.com> * change manifest_caching method Signed-off-by: Monica Sekoyan <msekoyan@nvidia.com> * Apply isort and black reformatting Signed-off-by: monica-sekoyan <monica-sekoyan@users.noreply.github.com> * remove monkey patching repetition Signed-off-by: Monica Sekoyan <msekoyan@nvidia.com> * Apply isort and black reformatting Signed-off-by: monica-sekoyan <monica-sekoyan@users.noreply.github.com> * change open_best mode Signed-off-by: Monica Sekoyan <msekoyan@nvidia.com> * Apply isort and black reformatting Signed-off-by: monica-sekoyan <monica-sekoyan@users.noreply.github.com> * handle modulenotfound in __init__.py Signed-off-by: Monica Sekoyan <msekoyan@nvidia.com> * modify aistore reading schema Signed-off-by: Monica Sekoyan <msekoyan@nvidia.com> * Apply isort and black reformatting Signed-off-by: monica-sekoyan <monica-sekoyan@users.noreply.github.com> * add lhotse open_best support Signed-off-by: Monica Sekoyan <msekoyan@nvidia.com> * Apply isort and black reformatting Signed-off-by: monica-sekoyan <monica-sekoyan@users.noreply.github.com> * remove aistore sdk support Signed-off-by: Monica Sekoyan <msekoyan@nvidia.com> * Apply isort and black reformatting Signed-off-by: monica-sekoyan <monica-sekoyan@users.noreply.github.com> * fix test Signed-off-by: Monica Sekoyan <msekoyan@nvidia.com> --------- Signed-off-by: Monica Sekoyan <msekoyan@nvidia.com> Signed-off-by: monica-sekoyan <monica-sekoyan@users.noreply.github.com> Co-authored-by: monica-sekoyan <monica-sekoyan@users.noreply.github.com> Co-authored-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: oliver könig <okoenig@nvidia.com>
Important
The
Update branch
button must only be pressed in very rare occassions.An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.
What does this PR do ?
Added AIStore Python SDK support for Webdataset dataloaders (previously was only for lhotse). This fixes the issue when ais binary is being forced to be used in all the scenarios.
Collection: [Note which collection this PR will affect]
Changelog
Usage
# Add a code snippet demonstrating how to use this
GitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information