[Storage Cleaner] Add option to get full paths when listing entries #379

2015aroras · 2023-11-22T16:48:36Z

This PR adds the option to get full file/directory paths when listing entries of an archive file or directory. For archive files, this may change the path from a cloud URL to a local file path. This PR has multiple benefits:

Users of list_entries no longer need to worry about whether the path corresponds to an archive file or a directory when they are using the results. They just need to be mindful that the entries could be in a different storage system.
Previously, there was a hacky way of dealing with runs having an extra nested dir. Now _get_archive_run_entry_paths clearly points out this issue and deals with both the presence and absence of the extra nested dir.

PR Train:

…me storage

epwalsh

LGTM, but would resolve_path(s) be a better argument name that full_path?

2015aroras · 2023-11-22T17:31:15Z

LGTM, but would resolve_path(s) be a better argument name that full_path?

If I called list_entries on gs://bucket/foo/, the result when full_path is true would be gs://bucket/foo/bar/ and when full_path is false would be bar/. I don't think resolve_path makes sense in this example. I do agree that the name full_path is not as nice for archive files, where the path could be completely different.

epwalsh · 2023-11-22T17:32:06Z

LGTM, but would resolve_path(s) be a better argument name that full_path?

If I called list_entries on gs://bucket/foo/, the result when full_path is true would be gs://bucket/foo/bar/ and when full_path is false would be bar/. I don't think resolve_path makes sense in this example. I do agree that the name full_path is not as nice for archive files, where the path could be completely different.

Fair enough!

dirkgr · 2023-12-01T01:48:43Z

scripts/storage_cleaner.py

+    # The unarchived file could have a redundant top-level directory. If the top-level
+    # directory has only a directory, we should return that directory's entries instead.
+    # We do not pass max_file_size to avoid accidentally skipping files.
+    entry_paths = storage.list_entries(run_archive_path, full_path=True)


So if I call this with full_path=True, it will download the archive (if necessary), and then return a local path? And it always does this, so you can assert on it later? Why is it called full_path then? I would expect full_path to just return completely absolute paths. Maybe it should be called full_local_path or something?

It seems weird from an API perspective that list_entries would be a function that can force a full download, and whether it does is controlled with a flag called full_path?

2015aroras added 5 commits November 22, 2023 12:07

Add option to return full path when listing entries

df24f55

Remove redundant top-level directories from archive files

1c07cb5

Pass on list_entries kwargs through _get_run_entries

048dc0f

Add missed _get_run_entries call to _is_run

047a861

Update list_entries comment to clarify that results might not have sa…

9c6cd51

…me storage

2015aroras force-pushed the shanea/storage-cleaner-archive-path-fix branch from 1a4e300 to 9c6cd51 Compare November 22, 2023 17:07

2015aroras requested review from dirkgr and epwalsh November 22, 2023 17:14

2015aroras marked this pull request as ready for review November 22, 2023 17:15

epwalsh approved these changes Nov 22, 2023

View reviewed changes

2015aroras mentioned this pull request Nov 22, 2023

[Storage Cleaner] Handle some legacy checkpoints in unsharding #382

Merged

dirkgr requested changes Dec 1, 2023

View reviewed changes

2015aroras closed this Dec 6, 2023

2015aroras mentioned this pull request Dec 6, 2023

[Storage Cleaner] Move unarchiving logic to cleaning jobs #390

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Storage Cleaner] Add option to get full paths when listing entries #379

[Storage Cleaner] Add option to get full paths when listing entries #379

Uh oh!

2015aroras commented Nov 22, 2023 •

edited

Loading

Uh oh!

epwalsh left a comment

Uh oh!

2015aroras commented Nov 22, 2023

Uh oh!

epwalsh commented Nov 22, 2023

Uh oh!

dirkgr Dec 1, 2023

Uh oh!

dirkgr Dec 1, 2023

Uh oh!

Uh oh!

[Storage Cleaner] Add option to get full paths when listing entries #379

[Storage Cleaner] Add option to get full paths when listing entries #379

Uh oh!

Conversation

2015aroras commented Nov 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

epwalsh left a comment

Choose a reason for hiding this comment

Uh oh!

2015aroras commented Nov 22, 2023

Uh oh!

epwalsh commented Nov 22, 2023

Uh oh!

dirkgr Dec 1, 2023

Choose a reason for hiding this comment

Uh oh!

dirkgr Dec 1, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

2015aroras commented Nov 22, 2023 •

edited

Loading