Skip to content

Conversation

2015aroras
Copy link
Collaborator

@2015aroras 2015aroras commented Nov 22, 2023

This PR adds the option to get full file/directory paths when listing entries of an archive file or directory. For archive files, this may change the path from a cloud URL to a local file path. This PR has multiple benefits:

  • Users of list_entries no longer need to worry about whether the path corresponds to an archive file or a directory when they are using the results. They just need to be mindful that the entries could be in a different storage system.
  • Previously, there was a hacky way of dealing with runs having an extra nested dir. Now _get_archive_run_entry_paths clearly points out this issue and deals with both the presence and absence of the extra nested dir.

PR Train:

  1. [Storage Cleaner] Add script for removing bad runs from local & cloud storage #364
  2. [Storage Cleaner] Migrate to cached_path #378
  3. This PR [Storage Cleaner] Add option to get full paths when listing entries #379
  4. [Storage Cleaner] Add more basic functionality to storage adapters #380
  5. [Storage Cleaner] Add unsharding to storage cleaner #381
  6. [Storage Cleaner] Handle some legacy checkpoints in unsharding #382

Copy link
Member

@epwalsh epwalsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but would resolve_path(s) be a better argument name that full_path?

@2015aroras
Copy link
Collaborator Author

LGTM, but would resolve_path(s) be a better argument name that full_path?

If I called list_entries on gs://bucket/foo/, the result when full_path is true would be gs://bucket/foo/bar/ and when full_path is false would be bar/. I don't think resolve_path makes sense in this example. I do agree that the name full_path is not as nice for archive files, where the path could be completely different.

@epwalsh
Copy link
Member

epwalsh commented Nov 22, 2023

LGTM, but would resolve_path(s) be a better argument name that full_path?

If I called list_entries on gs://bucket/foo/, the result when full_path is true would be gs://bucket/foo/bar/ and when full_path is false would be bar/. I don't think resolve_path makes sense in this example. I do agree that the name full_path is not as nice for archive files, where the path could be completely different.

Fair enough!

# The unarchived file could have a redundant top-level directory. If the top-level
# directory has only a directory, we should return that directory's entries instead.
# We do not pass max_file_size to avoid accidentally skipping files.
entry_paths = storage.list_entries(run_archive_path, full_path=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I call this with full_path=True, it will download the archive (if necessary), and then return a local path? And it always does this, so you can assert on it later? Why is it called full_path then? I would expect full_path to just return completely absolute paths. Maybe it should be called full_local_path or something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems weird from an API perspective that list_entries would be a function that can force a full download, and whether it does is controlled with a flag called full_path?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants