Skip to content

Conversation

efiop
Copy link
Contributor

@efiop efiop commented Feb 6, 2023

Slow POC implementation.

image

#8761

@efiop efiop force-pushed the not_in_remote branch 4 times, most recently from 74d934c to 0ae96fd Compare February 7, 2023 00:08
@codecov
Copy link

codecov bot commented Feb 7, 2023

Codecov Report

Base: 93.01% // Head: 93.06% // Increases project coverage by +0.04% 🎉

Coverage data is based on head (197d1b7) compared to base (ea6e01d).
Patch coverage: 76.00% of modified lines in pull request are covered.

❗ Current head 197d1b7 differs from pull request most recent head a85abcc. Consider uploading reports for the commit a85abcc to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8962      +/-   ##
==========================================
+ Coverage   93.01%   93.06%   +0.04%     
==========================================
  Files         457      456       -1     
  Lines       36824    36821       -3     
  Branches     5339     5333       -6     
==========================================
+ Hits        34252    34266      +14     
+ Misses       2052     2028      -24     
- Partials      520      527       +7     
Impacted Files Coverage Δ
dvc/commands/data.py 98.37% <ø> (ø)
dvc/repo/worktree.py 10.00% <0.00%> (ø)
dvc/data_cloud.py 82.19% <66.66%> (ø)
dvc/repo/data.py 99.14% <100.00%> (-0.86%) ⬇️
dvc/repo/index.py 89.30% <100.00%> (-1.74%) ⬇️
tests/func/test_data_status.py 100.00% <100.00%> (ø)
dvc/repo/experiments/apply.py 76.74% <0.00%> (-12.15%) ⬇️
dvc/repo/experiments/queue/celery.py 85.92% <0.00%> (-1.86%) ⬇️
dvc/commands/cache.py 90.90% <0.00%> (-1.78%) ⬇️
... and 29 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@efiop efiop force-pushed the not_in_remote branch 2 times, most recently from 39c8c15 to f7a5a32 Compare February 10, 2023 00:05
@efiop efiop force-pushed the not_in_remote branch 3 times, most recently from 5b5fe23 to a768a60 Compare February 10, 2023 03:55
@efiop efiop changed the title [WIP] data: status: introduce --not-in-remote [WIP] data: status: introduce --not-in-remote for regular remotes and cloud versioning Feb 10, 2023
@efiop efiop changed the title [WIP] data: status: introduce --not-in-remote for regular remotes and cloud versioning [WIP] data: status: introduce --not-in-remote Feb 10, 2023
@@ -114,6 +117,8 @@ def run(self) -> int:
status = self.repo.data_status(
granular=self.args.granular,
untracked_files=self.args.untracked_files,
not_in_cache=True,
not_in_remote=self.args.not_in_remote,
Copy link
Collaborator

@skshetry skshetry Feb 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might need to drop the not_in_remote/not_in_cache entries below.

if not self.args.not_in_remote:
	status.pop("not_in_remote")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean for json?

@efiop efiop force-pushed the not_in_remote branch 4 times, most recently from 6786610 to 21fa8e6 Compare February 10, 2023 23:10
efiop added a commit to efiop/dvc-data that referenced this pull request Feb 12, 2023
This does a few things:

1) partitions storage into 3 categories:
    * `data` - immediately accessible data contents (e.g. right in your workspace)
    * `cache` - data contents stashed somewhere aside
    * `remote` - data contents stoshed remotely (kinda like cold cache)

2) makes so that each storage can be either object-based (our typical odbs) or
    filesystem-based (e.g. cloud versioning or just a plain old backup copy).

3) settles `save` convention as being a transfer of data contents from `data`
    storage (e.g. workspace) to `cache` storage. But also allows to pick other
    storage type to get the data from (e.g. `remote`, which is how we use it
    for cloud-versioning)

4) settles `checkout` convention as being a transfer of data contents from `cache`,
    but also allows one to pick other storage type to get the data from (e.g. `data`,
    which is how we use it for cloud-versioning, or `remote` for cases where having a
    an intermediate cache is undesirable).

These things will simplify our current cloud-versioning operations and will
also allow us to introduce `dvcfs` support (which also means `dvc.api` support)
for it (coming in a separate PR later). Also this allows us to use persistent index
with `FileStorage` in iterative/dvc#8962 (again coming in
a separate PR).
efiop added a commit to efiop/dvc-data that referenced this pull request Feb 12, 2023
This does a few things:

1) partitions storage into 3 categories:
    * `data` - immediately accessible data contents (e.g. right in your workspace)
    * `cache` - data contents stashed somewhere aside
    * `remote` - data contents stoshed remotely (kinda like cold cache)

2) makes so that each storage can be either object-based (our typical odbs) or
    filesystem-based (e.g. cloud versioning or just a plain old backup copy).

3) settles `save` convention as being a transfer of data contents from `data`
    storage (e.g. workspace) to `cache` storage. But also allows to pick other
    storage type to get the data from (e.g. `remote`, which is how we use it
    for cloud-versioning)

4) settles `checkout` convention as being a transfer of data contents from `cache`,
    but also allows one to pick other storage type to get the data from (e.g. `data`,
    which is how we use it for cloud-versioning, or `remote` for cases where having a
    an intermediate cache is undesirable).

These things will simplify our current cloud-versioning operations and will
also allow us to introduce `dvcfs` support (which also means `dvc.api` support)
for it (coming in a separate PR later). Also this allows us to use persistent index
with `FileStorage` in iterative/dvc#8962 (again coming in
a separate PR).
efiop added a commit to iterative/dvc-data that referenced this pull request Feb 12, 2023
This does a few things:

1) partitions storage into 3 categories:
    * `data` - immediately accessible data contents (e.g. right in your workspace)
    * `cache` - data contents stashed somewhere aside
    * `remote` - data contents stoshed remotely (kinda like cold cache)

2) makes so that each storage can be either object-based (our typical odbs) or
    filesystem-based (e.g. cloud versioning or just a plain old backup copy).

3) settles `save` convention as being a transfer of data contents from `data`
    storage (e.g. workspace) to `cache` storage. But also allows to pick other
    storage type to get the data from (e.g. `remote`, which is how we use it
    for cloud-versioning)

4) settles `checkout` convention as being a transfer of data contents from `cache`,
    but also allows one to pick other storage type to get the data from (e.g. `data`,
    which is how we use it for cloud-versioning, or `remote` for cases where having a
    an intermediate cache is undesirable).

These things will simplify our current cloud-versioning operations and will
also allow us to introduce `dvcfs` support (which also means `dvc.api` support)
for it (coming in a separate PR later). Also this allows us to use persistent index
with `FileStorage` in iterative/dvc#8962 (again coming in
a separate PR).
@efiop efiop changed the title [WIP] data: status: introduce --not-in-remote [WIP] data: status: show not in remote Feb 13, 2023
@efiop
Copy link
Contributor Author

efiop commented Feb 13, 2023

For the record: reworked a few things to show not in remote by default, which is the most useful for users (both regular and cloud versioning remotes work now and the code is the same, which is important).

The hiccup is related to pre-requisites to finally using a persistent index here (for remotes specifically) so that we don't ring the cloud every time.

Things that are left to do:

  1. use index we build in cloud versioning in storage_map
  2. use persistent index by default for ^ and introduce a flag to refresh it (e.g. --remote-update as suggested in cloud versioning: status command #8761 (comment))

@efiop efiop force-pushed the not_in_remote branch 2 times, most recently from d1e37a0 to 197d1b7 Compare February 14, 2023 23:20
@efiop
Copy link
Contributor Author

efiop commented Feb 15, 2023

This is taking way too long and there are still some questions on how to index http remotes (since listing is not part of the standard). Hate to do this, but I'll bring back --not-in-remote for now in the interest of time 🙁

@efiop efiop force-pushed the not_in_remote branch 2 times, most recently from 3962ddc to fbc1844 Compare February 15, 2023 02:28
@efiop efiop changed the title [WIP] data: status: show not in remote [WIP] data: status: introduce --not-in-remote Feb 15, 2023
efiop added a commit to efiop/dvc that referenced this pull request Feb 15, 2023
This allows us to reuse cloud versioning index that we've built in other
operations like `data status`.

Note, this only works with `dvc config feature.data_index_cache true`
right now.

Also note that index is not populated for regular remotes, it will come
in a separate PR as there are some http-related issues to figure out.

Related iterative#8962
efiop added a commit that referenced this pull request Feb 15, 2023
This allows us to reuse cloud versioning index that we've built in other
operations like `data status`.

Note, this only works with `dvc config feature.data_index_cache true`
right now.

Also note that index is not populated for regular remotes, it will come
in a separate PR as there are some http-related issues to figure out.

Related #8962
@efiop efiop force-pushed the not_in_remote branch 3 times, most recently from c776286 to c9c85d3 Compare February 16, 2023 01:45
@dberenbaum
Copy link
Contributor

dberenbaum commented Feb 16, 2023

This is taking way too long and there are still some questions on how to index http remotes (since listing is not part of the standard). Hate to do this, but I'll bring back --not-in-remote for now in the interest of time 🙁

I think it's fine to merge it this way.

However, I did run into really slow performance when doing dvc data status --granular --not-in-remote. I haven't tested with persistent index (how do I enable it?). Here's what I did:

# Add 2800 files
$ dvc get git@github.com:iterative/dataset-registry.git use-cases/cats-dogs
$ dvc add cats-dogs
# Configure remote
$ dvc remote add -d -f versioned s3://dave-sandbox-versioning/test/versioned
$ dvc remote modify versioned version_aware true
# Get status
$ time dvc data status --granular --not-in-remote
...
dvc data status --granular --not-in-remote  12.77s user 1.42s system 14% cpu 1:37.57 total

I'm not even sure we actually need to bother with checking the cloud here. If there's no cloud version ID in the .dvc file, should we just report that it's not in remote?

Edit: Okay, I tested with the persistent index and it seems fine, but I'd still like to discuss the point above, especially if we aren't ready for the persistent index to be the default yet. In this case, isn't the .dvc file basically a persistent index already?

@efiop
Copy link
Contributor Author

efiop commented Feb 17, 2023

@dberenbaum Great idea with filling from version_id's in dvcfiles!

Sorry for the pause here, trying to move persistent index to within the .dvc/ directory (currently it is global, which was done in the hopes to help with NFS use case but it turned out that wouldn't help), so that we could enable persistent index by default. The version_id trick would definitely work, but I'm a bit worried about cases like corrupted dvc files that would result in invalid percieved state. Even if we go with that trick, we'll still need --remote-update to be able to fix potential bad states and then we would again need to save to the persistent index to be able to use that later. But using that trick to do initial remote index fill-up sounds pretty good to me.

@dberenbaum
Copy link
Contributor

If there's no cloud version ID in the .dvc file, should we just report that it's not in remote?

Discussed with @efiop that for this specific scenario where the version_id is missing, an optimization would be to not check the remote at all. This should be safe to do since no matter what versions exist on the remote, we don't have a local version_id to compare. This is not a blocker if performance looks fine otherwise.

@efiop efiop force-pushed the not_in_remote branch 4 times, most recently from 1d8e7bc to 2999959 Compare February 20, 2023 05:42
Also supports cloud versioning.

Fixes iterative#8761
@efiop efiop marked this pull request as ready for review February 20, 2023 10:25
@efiop efiop changed the title [WIP] data: status: introduce --not-in-remote data: status: introduce --not-in-remote Feb 20, 2023
@efiop efiop merged commit c34cbf6 into iterative:main Feb 20, 2023
}

assert dvc.data_status(
granular=True, untracked_files="all", not_in_remote=True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

untracked_files="all" may be unnecessary to test here.

join("dir", "bar"),
)
},
"uncommitted": {},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for "uncommitted" here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants