data: status: introduce --not-in-remote #8962

efiop · 2023-02-06T03:21:22Z

Slow POC implementation.

codecov · 2023-02-07T00:21:30Z

Codecov Report

Base: 93.01% // Head: 93.06% // Increases project coverage by +0.04% 🎉

Coverage data is based on head (197d1b7) compared to base (ea6e01d).
Patch coverage: 76.00% of modified lines in pull request are covered.

❗ Current head 197d1b7 differs from pull request most recent head a85abcc. Consider uploading reports for the commit a85abcc to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #8962      +/-   ##
==========================================
+ Coverage   93.01%   93.06%   +0.04%     
==========================================
  Files         457      456       -1     
  Lines       36824    36821       -3     
  Branches     5339     5333       -6     
==========================================
+ Hits        34252    34266      +14     
+ Misses       2052     2028      -24     
- Partials      520      527       +7

Impacted Files	Coverage Δ
dvc/commands/data.py	`98.37% <ø> (ø)`
dvc/repo/worktree.py	`10.00% <0.00%> (ø)`
dvc/data_cloud.py	`82.19% <66.66%> (ø)`
dvc/repo/data.py	`99.14% <100.00%> (-0.86%)`	⬇️
dvc/repo/index.py	`89.30% <100.00%> (-1.74%)`	⬇️
tests/func/test_data_status.py	`100.00% <100.00%> (ø)`
dvc/repo/experiments/apply.py	`76.74% <0.00%> (-12.15%)`	⬇️
dvc/repo/experiments/queue/celery.py	`85.92% <0.00%> (-1.86%)`	⬇️
dvc/commands/cache.py	`90.90% <0.00%> (-1.78%)`	⬇️
... and 29 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

skshetry · 2023-02-10T15:03:33Z

dvc/commands/data.py

@@ -114,6 +117,8 @@ def run(self) -> int:
            status = self.repo.data_status(
                granular=self.args.granular,
                untracked_files=self.args.untracked_files,
+                not_in_cache=True,
+                not_in_remote=self.args.not_in_remote,


You might need to drop the not_in_remote/not_in_cache entries below.

if not self.args.not_in_remote: status.pop("not_in_remote")

You mean for json?

This does a few things: 1) partitions storage into 3 categories: * `data` - immediately accessible data contents (e.g. right in your workspace) * `cache` - data contents stashed somewhere aside * `remote` - data contents stoshed remotely (kinda like cold cache) 2) makes so that each storage can be either object-based (our typical odbs) or filesystem-based (e.g. cloud versioning or just a plain old backup copy). 3) settles `save` convention as being a transfer of data contents from `data` storage (e.g. workspace) to `cache` storage. But also allows to pick other storage type to get the data from (e.g. `remote`, which is how we use it for cloud-versioning) 4) settles `checkout` convention as being a transfer of data contents from `cache`, but also allows one to pick other storage type to get the data from (e.g. `data`, which is how we use it for cloud-versioning, or `remote` for cases where having a an intermediate cache is undesirable). These things will simplify our current cloud-versioning operations and will also allow us to introduce `dvcfs` support (which also means `dvc.api` support) for it (coming in a separate PR later). Also this allows us to use persistent index with `FileStorage` in iterative/dvc#8962 (again coming in a separate PR).

efiop · 2023-02-13T05:04:42Z

For the record: reworked a few things to show not in remote by default, which is the most useful for users (both regular and cloud versioning remotes work now and the code is the same, which is important).

The hiccup is related to pre-requisites to finally using a persistent index here (for remotes specifically) so that we don't ring the cloud every time.

Things that are left to do:

use index we build in cloud versioning in storage_map
use persistent index by default for ^ and introduce a flag to refresh it (e.g. --remote-update as suggested in cloud versioning: status command #8761 (comment))

efiop · 2023-02-15T02:19:24Z

This is taking way too long and there are still some questions on how to index http remotes (since listing is not part of the standard). Hate to do this, but I'll bring back --not-in-remote for now in the interest of time 🙁

This allows us to reuse cloud versioning index that we've built in other operations like `data status`. Note, this only works with `dvc config feature.data_index_cache true` right now. Also note that index is not populated for regular remotes, it will come in a separate PR as there are some http-related issues to figure out. Related iterative#8962

This allows us to reuse cloud versioning index that we've built in other operations like `data status`. Note, this only works with `dvc config feature.data_index_cache true` right now. Also note that index is not populated for regular remotes, it will come in a separate PR as there are some http-related issues to figure out. Related #8962

dberenbaum · 2023-02-16T14:57:48Z

This is taking way too long and there are still some questions on how to index http remotes (since listing is not part of the standard). Hate to do this, but I'll bring back --not-in-remote for now in the interest of time 🙁

I think it's fine to merge it this way.

However, I did run into really slow performance when doing dvc data status --granular --not-in-remote. I haven't tested with persistent index (~~how do I enable it?~~). Here's what I did:

# Add 2800 files
$ dvc get git@github.com:iterative/dataset-registry.git use-cases/cats-dogs
$ dvc add cats-dogs
# Configure remote
$ dvc remote add -d -f versioned s3://dave-sandbox-versioning/test/versioned
$ dvc remote modify versioned version_aware true
# Get status
$ time dvc data status --granular --not-in-remote
...
dvc data status --granular --not-in-remote  12.77s user 1.42s system 14% cpu 1:37.57 total

I'm not even sure we actually need to bother with checking the cloud here. If there's no cloud version ID in the .dvc file, should we just report that it's not in remote?

Edit: Okay, I tested with the persistent index and it seems fine, but I'd still like to discuss the point above, especially if we aren't ready for the persistent index to be the default yet. In this case, isn't the .dvc file basically a persistent index already?

efiop · 2023-02-17T13:58:29Z

@dberenbaum Great idea with filling from version_id's in dvcfiles!

Sorry for the pause here, trying to move persistent index to within the .dvc/ directory (currently it is global, which was done in the hopes to help with NFS use case but it turned out that wouldn't help), so that we could enable persistent index by default. The version_id trick would definitely work, but I'm a bit worried about cases like corrupted dvc files that would result in invalid percieved state. Even if we go with that trick, we'll still need --remote-update to be able to fix potential bad states and then we would again need to save to the persistent index to be able to use that later. But using that trick to do initial remote index fill-up sounds pretty good to me.

dberenbaum · 2023-02-17T15:23:08Z

If there's no cloud version ID in the .dvc file, should we just report that it's not in remote?

Discussed with @efiop that for this specific scenario where the version_id is missing, an optimization would be to not check the remote at all. This should be safe to do since no matter what versions exist on the remote, we don't have a local version_id to compare. This is not a blocker if performance looks fine otherwise.

Also supports cloud versioning. Fixes iterative#8761

skshetry · 2023-02-20T10:28:18Z

tests/func/test_data_status.py

+    }
+
+    assert dvc.data_status(
+        granular=True, untracked_files="all", not_in_remote=True


untracked_files="all" may be unnecessary to test here.

skshetry · 2023-02-20T10:28:32Z

tests/func/test_data_status.py

+                join("dir", "bar"),
+            )
+        },
+        "uncommitted": {},


No need for "uncommitted" here.

efiop force-pushed the not_in_remote branch 4 times, most recently from 74d934c to 0ae96fd Compare February 7, 2023 00:08

efiop force-pushed the not_in_remote branch 2 times, most recently from 39c8c15 to f7a5a32 Compare February 10, 2023 00:05

efiop mentioned this pull request Feb 10, 2023

index: introduce storage.remote_fs/path iterative/dvc-data#309

Merged

efiop force-pushed the not_in_remote branch 3 times, most recently from 5b5fe23 to a768a60 Compare February 10, 2023 03:55

efiop changed the title ~~[WIP] data: status: introduce --not-in-remote~~ [WIP] data: status: introduce --not-in-remote for regular remotes and cloud versioning Feb 10, 2023

efiop changed the title ~~[WIP] data: status: introduce --not-in-remote for regular remotes and cloud versioning~~ [WIP] data: status: introduce --not-in-remote Feb 10, 2023

efiop force-pushed the not_in_remote branch from a768a60 to bbcb046 Compare February 10, 2023 14:09

skshetry reviewed Feb 10, 2023

View reviewed changes

efiop mentioned this pull request Feb 10, 2023

index: storage: introduce cache/remote_exists helpers iterative/dvc-data#312

Merged

efiop force-pushed the not_in_remote branch 4 times, most recently from 6786610 to 21fa8e6 Compare February 10, 2023 23:10

efiop mentioned this pull request Feb 12, 2023

index: settle storage layout/operations iterative/dvc-data#313

Merged

efiop force-pushed the not_in_remote branch from 21fa8e6 to 7985840 Compare February 12, 2023 07:16

efiop mentioned this pull request Feb 13, 2023

index: experimental persistent data index #8827

Merged

efiop force-pushed the not_in_remote branch from 7985840 to bdbe9af Compare February 13, 2023 04:57

efiop changed the title ~~[WIP] data: status: introduce --not-in-remote~~ [WIP] data: status: show not in remote Feb 13, 2023

efiop force-pushed the not_in_remote branch from bdbe9af to 708251e Compare February 13, 2023 05:01

efiop force-pushed the not_in_remote branch 2 times, most recently from d1e37a0 to 197d1b7 Compare February 14, 2023 23:20

efiop force-pushed the not_in_remote branch 2 times, most recently from 3962ddc to fbc1844 Compare February 15, 2023 02:28

efiop changed the title ~~[WIP] data: status: show not in remote~~ [WIP] data: status: introduce --not-in-remote Feb 15, 2023

efiop mentioned this pull request Feb 15, 2023

repo: cache index for cloud versioning #9029

Merged

efiop force-pushed the not_in_remote branch 3 times, most recently from c776286 to c9c85d3 Compare February 16, 2023 01:45

efiop mentioned this pull request Feb 17, 2023

repo: enable data index caching by default #9044

Merged

efiop force-pushed the not_in_remote branch 4 times, most recently from 1d8e7bc to 2999959 Compare February 20, 2023 05:42

data: status: rename with_missing to not_in_cache

d2869e2

efiop force-pushed the not_in_remote branch from 2999959 to 3cad416 Compare February 20, 2023 09:34

data: status: introduce --not-in-remote

a85abcc

Also supports cloud versioning. Fixes iterative#8761

efiop force-pushed the not_in_remote branch from 3cad416 to a85abcc Compare February 20, 2023 10:13

efiop marked this pull request as ready for review February 20, 2023 10:25

efiop changed the title ~~[WIP] data: status: introduce --not-in-remote~~ data: status: introduce --not-in-remote Feb 20, 2023

efiop merged commit c34cbf6 into iterative:main Feb 20, 2023

skshetry reviewed Feb 20, 2023

View reviewed changes

tests/func/test_data_status.py

join("dir", "bar"),

)

},

"uncommitted": {},

Copy link

Collaborator

skshetry Feb 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for "uncommitted" here.

skshetry mentioned this pull request Jun 18, 2025

feat: ignore files not in remote when push is false #10749

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

data: status: introduce --not-in-remote #8962

data: status: introduce --not-in-remote #8962

Uh oh!

efiop commented Feb 6, 2023 •

edited

Loading

Uh oh!

codecov bot commented Feb 7, 2023 •

edited

Loading

Uh oh!

skshetry Feb 10, 2023 •

edited

Loading

Uh oh!

efiop Feb 20, 2023

Uh oh!

efiop commented Feb 13, 2023 •

edited

Loading

Uh oh!

efiop commented Feb 15, 2023

Uh oh!

dberenbaum commented Feb 16, 2023 •

edited

Loading

Uh oh!

efiop commented Feb 17, 2023

Uh oh!

dberenbaum commented Feb 17, 2023

Uh oh!

skshetry Feb 20, 2023

Uh oh!

skshetry Feb 20, 2023

Uh oh!

Uh oh!

data: status: introduce --not-in-remote #8962

data: status: introduce --not-in-remote #8962

Uh oh!

Conversation

efiop commented Feb 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Feb 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

skshetry Feb 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

efiop Feb 20, 2023

Choose a reason for hiding this comment

Uh oh!

efiop commented Feb 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

efiop commented Feb 15, 2023

Uh oh!

dberenbaum commented Feb 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

efiop commented Feb 17, 2023

Uh oh!

dberenbaum commented Feb 17, 2023

Uh oh!

skshetry Feb 20, 2023

Choose a reason for hiding this comment

Uh oh!

skshetry Feb 20, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

efiop commented Feb 6, 2023 •

edited

Loading

codecov bot commented Feb 7, 2023 •

edited

Loading

skshetry Feb 10, 2023 •

edited

Loading

efiop commented Feb 13, 2023 •

edited

Loading

dberenbaum commented Feb 16, 2023 •

edited

Loading