-
Notifications
You must be signed in to change notification settings - Fork 1.2k
data: status: introduce --not-in-remote #8962
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
74d934c
to
0ae96fd
Compare
Codecov ReportBase: 93.01% // Head: 93.06% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## main #8962 +/- ##
==========================================
+ Coverage 93.01% 93.06% +0.04%
==========================================
Files 457 456 -1
Lines 36824 36821 -3
Branches 5339 5333 -6
==========================================
+ Hits 34252 34266 +14
+ Misses 2052 2028 -24
- Partials 520 527 +7
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
39c8c15
to
f7a5a32
Compare
5b5fe23
to
a768a60
Compare
@@ -114,6 +117,8 @@ def run(self) -> int: | |||
status = self.repo.data_status( | |||
granular=self.args.granular, | |||
untracked_files=self.args.untracked_files, | |||
not_in_cache=True, | |||
not_in_remote=self.args.not_in_remote, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might need to drop the not_in_remote
/not_in_cache
entries below.
if not self.args.not_in_remote:
status.pop("not_in_remote")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean for json?
6786610
to
21fa8e6
Compare
This does a few things: 1) partitions storage into 3 categories: * `data` - immediately accessible data contents (e.g. right in your workspace) * `cache` - data contents stashed somewhere aside * `remote` - data contents stoshed remotely (kinda like cold cache) 2) makes so that each storage can be either object-based (our typical odbs) or filesystem-based (e.g. cloud versioning or just a plain old backup copy). 3) settles `save` convention as being a transfer of data contents from `data` storage (e.g. workspace) to `cache` storage. But also allows to pick other storage type to get the data from (e.g. `remote`, which is how we use it for cloud-versioning) 4) settles `checkout` convention as being a transfer of data contents from `cache`, but also allows one to pick other storage type to get the data from (e.g. `data`, which is how we use it for cloud-versioning, or `remote` for cases where having a an intermediate cache is undesirable). These things will simplify our current cloud-versioning operations and will also allow us to introduce `dvcfs` support (which also means `dvc.api` support) for it (coming in a separate PR later). Also this allows us to use persistent index with `FileStorage` in iterative/dvc#8962 (again coming in a separate PR).
This does a few things: 1) partitions storage into 3 categories: * `data` - immediately accessible data contents (e.g. right in your workspace) * `cache` - data contents stashed somewhere aside * `remote` - data contents stoshed remotely (kinda like cold cache) 2) makes so that each storage can be either object-based (our typical odbs) or filesystem-based (e.g. cloud versioning or just a plain old backup copy). 3) settles `save` convention as being a transfer of data contents from `data` storage (e.g. workspace) to `cache` storage. But also allows to pick other storage type to get the data from (e.g. `remote`, which is how we use it for cloud-versioning) 4) settles `checkout` convention as being a transfer of data contents from `cache`, but also allows one to pick other storage type to get the data from (e.g. `data`, which is how we use it for cloud-versioning, or `remote` for cases where having a an intermediate cache is undesirable). These things will simplify our current cloud-versioning operations and will also allow us to introduce `dvcfs` support (which also means `dvc.api` support) for it (coming in a separate PR later). Also this allows us to use persistent index with `FileStorage` in iterative/dvc#8962 (again coming in a separate PR).
This does a few things: 1) partitions storage into 3 categories: * `data` - immediately accessible data contents (e.g. right in your workspace) * `cache` - data contents stashed somewhere aside * `remote` - data contents stoshed remotely (kinda like cold cache) 2) makes so that each storage can be either object-based (our typical odbs) or filesystem-based (e.g. cloud versioning or just a plain old backup copy). 3) settles `save` convention as being a transfer of data contents from `data` storage (e.g. workspace) to `cache` storage. But also allows to pick other storage type to get the data from (e.g. `remote`, which is how we use it for cloud-versioning) 4) settles `checkout` convention as being a transfer of data contents from `cache`, but also allows one to pick other storage type to get the data from (e.g. `data`, which is how we use it for cloud-versioning, or `remote` for cases where having a an intermediate cache is undesirable). These things will simplify our current cloud-versioning operations and will also allow us to introduce `dvcfs` support (which also means `dvc.api` support) for it (coming in a separate PR later). Also this allows us to use persistent index with `FileStorage` in iterative/dvc#8962 (again coming in a separate PR).
For the record: reworked a few things to show The hiccup is related to pre-requisites to finally using a persistent index here (for remotes specifically) so that we don't ring the cloud every time. Things that are left to do:
|
d1e37a0
to
197d1b7
Compare
This is taking way too long and there are still some questions on how to index http remotes (since listing is not part of the standard). Hate to do this, but I'll bring back |
3962ddc
to
fbc1844
Compare
This allows us to reuse cloud versioning index that we've built in other operations like `data status`. Note, this only works with `dvc config feature.data_index_cache true` right now. Also note that index is not populated for regular remotes, it will come in a separate PR as there are some http-related issues to figure out. Related iterative#8962
This allows us to reuse cloud versioning index that we've built in other operations like `data status`. Note, this only works with `dvc config feature.data_index_cache true` right now. Also note that index is not populated for regular remotes, it will come in a separate PR as there are some http-related issues to figure out. Related #8962
c776286
to
c9c85d3
Compare
I think it's fine to merge it this way. However, I did run into really slow performance when doing # Add 2800 files
$ dvc get git@github.com:iterative/dataset-registry.git use-cases/cats-dogs
$ dvc add cats-dogs
# Configure remote
$ dvc remote add -d -f versioned s3://dave-sandbox-versioning/test/versioned
$ dvc remote modify versioned version_aware true
# Get status
$ time dvc data status --granular --not-in-remote
...
dvc data status --granular --not-in-remote 12.77s user 1.42s system 14% cpu 1:37.57 total I'm not even sure we actually need to bother with checking the cloud here. If there's no cloud version ID in the Edit: Okay, I tested with the persistent index and it seems fine, but I'd still like to discuss the point above, especially if we aren't ready for the persistent index to be the default yet. In this case, isn't the |
@dberenbaum Great idea with filling from version_id's in dvcfiles! Sorry for the pause here, trying to move persistent index to within the .dvc/ directory (currently it is global, which was done in the hopes to help with NFS use case but it turned out that wouldn't help), so that we could enable persistent index by default. The version_id trick would definitely work, but I'm a bit worried about cases like corrupted dvc files that would result in invalid percieved state. Even if we go with that trick, we'll still need |
Discussed with @efiop that for this specific scenario where the |
1d8e7bc
to
2999959
Compare
Also supports cloud versioning. Fixes iterative#8761
} | ||
|
||
assert dvc.data_status( | ||
granular=True, untracked_files="all", not_in_remote=True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
untracked_files="all"
may be unnecessary to test here.
join("dir", "bar"), | ||
) | ||
}, | ||
"uncommitted": {}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for "uncommitted"
here.
Slow POC implementation.
#8761