-
Notifications
You must be signed in to change notification settings - Fork 1.2k
feat: ignore files not in remote when push is false #10749
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: ignore files not in remote when push is false #10749
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #10749 +/- ##
==========================================
+ Coverage 90.68% 91.06% +0.38%
==========================================
Files 504 504
Lines 39795 40040 +245
Branches 3141 3164 +23
==========================================
+ Hits 36087 36462 +375
+ Misses 3042 2950 -92
+ Partials 666 628 -38 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
4537674
to
b5a6a58
Compare
@skshetry, have you had time to look at this? This feature would be really great for our team! |
dvc/repo/data.py
Outdated
not_in_remote = uncommitted_diff.pop("not_in_remote", []) | ||
|
||
if respect_no_push: | ||
logger.debug("Filtering out paths that are not pushable") | ||
not_in_remote = _filter_out_push_false_outs(repo, not_in_remote) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have to extract out the "not_in_remote" checks from above _diff()
, as it has nothing to do with diff between worktree and committed changes (index). We need to calculate only for the given index (commited changes).
That is what repo.index
is. It's an index of committed changes. You can filter that index to a view, using worktree_view
:
Line 63 in 549821e
def worktree_view( |
if not_in_remote:
view = worktree_view(repo.index, push=True)
# ... existing logic
push=True
gives us
Line 73 in 549821e
push: Whether the view should be restricted to pushable data only. |
You can get access to DataIndex
using view.data["repo"]
. And then use index.iteritems(shallow=not granular)
on it.
Let me know if you need help. I can take over the PR if you prefer that way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can extract the above not_in_remote
stuff. It is using Change
type, but you can make minor modification to it to instead use Entry
type. change.old
and change.new
are just Entry
type.
Something like follows, maybe:
data_index = view.data["repo"]
for key, entry in data_index.iteritems(shallow=not granular):
if not (entry and entry.hash_info):
continue
k = (*key, "") if entry.meta and entry.meta.isdir else key
try:
if not index.storage_map.remote_exists(entry, refresh=remote_refresh):
yield os.path.sep.join(k)
except StorageError:
pass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks clean! Just to make sure I understand you correctly:
You propose to not use the uncommitted_diff["not_in_remote"]
at all, and instead handle it directly, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is from our spec. I think the plan was to show "not_in_remote" only for the tracked directories/files.
I don't think it makes sense for us to show "not_in_remote" for items that are already git-committed. If they are still being tracked by dvc.yaml/.dvc file in the workspace (i.e. the index), and are missing from remote, they will show up in "Not in remote" section anyway.
Only the files that are not being tracked anymore, but were tracked by Git's "HEAD" and is missing from the remote will show up in committed_diff.not_in_remote
, which I don't think makes sense. It could be intentional, or even if not intentional, that only adds noise. You can't populate those files into the remote now with dvc push
(unless you use --all-commits
), as they are no longer being tracked by DVC.
I don't see any discussion about this on either of:
So maybe that was an oversight.
Do you see any use for this? Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, there is no Not in cache
status for git-committed items, but missing from cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are your conventions on scope of PRs, @skshetry? It is starting to look like the scope here may creep a bit from the original issue, to also cover the nuances that we are discussing above. I'm fine with it, but wanted to check what are your norms. Should I make one PR with narrow scope of the push: false
part and then start on another for the rest (which you may be more equipped to do properly), or should we collaborate on doing it in one PR here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then start on another for the rest
Not sure I understand what the rest part is.
You propose to not use the uncommitted_diff["not_in_remote"] at all, and instead handle it directly, right?
^ I was only providing a bit more context for you to work with. I am not asking anything more than what you are proposing, except for making it the default.
And, using a different approach (aka refactoring) is necessary to get a "view" of a pushable data, which I suggested above. This push: false
information does not exist in _diff()
, and I am not sure how to represent it yet on the data management layers.
And I tried to give you my thoughts on why committed_diff["not_in_remote"]
does not make any sense and tried to hear yours.
If you have any questions, please feel free to ask. I am not trying to expand the scope here. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, thanks:) Working on the refactoring now. Really appreciate the thorough and patient follow-up - great maintainership!
@skshetry, thanks for the thorough review, and sorry for the very late reply from me. When reading up on your suggestions, especially on the change from diffing index - workspace to inspecting index directly, I realized I am somewhat confused about the intended/expected behavior of What is the difference between
I first thought this meant there was a difference between files generated with pipelines ( I made a simple example to investigate. ❯ source ./run_demo.sh
==================================
## Before repro
==================================
>>> dvc status -c --json
{
"bar.txt": "missing",
"foo.txt": "missing"
}
>>> dvc status --json
{
"create-bar": [
{
"changed outs": {
"bar.txt": "not in cache"
}
}
],
"foo.txt.dvc": [
{
"changed outs": {
"foo.txt": "not in cache"
}
}
]
}
>>> dvc data status --not-in-remote --json
{
"not_in_cache": [
"foo.txt",
"bar.txt"
],
"not_in_remote": [
"foo.txt",
"bar.txt"
],
"committed": {
"not_in_remote": [
"foo.txt",
"bar.txt"
]
}
}
==================================
## Running repro
==================================
>>> dvc repro
Running stage 'create-bar':
> echo bar > bar.txt
Use `dvc push` to send your updates to remote storage.
==================================
## After repro
==================================
>>> dvc status -c --json
{
"bar.txt": "new",
"foo.txt": "missing"
}
>>> dvc status --json
{
"foo.txt.dvc": [
{
"changed outs": {
"foo.txt": "not in cache"
}
}
]
}
>>> dvc data status --not-in-remote --json
{
"not_in_cache": [
"foo.txt"
],
"not_in_remote": [
"foo.txt",
"bar.txt"
],
"committed": {
"not_in_remote": [
"foo.txt",
"bar.txt"
]
}
}
==================================
## Add and commit foo
==================================
echo foo > foo.txt
>>> dvc commit foo.txt
==================================
## After commit
==================================
>>> dvc status -c --json
{
"bar.txt": "new",
"foo.txt": "new"
}
>>> dvc status --json
{}
>>> dvc data status --not-in-remote --json
{
"not_in_remote": [
"bar.txt",
"foo.txt"
],
"committed": {
"not_in_remote": [
"bar.txt",
"foo.txt"
]
}
}
==================================
## Running push
==================================
>>> dvc push
Collecting |2.00 [00:00, 521entry/s]
Pushing
2 files pushed
==================================
## After push
==================================
>>> dvc status -c --json
{}
>>> dvc status --json
{}
>>> dvc data status --not-in-remote --json
{
"committed": {
"not_in_remote": [
"bar.txt",
"foo.txt"
]
}
}
It does seem to contain the same information, structured slightly differently. Also: the Repro of example
|
The
The data from outputs are still "data" tracked by DVC. So they are shown by default (it ignores dependencies' unless they are also part of an output somewhere in the pipeline unlike
|
If you are using dvc for data management, use |
I see, thank you, that helps. I just got a bit confused by the |
Note that Lines 70 to 73 in c7c7ba6
So while it's called a The unchanged items are always computed unconditionally here, but only shown if (The |
@skshetry, updated the PR now, to use the Bit premature there... Need to sort out some issues before review. |
df95f2a
to
100983e
Compare
It is not clear to me why the failing tests are failing, or if it is related to these changes (main succeeds, so I assume so). Any help appreciated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work overall. The change look good! I’ve left a few minor/pedantic comments inline, just for polish. 🙂
Looks unrelated, maybe new pytest release is to blame. Please ignore, that'd fail on main too, but maybe we were lucky. I'll investigate separately. |
Convert kwargs to explicit arguments and extract not_in_remote handling
Thanks for the guidance! Ps. Also took the liberty to swap out the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Thank you for contributing!
I'll keep this PR open for a few days to gather feedback from community (will also discuss internally), and then merge by Wednesday.
@skshetry, what are your release cycle/policy? Really looking forward to getting this into our CI 🤩 |
I am planning to release by early next week. |
@Northo, I have created a new release. :) |
Fixes #10317
Enables opt-in to remove
push: false
stage outputs fromnot_in_remote
data status results.Notable changes:
outs_no_push
todvc.stage.utils.fill_stage_outputs
keys, to facilitate making outputs withpush: false
.status
, when flag enabled, filter through files reported asnot_in_remote
, and remove them if notcan_push
.--respect-no-push
flag to CLIOpen to suggestions on how to make the flag names more intuitive!
Corresponding PR for the docs: iterative/dvc.org#5373
❗ I have followed the Contributing to DVC checklist.
📖 If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.
Thank you for the contribution - we'll try to review it as soon as possible. 🙏