Skip to content

Conversation

Northo
Copy link
Contributor

@Northo Northo commented May 22, 2025

Fixes #10317

Enables opt-in to remove push: false stage outputs from not_in_remote data status results.

Notable changes:

  • Add outs_no_push to dvc.stage.utils.fill_stage_outputs keys, to facilitate making outputs with push: false.
  • In status, when flag enabled, filter through files reported as not_in_remote, and remove them if not can_push.
  • Add corresponding flag --respect-no-push flag to CLI

Open to suggestions on how to make the flag names more intuitive!

Corresponding PR for the docs: iterative/dvc.org#5373

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

Copy link

codecov bot commented May 22, 2025

Codecov Report

Attention: Patch coverage is 88.57143% with 4 lines in your changes missing coverage. Please review.

Project coverage is 91.06%. Comparing base (2431ec6) to head (b6b18ef).
Report is 68 commits behind head on main.

Files with missing lines Patch % Lines
dvc/repo/data.py 78.94% 3 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #10749      +/-   ##
==========================================
+ Coverage   90.68%   91.06%   +0.38%     
==========================================
  Files         504      504              
  Lines       39795    40040     +245     
  Branches     3141     3164      +23     
==========================================
+ Hits        36087    36462     +375     
+ Misses       3042     2950      -92     
+ Partials      666      628      -38     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Northo Northo force-pushed the fix/10317/ignore-not-in-remote-push-false branch 4 times, most recently from 4537674 to b5a6a58 Compare May 22, 2025 12:58
@Northo Northo changed the title Fix/10317/ignore not in remote push false feat: ignore files not in remote when push is false May 22, 2025
@Northo Northo marked this pull request as ready for review May 22, 2025 13:24
@skshetry skshetry added this to DVC May 25, 2025
@skshetry skshetry moved this to Review In Progress in DVC May 25, 2025
@Northo
Copy link
Contributor Author

Northo commented Jun 5, 2025

@skshetry, have you had time to look at this? This feature would be really great for our team!

@skshetry skshetry moved this from Review In Progress to In Progress in DVC Jun 7, 2025
dvc/repo/data.py Outdated
Comment on lines 263 to 276
not_in_remote = uncommitted_diff.pop("not_in_remote", [])

if respect_no_push:
logger.debug("Filtering out paths that are not pushable")
not_in_remote = _filter_out_push_false_outs(repo, not_in_remote)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have to extract out the "not_in_remote" checks from above _diff(), as it has nothing to do with diff between worktree and committed changes (index). We need to calculate only for the given index (commited changes).

That is what repo.index is. It's an index of committed changes. You can filter that index to a view, using worktree_view:

def worktree_view(

if not_in_remote:
    view = worktree_view(repo.index, push=True)
    # ... existing logic

push=True gives us

push: Whether the view should be restricted to pushable data only.
.

You can get access to DataIndex using view.data["repo"]. And then use index.iteritems(shallow=not granular) on it.

Let me know if you need help. I can take over the PR if you prefer that way.

Copy link
Collaborator

@skshetry skshetry Jun 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can extract the above not_in_remote stuff. It is using Change type, but you can make minor modification to it to instead use Entry type. change.old and change.new are just Entry type.

Something like follows, maybe:

data_index = view.data["repo"]
for key, entry in data_index.iteritems(shallow=not granular):
    if not (entry and entry.hash_info):
        continue

    k = (*key, "") if entry.meta and entry.meta.isdir else key
    try:
        if not index.storage_map.remote_exists(entry, refresh=remote_refresh):
            yield os.path.sep.join(k)
    except StorageError:
        pass

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks clean! Just to make sure I understand you correctly:

You propose to not use the uncommitted_diff["not_in_remote"] at all, and instead handle it directly, right?

Copy link
Collaborator

@skshetry skshetry Jun 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

image

This is from our spec. I think the plan was to show "not_in_remote" only for the tracked directories/files.

I don't think it makes sense for us to show "not_in_remote" for items that are already git-committed. If they are still being tracked by dvc.yaml/.dvc file in the workspace (i.e. the index), and are missing from remote, they will show up in "Not in remote" section anyway.

Only the files that are not being tracked anymore, but were tracked by Git's "HEAD" and is missing from the remote will show up in committed_diff.not_in_remote, which I don't think makes sense. It could be intentional, or even if not intentional, that only adds noise. You can't populate those files into the remote now with dvc push (unless you use --all-commits), as they are no longer being tracked by DVC.

I don't see any discussion about this on either of:

So maybe that was an oversight.

Do you see any use for this? Let me know what you think.

Copy link
Collaborator

@skshetry skshetry Jun 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, there is no Not in cache status for git-committed items, but missing from cache.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are your conventions on scope of PRs, @skshetry? It is starting to look like the scope here may creep a bit from the original issue, to also cover the nuances that we are discussing above. I'm fine with it, but wanted to check what are your norms. Should I make one PR with narrow scope of the push: false part and then start on another for the rest (which you may be more equipped to do properly), or should we collaborate on doing it in one PR here?

Copy link
Collaborator

@skshetry skshetry Jun 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then start on another for the rest

Not sure I understand what the rest part is.

You propose to not use the uncommitted_diff["not_in_remote"] at all, and instead handle it directly, right?

^ I was only providing a bit more context for you to work with. I am not asking anything more than what you are proposing, except for making it the default.
And, using a different approach (aka refactoring) is necessary to get a "view" of a pushable data, which I suggested above. This push: false information does not exist in _diff(), and I am not sure how to represent it yet on the data management layers.

And I tried to give you my thoughts on why committed_diff["not_in_remote"] does not make any sense and tried to hear yours.

If you have any questions, please feel free to ask. I am not trying to expand the scope here. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, thanks:) Working on the refactoring now. Really appreciate the thorough and patient follow-up - great maintainership!

@skshetry skshetry moved this from In Progress to Review In Progress in DVC Jun 9, 2025
@Northo
Copy link
Contributor Author

Northo commented Jun 18, 2025

@skshetry, thanks for the thorough review, and sorry for the very late reply from me.

When reading up on your suggestions, especially on the change from diffing index - workspace to inspecting index directly, I realized I am somewhat confused about the intended/expected behavior of dvc data status.

What is the difference between dvc data status and dvc status? In the docs, it says:

  • dvc status

    Show changes in the project pipelines, as well as file mismatches either between the cache and workspace, or between the cache and remote storage.
    For the status of tracked data, see dvc data status (similar to git status).

  • dvc data status

    Show changes to the files and directories tracked by DVC in the workspace.
    For the status of data pipelines, see dvc status.

I first thought this meant there was a difference between files generated with pipelines (dvc.yaml) and directly tracked (<filename>.dvc).
However, after some investigation, I believe I have misunderstood.

I made a simple example to investigate. foo.txt added directly, bar.txt created with pipeline. Run dvc status -c and dvc data status --not-in-remote at different states. The results are below.

source ./run_demo.sh
==================================
## Before repro
==================================
>>> dvc status -c --json
{
  "bar.txt": "missing",
  "foo.txt": "missing"
}

>>> dvc status --json
{
  "create-bar": [
    {
      "changed outs": {
        "bar.txt": "not in cache"
      }
    }
  ],
  "foo.txt.dvc": [
    {
      "changed outs": {
        "foo.txt": "not in cache"
      }
    }
  ]
}

>>> dvc data status --not-in-remote --json
{
  "not_in_cache": [
    "foo.txt",
    "bar.txt"
  ],
  "not_in_remote": [
    "foo.txt",
    "bar.txt"
  ],
  "committed": {
    "not_in_remote": [
      "foo.txt",
      "bar.txt"
    ]
  }
}


==================================
## Running repro
==================================
>>> dvc repro
Running stage 'create-bar':
> echo bar > bar.txt
Use `dvc push` to send your updates to remote storage.


==================================
## After repro
==================================
>>> dvc status -c --json
{
  "bar.txt": "new",
  "foo.txt": "missing"
}

>>> dvc status --json
{
  "foo.txt.dvc": [
    {
      "changed outs": {
        "foo.txt": "not in cache"
      }
    }
  ]
}

>>> dvc data status --not-in-remote --json
{
  "not_in_cache": [
    "foo.txt"
  ],
  "not_in_remote": [
    "foo.txt",
    "bar.txt"
  ],
  "committed": {
    "not_in_remote": [
      "foo.txt",
      "bar.txt"
    ]
  }
}


==================================
## Add and commit foo
==================================
echo foo > foo.txt
>>> dvc commit foo.txt


==================================
## After commit
==================================
>>> dvc status -c --json
{
  "bar.txt": "new",
  "foo.txt": "new"
}

>>> dvc status --json
{}

>>> dvc data status --not-in-remote --json
{
  "not_in_remote": [
    "bar.txt",
    "foo.txt"
  ],
  "committed": {
    "not_in_remote": [
      "bar.txt",
      "foo.txt"
    ]
  }
}


==================================
## Running push
==================================
>>> dvc push
Collecting                                                                                                                                                    |2.00 [00:00,  521entry/s]
Pushing
2 files pushed


==================================
## After push
==================================
>>> dvc status -c --json
{}

>>> dvc status --json
{}

>>> dvc data status --not-in-remote --json
{
  "committed": {
    "not_in_remote": [
      "bar.txt",
      "foo.txt"
    ]
  }
}

It does seem to contain the same information, structured slightly differently.
Is there a subtle difference here I am missing, or do they have overlapping functionality?
Thanks for any help in clarifying this.

Also: the comitted.not_in_remote entries seem a bit strange?

Repro of example

>>> .dvc/.gitignore
/config.local
/tmp
/cache
>>> .dvc/config
[core]
    remote = localremote
['remote "localremote"']
    url = ../localremote
>>> .dvcignore
# Add patterns of files dvc should ignore, which could improve
# the performance. Learn more at
# https://dvc.org/doc/user-guide/dvcignore
>>> .gitignore
/bar.txt
/foo.txt
localremote
>>> dvc.lock
schema: '2.0'
stages:
  create-bar:
    cmd: echo bar > bar.txt
    outs:
    - path: bar.txt
      hash: md5
      md5: c157a79031e1c40f85931829bc5fc552
      size: 4
>>> dvc.yaml
stages:
  create-bar:
    cmd: echo bar > bar.txt
    outs:
    - bar.txt
>>> foo.txt.dvc
outs:
- md5: d3b07384d113edec49eaa6238ad5ff00
  size: 4
  hash: md5
  path: foo.txt
>>> run_demo.sh
#!/bin/bash

## Clean
rm -rf localremote .dvc/cache .dvc/tmp

run() {
    echo ">>> $@"
    "$@"
    echo
}

status() {
  run dvc status -c --json
  run dvc status --json
  run dvc data status --not-in-remote --json
}

echo "=================================="
echo "## Before repro"
echo "=================================="
status

echo
echo "=================================="
echo "## Running repro"
echo "=================================="
run dvc repro

echo
echo "=================================="
echo "## After repro"
echo "=================================="
status


echo
echo "=================================="
echo "## Add and commit foo"
echo "=================================="
echo "echo foo > foo.txt"
echo "foo" > foo.txt
run dvc commit foo.txt

echo
echo "=================================="
echo "## After commit"
echo "=================================="
status

echo
echo "=================================="
echo "## Running push"
echo "=================================="
run dvc push


echo
echo "=================================="
echo "## After push"
echo "=================================="
status

@skshetry
Copy link
Collaborator

What is the difference between dvc data status and dvc status? In the docs, it says:

The dvc status command shows the state of your pipelines by detecting changes in tracked outputs, dependencies, and the commands. However, its scope is limited, and it can only indicate whether the tracked dependency/output/command has changed or not. It does not show you how the data changed. For example, there is no way to see granular changes within a tracked directory, which was an often requested feature:

dvc data status supports showing granular changes with --granular (ideally this should be the default if we fix performance issues with it). It is also custom built as a data(set) management command,
to show you the current state of your tracked datasets, based on user's feedback asking for a tool to understand the state of tracked data.

The data from outputs are still "data" tracked by DVC. So they are shown by default (it ignores dependencies' unless they are also part of an output somewhere in the pipeline unlike dvc status). If there's a demand for filtering those out, dvc data status would support it, but dvc status is unlikely to support that.

dvc data status command is focused on data, while dvc status is focused on pipelines.

dvc data status also powers the file-tree view and decorations in the "DVC Extension for VSCode". So some requirements also came from there.

@skshetry
Copy link
Collaborator

skshetry commented Jun 18, 2025

If you are using dvc for data management, use data status. If you are using it to check changes to your pipelines, use dvc status. data status is a new command, so any new features related to data/data-management are likely going to be implemented there than in status.

@Northo
Copy link
Contributor Author

Northo commented Jun 18, 2025

What is the difference between dvc data status and dvc status? In the docs, it says:

The dvc status command shows the state of your pipelines by detecting changes in tracked outputs, dependencies, and the commands. However, its scope is limited, and it can only indicate whether the tracked dependency/output/command has changed or not. It does not show you how the data changed. For example, there is no way to see granular changes within a tracked directory, which was an often requested feature:

* [status: granular output for directories #2180](https://github.com/iterative/dvc/issues/2180)

dvc data status supports showing granular changes with --granular (ideally this should be the default if we fix performance issues with it). It is also custom built as a data(set) management command, to show you the current state of your tracked datasets, based on user's feedback asking for a tool to understand the state of tracked data.

The data from outputs are still "data" tracked by DVC. So they are shown by default (it ignores dependencies' unless they are also part of an output somewhere in the pipeline unlike dvc status). If there's a demand for filtering those out, dvc data status would support it, but dvc status is unlikely to support that.

dvc data status command is focused on data, while dvc status is focused on pipelines.

dvc data status also powers the file-tree view and decorations in the "DVC Extension for VSCode". So some requirements also came from there.

I see, thank you, that helps. I just got a bit confused by the not-in-remote being computed inside the _diff. I'll respond directly in the review comments for specific questions. Will try to have a reviewed PR ready for end of week.

@skshetry
Copy link
Collaborator

skshetry commented Jun 18, 2025

I just got a bit confused by the not-in-remote being computed inside the _diff.

Note that _diff(...) here also returns "unchanged" items. That's because we set with_unchanged=True.

dvc/dvc/repo/data.py

Lines 70 to 73 in c7c7ba6

for change in diff(
old,
new,
with_unchanged=True,

So while it's called a _diff(), it effectively yields a full list of items from both sides of the index - items that may have been added, removed, modified, or left unchanged. In that sense, it's behaving more like a complete listing, similar to index.iteritems(), as discussed earlier: #10749 (comment).
(not_in_remote is only applied to change.old, which under _diff_index_to_wtree() corresponds to items from repo.index, change.new comes from the worktree index).

The unchanged items are always computed unconditionally here, but only shown if --unchanged is explicitly passed in the CLI.

(The --unchanged flag is used by the DVC Extension for VSCode to render the file tree.)

@skshetry skshetry added A: status Related to the dvc diff/list/status A: data-management Related to dvc add/checkout/commit/move/remove labels Jun 19, 2025
@Northo Northo requested a review from skshetry June 19, 2025 12:50
@Northo
Copy link
Contributor Author

Northo commented Jun 19, 2025

@skshetry, updated the PR now, to use the worktree_view based approach.

Bit premature there... Need to sort out some issues before review.

@Northo Northo force-pushed the fix/10317/ignore-not-in-remote-push-false branch from df95f2a to 100983e Compare June 19, 2025 13:43
@Northo Northo marked this pull request as ready for review June 19, 2025 14:20
@Northo
Copy link
Contributor Author

Northo commented Jun 19, 2025

It is not clear to me why the failing tests are failing, or if it is related to these changes (main succeeds, so I assume so). Any help appreciated.

Copy link
Collaborator

@skshetry skshetry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work overall. The change look good! I’ve left a few minor/pedantic comments inline, just for polish. 🙂

@skshetry
Copy link
Collaborator

skshetry commented Jun 19, 2025

It is not clear to me why the failing tests are failing, or if it is related to these changes (main succeeds, so I assume so). Any help appreciated.

Looks unrelated, maybe new pytest release is to blame. Please ignore, that'd fail on main too, but maybe we were lucky. I'll investigate separately.

Northo added 3 commits June 20, 2025 08:25
Convert kwargs to explicit arguments and extract not_in_remote handling
@Northo
Copy link
Contributor Author

Northo commented Jun 20, 2025

Thanks for the guidance!

Ps. Also took the liberty to swap out the kwargs in status for explicit arguments.

@Northo Northo requested a review from skshetry June 20, 2025 08:29
Copy link
Collaborator

@skshetry skshetry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thank you for contributing!

I'll keep this PR open for a few days to gather feedback from community (will also discuss internally), and then merge by Wednesday.

@skshetry skshetry merged commit 45fd102 into iterative:main Jun 26, 2025
41 checks passed
@github-project-automation github-project-automation bot moved this from Review In Progress to Done in DVC Jun 26, 2025
@Northo
Copy link
Contributor Author

Northo commented Jul 2, 2025

@skshetry, what are your release cycle/policy? Really looking forward to getting this into our CI 🤩

@skshetry
Copy link
Collaborator

skshetry commented Jul 2, 2025

@skshetry, what are your release cycle/policy? Really looking forward to getting this into our CI 🤩

I am planning to release by early next week.

@skshetry
Copy link
Collaborator

skshetry commented Jul 8, 2025

@Northo, I have created a new release. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: data-management Related to dvc add/checkout/commit/move/remove A: status Related to the dvc diff/list/status
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

data status returns files as "Not in remote" even though they are marked as push: false in pipeline
2 participants