Skip to content

Conversation

karajan1001
Copy link
Contributor

@karajan1001 karajan1001 commented Sep 19, 2022

Thank you for the contribution - we'll try to review it as soon as possible. πŸ™

fix: #7986

  1. Add "status" to replace "running" and "queued" in output of exp show.
  2. Add flags --hide-queued and --hide-failed to exp show
  3. Allow exp show to show failed experiments.
  4. Add unit test for the failed experiments shown.
  5. Add error msg to the exp show output

Need a coordinate with VS-CODE extension. @mattseddon

Example of current output in JSON:
for failed exp

"data": {
    "timestamp": ANY,
    "params": {"params.yaml": {"data": {"foo": 2}}},
    "deps": {"copy.py": {"hash": None, "size": None, "nfiles": None}},
    "outs": {},
    "status": "Failed",
    "executor": None,
    "error": {
        "msg": "ERROR: failed to reproduce 'failed-copy-file': "
        "failed to run: python -c 'import sys; sys.exit(1)', "
        "exited with 1",
        "type": "",
    },
}

for other

"data": {
    "deps": {
        "copy.py": {
            "hash": ANY,
            "size": ANY,
            "nfiles": None,
        }
    },
    "metrics": {},
    "outs": {},
    "params": {"params.yaml": {"data": {"foo": 1}}},
    "status": "Success",
    "executor": None,
    "timestamp": timestamp,
    "name": "master",
    "error": {}, 
}
  1. ["data"]["error"]["type"] is now only a placeholder. Need to define different types and other info.
    2.["data"]["error"]["msg"] will show the last line of the output log.
  2. The old ["data"]["running"] and ["data"]["queued"] were obsolete because they will show the wrong status on the failed experiment to the user.
  3. Current status list in dvc/repo/experiments/show.py
class ExpStatus(Enum):
    Success = 0
    Queued = 1
    Running = 2
    Failed = 3

asciicast

@karajan1001 karajan1001 added the feature is a feature label Sep 19, 2022
@karajan1001 karajan1001 requested a review from pmrowla September 19, 2022 03:39
@karajan1001 karajan1001 self-assigned this Sep 19, 2022
Comment on lines 32 to 49
def _get_last_error_message(
celery_queue: "LocalCeleryQueue", exp_rev: str
) -> str:
queue_entry: Optional[
"QueueEntry"
] = celery_queue.match_queue_entry_by_name(
{exp_rev}, celery_queue.iter_failed()
).get(
exp_rev
)
if queue_entry:
try:
proc_info: "ProcessInfo" = celery_queue.proc[queue_entry.stash_rev]
with open(proc_info.stdout, encoding="utf-8") as f_read:
return f_read.readlines()[-1].strip()
except (KeyError, FileNotFoundError):
pass
return ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this is really a reliable way to handle this case. It only works if the pipeline was run without debugging enabled.

@mattseddon can weigh in on this, but I think it would be better to just use a generic "Experiment run failed." error message, and then add a new field where we can pass the log file path to the vscode extension. And then on the vscode side they could potentially display the entire log to the user.

Actually, passing the log location seems useful in every exp case, not just for failures. This is something that the vscode extension could also get through dvc queue logs but I think it might be easier for them to just get the log file path and read it themselves when needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My plan on this is to use "type" field in "error" later in which I can add info of in which part (initialization, env preparing, running, result collection) of the tasks the error comes out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case I would still suggest just using a generic error message for now instead of trying to use the last log line, since the last log line is not actually guaranteed to contain anything useful.

In the long run we should probably also add an error message field to the executor info and save it here:

try:
logger.debug("Running repro in '%s'", os.getcwd())
yield dvc
except CheckpointKilledError:
raise
except DvcException:
if log_errors:
logger.exception("")
raise
except Exception:
if log_errors:
logger.exception("unexpected error")
raise

So on getting an exception during repro, we would just do something like

info.error = str(exc)

and then that would be included in the dump:

if infofile is not None:

This would give you a more reliable error message for the failed exp case

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattseddon can weigh in on this, but I think it would be better to just use a generic "Experiment run failed." error message, and then add a new field where we can pass the log file path to the vscode extension. And then on the vscode side they could potentially display the entire log to the user.

Actually, passing the log location seems useful in every exp case, not just for failures. This is something that the vscode extension could also get through dvc queue logs but I think it might be easier for them to just get the log file path and read it themselves when needed

Yes, the log location would be helpful.

The current approach for any error is to display the provided error as a tooltip in the experiments table. E.g:

image

It is unlikely that we would want to put the entire log into a tooltip so we would have to change the approach there. It would be good to get as much information about the error as possible without having to parse the logs.

Copy link
Contributor

@pmrowla pmrowla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than the error log handling this LGTM

@mattseddon
Copy link
Contributor

for failed exp

"data": {
    "timestamp": ANY,
    "params": {"params.yaml": {"data": {"foo": 2}}},
    "deps": {"copy.py": {"hash": None, "size": None, "nfiles": None}},
    "outs": {},
    "status": "Failed",
    "executor": None,
    "error": {
        "msg": "ERROR: failed to reproduce 'failed-copy-file': "
        "failed to run: python -c 'import sys; sys.exit(1)', "
        "exited with 1",
        "type": "",
    },
}

for other

"data": {
    "deps": {
        "copy.py": {
            "hash": ANY,
            "size": ANY,
            "nfiles": None,
        }
    },
    "metrics": {},
    "outs": {},
    "params": {"params.yaml": {"data": {"foo": 1}}},
    "status": "Success",
    "executor": None,
    "timestamp": timestamp,
    "name": "master",
    "error": {}, 
}

It would be helpful if you could not add the error key into the dict unless there has been an error (i.e the dict is non-empty).

@karajan1001
Copy link
Contributor Author

Use a more general message to replace the error message. BTW, I found the tasks can fail for

  1. Failed to acquire lock.
  2. Too many open files.
    ...
    We should gather these and they are helpful msg for the users. But this is a separate feature request from this.

@dberenbaum
Copy link
Contributor

Thanks @karajan1001! Could you rebase?

Do you need either @pmrowla or @mattseddon to review?

@karajan1001
Copy link
Contributor Author

Thanks @karajan1001! Could you rebase?

Do you need either @pmrowla or @mattseddon to review?

I think both the engineering and the API sides require a check.

fix: iterative#7986
1. Add two new flags  `--hide-queued` and  `--hide-failed` to `exp show`
2. Allow `exp show` to show failed experiments.
3. Add unit test for the failed experiments shown.
4. Add name support for failed exp
5. Add error msg to the `exp show` output
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature is a feature
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

exp show: --queued and --failed
4 participants