Skip to content

Conversation

DavidKatz-il
Copy link
Contributor

#19

Copy link

@marcglobality marcglobality left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! I added some comments and ideas, but I like a lot this addition

pdpipe/core.py Outdated
@@ -652,6 +655,17 @@ def __init__(self, stages, transformer_getter=None, **kwargs):
def __getitem__(self, index):
if isinstance(index, slice):
return PdPipeline(self._stages[index])

if isinstance(index, list):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you want to check for a list, or a list of strings?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not sure if it's right to force the label param to be an str.

pdpipe/core.py Outdated
if isinstance(index, str):
stages = [stage for stage in self._stages if stage._label == index]
if len(stages) == 0:
raise Exception("'{}' is not exist.".format(index))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if python3.5 support was dropped, this could be in f-strings, which is quite more readable @shaypal5

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if two stages have the same label you will return just the first one in the pipeline. Maybe that's the desired behaviour, but highlighting just in case. Not sure whether we would like something when creating the pipeline to check for integrity (i.e. ensure that two stages cannot have same label), or we just trust the user

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if python3.5 support was dropped, this could be in f-strings, which is quite more readable @shaypal5

I did not see the use of 'f-string' in the project, so I also did not use it.

if two stages have the same label you will return just the first one in the pipeline. Maybe that's the desired behaviour, but highlighting just in case. Not sure whether we would like something when creating the pipeline to check for integrity (i.e. ensure that two stages cannot have same label), or we just trust the user

When passing in one item a Stage is returned, so I cant return two stages.
pipeline[['someLabel']] will return a pipeline with all stages that label == 'someLabel'.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're both right: I didn't f-strings so far, be we definitely should. I recently moved to using f-strings almost exclusively in my code. I'm also going into type hints, but I don't think it's a very urgent change.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the second issue: See discussion below. A reasonable handling of this case should be decided upon and documented.

Copy link

@marcglobality marcglobality Oct 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re f-strings. We need to drop support for python3.5 or be very determined and use this https://stackoverflow.com/questions/55182209/f-string-invalid-syntax-in-python-3-5 , which I wouldn't recommend @shaypal5

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since Python 3.5 end-of-life was at September 2020, so Iet's do it!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to make a separate PR of dropping python 3.5 and adding f-strings.

pdpipe/core.py Outdated
@@ -173,14 +173,16 @@ class PdPipelineStage(abc.ABC):
dataframes, which will be used to determine whether this stage should
be skipped for input dataframes. See pdp.cond for more information on
specialised Condition objects.
label : str, default None
A label of this stage.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I fully understand the label vs desc. Would we like the label to be the string representation then?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. Label should be very short, like 'fix_spikes', while a description could be something like "Fixes spikes in all numeric columns by applying moving-window based smoothing function with parameters being dynamically learned, while smoothing, separately for each column."

Copy link
Collaborator

@shaypal5 shaypal5 Oct 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You really couldn't use the later conveniently for access as you could the first, as in pipe['fix_spikes'] or for slicing, like pipe[['fix_spikes', 'drop_corr']].

pdpipe/core.py Outdated
@@ -191,6 +193,7 @@ def __init__(self, exraise=True, exmsg=None, desc=None, prec=None,
self._prec_arg = prec
self._skip = skip
self._appmsg = '{}..'.format(desc)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the label be used here if defined and no desc rather than DEF_DESCRIPTION?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the label should be added on the beginning of the application message. Something like

self._appmsg = '{}{}..'.format(f'{label}: ' if label else '', self.,desc)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...
self._appmsg = f'{label}: {desc}')

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, stages with no supplied name shouldn't have an application message such as ": Dropping mostly NaN columns" (which is what the your current code does), but rather simply "Dropping mostly NaN columns".

Please fix to self._appmsg = '{}{}..'.format(f'{label}: ' if label else '', self.,desc)

pdpipe/core.py Outdated
@@ -191,6 +193,7 @@ def __init__(self, exraise=True, exmsg=None, desc=None, prec=None,
self._prec_arg = prec
self._skip = skip
self._appmsg = '{}..'.format(desc)
self._label = label

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would check label and raise error if (at least) not a str

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Use with pytest.raises(SomeException) to check for correct error throwing behaviour.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def __init__(self, label='', ...)
    if not isinstance(label, str):
        raise Exception(f"label must be a str, not {type(label)}.")
    self._label = label
    ...

drop_char = SilentDropStage('char', label='dropChar')
pipeline = PdPipeline([drop_num1, drop_num2, drop_char])
assert len(pipeline) == 3
pipeline = pipeline[['dropNum1', 'dropNum2']]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the way it's implemented now, (I think) one can pass as labels integers, and slice here using integers which would make it a bit confusing I think. I would force the labels to be strings, and ensure people cannot pass a list of integers to slice. WDYT?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notice that we are not going to disable int base slicing or access, as pipelines are first-and-formost lists of pipeline stages, and they should easily slice-able / index-accessible with integers. It's in the documentation.

@DavidKatz-il
Copy link
Contributor Author

@shaypal5

  1. Should we force label to be a string?
  2. Should label be unique for all stages in a pipeline? if so where we need to check it because only when a stage is adding to the pipeline it has to be checked.

@shaypal5
Copy link
Collaborator

I would say:

  1. I think that we should. Looks reasonable enough. I don't see an inherent need for richer types of labels. This is already an enhancement, after all.
  2. I'd say the package needs to handle such cases gracefully, not force it.

Either:

  • A pipeline checks - on creation - that all stage labels are unique, and if their not, issues a warning and disables label-based access and slicing
  • Do not ensure correct access and slicing when non-unique labels can be found in the same pipeline, and document it. When accessing/slicing, return any viable result for each of the non-unique labels asked for, and document that when that is the case there is no guarantee which pipeline stages will be returned, and that it is the user's responsibility to make sure each stage has a unique string label.

Also, notice that we are not going to disable int base slicing or access, as pipelines are first-and-formost lists of pipeline stages, and they should easily slice-able / index-accessible with integers. It's in the documentation.

@DavidKatz-il
Copy link
Contributor Author

@shaypal5 WDYT about changing label to name similar to sklearn?

@shaypal5
Copy link
Collaborator

Sure! A good change, I think, not to mix with dataframe column labels.

@codecov-io
Copy link

codecov-io commented Oct 14, 2020

Codecov Report

Merging #39 into master will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##            master       #39   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           12        12           
  Lines         1719      1730   +11     
  Branches       232       238    +6     
=========================================
+ Hits          1719      1730   +11     
Impacted Files Coverage Δ
pdpipe/core.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cecda8d...7f21a24. Read the comment docs.

@DavidKatz-il DavidKatz-il changed the title Add stage label Add stage name Oct 14, 2020
@DavidKatz-il
Copy link
Contributor Author

I didn't use f-string because of the conflict with python 3.5.

@DavidKatz-il DavidKatz-il requested a review from shaypal5 October 14, 2020 09:33
pdpipe/core.py Outdated
self._exraise = exraise
self._exmsg = exmsg
self._desc = desc
self._prec_arg = prec
self._skip = skip
self._appmsg = '{}..'.format(desc)
self._appmsg = '{}: {}'.format(name, desc)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix to
self._appmsg = '{}{}..'.format(f'{label}: ' if label else '', self.,desc)

Stages with no supplied name shouldn't have an application message such as ": Dropping mostly NaN columns", but rather simply "Dropping mostly NaN columns".

@shaypal5
Copy link
Collaborator

Almost done!

See my latest comment on your app_desc, fix it and I'll pull this. :)

@shaypal5 shaypal5 merged commit cdce586 into pdpipe:master Oct 30, 2020
@DavidKatz-il DavidKatz-il deleted the add-stage-label branch October 30, 2020 10:45
@shaypal5
Copy link
Collaborator

shaypal5 commented Oct 30, 2020

Released in version v0.0.52 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants