-
Notifications
You must be signed in to change notification settings - Fork 653
Fixes #897 - override default TextColumn string parsing. #913
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I need to think about this and would also like @benmccann to take a look. I like the fact that this goes deeper than the half-fix I implemented, but am concerned about some of the things you mentioned in the Issue - in short, are there any side effects to making this change that we're not aware of. I also need to look at the implementation details. I'll probably do that after getting some of these other PRs out of the way. |
It seems to me that #912 is still relevant as it fixes the double parsing issue and adds the control over the missing markers in the StringParser. Whatever form this fix might take should probably be rebased with that PR in place. |
- Replace hard-coded parser in TextColumn appendCell() with a private parser field that is initialized to DEFAULT_PARSER to preserve existing default behavior. A client may use the new getStringParser() and setStringParser() methods to access and change the parser used for the column. Setting the parser to null specifies that no parsing shall be performed when adding data to the column.
Rebased, resolved conflicts with #912. |
return stringParser; | ||
} | ||
|
||
public void setStringParser(StringParser stringParser) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd probably be better to call this something more generic like getParser
so that we could add this to the other column types as well. In fact this might better live in AbstractColumn
I also wonder if putting this in one of the create
methods might be a better alternative. It doesn't really seem to me like you'd want to change the parser after column creation (that could possibly lead to strange situations where the column contains contents that can't be parsed by the updated parser)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@benmccann The parser is only used when putting new data into the column. I can imagine scenarios where you load a file that requires one kind of parser and append to the table later with different rules.
OTOH, I'm concerned about what happens if you apply a built-in map function to the column (e.g. StringColumn.substring(0, 7). The new column may not be able to parse what the old column did unless the map functions (a) skip the parser (they might, i have no idea) or (b) inherit the parser of the source column. The latter could still have issues if the function operates on two string columns (e.g. concatenate()).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@benmccann I get your point on the method name; I hadn't considered generalizing it but I'll take a look at that.
@lwhite1 The general case you describe is that where Tablesaw creates and populates a column, and the user requires/expects that column to have custom parsing?
I realized that my use case is actually addressed by #912, in that it's sufficient for me to globally set text column parsing policy:
TextColumnType.DEFAULT_PARSER.setMissingValueStrings(Collections.emptyList());
If I had text columns that required distinct parsers, I'd be out of luck though.
Thanks for the feedback! I'll give it some thought today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jlogan If you had two text columns requiring different parsers, you could use addCell(s, parser) to add things manually. I also noted that the Row api calls set(row, col) which bypasses any parsing, so that will work as well. That was probably done for efficiency and assuming that the user has more control over manual additions as opposed to loading some random file of dubious quality.
My concern about using a built-in String map function (like substring()) on a column with a custom parser is baseless, because all those methods return a StringColumn rather than a TextColumn (for historical reasons).
So, I think we have the ability to change the missing value handling of parsers by columnType, and have a solution for manually adding values. This probably covers most scenarios.
If we could set a parser at column creation time I think that would be awesome, especially if it can be done in AbstractColumn as Ben suggested, but I think this is only worth doing if it can be done without much difficulty or complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lwhite1 Let try to sum up based on your last comment and my deep dive in the code to make sure I understand:
TD;DR We don't need the code from this PR because there are the mitigation strategies below, but a change to allow setting a custom parser at creation time would be welcome, provided that the change isn't too risky.
I'll look into what that entails over the weekend.
Parsing data by default
All ColumnType
classes define a DEFAULT_PARSER
value that is used only by the Column.appendCell(String stringValue)
method. All default parsers appear to use the default missing value list of NaN
, *
NA
, null
, N/A
from AbstractColumnParser
.
The user guide states that appendCell()
converts the string, and append()
merely adds the value. This is true for all columns except TextColumn
, where append()
invokes appendCell()
, and appendObj()
invokes append()
. As a result, most operations involving text column data will result in re-parsing of the data, which can yield unexpected results.
It would be nice if TextColumn
had consistent semantics, but I expect this could break a significant number of existing applications.
Mitigation strategies
As of PR #912, callers can work around the parsing behavior of TextColumn
globally by modifying the missing value list for TextColumn.DEFAULT_PARSER
.
The TextColumn
behavior can be worked around on a call-by-call basis:
- Cells can be appended using
appendCell()
with a custom parser, or by usingappendMissing()
and then setting the value. - Rows can be added by using
appendRow()
andset()
instead ofaddRow()
.
The call-by-call approach might require rewriting much of an application code base. If neither of the above approaches is suitable, the only remaining workaround is to use StringColumn
instead of TextColumn
.
Indirect invocation of default parsing
It seems to me that Tablesaw should only perform implicit parsing for data that is being ingested; it should not re-parse column data except as explicitly specified by the caller. If that is the case, there are a few other cases where Tablesaw invokes appendCell()
is beneath the covers on column data, and these could produce unexpected results:
StringColumn.lag()
- the shifted table will be re-parsedCrossTab.counts()
- label column values that match the default missing values will be eatenCategoricalColumn.countByCategory()
- category names that match the default missing values will be eatenPivotTable.pivot()
- column 1 group names that match the default missing values will be eatenTableSliceGroup.splitGroupingColumn()
invokes appendCell() when populating new columns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jglogan Given what you read in the documentation, and considering the design (where append(String) is specific to String and Text Column and appendCell(String) is implemented for all columns and has a version that takes a parser) I feel confident the best course of action is to fix append(String). TextColumn is relatively new and was created by modifying the code for StringColumn, which for performance reasons has unique and complex append logic as it uses dictionary encoding. The bug was no doubt introduced at that point.
I did a simple change to TextColumn::append() to skip parsing and nothing broke. I also fixed the String and TextColumn versions of lag().
As for the appendCell() usages mentioned above, each would have to be considered individually. If they rely on a stringified version of the source values, it may be hard to fix them. In early days, Tablesaw didn't have generic or even boxed interfaces for the columns so Strings were sometimes used as a common type. Where the stringified version is used as a column title or label, there would need to be tests that ensure that the DoubleColumnType.missingValueIndicator() never shows up.
I'm going to push my append() fix. At which point the initial issue can be closed. There could be other issues that look at whether appendCell() is the right choice for the operations you mention.
@jglogan Would you mind taking a look at PR#921 and letting me know what you think? Assuming that looks ok, I think we can close issue #897. The changes discussed here regarding providing a custom parser for a specific column would then be strictly enhancements. I'd be happy to leave that to the future if you are. Thank you very much for your analysis of this problem. I wouldn't have gotten to the solution if i had to find time to dig deep. |
@lwhite1 Yep, making the semantics of appendCell() and append() for TextColumn consistent with the other column types seems like the best approach to me. TextColumn should "just work" in my use case with this change. I'll close this PR in favor of PRs #912 and #921, as issue #897 is sufficiently addressed by the release of those PRs. Thank you! |
Fixes #897
Replace hard-coded parser in TextColumn appendCell() with a private parser field that is initialized to DEFAULT_PARSER to preserve existing default behavior. A client may use the new getStringParser() and setStringParser() methods to access and change the parser used for the column. Setting the parser to null specifies that no parsing shall be performed when adding data to the column.
Description
see above
Testing
Modified existing test to validate that previous behavior does not regress. Added tests for null parser and custom parser.