Fixes #897 - override default TextColumn string parsing. #913

jglogan · 2021-04-27T20:24:09Z

Fixes #897

Replace hard-coded parser in TextColumn appendCell() with a private parser field that is initialized to DEFAULT_PARSER to preserve existing default behavior. A client may use the new getStringParser() and setStringParser() methods to access and change the parser used for the column. Setting the parser to null specifies that no parsing shall be performed when adding data to the column.

Tick to sign-off your agreement to the Developer Certificate of Origin (DCO) 1.1

Description

see above

Testing

Modified existing test to validate that previous behavior does not regress. Added tests for null parser and custom parser.

lwhite1 · 2021-04-28T14:44:48Z

I need to think about this and would also like @benmccann to take a look. I like the fact that this goes deeper than the half-fix I implemented, but am concerned about some of the things you mentioned in the Issue - in short, are there any side effects to making this change that we're not aware of.

I also need to look at the implementation details. I'll probably do that after getting some of these other PRs out of the way.

jglogan · 2021-04-28T15:55:47Z

It seems to me that #912 is still relevant as it fixes the double parsing issue and adds the control over the missing markers in the StringParser. Whatever form this fix might take should probably be rebased with that PR in place.

- Replace hard-coded parser in TextColumn appendCell() with a private parser field that is initialized to DEFAULT_PARSER to preserve existing default behavior. A client may use the new getStringParser() and setStringParser() methods to access and change the parser used for the column. Setting the parser to null specifies that no parsing shall be performed when adding data to the column.

jglogan · 2021-04-28T20:53:36Z

Rebased, resolved conflicts with #912.

benmccann · 2021-04-28T21:40:34Z

core/src/main/java/tech/tablesaw/api/TextColumn.java

+    return stringParser;
+  }
+
+  public void setStringParser(StringParser stringParser) {


It'd probably be better to call this something more generic like getParser so that we could add this to the other column types as well. In fact this might better live in AbstractColumn

I also wonder if putting this in one of the create methods might be a better alternative. It doesn't really seem to me like you'd want to change the parser after column creation (that could possibly lead to strange situations where the column contains contents that can't be parsed by the updated parser)

@benmccann The parser is only used when putting new data into the column. I can imagine scenarios where you load a file that requires one kind of parser and append to the table later with different rules.

OTOH, I'm concerned about what happens if you apply a built-in map function to the column (e.g. StringColumn.substring(0, 7). The new column may not be able to parse what the old column did unless the map functions (a) skip the parser (they might, i have no idea) or (b) inherit the parser of the source column. The latter could still have issues if the function operates on two string columns (e.g. concatenate()).

@benmccann I get your point on the method name; I hadn't considered generalizing it but I'll take a look at that.

@lwhite1 The general case you describe is that where Tablesaw creates and populates a column, and the user requires/expects that column to have custom parsing?

I realized that my use case is actually addressed by #912, in that it's sufficient for me to globally set text column parsing policy:

TextColumnType.DEFAULT_PARSER.setMissingValueStrings(Collections.emptyList());

If I had text columns that required distinct parsers, I'd be out of luck though.

Thanks for the feedback! I'll give it some thought today.

@jlogan If you had two text columns requiring different parsers, you could use addCell(s, parser) to add things manually. I also noted that the Row api calls set(row, col) which bypasses any parsing, so that will work as well. That was probably done for efficiency and assuming that the user has more control over manual additions as opposed to loading some random file of dubious quality.

My concern about using a built-in String map function (like substring()) on a column with a custom parser is baseless, because all those methods return a StringColumn rather than a TextColumn (for historical reasons).

So, I think we have the ability to change the missing value handling of parsers by columnType, and have a solution for manually adding values. This probably covers most scenarios.

If we could set a parser at column creation time I think that would be awesome, especially if it can be done in AbstractColumn as Ben suggested, but I think this is only worth doing if it can be done without much difficulty or complexity.

@lwhite1 Let try to sum up based on your last comment and my deep dive in the code to make sure I understand:

TD;DR We don't need the code from this PR because there are the mitigation strategies below, but a change to allow setting a custom parser at creation time would be welcome, provided that the change isn't too risky.

I'll look into what that entails over the weekend.

Parsing data by default

All ColumnType classes define a DEFAULT_PARSER value that is used only by the Column.appendCell(String stringValue) method. All default parsers appear to use the default missing value list of NaN, * NA, null, N/A from AbstractColumnParser.

The user guide states that appendCell() converts the string, and append() merely adds the value. This is true for all columns except TextColumn, where append() invokes appendCell(), and appendObj() invokes append(). As a result, most operations involving text column data will result in re-parsing of the data, which can yield unexpected results.

It would be nice if TextColumn had consistent semantics, but I expect this could break a significant number of existing applications.

Mitigation strategies

As of PR #912, callers can work around the parsing behavior of TextColumn globally by modifying the missing value list for TextColumn.DEFAULT_PARSER.

The TextColumn behavior can be worked around on a call-by-call basis:

Cells can be appended using appendCell() with a custom parser, or by using appendMissing() and then setting the value.

Rows can be added by using appendRow() and set() instead of addRow().

The call-by-call approach might require rewriting much of an application code base. If neither of the above approaches is suitable, the only remaining workaround is to use StringColumn instead of TextColumn.

Indirect invocation of default parsing

It seems to me that Tablesaw should only perform implicit parsing for data that is being ingested; it should not re-parse column data except as explicitly specified by the caller. If that is the case, there are a few other cases where Tablesaw invokes appendCell() is beneath the covers on column data, and these could produce unexpected results:

StringColumn.lag() - the shifted table will be re-parsed

CrossTab.counts() - label column values that match the default missing values will be eaten

CategoricalColumn.countByCategory() - category names that match the default missing values will be eaten

PivotTable.pivot() - column 1 group names that match the default missing values will be eaten

TableSliceGroup.splitGroupingColumn() invokes appendCell() when populating new columns

@jglogan Given what you read in the documentation, and considering the design (where append(String) is specific to String and Text Column and appendCell(String) is implemented for all columns and has a version that takes a parser) I feel confident the best course of action is to fix append(String). TextColumn is relatively new and was created by modifying the code for StringColumn, which for performance reasons has unique and complex append logic as it uses dictionary encoding. The bug was no doubt introduced at that point.

I did a simple change to TextColumn::append() to skip parsing and nothing broke. I also fixed the String and TextColumn versions of lag().

As for the appendCell() usages mentioned above, each would have to be considered individually. If they rely on a stringified version of the source values, it may be hard to fix them. In early days, Tablesaw didn't have generic or even boxed interfaces for the columns so Strings were sometimes used as a common type. Where the stringified version is used as a column title or label, there would need to be tests that ensure that the DoubleColumnType.missingValueIndicator() never shows up.

I'm going to push my append() fix. At which point the initial issue can be closed. There could be other issues that look at whether appendCell() is the right choice for the operations you mention.

lwhite1 · 2021-05-01T15:29:42Z

@jglogan Would you mind taking a look at PR#921 and letting me know what you think?

Assuming that looks ok, I think we can close issue #897. The changes discussed here regarding providing a custom parser for a specific column would then be strictly enhancements. I'd be happy to leave that to the future if you are.

Thank you very much for your analysis of this problem. I wouldn't have gotten to the solution if i had to find time to dig deep.

jglogan · 2021-05-01T16:37:24Z

@lwhite1 Yep, making the semantics of appendCell() and append() for TextColumn consistent with the other column types seems like the best approach to me. TextColumn should "just work" in my use case with this change.

I'll close this PR in favor of PRs #912 and #921, as issue #897 is sufficiently addressed by the release of those PRs. Thank you!

jglogan mentioned this pull request Apr 27, 2021

Is it possible to add the text "*" to a text column? #897

Closed

jglogan force-pushed the issue/891 branch from eae01a5 to 3880775 Compare April 27, 2021 21:05

jglogan added 2 commits April 28, 2021 13:44

Improve unit test coverage.

0198fc8

jglogan force-pushed the issue/891 branch from 3880775 to 0198fc8 Compare April 28, 2021 20:52

benmccann reviewed Apr 28, 2021

View reviewed changes

lwhite1 mentioned this pull request May 1, 2021

Fix implementation of append(String) in TextColumn #921

Merged

1 task

jglogan closed this May 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fixes #897 - override default TextColumn string parsing. #913

Fixes #897 - override default TextColumn string parsing. #913

Uh oh!

jglogan commented Apr 27, 2021 •

edited by benmccann

Loading

Uh oh!

lwhite1 commented Apr 28, 2021

Uh oh!

jglogan commented Apr 28, 2021

Uh oh!

jglogan commented Apr 28, 2021

Uh oh!

benmccann Apr 28, 2021

Uh oh!

lwhite1 Apr 29, 2021

Uh oh!

jglogan Apr 29, 2021

Uh oh!

lwhite1 Apr 29, 2021

Uh oh!

jglogan Apr 30, 2021

Uh oh!

lwhite1 May 1, 2021

Uh oh!

lwhite1 commented May 1, 2021

Uh oh!

jglogan commented May 1, 2021

Uh oh!

Uh oh!

Fixes #897 - override default TextColumn string parsing. #913

Fixes #897 - override default TextColumn string parsing. #913

Uh oh!

Conversation

jglogan commented Apr 27, 2021 • edited by benmccann Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Testing

Uh oh!

lwhite1 commented Apr 28, 2021

Uh oh!

jglogan commented Apr 28, 2021

Uh oh!

jglogan commented Apr 28, 2021

Uh oh!

benmccann Apr 28, 2021

Choose a reason for hiding this comment

Uh oh!

lwhite1 Apr 29, 2021

Choose a reason for hiding this comment

Uh oh!

jglogan Apr 29, 2021

Choose a reason for hiding this comment

Uh oh!

lwhite1 Apr 29, 2021

Choose a reason for hiding this comment

Uh oh!

jglogan Apr 30, 2021

Choose a reason for hiding this comment

Parsing data by default

Mitigation strategies

Indirect invocation of default parsing

Uh oh!

lwhite1 May 1, 2021

Choose a reason for hiding this comment

Uh oh!

lwhite1 commented May 1, 2021

Uh oh!

jglogan commented May 1, 2021

Uh oh!

Uh oh!

jglogan commented Apr 27, 2021 •

edited by benmccann

Loading