Skip to content

Conversation

jglogan
Copy link
Contributor

@jglogan jglogan commented Apr 27, 2021

Fixes #897

Replace hard-coded parser in TextColumn appendCell() with a private parser field that is initialized to DEFAULT_PARSER to preserve existing default behavior. A client may use the new getStringParser() and setStringParser() methods to access and change the parser used for the column. Setting the parser to null specifies that no parsing shall be performed when adding data to the column.

Description

see above

Testing

Modified existing test to validate that previous behavior does not regress. Added tests for null parser and custom parser.

@lwhite1
Copy link
Collaborator

lwhite1 commented Apr 28, 2021

I need to think about this and would also like @benmccann to take a look. I like the fact that this goes deeper than the half-fix I implemented, but am concerned about some of the things you mentioned in the Issue - in short, are there any side effects to making this change that we're not aware of.

I also need to look at the implementation details. I'll probably do that after getting some of these other PRs out of the way.

@jglogan
Copy link
Contributor Author

jglogan commented Apr 28, 2021

It seems to me that #912 is still relevant as it fixes the double parsing issue and adds the control over the missing markers in the StringParser. Whatever form this fix might take should probably be rebased with that PR in place.

jglogan added 2 commits April 28, 2021 13:44
- Replace hard-coded parser in TextColumn
  appendCell() with a private parser field
  that is initialized to DEFAULT_PARSER to
  preserve existing default behavior.  A
  client may use the new getStringParser()
  and setStringParser() methods to access
  and change the parser used for the column.
  Setting the parser to null specifies that
  no parsing shall be performed when adding
  data to the column.
@jglogan
Copy link
Contributor Author

jglogan commented Apr 28, 2021

Rebased, resolved conflicts with #912.

return stringParser;
}

public void setStringParser(StringParser stringParser) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd probably be better to call this something more generic like getParser so that we could add this to the other column types as well. In fact this might better live in AbstractColumn

I also wonder if putting this in one of the create methods might be a better alternative. It doesn't really seem to me like you'd want to change the parser after column creation (that could possibly lead to strange situations where the column contains contents that can't be parsed by the updated parser)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benmccann The parser is only used when putting new data into the column. I can imagine scenarios where you load a file that requires one kind of parser and append to the table later with different rules.

OTOH, I'm concerned about what happens if you apply a built-in map function to the column (e.g. StringColumn.substring(0, 7). The new column may not be able to parse what the old column did unless the map functions (a) skip the parser (they might, i have no idea) or (b) inherit the parser of the source column. The latter could still have issues if the function operates on two string columns (e.g. concatenate()).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benmccann I get your point on the method name; I hadn't considered generalizing it but I'll take a look at that.

@lwhite1 The general case you describe is that where Tablesaw creates and populates a column, and the user requires/expects that column to have custom parsing?

I realized that my use case is actually addressed by #912, in that it's sufficient for me to globally set text column parsing policy:

TextColumnType.DEFAULT_PARSER.setMissingValueStrings(Collections.emptyList());

If I had text columns that required distinct parsers, I'd be out of luck though.

Thanks for the feedback! I'll give it some thought today.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jlogan If you had two text columns requiring different parsers, you could use addCell(s, parser) to add things manually. I also noted that the Row api calls set(row, col) which bypasses any parsing, so that will work as well. That was probably done for efficiency and assuming that the user has more control over manual additions as opposed to loading some random file of dubious quality.

My concern about using a built-in String map function (like substring()) on a column with a custom parser is baseless, because all those methods return a StringColumn rather than a TextColumn (for historical reasons).

So, I think we have the ability to change the missing value handling of parsers by columnType, and have a solution for manually adding values. This probably covers most scenarios.

If we could set a parser at column creation time I think that would be awesome, especially if it can be done in AbstractColumn as Ben suggested, but I think this is only worth doing if it can be done without much difficulty or complexity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lwhite1 Let try to sum up based on your last comment and my deep dive in the code to make sure I understand:

TD;DR We don't need the code from this PR because there are the mitigation strategies below, but a change to allow setting a custom parser at creation time would be welcome, provided that the change isn't too risky.

I'll look into what that entails over the weekend.

Parsing data by default

All ColumnType classes define a DEFAULT_PARSER value that is used only by the Column.appendCell(String stringValue) method. All default parsers appear to use the default missing value list of NaN, * NA, null, N/A from AbstractColumnParser.

The user guide states that appendCell() converts the string, and append() merely adds the value. This is true for all columns except TextColumn, where append() invokes appendCell(), and appendObj() invokes append(). As a result, most operations involving text column data will result in re-parsing of the data, which can yield unexpected results.

It would be nice if TextColumn had consistent semantics, but I expect this could break a significant number of existing applications.

Mitigation strategies

As of PR #912, callers can work around the parsing behavior of TextColumn globally by modifying the missing value list for TextColumn.DEFAULT_PARSER.

The TextColumn behavior can be worked around on a call-by-call basis:

  • Cells can be appended using appendCell() with a custom parser, or by using appendMissing() and then setting the value.
  • Rows can be added by using appendRow() and set() instead of addRow().

The call-by-call approach might require rewriting much of an application code base. If neither of the above approaches is suitable, the only remaining workaround is to use StringColumn instead of TextColumn.

Indirect invocation of default parsing

It seems to me that Tablesaw should only perform implicit parsing for data that is being ingested; it should not re-parse column data except as explicitly specified by the caller. If that is the case, there are a few other cases where Tablesaw invokes appendCell() is beneath the covers on column data, and these could produce unexpected results:

  • StringColumn.lag() - the shifted table will be re-parsed
  • CrossTab.counts() - label column values that match the default missing values will be eaten
  • CategoricalColumn.countByCategory() - category names that match the default missing values will be eaten
  • PivotTable.pivot() - column 1 group names that match the default missing values will be eaten
  • TableSliceGroup.splitGroupingColumn() invokes appendCell() when populating new columns

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jglogan Given what you read in the documentation, and considering the design (where append(String) is specific to String and Text Column and appendCell(String) is implemented for all columns and has a version that takes a parser) I feel confident the best course of action is to fix append(String). TextColumn is relatively new and was created by modifying the code for StringColumn, which for performance reasons has unique and complex append logic as it uses dictionary encoding. The bug was no doubt introduced at that point.

I did a simple change to TextColumn::append() to skip parsing and nothing broke. I also fixed the String and TextColumn versions of lag().

As for the appendCell() usages mentioned above, each would have to be considered individually. If they rely on a stringified version of the source values, it may be hard to fix them. In early days, Tablesaw didn't have generic or even boxed interfaces for the columns so Strings were sometimes used as a common type. Where the stringified version is used as a column title or label, there would need to be tests that ensure that the DoubleColumnType.missingValueIndicator() never shows up.

I'm going to push my append() fix. At which point the initial issue can be closed. There could be other issues that look at whether appendCell() is the right choice for the operations you mention.

@lwhite1
Copy link
Collaborator

lwhite1 commented May 1, 2021

@jglogan Would you mind taking a look at PR#921 and letting me know what you think?

Assuming that looks ok, I think we can close issue #897. The changes discussed here regarding providing a custom parser for a specific column would then be strictly enhancements. I'd be happy to leave that to the future if you are.

Thank you very much for your analysis of this problem. I wouldn't have gotten to the solution if i had to find time to dig deep.

@jglogan
Copy link
Contributor Author

jglogan commented May 1, 2021

@lwhite1 Yep, making the semantics of appendCell() and append() for TextColumn consistent with the other column types seems like the best approach to me. TextColumn should "just work" in my use case with this change.

I'll close this PR in favor of PRs #912 and #921, as issue #897 is sufficiently addressed by the release of those PRs. Thank you!

@jglogan jglogan closed this May 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Is it possible to add the text "*" to a text column?
3 participants