Long string handling

I have a task to convert a data set containing several variables with very long strings (~64K characters each) to the SPSS SAV format.  Given the SPSS variable-length limit, I started by splitting those variables into smaller segments of 32K characters each.  Then, I wrote the data set to a SAV file using pyreadstat.  However, I found that when reading the SAV file generated by pyreadstat back into SPSS, there were a large number of new variables in the data set (i.e., variables that I did not explicitly create), each starting with "v" and ending with an alphanumeric suffix.  These variables appear to correspond to 255-byte segments of the original long string.

If I read the SAV file back into Pandas using pyreadstat, those additional variable segments are not present; working only in Python, everything behaves as expected.  It is only when I try to read the generated SAV file in SPSS itself that I encounter the issue.  Note, we are using SPSS v25.

Looking through the issue history of both pyreadstat and readstat, I came across the following two issues, which appear to be related to my issue:

* https://github.com/tidyverse/haven/issues/266
* https://github.com/WizardMac/ReadStat/issues/122

Based on some hints from https://github.com/WizardMac/ReadStat/issues/122#issuecomment-366701200, there seem to be two changes required to work-around this issue:
1. Modify the variable name of long-string columns such that the name contains 5 characters or less.  This is apparently to account for suffixes that readstat adds behind the scenes when it generates variable segments.
2. Split the variables into smaller segments of <= 9180 characters each.  
After making these two changes, everything works as expected (including in SPSS).

So: I have a viable work-around for this issue, but the behavior is quite unexpected and the work-around is not at all obvious.  At the very least, generating some kind of warning in the Python code if pyreadstat detects a scenario that could lead to this issue may be helpful to others.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Long string handling #118

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Long string handling #118

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions