Skip to content

Long string handling #118

@mnizol

Description

@mnizol

I have a task to convert a data set containing several variables with very long strings (~64K characters each) to the SPSS SAV format. Given the SPSS variable-length limit, I started by splitting those variables into smaller segments of 32K characters each. Then, I wrote the data set to a SAV file using pyreadstat. However, I found that when reading the SAV file generated by pyreadstat back into SPSS, there were a large number of new variables in the data set (i.e., variables that I did not explicitly create), each starting with "v" and ending with an alphanumeric suffix. These variables appear to correspond to 255-byte segments of the original long string.

If I read the SAV file back into Pandas using pyreadstat, those additional variable segments are not present; working only in Python, everything behaves as expected. It is only when I try to read the generated SAV file in SPSS itself that I encounter the issue. Note, we are using SPSS v25.

Looking through the issue history of both pyreadstat and readstat, I came across the following two issues, which appear to be related to my issue:

Based on some hints from WizardMac/ReadStat#122 (comment), there seem to be two changes required to work-around this issue:

  1. Modify the variable name of long-string columns such that the name contains 5 characters or less. This is apparently to account for suffixes that readstat adds behind the scenes when it generates variable segments.
  2. Split the variables into smaller segments of <= 9180 characters each.
    After making these two changes, everything works as expected (including in SPSS).

So: I have a viable work-around for this issue, but the behavior is quite unexpected and the work-around is not at all obvious. At the very least, generating some kind of warning in the Python code if pyreadstat detects a scenario that could lead to this issue may be helpful to others.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions