-
Notifications
You must be signed in to change notification settings - Fork 65
Description
I have a task to convert a data set containing several variables with very long strings (~64K characters each) to the SPSS SAV format. Given the SPSS variable-length limit, I started by splitting those variables into smaller segments of 32K characters each. Then, I wrote the data set to a SAV file using pyreadstat. However, I found that when reading the SAV file generated by pyreadstat back into SPSS, there were a large number of new variables in the data set (i.e., variables that I did not explicitly create), each starting with "v" and ending with an alphanumeric suffix. These variables appear to correspond to 255-byte segments of the original long string.
If I read the SAV file back into Pandas using pyreadstat, those additional variable segments are not present; working only in Python, everything behaves as expected. It is only when I try to read the generated SAV file in SPSS itself that I encounter the issue. Note, we are using SPSS v25.
Looking through the issue history of both pyreadstat and readstat, I came across the following two issues, which appear to be related to my issue:
- SPSS Error # 1405 when reading haven-created SAV files containing 256+ byte strings tidyverse/haven#266
- Virtual variables in ReadStat-created SAV files with long strings are visible un-merged in SPSS WizardMac/ReadStat#122
Based on some hints from WizardMac/ReadStat#122 (comment), there seem to be two changes required to work-around this issue:
- Modify the variable name of long-string columns such that the name contains 5 characters or less. This is apparently to account for suffixes that readstat adds behind the scenes when it generates variable segments.
- Split the variables into smaller segments of <= 9180 characters each.
After making these two changes, everything works as expected (including in SPSS).
So: I have a viable work-around for this issue, but the behavior is quite unexpected and the work-around is not at all obvious. At the very least, generating some kind of warning in the Python code if pyreadstat detects a scenario that could lead to this issue may be helpful to others.