Skip to content

Parse SAS format strings for XPT outputs (#257) #258

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Nov 23, 2021

Conversation

gorcha
Copy link
Contributor

@gorcha gorcha commented Nov 21, 2021

This PR adds a simple parser to extract format name, width and decimals from a SAS format string to be used when writing XPT files, closing #257.

It parses the format string in reverse, assigns width and decimals based on the pattern of numbers and periods at the end, and uses the remainder of the string as the format name.

There are also a couple of tweaks to the section that writes the long format name for XPT v8 files:

  • Now uses a simplified version of the parser to extract the format name
  • Writes the entire format name to the long format record per the spec
  • The long format record was incorrectly writing the label out after the format/informat values, these have been swapped.

* SAS format roundtrip uses $CHAR3 instead of $CHAR3. - the trailing
period is dropped when roundtripping XPT files but it represents the
same format
* Check if label_set has been set before running roundtrip test
@gorcha
Copy link
Contributor Author

gorcha commented Nov 21, 2021

I've made a few more updates after seeing the test failures. The XPT reader code has been updated to reflect the changes to the XPT writer.

I had to make a couple of changes to the tests for what looked like false positives, and I think they make sense:

  • One of the SAS format roundtrip tests has been changed to use $CHAR3 instead of $CHAR3.. The trailing period is dropped when roundtripping XPT files because of the way the format is reconstructed so $CHAR3. is read back in as $CHAR3 and fails the test. It represents the same format so it made to sense to me to update the test.
  • It now checks if label_set has been set before running the roundtrip test (as was already the case for format and display_width). It was throwing some strange errors in the display_width tests that looked like it was checking unallocated memory like:
Test "Display width" failed: Column label sets
 * Format: xpt5 (0x200000)
 * Expected: "¤�~y²U" (length=6)
 * Received: "�~y²U" (length=5)
 * Column: 1

@gorcha
Copy link
Contributor Author

gorcha commented Nov 23, 2021

I've updated the format parsing code to use a Ragel parser and added an xport_format_t struct to store the parser output, and added a few more tests to check it's throwing errors correctly.

I've also had a go at the fuzzer setup but a bit less sure of what I'm doing there!

@gorcha gorcha requested a review from evanmiller November 23, 2021 15:06
Copy link
Contributor

@evanmiller evanmiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Just found one whitespace issue, otherwise I'm happy with it.

Thanks for taking the time to implement the fuzzer – and of course the word dictionary is a nice touch.

@evanmiller
Copy link
Contributor

It looks like Appveyor is failing:

https://ci.appveyor.com/project/evanmiller/readstat/builds/41638815

You may need to modify the Visual Studio project file to ensure all the new files are being built. See

https://github.com/WizardMac/ReadStat/blob/master/VS17/ReadStat.vcxproj

@gorcha
Copy link
Contributor Author

gorcha commented Nov 23, 2021

Awesome, thanks 🙂

I've made the appropriate fixes to the VS17 project file, appveyor is looking happier now.

@evanmiller evanmiller merged commit 1e0ca4d into WizardMac:dev Nov 23, 2021
@evanmiller
Copy link
Contributor

Thanks, much better now!

@mstackhouse
Copy link

@evanmiller - any idea when these changes might be able to be merged into a new release of ReadStat? These changes would open up some important doors for haven, which will have a significant impact in some pharmaceutical industry projects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants