Skip to content

Issues writing Stata StrL variables #269

@gorcha

Description

@gorcha

Hi!

While trying to add StrL support to haven (tidyverse/haven#584) we've noticed a couple of issues with the code for writing string refs.

v,o indexing

The variable and observation numbers in the stata file should be indexed from 1, but ReadStat is indexed from 0. readstat_insert_string_ref() doesn't apply this, so we get an off by one in the variable and observation index values for the strLs. There's a simple fix for this:

readstat_error_t readstat_insert_string_ref(readstat_writer_t *writer, const readstat_variable_t *variable, readstat_string_ref_t *ref) {
    if (!writer->initialized)
        return READSTAT_ERROR_WRITER_NOT_INITIALIZED;
    if (variable->type != READSTAT_TYPE_STRING_REF)
        return READSTAT_ERROR_VALUE_TYPE_MISMATCH;
    if (!writer->callbacks.write_string_ref)
        return READSTAT_ERROR_STRING_REFS_NOT_SUPPORTED;

    if (ref && ref->first_o == -1 && ref->first_v == -1) {
        ref->first_o = writer->current_row + 1;
        ref->first_v = variable->index + 1;
    }

    return writer->callbacks.write_string_ref(&writer->row[variable->offset], variable, ref);
}

map offsets

All of the file offsets in the map in the file header are written in readstat_begin_writing_data(), but at this point the string refs are empty if they're being added row by row. Once the string refs are written out the file offsets for the sections after the strLs are different but the map isn't updated. The created file gives an Invalid file, or file has unsupported features. error.

I've confirmed that the issue is the offsets in the map - see the two files in stata_strl.zip:

  • stata_strl.dta was written by haven via ReadStat (using the modified string ref insertion above for correct v,o values). The offsets for the sections after the <strls> are written as if this section is empty.
  • In stata_strl_edit.dta I've manually edited the map with correct file offsets for the sections after the <strls> and this reads in successfully.

There's no support for non-sequential writing in the API at the moment so not sure how to get around this apart from populating the string refs before we begin writing rows.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions