Skip to content

indices reports byte offsets instead of character offsets #3064

@01mf02

Description

@01mf02

Describe the bug
jq uses characters to index strings.
To see that, we can run "🇬🇧oo" | .[0 : 1,2,3,4], which yields "🇬" "🇬🇧" "🇬🇧o" "🇬🇧oo".
Note that 🇬🇧 is actually two characters and 8 bytes, as we can see from "🇬🇧" | length, utf8bytelength.
However, the indices filter returns byte offsets to the pattern in the string.
The documentation does not specify the behaviour of indices for UTF-8 strings, but given that length and .[x:y] use character counts to index strings, it is likely that this is a bug and not just undocumented behaviour.

To Reproduce
$ ./jq-linux-amd64-1.7.1 -nc '"🇬🇧oo" | indices("o")'
[8,9]
$ ./jq-linux-amd64-1.7.1 -nc '"ƒoo" | indices("o")'
[2,3]

Expected behavior
$ ./jq-linux-amd64-1.7.1-fixed -nc '"🇬🇧oo" | indices("o")'
[2,3]
$ ./jq-linux-amd64-1.7.1-fixed -nc '"ƒoo" | indices("o")'
[1,2]

The problem is probably caused in jv_string_indexes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions