-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
Describe the bug
jq uses characters to index strings.
To see that, we can run "🇬🇧oo" | .[0 : 1,2,3,4]
, which yields "🇬" "🇬🇧" "🇬🇧o" "🇬🇧oo".
Note that 🇬🇧 is actually two characters and 8 bytes, as we can see from "🇬🇧" | length, utf8bytelength
.
However, the indices
filter returns byte offsets to the pattern in the string.
The documentation does not specify the behaviour of indices
for UTF-8 strings, but given that length
and .[x:y]
use character counts to index strings, it is likely that this is a bug and not just undocumented behaviour.
To Reproduce
$ ./jq-linux-amd64-1.7.1 -nc '"🇬🇧oo" | indices("o")'
[8,9]
$ ./jq-linux-amd64-1.7.1 -nc '"ƒoo" | indices("o")'
[2,3]
Expected behavior
$ ./jq-linux-amd64-1.7.1-fixed -nc '"🇬🇧oo" | indices("o")'
[2,3]
$ ./jq-linux-amd64-1.7.1-fixed -nc '"ƒoo" | indices("o")'
[1,2]
The problem is probably caused in jv_string_indexes.