-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Use codepoint index for indices/1, index/1 and rindex/1 #3065
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Previsouly byte index was used. Fixes jqlang#1430 jqlang#1624 jqlang#3064
while ((p = _jq_memmem(p, (jstr + jlen) - p, idxstr, idxlen)) != NULL) { | ||
a = jv_array_append(a, jv_number(p - jstr)); | ||
while (lp < p) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make this even more efficient i guess we would need to count codepoints inside memmem somehow
Haven't entirely convinced myself yet that it should be fine to look for matches using the byte representation. Assuming both the needle and haystack is valid utf-8 i'm thinking it should be fine because of utf-8's self-synchronization property. Update: now looking at Line 1374 in c95b34f
|
I'd like to include this. Any objection on changing the behavior in 1.8? |
Ok to merge for me but would be great if someone could have a look or know if my assumption about strings always being valid utf-8 is true. |
@itchyny asked:
This is a major breaking change and it has been my understanding for some years that such changes would have to wait until jq 2.0. Certainly if we were following a strict SemVer policy that would be the case. Since we don't seem to be doing so, the situation is not black-and-white, but if the change is incorporated into 1.8, we should be sure to highlight it. @wader wrote:
Based on past experience, such an assumption would not be warranted, so the question is: could the proposed changes make anything worse? I suppose the major issue would be whether (in the presence of invalid utf-8) the old index would give an accurate byte count but the new version might give an inaccurate codepoint count. Perhaps a starting point would be
|
I can't see how the current behaviour for non-ASCII strings makes any sense or could even be useful in any resonable way? so for me it feels more like a bug.
This is an incomplete surrogates pair? yeap stuff like this i'm concerned about also. |
With this change: $ echo '"a\uDD1Ec"' | ./jq -c '[index("c"), length]'
[2,3] Seems correct assuming broken surrogates codepoints should be allowed. But I think i'm mostly concern if there is any way to produce jq strings that has a byte buffer that is not valid utf-8. If so use of |
As @nicowilliams also expressed this is a bug #1430 (comment) ill merge this now |
Should the docs be updated to make it clear that it's a codepoint index, instead of byte index? |
Good point, i did quick skimming of the docs and it seems like we don't say it's byte offset anywhere but we don't make it that clear that it's codepoints either. So maybe mention for the index-functions and possible also under "Array/String Slice" and/or "Types and Values"? for the regexp functions we do mention things are in codepoints. |
This MR contains the following updates: | Package | Update | Change | |---|---|---| | [jqlang/jq](https://github.com/jqlang/jq) | minor | `1.7.1` -> `1.8.0` | MR created with the help of [el-capitano/tools/renovate-bot](https://gitlab.com/el-capitano/tools/renovate-bot). **Proposed changes to behavior should be submitted there as MRs.** --- ### Release Notes <details> <summary>jqlang/jq (jqlang/jq)</summary> ### [`v1.8.0`](https://github.com/jqlang/jq/releases/tag/jq-1.8.0): jq 1.8.0 [Compare Source](jqlang/jq@jq-1.7.1...jq-1.8.0) We are pleased to announce the release of version 1.8.0. This release includes a number of improvements since the last version. Note that some changes may introduce breaking changes to existing scripts, so be sure to read the following information carefully. Full commit log can be found at <jqlang/jq@jq-1.7.1...jq-1.8.0>. #### Releasing - Change the version number pattern to `1.X.Y` (`1.8.0` instead of `1.8`). [@​itchyny](https://github.com/itchyny) [#​2999](jqlang/jq#2999) - Generate provenance attestations for release artifacts and docker image. [@​lectrical](https://github.com/lectrical) [#​3225](jqlang/jq#3225) ```sh gh attestation verify --repo jqlang/jq jq-linux-amd64 gh attestation verify --repo jqlang/jq oci://ghcr.io/jqlang/jq:1.8.0 ``` #### Security fixes - CVE-2024-23337: Fix signed integer overflow in `jvp_array_write` and `jvp_object_rehash`. [@​itchyny](https://github.com/itchyny) [`de21386`](jqlang/jq@de21386) - The fix for this issue now limits the maximum size of arrays and objects to [`5368709`](jqlang/jq@536870912) (`2^29`) elements. - CVE-2024-53427: Reject NaN with payload while parsing JSON. [@​itchyny](https://github.com/itchyny) [`a09a4df`](jqlang/jq@a09a4df) - The fix for this issue now drops support for NaN with payload in JSON (like `NaN123`). Other JSON extensions like `NaN` and `Infinity` are still supported. - CVE-2025-48060: Fix heap buffer overflow in `jv_string_vfmt`. [@​itchyny](https://github.com/itchyny) [`c6e0416`](jqlang/jq@c6e0416) - Fix use of uninitialized value in `check_literal`. [@​itchyny](https://github.com/itchyny) [#​3324](jqlang/jq#3324) - Fix segmentation fault on `strftime/1`, `strflocaltime/1`. [@​itchyny](https://github.com/itchyny) [#​3271](jqlang/jq#3271) - Fix unhandled overflow in `@base64d`. [@​emanuele6](https://github.com/emanuele6) [#​3080](jqlang/jq#3080) #### CLI changes - Fix `--indent 0` implicitly enabling `--compact-output`. [@​amarshall](https://github.com/amarshall) [@​gbrlmarn](https://github.com/gbrlmarn) [@​itchyny](https://github.com/itchyny) [#​3232](jqlang/jq#3232) ```sh $ jq --indent 0 . <<< '{ "foo": ["hello", "world"] }' { "foo": [ "hello", "world" ] } ``` ### Previously, this implied --compact-output, but now outputs with new lines. ```` - Improve error messages to show problematic position in the filter. @​itchyny #​3292 ```sh $ jq -n '1 + $foo + 2' jq: error: $foo is not defined at <top-level>, line 1, column 5: 1 + $foo + 2 ^^^^ jq: 1 compile error ```` - Include column number in parser and compiler error messages. [@​liviubobocu](https://github.com/liviubobocu) [#​3257](jqlang/jq#3257) - Fix error message for string literal beginning with single quote. [@​mattmeyers](https://github.com/mattmeyers) [#​2964](jqlang/jq#2964) ```sh $ jq .foo <<< "{'foo':'bar'}" jq: parse error: Invalid string literal; expected ", but got ' at line 1, column 7 ``` ### Previously, the error message was Invalid numeric literal at line 1, column 7. ```` - Improve `JQ_COLORS` environment variable to support larger escapes like truecolor. @​SArpnt #​3282 ```sh JQ_COLORS="38;2;255;173;173:38;2;255;214;165:38;2;253;255;182:38;2;202;255;191:38;2;155;246;255:38;2;160;196;255:38;2;189;178;255:38;2;255;198;255" jq -nc '[null,false,true,42,{"a":"bc"}]' ```` - Add `--library-path` long option for `-L`. [@​thaliaarchi](https://github.com/thaliaarchi) [#​3194](jqlang/jq#3194) - Fix `--slurp --stream` when input has no trailing newline character. [@​itchyny](https://github.com/itchyny) [#​3279](jqlang/jq#3279) - Fix `--indent` option to error for malformed values. [@​thaliaarchi](https://github.com/thaliaarchi) [#​3195](jqlang/jq#3195) - Fix option parsing of `--binary` on non-Windows platforms. [@​calestyo](https://github.com/calestyo) [#​3131](jqlang/jq#3131) - Fix issue with `~/.jq` on Windows where `$HOME` is not set. [@​kirkoman](https://github.com/kirkoman) [#​3114](jqlang/jq#3114) - Fix broken non-Latin output in the command help on Windows. [@​itchyny](https://github.com/itchyny) [#​3299](jqlang/jq#3299) - Increase the maximum parsing depth for JSON to 10000. [@​itchyny](https://github.com/itchyny) [#​3328](jqlang/jq#3328) - Parse short options in order given. [@​thaliaarchi](https://github.com/thaliaarchi) [#​3194](jqlang/jq#3194) - Consistently reset color formatting. [@​thaliaarchi](https://github.com/thaliaarchi) [#​3034](jqlang/jq#3034) #### New functions - Add `trim/0`, `ltrim/0` and `rtrim/0` to trim leading and trailing white spaces. [@​wader](https://github.com/wader) [#​3056](jqlang/jq#3056) ```sh $ jq -n '" hello " | trim, ltrim, rtrim' "hello" "hello " " hello" ``` - Add `trimstr/1` to trim string from both ends. [@​gbrlmarn](https://github.com/gbrlmarn) [#​3319](jqlang/jq#3319) ```sh $ jq -n '"foobarfoo" | trimstr("foo")' "bar" ``` - Add `add/1`. Generator variant of `add/0`. [@​myaaaaaaaaa](https://github.com/myaaaaaaaaa) [#​3144](jqlang/jq#3144) ```sh $ jq -c '.sum = add(.xs[])' <<< '{"xs":[1,2,3]}' {"xs":[1,2,3],"sum":6} ``` - Add `skip/2` as the counterpart to `limit/2`. [@​itchyny](https://github.com/itchyny) [#​3181](jqlang/jq#3181) ```sh $ jq -nc '[1,2,3,4,5] | [skip(2; .[])]' [3,4,5] ``` - Add `toboolean/0` to convert strings to booleans. [@​brahmlower](https://github.com/brahmlower) [@​itchyny](https://github.com/itchyny) [#​2098](jqlang/jq#2098) ```sh $ jq -n '"true", "false" | toboolean' true false ``` - Add `@urid` format. Reverse of `@uri`. [@​fmgornick](https://github.com/fmgornick) [#​3161](jqlang/jq#3161) ```sh $ jq -Rr '@​urid' <<< '%6a%71' jq ``` #### Changes to existing functions - Use code point index for `indices/1`, `index/1` and `rindex/1`. [@​wader](https://github.com/wader) [#​3065](jqlang/jq#3065) - This is a breaking change. Use `utf8bytelength/0` to get byte index. - Improve `tonumber/0` performance and rejects numbers with leading or trailing white spaces. [@​itchyny](https://github.com/itchyny) [@​thaliaarchi](https://github.com/thaliaarchi) [#​3055](jqlang/jq#3055) [#​3195](jqlang/jq#3195) - This is a breaking change. Use `trim/0` to remove leading and trailing white spaces. - Populate timezone data when formatting time. This fixes timezone name in `strftime/1`, `strflocaltime/1` for DST. [@​marcin-serwin](https://github.com/marcin-serwin) [@​sihde](https://github.com/sihde) [#​3203](jqlang/jq#3203) [#​3264](jqlang/jq#3264) [#​3323](jqlang/jq#3323) - Preserve numerical precision on unary negation, `abs/0`, `length/0`. [@​itchyny](https://github.com/itchyny) [#​3242](jqlang/jq#3242) [#​3275](jqlang/jq#3275) - Make `last(empty)` yield no output values like `first(empty)`. [@​itchyny](https://github.com/itchyny) [#​3179](jqlang/jq#3179) - Make `ltrimstr/1` and `rtrimstr/1` error for non-string inputs. [@​emanuele6](https://github.com/emanuele6) [#​2969](jqlang/jq#2969) - Make `limit/2` error for negative count. [@​itchyny](https://github.com/itchyny) [#​3181](jqlang/jq#3181) - Fix `mktime/0` overflow and allow fewer elements in date-time representation array. [@​emanuele6](https://github.com/emanuele6) [#​3070](jqlang/jq#3070) [#​3162](jqlang/jq#3162) - Fix non-matched optional capture group. [@​wader](https://github.com/wader) [#​3238](jqlang/jq#3238) - Provide `strptime/1` on all systems. [@​george-hopkins](https://github.com/george-hopkins) [@​fdellwing](https://github.com/fdellwing) [#​3008](jqlang/jq#3008) [#​3094](jqlang/jq#3094) - Fix `_WIN32` port of `strptime`. [@​emanuele6](https://github.com/emanuele6) [#​3071](jqlang/jq#3071) - Improve `bsearch/1` performance by implementing in C. [@​eloycoto](https://github.com/eloycoto) [#​2945](jqlang/jq#2945) - Improve `unique/0` and `unique_by/1` performance. [@​itchyny](https://github.com/itchyny) [@​emanuele6](https://github.com/emanuele6) [#​3254](jqlang/jq#3254) [#​3304](jqlang/jq#3304) - Fix error messages including long string literal not to break Unicode characters. [@​itchyny](https://github.com/itchyny) [#​3249](jqlang/jq#3249) - Remove `pow10/0` as it has been deprecated in glibc 2.27. Use `exp10/0` instead. [@​itchyny](https://github.com/itchyny) [#​3059](jqlang/jq#3059) - Remove private (and undocumented) `_nwise` filter. [@​itchyny](https://github.com/itchyny) [#​3260](jqlang/jq#3260) #### Language changes - Fix precedence of binding syntax against unary and binary operators. Also, allow some expressions as object values. [@​itchyny](https://github.com/itchyny) [#​3053](jqlang/jq#3053) [#​3326](jqlang/jq#3326) - This is a breaking change that may change the output of filters with binding syntax as follows. ```sh $ jq -nc '[-1 as $x | 1,$x]' [1,-1] # previously, [-1,-1] $ jq -nc '1 | . + 2 as $x | -$x' -3 # previously, -1 $ jq -nc '{x: 1 + 2, y: false or true, z: null // 3}' {"x":3,"y":true,"z":3} # previously, syntax error ``` - Support Tcl-style multiline comments. [@​emanuele6](https://github.com/emanuele6) [#​2989](jqlang/jq#2989) ```sh #!/bin/sh -- ``` ### Can be use to do shebang scripts. ### Next line will be seen as a comment be of the trailing backslash. \\ exec jq ... ### this jq expression will result in \[1] \[ 1, ### \\ 2 ] ```` - Fix `foreach` not to break init backtracking with `DUPN`. @​kanwren #​3266 ```sh $ jq -n '[1, 2] | foreach .[] as $x (0, 1; . + $x)' 1 3 2 4 ```` - Fix `reduce`/`foreach` state variable should not be reset each iteration. [@​itchyny](https://github.com/itchyny) [#​3205](jqlang/jq#3205) ```sh $ jq -n 'reduce range(5) as $x (0; .+$x | select($x!=2))' 8 $ jq -nc '[foreach range(5) as $x (0; .+$x | select($x!=2); [$x,.])]' [[0,0],[1,1],[3,4],[4,8]] ``` - Support CRLF line breaks in filters. [@​itchyny](https://github.com/itchyny) [#​3274](jqlang/jq#3274) - Improve performance of repeating strings. [@​itchyny](https://github.com/itchyny) [#​3272](jqlang/jq#3272) #### Documentation changes - Switch the homepage to custom domain [jqlang.org](https://jqlang.org). [@​itchyny](https://github.com/itchyny) [@​owenthereal](https://github.com/owenthereal) [#​3243](jqlang/jq#3243) - Make latest release instead of development version the default manual. [@​wader](https://github.com/wader) [#​3130](jqlang/jq#3130) - Add opengraph meta tags. [@​wader](https://github.com/wader) [#​3247](jqlang/jq#3247) - Replace jqplay.org with play.jqlang.org [@​owenthereal](https://github.com/owenthereal) [#​3265](jqlang/jq#3265) - Add missing line from decNumber's licence to `COPYING`. [@​emanuele6](https://github.com/emanuele6) [#​3106](jqlang/jq#3106) - Various document improvements. [@​tsibley](https://github.com/tsibley) [#​3322](jqlang/jq#3322), [@​itchyny](https://github.com/itchyny) [#​3240](jqlang/jq#3240), [@​jhcarl0814](https://github.com/jhcarl0814) [#​3239](jqlang/jq#3239), [@​01mf02](https://github.com/01mf02) [#​3184](jqlang/jq#3184), [@​thaliaarchi](https://github.com/thaliaarchi) [#​3199](jqlang/jq#3199), [@​NathanBaulch](https://github.com/NathanBaulch) [#​3173](jqlang/jq#3173), [@​cjlarose](https://github.com/cjlarose) [#​3164](jqlang/jq#3164), [@​sheepster1](https://github.com/sheepster1) [#​3105](jqlang/jq#3105), [#​3103](jqlang/jq#3103), [@​kishoreinvits](https://github.com/kishoreinvits) [#​3042](jqlang/jq#3042), [@​jbrains](https://github.com/jbrains) [#​3035](jqlang/jq#3035), [@​thalman](https://github.com/thalman) [#​3033](jqlang/jq#3033), [@​SOF3](https://github.com/SOF3) [#​3017](jqlang/jq#3017), [@​wader](https://github.com/wader) [#​3015](jqlang/jq#3015), [@​wllm-rbnt](https://github.com/wllm-rbnt) [#​3002](jqlang/jq#3002) #### Build improvements - Fix build with GCC 15 (C23). [@​emanuele6](https://github.com/emanuele6) [#​3209](jqlang/jq#3209) - Fix build with `-Woverlength-strings` [@​emanuele6](https://github.com/emanuele6) [#​3019](jqlang/jq#3019) - Fix compiler warning `type-limits` in `found_string`. [@​itchyny](https://github.com/itchyny) [#​3263](jqlang/jq#3263) - Fix compiler error in `jv_dtoa.c` and `builtin.c`. [@​UlrichEckhardt](https://github.com/UlrichEckhardt) [#​3036](jqlang/jq#3036) - Fix warning: a function definition without a prototype is deprecated. [@​itchyny](https://github.com/itchyny) [#​3259](jqlang/jq#3259) - Define `_BSD_SOURCE` in `builtin.c` for OpenBSD support. [@​itchyny](https://github.com/itchyny) [#​3278](jqlang/jq#3278) - Define empty `JV_{,V}PRINTF_LIKE` macros if `__GNUC__` is not defined. [@​emanuele6](https://github.com/emanuele6) [#​3160](jqlang/jq#3160) - Avoid `ctype.h` abuse: cast `char` to `unsigned char` first. [@​riastradh](https://github.com/riastradh) [#​3152](jqlang/jq#3152) - Remove multiple calls to free when successively calling `jq_reset`. [@​Sameesunkaria](https://github.com/Sameesunkaria) [#​3134](jqlang/jq#3134) - Enable IBM z/OS support. [@​sachintu47](https://github.com/sachintu47) [#​3277](jqlang/jq#3277) - Fix insecure `RUNPATH`. [@​orbea](https://github.com/orbea) [#​3212](jqlang/jq#3212) - Avoid zero-length `calloc`. [@​itchyny](https://github.com/itchyny) [#​3280](jqlang/jq#3280) - Move oniguruma and decNumber to vendor directory. [@​itchyny](https://github.com/itchyny) [#​3234](jqlang/jq#3234) #### Test improvements - Run tests in C locale. [@​emanuele6](https://github.com/emanuele6) [#​3039](jqlang/jq#3039) - Improve reliability of `NO_COLOR` tests. [@​dag-erling](https://github.com/dag-erling) [#​3188](jqlang/jq#3188) - Improve `shtest` not to fail if `JQ_COLORS` and `NO_COLOR` are already set. [@​SArpnt](https://github.com/SArpnt) [#​3283](jqlang/jq#3283) - Refactor constant folding tests. [@​itchyny](https://github.com/itchyny) [#​3233](jqlang/jq#3233) - Make tests pass when `--disable-decnum`. [@​nicowilliams](https://github.com/nicowilliams) [`6d02d53`](jqlang/jq@6d02d53) - Disable Valgrind by default during testing. [@​itchyny](https://github.com/itchyny) [#​3269](jqlang/jq#3269) </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever MR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this MR and you won't be reminded about this update again. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this MR, check this box --- This MR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiI0MC41MC4wIiwidXBkYXRlZEluVmVyIjoiNDAuNTAuMCIsInRhcmdldEJyYW5jaCI6Im1haW4iLCJsYWJlbHMiOlsiUmVub3ZhdGUgQm90Il19-->
Previsouly byte index was used.
Fixes #1430, fixes #1624, fixes #3064.