Lazily store the number of Unicode characters a string contains. #176

ltratt · 2020-08-06T13:59:14Z

Calculating the number of Unicode characters in a string is an O(n) operation. Previously everytime String_::length was called we iterated over the underlying string to discover how many Unicode characters it contains.

This commit adds a field to SOM strings that stores its number of Unicode characters. Since not all strings will be queried for this property, we lazily cache this property the first time that length is called on a String_. We use usize::max as our marker as that means the marker is "safe" in all cases.

On the JSON benchmark, with GC off, this speeds things up by about 80%.

Needs softdevteam/libgc#19 to be merged first.

Calculating the number of Unicode characters in a string is an O(n) operation. Previously everytime `String_::length` was called we iterated over the underlying string to discover how many Unicode characters it contains. This commit adds a field to SOM strings that stores its number of Unicode characters. Since not all strings will be queried for this property, we lazily cache this property the first time that `length` is called on a `String_`. We use `usize::max` as our marker as that means the marker is "safe" in all cases. On the JSON benchmark, with GC off, this speeds things up by about 80%.

ptersilie · 2020-08-07T10:07:00Z

src/lib/vm/objects/string_.rs

+    /// how many characters this string has`. The former condition is rather unlikely, so we accept
+    /// that if anyone ever manages to make a string of usize::MAX characters, we won't cache its
+    /// length correctly, but will recalculate it each time.
+    chars_len: Cell<usize>,


Why can't we use 0 as a marker. This way we don't need to recalculate the length if someone manages to hit the worst case. Or is the call to self.s.chars().count() on an empty string that much more expensive than self.chars_len.get()?

If we use 0 as a marker, it means that for 0 length strings, we still do a call of chars().count(). This seems a pretty common case to me. With usize::MAX, you'd need a string of 18014398509481984KiB or 17592186044416MiB or 17179869184GiB or 16777216TiB or 16384PiB or 16EiB to hit the worst case. That's a big string so I'm OK with being a bit slow in such cases since it's roughly 6-7 orders of magnitude bigger than any machine's RAM :)

Alright, you win. Though I do feel for whoever does hit that string someday as they'll have to wait a loooong time for their length function to return. ;-)

Out of curiosity though, how much more expensive is the call to self.s.chars().count() on an empty string, compared to the call to self.chars_len.get()?

Cell is a zero-cost wrapper that means "this value might have been mutated". Therefore self.chars_len.get() should translate, at worst, to a pure load from memory (but in some cases, with inlining, the value can probably be even left in a register).

ptersilie · 2020-08-07T10:11:12Z

One comment. Rest looks good.

ptersilie · 2020-08-10T07:57:19Z

bors r+

bors · 2020-08-10T08:01:01Z

Build succeeded:

buildbot/buildbot-build-script

ltratt assigned ptersilie Aug 6, 2020

Simplify.

918d735

ptersilie reviewed Aug 7, 2020

View reviewed changes

bors bot merged commit 0f58312 into softdevteam:master Aug 10, 2020

ltratt deleted the lazy_string_chars_len branch August 10, 2020 08:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Lazily store the number of Unicode characters a string contains. #176

Lazily store the number of Unicode characters a string contains. #176

Uh oh!

ltratt commented Aug 6, 2020

Uh oh!

ptersilie Aug 7, 2020

Uh oh!

ltratt Aug 7, 2020

Uh oh!

ptersilie Aug 7, 2020

Uh oh!

ltratt Aug 7, 2020

Uh oh!

ptersilie commented Aug 7, 2020

Uh oh!

ptersilie commented Aug 10, 2020

Uh oh!

bors bot commented Aug 10, 2020

Uh oh!

Uh oh!

Lazily store the number of Unicode characters a string contains. #176

Lazily store the number of Unicode characters a string contains. #176

Uh oh!

Conversation

ltratt commented Aug 6, 2020

Uh oh!

ptersilie Aug 7, 2020

Choose a reason for hiding this comment

Uh oh!

ltratt Aug 7, 2020

Choose a reason for hiding this comment

Uh oh!

ptersilie Aug 7, 2020

Choose a reason for hiding this comment

Uh oh!

ltratt Aug 7, 2020

Choose a reason for hiding this comment

Uh oh!

ptersilie commented Aug 7, 2020

Uh oh!

ptersilie commented Aug 10, 2020

Uh oh!

bors bot commented Aug 10, 2020

Uh oh!

Uh oh!