Skip to content

Conversation

ltratt
Copy link
Member

@ltratt ltratt commented Aug 6, 2020

Calculating the number of Unicode characters in a string is an O(n) operation. Previously everytime String_::length was called we iterated over the underlying string to discover how many Unicode characters it contains.

This commit adds a field to SOM strings that stores its number of Unicode characters. Since not all strings will be queried for this property, we lazily cache this property the first time that length is called on a String_. We use usize::max as our marker as that means the marker is "safe" in all cases.

On the JSON benchmark, with GC off, this speeds things up by about 80%.

Needs softdevteam/libgc#19 to be merged first.

Calculating the number of Unicode characters in a string is an O(n) operation.
Previously everytime `String_::length` was called we iterated over the
underlying string to discover how many Unicode characters it contains.

This commit adds a field to SOM strings that stores its number of Unicode
characters. Since not all strings will be queried for this property, we lazily
cache this property the first time that `length` is called on a `String_`. We
use `usize::max` as our marker as that means the marker is "safe" in all cases.

On the JSON benchmark, with GC off, this speeds things up by about 80%.
/// how many characters this string has`. The former condition is rather unlikely, so we accept
/// that if anyone ever manages to make a string of usize::MAX characters, we won't cache its
/// length correctly, but will recalculate it each time.
chars_len: Cell<usize>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we use 0 as a marker. This way we don't need to recalculate the length if someone manages to hit the worst case. Or is the call to self.s.chars().count() on an empty string that much more expensive than self.chars_len.get()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use 0 as a marker, it means that for 0 length strings, we still do a call of chars().count(). This seems a pretty common case to me. With usize::MAX, you'd need a string of 18014398509481984KiB or 17592186044416MiB or 17179869184GiB or 16777216TiB or 16384PiB or 16EiB to hit the worst case. That's a big string so I'm OK with being a bit slow in such cases since it's roughly 6-7 orders of magnitude bigger than any machine's RAM :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, you win. Though I do feel for whoever does hit that string someday as they'll have to wait a loooong time for their length function to return. ;-)

Out of curiosity though, how much more expensive is the call to self.s.chars().count() on an empty string, compared to the call to self.chars_len.get()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cell is a zero-cost wrapper that means "this value might have been mutated". Therefore self.chars_len.get() should translate, at worst, to a pure load from memory (but in some cases, with inlining, the value can probably be even left in a register).

@ptersilie
Copy link
Member

One comment. Rest looks good.

@ptersilie
Copy link
Member

bors r+

@bors
Copy link
Contributor

bors bot commented Aug 10, 2020

Build succeeded:

@bors bors bot merged commit 0f58312 into softdevteam:master Aug 10, 2020
@ltratt ltratt deleted the lazy_string_chars_len branch August 10, 2020 08:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants