-
Notifications
You must be signed in to change notification settings - Fork 6
Lazily store the number of Unicode characters a string contains. #176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Calculating the number of Unicode characters in a string is an O(n) operation. Previously everytime `String_::length` was called we iterated over the underlying string to discover how many Unicode characters it contains. This commit adds a field to SOM strings that stores its number of Unicode characters. Since not all strings will be queried for this property, we lazily cache this property the first time that `length` is called on a `String_`. We use `usize::max` as our marker as that means the marker is "safe" in all cases. On the JSON benchmark, with GC off, this speeds things up by about 80%.
/// how many characters this string has`. The former condition is rather unlikely, so we accept | ||
/// that if anyone ever manages to make a string of usize::MAX characters, we won't cache its | ||
/// length correctly, but will recalculate it each time. | ||
chars_len: Cell<usize>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't we use 0
as a marker. This way we don't need to recalculate the length if someone manages to hit the worst case. Or is the call to self.s.chars().count()
on an empty string that much more expensive than self.chars_len.get()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use 0 as a marker, it means that for 0 length strings, we still do a call of chars().count()
. This seems a pretty common case to me. With usize::MAX
, you'd need a string of 18014398509481984KiB or 17592186044416MiB or 17179869184GiB or 16777216TiB or 16384PiB or 16EiB to hit the worst case. That's a big string so I'm OK with being a bit slow in such cases since it's roughly 6-7 orders of magnitude bigger than any machine's RAM :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, you win. Though I do feel for whoever does hit that string someday as they'll have to wait a loooong time for their length
function to return. ;-)
Out of curiosity though, how much more expensive is the call to self.s.chars().count()
on an empty string, compared to the call to self.chars_len.get()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cell
is a zero-cost wrapper that means "this value might have been mutated". Therefore self.chars_len.get()
should translate, at worst, to a pure load from memory (but in some cases, with inlining, the value can probably be even left in a register).
One comment. Rest looks good. |
bors r+ |
Build succeeded: |
Calculating the number of Unicode characters in a string is an O(n) operation. Previously everytime
String_::length
was called we iterated over the underlying string to discover how many Unicode characters it contains.This commit adds a field to SOM strings that stores its number of Unicode characters. Since not all strings will be queried for this property, we lazily cache this property the first time that
length
is called on aString_
. We useusize::max
as our marker as that means the marker is "safe" in all cases.On the JSON benchmark, with GC off, this speeds things up by about 80%.
Needs softdevteam/libgc#19 to be merged first.