Skip to content

Conversation

SergeiPavlov
Copy link
Contributor

It is well-known that standard Guid.NewGuid() is slow.
And even slower on Linux dotnet/runtime#13628

If GUIDs are not published outside of process it is enough to use global 64-bit counter as part of GUID generator.

@alistairjevans
Copy link
Member

Thanks, interesting suggestion; can you please run our benchmarks and post before and after results here?

Copy link

codecov bot commented Mar 15, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 78.54%. Comparing base (b843130) to head (fed8a76).
Report is 6 commits behind head on develop.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #1414      +/-   ##
===========================================
+ Coverage    78.48%   78.54%   +0.06%     
===========================================
  Files          200      201       +1     
  Lines         5703     5720      +17     
  Branches      1168     1168              
===========================================
+ Hits          4476     4493      +17     
  Misses         714      714              
  Partials       513      513              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@botinko
Copy link
Contributor

botinko commented Mar 16, 2024

Benchmark:

| Method  | Mean      | Error     | StdDev    |
|-------- |----------:|----------:|----------:|
| Default | 34.354 ns | 0.2249 ns | 0.2104 ns |
| Fast    |  8.337 ns | 0.0624 ns | 0.0553 ns |

    [Benchmark]
    public Guid Default() => Guid.NewGuid();

    [Benchmark]
    public Guid Fast() => FastGuid.NewGuid();

Tested on Windows, dotnet 8.
For different OS it could be different. In mentioned dotnet runtime issue they say ~900ns on Linux. On Linux Guid.NewGuid() uses /dev/urandom as a source for randomness. /dev/urandom serve as cryptographically secure pseudorandom number generators. They allow access to a CSPRNG that is seeded with entropy from environmental noise, collected from device drivers and other sources. It's known fact, that it have limited trougput, especially considering that in Kubernetes many containers share one physical machine and possible same /dev/urandom stream.

@alistairjevans
Copy link
Member

Ok, can you compare the existing Autofac benchmarks as well, the ones in our "bench" folder (https://github.com/autofac/Autofac/tree/develop/bench/Autofac.Benchmarks)?

Want to understand the overall impact of the change on Autofac, based on how many guids we generate.

@SergeiPavlov
Copy link
Contributor Author

SergeiPavlov commented Mar 18, 2024

What are specific Benchmarks which depends on .NewGuid() ?

I ran the entire set Before & After this change and don't see performance improvement.
It looks even worse, and I cannot explain why.

But the benchmarks mostly test Resolve*() methods, not registrations where GUID may be generated

Environment: x64, Linux
Example:
Before:

Method repetitions Mean Error StdDev Median Ratio RatioSD Gen0 Allocated Alloc Ratio
Baseline ? 728.9 ns 0.46 ns 0.43 ns 728.7 ns 1.00 0.00 0.0324 2.67 KB 1.00
ResolveEnumerableT 1 2,665.4 ns 8.11 ns 7.58 ns 2,664.6 ns ? ? 0.0648 5.51 KB ?
ResolveT 1 1,415.2 ns 4.17 ns 3.69 ns 1,416.0 ns ? ? 0.0477 3.95 KB ?
ResolveEnumerableT 2 4,404.5 ns 12.41 ns 10.36 ns 4,405.7 ns ? ? 0.0992 8.34 KB ?
ResolveT 2 2,085.5 ns 2.84 ns 2.66 ns 2,085.3 ns ? ? 0.0610 5.22 KB ?
ResolveEnumerableT 3 6,237.9 ns 20.51 ns 16.01 ns 6,241.0 ns ? ? 0.1297 11.18 KB ?
ResolveT 3 2,700.5 ns 52.95 ns 70.69 ns 2,659.7 ns ? ? 0.0763 6.49 KB ?

After:

Method repetitions Mean Error StdDev Ratio RatioSD Gen0 Allocated Alloc Ratio
Baseline ? 775.2 ns 7.70 ns 7.20 ns 1.00 0.00 0.0324 2.67 KB 1.00
ResolveEnumerableT 1 2,872.2 ns 21.73 ns 20.32 ns ? ? 0.0648 5.51 KB ?
ResolveT 1 1,502.1 ns 13.33 ns 11.82 ns ? ? 0.0477 3.95 KB ?
ResolveEnumerableT 2 4,889.6 ns 86.27 ns 76.47 ns ? ? 0.0992 8.34 KB ?
ResolveT 2 2,163.8 ns 20.72 ns 18.37 ns ? ? 0.0610 5.22 KB ?
ResolveEnumerableT 3 6,592.1 ns 26.22 ns 24.53 ns ? ? 0.1297 11.18 KB ?
ResolveT 3 2,781.0 ns 10.43 ns 8.71 ns ? ? 0.0763 6.49 KB ?

@alistairjevans
Copy link
Member

Interesting that it's coming out worse...have you benchmarked dictionary lookups when Guids from FastGuid are used as a key? I'm wondering (without evidence) if the GetHashCode distribution of the guids are not spread out in the same way as with Guid.NewGuid?

@SergeiPavlov
Copy link
Contributor Author

I have tried to mix bits of Guid:

    public static Guid NewGuid()
    {
        var v = Interlocked.Increment(ref variablePart);
        var h = (int)v * 521;               // one-to-one function to improve Guid's hash distribution
        return new(h, (short)(v >> 32), (short)(v >> 48), BasePart);
    }

But it does not show visible improvement.
Guid.GetHashCode() XORs all 32-bit parts of Guid, so the distribution should be uniform

* FastGuid improved

25% faster implementation of FastGuid with MemoryMarshal

* fix usings

---------

Co-authored-by: Sergei Pavlov <spavlov@servicetitan.com>
@SergeiPavlov
Copy link
Contributor Author

Thanks to @snaumenko-st for further improvement of this PR: 88272de

Benchmark results look not worse than before.

@tillig
Copy link
Member

tillig commented Apr 5, 2024

Trying to figure out the state of this so I can get a 0.1.0 release out for #1415. I may just release that one without this, regardless.

However, looking at the chain here, it sounds like:

  • The first run of benchmarks showed that things actually got worse.
  • Subsequent runs (not shown) indicate things aren't worse, but also aren't any better.
  • The primary use case here is to improve registration time, which happens effectively once per application lifecycle.

It's not interesting to add an optimization that doesn't do anything, so I'm not sure where that leaves us.

If it's something where we can show some really marked improvement like "it used to take 10 seconds to start this app and now it takes half a second" or whatever, I'm all for it. But I'm not super interested in optimizing the "we told Autofac to scan and register every type in the entire app and all assemblies" use case - folks shouldn't be doing that, and the pain may be a forcing factor to tell them they're doing something wrong, especially if we're still supporting AnyConcreteTypeNotAlreadyRegisteredSource.

@SergeiPavlov
Copy link
Contributor Author

  • The primary use case here is to improve registration time, which happens effectively once per application lifecycle.

In our case most of restrations happen at App start, but a few additional happen at every Web Request.
That is why we are interested in this improvement.

@tillig
Copy link
Member

tillig commented Apr 5, 2024

I think I still need to see some sort of improvement, which means this PR possibly needs to include adding some benchmarks that show the affected path and can show that there's improvement. It'd just be nice to see that it's not, like, "2ns improvement" or something basically meaningless. I saw the benchmark of raw GUID creation, but it'd be good to benchmark it in the context of the whole execution path.

@SergeiPavlov
Copy link
Contributor Author

Added a RegistrationBenchmark:

Before:

Method Mean Error StdDev Gen0 Gen1 Allocated
Register 53.76 us 0.656 us 0.548 us 0.3052 0.2441 26.77 KB

After:

Method Mean Error StdDev Gen0 Gen1 Allocated
Register 49.55 us 0.648 us 0.575 us 0.3052 0.2441 27.82 KB

@tillig
Copy link
Member

tillig commented Apr 5, 2024

Let me verify I'm reading this right:

After adding the fast GUID optimization, across six registrations you'll save about four microseconds (0.7us per registration) but allocate 1KB more memory to get that.

Does that sound right?

@SergeiPavlov
Copy link
Contributor Author

Yes.

Second run of this benchmark confirms.
Probably new Guids have not so random distribution and Dictionary allocates more buckets for them.

@tillig
Copy link
Member

tillig commented Apr 5, 2024

Hmmm. Maybe @alistairjevans feels differently, but... I... I have to admit, I kind of lean against this given the trade-off between the microseconds vs. allocations doesn't add up.

@alistairjevans
Copy link
Member

Yeah, I have to agree, the extra code complexity doesn't seem worth the tiny time gains, and we also don't understand the dictionary bucket impacts enough to know that any gain would be consistent anyway.

@SergeiPavlov
Copy link
Contributor Author

We see following hotspot on NewGuid() during profiling
image

So it looks not-insignificant

@alistairjevans
Copy link
Member

Was that profiling session captured on windows or Linux?

@SergeiPavlov
Copy link
Contributor Author

Linux Debian x64 .NET8

@alistairjevans
Copy link
Member

Ah; is that also where you've been running benchmarks?

@SergeiPavlov
Copy link
Contributor Author

SergeiPavlov commented Apr 7, 2024

Ah; is that also where you've been running benchmarks?

Not exactly
I was running benchmarks on Ubuntu x64 on real PC
but the App profiling was under Docker container in the Azure cloud

May be .NewGuid() implementation differs in that environment.
It is based on /dev/urandom, and VM Hypervisor may have slow implementation of this virtual device

@tillig
Copy link
Member

tillig commented Apr 9, 2024

Is it possible to run the benchmarks in that environment? Again, it'd just be nice to verify that this is going to fix the issue before it gets accepted.

@SergeiPavlov
Copy link
Contributor Author

SergeiPavlov commented Apr 9, 2024

Is it possible to run the benchmarks in that environment?

I did (isolated Guid benchmark only). The results on Azure are the same as on bare Linux.
FastGuid ~10x times faster than Guid.NewGuid()

A agree, it is not enough to be merge into Autofac

@tillig
Copy link
Member

tillig commented Apr 10, 2024

Sounds good. We'll close this one out for now, but if you can show that it will be a marked improvement in some environments maybe we can revive it and target those environments.

@tillig tillig closed this Apr 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants