Skip to content

Cuda performance #2 #266

@adam-ce

Description

@adam-ce

Hi,
I experimented a bit more with my tests of glm's CUDA performance. My testing environment is again Linux, Cuda 6.5 and a Geforce GTX 550Ti. GLM's released version 0.9.5.4 was used as the current trunk doesn't compile on CUDA.

The result in short

GLM is still 12% behind CUDA's native types and helper_math.h in certain cases.

About the tests made

In the last bug report (issue #257 ) I used only test1 (watch out, the link contains the revision number). It was quite synthetic and therefore not very good. In my own code glm still turned out to be slower.

So I made sort of a minimal example only for testing glm, based on my code: test 2a. This is the example showing that glm is 12% behind cuda.

One interesting thing is, that when removing the "early exit" from the for loop in glm/cudaKernel, the difference between glm and cuda native is much lower (and the overall performance is better): test 2b. Note that the only difference compared to test 2a are lines 265ff.

Test results

I already found out some possibilities to improve glm's performance (see below), here are the results after the best/fastest changes:

#test 1
CUDA kernel launch with 19532 blocks of 256 threads
time for cuda glm (matrix):         546 milliseconds
time for cuda helper math (matrix): 660 milliseconds
time for cuda glm (dot):            471 milliseconds
time for cuda helper math (dot):    491 milliseconds
time for cuda glm (cross):          246 milliseconds
time for cuda helper math (cross):  246 milliseconds
#test 2a
time for glm:   468 milliseconds
time for cuda:  417 milliseconds
#test 2b
time for glm:   373 milliseconds
time for cuda:  370 milliseconds

I made a file containing all test results.

Code changes

  • One important change is aligning vec4 to 16 bytes, this is the same as in issue Bad matrix-vector multiplication performance in Cuda #257.
  • I also aligned vec2 to 8 bytes, but didn't test.
  • Aligning vec3 to 16 bytes gives improvements in the synthetic test, but in the test 2 it gives worse performance. The reason is probably different loads and register/memory usage.
  • removing all const references from the base classes' (mat4, vec3, vec4 etc.) methods gave an improvement in some cases (surprisingly test2a didn't change, but test2b did, this might be an error in my testing method though. I checked it several times and couldn't find anyting). It seems like passing by value is faster on the gpu, but in test 2a this is shadowed by some other bottleneck. The usage of const references is controlled by the "#define GLM_REFERENCE const &" in setup.hpp in my version of glm.
  • I experimented with the operator *(mat4, vec4), which changed the performance only in case of using const references.

Conclusion

There is still some issue with glm's cuda performance. I would be happy to continue testing if you give me some ideas on what to test. I just hope that it's not glm's elaborate usage of templates.

Something that could lead us to the solution is the difference from test 2a to 2b. It could mean that loading something (instructions?) from memory is more expensive in glm, and the early exit causes cache misses or something like that. But i'm just speculating.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions