Page MenuHomeWildfire Games

Matrix3D SSE
Needs ReviewPublic

Authored by OptimusShepard on Jul 3 2020, 1:02 AM.

Details

Summary

Using SSE to improve the performance. The pictures and the profiling in the comments shows the improvement on an Ryzen 3700X.

Test Plan

Try, that everything works correct. Profile and confirm, that the SSE version is an improvement. SSE improvement depends on the CPU. Additional test with the SSE build flag enabled and disabled.

Event Timeline

OptimusShepard created this revision.Jul 3 2020, 1:02 AM
OptimusShepard planned changes to this revision.Jul 3 2020, 1:04 AM

The hardware request doesn't work, like it should.

Vulcan added a comment.Jul 3 2020, 1:15 AM

Build failure - The Moirai have given mortals hearts that can endure.

Link to build: https://jenkins.wildfiregames.com/job/vs2015-differential/2041/display/redirect

Vulcan added a comment.Jul 3 2020, 1:42 AM

Successful build - Chance fights ever on the side of the prudent.

Linter detected issues:
Executing section Source...

source/lib/sysdep/compiler.h
|   1| /*·Copyright·(c)·2019·Wildfire·Games.
|    | [NORMAL] LicenseYearBear:
|    | License should have "2020" year instead of "2019"

source/lib/sysdep/arch/x86_x64/x86_x64.h
|   1| /*·Copyright·(C)·2010·Wildfire·Games.
|    | [NORMAL] LicenseYearBear:
|    | License should have "2020" year instead of "2010"

source/lib/sysdep/arch/x86_x64/x86_x64.h
|  40| namespace·x86_x64·{
|    | [MAJOR] CPPCheckBear (syntaxError):
|    | Code 'namespacex86_x64{' is invalid C code. Use --std or --language to configure the language.

source/maths/Matrix3D.h
|   1| /*·Copyright·(C)·2019·Wildfire·Games.
|    | [NORMAL] LicenseYearBear:
|    | License should have "2020" year instead of "2019"

source/maths/Matrix3D.h
|  39| class·CMatrix3D
|    | [MAJOR] CPPCheckBear (syntaxError):
|    | Code 'classCMatrix3D{' is invalid C code. Use --std or --language to configure the language.
Executing section JS...
Executing section cli...

Link to build: https://jenkins.wildfiregames.com/job/docker-differential/2574/display/redirect

asterix added a subscriber: asterix.

Welcome improvement.

Would be nice to:

  • Understand why compilers can't vectorise these themselves (is it just a build-flag issue?)
  • Get some actual profiling in, ideally some Profiler 2 graphs of MP replays.
Stan added a comment.Jul 9 2020, 11:14 AM

From my discussions with Optimus Shepard, the fact that it is using FMA3 is problematic, because it literraly prevents anyone with less an Intel IX-4000 era CPU and FX6000 to play the game. We don't gather stats about those, so we do not actually know how many of our users have such CPUs.
From the 7th percent improvement reported by Optimus Shepard 30% of that is due to the usage of FMA3.

@wraitii on Linux we do not compile with march=native, so it's likely a lot of the code doesn't use the the CPU specific things because else the package would not be compatible depending on the CPU. This is of course not an issue on Mac since most people have the same hardware (Although from the GMP flag issue that causes TLS to crash on some MacOs versions, I guess it's not exactly true.

We also don't passe the flags that allow such compilation on windows see rP16912

I tested the build flags, SSE seems to be the only flag with an positiv impact. AVX2 makes everything worse. I also made some profiling.
Current version:

Current Version with SSE flag:

SSE patch:

SSE patch with SSE flag:

As you can see the SSE patch lowers the spikes. You can also see there are less and smaller "black blocks" on the right side, which means also better frametimes.
The impact of the SSE flag is not really noticeable, but for me it looks a bit better than without.


I have rewrite the patch, so it uses only SSE. That I have used for the profiling. I will upload it later this day.

OptimusShepard retitled this revision from Matrix3D SSE, FMA, AVX to Matrix3D SSE.
OptimusShepard edited the summary of this revision. (Show Details)
OptimusShepard edited the test plan for this revision. (Show Details)

Removed the AVX and FMA version, as we don't be able, to change instructions by runtime. Furthermore the AVX instructions aren't faster than SSE here.

Build failure - The Moirai have given mortals hearts that can endure.

Link to build: https://jenkins.wildfiregames.com/job/docker-differential/2614/display/redirect

Build failure - The Moirai have given mortals hearts that can endure.

Link to build: https://jenkins.wildfiregames.com/job/macos-differential/982/display/redirect

Including the SSE header, because Vulcan fails for the none Windows tests.

Successful build - Chance fights ever on the side of the prudent.

Linter detected issues:
Executing section Source...

source/maths/Matrix3D.h
|   1| /*·Copyright·(C)·2019·Wildfire·Games.
|    | [NORMAL] LicenseYearBear:
|    | License should have "2020" year instead of "2019"

source/maths/Matrix3D.h
|  38| class·CMatrix3D
|    | [MAJOR] CPPCheckBear (syntaxError):
|    | Code 'classCMatrix3D{' is invalid C code. Use --std or --language to configure the language.
Executing section JS...
Executing section cli...

Link to build: https://jenkins.wildfiregames.com/job/docker-differential/2615/display/redirect

Updated the year of the license header.

Successful build - Chance fights ever on the side of the prudent.

Linter detected issues:
Executing section Source...

source/maths/Matrix3D.h
|  38| class·CMatrix3D
|    | [MAJOR] CPPCheckBear (syntaxError):
|    | Code 'classCMatrix3D{' is invalid C code. Use --std or --language to configure the language.
Executing section JS...
Executing section cli...

Link to build: https://jenkins.wildfiregames.com/job/docker-differential/2616/display/redirect

Stan added a comment.Jul 10 2020, 11:11 AM

I like your approach better but interestingly it's done slightly differently in https://github.com/0ad/0ad/blob/d3e68a99e7f715ad7921a81e959f8ac51dfa1248/source/graphics/Color.cpp

Can you figure out what's causing the spikes?

source/maths/Matrix3D.h
277

0.f here and above :)

In D2857#123243, @Stan wrote:

I like your approach better but interestingly it's done slightly differently in https://github.com/0ad/0ad/blob/d3e68a99e7f715ad7921a81e959f8ac51dfa1248/source/graphics/Color.cpp

Oh, I think they changing the instructions by runtime, doesn't they? A bit ugly, I think, but we could use FMA :)

Can you figure out what's causing the spikes?

I will try, but on an first look, the profiler doesn't show me something useful.

In D2857#123243, @Stan wrote:

Can you figure out what's causing the spikes?

Didn't find the cause yet. But I recognized, that the profiler gains these pikes very much. Without the framedrops were much lower.

Stan added a comment.Jul 18 2020, 9:25 AM

Is it blend or multiplication that gives the biggest boost? And where are those called ?

I'm not convinced TBH. If this is hardcoded at compile-time, either we drop support or it's pretty much useless for releases. SIMD-capable compilers seem able to vectorise this functions, so custom versions don't seem particularly useful.
If there was a runtime switch that actually increased performance, might be more interesting.

source/maths/Matrix3D.h
151

For what it's worth, here's what Clang generates for me on SSE3 (NB -> assembly):

rdx -> this
rsi -> argument Matrix
rdi -> return Matrix

movss      xmm3, dword [rdx]
movss      xmm4, dword [rdx+4]
movss      xmm7, dword [rdx+8]
movss      xmm0, dword [rdx+0xc]
movss      xmm5, dword [rdx+0x10]
movss      xmm6, dword [rdx+0x14]
movups     xmm2, xmmword [rsi]
movups     xmm1, xmmword [rsi+0x10]
shufps     xmm3, xmm3, 0x0
mulps      xmm3, xmm2
shufps     xmm4, xmm4, 0x0
mulps      xmm4, xmm1
addps      xmm4, xmm3
movups     xmm3, xmmword [rsi+0x20]
shufps     xmm7, xmm7, 0x0
mulps      xmm7, xmm3
addps      xmm7, xmm4
movups     xmm4, xmmword [rsi+0x30]
shufps     xmm0, xmm0, 0x0
mulps      xmm0, xmm4
addps      xmm0, xmm7
movss      xmm7, dword [rdx+0x18]
shufps     xmm5, xmm5, 0x0
mulps      xmm5, xmm2
shufps     xmm6, xmm6, 0x0
mulps      xmm6, xmm1
addps      xmm6, xmm5
movss      xmm5, dword [rdx+0x1c]
shufps     xmm7, xmm7, 0x0
mulps      xmm7, xmm3
addps      xmm7, xmm6
movss      xmm6, dword [rdx+0x20]
shufps     xmm5, xmm5, 0x0
mulps      xmm5, xmm4
addps      xmm5, xmm7
movss      xmm7, dword [rdx+0x24]
shufps     xmm6, xmm6, 0x0
mulps      xmm6, xmm2
shufps     xmm7, xmm7, 0x0
mulps      xmm7, xmm1
addps      xmm7, xmm6
movss      xmm6, dword [rdx+0x28]
shufps     xmm6, xmm6, 0x0
mulps      xmm6, xmm3
addps      xmm6, xmm7
movss      xmm7, dword [rdx+0x2c]
shufps     xmm7, xmm7, 0x0
mulps      xmm7, xmm4
addps      xmm7, xmm6
movss      xmm6, dword [rdx+0x30]
shufps     xmm6, xmm6, 0x0
mulps      xmm6, xmm2
movss      xmm2, dword [rdx+0x34]
shufps     xmm2, xmm2, 0x0
mulps      xmm2, xmm1
addps      xmm2, xmm6
movss      xmm1, dword [rdx+0x38]
shufps     xmm1, xmm1, 0x0
mulps      xmm1, xmm3
addps      xmm1, xmm2
movss      xmm2, dword [rdx+0x3c]
shufps     xmm2, xmm2, 0x0
mulps      xmm2, xmm4
addps      xmm2, xmm1
movups     xmmword [rdi], xmm0
movups     xmmword [rdi+0x10], xmm5
movups     xmmword [rdi+0x20], xmm7
movups     xmmword [rdi+0x20], xmm7
movups     xmmword [rdi+0x30], xmm2