Page MenuHomeWildfire Games

L3 Ryzen 3000 Bugfix
ClosedPublic

Authored by OptimusShepard on Oct 3 2019, 10:04 PM.

Details

Summary

A false entry in the AMD associativity table for L2 and L3 cache leads to an error in the L3 cache initialisation. The patch correct the entry of the table.

Test Plan

Play the game with an Ryzen 3000 CPU.

Diff Detail

Repository
rP 0 A.D. Public Repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

OptimusShepard created this revision.Oct 3 2019, 10:04 PM

Successful build - Chance fights ever on the side of the prudent.

Link to build: https://jenkins.wildfiregames.com/job/vs2015-differential/393/display/redirect

Successful build - Chance fights ever on the side of the prudent.

Linter detected issues:
Executing section Source...

source/lib/sysdep/arch/x86_x64/cache.cpp
|   1| /*·Copyright·(C)·2018·Wildfire·Games.
|    | [NORMAL] LicenseYearBear:
|    | License should have "2019" year instead of "2018"
Executing section JS...
Executing section cli...

Link to build: https://jenkins.wildfiregames.com/job/docker-differential/908/display/redirect

Stan awarded a token.Oct 3 2019, 10:12 PM
In D2353#98365, @elexis wrote:

This #4360?

Right, thx

elexis updated the Trac tickets for this revision.Oct 3 2019, 10:39 PM

Play the game with an Ryzen 3000 CPU.

I don't think anyone on the team has that CPU, so we cannot perform the testplan.

A false entry in the AMD associativity table for L2 and L3 cache leads to an error in the L3 cache initialisation. The patch correct the entry of the table.

How does one determine the magic numbers that should be there?
E.g. can we look it up on some specification of the manufacturer?

Will the numbers not bug for other people? (As no other CPU users seem to have complained about them)

svn blame says rP6600

Imarok added a subscriber: Imarok.EditedOct 4 2019, 12:14 AM

As I was "pushed" by @OptimusShepard to help him on that part, I'll answer:

In D2353#98371, @elexis wrote:

How does one determine the magic numbers that should be there?
E.g. can we look it up on some specification of the manufacturer?

In an earlier version of this table It said the zero values are reserved.
But @OptimusShepard reported he got 0x9 as value for the L3-Cache. And it's know, that his CPU has a 16-ways associative cache.
Looking at the spec of that CPU(PDF page 75) it seems 0x9 is still reserved.
So I'm not sure, what goes on here, but it seems 0x9 means 16-way associative in this case.

In D2353#98371, @elexis wrote:

Will the numbers not bug for other people? (As no other CPU users seem to have complained about them)

This change only affects people who got this 0x9, and with previous code they just crashed.
After this change it means in the worst case it will report a 16-way associative cache when someone just has a 8-way or a 32-way or whatever.
So I think the harm is quite limited.

We might add a comment to the code that in theory the values 0x3, 0x5, 0x7, 0x9 are reserved but that a practical experiment with a zen 2 CPU has shown that 0x9 stands for 16.

Edit: I got the CPU spec, from here following Wikipedia 17h is the right family. The document is "Processor Programming Reference (PPR) for AMD Family 17h Models 00h-0Fh Processors (PUB)".

In D2353#98371, @elexis wrote:

Will the numbers not bug for other people? (As no other CPU users seem to have complained about them)

I think we now had multiple Zen 2 users complaining about crashes.

It seems like you know more than the rest of us about this issue, you have many facts gathered.

Imarok added a comment.Oct 5 2019, 9:57 AM
In D2353#98445, @elexis wrote:

It seems like you know more than the rest of us about this issue, you have many facts gathered.

I just went down the rabbit hole a bit xD
(Plus some knowledge about how a CPU works internally)

Imarok accepted this revision.Oct 5 2019, 10:30 AM

The one in https://wildfiregames.com/forum/index.php?/topic/27095-cachecpp43-assertion-failed-cachevalidate/ also has this 0x9 and a 16-way associative cache.
So I assume it's correct.

(If nobody else does I will commit this in the upcoming days)

This revision is now accepted and ready to land.Oct 5 2019, 10:30 AM
In D2353#98374, @Imarok wrote:

I think we now had multiple Zen 2 users complaining about crashes.

Got a link?

The other question I have is whether we should actually call ENSURE in AddTLB.
It sounds better to fail more softly with a red error message?
Sooner or later the magic numbers will change again I assume.
Wouldn't object to a ticket for that (perhaps we already have a "dont use ENSURE" ticket?).

Looking at the spec of that CPU(PDF page 75) it seems 0x9 is still reserved.

It sounds useful to add the links in the code, there already are some
What does reserved mean?

After this change it means in the worst case it will report a 16-way associative cache when someone just has a 8-way or a 32-way or whatever.

Reporting the correct number of not reporting would be more correct if the number is not known.

Either the magic number is correct for all caches, or there should be a wildcard and the Validate function should be changed to account for that wildcard?
(I have no clue what Im talking about)

In D2353#98462, @elexis wrote:
In D2353#98374, @Imarok wrote:

I think we now had multiple Zen 2 users complaining about crashes.

Got a link?

Just see the ticket and the link in the post above.

The other question I have is whether we should actually call ENSURE in AddTLB.
It sounds better to fail more softly with a red error message?
Wouldn't object to a ticket for that (perhaps we already have a "dont use ENSURE" ticket?).

I think we should in general not use ENSURE in hardware detection but red error warnings instead.
Afaik there is nothing that critical that would justify an ENSURE

Sooner or later the magic numbers will change again I assume.

I don't think so.

Looking at the spec of that CPU(PDF page 75) it seems 0x9 is still reserved.

It sounds useful to add the links in the code, there already are some

Good idea

What does reserved mean?

That the number has no current meaning but is reservered for assigning some meaning in the future.

After this change it means in the worst case it will report a 16-way associative cache when someone just has a 8-way or a 32-way or whatever.

Reporting the correct number of not reporting would be more correct if the number is not known.

Either the magic number is correct for all caches, or there should be a wildcard and the Validate function should be changed to account for that wildcard?
(I have no clue what Im talking about)

Seems like ;P
The issue is: we currently don't know what this 0x9 means. But everyone who played 0ad and got this 0x9 had a CPU with a L3 16-way associative cache.
And to be honest: I don't even know why we need this information at all...

Imarok requested changes to this revision.Oct 5 2019, 11:21 AM

Please add a comment with some explanation and the link to the doc.

This revision now requires changes to proceed.Oct 5 2019, 11:21 AM
In D2353#98463, @Imarok wrote:
In D2353#98462, @elexis wrote:
In D2353#98374, @Imarok wrote:

I think we now had multiple Zen 2 users complaining about crashes.

Got a link?

Just see the ticket and the link in the post above.

Here is another link. https://wildfiregames.com/forum/index.php?/topic/15796-known-problems-please-read-before-posting/&do=findComment&comment=384917
So we allready have Ryzen 3600, 3600X, 3700X and 3800X. So we only miss a crash report of an 3900X user to complete the Ryzen Zen2 series.

Stan added a subscriber: Stan.Oct 10 2019, 10:02 AM

@OptimusShepard will you update the patch ? :)

In D2353#98805, @Stan wrote:

@OptimusShepard will you update the patch ? :)

Yes i will. It seems to be that there is another bug wich I will try to debug bevor updating the patch.

Stan added a comment.Oct 10 2019, 10:31 AM

Thanks a lot for working on this :)

Great Work!

I conform that the modification on line 92 dows the trick for my ryzen 3600. Now working. The packager for Archlinux has already included this patch in the a23.1-6 version of 0ad

Great Work!

I conform that the modification on line 92 dows the trick for my ryzen 3600. Now working. The packager for Archlinux has already included this patch in the a23.1-6 version of 0ad

Please report back if you have any bugs with the patch.
(@OptimusShepard got an OOM after about 30 minutes when playing on a big map with 8 AIs that might be related to this issue.)

In D2353#99092, @Imarok wrote:

(OptimusShepard got an OOM after about 30 minutes when playing on a big map with 8 AIs that might be related to this issue.)

Reproducibly? (For example by replaying the match or starting a new match with the same settings)

It sounds unexpected that a magic number comparison would trigger an out-of-memory during a match later.
If the same happens without the patch (commenting out the ENSURE), then it's not the patch.
Also then it would be something in the simulation, in which case a replay and some profiling on that to figure out what happens might be useful.

In D2353#99093, @elexis wrote:
In D2353#99092, @Imarok wrote:

(OptimusShepard got an OOM after about 30 minutes when playing on a big map with 8 AIs that might be related to this issue.)

Reproducibly? (For example by replaying the match or starting a new match with the same settings)

No idea. Didn't got any more information.

It sounds unexpected that a magic number comparison would trigger an out-of-memory during a match later.
If the same happens without the patch (commenting out the ENSURE), then it's not the patch.
Also then it would be something in the simulation, in which case a replay and some profiling on that to figure out what happens might be useful.

Yeah, I also don't think it has anything to do with the patch, but I can't rule it out 100%.

I currently have no time to write the patch, sorry. If someone else writes this patch I would test it.
Here are the required information which Imarok got from the AMD forum.

https://community.amd.com/thread/244207

Stan added a comment.Nov 30 2019, 1:18 PM

I currently have no time to write the patch, sorry. If someone else writes this patch I would test it.
Here are the required information which Imarok got from the AMD forum.

https://community.amd.com/thread/244207

Can you test if the OOM is reproductible though?

Stan added a comment.Dec 2 2019, 7:46 AM

This person also runs out of memory #5645 without the fix, maybe it's the same reason.

Imarok added a comment.Dec 2 2019, 9:16 PM
In D2353#102347, @Stan wrote:

This person also runs out of memory #5645 without the fix, maybe it's the same reason.

I don't think that has anything to do with this patch.

Stan added a comment.Dec 2 2019, 9:25 PM

Yeah which is why it should be committed if that's the right fix ;)

Can you test if the OOM is reproductible though?

I will have a look on it.

In D2353#102376, @Stan wrote:

Yeah which is why it should be committed if that's the right fix ;)

I would call this a dirty fix. As I read in amds forum thread we need another if branch to get the cpu id informations. CPUID isn't something I could do quickly. Sadly I currently have not the time to get more familar with that. I think Imarok could tell you more about that.

This code has long been planned to be removed (precisely because I expected things like this to happen), if things are difficult to fix a good alternative is just to stop calling this code, but it's all a bit messy to figure out where this code is called from. See also #4451.

In D2353#99092, @Imarok wrote:

Great Work!

I conform that the modification on line 92 dows the trick for my ryzen 3600. Now working. The packager for Archlinux has already included this patch in the a23.1-6 version of 0ad

Please report back if you have any bugs with the patch.
(@OptimusShepard got an OOM after about 30 minutes when playing on a big map with 8 AIs that might be related to this issue.)

Until now no issues whatsoever.

I tested the game with, and without the patch for several times. With the patch the game crashes with an OOM after 30-45min. The OOM report isn't every time the same. Without the patch the game crashes only at the start. I have tested it with following settings. Mapsize big, very huge start ressources, 4 ai vs 4 ai, anatolien plateau, speed 20x.
In my opinion someone should take the current documentation of AMD and make a clean implementation.

I don't think the OOM can be related to this, as this code (or the result of using it) is never used again after initialization, as far as I know.

Added comment with link.

Vulcan added a comment.Dec 8 2019, 2:50 PM

Successful build - Chance fights ever on the side of the prudent.

Link to build: https://jenkins.wildfiregames.com/job/vs2015-differential/718/display/redirect

Vulcan added a comment.Dec 8 2019, 2:53 PM

Successful build - Chance fights ever on the side of the prudent.

Linter detected issues:
Executing section Source...

source/lib/sysdep/arch/x86_x64/cache.cpp
|   1| /*·Copyright·(C)·2018·Wildfire·Games.
|    | [NORMAL] LicenseYearBear:
|    | License should have "2019" year instead of "2018"
Executing section JS...
Executing section cli...

Link to build: https://jenkins.wildfiregames.com/job/docker-differential/1234/display/redirect

I've made a mistake by testing procedure. So I didn't seen that the OOM only occurs when I used a version which I compiled myself. So the bug is related to my Visual Studio settings or something like that. Sorry for that.

I had also a look to the technical specs of the other Ryzen 3000 and Threadripper 3000 cpus. They all use a L3 associativity of 16. So there is currently no conflict. But I have heard that AMD will rework there L3 cache till next generation (summer 2020), so maybe than you have to do the full implementation.

Stan added a comment.Dec 8 2019, 7:03 PM

If you use windows since the build is not 64 bits you have to pass --large-address-aware to premake. Else it'a limited to 2GB the flag allows it to use up to 4GB. If you play Corinthian Ismuth (4) with too many AIs it will crash after a while anyway...

Imarok accepted this revision.Dec 8 2019, 8:29 PM

So I think we should commit this workaround.
Pros:

  • It fixes the issue.
  • It gets the correct result at least until zen3 comes out.

Cons:

  • It is not the correct way to get the cache associativity from this CPUs.

But as this code will (at least in parts) removed anyway sooner or later, I think we are ok with this solution for now.

If nobody obejcts, I'll commit it in the coming days.

This revision is now accepted and ready to land.Dec 8 2019, 8:29 PM
This revision was automatically updated to reflect the committed changes.