Page MenuHomeWildfire Games

L3 Ryzen 3000 Bugfix
Needs RevisionPublic

Authored by OptimusShepard on Oct 3 2019, 10:04 PM.

Details

Reviewers
Imarok
Trac Tickets
#4360
Summary

A false entry in the AMD associativity table for L2 and L3 cache leads to an error in the L3 cache initialisation. The patch correct the entry of the table.

Test Plan

Play the game with an Ryzen 3000 CPU.

Diff Detail

Event Timeline

OptimusShepard created this revision.Oct 3 2019, 10:04 PM

Successful build - Chance fights ever on the side of the prudent.

Link to build: https://jenkins.wildfiregames.com/job/vs2015-differential/393/display/redirect

Successful build - Chance fights ever on the side of the prudent.

Linter detected issues:
Executing section Source...

source/lib/sysdep/arch/x86_x64/cache.cpp
|   1| /*·Copyright·(C)·2018·Wildfire·Games.
|    | [NORMAL] LicenseYearBear:
|    | License should have "2019" year instead of "2018"
Executing section JS...
Executing section cli...

Link to build: https://jenkins.wildfiregames.com/job/docker-differential/908/display/redirect

Stan awarded a token.Oct 3 2019, 10:12 PM
In D2353#98365, @elexis wrote:

This #4360?

Right, thx

elexis updated the Trac tickets for this revision.Oct 3 2019, 10:39 PM

Play the game with an Ryzen 3000 CPU.

I don't think anyone on the team has that CPU, so we cannot perform the testplan.

A false entry in the AMD associativity table for L2 and L3 cache leads to an error in the L3 cache initialisation. The patch correct the entry of the table.

How does one determine the magic numbers that should be there?
E.g. can we look it up on some specification of the manufacturer?

Will the numbers not bug for other people? (As no other CPU users seem to have complained about them)

svn blame says rP6600

Imarok added a subscriber: Imarok.EditedOct 4 2019, 12:14 AM

As I was "pushed" by @OptimusShepard to help him on that part, I'll answer:

In D2353#98371, @elexis wrote:

How does one determine the magic numbers that should be there?
E.g. can we look it up on some specification of the manufacturer?

In an earlier version of this table It said the zero values are reserved.
But @OptimusShepard reported he got 0x9 as value for the L3-Cache. And it's know, that his CPU has a 16-ways associative cache.
Looking at the spec of that CPU(PDF page 75) it seems 0x9 is still reserved.
So I'm not sure, what goes on here, but it seems 0x9 means 16-way associative in this case.

In D2353#98371, @elexis wrote:

Will the numbers not bug for other people? (As no other CPU users seem to have complained about them)

This change only affects people who got this 0x9, and with previous code they just crashed.
After this change it means in the worst case it will report a 16-way associative cache when someone just has a 8-way or a 32-way or whatever.
So I think the harm is quite limited.

We might add a comment to the code that in theory the values 0x3, 0x5, 0x7, 0x9 are reserved but that a practical experiment with a zen 2 CPU has shown that 0x9 stands for 16.

Edit: I got the CPU spec, from here following Wikipedia 17h is the right family. The document is "Processor Programming Reference (PPR) for AMD Family 17h Models 00h-0Fh Processors (PUB)".

In D2353#98371, @elexis wrote:

Will the numbers not bug for other people? (As no other CPU users seem to have complained about them)

I think we now had multiple Zen 2 users complaining about crashes.

It seems like you know more than the rest of us about this issue, you have many facts gathered.

Imarok added a comment.Oct 5 2019, 9:57 AM
In D2353#98445, @elexis wrote:

It seems like you know more than the rest of us about this issue, you have many facts gathered.

I just went down the rabbit hole a bit xD
(Plus some knowledge about how a CPU works internally)

Imarok accepted this revision.Oct 5 2019, 10:30 AM

The one in https://wildfiregames.com/forum/index.php?/topic/27095-cachecpp43-assertion-failed-cachevalidate/ also has this 0x9 and a 16-way associative cache.
So I assume it's correct.

(If nobody else does I will commit this in the upcoming days)

This revision is now accepted and ready to land.Oct 5 2019, 10:30 AM
In D2353#98374, @Imarok wrote:

I think we now had multiple Zen 2 users complaining about crashes.

Got a link?

The other question I have is whether we should actually call ENSURE in AddTLB.
It sounds better to fail more softly with a red error message?
Sooner or later the magic numbers will change again I assume.
Wouldn't object to a ticket for that (perhaps we already have a "dont use ENSURE" ticket?).

Looking at the spec of that CPU(PDF page 75) it seems 0x9 is still reserved.

It sounds useful to add the links in the code, there already are some
What does reserved mean?

After this change it means in the worst case it will report a 16-way associative cache when someone just has a 8-way or a 32-way or whatever.

Reporting the correct number of not reporting would be more correct if the number is not known.

Either the magic number is correct for all caches, or there should be a wildcard and the Validate function should be changed to account for that wildcard?
(I have no clue what Im talking about)

In D2353#98462, @elexis wrote:
In D2353#98374, @Imarok wrote:

I think we now had multiple Zen 2 users complaining about crashes.

Got a link?

Just see the ticket and the link in the post above.

The other question I have is whether we should actually call ENSURE in AddTLB.
It sounds better to fail more softly with a red error message?
Wouldn't object to a ticket for that (perhaps we already have a "dont use ENSURE" ticket?).

I think we should in general not use ENSURE in hardware detection but red error warnings instead.
Afaik there is nothing that critical that would justify an ENSURE

Sooner or later the magic numbers will change again I assume.

I don't think so.

Looking at the spec of that CPU(PDF page 75) it seems 0x9 is still reserved.

It sounds useful to add the links in the code, there already are some

Good idea

What does reserved mean?

That the number has no current meaning but is reservered for assigning some meaning in the future.

After this change it means in the worst case it will report a 16-way associative cache when someone just has a 8-way or a 32-way or whatever.

Reporting the correct number of not reporting would be more correct if the number is not known.
Either the magic number is correct for all caches, or there should be a wildcard and the Validate function should be changed to account for that wildcard?
(I have no clue what Im talking about)

Seems like ;P
The issue is: we currently don't know what this 0x9 means. But everyone who played 0ad and got this 0x9 had a CPU with a L3 16-way associative cache.
And to be honest: I don't even know why we need this information at all...

Imarok requested changes to this revision.Oct 5 2019, 11:21 AM

Please add a comment with some explanation and the link to the doc.

This revision now requires changes to proceed.Oct 5 2019, 11:21 AM
In D2353#98463, @Imarok wrote:
In D2353#98462, @elexis wrote:
In D2353#98374, @Imarok wrote:

I think we now had multiple Zen 2 users complaining about crashes.

Got a link?

Just see the ticket and the link in the post above.

Here is another link. https://wildfiregames.com/forum/index.php?/topic/15796-known-problems-please-read-before-posting/&do=findComment&comment=384917
So we allready have Ryzen 3600, 3600X, 3700X and 3800X. So we only miss a crash report of an 3900X user to complete the Ryzen Zen2 series.

Stan added a subscriber: Stan.Oct 10 2019, 10:02 AM

@OptimusShepard will you update the patch ? :)

In D2353#98805, @Stan wrote:

@OptimusShepard will you update the patch ? :)

Yes i will. It seems to be that there is another bug wich I will try to debug bevor updating the patch.

Stan added a comment.Oct 10 2019, 10:31 AM

Thanks a lot for working on this :)

Great Work!

I conform that the modification on line 92 dows the trick for my ryzen 3600. Now working. The packager for Archlinux has already included this patch in the a23.1-6 version of 0ad

Great Work!
I conform that the modification on line 92 dows the trick for my ryzen 3600. Now working. The packager for Archlinux has already included this patch in the a23.1-6 version of 0ad

Please report back if you have any bugs with the patch.
(@OptimusShepard got an OOM after about 30 minutes when playing on a big map with 8 AIs that might be related to this issue.)

In D2353#99092, @Imarok wrote:

(OptimusShepard got an OOM after about 30 minutes when playing on a big map with 8 AIs that might be related to this issue.)

Reproducibly? (For example by replaying the match or starting a new match with the same settings)

It sounds unexpected that a magic number comparison would trigger an out-of-memory during a match later.
If the same happens without the patch (commenting out the ENSURE), then it's not the patch.
Also then it would be something in the simulation, in which case a replay and some profiling on that to figure out what happens might be useful.

In D2353#99093, @elexis wrote:
In D2353#99092, @Imarok wrote:

(OptimusShepard got an OOM after about 30 minutes when playing on a big map with 8 AIs that might be related to this issue.)

Reproducibly? (For example by replaying the match or starting a new match with the same settings)

No idea. Didn't got any more information.

It sounds unexpected that a magic number comparison would trigger an out-of-memory during a match later.
If the same happens without the patch (commenting out the ENSURE), then it's not the patch.
Also then it would be something in the simulation, in which case a replay and some profiling on that to figure out what happens might be useful.

Yeah, I also don't think it has anything to do with the patch, but I can't rule it out 100%.