Page MenuHomeWildfire Games

Fix AI OOS on rejoin related to random calls
AbandonedPublic

Authored by Silier on Jan 15 2020, 8:33 PM.

Details

Reviewers
None
Group Reviewers
Restricted Owners Package(Owns No Changed Paths)
Trac Tickets
#3858
Summary

When player rejoins and any function using random number generation is called during deserialization, states of players will not match because rng number of aimanagers will be different. That also causes problem with every other generation of random number and as that will not be the same as the host generates, ai will enter different gamestate.

Reverting random pickup of phase technology in ai.gamestate (rP20750) because that init function is called on rejoin and gamestate is not serialised from aimanager. Most likely it would be better for the future if ai could pick that phase technology at runtime so it would be possibly different pick for multiple ais.

Test Plan

Without patch:
Run multiplayer match with ai.
Connect with another client.
Rejoin in early game.
Almost immediately oos.

With patch:
Run multiplayer match with ai.
Connect with another client.
Rejoin in early game.
There should be no oos.

If there would be, rng numbers in oos dumps will be the same as there are possibly another causes of oos (most likely bombing attack in attackManager and started attack plan)

Event Timeline

Silier created this revision.Jan 15 2020, 8:33 PM
Owners added a subscriber: Restricted Owners Package.Jan 15 2020, 8:33 PM
Silier updated the Trac tickets for this revision.Jan 15 2020, 8:35 PM

Successful build - Chance fights ever on the side of the prudent.

Link to build: https://jenkins.wildfiregames.com/job/vs2015-differential/1055/display/redirect

Silier edited the summary of this revision. (Show Details)Jan 15 2020, 8:37 PM

Successful build - Chance fights ever on the side of the prudent.

Linter detected issues:
Executing section Source...
Executing section JS...

binaries/data/mods/public/simulation/ai/common-api/gamestate.js
| 271| »   »   switch·(entity.check)
|    | [NORMAL] ESLintBear (default-case):
|    | Expected a default case.

binaries/data/mods/public/simulation/ai/common-api/gamestate.js
| 288| »   »   »   switch·(type)
|    | [NORMAL] ESLintBear (default-case):
|    | Expected a default case.

binaries/data/mods/public/simulation/ai/common-api/gamestate.js
| 938| }(API3);
|    | [MAJOR] ESLintBear (no-use-before-define):
|    | 'API3' was used before it was defined.
Executing section cli...

Link to build: https://jenkins.wildfiregames.com/job/docker-differential/1573/display/redirect

Successful build - Chance fights ever on the side of the prudent.

Link to build: https://jenkins.wildfiregames.com/job/macos-differential/151/display/redirect

elexis added a subscriber: elexis.Jan 16 2020, 12:34 PM

(The string in rP19491 should be nuked and D598 abandoned in case this patch is committed and achieves the objective)

bb added a subscriber: bb.Jan 29 2020, 9:44 PM

Looking at the identity component I see a similar structure: a random call on init. Why is this not creating an OOS?

binaries/data/mods/public/simulation/ai/common-api/gamestate.js
47

let the ai decide at runtime which one it wants

Not sure if this is possible even, the ai might make plans and stuff. Any way we can know here whether we will be serializing after?

49

Would also bug in the old system: what if all techs are disabled?

Silier added a comment.EditedJan 30 2020, 11:08 AM

Do you mean components/Identity ?
Thats not related to ai deserialisation problem. If that random pick of phenotype is problem then even games without ais will oos, what does not. Game rng is not changed, but one could test with acutally 2 phenotypes and see if that causes oos.

Edit: identity component does not have serialisation code in js what means that whole component is serialised by default when it possibly skips init function call.

binaries/data/mods/public/simulation/ai/common-api/gamestate.js
47

one possible solution could be to do random pick for next phase pair once last phase is researched, and only last choice would be serialized instead every pick what would be necessary if done here.

49

then whole function has a problem

Stan added a subscriber: Stan.Jan 30 2020, 1:40 PM

Also the random in identity only changes the visual actor and or the sounds which is not serialized :)

If I read correctly, this.sentDiplomacyRequestLapseTime is randomized, then replaced by the deserialized value. So the code change is not necessary but done for consistency, because the random calls are considered the root cause of the problem?
If all of that is the case, it seems that the more safe fix would not be to remove the randomization calls, but to serialize everything that needs to be serialized.

So for gamesetate.js, it could be that properties like this.phases should be serialized as well.

The main problem is that it might not just be the random calls that change a setting to become OOS, but an AI state property that influences simulation state (like this.phases) could also be set to a non-random default value and then changed to another non-random value later in the match. If that is not serialized, then despite the lack of random-use, the property would differ for the rejoined client, since it is not serialized and set to the default state for the rejoining client and to the later-match state for the non-rejoined client. It would not trigger an OOS directly upon rejoin since the property is not serialized, thus not directly contributing to the compared hash. But later when the non-serialized property is read and determines a property that is serialized, that effect would cause the OOS.

So in order to be completely sure that there is no OOS, one would have to check every JS value that can influence a simulation state, and check that this value is deterministically the same for all players, i.e. either a template value where we assume they have the same templates, or a serialized property, or a constant that never changes.

I.e. there might be one property that is not set to random in one of these modified init functions but still changed later on and not serialized, triggering an OOS, perhaps even including the properties modified to be not-random here.

So it seems the task is to serialize all properties that are known to determine simstate that are not serialized yet and changed during *any* simulation time, or to make these non-serialized values constant (which you did with the pair tech here by the looks of things).

Using constant non-serialized values has the disadvantage that it yields less distinct / more repetitive AI gameplay.
Varying and serializing a value has the possible disadvantage that it may cost a lot of memory.
That memory will affect savegamesize (zipped) and network data size to be transferred per rejoin (which can be possibly many 'late observers'.).

If I read correctly, this.sentDiplomacyRequestLapseTime is randomized, then replaced by the deserialized value. So the code change is not necessary but done for consistency, because the random calls are considered the root cause of the problem? If all of that is the case, it seems that the more safe fix would not be to remove the randomization calls, but to serialize everything that needs to be serialized.

Calling any random function during rejoining will cause oos because rng will change so just serialising random values will not help if that random call stays there.

In D2573#109250, @Angen wrote:

Calling any random function during rejoining will cause oos because rng will change so just serialising random values will not help if that random call stays there.

It doesn't matter if the randomized value is overwritten with the serialized value immediately before serialization and before changing any other simulation dependent value which isnt also overwritten by a serialize value.
I.e. as long as the simulation has the same state at the time its serialized, it wont matter.

Also as mentioned it might not only be the randomized values but also fixed values just alike. For example onInit this.x=1, during game this.x=2, x not being serialized, then on rejoined x=1 for the rejoined client but x=2 for the non-rejoined one, despite no use of random.

pool when rejoining for system components and entities:

WARNING: guiinterface.init

WARNING: mod mana init

WARNING: Identity.Init
...

WARNING: Identity.Init
WARNING: deserialisation started (set random number generator)

WARNING: guiinterface.deserial

WARNING: guiinterface.init (called explicitly from deserialisation in js)

WARNING: modmana deser

WARNING: mod mana init (called explicitly from deserialisation in js)

so any possible oos related to random functions is removed by calling all init functions and then starting deserialization here, unless explicit called init functions does not call random functions.

for aimanager
set random number generator
load shared scripts
call deserialise (init functions and constructors called explicitly from js)

Silier added inline comments.Feb 7 2020, 7:06 PM
binaries/data/mods/public/simulation/ai/common-api/gamestate.js
49

also L28
// we assume all of them are researchable from the civil center

bb added a comment.Feb 12 2020, 3:22 PM

OOS at turn 80:

bb added a comment.Feb 12 2020, 3:30 PM

For those who want:

See http://irclogs.wildfiregames.com/2020-02/2020-02-12-QuakeNet-%230ad-dev.log; textual diff is identical, binary diff has as a first change one more entity in the JS Map stored in AI entitiesModifications.

+1

binaries/data/mods/public/simulation/ai/petra/attackPlan.js
199

just for record: this is adding duplicate plans to queue manager on deserialisation and even with different priority as should be

Silier planned changes to this revision.Apr 14 2020, 1:40 PM

i am still getting random number oos, not sure where is the cause

Silier abandoned this revision.Apr 20 2023, 4:50 PM
Herald added a reviewer: Restricted Owners Package. · View Herald TranscriptApr 20 2023, 4:50 PM
tuxayo added a subscriber: tuxayo.Aug 30 2023, 4:30 PM

Isn't there still a benefit of having this merged? (after eventual polishing)

As seen, it doesn't fix all the cases of random() calls making rejoins non-deterministic. But it definitely fixes part of them and they will certainly need to be fixed again in a similar fashion if another full-attempt is made. Right?

So the current changes here can be see as an increment toward fully fixing the issue.

I can provide rebases if needed. And follow-ups for very explicit requested changes (I don't have the codebase knowledge to make changes from any request a bit abstract :( )