Page MenuHomeWildfire Games

Flush the destruction queue after GUI events are processed to avoid serialisation crashing
Needs ReviewPublic

Authored by wraitii on Jan 6 2019, 3:10 PM.

Details

Reviewers
None
Group Reviewers
Restricted Owners Package(Owns No Changed Paths)
Trac Tickets
#4616
Summary

GUI events can create and destroy local entities. Serialisation ENSURES that the destruction queue is empty (the reason for that is that we don't serialize the destruction queue)

If the simulation is paused, it can happen that local entities have been destroyed and not flushed (since we only flush if the simulation is running), which then crashes. In particular this can happen in odd cases in MP (see ticket).

I _believe_ it is enough to flush after the GUI event loop and the simulationUpdate call. I think this still crashes if the player tries to serialize on the same event-loop that an entity is destroyed, but in SP I don't really see how you could do that, and I _believe_ in MP we would necessarily flush before the server can handle any request.

It might be safer to explicitly Flush before any "SaveGame" call or something.
Alternatively, I'm not sure that we actually need this ENSURE here: since local entities don't get serialised, they won't be loaded back or transmitted over MP, so it ought to be safe to remove it.

Test Plan

To reproduce locally:

  1. Select an entity that can build something
  2. Pause the game
  3. Try placing building previews (you can see they don't disappear - they also can't move)
  4. Try saving the game
  5. It crashes

Diff Detail

Repository
rP 0 A.D. Public Repository
Branch
D1738_flush
Lint
Lint OK
Unit
No Unit Test Coverage
Build Status
Buildable 7362
Build 11991: Vulcan BuildJenkins
Build 11990: arc lint + arc unit

Event Timeline

wraitii created this revision.Jan 6 2019, 3:10 PM
wraitii edited the summary of this revision. (Show Details)Jan 6 2019, 3:12 PM
Vulcan added a subscriber: Vulcan.Jan 6 2019, 3:12 PM

Successful build - Chance fights ever on the side of the prudent.

Linter detected issues:
Executing section Source...

source/ps/Game.cpp
|   1| /*·Copyright·(C)·2018·Wildfire·Games.
|    | [NORMAL] LicenseYearBear:
|    | License should have "2019" year instead of "2018"

source/main.cpp
|   1| /*·Copyright·(C)·2018·Wildfire·Games.
|    | [NORMAL] LicenseYearBear:
|    | License should have "2019" year instead of "2018"
Executing section JS...
Executing section cli...

Link to build: https://jenkins.wildfiregames.com/job/differential/935/

The more I think about it the more I think the best thing to do here is to remove/amend the ENSURE. It dates from rP7276, and I'm not sure if it ever was useful. The Simulation's UpdateComponents cannot be halted and we won't serialize or deserialise in the middle of it, so any non-local entity destroyed is flushed at the end of the turn, i.e. the destruction queue only has local entities in it. That's fine because those don't get serialised anyways.

Maybe amending it to ENSURE that the queue is empty or only has local entities is the way to go since this is unlikely to be super-slow.

elexis added a comment.Jan 7 2019, 1:07 AM

GUI events can create and destroy local entities.

Which ones? I only know about corpses, rallypoints and target markers. The latter are new in a23, so rallypoints triggered the issue originally?
It would be good to reproduce this issue. One can increase the turn length with F9 -> Engine.SetTurnLength(ms);.
One could hide the pause overaly screen to mess with rallypoints or target markers during pause.

As to the solution, dunno. In general its better to keep simulation logic in the simulation folder, perhaps TurnManager.

The comment says

it's (hopefully) easier to just expect callers to flush the queue before serializing

So it sounds like theres a missing Flush call just before a call to this function, I would expect when the NetClient of the local player starts serializing the game for the rejoiner.

Sorry, I'm going to be a little snarky, but there is a test plan for this revision that says:

To reproduce locally:

Select an entity that can build something
Pause the game
Try placing building previews (you can see they don't disappear - they also can't move)
Try saving the game
It crashes

This is 100% reproducible. I can guarantee it has _nothing_ to do with turn length because that _can't_ fail.

So it sounds like theres a missing Flush call just before a call to this function, I would expect when the NetClient of the local player starts serializing the game for the rejoiner.

I will quote myself in the summary of this revision: 'It might be safer to explicitly Flush before any "SaveGame" call or something.'

However I believe this ENSURE is wrong in the first place. There's no reason to believe code from that era is perfect.

elexis removed a reviewer: elexis.Jan 7 2019, 2:03 PM

Rightfully so, someone filing an actual testplan is new to me and I didn't notice the trac ticket until I posted the comment.

Good work about the reproduce, I was wondering about this case for about 2 years.

The ENSURE seems necessary, because the simulation entities should really be cleaned at this point.
The local entities however are irrelevant.

So either the ENSURE must ignore local entities, or local entities must be flushed before the serialization attempt occurs.

Given that we don't want to have multiple building previews visible at a time, even during pause, means that (at least local) entities must be flushed immediately after destruction.

Flushing destroyed non-local entities outside of the simulation would be troublesome if that would change the simulation state in any way (I would guess that it doesn't but it might in the future).
One may also examine to nuke the function and flush each entity immediately, or flush local entities immediately upon destruction. (Might have a performance impact, reading the code would help decide.)

So either the ENSURE must ignore local entities, or local entities must be flushed before the serialization attempt occurs.

Right.

Given that we don't want to have multiple building previews visible at a time, even during pause, means that (at least local) entities must be flushed immediately after destruction.

The existing revision fixes this issue by cleaning once per GUI loop, but it's arguably a hack.

Flushing destroyed non-local entities outside of the simulation would be troublesome if that would change the simulation state in any way (I would guess that it doesn't but it might in the future).

Pretty sure it would as the flushing is what sends MT_Destroy message. However the Simulation Update flushes, and that can't be interrupted atm so we should be fine in that respect.

One may also examine to nuke the function and flush each entity immediately, or flush local entities immediately upon destruction. (Might have a performance impact, reading the code would help decide.)

I think that can't work for the simulation as we might have dangling references - but that'd need a look at.

wraitii updated this revision to Diff 7907.Sat, May 4, 5:46 PM

Fix the crash by updating the ensure to more accurately reflect what should happen.

Vulcan added a comment.Sat, May 4, 6:45 PM

Successful build - Chance fights ever on the side of the prudent.

Linter detected issues:
Executing section Source...

source/simulation2/system/ComponentManagerSerialization.cpp
|   1| /*·Copyright·(C)·2017·Wildfire·Games.
|    | [NORMAL] LicenseYearBear:
|    | License should have "2019" year instead of "2017"
Executing section JS...
Executing section cli...

Link to build: https://jenkins.wildfiregames.com/job/differential/1326/display/redirect