Page MenuHomeWildfire Games

Fix NetServer FSM error when ending the MP match in the loading screen
ClosedPublic

Authored by elexis on Jun 3 2018, 3:22 AM.

Details

Summary

See #4594.

Test Plan

Notice that this message is useful in the ingame state but not elsewhere.

Diff Detail

Repository
rP 0 A.D. Public Repository
Branch
/ps/trunk
Lint
Lint OK
Unit
No Unit Test Coverage
Build Status
Buildable 6208
Build 10311: Vulcan BuildJenkins
Build 10310: arc lint + arc unit

Event Timeline

elexis created this revision.Jun 3 2018, 3:22 AM
Vulcan added a subscriber: Vulcan.Jun 3 2018, 3:27 AM

Successful build - Chance fights ever on the side of the prudent.

Link to build: https://jenkins.wildfiregames.com/job/differential/622/display/redirect

Imarok accepted this revision.Jun 3 2018, 11:56 AM

Looks good code-wise.
Played a game and also worked.
Couldn't reproduce the OOS, so couldn't test that.

This revision is now accepted and ready to land.Jun 3 2018, 11:56 AM
elexis added a comment.Jun 3 2018, 1:01 PM

Thanks for the review.
This has nothing to do with OOS issues (I guess you mean doubleclick on the start button #5199).

Imarok added a comment.Jun 3 2018, 2:10 PM
In D1557#62851, @Imarok wrote:

Looks good code-wise.
Played a game and also worked.
Couldn't reproduce the OOS, so couldn't test that.

Oops, I meant s/OOS/Fsm error/

elexis added a comment.Jun 3 2018, 4:23 PM

(I know, TLDR, but we wanted reviews and prevent broken code, so forcefeeding proof of correctness to everyone x) )

The NetServer runs in a separate thread, the NetClient in the main thread.
So the FSM error occurs iff. the main thread is destroyed before the NetServer thread I believe.
Maybe the circumstances are more than that.
Reproducing the issue is entirely coupled to luck (it seems to work better when selecting mainland and pressing alt+f4 when it's at 6%). But even when killing at the exact same time of the mapgen process, sometimes it's triggered, other times not.
But this condition is missing whether we like it or not, so we have to add it and adding it makes our clients not trigger FSM errors in the NetServer anymore.

As reported in the ticket, a second way to trigger the error message on the NetServer is to Alt+F4 the client while it rejoins:

ERROR: Net server: Error running FSM update (type=25 state=5) NMT_END_COMMAND_BATCH while NSS_JOIN_SYNCING

NSS_JOIN_SYNCING state specification is here D1556#62860

The server started to or finished transfering the simulation state to deserialize and the client has not finished loading the map and deserializing the state yet.

I couldn't reproduce the issue however.
This leads me to believe that the NetClient just doesn't have chance to send out his message before it's choked off.
Afte 10 attempts I finally could reproduce it.

Generally, I wonder if these messages should be shown like that.
At some point we should also think about the possibility of evil clients killing games by sending broken packets.
The FSM errors could be used to spam errors on the host machine.
(So maybe for that scenario, rather than not showing these error messsages by default, we might think about adding an (possibly configurable) automatic kick if clients cause too many errors, or kick button link on the LOGERROR as proposed for the lagwarnings in D1461.)

Thanks for the review Imarok, good to put this case to a close years later.

elexis added a comment.Jun 4 2018, 1:32 PM

Imarok:

  1. Missing comment: The statement should explanain why it's necessary to begin with. When one assumes that the code works, it's not clear why it's needed. Only archeology has shown that it was introduced by rP14732 to try to flush out the resign just before calling EndGame after clicking on Exit and clicking on Resign in the resulting question (from rP14772). So if this code would work, we should add // Send resign before exiting the game or so.
  1. Delete code instead? If we can only trigger the FSM error 1 out of 10 times it means that the code only does what it should 1 out of 10 times... Testint to call leaveGame instantly after resignGame in alpha 23 proves that the resign is also sent only 1 out of 10 times. Calling EndGame() and destroying the g_NetClient after posting the network command was the case prior to rP15106. So the latter commit is the third piece of evidence that the patched code doesn't work and can be nuked. See also https://trac.wildfiregames.com/ticket/2420#comment:6.

So are we going for a straight revert of rP14732 rather than fixing a bug in code that is almost never effective?

elexis added a comment.Jun 4 2018, 1:34 PM

rP14732 wanted to solve the rated game evasion using alt+f4 hack even, see #2373. Failure bigtime!

elexis added a comment.Jun 5 2018, 3:34 PM

I'd say we can keep the constructors just so that people don't fall for the same assumption again.
Adding a code informing them that this sending is unreliable.

elexis added inline comments.Jun 5 2018, 3:47 PM
source/network/NetClientTurnManager.cpp
99

I suspect that this +2 when not ready yet may even have triggered the Client %d (%s) is ready for turn %d, but expected %d breakpoint in CNetServerTurnManager::NotifyFinishedClientCommands that is reported in #3643 and replaced with a LOGERROR in rP21023 but never understood how it could be triggered.

This revision was automatically updated to reflect the committed changes.