Page MenuHomeWildfire Games

Prevent players from dropping during map generation
ClosedPublic

Authored by causative on May 21 2018, 8:31 PM.

Details

Summary

In a23 it has become much more difficult to start multiplayer games, especially with large maps and many players, because players will drop during map generation. The problem occurs regardless of whether playing the new maps. This is an urgent problem since many players returned to 0ad specifically to check out a23, and we could permanently lose players since they can't get into games. In my experience it's virtually impossible to start 8 player games with random lobby players because two or three players will drop.

It is equally a problem during rejoining; players who dropped during initial map generation usually can't rejoin either.

It is caused by clients timing out when they can't perform certain aspects of map generation fast enough. elexis discovered a temporary workaround where the host can artificially increase their network lag, which causes enet to permit longer timeouts.

This patch just directly makes enet permit longer timeouts during map generation. It temporarily sets enet_peer_timeout to a high value on both client and server when map generation begins, and sets it back to a normal value once map generation has finished.

Note that elexis has a plan in the works to multithread map generation, which would also solve this problem. However, that patch is not yet ready. In the meantime, given the urgency of the situation, we need a fix.

Test Plan

Find a player who historically has had a problem with dropping during a23. I have a list of some of the players I observed to have this problem: fleur, zwerg, mord, M.T., fpre, wololoo, Palaiologos, xavierliz, AtendedorBoludos, YoSoyLeonidas, amancino. Host a game where that player joins, where neither they nor you are using this patch, with a large map and 8 players (ok if most positions are unassigned), and verify that the player drops when you try to start the game. Jebel Barkal is a good map to ensure the player drops due to its complexity. Then, host a game with the same map settings where both you and the other player have this patch, and verify that the player can connect.

This practical test has not yet been done, however it can be tested locally.

An easier test you can do by yourself is to add a large loop to a map. for (int i=0;i<1000000;i++){LOGWARNING("w");} placed inside CMapReader::ParseEntities, for example. Then you open one instance of 0ad and host a multiplayer game, and join it with the other instance. You should make sure the joined instance drops without D1513, but is able to join when both client and server have D1513.

Also make sure that you can rejoin a game under these conditions.

Also make sure that if you kill the joined instance during map loading, it is eventually noticed as disconnected from the server (within about a minute). And make sure that when you kill the joined instance after loading is finished, whether it joined with the game start or rejoined later, the joined instance will be *quickly* noticed as disconnected (within 10 seconds on my computer).

Diff Detail

Repository
rP 0 A.D. Public Repository
Lint
Lint Skipped
Unit
Unit Tests Skipped
Build Status
Buildable 6121
Build 10189: Vulcan BuildJenkins

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
elexis added inline comments.May 23 2018, 2:12 PM
source/network/NetClient.cpp
480

I was thinking that the reset call should be here, but the client didn't catch up yet with the turns that were performed after the rejoin started.

891

The call is needed for rejoined clients for sure.
But non-rejoined clients at the gamestart should be reset as soon as they finish.
So perhaps it were cleaner to move it into the if-statement and add a conditional reset call to LoadFinished.

source/network/NetSession.cpp
31

A reader might assume that LONG_TIMEOUT serves the same purpose as NETWORK_WARNING_TIMEOUT just in longer.
This should either have a very short comment stating its purpose, or better be renamed to something like NETWORK_GAMESTART_TIMEOUT.

37

Were a macro function nicer? (So that we only have two C++ functions with that semantics instead of three)

41

Correct arguments.
http://enet.bespin.org/group__peer.html#gac48f35cdd39a89318a7b4fc19920b21b

The second one is the time that is doubled repeatedly until the third argument is reached.
The third one is the limit of the exponential timeout mechanism (the one that makes the players drop),
the fourth one is the maximum timeout limit independent of the exp mechanism.

(Was wondering if the fourth one could be set to 0, so that more than 40 seconds were allowed if the ping is crap. But it may be smaller than 40 if too good ping. Also not important with that magic number 40s being chosen.)

284

(IMO better to have either code or comment per line, even if not exceeding the recommended 80 char linelength)

vladislavbelov added inline comments.
source/network/NetSession.cpp
29

Wouldn't it be good to add a comment, that all timeouts are in milliseconds?

31

Do we need empty line between constants? They're all timeout constants.

source/network/NetSession.h
109

Do we need the part about map in the comment? Because the function doesn't have something related to the map generation. At least it have to mention it as an example (also it doesn't control the temporary, you need to call it again to disable the long timeout): Allows to increase the timeout before a peer connection failed. Example: it allows to prevent drops during map generation.. The same below.

causative updated this revision to Diff 6626.May 24 2018, 1:31 AM
causative marked 10 inline comments as done.

misc

causative updated this revision to Diff 6627.May 24 2018, 1:34 AM
causative marked an inline comment as done.

diff context

source/network/NetClient.cpp
891

LoadFinished is called when clients joined and at the start of the game, but occurs before rejoining clients have synchronized. Therefore I am unsetting the long timeout here for rejoining clients, and in LoadFinished for others.

source/network/NetSession.cpp
31

Agree that this is a temporary fix and threading is a better fix.

37

Wouldn't a macro function be exactly the same? (macro + 2 methods). Also I prefer functions.

41

The enet defaults for the second through fourth parameters are: 32 ms, 5 seconds, and 30 seconds.

Successful build - Chance fights ever on the side of the prudent.

Link to build: https://jenkins.wildfiregames.com/job/differential/555/display/redirect

Successful build - Chance fights ever on the side of the prudent.

Link to build: https://jenkins.wildfiregames.com/job/differential/556/display/redirect

I just got disconnected during the beginning of a mapgen stage, there could hardly be any freeze. Got 100ms ping to the host.
One can join a remote host with one client, then rejoin with a second. One can look at the first screen to find the moment when the server disconnects, long before the loading screen client notices.

In D1513#62429, @elexis wrote:

I just got disconnected during the beginning of a mapgen stage, there could hardly be any freeze. Got 100ms ping to the host.
One can join a remote host with one client, then rejoin with a second. One can look at the first screen to find the moment when the server disconnects, long before the loading screen client notices.

So what are you saying? Is this without D1513? Sounds like the client is disconnecting when it doesn't appear to be frozen.

Without the patch. To me it sounds like a freeze before mapgen.
So threading of ParseEntities and other function in MapGeneration or MapReader or whatever it was described in #3700 may not include all places that ought to be threaded.
But threading the NetClient should fix the dropping issue even if the mainthread (+ gui) freezes.

I've also had this problem, esp with larger maps. Pretty much anything more complicated than mainland will cause the game to be unjoinable and unrejoinable.

smiley had a timeout of 60 seconds today, so 45 seconds wouldn't have helped at all. And if one of the players is missing the 4v4 is over and players have to rehost, lost 30min lifetime.

So eh, can you make that a config option?

See the lag simulated by nani and smileys timeout:

Got about 30s of freeze prior to mapgen. between these two:

TIMER| common/global.xml: 574.471 us
Generating Jebel Barkal of size 320 and 8 players.
lyv added a subscriber: lyv.May 29 2018, 9:09 PM
In D1513#62539, @elexis wrote:

Got about 30s of freeze prior to mapgen. between these two:

TIMER| common/global.xml: 574.471 us
Generating Jebel Barkal of size 320 and 8 players.

Something I have noticed as well.

causative updated this revision to Diff 6684.May 30 2018, 4:56 AM

Increase loading timeout from 40s to 80s

Successful build - Chance fights ever on the side of the prudent.

Link to build: https://jenkins.wildfiregames.com/job/differential/600/display/redirect

In practice, this patch allows about 60 seconds of timeout. I'm not sure on exactly why but that's what local testing shows.

In D1513#62537, @elexis wrote:

smiley had a timeout of 60 seconds today, so 45 seconds wouldn't have helped at all. And if one of the players is missing the 4v4 is over and players have to rehost, lost 30min lifetime.

So eh, can you make that a config option?

Bit of a pain to make it a config option, between finding the right place to set it, thoroughly testing to make sure all code paths go through that place, and then arguing here about whether or not that's the right place, which will surely result in changing it, and then thoroughly retesting again. Really it's the almost the same level of work and testing as the entire patch has demanded to date. This patch is only a temporary workaround that is intended to be reverted at some point, let's not make it too involved.

causative updated this revision to Diff 6685.May 30 2018, 5:14 AM

increase timeout to 80s

Build failure - The Moirai have given mortals hearts that can endure.

Link to build: https://jenkins.wildfiregames.com/job/differential/601/display/redirect

Successful build - Chance fights ever on the side of the prudent.

Link to build: https://jenkins.wildfiregames.com/job/differential/602/display/redirect

We want a minimum timeout of 10 or 15 seconds even if we have threading, so no revert?
The config option has the other problem though that both ends would have to set it, otherwise it's useless I guess.

causative updated this revision to Diff 6687.May 30 2018, 9:00 PM

add config option

Successful build - Chance fights ever on the side of the prudent.

Link to build: https://jenkins.wildfiregames.com/job/differential/603/display/redirect

elexis requested changes to this revision.Jun 1 2018, 5:47 PM

This patch is very clean and reads correct. But it breaks apart when testing.

Two major defects undermine the mission objective and one defect adds a regression:

  1. It fails to achieve the stated objective:

    The code claims to provide 80 seconds timeout tolerance. But the timeout tolerance remains unadulterated and players can still drop after 5 seconds of timeout or more (https://github.com/lsalzman/enet/search?utf8=%E2%9C%93&q=ENET_PEER_TIMEOUT_MINIMUM&type=).

    Bug A: When the server is notified that the host pressed the start button in`CNetServerWorker::OnStartGame`, the server only enables long timeouts for the client who pressed the StartGame button rather than all clients.

    Bug B: When clients enter the loading screen, they experience the reported freezes. Then the loading screen finishes and the timeout is shortened while enet still hasn't processed new packets yet. So it is still in the timeout phase and the default limit is immediately exceeded and the client drops. Shortening the timeout only in functions called after enet processed the message (OnFoo functions) prevents this problem. I.e. the lag may not yet be shortened in LoadFinished but after the server reflected that, i.e. in OnClientsLoading.
  1. It does the converse of the mission objective by breaking rP17772.

    The CNetClientSession sets default timeout tolerance for all clients rather than excluding the Local client.

    This issue is reproduced by simulating lag for the local interface (lo instead of eth0 on unix. This also allows testing the patch without simulating lag on the internet connection.).

    To fix this issue, the CNetClientSession needs to remember if it is the LocalClient beyond its constructor and not have the timeout set if it is.
  1. An issue not caused by this patch but the first one to notice when testing this patch is D1546, as it makes testing this patch unfeasible.

As all of these four issues above are found by simulating more lag than the claimed timeout, and since both of you know how to simulate arbitrary lag, <insert yelling here> and <insert you-see-that-guy-over-there meme>.

  1. As causative pointed out last night on irc: CNetServerWorker::OnLoadedGamenever resets the timeout tolerance for the last client to leave the loading screen. So players would have to bear more than a minute of pause in a running game in case of that client dropping.

(and still don't see the need for the proposed comments in the code as they are equivalent to whatever should be in the declaration of the LongTimeout function)

source/network/NetServer.cpp
1255

That's the missing loop over all clients

1271

(10:40:39) causative: in CNetServerWorker::OnLoadedGame, the call to SetLongTimeout has to be at the top

source/network/NetSession.cpp
35

Now unneeded global.

37

Since this function already contains a macro and since this is the only global function, using a macro seems considerable.
But they have the disadvantage that the newlines need to be scaped which is quite ugly, so agree to leave it as a C++ function.

52

float is a bit ugly. int almost exclusively has at least 32bit, but almost exclusively is not always, so meh.
http://en.cppreference.com/w/cpp/language/types

source/network/NetSession.h
191

whitespace issue

This revision now requires changes to proceed.Jun 1 2018, 5:47 PM
elexis accepted this revision.Jun 1 2018, 8:05 PM

Had to post the red flags. I won't commandeer so you remain the author and I'll commit what you had reviewed tonight in irc (+ the one fix you mentioned).

Thanks a lot for this patch!

This revision is now accepted and ready to land.Jun 1 2018, 8:05 PM
elexis updated the Trac tickets for this revision.Jun 1 2018, 8:07 PM
elexis added inline comments.
source/network/NetServer.cpp
1255

Also seems more logically related (and possibly scalable) to do this within the StartGame function.

Yeah, it's not that it wasn't tested, it's that the test coverage wasn't good. My method of testing disconnects was to kill -9 the client process, which for some reason results in a longer time before disconnect if applied during game loading with the earlier patch version (~40s) than the method of setting artificial lag with tc. Also tested rejoins, where it was working fine. And we didn't ever manage a test with a real dropping player.

elexis added a comment.Jun 3 2018, 2:38 AM
In D1513#61986, @Imarok wrote:

At start while loading 3 people lost connection and I got Net server: Error running FSM update (type=25 state=4) (That is NMT_END_COMMAND_BATCH and NSS_PREGAME).

This was reported here some time ago #4594 and is most likely unrelated from this diff.

elexis added a comment.Jun 3 2018, 3:51 AM

8+ second freeze on an empty map on gamestart prior to mapgen identified #5200. May or may not be the trigger that makes this issue more frequent than before.

vladislavbelov added inline comments.Jun 4 2018, 2:33 AM
source/network/NetSession.cpp
52

I agree - float isn't good here, especially if it's casted to the u32 anyway.

In D1513#61804, @Stan wrote:

Just a dumb question. Why now and in not previous alpha ?

Apparently the a23 aura bugfix rP21785 creates a like 8 second freeze before map generation starts that was < 2 seconds in a22 and auras / AIInterface are slow.

source/network/NetSession.cpp
52

Just that int is the only integer type supported currently by the config system and we want to be able to allow more than 65535 milliseconds and shouldn't rely on probabilities.

Stan added inline comments.Jun 4 2018, 11:23 AM
source/network/NetSession.cpp
52

Isn't int −2,147,483,648 to 2,147,483,647, ?
−32,768 to 32,767 is for short int.

So could at least be unsigned :)

vladislavbelov added inline comments.Jun 4 2018, 11:43 AM
source/network/NetSession.cpp
52

Anyway float isn't for this case, you always can use uint and clamp it. But do not use this hacky workaround.

elexis requested changes to this revision.EditedJun 6 2018, 6:10 PM

Another blaring defect

Bug: For the last non-rejoining client to leave the loading screen, CNetClient::OnClientsLoading is never called and the timeout tolerance is never set back to short.
Reason: When the last client finished the loading screen, CNetServerWorker::CheckGameLoadStatus broadcasts CLoadedGameMessage, returns true and then the CClientsLoadingMessage won't be sent anymore from CNetServerWorker::OnLoadedGame.
Effect: The game is affected significantly, because the rest of the clients now have to wait the full long timeout until the disconnected client realizes that he is disconnected and has to rejoin.

This was found by adding a LOGWARNING above each SetLongTimeout call and testing every single of them multiple times (the statemodel is more complex than we often think and has some edge cases).

Unexpected lag tolerance

Then I observed more timeout tolerance than what was configured and I can't tell why.

Test 1:
It seemed the NetServer didn't disable the long timeout for other nonrejoined clients after the loading screen.
If a non-rejoining client loses the connection in the loading screen but finishes the loading screen quickly, he may have to wait for 60 seconds, sometimes 90 seconds until he sees that he's gone.
Apparently the loading screen is still freezing so much (even with new_rms_test), that the enet timeout is greater than 90 seconds, even if we set it 60.
Test 2:
I joined via public IP and added 120s lag to that once the game was in the loading screen.
So I set the tolerance to 2 minutes and lag to 3 minutes to test that it actually works...
Test 3:
But then I even got a 157 second timeout tolerance, although I set only 120 seconds.
The server must have had received a message after 2min and thus assumed the latency is actually 120000ms and thus a multiple of that can be tolerated.
So it wasn't the loading screen, but in order to test a full packet delivery stop, use 10min lag?
Test 4:
It complains about 300000ms being too much lag, so I tested 200000ms = 3min 20s. The NetServer` disconnected after 120s (as setup), but the NetClient not before 170s.
Test 5:
In another retry both server and remote client waited around 170s.
Test 6:
Doing yet another test with 60s timeout tolerance and 3min20s lag, the remote client disconnected after either 60 seconds or slightly more, the server waited more than 80 seconds again.

Wild guessing:
According to the specification there shouldn't be any message received for 3 minutes, and before adding the lag, there is no lag, so how does it go beyond the timeout tolerance I had setup...
https://wiki.linuxfoundation.org/networking/netem#emulating-wide-area-network-delays
So it seems the enet first waits that timeout tolerance and then does RTT based multiplication with the RTT before the timeout
Was trying to look at https://github.com/lsalzman/enet/blob/f46fee0acc8e243b2b6910b09693f93c3aad775f/protocol.c#L1435 to figure out.
But we don't know how that roundTripTimeout condition works.
Or we don't know how that lag simulation works.
sudo ifconfig eth0 down isn't an alternative, as that bugs #5213.
Mind over matter, except in this case.
I can accept the fact as at least it won't disconnect too soon.
I will report this unexpected loading screen timeout tolerance as a ticket which we can close as meh if we have #3700.

Other Observations

  • Also notice that the workaround in disabling in CNetClient::OnClientsLoading instead of CNetClient::LoadFinished performs the timeout disabling redundantly (can be abit confusing when testing).
  • Also it seems less complicated to derive the timeout tolerance from the transition and not from the state. But the state model doesn't seem precise enough.

It's a funny example of how pushing release blockers haunts one later.

Here the patch with which I tested, including all debug messages

binaries/data/config/default.cfg
448 ↗(On Diff #6687)

during loading screen and rejoin

(Arguably rejoiners missed the gamestart, but obviously "loadingscreentimeout" doesn't fit for the rejoin sync stage either.)

source/network/NetClient.cpp
711

This function is called when the client starts the download of the simstate.
After the download succeeded, OnGameStart is called.
So we enable twice for rejoiners.
So just to avoid the redundancy this line could be removed.
But then the NetServer has long timeouts while the NetClients doesn't have them during downloads and ideally they'd be always in sync.
I think we should restrict the long timeout only for the loading screen, deserialization and "catching up" stage, the file transfer is quick enough.
So I'll chose the way of the least lines of code that happen to not be redundant but not perfectly in sync, declare missing justification for the added lines.
It seems we are missing a CNetFileTransferer event if the upload succeeded, then we could keep the NetServer more in sync.
But where a "OnUploadComplete` event should be we find a TODO and a memory leak (CNetFileTransferer::Poll()).

891

I was wrong, the last client to leave the loading screen doesn't receive the CClientsLoadingMessage, so we need to call this here for that one client too (and may do so redundantly for the others).

source/network/NetServer.cpp
1255

This way reducing one line of code (adding two lines of braces, but they don't count I say)

source/network/NetSession.cpp
52

Always after implementing the uint conversion in CConfigDB code: D1566.
uint can't be used, as mentioned we should be able to pass a value greater than u16, so we can implement the u32 conversion which already seems to be applicable to many other config values.

This revision now requires changes to proceed.Jun 6 2018, 6:10 PM
elexis accepted this revision.Jun 6 2018, 11:34 PM

(137 replays generated today)

binaries/data/config/default.cfg
448 ↗(On Diff #6687)

The sentence "Wait this many seconds before X" could also be understood as adding a delay.

This revision is now accepted and ready to land.Jun 6 2018, 11:34 PM
This revision was automatically updated to reflect the committed changes.