Page MenuHomeWildfire Games

Fix rP14372 and rP22666 miniupnpc leak and crash on thread exit
ClosedPublic

Authored by wraitii on Aug 16 2019, 9:18 AM.

Details

Summary

It appears the reason is that I did not join the UPnP thread in rP22666 (Edit: nor the code in rP14372). This was not done before so I missed it and I failed to test this code path.

Test Plan

Check that hosting works after quitting the game.

Diff Detail

Repository
rP 0 A.D. Public Repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

wraitii created this revision.Aug 16 2019, 9:18 AM

Successful build - Chance fights ever on the side of the prudent.

Link to build: https://jenkins.wildfiregames.com/job/docker-differential/396/display/redirect

elexis edited the summary of this revision. (Show Details)Aug 16 2019, 11:26 AM
elexis requested changes to this revision.Aug 16 2019, 11:37 AM
elexis added a subscriber: elexis.
  • that member is not defined is the macro is false
  • Alt+f4 netserver takes 20 seconds for me to end the program
This revision now requires changes to proceed.Aug 16 2019, 11:37 AM
wraitii updated this revision to Diff 9355.Aug 16 2019, 12:07 PM

Cleanly exit to avoid hanging in upnp Discovery.

Changes from waiting one time 10 seconds to 100 times 100 ms - not sure if that has an impact on efficiency.

Build failure - The Moirai have given mortals hearts that can endure.

Link to build: https://jenkins.wildfiregames.com/job/docker-differential/398/display/redirect

So the quetion is mostly what the 10 sec to 100ms change will do.
It sounds like 100ms lag could break it already depending on implementation? Depends on whether it only listens for a response or whether it also sends something prior each call. Specs say "execute the discovery process" only.
I saw someone using 2000ms, and no recommendation otherwise.
So looking at the UPNP discovery protocol, that's https://en.wikipedia.org/wiki/Simple_Service_Discovery_Protocol
i.e. https://tools.ietf.org/html/draft-cai-ssdp-v1-03#section-2.1
which says

Hence two types of SSDP requests will be sent across the SSDP

multicast channel/port. The first are discovery requests, a SSDP
client looking for SSDP services.

so 100ms sounds like too few time to send and receive a packet of any type.

It sounds like it should be tested, simulating lag on a network device is easy too, I posted it on trac 2854

So the question is whether the 10 seconds + killing the thread aren't just okay, or whether it is mandatory to perform cleanup in that UPNP thread.

If SSDP is stateless, then it wont matter if the thread is killed.
If this doesnt use encryption, one can easily view the packets sent with wireshark, and measure precisely how long communication takes on the local network, see which states are involved etc.

But also the local thread doesnt perform cleanup on destruction, for example those cleanup statements exist:

FreeUPNPUrls(&urls);
freeUPNPDevlist(devlist);

which are not called if the thread is killed before SetupUPNP finished.
It looks like this cleanup is not needed if the thread is killed in that 10 second period however.

Notice that the NetServer destructor can be called many times during the program lifetime, so it shouldn't leak anything.

Also the library needs an update by the looks of things https://www.exploit-db.com/exploits/43501

source/network/NetServer.cpp
260 ↗(On Diff #9355)

Only netServer->m_Shutdown requires the mutex, right? So I'd put that into a single scope to indicate that and optimize the lock.

wraitii updated this revision to Diff 9356.Aug 16 2019, 1:10 PM

so, I think it's better to just detach() and let it live. It'll either clean up on its own or get cleaned up by the OS when exiting, and that's likely fine.

Successful build - Chance fights ever on the side of the prudent.

Link to build: https://jenkins.wildfiregames.com/job/docker-differential/399/display/redirect

elexis retitled this revision from Fix rP22666 - pThread 4/7 - NetServer crashes when quitting MP games to Fix rP22666 - pThread 4/7 - NetServer / miniupnp crashes when quitting MP games.Aug 16 2019, 1:17 PM

rP14332 originally it ran in the same thread and froze the UI
rP14370 started to move it to the NetServer worker thread
rP14371 finished to move it to the NetServer worker thread
rP14372 moved it to a new UPNP worker thread and let it running if the NetServer closes

Brief discussions about those commits on http://irclogs.wildfiregames.com/2013-12/2013-12-17-QuakeNet-%230ad-dev.log

For rP14370:

14:23 < Josh> leper: Since the server runs in it's own thread I just cheated and made the UPnP search run in that thread as well
14:27 <@leper> Josh: it did block in SetupConnection() right?
14:27 <@leper> so it'll block in the new function too if you call both from the same place

for rP14371:

15:53 <@leper> Josh: your last commit breaks it
15:53 < Josh> leper: Breaks what?
15:53 <@leper> (presumably because the "server" isn't running when the game tries to connect)
15:53 <@leper> so I get "couldn't connect to server" when I try to host a game

For rP14372:

15:57 <@leper> I reverted 14371 to make sure it is the issue and I don't get the bug with *70

So the issue how to sanely end the miniupnpc function prior to the timeout exists since introduction, and the thread additions didn't help with fixing that (or even considering it?).

I agree to commit this in order to fix the crash asap and reflect the previous broken code behavior, so that possible leak fixes are in a separate commit (+1 savestate).

I did some testing with wireshark to see packets, the function with the 10 second delay in fact runs for 40 seconds for me, because it sends 4 requests with 10 second delay in between.
After that the further statements consume further seconds.

I've also used this patch to add debug output.

Before a commit I'd like to know whether it's true that the previous threading code was still leaking memory (pthread resources?)
in case the free functions were executed by the thread, but pthread_detach() wasn't called indicating this to be the expected behavior.

Because then rP22666 was only revealing what was already broken before, just more visibly broken with std::thread.

It's supposedly detectable with valgrind:
https://stackoverflow.com/questions/17642433/why-pthread-causes-a-memory-leak

source/network/NetServer.cpp
162 ↗(On Diff #9356)

There was a pthread_detach() missing in rP14372? Consequently the resources were leaked?

http://man7.org/linux/man-pages/man3/pthread_detach.3.html

When a detached thread terminates, its resources are

automatically released back to the system without the need for
another thread to join with the terminated thread.
164 ↗(On Diff #9356)

this means we won't clean up resources nicely in SetupUPnP if we exit too soon is not accurate, unless the statement is restricted to std::terminate case.

Whether this is an issue should be examined.

Redunant with "TODO"

266 ↗(On Diff #9356)

All these early returns were introduced in rP14370, which to me looks like a leak, but I might be wrong.

elexis retitled this revision from Fix rP22666 - pThread 4/7 - NetServer / miniupnp crashes when quitting MP games to Fix rP14372 and rP22666 miniupnpc leak and crash on thread exit.Aug 16 2019, 4:47 PM
elexis edited the summary of this revision. (Show Details)

Also the entire thing should become optional (default.cfg)!!

I reverted rP22666 locally and confirmed that the thread is still running after having closed the gamesetup page with the previous code. It still sends packets when one clicked on it once, and it sends twice as many packets if one opens the page twice and gets back to main menu immediately.

Also I tried with this diff

Index: source/network/scripting/JSInterface_Network.cpp
===================================================================
--- source/network/scripting/JSInterface_Network.cpp	(revision 22672)
+++ source/network/scripting/JSInterface_Network.cpp	(working copy)
@@ -56,10 +56,22 @@ void JSI_Network::StartNetworkHost(Scrip
 	ENSURE(!g_NetClient);
 	ENSURE(!g_NetServer);
 	ENSURE(!g_Game);
 
 	// Always use lobby authentication for lobby matches to prevent impersonation and smurfing, in particular through mods that implemented an UI for arbitrary or other players nicknames.
+	std::map<u16, CNetServer> servers;
+	for (int i = 1; i < 100; ++i)
+	{
+		servers.emplace(serverPort + i, static_cast<bool>(g_XmppClient));
+		if (!servers[i].SetupConnection(serverPort))
+		{
+			pCxPrivate->pScriptInterface->ReportError("Failed to start server");
+			SAFE_DELETE(g_NetServer);
+			return;
+		}
+	}
+
 	g_NetServer = new CNetServer(static_cast<bool>(g_XmppClient));
 	if (!g_NetServer->SetupConnection(serverPort))
 	{
 		pCxPrivate->pScriptInterface->ReportError("Failed to start server");
 		SAFE_DELETE(g_NetServer);

And that indeed sends 100 times as many SSDP requests as expected.

But I still dont know if the missing pthread_join / pthread_detach by itself causes a leak even if the inofficially detached thread would still clean its used variables.

So I ran valgrind:

valgrind --leak-check=full ./pyrogenesis

It identified those miniupnpc leaks

==15080== 148 bytes in 1 blocks are definitely lost in loss record 2,224 of 3,227
==15080==    at 0x483877F: malloc (vg_replace_malloc.c:299)
==15080==    by 0x630E7E5: ssdpDiscoverDevices (in /usr/lib/libminiupnpc.so.17)
==15080==    by 0x630973A: upnpDiscover (in /usr/lib/libminiupnpc.so.17)
==15080==    by 0x191253: CNetServerWorker::SetupUPnP(void*) (NetServer.cpp:252)
==15080==    by 0x667757E: start_thread (in /usr/lib/libpthread-2.29.so)
==15080==    by 0x678B0E2: clone (in /usr/lib/libc-2.29.so)


==15080== 36 bytes in 1 blocks are definitely lost in loss record 788 of 3,227
==15080==    at 0x483877F: malloc (vg_replace_malloc.c:299)
==15080==    by 0x671A36E: strdup (in /usr/lib/libc-2.29.so)
==15080==    by 0x63097EC: GetUPNPUrls (in /usr/lib/libminiupnpc.so.17)
==15080==    by 0x6309BE4: UPNP_GetValidIGD (in /usr/lib/libminiupnpc.so.17)
==15080==    by 0x191282: CNetServerWorker::SetupUPnP(void*) (NetServer.cpp:257)
==15080==    by 0x667757E: start_thread (in /usr/lib/libpthread-2.29.so)
==15080==    by 0x678B0E2: clone (in /usr/lib/libc-2.29.so)

=15080== 26 bytes in 1 blocks are definitely lost in loss record 243 of 3,227
==15080==    at 0x483877F: malloc (vg_replace_malloc.c:299)
==15080==    by 0x6308D5E: ??? (in /usr/lib/libminiupnpc.so.17)
==15080==    by 0x630980C: GetUPNPUrls (in /usr/lib/libminiupnpc.so.17)
==15080==    by 0x6309BE4: UPNP_GetValidIGD (in /usr/lib/libminiupnpc.so.17)
==15080==    by 0x191282: CNetServerWorker::SetupUPnP(void*) (NetServer.cpp:257)
==15080==    by 0x667757E: start_thread (in /usr/lib/libpthread-2.29.so)
==15080==    by 0x678B0E2: clone (in /usr/lib/libc-2.29.so)
==15080== 
==15080== 26 bytes in 1 blocks are definitely lost in loss record 244 of 3,227
==15080==    at 0x483877F: malloc (vg_replace_malloc.c:299)
==15080==    by 0x6308D5E: ??? (in /usr/lib/libminiupnpc.so.17)
==15080==    by 0x6309825: GetUPNPUrls (in /usr/lib/libminiupnpc.so.17)
==15080==    by 0x6309BE4: UPNP_GetValidIGD (in /usr/lib/libminiupnpc.so.17)
==15080==    by 0x191282: CNetServerWorker::SetupUPnP(void*) (NetServer.cpp:257)
==15080==    by 0x667757E: start_thread (in /usr/lib/libpthread-2.29.so)
==15080==    by 0x678B0E2: clone (in /usr/lib/libc-2.29.so)
==15080== 
==15080== 26 bytes in 1 blocks are definitely lost in loss record 245 of 3,227
==15080==    at 0x483877F: malloc (vg_replace_malloc.c:299)
==15080==    by 0x6308D5E: ??? (in /usr/lib/libminiupnpc.so.17)
==15080==    by 0x630983E: GetUPNPUrls (in /usr/lib/libminiupnpc.so.17)
==15080==    by 0x6309BE4: UPNP_GetValidIGD (in /usr/lib/libminiupnpc.so.17)
==15080==    by 0x191282: CNetServerWorker::SetupUPnP(void*) (NetServer.cpp:257)
==15080==    by 0x667757E: start_thread (in /usr/lib/libpthread-2.29.so)
==15080==    by 0x678B0E2: clone (in /usr/lib/libc-2.29.so)
==15080== 
==15080== 26 bytes in 1 blocks are definitely lost in loss record 246 of 3,227
==15080==    at 0x483877F: malloc (vg_replace_malloc.c:299)
==15080==    by 0x6308D5E: ??? (in /usr/lib/libminiupnpc.so.17)
==15080==    by 0x6309857: GetUPNPUrls (in /usr/lib/libminiupnpc.so.17)
==15080==    by 0x6309BE4: UPNP_GetValidIGD (in /usr/lib/libminiupnpc.so.17)
==15080==    by 0x191282: CNetServerWorker::SetupUPnP(void*) (NetServer.cpp:257)
==15080==    by 0x667757E: start_thread (in /usr/lib/libpthread-2.29.so)
==15080==    by 0x678B0E2: clone (in /usr/lib/libc-2.29.so)
==15080==

And another only for pthread:

==15080== 480 bytes in 1 blocks are possibly lost in loss record 2,784 of 3,227
==15080==    at 0x483AB65: calloc (vg_replace_malloc.c:752)
==15080==    by 0x4012AC1: allocate_dtv (in /usr/lib/ld-2.29.so)
==15080==    by 0x4013431: _dl_allocate_tls (in /usr/lib/ld-2.29.so)
==15080==    by 0x66781AD: pthread_create@@GLIBC_2.2.5 (in /usr/lib/libpthread-2.29.so)
==15080==    by 0x190EA0: CNetServerWorker::SetupConnection(unsigned short) (NetServer.cpp:208)
==15080==    by 0x19A84C: CNetServer::SetupConnection(unsigned short) (NetServer.cpp:1591)
==15080==    by 0x5F1561: JSI_Network::StartNetworkHost(ScriptInterface::CxPrivate*, CStrW const&, unsigned short, CStr8 const&) (JSInterface_Network.cpp:62)
==15080==    by 0x5F2BE0: call<void (ScriptInterface::CxPrivate *, const CStrW &, unsigned short, const CStr8 &), CStrW, unsigned short, CStr8> (NativeWrapperDefns.h:86)
==15080==    by 0x5F2BE0: bool ScriptInterface::call<void, CStrW, unsigned short, CStr8, &JSI_Network::StartNetworkHost>(JSContext*, unsigned int, JS::Value*) (NativeWrapperDefns.h:125)
==15080==    by 0x50969E7: CallJSNative (jscntxtinlines.h:235)
==15080==    by 0x50969E7: js::Invoke(JSContext*, JS::CallArgs const&, js::MaybeConstruct) (Interpreter.cpp:444)
==15080==    by 0x508C4DB: Interpret(JSContext*, js::RunState&) (Interpreter.cpp:2766)
==15080==    by 0x5096656: js::RunScript(JSContext*, js::RunState&) (Interpreter.cpp:391)
==15080==    by 0x509691C: js::Invoke(JSContext*, JS::CallArgs const&, js::MaybeConstruct) (Interpreter.cpp:462)

So to me it seems like yes, this patch fixes both the crash from rP22666 and the pthread leak from rP14372.

source/network/NetServer.cpp
162 ↗(On Diff #9356)

(Valgrind evidence says yes)

164 ↗(On Diff #9356)

The TODO was done now

elexis added inline comments.Aug 16 2019, 5:08 PM
source/network/NetServer.cpp
164 ↗(On Diff #9356)

if joinable() == false or an error occurs.

https://en.cppreference.com/w/cpp/thread/thread/detach

I thought we decided on IRC that SetupConnection might not have been called, thus checked? It's missing then, is it not? Indeed it does crash if calling this after the default ctor, I tried.

Should this be in the if statement above?

wraitii added inline comments.Aug 16 2019, 5:12 PM
source/network/NetServer.cpp
164 ↗(On Diff #9356)

It should indeed, I sort of assumed that one could detach an un joinable thread, but apparently not.

elexis accepted this revision.Aug 16 2019, 6:06 PM

Given that the remaining leak is to be addressed in D2183, you can safely remove the TODO.
I suggest to check for joinable() since it definitely crashes if the thing has not been started.

This revision is now accepted and ready to land.Aug 16 2019, 6:06 PM

Hold on, actually I still get this after applying mine and yours:

==28068== 480 bytes in 1 blocks are possibly lost in loss record 2,787 of 3,234
==28068==    at 0x483AB65: calloc (vg_replace_malloc.c:752)
==28068==    by 0x4012AC1: allocate_dtv (in /usr/lib/ld-2.29.so)
==28068==    by 0x4013431: _dl_allocate_tls (in /usr/lib/ld-2.29.so)
==28068==    by 0x66781AD: pthread_create@@GLIBC_2.2.5 (in /usr/lib/libpthread-2.29.so)
==28068==    by 0x63F2FA9: __gthread_create (gthr-default.h:663)
==28068==    by 0x63F2FA9: std::thread::_M_start_thread(std::unique_ptr<std::thread::_State, std::default_delete<std::thread::_State> >, void (*)()) (thread.cc:135)
==28068==    by 0x1CC6D0: thread<void (&)()> (thread:130)
==28068==    by 0x1CC6D0: CNetServerWorker::SetupConnection(unsigned short) (NetServer.cpp:211)
==28068==    by 0x6C50E4: JSI_Network::StartNetworkHost(ScriptInterface::CxPrivate*, CStrW const&, unsigned short, CStr8 const&) (JSInterface_Network.cpp:62)
==28068==    by 0x6C7666: call<void(ScriptInterface::CxPrivate*, const CStrW&, short unsigned int, const CStr8&), CStrW, short unsigned int, CStr8> (NativeWrapperDefns.h:86)
==28068==    by 0x6C7666: bool ScriptInterface::call<void, CStrW, unsigned short, CStr8, &JSI_Network::StartNetworkHost>(JSContext*, unsigned int, JS::Value*) (NativeWrapperDefns.h:125)
==28068==    by 0x50969E7: CallJSNative (jscntxtinlines.h:235)
==28068==    by 0x50969E7: js::Invoke(JSContext*, JS::CallArgs const&, js::MaybeConstruct) (Interpreter.cpp:444)
==28068==    by 0x508C4DB: Interpret(JSContext*, js::RunState&) (Interpreter.cpp:2766)
==28068==    by 0x5096656: js::RunScript(JSContext*, js::RunState&) (Interpreter.cpp:391)
==28068==    by 0x509691C: js::Invoke(JSContext*, JS::CallArgs const&, js::MaybeConstruct) (Interpreter.cpp:462)

And one ScriptRuntime leak in this file, but sounds less related:

==28068== 32 bytes in 1 blocks are definitely lost in loss record 519 of 3,234
==28068==    at 0x483877F: malloc (vg_replace_malloc.c:299)
==28068==    by 0x50ACA89: js_malloc (Utility.h:221)
==28068==    by 0x50ACA89: js_new<nspr::Thread, std::nullptr_t, std::nullptr_t, bool> (Utility.h:333)
==28068==    by 0x50ACA89: PR_GetCurrentThread() (PosixNSPR.cpp:146)
==28068==    by 0x50BBF96: JSRuntime::init(unsigned int, unsigned int) (Runtime.cpp:265)
==28068==    by 0x4EEF74E: JS_NewRuntime(unsigned int, unsigned int, JSRuntime*) (jsapi.cpp:469)
==28068==    by 0x32ED9A: ScriptRuntime::ScriptRuntime(std::shared_ptr<ScriptRuntime>, int, int) (ScriptRuntime.cpp:117)
==28068==    by 0x327CF1: construct<ScriptRuntime, std::shared_ptr<ScriptRuntime>&, int&, int&> (new_allocator.h:147)
==28068==    by 0x327CF1: construct<ScriptRuntime, std::shared_ptr<ScriptRuntime>&, int&, int&> (alloc_traits.h:484)
==28068==    by 0x327CF1: _Sp_counted_ptr_inplace<std::shared_ptr<ScriptRuntime>&, int&, int&> (shared_ptr_base.h:548)
==28068==    by 0x327CF1: __shared_count<ScriptRuntime, std::allocator<ScriptRuntime>, std::shared_ptr<ScriptRuntime>&, int&, int&> (shared_ptr_base.h:679)
==28068==    by 0x327CF1: __shared_ptr<std::allocator<ScriptRuntime>, std::shared_ptr<ScriptRuntime>&, int&, int&> (shared_ptr_base.h:1344)
==28068==    by 0x327CF1: shared_ptr<std::allocator<ScriptRuntime>, std::shared_ptr<ScriptRuntime>&, int&, int&> (shared_ptr.h:359)
==28068==    by 0x327CF1: allocate_shared<ScriptRuntime, std::allocator<ScriptRuntime>, std::shared_ptr<ScriptRuntime>&, int&, int&> (shared_ptr.h:702)
==28068==    by 0x327CF1: make_shared<ScriptRuntime, std::shared_ptr<ScriptRuntime>&, int&, int&> (shared_ptr.h:718)
==28068==    by 0x327CF1: ScriptInterface::CreateRuntime(std::shared_ptr<ScriptRuntime>, int, int) (ScriptInterface.cpp:881)
==28068==    by 0x1D3920: CNetServerWorker::Run() (NetServer.cpp:388)
==28068==    by 0x63F2CD3: execute_native_thread_routine (thread.cc:80)
==28068==    by 0x667757E: start_thread (in /usr/lib/libpthread-2.29.so)
==28068==    by 0x678B0E2: clone (in /usr/lib/libc-2.29.so)
elexis added a comment.EditedAug 16 2019, 7:36 PM

When using join(), the leak warning vanishes, so I was wrong, the // TODO should not be deleted, but it should say // TODO: Stop leaking the std::thread by using join() or "possibly leaking"

This revision was automatically updated to reflect the committed changes.