Page MenuHomeWildfire Games

Fix TLS Segfault on various mac versions
ClosedPublic

Authored by Stan on Feb 2 2019, 7:36 PM.

Details

Reviewers
wraitii
Group Reviewers
Restricted Owners Package(Owns No Changed Paths)
macOS Developers
Commits
rP22212: Fix TLS Segfault on various mac versions
Summary

Currently a random number of people on Mac have reported being unable to join the lobby with TLS enabled and it causing a crash. This patches fixes it.

Currently it's only been tried with @Tobbi and it works

Discussion here: https://wildfiregames.com/forum/index.php?/topic/25120-macos-osx-rc-bundles/&page=2&tab=comments#comment-368603

Test Plan

Make sure it's the proper fix.

Diff Detail

Repository
rP 0 A.D. Public Repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Stan created this revision.Feb 2 2019, 7:36 PM
Vulcan added a subscriber: Vulcan.Feb 2 2019, 7:37 PM

Successful build - Chance fights ever on the side of the prudent.

Link to build: https://jenkins.wildfiregames.com/job/differential/1028/

wraitii accepted this revision.Apr 13 2019, 11:14 AM
wraitii added a subscriber: wraitii.

enable-fit doesn't seem to be an option: https://gmplib.org/manual/Build-Options.html

This revision is now accepted and ready to land.Apr 13 2019, 11:14 AM
wraitii requested changes to this revision.Apr 13 2019, 11:15 AM

Sorry didn't mean to accept here :p

Also if there's an issue it would be worth reporting it upstream.

This revision now requires changes to proceed.Apr 13 2019, 11:15 AM
Stan added a comment.Apr 13 2019, 4:12 PM

Well good thing that option doesn't exist because this means only enable-fat is needed.
I'm not sure that's an issue upstream more us supporting too many platforms with one build.

Worth noting the bundle with this fix fixes the problem for everyone.

Stan updated this revision to Diff 7753.Apr 15 2019, 7:58 AM

Successful build - Chance fights ever on the side of the prudent.

Link to build: https://jenkins.wildfiregames.com/job/differential/1216/display/redirect

wraitii accepted this revision.Apr 15 2019, 1:39 PM

I don't actually care enough to understand why this is required.

This revision is now accepted and ready to land.Apr 15 2019, 1:39 PM
Itms added a subscriber: Itms.Apr 15 2019, 2:05 PM

I don't actually care enough to understand why this is required.

If you don't, you shouldn't accept the diff... There is no rush in accepting this patch, if you want it in, you should actually review it; if you don't have time or don't care, it's not a problem at all, just let the diff waiting on a review.

Stan added a comment.Apr 15 2019, 2:17 PM

Yeah, it would be nice to have it for A24 but that's all.

Crashes preventing access to the lobby feel like a blocker to me, and if this fixes it that's enough to get it committed _ as far as I am concerned _ . It's a small fix in terms of code debt, easy to git/svn-blame, and won't change much for our end-users.
The issue might be in our code (and it might not) but I don't have time to debug it properly and/or won't do it anyways as I already have plenty of things I care to do on this project.

I'm accepting this as I believe it's worth having. I am not committing it nor closing the diff. If somebody wants to come in and do the hard work, feel free. We can even commit this and create a "ticket" or something along those lines for the future... But then again I'm no longer active on trac.

This revision was automatically updated to reflect the committed changes.
Owners added a subscriber: Restricted Owners Package.Apr 22 2019, 11:50 PM
elexis added a subscriber: elexis.Apr 23 2019, 1:10 PM

I'm accepting this as I believe it's worth having. I am not committing it nor closing the diff. If somebody wants to come in and do the hard work, feel free.

Then you should press the Like button instead of the Accept button.

Crashes preventing access to the lobby feel like a blocker to me

Agree

if this fixes it that's enough to get it committed _ as far as I am concerned _ .

So you acknowledge that the commit may actually be wrong and for instance obfuscate instead of resolve the issue, or fix the issue in one case but still come with a configuration that we don't want, or possibly even breaking under some circumstances, but you are just not concerned about that?

It's a small fix in terms of code debt, easy to git/svn-blame

Agree

won't change much for our end-users.

How can you tell if you didn't review it?

The issue might be in our code (and it might not) but I don't have time to debug it properly and/or won't do it anyways as I already have plenty of things I care to do on this project.

I agree with Itms, if you don't have the time or will to do something properly, then don't do it.
The bug needs to be fixed, therefore we require someone to do it as properly as they can.

We can even commit this and create a "ticket" or something along those lines for the future

Step 1: Report task that requires consideration
Step 2: Propose solution that requires consideration
Step 3: Commit proposed solution without consideration
Step 4: Go back to step 1.

This way one creates a lot of duplicate tickets and a lot of patches that revise patches that revise patches that revise patches that could have been done correctly the first time.

I'm actually not sure whether you're actually a negligent as you're justifying here and in other places. Remember the upgrade feature, it was reviewed and tested by multiple people, but you said you don't care whether it was tested or had defects. But you also fixed some defects that were found later if I'm not mistaken. So you sold yourself under value. Presentation matters.

Anyhow, about the patch, what should be done is looking up the specifications of the arguments changed, looking up their ramifications, and then arguing based on the ramifications.
It's hard work, but justifiably much work.

Mostly I'm wondering what Itms had to say about this patch, as Stan, Itms and Tobbi had created the release candidate, but this patch was somehow not committed with the others.

Stan added a comment.Apr 23 2019, 1:47 PM
In D1772#75866, @elexis wrote:

Mostly I'm wondering what Itms had to say about this patch, as Stan, Itms and Tobbi had created the release candidate, but this patch was somehow not committed with the others.

What I can say is that we ignored this issue, because I failed to communicate on this issue. The last time we talked about it with Itms he told me he thought it was the same bug as the Fedora lobby bug, hence it would have been of an unknown cause. Since we had the disable option workaround, it was decided we would skip it. It kinda made Tobbi angry at the time.

So, I spent a few days trying to figure out why GMP would segfault on other macs and what it seem randomly. I tried a bunch of flags until I had a release pipeline (a script that I only need to press enter to create bundles) set up, and then started asking mac users to test them. I read, reread, and rereread the doc until I found potential options that could fix it.

Since the crash was about a missing function I assumed that it was because it was optimized somehow for my CPU on my build, and any different enough CPU would just break.

The --enable-fat selects a “fat binary” build on x86, where optimized low level subroutines are chosen at runtime according to the CPU detected. This means more code, but gives good performance on all x86 chips. (This option might become available for more architectures in the future.)

More code meant less likely that the function would not exist. and it worked.

Itms added a comment.Apr 23 2019, 1:49 PM
In D1772#75866, @elexis wrote:

Mostly I'm wondering what Itms had to say about this patch, as Stan, Itms and Tobbi had created the release candidate, but this patch was somehow not committed with the others.

It's because this patch was found and written by Stan after the re-release. The bug it fixes was encountered by Tobbi with the RC but not by others; I looked for a fix but didn't find one, so we decided to keep it this way. You should ask Stan if you want to know how he later pinpointed the flag as solving the bug.

The issue might be in our code (and it might not) but I don't have time to debug it properly and/or won't do it anyways as I already have plenty of things I care to do on this project.

I agree with Itms, if you don't have the time or will to do something properly, then don't do it.

Yeah I'm not sure why you went ahead. I'm happy it's in, but you could have left it aside if you wanted, and it would have been thoroughly reviewed in the same way as the other macOS TLS patches from the re-release (probably at a moment where it would become high priority, like the A24 release). Did you feel pressured to review it somehow (this is what your message looks like)? Or did it look like a low hanging fruit? I'm interested in knowing that in order to make sure the priority order of reviews in done in a sensible way.

I'm actually not sure whether you're actually a negligent as you're justifying here and in other places. Remember the upgrade feature, it was reviewed and tested by multiple people, but you said you don't care whether it was tested or had defects. But you also fixed some defects that were found later if I'm not mistaken. So you sold yourself under value. Presentation matters.

I agree with this, I don't think you're negligent wraitii, but you constantly write things like if you were. I also remember some commit messages that made harmless commit seem dangerous ? I think you would appear better if you worded things in a more neutral way when it comes to code (for everything that is not code, you are already a very nice person to chat with! ?)

In D1772#75868, @Itms wrote:

The issue might be in our code (and it might not) but I don't have time to debug it properly and/or won't do it anyways as I already have plenty of things I care to do on this project.

I agree with Itms, if you don't have the time or will to do something properly, then don't do it.

Yeah I'm not sure why you went ahead. I'm happy it's in, but you could have left it aside if you wanted, and it would have been thoroughly reviewed in the same way as the other macOS TLS patches from the re-release (probably at a moment where it would become high priority, like the A24 release). Did you feel pressured to review it somehow (this is what your message looks like)? Or did it look like a low hanging fruit? I'm interested in knowing that in order to make sure the priority order of reviews in done in a sensible way.

Well the answer to that question is simple: beyond that it looks like a low hanging fruit (and it honestly does to me), I have 0% confidence that it would have been thoroughly reviewed in the same way as the other macOS TLS patches from the re-release (probably at a moment where it would become high priority, like the A24 release). Based on my experience, patches don't get reviewed, simple as that. I could be wrong about that, and patches that block OSX might actually get reviewed, but even there I have examples to the contrary I could show you.
So as I was active again, I felt like I should say "OK" so that this gets in, instead of either pushing A24 release back by 4 months for no good reason when we actually get to that, or just never getting committed and the issue remaining forever. Because I also don't believe anyone is going to spend the time to look into this properly and actually find something (particularly as it doesn't seem easily reproducible).

On the odd chance that this is actually the fix required, and things never crash again, I feel like this would have been useful time gained. An issue accessing the lobby is a blocker, but that doesn't mean imo that it's worth spending ages on when we could spend ages on other stuff (such as actually fixing the crap performance of the game, which has been my N1 priority for years now, and which I feel is a much bigger problem™, but maybe I'm wrong).

Anyhow, about the patch, what should be done is looking up the specifications of the arguments changed, looking up their ramifications, and then arguing based on the ramifications.

I've done point 1 and 2 before accepting (and in fact noticed that an option didn't exist). So I guess I should indeed just try and communicate better.

So you acknowledge that the commit may actually be wrong and for instance obfuscate instead of resolve the issue, or fix the issue in one case but still come with a configuration that we don't want, or possibly even breaking under some circumstances, but you are just not concerned about that?

It's not that I'm not concerned. It's that I've weighted what I estimate to be the chance of 'it fixes the issue with 0 additional work' vs the chance of 'it actually makes things worse' and I honestly believe the calculation comes out to "overall it should be committed now".

It seems to me you, elexis, much like leper, tend to always assume the worst is going to happen, and have very low risk tolerance (I.E. you almost always prefer to take 0 risks). That's not something I have a problem with by itself. However, it appears I have a (much?) higher risk tolerance, and I thus make different decisions. And so far we haven't reached a consensus, as a team, on what the acceptable risk threshold is, or on what kind of accountability we want.
I guess we've talked about reviews a lot, but this is really about risk tolerance, and the cost of a lower risk tolerance is much lower development pace, which I perceive as a much higher risk to this project's survival. Hell, there have been warning signs: the game has been forked because people felt we were moving too slow.

It's not that I'm not concerned. It's that I've weighted what I estimate to be the chance of 'it fixes the issue with 0 additional work' vs the chance of 'it actually makes things worse' and I honestly believe the calculation comes out to "overall it should be committed now".

The problem is we cannot estimate the risk before we have the facts.
We can assess the risk and weigh the benefits to the disadvantages afterwards.

It seems to me you tend to always assume the worst is going to happen, and have very low risk tolerance (I.E. you almost always prefer to take 0 risks).

You are right, we have to assume that it's broken unless we have proven otherwise, either publicly or only in our own mind.
If we have it in our minds, then it's only a matter of typing it out for the future self, other readers and maintainers.

That's not something I have a problem with by itself. However, it appears I have a (much?) higher risk tolerance, and I thus make different decisions.
so far we haven't reached a consensus, as a team, on what the acceptable risk threshold is

Nor should or can we even decide on that.
You can add speed limits to highways or not, but it still requires that people are driving with precaution.
If they don't, there will be traffic accidents, regardless of the policies.

what kind of accountability we want

The purpose of quality assurance is not satisfying a policy but to furtherance of free software.
If we break code, it's still broken code in need of a fix, regardless whether the policy allowed it or not.
If one author breaks code and would decide to ignore a concern, that person would be unkind regardless of whether it's a policy violation or not.
I'm not the only one who spent years fixing other peoples mistakes that accumulated over years, so it's not a sustainable model to take big risks or negligence.
So the question is whether we want to improve the quality of the codebase in the long term or not.
If we do, then we actually need to put up with the quality standards to achieve that.
If we don't, then we can keep spaghetti code as status quo, the game won't be improved and it will hit the ceiling of what can be achieved quickly.

I guess we've talked about reviews a lot, but this is really about risk tolerance, and the cost of a lower risk tolerance is much lower development pace, which I perceive as a much higher risk to this project's survival.

I've stepped over corpses too, but then we need to clearly point out that these are corpses and that we are guilty of that state, make a plan how they should be revived again.
You should move at the greatest pace possible, but that doesn't justify any negligence.
In particular notice that if you become more familiar with the code you're working on, then you will be able to move at a higher pace in the future while gaining code quality simultaneously.

Encountered this today while working on D2057. I mistakenly removed the flag, thinking it was only macOS "fat binaries" which are combined x86_64 and i386 archs.

In fact, it's more significant for libraries like GMP, because it detects CPU-specific instructions during the build process. If you build on a new CPU, and try to run on an older CPU, there's a good chance it will crash. I'm sure that's what happened before. Here's an example: https://gmplib.org/list-archives/gmp-bugs/2019-May/004549.html

I'm adding a note in the script about why the flag is required now.