The unicode character-code for ë is U+00EB.
The utf-8 specification states that the maximum value of a single-byte utf-8 character is 0x7F (0111 1111).
0xEB is greater, so in order to encode ë in utf-8, it is encoded as a multi-byte character, specifically 0xC3AB.
We load JSON files into a string of the std::string type. This stringified JSON then gets passed to a parser to be decanted into an actual object with keys and values. In main 0ad, this is handled by spidermonkey.
In Atlas, it is handled by the function ConvertNode(^1) feeding it though the third party jsonspirit(^2) library. When this function encounters a node that contains a string value, it runs the following line to store the value returned from jsonspirit in an AtObj:
obj->value = std::wstring(node.get_str().begin(), node.get_str().end());
node.get_str() returns a std::string. However, obj->value requires a std::wstring. Hence the conversion.
However, the conversion is done naively, and without respect for the rules of utf-8. From what I can tell, each char in node.get_str() gets converted to a wchar_t. This works fine for single-byte utf-8 characters. However when it comes to multi-byte utf-8 characters, all the separate bytes of a character should go into the same wchar_t - but that doesn't happen, instead being put into separate (albeit consecutive) wchar_ts.
The solution below rewrites AtObj to store strings as std::string instead of std::wstring.
(Historical note: The solution originally used in this revision was to use wxString as an intermediary to convert from std::string to std::wstring in a utf-8 multi-byte aware manner. This was ultimately considered a flawed approach.)
^1 - source/tools/atlas/AtlasObject/AtlasObjectJS.cpp line 44
^2 - https://www.codeproject.com/KB/recipes/JSON_Spirit.aspx