Paths

Table of Contentst

Differential D2838

Fix string conversion to/from Spidermonkey
ClosedPublic
Actions

Authored by wraitii on Jun 27 2020, 10:40 AM.

Details

Reviewers

Itms

Commits

rP23795: Fix conversion of UTF8 strings between the scripts and the engine.

Summary

This fixes std::string conversion to and from spider monkey strings.

JS Strings are always 'wide'. They are stored either as Latin-1 with a 'wide' hack, or as proper UTF16.
This is incorrectly handled by the std::string conversion code, which can result in broken string conversions.

Adds tests to fix the broken cases.

Uplifted from D2814

Test Plan

Review code, review tests, notice how strings are now properly converted.

Diff Detail

Repository

rP 0 A.D. Public Repository

Lint

Lint OK

Unit

No Unit Test Coverage

Build Status

Buildable 12240
Build 23166: Vulcan Build	Jenkins
Build 23165: Vulcan Build (macOS)	Jenkins
Build 23164: Vulcan Build (Windows)	Jenkins
Build 23163: arc lint + arc unit

Event Timeline

wraitii created this revision.Jun 27 2020, 10:40 AM

wraitii added inline comments.

source/scriptinterface/tests/test_ScriptConversions.h
153	As written on D2814: This one is tricky. s2 is utf-8 of cyrillic characters, which are 2 bytes each. Since we modify the string manually to set the 3rd and 4th bytes to 0, we end up from тест to т0ст. The size of the string is still 8. SO the UTF16 conversion, operating on each character, will see each 0 separately and convert them to a full UTF16 zero. It's thus correct, but really we shouldn't hack std::string like that, the test is a little odd.

Successful build - Chance fights ever on the side of the prudent.

Linter detected issues:
Executing section Source...

source/scriptinterface/tests/test_ScriptConversions.h
|   1| /*·Copyright·(C)·2019·Wildfire·Games.
|    | [NORMAL] LicenseYearBear:
|    | License should have "2020" year instead of "2019"

source/scriptinterface/tests/test_ScriptConversions.h
|  31| class·TestScriptConversions·:·public·CxxTest::TestSuite
|    | [MAJOR] CPPCheckBear (syntaxError):
|    | Code 'classTestScriptConversions:' is invalid C code. Use --std or --language to configure the language.
Executing section JS...
Executing section cli...

Link to build: https://jenkins.wildfiregames.com/job/docker-differential/2534/display/redirect

Harbormaster completed remote builds in B12238: Diff 12467.Jun 27 2020, 11:10 AM

This looks good to me apart from the roundtrip change. Additionally I am fine with believing that SM uses UTF16 internally but what is the source of this affirmation? Is it still true with recent versions of SM?

I am also curious about whether this change allows the proper conversion of chars outside the first Unicode plane. If it works we should add some tests with this kind of chars.

source/scriptinterface/ScriptConversions.cpp
164	Could this proper conversion create a performance issue?
source/scriptinterface/tests/test_ScriptConversions.h
74	I don't understand the point of changing `roundtrip`? The use you make of the return value is redundant with the `value` parameter.
153	Indeed the test is a bit surprising, but it is useful as one can make sure that the UTF8 strings are made of 4 or 8 bytes, while the UTF16 strings are made of four wide chars.
255	You should just not use `roundtrip` here. Just create your script interface for your specific use here and leave the definition of `roundtrip` untouched. The code of your `test_strings` test will be easier to understand that way.

This revision now requires changes to proceed.Jun 27 2020, 12:04 PM

In D2838#121714, @Itms wrote:

Additionally I am fine with believing that SM uses UTF16 internally but what is the source of this affirmation? Is it still true with recent versions of SM?

I've found no documentation beyond code comments.
See JSAPI.h:4294

* Flat strings and interned strings are always null-terminated, so
* JS_FlattenString can be used to get a null-terminated string.
*
* Additionally, string characters are stored as either Latin1Char (8-bit)
* or char16_t (16-bit). Clients can use JS_StringHasLatin1Chars and can then
* call either the Latin1* or TwoByte* functions. Some functions like
* JS_CopyStringChars and JS_GetStringCharAt accept both Latin1 and TwoByte
* strings.

and CharacterEncoding.h

/*
 * By default, all C/C++ 1-byte-per-character strings passed into the JSAPI
 * are treated as ISO/IEC 8859-1, also known as Latin-1. That is, each
 * byte is treated as a 2-byte character, and there is no way to pass in a
 * string containing characters beyond U+00FF.
 */

Hence my understanding above that text is always stored "wide" in Spidermonkey, even in the "latin1" encoding.

As far as I can tell, this is still the case upstream.

I am also curious about whether this change allows the proper conversion of chars outside the first Unicode plane. If it works we should add some tests with this kind of chars.

I've tried emojis, but our custom utf8-16 conversion code doesn't really handle them. Not sure about other planes in general.

wraitii mentioned this in D2814: Allow remapping hotkeys from inside the game / fix QWERTY hotkeys.Jun 27 2020, 12:26 PM

wraitii added a child revision: D2814: Allow remapping hotkeys from inside the game / fix QWERTY hotkeys.

Fix roundtrip issue.

I don't think this will work for further unicode planes regardless since our conversion code doesn't handle them.

source/scriptinterface/ScriptConversions.cpp
164	I've wondered. There is actually not a _lot_ of code passing `std::string` around, and I was thinking that we ought not pass strings in performance-critical code anyways.
source/scriptinterface/tests/test_ScriptConversions.h
74	I think I just got confused at some point.

Successful build - Chance fights ever on the side of the prudent.

Linter detected issues:
Executing section Source...

source/scriptinterface/tests/test_ScriptConversions.h
|   1| /*·Copyright·(C)·2019·Wildfire·Games.
|    | [NORMAL] LicenseYearBear:
|    | License should have "2020" year instead of "2019"

source/scriptinterface/tests/test_ScriptConversions.h
|  31| class·TestScriptConversions·:·public·CxxTest::TestSuite
|    | [MAJOR] CPPCheckBear (syntaxError):
|    | Code 'classTestScriptConversions:' is invalid C code. Use --std or --language to configure the language.
Executing section JS...
Executing section cli...

Link to build: https://jenkins.wildfiregames.com/job/docker-differential/2536/display/redirect

This looks good, I will test it later! ??

source/scriptinterface/ScriptConversions.cpp
164	Yes, I agree we shouldn't pass too many strings between the engine and JS. However I think there are a lot of implicit conversions of strings in the simulation between interfaces and components. It might be worth checking the impact of this patch, but I believe your fix is correct and needed anyway.
source/scriptinterface/tests/test_ScriptConversions.h
277	the extra block is not needed anymore

Harbormaster completed remote builds in B12240: Diff 12469.Jun 27 2020, 1:04 PM

OK, everything good for me ??

I am requesting three very small changes in the new test:

change the name for something more descriptive, see inline comment
use another value in the test, see inline comment
remove the now unneeded braces

source/scriptinterface/tests/test_ScriptConversions.h
255	What about calling the test `test_utf8utf16_conversion`, and moving the comment line 262, which describes the test, here at the top
272	You could just create a value `v2` for the second part of the test, it's inexpensive and cleaner. It's also more atomic as you are not testing at the same time the correct overwriting of the value.