Page MenuHomeWildfire Games

Fix string conversion to/from Spidermonkey
ClosedPublic

Authored by wraitii on Jun 27 2020, 10:40 AM.

Details

Summary

This fixes std::string conversion to and from spider monkey strings.

JS Strings are always 'wide'. They are stored either as Latin-1 with a 'wide' hack, or as proper UTF16.
This is incorrectly handled by the std::string conversion code, which can result in broken string conversions.

Adds tests to fix the broken cases.


Uplifted from D2814

Test Plan

Review code, review tests, notice how strings are now properly converted.

Event Timeline

wraitii created this revision.Jun 27 2020, 10:40 AM
wraitii added inline comments.
source/scriptinterface/tests/test_ScriptConversions.h
153

As written on D2814:

This one is tricky. s2 is utf-8 of cyrillic characters, which are 2 bytes each. Since we modify the string manually to set the 3rd and 4th bytes to 0, we end up from тест to т0ст. The size of the string is still 8. SO the UTF16 conversion, operating on each character, will see each 0 separately and convert them to a full UTF16 zero.

It's thus correct, but really we shouldn't hack std::string like that, the test is a little odd.

Successful build - Chance fights ever on the side of the prudent.

Linter detected issues:
Executing section Source...

source/scriptinterface/tests/test_ScriptConversions.h
|   1| /*·Copyright·(C)·2019·Wildfire·Games.
|    | [NORMAL] LicenseYearBear:
|    | License should have "2020" year instead of "2019"

source/scriptinterface/tests/test_ScriptConversions.h
|  31| class·TestScriptConversions·:·public·CxxTest::TestSuite
|    | [MAJOR] CPPCheckBear (syntaxError):
|    | Code 'classTestScriptConversions:' is invalid C code. Use --std or --language to configure the language.
Executing section JS...
Executing section cli...

Link to build: https://jenkins.wildfiregames.com/job/docker-differential/2534/display/redirect

Itms requested changes to this revision.Jun 27 2020, 12:04 PM

This looks good to me apart from the roundtrip change. Additionally I am fine with believing that SM uses UTF16 internally but what is the source of this affirmation? Is it still true with recent versions of SM?

I am also curious about whether this change allows the proper conversion of chars outside the first Unicode plane. If it works we should add some tests with this kind of chars.

source/scriptinterface/ScriptConversions.cpp
164

Could this proper conversion create a performance issue?

source/scriptinterface/tests/test_ScriptConversions.h
74

I don't understand the point of changing roundtrip? The use you make of the return value is redundant with the value parameter.

153

Indeed the test is a bit surprising, but it is useful as one can make sure that the UTF8 strings are made of 4 or 8 bytes, while the UTF16 strings are made of four wide chars.

255

You should just not use roundtrip here. Just create your script interface for your specific use here and leave the definition of roundtrip untouched. The code of your test_strings test will be easier to understand that way.

This revision now requires changes to proceed.Jun 27 2020, 12:04 PM
In D2838#121714, @Itms wrote:

Additionally I am fine with believing that SM uses UTF16 internally but what is the source of this affirmation? Is it still true with recent versions of SM?

I've found no documentation beyond code comments.
See JSAPI.h:4294

* Flat strings and interned strings are always null-terminated, so
* JS_FlattenString can be used to get a null-terminated string.
*
* Additionally, string characters are stored as either Latin1Char (8-bit)
* or char16_t (16-bit). Clients can use JS_StringHasLatin1Chars and can then
* call either the Latin1* or TwoByte* functions. Some functions like
* JS_CopyStringChars and JS_GetStringCharAt accept both Latin1 and TwoByte
* strings.

and CharacterEncoding.h

/*
 * By default, all C/C++ 1-byte-per-character strings passed into the JSAPI
 * are treated as ISO/IEC 8859-1, also known as Latin-1. That is, each
 * byte is treated as a 2-byte character, and there is no way to pass in a
 * string containing characters beyond U+00FF.
 */

Hence my understanding above that text is always stored "wide" in Spidermonkey, even in the "latin1" encoding.

As far as I can tell, this is still the case upstream.


I am also curious about whether this change allows the proper conversion of chars outside the first Unicode plane. If it works we should add some tests with this kind of chars.

I've tried emojis, but our custom utf8-16 conversion code doesn't really handle them. Not sure about other planes in general.

wraitii updated this revision to Diff 12469.Jun 27 2020, 12:43 PM
wraitii marked an inline comment as done.

Fix roundtrip issue.

I don't think this will work for further unicode planes regardless since our conversion code doesn't handle them.

source/scriptinterface/ScriptConversions.cpp
164

I've wondered. There is actually not a _lot_ of code passing std::string around, and I was thinking that we ought not pass strings in performance-critical code anyways.

source/scriptinterface/tests/test_ScriptConversions.h
74

I think I just got confused at some point.

Successful build - Chance fights ever on the side of the prudent.

Linter detected issues:
Executing section Source...

source/scriptinterface/tests/test_ScriptConversions.h
|   1| /*·Copyright·(C)·2019·Wildfire·Games.
|    | [NORMAL] LicenseYearBear:
|    | License should have "2020" year instead of "2019"

source/scriptinterface/tests/test_ScriptConversions.h
|  31| class·TestScriptConversions·:·public·CxxTest::TestSuite
|    | [MAJOR] CPPCheckBear (syntaxError):
|    | Code 'classTestScriptConversions:' is invalid C code. Use --std or --language to configure the language.
Executing section JS...
Executing section cli...

Link to build: https://jenkins.wildfiregames.com/job/docker-differential/2536/display/redirect

Itms added a comment.Jun 27 2020, 12:49 PM

This looks good, I will test it later! ??

source/scriptinterface/ScriptConversions.cpp
164

Yes, I agree we shouldn't pass too many strings between the engine and JS. However I think there are a lot of implicit conversions of strings in the simulation between interfaces and components. It might be worth checking the impact of this patch, but I believe your fix is correct and needed anyway.

source/scriptinterface/tests/test_ScriptConversions.h
277

the extra block is not needed anymore

Itms requested changes to this revision.Jun 30 2020, 12:13 PM

OK, everything good for me ??

I am requesting three very small changes in the new test:

  • change the name for something more descriptive, see inline comment
  • use another value in the test, see inline comment
  • remove the now unneeded braces
source/scriptinterface/tests/test_ScriptConversions.h
255

What about calling the test test_utf8utf16_conversion, and moving the comment line 262, which describes the test, here at the top

272

You could just create a value v2 for the second part of the test, it's inexpensive and cleaner. It's also more atomic as you are not testing at the same time the correct overwriting of the value.

This revision now requires changes to proceed.Jun 30 2020, 12:13 PM
Itms accepted this revision.Jun 30 2020, 12:16 PM

I'll fix those myself and commit, as proposed by wraitii on IRC.

This revision is now accepted and ready to land.Jun 30 2020, 12:16 PM
This revision was automatically updated to reflect the committed changes.