FFI::MemoryPointer.new(:uint8, 1200)
Note
|
This is also published as a guest post at the Opal website. Thank you and a great hug to the Opal team! |
Background
At Ribose we develop Interscript, an open source Ruby implementation of interoperable transliteration schemes from ALA-LC, BGN, PCGN, ICAO, ISO, UN (by UNGEGN) and many, many other script conversion system authorities. The goal of this project is to achieve interoperable transliteration schemes allowing quality comparisons.
We needed to port the Interscript runtime to JavaScript using Opal (the Ruby to JavaScript compiler), so it can be also used in web browsers and Node.js environments.
The problem is that Opal translates Ruby regular expressions (upon which we rely quite heavily) to JavaScript almost verbatim. This made our ported codebase incompatible on principle, so we searched for a better solution.
Unfortunately, Regexp is basically something like a programming language that has more than a dozen of incompatible implementations — even across the web browsers. For instance, we need lookbehind assertions, but even if there is a new standard in ECMAScript which adds lookbehind assertions, Safari doesn’t implement that.
Given all this context let’s dive into how we ported the original Ruby Regexp engine to the browser!
Onigmo
We started by trying to compile Onigmo with WebAssembly.
Onigmo is a Regexp engine used by Ruby. It is a fork of Oniguruma, which is also in use by PHP and a few more programming languages. Fortunately, it was possible to compile it to a static WebAssembly module which can be interfaced with the JavaScript land.
We tried compiling Onigmo using a simple handcrafted libc
with no memory
management so as to reduce the size, but this plan backfired, and rightfully so!
Now we use wasi-libc. WASI stands for WebAssembly System Interface, and is designed to provide “a wide array of POSIX-compatible C APIs”.
The library is made to be able to work with both wasi-libc
and the handcrafted
libc
, but use of wasi-libc
is highly encouraged. As we are concerned about the
output size of the resulting WASM binaries, we chose not to use
Emscripten, just
the upstream LLVM/Clang and its WASM target.
Opal-WebAssembly
After getting Onigmo working, we noted, that the WebAssembly interface doesn’t map 100% between C and JS. We can’t pass strings verbatim and we need to do memory management (think: pointers). Is there a better solution for that than writing an Opal library to interface WebAssembly libraries?
The solution we came up with is opal-webassembly.
This library is divided in two parts:
-
a simple WebAssembly interface
-
a Ruby-FFI compatible binding that works on everything memory-related and brings C functions to seamlessly work with the Ruby (Opal, that is) workflow.
This library can be used in more advanced use cases beyond Interscript. Its interface is rather compatible with Ruby-FFI allowing C API bindings across all Ruby implementations. There are some minor incompatibilities though.
Ruby-FFI assumes a shared memory model. WebAssembly has different memory spaces for a calling process and each library (think about something like a segmented memory). This makes some assumptions false.
For instance, for the following code, we don’t know which memory space to use:
This requires us to use a special syntax, like:
LibraryName.context do
FFI::MemoryPointer.new(:uint8, 1200)
end
This context call makes it clear that we want this memory to be allocated in the
LibraryName
space.
Another thing is that a call like the following:
FFI::MemoryPointer.from_string("Test string")
Would not allocate the memory, but share the memory between the calling process and the library. In opal-webassembly we must allocate the memory, as sharing is not an option in the WASM model.
Now, another issue comes into play. In regular Ruby a call similar to this should allocate the memory and clear it later, once the object is destroyed. In our case, we can’t really access JavaScript’s GC. This means we always need to free the memory ourselves.
Due to some Opal inadequacies, we are unable interface floating-point fields in structs. This doesn’t happen in Onigmo, but if needed in the future, a pack/unpack implementation for those will be needed.
The Chromium browser doesn’t allow us to load WebAssembly modules larger than 4KB synchronously. This means that we had to implement some methods for awaiting the load. This also means, that in the browser we can’t use the code in a following way:
<script src='file.js'></script>
<script>
Opal.Library.$new();
</script>
This approach works in Node.js and possibly in other browsers, but Chromium requires us to use promises:
<script src='file.js'></script>
<script>
Opal.WebAssembly.$wait_for("library-wasm").then(function() {
Opal.Library.$new();
});
</script>
There are certain assumptions of how a library should be loaded on Opal side — the FFI library creation depends on the WebAssembly module being already loaded,
so we need to either move those definitions to a wait_for
block or move require
directives, like so:
WebAssembly.wait_for "onigmo/onigmo-wasm" do
require 'interscript'
require 'my_application_logic'
end
Opal-Onigmo
After having a nice library (opal-webassembly) to bind with WebAssembly modules, writing an individual binding was very easy and the resulting code looks (in my opinion) very cool.
Our initial plan assumed upstreaming the code later on, but on further consideration it might not be the correct choice for Opal. This library should stay as a separate gem for a couple of reasons.
The resulting work is opal-onigmo, available on GitHub.
First, due to memory issues, we aren’t able to make it work as a drop-in
replacement. We need to manually call an #ffi_free
method.
For example:
re = Onigmo::Regexp.new("ab+")
# use the regular expression
re.ffi_free # free it afterwards and not use it anymore
At early stages our implementation of Opal-Onigmo we didn’t consider the memory a problem. When hit with a real world scenario, we found out, that it’s a severe issue and needs to be dealt with. As far as we know, the library doesn’t leak any memory if the regular expression memory is managed correctly.
The second is that after all, we don’t really have a way of caching the compiled Regexps. Furthermore, Onigmo compiled with WASM may not be as performant as the highly optimized JS regexp engine. In this case it’s much better to leave it as a drop-in replacement for those who need more correctness.
Opal-Onigmo doesn’t implement all the methods for Ruby Regexp, it was mostly
meant for completion of the Interscript project, but can be extended beyond. It
implements a few methods it needs to implement for String (this is just an
option - you need to load onigmo/core_ext manually), but most of the existing
ones work without a problem. We implemented a Regexp.exec
(JavaScript) method,
and the rest of Opal happened to mostly interface with it. At the current time
we know that String#split
won’t "just" work, but
String#{index,rindex,partition,rpartition}
should.
Opal-Onigmo depends on the strings being coded as UTF-16. There are two reasons to that:
-
Opal includes methods for getting the binary form of strings in various encodings, but only methods for UTF-16 are valid for characters beyond the Basic Multilingual Plane (Unicode 0x0000 to 0xffff) which are used in 2 maps.
-
JavaScript uses UTF-16 strings internally.
Interscript
Finally by using opal-onigmo
, the Opal-generated code passes all the tests
(not counting transliterating Thai scripts which ultimately depends on an
external process, which relies on machine learning).
To optimize it, we use opal-onigmo
only when the regexp is a more complex
regexp, otherwise we fall back to an (ultimately faster) JavaScript regexp
engine:
def mkregexp(regexpstring)
@cache ||= {}
if s = @cache[regexpstring]
if s.class == Onigmo::Regexp
# Opal-Onigmo stores a variable "lastIndex" mimicking the JS
# global regexp. If we want to reuse it, we need to reset it.
s.reset
else
s
end
else
# JS regexp is more performant than Onigmo. Let's use the JS
# regexp wherever possible, but use Onigmo where we must.
# Let's allow those characters to happen for the regexp to be
# considered compatible: ()|.*+?{} ** BUT NOT (? **.
if /[\\$^\[\]]|\(\?/.match?(regexpstring)
# Ruby caches its regexps internally. We can't GC. We could
# think about freeing them, but we really can't, because they
# may be in use.
@cache[regexpstring] = Onigmo::Regexp.new(regexpstring)
else
@cache[regexpstring] = Regexp.new(regexpstring)
end
end
end
It also never frees the Regexps (see a previous note about #ffi_free
), because
we never know if a Regexp won’t be in use later on (and the Regexps are actually
cached in a Hash for performance reasons). The issue about dangling Regexps can
be worked out in the future, but the JS API will need to change again.
We would need to do something like:
Opal.Interscript.$with_a_map("map-name", function() {
// do some work with a map
});
This call would at the beginning allocate all the Regexps needed, and at the end, free them all. The good news is that we would be able to somehow integrate loading transliteration maps from the network (along with dependencies) with such a construct.
The future
Post writing this article we noted that JavaScript actually does implement a construct that would work like a destructor, allowing us to free the allocated memory dynamically. Unfortunately, that is the latest ECMAScript addition, which means there are still environments that don’t support it (Safari) and there is one that needs an explicit flag (Node 13+).
We could use it to implement some parts of ObjectSpace of Ruby and then use it
in opal-webassembly
to free memory on demand.
Postscript
This article was written long before it was published. Since then, Interscript has been rewritten in a different architecture and does not relies on Opal.
While we no longer use Regexps directly, we have created a higher-level (Ruby) DSL to describe the transliteration process that we compile directly to a highly-optimized pure Ruby/JavaScript code (and it can be extended to other languages as well).
Ribose still uses Opal in other projects, for example to build the
latexmath
gem, a library that compiles
LaTeX math expressions into MathML, as a JavaScript library. We also contribute
fixes back to the upstream Opal project.
For the Opal project, this effort serves as an interesting experiment to establish further guidelines should we decide to increase Regexp compatibility in the future and can serve as a useful tool for anyone wanting to port his Ruby codebase with a heavy regexp use to JavaScript. It should also facilitate porting libraries that utilize Ruby-FFI.
The libraries we created are available under a 2-clause BSD license in the following repositories:
-
https://github.com/interscript/Onigmo - Onigmo port to WebAssembly
-
https://github.com/interscript/opal-onigmo - the Onigmo interface to Opal
-
https://github.com/interscript/opal-webassembly - the FFI-like interface to Opal, using WebAssembly
-
https://github.com/interscript/interscript/tree/v1 - the obsolete v1 branch of Interscript that used Opal and Opal-Onigmo
Enjoy Opaling!