> Elevator achieves performance on par with or better than QEMU's user-mode JIT emulation.
I am not sure what QEMU's JIT is doing (in its userspace wrapper), but I think it has a lot of room to improve.
In 2013 I wrote a x86-64 to aarch64 JIT engine that was able to run what was then Fedora beta aarch64 binaries and rebuild almost the entire aarch64 port of Fedora on a x86_64 Linux. I also made a reverse aarch64 to x86-64 JIT that worked in the same way, and for fun I also showed the two JITs managing to run each other in a loop back fashion: x86-64 -> aarch64 -> x86_64 in the same process.
The JIT I devised did a 1-to-many instruction and CPU state mapping with overhead that was somewhat 2x to 5x slower than what would be expected to native recompiled code. I later compared this with QEMU's JIT which seemed more in the range of 10x to 50x slower.
Unfortunately this was not under a open source license settings, so no code release to prove it.. :(
show comments
linkregister
A 50x increase in the size of the .text section is enormous, but seems to be a reasonable price to pay for a fully-deterministic translation. The performance difference over emulation will outweigh the inconvenience of the size increase in many cases.
It's exciting to see that multithreading and exception handling are not impossible to support; they're just out of scope of this particular project.
I wonder if the next step is to then use heuristics to prune the possibility space and reduce the size of the binary (thus breaking the guarantees of the translation, but making portability of the binary practical).
Asraelite
Where is the source code?
pcblues
Sounds like they tweaked an AI to get a minimal subset of accurate outcomes and started waving their hands for anything more complicated, realistic and ultimately generally useful. The larger problem-space is still an NP-complete problem. I guess if the data-centres become infinitely large, this problem can be worked around.
/s /jk
JoheyDev888
50x isn't reasonable, it's a cache disaster. Any perf win from avoiding JIT gets eaten alive.
show comments
gblargg
> Elevator considers all possible interpretations of every byte and produces a separate translation for each feasible one ahead of time [...] pruning only those leading to abnormal termination.
So any real program with the possibility to crash is pruned?
show comments
jonhohle
This is neat. I haven’t looked into it, but I would think relative offsets could still be an issue, but it seems there must be some translation layer/mmu since the codegen will be different sizes anyway. This would impact jump tables and internal branches, primarily.
I mostly work on stuff from the 90s, but disassemblers make a lot of assumptions about where code starts and ends, but occasionally a binary blob is not discoverable unless you have some prior knowledge (pointer at a fixed location to an entry point).
I would think after a few passes you could refine the binary into areas that are definitely code.
Panzerschrek
Can it handle self-modifying code?
Why only x86_64? It has more sense to convert 32-bit programs, like many old games.
show comments
fizza_pizza
The certification angle is the most interesting part to me. Regulated industries (aviation, medical devices) often can't use JIT for exactly this reason, the code that runs has to be the code that was certified. Static translation that produces a signable binary is a real unlock there, code bloat notwithstanding.
show comments
mgaunard
On par with QEMU, but still far behind Rosetta...
show comments
fguerraz
Does it mean I can finally run Slack on Asahi?
show comments
dmitrygr
Cute, but Rice's theorem remains, and while they translated every byte as code, still no handling is possible for
static translation is only possible when you assume no adversarial code AND mostly assume compiler-produced binaries. hand-rolled asm gets hard, and adversarial code is provably unsolvable in all cases.
> Elevator achieves performance on par with or better than QEMU's user-mode JIT emulation.
I am not sure what QEMU's JIT is doing (in its userspace wrapper), but I think it has a lot of room to improve.
In 2013 I wrote a x86-64 to aarch64 JIT engine that was able to run what was then Fedora beta aarch64 binaries and rebuild almost the entire aarch64 port of Fedora on a x86_64 Linux. I also made a reverse aarch64 to x86-64 JIT that worked in the same way, and for fun I also showed the two JITs managing to run each other in a loop back fashion: x86-64 -> aarch64 -> x86_64 in the same process.
The JIT I devised did a 1-to-many instruction and CPU state mapping with overhead that was somewhat 2x to 5x slower than what would be expected to native recompiled code. I later compared this with QEMU's JIT which seemed more in the range of 10x to 50x slower.
Unfortunately this was not under a open source license settings, so no code release to prove it.. :(
A 50x increase in the size of the .text section is enormous, but seems to be a reasonable price to pay for a fully-deterministic translation. The performance difference over emulation will outweigh the inconvenience of the size increase in many cases.
It's exciting to see that multithreading and exception handling are not impossible to support; they're just out of scope of this particular project.
I wonder if the next step is to then use heuristics to prune the possibility space and reduce the size of the binary (thus breaking the guarantees of the translation, but making portability of the binary practical).
Where is the source code?
Sounds like they tweaked an AI to get a minimal subset of accurate outcomes and started waving their hands for anything more complicated, realistic and ultimately generally useful. The larger problem-space is still an NP-complete problem. I guess if the data-centres become infinitely large, this problem can be worked around.
/s /jk
50x isn't reasonable, it's a cache disaster. Any perf win from avoiding JIT gets eaten alive.
> Elevator considers all possible interpretations of every byte and produces a separate translation for each feasible one ahead of time [...] pruning only those leading to abnormal termination.
So any real program with the possibility to crash is pruned?
This is neat. I haven’t looked into it, but I would think relative offsets could still be an issue, but it seems there must be some translation layer/mmu since the codegen will be different sizes anyway. This would impact jump tables and internal branches, primarily.
I mostly work on stuff from the 90s, but disassemblers make a lot of assumptions about where code starts and ends, but occasionally a binary blob is not discoverable unless you have some prior knowledge (pointer at a fixed location to an entry point).
I would think after a few passes you could refine the binary into areas that are definitely code.
Can it handle self-modifying code?
Why only x86_64? It has more sense to convert 32-bit programs, like many old games.
The certification angle is the most interesting part to me. Regulated industries (aviation, medical devices) often can't use JIT for exactly this reason, the code that runs has to be the code that was certified. Static translation that produces a signable binary is a real unlock there, code bloat notwithstanding.
On par with QEMU, but still far behind Rosetta...
Does it mean I can finally run Slack on Asahi?
Cute, but Rice's theorem remains, and while they translated every byte as code, still no handling is possible for
static translation is only possible when you assume no adversarial code AND mostly assume compiler-produced binaries. hand-rolled asm gets hard, and adversarial code is provably unsolvable in all cases.still, pretty cool for cooperative binaries