Branchless UTF-8 Encoding

comex

Incidentally, this automatic branch-if-zero from LLVM is being improved.

First of all, a recent LLVM patch apparently changes codegen to use CMOV instead of a branch:

https://github.com/llvm/llvm-project/pull/102885

Beyond that, Intel recently updated their manual to retroactively define the behavior of BSR/BSF on zero inputs: it leaves the destination register unmodified. This matches the AMD manual, and I suspect it matches the behavior of all existing x86-64 processors (but that will need to be tested, I guess).

If so, you don't need either a branch or CMOV. Just set a register to 32, then run BSR with the same register as destination. If the BSR input is nonzero, the 32 is overwritten with the trailing-zero count. If the BSR input is zero, then BSR leaves the register unmodified and you get 32.

Since this behavior is now guaranteed for future x86-64 processors, and assuming it's indeed compatible with all existing x86-64 processors (maybe even all x86 processors period?), LLVM will no longer need the old path regardless of what it's targeting.

Note that if you're targeting a newer x86-64 version, LLVM will just emit TZCNT, which just does what you'd expect and returns 32 if the input is zero (or 64 for a 64-bit TZCNT). But as the blog post demonstrates, many people still build for baseline x86_64.

(Intel does document one discrepancy between processors: "On some older processors, use of a 32-bit operand size may clear the upper 32 bits of a 64-bit destination while leaving the lower 32 bits unmodified.")

show comments

orlp

If you have access to the BMI2 instruction set I can do branchless UTF-8 encoding like in the article using only 9 instructions and 73 bytes of lookup tables:

    branchless_utf8:
        mov     rax, rdi
        lzcnt   ecx, esi
        lea     rdx, [rip + .L__unnamed_1]
        movzx   ecx, byte ptr [rcx + rdx]
        lea     rdx, [rip + example::DEP_AND_OR::h78cbe1dc7fe823a9]
        pdep    esi, esi, dword ptr [rdx + 8*rcx]
        or      esi, dword ptr [rdx + 8*rcx + 4]
        movbe   dword ptr [rdi], esi
        mov     qword ptr [rdi + 8], rcx
        ret

The code:

    static DEP_AND_OR: [(u32, u32); 5] = [
        (0, 0),
        (0b01111111_00000000_00000000_00000000, 0b00000000_00000000_00000000_00000000),
        (0b00011111_00111111_00000000_00000000, 0b11000000_10000000_00000000_00000000),
        (0b00001111_00111111_00111111_00000000, 0b11100000_10000000_10000000_00000000),
        (0b00000111_00111111_00111111_00111111, 0b11110000_10000000_10000000_10000000),
    ];

    const LEN: [u8; 33] = [
        // 0-10 leading zeros: not valid.
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        // 11-15 leading zeros: 4 bytes.
        4, 4, 4, 4, 4,
        // 16-20 leading zeros: 3 bytes.
        3, 3, 3, 3, 3,
        // 21-24 leading zeros: 2 bytes.
        2, 2, 2, 2,
        // 25-32 leading zeros: 1 byte.
        1, 1, 1, 1, 1, 1, 1, 1,
    ];

    pub unsafe fn branchless_utf8(codepoint: u32) -> ([u8; 4], usize) {
        let leading_zeros = codepoint.leading_zeros() as usize;
        let bytes = LEN[leading_zeros] as usize;
        let (mask, or) = *DEP_AND_OR.get_unchecked(bytes);
        let ret = core::arch::x86_64::_pdep_u32(codepoint, mask) | or;
        (ret.swap_bytes().to_le_bytes(), bytes)
    }

show comments

deathanatos

  /// Encode a UTF-8 codepoint.
  /// […]
  /// Returns a length of zero for invalid codepoints (surrogates and out-of-bounds values);
  /// it's up to the caller to turn that into U+FFFD, or return an error.

It's not a "UTF-8 codepoint", that's horridly mangling the terminology. Code points are just code points.

The input to a UTF-8 encode is a scalar value, not a code point, and encoding a scalar value is infallible. What doubly kills me is that Rust has a dedicated type for scalar values. (`char`.)

(In languages with non-[USV]-strings…, Python raises an exception, JS emits garbage.)

xeeeeeeeeeeenu

> So on x86_64 processors, we have to branch to say “a 32-bit zero value has 32 leading zeros”.

Not if you're targeting x86-64-v3 or higher. Haswell (Intel) and Piledriver (AMD) introduced the LZCNT instruction that doesn't have this problem.

show comments

koala_man

I'm surprised there are no UTF-8 specific decode instructions yet, the way ARM has "FJCVTZS - Floating-point Javascript Convert to Signed fixed-point, rounding toward Zero"

show comments

Arnavion

>So on x86_64 processors, we have to branch to say “a 32-bit zero value has 32 leading zeros”. Put differently, the “count leading zeros” intrinsic isn’t necessarily a branchless instruction. This might look nicer on another architecture!

Yes, RISC-V for example defines the instructions for counting leading / trailing zeros (clz, clzw, ctz, ctzw) such that an N-bit zero value has N of them.

I don't know if I can show it on Rust Godbolt because none of the default RISC-V targets that Rust has support the Zbb extension, but I checked with a custom target that I use locally for my emulator, and `leading_zeros()` indeed compiles to just one `clz` without any further branches. Here's a C demonstration though: https://gcc.godbolt.org/z/cKx3ajsjh

show comments

purplesyringa

Instead of

    let surrogate_mask = surrogate_bit << 2 | surrogate_bit << 1 | surrogate_bit;
    len &= !surrogate_mask;

consider

    len &= surrogate_bit.wrapping_sub(1);

This should still work better. Alternatively, invert the conditions and do

    len &= non_surrogate_bit.wrapping_neg();

emilfihlman

I mean, isn't the trivial answer to just collapse the if else tree into just math that's evaluated always?

  u32 a = (code <= 0x7F);
  u32 b = (code <= 0x07FF);
  u32 c = ((code < 0xD800) || 
  (0xDFFF < code));
  u32 d = (code <= 0xFFFF) * c;
  u32 e = (code <= 0x10FFFF);
  u32 v = (c && e);
  return(-1 * !v + v * (4 - a - b - d));

Highly likely easy to optimise.

show comments

Validark

I wrote this stub a while back

https://gist.github.com/Validark/457b6db8aa00ded26a6681d4d25...

marxisttemp

I love weird little tricks with popcnt/leading/trailing zero instructions.

I recently had a lot of fun getting Swift’s OptionSet bitset interface to iterate over active members.

(Unfortunately, because of weird specifics of Swift protocol associated types, I wasn’t able to actually conform OptionSet to Collection like I wanted to originally. I find it amusing that one of the first examples in the official documentation for Swift macros is to make OptionSet used an associated enum like it should.)

lxgr

> Compiler explorer confirms that, with optimizations enabled, this function is branchless.

Only if you don't consider conditional move instructions branching/cheating :)

RenThraysk

Or-ing 1 onto codepoint before calling leading_zeroes() should get a decent compiler to remove the branch.

Dwedit

Wouldn't branchless UTF-8 encoding always write 3 bytes to RAM for every character (possibly to the same address)?

show comments

sylware

CPU hardware ISA can load tables into regs (RISC-V for instance), then on sufficently big data under UTF-8 processing, you get branchless without cache memory request.

ThatGuyRaion

So is this potentially performance improving?.

show comments

decafbad

Checkout Erlang bit parsing.

show comments

jan_haker

[flagged]