Fundamental flaws of SIMD ISAs (2021)

stephencanon

The basic problem with almost every "SIMD is flawed, we should have vector ISAs" article or post (including the granddaddy, "SIMD Instructions Considered Harmful"), is that they invariably use SAXPY or something else trivial where everything stays neatly in lane as their demonstration case. Of course vector ISAs look good when you show them off using a pure vector task. This is fundamentally unserious.

There is an enormous quantity of SIMD code in the world that isn't SAXPY, and doesn't stay neatly in lane. Instead it's things like "base64 encode this data" or "unpack and deinterleave this 4:2:2 pixel data, apply a colorspace conversion as a 3x3 sparse matrix and gamma adjustment in 16Q12 fixed-point format, resize and rotate by 15˚ with three shear operations represented as a linear convolution with a sinc kernel per row," or "extract these fields from this JSON data". All of which _can totally be done_ with a well-designed vector ISA, but the comparison doesn't paint nearly as rosy of a picture. The reality is that you really want a mixture of ideas that come from fixed-width SIMD and ideas that come from the vector world (which is roughly what people actually shipping hardware have been steadily building over the last two decades, implementing more support for unaligned access, predication, etc, while the vector ISA crowd writes purist think pieces)

Someone

> Since the register size is fixed there is no way to scale the ISA to new levels of hardware parallelism without adding new instructions and registers.

I think there is a way: vary register size per CPU, but also add an instruction to retrieve register size. Then, code using the vector unit will sometimes have to dynamically allocate a buffer for intermediate values, but it would allow for software to run across CPUs with different vector lengths. Does anybody know whether any architecture does this?

TinkersW

I write a lot of SIMD and I don't really agree with this..

Flaw1:fixed width

I prefer fixed width as it makes the code simpler to write, size is known as compile time so we know the size of our structures. Swizzle algorithms are also customized based on the size.

Flaw2:pipelining

no CPU I care about is in order so mostly irrelevant, and even scalar instructions are pipelined

Flaw3: tail handling

I code with SIMD as the target, and have special containers that pad memory to SIMD width, no need to mask or run a scalar loop. I copy the last valid value into the remaining slots so it doesn't cause any branch divergence.

show comments

xphos

Personally, I think load and increment address register in a single instruction is extremely valuable here. It's not quite the risc model but I think that it is actually pretty significant in avoiding a von nurmon bottleneck with simd (the irony in this statement)

I found that a lot of the custom simd cores I've written for simply cannot issue instructions fast enough risvc. Or when they it's in quick bursts and than increments and loop controls that leave the engine idling for more than you'd like.

Better dual issue helps but when you have seperate vector queue you are sending things to it's not that much to add increments into vloads and vstores

show comments

sweetjuly

Loop unrolling isn't really done because of pipelining but rather to amortize the cost of looping. Any modern out-of-order core will (on the happy path) schedule the operations identically whether you did one copy per loop or four. The only difference is the number of branches.

show comments

pornel

There are alternative universes where these wouldn't be a problem.

For example, if we didn't settle on executing compiled machine code exactly as-is, and had a instruction-updating pass (less involved than a full VM byte code compilation), then we could adjust SIMD width for existing binaries instead of waiting decades for a new baseline or multiversioning faff.

Another interesting alternative is SIMT. Instead of having a handful of special-case instructions combined with heavyweight software-switched threads, we could have had every instruction SIMDified. It requires structuring programs differently, but getting max performance out of current CPUs already requires SIMD + multicore + predictable branching, so we're doing it anyway, just in a roundabout way.

show comments

codedokode

I think that packed SIMD is better in almost every aspect and Vector SIMD is worse.

With vector SIMD you don't know the register size beforehand and therefore have to maintain and increment counters, adding extra unnecessary instructions, reducing total performance. With packed SIMD you can issue several loads immediately without dependencies, and if you look at code examples, you can see that the x86 code is more dense and uses a sequence of unrolled SIMD instructions without any extra instructions which is more efficient. While RISC-V has 4 SIMD instructions and 4 instructions dealing with counters per loop iteration, i.e. it wastes 50% of command issue bandwidth and you cannot load next block until you increment the counter.

The article mentions that you have to recompile packed SIMD code when a new architecture comes out. Is that really a problem? Open source software is recompiled every week anyway. You should just describe your operations in a high level language that gets compiled to assembly for all supported architectures.

So as a conclusion, it seems that Vector SIMD is optimized for manually-written assembly and closed-source software while Packed SIMD is made for open-source software and compilers and is more efficient. Why RISC-V community prefers Vector architecture, I don't understand.

show comments

pkhuong

There's more to SIMD than BLAS. https://branchfree.org/2024/06/09/a-draft-taxonomy-of-simd-u... .

show comments

bob1029

> Since the register size is fixed there is no way to scale the ISA to new levels of hardware parallelism without adding new instructions and registers.

I look at SIMD as the same idea as any other aspect of the x86 instruction set. If you are directly interacting with it, you should probably have a good reason to be.

I primarily interact with these primitives via types like Vector<T> in .NET's System.Numerics namespace. With the appropriate level of abstraction, you no longer have to worry about how wide the underlying architecture is, or if it even supports SIMD at all.

I'd prefer to let someone who is paid a very fat salary by a F100 spend their full time job worrying about how to emit SIMD instructions for my program source.

dragontamer

1. Not a problem for GPUs. NVdia and AMD are both 32-wide or 1024-bit wide hard coded. AMD can swap to 64-wide mode for backwards compatibility to GCN. 1024-bit or 2048-bit seems to be the right values. Too wide and you get branch divergence issues, so it doesn't seem to make sense to go bigger.

In contrast, the systems that have flexible widths have never taken off. It's seemingly much harder to design a programming language for a flexible width SIMD.

2. Not a problem for GPUs. It should be noted that kernels allocate custom amounts of registers: one kernel may use 56 registers, while another kernel might use 200 registers. All GPUs will run these two kernels simultaneously (256+ registers per CU or SM is commonly supported, so both 200+56 registers kernels can run together).

3. Not a problem for GPUs or really any SIMD in most cases. Tail handling is O(1) problem in general and not a significant contributor to code length, size, or benchmarks.

Overall utilization issues are certainly a concern. But in my experience this is caused by branching most often. (Branching in GPUs is very inefficient and forces very low utilization).

show comments

dang

Three Fundamental Flaws of SIMD - https://news.ycombinator.com/item?id=28114934 - Aug 2021 (20 comments)

freeone3000

x86 SIMD suffers from register aliasing. xmm0 is actually the low-half of ymm0, so you need to explicitly tell the processor what your input type is to properly handle overflow and signing. Actual vectorized instructions don’t have this problem but you also can’t change it now.

gitroom

Oh man, totally get the pain with compilers and SIMD tricks - the struggle's so real. Ever feel like keeping low level control is the only way stuff actually runs as smooth as you want, or am I just too stubborn to give abstractions a real shot?

lauriewired

The three “flaws” that this post lists are exactly what the industry has been moving away from for the last decade.

Arm’s SVE, and RISC-V’s vector extension are all vector-length-agnostic. RISC-V’s implementation is particularly nice, you only have to compile for one code path (unlike avx with the need for fat-binary else/if trees).

convolvatron

i would certainly add lack of reductions ('horizontal' operations) and a more generalized model of communication to the list.

show comments

timewizard

> Another problem is that each new SIMD generation requires new instruction opcodes and encodings.

It requires new opcodes. It does not strictly require new encodings. Several new encodings are legacy compatible and can encode previous generations vector instructions.

> so the architecture must provide enough SIMD registers to avoid register spilling.

Or the architecture allows memory operands. The great joy of basic x86 encoding is that you don't actually need to put things in registers to operate on them.

> Usually you also need extra control logic before the loop. For instance if the array length is less than the SIMD register width, the main SIMD loop should be skipped.

What do you want? No control overhead or the speed enabled by SIMD? This isn't a flaw. This is a necessary price to achieve the efficiency you do in the main loop.

show comments