Faster asin() was hiding in plain sight

156 points61 comments6 hours ago
jason_s

While I'm glad to see the OP got a good minimax solution at the end, it seems like the article missed clarifying one of the key points: error waveforms over a specified interval are critical, and if you don't see the characteristic minimax-like wiggle, you're wasting easy opportunity for improvement.

Taylor series in general are a poor choice, and Pade approximants of Taylor series are equally poor. If you're going to use Pade approximants, they should be of the original function.

I prefer Chebyshev approximation: https://www.embeddedrelated.com/showarticle/152.php which is often close enough to the more complicated Remez algorithm.

show comments
xt00

To be accurate, this is originally from Hastings 1955, Princeton "APPROXIMATIONS FOR DIGITAL COMPUTERS BY CECIL HASTINGS", page 159-163, there are actually multiple versions of the approximation with different constants used. So the original work was done with the goal of being performant for computers of the 1950's. Then the famous Abramowitz and Stegun guys put that in formula 4.4.45 with permission, then the nvidia CG library wrote some code that was based upon the formula, likely with some optimizations.

LegionMammal978

In general, I find that minimax approximation is an underappreciated tool, especially the quite simple Remez algorithm to generate an optimal polynomial approximation [0]. With some modifications, you can adapt it to optimize for either absolute or relative error within an interval, or even come up with rational-function approximations. (Though unfortunately, many presentations of the algorithm use overly-simple forms of sample point selection that can break down on nontrivial input curves, especially if they contain small oscillations.)

[0] https://en.wikipedia.org/wiki/Remez_algorithm

show comments
exmadscientist

This line:

> This amazing snippet of code was languishing in the docs of dead software, which in turn the original formula was scrawled away in a math textbook from the 60s.

was kind of telling for me. I have some background in this sort of work (and long ago concluded that there was pretty much nothing you can do to improve on existing code, unless either you have some new specific hardware or domain constraint, or you're just looking for something quick-n-dirty for whatever reason, or are willing to invest research-paper levels of time and effort) and to think that someone would call Abramowitz and Stegun "a math textbook from the 60s" is kind of funny. It's got a similar level of importance to its field as Knuth's Art of Computer Programming or stuff like that. It's not an obscure text. Yeah, you might forget what all is in it if you don't use it often, but you'd go "oh, of course that would be in there, wouldn't it...."

show comments
AlotOfReading

I'm pretty sure it's not faster, but it was fun to write:

    float asin(float x) {
      float x2 = 1.0f-fabs(x);
      u32 i = bitcast(x2);
      i = 0x5f3759df - (i>>1);
      float inv = bitcast(i);
      return copysign(pi/2-pi/2*(x2*inv),x);
    }
Courtesy of evil floating point bithacks.
show comments
cmovq

> After all of the above work and that talk in mind, I decided to ask an LLM.

Impressive that an LLM managed to produce the answer from a 7 year old stack overflow answer all on its own! [1] This would have been the first search result for “fast asin” before this article was published.

[1]: https://stackoverflow.com/a/26030435

show comments
andyjohnson0

Interesting aeticle. A few years back I implemented a bunch of maths primitives, including trig functions, using Taylor sequences etc, to see how it was done. An interesting challenge, even at the elementary level I was working at.

So this article got me wondering how much accuracy is needed before computing a series beats pre-computed lookup tables and interpolation. Anyone got any relevant experience to share?

How much accuracy does ray tracing require?

srean

Not directly related, but if you are reaching for sin, cos, asin, acos, atan and their relatives to handle rotation, you may save yourself a lot of trouble by representing angle not as a scalar but as (i) a cos, sin tuple or (ii) equivalently as a complex numbers.

You may then get away with simple algebra and square roots. A runtime such as Python would do a lot of that transparently.

scottlamb

Isn't the faster approach SIMD [edit: or GPU]? A 1.05x to 1.90x speedup is great. A 16x speedup is better!

They could be orthogonal improvements, but if I were prioritizing, I'd go for SIMD first.

I searched for asin on Intel's intrinsics guide. They have a AVX-512 instrinsic `_mm512_asin_ps` but it says "sequence" rather than single-instruction. Presumably the actual sequence they use is in some header file somewhere, but I don't know off-hand where to look, so I don't know how it compares to a SIMDified version of `fast_asin_cg`.

https://www.intel.com/content/www/us/en/docs/intrinsics-guid...

show comments
orangepanda

> Nobody likes throwing away work they've done

I like throwing away work I've done. Frees up my mental capacity for other work to throw away.

sixo

It appears that the real lesson here was to lean quite a bit more on theory than a programmer's usual roll-your-own heuristic would suggest.

A fantastic amount of collective human thought has been dedicated to function approximations in the last century; Taylor methods are over 200 years old and unlikely to come close to state-of-the-art.

glitchc

The 4% improvement doesn't seem like it's worth the effort.

On a general note, instructions like division and square root are roughly equal to trig functions in cycle count on modern CPUs. So, replacing one with the other will not confer much benefit, as evidenced from the results. They're all typically implemented using LUTs, and it's hard to beat the performance of an optimized LUT, which is basically a multiplexer connected to some dedicated memory cells in hardware.

show comments
empiricus

Does anyone knows the resources for the algos used in the HW implementations of math functions? I mean the algos inside the CPUs and GPUs. How they make a tradeoff between transistor number, power consumption, cycles, which algos allow this.

erichocean

Ideal HN content, thanks!

drsopp

Did some quick calculations, and at this precision, it seems a table lookup might be able to fit in the L1 cache depending on the CPU model.

show comments
ok123456

Chebyshev approximation for asin is sum(2T_n(x) / (pi*n*n),n), the even terms are 0.

stephc_int13

My favorite tool to experiment with math approximation is lolremez. And you can easily ask your llm to do it for you.

varispeed

If you are interested in such "tricks", you should check out the classic Hacker's Delight by Henry Warren

adampunk

We love to leave faster functions languishing in library code. The basis for Q3A’s fast inverse square root had been sitting in fdlibm since 1986, on the net since 1993: https://www.netlib.org/fdlibm/e_sqrt.c

show comments