X X^t can be faster

185 points52 commentsa day ago
TimorousBestie

I wish they could have modeled memory cache movements as well, somehow. It would have made the analysis more difficult but often these large matrix algorithms live or die by how cache-friendly the memory access patterns are.

Splitting into 4x4 blocks is typically very nice, though. Maybe it doesn’t matter so much to practical runtime.

show comments
ttoinou

Are there researchers interested in accelerating these algorithms while also keeping the maximum accuracy of the results ? This and others optimizations are trading less multiplication for more additions, but from the little I know, on floating point additions and subtractions are risky precision wise, while multiplications are harmless. Also using FMA fused multiply add operations would be beneficial.

show comments
jethkl

how helpful was ai for this? The paper is light on details, but it says the agent was used to generate a kind of seed set (rank-1 bilinear products) that were then fed into the subsequent steps. Evidently this idea succeeded. Curious if anyone here has insight into if this is a common technique, how this agent's output would compare to random or a simple heuristic that attempts the same. Also interested to see how the training objective gets defined since the final task is a couple of steps downstream from what the agent generates.

BryanLegend

So if an AI company spends $5B on a cluster, is this optimization worth $250m?

show comments
Scene_Cast2

I can't name any applications off the top of my head, other than iterative matrix multiplication for approximate eigenvector finding in square matrixes. But I don't know what's actually used for finding eigenvectors (or other decompositions for that matter).

show comments
toolslive

as a general tip: X^-1 doesn't mean you have to inverse the matrix. It's often a notational shorthand for "you can solve the system". Same remark for X^t. It doesn't mean you have to build a new matrix. It just means you have to use the one you have in a different way. I've have seen this being butchered by scientists and then they complain their performance sucks.

show comments
kazinator

Under multiplication, rows go into columns and all that. So if we transpose things, rows are being dot-producted with (what were originally) rows:

  [ a b ] [ a c ] = [ (a b) . (a b)  (a b) . (c d) ] = [   |(a b)]^2     (a b) . (c d) ]
  [ c d ] [ b d ]   [ (c d) . (a b)  (c d) . (c d) ]   [ (c d) . (a b)      |(c d)|^2  ]
 
Down the diagonal we have squared norms of (a b) and (c d) which is interesting and could somehow be used.

We also have repeated values between the upper and lower triangle because (a b) . (c d) is the same as (c d) . (a b): the dot product commutes.

Whereas just squaring the matrix:

  [ a b ] [ a b ] = [ (a b) . (a c)  (a b) . (b d) ]
  [ c d ] [ c d ]   [ (c d) . (a c)  (c d) . (b d) ]
all four dot products are unique.

So right of the bat, we can find interesting things about X X^t without going deep, and an obvious shortcut.

selimthegrim

I guess the news is the “for small sizes” part

meisel

Is this like the Karatsuba algorithm, where it's theoretically faster but not actually faster when run on real hardware?

Btw, it's worth noting that if you know that the result will be symmetric (such as is the case for X * X^T), you can make things faster. For example in cuBLAS, cublas*syrk (the variant optimized for when the result is symmetric) IME isn't faster than gemm, so what you can do instead is just do smaller multiplications that fill in one of the two triangles piece by piece, and then copy that triangle to the other one.

show comments