Blog Index 2023 April 29
Dividing 8-bit Uints with AVX-512VBMI
Torbjorn Granlund and Peter L. Montgomery are the authors behind a paper that is
well-known by those interested in numerical techniques and/or optimization. This
paper, Division by Invariant Integers using Multiplication, provides a
technique, and proof for the correctness thereof, for quickly computing
quotients in the event that you know what the divisor is beforehand. The basic
idea behind this technique is nothing more than elementary math: n / d
is the
same as n * (1 / d)
. If the reciprocal of d
is computed beforehand, then the
problem of division can transformed into one of multiplication. Crucially,
multiplication instructions are generally cheaper than corresponding division
instructions. Therefore, this technique ends up being a potentially faster way
to perform division, at least at the point of computing the quotient.
If you’re working with floating-point numbers then computing n * (1 / d)
is
simple, at least if you don’t care about the minor error this introduces.
However, it’s more complicated if you’re working with integer types. The
reciprocal of an integer is in general not itself an integer, meaning it cannot
be straightforwardly represented using integral types. Hence, to put this idea
into practice, alternative representations for the reciprocal of d
have to be
considered.
This is just what Granlund and Montgomery’s paper offers. It’s more accurate to say that they provide two such approaches, one for unsigned integers and one for signed integers. This approach can be roughly described as emulating a floating-point number. Those interested in the fine details are encouraged to read through the paper. However, the only part of it that’s really important here are the following formulas which pertain to the technique for unsigned integers.
Once the value of d
is known, three variables, m
, sh1
, and sh2
should be
computed as follows:
l = N - lzcnt(d - 1)
m = (double_width_uint((1 << l) - d) << N) / (d + 1)
sh1 = min(l, 1)
sh2 = l - sh1
Here N
refers to how many bits wide the type of n
and d
is.
double_width_uint
is straightforwardly an unsigned integer type which is twice
as wide as the type of n
or d
.
Note that l
is just an intermediate variable that does not need to be kept
around for the following step.
When some number n
should be divided by d
, then the following formula should
be evaluated:
t1 = mulhi(m, n)
uint q = t1 + ((n - t1) >> sh1) >> sh2
Here, mulhi
refers to the high half of performing a widening multiplication.
That is equivalent to (double_width_uint(m) * double_width_uint(n)) >> N
.
q
here denotes the quotient of interest, n / d
.
Note that the computation of m
involves dividing two integers which are twice
the width of either n
or d
. This means that this approach actually requires
more work overall compared to simply computing n / d
directly. Therefore, this
technique is really only beneficial if you’re performing multiple divisions by
d
such that decrease in the runtime of these divisions outweighs the upfront
cost. Or rather, that’s what you’ll commonly hear about this technique. The
other case where it’s useful is when you can quickly compute the value of m
by
alternative means.
The Part Where AVX-512 Comes In
Even with the latest extensions to the x86 ISA, there are no vectorized integer division instructions, meaning that vectorized division must be emulated in software. However, software division is generally fairly expensive and somewhat inconvenient to implement. This is even moreso the case when it must be vectorized for 8-bit ints since x86 lacks vectorized instructions for shifts and multiplications.
Fortunately, AVX-512VBMI introduced an instruction called vpermi2b
. This
instruction fundamentally allows you to perform a vectorized table lookup. It
takes in two vector registers each containing 64 bytes, which are collectively
used as an 128-entry lookup table. The instruction then also take another vector
register which contains 64 indices into the aforementioned lookup table. Note
that 128 is only one power of two away from 256, the number of unique values
that a byte may take on, and hence the number of unique values of m
(if you
ignore the degenerate case of d == 0
of course). Although it’s not ideal, it’s
feasible to emulate a 256-entry table lookup by using two vpermi2b
s, a
vpmovb2m
, and a vpblendmb
.
The idea is that the vpermi2b
instruction only looks at the low 7 bits of each
index in the index vector, hence we need some way to factor in that last 8th
bit. The vpmovb2m
produces a mask specifically based on the most-significant
bit of each byte in a vector register, allowing us to easily extract that high
bit. This mask can then be used as one of the parameters to the vpblendmb
instruction. The blend’s other inputs would simply be the results of invoking
the vpermi2b
instruction on the high and low halves of a 256-byte lookup table
containing all possible values of m
. Assuming that the lookup table is already
loaded, then that means m
can be computed in as little as four instructions.
There is another detail that’s a little difficult to compute, l
. AVX-512 does
not have leading-zero count instructions for 8-bit integers which raises the
question of how to quickly compute l
. But before you go exploring the
straightforward approach, I’d like to point out that l
may be equivalently
computed as bit_width(x - 1)
where bit_width
comes from C++20’s
<bit>
header. This
alternative is slightly cheaper to compute compared to following the formula
from the original paper.
Subtracting one from a number is simple but computing bit_width
has some
complexity to it. Thankfully, we’ve just discussed three instructions that are
very useful for this kind of thing. We can use a very similar approach as was
used to compute m
. We can note that if the high bit of x
is set, then
bit_width(x)
should evaluate to 8
and that if it’s not set, then it should
evaluate to the bit width of the low 7 bits. Thankfully, we can check if the
high bit of each byte is set with vpmovb2m
, compute the bit width of those 7
bits with vpermi2b
, and combine those results with a vpblendmb
.
With a convenient way to compute both m
and l
, it’s feasible to create a
vectorized implementation of 8-bit integer division which relies upon the
Granlund-Montgomery technique.
Simplifying it Even Further
Early, it was mentioned that the approach taken by the paper can be crudely described as emulating a float, however there is a simpler alternative, emulating a fixed-point number. u/YumiYumiYumi suggested this simplification to me as well as example code of a refined implementation. The following section is an explanation of it.
This approach removes the lead to compute l
, and use the altogether, but it
does fundamentally mean that more work must be done in computing m
(which to
be clear is now just the fixed-point reciprocal of d
instead of the
pseudo-mantissa which it previously was), and it must also be widened to be a
16-bit value instead of an 8-bit value. This widening is necessary to ensure
that the reciprocal has a sufficient amount of fractional bits to be able to
accurately compute the quotient.
Intuitively, a 16-bit value would seem to suggest that four uses of vpermi2b
would be necessary, two for each of the low and high bytes of m
, following the
steps previously described. However this is not the case if you leverage a
couple pieces of additional information. If the value of d
is greater than
128, then the high byte of m
will always be 0000'0001
. 127 of the 255
non-degenerate values of d
meet this criteria, leaving 128 to handle via a
lookup. A simple blend can be used to choose between the 0000'0001
and the
value produced by the lookup. The mask which is passed to this blend instruction
may be produced by a vpermi2b
if 1 is subtracted from the value of d
first.
In the reference code, this subtraction is performed upfront, affecting the
layout of the data used in lookup table. The low bytes of the reciprocals can be
computed using two vpermi2b
instructions as was discussed in the prior
section.
A further consideration is that the value of sh1 will only ever be 0
or
false
when d
equals 1, which is trivially handled by a blend at the end of
the division subroutine. Additionally, since the d
was decremented upfront,
testing for whether d
equal one can at this point be performed with just a
vptestnmb
.
The Alternatives Methods
It’s worth discussing which specifically the alternatives are. There are at
least four to compare against. The first would be performing division in a
scalar fashion that uses the native div
instruction. The second would be a
vectorized long division approach. A related third would a vectorized long
division approach that also features early termination. And the last would to
fall back to using the vectorized division instructions that are available,
the ones for 32-bit floats.
The long division algorithms would fundamentally take the same approach to division as children are taught in elementary, just in base 2. This means an iterative algorithm based on repeated subtraction that produces one bit of the quotient with each iteration.
Note that it’s possible to reach a point where no more iterations are necessary once the dividend becomes smaller than the divisor. Should this happen, all subsequent bits of the quotient would evaluate to 0. Also note that if the quotient has some number of leading zeros, then the same number of iterations could be skipped upfront.
However, given that we’re dealing with vectors with 64 lanes, the probability that all of them will meet either of theses criteria simultaneously for a random set of inputs is very low, practically non-existent. Not to mention that even if that’s ignored and we instead assume that there is a good chance of that actually happening, then we’d still be introducing a very unpredictable set of branches into the algorithm. Of course, the branches are in reality quite predictable in that we know they’ll almost never get taken since if even one lane does not meet the criteria, then the loop iterations cannot be skipped. Regardless, the mere presence of the branches and the instructions required to compute their conditions will make the algorithm slower by some amount.
That said, theoretically, if you happen to find yourself in the frankly fantasy-like scenario that you need to compute the quotients of a long list of unsigned 8-bit integers, whose quotients will all have around the same number of leading and/or trailing zeros, then this might just end up being faster.
Using 32-bit floating-point division instructions, seems like it may be more likely to perform well. After all, you’d be leveraging a hardware implementation of division. However, there is a fair bit of work to this approach as well. You’d need to widen the 8-bit integers to 32-bit integers, convert them to floats, perform four different division instructions (which have high CPIs in practice), convert the quotients back to 32-bit floats, and narrow them back down to 8-bit ints. Even without having to handle any of actual division logic in software, this doesn’t make for a trivial implementation.
What About AVX-512FP16?
AVX-512FP16 offers vectorized 16-bit division instructions which introduces yet another alternative. Using 16-bit fp division would involve fewer instructions than any of the previously mentioned techniques, while still benefiting from a hardware implementation of division. There is therefore some degree of intuition that suggests that it may end up being the fastest approach.
Currently, AVX-512FP16 is only available on Golden Cove cores, which can only be found on Alderlake CPUs as P-cores and Sapphire Rapids CPUs making this a fairly rare feature at the current time. Further contributing to its rarity is a complex situation where AVX-512 is not usable on all Alderlakes. Unfortunately, as much as I would have liked to test the performance of 16-bit division on these machines, I do not have access to one, nor the means to procure one.
As an alternative we can instead reference the uops.info
instruction tables for Alderlake P-cores. Checking the performance
characteristics of the
vdivph
instruction and comparing it to those of the
vdivps
instruction, it appears that 16-bit fp division is actually inferior to 32-bit
fp division. Roughly speaking, vdivph
has performance characteristics that are
~2x worse than those for vdivps
. Although the site does not have info for
Sapphire Rapids CPUs, since they’re both using Golden Cove cores, presumably it
would be much the same story.
This casts some doubt on the efficacy of using 16-bit fp division to emulate 8-bit integer division. An unstated part of the intuition for why this may be faster was the idea that with a smaller float and hence a smaller significand, the 16-bit division instructions would be cheaper than their 32-bit counterparts. However, right now it’s not clear whether using two 16-bit divisions would actually end up being faster than using the four 32-bit divisions.
If we’re lucky, the performance characteristics of 16-bit fp instructions will improve in future microarchitectures but as it stands, emulating 8-bit integer division using 16-bit floating-point division doesn’t seem like it’ll necessarily be a silver bullet.
Some Simple Benchmarks & Code
To compare the approaches that I could run, I set up a simple and frankly unrealistic benchmark where a list of dividends and divisors that fit into an L1D cache on my Ice Lake 1065G7 are processed repeatedly.
The code for this benchmark and the implementations of the various approaches to division are available on Compiler Explorer.
In creating these implementations I’ve tried to ensure that they are reasonable,
but it must be admitted that they are not necessarily as well optimized as they
might theoretically be. If you go through the code, you might for example point
out the possibility of using vgf2p8affineqb
to perform 8-bit shifts. However I
opted to not use them here because in previous benchmarks I’ve run on my
machine, they proved inferior to using 16-bit shifts and masking. Theoretically,
that instruction could perform better on newer microarchitectures e.g. on Zen 4
the instruction has a latency of only 3 cycles and has a CPI of 0.5 compared to
the 5 and 1.0 respectively for my Ice Lake. Again however, I don’t have access
to such hardware. More broadly speaking, in my attempts to refine such small
details, I found that the differences were small enough to the point were it was
unlikely to change the overall conclusions you’d draw from the data.
The code was compiled using GCC 12.2.0 with the -O3 -march=icelake-client
flags. The code was then run 50 times and the following statistics were computed
from the results.
Scalar | G-M Lookup8 | G-M Lookup16 | Long Div. | Early Ret. Div | fp-32 div. | |
---|---|---|---|---|---|---|
Mean | 421,273us | 22,464us | 14,091us | 36,862us | 50,505us | 44,497us |
Std. Dev. | 1,407us | 156us | 163us | 175us | 273us | 242us |
Speed-up | N/A | 18.75 | 29.90 | 11.43 | 8.34 | 9.47 |
Notably, all approaches delivered a substantial improvement over using the scalar div instruction, so none of them are terrible.
However, the Granlund-Montgomery approaches stand out as they performs nearly 20x and 30x better than scalar code. However, they do fundamentally rely on the lookup tables being in cache, something which may ruin its performance in practical applications where divisions don’t have sufficient temporal locality.
In contrast, the long division algorithm does not rely on reading from memory at all and its performance is still a large improvement over scalar code. It may end up being faster in applications that involve one-off divisions.
32-bit floating-point division did not perform great. It’s performance is somewhat worse than long-division and is only half as good as the Granlund-Montgomery approach.
We can see that the long division algorithm did not benefit from returning early which is to be expected given that the numerators and denominators were generated at random with a uniform distribution, demonstrating that as predicted earlier, iterations were not able to be skipped (modifying the code so that it prints out a message when iterations are skipped confirms that it didn’t happen even once).
The early termination approach does have some potential however. In an alternate version of the test where all the dividends (chosen from [0, 127]) were smaller than the divisors (chosen from [128, 255]), meaning that the quotients would always be 0, the following results were obtained:
Early Ret. Div | |
---|---|
Mean | 29us |
Std. Dev. | 1.85us |
Speed-up | 15432.66 |
Here, the early-returning long division runs through the data at a substantially greater rate, which shouldn’t be surprising given the fact that there is no meaningful work to be done. But this just isn’t representative of what you’d expect to encounter in practice so perhaps this potential is best left ignored.
Overall, the lookup table accelerated version of the Granlund-Montgomery approach appears to be best if you’re going to be performing a large amount of 8-bit divisions in series, and the long-division may be best when performing divisions on an inconsistent basis. That said, in the grand scheme of things, the truth is that having to churn through any significant amount of unsigned 8-bit integer divisions is not something most people ever have to do. I imagine that the number of people that would actually benefit from this is quite small.