1
Last year, Lemire wrote about an optimized variation of the Euclidean algorithm for computing the greatest common divisor of two numbers, called binary Euclidean algorithm or Stein’s algorithm. It’s a best-of-class implementation, though it’s currently only used by libc++.
The post also briefly mentions the extended Euclidean algorithm, a related algorithm most often used to compute the modular multiplicative inverse (given a remainder a and a modulus m , find x such that a ⋅ x mod m = 1 ):
There is also a binary version of the extended Euclidean algorithm[,] although it is quite a bit more involved and it is not clear that it […] can be implemented at high speed, leveraging fast instructions, when working on integers that fit in general-purpose registers. […]
My implementation of the binary extended Euclidean algorithm is quite a bit slower and not recommended. I expect that it should be possible to optimize it further.
That’s a big shame, because the extended Euclidean algorithm can be optimized in a very similar manner, and the underlying ideas were described in a 2020 paper. It’s probably not well-known because the paper focuses on constant-time evaluation and long arithmetic, so people might have assumed it’s irrelevant.
I’m hoping to bring justice to the extended Stein’s algorithm with this post. I’ll cover how the algorithm works, its limitations, some optimizations compared to Pornin’s paper, and potential further improvements.
My implementation is available on GitHub as part of a Rust modular arithmetic library.
You must log in or register to comment.

