Bignum Arithmetic 07: Montgomery Arithmetic
Bignum arithmetic notes / 07
Montgomery arithmetic
Montgomery reduction replaces division by an odd modulus with word-by-word cancellation modulo the radix.
Let \(m\) be odd, \(0<m<B^n\), and \(R=B^n\) with \(\gcd(R,m)=1\). The Montgomery representation of \(x\) is
\[\tilde x=xR\bmod m.\]A Montgomery multiplication computes
\(\operatorname{MontMul}(\tilde x,\tilde y)=\tilde x\tilde yR^{-1}\bmod m=xyR\bmod m.\) The elliptic-curve series uses this exact contract in its field arithmetic interface: if field elements are stored as \(xR\bmod p\), additions and point formulas must remain representation-preserving.
The REDC identity
Choose
\[m'\equiv -m^{-1}\pmod B.\]For an input \(T< mR\), REDC repeatedly chooses
\[q_i\equiv T_i m'\pmod B\]so that the low limb of \(T+q_i mB^i\) becomes zero. After \(n\) steps, the accumulated value is divisible by \(R\).
Low-limb cancellation
At step \(i\), the current low active limb is \(t_i\). Set \(q_i=t_i m'\bmod B\). Since \(m'm\equiv -1\pmod B\),
\[t_i+q_i m_0\equiv t_i(1+m'm)\equiv 0\pmod B.\]Thus shifting away that limb is exact.
C skeleton
Preconditions for the skeleton: t has at least 2*n + 1 limbs, initially stores \(T<mR\) in its low 2*n limbs, and the extra high limb is zero. The modulus array m has n limbs and is odd. Without the extra limb, the carry propagation past t[i+n] can write out of bounds.
#define BN_MAX_LIMBS 256
void mont_redc(limb_t *r, limb_t *t,
const limb_t *m, limb_t m0inv, uint32_t n) {
/* API contract: n <= BN_MAX_LIMBS and t has 2*n + 1 limbs. */
for (uint32_t i = 0; i < n; i++) {
limb_t q = (limb_t)((dlimb_t)t[i] * m0inv);
dlimb_t carry = 0;
for (uint32_t j = 0; j < n; j++) {
dlimb_t z = (dlimb_t)q * m[j] + t[i+j] + carry;
t[i+j] = (limb_t)z;
carry = z >> BN_LIMB_BITS;
}
for (uint32_t k = i + n; k <= 2u*n; k++) {
dlimb_t z = (dlimb_t)t[k] + carry;
t[k] = (limb_t)z;
carry = z >> BN_LIMB_BITS;
}
}
limb_t high = t[2u*n]; /* under T < mR, high is 0 or 1 */
limb_t tmp[BN_MAX_LIMBS];
for (uint32_t i = 0; i < n; i++) r[i] = t[i+n];
uint32_t borrow = bn_sub_n(tmp, r, m, n);
bn_cmov(r, tmp, r, n, (uint32_t)high | (borrow ^ 1u));
}
The final canonicalization must also account for the extra high limb t[2*n]. The REDC candidate is below \(2m\), but it can still be at least \(B^n\) when \(m\) is close to \(B^n\). In that case the low n limbs alone may look smaller than m, so the high limb must force selection of the subtracted result.
The carry propagation loop has a public bound and continues through the extra limb even after carry becomes zero. Its safety is part of the REDC invariant: under \(0\le T<mR\), the represented intermediate stays below \(2mR<2R^2\), so no limb beyond t[2*n] is required. Production code should make BN_MAX_LIMBS an enforced compile-time or API-level bound.
Computing m0inv
If m[0] is odd, compute \(m'=-m_0^{-1}\bmod B\). Newton iteration doubles correct bits:
Example: RSA-style modulus
For a 2048-bit odd modulus under the 32-bit-only profile, \(w=16\), \(n=128\), and \(R=2^{2048}\). Every modular multiplication in exponentiation can be performed as Montgomery multiplication after converting the base to \(xR\bmod m\) and converting the final result by multiplying by 1.
Example: Diffie-Hellman prime field
For a finite-field Diffie-Hellman group modulo a fixed 2048-bit prime, Montgomery constants \(R^2\bmod p\) and \(m'\) are precomputed once. The fixed modulus makes the reduction schedule and table sizes public.
SageMath vector generator
def mont_params(m, w, n):
B = 2^w
R = B^n
print(gcd(m, B) == 1 and m < R)
mp = (-inverse_mod(m, B)) % B
return B, R, mp, (R^2 % m)
m = 2^127 - 1
B, R, mp, R2 = mont_params(m, 16, 8)
x, y = 123456789, 987654321
xt = x*R % m
yt = y*R % m
print((xt * yt * inverse_mod(R, m)) % m == (x*y*R) % m)
print(hex(mp), hex(R2))
# Small REDC case where the final high limb matters.
B, R, m, T = 16, 16, 9, 121
mp = (-inverse_mod(m, B)) % B
q = (T*mp) % B
A = (T + q*m) // R
print(A, A // R, A % R, A - m)
Range requirement
A standard REDC proof assumes \(0\le T<mR\) and yields a candidate \(A=(T+qm)/R\) with \(0\le A<2m\). If multiplication produces \(T<m^2\) and \(m<R\), then \(T<mR\) holds. Because \(A\) may lie in \([R,2m)\), an implementation stores the low \(n\) limbs together with a high carry limb and canonicalizes by selecting \(A-m\) whenever that high limb is set or the low-limb subtraction does not borrow.
