Tanglee’s Blog

Fast Fourier Transform and Number Theoretic Transform

2026-04-29T00:00:00+08:00

tl;dr: The Fast Fourier Transform (FFT) is generally credited to Cooley and Tukey in 1965, although its earliest ideas can be traced back to Gauss’s unpublished manuscript around 1805. FFT is one of the foundational algorithms behind modern high-performance computation: it accelerates integer multiplication and polynomial multiplication, and was named by IEEE as one of the top ten algorithms of the twentieth century. Current NIST post-quantum standards such as Kyber, Dilithium, and Falcon all involve FFT and its finite-field analogue, the Number Theoretic Transform (NTT). In addition, NTT is a critical acceleration primitive in practical zero-knowledge proof systems such as Plonk and in fully homomorphic encryption schemes such as BFV and TFHE. This article explains the mathematical theory and practical value of FFT and NTT in detail.

Disclaimer: This article is the English counterpart automatically generated from the original Chinese blog by Codex + GPT-5. The translation aims to preserve the original meaning, structure, and technical details as faithfully as possible. If there is any ambiguity or inaccuracy, please refer to the original Chinese version.

Fast Fourier Transform, CP-Algorithms: https://cp-algorithms.com/algebra/fft.html.
A note on NTT definitions and implementations: https://eprint.iacr.org/2024/585.pdf.
Number Theoretic Transform, Cryptography Caffe: https://cryptographycaffe.sandboxaq.com/posts/ntt-02/.
Survey reference: https://arxiv.org/pdf/2211.13546.

Discrete Fourier Transform

Let an $n-1$ degree polynomial be written as

\[A(x) = a_0 x^0 + a_1 x^1 + \dots + a_{n-1} x^{n-1}\]

In particular, we assume that the polynomial degree bound, or equivalently the length of the coefficient vector, is $n = 2^k$. In the non-power-of-two case, we can pad the higher-degree coefficients with zeros until the coefficient-vector length becomes a power of two. Let the $n$-th roots of unity be $w_{n,k} = e^{\frac{2 k \pi i }{n}}$, where $k \in [0..n-1]$, and let the primitive root of unity be $w_{n} = w_{n, 1} = e^{\frac{2 \pi i }{n}}$. They all satisfy $x^n = 1$.

The coefficient-vector representation of a polynomial is the most common one, namely $\vec{A} = (a_0, a_1, \ldots, a_{n-1})$ above. The discrete Fourier transform is a special evaluation representation: it represents the polynomial as a vector of evaluations at the special $n$-th roots of unity:

\[\begin{aligned} \hat{A} &= \mathsf{DFT}(\vec A) = \mathsf{DFT}(a_0, a_1, \dots, a_{n-1})\\ &= (A(w_{n, 0}), A(w_{n, 1}), \dots, A(w_{n, n-1})) \\ &= (A(w_n^0), A(w_n^1), \dots, A(w_n^{n-1})) \\ &:= (y_0, y_1, \dots, y_{n-1}) \\ \end{aligned}\]

The inverse discrete Fourier transform essentially converts the evaluation representation of a polynomial back into the usual coefficient-vector form. This transformation is also better known as Lagrange interpolation for polynomials. Thus, the (inverse) discrete Fourier transform is an algorithm for converting between these two representations, namely the following maps:

\[\begin{cases} \mathsf{DFT}_{n}: \underbrace{(a_0, a_1, \ldots, a_{n-1})}_{\text{coefficient form}} \mapsto \underbrace{(y_0, y_1, \ldots, y_{n-1})}_{\text{evaluation form}} \\ \mathsf{iDFT}_{n}: \underbrace{(y_0, y_1, \ldots, y_{n-1})}_{\text{evaluation form}} \mapsto \underbrace{(a_0, a_1, \ldots, a_{n-1})}_{\text{coefficient form}} \\ \end{cases}\]

Let $A(x) = a_0 x^0 + a_1 x^1 + \dots + a_{n-1} x^{n-1}$ and $B(x) = b_0 x^0 + b_1 x^1 + \dots + b_{n-1} x^{n-1}$ be arbitrary $n-1$ degree polynomials over any ring. Then:

\[\mathsf{DFT}(A(x)) \circ \mathsf{DFT}(B(x)) = \mathsf{DFT}(A(x) \cdot B(x))\]

Here $\circ$ denotes component-wise vector multiplication, which can be computed in $\mathcal{O}(n)$ time. If we can compute the discrete Fourier transform $\mathsf{DFT}$ and its inverse $\mathsf{iDFT}$ in $\mathcal{O}(n \log n)$ time, then we can also multiply polynomials in coefficient-vector form in $\mathcal{O}(n \log n)$ time. Let $m \ge 2n - 1$ be the transform length. In FFT, $m$ is usually chosen as the smallest power of two not smaller than $2n-1$. Then:

\[A(x) \cdot B(x) = \mathsf{iDFT}_{m} \left(\mathsf{DFT}_{m} \left(A\left(x\right)\right) \circ \mathsf{DFT}_{m} \left(B\left(x\right)\right)\right)\]

This is the core idea behind using the Fast Fourier Transform and the Number Theoretic Transform to accelerate polynomial multiplication and integer multiplication. In the computation above, we need to zero-pad the polynomial coefficients to length $m$, because the final product $A(x)\cdot B(x)$ has degree $2(n - 1)$ and therefore has $2n-1$ coefficients. A vector of dimension at least $2n-1$ is needed to recover $A(x)\cdot B(x)$ completely.

Convolution and Fourier Transform

In communications, the Fourier transform (Continuous Time Fourier Transform) is usually a powerful tool for studying continuous signals. It converts continuous time-domain information into frequency information, or a spectrum:

\[S(f) = \int_{-\infty}^{\infty} s(t) \cdot e^{-i2\pi ft} \, dt\]

On existing computers, however, it is impossible to simulate a fully continuous time-domain signal. Therefore, the discrete Fourier transform has greater practical value, and this leads naturally to the discrete-time Fourier transform.

Discrete Fourier Transform (DFT) converts a sequence of complex numbers $\{x_n\} := x_0, x_1, \dots, x_{N-1}$ into another sequence of complex numbers of the same length $\{X_k\} := X_0, X_1, \dots, X_{N-1}$. Its forward transform is mathematically defined as follows:
\[X_k = \sum_{n=0}^{N-1} x_n \cdot e^{-i 2\pi \frac{k}{N} n}, \quad k = 0, \dots, N-1\]
Here $x_n$ is the sampled signal in the time domain, $X_k$ is the frequency component in the frequency domain, and $N$ is the sequence length. The term $e^{-i 2\pi \frac{k}{N} n}$ is the complex exponential basis function, which can be expanded by Euler’s formula as $\cos(2\pi \frac{k}{N} n) - i \sin(2\pi \frac{k}{N} n)$.
Inverse Discrete Fourier Transform (Inverse DFT) recovers the time-domain sequence from the frequency-domain sequence:
\[x_n = \frac{1}{N} \sum_{k=0}^{N-1} X_k \cdot e^{i 2\pi \frac{k}{N} n}, \quad n = 0, \dots, N-1\]
Note that signal-processing conventions usually use a negative exponent for the forward DFT, whereas this article uses a positive exponent convention from the polynomial-evaluation perspective: $y_k=\sum_{j=0}^{n-1}a_j w_n^{kj}$. The two directions are conjugate to each other; what matters is that the forward and inverse transforms are used consistently.

Switching back to the polynomial perspective, we obtain the following somewhat imprecise analogy:

Time-domain representation	After the discrete Fourier transform	Practical meaning of the Fourier transform
Continuous time-domain signal of a wave	Spectrum information (frequency, amplitude, phase)	Frequency decomposition and component analysis of waves: decompose a superposition of waves into sinusoidal waves of single frequencies, making it easier to compute superposition and spectrum information
Coefficient vector of a polynomial	Evaluation form of the polynomial	Accelerating polynomial multiplication: by analogy with wave superposition, it enables fast convolution, i.e. multiplication

In essence, convolution and multiplication are equivalent: multiplying two polynomials is essentially taking the linear convolution of their coefficient sequences. To make the later discussion of NTT more convenient, we directly consider the integer quotient ring $\mathbb{Z}_q[x]$ here.

Given two $n-1$ degree polynomials $G(x)$ and $H(x)$ over the commutative ring $\mathbb{Z}_q[x]$, where $q \in \mathbb{Z}$ and $x$ is the polynomial variable, the multiplication of $G(x)$ and $H(x)$ is defined as:

\[Y(x)=G(x) \cdot H(x)=\sum_{k=0}^{2(n-1)} y_k x^k\]

The new coefficient is $y_k=\sum_{i=0}^k g_i h_{k-i} \bmod q$, where $\boldsymbol{g}$ and $\boldsymbol{h}$ are the coefficient vectors of the polynomials $G(x)$ and $H(x)$, respectively.

Let $\mathbf{g} = \{g_0, g_1, \dots, g_{n-1}\}, \mathbf{h} = \{h_0, h_1, \dots, h_{n-1}\}$ be two vectors of length $n$. Their linear convolution $\mathbf{y} = \mathbf{g} * \mathbf{h}$ is defined as:

\[y_k = \sum_{i} g_i h_{k-i}\]

The resulting vector $\mathbf{y}$ has length $2n-1$, and the element index satisfies $k \in \{0, 1, \dots, 2n-2\}$. For each $k$, the summation range must satisfy $0 \le i < n$ and $0 \le k-i < n$.

It is easy to verify that the linear convolution above is equivalent to polynomial multiplication. After a polynomial is transformed into its evaluation form by the discrete Fourier transform, convolution operations become more convenient. Beyond linear convolution, cryptography often uses cyclic convolution:

Positive wrapped convolution (PWC): equivalent to multiplication in the polynomial quotient ring $\mathbb{Z}_q[x] / (x^n - 1)$

Negative wrapped convolution (NWC): equivalent to multiplication in the polynomial quotient ring $\mathbb{Z}_q[x] / (x^n + 1)$

Consider two degree $n - 1$ polynomials $G(x)$ and $H(x)$ in the polynomial quotient ring $\mathbb{Z}_q[x] / (x^n - 1)$, with coefficient vectors $\mathbf{g} = \{g_0, g_1, \dots, g_{n-1}\}, \mathbf{h} = \{h_0, h_1, \dots, h_{n-1}\}$. Their cyclic convolution $\mathbf{y} = \mathbf{g} \circledast \mathbf{h}$ is defined by the $k$-th component:

\[y_k = \sum_{i=0}^{n-1} g_i \cdot h_{(k-i) \pmod n} \\ \iff y_k = \sum_{i=0}^{k} g_i \cdot h_{k-i} + \sum_{i=k + 1}^{n-1} g_i \cdot h_{k + n - i}\]

where $k \in \{0, 1, \dots, n-1\}$. The equivalent polynomial expression of this vector computation is:

\[Y(x) = G(x) \cdot H(x) \pmod{x^n - 1}\]

Consider two degree $n - 1$ polynomials $G(x)$ and $H(x)$ in the quotient ring $\mathbb{Z}_q[x] / (x^n + 1)$, with coefficient vectors $\mathbf{g} = \{g_0, g_1, \dots, g_{n-1}\}, \mathbf{h} = \{h_0, h_1, \dots, h_{n-1}\}$. Their negacyclic convolution $\mathbf{y} = \mathbf{g} \star \mathbf{h}$ is defined by the $k$-th component:

\[y_k = \left( \sum_{i=0}^{k} g_i h_{k-i} - \sum_{i=k+1}^{n-1} g_i h_{k+n-i} \right)\]

where $k \in \{0, 1, \dots, n-1\}$. The equivalent polynomial expression of this vector computation is:

\[Y(x) = G(x) \cdot H(x) \pmod{x^n + 1}\]

Negacyclic convolution, also often called Negative Wrapped Convolution (NWC), is one of the core acceleration operations in lattice-based cryptography such as Kyber and Dilithium, as well as in fully homomorphic encryption.

Fast Fourier Transform

How can we implement $\mathcal{O}(n \log n)$ algorithms for $\mathsf{DFT}$ and $\mathsf{iDFT}$? We know that ordinary point evaluation costs $\mathcal{O}(n)$, so the naive $\mathsf{DFT}$ has complexity $\mathcal{O}(n^2)$. The naive Lagrange interpolation algorithm also has complexity $\mathcal{O}(n^2)$. The core of the Fast Fourier Transform lies in the special root-of-unity basis vector used by the evaluation representation:

\[\vec w = (w_{n,0}, w_{n,1}, \ldots, w_{n,n-1}) = (w_n^0, w_n^1, \ldots, w_n^{n-1})\]

The central algorithmic idea is divide and conquer. We know that:

\[\begin{aligned} A(x) &= a_0 x^0 + a_1 x^1 + \dots + a_{n-1} x^{n-1} \\ &= a_0 x^0 + a_2 x^2 + \dots + a_{n-2} x^{n-2} + x(a_1 x^0 + a_3 x^2 + \dots + a_{n-1}x^{n-2}) \\ &= A_0(x^2) + xA_1(x^2) \end{aligned}\]

Here $A_0(x), A_1(x)$ are both polynomials with only $\frac{n}{2}$ coefficients, satisfying:

\[\begin{aligned} A_0(x) &= a_0 x^0 + a_2 x^1 + \dots + a_{n-2} x^{\frac{n}{2}-1} \\ A_1(x) &= a_1 x^0 + a_3 x^1 + \dots + a_{n-1} x^{\frac{n}{2}-1} \end{aligned}\]

DFT Algorithm $\mathcal{O}(n \log n)$

Given the coefficient vector $\vec{A} = (a_0, a_1, \ldots, a_{n-1})$ of an $n-1$ degree polynomial, corresponding to the polynomial

\[A(x) = a_0 x^0 + a_1 x^1 + \dots + a_{n-1} x^{n-1}.\]

How can we compute its values $\left(y_0, y_1, \ldots, y_{n-1}\right)$ at the $n$-th roots of unity $\vec w = (w_{n,0}, w_{n,1}, \ldots, w_{n,n-1})$ in $\mathcal{O}(n \log n)$ time, where $y_i = A\left(w_{n,i}\right)$?

Let $T_{\mathsf{DFT}}(n)$ denote the time complexity of computing the discrete Fourier transform of a degree-$n$ polynomial. From the decomposition $A(x) = A_0(x^2) + xA_1(x^2)$, if we can obtain the $\mathsf{DFT}$ vector of $A$ in $O(n)$ time from the known $\mathsf{DFT}$ vectors of $A_0$ and $A_1$, then the complexity satisfies the following recurrence:

\[T_{\mathsf{DFT}}(n) = 2T_{\mathsf{DFT}}(\frac{n}{2}) + \mathcal{O}(n)\]

By the Master theorem for recurrences, the final time complexity of this recursive algorithm is $\mathcal{O}(n \log n)$. A key observation is that squaring the vector of $n$-th roots of unity, $\vec w^2 = (w_n^0, w_n^2, \ldots, w_n^{2(n-1)})$, gives exactly all $\frac{n}{2}$-th roots of unity. Therefore, the input pattern in the evaluation representations of $A_0(x)$ and $A_1(x)$ matches that of $A(x)$ precisely. More concretely, suppose we already know $A_0(x), A_1(x)$ and their discrete Fourier transforms:

\[\begin{cases} \left(y_k^0\right)_{k=0}^{n/2-1} = \mathsf{DFT}(A_0) \\ \left(y_k^1\right)_{k=0}^{n/2-1} = \mathsf{DFT}(A_1) \end{cases}\]

Using the special properties of roots of unity:

\[\begin{cases} w_{n}^{2k} = e^{\frac{2\pi k i}{n/2}} = w_{n/2}^{k} & k \in [0, n/2 - 1] \\ w_{n}^{k + \frac{n}{2}}= - w_{n}^{k} & k \in [0, n - 1] \end{cases}\]

Therefore, the $n$ evaluation values of $\mathsf{DFT}(A)$ can be recovered as follows:

\[\begin{cases} y_k = A_0(w_n^{2k}) + w_n^{k} \cdot A_1(w_n^{2k}) = y_k^0 + w_n^k y_k^1, & k = 0, \ldots, \frac{n}{2} - 1. \\ y_k = A_0(w_n^{2k}) + w_n^{k} \cdot A_1(w_n^{2k}) = y_{k \bmod \frac{n}{2}}^{0} + w_n^{k} y_{k \bmod \frac{n}{2}}^{1} & k = \frac{n}{2}, \ldots, {n} - 1. \\ \end{cases}\]

Written more elegantly:

\[\begin{cases} y_k &= y_k^0 + w_n^k y_k^1, &\quad k = 0 \dots \frac{n}{2} - 1, \\ y_{k+n/2} &= y_k^0 - w_n^k y_k^1, &\quad k = 0 \dots \frac{n}{2} - 1. \end{cases}\]

This formula is also called the butterfly formula. The whole recursive expression is quite elegant: using the butterfly formula, one only needs $\mathcal{O}(n)$ time to recover the DFT of $A$ from the DFTs of $A_0$ and $A_1$. In summary, we have obtained a recursive $\mathcal{O}(n \log n)$ algorithm for the discrete Fourier transform $\mathsf{DFT}$.

iDFT Algorithm $\mathcal{O}(n \log n)$

Given the values $\left(y_0, y_1, \ldots, y_{n-1}\right)$ of an $n-1$ degree polynomial $A(x) = a_0 x^0 + a_1 x^1 + \dots + a_{n-1} x^{n-1}$ at the $n$-th roots of unity $\vec w = (w_{n,0}, w_{n,1}, \ldots, w_{n,n-1})$, where $y_i = A\left(w_{n,i}\right)$, how can we compute its coefficient vector $\vec{A} = (a_0, a_1, \ldots, a_{n-1})$ in $\mathcal{O}(n \log n)$ time?

Simply put, this is polynomial interpolation. Lagrange interpolation can compute it in $\mathcal{O}(n^2)$ time. Essentially, this is solving a system of linear equations:

\[\underbrace{ \begin{pmatrix} w_n^0 & w_n^0 & w_n^0 & w_n^0 & \cdots & w_n^0 \\ w_n^0 & w_n^1 & w_n^2 & w_n^3 & \cdots & w_n^{n-1} \\ w_n^0 & w_n^2 & w_n^4 & w_n^6 & \cdots & w_n^{2(n-1)} \\ w_n^0 & w_n^3 & w_n^6 & w_n^9 & \cdots & w_n^{3(n-1)} \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots \\ w_n^0 & w_n^{n-1} & w_n^{2(n-1)} & w_n^{3(n-1)} & \cdots & w_n^{(n-1)(n-1)} \end{pmatrix} }_{\mathbf{V} \in \mathbb{C}^{n \times n}} \begin{pmatrix} a_0 \\ a_1 \\ a_2 \\ a_3 \\ \vdots \\ a_{n-1} \end{pmatrix} = \begin{pmatrix} y_0 \\ y_1 \\ y_2 \\ y_3 \\ \vdots \\ y_{n-1} \end{pmatrix}\]

Here $\mathbf{V} \in \mathbb{C}^{n \times n}$ is the Vandermonde matrix. Its inverse is:

\[\mathbf{V}^{-1} = \frac{1}{n} \begin{pmatrix} w_n^0 & w_n^0 & w_n^0 & w_n^0 & \cdots & w_n^0 \\ w_n^0 & w_n^{-1} & w_n^{-2} & w_n^{-3} & \cdots & w_n^{-(n-1)} \\ w_n^0 & w_n^{-2} & w_n^{-4} & w_n^{-6} & \cdots & w_n^{-2(n-1)} \\ w_n^0 & w_n^{-3} & w_n^{-6} & w_n^{-9} & \cdots & w_n^{-3(n-1)} \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots \\ w_n^0 & w_n^{-(n-1)} & w_n^{-2(n-1)} & w_n^{-3(n-1)} & \cdots & w_n^{-(n-1)(n-1)} \end{pmatrix} \\ \implies \begin{pmatrix} a_0 \\ a_1 \\ a_2 \\ a_3 \\ \vdots \\ a_{n-1} \end{pmatrix} = \mathbf{V}^{-1} \begin{pmatrix} y_0 \\ y_1 \\ y_2 \\ y_3 \\ \vdots \\ y_{n-1} \end{pmatrix}\]

Therefore, the Lagrange interpolation formula in this special form is also very elegant. We can directly express $a_{k}$ in polynomial form as:

\[a_k = \frac{1}{n} \sum_{j=0}^{n-1} y_j w_n^{-k j}\]

This gives us a problem almost identical to the expression $y_k = \sum_{j=0}^{n-1} a_j w_n^{k j}$ for $\mathsf{DFT}$. The key changes are:

\[\begin{cases} 1 &\implies \frac{1}{n} \\ w_n^{k j} &\implies w_n^{-k j} \end{cases}\]

The recursive algorithm from the previous section applies equally well in this setting. In summary, we have obtained a recursive $\mathcal{O}(n \log n)$ algorithm for the inverse discrete Fourier transform $\textsf{iDFT}$.

The core of FFT acceleration: The fundamental reason FFT is fast is the periodicity of roots of unity:
\[w_{n}^{n} = 1, \quad w_{n}^{\frac{n}{2}} = -1\]
This allows many computations to be reused, which is the essence of the recursive acceleration. In the later discussion of NTT, we will further explain how to reuse computation through the periodicity of roots of unity.

Fast Number Theoretic Transform

In cryptography, we usually care about polynomials over integer rings. More specifically, we care about polynomials over the integer quotient ring $\mathbb{Z}_{q}$, and in most cases we regard $q$ as a prime. In this section, all multiplication operations are explained through convolution, namely the following correspondence:

Linear Convolution	Cyclic Convolution / Positive Wrapped Convolution	Negacyclic Convolution
Multiplication in $\mathbb{Z}_q[x]$	Multiplication in $\mathbb{Z}_q[x] / (x^n - 1)$	Multiplication in $\mathbb{Z}_q[x] / (x^n + 1)$

Over integer quotient rings, we need to find a root of unity with the same properties as $e^{\frac{2\pi i}{n}}$ in the discrete Fourier transform: the primitive root of unity over $\mathbb{Z}_{q}$ defined below.

We call $w$ a primitive $n$-th root of unity over $\mathbb{Z}_{q}$ if and only if it satisfies:

\[w^n \equiv 1 \bmod q, \text{ and } w^i \not\equiv 1 \bmod q, \forall i \in [1, n-1]\]

Linear / Positive Wrapped Convolution

Let $\omega$ be a primitive $n$-th root of unity over $\mathbb{Z}_q$, and let $A(x)$ be an $n-1$ degree polynomial over $\mathbb{Z}_q[x]$. The Number Theoretic Transform (NTT) of its coefficient vector $\mathbf{a}$ is defined as $\hat{\mathbf{a}} = \textsf{NTT}^{\omega}(\mathbf{a})$:

\[\hat{\mathbf{a}}_j = \sum_{i=0}^{n-1} \omega^{ij} \mathbf{a}_i \pmod q, \quad j = 0, 1, 2, \dots, n-1\]

In particular, we know that $\hat{\mathbf{a}}_j = A(\omega^j) \pmod{ q }$.

Let $\omega$ be a primitive $n$-th root of unity over $\mathbb{Z}_q$. The inverse Number Theoretic Transform (iNTT) of an $n$-dimensional evaluation vector $\hat{\mathbf{a}}$ is defined as $\mathbf{a} = \textsf{iNTT}^{\omega}(\hat{\mathbf{a}})$:

\[\mathbf{a}_j = \frac{1}{n} \sum_{i=0}^{n-1} \omega^{-ij} \hat{\mathbf{a}}_i \pmod q, \quad j = 0, 1, 2, \dots, n-1\]

It is easy to verify that the two matrices corresponding to the expressions above in $\hat{\mathbf{a}}$ and ${\mathbf{a}}$ are inverses of each other:

\[\mathbf{a} = \textsf{iNTT}^{\omega}\left(\textsf{NTT}^{\omega}\left(\mathbf{a}\right)\right)\]

Thus, linear convolution can be computed via NTT as follows. Note that when computing linear convolution in $\mathbb{Z}_q[x]$, one should choose a transform length $m \ge 2n-1$ and zero-pad the input:

\[\mathbf{c} = \mathbf{a} * \mathbf{b} = \textsf{iNTT}^{\omega}\left(\textsf{NTT}^{\omega}\left(\mathbf{a}\right) \circ \textsf{NTT}^{\omega}\left(\mathbf{b}\right)\right)\]

The acceleration techniques from FFT can be transferred completely to NTT and iNTT. Earlier, we mentioned that if we want to compute linear convolution over $\mathbb{Z}_q[x]$, the dimension of $\mathbf{c}$ should be $2n - 1$. If we only use an $n$-th primitive root of unity and a transform of length $n$, then we do not obtain linear convolution, but cyclic convolution. At this point, switching to the polynomial perspective, the values obtained after convolution are still the genuine values of $A(\omega^i) \cdot B(\omega^i) \bmod q$, but the coefficients of monomials of degree $\ge n$ have been cyclically accumulated into lower-degree coefficients. Because we are using an $n$-th primitive root, it satisfies $x^{n + k} = x^k$, which is equivalent to:

\[x^{n+k} \equiv x^k \pmod {x^{n} - 1}\]

That is, the result is reduced modulo the polynomial ${x^{n} - 1}$. In coefficient terms, the true higher-degree monomial coefficients are cyclically accumulated into lower-degree monomial coefficients, which is exactly the expression for positive wrapped convolution:

\[y_k = \sum_{i=0}^{k} g_i \cdot h_{k-i} + \sum_{i=k + 1}^{n-1} g_i \cdot h_{k + n - i}\]

Let $\textsf{NTT}_{n}^{\omega}(\cdot)$ denote the number theoretic transform acting on an $n$-dimensional vector using the primitive generator $\omega$. Unless otherwise specified, we omit the parameter $n$ and assume it matches the actual vector dimension, writing it simply as $\textsf{NTT}^{\omega}(\cdot)$. We then obtain the following proposition.

Let $\mathbf{a}, \mathbf{b}$ be two $n$-dimensional vectors over $\mathbb{Z}_q$, corresponding to two degree $n-1$ polynomials, and let $\omega$ be a primitive $n$-th root of unity over $\mathbb{Z}_q$. Their positive wrapped convolution can be computed by the following number theoretic transforms:

\[\mathbf{c} = \mathbf{a} \circledast \mathbf{b} = \textsf{iNTT}^{\omega}\left(\textsf{NTT}^{\omega}\left(\mathbf{a}\right) \circ \textsf{NTT}^{\omega}\left(\mathbf{b}\right)\right)\]

The more essential point is that the $n$-th primitive roots selected by NTT all satisfy:

\[x^n = 1 \iff x^n - 1= 0\]

Therefore, the final result is plainly equivalent to the result after reducing modulo the polynomial $x^n - 1$. This gives a more intuitive understanding of PWC and also helps us understand the mathematical intuition behind NWC in the next section.

Negacyclic Convolution

Next, consider how to compute negacyclic convolution. From the expression

\[y_k = \sum_{i=0}^{k} g_i \cdot h_{k-i} - \sum_{i=k + 1}^{n-1} g_i \cdot h_{k + n - i}\]

We naturally think that coefficients of monomials of degree $\ge n$ are also accumulated into the corresponding lower-degree terms after reducing degrees modulo $n$, except that their contribution to the coefficients is negative. Thus, we naturally want the relation $x^{n + k} = -x^{k}$. In other words, the primitive root $\varphi$ for this NTT should satisfy $\varphi^{n} = - 1$, so $\varphi$ is a primitive $2n$-th root of unity. However, if we simply replace $\omega$ in positive wrapped convolution by $\varphi$, this does not directly produce negacyclic convolution; instead, it introduces a frequency shift or a mathematical mismatch. Moreover, the evolution of twiddle factors in the standard NTT is based on the primitive-root sequence $\omega^0, \omega^1, \omega^2, \ldots, \omega^{n-1}$ satisfying $x^n = 1$. If we simply replace it by $\varphi^0, \varphi^1, \varphi^2, \ldots, \varphi^{n-1}$, then half of these $2n$-th roots of unity do not have a consistent identity of either $x^n = 1$ or $x^n = -1$, and this property is crucial for fast NTT. Here I give two mathematical ways to understand NWC constructions.

Let $\varphi$ be a primitive $2n$-th root of unity, and let $\omega$ be a primitive $n$-th root of unity satisfying $\omega = \varphi^2$.

From the sequence perspective, what we need are exactly all roots satisfying $x^n = -1$. There are exactly $n$ such roots, and one can verify that they are precisely the following sequence:

\[\{\varphi^1, \varphi^3, \varphi^5, \ldots, \varphi^{2n-1}\}\]

In other words, the roots of $x^n+1$ are exactly the odd powers of the $2n$-th roots of unity. Therefore, the NTT construction based on $\varphi$ is:

\[\hat{\mathbf{a}}_j = \sum_{i=0}^{n-1} \varphi^{i(2j+1)} a_i \pmod q, \quad j = 0, 1, 2, \dots, n-1\]

Now our idea is to convert NWC into PWC, so that we can use the standard NTT defined earlier. To achieve this, we need to transform the coefficients. Define a new polynomial $\hat{A}(y)$ by setting $x = \varphi \cdot y$. When $x^n = -1$, we have $(\varphi y)^n = -1 \Rightarrow \varphi^n y^n = -1$. Since $\varphi^n = -1$, this becomes $-y^n = -1 \Rightarrow y^n = 1$. Therefore, applying the PWC-style NTT to $\hat{A}(y)$ gives the negacyclic NTT of the original polynomial.

On coefficients, this mapping is $\mathbf{a}'_i = \mathbf{a}_i \cdot \varphi^i$. Construct the polynomial $A'(x) = \sum \mathbf{a}'_i x^i$ and apply the PWC-style NTT to it:

\[\begin{aligned} \hat{\mathbf{a}}_j &= \sum_{i=0}^{n-1} \omega^{ij} \mathbf{a}'_i \pmod q \\ &= \sum_{i=0}^{n-1} \omega^{ij} \varphi^i \mathbf{a}_i \pmod q \\ &= \sum_{i=0}^{n-1} \varphi^{i(2j + 1)} \mathbf{a}_i \pmod q \end{aligned}\]

where $j = 0, 1, 2, \dots, n-1$.

The two viewpoints above yield the same result. We obtain the formal definition of the negacyclic Number Theoretic Transform as follows.

Let $\varphi$ be a primitive $2n$-th root of unity over $\mathbb{Z}_q$. Then $\omega := \varphi^2$ is a primitive $n$-th root of unity over $\mathbb{Z}_q$. Let $A(x)$ be an $n-1$ degree polynomial over $\mathbb{Z}_q[x]$. The number theoretic transform of its coefficient vector $\mathbf{a}$ based on $\varphi$ is defined as $\hat{\mathbf{a}} = \textsf{NTT}^{\varphi}(\mathbf{a})$:

\[\hat{\mathbf{a}}_j = \sum_{i=0}^{n-1} \varphi^i \omega^{ij} \mathbf{a}_i \pmod q, \quad j = 0, 1, 2, \dots, n-1\]

Substituting $\omega := \varphi^2$, this is equivalent to:

\[\hat{\mathbf{a}}_j = \sum_{i=0}^{n-1} \varphi^{i(2j + 1)} \mathbf{a}_i \pmod q\]

Similarly, by inverting the Vandermonde matrix, we can obtain the formula for the inverse negacyclic Number Theoretic Transform. It is worth pointing out that the paper https://eprint.iacr.org/2024/585.pdf contains a major typo in its definition of iNTT.

Let $\varphi$ be a primitive $2n$-th root of unity over $\mathbb{Z}_q$, and let $\omega := \varphi^2$ be a primitive $n$-th root of unity over $\mathbb{Z}_q$. The inverse number theoretic transform based on $\varphi$ of an $n$-dimensional evaluation vector $\hat{\mathbf{a}}$ is defined as $\mathbf{a} = \textsf{iNTT}^\varphi(\hat{\mathbf{a}})$:

\[\mathbf{a}_j = \frac{1}{n} \sum_{i=0}^{n-1} \varphi^{-j} \omega^{-ij} \hat{\mathbf{a}}_i \pmod q, \quad j = 0, 1, 2, \dots, n-1\]

Substituting $\omega := \varphi^2$, this is equivalent to:

\[\mathbf{a}_j = \frac{1}{n} \sum_{i=0}^{n-1} \varphi^{-j(2i + 1)} \hat{\mathbf{a}}_i \pmod q\]

It is easy to verify that the two matrices corresponding to the expressions above in $\hat{\mathbf{a}}$ and ${\mathbf{a}}$ are inverses of each other:

\[\mathbf{a} = \textsf{iNTT}^\varphi \left(\textsf{NTT}^\varphi\left(\mathbf{a}\right)\right)\]

Let $\mathbf{a}, \mathbf{b}$ be two $n$-dimensional vectors over $\mathbb{Z}_q$, corresponding to two degree $n-1$ polynomials, and let $\varphi$ be a primitive $2n$-th root of unity over $\mathbb{Z}_q$. Their negacyclic convolution can be computed by the following number theoretic transforms:

\[\mathbf{c} = \mathbf{a} \star \mathbf{b} = \textsf{iNTT}^{\varphi}\left(\textsf{NTT}^{\varphi}\left(\mathbf{a}\right) \circ \textsf{NTT}^{\varphi}\left(\mathbf{b}\right)\right)\]

The Essence of the Number Theoretic Transform

Returning to the algebraic perspective, the essence of the Number Theoretic Transform is ring decomposition and isomorphism. We take NWC as an example. Let $\varphi$ be a primitive $2n$-th root of unity over $\mathbb{Z}_q$. Then the cyclotomic polynomial $C(X) = X^{n} + 1$ has the following factorization:

\[C(X) = \prod_{i=0}^{n-1} (X - \varphi^{2i + 1})\]

By the Chinese Remainder Theorem, there is a ring isomorphism:

\[\mathbb{Z}_q[X] / (X^n + 1) \cong \prod_{i=0}^{n-1} \mathbb{Z}_q[X] / (X - \varphi^{2i + 1})\]

For each factor $\alpha = \varphi^{2i + 1}$, we have $\mathbb{Z}_q[X] / (X - \alpha) \cong \mathbb{Z}_q$ through the map $X \mapsto \alpha$, i.e. evaluating the polynomial at the point $\alpha$. Therefore, the isomorphism above can be further simplified as:

\[\mathbb{Z}_q[X] / (X^n + 1) \cong \underbrace{\mathbb{Z}_q \times \mathbb{Z}_q \times \dots \times \mathbb{Z}_q}_{n} \cong \mathbb{Z}_q^n\]

For a polynomial $A(X) \in \mathbb{Z}_q[X] / (X^n + 1)$ with coefficient vector $\mathbf{a}$, the NTT and inverse NTT are essentially a ring isomorphism.

\[\begin{aligned} \textsf{NTT}: \mathbb{Z}_q[X] / (X^n + 1) \mapsto \mathbb{Z}_q^{n} &\implies \mathbf{a} \mapsto (A(\varphi^{1}), A(\varphi^{3}), \dots, A(\varphi^{2n-1}))\\ \textsf{iNTT}: \mathbb{Z}_q^{n} \mapsto \mathbb{Z}_q[X] / (X^n + 1) &\implies (A(\varphi^{1}), A(\varphi^{3}), \dots, A(\varphi^{2n-1})) \mapsto \mathbf{a} \\ \end{aligned}\]

The essence of the Fast Fourier Transform and the fast Number Theoretic Transform is that the group isomorphism above also admits the following recursive divide-and-conquer decomposition:

\[\begin{aligned} \mathbb{Z}_q[X] / (X^n + 1) & \cong \mathbb{Z}_q[X] / (X^{\frac{n}{2}} - \varphi^{\frac{n}{2}}) \times \mathbb{Z}_q[X] / (X^{\frac{n}{2}} + \varphi^{\frac{n}{2}}) \\ &\cong \mathbb{Z}_q[X] / (X^{\frac{n}{4}} - \varphi^{\frac{n}{4}}) \times \mathbb{Z}_q[X] / (X^{\frac{n}{4}} + \varphi^{\frac{n}{4}}) \\ &\quad \times \mathbb{Z}_q[X] / (X^{\frac{n}{4}} - \varphi^{\frac{3n}{4}}) \times \mathbb{Z}_q[X] / (X^{\frac{n}{4}} + \varphi^{\frac{3n}{4}}) \\ & \cong \cdots \\ & \cong \prod_{i=0}^{n-1} \mathbb{Z}_q[X] / (X - \varphi^{2i + 1}) \end{aligned}\]

That is the following CRT isomorphism map:

Figure 1. Recursive CRT decomposition in the NWC setting, source: https://arxiv.org/pdf/2211.13546

Similarly, for PWC, the modulus polynomial $x^n - 1$ admits a similar CRT isomorphism map:

Figure 2. Recursive CRT decomposition in the PWC setting, source: https://arxiv.org/pdf/2211.13546

From the ring-isomorphism decomposition above, we can already see the rough shape of the butterfly operation. In the next section, we introduce the butterfly operation of the fast Number Theoretic Transform, namely the Cooley-Tukey algorithm, and the butterfly operation of the fast inverse Number Theoretic Transform, namely the Gentleman-Sande algorithm.

CT/GS Butterfly Algorithms

Let $\varphi$ be a primitive $2n$-th root of unity over $\mathbb{Z}_q$, and let $\omega := \varphi^2$ be a primitive $n$-th root of unity over $\mathbb{Z}_q$. Here $n$ is exactly a power of two, so the recursion can proceed completely.

The key properties of the fast Fourier transform are:

\[\varphi^{k+2n} = \varphi^{k} \\ \varphi^{k+n} = -\varphi^{k}\]

To unify notation, let the number theoretic transform for positive wrapped convolution be denoted by $\textsf{NTT}^{+}$, and the number theoretic transform for negacyclic convolution be denoted by $\textsf{NTT}^{-}$. Since $\omega := \varphi^2$, both can be written uniformly in terms of $\varphi$:

\[\begin{cases} \textsf{NTT}^{+}: & \hat{\mathbf{a}}_j = \sum_{i=0}^{n-1} \varphi^{i \cdot 2j} \mathbf{a}_i \pmod q, & j = 0, 1, 2, \dots, n-1 \\ \textsf{NTT}^{-}: & \hat{\mathbf{a}}_j = \sum_{i=0}^{n-1} \varphi^{i \cdot (2j + 1)} \mathbf{a}_i \pmod q, & j = 0, 1, 2, \dots, n-1 \\ \end{cases}\]

We only need to consider the negacyclic convolution case below, because the corresponding positive wrapped convolution transform can be obtained easily through coefficient reconstruction $\mathbf{b}_i :=\varphi^{-i} \cdot \mathbf{a}_i$.

Fast-NTT: Cooley-Tukey Algorithm

Consider the first ring-isomorphism step below:

\[\begin{aligned} \hat{\boldsymbol{a}}_j & =\sum_{i=0}^{n-1} \varphi^{2 i j+i} a_i \bmod q \\ & = \left[ \sum_{i=0}^{n / 2-1} \varphi^{4 i j+2 i} a_{2 i}+\sum_{i=0}^{n / 2-1} \varphi^{4 i j+2 j+2 i+1} a_{2 i+1} \right] \bmod q \\ & = \left[ \sum_{i=0}^{n / 2-1} \varphi^{4 i j+2 i} a_{2 i}+\varphi^{2 j+1} \sum_{i=0}^{n / 2-1} \varphi^{4 i j+2 i} a_{2 i+1} \right]\bmod q \end{aligned}\]

Now consider the coefficient with $J = j + n/2 > n/2$:

\[\hat{\boldsymbol{a}}_{J} = \hat{\boldsymbol{a}}_{j+n / 2}=\sum_{i=0}^{n / 2-1} \varphi^{4 i j+2 i} a_{2 i}-\varphi^{2 j+1} \sum_{i=0}^{n / 2-1} \varphi^{4 i j+2 i} a_{2 i+1} \quad \bmod q, j \in [0,n/2 - 1]\]

This gives some reusable intermediate quantities. Let $A_j=\sum_{i=0}^{n / 2-1} \varphi^{4 i j+2 i} a_{2 i}$ and $B_j=\sum_{i=0}^{n / 2-1} \varphi^{4 i j+2 i} a_{2 i+1}$. From the decomposition, we obtain:

\[\begin{cases} \text{Former}: & \hat{\boldsymbol{a}}_j & =A_j+\varphi^{2 j+1} B_j \quad \bmod q \\ \text{Latter}: &\hat{\boldsymbol{a}}_{j+n / 2} & =A_j-\varphi^{2 j+1} B_j \quad \bmod q \end{cases}\]

The coefficients $A_j, B_j$ can themselves be computed by $n/2$-point NTTs. Define:

\[\begin{cases} \mathbf{a}^{(0)} = (a_0, a_2, \ldots, a_{n-2}) \\ \mathbf{a}^{(1)} = (a_1, a_3, \ldots, a_{n-1}) \end{cases}\]

Let $\omega = \varphi^2$ be a primitive $2 \cdot \left( \frac{n}{2} \right)$-th root of unity. We have:

\[\begin{cases} \mathbf{A} = \textsf{NTT}_{n/2}^{\omega}(\mathbf{a}^{(0)}), & A_j=\sum_{i=0}^{n / 2-1} \varphi^{4 i j+2 i} a_{2 i} = \sum_{i=0}^{n / 2-1} \omega^{2ij+i} a_{2 i} \\ \mathbf{B} = \textsf{NTT}_{n/2}^{\omega}(\mathbf{a}^{(1)}), & B_j=\sum_{i=0}^{n / 2-1} \varphi^{4 i j+2 i} a_{2 i + 1} = \sum_{i=0}^{n / 2-1} \omega^{2ij+i} a_{2 i + 1} \end{cases}\]

We recurse in this way until the NTT coefficients can be computed in constant time.

\[\begin{array}{l} \textsf{CT-NTT}^{\varphi}(\mathbf{a}): \\ \quad n \leftarrow \vert\mathbf{a}\vert \\ \quad \text{if } n=1,\ \text{return } \mathbf{a} \\ \quad \mathbf{a}^{(0)} \leftarrow (a_0,a_2,\ldots,a_{n-2}) \\ \quad \mathbf{a}^{(1)} \leftarrow (a_1,a_3,\ldots,a_{n-1}) \\ \quad \mathbf{A} \leftarrow \textsf{CT-NTT}^{\varphi^2}(\mathbf{a}^{(0)}) \\ \quad \mathbf{B} \leftarrow \textsf{CT-NTT}^{\varphi^2}(\mathbf{a}^{(1)}) \\ \quad \text{for } j=0,\ldots,n/2-1: \\ \quad\quad \hat{\mathbf{a}}_j \leftarrow A_j+\varphi^{2j+1}B_j \pmod q \\ \quad\quad \hat{\mathbf{a}}_{j+n/2} \leftarrow A_j-\varphi^{2j+1}B_j \pmod q \\ \quad \text{return } \hat{\mathbf{a}} \end{array}\]

For the standard NTT for positive wrapped convolution, the recursive structure is the same. One only needs to replace $\varphi^{2j+1}$ above by $\omega^j$ and replace the subproblem root by $\omega^2$.

Fast-iNTT: Gentleman-Sande Algorithm

Recall that the inverse NTT is computed as follows:

\[\begin{aligned} \mathbf{a}_j &= \frac{1}{n} \cdot \varphi^{-j} \sum_{i=0}^{n-1} \varphi^{-(2ij)} \hat{\mathbf{a}}_i \bmod q \\ & = \frac{1}{n} \sum_{i=0}^{n-1} \varphi^{-j(2i + 1)} \hat{\mathbf{a}}_i \pmod q \\ \end{aligned}\]

The fast computation of the inverse NTT is decomposed as:

\[\begin{aligned} \mathbf{a}_j & = \frac{1}{n} \sum_{i=0}^{n-1} \varphi^{-j (2i + 1)} \hat{\mathbf{a}}_i \bmod q \\ & = \frac{1}{n} \cdot \varphi^{-j} \left[ \sum_{i=0}^{n / 2-1} \varphi^{-2ij} \hat{\mathbf{a}}_i +\sum_{i=n/2}^{n - 1} \varphi^{-2ij} \hat{\mathbf{a}}_i \right] \bmod q \\ & = \frac{1}{n} \cdot \varphi^{-j} \left[ \sum_{i=0}^{n / 2-1} \varphi^{-2ij} \hat{\mathbf{a}}_i +\sum_{i=0}^{n/2 - 1} \varphi^{-2(i + n/2)j} \hat{\mathbf{a}}_{i + n/2} \right] \bmod q \\ & = \frac{1}{n} \cdot \varphi^{-j} \left[ \sum_{i=0}^{n / 2-1} \varphi^{-2ij} \hat{\mathbf{a}}_{i} + (-1)^j \sum_{i=0}^{n/2 - 1} \varphi^{-2ij} \hat{\mathbf{a}}_{i + n/2} \right] \bmod q \\ & = \frac{1}{n} \cdot \varphi^{-j} \left[ \sum_{i=0}^{n / 2-1} \varphi^{-2ij} \left( \hat{\mathbf{a}}_{i} + (-1)^j \hat{\mathbf{a}}_{i + n/2} \right) \right] \bmod q \\ \end{aligned}\]

The even and odd coefficients can be separated as follows:

\[\begin{cases} \mathbf{a}_{2k} & = \frac{1}{n} \cdot \varphi^{-2k} \left[ \sum_{i=0}^{n / 2-1} \left( \varphi^{-4ki} \left( \hat{\mathbf{a}}_{i}+ \hat{\mathbf{a}}_{i + n/2} \right) \right) \right] \bmod q \\ \mathbf{a}_{2k+1} & = \frac{1}{n} \cdot \varphi^{-2k - 1} \left[ \sum_{i=0}^{n / 2-1} \left( \varphi^{-2i(2k + 1)} \left( \hat{\mathbf{a}}_{i} - \hat{\mathbf{a}}_{i + n/2} \right) \right) \right] \bmod q \\ \end{cases}\]

Next, we analyze the recursive formula from two perspectives.

Inverting the CT Transform

The Gentleman-Sande inverse transform can be obtained directly by inverting the butterfly formula from the Cooley-Tukey forward transform in the previous section. Recall the CT forward transform in the negacyclic convolution setting. Split the input coefficients by even and odd indices:

\[\begin{cases} \mathbf{a}^{(0)} = (a_0, a_2, \ldots, a_{n-2}) \\ \mathbf{a}^{(1)} = (a_1, a_3, \ldots, a_{n-1}) \end{cases}\]

Let $\omega = \varphi^2$. Then $\omega$ is the primitive $n$-th root of unity needed for the length $n/2$ negacyclic subproblem; in other words, it plays the role of a primitive $2\cdot(n/2)$-th root of unity inside the subproblem. Write:

\[\begin{cases} \mathbf{E} = \textsf{NTT}_{n/2}^{\omega}(\mathbf{a}^{(0)}), & E_j=\sum_{i=0}^{n / 2-1} \varphi^{4 i j+2 i} a_{2 i} = \sum_{i=0}^{n / 2-1} \omega^{2 i j+i} a_{2 i} \\ \mathbf{O} = \textsf{NTT}_{n/2}^{\omega}(\mathbf{a}^{(1)}), & O_j=\sum_{i=0}^{n / 2-1} \varphi^{4 i j+2 i} a_{2 i+1} = \sum_{i=0}^{n / 2-1} \omega^{2 i j+i} a_{2 i+1} \end{cases}\]

The CT butterfly gives:

\[\begin{cases} \hat{\mathbf{a}}_j = E_j + \varphi^{2j+1} O_j, & j = 0, \ldots, \frac{n}{2}-1 \\ \hat{\mathbf{a}}_{j+n/2} = E_j - \varphi^{2j+1} O_j, & j = 0, \ldots, \frac{n}{2}-1 \end{cases}\]

The GS inverse transform inverts this linear system layer by layer. Given the evaluation vector $\hat{\mathbf{a}}$ at the current layer, first pair the upper and lower halves to recover two length $n/2$ subproblem evaluation vectors:

\[\begin{cases} E_j = \frac{1}{2}\left(\hat{\mathbf{a}}_j + \hat{\mathbf{a}}_{j+n/2}\right), & j = 0, \ldots, \frac{n}{2}-1 \\ O_j = \frac{1}{2\varphi^{2j+1}}\left(\hat{\mathbf{a}}_j - \hat{\mathbf{a}}_{j+n/2}\right), & j = 0, \ldots, \frac{n}{2}-1 \end{cases}\]

Then recursively apply length $n/2$ inverse transforms to $\mathbf{E}$ and $\mathbf{O}$:

\[\begin{cases} \mathbf{a}^{(0)} = \textsf{iNTT}_{n/2}^{\omega}(\mathbf{E}) \\ \mathbf{a}^{(1)} = \textsf{iNTT}_{n/2}^{\omega}(\mathbf{O}) \end{cases}\]

Finally, interleave the coefficients of the two subproblems:

\[\begin{cases} a_{2r} = a^{(0)}_r, & r = 0, \ldots, \frac{n}{2}-1 \\ a_{2r+1} = a^{(1)}_r, & r = 0, \ldots, \frac{n}{2}-1 \end{cases}\]

The recursion terminates at $n=1$, where the input evaluation vector is already the coefficient vector. Since each butterfly layer multiplies by $\frac{1}{2}$ and there are $\log_2 n$ layers in total, the total scaling factor is:

\[\left(\frac{1}{2}\right)^{\log_2 n} = \frac{1}{n}\]

This exactly matches the normalization factor $\frac{1}{n}$ in the iNTT definition. Therefore, an implementation can use $2^{-1} \bmod q$ at each layer, without multiplying by $n^{-1}$ again after the recursion ends.

Deriving the Standard GS Transform

We can derive the recursion directly from the definition of $\textsf{iNTT}^{\varphi}$. Let the input evaluation vector at the current layer be $\hat{\mathbf{a}}=(\hat{\mathbf{a}}_0,\ldots,\hat{\mathbf{a}}_{n-1})$, and let the output coefficient vector be $\mathbf{a}=(\mathbf{a}_0,\ldots,\mathbf{a}_{n-1})$. By definition:

\[\mathbf{a}_j = \frac{1}{n}\sum_{i=0}^{n-1}\varphi^{-j(2i+1)}\hat{\mathbf{a}}_i \pmod q\]

Split the output index $j$ into even and odd cases. For even index $j=2k$:

\[\begin{aligned} \mathbf{a}_{2k} &= \frac{1}{n}\sum_{i=0}^{n-1}\varphi^{-2k(2i+1)}\hat{\mathbf{a}}_i \\ &= \frac{1}{n}\varphi^{-2k}\sum_{i=0}^{n-1}\varphi^{-4ki}\hat{\mathbf{a}}_i \\ &= \frac{1}{n}\varphi^{-2k}\sum_{i=0}^{n/2-1} \left(\varphi^{-4ki}\hat{\mathbf{a}}_i+\varphi^{-4k(i+n/2)}\hat{\mathbf{a}}_{i+n/2}\right) \\ &= \frac{1}{n}\varphi^{-2k}\sum_{i=0}^{n/2-1} \varphi^{-4ki}\left(\hat{\mathbf{a}}_i+\hat{\mathbf{a}}_{i+n/2}\right) \end{aligned}\]

The last step uses $\varphi^{-4k(i+n/2)}=\varphi^{-4ki}\varphi^{-2kn}=\varphi^{-4ki}$.

For odd index $j=2k+1$:

\[\begin{aligned} \mathbf{a}_{2k+1} &= \frac{1}{n}\sum_{i=0}^{n-1}\varphi^{-(2k+1)(2i+1)}\hat{\mathbf{a}}_i \\ &= \frac{1}{n}\varphi^{-(2k+1)}\sum_{i=0}^{n-1}\varphi^{-2(2k+1)i}\hat{\mathbf{a}}_i \\ &= \frac{1}{n}\varphi^{-(2k+1)}\sum_{i=0}^{n/2-1} \left(\varphi^{-2(2k+1)i}\hat{\mathbf{a}}_i+\varphi^{-2(2k+1)(i+n/2)}\hat{\mathbf{a}}_{i+n/2}\right) \\ &= \frac{1}{n}\varphi^{-(2k+1)}\sum_{i=0}^{n/2-1} \varphi^{-2(2k+1)i}\left(\hat{\mathbf{a}}_i-\hat{\mathbf{a}}_{i+n/2}\right) \end{aligned}\]

The last step uses $\varphi^{-2(2k+1)(i+n/2)}=-\varphi^{-2(2k+1)i}$.

Now let the primitive root of the subproblem be $\omega=\varphi^2$, and define two new evaluation vectors of length $n/2$:

\[\begin{cases} E_i = \frac{1}{2}\left(\hat{\mathbf{a}}_i+\hat{\mathbf{a}}_{i+n/2}\right), \\ O_i = \frac{1}{2}\varphi^{-(2i+1)}\left(\hat{\mathbf{a}}_i-\hat{\mathbf{a}}_{i+n/2}\right), \end{cases} \quad i=0,\ldots,\frac{n}{2}-1\]

We can observe that the two length $n/2$ recursive subproblems $\textsf{iNTT}^{\omega}$ give:

\[\begin{aligned} \textsf{iNTT}_{n/2}^{\omega}(\mathbf{E})_k &= \frac{2}{n}\sum_{i=0}^{n/2-1}\omega^{-k(2i+1)}E_i \\ &= \frac{1}{n}\varphi^{-2k}\sum_{i=0}^{n/2-1} \varphi^{-4ki}\left(\hat{\mathbf{a}}_i+\hat{\mathbf{a}}_{i+n/2}\right) \\ &= \mathbf{a}_{2k}, \end{aligned}\]

and:

\[\begin{aligned} \textsf{iNTT}_{n/2}^{\omega}(\mathbf{O})_k &= \frac{2}{n}\sum_{i=0}^{n/2-1}\omega^{-k(2i+1)}O_i \\ &= \frac{1}{n}\sum_{i=0}^{n/2-1} \varphi^{-2k(2i+1)}\varphi^{-(2i+1)} \left(\hat{\mathbf{a}}_i-\hat{\mathbf{a}}_{i+n/2}\right) \\ &= \frac{1}{n}\varphi^{-(2k+1)}\sum_{i=0}^{n/2-1} \varphi^{-2(2k+1)i}\left(\hat{\mathbf{a}}_i-\hat{\mathbf{a}}_{i+n/2}\right) \\ &= \mathbf{a}_{2k+1}. \end{aligned}\]

Therefore, directly from the iNTT formula, we obtain the recurrence:

\[\begin{cases} (\mathbf{a}_0,\mathbf{a}_2,\ldots,\mathbf{a}_{n-2}) = \textsf{iNTT}_{n/2}^{\varphi^2}(\mathbf{E}) \\ (\mathbf{a}_1,\mathbf{a}_3,\ldots,\mathbf{a}_{n-1}) = \textsf{iNTT}_{n/2}^{\varphi^2}(\mathbf{O}) \end{cases}\]

The core of the recursion is computing the vectors $\mathbf{E}$ and $\mathbf{O}$:

\[\begin{cases} E_i \leftarrow 2^{-1}(\hat{\mathbf{a}}_i+\hat{\mathbf{a}}_{i+n/2}) \pmod q \\ O_i \leftarrow 2^{-1}(\hat{\mathbf{a}}_i-\hat{\mathbf{a}}_{i+n/2})\cdot(\varphi^{2i+1})^{-1} \pmod q \end{cases}\]

\[\begin{array}{l} \textsf{GS-iNTT}^{\varphi}(\hat{\mathbf{a}}): \\ \quad n \leftarrow \vert\hat{\mathbf{a}}\vert \\ \quad \text{if } n=1,\ \text{return } \hat{\mathbf{a}} \\ \quad \text{for } j=0,\ldots,n/2-1: \\ \quad\quad E_j \leftarrow 2^{-1}\left(\hat{\mathbf{a}}_j+\hat{\mathbf{a}}_{j+n/2}\right) \pmod q \\ \quad\quad O_j \leftarrow 2^{-1}\left(\hat{\mathbf{a}}_j-\hat{\mathbf{a}}_{j+n/2}\right)\cdot\left(\varphi^{2j+1}\right)^{-1} \pmod q \\ \quad \mathbf{a}^{(0)} \leftarrow \textsf{GS-iNTT}^{\varphi^2}(\mathbf{E}) \\ \quad \mathbf{a}^{(1)} \leftarrow \textsf{GS-iNTT}^{\varphi^2}(\mathbf{O}) \\ \quad \text{for } r=0,\ldots,n/2-1: \\ \quad\quad \mathbf{a}_{2r} \leftarrow \mathbf{a}^{(0)}_r \\ \quad\quad \mathbf{a}_{2r+1} \leftarrow \mathbf{a}^{(1)}_r \\ \quad \text{return } \mathbf{a} \end{array}\]

Notice that the final step interleaves the results of the two recursive subproblems by even and odd indices. The recursive version here does not require bit reversal, and it also does not require an additional multiplication by $n^{-1}$ at the end, because the $2^{-1}$ factor at each layer already accumulates to the normalization factor $n^{-1}$ in the inverse NTT definition.

For the iNTT of positive wrapped convolution, the recursive structure is the same. One only needs to replace $\varphi^{2j+1}$ by $\omega^j$ and replace the subproblem root by $\omega^2$:
\[\begin{cases} E_j = \frac{1}{2}\left(\hat{\mathbf{a}}_j + \hat{\mathbf{a}}_{j+n/2}\right) \\ O_j = \frac{1}{2\omega^j}\left(\hat{\mathbf{a}}_j - \hat{\mathbf{a}}_{j+n/2}\right) \end{cases}\]

Non-Recursive Iterative Butterfly Algorithms

The recursive CT/GS algorithms are best suited for understanding where the formulas come from, but practical implementations usually unroll the recursion into an iterative butterfly network. The essence of recursive CT is that it repeatedly splits subproblems according to the parity of the input index: the first layer looks at the least significant bit, the second layer at the next bit, and so on until the most significant bit. Therefore, the input order at the leaves of the recursion tree is exactly the bit-reversal permutation (BO order) of the original indices. Let $\operatorname{brv}_{\ell}(i)$ denote the integer obtained by reversing the $\ell=\log_2 n$ bits of $i$. For example, when $n=8$:

\[(0,1,2,3,4,5,6,7) \mapsto (0,4,2,6,1,5,3,7)\]

Expanding in three bits, the correspondence is:

\[\begin{cases} 000_2 \mapsto 000_2, & 0 \mapsto 0 \\ 001_2 \mapsto 100_2, & 1 \mapsto 4 \\ 010_2 \mapsto 010_2, & 2 \mapsto 2 \\ 011_2 \mapsto 110_2, & 3 \mapsto 6 \\ 100_2 \mapsto 001_2, & 4 \mapsto 1 \\ 101_2 \mapsto 101_2, & 5 \mapsto 5 \\ 110_2 \mapsto 011_2, & 6 \mapsto 3 \\ 111_2 \mapsto 111_2, & 7 \mapsto 7 \end{cases}\]

For $n = 8$, suppose the input coefficient vector is in natural order (NO): $(a_0,a_1,a_2,a_3,a_4,a_5,a_6,a_7)$. The final output vector is not in natural order, but in BO order: $(\hat a_0 \mid \hat a_4 \mid \hat a_2 \mid \hat a_6 \mid \hat a_1 \mid \hat a_5 \mid \hat a_3 \mid \hat a_7)$. The detailed permutation during the CT operation is:

\[\begin{aligned} &\textbf{Cooley-Tukey:} \text{NO} \to \text{BO} \\[2mm] &(a_0,a_1,a_2,a_3,a_4,a_5,a_6,a_7) \\ &\xrightarrow{\text{split by bit }0} (a_0,a_2,a_4,a_6 \mid a_1,a_3,a_5,a_7) \\ &\xrightarrow{\text{split by bit }1} (a_0,a_4 \mid a_2,a_6 \mid a_1,a_5 \mid a_3,a_7) \\ &\xrightarrow{\text{split by bit }2} (a_0 \mid a_4 \mid a_2 \mid a_6 \mid a_1 \mid a_5 \mid a_3 \mid a_7) \\ &\qquad = (\hat{a}_{\operatorname{brv}_3(0)},\hat{a}_{\operatorname{brv}_3(1)},\dots,\hat{a}_{\operatorname{brv}_3(7)}) \end{aligned}\]

This is exactly the left-to-right order of the leaf nodes after CT recurses to the bottom. The permutation has order two, so applying the same permutation again flips the sequence from BO order back to normal NO order. In fact:

If the input is in BO order, CT butterfly operations produce NO order.
If the input is in NO order, CT butterfly operations produce BO order.

Returning to the Gentleman-Sande butterfly operation, the permutation is exactly the same from the output perspective. In short, if we want to obtain the normal NO sequence, we need to permute the result vector back from BO order at the end. After clarifying the ordering, we continue to analyze the iterative form of the Fast NTT algorithm.

Non-Recursive Cooley-Tukey NTT

For the standard NTT for positive wrapped convolution, let $\omega$ be a primitive $n$-th root of unity. The iterative CT algorithm first reorders the input coefficient vector in bit-reversal order, then starts from small blocks of length $m=2$ and merges upward, doubling $m$ layer by layer until $m=n$. Within each block of length $m$, the local primitive $m$-th root of unity is:

\[\omega_m = \omega^{n/m}\]

For block position $j=0,\ldots,m/2-1$, the CT butterfly is:

\[\begin{cases} u = a_{\text{start}+j} \\ v = \omega_m^j \cdot a_{\text{start}+j+m/2} \end{cases} \implies \begin{cases} a_{\text{start}+j} \leftarrow u+v \pmod q \\ a_{\text{start}+j+m/2} \leftarrow u-v \pmod q \end{cases}\]

\[\begin{array}{l} \textsf{Iter-CT-NTT}^{\omega}(\mathbf{a}): \\ \quad \mathbf{a} \leftarrow (a_{\operatorname{brv}_{\ell}(0)},a_{\operatorname{brv}_{\ell}(1)},\ldots,a_{\operatorname{brv}_{\ell}(n-1)}) \\ \quad \text{for } m=2,4,8,\ldots,n: \\ \quad\quad \omega_m \leftarrow \omega^{n/m} \\ \quad\quad \text{for } \text{start}=0,m,2m,\ldots,n-m: \\ \quad\quad\quad \text{for } j=0,\ldots,m/2-1: \\ \quad\quad\quad\quad u \leftarrow a_{\text{start}+j} \\ \quad\quad\quad\quad v \leftarrow \omega_m^j a_{\text{start}+j+m/2} \\ \quad\quad\quad\quad a_{\text{start}+j} \leftarrow u+v \pmod q \\ \quad\quad\quad\quad a_{\text{start}+j+m/2} \leftarrow u-v \pmod q \\ \quad \text{return } \mathbf{a} \end{array}\]

For $\textsf{NTT}^{\varphi}$ in negacyclic convolution, let $\varphi$ be a primitive $2n$-th root of unity. In a local subproblem of length $m$, the corresponding primitive $2m$-th root of unity is:

\[\varphi_m = \varphi^{n/m}\]

The CT butterfly in the negacyclic version only needs to replace the twiddle factor $\omega_m^j$ in the standard NTT by the odd power:

\[\varphi_m^{2j+1}\]

That is:

\[\begin{cases} u = a_{\text{start}+j} \\ v = \varphi_m^{2j+1} \cdot a_{\text{start}+j+m/2} \end{cases} \implies \begin{cases} a_{\text{start}+j} \leftarrow u+v \pmod q \\ a_{\text{start}+j+m/2} \leftarrow u-v \pmod q \end{cases}\]

Non-Recursive Gentleman-Sande iNTT

The GS inverse transform can be viewed as running the CT butterfly network backward. CT merges from small blocks into larger blocks, so GS splits from large blocks into smaller blocks. For the iNTT of standard positive wrapped convolution, define within a block of length $m$:

\[\omega_m = \omega^{n/m}\]

The GS butterfly is:

\[\begin{cases} u = a_{\text{start}+j} \\ v = a_{\text{start}+j+m/2} \end{cases} \implies \begin{cases} a_{\text{start}+j} \leftarrow \frac{u+v}{2} \pmod q \\ a_{\text{start}+j+m/2} \leftarrow \frac{u-v}{2\omega_m^j} \pmod q \end{cases}\]

Here $\frac{1}{2}$ is placed inside each butterfly layer. Since there are $\log_2 n$ layers in total, the overall scaling is $1/n$. Another common form omits the factor $\frac{1}{2}$ at each layer and instead multiplies by $n^{-1}$ at the end; the two forms are equivalent. The current code uses the former form, so no additional multiplication by $n^{-1}$ is needed at the end. After each GS layer, the data still follows the grouping order of recursive splitting. After all layers have been executed, the coefficients are in bit-reversal order, so one more bit reversal is needed to return to natural order.

\[\begin{array}{l} \textsf{Iter-GS-iNTT}^{\omega}(\hat{\mathbf{a}}): \\ \quad \mathbf{a} \leftarrow \hat{\mathbf{a}} \\ \quad \text{for } m=n,n/2,n/4,\ldots,2: \\ \quad\quad \omega_m \leftarrow \omega^{n/m} \\ \quad\quad \text{for } \text{start}=0,m,2m,\ldots,n-m: \\ \quad\quad\quad \text{for } j=0,\ldots,m/2-1: \\ \quad\quad\quad\quad u \leftarrow a_{\text{start}+j} \\ \quad\quad\quad\quad v \leftarrow a_{\text{start}+j+m/2} \\ \quad\quad\quad\quad a_{\text{start}+j} \leftarrow 2^{-1}(u+v) \pmod q \\ \quad\quad\quad\quad a_{\text{start}+j+m/2} \leftarrow 2^{-1}(u-v)(\omega_m^j)^{-1} \pmod q \\ \quad \text{return } (a_{\operatorname{brv}_{\ell}(0)},a_{\operatorname{brv}_{\ell}(1)},\ldots,a_{\operatorname{brv}_{\ell}(n-1)}) \end{array}\]

For $\textsf{iNTT}^{\varphi}$ in negacyclic convolution, similarly let the local root be:

\[\varphi_m = \varphi^{n/m}\]

Then replace $\omega_m^j$ by $\varphi_m^{2j+1}$:

\[\begin{cases} a_{\text{start}+j} \leftarrow 2^{-1}(u+v) \pmod q \\ a_{\text{start}+j+m/2} \leftarrow 2^{-1}(u-v)(\varphi_m^{2j+1})^{-1} \pmod q \end{cases}\]

Whether using CT or GS, each layer covers all $n$ elements and executes $n/2$ butterflies. Since $n=2^k$, the number of layers is $\log_2 n$. Therefore, the total number of butterfly operations is:

\[\frac{n}{2}\log_2 n\]

Each butterfly contains only a constant number of modular additions, modular subtractions, and modular multiplications, so the overall arithmetic complexity is:

\[\mathcal{O}(n\log n)\]

Bit-reversal permutation requires $\mathcal{O}(n)$ to $\mathcal{O}(n\log n)$ bit operations, depending on the implementation. Under the modular-multiplication counting model for NTT, it usually does not change the dominant complexity. Compared with recursive implementations, iterative implementations avoid the extra overhead of function-call stacks and recursive slicing, and are closer to the butterfly networks used in hardware circuits or constant-time software implementations.

快速傅里叶变换与数论变换

2026-04-29T00:00:00+08:00

概要: 快速傅里叶变换（Fast Fourier Transform）普遍认为由 Cooley 和 Tukey 在 1965 年提出，但是其最早的思想可追溯到 Gauss 约 1805 年的未刊手稿。快速傅里叶变换几乎是目前所有高性能计算的基础算法，可以有效加速整数乘法以及多项式乘法，被 IEEE 誉为 20 世纪十大算法之一。目前 NIST 后量子密码标准化中的 Kyber、Dilithium、Falcon 等方案均涉及快速傅里叶变换和它的应用变体快速数论变换（NTT）。除此之外，在零知识证明协议（比如 Plonk 协议）、全同态加密（比如 BFV、TFHE）中，NTT 都是它们落地应用必不可少的关键加速算法。本文详细介绍快速傅里叶变换与数论变换的数学理论与实际的应用价值。

Fast Fourier Transform, CP-Algorithms: https://cp-algorithms.com/algebra/fft.html.
A note on NTT definitions and implementations: https://eprint.iacr.org/2024/585.pdf.
Number Theoretic Transform, Cryptography Caffe: https://cryptographycaffe.sandboxaq.com/posts/ntt-02/.
Survey reference: https://arxiv.org/pdf/2211.13546.

离散傅里叶变换

记一个 $n-1$ 次多项式为

\[A(x) = a_0 x^0 + a_1 x^1 + \dots + a_{n-1} x^{n-1}\]

特别地，我们假定多项式次数（系数向量的长度）为 $n = 2^k$，在非 2 次幂情形下，我们可以补齐高次零系数直到系数向量的长度等于 2 的次幂。记 $n$ 次单位元为 $w_{n,k} = e^{\frac{2 k \pi i }{n}}$，其中 $k \in [0..n-1]$，本原单位元为 $w_{n} = w_{n, 1} = e^{\frac{2 \pi i }{n}}$，它们均满足 $x^n = 1$。

多项式的系数向量表示方式是最常见的，即上面的 $\vec{A} = (a_0, a_1, \ldots, a_{n-1})$。离散傅里叶变换是一种特殊的点值表示，即将多项式表示为特殊的 $n$ 次单位元的点值向量：

逆离散傅里叶变换本质就是将多项式的点值表示转换成一般的向量形式，这个变换另一个更为人所知的说法是多项式的拉格朗日插值算法。故（逆）离散傅里叶变换就是将这两种表示方式进行相互转换的算法。即下面的映射：

记多项式 $A(x) = a_0 x^0 + a_1 x^1 + \dots + a_{n-1} x^{n-1}$ 和 $B(x) = b_0 x^0 + b_1 x^1 + \dots + b_{n-1} x^{n-1}$ 是任意环上的 $n-1$ 次多项式，则我们知道：

\[\mathsf{DFT}(A(x)) \circ \mathsf{DFT}(B(x)) = \mathsf{DFT}(A(x) \cdot B(x))\]

其中 $\circ$ 代表向量的按位乘法，可以在 $\mathcal{O}(n)$ 的时间复杂度内计算。如果我们可以在 $\mathcal{O}(n \log n)$ 时间内计算离散傅里叶变换 $\mathsf{DFT}$ 和其逆变换 $\mathsf{iDFT}$，则我们就可以在 $\mathcal{O}(n \log n)$ 时间内完成系数向量形式的多项式乘法。令 $m \ge 2n - 1$ 为变换长度，通常在 FFT 中取 $m$ 为不小于 $2n-1$ 的最小 2 次幂，则：

\[A(x) \cdot B(x) = \mathsf{iDFT}_{m} \left(\mathsf{DFT}_{m} \left(A\left(x\right)\right) \circ \mathsf{DFT}_{m} \left(B\left(x\right)\right)\right)\]

这就是快速傅里叶变换和数论变换对于多项式乘法、整数乘法加速的核心思想。上面的计算中，我们需要将多项式的系数进行零填充到长度 $m$，因为最终结果 $A(x)\cdot B(x)$ 的次数为 $2(n - 1)$，共有 $2n-1$ 个系数，所以至少需要 $2n-1$ 维度的向量才能完全恢复 $A(x)\cdot B(x)$。

卷积与傅里叶变换

在通信领域，傅里叶变换（Continuous Time Fourier Transform ）通常是研究连续信号的强有力的工具，将连续的时域上的信息转换为频率信息或者频谱：

\[S(f) = \int_{-\infty}^{\infty} s(t) \cdot e^{-i2\pi ft} \, dt\]

而在现有的计算机下，模拟完全连续的时域信号是不可能的，因此离散的傅里叶变换在实际应用中的价值更高，于是就衍生了离散（时间）傅里叶变换。

离散傅里叶变换 (DFT) 将一组复数序列 $\{x_n\} := x_0, x_1, \dots, x_{N-1}$ 转换为另一组等长的复数序列 $\{X_k\} := X_0, X_1, \dots, X_{N-1}$。其正变换 (Forward DFT) 数学定义如下：
\[X_k = \sum_{n=0}^{N-1} x_n \cdot e^{-i 2\pi \frac{k}{N} n}, \quad k = 0, \dots, N-1\]
其中 $x_n$ 是时域中的采样信号，$X_k$ 是频域中的频率分量，$N$ 是序列长度。$e^{-i 2\pi \frac{k}{N} n}$ 是复指数基函数，根据欧拉公式可以展开为 $\cos(2\pi \frac{k}{N} n) - i \sin(2\pi \frac{k}{N} n)$。
离散傅里叶逆变换 (Inverse DFT) 将频域序列还原回时域序列：
\[x_n = \frac{1}{N} \sum_{k=0}^{N-1} X_k \cdot e^{i 2\pi \frac{k}{N} n}, \quad n = 0, \dots, N-1\]
需要注意的是，信号处理中的 Forward DFT 通常使用负指数约定，而本文在多项式求值视角下使用正指数约定 $y_k=\sum_{j=0}^{n-1}a_j w_n^{kj}$。二者互为共轭方向，只要正逆变换保持一致即可。

切换回多项式的视角，我们其实可以得到下面不太精准的规约：

时域表达形式	离散傅里叶变换后	傅里叶变换的实际意义
波的时域连续信号	频谱信息 (频率、幅度、相位)	波频率拆解与成分分析：将叠加的波还原为单一频率的正弦波，方便计算波的叠加和频谱信息
多项式的系数向量	多项式的点值表达（evaluation form）	加速多项式乘法：类比波叠加，允许快速计算卷积，即乘法

本质上，卷积与乘法是对等的： 两个多项式相乘，本质上就是它们的系数序列在做线性卷积。为了方便后续说明 NTT 的概念，我们这里直接考虑整数商环 $\mathbb{Z}_q[x]$。

给定交换环 $\mathbb{Z}_q[x]$ 上的两个 $n-1$ 次多项式 $G(x)$ 和 $H(x)$ ，其中 $q \in \mathbb{Z}$， $x$ 为多项式变量，则 $G(x)$ 与 $H(x)$ 的乘法定义为:

\[Y(x)=G(x) \cdot H(x)=\sum_{k=0}^{2(n-1)} y_k x^k\]

新的系数 $y_k=\sum_{i=0}^k g_i h_{k-i} \bmod q$，其中$\boldsymbol{g}$ 和 $\boldsymbol{h}$ 分别为多项式 $G(x)$ 和 $H(x)$ 的系数向量。

令长度均为 $n$ 的向量 $\mathbf{g} = \{g_0, g_1, \dots, g_{n-1}\}, \mathbf{h} = \{h_0, h_1, \dots, h_{n-1}\}$，其线性卷积 $\mathbf{y} = \mathbf{g} * \mathbf{h}$ 定义为：

\[y_k = \sum_{i} g_i h_{k-i}\]

其中结果向量 $\mathbf{y}$ 的长度为 $2n-1$，元素索引 $k \in \{0, 1, \dots, 2n-2\}$。对于每一个 $k$，求和范围需满足 $0 \le i < n$ 且 $0 \le k-i < n$。

可以很容易地验证上述线性卷积等价于多项式乘法，而经过离散傅里叶变换后的点值形式的多项式，可以更方便地进行卷积运算。不局限于线性卷积，还有密码学上经常用到的循环卷积函数：

正循环卷积（PWC）：等价于多项式商环 $\mathbb{Z}_q[x] / (x^n - 1)$ 上的乘法运算

负循环卷积（NWC）：等价于多项式商环 $\mathbb{Z}_q[x] / (x^n + 1)$ 上的乘法运算

在多项式商环 $\mathbb{Z}_q[x] / (x^n - 1)$ 上有两个次数为 $n - 1$ 的多项式 $G(x)$ 和 $H(x)$，系数向量分别为： $\mathbf{g} = \{g_0, g_1, \dots, g_{n-1}\}, \mathbf{h} = \{h_0, h_1, \dots, h_{n-1}\}$，其循环卷积 $\mathbf{y} = \mathbf{g} \circledast \mathbf{h}$ 的第 $k$ 个元素定义为：

\[y_k = \sum_{i=0}^{n-1} g_i \cdot h_{(k-i) \pmod n} \\ \iff y_k = \sum_{i=0}^{k} g_i \cdot h_{k-i} + \sum_{i=k + 1}^{n-1} g_i \cdot h_{k + n - i}\]

其中 $k \in \{0, 1, \dots, n-1\}$。该向量计算结果等价的多项式表达形式为：

\[Y(x) = G(x) \cdot H(x) \pmod{x^n - 1}\]

在商环 $\mathbb{Z}_q[x] / (x^n + 1)$ 上有两个次数为 $n - 1$ 的多项式 $G(x)$ 和 $H(x)$，系数向量分别为： $\mathbf{g} = \{g_0, g_1, \dots, g_{n-1}\}, \mathbf{h} = \{h_0, h_1, \dots, h_{n-1}\}$，其负循环卷积 $\mathbf{y} = \mathbf{g} \star \mathbf{h}$ 的第 $k$ 个元素定义为：

\[y_k = \left( \sum_{i=0}^{k} g_i h_{k-i} - \sum_{i=k+1}^{n-1} g_i h_{k+n-i} \right)\]

其中 $k \in \{0, 1, \dots, n-1\}$。该向量计算结果等价的多项式表达形式为：

\[Y(x) = G(x) \cdot H(x) \pmod{x^n + 1}\]

负循环卷积（Negacyclic Convolution），也常被称为负向折叠卷积（Negative Wrapped Convolution, NWC），是格密码（如 Kyber, Dilithium）和全同态加密中最为核心的加速运算。

快速傅里叶变换

那么如何实现 $\mathcal{O}(n \log n)$ 复杂度的 $\mathsf{DFT}$ 和 $\mathsf{iDFT}$。我们知道，一般的点值计算复杂度为 $\mathcal{O}(n)$，因此朴素的 $\mathsf{DFT}$ 复杂度为 $\mathcal{O}(n^2)$，朴素的拉格朗日插值算法的复杂度也是 $\mathcal{O}(n^2)$，而快速傅里叶变换的核心在于点值表示中的特殊的单位元基点向量：

\[\vec w = (w_{n,0}, w_{n,1}, \ldots, w_{n,n-1}) = (w_n^0, w_n^1, \ldots, w_n^{n-1})\]

其核心的算法思想就是分而治之（divide and conquer）。我们知道：

其中 $A_0(x), A_1(x)$ 都是只有 $\frac{n}{2}$ 个系数的多项式，满足：

\[\begin{aligned} A_0(x) &= a_0 x^0 + a_2 x^1 + \dots + a_{n-2} x^{\frac{n}{2}-1} \\ A_1(x) &= a_1 x^0 + a_3 x^1 + \dots + a_{n-1} x^{\frac{n}{2}-1} \end{aligned}\]

DFT 算法 $\mathcal{O}(n \log n)$

给定一个 $n-1$ 次多项式的系数向量 $\vec{A} = (a_0, a_1, \ldots, a_{n-1})$，对应多项式

\[A(x) = a_0 x^0 + a_1 x^1 + \dots + a_{n-1} x^{n-1}.\]

如何在 $\mathcal{O}(n \log n)$ 时间内计算出它在 $n$ 次单位元 $\vec w = (w_{n,0}, w_{n,1}, \ldots, w_{n,n-1})$ 上的值 $\left(y_0, y_1, \ldots, y_{n-1}\right)$，其中 $y_i = A\left(w_{n,i}\right)$。

定义 $T_{\mathsf{DFT}}(n)$ 为计算 $n$ 次多项式的离散傅里叶变换的时间复杂度，根据分解 $A(x) = A_0(x^2) + xA_1(x^2)$，如果我们可以根据已知的 $A_0, A_1$ 的 $\mathsf{DFT}$ 向量，在 $O(n)$ 时间内得到 $A$ 的 $\mathsf{DFT}$ 向量。则我们知道它的复杂度满足下面的递归关系：

\[T_{\mathsf{DFT}}(n) = 2T_{\mathsf{DFT}}(\frac{n}{2}) + \mathcal{O}(n)\]

由递归算法的主定理，我们知道该递归算法的最终时间复杂度为 $\mathcal{O}(n \log n)$。一个关键的观察在于 $n$ 次单位元向量平方后的 $\vec w^2 = (w_n^0, w_n^2, \ldots, w_n^{2(n-1)})$ 就是所有的 $\frac{n}{2}$ 次的单位元，因此 $A_0(x), A_1(x)$ 的点值表示中的输入模式与 $A(x)$ 恰好是匹配的。具体来说，假如我们已知 $A_0(x), A_1(x)$ 和离散傅里叶变换：

\[\begin{cases} \left(y_k^0\right)_{k=0}^{n/2-1} = \mathsf{DFT}(A_0) \\ \left(y_k^1\right)_{k=0}^{n/2-1} = \mathsf{DFT}(A_1) \end{cases}\]

注意到单位元的特殊性：

\[\begin{cases} w_{n}^{2k} = e^{\frac{2\pi k i}{n/2}} = w_{n/2}^{k} & k \in [0, n/2 - 1] \\ w_{n}^{k + \frac{n}{2}}= - w_{n}^{k} & k \in [0, n - 1] \end{cases}\]

因此 $\mathsf{DFT}(A)$ 的 $n$ 点值表示的向量值可以通过如下方式恢复：

写成比较优雅的表达式就是：

\[\begin{cases} y_k &= y_k^0 + w_n^k y_k^1, &\quad k = 0 \dots \frac{n}{2} - 1, \\ y_{k+n/2} &= y_k^0 - w_n^k y_k^1, &\quad k = 0 \dots \frac{n}{2} - 1. \end{cases}\]

上述公式也被称之为蝴蝶公式，整个递归表达式非常优雅，根据蝴蝶公式只需 $\mathcal{O}(n)$ 的时间复杂度就可以从 $A_0, A_1$ 的离散傅里叶变换的结果恢复出 $A$ 的离散傅里叶变换的结果。综上我们给出了离散傅里叶变换 $\mathsf{DFT}$ 的一个 $\mathcal{O}(n \log n)$ 的递归算法。

iDFT 算法 $\mathcal{O}(n \log n)$

给定一个 $n-1$ 次多项式 $A(x) = a_0 x^0 + a_1 x^1 + \dots + a_{n-1} x^{n-1}$ 在 $n$ 次单位元 $\vec w = (w_{n,0}, w_{n,1}, \ldots, w_{n,n-1})$ 上的值 $\left(y_0, y_1, \ldots, y_{n-1}\right)$，其中 $y_i = A\left(w_{n,i}\right)$，如何在 $\mathcal{O}(n \log n)$ 时间内计算出它的多项式的系数向量 $\vec{A} = (a_0, a_1, \ldots, a_{n-1})$。

简单来说，这就是多项式插值，利用拉格朗日插值算法可以在 $\mathcal{O}(n^2)$ 时间内完成，本质上是求解线性方程组，即

其中 $\mathbf{V} \in \mathbb{C}^{n \times n}$ 就是范德蒙矩阵。这个矩阵的逆为：

因此，该形式下的拉格朗日插值公式也非常之优雅，我们同样可以直接以多项式的形式直接给出 $a_{k}$ 的表达式。

\[a_k = \frac{1}{n} \sum_{j=0}^{n-1} y_j w_n^{-k j}\]

这就得到了一个几乎与 $\mathsf{DFT}$ 表达式 $y_k = \sum_{j=0}^{n-1} a_j w_n^{k j}$ 一模一样的问题，关键变化在于：

\[\begin{cases} 1 &\implies \frac{1}{n} \\ w_n^{k j} &\implies w_n^{-k j} \end{cases}\]

上节递归算法同样适用于此情形。综上我们给出了离散傅里叶逆变换 $\textsf{iDFT}$ 的一个 $\mathcal{O}(n \log n)$ 的递归算法。

FFT 加速的核心: 快速傅里叶变换加速的根本在于单位元的周期性：
\[w_{n}^{n} = 1, \quad w_{n}^{\frac{n}{2}} = -1\]
从而可以复用很多运算，这样是而递归加速的本质。在之后 NTT 的讨论中，我们将进一步展开如何通过单位元的周期性去复用运算。

快速数论变换

密码学上，我们通常关注整数环上的多项式，更具体地，我们关注整数商环 $\mathbb{Z}_{q}$ 上的多项式，其中大部分情况下，我们认为 $q$ 是一个素数。这一节，我们将所有的乘法运算都使用卷积来说明，即下面的对应关系。

线性卷积（Linear Convolution）	正循环卷积（Cyclic Convolution）	负循环卷积 (Negacyclic Convolution)
$\mathbb{Z}_q[x]$ 上的乘法运算	$\mathbb{Z}_q[x] / (x^n - 1)$ 上的乘法运算	$\mathbb{Z}_q[x] / (x^n + 1)$ 上的乘法运算

在整数商环上，我们需要找到和离散傅里叶变换中 $e^{\frac{2\pi i}{n}}$ 拥有相同性质的单位元，即下面定义的 $\mathbb{Z}_{q}$ 上的本原单位元。

我们称 $w$ 为 $\mathbb{Z}_{q}$ 上的 n 次本原单位元，当且仅当其满足下面的性质：

\[w^n \equiv 1 \bmod q, \text{ and } w^i \not\equiv 1 \bmod q, \forall i \in [1, n-1]\]

线性/正循环卷积

记 $\omega$ 是一个 $\mathbb{Z}_q$ 上的 $n$ 次本原单位根，$A(x)$ 是 $\mathbb{Z}_q[x]$ 上 $n-1$ 次多项式，其系数向量 $\mathbf{a}$ 的数论变换 (NTT) 定义为 $\hat{\mathbf{a}} = \textsf{NTT}^{\omega}(\mathbf{a})$：

\[\hat{\mathbf{a}}_j = \sum_{i=0}^{n-1} \omega^{ij} \mathbf{a}_i \pmod q, \quad j = 0, 1, 2, \dots, n-1\]

特别地，我们知道 $\hat{\mathbf{a}}_j = A(\omega^j) \pmod{ q }$。

记 $\omega$ 是一个 $\mathbb{Z}_q$ 上的 $n$ 次本原单位根，一组 $n$ 维点值向量 $\hat{\mathbf{a}}$ 的逆数论变换 (iNTT) 定义为 $\mathbf{a} = \textsf{iNTT}^{\omega}(\hat{\mathbf{a}})$：

\[\mathbf{a}_j = \frac{1}{n} \sum_{i=0}^{n-1} \omega^{-ij} \hat{\mathbf{a}}_i \pmod q, \quad j = 0, 1, 2, \dots, n-1\]

容易验证（证明上述两个表达式关于 $\hat{\mathbf{a}}$ 和 ${\mathbf{a}}$ 的两个矩阵互为逆）：

\[\mathbf{a} = \textsf{iNTT}^{\omega}\left(\textsf{NTT}^{\omega}\left(\mathbf{a}\right)\right)\]

从而我们知道线性卷积可以基于 NTT 进行如下计算。需要注意的是，如果计算 $\mathbb{Z}_q[x]$ 上的线性卷积，应选择变换长度 $m \ge 2n-1$ 并对输入进行零填充：

\[\mathbf{c} = \mathbf{a} * \mathbf{b} = \textsf{iNTT}^{\omega}\left(\textsf{NTT}^{\omega}\left(\mathbf{a}\right) \circ \textsf{NTT}^{\omega}\left(\mathbf{b}\right)\right)\]

同时基于快速傅里叶的优化技术可以完全迁移到 NTT 和 iNTT 上。前面我们提到过，如果想要计算 $\mathbb{Z}_q[x]$ 上的线性卷积， $\mathbf{c}$ 的维度应该是 $2n - 1$，而如果我们只使用 $n$ 次本原单位根和长度为 $n$ 的变换，那么我们不会得到线性卷积，而是循环卷积。这个时候，我们切换到多项式的角度思考，我们卷积后得到的点值其实仍然是真实的 $A(\omega^i) \cdot B(\omega^i) \bmod q$ 的值，但是 $\ge n$ 次的单项式系数相当于都被我们循环累加到低次的系数了，但是由于我们使用的是 $n$ 次本原根，它满足 $x^{n + k} = x^k$，这等价于：

\[x^{n+k} \equiv x^k \pmod {x^{n} - 1}\]

即我们会把结果模多项式 ${x^{n} - 1}$，对应到系数就是真实的高次单项式系数都循环累加到低次单项式的系数上了，也就是正循环卷积的表达式：

\[y_k = \sum_{i=0}^{k} g_i \cdot h_{k-i} + \sum_{i=k + 1}^{n-1} g_i \cdot h_{k + n - i}\]

记 $\textsf{NTT}_{n}^{\omega}(\cdot)$ 为使用本原生成元 $\omega$ 对 $n$ 维向量作用的数论变换，除非特别说明，我们省略 $n$ 参数，认为它与实际作用的向量维度一致，默认记为 $\textsf{NTT}^{\omega}(\cdot)$。于是我们得到了下面的命题。

记 $\mathbb{Z}_q$ 上的两个 $n$ 维向量为 $\mathbf{a}, \mathbf{b}$ （分别是对应两个 $n-1$ 次的多项式），以及 $\omega$ 是一个 $\mathbb{Z}_q$ 上的 $n$ 次本原单位根，则它们的正循环卷积可以通过下面的数论变换计算得到：

\[\mathbf{c} = \mathbf{a} \circledast \mathbf{b} = \textsf{iNTT}^{\omega}\left(\textsf{NTT}^{\omega}\left(\mathbf{a}\right) \circ \textsf{NTT}^{\omega}\left(\mathbf{b}\right)\right)\]

其实更本质的意义在于，NTT 选取的 $n$ 次本原单位根均满足：

\[x^n = 1 \iff x^n - 1= 0\]

因此，最后的结果显然就是等价于在模了多项式 $x^n - 1$ 之后的结果，这样对 PWC 的理解更为直观，也更利于我们去理解下一节的 NWC 背后的数学直觉。

负循环卷积

接下来我们考虑如何计算负循环卷积，根据表达式：

\[y_k = \sum_{i=0}^{k} g_i \cdot h_{k-i} - \sum_{i=k + 1}^{n-1} g_i \cdot h_{k + n - i}\]

我们很自然想到，对于 $\ge n$ 次的单项式系数也同样累加作用到对应的低次项了（次数模 $n$ 后），只不过此时对系数的贡献是负的，于是我们很自然想到此时应该有关系 $x^{n + k} = -x^{k}$，即此时 NTT 的本原单位根 $\varphi$ 应该满足 $\varphi^{n} = - 1$，即 $\varphi$ 就是 $2n$ 次的本原单位根。但是很显然，如果我们简单地将正循环卷积中的 $\omega$ 替换为 $\varphi$，这并不能直接得到负循环卷积，而会产生频率偏移或数学意义上的不匹配。另外标准 NTT 的旋转因子（Twiddle Factors）演变规律是基于 $x^n = 1$ 的本原根序列 $\omega^0, \omega^1, \omega^2, \ldots, \omega^{n-1}$，而简单替换为 $\varphi^0, \varphi^1, \varphi^2, \ldots, \varphi^{n-1}$，这样一半的 $2n$ 次本原单位根不具有 $x^n = 1$ 或者 $x^n = -1$ 一致的恒等式，而这种性质是快速 NTT 的关键。这里我给出两个 NWC 构造的数学上的理解。

记 $\varphi$ 是 $2n$ 次的本原单位根，$\omega$ 是 $n$ 次的本原单位根，并且满足 $\omega = \varphi^2$。

从序列上看，我们需要就是满足 $x^n = -1$ 的所有根，这恰好也有 $n$ 个，并且读者可以验证这 $n$ 个根恰好就是下面的序列：

\[\{\varphi^1, \varphi^3, \varphi^5, \ldots, \varphi^{2n-1}\}\]

也就是说，$x^n+1$ 的根恰好是 $2n$ 次单位根中的奇数次幂。因此基于 $\varphi$ 的 NTT 构造如下：

\[\hat{\mathbf{a}}_j = \sum_{i=0}^{n-1} \varphi^{i(2j+1)} a_i \pmod q, \quad j = 0, 1, 2, \dots, n-1\]

现在我们的想法是将 NWC 转换为 PWC，于是我们就可以使用前面定义的标准的 NTT 来计算，为了达到这个效果，我们需要对系数进行变换。即定义新的多项式 $\hat{A}(y)$，令 $x = \varphi \cdot y$。当 $x^n = -1$ 时，$(\varphi y)^n = -1 \Rightarrow \varphi^n y^n = -1$。因为 $\varphi^n = -1$，所以该式变为 $-y^n = -1 \Rightarrow y^n = 1$。因此如果对 $\hat{A}(y)$ 进行正循环卷积的 NTT 变换，就得到了对原多项式的负卷积 NTT 变换。

对应到系数的映射，就是 $\mathbf{a}'_i = \mathbf{a}_i \cdot \varphi^i$，构造多项式 $A'(x) = \sum \mathbf{a}'_i x^i$，对此进行正循环卷积的 NTT 变换：

其中 $j = 0, 1, 2, \dots, n-1$。

上面两种方式得到的结果是一致的，我们得到了负循环数论变换的正式定义如下。

记 $\varphi$ 是一个 $\mathbb{Z}_q$ 上的 $2n$ 次本原单位根，则 $\omega := \varphi^2$ 是一个 $\mathbb{Z}_q$ 上的 $n$ 次本原单位根， $A(x)$ 是 $\mathbb{Z}_q[x]$ 上 $n-1$ 次多项式，其系数向量 $\mathbf{a}$ 基于 $\varphi$ 的数论变换定义为 $\hat{\mathbf{a}} = \textsf{NTT}^{\varphi}(\mathbf{a})$：

\[\hat{\mathbf{a}}_j = \sum_{i=0}^{n-1} \varphi^i \omega^{ij} \mathbf{a}_i \pmod q, \quad j = 0, 1, 2, \dots, n-1\]

代入 $\omega := \varphi^2$，这等价于：

\[\hat{\mathbf{a}}_j = \sum_{i=0}^{n-1} \varphi^{i(2j + 1)} \mathbf{a}_i \pmod q\]

同理我们对范德蒙矩阵求逆，可以得到负循环逆数论变换的公式，值的指出的是，https://eprint.iacr.org/2024/585.pdf 这篇论文对 iNTT 的定义存在重大 typo。

记 $\varphi$ 是一个 $\mathbb{Z}_q$ 上的 $2n$ 次本原单位根，令 $\omega := \varphi^2$ 是一个 $\mathbb{Z}_q$ 上的 $n$ 次本原单位根，一组 $n$ 维点值向量 $\hat{\mathbf{a}}$ 的基于 $\varphi$ 的逆数论变换定义为 $\mathbf{a} = \textsf{iNTT}^\varphi(\hat{\mathbf{a}})$：

\[\mathbf{a}_j = \frac{1}{n} \sum_{i=0}^{n-1} \varphi^{-j} \omega^{-ij} \hat{\mathbf{a}}_i \pmod q, \quad j = 0, 1, 2, \dots, n-1\]

代入 $\omega := \varphi^2$，这等价于：

\[\mathbf{a}_j = \frac{1}{n} \sum_{i=0}^{n-1} \varphi^{-j(2i + 1)} \hat{\mathbf{a}}_i \pmod q\]

容易验证（证明上述两个表达式关于 $\hat{\mathbf{a}}$ 和 ${\mathbf{a}}$ 的两个矩阵互为逆）：

\[\mathbf{a} = \textsf{iNTT}^\varphi \left(\textsf{NTT}^\varphi\left(\mathbf{a}\right)\right)\]

记 $\mathbb{Z}_q$ 上的两个 $n$ 维向量为 $\mathbf{a}, \mathbf{b}$ （分别是对应两个 $n-1$ 次的多项式），以及 $\varphi$ 是一个 $\mathbb{Z}_q$ 上的 $2n$ 次本原单位根，则它们的负循环卷积可以通过下面的数论变换计算得到：

\[\mathbf{c} = \mathbf{a} \star \mathbf{b} = \textsf{iNTT}^{\varphi}\left(\textsf{NTT}^{\varphi}\left(\mathbf{a}\right) \circ \textsf{NTT}^{\varphi}\left(\mathbf{b}\right)\right)\]

数论变换的本质

回到代数的视角，数论变换的本质就是环的分解与同构，我们以 NWC 卷积为例。$\varphi$ 是一个 $\mathbb{Z}_q$ 上的 $2n$ 次本原单位根，则分圆多项式（cyclotomic polynomial）$C(X) = X^{n} + 1$ 存在下面的分解：

\[C(X) = \prod_{i=0}^{n-1} (X - \varphi^{2i + 1})\]

由中国剩余定理，我们知道存在下面的环同构：

\[\mathbb{Z}_q[X] / (X^n + 1) \cong \prod_{i=0}^{n-1} \mathbb{Z}_q[X] / (X - \varphi^{2i + 1})\]

由于对于每一个因子 $\alpha = \varphi^{2i + 1}$，都有 $\mathbb{Z}_q[X] / (X - \alpha) \cong \mathbb{Z}_q$（通过映射 $X \mapsto \alpha$ 实现，即多项式在点 $\alpha$ 的取值），上述同构可以进一步简化为：

\[\mathbb{Z}_q[X] / (X^n + 1) \cong \underbrace{\mathbb{Z}_q \times \mathbb{Z}_q \times \dots \times \mathbb{Z}_q}_{n} \cong \mathbb{Z}_q^n\]

对于多项式 $A(X) \in \mathbb{Z}_q[X] / (X^n + 1)$，其系数向量 $\mathbf{a}$ 的数论变换 NTT 和逆数论变换 iNTT 的本质就是一个环同构。

而快速傅里叶/数论变换的本质在于上面的群同构又同时存在下面的可递归分治的分解：

即下面的 CRT 同构映射：

图 1 NWC 场景下的 CRT 递归分解，图源自 https://arxiv.org/pdf/2211.13546

同理对于 PWC 卷积，其模多项式 $x^n - 1$ 也存在类似的 CRT 同构映射：

图 2 PWC 场景下的 CRT 递归分解，图源自 https://arxiv.org/pdf/2211.13546

从上面的环同构的分解，我们就大概可以看出蝴蝶操作的雏形了。下一节，我们将介绍快速数论变换的蝴蝶操作（ Cooley-Tukey 算法）与快速逆数论变换的蝴蝶操作（Gentleman-Sande 算法）。

CT/GS 蝴蝶算法

记 $\varphi$ 是一个 $\mathbb{Z}_q$ 上的 $2n$ 次本原单位根， $\omega := \varphi^2$ 是一个 $\mathbb{Z}_q$ 上的 $n$ 次本原单位根，其中 $n$ 恰好是 2 的次幂，从而保证可以完整递归。

快速傅里叶最关键的性质在于：

\[\varphi^{k+2n} = \varphi^{k} \\ \varphi^{k+n} = -\varphi^{k}\]

为了统一表达形式，我们记正循环卷积的数论变换为 $\textsf{NTT}^{+}$，负循环卷积的数论变换为 $\textsf{NTT}^{-}$，由于 $\omega := \varphi^2$，它们可以统一表示为 $\varphi$ 的形式

我们接下来只考虑负循环卷积的情形即可，因为通过系数重构 $\mathbf{b}_i :=\varphi^{-i} \cdot \mathbf{a}_i$，很容易得到对应的正循环卷积变换。

Fast-NTT: Cooley-Tukey Algorithm

考虑下面的第一步环同构：

考虑 $J = j + n/2 > n/2$ 的系数：

这实际上给出了一些可重复计算的系数，令 $A_j=\sum_{i=0}^{n / 2-1} \varphi^{4 i j+2 i} a_{2 i}$ 以及 $B_j=\sum_{i=0}^{n / 2-1} \varphi^{4 i j+2 i} a_{2 i+1}$，则根据分解可以得到：

其中 $A_j, B_j$ 的系数又可以通过 $n/2$ 个点的 NTT 变换计算得到。定义：

\[\begin{cases} \mathbf{a}^{(0)} = (a_0, a_2, \ldots, a_{n-2}) \\ \mathbf{a}^{(1)} = (a_1, a_3, \ldots, a_{n-1}) \end{cases}\]

则令 $\omega = \varphi^2$ 是一个 $2 \cdot \left( \frac{n}{2} \right)$ 次本原根，我们知道

如此递归直到我们可以以常数时间内计算出 NTT 变换的系数。

对于正循环卷积的标准 NTT，递归结构相同，只需要把上面的 $\varphi^{2j+1}$ 替换为 $\omega^j$，并将子问题根替换为 $\omega^2$。

Fast-iNTT: Gentleman-Sande Algorithm

回顾逆 NTT 的计算如下：

逆 NTT 的快速计算，其分解方式如下：

奇偶系数可以差分如下：

接下来我们从两个角度分析递归公式的推导。

反解 CT 变换

Gentleman-Sande 逆变换可以直接反解上一节 Cooley-Tukey 正变换中的蝴蝶公式。回忆负循环卷积场景下的 CT 正变换。将输入系数按奇偶下标拆成：

\[\begin{cases} \mathbf{a}^{(0)} = (a_0, a_2, \ldots, a_{n-2}) \\ \mathbf{a}^{(1)} = (a_1, a_3, \ldots, a_{n-1}) \end{cases}\]

令 $\omega = \varphi^2$，则 $\omega$ 是长度为 $n/2$ 的负循环子问题所需的 $n$ 次本原单位根，即它在子问题中扮演 $2\cdot(n/2)$ 次本原单位根的角色。记：

则 CT 蝴蝶给出：

\[\begin{cases} \hat{\mathbf{a}}_j = E_j + \varphi^{2j+1} O_j, & j = 0, \ldots, \frac{n}{2}-1 \\ \hat{\mathbf{a}}_{j+n/2} = E_j - \varphi^{2j+1} O_j, & j = 0, \ldots, \frac{n}{2}-1 \end{cases}\]

GS 逆变换就是对这个线性系统逐层求逆。给定当前层的点值向量 $\hat{\mathbf{a}}$，先将上下半区配对，恢复两个长度为 $n/2$ 的子问题点值向量：

然后递归地对 $\mathbf{E}$ 与 $\mathbf{O}$ 做长度为 $n/2$ 的逆变换：

\[\begin{cases} \mathbf{a}^{(0)} = \textsf{iNTT}_{n/2}^{\omega}(\mathbf{E}) \\ \mathbf{a}^{(1)} = \textsf{iNTT}_{n/2}^{\omega}(\mathbf{O}) \end{cases}\]

最后将两个子问题的系数交错合并：

\[\begin{cases} a_{2r} = a^{(0)}_r, & r = 0, \ldots, \frac{n}{2}-1 \\ a_{2r+1} = a^{(1)}_r, & r = 0, \ldots, \frac{n}{2}-1 \end{cases}\]

递归终止条件为 $n=1$，此时输入点值向量本身就是系数向量。由于每一层蝴蝶都乘以 $\frac{1}{2}$，总共有 $\log_2 n$ 层，因此总缩放因子为：

\[\left(\frac{1}{2}\right)^{\log_2 n} = \frac{1}{n}\]

这正好对应 iNTT 定义中的归一化因子 $\frac{1}{n}$。因此实现时在每一层使用 $2^{-1} \bmod q$，而不需要在递归结束后再额外乘以 $n^{-1}$。

标准 GS 变换推导

我们直接从 $\textsf{iNTT}^{\varphi}$ 的定义公式推出递归。令当前层输入点值向量为 $\hat{\mathbf{a}}=(\hat{\mathbf{a}}_0,\ldots,\hat{\mathbf{a}}_{n-1})$，输出系数向量为 $\mathbf{a}=(\mathbf{a}_0,\ldots,\mathbf{a}_{n-1})$。根据定义：

\[\mathbf{a}_j = \frac{1}{n}\sum_{i=0}^{n-1}\varphi^{-j(2i+1)}\hat{\mathbf{a}}_i \pmod q\]

将输出下标 $j$ 分成偶数与奇数两种情况。对于偶数下标 $j=2k$，有：

其中最后一步使用了 $\varphi^{-4k(i+n/2)}=\varphi^{-4ki}\varphi^{-2kn}=\varphi^{-4ki}$。

对于奇数下标 $j=2k+1$，有：

其中最后一步使用了 $\varphi^{-2(2k+1)(i+n/2)}=-\varphi^{-2(2k+1)i}$。

现在令子问题的本原单位根为 $\omega=\varphi^2$，并定义两个长度为 $n/2$ 的新点值向量：

我们可以观察到上面两个长度为 $n/2$ 的 $\textsf{iNTT}^{\omega}$ 递归子问题分别给出：

以及：

因此，直接从 iNTT 公式可以得到递归关系：

递归的核心在于 $\mathbf{E}$ 和 $\mathbf{O}$ 的向量计算：

注意最后一步是把两个递归子问题的结果按偶数下标与奇数下标交错合并；递归版本这里不需要 bit-reversal，也不需要再额外乘以 $n^{-1}$，因为每一层的 $2^{-1}$ 已经累计成了逆 NTT 定义中的归一化因子 $n^{-1}$。

对于正循环卷积的 iNTT，递归结构相同，只需要把 $\varphi^{2j+1}$ 替换为 $\omega^j$，并将子问题根替换为 $\omega^2$：
\[\begin{cases} E_j = \frac{1}{2}\left(\hat{\mathbf{a}}_j + \hat{\mathbf{a}}_{j+n/2}\right) \\ O_j = \frac{1}{2\omega^j}\left(\hat{\mathbf{a}}_j - \hat{\mathbf{a}}_{j+n/2}\right) \end{cases}\]

蝴蝶操作的非递归迭代算法

递归版本的 CT/GS 算法最适合理解公式来源，但实际实现通常会将递归展开为迭代的蝴蝶网络。递归 CT 的本质是不断按照输入下标的奇偶性拆分子问题：第一层看最低位，第二层看次低位，直到最高位。因此，递归树最底层的输入顺序恰好是原始下标的 bit-reversal permutation（BO 序）。记 $\operatorname{brv}_{\ell}(i)$ 表示将 $\ell=\log_2 n$ 比特的整数 $i$ 进行二进制位反转。例如当 $n=8$ 时：

\[(0,1,2,3,4,5,6,7) \mapsto (0,4,2,6,1,5,3,7)\]

按照三比特展开，对应关系为：

以 n = 8 为例，我们以自然顺序（NO）输入的系数向量为 $(a_0,a_1,a_2,a_3,a_4,a_5,a_6,a_7)$，最终得到的输出系数向量不是自然序，而是 BO 序： $(\hat a_0 \mid \hat a_4 \mid \hat a_2 \mid \hat a_6 \mid \hat a_1 \mid \hat a_5 \mid \hat a_3 \mid \hat a_7)$，详细的 CT 操作过程中的置乱如下：

\[\begin{aligned} &\textbf{Cooley-Tukey：} \text{NO} \to \text{BO} \\[2mm] &(a_0,a_1,a_2,a_3,a_4,a_5,a_6,a_7) \\ &\xrightarrow{\text{split by bit }0} (a_0,a_2,a_4,a_6 \mid a_1,a_3,a_5,a_7) \\ &\xrightarrow{\text{split by bit }1} (a_0,a_4 \mid a_2,a_6 \mid a_1,a_5 \mid a_3,a_7) \\ &\xrightarrow{\text{split by bit }2} (a_0 \mid a_4 \mid a_2 \mid a_6 \mid a_1 \mid a_5 \mid a_3 \mid a_7) \\ &\qquad = (\hat{a}_{\operatorname{brv}_3(0)},\hat{a}_{\operatorname{brv}_3(1)},\dots,\hat{a}_{\operatorname{brv}_3(7)}) \end{aligned}\]

这正是 CT 递归到最底层之后叶子节点从左到右的顺序，整个置乱的阶为 2，再做一次相同的置乱即可从 BO 序列翻转到正常的 NO 序列。事实上

如果输入是 BO 序列，CT 蝴蝶操作后得到 NO 序列。
如果输入是 NO 序列，CT 蝴蝶操作后得到 BO 序列。

再回到 Gentleman-Sande 蝴蝶操作，从结果上看二者的置乱是完全相同的。总之，如果我们想要得到正常的 NO 序列，最后需要将结果向量从 BO 序列置乱回来。顺序分析清楚之后，我们继续分析迭代形式的 Fast NTT 算法

非递归 Cooley-Tukey NTT

对于正循环卷积的标准 NTT，令 $\omega$ 是 $n$ 次本原单位根。CT 迭代算法先将输入系数向量按 bit-reversal 顺序重排，然后从长度 $m=2$ 的小块开始合并，逐层翻倍直到 $m=n$。在每个长度为 $m$ 的块内，局部本原 $m$ 次单位根为：

\[\omega_m = \omega^{n/m}\]

对于块内位置 $j=0,\ldots,m/2-1$，CT 蝴蝶为：

对于负循环卷积的 $\textsf{NTT}^{\varphi}$，令 $\varphi$ 是 $2n$ 次本原单位根。在长度为 $m$ 的局部子问题中，对应的 $2m$ 次本原单位根为：

\[\varphi_m = \varphi^{n/m}\]

负循环版本的 CT 蝴蝶只需要把标准 NTT 的旋转因子 $\omega_m^j$ 替换为奇数次幂：

\[\varphi_m^{2j+1}\]

即：

非递归 Gentleman-Sande iNTT

GS 逆变换可以看作 CT 蝴蝶网络的反向执行。CT 从小块合并到大块，因此 GS 从大块拆分到小块。对于标准正循环卷积的 iNTT，在长度为 $m$ 的块内令：

\[\omega_m = \omega^{n/m}\]

GS 蝴蝶为：

这里把 $\frac{1}{2}$ 放在每一层蝴蝶中；因为共有 $\log_2 n$ 层，整体缩放为 $1/n$。另一种常见写法是每层不乘 $\frac{1}{2}$，最后统一乘 $n^{-1}$，二者等价。当前代码采用前一种写法，因此不需要在最后额外乘以 $n^{-1}$。GS 每一层执行后，数据仍然保持递归拆分的分组顺序；全部层执行完后得到的是 bit-reversal 顺序的系数，因此最后需要再做一次 bit-reversal 才能返回自然顺序。

对于负循环卷积的 $\textsf{iNTT}^{\varphi}$，同样令局部根为：

\[\varphi_m = \varphi^{n/m}\]

然后把 $\omega_m^j$ 替换为 $\varphi_m^{2j+1}$：

\[\begin{cases} a_{\text{start}+j} \leftarrow 2^{-1}(u+v) \pmod q \\ a_{\text{start}+j+m/2} \leftarrow 2^{-1}(u-v)(\varphi_m^{2j+1})^{-1} \pmod q \end{cases}\]

无论 CT 还是 GS，每一层都会覆盖全部 $n$ 个元素，并执行 $n/2$ 个蝴蝶。由于 $n=2^k$，层数为 $\log_2 n$，因此总共有：

\[\frac{n}{2}\log_2 n\]

个蝴蝶操作。每个蝴蝶只包含常数次模加、模减和模乘，所以整体算术复杂度为：

\[\mathcal{O}(n\log n)\]

bit-reversal 重排需要 $\mathcal{O}(n)$ 到 $\mathcal{O}(n\log n)$ 的位操作，取决于具体实现方式；在 NTT 的模乘计数模型下，它通常不改变主导复杂度。相比递归实现，迭代实现避免了函数调用栈和递归切片的额外开销，更贴近硬件电路或常数时间软件实现中的蝴蝶网络。

Parallelizable Memory-Efficient Hash Collision Search

2026-04-15T00:00:00+08:00

tl;dr: This article discusses three generic hash-collision search methods: the birthday-paradox collision algorithm, Pollard’s rho with Floyd cycle detection, and the parallelizable Pollard’s lambda method based on Distinguished Points. These generic methods can be generalized in a similar way to integer factorization and discrete logarithm problems.

Disclaimer: This article is the English counterpart automatically generated from the original Chinese blog by Codex + GPT-5.4. The translation aims to preserve the original meaning, structure, and technical details as faithfully as possible. If there is any ambiguity or inaccuracy, please refer to the original Chinese version.

Parallel Hash Collision Search by Rho Method with Distinguished Points: https://www.cs.csi.cuny.edu/~zhangx/papers/P_2018_LISAT_Weber_Zhang.pdf.
HITCON 2023 challenge Collision: https://github.com/maple3142/My-CTF-Challenges/tree/master/HITCON%20CTF%202023/Collision.

Given a hash function $\mathcal{H}: \{0,1\}^{*} \mapsto \{0,1\}^n$ with output length $n$, how do we find two inputs $x_1, x_2$ such that:

\[\mathcal{H}(x_1) = \mathcal{H}(x_2)\]

The hash collision problem, or more precisely the second-preimage-style collision search considered here, is a fundamental problem in cryptography and appears throughout the entire discipline. The generic hash-collision algorithms discussed in this article can be divided into the following three categories:

Algorithm	Time Complexity	Space Complexity	Parallelism
Birthday-paradox collision search	$\mathcal{O}(2^{n/2})$	$\mathcal{O}(2^{n/2})$	Parallelizable, but memory-intensive
Pollard’s rho	$\mathcal{O}(2^{n/2})$	$\mathcal{O}(1)$	No linear speed-up in parallel
Pollard’s lambda	$\mathcal{O}(2^{n/2})$	$\mathcal{O}(k)$ （trade-off）	Parallelizable, often close to linear speed-up

Birthday-Paradox Collision Search

The classical birthday paradox. A well-known question is: in a year with 365 days, how many people are needed so that the probability that at least two people share the same birthday exceeds 50%? Under a fully random model, the answer is 23, which is much smaller than intuition suggests.

Consider the generalized version: given $k$ people, what is the probability that at least two of them share the same birthday? When $k > 365$, this probability is 1 by the inclusion-exclusion principle. More generally, given a set of size $N$, such as the output space of a hash function, we randomly sample $k \le N$ values from the set with replacement. Let the probability that at least two sampled values are equal be denoted by $\Pr\left(\text{coll}\right)$. Let $\Pr\left(z=0\right)$ denote the event that all sampled values are distinct. Then $\Pr\left(\text{coll}\right) = 1 - \Pr\left(z=0\right)$, where:

\[\Pr\left(z=0\right) = \frac{N}{N} \cdot \frac{N-1}{N} \cdot \frac{N-2}{N} \cdots \frac{N-k+1}{N}\]

Hence the probability that two values coincide, i.e. that a collision occurs, is:

\[\Pr\left(\text{coll}\right) = 1 - \Pr\left(z=0\right)\]

For the birthday-paradox problem, this probability exceeds 50% as soon as $k \ge 23$. This is much smaller than most people would expect. More generally, when $k$ is small relative to $N$, we may use the approximation:

\[\Pr\left(\text{coll}\right) = 1 - \Pr\left(z=0\right) \approx 1 - e^{-\frac{k^2}{2N}} > 0.5 \\ \implies e^{-\frac{k^2}{2N}} \approx 0.5 \implies k \approx \sqrt{2N \ln(2)}\]

For a hash function with output bit-length $n$, we obtain

\[k \approx 1.177 \cdot 2^{n/2}\]

This means that, by the birthday paradox, computing $\mathcal{O}(2^{n/2})$ random hash values already gives a high probability of finding a collision.

Initialize a dictionary with $O(1)$ lookup time, where the key is a hash value and the value is the corresponding preimage.
Randomly generate preimage-hash pairs $\{x, \mathcal{H}(x)\}$ and insert them into the dictionary until a key collision occurs.

By the birthday paradox, this probabilistic algorithm terminates after $\mathcal{O}(2^{n/2})$ hash evaluations, and its space complexity is $\mathcal{O}(2^{n/2})$.

Pollard’s rho

Pollard’s rho method was originally developed as an integer-factorization algorithm. Its core intuition also comes from the birthday paradox. Since the generated sequence resembles the Greek letter $\rho$, the method is called rho.

Pollard’s rho for Integer Factorization

Integer factorization problem. Given a composite integer $n = p \cdot q$, how do we recover a non-trivial factor $p$?

For Pollard’s rho factorization algorithm, the key idea is to define a function $g(x)$ that generates a pseudorandom sequence. For example, one may choose the polynomial $g(x) = x^2 + 1 \bmod n$. This generates the following finite sequence

\[\left\{x_0, g(x_0), \cdots, g^k(x_0), \cdots \right\}\]

where $g^k$ denotes repeated composition, and we write $x_k = g^k(x_0) \in \mathbb{Z}_n$. However, from the viewpoint modulo $p$, the same sequence implicitly contains a subsequence:

\[\left\{x_0, g(x_0), \cdots, g^k(x_0), \cdots \right\} \bmod p\]

which is a subsequence of $\left\{x_k \bmod p\right\}$. If the chosen $g(x)$ behaves randomly enough, then by the birthday paradox we expect a collision after about $\mathcal{O}(\sqrt p)$ steps. This is illustrated by the point $l_0$ in the figure below:

Figure 1. Pollard's method

If the sequence values in Figure 1 are interpreted modulo $p$, such a collision means that we have found

\[g(x_{l_0- 1}) = g(x_{l_0 + n}) \bmod p\]

Since in practice we only see the sequence modulo $n$, there is overwhelmingly high probability that

\[g(x_{l_0- 1}) \ne g(x_{l_0 + n}) \bmod n\]

and therefore

\[\gcd\left(g(x_{l_0- 1}) - g(x_{l_0 + n}), n\right) = p\]

reveals a factor of $n$. However, during the sequence computation we cannot directly detect which values have collided; comparing against the whole previous sequence via repeated $\gcd$ computations would be prohibitively expensive in both time and space. Therefore we need an efficient cycle-detection algorithm to assist Pollard’s rho.

Pollard’s rho is often combined with Floyd’s algorithm, which is vividly described as the tortoise and hare algorithm.

Start both sequences from the same initial point $x_0$. Let the slow sequence $\{x^{(T)}_{i}\}$ use the update rule $f_1(x) = g(x)$, and let the fast sequence $\{x^{(H)}_{i}\}$ use $f_2(x) = g(g(x))= g^2(x)$. We iteratively compute these sequences while storing only the current values $x_k^{(T)}, x_{k}^{(H)}$.
When $l_0 < n$, after only $n$ iterations we obtain $x_m^{(T)} = x_{m}^{(H)} \bmod p$, because $x_{m} = x_{2m} \bmod p$. Hence, while computing the two sequences, Floyd’s algorithm repeatedly evaluates $\gcd\left(x_k^{(T)} - x_{k}^{(H)}, n\right)$, and as soon as this common divisor becomes non-trivial, we recover a prime factor $p$.

For example, if the Floyd meeting point in Figure 1 occurs at the $i$-th node (in fact $i = m$), then the two values are congruent modulo $p$ at that point, but with high probability not congruent modulo $n$. Thus $\gcd\left(x_i^{(T)} - x_{i}^{(H)}, n\right)$ also yields $p$.

As for time complexity, the expected sequence length is $l_0 + n \approx \mathcal{O}(\sqrt p)$. Since the slow sequence meets the fast sequence before traversing the entire $\rho$-shaped structure, the overall time complexity is $\mathcal{O}(\sqrt p)$ and the space complexity is $\mathcal{O}(1)$.

A simple implementation of Pollard’s rho is shown below:

# sage
def rho(n):
    # Pollard's rho method
    c = int(10)
    a0 = int(1)
    a1 = a0^2+c
    a2 = a1^2+c
    while gcd(n,a2-a1) == 1:
        a1 = (a1^2+c) % n
        a2 = (a2^2+c) % n
        a2 = (a2^2+c) % n
    g = gcd(n,a2-a1)
    return [g,n//g]

Readers may wonder about the special role of the collision point $l_0$. Let $a= g(x_{l_0- 1}),\ b= g(x_{l_0 + n}),\ c = x_{l_0}$. In the factorization setting, where the pseudorandom sequence uses $f(x) = x^2 + 1$, the collision at $l_0$ means that we have found two distinct values $a,b$ such that $f(a) = f(b) = c$. In other words, $a,b$ are two distinct solutions of

\[x^2 = c - 1 \bmod p\]

and thus they are two quadratic residues in $\mathbb{Z}_p$ satisfying $a + b = 0 \bmod p$.

In the integer-factorization setting, we care about recovering the hidden modulus $p$, so the rho collision point and the Floyd meeting point are effectively equivalent. But once we move to the hash-collision setting, the meanings of these two points diverge sharply. The hash-collision value is precisely the value at the collision point $l_0$.

Pollard’s rho for Hash Collisions

Figure 1. Pollard's method

Now move to the hash-collision setting. The pseudorandom sequence is generated by a hash function $\mathcal{H}: \{0,1\}^{*} \mapsto \{0,1\}^n$, or by a composed map $\mathcal{H}^{+} = \mathcal{H} \circ \mathcal{R}$. For simplicity, let the initial value be $x_0$, and denote the update rule by $x_{i+1} = H(x_i)$. In Figure 1, the cycle contains $n+1$ points; let $N = n + 1$.

Again, the pseudorandom sequence $\{x_k\}$ collides after about $k = \mathcal{O}(2^{n/2})$ steps, after which it enters a cycle. We use Floyd’s cycle-detection algorithm. Assume that the fast and slow sequences meet at point $i$. At that moment, the slow sequence must still lie before the end of the first cycle traversal, so the number of sequence computations satisfies $i \le l_0 + n$, and we have:

\[2*i - i = kn \implies i = k(n + 1) = kN\]

It follows that $k = \lceil \frac{l_0}{n} \rceil$. At this point, the two sequences meet at node $i$, but this is not necessarily the collision point itself. We therefore want to continue until reaching $l_0$. A useful observation is that the distances $0 \rightarrow l_0$ and $i \rightarrow l_0$ are equal modulo $N = n + 1$. Indeed:

\[\left\{ \begin{aligned} d_1 &= l_0 + 1 + n - i \\ d_2 &= l_0 \end{aligned} \right.\]

Thus

\[\begin{aligned} d_1 & = l_0 + n + 1 - i \bmod N \\ &= l_0 - kN \bmod N \\ &= l_0 \bmod N \\ &= d_2 \bmod N \end{aligned}\]

Starting from point $i$, the subsequent point sequence lies on a cycle of length $N$. Therefore $0 \rightarrow l_0$ and $i \rightarrow l_0$ both reach $l_0$ in exactly $l_0$ slow steps. This lets us recover the two points $x_{l_0 - 1}$ and $x_{l_0 + n}$ that collide under the hash, with collision value $x_{l_0}$.

Time-complexity analysis: once the meeting occurs, we keep the slow sequence fixed, return the fast sequence to the initial point $0$, and then lower it to slow speed. After $l_0$ additional steps, both sequences arrive at $l_0$ and the hash collision is found. Hence the total number of hash evaluations is:

\[T = 3i + 2l_0, i = \lceil \frac{l_0}{n} \rceil (n+1)\]

By the birthday paradox, we know that $l_0 + n \approx \mathcal{O}(2^{n/2})$. Therefore the overall time complexity is upper-bounded by $\mathcal{O}(5 \cdot 2^{n/2})$. Since we only need to maintain three pieces of state — the initial point, one slow-sequence node, and one fast-sequence node — the space complexity is $\mathcal{O}(1)$.

Floyd’s algorithm is an efficient cycle-detection algorithm. Moreover, once the meeting point is known, it can quickly locate the actual collision point. This is why it is widely used across many cryptographic algorithms.

Pollard’s lambda

Although Pollard’s rho for hash collisions reaches the birthday-paradox bound and uses only constant memory, it does not admit linear speed-up under parallelization. On the other hand, the naive birthday-paradox method has enormous memory overhead in parallel and still does not behave well with respect to linear acceleration. So is there an algorithm that parallelizes nearly linearly while keeping memory usage low? Quisquater and Delescaille answered this question in the context of DES collision search by introducing Distinguished Points.

Distinguished-Point Collision Search

A Distinguished Point (DP) is selected by some conspicuous and easy-to-test property. In the hash-collision setting, we usually define a distinguished point as a hash value whose first $k$ bits are all zero. That is, any hash value of the form $\underbrace{00\cdots0}_{k} x_{k+1}\cdots x_{n}$ is called a distinguished point.

The DP collision algorithm then proceeds as follows, with distinguished-point parameter $k$ fixed in advance:

Randomly choose a start point $S_i$, compute the hash sequence until a distinguished point $D_i$ is reached, and store the DP chain $(S_i, D_i, L_i)$, where $L_i$ is the chain length.
Repeatedly choose different start points and generate such DP chains until two chains end at the same distinguished point $D_i = D_j$.
For two colliding chains $(S_i, D_i, L_i), (S_j, D_j, L_j)$, first advance the longer chain until the two remaining lengths match, then advance both chains together and test whether a real hash collision appears. If no collision is found, discard the shorter chain and return to step 1.

Figure 2. Distinguished Points Lead to Collision

Figure 2 illustrates a collision structure arising in DP-based search. There, $\mathcal{H}(x_1) = \mathcal{H}(x_2) = x_c$. The two chains share the same distinguished point but originate from different start points, which is what makes the collision possible. When the algorithm detects that the two chains in Figure 2 end at the same DP, the SP1 chain is longer than the SP2 chain by one step. Thus SP1 first performs one hash evaluation, after which SP1 and SP2 are advanced simultaneously, and the collision is then detected at $x_1, x_2$.

If, after advancing SP1, it overlaps entirely with the SP2 chain, then this is only a pseudo-collision and the shorter chain is discarded. This situation is called the Robinhood Case, shown in Figure 3:

Figure 3. Robinhood Case

The Distinguished-Point collision algorithm is more widely known as Pollard’s lambda algorithm. The name comes from the shape of DP-chain collisions, which resembles the Greek letter $\lambda$, as in Figure 2. Pollard’s lambda also applies to discrete logarithm computation, and is a general, efficient, and parallelizable algorithm for that problem as well.

Time-Space Trade-off

The time-space complexity of the Distinguished-Point collision algorithm depends heavily on the distinguished-point difficulty parameter. This notion is analogous to the difficulty parameter used in Bitcoin mining. Let the difficulty parameter be $k$, meaning that the hash must begin with $k$ leading zeros.

Analysis of the Distinguished-Point collision algorithm. The overall complexity can be decomposed into three phases: generating DP chains, obtaining a DP-chain collision, and recovering the actual hash collision.

Generating DP chains: finding a DP chain is effectively a preimage search process, whose time complexity is $\mathcal{O}(2^k)$.
DP-chain collision: if we isolate the second phase, we are effectively looking for a second-preimage-style collision among DP chains. By the birthday paradox, the number of DP chains needed is $\mathcal{O}(2^{(n-k)/2})$, and the corresponding space complexity is also $\mathcal{O}(2^{(n-k)/2})$. However, this is not yet a hash collision, because it is a collision between two chains rather than between two points. If we analyze the process directly in terms of point collisions, then as soon as we have $2^{n/2}$ points, a collision becomes likely. In the DP-chain view, this implies identical distinguished points. Therefore, the number of DP chains needed in the second phase is $\mathcal{O}(\frac{2^{n/2}}{2^{k}}) = \mathcal{O}(2^{n/2 - k})$, and the space complexity is likewise $\mathcal{O}(2^{n/2 - k})$.
Recovering the actual hash collision: once two DP chains collide, locating the real hash-collision position costs $\mathcal{O}(2^k)$.

Putting these together, the time and space complexity of the Distinguished-Point collision algorithm are:

Time complexity: $\mathcal{O}(2^{n/2} + 2^k) = \mathcal{O}(2^{n/2})$
Space complexity: $\mathcal{O}(2^{n/2 - k})$

This is the idealized analysis, ignoring exceptional situations such as the Robinhood Case. In practice, if $k$ is too small, the space complexity becomes large. If $k$ is too large, pseudo-collisions of the Robinhood type occur frequently, which increases the running time. Therefore the choice of difficulty parameter $k$ is crucial for the Distinguished-Point method.

It is worth emphasizing that with a suitable choice of $k$, the Distinguished-Point algorithm can keep the time complexity close to $2^{n/2}$, avoid severe memory pressure, and still maintain essentially linear speed-up on multi-core hardware. For example, when $n = 64$ and we choose $k = 24$, the time complexity is $\mathcal{O}(2^{32})$ and the memory complexity is $\mathcal{O}(2^{8})$, which makes parallel linear acceleration practical. Below are the author’s experimental results for finding collisions on the lower 64 bits of SHA-256:

4 cores (with PRNG seed 0x123456789abcdef0)

Two DP chains collided with dp mask=ffffff
Number of chains find: 393
diff = 11897207
Looking for collision...
Collision found! with 4 cores
333412288b678e3b ff7cb8a664c810e3
962860fc377014f1 962860fc377014f1
  
real    1m17.441s
user    5m9.708s
sys     0m0.020s

8 cores (with PRNG seed 0x123456789abcdef0)

Two DP chains collided with dp mask=ffffff
Number of chains find: 409
diff = 11897207
Looking for collision...
Collision found! with 8 cores
333412288b678e3b ff7cb8a664c810e3
962860fc377014f1 962860fc377014f1
  
real    0m45.683s
user    6m5.344s
sys     0m0.011s

The above experiments use an efficient C++ implementation of the DP collision algorithm adapted from the HITCON 2023 Collision challenge.

These results are broadly consistent with linear acceleration. Theoretically, the expected number of DP chains is $2^8 = 256$, while the observed value is around 400. This is because the birthday-paradox estimate $\mathcal{O}(1.117 \cdot 2^{n/2})$ corresponds to the point where the collision probability is just slightly above 50%.

可并行的内存高效的哈希碰撞算法

2026-04-15T00:00:00+08:00

概要: 本文讨论三类通用哈希碰撞搜索方法：基于生日悖论的碰撞算法（Birthday Paradox）、结合 Floyd 循环检测的 Pollard’s rho 算法，以及可并行的 Pollard’s Lambda 算法（Distinguished Points），这些通用算法可以类似地泛化到整数分解和离散对数问题的求解。

Parallel Hash Collision Search by Rho Method with Distinguished Points: https://www.cs.csi.cuny.edu/~zhangx/papers/P_2018_LISAT_Weber_Zhang.pdf.
HITCON 2023 赛题 Collision: https://github.com/maple3142/My-CTF-Challenges/tree/master/HITCON%20CTF%202023/Collision.

给定一个输出长度为 $n$ 的哈希函数 $\mathcal{H}: \{0,1\}^{*} \mapsto \{0,1\}^n$，如何找到两个输入 $x_1, x_2$ 使得：

\[\mathcal{H}(x_1) = \mathcal{H}(x_2)\]

哈希碰撞问题（第二原像攻击）是一个基础性的密码学问题，其几乎贯彻了整个密码学体系。本文介绍的通用哈希碰撞算法分为下面三类：

算法	时间复杂度	空间复杂度	并行性
生日悖论碰撞算法	$\mathcal{O}(2^{n/2})$	$\mathcal{O}(2^{n/2})$	可并行，但内存开销大
Pollard’s rho 算法	$\mathcal{O}(2^{n/2})$	$\mathcal{O}(1)$	不可线性并行加速
Pollard’s lambda 算法	$\mathcal{O}(2^{n/2})$	$\mathcal{O}(k)$ （可权衡）	可并行，通常可接近线性加速

生日悖论碰撞算法

经典生日悖论。 一个经典的问题：在一个有 365 天的年份中，需要多少个人才能使得至少两个人有相同生日的概率超过 50%？在完全随机的情况下，理论值是 23 ，这比直觉上要少得多。

考虑一般化的版本：给定 $k$ 个人，至少有两个人同一生日的概率是多少？在 $k > 365$ 时，由容斥原理，这个概率为 1。进一步地，给定一个大小为 $N$ 的集合（比如哈希函数输出空间），随机选择 $k \le N$ 个集合内的值（有放回抽取），至少有两个相同值的概率记为 $\Pr\left(\text{coll}\right)$。令 $\Pr\left(z=0\right)$ 代表所有选择的值均互异，则 $\Pr\left(\text{coll}\right) = 1 - \Pr\left(z=0\right)$，其中：

\[\Pr\left(z=0\right) = \frac{N}{N} \cdot \frac{N-1}{N} \cdot \frac{N-2}{N} \cdots \frac{N-k+1}{N}\]

因此有两个相同值的概率（即碰撞）是：

\[\Pr\left(\text{coll}\right) = 1 - \Pr\left(z=0\right)\]

对于生日悖论问题，只要 $k \ge 23$，这个概率就超过了 50%。这比大多数人预期的要少得多。一般地，当 $k$ 相对于 $N$ 较小时，使用近似公式有：

\[\Pr\left(\text{coll}\right) = 1 - \Pr\left(z=0\right) \approx 1 - e^{-\frac{k^2}{2N}} > 0.5 \\ \implies e^{-\frac{k^2}{2N}} \approx 0.5 \implies k \approx \sqrt{2N \ln(2)}\]

对于输出比特长度为 $n$ 的哈希函数，得到

\[k \approx 1.177 \cdot 2^{n/2}\]

这意味着，利用生日悖论，我们需要计算 $\mathcal{O}(2^{n/2})$ 个随机的哈希值，就有很大概率得到碰撞。

初始化一个字典，查询效率为 $O(1)$，键（key）为哈希值，值（value）为哈希值对应的原像。
随机生成原像、哈希值对 $\{x, \mathcal{H}(x)\}$，插入上述字典，直至键值发生碰撞。

根据生日悖论原理，上述概率性算法在 $\mathcal{O}(2^{n/2})$ 个哈希值操作后结束，空间复杂度为 $\mathcal{O}(2^{n/2})$。

Pollard’s rho 算法

Pollard’s rho method 最初是整数分解中的一类算法，其核心原理也是 Birthday Paradox。因其生成序列的性质酷似希腊字母 $\rho$，故而得名 rho。

整数分解的 Pollard’s rho 算法

整数分解问题. 给定一个合数 $n = p \cdot q$，如何找到它的一个非平凡因子 $p$？

对于整数分解的 Pollard’s rho 算法，核心在于定义一个函数 $g(x)$ 用于生成伪随机数序列，例如我们取一个多项式 $g(x) = x^2 + 1 \bmod n$。这会生成下面的有限序列

\[\left\{x_0, g(x_0), \cdots, g^k(x_0), \cdots \right\}\]

其中 $g^k$ 代表映射复合，记 $x_k = g^k(x_0) \in \mathbb{Z}_n$。但是，如果我们从模 $p$ 的视角来看，同样上述序列其实隐藏了一个子群序列：

\[\left\{x_0, g(x_0), \cdots, g^k(x_0), \cdots \right\} \bmod p\]

其是 $\left\{x_k \bmod p\right\}$ 的子序列。如果我们选取的 $g(x)$ 足够随机，根据生日悖论，我们大概会在 $\mathcal{O}(\sqrt p)$ 后找到碰撞。如下图 $l_0$ 所示：

图 1 Pollard's method

如果图 1 中序列值代表的是模 $p$ 的序列，这样的碰撞代表着我们寻找到了 $g(x_{l_0- 1}) = g(x_{l_0 + n}) \bmod p$。由于我们只有模 $n$ 的序列，因此有极大概率在模 $n$ 的序列下 $g(x_{l_0- 1}) \ne g(x_{l_0 + n}) \bmod n$，于是

\[\gcd\left(g(x_{l_0- 1}) - g(x_{l_0 + n}), n\right) = p\]

即可分解 $n$。但是，值得注意的是，在计算序列时无法直接判断哪个值发生了碰撞；如果需要和之前的序列进行逐次 $\gcd$，其时间和空间开销都非常巨大。因此我们需要一个高效的循环检测算法来辅助 Pollard’s rho 算法。

Pollard’s rho 算法常常与 Floyd 算法配合使用，被形象地称为龟兔赛跑算法（Tortoise and Hare Algorithm）。

设置初始点相同 $x_0$，一个慢速序列 $\{x^{(T)}_{i}\}$ 的生成函数为 $f_1(x) = g(x)$，另一个快速序列 $\{x^{(H)}_{i}\}$ 的生成函数为 $f_2(x) = g(g(x))= g^2(x)$。我们逐次计算上面两个序列，并且只保留当前值 $x_k^{(T)}, x_{k}^{(H)}$。
在 $l_0 < n$ 时，只需要 $n$ 次迭代，即可得到 $x_m^{(T)} = x_{m}^{(H)} \bmod p$，因为 $x_{m} = x_{2m} \bmod p$。因此 Floyd 算法在迭代计算两个序列的同时，每次尝试计算 $\gcd\left(x_k^{(T)} - x_{k}^{(H)}, n\right)$，一旦上述公因子不为 0，即分解得到一个素因子 $p$。

例如图 1 中得到 Floyd 的碰撞点在第 $i$ 个点（实际上 $i = m$），那么在 $i$ 点两个值模 $p$ 同余，但是大概率模 $n$ 不同余，因此也能通过分解 $\gcd\left(x_i^{(T)} - x_{i}^{(H)}, n\right)$ 得到 $p$。

考虑时间复杂度，期望的序列长度 $l_0 + n \approx \mathcal{O}(\sqrt p)$。因为慢速的序列会在走完整个 $\rho$ 形序列之前与快速的序列发生碰撞，因此整个算法的时间复杂度为 $\mathcal{O}(\sqrt p)$，空间复杂度为 $\mathcal{O}(1)$。

一个简单的 Pollard’s rho 算法如下：

# sage
def rho(n):
    # Pollard's rho method
    c = int(10)
    a0 = int(1)
    a1 = a0^2+c
    a2 = a1^2+c
    while gcd(n,a2-a1) == 1:
        a1 = (a1^2+c) % n
        a2 = (a2^2+c) % n
        a2 = (a2^2+c) % n
    g = gcd(n,a2-a1)
    return [g,n//g]

读者可能会好奇碰撞点 $l_0$ 的特殊之处。记 $a= g(x_{l_0- 1}),\ b= g(x_{l_0 + n}),\ c = x_{l_0}$，在整数分解的场景下，伪随机序列的生成函数选取为 $f(x) = x^2 + 1$，点 $l_0$ 处的碰撞实际上就是寻找到了两个不同的值 $a,b$ 使得 $f(a) = f(b) = c$，即 $a,b$ 是

\[x^2 = c - 1 \bmod p\]

的两个互异解，因此 $a,b$ 即为 $\mathbb{Z}_p$ 上的两个二次剩余，满足 $a + b = 0 \bmod p$。

在整数分解的场景下，由于我们需要得到隐藏序列的模数 $p$，rho 碰撞点与 Floyd 相遇点没有区别；而一旦我们迁移到哈希碰撞的角度看，这两个点的意义就截然不同了。哈希碰撞的哈希值即为碰撞点 $l_0$ 的值。

哈希碰撞的 Pollard’s rho 算法

图 1 Pollard's method

迁移到哈希碰撞的场景，此时伪随机序列的生成函数为哈希函数 $\mathcal{H}: \{0,1\}^{*} \mapsto \{0,1\}^n$，或者某个复合哈希映射 $\mathcal{H}^{+} = \mathcal{H} \circ \mathcal{R}$。简便起见，初始值为 $x_0$，我们使用 $\mathcal{H}$ 表示伪随机序列的生成函数：$x_{i+1} = H(x_i)$。图 1 中环有 $n+1$ 个点，记 $N = n + 1$。

同样地，伪随机序列 $\{x_k\}$ 会在 $k = \mathcal{O}(2^{n/2})$ 处发生碰撞，之后进入循环。采用 Floyd 算法进行循环检测（cycle detection），假设在点 $i$ 为 Floyd 快速序列和慢速序列的相遇点，在这一点相遇时，慢速序列一定处于第一次 cycle 结束之前，因此序列计算次数为 $i \le l_0 + n$，有如下关系：

\[2*i - i = kn \implies i = k(n + 1) = kN\]

容易得出 $k = \lceil \frac{l_0}{n} \rceil$。此时，$i$ 点相遇，但是不一定发生碰撞，因此我们想要继续行进到点 $l_0$。一个有趣的观察是 $0 \rightarrow l_0$ 和 $i \rightarrow l_0$ 的距离一定是相等的（模 $N = n + 1$ 意义下）。证明如下：

\[\left\{ \begin{aligned} d_1 &= l_0 + 1 + n - i \\ d_2 &= l_0 \end{aligned} \right.\]

故而

\[\begin{aligned} d_1 & = l_0 + n + 1 - i \bmod N \\ &= l_0 - kN \bmod N \\ &= l_0 \bmod N \\ &= d_2 \bmod N \end{aligned}\]

以 $i$ 点为起始点，后续点集序列将会是一个长度为 $N$ 的循环，因此 $0 \rightarrow l_0$ 和 $i \rightarrow l_0$ 将会以相同的步数 $l_0$ 达到 $l_0$ 点（均慢速），从而检测得到 $x_{l_0 - 1}$ 和 $x_{l_0 + n}$ 两个点发生哈希碰撞，碰撞的哈希值为 $x_{l_0}$。

时间复杂度分析：发生碰撞后，我们让慢速序列保持不变，快速序列返回到初始点 $0$，速度降为慢速，经过 $l_0$ 步之后最终均到达点 $l_0$，找到哈希碰撞。因此整个序列中计算哈希的总次数就是：

\[T = 3i + 2l_0, i = \lceil \frac{l_0}{n} \rceil (n+1)\]

根据生日悖论，我们知道 $l_0 + n \approx \mathcal{O}(2^{n/2})$，故算法的总体时间复杂度不超过 $\mathcal{O}(5 \cdot 2^{n/2})$。由于只需要维护三个点的信息（起始点、一个慢速序列的节点、一个快速序列的节点），空间复杂度是 $\mathcal{O}(1)$。

Floyd 算法是一种有效的循环检测算法（Cycle Detection），并且从相遇点（Meeting Point）能够快速定位到碰撞点（Collision Point），在许多密码学算法中都有非常广泛的应用。

Pollard’s lambda 算法

Pollard’s rho 哈希碰撞算法虽然时间复杂度满足生日悖论的界，并且只需要常量内存，但是它不能通过并行计算进行线性的加速；朴素的生日悖论碰撞并行的空间开销巨大，并且也很难满足线性的加速。那么是否存在一种算法，使得其在并行环境中能够线性加速，并且空间复杂度也不高呢？Quisquater 和 Delescaille 在寻找 DES 的碰撞时，就使用了 Distinguished Point 来辅助碰撞。

Distinguished Point 碰撞算法

显著点（DP）是根据显著且易于测试的属性来选择的。对于哈希碰撞，我们一般把显著点选取为前 $k$ 个比特均为 0 的哈希点。即形如 $\underbrace{00\cdots0}_{k} x_{k+1}\cdots x_{n}$ 的哈希值，称为一个显著点。

于是 DP 哈希碰撞算法主要包含下面的步骤，预定义显著点参数为 $k$：

随机选取一个初始点 $S_i$（start point），计算哈希序列，直至得到一个显著点 $D_i$，保存一条 DP 链 $(S_i, D_i, L_i)$，其中 $L_i$ 为长度信息。
不断选取不同的初始点，寻找上述 DP 链，直到显著点发生碰撞 $D_i = D_j$，此时停止寻找 DP 链。
选取发生碰撞的两条链 $(S_i, D_i, L_i), (S_j, D_j, L_j)$，先对较长的链进行计算，直至剩余长度与另一条保持一致，之后两条链一起计算，检测是否出现哈希碰撞。如果没有碰撞，丢弃较短的链，继续回到第一步寻找其他的 DP 链。

图 2 Distinguished Points Lead to Collision

图 2 是 DP 碰撞搜索中生成的碰撞示意图。图中 $\mathcal{H}(x_1) = \mathcal{H}(x_2) = x_c$，它们的显著点 DP 相同，但是位于不同的起点上，从而导致碰撞出现。检测到图 2 中 DP 相同的链出现时，由于 SP1 链比 SP2 链长 1，于是 SP1 首先进行 1 次哈希，此后 SP1 和 SP2 同时进行哈希，之后在 $x_1, x_2$ 处检测到碰撞。

如果 SP1 链移动后发现与 SP2 链重合，则这是一次伪哈希碰撞，丢弃较短的链。这种情况被称为 Robinhood Case，如图 3 所示：

图 3 Robinhood Case

Distinguished Point 碰撞算法更广为人知的一个名字是 Pollard’s lambda 算法，源自于 DP 链碰撞的图形（参考图 2）酷似希腊字母 $\lambda$ 而得名。Pollard’s lambda 算法同样也适用于离散对数的求解，是一种通用、高效、可并行的离散对数求解算法。

时间空间复杂度权衡

Distinguished Point 碰撞算法的时间空间复杂度，很大程度上与 Distinguished Point 的难度系数有关（Difficulty）。这里的难度系数定义和比特币挖矿算法的难度系数定义是一致的。记难度系数为 $k$：哈希值为前置 $k$ 个 0。

Distinguished Point 碰撞算法分析. 整个算法考虑三个阶段的复杂度：DP 链的生成 + DP 链碰撞的过程 + 恢复哈希碰撞。

DP 链的生成：寻找 DP 链的过程是第一原像攻击（Preimage Attack），其时间复杂度是 $\mathcal{O}(2^k)$。
DP 链碰撞：单独分析第二阶段 DP 链碰撞的过程是第二原像攻击，即哈希碰撞。根据生日悖论，找到碰撞需要生成的 DP 链数目是 $\mathcal{O}(2^{(n-k)/2})$，空间复杂度也就是 $\mathcal{O}(2^{(n-k)/2})$。但这与哈希碰撞并不同，这是两条链的碰撞，而不是点的碰撞！ 因此如果要从生日悖论的角度分析，我们仍然分析点的碰撞，只要有 $2^{n/2}$ 个点，就可能发生碰撞；对应到 DP 链上，一定会导致显著点（DP）相同。因此第二阶段的碰撞，需要 DP 链的数目为 $\mathcal{O}(\frac{2^{n/2}}{2^{k}}) = \mathcal{O}(2^{n/2 - k})$，空间复杂度也就是 $\mathcal{O}(2^{n/2 - k})$。
恢复哈希碰撞：DP 链发生碰撞后，寻找哈希碰撞位置的时间复杂度为 $\mathcal{O}(2^k)$。

综合上述分析，Distinguished Point 碰撞算法的时间空间复杂度如下：

时间复杂度：$\mathcal{O}(2^{n/2} + 2^k) = \mathcal{O}(2^{n/2})$
空间复杂度：$\mathcal{O}(2^{n/2 - k})$

这是理想分析下的结果，尚不考虑特殊情况如 Robinhood Case 的出现。实际上，如果 $k$ 值取得太小，空间复杂度高；如果 $k$ 选取得太大，会频繁出现 Robinhood Case 的伪碰撞，导致时间复杂度增加。因此难度系数 $k$ 的选取对 Distinguished Point 算法非常关键。

值得指出的是，通过精心选取 $k$，Distinguished Point 算法既能保证时间复杂度基本在 $2^{n/2}$ 附近，不是内存困难的，并且在多核并行下保持线性的加速。比如 $n = 64$，选择 $k = 24$，时间复杂度 $\mathcal{O}(2^{32})$，内存复杂度 $\mathcal{O}(2^{8})$，在此情况下可以进行线性加速的并行。下面是笔者对 sha256 的低 64 位进行碰撞的实验数据：

4 核（PRNG 的 SEED 为 0x123456789abcdef0）

Two DP chains collided with dp mask=ffffff
Number of chains find: 393
diff = 11897207
Looking for collision...
Collision found! with 4 cores
333412288b678e3b ff7cb8a664c810e3
962860fc377014f1 962860fc377014f1
  
real    1m17.441s
user    5m9.708s
sys     0m0.020s

8 核（PRNG 的 SEED 为 0x123456789abcdef0）

Two DP chains collided with dp mask=ffffff
Number of chains find: 409
diff = 11897207
Looking for collision...
Collision found! with 8 cores
333412288b678e3b ff7cb8a664c810e3
962860fc377014f1 962860fc377014f1
  
real    0m45.683s
user    6m5.344s
sys     0m0.011s

上述实验使用了来自 Hitcon 2023 Collision 赛题的一个高效 C++ 实现的 DP 碰撞算法。

上述结果基本符合线性的加速。理论上期望的 DP 链数目为 $2^8 = 256$，实际 400 左右略高，是因为生日悖论给出的估计 $\mathcal{O}(1.117 \cdot 2^{n/2})$ 是碰撞概率刚好大于 50% 时的哈希次数。

SIDH: Supersingular Isogeny Key Exchange

2026-04-14T00:00:00+08:00

概要: 介绍 Supersingular Isogeny Key Exchange 的核心：超奇异椭圆曲线、 J-invariant 和 Isogeny，最后介绍标准的 SIDH 协议。本文是对 Supersingular isogeny key exchange for beginners 原文的一份笔记式整理/翻译，原文更适合入门阅读。

说明: 基于椭圆曲线同源的密码方案曾经是 NIST 后量子密码标准化过程中一个很有希望的方向。NIST 在第二轮状态报告中将 SIKE 列入进入第三轮的 Alternate Candidate，之后在第四轮中也继续保留过 SIKE 这一候选。但 2022 年 Castryck 与 Decru 给出了对原始 SIDH 的高效密钥恢复攻击，传统 SIDH 今天已经不应再被视为可直接部署的安全方案。但是它的设计理念和数学结构仍然非常有启发性，尤其是对于理解基于 isogeny 的密码学构造，以及后续一些改进版本的设计思路，都具有重要的参考价值。

背景知识

超奇异椭圆曲线

考虑定义在有限域 $K= \mathbb{F}_q$ 上的椭圆曲线 $E$，其 Weierstrass 方程为：

\[E: y^2 = x^3 + ax + b \quad a, b \in K\]

超奇异椭圆曲线是具有特殊性质的椭圆曲线，在有限域上定义时，它们的端子态数（Endomorphism Ring）是最大可能的。具体来说，等价于下面（任意一个）条件：

椭圆曲线 $E$ 的 Frobenius Trace 记为 $t$，其满足 $t \equiv 0 \mod p$。
椭圆曲线 $E$ 的自同态环 $End(E)$ 是一个秩为 4 的模数环。
椭圆曲线 $E$ 的 Hasse 不变量 $a_p$ 为 0。
$\mathbb{F}_p$ 上的椭圆曲线 $E$ 是 supersingular 的，当且仅当它与定义在 $\mathbb{F}_{p^2}$ 上的某条椭圆曲线同构。

超奇异椭圆曲线之间的同源具有丰富的结构，也是 SIDH 一类协议的基础。由于这类曲线的 Frobenius Trace 等于 0，则其阶为：$\vert E(\mathbb{F}_p) \vert = p + 1$，在更一般的扩域上，我们有 $\vert E(\mathbb{F}_{p^2}) \vert = k(p + 1)$，其中 $k$ 通常为 $p+1$。

$j$-不变量

在椭圆曲线理论中，$j$-不变量（$j$-invariant）是一个重要的不变量，用于分类椭圆曲线。唯一标识一个椭圆曲线群同构类的值是 j-invariant。容易想象，椭圆曲线经过简单的平移或旋转之后，并不会改变其几何本质，因此曲线 $E$ 和其经过简单几何变换得到的 $E^\prime$ 是同构的，对应有限域上的点群也同构。能够唯一标识曲线同构类的代数量，就是 j-invariant。

正式代数定义如下。考虑曲线 $E: y^2 = x^3 + ax + b \quad a, b \in K$，其 j-invariant 为：

\[j(E) = 1728 \cdot \frac{4a^3}{4a^3 + 27b^2}\]

两条椭圆曲线同构当且仅当它们的 $j$-不变量相同。
对于 $p = 3 \mod 4$，在有限域 $\mathbb{F}_{p^2} = \mathbb{F}_{p}(i)$ 中，其中 $i^2 + 1 = 0$，超奇异曲线一共有 $\lfloor p/12 \rfloor + z$ 类，其中 $z \in \{0,1,2\}$ 类，它的值与 $p \mod 12$ 有关。
特征为 $p$ 的有限域上的超奇异椭圆曲线，其 $j$-invariant 总是落在 $\mathbb{F}_{p^2}$ 上。因此讨论 supersingular 曲线时，转到 $\mathbb{F}_{p^2}$ 上通常是自然的。

同源（Isogeny）

同源（Isogeny）是一类特殊映射，可以把一条椭圆曲线映射到另一条椭圆曲线。 $j$-不变量相同的曲线之间存在同构映射，而更一般的同源映射则连接了不同 $j$-不变量的曲线。

一般而言，这样的映射可以写成 $(x,y) \mapsto (f(x,y), g(x,y))$。很多时候我们只写 $x$ 坐标上的部分，因为 $y$ 坐标的变化可以从 $x$ 的变化推导出来。具体而言即 $(x,y) \mapsto (f(x), c \cdot f^\prime(x))$ ，其中 $c$ 是一个常数值。下面我们介绍非常常见的倍点映射，也是与同源密切相关的一个重要例子。

倍点映射

记 $E_a: y^2 = x^3 + ax^2 + x$，考虑最简单的自同态映射二倍点乘：

\[\text { [2]: } \quad E_a \rightarrow E_a, \quad x \mapsto \frac{\left(x^2-1\right)^2}{4 x\left(x^2+a x+1\right)}\]

显然这不是一个同构，因为存在若干点使得上述映射的分母等于 0，即 $(0,0), (\alpha, 0), (1/\alpha , 0)$，其中 $\alpha ^ 2 + a \alpha + 1 = 0$。换句话说，所有阶为 2 的点以及无穷远点 $\mathcal{O}$ 都会映射到 $\mathcal{O}$。这四个元素构成二倍点映射的核（kernel），且满足：

\[\operatorname{ker}([2]) \cong \mathbb{Z}_2 \times \mathbb{Z}_2\]

其中三个非平凡元素恰好对应 3 个 2-torsion 子群的生成元。

同理对于三倍点乘映射：

\[\text { [3]: } \quad E_a \rightarrow E_a, \quad x \mapsto \frac{x\left(x^4-6 x^2-4 a x^3-3\right)^2}{\left(3 x^4+4 a x^3+6 x^2-1\right)^2}\]

存在 4 个点使得上述映射的分母等于 0，记它们的 $x$ 坐标为 $\beta, \delta, \zeta, \theta$。这些坐标对应的 8 个点，再加上无穷远点 $\mathcal{O}$，一起构成三倍点映射的核空间，满足：

\[\operatorname{ker}([3]) \cong \mathbb{Z}_3 \times \mathbb{Z}_3\]

即 3-torsion 恰好由 4 个 3 阶循环子群组成。

2-torsion 与 3-torsion 的几何直观

更一般地，对于所有满足 $\ell \nmid p$ 的倍点映射，$\ell$-torsion 都满足：

\[\operatorname{ker}([\ell]) \cong \mathbb{Z}_{\ell} \times \mathbb{Z}_{\ell}\]

上面的二倍点和三倍点映射，其实都可以看作更一般的 isogeny 的特殊情况。

同源映射

同源（Isogeny）是一个从椭圆曲线 $E$ 到另一椭圆曲线 $E^{\prime}$ 的非平凡态射，并且它是群同态。也就是说，对所有 $P, Q \in E$，有：

\[\phi(P+Q)=\phi(P)+\phi(Q)\]

同时，$\phi$ 可以用有理函数来表示，如果 $\phi: E \rightarrow E^{\prime}$ 是同源，则存在有理函数 $\phi_x(x, y)$ 和 $\phi_y(x, y)$，使得：

\[\phi(x, y)=\left(\phi_x(x, y), \phi_y(x, y)\right)\]

同源的基本性质包括：

核（Kernel）：同源的核是映射到零点的那些点的集合。
度（Degree）：同源的度是函数域扩张的次数。度为 $n$ 的同源称为 $n$-同源。
复合：如果 $\phi: E \rightarrow E^{\prime}$ 和 $\psi: E^{\prime} \rightarrow E^{\prime \prime}$ 是同源，则 $\psi \circ \phi$ 也是同源。

记核为 $G$，则通常也把像曲线记为 $E^\prime = E/G$。值得注意的是，椭圆曲线同源与其核 $G$ 一一对应。给定一个核 $G$，我们都可以构造对应的同源映射；其显式构造可以参考 Vélu Formulas。这部分证明非常数学，细节可以参考：

MIT Elliptic Curves: https://math.mit.edu/classes/18.783/2023/LectureSlides5.pdf
Vélu’s Formulas for SIDH: https://www.mariascrs.com/2020/11/07/velus-formulas.html

同源示例

以二倍点映射为例，选取 $G=\{\mathcal{O},(\alpha, 0)\}$ 和 $E_a$。根据 Vélu 公式，可以得到：

\[\phi: \quad E_a \rightarrow E_{a^{\prime}}, \quad x \mapsto \frac{x(\alpha x-1)}{x-\alpha}\]

其中

\[a^{\prime}=2\left(1-2 \alpha^2\right)\]

以 $\mathbb{F}_{431^2}$ 上的具体曲线为例：

\[E_a: y^2=x^3+(208 i+161) x^2+x, \quad \text { with } \quad j\left(E_a\right)=364 i+304\]

其中 $(\alpha, 0) \in E_a$，且 $\alpha=350 i+68$。代入上面的 2-isogeny，可以得到新的曲线：

\[E_{a^{\prime}}: y^2=x^3+(102 i+423) x^2+x, \quad \text { with } \quad j\left(E_{a^{\prime}}\right)=344 i+190\]

对应的映射为：

\[\phi: x \mapsto \frac{x((350 i+68) x-1)}{x-(350 i+68)}\]

同理，以三倍点映射为例，令 $G=\{\mathcal{O},(\beta, \gamma),(\beta,-\gamma)\}$。根据 Vélu 公式，有：

\[\phi: \quad E_a \rightarrow E_{a^{\prime}}, \quad x \mapsto \frac{x(\beta x-1)^2}{(x-\beta)^2}\]

其中

\[a^{\prime}=\left(a \beta-6 \beta^2+6\right) \beta\]

如果点 $(\beta, \gamma)=(321 i+56,303 i+174)$ 在曲线 $E_a: y^2=x^3+(208 i+161) x^2+x$ 上的阶恰好为 3，则可以得到一个具体的 3-isogeny，其 codomain 为：

\[E_{a^{\prime}}: y^2=x^3+415 x^2+x, \quad \text { with } \quad j\left(E_{a^{\prime}}\right)=189\]

同源映射函数为：

\[\phi: x \mapsto \frac{x((321 i+56) x-1)^2}{(x-(321 i+56))^2}\]

与只保留 j-invariant 的同构不同，这里的同源会把曲线送到另一条不同 j-invariant 的曲线上，因此两条曲线不再同构，而是同源（isogenous）。

代数性质

记 $\phi: E \mapsto E^\prime$ 为一个同源，其核（kernel）为 $G$，度为 $d = \vert G \vert$。

非零可分同源的度等于其 kernel 的大小。
同源一般会改变曲线的 j-invariant。
同构是一种特殊的同源，此时核为 $G=\{\mathcal{O}\}$。
同源一般不可逆；通常不存在真正意义上的逆映射 $\phi^{-1}$。

如果 $\phi: E \mapsto E^\prime$ 的度数为 $d$，则其对偶映射 $\hat \phi$ 满足：

\[\hat \phi \circ \phi = [d]_E \text{ and } \phi \circ \hat \phi = [d]_{E^\prime}\]

其中 $[d]_E$ 代表 $E$ 上的 $d$ 倍点映射，以及 $[d]_{E^\prime}$ 代表 $E^\prime$ 上的 $d$ 倍点映射。

对偶同源可以看作“某种意义上的逆”，但它的复合结果不是恒等映射，而是倍点映射。

一个度为 $d$ 的同源 $\phi: E \mapsto E^\prime$，可能会让 $P \in E$ 的像点 $\phi(P)$ 的阶降低一个因子 $k \mid d$。
若点 $P$ 的阶为 $\ell$，且 $\gcd(\ell, d) = 1$，则经过一个 $d$-isogeny 后点的阶保持不变。
特别地，$\phi(P)=\mathcal{O}$ 当且仅当 $P$ 是 $\phi$ 的 kernel，即 $P \in G$。
有限域 $\mathbb{F}_q$ 上的两条曲线同源，当且仅当它们的点数相同。

上述第四个结论对 supersingular 曲线而言尤其重要。对于定义在 $\mathbb{F}_{p^2}$ 上的超奇异曲线，通常都有：

\[\vert E(\mathbb{F}_{p^2}) \vert = (p + 1)^2\]

因此可以得出一个非常关键的结论，所有的超奇异椭圆曲线都是同源的。以 $\mathbb{F}_{431^2}$ 上的一条具体曲线为例：

\[E_a: y^2=x^3+(208 i+161) x^2+x, \quad \text { with } \quad j\left(E_a\right)=364 i+304\]

其阶满足 $\#E(\mathbb{F}_{431^2}) = 432^2$，群结构为：

\[\mathbb{Z}_{432} \times \mathbb{Z}_{432}\]

并且这条曲线满足：

\[ker([p+1]) \cong \mathbb{Z}_{p + 1} \times \mathbb{Z}_{p +1}\]

从而有：

\[E(\mathbb{F}_{p^2}) \cong \mathbb{Z}_{p + 1} \times \mathbb{Z}_{p + 1}\]

一个容易困惑的地方是：在二倍点同源中，显然有多个点会被映射到 $\mathcal{O}$，那么为什么两边曲线的阶还能相同？原文给出的解释是，这种“损失”会通过更高扩域中的点来平衡，因此最终同源曲线的点数保持一致。

同源图

以 $\mathbb{F}_{431^2}$ 上所有超奇异曲线的 j-invariant 构成的图为例，可以得到如下的 supersingular isogeny graph（共 37 类超奇异同源曲线）：

$\mathbb{F}_{431^2}$ 上的 supersingular isogeny graph

由于同源保持曲线阶不变，因此超奇异曲线在进行同源后，仍然会落到超奇异曲线集合中。于是当我们在这张图上做 $\ell$-isogeny 时，本质上就是在图上进行随机游走。

从这个角度看，SIDH 已经和传统 DH 有了某种相似性：Alice 和 Bob 从同一个起点出发，分别按照自己的私钥选择图上的路径，最后再利用对方公开出来的信息继续走向一个共同的终点。

对于每一条曲线 $E$，存在 3 个不同的 2-isogeny，因此理论上它最多可以通过 2-isogeny 到达 3 条不同 j-invariant 的曲线。于是我们得到如下结构：

2-isogeny 的局部图结构

除了 j-invariant 值为 $0, 4, 242$ 的曲线外，其他所有顶点都有 3 条出边。而且这里的边默认都是双向的，因为对应同源 $\phi: E \mapsto E^\prime$ 的对偶同源 $\hat \phi: E^\prime \mapsto E$ 会提供返回的边。

同理，对于 3-isogeny 图，每个顶点会有 4 条出边：

3-isogeny 的局部图结构

有了这种图论直觉后，我们再看 SIDH 中有限域的选取。SIKE/SIDH 的标准选择是下面形式的素数：

\[p = 2^{e_A}3^{e_B} - 1\]

其中 $2^{e_A} \approx 3^{e_B}$。更一般地，SIDH 也适用于 $p = f2^{e_A}3^{e_B} - 1$ 的形式，但很多标准设置里直接取 $f = 1$。由于：

\[E\left(\mathbb{F}_{p^2}\right) \cong \mathbb{Z}_{(2^{e_A} 3^{e_B})} \times \mathbb{Z}_{2^{e_A} 3^{e_B}}\]

因此存在两个点 $P, Q$，它们的阶为 $p_s = 2^{e_A}3^{e_B}$，并构成整个椭圆曲线群的基。所有阶为 $2^{e_A}$ 或 $3^{e_B}$ 的点也都落在 $E\left(\mathbb{F}_{p^2}\right)$ 上。这也是为什么 SIDH 可以分别在 2-power 和 3-power torsion 上工作，并把 Alice 与 Bob 的计算放在同一条起始曲线中完成。选择 $\ell = 2, 3$ 还有一个非常现实的原因：这两类小度同源都可以在 $\mathbb{F}_{p^2}$ 上高效计算；如果选择更高阶的同源，通常就需要进入更大的扩域。

SIDH Protocol

有了上面的 supersingular isogeny graph 直觉之后，SIDH 的整体轮廓就已经比较清楚了。不过，在给出完整协议之前，先看一个“看起来像 DH、但其实不对”的朴素版本，会更容易理解真实 SIDH 为什么要引入辅助点。

朴素 SIDH

参考传统 DH 协议，一个最自然的想法是：选择私钥 $s_a \in (0, 2^{e_A})$ 与 $s_b \in (0, 3^{e_B})$，然后让 Alice 和 Bob 分别按自己的私钥在图上走若干步。

Alice 的公钥生成可以粗略理解为：

根据 $s_a$ 的第 1 个比特选择一个 2-isogeny，记为 $\phi_{a_1}$，得到新的曲线 $E_{a_1} = \phi_{a_1}(E_{a_0})$。
第 $i$ 轮，根据第 $i$ 个比特在 $E_{a_{i-1}}$ 上继续选择一个 2-isogeny，记为 $\phi_{a_i}$。

经过 $e_A$ 次 2-isogeny 后，Alice 到达曲线 $E_a$。

Bob 同理，通过 $e_B$ 次 3-isogeny 到达曲线 $E_b$。

于是一个朴素的共享秘密想法是：

Alice 拿到 $E_b$ 后，再按自己的私钥继续走 $e_A$ 步 2-isogeny，得到 $E_{ba}$。
Bob 拿到 $E_a$ 后，再按自己的私钥继续走 $e_B$ 步 3-isogeny，得到 $E_{ab}$。

这个方案是错误的，错误的关键原因有两个：

同源群不是交换群，因此通常有 $j(E_{ba}) \ne j(E_{ab})$，无法得到共享秘密。
进行 2-isogeny 或 3-isogeny 时，在每一步其实都存在多个 kernel 选择，因此“私钥”并不只是一个简单整数，而是包含了更多关于子群的信息。

更直观一点说，isogeny 本质上是图上的随机游走。先执行策略 $s_1$ 再执行策略 $s_2$，和先执行 $s_2$ 再执行 $s_1$，最终到达的终点一般不同。真正 SIDH 的关键，在于引入辅助点信息，使得双方最终构造出的复合同源拥有同一个 kernel，从而得到同一个共享 j-invariant。

标准 SIDH

设素数 $p = 2^{e_A}3^{e_B} - 1$，并固定一条初始超奇异椭圆曲线 $E$。下面给出更接近真实协议的版本。

公开辅助点

由于 $\ell$-torsion 具有 $\mathbb{Z}_{\ell} \times \mathbb{Z}_{\ell}$ 的二维结构，因此 Alice 选取：
\[\left\langle P_A, Q_A\right\rangle=E\left[2^{e_A}\right] \cong \mathbb{Z}_{2^{e_A}} \times \mathbb{Z}_{2^{e_A}}\]
其中 $P_A, Q_A$ 的阶都为 $2^{e_A}$。它们的线性组合可以生成一个大小为 $2^{2e_A}$ 的子群。

Bob 同理选取：
\[\left\langle P_B, Q_B\right\rangle=E\left[3^{e_B}\right] \cong \mathbb{Z}_{3^{e_B}} \times \mathbb{Z}_{3^{e_B}}\]
其中 $P_B, Q_B$ 的阶为 $3^{e_B}$。
公钥生成
- Alice 随机采样私钥 $k_A \in [0, 2^{e_A})$，计算
  \[S_A=P_A+\left[k_A\right] Q_A \quad \text { with } \quad k_A \in\left[0,2^{e_A}\right)\]
  根据 $S_A$ 生成 $e_A$ 个 2-isogeny，得到复合同源 $\phi_A: E \mapsto E_A$，记为 $E_A = E /\left\langle S_A\right\rangle$。然后把 Bob 的基点也映射过去，得到 $P_B^\prime, Q_B^\prime$，于是 Alice 的公钥为
  \[\mathrm{PK}_A=\left(E_A, P_B^{\prime}, Q_B^{\prime}\right)=\left(\phi_A(E), \phi_A\left(P_B\right), \phi_A\left(Q_B\right)\right)\]
- Bob 随机采样私钥 $k_B \in [0, 3^{e_B})$，计算
  \[S_B=P_B+\left[k_B\right] Q_B \quad \text { with } \quad k_B \in\left[0,3^{e_B}\right)\]
  根据 $S_B$ 生成 $e_B$ 个 3-isogeny，得到 $\phi_B: E \mapsto E_B$，记为 $E_B = E /\left\langle S_B\right\rangle$。然后把 Alice 的基点映射过去，得到 $P_A^\prime, Q_A^\prime$，于是 Bob 的公钥为
  \[\mathrm{PK}_B=\left(E_B, P_A^{\prime}, Q_A^{\prime}\right)=\left(\phi_B(E), \phi_B\left(P_A\right), \phi_B\left(Q_A\right)\right)\]
秘密共享值计算
- Alice 收到 Bob 的公钥后，在 $E_B$ 上计算
  \[S_A^\prime = P_A^\prime + [k_A] Q_A^\prime\]
  从而得到秘密同源 $\phi_A^\prime : E_B \mapsto E_{AB}$，其中
  \[E_{AB} = E_B/\left\langle S_A^\prime\right\rangle\]
  最终共享值取为 $j_{AB} = j(E_{AB})$。
- Bob 同理，在 $E_A$ 上计算
  \[S_B^\prime = P_B^\prime + [k_B] Q_B^\prime\]
  得到秘密同源 $\phi_B^\prime : E_A \mapsto E_{BA}$，最终共享值为 $j_{BA} = j(E_{BA})$。

如何从一个阶为 $2^{e_A}$ 的点，分解出 $e_A$ 个 2-isogeny？这个问题和 isogeny 对点阶的影响直接相关。记 $E_0 = E$，$S_0 = S_A$，其中 $S_0$ 的阶为 $2^{e_A}$。则：

\[R_0 = S_0^{2^{e_A - 1}}\]

是 $E_0$ 上一个阶为 2 的点，因此可以作为第一步 2-isogeny 的 kernel。记第一步同源为 $\phi_1$，则得到新的曲线 $E_1$ 和新的点 $S_1 = \phi_1(S_0)$。此时 $S_1$ 在 $E_1$ 上的阶会降为 $2^{e_A - 1}$。归纳如下。第 $i$ 轮时，$S_{i-1}$ 的阶为 $2^{e_A - i + 1}$，则计算：

\[R_i = S_i^{2^{e_A - i}}\]

即可得到下一步 2-isogeny 的 kernel。重复这个过程共 $e_A$ 轮，最终 $S_{e_A} = \mathcal{O}$。

对 Bob 的 3-isogeny 过程同理，只不过 kernel 需要由两个非零元构成，因此会取 $R_i$ 与其逆元 $-R_i$ 一起生成核。

SIDH 的正确性

SIDH 抽象到几何/图论上有着很明确的意义：即有向图的随机游走，从起点到终点的过程其实就是群作用(group action)，具体而言就是同源 isogeny，而同源的度决定了该随机游走的复杂性，即从某个确定的起点出发，不同的终点数目最大有多少。按照上述方式构造后，双方最终得到的曲线满足 $j(E_{AB}) = j(E_{BA})$，它们都对应于同一个类曲线 $E /\left\langle S_A, S_B\right\rangle$。更严格的证明可以在论文 pqc from supersingular elliptic curve isogenies 中找到。其核心等式是：

\[E /\left\langle P, Q\right\rangle \cong (E/\left\langle P\right\rangle) / \phi(Q)\]

其中 $\phi = E/ \left\langle P \right\rangle$。

SIDH 选择的同源度数形如 $p^e$。当 $p$ 很小时，这类同源可以在多项式时间内计算，复杂度近似为 $O(ep)$。这也是为什么协议特别偏爱 $2$ 和 $3$ 这两个小素数。对于比较大的素数阶 $p$ 的同源，目前计算同源的最优复杂度是 $O(\sqrt{p})$ （参考 velusqrt）。相比之下，更大素数阶的同源目前计算代价会明显更高。

下面从 kernel 的角度，解释为什么 SIDH 最终一定会得到相同的 j-invariant。

ZK-SNARK: Deep Dive into Groth16

2026-01-29T00:00:00+08:00

tl;dr: Groth16 is one of the most popular and efficient Zero-Knowledge Succinct Non-interactive Arguments of Knowledge (zk-SNARKs) based on Quadratic Arithmetic Programs (QAPs). This post provides a detailed walkthrough of the Groth16 protocol, covering its setup, proving, and verification phases, along with the underlying mathematical principles.

Useful references:

Groth16 paper: On the Size of Pairing-based Non-interactive Arguments
Awesome introduction to zk-snark: moonmath book

Preliminaries

Before start, the basic definitions of zero-knowledge proofs and zk-SNARKs are assumed to be known, especially for Rank-1 Constraint System (R1CS) and Quadratic Arithmetic Programs. For a brief introduction, please refer to my previous post: Notes on Formal Language and Generic Proof System. For beginners, moonmath book is highly recommended for learning the mathematical foundations and Groth16 protocol.

High-Level Process of Groth16. In Groth16, the claim or knowledge to be proven is typically represented as an arithmetic circuit, then reduced to a Rank-1 Constraint System (R1CS), and finally transformed into a Quadratic Arithmetic Program (QAP). This reduction allows the proof to be distilled into a single polynomial identity. In this post, we focus exclusively on the final polynomial proof, which constitutes the core of Groth16’s zero-knowledge property.

Recall Quadratic Arithmetic Program (QAP)

Let $L$ be a language defined by some Rank-1 Constraint System $R$ such that a constructive proof of knowledge for an instance  in $L$ consists of a witness . Let $\left\{\mathbb{G}_1, \mathbb{G}_2, e(\cdot, \cdot), g_1, g_2, \mathbb{F}_r\right\}$ be a set of Groth16 parameters where $e(\cdot, \cdot)$ is an efficiently computable, non-degenerate, bilinear map from $\mathbb{G}_1\times \mathbb{G}_2$ to some target group $\mathbb{G}_T$ of order $r$. Let $Q A P(R)=\left\{T \in \mathbb{F}[x],\left\{A_j, B_j, C_j \in \mathbb{F}[x]\right\}_{j=0}^{n+m}\right\}$ be a Quadratic Arithmetic Program associated to $R$. The string $\left.\left(\right)$ is a solution to the R1CS if and only if the following polynomial is divisible by the target polynomial $T$ :

\[\begin{aligned} P_{(I ; W)} = &\left(A_0+\sum_j^n I_j \cdot A_j+\sum_j^m W_j \cdot A_{n+j}\right) \cdot\left(B_0+\sum_j^n I_j \cdot B_j+\sum_j^m W_j \cdot B_{n+j}\right) \\ &-\left(C_0+\sum_j^n I_j \cdot C_j+\sum_j^m W_j \cdot C_{n+j}\right) \\ \end{aligned}.\]

This implies

\[P_{(I ; W)}(x) = H(x) \cdot T(x) \text{ for some } H(x) \in \mathbb{F}[x].\]

The prover is going to convince the verifier that he/she knows a valid witness  for the instance  without revealing any information about the witness.

In the following sections, this post provides a detailed exposition of the three core sub-protocols of Groth16: the Setup Phase, the Prover Phase, and the Verifier Phase. It concludes by addressing several practical security considerations essential for implementation.

Setup Phase

The setup phase samples 5 random, invertible elements $\alpha, \beta, \gamma, \delta$ and $\tau$ from the scalar field $\mathbb{F}_r$ of the protocol and outputs the simulation trapdoor $\mathrm{ST}$ :

\[\mathrm{ST}=(\alpha, \beta, \gamma, \delta, \tau)\]

In the setup phase, we need to generate the following common reference string and remove the simulation trapdoor completely right after the setup phase.

\[\begin{aligned} & C R S_{\mathbb{G}_1}=\left\{\begin{array}{c} g_1^\alpha, g_1^\beta, g_1^\delta,\left(g_1^{\tau^j}, \ldots\right)_{j=0}^{\operatorname{deg}(T)-1},\left(g_1^{\frac{\beta \cdot A_j(\tau)+\alpha \cdot B_j(\tau)+C_j(\tau)}{\gamma}}, \ldots\right)_{j=0}^n \\ \left(g_1^{\frac{\beta \cdot A_{j+n}(\tau)+\alpha \cdot B_{j+n}(\tau)+C_{j+n}(\tau)}{\delta}}, \ldots\right)_{j=1}^m,\left(g_1^{\frac{\tau^j \cdot T(\tau)}{\delta}}, \ldots\right)_{j=0}^{\operatorname{deg}(T)-2} \end{array}\right\} \\ & C R S_{\mathbb{G}_2}=\left\{g_2^\beta, g_2^\gamma, g_2^\delta,\left(g_2^{\tau^j}, \ldots\right)_{j=0}^{\operatorname{deg}(T)-1}\right\} \end{aligned}\]

Usually $\tau$ is called a secret evaluation point. Let $P(x) = \sum_{i=0}^{k} a_i x^i$ be a polynomial of degree $k < \deg T$ with coefficients in $\mathbb{F}_{r}$. Then we can evaluate $P(\tau)$ in the exponent of $g_1$ or $g_2$ given the common reference string:
\[g^{P(\tau)} = g^{\sum_{i=0}^{k}a_i{\tau^i}} = \prod_{i=0}^{k} (g^{\tau^i})^{a_i}.\]
The elements $g^{\tau^0}_{1,2}, g^{\tau^1}_{1,2}, \ldots, g^{\tau^k}_{1,2}$ are commonly referred to as the Powers of Tau.
Toxic Waste. The simulation trapdoor $\mathrm{ST}=(\alpha, \beta, \gamma, \delta, \tau)$ is often referred to as the toxic waste of the setup phase. The simulation trapdoor can be utilized to generate fraud proofs, which are verifiable zk-SNARKs that can be constructed without knowledge of any witness, that is, forging proofs. Thus, $\mathrm{ST}=(\alpha, \beta, \gamma, \delta, \tau)$ must be safely deleted in the setup phase (through a trusted third party or multi-party computation).
Public Information for Prover/Verifier. The R1CS, its corresponding QAP and the Common Reference String are public to the Prover and Verifier.

The Prover Phase

We first recall that given $QAP(R)=\left\{T \in \mathbb{F}[x],\left\{A_j, B_j, C_j \in \mathbb{F}[x]\right\}_{j=0}^{n+m}\right\}$ associated with our R1CS and a witness  for an instance , the knowledge proof of witness  is performed as follows. We first compute the proving polynomial:

\[\begin{aligned} P_{(I ; W)} &=\left(A_0+\sum_{j=1}^n I_j \cdot A_j+\sum_{j = 1}^m W_j \cdot A_{n+j}\right) \cdot\left(B_0+\sum_{j = 1}^n I_j \cdot B_j+\sum_{j = 1}^m W_j \cdot B_{n+j}\right) \\ &-\left(C_0+\sum_{j = 1}^n I_j \cdot C_j+\sum_{j = 1}^m W_j \cdot C_{n+j}\right). \end{aligned}\]

To be more precise, we split $P_{(I ; W)}$ as three parts $\mathcal{A}, \mathcal{B}, \mathcal{C}$:

\[\begin{aligned} P_{(I ; W)} &= \underbrace{\left(A_0+\sum_{j=1}^n I_j \cdot A_j+\sum_{j = 1}^m W_j \cdot A_{n+j}\right)}_{\mathcal{A}} \cdot \underbrace{\left(B_0+\sum_{j = 1}^n I_j \cdot B_j+\sum_{j = 1}^m W_j \cdot B_{n+j}\right)}_{\mathcal{B}} \\ &- \underbrace{\left(C_0+\sum_{j = 1}^n I_j \cdot C_j+\sum_{j = 1}^m W_j \cdot C_{n+j}\right)}_{\mathcal{C}}. \end{aligned}\]

Denote the degree of target polynomial $T(x):=\Pi_{l=1}^t\left(x-m_l\right)$ as $t$. By the definitions of QAP polynomials $A, B, C, T$, if the witness  is valid for an instance , the polynomial $P_{(I ; W)}$ has roots $(m_1, m_2, \ldots, m_{t})$ (which exactly correspond to the $t$ equations in R1CS) and is hence divisible by $T(x)$. This implies:

\[P_{(I ; W)}(x) = H(x) \cdot T(x) \tag{F}\]

The Core of Knowledge Proof

The prover has the knowledge of the polynomial factorization $P_{(I ; W)}(x) = H(x) \cdot T(x)$. The Groth16 protocol does not rely on Fiat-Shamir transform. Instead, all potential ‘randomness’ required from the verifier is pre-generated during the trusted setup phase and remains concealed within the Common Reference String (CRS). Regarding the secret challenge point $\tau$ embedded in the CRS, the prover is merely required to demonstrate the capability to compute the following identity:

\[\begin{cases} P_{(I ; W)}(\tau) = H(\tau) \cdot T(\tau) \\ P_{(I ; W)}(\tau) = \mathcal{A}(\tau) \cdot \mathcal{B}(\tau) - \mathcal{C}(\tau) \end{cases} \implies \mathcal{A}(\tau) \cdot \mathcal{B}(\tau) - \mathcal{C}(\tau) = H(\tau) \cdot T(\tau)\]

This ability implies that the prover must know the polynomial $H(x)$, which effectively signifies the possession of a valid witness. Please note that the preceding explanation focuses on the underlying principles; in practice, the Groth16 protocol incorporates random blinding factors (masks) to ensure zero-knowledge:

\[\left( \mathcal{A}(\tau) + \alpha + r \cdot \delta \right) \cdot \left( \mathcal{B}(\tau) + \beta + s \cdot \delta \right) = H(\tau) \cdot T(\tau) + \mathcal{C}(\tau) - \underbrace{\cdots\cdots\cdots}_{\text{Messy Stuff}} \tag{Groth}\]

You can circle back to the $(Groth)$ equation after checking out the verifier phase, or keep it in mind as you follow the completeness proof. It’s the best way to grasp what’s actually happening under the hood of the Groth16 protocol. Doing this helps you see the bigger picture, rather than just grinding through a bunch of dry math only to realize at the end, ‘Oh, I guess the verifier’s pairing equation works.’

By the pre-computed CRS, the prover can evaluate $P_{(I ; W)}(\tau)/ \delta$. We first note the all polynomials $A_i, B_i, C_i$ are at most of degree $t - 1$ since they are computed by Lagrange Interpolation on $t$ points with x-coordinates $(m_1, \ldots, m_t)$. The degree of $H(x)$ ($h := \deg H \le t - 2 = \deg T - 2$) is strictly smaller than that of $T(x)$. Denote $H(x)$ as:

\[H(x) = H_0 + H_1 x + \cdots + H_h x^{h}.\]

Then:

\[\begin{aligned} g_1^\frac{P_{(I ; W)}(\tau)}{\delta} &= g_1^{\frac{H(\tau) \cdot T(\tau)}{\delta}} \\ &= (g_1^{\frac{\tau^0 \cdot T(\tau)}{\delta}})^{H_0} \cdot (g_1^{\frac{\tau^1 \cdot T(\tau)}{\delta}})^{H_1} \cdots (g_1^{\frac{\tau^h \cdot T(\tau)}{\delta}})^{H_h} \end{aligned}\]

The prover samples two random field elements $r, t \in \mathbb{F}_{r}$ and computes the following curve points：

\[\begin{aligned} & g_1^W=\left(g_1^{\frac{\beta \cdot A_{1+n}(\tau)+\alpha \cdot B_{1+n}(\tau)+C_{1+n}(\tau)}{\delta}}\right)^{W_1} \cdot \left(g_1^{\frac{\beta \cdot A_{2+n}(\tau)+\alpha \cdot B_{2+n}(\tau)+C_{2+n}(\tau)}{\delta}}\right)^{W_2} \cdots\left(g_1^{\frac{\beta \cdot A_{m+n}(\tau)+\alpha \cdot B_{m+n}(\tau)+C_{m+n}(\tau)}{\delta}}\right)^{W_m} \\ & g_1^A=g_1^\alpha \cdot g_1^{A_0(\tau)} \cdot\left(g_1^{A_1(\tau)}\right)^{I_1} \cdots\left(g_1^{A_n(\tau)}\right)^{I_n} \cdot\left(g_1^{A_{n+1}(\tau)}\right)^{W_1} \cdots\left(g_1^{A_{n+m}(\tau)}\right)^{W_m} \cdot\left(g_1^\delta\right)^r \\ & g_1^B=g_1^\beta \cdot g_1^{B_0(\tau)} \cdot\left(g_1^{B_1(\tau)}\right)^{I_1} \cdots\left(g_1^{B_n(\tau)}\right)^{I_n} \cdot\left(g_1^{B_{n+1}(\tau)}\right)^{W_1} \cdots\left(g_1^{B_{n+m}(\tau)}\right)^{W_m} \cdot\left(g_1^\delta\right)^t \\ & g_2^B=g_2^\beta \cdot g_2^{B_0(\tau)} \cdot\left(g_2^{B_1(\tau)}\right)^{I_1} \cdots\left(g_2^{B_n(\tau)}\right)^{I_n} \cdot\left(g_2^{B_{n+1}(\tau)}\right)^{W_1} \cdots\left(g_2^{B_{n+m}(\tau)}\right)^{W_m} \cdot\left(g_2^\delta\right)^t \\ & g_1^C=g_1^W \cdot g_1^{\frac{H(\tau) \cdot T(\tau)}{\delta}} \cdot\left(g_1^A\right)^t \cdot\left(g_1^B\right)^r \cdot\left(g_1^\delta\right)^{-r \cdot t} \end{aligned}\]

Note that all $A_i, B_i, C_i$ are polynomials of degree less than $t - 1$ and can be evaluated at $\tau$ by the powers of tau. In practice, $g_1^{A_i(\tau)}, g_1^{B_i(\tau)}, g_2^{B_i(\tau)}$ can be pre-computed. In other words, these points only need to be computed once, and can be made public and reused for multiple proof generations as they are consistent across all instances and witnesses.

Therefore, we have:

\[\begin{cases} g_1^{A} = g_1^{\mathcal{A}(\tau) + \alpha + r \cdot \delta} \\ g_1^{B} = g_1^{\mathcal{B}(\tau) + \beta + t \cdot \delta} \\ g_2^{B} = g_2^{\mathcal{B}(\tau) + \beta + t \cdot \delta} \\ \end{cases}\]

The final proof consists of only three elements:

\[\pi = (g_1^{A}, g_1^{C}, g_2^{B})\]

Denote the three proof elements as $\pi:=(\pi_A, \pi_B, \pi_C)$ in the following context. The correctness of the proof and details of verifying will be addressed in next section.

The Verification Phase

The verifier has the knowledge of $A, B, C, T$, the public instance $I_1, \ldots, I_{n}$ and the common reference string $CRS_{\mathbb{G}_1}, CRS_{\mathbb{G}_2}$. The verifier computes:

\[g_1^I=\left(g_1^{\frac{\beta \cdot A_{0}(\tau)+\alpha \cdot B_{0}(\tau)+C_{0}(\tau)}{\gamma}}\right) \cdot \left(g_1^{\frac{\beta \cdot A_{1}(\tau)+\alpha \cdot B_{1}(\tau)+C_{1}(\tau)}{\gamma}}\right)^{I_1} \cdots \left(g_1^{\frac{\beta \cdot A_{n}(\tau)+\alpha \cdot B_{n}(\tau)+C_{n}(\tau)}{\gamma}}\right)^{I_n}.\]

The Core of Verification

The verifier is able to verify the zk-SNARK proof $\pi = (g_1^A,g_1^C, g_2^B)$ by checking:

\[e(g_1^A, g_2^B) = e(g_1^{\alpha}, g_2^{\beta}) \cdot e(g_1^{I}, g_2^{\gamma}) \cdot e(g_1^C, g_2^{\delta}). \tag{G}\]

By pairing, equation $\text{(G)}$ is equivalent to the following equation in exponent:

\[\begin{aligned} A \cdot B = \alpha \beta + \gamma I + \delta C \end{aligned} \tag{V}\]

2025 年终总结

2025-12-31T00:00:00+08:00

2025 的总结就是落落落落落起起。虽然途经低谷，但总还算是向着垭口前行。诚如《普罗米修斯》中的台词，人生是旷野，而不是轨道。2025 沿着轨道按部就班走了一年，2026 的愿景就是旷野的探索。

读博之后，已经很久没有认真记录过生活和旅程了。于是在年末的某天翻过了之前高中写下的诗集之后，细数惭愧，我决定写一些文字记录这一年。或许也是在 AI 浪潮的裹挟下，我想要保留一些生活的慢节奏，以此证明自己还是一个有温度的人类，而不是在规定好的工作流上孜孜不倦运转的智能体。

关于旅途和生活

好好生活，慢慢相遇。

旅行与演唱会

2025 年的旅行，年初二月份本来计划去日本参加 SECCON，签证办完之后，由于各种不让出国比赛的经典原因，只能作罢，继 2024 年 DEFCON 美签作废之后，又作废一份日签。幸运的是，抢到了黄老板（Ed Sheeran）二月份在杭州的演唱会门票，和朋友们一起去看了现场，算是弥补了一些遗憾。

杭州 · Ed Sheeran 演唱会

第一次体验 Live Looping 的演唱会模式，黄老板一人一吉他撑起了整个舞台。不过没有舞美以及布景设计，确实有点素，所以有些人是吐槽诚意不够。不过现场气氛还是很不错的，黄老板唱功在线，而且几乎是全程弹唱了两个多小时，机能太顶了，作为十年老粉而言，我觉得还是很值的。

Live Looping

杭州回来之后就没出远门旅行了，不过演唱会倒还是去了不少，如下。

Feature Test Page

2025-12-25T00:00:00+08:00

This is the English content.

This is a test for bilingual blog support. You should see language toggle buttons below the title.

If you see this post in the main list, you should see the EN/CN badge.

This document demonstrates the various custom content blocks supported by the blog theme and how to use them. Every style includes source code examples and the actual rendered result.

1. Basic Blocks

Supports four basic state colors: Standard, Success, Info, Warning, Error.

HTML Syntax

Use

(Note that markdown="1" is required for processing internal MD content).

Source Example:

 class="neutral-block" markdown="1">
 class="block-title">Neutral Block

This is a default style block.




 class="success-block" markdown="1">
 class="block-title">Success Block


Operation successful.



 class="info-block" markdown="1">
 class="block-title">Info Block
General information.



 class="warning-block" markdown="1">
 class="block-title">Warning Block
Warning message.



 class="error-block" markdown="1">
 class="block-title">Error Block
Error or dangerous operation.

Rendered Result:

Neutral Block

This is a default style block.

Success Block

Operation successful.

Info Block

General information.

Warning Block

Warning message.

Error Block

Error or dangerous operation.

Liquid Tag Syntax

Use {% plain type title="..." %}.

Source Example:

{% plain success title="Liquid Success" %}
This is a block generated using Liquid tags.
{% endplain %}

{% plain error title="Liquid Error" %}
This is an error block generated using Liquid tags.
{% endplain %}

Rendered Result:

This is a block generated using Liquid tags.

This is an error block generated using Liquid tags.

2. Academic Blocks

Supports common academic environment definitions: proof, theorem, lemma, proposition, definition, example, remark, note, solution.

Basic Usage (Default Block Title)

HTML Source:

 class="theorem" markdown="1">
This is a theorem.


 class="proof" markdown="1">
This is a proof.

Liquid Source:

{% theorem %}
This is a theorem (Liquid).
{% endtheorem %}

Rendered Result:

This is a theorem.

This is a proof.

Inline Title Style

Add inline class or parameter.

Source:

 class="proof inline" markdown="1">
Title and content are on the same line.



{% note inline %}
Note: This is a note with an inline title.
{% endnote %}

Rendered Result:

Title and content are on the same line.

Note: This is a note with an inline title.

Custom Title

Use data-title attribute or title parameter.

Source:

 class="lemma" data-title="Zorn's Lemma" markdown="1">
Every non-empty partially ordered set has a maximal element...



{% proposition title="My Proposition" %}
This is a proposition with a custom title.
{% endproposition %}

Rendered Result:

Every non-empty partially ordered set has a maximal element…

This is a proposition with a custom title.

3. Collapsible Blocks

HTML Syntax (`details` & `summary`)

Source:

 class="info" markdown="1">
 data-title="Click to expand details">
Here is the hidden detailed content.

Rendered Result:

功能测试页面

2025-12-25T00:00:00+08:00

这是中文内容。

这是一个双语博客的测试。你应该能在标题下方看到语言切换按钮。

如果在主页列表看到这篇文章，应该能看到 EN/CN 的标志。

本文档用于展示博客主题支持的各种自定义内容块及其使用方法。每种样式都提供了源码示例和实际渲染效果。

1. 基础提示块 (Basic Blocks)

支持四种基础状态颜色：Standard, Success, Info, Warning, Error。

HTML 语法

使用

(注意 markdown="1" 对于处理内部 MD 内容是必须的)。

源码示例：

 class="neutral-block" markdown="1">
 class="block-title">Neutral Block

这是一个默认样式的块。




 class="success-block" markdown="1">
 class="block-title">Success Block


操作成功提示。



 class="info-block" markdown="1">
 class="block-title">Info Block
一般信息提示。



 class="warning-block" markdown="1">
 class="block-title">Warning Block
警告信息提示。



 class="error-block" markdown="1">
 class="block-title">Error Block
错误或危险操作提示。

渲染效果：

Neutral Block

这是一个默认样式的块。

Success Block

操作成功提示。

Info Block

一般信息提示。

Warning Block

警告信息提示。

Error Block

错误或危险操作提示。

Liquid 标签语法

使用 {% plain type title="..." %}。

源码示例：

{% plain success title="Liquid Success" %}
这是使用 Liquid 标签生成的块。
{% endplain %}

{% plain error title="Liquid Error" %}
这是使用 Liquid 标签生成的错误块。
{% endplain %}

渲染效果：

这是使用 Liquid 标签生成的块。

这是使用 Liquid 标签生成的错误块。

2. 学术与数学块 (Academic Blocks)

支持常见的学术环境定义：proof, theorem, lemma, proposition, definition, example, remark, note, solution。

基础用法 (默认换行标题)

HTML 源码：

 class="theorem" markdown="1">
这是一个定理。


 class="proof" markdown="1">
这是一个证明。

Liquid 源码：

{% theorem %}
这是一个定理 (Liquid)。
{% endtheorem %}

渲染效果：

这是一个定理。

这是一个证明。

行内标题样式 (Inline)

添加 inline 类或参数。

源码：

 class="proof inline" markdown="1">
标题与内容在同一行。



{% note inline %}
注意：这是一个行内标题的 Note。
{% endnote %}

渲染效果：

标题与内容在同一行。

注意：这是一个行内标题的 Note。

自定义标题

使用 data-title 属性或 title 参数。

源码：

 class="lemma" data-title="Zorn's Lemma" markdown="1">
每个非空偏序集都有一个最大元...



{% proposition title="My Proposition" %}
这是一个自定义标题的命题。
{% endproposition %}

渲染效果：

每个非空偏序集都有一个最大元…

这是一个自定义标题的命题。

3. 可折叠块 (Collapsible Blocks)

HTML 语法 (`details` & `summary`)

源码：

 class="info" markdown="1">
 data-title="点击展开详情">
这里是隐藏的详细内容。

渲染效果：

BLACKHAT MEA 2025 Whack-A-Scratch

2025-09-17T00:00:00+08:00

tl;dr: I tried the intended solution after the game. An impressive challenge about linear algebra and Legendre symbol.

Challenge Setup

Let $p = 2^{21} - 9$ and $n = 6$. There are 3 main outer matrices if size $n \times n $:

\[\begin{cases} A \in_{R} \mathbf{GL}(\mathbb{F}_p, n) \\ B \in_{R} \mathbf{GL}(\mathbb{F}_p, n) \\ C = A \cdot S \cdot B \end{cases}\]

The inner matrix $S$ is structured as:

\[S = S_0 \cdot S_1 = \begin{bmatrix} s_1 & X_{1,2} & \cdots & X_{1,n} \\ & s_{2} & \cdots & X_{2,n} \\ & & \ddots & \vdots \\ & & & s_n \end{bmatrix}^{q_1} \cdot \begin{bmatrix} s_{n+1} & & & \\ Y_{2,1} & s_{n+2} & & \\ \vdots & \cdots & \ddots & \\ Y_{n, 1} & \cdots & Y_{n, n-1} & s_{2n} \end{bmatrix}^{q_2}.\]

We are going to recover the secret diagonal values of $S_0, S_1$: $(s_0, s_1, \cdots, s_{2n})$. When $S$ is resampled, only $q_1$ and $q_2$ are resampled. There are two oracles:

Scratch: sample a random vector $k \in \mathbb{F}_p^{n}$ and leak three vectors:
\[\begin{cases} r = A^{-1} \cdot k \\ s = k^T \cdot B^{-1} \\ t = C \cdot A \cdot k \text{ or } C \cdot B \cdot k \end{cases} \tag{SO}\]
Consider a $2n$-bit mask $j$ with Hamming weight $n$, which determines the vector $t$. Specifically, if the $i$-th bit of $j$ satisfies $j_i = 1$, then $t_i = C \cdot A \cdot k$; otherwise, $t_i = C \cdot B \cdot k$. After $2n$ Scratch oracles, $A, B, S$ will be resampled.
Whack: input $i, j, k$ and the server will increase $S_k[i][j]$ by one. This allows us to increase one element by one in static matrices $S_0, S_1$.

Recover $j$ and matrix product

We define one round as 12 calls to the Scratch oracle, during which $A, B, C$, and $S$ remain fixed. There are only $\binom{2n}{n} = 924$ possible values of $j$. Assuming that we have guessed the correct $j$, denote $R_0, S_0, T_0, K_0 \in \mathbf{GL}(\mathbb{F}_p, n)$ as the matrix spanned by $r_i,s_i, t_i, k_i$ with $j_i = 0$ and $R_1, S_1, T_1, K_1$ as the matrix spanned by $r_i,s_i, t_i, k_i \in \mathbf{GL}(\mathbb{F}_p, n)$ with $j_i = 1$ , respectively.

By equation (SO), we can learn that:

\[\begin{cases} A \cdot R_i = K_i, & i = 0, 1 \\ S_i \cdot B = K_i^T, & i = 0, 1 \\ T_1 = C \cdot A \cdot K \\ T_0 = C \cdot B \cdot K \end{cases}\]

Thus, the following four matrices can be recovered:

\[\begin{cases} M_1 := C \cdot A \cdot A = T_1 \cdot R_{1}^{-1} \\ M_2 := C \cdot B \cdot A = T_0 \cdot R_{0}^{-1} \\ M_3 := C \cdot A \cdot B^T = T_1 \cdot S_{1}^{-1} \\ M_4 := C \cdot B \cdot B^T = T_0 \cdot S_{0}^{-1} \\ \end{cases} \tag{M}\]

Since $\det (M_2) = \det(M_3) \implies \det(R_0) \det(T_1) = \det(T_0) \det(S_1)$, we can use this equation to determine the correct value of $j$ and also the four matrices defined in equation (M).

Remarks

A natural question is whether $A, B$, and $C$ can be fully recovered from the above matrix. This problem appears to be related to solving multivariate quadratic equations, which is known to be NP-hard. It’s easy to see that:

\[A^2 \cdot M_1^{-1} \cdot M_4 = X^T \cdot A^T \cdot A \cdot X.\]

where $X := R_0 \cdot S_0 ^ {-1} = A^{-1} \cdot B^T$.

Solving above matrix equation is equal to solving a multivariate quadratic system with $36$ variables and $36$ equations. This seems infeasible.

Recover diagonal values $s_i$

Span equation $\det(M_2) = \det(T_0) / \det(R_0)$, we have:

\[\begin{aligned} \det(T_0) / \det(R_0) &= \det(A) \det(S) \det (B) \det(B) \det(A) \\ &= \det(A)^2 \det(B)^2 \left(\prod_{1}^{n} {s_i}\right)^{q_1} \left(\prod_{n + 1}^{2n} {s_i}\right)^{q_2} \end{aligned}\]

The most crucial part of this problem lies in taking the Legendre symbol regarding $p$ denoted as $\textsf{leg}(\cdot)$ of both sides. This automatically eliminates all squared terms, which reveals information about the kernel $S$:

\[\textsf{leg}\left(\frac{\det(T_0)}{\det(R_0)} \right) = \textsf{leg}\left(\left(\prod_{1}^{n} {s_i}\right)^{q_1} \left(\prod_{n + 1}^{2n} {s_i}\right)^{q_2} \right)\]

We will not discuss the case that for some $i$, $s_i = 0$ since it’s negligible. Denote $\ell_1 = \textsf{leg}\left(\prod_{1}^{n} {s_i}\right)$ and $\ell_2 = \textsf{leg}\left(\prod_{n+1}^{2n} {s_i}\right) $. Denote the round constant (the left side) as $d_i$ for round $i$. Define a good state as $\ell_1= 1$ and $\ell_2 = 1$. Such a good state can be detected when the round constant $d_i$ is always $1$.

Without the loss of generality, we assume the initial state is a good state denoted as $\mathcal{S}_0 = 1$ (and bad state denoted as $\mathcal{S}_0 = -1$). Let $m$ be the number of round trials. We can recover a secret diagonal value $s$ as follows:

Step 1: call one Whack oracle on $s$ and then $12m$ Scratch oracles. This will generates $m$ round constants: $d_1, d_2, \ldots, d_m$. If any $d_i$ is $-1$, it means the current state is bad, i.e., $\mathcal{S}_1 = -1$. Otherwise (all $d_i$s are 1), it means the current state is good, i.e., $\mathcal{S}_1 = 1$. If $\mathcal{S}_1 \ne \mathcal{S}_0$, it must be that $\textsf{leg}(s + 1) = 1-\textsf{leg}(s)$. Otherwise $\textsf{leg}(s + 1) = \textsf{leg}(s)$.
……
Step $i+1$: call one Whack oracle on $s$ and then $12m$ Scratch oracles. Similarly, determine the current state $\mathcal{S}_{i+1}$. If $\mathcal{S}_{i+1} \ne \mathcal{S}_{i}$, it must be that $\textsf{leg}(s + i + 1) = 1 - \textsf{leg}(s + i)$. Otherwise $\textsf{leg}(s + i + 1) = \textsf{leg}(s + i)$.

This actually leaks a sequence of Legendre symbols $\left( \textsf{leg}(s), \textsf{leg}(s+1), \textsf{leg}(s+2), \textsf{leg}(s +3), \ldots \right)$ to us, which can be used to determine the unique value of original $s$. To be specific, we choose $M$ as the sequence length, slightly greater than $21$. Since $p = 2^{21} - 9$ is small, we can precompute all Legendre symbols for $x \in [0, p-1]$ in a table. By guessing the value of $\textsf{leg}(s)$, we have two candidates of Legendre sequence and only one matches the correct start point $s$.

Algorithm	Time Complexity	Space Complexity	Parallelism
Birthday-paradox collision search	\(\mathcal{O}(2^{n/2})\)	\(\mathcal{O}(2^{n/2})\)	Parallelizable, but memory-intensive
Pollard’s rho	\(\mathcal{O}(2^{n/2})\)	\(\mathcal{O}(1)\)	No linear speed-up in parallel
Pollard’s lambda	\(\mathcal{O}(2^{n/2})\)	\(\mathcal{O}(k)\) （trade-off）	Parallelizable, often close to linear speed-up

Tanglee’s Blog

Fast Fourier Transform and Number Theoretic Transform

Discrete Fourier Transform

Convolution and Fourier Transform

Fast Fourier Transform

DFT Algorithm \(\mathcal{O}(n \log n)\)

iDFT Algorithm \(\mathcal{O}(n \log n)\)

Fast Number Theoretic Transform

Linear / Positive Wrapped Convolution

Negacyclic Convolution

The Essence of the Number Theoretic Transform

CT/GS Butterfly Algorithms

Fast-NTT: Cooley-Tukey Algorithm

Fast-iNTT: Gentleman-Sande Algorithm

Inverting the CT Transform

Deriving the Standard GS Transform

Non-Recursive Iterative Butterfly Algorithms

Non-Recursive Cooley-Tukey NTT

Non-Recursive Gentleman-Sande iNTT

快速傅里叶变换与数论变换

离散傅里叶变换

卷积与傅里叶变换

快速傅里叶变换

DFT 算法 \(\mathcal{O}(n \log n)\)

iDFT 算法 \(\mathcal{O}(n \log n)\)

快速数论变换

线性/正循环卷积

负循环卷积

数论变换的本质

CT/GS 蝴蝶算法

Fast-NTT: Cooley-Tukey Algorithm

Fast-iNTT: Gentleman-Sande Algorithm

反解 CT 变换

标准 GS 变换推导

蝴蝶操作的非递归迭代算法

非递归 Cooley-Tukey NTT

非递归 Gentleman-Sande iNTT

Parallelizable Memory-Efficient Hash Collision Search

Birthday-Paradox Collision Search

Pollard’s rho

Pollard’s rho for Integer Factorization

Pollard’s rho for Hash Collisions

Pollard’s lambda

Distinguished-Point Collision Search

Time-Space Trade-off

可并行的内存高效的哈希碰撞算法

生日悖论碰撞算法

Pollard’s rho 算法

整数分解的 Pollard’s rho 算法

哈希碰撞的 Pollard’s rho 算法

Pollard’s lambda 算法

Distinguished Point 碰撞算法

时间空间复杂度权衡

SIDH: Supersingular Isogeny Key Exchange

背景知识

超奇异椭圆曲线

$j$-不变量

同源（Isogeny）

倍点映射

同源映射

同源示例

代数性质

同源图

SIDH Protocol

朴素 SIDH

标准 SIDH

SIDH 的正确性

ZK-SNARK: Deep Dive into Groth16

Preliminaries

Setup Phase

The Prover Phase

The Verification Phase

2025 年终总结

关于旅途和生活

旅行与演唱会

Feature Test Page

1. Basic Blocks

HTML Syntax

Liquid Tag Syntax

2. Academic Blocks

HTML Syntax (`details` & `summary`)

HTML 语法 (`details` & `summary`)