GPU acceleration
Last updated
Last updated
Number Theoretic Transform (NTT) is similar to Fast Fourier Transform (FFT) and is a key computational bottleneck for Ola's zero-knowledge proof system. In a zero-knowledge proof system, the prover proves that they have some secret information by proving that they know the input of a given polynomial and that this input can be evaluated to an output. This task is trivial for small polynomials, it becomes very complex when a given polynomial is very large in a field of prime numbers.
A necessary step in a zero-knowledge proof system is to evaluate a polynomial to prove that it is equal to an output. These polynomials are represented by a designed arithmetic circuit that takes an input vector and produces a polynomial at those points. Therefore, the problem is that direct evaluation is computationally expensive (it requires N^2 operations), where N is the degree of the polynomial. Cooley-Tukey's butterfly transform principle solves this problem by requiring only N*log(N) calculations to be performed when evaluating very large polynomials. Cooley-Tukey will change the order of the obtained FFT sequence (as shown in the figure below). Therefore, it is necessary to arrange the order of the sub-FFTs by the bit-inverse processing.
Even with the Cooley-Tukey algorithm, the processing time of large NTTs still restricts the efficiency of zero-knowledge proof generation. This is because the butterfly-shaped fast-processing model adopted by NTT is not friendly to hardware design. It requires frequent reads in and reads out data from external memory. However, the data is retrieved from memory in random access mode, which increases the latency of the data transfer time. Random memory access and data shuffling became a major challenge, which limited NTT's parallelization capabilities. Stockham-based NTT can automatically adjust the order of data during the transformation process, avoiding bit inversion operations, as shown in the figure below.
The biggest difference between the Stockham algorithm and Cooley-Tukey is that the subscript of the input data corresponding to the thread for each butterfly transform in the Stockham algorithm is fixed, and there is no need for additional bit-inverse processing. Ola uses the Stockham-based NTT algorithm in the Goldilocks domain to complete polynomial evaluation on a 64-bit or 128-bit prime number field, which can not only greatly reduce the number of random memory accesses, but also reduce the occupation of storage space and improve the efficiency of zero-knowledge proofs.
However, when the recursion of Stockham becomes deeper, the butterfly transform step size will become larger, resulting in a relatively large jump in memory access. And it is difficult for hardware to meet the large memory requirement. Therefore, Ola adopts the fractal dimension idea to eliminate this defect in hardware design.
In NTT calculations, the length of the input polynomial coefficients is very large, reaching 2^24 or even longer. Therefore, it requires a large amount of data transmission bandwidth and cache to store the twiddle factor, which is difficult to meet with current graphics devices. Therefore, Ola designed a multi-dimensional NTT processing architecture to achieve efficient NTT computation with less cache while reducing the data transmission bandwidth.
Decomposing an NTT of scale N into independent NTTs of smaller scale is conducive to parallel implementation, and decomposing the NTT length also makes memory management more flexible. The following is a 2D example to illustrate, that process of 3D NTT is basically the same as that of 2D processing.
First, the input vector a is thought of as a two-dimensional vector in row I and column J. Then the subscript of a becomes J*m+j. Here, N, I, and J are all exponential to the power of 2, and m<I. The NTT transformation of the decomposed a is achieved by the following steps:
Step 1: Denote the input vector a as a matrix of I*J. And, an NTT transformation of length I is applied to each column of the matrix using the same I-size twiddle factors.
Step 2: Multiply the transformed matrix elements by their corresponding twiddle factors;
Step 3: After multiplying the corresponding twiddle factors, perform an NTT transformation with length J on each row with the same J-size twiddle factors. At last, sort the transformed matrix by column, which is the final NTT result.
It can be seen that this multi-dimensional processing method also does not have a bit-inverse operation, which is convenient for hardware parallel implementation.
To further reduce data transfer time and improve computational efficiency, the data length of each dimension needs to match the number of stream processors and the size of shared memory in GPU. For example, on NVIDIA Ampere architecture graphics devices, if the data length is less than or equal to 2^20, NTT uses a 2D computing strategy; and when the data length is greater than 2^20, NTT uses a 3D computing strategy, which can adapt to the size of the shared memory in the graphics device.
In order to improve the NTT computing speed, Ola adopts a multi-dimensional NTT computing scheme. Like most zero-knowledge proof systems, multi-dimensional NTT computation requires pre-computing of some parameters to improve the efficiency of NTT on the GPU. The pre-computed initialization of Ola is done on the host side, and the pre-computed parameters are stored on the device memory in the form of a struct so that the data transmission between the device and the host can be reduced during the NTT butterfly transformation. The parameters of precomputed storage mainly include:
The modulus and root of the Goldilocks prime field;
Preset data length, The length range is 2^12~2^24;
The length of the data for each dimension in the multidimensional NTT transformation;
The size of thread, grid, block, and stream allocated in the GPU for different data lengths;
The twiddle factors of each dimensional NTT;
The corresponding twiddle factors between each dimensional NTT.
When the host-side program starts running, NTT is initialized only once. The pre-calculated data are stored in the initialized struct on the graphics device, which is required by different-length polynomial coefficients. This can adapt to the rapid calculation of Goldilocks data with different lengths.
Data is transferred between the host and the graphics device via the PCIE bus protocol. The default Pageable mechanism for memory allocation on the host side is very time-consuming when the data is large. Ola uses page-locked host memory to perform data exchange between the host and the device. The advantage of page-locked memory is that the copy between page-locked memory and device memory can be executed at the same time kernel functions, i.e., asynchronously parallel.
In some cases, the address of page-locked memory can be mapped from the host address space to the device address space, eliminating the overhead time of copying. On systems with front-side buses, the memory bandwidth can be larger if the host memory is allocated as page-locked memory. In addition, if the host memory is allocated as another page-locked memory-Write-Combining Memory, the bandwidth will be further increased.
However, page-locked memory is a scarce resource, and if there is a large allocation of page-locked memory, the allocation will fail. Also, page-locked memory reduces the amount of physical pageable memory, and allocating too much page-locked memory can degrade the overall performance of the system. In our scheme, the precomputed parameters occupy less than 1 GB of page-locked memory.
The types of programmable memory in CUDA are: Registers, Local Memory, Shared Memory, Constant Memory, Texture Memory, and Global Memory, and the hierarchy of these memory spaces is shown in the figure below.
Shared memory is a key part of the GPU. Structurally, each stream multiprocessor(SM) has a small chip memory, which can be flexibly used as a shared memory L1. In terms of data access, the on-chip memory has the fastest access speed except for the registers. When shared memory is declared, the thread blocks executed on the same SM can collectively access that memory. Shared memory enables cross-access of thread data in the same thread block. Storing data that requires frequent cross-access in shared memory can greatly improve data access efficiency.
For most GPUs with different architectures, SM typically has an on-chip memory size of 64KB. For NTT computation with a data length of 2^24 or even longer, it is necessary to reasonably configure the shared memory size of each thread block. In other words, the length of NTT should be set reasonably taking into account both access efficiency and computing efficiency.
In addition, when performing NTT calculations, bank conflict can occur once multiple threads access the shared memory at the same time. This can be eliminated by unrolling shared memory access based on the weak ordering of instructions in CUDA.
In multi-dimensional NTT calculations, the twiddle factors are the same for each column (or row). A fixed portion of on-chip memory can be divided into cache twiddle factors, which Increases shared memory usage and reduces the demand for memory bandwidth.
The biggest advantage of using Goldilocks finite field is that domain operations can be accelerated. When NTT is small, it is possible to achieve good computing speed using the host CPU.
In the Ola zero-knowledge proof system, the length of the data involved in NTT computation may vary. Therefore, considering the overall zero-knowledge proof efficiency, not only does it need to perform data length management for different data lengths on the graphics device, but also the host side needs to calculate NTT separately when the data length is small without starting the graphics device. This avoids time loss in host-to-device data transfer. The following figure shows the strategy for calculating NTT for different data lengths.
This section presents the results of our multi-dimensional NTT implementations on NVIDIA RTX3090. We vary the input size from 2^18 to 2^24 in the Goldilocks finite field to demonstrate the scalability of our design. The results are shown in the table below. The “time consumed by GPU” is the time measured in the GPU. The “time consumed at the host interface” is the time it takes for the host to start calling CUDA's interface functions and output the results, which contains the time it takes to exchange data between the host and the device via PCIE.