# SoK: Fully Homomorphic Encryption Accelerators

Junxue Zhang<sup>1,2\*</sup>, Xiaodian Cheng<sup>1\*</sup>, Liu Yang<sup>1,2</sup>, Jinbin Hu<sup>1</sup>, Ximeng Liu<sup>3</sup> and Kai Chen<sup>1</sup>

<sup>1</sup>iSINGLab, Hong Kong University of Science and Technology <sup>2</sup>Clustar <sup>3</sup>Fuzhou University

Abstract—Fully Homomorphic Encryption (FHE) is a key technology enabling privacy-preserving computing. However, the fundamental challenge of FHE is its inefficiency, due primarily to the underlying polynomial computations with high computation complexity and extremely time-consuming ciphertext maintenance operations. To tackle this challenge, various FHE accelerators have recently been proposed by both research and industrial communities. This paper takes the first initiative to conduct a systematic study on the 11 FHE accelerators - cuHE/cuFHE, nuFHE, HEAT, HEAX, HEXL, HEXL-FPGA, 100×, F1, CraterLake, BTS, and ARK. We first make our observations on the evolution trajectory of these existing FHE accelerators to establish a qualitative connection between them. Then, we perform testbed evaluations of representative open-source FHE accelerators to provide a quantitative comparison on them. Finally, with the insights learned from both qualitative and quantitative studies, we discuss potential directions to inform the future design and implementation for FHE accelerators.

# 1. Introduction

With the increasing concern about data privacy and integrity, privacy-preserving computing has been adopted in many real-world applications, *e.g.*, cloud computing [1], machine learning [2], [3], database search [4], *etc.* Among all the privacy-preserving technologies, fully homomorphic encryption (FHE) emerges as one of the most important and promising technologies and has been adopted in various applications [4]–[7]. Specifically, FHE allows performing arbitrary operations directly over the encrypted data without decryption, making it appealing for privacy-preserving computation.

Although promising, one fundamental drawback of FHE is its *inefficiency*. Compared to plaintext computation, FHEenabled computation is orders of magnitude slower, which restricts its deployment in many performance-critical systems. To solve this problem, various optimizations have been proposed. One direction is to improve the efficiency of the algorithm. For example, modern FHE schemes, such as BGV [8], BFV [9], and CKKS [10], all support SIMDlike (*i.e.*, batching) operations [11] to pack many plaintext messages into one ciphertext to improve the execution efficiency. Instead of studying these algorithm-level performance optimizations, in this paper we focus on the other key direction — leveraging hardware accelerators to improve the efficiency of FHE schemes<sup>1</sup>.

Before introducing FHE accelerators, we first illustrate what makes FHE slow and the challenges of acceleration.

In this paper, we find that the root cause of FHE's inefficiency is two-fold: underlying polynomial computations with high computation complexity and two extremely timeconsuming ciphertext maintenance operations. First, most of FHE's underlying operations are polynomial operations, which are much more complex than plaintext computation, where operands are integers or floating numbers. While Fast Fourier Transform (FFT)/Number Theoretic Transform (NTT) can be utilized to speed up the polynomial operations [12] from algorithm-level, further accelerating NTT/FFT faces challenges in three aspects: high computation complexity, extremely intensive memory access, and limited generality (§3.1). Second, compared to plaintext computation, FHE requires ciphertext maintenance operations to ensure correctness. Such operations involve overcomplicated computation steps, causing further performance degradation. Although they are built upon polynomial operations, fully accelerating ciphertext maintenance operations is more challenging in terms of the aforementioned three aspects compared to merely accelerating polynomial operations (§3.2 and §3.3).

To improve the efficiency of FHE, FHE accelerators are proposed. Initially, these accelerators rely on features provided by general hardware. For example, Intel proposed Intel Homomorphic Encryption Acceleration Library (HEXL) to leverage AVX-512 instructions for fast NTT operations [13]. nuFHE [14] and  $100 \times [15]$  used GPU implementations, *i.e.*, CUDA [16] programs, to accelerate TFHE [17] and CKKS [10] respectively. While providing notable acceleration for FHE schemes, these accelerators are far from satisfactory.

To further improve the performance of FHE schemes, people begin to exploit specific-designed hardware accelerators. Field Programmable Gate Array (FPGA) is first used to build circuits to efficiently execute NTT, inverse NTT (iNTT), and key-switching in FHE schemes [18]-[24]. These FPGA-based accelerators are affordable but suffer from intrinsic disadvantages of FPGA itself: limited programmable resources and low working frequency [25]. Then, to overcome these disadvantages, people are seeking expensive Application-specific Integrated Circuit (ASIC) technologies to build high-performant FHE accelerators [26]–[29]. While the acceleration ratios of these ASICbased accelerators are promising, e.g.,  $\sim 1000 \times$  and  $\sim$  $100 \times$  NTT throughput compared to FPGA-based and GPUbased solutions respectively, they are way more expensive. For instance, developing and taping out a 12nm ASIC like [27] requires millions of US dollars.

In the recent decade, we have seen an explosive growth of FHE accelerators [18]–[24], [26]–[30], and expect an increasingly more active development of FHE accelerators in the near future. However, we lack a comprehensive and

<sup>\*.</sup> Equal contribution.

<sup>1.</sup> Note that algorithm optimization and hardware acceleration are complementary, and can be combined to improve the overall performance.

systematic study to shed light on the status quo of existing FHE accelerators, which could inspire the future design and implementation of FHE accelerators.

Motivated by this, we take the first initiative to perform the systematization of knowledge on FHE accelerators. We first review 11 existing FHE accelerators and make observations on the evolution trajectory of these works, which establishes a qualitative connection among them (§4). Then, to give readers a clear view of how these accelerators perform, we present a quantitative analysis of them. Specifically, we use testbed experiments to evaluate the performance of some representative open-source accelerators, and further include the statistics from papers of other well-known but closed-source accelerators to make our results more complete (§5). Finally, based on our qualitative and quantitative analysis, we discuss the potential future directions, such as new design tradeoffs, software/hardware co-designs, scaling methodologies, etc., to inform the future design and implementation of FHE accelerators (§6).

Along with the paper, we will provide a Docker image that includes all configurations and scripts for performance evaluations of all open-sourced FHE accelerators used in our paper, which can be readily reused by the community.

**Related Works:** Systematization of knowledge on FHE library [31] and compiler [32] have already been proposed. To the best of our knowledge, our paper is the first systematization of knowledge on FHE accelerators. FHE accelerators are closely related to FHE libraries and compilers. For example, some FHE accelerators are designed to work with particular FHE libraries, *e.g.*, HEAX chooses SEAL [33] as its target library to accelerate. Moreover, an increasing number of FHE accelerators use compilers for a software/hardware co-design [26]–[29]. However, few FHE accelerators consider leveraging existing FHE compilers, such as EVA [34], E3 [35], *etc.*, which, in our opinion, still leaves dramatic design space for better performance and flexibility.

# 2. Fully Homomorphic Encryption

Homomorphic encryption is an encryption scheme that allows performing arbitrary computation over ciphertexts without decrypting them. For example, Paillier is an additive homomorphic encryption scheme thus we can perform additions over ciphertext [36], while RSA [37] is a multiplicative homomorphic encryption scheme. In this paper, we are focusing on fully homomorphic encryption (FHE) schemes<sup>2</sup>, which allow both additive and multiplicative homomorphic operations.

Current FHE schemes can be categorized into wordwise FHE and bit-wise FHE [38]. Word-wise FHE supports algebraic operations on word-based (or message-based) encrypted data. Moreover, word-wise FHE supports efficient single-instruction-multiple-data (SIMD) style homomorphic operations, *e.g.*, performing homomorphic addition and multiplication over batched plaintext [11]. However, word-wise FHE is not suitable for evaluating non-polynomial functions,

| Notation                          | Definition                                                               |
|-----------------------------------|--------------------------------------------------------------------------|
| n                                 | Degree of a polynomial.                                                  |
| L                                 | Multiplicative depth of a fresh ciphertext.                              |
| l                                 | Current multiplicative depth of a ciphertext.                            |
| q                                 | Coefficient modulus of plaintext polynomial.                             |
| Q                                 | Coefficient modulus of ciphertext polynomial.                            |
| $\{q_i, i \in [0, L]\}$           | A set of moduli. $Q = \prod_{i=0}^{L} q_i$ .                             |
| P                                 | Special modulus for the keys.                                            |
| $\{p_i, i \in [0, \alpha - 1]\}$  | A set of special moduli. $P = \prod_{i=0}^{\alpha-1} p_i$ .              |
| dnum                              | Decomposition number in key-switching.                                   |
| $\alpha$                          | # of special moduli $p_i$ . $\alpha = \lfloor (L+1)/\text{dnum} \rfloor$ |
| $\{Q_j, j \in [0, \text{dnum}]\}$ | A set of modulus factors. $Q_j = \prod_{i=j\alpha}^{(j+1)\alpha-1} q_i$  |
| $\Delta$                          | Scaling factor.                                                          |
| m                                 | Plaintext.                                                               |
| 8                                 | Secret key.                                                              |
| pk                                | Public key.                                                              |
| $\mathbf{ct}$                     | Ciphertext.                                                              |
| $\mathbf{swk}$                    | Switching key for key-switching.                                         |
| Dyadic multiplication             | Element-wise multiplication of polynomials.                              |

TABLE 1: Notations used in this paper.

*e.g.*, sigmoid/relu functions, which are commonly used in machine learning applications. Examples of word-wise FHE are CKKS [10], BFV [9], BGV [8], *etc.* In contrast, bitwise FHE supports operations of boolean circuits. As its name indicates, bit-wise FHE schemes encrypt each bit of the plaintext and are usually used to evaluate non-polynomial operations by constructing lookup tables. Bitwise FHE schemes do not provide sufficient support for SIMD-style operations and usually suffer from a higher degree of ciphertext inflation, which poses a larger challenge to memory bandwidth. Examples of bit-wise FHE schemes are TFHE [17] and FHEW [39]. In order to support real-world privacy-preserving applications, such as privacy-preserving machine learning, both word-wise and bit-wise FHE schemes are used together [38], [40].

In this article, we focus on word-wise FHE algorithms constructed over ring learning with errors problem (RLWE) (*e.g.*, BFV, BGV and CKKS) [41] and bitwise FHE algorithm (*e.g.*, TFHE) [42] because they are practical and widely adopted. Since these FHE schemes mainly manipulate polynomials over finite fields or torus, most of the operations to be discussed later are made up of polynomial computations. In the following sections, we will show them in detail. Table 1 provides notations and terms used in this paper, and Figure 1 demonstrates the relationship of operations of FHE.

## 2.1. Encoding, Decoding, Encryption & Decryption

**Encoding and Decoding:** The goal of encoding is to convert plaintext messages into polynomials for the subsequent homomorphic operations. Please note that by packing a vector of numbers during encoding, specific FHE schemes, such as BGV and BFV, can naturally support SIMD-style operations. Conversely, decoding is used to recover plaintext messages from polynomials.

**Encryption:** In an asymmetric encryption system, a ciphertext **ct** is generated by obfuscating the plaintext with a

<sup>2.</sup> Similar to [32], FHE schemes in our paper include leveled FHE.



Figure 1: Overview of the operations used in FHE schemes.

public key and noises. For example, in RLWE-based FHE, the public key is an RLWE instance,  $\mathbf{pk} = (\mathbf{pk}_0, \mathbf{pk}_1) = (as + e, a)$ , generated from the secret key s. The coefficients of random polynomial a are uniformly sampled from interval (-Q/2, Q/2], and coefficients of noise polynomial e are independently sampled from a discrete Gaussian distribution. The encryption is formulated as

$$\mathbf{ct} = (c_0, c_1) = v \cdot \mathbf{pk} + (\Delta \cdot m + e_1, e_0) \mod Q, \quad (1)$$

where m is the plaintext (*i.e.*, encoding result),  $e_0$  and  $e_1$  are also noise polynomials, v is a random polynomial with small coefficients, and  $\Delta$  is the scaling factor in some schemes to control precision.

**Decryption:** Although different FHE schemes use different decryption workflows, the common objective is to recover the plaintext by removing  $a \cdot s$  from the ciphertext. As a result, they share the same core operation in decryption, which is  $c_0 - c_1 \cdot s$ .

#### 2.2. Homomorphic Evaluation

For word-wise FHE, major homomorphic evaluations include multiplications, additions, subtractions and permutations, which are sufficient for most applications. Nonpolynomial functions can also be achieved via polynomial approximation, which covers a large range of applications. Bit-wise FHE schemes can efficiently evaluate bit operations, including NOT, AND, NAND, OR and XOR, which can provide accurate results of non-polynomial functions.

However, homomorphic evaluations lead to two significant problems. First, operations such as multiplication and permutation construct special ciphertexts that cannot be directly decrypted or used as the input of subsequent operations. Second, as noise is introduced to secure the ciphertext in FHE schemes, it gradually grows during FHE operations, especially in homomorphic multiplications. After the noise exceeds a threshold, it will impact the correctness of the decryption. Therefore, ciphertext maintenance operations, *e.g.*, key-switching, modulus-switching and bootstrapping, are required in FHE to solve these problems.

#### 2.3. Ciphertext Maintenance

**Key-switching:** As its name implies, key-switching homomorphically switches the secret key of a ciphertext while keeping the corresponding plaintext unchanged. More specifically, ct is the ciphertext of plaintext m, and it can be decrypted with some special secret key s'. After executing

ct' = Keyswitch(ct, swk), ciphertext ct' can be decrypted with the original secret key s and the corresponding plaintext is still m. In the equation, swk is a pre-generated public key called switching key, and it can be considered a ciphertext of  $P \cdot s'$  with modulus  $P \cdot Q$ . P is an integer used to control the scale of noise in key-switching. After certain operations such as homomorphic multiplication and permutation, keyswitching is leveraged to convert the resulting ciphertexts back into the original form for the following operations. Therefore, it is intensively used in FHE schemes.

Modulus Switching and Multiplicative Depth: Generally speaking, modulus switching refers to the operations that switch the modulus of a ciphertext. It is largely applied in FHE schemes to raise or reduce the modulus for different purposes. In particular, RLWE-based FHE schemes introduce modulus switching to control the proportion of noise in the ciphertext at the cost of reducing the modulus Q. When Q is too small to support further operations, the noise cannot be reduced anymore. Consequently, the size of modulus Q limits the number of consecutive homomorphic operations on a freshly encrypted ciphertext. The number is also called the maximum multiplicative depth (or budget) L of arithmetics supported by the FHE scheme. Such FHE schemes are usually called leveled fully homomorphic encryption (leveled FHE) because of the limitations on L. In this paper, we use the term deep and shallow to describe applications that consume large multiplicative depth (e.g., deep neural networks [43]) and those that only contain a few multiplications (e.g., database lookup [44]), respectively. Leveled FHE schemes are designed to be efficient for shallow computations. In deep applications, excessive Q dramatically reduces the performance. Bootstrapping is leveraged to refresh the ciphertext and recover the multiplicative depth, which we will discuss next.

**Bootstrapping:** Bootstrapping is a generic term for operations that refresh a ciphertext in FHE. Their common idea is to homomorphically recrypt the old ciphertext and generate a fresh one.

In bit-wise FHE schemes such as TFHE, bootstrapping is performed in every bit-wise operation. Therefore, multiplicative depth is not considered in TFHE. The most timeconsuming part of bootstrapping in TFHE is the homomorphic evaluation of a lookup table (LUT). The lookup process can be implemented through a large number of multiplexer (MUX) gates. Since the MUX gate mainly consists of polynomial additions, subtractions and multiplications, polynomial computations are the major workload in TFHE.

Unlike bit-wise FHE, bootstrapping in RLWE-based word-wise FHE schemes is more complicated. We tend to reduce the frequency of bootstrapping and perform one only when the current multiplicative depth is not enough. It is worth noting that bootstrapping combines multiple operations, including many multiplications, thus consuming considerable multiplicative depth by itself. The concrete implementations of bootstrapping are not the same in various word-wise FHE algorithms, but their workflow can be summarized into four common steps. Step 1. Modulus Switching: In word-wise FHE, the multiplicative depth is proportional to the modulus size of the ciphertext. Therefore, the modulus should be raised if more multiplications are needed. Given ciphertext ct under modulus Q, the first step in bootstrapping is to generate a new ciphertext ct' encrypting the same plaintext while extending the modulus to Q' which satisfies  $Q' \gg Q$ .

Step 2. CoeffToSlot: Although the modulus of the ciphertext has been raised, the decryption formula no longer holds since the coefficients of the plaintext polynomial are not guaranteed to be bounded by the modulus of the plaintext during the process of modulus switching. In this case, homomorphic evaluations are required to modulo the coefficients. However, we can only operate on plaintext slots rather than polynomial coefficients with homomorphic operations in FHE. To make the coefficients accessible for homomorphic evaluations, homomorphic encoding should be performed in advance to put the coefficients of plaintext polynomial into the plaintext slots. This process is also called CoeffToSlot [45], [46] or linear transformation [47], [48].

Step 3. Homomorphic Evaluation: Homomorphic evaluations in bootstrapping are introduced to modulo the coefficients of plaintext polynomial. They contain non-polynomial functions that can not be directly executed in FHE. For example, in BGV and BFV, the main operation of this step is digit extraction, while modulus reduction is the dominant operation of this step in CKKS. Thus, in modern FHE implementations, approximation schemes such as Taylor expansion [45] and optimized Chebyshev method [46], [48] are used to alternatively evaluate high-degree polynomials. Step 4. SlotToCoeff: After homomorphic evaluation, the results in the plaintext slots should be placed back to the coefficients of the plaintext polynomial. The process is an inverse operation of CoeffToSlot, which is called SlotToCoeff or inverse linear transformation.

#### 2.4. RNS Decomposition

Since word-wise FHE schemes are mainly constructed over polynomial rings, polynomial operations with large integers lead to performance degradation [10]. For performance optimization, a common strategy is to use Residue Number System (RNS). RNS decomposes modulus Q and P into the product of several smaller coprime moduli  $q_0 \cdot q_1 \cdots \cdot q_L$  and  $p_0 \cdot p_1 \cdots \cdot p_{\alpha}$ , allowed by the Chinese Remainder Theorem (CRT) [49]. Multiple polynomials with smaller moduli are used to replace the original polynomial. By applying RNS to FHE, the coefficient size of polynomials is greatly reduced at the cost of decomposing each polynomial into multiple ones. Since the calculation complexity of modular multiplication is roughly proportional to the square of the coefficients' bit width, the overall complexity is reduced.

# 3. What Make FHE Slow & the Challenges of Accelerating Them

FHE is an ideal solution for many privacy-preserving applications since it can simultaneously protect confidential data and satisfy emerging data protection lawsuits and regulations [50]. However, FHE still suffers from inefficiency, which is the focus of the paper. It is worth noting that previous works have also mentioned several other reasons restricting the broad adoption of FHE, such as its usage complexity [32]. However, we believe the inefficiency problem is still the major roadblock to FHE's adoption in the production environment.

In this paper, we have identified the cause of FHE's inefficiency as a two-fold problem. First, most of FHE's operations are built upon polynomials [8]–[10], [17]. Therefore, FHE is inefficient since polynomial computations, specifically polynomial multiplication, are naturally much more complex than integer/floating number calculations. Second, two necessary ciphertext maintenance operations, *i.e.*, key-switching and bootstrapping, are extremely timeconsuming since they involve very complicated computations. As reported in ARK [29], the two major components (i.e., NTT and fast basic conversion, which will be introduced in detail later) in key-switching take up more than 80% of the total computation time. According to the analysis in [27], bootstrapping may take up over 90% of computation time in an end-to-end FHE task (bootstrapping consists of key-switching operations).

To solve the problems, FHE accelerators are proposed. They target improving the performance of FHE schemes by either leveraging general hardware (SIMD feature provided by CPUs) or relying on completely specific hardware (FPGA-based circuits). However, it is not easy for these accelerators to achieve ideal performance. In the following part, we will first summarize the challenges of accelerating polynomial computation (specifically, its core operation: NTT/FFT) into three aspects: high computation complexity, extremely intensive memory access and limited generality. Second, we will further demonstrate the challenges of accelerating key-switching and bootstrapping. Although these two operations are built upon polynomials operations, designing an end-to-end acceleration solution for them is more challenging in terms of the aforementioned three aspects, which makes it increasingly more difficult to design efficient FHE accelerators.

## 3.1. Challenges of Accelerating Polynomial Computation

One widely-adopted algorithm-level optimization for polynomial computation is to leverage Fast Fourier Transform (FFT)/Number Theoretic Transform (NTT) [33], [51]. Thus, accelerating FFT/NTT is essential to all FHE accelerators. In the following part, we will first introduce how FFT/NTT reduces the time complexity of polynomial computation and then present the three challenges of further accelerating FFT/NTT with the current hardware architecture.

#### 3.1.1. FFT/NTT

As discussed, the underlying operations of FHE schemes are all polynomial operations. Here we first define a polynomial A as follows:

$$A(x) = \sum_{j=0}^{n-1} a_j x^j$$
 (2)

The degree of a polynomial is the highest power of the variable x with a non-zero coefficient. Integer n is defined as the degree-bound of the polynomial, which is strictly larger than the degree of the polynomial [12]. In many previous works, n is also called the degree of a polynomial for simplicity. In this paper, we use the term degree rather than degree-bound to refer to n.

There are two common representations of a polynomial: coefficient representation and point-value representation. The polynomials in FHE are naturally stored in the coefficient representation. However, the time complexity of multiplication between polynomials in the coefficient representation is  $O(n^2)$ , while it can be reduced to O(n)with the point-value representation. A popular approach for representation conversion is the Fast Fourier Transform, which leverages the idea of divide-and-conquer to reduce the conversion time complexity to  $O(n \log n)$ . Note that FFT only works on complex numbers. For RLWE-based FHE schemes, we also need Number Theoretic Transform, which is a generalization of FFT but works over finite fields. Readers may refer to Appendix A for more details.

#### 3.1.2. High Computation Complexity

Although FFT/NTT has lowered the computation complexity of polynomial multiplications to  $O(n \log n)$ , this operation still takes the majority of the time, reaching 70% in some scenarios [29]. Therefore, to further accelerate the FFT/NTT, people begin to design accelerators to further improve performance.

Most FFT/NTT implementations are constructed by iteratively executing a basic operation called butterfly. The most well-known butterfly strategies are Cooley-Tukey (CT) butterfly [52] (shown in Figure 2) and Gentlemen-Sande (GS) butterfly [53] (shown in Figure 3). These two strategies mainly consist of addition, subtraction and multiplication operations. CT butterfly is designed to execute the Equation 8 and 10 in Appendix A by reusing the multiplication result. GS butterfly exploits a similar structure and can be used as the reverse operation of CT butterfly in applications.

Based on the butterfly strategies, the FFT/NTT algorithm is divided into  $\log n$  stages with n/2 butterflies in each stage, as shown in Figure 4, which presents an example of FFT with n = 8. The input of the butterflies in FFT/NTT is the output of specific ones in the previous stage, which leads to a strict dependency chain from the first to the last stage. Data dependency among butterflies in adjacent stages continuously changes as the calculation proceeds.

To accelerate FFT/NTT computation, the first attempt is to leverage data parallelism since data parallelism is widely supported by general hardware, such as CPU (through SIMD instructions) and GPU (through CUDA programming). However, data parallelism is only efficient when there are no dependencies among all pieces of data. Consequently, the data parallelism cannot be directly applied to FFT/NTT due to the strict data dependencies among adjacent butterfly operations.

Another accelerating approach is to utilize pipeline parallelism, which can be implemented by designing customized hardware circuits via FPGA or ASIC. However, as Figure 4 indicates, to build a pipeline with large n,  $(n \log n)/2$  separate butterflies should be implemented, which will consume too many hardware resources, causing either impossible circuits placement with FPGA or extremely expensive costs with ASIC. Moreover, data movement within such a large pipeline is almost impossible with state-of-the-art hardware technologies.

Therefore, naive data and pipeline parallelism are challenging. There are two possible solutions. First, it is possible to make FFT/NTT partly parallelizable based on the 4step FFT/NTT algorithm [54]. Readers may refer to the Appendix B for a detailed introduction to 4-step FFT/NTT. The second solution is inter-polynomial parallelism, executing operations over multiple polynomials concurrently with sufficient hardware resources. The scheme is called residuepolynomial-level parallelism (rPLP), and we will discuss it in detail in §3.2.

#### 3.1.3. Intensive Memory Access

Recent works have observed that overhead in memory access has become an essential bottleneck even if the accelerator is designed with modern hardware, such as FPGA or ASIC [26]–[29]. For example, let's consider a chip operating at 1GHz and there're 40960 on-chip modular multiplication units. The chip is connected to the latest HBM3 with a bandwidth of 3TB/s. Assume that the chip is executing CoeffToSlot with bootstrappable parameters  $(n = 2^{16}, L = 23, \text{dnum} = 4)$ . All the calculation units are concurrently working and the bandwidth of the HBM3 is fully utilized. It only takes the chip 0.18ms to finish all the multiplications, while it takes the HBM3 2.1ms (11.7× of computation time) to load the data used in CoeffToSlot [29].

The root cause of the problem is data inflation. Compared to plaintext computation, data size in polynomial operations largely increases for the following reasons.

In FFT/NTT, a group of pre-computed parameters called twiddle factors is mandatory for the calculations, *i.e.*,  $\omega_n^k$ in Figure 2 and Figure 3. Because the twiddle factors can be reused throughout an entire FHE job, it is usually cached in the on-chip memory (fast but extremely limited) of the accelerator to achieve high performance. For FFT, the number of twiddle factors is less than the number of input coefficients. However, for NTT, one group of twiddle factors is required for each modulus in RNS, which results in data inflation. In addition to the twiddle factors, the temporary results generated by each stage in FFT/NTT also place a high demand on the memory size.

As mentioned in §3.1.2, the 4-step FFT/NTT algorithm makes it possible to process a single FFT or NTT in parallel. Therefore, it is common to leverage this algorithm in recent designs. However, the 4-step FFT/NTT requires more pre-computed parameters than the original FFT/NTT algorithm (shown in Appendix B.2). Although the algorithm greatly reduces the number of twiddle factors, it introduces n additional pre-computed numbers called twisting factors for each RNS modulus [54]. Therefore, the memory space requirement for pre-computed numbers is even more strict after applying the algorithm optimization.



Figure 2: Cooley-Tukey butterfly



Figure 3: Gentlemen-Sande butterfly

Furthermore, due to the complex data dependency and large data size, read/write requests over the same block of memory from different butterflies are common. Such conflicts cause either large design complexity or severe performance degradation.

#### 3.1.4. Limited Generality

Degree n of the polynomials are variant in different scenarios. For example, n is related to security level and multiplicative depth. However, it is challenging for these specificdesigned hardware accelerators, such as FPGA and ASIC, to consistently provide optimal acceleration when n varies. The reason is as follows. There are  $(n \log n)/2$  butterflies in the entire FFT/NTT pipeline. When n becomes larger, the consumption of hardware resources greatly increases. Furthermore, different n leads to completely different data dependencies in FFT/NTT. Static circuit connections among butterflies cannot satisfy the requirement.

If we use a fixed n to design the pipeline, the design suffers from resource underutilization or non-optimal performance. For example, when an architecture designed for a certain degree is directly applied to accelerate FFT/NTT with a relatively larger degree, the calculation fails to achieve full pipelining because of insufficient multipliers. In contrast, adapting the design to scenarios where the degree is relatively smaller results in a tremendous waste of computing resources in the redundant stages.

#### 3.2. Challenges of Accelerating Key-switching

As mentioned in §2.3, key-switching is widely used for ciphertext conversion in homomorphic multiplication and permutation. The main challenge of accelerating key-switching is achieving generality, which we will demonstrate in detail in the following parts.

The key-switching algorithms used in state-of-the-art accelerators can be considered as different variants of the same algorithm, *i.e.*, generalized key-switching [46], [55]. Based on RNS decomposition, the modulus Q is factorized into L+1 coprime moduli  $q_0, q_1, \ldots$ , and  $q_L$ . In generalized key-switching, given a fixed integer parameter dnum  $\in [1, L+1]$  which stands for decomposition number, the moduli are grouped into dnum blocks and the partial production of moduli in each block is denoted as  $Q_j = \prod_{i=j\alpha}^{(j+1)\alpha-1} q_i$ ,



Figure 4: Workflow of NTT/FFT with n = 8 with CT butterfly

where  $j \in [0, \text{dnum})$  and  $\alpha = \lfloor (L+1)/\text{dnum} \rfloor$ . Then the key-switching can be decomposed into calculations in each block, followed by the accumulation of results from different blocks. Each block's calculations are also called fast basis conversion, mainly consisting of multiplications and additions, while the major operation in accumulation is NTT. The selection of dnum greatly impacts the memory requirement and calculation complexity of key-switching. As dnum increases from 1 to L + 1, the overall memory consumption in key-switching and the workload for NTT strictly grows, while the workload for basis conversion decreases. The overall calculation workload also depends on the selection of other parameters, especially multiplicative depth L. Therefore, different dnums are suitable for various scenarios.

The variation of dnum poses generality challenges to the accelerators, especially for those with specific-designed circuits due to the high complexity of changing them. Since the proportion of different operations in the total workload is not static, the first challenge for the designers is to decide how to distribute the limited resources in the accelerator to different calculation units (*i.e.*, NTT and basis conversion) to achieve balanced throughput for all the typical applications.

The second generality problem is how to decide the parallelism scheme. Since NTT and fast basis conversion require different schemes to be efficient, for the accelerators, how to parallel process polynomials is impacted by the relative workload of NTT and basis conversion, which is relevant to dnum. The parallelism schemes adopted by existing works can be categorized into rPLP (residue-polynomiallevel parallelism) and CLP (coefficient-level parallelism), two different polynomial access patterns for the accelerators to achieve high parallelism. Under RNS representation  $(\S2.4)$ , we need to perform concurrent operations on multiple residue polynomials. With rPLP, each processor in the accelerator individually executes operations over a single residue polynomial. The parallelism is achieved by distributing multiple residue polynomials to different processors. With CLP, multiple processors in the accelerator collaboratively execute operations for a single residue polynomial, and n coefficients in the same polynomial are distributed to different processors. There is no superior or inferior of these two schemes in terms of coefficient-wise operations like polynomial addition and dyadic multiplication. However, CLP introduces global data communication across different processors for NTT. Similarly, rPLP leads to extra data exchange in basis conversion. Therefore, both schemes may cause performance degradation in certain operations.

In key-switching, both basis conversion and NTT are dominant. The designer may compare the relative workload of basis conversion and NTT in the targeted applications to choose between CLP and rPLP. For example, when a relatively smaller dnum is chosen, which increases the importance of basis conversion, CLP is a preferred choice. Nevertheless, generality remains to be a challenge.

#### 3.3. Challenges of Accelerating Bootstrapping

In this section, we mainly analyze bootstrapping in wordwise FHE since it is more complicated than that in bit-wise FHE (details in §2.3), and is the focus of the latest FHE accelerators [27]–[29]. Bootstrapping in word-wise FHE consists of four complicated steps, making it the most timeconsuming operation in word-wise FHE. The impressive overhead comes from the large consumption of multiplicative depth and enormous homomorphic permutations, which both lead to extremely high computation complexity and memory access.

**Large Consumption of Multiplicative Depth:** As reported in [27], bootstrapping in the LSTM benchmark consumes 61% of the maximum multiplicative depth (35 out of 57 levels). The significant consumption of multiplicative depth mainly comes from the approximation schemes applied in bootstrapping, which include deep calculations.

Therefore, for FHE applications requiring unbounded multiplication depth, such as deep learning training, the maximum multiplicative depth L must be large enough to support bootstrapping. Furthermore, since bootstrapping involves dramatic overhead, we should further increase depth L to use more multiplicative depth to perform application computations before bootstrapping, thus reducing the frequency of bootstrapping during the whole calculation to guarantee the effectiveness of an application.

Consequently, the growth of depth L leads to a larger modulus Q, and the degree of polynomial n should be raised accordingly to reach a certain security level of RLWE [56]. Large modulus and degree of polynomial lead to the growing size of ciphertexts and keys, thus significantly increasing the workload of polynomial calculations and memory consumption.

**Enormous Homomorphic Permutations:** The second and fourth steps of bootstrapping, CoeffToSlot and SlotToCoeff, suffer from the high computational overhead of homomorphic operations, especially when the plaintext messages are densely packed in the ciphertext. In the implementation of [48], CoeffToSlot on a ciphertext, which packs 4096 plaintext messages, requires  $\sim 30$  homomorphic permutations, leading to large computation complexity. Moreover, homomorphic permutations introduce massive additional switching keys, dramatically inflating the memory space

needed and increasing the bandwidth requirements. In detail, the size of a single switching key is about dnum× the size of a ciphertext, and permutations aiming at different rotation steps need distinct keys, which introduce ignorable memory storage and access overhead. For example, according to the analysis in [29], with  $n = 2^{15}$ , 40 different switching keys should be pre-computed and stored in preparation for permutations in the linear transformation of bootstrapping.

#### 4. FHE Accelerators

In this section, we will first comprehensively review the existing FHE accelerators in chronological order (§4.1). Then, we will summarize their evolution trajectory to address particular challenges mentioned in the previous section, which establishes a qualitative connection among them (§4.2).

Table 2 provides an overview of FHE accelerators by demonstrating the hardware leveraged, the data parallelism schemes used, and the features of these FHE accelerators. We also make the following explanations to make it clear. First, the software/hardware co-design denotes whether the design leverages software's flexibility and hardware's efficiency to properly distribute and schedule the workload in the accelerator. In particular, we regard co-design is absent in all CPU-based and GPU-based works, which only include software design based on off-the-shelf hardware architectures. Second, similar to previous works [32], the FHE schemes are grouped into families of related schemes. For example, BFV stands for BFV [9]/BGV [8] and TFHE stands for TFHE [17]/GSW [42]. Third, among the different features of the accelerators, we mainly focus on whether they are programmable and bootstrappable. Programmable refers to the accelerator's ability to support a variety of cryptographic parameters without hardware architecture reconfiguration. Bootstrappable means it is possible to execute bootstrapping and achieve unlimited FHE operations with the accelerator. The half circle  $(\mathbf{0})$  here represents that the accelerator supports bootstrapping, but the performance is far from meeting practical requirements.

Worth noting, although the FHE accelerators rely on the acceleration of polynomial computations, such as NTT, it is not reasonable to categorize works that only accelerate these basic modules as FHE accelerators. For example, FFT accelerators are commonly applied in other fields, such as signal processing, but most of them cannot be directly used to boost the performance of FHE schemes. Our survey will not cover them but focus on accelerators that explicitly improve the efficiency of FHE schemes.

#### 4.1. Survey of Existing FHE Accelerators 4.1.1. cuHE/cuFHE

CUDA Homomorphic Encryption Library (cuHE) was proposed by Dai *et al.* in 2015 [57], [59]. cuHE uses GPU for acceleration and provides CUDA implementation of NTT and CRT. CUDA-accelerated Fully Homomorphic Encryption Library (cuFHE) was proposed by Dai *et al.* in 2018 [58]. cuFHE leverages implementations of cuHE to boost the performance of TFHE [17]. Both cuHE and cuFHE are open-sourced standalone acceleration libraries [57], [58] and do not provide official integration with FHE libraries.

| Name            | Hardware<br>Target | Hardware Software/Hardware |              | Supported Schemes |      |       | Supported Features |                | Open-  |
|-----------------|--------------------|----------------------------|--------------|-------------------|------|-------|--------------------|----------------|--------|
|                 | Inget              | eu-uesign                  | r ai anensin | DI V              | CKKS | TITIL | Tiogrammable       | Bootstrappable | source |
| cuHE [57]       | GPU                | 0                          | rPLP         | 0                 | 0    | •     | 0                  | •              | •      |
| cuFHE [58]      | GPU                | 0                          | rPLP         | 0                 | 0    | •     | 0                  | •              | •      |
| nuFHE [14]      | GPU                | 0                          | rPLP         | 0                 | 0    | •     | 0                  | •              | •      |
| HEAT [30]       | FPGA               | •                          | rPLP         | •                 | 0    | 0     | 0                  | 0              | •      |
| HEAX [19]       | FPGA               | 0                          | rPLP         | 0                 | •    | 0     | 0                  | 0              | 0      |
| HEXL [13]       | CPU                | 0                          | rPLP         | •                 | ٠    | 0     | •                  | 0              | •      |
| HEXL-FPGA [18]  | FPGA               | 0                          | rPLP         | •                 | •    | 0     | •                  | 0              | •      |
| 100× [15]       | GPU                | 0                          | rPLP + CLP   | •                 | •    | 0     | •                  | •              | •      |
| F1 [26]         | ASIC               | •                          | rPLP         | •                 | •    | 0     | •                  | O              | 0      |
| CraterLake [27] | ASIC               | •                          | CLP          | •                 | •    | 0     | •                  | •              | 0      |
| BTS [28]        | ASIC               | •                          | CLP          | •                 | •    | 0     | •                  | •              | 0      |
| ARK [29]        | ASIC               | •                          | rPLP + CLP   | •                 | •    | 0     | •                  | •              | 0      |

TABLE 2: Overview of all FHE accelerators.

However, they only support accelerating polynomial operations with NTT, while an FFT-based implementation may achieve higher performance for TFHE. Moreover, besides limited functions, cuFHE adopts fixed cryptographic parameters, which cannot be configured to achieve high generality.

#### 4.1.2. nuFHE

GPU-powered Torus FHE implementation (nuFHE) was launched by NuCypher in 2018 [14]. Similar to cuHE/cuFHE, nuFHE also adopts GPU to accelerate FHE schemes. However, different from them, nuFHE provides either FFT or NTT to improve performance. nuFHE is an open-source standalone library [14] and provides Python APIs. With FFT, nuFHE achieves better performance compared with cuFHE. But it shares the similar disadvantage of limited generality.

#### 4.1.3. HEAT

HEAT was proposed by Roy et al. in 2019 [30]. Different from cuHE, cuFHE, nuFHE, HEAT targets accelerating a word-wise FHE scheme: BFV. Moreover, besides general hardware which has been used in previous works, HEAT further leverages FPGA to achieve a more flexible hardware design. Thus, HEAT utilized a heterogeneous ARM-FPGA co-designed architecture implemented on Xilinx ZCU102 Evaluation Kit [60]. The authors further migrated HEAT to the f1 instance of Amazon AWS in 2020 [61]. Both implementations are open-sourced. The implementation of HEAT on ZCU102 comprises several hardware coprocessors on FPGA and a multi-core (4 cores) ARM processor. The parallel coprocessors can efficiently execute primitive operations in BFV, including addition, subtraction, multiplication, modulus switching, and NTT. The ARM cores control the coprocessors' workflow and manage the network connection to the applications. By using the software as a workflow controller, HEAT can perform polynomial arithmetics and homomorphic operations like key-switching in BFV by combining different primitive operation units.

Considering the limited on-chip memory (BRAM/U-RAM) and calculation units (DSP), it is impossible to implement a fully pipelined NTT processor on FPGA. Instead, the authors instantiated two CT butterfly units for each NTT core and accomplished the NTT by reusing them. This design is a compromise solution due to limited resources, thus inevitably leading to non-optimal performance. Another problem caused by resource constraints is the relatively smaller polynomial degree. The polynomial degree supported by HEAT is 4096, which does not fulfill most practical requirements and significantly limits its generality.

## 4.1.4. HEAX

HEAX was proposed by Microsoft in 2020 [19]. It provides a highly performant hardware architecture to accelerate the operations of CKKS. HEAX is not an open-source project. The acceleration foundations of HEAX are the primitive modules: NTT and multiplication modules. Each module consists of multiple calculation cores, which could be adjusted to match the required throughput.

Different from HEAT, which depends on the software to manipulate hardware primitive operations (*e.g.*, NTT and multiplication) when implementing key-switching, HEAX implements a specific key-switching module to minimize the overhead of software and hardware interaction. Following the workflow of key-switching, the key-switching module instantiates multiple primitive modules and BRAMs to construct the pipeline. The modules can be adjusted to balance the throughput in the pipeline based on specific cryptographic parameters.

However, the adjustment of HEAX is accomplished by physically modifying the modules. The FPGA has to be reconfigured to adapt to different cryptographic parameters, which limits the generality of the design.

## 4.1.5. HEXL

Unlike previous works relying on specific hardware devices, such as GPU and FPGA, Intel proposed Intel Homomorphic Encryption Acceleration Library (HEXL) in 2021 [13]. HEXL exploits the SIMD features provided by Intel CPUs, which are easily accessible, to provide plug-and-play acceleration capacities for FHE. HEXL has been integrated with PALISADE [51], Microsoft SEAL [33] and HElib [62] by replacing the underlying arithmetic implementations. It is open-sourced, and the code is hosted on Github [63].

With the single-threaded implementation of primitive operations, including NTT/iNTT and dyadic multiplication, based on the Intel AVX-512/AVX512IFMA instructions [64], HEXL reaches high single-core acceleration. Since HEXL is thread-safe, users can achieve better performance by paralleling different operations with multithreading. However, due to architecture deficiency, the overall performance largely degrades when there are many threads. For example, too many threads with SIMD cause significant heat dissipation. As a result, core frequency is dramatically reduced when reaching the TDP [65]. Moreover, similar to the CPU-based acceleration libraries in other domains, HEXL also suffers from slow memory access, further reducing its overall performance.

#### 4.1.6. HEXL-FPGA

To compensate for the disadvantages of HEXL, Intel further proposed HEXL-FPGA in 2021 [18]. HEXL-FPGA offers the HLS-based (high level synthesis) implementation of NTT/iNTT, multiplication, and key-switching. Since HEXL-FPGA is open-sourced, users can compile each of the functions into an individual bitstream and program the FPGA device. HEXL-FPGA can be integrated with aforementioned HEXL to accelerate corresponding FHE libraries. HEXL-FPGA is open-sourced and under active development [18].

However, there are inherent problems with HLS-based solutions. HLS designs are written in high-level language and rely on the compiler to convert the codes into hardware design. Due to the essential difference between the hardware and software, the compiling process may lead to redundant resource consumption and complex workflow synchronization [66], thus leading to suboptimal performance. Therefore, it's hard for HEXL-FPGA to fully utilize the advantages of FPGA. Moreover, HEXL-FPGA only supports a limited range of cryptographic parameters, which will be demonstrated in §5.2.

#### **4.1.7.** 100×

 $100 \times$  was proposed by Jung *et al.* in 2021 [15]. It provides higher acceleration of CKKS than previous works (*i.e.*, HEXL-FPGA and HEAX) with the powerful V100 GPU [67]. The reason is that V100 has more hardware resources due to better semiconductor manufacturing processes and works at a much higher frequency than the FP-GAs adopted in previous works.  $100 \times$  introduces memory-centric optimizations to increase end-to-end performance and achieves over  $100 \times$  acceleration ratio compared to the single-thread CPU implementation.

The implementation of NTT in  $100 \times$  is built based on the hierarchical approach in [68]. Its basic idea follows the 4-step FFT/NTT algorithm [54]. The implementation of NTT is divided into two separate kernels, making it possible to cache all input data in the shared memory. Moreover, due to the limited size of registers in GPU, both kernels further decompose the NTT based on a generalized version of the 4step FFT/NTT algorithm. In addition to the decomposition, schemes including coalesced memory access and on-thefly twiddle factors generation are leveraged to decrease the overhead in memory access.

However, although the memory accesses in the basic operations have been optimized, the end-to-end performance of a FHE task is still bottlenecked by the bandwidth of the main memory. To relieve the bottleneck, the authors of  $100 \times$  tend to fuse multiple kernels into a single kernel, which allows the data cached on the chip to be reused by a series of consecutive operations and reduces global memory accesses. Nevertheless, since GPU's architecture is designed for calculations between small numbers (*e.g.*, FP16), it does

not offer large on-chip memory. As a result, the intensive memory access still degrades the performance of  $100 \times$ .

## 4.1.8. F1

To solve the problems of insufficient resources (*e.g.*, the FPGA-based ones) and unsuitable fixed architecture (*e.g.*, the CPU or GPU-based ones), recent works have shifted to explore the potential of application-specific integrated circuits (ASICs). Following this trend, F1 was proposed by Feldmann *et al.* in 2021 [26]. To pursue more practical acceleration for FHE schemes, F1 is the first programmable FHE accelerator with a dedicated architecture, which denotes it can support several FHE schemes with a large range of cryptographic parameters. F1 is not open-sourced.

At its core, F1 implements 16 computation clusters, each containing several primitive function units (FU), including NTT, modular multiplication, modular addition, and automorphism. Different units can compose high-level FHE operations such as key-switching. All the FUs are pipelined and vectorized to process 128 elements in each cycle. Therefore, polynomials with degrees that are multiples of 128 can be handled by sequentially feeding the operands to the pipeline. Specifically, to implement NTT with the 128-element processor, F1 leverages the 4-step FFT/NTT algorithm [54] to decompose a NTT into multiple vectorized operations with much fewer input elements. F1 tries to minimize data movement overhead by proposing a hierarchical storage system. The off-chip high-bandwidth memory (HBM) is the global memory that directly interacts with the CPU server. A 64MB scratchpad built on 16 banks of SRAM is designed as the on-chip cache and fetches data from HBM. The computation clusters communicate with the scratchpad and store the data used for the current operation in the limited vector registers. The communication between the scratchpad and clusters depends on a complex fully connected network (three  $16 \times 16$  crossbars).

The design of F1 has three main problems. First, the performance of F1 highly relies on sufficient parallelism among 16 clusters and efficient data movement in the accelerator, which places a high demand on the software compiler for operation scheduling. Considering that FHE computations vary significantly in terms of different workflows and different parameter settings, the compiler is complicated. However, F1 does not provide many details about its compiler. The second problem is that F1 uses a fixed key-switching algorithm (generalized key-switching with dnum = L + 1). If the multiplicative depth L of ciphertexts is high, keyswitching is extremely slow. Last, F1 only supports nonpacked bootstrapping [45], which only works for ciphertext with a single number packed in the polynomial and is far from practical in real-world applications. The reason is that the maximum degree of polynomial that F1 supports via the 4-step FFT/NTT algorithm is 16384, which is too small for F1 to execute fully packed bootstrapping [69] under 80-bit or 128-bit security level of RLWE. Without efficient bootstrapping, F1 struggles to evaluate deep arithmetic functions.

#### 4.1.9. CraterLake

CraterLake was proposed by Samardzic et al. in 2022 [27]. It is not an open-source project. CraterLake is a followup to F1 and targets unbounded-depth homomorphic multiplications. Consequently, CraterLake uses the fully packed bootstrapping [69] to refresh the multiplicative depth of ciphertexts. To balance the frequency of bootstrapping and the size of ciphertexts, the authors of CraterLake chose the number of multipliers required per homomorphic multiplication as the criterion to evaluate the overall computational complexity of an FHE program. Moreover, the authors followed the idea of the vectorized unit in F1 and logically designed the whole accelerator as a single 2048-element vectorized processor to handle the large ciphertext, which requires the maximum degree of polynomial and multiplicative depth to be 65536 and 60 respectively. However, such a design can not be naively implemented on ASIC considering the complex combinational and sequential logic circuits. Thus, in CraterLake, the 2048-element processor is physically composed of eight 256-element groups.

Similar to F1 [26], CraterLake implements several functional units (FU) in each 256-element group. The difference is the two additional FUs in CraterLake, *i.e.*, Change-RNSbase (CRB) and switching key generator (KSHGen), and both units are introduced to further optimize key-switching. The CRB unit, which mainly executes modular multiplication and modular summation, is specifically designed to accelerate fast basis conversion when dnum is relatively small (*e.g.*, dnum = 1). To save additional memory space, the KSHGen unit generates switching keys on the fly, which were pre-computed and cached in the device memory in previous works, such as F1 [26] and HEAX [19].

Another major difference between CraterLake and previous works is that CraterLake adopts CLP as the data parallelism scheme rather than rPLP due to the increasing workload of fast basis conversion. With the CLP scheme, CraterLake only processes one polynomial operation simultaneously, and the coefficients of polynomials are distributed to the 256-element groups with a static distribution strategy, leading to three significant advantages. First, as a particular coefficient can only be assigned to a specific group, the complex  $16 \times 16$  crossbar in F1, which deals with the dynamic data exchange between on-chip scratchpad and groups, is no longer required. Data movement caused by operations among different polynomials can be eliminated as well. Second, the parallelism of CraterLake is not influenced by the varying multiplicative depth or numbers of concurrent operations, which maintains the performance throughout an entire FHE program and simplifies the software scheduler.

Although CraterLake is programmable enough to support different cryptographic parameters, the FUs, especially NTT units, may face functionality limitations or resource underutilization because of the static pipelined circuit, as we mentioned in §3.1.4. The computation resources in Crater-Lake can be fully utilized if and only if n = 65536, which is a common problem for works containing a pipelined NTT circuit (*e.g.*, F1 and ARK). Besides, in CraterLake, the CRB unit occupies 34% on-chip area, reducing the performance of

other operations, like NTT. Therefore, in the scenarios where larger dnum is optimal, the CRB units are underutilized and CraterLake cannot deliver sufficient acceleration.

#### 4.1.10. BTS

BTS was proposed by Kim *et al.* in 2022 [28]. It is a followup to  $100 \times$  and shares similar design goals with CraterLake, *i.e.*, a bootstrappable and programmable hardware architecture for word-wise FHE.

Because of the similar goal, many optimization approaches in BTS are close to those in CraterLake, including the basis conversion unit, CLP parallelism scheme and on-the-fly data generation. The most significant difference between CraterLake and BTS is the pattern in which they place basic functional units, thus leading to different implementations of basic operations. The architecture of BTS is close to that of modern GPUs, consisting of a  $64 \times 32$ two-dimensional (2D) array of processing elements (PE). Each PE comprises a basis conversion unit and a 2-point NTT/iNTT unit (i.e., a single butterfly unit). The 2048 PEs are interconnected via vertical and horizontal crossbars. An example of 131072-point NTT with the architecture works as follows. First, each PE performs a 64-point NTT independently. Second, 64 PEs in the same row perform a 64-point NTT with data synchronization through horizontal crossbars. Third, 32 PEs in the same column perform a 32-point NTT with data synchronization through vertical crossbars. This process generalized the 4-step FFT/NTT algorithm. Each PE plays a similar role to the GPU threads, and the crossbars accomplish thread synchronization. To facilitate the unique computational pattern, the authors placed multiple smaller local scratchpads in each PE instead of putting a large global one that communicates with all the PEs. The local scratchpad is directly connected with HBM and eliminates additional overhead from the hierarchical data distribution.

However, a potential drawback of BTS is that the 2D array structure cannot execute NTT with a fully pipelined workflow. As mentioned before, NTT operation in BTS consists of three sequential steps and all the PEs are involved in each step. Therefore, no pipeline parallelism can be achieved if multiple NTT operations should be processed, which may decrease the end-to-end performance. Similar to F1 and CrateLake, BTS is not open-sourced.

#### 4.1.11. ARK

ARK was proposed by Kim *et al.* in 2022 [29]. It consists of four vectorized processing clusters, similar to the 256element groups of CraterLake. Each cluster contains several primary operation units. Different from previous works that target accelerating bootstrapping only from hardware angle [27], [28], ARK is the first work that analyzes and modifies the bootstrapping algorithm, thus achieving an algorithm and hardware co-design to optimize the accelerator's performance. Specifically, the authors proposed multihop homomorphic rotation and on-the-fly residue extension to significantly reduce the size of switching keys and plaintexts used in bootstrapping, respectively, which lowers the difficulties in hardware design.

As mentioned earlier in this section, the workload of

basis conversion is heavy in bootstrapping, making CLP a preferred parallelism scheme for bootstrappable accelerators like CraterLake and BTS. But CLP also introduces communication overhead in NTT. According to ARK, NTT and basis conversion separately take up 54.8% and 34.2% of the overall computational workload given practical bootstrappable parameters. Since both functions are crucial to end-to-end performance, ARK leverages both schemes, *i.e.*, CLP for basis conversion and rPLP for the other operations including NTT. A switching mechanism between the two data distribution patterns is also used by ARK, especially for functions that contain both NTT and basis conversion, such as key-switching. In ARK's paper, this hybrid parallelism is proven more efficient than rPLP, but whether it is better than CLP is not discussed.

## 4.2. Observations on Evolution of Existing Works

Based on the above survey, we make the following observations on the evolution of existing FHE accelerators and the underlying connection among them. We also illustrate how they address particular challenges mentioned in §3.

From General Hardware to Application-Specific Integrated Circuit: Most latest FHE accelerators choose Application-Specific Integrated Circuit (ASIC) as their hardware platform for two reasons. First, general-purpose hardware, such as CPU and GPU, suffer from fixed architecture and limited on-chip memory, which fail to address the challenges mentioned in §3. Second, FPGA suffers from relatively limited programmable resources and low operational frequency, which restricts it from reaching high performance.

The advantage of applying ASICs to FHE accelerators is that designers can optimize the hardware architecture with high flexibility and utilize state-of-the-art technologies of the computer architecture community according to the requirement of FHE computations. Recent ASIC-based works adopt specially designed architecture, such as crossbars, to enable complex polynomial operations described in §3.1.2. They also use high-speed memory, including HBM and SRAM, to accelerate the intensive memory access mentioned in §3.1.3 and §3.3. While promising, this design choice also introduces several problems: 1) most modern ASIC-based FHE accelerators are expensive to produce, especially when the chip size is up to hundreds of square millimeters, e.g., a 14nm ASIC of 100mm<sup>2</sup> requires millions of US dollars to tape out; 2) most of these ASIC-based solutions are not open-sourced due to reasons such as IP restrictions, making them difficult to reproduce, thus undesirable for the research community.

**Software/Hardware Co-design:** Early FHE accelerators perform certain simple operations, such as NTT, after receiving a single instruction from upper-layer software without any instruction scheduling [19], [30]. Since these operations follow the static hardware workflow, which cannot be dynamically adapted, the accelerators suffer from limited generality (§3.1.4). Moreover, when accelerating an end-to-end application containing massive high-level operations such as bootstrapping (§3.3), the overhead of frequent interactions

between software and hardware cannot be ignored.

To overcome these problems, recent FHE accelerators adopt software/hardware co-designed approaches. Specifically, they first leverage hardware-friendly algorithms to divide the hardware resources into multiple computation clusters which can be independently scheduled [26]–[29]. Furthermore, the designers can make extensive use of the software's flexibility to apply adaptations according to the specific application to efficiently support complicated endto-end applications.

However, the current software design of FHE accelerators fails to leverage knowledge from some of the latest techniques, *e.g.*, various FHE compilers [34], [35]. We identify it as a potential future direction in §6.3.

**Enhanced Programmability:** Programmability, *i.e.*, supporting different cryptographic parameters without hardware architecture reconfiguration, is not achieved in most of the early works due to the following two reasons. First, early works rarely focus on end-to-end acceleration for real-world applications. Thus, they do not have such a need. Second, programmability is not easy to achieve, considering the complexity of FFT/NTT.

Recent FHE accelerators have begun to adopt the 4step FFT/NTT algorithm not only to increase parallelism but also to improve the architecture's programmability to handle different parameters, specifically various n. Readers may refer to Appendix B.2 for more details. The improved programmability alleviates the generality limitation in polynomial computations described in §3.1.4, although the problems are not eradicated. Currently, an unresolved issue is the generality challenge in key-switching (§3.2). Recent works focusing on deep calculations show unoptimized performance in shallow tasks. As described in [27], in shallow benchmarks without bootstrapping (L is between 4 and 8), CaterLake is slower than F1 due to the underutilization of the basis conversion units that occupy a large proportion of hardware resources.

Unlimited Depth of Operations: Interestingly, the latest few works (i.e., CraterLake [27], BTS [28], and ARK [29]) show a similar tendency of accelerating bootstrapping to achieve unlimited 'fully' homomorphic operations. This common goal leads to the convergence of multiple design choices. First, optimized algorithms, including 4-step FFT/NTT and generalized key-switching, are preferred because of their advantages in handling large ciphertexts. Recent studies are willing to allocate a considerable amount of hardware resources to corresponding structures such as global transpose and fast basis conversion. Second, recent works invest much effort in eliminating the memory access bottleneck. They tend to reduce the memory overhead at the cost of introducing extra calculations, like on-thefly generation of essential parameters. Last but not least, since all these works have emphasized the importance of basis conversion, CLP is becoming the primary parallelism scheme, contrasting with rPLP in the earlier designs.

## 5. Evaluation

#### 5.1. Evaluation Methodology

In this section, we provide a quantitative comparison of these existing FHE accelerators. As introduced in §4, some FHE accelerators are open-sourced. We will use our testbed to reproduce the results of some representative ones. Since these open-sourced FHE accelerators do not provide direct support for end-to-end algorithms, we mainly evaluate how they accelerate the performance of NTT and key-switching. In this paper, we evaluate HEXL, HEXL-FPGA, and  $100 \times$  as word-wise FHE schemes. We also evaluate two bit-wise FHE accelerators: cuFHE and nuFHE.

Second, since some latest FHE accelerators are not opensourced, we will directly use the results from their original papers. For example, the paper of HEAX and F1 provides its performance of accelerating NTT and key-switching, thus we will align these results with our testbed results. Moreover, some latest ASIC-based accelerators, such as F1, BTS and ARK, also provide end-to-end performance results of accelerating two real-world applications, *i.e.*, ResNet20 [43] and Logistic Regression [70], we will also include these results in our paper.

**Testbed settings:** We use one X86 server as our testbed. The server is equipped with an Intel(R) Xeon(R) Gold 5115 CPU running at 2.40GHz and 128GB RAM. The CPU supports AVX-512 FMA [64]. We also use Intel PAC D5005 Acceleration Card with an Intel Stratix 10 GX FPGA [71] to reproduce the results of HEXL-FPGA. For all accelerators that require the GPU as their target hardware, we use NVIDIA V100 GPU with 32GB RAM for evaluation [67].

#### 5.2. Evaluation Results

**NTT/FFT.** First, we will evaluate the performance of word-wise FHE accelerators. Figure 5 shows the NTT performance of HEXL, HEXL-FPGA,  $100\times$ , HEAX and F1. Please note that the performance of HEXL, HEXL-FPGA and  $100\times$  is measured on our testbed, while the performance results of HEAX and F1 are from their original paper. We also run SEAL [33] without any accelerators to demonstrate the baseline performance (denoted as No Acc in the Figure). In this evaluation, we use three settings of n, *i.e.*, n = 4096, 8192, 16384. Similar to HEAX, we set the bit-width of polynomial coefficients in NTT to 52 for evaluation [19]. But F1 chooses the bit-width of 32 as it is the largest word size in F1. Theoretically, the performance of F1 should be slightly worse if the bit-width is 52.

We mainly have the following observations. 1) when the FHE accelerators leverage more advanced hardware technologies, the performance is largely improved. For example, the CPU-based accelerator, HEXL, can only achieve up to  $3.0 \times$  acceleration ratio, while ASIC-based accelerators, F1, can achieve up to  $20546.9 \times$  acceleration ratio. 2) contradicting our common wisdom, specific-designed hardware-based solutions do not always yield better performance than general hardware-based ones. For example, HEXL-FPGA and HEAX cannot achieve a better acceleration ratio over  $100 \times$ . The core reason is that HEXL-FPGA and HEAX adopt FPGA as their hardware platform, which suffers from



Figure 5: Performance of NTT. The performance of HEXL, HEXL-FPGA and  $100 \times$  is measured on our testbed while the performance results of HEAX and F1 are from their original paper.



Figure 6: Performance of key-switching. The performance of HEXL, HEXL-FPGA and  $100 \times$  is measured on our testbed while the performance results of HEAX and F1 are from their original paper. HEXL-FPGA does not support Setting3.

the aforementioned problems such as limited programmable resources and low working frequency. Precisely, V100 GPU in our evaluation has the peak performance of ~ 250 INT8 TOPS with tensor core [67], which is ~  $10 \times$  better than Stratix 10 FPGA [72] used in HEAX and HEXL-FPGA.

Figure 7 shows the performance of NAND gate achieved by cuFHE and nuFHE on our testbed. The NAND gate is a typical example that includes bootstrapping in bit-wise FHE schemes, such as TFHE. Polynomial multiplication in TFHE can be accelerated with either NTT or FFT, and we mark the schemes used by different libraries in the Figure. For the baseline (No Acc), we run the TFHE software libraries [73].

From our evaluation, we can observe that nuFHE and cuFHE can achieve  $170.6 \times$  and  $186.2 \times$  acceleration with NTT implementation, respectively. Moreover, nuFHE also supports using FFT to accelerate the NAND gate, which can achieve up to  $432.0 \times$  acceleration.

**Key-switching.** In our evaluation, we use three settings. Setting1: n = 4096, L = 1,  $\log(P \cdot Q) = 109$ , dnum = 2, Setting2: n = 8192, L = 3,  $\log(P \cdot Q) = 218$ , dnum = 4, Setting3: n = 16384, L = 7,  $\log(P \cdot Q) = 438$ , dnum = 8. Figure 6 shows the results. We have similar results as previous FFT/NTT experiments. Worth noting, HEXL-FPGA cannot perform key-switching with the most complicated setting, *i.e.* setting3, which confirms the potential drawback as discussed in §4.1.6.

**End-to-end applications.** Figure 8 shows the end-to-end performance over two applications achieved by three ASIC-based solutions, *i.e.* F1, BTS and ARK. Since these solutions are not open-sourced, we use the results from their original papers. We do not include the results of ResNet20 with F1 and  $100 \times$  since they are not presented in the original papers. The results reveal that: 1) As discussed in §4.1.8, in



NAND gate with NTT/FFT.

end-to-end applications.

a task requiring large multiplicative depth, F1 shows critical performance deficiencies despite its significant performance of NTT. For example, it takes F1 1024ms to execute a single iteration of LR, which is even slower than  $100 \times (775 \text{ ms})$ ; 2) For deep benchmarks, accelerators that support efficient bootstrapping achieve better performance. For example, BTS and ARK are separately  $36.0 \times$  and  $132.7 \times$  faster than F1 in LR. As ARK proposes further improvement over bootstrapping, including algorithm optimization and architecture co-design, it achieves better performance when bootstrapping dominates the workflow. In ResNet20, where bootstrapping takes up to 76.2% of the total workload, ARK achieves a  $6.5 \times$  speedup compared to BTS.

## 6. Discussion on Future Directions

Inspired by the above qualitative ( $\S4$ ) and quantitative ( $\S5$ ) studies, in this section we discuss some future directions of designing and implementing FHE accelerators.

#### 6.1. Application-driven Design Approach

Most existing FHE accelerators are trying to build increasingly powerful accelerators by leveraging more advanced architecture technologies, such as better semiconductor manufacturing processes [19], [26]–[29]. While such design methods can largely improve the performance of FHE schemes, they also become more unaffordable, especially for these ASIC-based solutions.

We envision that a future opportunity for FHE accelerators could be the application-driven design approach. The reason is that for some practical FHE applications, the performance is determined not only by the accelerators with extremely high FHE computing performance, but also by the whole architecture stack that customizes with the application. For example, FHE is widely adopted in private set intersection (PSI), and a recent work called INSPIRE codesigns the storage architecture with the FHE accelerator to improve the end-to-end performance of the PSI application by minimizing the overhead of data movement between the storage controller and FHE accelerators [74]. We believe such an application-driven design approach could better balance the application requirements, costs, and design difficulties, providing a promising direction for the future.

## 6.2. Supporting both Word-wise & Bit-wise FHE Schemes

As discussed in §2, word-wise FHE is more suitable for polynomial evaluation, while bit-wise FHE is preferred for non-polynomial evaluation. However, real-world applications, such as machine learning tasks, require both polynomial and non-polynomial evaluations to be effective. Moreover, ciphertexts switch between the two forms (e.g., CKKS and FHEW in [38]) is extremely complicated, thus timeconsuming. Therefore, designing a FHE accelerator that can simultaneously support both word-wise and bit-wise FHE schemes, and further perform efficient ciphertext switch and evaluation is important for real-world applications, which is worthy of future investigation.

## 6.3. Enhanced Software/Hardware Co-design

In recent works, the software has been considered an important part of the design. Some works, such as F1 [26]. CraterLake [27], use compilers for a software/hardware codesign solution. However, they do not provide a comprehensive description of the software design. We believe fully functional compilers should be co-designed with general FHE compilers, such as EVA [34]. As the FHE compiler SoK paper [32] indicates, general FHE compilers can optimize FHE programs based on the cost model of FHE schemes. We believe these general FHE compilers could be integrated with FHE accelerator compilers for improved end-to-end performance by considering a cost model from a hardware-level perspective, which points to a potential future direction.

## 6.4. From Scale-up to Scale-out

Current FHE acceleration solutions mostly focus on scaleup, *i.e.*, improving the performance of a single accelerator vertically. However, scale-out, i.e., connecting multiple FHE accelerators via networking horizontally, should also be an effective way to further improve the performance of FHE applications. As discussed in §3, the data inflation problem has posed a challenge to efficient data movement between on-chip and off-chip memories. This challenge also exists when we connect multiple FHE accelerators via networking in a scale-out manner. Thus, FHE accelerator and network co-design, e.g., integrating both FHE acceleration functions and high-performant networking controllers on the same chip, should be a potential research direction in the future.

## 6.5. Accelerating NTRU-based Schemes

In this paper, we do not include accelerators for traditional NTRU-based solutions such as [75] since these NTRUbased schemes were believed to be vulnerable to attacks and thus impractical [76]. However, recently various modern NTRU-based solutions have been proposed to overcome its original security problems [77]. Moreover, NTRU-based solutions have the advantages of low memory consumption and fast computation. Therefore, we believe that designing practical accelerators for modern NTRU-based solutions also deserves future exploration.

# 7. Conclusion

In this paper, we presented a comprehensive systematization of knowledge by both qualitatively and quantitatively studying 11 existing FHE accelerators and observing their evolution process. We further discussed some potential future directions of designing and implementing FHE accelerators, which we hope could inspire the research community.

## References

- [1] H. Tian, C. Zeng, Z. Ren, D. Chai, J. Zhang, K. Chen, and Q. Yang, "Sphinx: Enabling privacy-preserving online learning over the cloud," in 2022 IEEE Symposium on Security and Privacy, SP 2022, Los Alamitos, CA, USA, May, 2022. IEEE Computer Society, 2022, pp. 1135–1149. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/SP46214.2022.00066
- [2] Q. Yang, Y. Liu, T. Chen, and Y. Tong, "Federated machine learning: Concept and applications," ACM Trans. Intell. Syst. Technol., vol. 10, no. 2, pp. 12:1–12:19, 2019. [Online]. Available: https://doi.org/10.1145/3298981
- [3] P. Mohassel and Y. Zhang, "Secureml: A system for scalable privacy-preserving machine learning," in 2017 IEEE Symposium on Security and Privacy, SP 2017, San Jose, CA, USA, May 22-26, 2017. IEEE Computer Society, 2017, pp. 19–38. [Online]. Available: https://doi.org/10.1109/SP.2017.12
- [4] M. H. Mughees, H. Chen, and L. Ren, "Onionpir: Response efficient single-server PIR," in CCS '21: 2021 ACM SIGSAC Conference on Computer and Communications Security, Virtual Event, Republic of Korea, November 15 - 19, 2021, Y. Kim, J. Kim, G. Vigna, and E. Shi, Eds. ACM, 2021, pp. 2292–2306. [Online]. Available: https://doi.org/10.1145/3460120.3485381
- [5] S. Sav, A. Pyrgelis, J. R. Troncoso-Pastoriza, D. Froelicher, J. Bossuat, J. S. Sousa, and J. Hubaux, "POSEIDON: privacypreserving federated neural network learning," in 28th Annual Network and Distributed System Security Symposium, NDSS 2021, virtually, February 21-25, 2021. The Internet Society, 2021. [Online]. Available: https://www.ndss-symposium.org/ndss-paper/ poseidon-privacy-preserving-federated-neural-network-learning/
- [6] J. Lee, H. Kang, Y. Lee, W. Choi, J. Eom, M. Deryabin, E. Lee, J. Lee, D. Yoo, Y. Kim, and J. No, "Privacy-preserving machine learning with fully homomorphic encryption for deep neural network," *IEEE Access*, vol. 10, pp. 30 039–30 054, 2022. [Online]. Available: https://doi.org/10.1109/ACCESS.2022.3159694
- H. Chen, K. Laine, and P. Rindal, "Fast private set intersection from homomorphic encryption," in *Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security*, *CCS 2017, Dallas, TX, USA, October 30 - November 03, 2017*, B. Thuraisingham, D. Evans, T. Malkin, and D. Xu, Eds. ACM, 2017, pp. 1243–1255. [Online]. Available: https://doi.org/10.1145/ 3133956.3134061
- [8] Z. Brakerski, C. Gentry, and V. Vaikuntanathan, "(leveled) fully homomorphic encryption without bootstrapping," *ACM Trans. Comput. Theory*, vol. 6, no. 3, pp. 13:1–13:36, 2014. [Online]. Available: https://doi.org/10.1145/2633600
- [9] J. Fan and F. Vercauteren, "Somewhat practical fully homomorphic encryption," *IACR Cryptol. ePrint Arch.*, p. 144, 2012. [Online]. Available: http://eprint.iacr.org/2012/144
- [10] J. H. Cheon, A. Kim, M. Kim, and Y. S. Song, "Homomorphic encryption for arithmetic of approximate numbers," in Advances in Cryptology - ASIACRYPT 2017 - 23rd International Conference on the Theory and Applications of Cryptology and Information Security, Hong Kong, China, December 3-7, 2017, Proceedings, Part I, ser. Lecture Notes in Computer Science, T. Takagi and T. Peyrin, Eds., vol. 10624. Springer, 2017, pp. 409–437. [Online]. Available: https://doi.org/10.1007/978-3-319-70694-8\_15
- [11] N. P. Smart and F. Vercauteren, "Fully homomorphic SIMD operations," *Des. Codes Cryptogr.*, vol. 71, no. 1, pp. 57–81, 2014. [Online]. Available: https://doi.org/10.1007/s10623-012-9720-4
- [12] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, *Introduction to Algorithms, 3rd Edition.* MIT Press, 2009. [Online]. Available: http://mitpress.mit.edu/books/introduction-algorithms
- [13] F. Boemer, S. Kim, G. Seifu, F. D. M. de Souza, and V. Gopal, "Intel HEXL: accelerating homomorphic encryption with intel AVX512-IFMA52," in WAHC '21: Proceedings of the 9th on Workshop on Encrypted Computing & Applied Homomorphic Cryptography, Virtual Event, Korea, 15 November 2021. WAHC@ACM, 2021, pp. 57–62. [Online]. Available: https://doi.org/10.1145/3474366.3486926

- [14] "A gpu implementation of fully homomorphic encryption on torus," https://github.com/nucypher/nufhe, accessed: 2022-07-07.
- [15] W. Jung, S. Kim, J. H. Ahn, J. H. Cheon, and Y. Lee, "Over 100x faster bootstrapping in fully homomorphic encryption through memory-centric optimization with gpus," *IACR Trans. Cryptogr. Hardw. Embed. Syst.*, vol. 2021, no. 4, pp. 114–148, 2021. [Online]. Available: https://doi.org/10.46586/tches.v2021.i4.114-148
- [16] "CUDA toolkit," https://developer.nvidia.com/cuda-toolkit, accessed: 2022-07-07.
- [17] I. Chillotti, N. Gama, M. Georgieva, and M. Izabachène, "Faster fully homomorphic encryption: Bootstrapping in less than 0.1 seconds," in Advances in Cryptology - ASIACRYPT 2016 -22nd International Conference on the Theory and Application of Cryptology and Information Security, Hanoi, Vietnam, December 4-8, 2016, Proceedings, Part I, ser. Lecture Notes in Computer Science, J. H. Cheon and T. Takagi, Eds., vol. 10031, 2016, pp. 3–33. [Online]. Available: https://doi.org/10.1007/978-3-662-53887-6\_1
- [18] "Intel homomorphic encryption (HE) acceleration library for FPGAs," https://github.com/intel/hexl-fpga, accessed: 2022-07-08.
- [19] M. S. Riazi, K. Laine, B. Pelton, and W. Dai, "HEAX: an architecture for computing on encrypted data," in ASPLOS '20: Architectural Support for Programming Languages and Operating Systems, Lausanne, Switzerland, March 16-20, 2020, J. R. Larus, L. Ceze, and K. Strauss, Eds. ACM, 2020, pp. 1295–1309. [Online]. Available: https://doi.org/10.1145/3373376.3378523
- [20] D. B. Cousins, J. Golusky, K. Rohloff, and D. Sumorok, "An FPGA co-processor implementation of homomorphic encryption," in *IEEE High Performance Extreme Computing Conference, HPEC 2014, Waltham, MA, USA, September 9-11, 2014.* IEEE, 2014, pp. 1–6. [Online]. Available: https://doi.org/10.1109/HPEC.2014.7040950
- [21] Y. Doröz, E. Öztürk, and B. Sunar, "Accelerating fully homomorphic encryption in hardware," *IEEE Trans. Computers*, vol. 64, no. 6, pp. 1509–1521, 2015. [Online]. Available: https://doi.org/10.1109/ TC.2014.2345388
- [22] V. Migliore, C. Seguin, M. M. Real, V. Lapotre, A. Tisserand, C. Fontaine, G. Gogniat, and R. Tessier, "A high-speed accelerator for homomorphic encryption using the karatsuba algorithm," ACM Trans. Embed. Comput. Syst., vol. 16, no. 5s, pp. 138:1–138:17, 2017. [Online]. Available: https://doi.org/10.1145/3126558
- [23] D. B. Cousins, K. Rohloff, and D. Sumorok, "Designing an fpgaaccelerated homomorphic encryption co-processor," *IEEE Trans. Emerg. Top. Comput.*, vol. 5, no. 2, pp. 193–206, 2017. [Online]. Available: https://doi.org/10.1109/TETC.2016.2619669
- [24] A. C. Mert, E. Öztürk, and E. Savas, "Design and implementation of encryption/decryption architectures for BFV homomorphic encryption scheme," *IEEE Trans. Very Large Scale Integr. Syst.*, vol. 28, no. 2, pp. 353–362, 2020. [Online]. Available: https:// doi.org/10.1109/TVLSI.2019.2943127
- [25] I. Kuon and J. Rose, "Measuring the gap between fpgas and asics," in Proceedings of the ACM/SIGDA 14th International Symposium on Field Programmable Gate Arrays, FPGA 2006, Monterey, California, USA, February 22-24, 2006, S. J. E. Wilton and A. DeHon, Eds. ACM, 2006, pp. 21–30. [Online]. Available: https://doi.org/10.1145/1117201.1117205
- [26] N. Samardzic, A. Feldmann, A. Krastev, S. Devadas, R. G. Dreslinski, C. Peikert, and D. Sánchez, "F1: A fast and programmable accelerator for fully homomorphic encryption," in *MICRO '21: 54th Annual IEEE/ACM International Symposium on Microarchitecture, Virtual Event, Greece, October 18-22, 2021.* ACM, 2021, pp. 238–252. [Online]. Available: https://doi.org/10.1145/3466752.3480070
- [27] N. Samardzic, A. Feldmann, A. Krastev, N. Manohar, N. Genise, S. Devadas, K. Eldefrawy, C. Peikert, and D. Sánchez, "Craterlake: a hardware accelerator for efficient unbounded computation on encrypted data," in *ISCA '22: The 49th Annual International Symposium on Computer Architecture, New York, New York, USA, June 18 - 22, 2022*, V. Salapura, M. Zahran, F. Chong, and L. Tang, Eds. ACM, 2022, pp. 173–187. [Online]. Available: https://doi.org/10.1145/3470496.3527393

- [28] S. Kim, J. Kim, M. J. Kim, W. Jung, J. Kim, M. Rhu, and J. H. Ahn, "BTS: an accelerator for bootstrappable fully homomorphic encryption," in *ISCA '22: The 49th Annual International Symposium* on Computer Architecture, New York, New York, USA, June 18 - 22, 2022, V. Salapura, M. Zahran, F. Chong, and L. Tang, Eds. ACM, 2022, pp. 711–725. [Online]. Available: https://doi.org/10.1145/3470496.3527415
- [29] J. Kim, G. Lee, S. Kim, G. Sohn, J. Kim, M. Rhu, and J. H. Ahn, "ARK: fully homomorphic encryption accelerator with runtime data generation and inter-operation key reuse," *CoRR*, vol. abs/2205.00922, 2022. [Online]. Available: https://doi.org/10.48550/ arXiv.2205.00922
- [30] S. S. Roy, F. Turan, K. Järvinen, F. Vercauteren, and I. Verbauwhede, "Fpga-based high-performance parallel architecture for homomorphic computing on encrypted data," in 25th IEEE International Symposium on High Performance Computer Architecture, HPCA 2019, Washington, DC, USA, February 16-20, 2019. IEEE, 2019, pp. 387–398. [Online]. Available: https://doi.org/10.1109/ HPCA.2019.00052
- [31] A. Acar, H. Aksu, A. S. Uluagac, and M. Conti, "A survey on homomorphic encryption schemes: Theory and implementation," *ACM Comput. Surv.*, vol. 51, no. 4, pp. 79:1–79:35, 2018. [Online]. Available: https://doi.org/10.1145/3214303
- [32] A. Viand, P. Jattke, and A. Hithnawi, "Sok: Fully homomorphic encryption compilers," in 42nd IEEE Symposium on Security and Privacy, SP 2021, San Francisco, CA, USA, 24-27 May 2021. IEEE, 2021, pp. 1092–1108. [Online]. Available: https://doi.org/ 10.1109/SP40001.2021.00068
- [33] "Microsoft SEAL," https://github.com/microsoft/SEAL, accessed: 2022-07-12.
- [34] R. Dathathri, B. Kostova, O. Saarikivi, W. Dai, K. Laine, and M. Musuvathi, "EVA: an encrypted vector arithmetic language and compiler for efficient homomorphic computation," in *Proceedings of the 41st ACM SIGPLAN International Conference on Programming Language Design and Implementation, PLDI 2020, London, UK, June 15-20, 2020, A. F. Donaldson and E. Torlak, Eds. ACM, 2020, pp. 546–561. [Online]. Available: https://doi.org/10.1145/ 3385412.3386023*
- [35] E. Chielle, O. Mazonka, N. G. Tsoutsos, and M. Maniatakos, "E<sup>3</sup>: A framework for compiling C++ programs with encrypted operands," *IACR Cryptol. ePrint Arch.*, p. 1013, 2018. [Online]. Available: https://eprint.iacr.org/2018/1013
- [36] P. Paillier, "Public-key cryptosystems based on composite degree residuosity classes," in Advances in Cryptology - EUROCRYPT '99, International Conference on the Theory and Application of Cryptographic Techniques, Prague, Czech Republic, May 2-6, 1999, Proceeding, ser. Lecture Notes in Computer Science, J. Stern, Ed., vol. 1592. Springer, 1999, pp. 223–238. [Online]. Available: https://doi.org/10.1007/3-540-48910-X\_16
- [37] R. L. Rivest, A. Shamir, and L. M. Adleman, "A method for obtaining digital signatures and public-key cryptosystems," *Commun. ACM*, vol. 21, no. 2, pp. 120–126, 1978. [Online]. Available: http://doi.acm.org/10.1145/359340.359342
- [38] W. Lu, Z. Huang, C. Hong, Y. Ma, and H. Qu, "PEGASUS: bridging polynomial and non-polynomial evaluations in homomorphic encryption," in 42nd IEEE Symposium on Security and Privacy, SP 2021, San Francisco, CA, USA, 24-27 May 2021. IEEE, 2021, pp. 1057–1073. [Online]. Available: https://doi.org/10.1109/ SP40001.2021.00043
- [39] L. Ducas and D. Micciancio, "FHEW: bootstrapping homomorphic encryption in less than a second," in Advances in Cryptology -EUROCRYPT 2015 - 34th Annual International Conference on the Theory and Applications of Cryptographic Techniques, Sofia, Bulgaria, April 26-30, 2015, Proceedings, Part I, ser. Lecture Notes in Computer Science, E. Oswald and M. Fischlin, Eds., vol. 9056. Springer, 2015, pp. 617–640. [Online]. Available: https://doi.org/10.1007/978-3-662-46800-5\_24

- [40] C. Boura, N. Gama, M. Georgieva, and D. Jetchev, "CHIMERA: combining ring-lwe-based fully homomorphic encryption schemes," *J. Math. Cryptol.*, vol. 14, no. 1, pp. 316–338, 2020. [Online]. Available: https://doi.org/10.1515/jmc-2019-0026
- [41] V. Lyubashevsky, C. Peikert, and O. Regev, "On ideal lattices and learning with errors over rings," in Advances in Cryptology -EUROCRYPT 2010, 29th Annual International Conference on the Theory and Applications of Cryptographic Techniques, Monaco / French Riviera, May 30 - June 3, 2010. Proceedings, ser. Lecture Notes in Computer Science, H. Gilbert, Ed., vol. 6110. Springer, 2010, pp. 1–23. [Online]. Available: https://doi.org/10.1007/978-3-642-13190-5\_1
- [42] C. Gentry, A. Sahai, and B. Waters, "Homomorphic encryption from learning with errors: Conceptually-simpler, asymptoticallyfaster, attribute-based," in Advances in Cryptology - CRYPTO 2013 - 33rd Annual Cryptology Conference, Santa Barbara, CA, USA, August 18-22, 2013. Proceedings, Part I, ser. Lecture Notes in Computer Science, R. Canetti and J. A. Garay, Eds., vol. 8042. Springer, 2013, pp. 75–92. [Online]. Available: https://doi.org/10.1007/978-3-642-40041-4\_5
- [43] J. Lee, H. Kang, Y. Lee, W. Choi, J. Eom, M. Deryabin, E. Lee, J. Lee, D. Yoo, Y. Kim, and J. No, "Privacy-preserving machine learning with fully homomorphic encryption for deep neural network," *IEEE Access*, vol. 10, pp. 30 039–30 054, 2022. [Online]. Available: https://doi.org/10.1109/ACCESS.2022.3159694
- [44] "HElib Country Lookup Example," https://github.com/homenc/HElib/ tree/master/examples/BGV\_country\_db\_lookup.
- [45] J. H. Cheon, K. Han, A. Kim, M. Kim, and Y. Song, "Bootstrapping for approximate homomorphic encryption," *IACR Cryptol. ePrint Arch.*, p. 153, 2018. [Online]. Available: http://eprint.iacr.org/2018/ 153
- [46] K. Han and D. Ki, "Better bootstrapping for approximate homomorphic encryption," in *Topics in Cryptology - CT-RSA 2020* - The Cryptographers' Track at the RSA Conference 2020, San Francisco, CA, USA, February 24-28, 2020, Proceedings, ser. Lecture Notes in Computer Science, S. Jarecki, Ed., vol. 12006. Springer, 2020, pp. 364–390. [Online]. Available: https://doi.org/ 10.1007/978-3-030-40186-3\_16
- [47] S. Halevi and V. Shoup, "Bootstrapping for helib," J. Cryptol., vol. 34, no. 1, p. 7, 2021. [Online]. Available: https://doi.org/ 10.1007/s00145-020-09368-7
- [48] H. Chen and K. Han, "Homomorphic lower digits removal and improved FHE bootstrapping," in Advances in Cryptology -EUROCRYPT 2018 - 37th Annual International Conference on the Theory and Applications of Cryptographic Techniques, Tel Aviv, Israel, April 29 - May 3, 2018 Proceedings, Part I, ser. Lecture Notes in Computer Science, J. B. Nielsen and V. Rijmen, Eds., vol. 10820. Springer, 2018, pp. 315–337. [Online]. Available: https://doi.org/10.1007/978-3-319-78381-9\_12
- [49] S. Foldes, "Symmetries of directed graphs and the chinese remainder theorem," J. Comb. Theory, Ser. B, vol. 28, no. 1, pp. 18–25, 1980. [Online]. Available: https://doi.org/10.1016/0095-8956(80)90053-2
- [50] "General data protection regulation," https://gdpr-info.eu, accessed: 2022-10-20.
- [51] "Palisade homomorphic encryption software library," https://palisadecrypto.org, accessed: 2022-07-31.
- [52] J. W. Cooley and J. W. Tukey, "An algorithm for the machine calculation of complex fourier series," *Mathematics of computation*, vol. 19, no. 90, pp. 297–301, 1965. [Online]. Available: https:// doi.org/10.1090/S0025-5718-1965-0178586-1
- [53] W. M. Gentleman and G. Sande, "Fast fourier transforms: for fun and profit," in American Federation of Information Processing Societies: Proceedings of the AFIPS '66 Fall Joint Computer Conference, November 7-10, 1966, San Francisco, California, USA, ser. AFIPS Conference Proceedings, vol. 29. AFIPS / ACM / Spartan Books, Washington D.C., 1966, pp. 563–578. [Online]. Available: https://doi.org/10.1145/1464291.1464352

- [54] D. H. Bailey, "Ffts in external or hierarchical memory," in Proceedings Supercomputing '89, Reno, NV, USA, November 12-17, 1989, F. R. Bailey, Ed. ACM, 1989, pp. 234–242. [Online]. Available: https://doi.org/10.1145/76263.76288
- [55] C. Gentry, S. Halevi, and N. P. Smart, "Homomorphic evaluation of the AES circuit," in Advances in Cryptology - CRYPTO 2012 - 32nd Annual Cryptology Conference, Santa Barbara, CA, USA, August 19-23, 2012. Proceedings, ser. Lecture Notes in Computer Science, R. Safavi-Naini and R. Canetti, Eds., vol. 7417. Springer, 2012, pp. 850–867. [Online]. Available: https://doi.org/10.1007/978-3-642-32009-5\_49
- [56] M. R. Albrecht, M. Chase, H. Chen, J. Ding, S. Goldwasser, S. Gorbunov, S. Halevi, J. Hoffstein, K. Laine, K. E. Lauter, S. Lokam, D. Micciancio, D. Moody, T. Morrison, A. Sahai, and V. Vaikuntanathan, "Homomorphic encryption standard," *IACR Cryptol. ePrint Arch.*, p. 939, 2019. [Online]. Available: https:// eprint.iacr.org/2019/939
- [57] "Cuda homomorphic encryption library (cuhe)," https://github.com/ vernamlab/cuHE, accessed: 2022-07-30.
- [58] "Cuda-accelerated fully homomorphic encryption library (cufhe)," https://github.com/vernamlab/cuFHE, accessed: 2022-07-30.
- [59] W. Dai and B. Sunar, "cuhe: A homomorphic encryption accelerator library," in Cryptography and Information Security in the Balkans - Second International Conference, BalkanCryptSec 2015, Koper, Slovenia, September 3-4, 2015, Revised Selected Papers, ser. Lecture Notes in Computer Science, E. Pasalic and L. R. Knudsen, Eds., vol. 9540. Springer, 2015, pp. 169–186. [Online]. Available: https://doi.org/10.1007/978-3-319-29172-7\_11
- [60] "Xilinx Zynq UltraScale+ MPSoC ZCU102 Evaluation Kit," https:// www.xilinx.com/products/boards-and-kits/ek-u1-zcu102-g.html.
- [61] F. Turan, S. S. Roy, and I. Verbauwhede, "HEAWS: an accelerator for homomorphic encryption on the amazon AWS FPGA," *IEEE Trans. Computers*, vol. 69, no. 8, pp. 1185–1196, 2020. [Online]. Available: https://doi.org/10.1109/TC.2020.2988765
- [62] "HElib," https://github.com/homenc/HElib, accessed: 2022-07-31.
- [63] "Intel HEXL," https://github.com/intel/hexl, accessed: 2022-11-13.
- [64] "Intel Advanced Vector Extensions 512 (Intel AVX-512)," https://www.intel.com/content/www/us/en/architecture-andtechnology/avx-512-overview.html.
- [65] A. Guermouche and A. Orgerie, "Thermal design power and vectorized instructions behavior," *Concurr. Comput. Pract. Exp.*, vol. 34, no. 2, 2022. [Online]. Available: https://doi.org/10.1002/ cpe.6261
- [66] D. G. Bailey, "The advantages and limitations of high level synthesis for FPGA based image processing," in *Proceedings of* the 9th International Conference on Distributed Smart Camera, Seville, Spain, September 8-11, 2015, R. Carmona-Galán and Á. Rodríguez-Vázquez, Eds. ACM, 2015, pp. 134–139. [Online]. Available: https://doi.org/10.1145/2789116.2789145
- [67] "NVIDIA V100 Datasheet," https://images.nvidia.com/content/ technologies/volta/pdf/volta-v100-datasheet-update-us-1165301r5.pdf.
- [68] S. Kim, W. Jung, J. Park, and J. H. Ahn, "Accelerating number theoretic transformations for bootstrappable homomorphic encryption on gpus," *CoRR*, vol. abs/2012.01968, 2020. [Online]. Available: https://arxiv.org/abs/2012.01968
- [69] J. Bossuat, C. Mouchet, J. R. Troncoso-Pastoriza, and J. Hubaux, "Efficient bootstrapping for approximate homomorphic encryption with non-sparse keys," in Advances in Cryptology - EUROCRYPT 2021 - 40th Annual International Conference on the Theory and Applications of Cryptographic Techniques, Zagreb, Croatia, October 17-21, 2021, Proceedings, Part I, ser. Lecture Notes in Computer Science, A. Canteaut and F. Standaert, Eds., vol. 12696. Springer, 2021, pp. 587–617. [Online]. Available: https://doi.org/10.1007/978-3-030-77870-5\_21

- [70] K. Han, S. Hong, J. H. Cheon, and D. Park, "Logistic regression on homomorphic encrypted data at scale," in *The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI* 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019. AAAI Press, 2019, pp. 9466–9471. [Online]. Available: https://doi.org/10.1609/aaai.v33i01.33019466
- [71] "Intel FPGA PAC D5005," https://www.intel.com/content/www/us/ en/products/sku/193921/intel-fpga-pac-d5005/specifications.html, accessed: 2022-10-17.
- [72] "Intel Stratix 10 GX/SX Product Table," https://www.intel.com/ content/www/us/en/content-details/652478/intel-stratix-10-gxfpga-and-intel-stratix-10-sx-soc-fpga-family-overview-producttable.html.
- [73] "TFHE: Fast Fully Homomorphic Encryption Library over the Torus," https://github.com/tfhe/tfhe, accessed: 2022-11-17.
- [74] J. Lin, L. Liang, Z. Qu, I. Ahmad, L. Liu, F. Tu, T. Gupta, Y. Ding, and Y. Xie, "INSPIRE: in-storage private information retrieval via protocol and architecture co-design," in *ISCA '22: The 49th Annual International Symposium on Computer Architecture, New York, New York, USA, June 18 - 22, 2022*, V. Salapura, M. Zahran, F. Chong, and L. Tang, Eds. ACM, 2022, pp. 102–115. [Online]. Available: https://doi.org/10.1145/3470496.3527433
- [75] W. Dai, Y. Doröz, and B. Sunar, "Accelerating NTRU based homomorphic encryption using gpus," in *IEEE High Performance Extreme Computing Conference, HPEC 2014, Waltham, MA, USA, September 9-11, 2014.* IEEE, 2014, pp. 1–6. [Online]. Available: https://doi.org/10.1109/HPEC.2014.7041001
- [76] É. Jaulmes and A. Joux, "A chosen-ciphertext attack against NTRU," in Advances in Cryptology - CRYPTO 2000, 20th Annual International Cryptology Conference, Santa Barbara, California, USA, August 20-24, 2000, Proceedings, ser. Lecture Notes in Computer Science, M. Bellare, Ed., vol. 1880. Springer, 2000, pp. 20–35. [Online]. Available: https://doi.org/10.1007/3-540-44598-6\_2
- [77] C. Bonte, I. Iliashenko, J. Park, H. V. L. Pereira, and N. P. Smart, "FINAL: faster FHE instantiated with NTRU and LWE," *IACR Cryptol. ePrint Arch.*, p. 74, 2022. [Online]. Available: https://eprint.iacr.org/2022/074

## Appendix A. NTT/FFT

Given a polynomial A of degree n as follows:

$$A(x) = \sum_{j=0}^{n-1} a_j x^j$$
(3)

The polynomial can be represented in the form of the vector of its coefficients as  $a = (a_1, a_2, ..., a_{n-1})$ . While the addition operation of two polynomials a and b (*i.e.*, element-wise addition of their coefficients vectors) is trivial, the multiplication of a and b (denoted as  $a \otimes b$  in our paper) is time-consuming with the computation complexity of  $O(n^2)$ . Since polynomial multiplication is the fundamental operation in FHE schemes, *e.g.*, encryption, homomorphic operations, *etc.*, such high computation complexity causes FHE schemes to be extremely inefficient.

To optimize polynomial multiplication, another representation of the polynomial — point-value representation — can be exploited. A polynomial of degree n can be represented by n distinct point-value tuples:  $\{(x_0, y_0), (x_1, y_1), \dots (x_{n-1}, y_{n-1})\}$ , where  $y_k = A(x_k)$  and  $k \in [0, n)$ . To achieve calculations between polynomials,

*e.g.*, A and B, the same group of  $x_k$  is chosen from both polynomials A and B. Under point-value representation, both addition and multiplication between two polynomials of degree n only require n point-wise operations on  $y_k$ . Therefore, the computation complexity is reduced to O(n).

However, the conversion from coefficient representation to point-value representation is still time-consuming and requires the calculation of  $y_k = A(x_k)$  for every  $x_k$ . The time complexity of evaluating each  $A(x_k)$  is O(n), thus leading to overall time complexity of  $O(n^2)$  for  $k \in [0, n)$ . The same complexity is needed for the inverse conversion.

To lower the time complexity of representation conversion, we choose special  $x_k$  values. For a polynomial of degree n-1, we use the roots of an equation  $\omega^n = 1$  as  $x_k$  values. The equation has n roots, denoted as  $e^{2\pi i k/n}$ , k = 0, 1, ..., n-1. They are usually written in simplified form as  $\omega_n^k = e^{2\pi i k/n}$ , which is the power of  $\omega_n = e^{2\pi i n/n}$ . With these roots, we can write Equation 3 as

$$A_{k} = A(\omega_{n}^{k}) = \sum_{j=0}^{n-1} a_{j}\omega_{n}^{kj}, k \in [0, n)$$
(4)

Equation 4 is called Discrete Fourier Transform (DFT). Our goal is to accelerate the transformation with the property of  $\omega_n$ .

For  $\omega_n$ , we have the following theorems:

$$\omega_{2n}^{2k} = \omega_n^k \tag{5}$$

$$\omega_n^{k+\frac{n}{2}} = -\omega_n^k \tag{6}$$

Given a polynomial  $A(x) = a_0 + a_1 \cdot x^1 + a_2 \cdot x^2 + \ldots + a_{n-1} \cdot x^{n-1}$ , we can re-arrange it to  $A(x) = (a_0 + a_2 \cdot x^2 + a_4 \cdot x^4 + \ldots + a_{n-2} \cdot x^{n-2}) + x(a_1 + a_3 \cdot x^2 + a_5 \cdot x^4 + \ldots + a_{n-1} \cdot x^{n-2})$ . Let  $A^{[0]}(x) = a_0 + a_2 \cdot x^1 + a_4 \cdot x^2 + \ldots + a_{n-2} \cdot x^{\frac{n-2}{2}}$  and  $A^{[1]}(x) = a_1 + a_3 \cdot x^1 + a_5 \cdot x^2 + \ldots + a_{n-1} \cdot x^{\frac{n-2}{2}}$ . Then we have  $A(x) = A^{[0]}(x^2) + xA^{[1]}(x^2)$ .

Therefore, we can use  $x_k = \omega_n^k$  to sample values. For  $k = 0, 1, \dots, n/2 - 1$ , we have

$$A(\omega_n^k) = A^{[0]}(\omega_n^{2k}) + \omega_n^k A^{[1]}(\omega_n^{2k})$$
(7)

Based on Theorems 5, the equation transforms into

$$A(\omega_n^k) = A^{[0]}(\omega_{\frac{k}{2}}^k) + \omega_n^k A^{[1]}(\omega_{\frac{k}{2}}^k)$$
(8)

Similarly, for k + n/2, we have:

$$A(\omega_n^{k+\frac{n}{2}}) = A^{[0]}(\omega_n^{2k+n}) + \omega_n^{k+\frac{n}{2}}A^{[1]}(\omega_n^{2k+n})$$
(9)

Based on Theorems 5 and 6, the equation now becomes:

$$A(\omega_n^{k+\frac{n}{2}}) = A^{[0]}(\omega_{\frac{n}{2}}^k) - \omega_n^k A^{[1]}(\omega_{\frac{n}{2}}^k)$$
(10)

Note that k and k + n/2 have covered all integers ranging from 0 to n - 1. Then the problem becomes a divide-and-conquer problem. The origin problem A is split into two subproblems of  $A^{[0]}$  and  $A^{[1]}$  whose degrees of polynomial are n/2. It requires n/2 extra multiplications to recover A from  $A^{[0]}$  and  $A^{[1]}$ . After that,  $A^{[0]}$  and  $A^{[1]}$  can be further divided into smaller subproblems following the same scheme. The recursion ends when the degree of the polynomial in the subproblem is 1. According to the master theorem in [12], the time complexity can be reduced to  $O(n \log n)$ .

The aforementioned process is called Fast Fourier Transform (FFT). When the degree of polynomials is n, we refer to the transform as an n-point FFT. The inverse operation of FFT is called iFFT, which can be implemented with a similar approach.

Number Theoretic Transforms (NTT) is similar to DFT but works with the finite field. Different from DFT in Equation 4, NTT is formulated as

$$A_k = \sum_{j=0}^{n-1} a_j g_n^{kj} \mod p, k \in [0, n),$$
(11)

where p is a prime, and  $g_n$  is the primitive n-th root of unity modulo p, which satisfies the following theorems:

$$g_{2n}^{2k} \equiv g_n^k \mod p \tag{12}$$

$$g_n^{k+\frac{n}{2}} \equiv -g_n^k \mod p \tag{13}$$

With Theorem 12 and 13, NTT can be accelerated following the same scheme we discussed in FFT, except for the multiplication and addition/subtraction, which need to be replaced by modular multiplication and modular addition/subtraction, respectively.

# Appendix B.

# 4-step FFT/NTT Algorithm

**B.1. Introduction to 4-step FFT/NTT Algorithm** In this section, we use the example of an *n*-point FFT to illustrate the workflow of the 4-step FFT/NTT algorithm. Recall the original FFT (DFT) in Equation 4. In FHE schemes, *n* is a power of two and can be expressed as the product of two numbers  $n = R \cdot C$ . Based on this property, we consider the *n* input numbers  $\{a_j, j \in [0, n)\}$  as an  $R \times C$  matrix. In this way, the workflow of the 4-step FFT is as follows.

- 1. Transpose the  $R \times C$  input matrix and get a new  $C \times R$  matrix. Perform FFT on each row of the  $C \times R$  matrix (*i.e.*, C independent R-point FFTs). We let matrix A' denote the results of FFTs.
- 2. Transpose matrix A' and get a new  $R \times C$  matrix A''.
- 3. Generate a  $R \times C$  twisting factor matrix  $F = [F_{i,j}]$ , where  $F_{i,j} = \omega_n^{ij}$ . Then perform dyadic multiplication between matrix A'' and matrix F. Let matrix A''' denote the multiplication result.
- 4. Perform FFT on each row of the  $R \times C$  matrix A'''(*i.e.*, R independent C-point FFTs). Transpose the result of FFT and get a new  $C \times R$  matrix, denoted as A.

A is the final result of the n-point FFT with some differences in the placement order of the result numbers

from the original FFT algorithm. The 4-step NTT algorithm follows a similar workflow with modular operations between integers. Since the order of coefficients does not affect the correctness of element-wise polynomial addition and multiplication, the 4-step FFT/NTT algorithm can be easily applied in FHE schemes to replace the original FFT/NTT algorithm. Actually, we could skip the transpose in the first step by storing the input numbers in column-major order, and skip the transpose in the fourth step because of the order-independent element-wise operations.

## B.2. Comparison between the 4-step FFT/NTT Algorithm and the Original FFT/NTT Algorithm

This section will show the pros and cons of the 4-step FFT/NTT algorithm by comparing it with the original FFT/NTT algorithm.

#### **B.2.1. Efficient and Practical Parallelism**

As mentioned in §3.1.2, the original FFT/NTT algorithm can hardly be accelerated via parallelism because of strict data dependency and high computation resource consumption. The 4-step FFT/NTT algorithm alleviates both problems.

Less Data Dependency: According to the aforementioned workflow, the 4-step FFT/NTT algorithm decomposes the *n*-point FFT/NTT operation into C independent R-point FFT/NTTs in the first step, a global transpose in the second step, n independent multiplications in the third step, and Rindependent C-point FFT/NTTs in the last step. Although the four steps should be executed one by one, the parallelism in these steps (except for the second step) is fully exploited as these independent operations can be performed simultaneously. Since R and C can be quite large (e.g., when n = 65536, R = C = 256), the parallelism is high enough to fully utilize the performance capacity of hardware accelerators. Actually, the global data dependency problem in the original FFT/NTT algorithm still exists in the second step when transposing the matrix. Consequently, a large transpose network is required. Recent works prefer such a design since 1) the transpose network is much easier to design than FFT/NTT due to its fixed workflow; 2) the transpose network can be reused in other operations such as permutation [26], which improves the resource utilization.

Low Resource Consumption: Compared to the original FFT/NTT, the total computation workload in the 4-step FFT/NTT is not reduced. But the original algorithm is decomposed into multiple smaller ones with much fewer input numbers (we choose R and C close to  $\sqrt{n}$  in many cases). Therefore, instead of designing a fully pipelined *n*-point FFT/NTT circuit, we can combine multiple small pipelined FFT/NTT units (*i.e.*, *R*-point and *C*-point FFT/NTT units) to achieve the same arithmetic function, reducing the hardware design complexity. To reduce hardware resource consumption, we can reuse *R*-point FFT/NTT units rather than implement all the units as dedicated ones (*i.e.*, *R C*-point FFT/NTT units). Therefore, the 4-step FFT/NTT algorithm requires much fewer hardware resources to achieve pipeline parallelism.

#### **B.2.2.** Increased Memory Overhead

Although promising, the 4-step FFT/NTT also causes increased memory overhead due to pre-computed parameters. In addition to the input data and calculation results, we need to store the twiddle factors for *n*-point FFT/NTT in the original algorithm. Larger *n* leads to more twiddle factors. Although the number of twiddle factors is reduced in the 4step FFT/NTT algorithm since we only need twiddle factors for *R*-point and *C*-point FFT/NTTs, we need to store the twisting factors introduced in the third step of the 4-step FFT/NTT algorithm. The size of twisting factors in the 4-step FFT/NTT algorithm is close to the size of twiddle factors in the original algorithm, making the requirements of total storage even more strict. To address the problem, recent works generate twiddle and twisting factors on the fly rather than cache all the factors in the memory [27], [28].