Research 2023
A Unified Cryptoprocessor for Lattice-Based Signature and Key-Exchange
We propose design methodologies for building a compact, unified and programmable cryptoprocessor architecture that computes post-quantum key agreement and digital signature. Synergies in the two types of cryptographic primitives are used to make the cryptoprocessor compact. As a case study, the cryptoprocessor architecture has been optimized targeting the signature scheme ’CRYSTALS-Dilithium’ and the key encapsulation mechanism (KEM) ’Saber,’ both finalists in the NIST’s post-quantum cryptography standardization project. The programmable cryptoprocessor executes key generations, encapsulations, decapsulations, signature generations, and signature verifications for all the security levels of Dilithium and Saber. On a Xilinx Ultrascale+ FPGA, the proposed cryptoprocessor consumes 18,406 LUTs, 9,323 FFs, 4 DSPs, and 24 BRAMs. It achieves 200 MHz clock frequency and finishes CCA-secure key-generation/encapsulation/decapsulation operations for LightSaber in 29.6/40.4/ 58.3 μ s; for Saber in 54.9/69.7/94.9 μ s; and for FireSaber in 87.6/108.0/139.4 μ s, respectively. It finishes key-generation/sign/verify operations for Dilithium-2 in 70.9/151.6/75.2 μ s; for Dilithium-3 in 114.7/237/127.6 μ s; and for Dilithium-5 in 194.2/342.1/228.9 μ s, respectively, for the best-case scenario. On UMC 65 nm library for ASIC the latency is improved by a factor of two due to a 2× increase in clock frequency.
KaLi: A Crystal for Post-Quantum Security Using Kyber and Dilithium
Quantum computers pose a threat to the security of communications over the internet. This imminent risk has led to the standardization of cryptographic schemes for protection in a post-quantum scenario. We present a design methodology for future implementations of such algorithms. This is manifested using the NIST selected digital signature scheme CRYSTALS-Dilithium and key encapsulation scheme CRYSTALS-Kyber. A unified architecture, KaLi , is proposed that can perform key generation, encapsulation, decapsulation, signature generation, and signature verification for all the security levels of CRYSTALS-Dilithium, and CRYSTALS-Kyber. A unified yet flexible polynomial arithmetic unit is designed that can processes Kyber operations twice as fast as Dilithium operations. Efficient memory management is proposed to achieve optimal latency. KaLi is explicitly tailored for ASIC platforms using multiple clock domains. On ASIC 28nm/65nm technology, it occupies 0.263/1.107 mm2 and achieves a clock frequency of 2GHz/560MHz for the fast clock used for memory unit. On Xilinx Zynq Ultrascale+ZCU102 FPGA, the proposed architecture uses 23,277 LUTs, 9,758 DFFs, 4 DSPs, and 24 BRAMs, at 270 MHz clock frequency. KaLi performs better than the standalone implementations of either of the two schemes. This is the first work to provide a unified design in hardware for both schemes.
Medha: Microcoded Hardware Accelerator for computing on Encrypted Data
Homomorphic encryption enables computation on encrypted data, and hence it has a great potential in privacy-preserving outsourcing of computations to the cloud. Hardware acceleration of homomorphic encryption is crucial as software implementations are very slow. In this paper, we present design methodologies for building a programmable hardware accelerator for speeding up the cloud-side homomorphic evaluations on encrypted data.
First, we propose a divide-and-conquer technique that enables homomorphic evaluations in the polynomial ring RQ,2N = ZQ[x]/(x2N + 1) to use a hardware accelerator that has been built for the smaller ring RQ,N = ZQ[x]/(xN + 1). The technique makes it possible to use a single hardware accelerator flexibly for supporting several homomorphic encryption parameter sets.
Next, we present several architectural design methods that we use to realize the flexible and instruction-set accelerator architecture, which we call ‘Medha’. At every level of the implementation hierarchy, we explore possibilities for parallel processing. Starting from hardware-friendly parallel algorithms for the basic building blocks, we gradually build heavily parallel RNS polynomial arithmetic units. Next, many of these parallel units are interconnected elegantly so that their interconnections require the minimum number of nets, therefore making the overall architecture placement-friendly on the platform. As homomorphic encryption is computation- as well as data-centric, the speed of homomorphic evaluations depends greatly on the way the data variables are handled. For Medha, we take a memory-conservative design approach and get rid of any off-chip memory access during homomorphic evaluations.
Finally, we implement Medha in a Xilinx Alveo U250 FPGA and measure timing performances of the microcoded homomorphic addition, multiplication, key-switching, and rescaling routines for the leveled fully homomorphic encryption scheme RNSHEAAN at 200 MHz clock frequency. For the large parameter sets (log Q,N) = (438, 214) and (546, 215), Medha achieves accelerations by up to 68× and 78× times respectively compared to a highly optimized software implementation Microsoft SEAL running at 2.3 GHz.
ProvableCorrectandAdaptiveSimplex ArchitectureforBounded-Liveness PropertiesHigh-speed Design of Post Quantum Cryptography with Optimized Hashing and Multiplication
We propose an approach to synthesize Simplex architectures that are provably correct for a rich class of temporal specifications, and are high-performant by optimizing for the time the advanced controller is active. We achieve provable correctness by performing a static verification of the baseline controller. The result of this verification is a set of states which is proven to be safe, called the recoverable region. During runtime, our Simplex architecture adapts towards a running advanced controller by exploiting proof-on-demand techniques. Verification of hybrid systems is often overly conservative, resulting in over-conservative recoverable regions that cause unnecessary switches to the baseline controller. To avoid these switches, we invoke targeted reachability queries to extend the recoverable region at runtime.
Our offline and online verification relies upon reachability analysis, since it allows observation-based extension of the known recoverable region. However, detecting fix-points for bounded liveness properties is a challenging task for most hybrid system reachability analysis tools. We present several optimizations for efficient fix-point computations that we implemented in the state-of-the-art tool HYPRO that allowed us to automatically synthesize verified and performant Simplex architectures for advanced case studies, like safe autonomous driving on a race track.
Safety Shielding under Delayed Observation
Agents operating in physical environments need to be ableto handle delays in the input and output signals since nei-ther data transmission nor sensing or actuating the environ-ment are instantaneous. Shields are correct-by-constructionruntime enforcers that guarantee safe execution by correctingany action that may cause a violation of a formal safety spec-ification. Besides providing safety guarantees, shields shouldinterfere minimally with the agent. Therefore, shields shouldpick the safe corrective actions in such a way that future in-terferences are most likely minimized. Current shielding ap-proaches do not consider possible delays in the input signalsin their safety analyses. In this paper, we address this issue.We propose synthesis algorithms to computedelay-resilientshieldsthat guarantee safety under worst-case assumptionson the delays of the input signals. We also introduce novelheuristics for deciding between multiple corrective actions,designed to minimize future shield interferences caused bydelays. As a further contribution, we present the first inte-gration of shields in a realistic driving simulator. We imple-mented our delayed shields in the driving simulator CARLA.We shield potentially unsafe autonomous driving agents indifferent safety-critical scenarios and show the effect of de-lays on the safety analysis.
Research 2022
Magnifying Side-Channel Leakage of Lattice-Based Cryptosystems With Chosen Ciphertexts: The Case Study of Kyber
Lattice-based cryptography, as an active branch of post-quantum cryptography (PQC), has drawn great attention from side-channel analysis researchers in recent years. Despite the various side-channel targets examined in previous studies, detail on revealing the secret-dependent information efficiently is less studied. In this paper, we propose adaptive EM side-channel attacks with carefully constructed ciphertexts on Kyber, which is a finalist of NIST PQC standardization project. We demonstrate that specially chosen ciphertexts allow an adversary to modulate the leakage of a target device and enable full key extraction with a small number of traces through simple power analysis. Compared to prior research, our techniques require fewer traces and avoid building complex templates. We practically evaluate our methods using both a reference implementation and the ARM-specific implementation in pqm4 library. For the reference implementation, we target the leakage of the output of the inverse NTT computation and recover the full key with only four traces. For the pqm4 implementation, we develop a message-recovery attack that leads to extraction of the full secret key with between eight and 960 traces, depending on the compiler optimization level. We discuss the relevance of our findings to other lattice-based schemes and explore potential countermeasures.
Time-Memory Trade-Offs for Saber+ on Memory-Constrained RISC-V Platform
Saber is a module-lattice-based key encapsulation scheme that has been selected as a finalist in the NIST Post-Quantum Cryptography standardization project. As Saber computes on considerably large matrices and vectors of polynomials, its efficient implementation on memory-constrained IoT devices is very challenging. In this paper, we present an implementation of Saber with a minor tweak to the original Saber protocol for achieving reduced memory consumption and better performance. We call this tweaked implementation ‘Saber+’, and the difference compared to Saber is that we use different generation methods of public matrix A and secret vector s for memory optimization. Our highly optimized software implementation of Saber+ on a memory-constrained RISC-V platform achieves 48% performance improvement compared with the best state-of-the-art memory-optimized implementation of original Saber. Specifically, we present various memory and performance optimizations for Saber+ on a memory-constrained RISC-V microcontroller, with merely 16KB of memory available. We utilize the Number Theoretic Transform (NTT) to speed up the polynomial multiplication in Saber+. For optimizing cycle counts and memory consumption during NTT, we carefully compare the efficiency of the complete and incomplete-NTTs, with platform-specific optimization. We implement 4-layers merging in the complete-NTT and 3-layers merging in the 6-layer incomplete-NTT. An improved on-the-fly generation strategy of the public matrix and secret vector in Saber+ results in low memory footprint. Furthermore, by combining different optimization strategies, various time-memory trade-offs are explored. Our software implementation for Saber+ on selected RISC-V core takes just 3,809K, 3,594K, and 3,193K clock cycles for key generation, encapsulation, and decapsulation, respectively, while consuming only 4.8KB of stack at most.
Will You Cross the Threshold for Me?
In this work, we propose generic and novel side-channel assisted chosenciphertext attacks on NTRU-based key encapsulation mechanisms (KEMs). These KEMs are IND-CCA secure, that is, they are secure in the chosen-ciphertext model. Our attacks involve the construction of malformed ciphertexts. When decapsulated by the target device, these ciphertexts ensure that a targeted intermediate variable becomes very closely related to the secret key. An attacker, who can obtain information about the secret-dependent variable through side-channels, can subsequently recover the full secret key. We propose several novel CCAs which can be carried through by using side-channel leakage from the decapsulation procedure. The attacks instantiate three different types of oracles, namely a plaintext-checking oracle, a decryptionfailure oracle, and a full-decryption oracle, and are applicable to two NTRU-based schemes, which are NTRU and NTRU Prime. The two schemes are candidates in the ongoing NIST standardization process for post-quantum cryptography. We perform experimental validation of the attacks on optimized and unprotected implementations of NTRU-based schemes, taken from the open-source pqm4 library, using the EM-based side-channel on the 32-bit ARM Cortex-M4 microcontroller. All of our proposed attacks are capable of recovering the full secret key in only a few thousand chosen ciphertext queries on all parameter sets of NTRU and NTRU Prime. Our attacks, therefore, stress on the need for concrete side-channel protection strategies for NTRUbased KEMs.