The Potential of Reconfigurable Hardware for HPC Cryptanalysis of SHA-1

Alessandro Cilardo
Computer Science Department, University of Naples Federico II, via Claudio 21, 80125 Napoli, Italy, Email: acilardo@unina.it

Abstract—Modern reconfigurable technologies can have a number of inherent advantages for cryptanalytic applications. Aimed at the cryptanalysis of the SHA-1 hash function, this work explores this potential showing new approaches inherently based on hardware reconfigurability, enabling algorithm and architecture exploration, input-dependent system specialization, and low-level optimizations based on static/dynamic reconfiguration. As a result of this approach, we identified a number of new techniques, at both the algorithmic and architectural level, to effectively improve the attacks against SHA-1. We also defined the architecture of a high-performance FPGA-based cluster, that turns out to be the solution with the highest speed/cost ratio for SHA-1 collision search currently available. A small-scale prototype of the cluster enabled us to reach a real collision for a 72-round version of the hash function.

I. INTRODUCTION

Modern FPGAs can be used to build application-specific high-performance computing (HPC) machines at costs that are several orders of magnitude lower than standard HPC platforms. This has some important implications for those applications where the (un)availability of suitable computing resources is an essential underlying assumption. Cryptographic algorithms are perhaps the most remarkable example, since most of them are based on some hard problems, supposed to be intractable with ordinary computing resources. Cost is not the only factor that may give reconfigurable technologies a special role for cryptanalysis. In fact, reconfigurable hardware, by its nature, can be used “interactively”, changing the hardware design over time more than once. So, not only can FPGAs be used for the cryptanalytic computation in itself, but also for supporting the study of the problem, enabling large-scale computation efforts just aimed at algorithm and architecture exploration.

This paper, in particular, addresses the possibility of breaking the most important cryptographic hash function currently in use, SHA-1, a hot topic in today’s cryptanalytic research. The background in hardware cryptanalysis and attacks against cryptographic hash functions is reviewed in Section II, while a deeper presentation of the SHA-1 algorithm and the cryptanalysis methods used in this work is provided in Section III. Starting from the current state-of-the-art, Section IV then investigates innovative approaches, inherently based on hardware reconfigurability, enabling algorithm and architecture exploration for the cryptanalysis of SHA-1 and identifying new techniques for building effective attacks. In addition, it demonstrates how static and dynamic circuit specialization can play a key role to maximize the performance of the hardware system by customizing it for specific parameters. These opportunities are extensively exploited by a set of software tools we developed to automatically generate highly optimized HDL code for specific input collision search parameters. To demonstrate our approach, we have developed a prototypical FPGA cluster, built on top of a highly optimized dedicated unit for SHA-1 collision search, described in Section V. A quantitative evaluation of performance and implementation efficiency confirms the impact of the different techniques employed. As shown in Section VI, in fact, the FPGA-based architecture can reach the same levels of performance as an optimized collision search application run on a massively parallel HPC cluster, while costing two orders of magnitude less. By using a small-scale prototypical cluster, made of only 20 very low-cost FPGAs, we were able to find an actual collision for a 72-round version of SHA-1, outperforming the best achievement presented in the technical literature.

II. BACKGROUND

Cryptanalysis usually requires massive computations and often relies on special-purpose hardware solutions exhibiting much better performance/cost ratios than off-the-shelf computers. While there would be countless examples in the literature of hardware acceleration for cryptanalysis, there are few contributions that deliberately use reconfigurable devices as the underlying technology. Some commercial clusters relying on FPGAs, for example, have been recently suggested for use in cryptanalytic applications, including Copacobana, made of 120 Spartan-3 1000 or Virtex-4 SX35 FPGAs [1], and its successor Rivyera, made of 16 to 128 Spartan-3 5000 FPGAs [2]. Other contributions in the literature focus on low-level aspects. For instance, it has been shown how some FPGA-specific resources can accelerate RSA encryptions [3] as well as attacks on RSA [4].

Among the numerous works on hardware-accelerated cryptanalysis, very few target collision search for cryptographic hash functions, in spite of the serious threats to security created by vulnerabilities in hash functions such as MD5 [5]. Reference [6], in particular, presents a special-purpose microprocessor to speedup collision search for MD4-family hash functions. The core has minimal area requirements but it basically uses a software-like implementation of the collision search algorithm and, in fact, it only shows significant improvements if implemented as an ASIC. Many works exist, on the other hand, on theoretical or software-based attacks to SHA-1. They are especially relevant here for comparisons of attack performance and costs. After some early studies by Chabaud and Joux [7] on SHA-1’s predecessor SHA-0, based on a differential approach, and by Biham et al. [8], Wang et al. [9] showed in 2005, for the first time, a method to find a collision for SHA-1 with a theoretical complexity lower than the bound of $2^{80}$ of a simple birthday attack, namely $2^{69}$. De Cannière et al. [10] then described a way to automatically find complex Non-Linear characteristics and used it to determine a two-block colliding message pair for a weakened 64-round version of SHA-1. A collision for a 70-round version of SHA-1 was recently identified via a similar approach [11].
SHA-1 was presented by the same researchers in [11] and an equivalent result was obtained by T. Peyrin in [12]. Finally, Cilardo et al. [13] presented a study of vulnerabilities in the SHA family, namely the SHA-0 and SHA-1 hash functions, based on a high-performance computing application run on a massively parallel cluster. They were able to identify the first collision for a 71-round version of SHA-1.

### III. The SHA-1 Hash Function

Issued by NIST in 1995 as a Federal Information Processing Standard [14], SHA-1 is the most popular hash function currently in use for cryptographic applications. The hash function SHA-1 takes a message of length less than $2^{64}$ bits and produces a 160-bit hash value. The input message is padded and then processed in 512-bit blocks in the Merkle-Damgård iterative structure. Each iteration invokes a so-called compression function which takes a 160-bit chaining value and a 512-bit message block and outputs another 160-bit chaining value. The initial chaining value (called $IV$) is a set of fixed constants, and the final chaining value is the hash of the message. The compression function of SHA-1 works as follows. For each 512-bit block of the padded message, divide it into sixteen 32-bit words, $(W_0, W_1, \ldots, W_{15})$. The message words are first expanded as follows: for $i = 16 \ldots 79$

$$W_i = (W_{i-3} \oplus W_{i-8} \oplus W_{i-14} \oplus W_{i-16}) \bmod 2^32 \quad (1)$$

where the ‘$\bmod 2^{32}$’ notation is used for a left rotation of $x$ bits. The 160-bit chaining value is stored into five internal 32-bit variables, called $A, B, C, D, E$. The expanded message words $W_i$ are then processed in 80 rounds, divided into four groups of 20 consecutive rounds. Each of these 80 steps applies the following round function: for $i = 1 \ldots 80$

$$A_i = (A_{i-1} \bmod 5) + f_i(B_{i-1}, C_{i-1}, D_{i-1}) + E_{i-1} + W_{i-1} + K_i$$

$$B_i = A_{i-1} \quad C_i = B_{i-1} \quad D_i = C_{i-1} \quad E_i = D_{i-1} \quad (2)$$

where the ‘$+$’ symbol denotes the integer addition performed modulo-$2^{32}$. Each group of 20 rounds uses a different Boolean function $f_i$ and constant $K_i$, as summarized in the following table.

<table>
<thead>
<tr>
<th>round</th>
<th>Boolean function $f_i(x, y, z)$</th>
<th>constant $K_i$</th>
</tr>
</thead>
<tbody>
<tr>
<td>0–19</td>
<td>$TF: (x \cdot y) + (\overline{z})$</td>
<td>0x5A827999</td>
</tr>
<tr>
<td>20–39</td>
<td>$XOR: x \oplus y \oplus z$</td>
<td>0x6ED9EBA1</td>
</tr>
<tr>
<td>40–59</td>
<td>$MAJ: (x \cdot y) + (x \cdot z) + (y \cdot z)$</td>
<td>0x8FABCBCDC</td>
</tr>
<tr>
<td>60–79</td>
<td>$XOR: x \oplus y \oplus z$</td>
<td>0xCAF62C1D6</td>
</tr>
</tbody>
</table>

The chaining value $IV = (A_0, B_0, C_0, D_0, E_0)$ for the first application of the compression function is defined by the standard as $(0x67452301, 0xEFCDAB89, 0x98BADCFE, 0x10325476, 0xC3D2E1F0)$.

### A. Cryptanalysis of SHA-1

The essential idea behind the attack to SHA-1 is to constrain the values of the messages and registers $A \ldots E$ in order to reach a collision with a certain probability. Such set of bit-level constraints on the difference of two messages is called “differential characteristic” (an example is shown in Figure 1). A differential characteristic is comprised of two sections. The $W$ part defines the constraints imposed to the bits of the two messages in each position, while the $A$ part contains the constraints imposed to the internal registers $A \ldots E$ during the hashing process. In fact, the characteristic only considers the $A$ register, since the remaining four variables $B \ldots E$, and related constraints, can be easily obtained from $A$ by simple rotations and round shifts (e.g. $D_i = A_{i-3} \bmod 30$). The constraints are expressed by a set of symbols: ‘$1$’ and ‘$0$’ indicate that both bits in the two messages must take on the values ‘$1$’ and ‘$0$’, respectively. ‘$u$’ and ‘$n$’ indicate a signed difference ($-1$ or $0$, respectively), ‘$\cdot$’ an unspecified difference, ‘$-$’ two unspecified equal values.

As a preliminary work, we developed some support tools for the generation of the characteristic, based on both existing and original techniques. We will not give the details of these tools here, as they would be out of the scope of the paper. For each round $i$, the support tools evaluate the probability that the consequent $A_{i+1}$ complies with the characteristic, given that the current word $W_i$ and the previous five $A_i \ldots A_{i-4}$ (i.e., registers $A_1 \ldots E_0$) do as much. This probability is called $P_o(i)$. For the first sixteen $W_i$, which are in fact a part of the message to be hashed, some bits are not constrained by the characteristic (those indicated as ‘$-$‘) so that they can be controlled by the attacker. They are called degrees of freedom and their number for each round is denoted as $D_o(i)$. Clearly, $D_o(i) = 0$ for $i \geq 16$ due to the expansion function (1) for computing words $W_i$. The collision search proceeds by setting the degrees of freedom and evaluating pairs of messages whose difference complies with a given characteristic.

In order to evaluate the performance of the attack, it is important to estimate the expected number of executions for each round $i$, denoted as $N(i)$. This parameter can be computed from $P_o(i)$ and $D_o(i)$ starting from the bottom of the characteristic, by using the recursive relationship $N(i+1) = N(i) \cdot P_i(i) \cdot 2^{D_o(i+1)} \implies N(i) = N(i+1) \cdot 2^{-D_o(i+1)} / P_i(i)$ with the initial value of $N(0+1)$, i.e. the number of execution of the last round (72 for the characteristic in the figure) equal to 1, since we will stop the search process as soon as we reach the last round for the first time. Since the execution of
each round, i.e. the computation of all relationships (1) and (2) for \( W_i \) and \( A_i \ldots E_i \) and the corresponding register updates, can be performed in parallel, possibly in a single clock cycle for a hardware implementation, we call this set of steps an Elementary Operation (EO), and the expected total number of round executions, i.e. the summation of parameters \( N(i) \) for all rounds \( i \), the Mean number of EOs to Collision (MEOC).

Due to the exponential complexity of the search process, it is essential to find out efficient ways to enumerate all available message pairs in the search space, possibly by identifying “more probable” message pairs and by pruning as early as possible large subspaces not containing collisions. An improvement to the algorithm is provided by the use of Auxiliary Paths (APs, see [15]). An AP is a set of bits in the message pair which, if flipped, produce another message pair satisfying the characteristic until a certain round \( r_{AP} \). Thus, once a message pair is tested to be compliant to the characteristic until round \( r_{AP} \), it is possible to generate another message pair that will certainly be compliant in that round. In effect, since \( r_{AP} \) is located late in the round sequence, when the probability of execution has already decreased by several orders of magnitude, the fork produced by an AP virtually doubles the chances of finding a collision, i.e. it halves the MEOC. Clearly, a number \( P \) of different APs allows the generation of \( 2^P - 1 \) new “good” message pairs from \( r_{AP} \) onwards, and a corresponding improvement for the MEOC. The characteristic in Figure 1 has two APs that can be applied at round \( r_{AP} = 19 \). All \( N(i) \) values above the fork, computed with the formula given earlier, are thus corrected with a decrement by 2. Unfortunately, taking into account an AP also involves a cost related to saving the intermediate search state and restoring it after the previous execution branch. This overhead makes the EO, or round, where the AP fork occurs have a time cost larger than other EOs. In general, furthermore, especially for parallel implementations, similar forks -and related overheads- take place at some specific rounds for dividing the search space among different processors. As a consequence, the execution times in terms of clock counts \( C(i) \) for each round \( i \) can be different and, thus, the MEOC is not necessarily proportional to the total clock count for a collision. The Mean Clock Count to Collision (MCCC), rather, should be computed as the weighted sum of the EO counts \( N(i) \), i.e. \( MCCC = \sum N(i) \cdot C(i) \). Since the MEOC, including the APs’ contribution, is specific to a given characteristic, while the times \( C(i) \) depend on the different architectural optimizations employed, a good metric for the quality of an implementation is the \( MEOC/MCCC \) ratio, called here \( EO \) per clock cycle, or EOC. For a single, basic core performing a sequence of EOs on \( W_i \) pairs, the ideal bound for EOC is 1.

IV. HARDWARE RECONFIGURATION FOR THE CRYPTANALYSIS OF SHA-1

Reconfigurable hardware, by its nature, can be used “interactively”, changing the hardware design over time more than once. So, not only can FPGAs be used for the cryptanalytic computation in itself, but also for supporting the study of the problem, enabling large-scale computation efforts just aimed at algorithm and architecture exploration. We extensively exploited this potential, as described below, to investigate new effective search techniques. In addition, we found out that both static and dynamic circuit specialization, enabled by reconfigurable devices, can play a key role to maximize the performance and minimize the hardware complexity of the system, automatically specializing it for a given input characteristic.

a) Algorithm and architecture exploration: As a first step in our study, we employed reconfigurable hardware for extensively experiment with new algorithmic techniques. Two remarkable opportunities were identified by this approach:

1) We developed an FPGA-based temporary design to explore the behavior of higher rounds in the differential characteristic, namely those beyond round 60, that are normally executed very rarely due to the exponentially decreasing value of probability. We found that some bit-level constraints in the last rounds, seemingly independent of the previous constraints, are in fact indirectly influenced by some bit patterns including the degrees of freedom in the sixteen message words, due to the message expansion function for \( W_i \). These inter-bit constraints, resulting in a degraded actual probability of success (normally halved for each constraint) had not been detected before, because they impact the behavior of higher, i.e. very rare, rounds. The temporary search hardware was designed to bypass the middle phase of the collision search process and, albeit not able to find an actual collision, it could detect the misalignments in the actual behavior at the higher rounds. It turned out that the differential characteristic, as it is normally defined, is not suitable to capture the effect of inter-bit constraints, essentially because it only expresses bit-level constraints. Moreover, the additional constraints depend dynamically on the specific bit pattern of some degrees of freedom chosen during message enumeration. To cope with this problem, we devised a technique where the differential characteristic -and the corresponding hardware- is dynamically changed during message enumeration to mask the inter-bit constraints. More precisely, at design-time we analyze the characteristic and express all inter-bit constraints in a system of Boolean linear equations, reduced by Gaussian elimination, in such a form as to concentrate all independent variables (some degrees of freedom in the message words) as high as possible in the first sixteen words. As the independent variables are set during message enumeration, the corresponding constraints in the characteristic are set accordingly by back-substitution in the system. In practice, this process only requires the computation of a sequence of XOR operations and, importantly, it does not affect the execution time significantly since it only happens before round 15, i.e. for rounds that are orders of magnitude less frequent than the subsequent ones.

2) A second technique that was successfully investigated is called constraint relaxation. Basically, we experimented with new constraints in the characteristic, different than the bitwise conditions imposed by the \( A \) part. The essential idea was to employ integer differences rather than bitwise differences. Under some circumstances, this may deteriorate the probability \( P_e(i) \) less than expected, while enlarging the set of tries that satisfy the conditions, resulting in an improved overall MEOC. A difficult problem was to identify such situations, if any, and define the actual positions and parameters for the constraint relaxation. To this aim, we developed an ad-hoc message enumeration component designed to skip or reinterpret some part of the compliance verification process in order to identify the locations where the relaxation could pay off. The hardware-supported analysis led us to identify rounds 32 and 64 – 72. The actual relaxation consists in replacing the bitwise XOR between the \( A \) and \( A' \) registers for the two messages in the pair with their integer absolute difference \( |A - A'| \).

Of course, there is no space here to present the different
executions. Based on the actual probability values inferred
core, and hence doubled parallelism and computational power.

of the design, the impact on both hardware complexity and
input-dependent optimizations involves nearly all components
LUTs and surrounding logic. Since this kind of low-level,
skipped, Figure 2.a) shows the appropriate configuration of
ad-hoc
design approach that can be especially relevant for cryptanal-
sign:
Reconfigurable technologies enable another important
presented in Section V for the SHA-1 collision search cluster.

Another example (out of many actually implemented), we
provide some details on segmented incrementers. As explained
above, the search process works basically by enumerating all
the combinations for the ‘−’ symbols in the characteristic
(see Figure 1). This cannot be done by using normal counters
in a straightforward way, since bits are in general not con-
secutive, and may compromise the use of carry propagation
logic for fast increments. However, for a typical configurable
element, e.g. a Xilinx Spartan3 slice like those depicted in
Figure 2, a carefully optimized (and, of course, characteristic-
dependent) configuration allows us to pack in a single group
of consecutive FPGA Look-Up Tables (LUTs) and Flip-Flops
all the bits in their order, exploiting carry propagation logic
for the counting operation, skipping unaffected bits without
interrupting the carry chain, and—in addition—embedding
a multiplexer for load operations from the outside. For the
example pattern ‘−∗∗∗−...−∗∗’, where ‘∗’ symbols denote
the bits to be enumerated and ‘−’ the constant values to be
skipped. Figure 2.a) shows the appropriate configuration of
LUTs and surrounding logic. Since this kind of low-level,
input-dependent optimizations involves nearly all components
of the design, the impact on both hardware complexity and
circuit delay can be considerable. In practice, it roughly leads
to halved footprints on the FPGA for the single elementary
core, and hence doubled parallelism and computational power.

Another important example of input-dependent system cus-
tomization is related to hardware replication for fast EO
executions. Based on the actual probability values inferred
from the characteristic, we resort to replication only for those
rounds whose \( N(i) \) is above a certain threshold, i.e. only when
an improved \( C(i) \) can significantly increase the EOC towards
the ideal bound of 1. This is an important input-dependent
optimization, that can have a dramatic impact on the overall
execution time, as shown in Section V-A.

For a given input characteristic, the above optimizations
determine statically the design to be implemented. To support
the specialization of the system, thus, we have developed a set
of tools for the automated generation of HDL code, taking into
account the input characteristic, its APs, inter-bit constraints,
constraint relaxation, low-level optimizations, etc. Needless
to say, this customization of the hardware system based on
specific input parameters is an inherent advantage of FPGAs,
that would be impossible for an approach based on ASICs.

3. Dynamic updates of wired logic: Finally, we identified
some situations where dynamic reconfiguration of some parts
of the system can lead to higher speed and/or improved
hardware complexity. Since, in general, dynamic recon-
figuration requires a non-negligible time, its use is justified only
when it occurs with a medium/coarse temporal granularity.
Again, there would be many examples where it can pay off,
including the cases where the logic to be applied depends
on some previous settings made at run-time, e.g. for inter-
bit constraints. Another example of such situations is related
to register re-initialization, where some value set earlier in a
register, and then overwritten, needs to be used again (e.g.,
during message enumeration, before entering a new branch
in the search tree from a certain node). Typically, we would
need an additional register or memory to store the value while
all the branches below are explored. A possibility based on
dynamic reconfiguration, on the other hand, would consist in
using the Set/Reset (SR) configurable signal available for Flip-
Flops in most FPGAs (see Figure 2.b)). By controlling the
behavior of the SR signal appropriately, any initialization
value can be set in the flip-flops making up the bits of a register.
The SR MUX, of course, cannot be controlled directly from
the user design, as it is part of the FPGA configuration. The
intermediate values of registers \( A \ldots E \) for the relatively rare
round 14, kept constant as the search goes down through round
15 and beyond, and then changed every time we go back to
round 13, could be an example of a situation benefitting from
re-initialization based on dynamic reconfiguration. Since there
are ten such registers in a single SHA-1 collision core (320
bits), and many cores in a single FPGA, the technique may
save an appreciable quantity of hardware resources, although
in general less compared to the impact of the static input-
dependent optimizations described in the previous paragraph.
Incidentally, the Xilinx Spartan3 devices we used for our
experiments do not support dynamic reconfiguration, so we
did not implement this class of optimizations.

V. THE FPGA CLUSTER

Figure 3.a) summarizes the application flow for the auto-
mated generation of the HDL code from an input characteristic
and the configuration of the FPGA cluster. Based on an
iterative, three-phase refinement process, not described here,
a set of software tools generate a differential characteristic
like that in Figure 1. The characteristic is then analyzed by a
module for the automated generation of HDL code (namely,
VHDL), leading to the configuration of the different compo-
nents of the cluster architecture. This automated generation
applies all the architectural optimizations that, as described
in the previous section, depend on the specific behavior of
the characteristic. The cluster architecture is made of three levels. A top-level Master node analyzes the first few rounds (e.g. the first thirteen), in order to produce more constrained characteristics, which are then sent to the Slave nodes. Most of the workload is concentrated below the first rounds, so that the concurrent jobs dispatched to the slaves achieve an almost complete parallelization of the search process. The jobs, moreover, involve very little communication overhead, since only the initial search state for each job and the possible colliding messages need to be exchanged over the bus. Within the slave node (a whole FPGA), a Controller component acts as a second level in the architecture. It makes further enumeration of intermediate rounds (e.g. on round 14) and distributes the remaining search workload to a set of SHA-1 collision cores, described in detail below, which constitute the third level in the architecture. We leave only round 15 for the enumeration on the third-level cores, as long as the available degrees of freedom ensure an execution time long enough to hide the synchronization overhead. This simplification allows further speed and area optimizations, enabling a high EOC with small-footprint cores.

For the current prototypical implementation of the cluster, we used a set of 20 inexpensive commercial off-the-shelf boards, namely the Digilent Nexys-2 boards, each equipped with a Xilinx Spartan XC3S1200E FPGA. The interconnection is made through an ad-hoc inter-board bus. The Master node is implemented as a microprocessor system based on the Xilinx MicroBlaze core, while the Slave nodes are completely custom-made. We paid much attention to the modular nature of the cluster, supporting an easy extension with additional hardware. The extension port allows the hot-plug of additional modules and the software-supported reorganization of the search partitioning as new modules are plugged in. The Controller also interact with an on-board non-volatile memory for the checkpointing of the search jobs. An I/O interface managed by the Master node allows the external user to interact with the cluster during the search operation.

A. SHA-1 collision core architecture

The basic building block of the cluster consists of a core which is able to process sequences of EOIs for a certain portion of the search space. Figure 3.b) shows a particular instance of the SHA-1 collision core, generated for the characteristic of Figure 1. We relied on many of the techniques identified in Section IV to determine the structure of the core. For time-critical rounds, we used selective hardware replication by means of suitable shadow registers (see the figure) with preloaded message expansion (possibly completed with a single XOR during enumeration). The first sixteen \(W_i\) words are stored in an inexpensive 32-bit LUT-based memory and accessed sequentially for filling in a shift register with a (time-consuming) serial-in load, but only for the relatively rare rounds beyond 21. The shadow registers and the shift register load circuitry contain some ad-hoc logic controlling the selective bit-flipping related to AP enumeration. A segmented incremener is used for round 15. The registers \(A \ldots E\), making up a shift register, need one long- and one medium-term initialization value to be stored into two additional registers for each variable \(A \ldots E\), implemented as memory LUTs. The structure shown in Figure 3.b) is duplicated for each SHA-1 collision core to explore the behavior of a pair of messages concurrently. The output difference between the two halves of a core is used to verify the compliance with the characteristic. To save memory, only a digest of the whole 80-round characteristic is stored in each core, privileging the conditions on the more frequent rounds to limit false positives. The actual digest is defined, again, according to the input characteristic. The hardware for the compliance check is also responsible for constraint relaxation at the appropriate rounds.

As a result of these techniques, the EOC measured for the above SHA-1 collision core, specialized for the input characteristic of Figure 1, is very close to the ideal bound, precisely \(EOC = 0.84\). Similar, or even higher, values were obtained for other characteristics.

A number of implementation-level optimizations were carried out for the target Spartan3 device. We made an extensive use of RLOC constraints and manual placement in order to obtain extremely regular layouts, enabling high efficiency in the use of the FPGA resources and decreased delays. To obtain the maximum level of compactness, we carefully balanced the use of flip-flops and LUTs used as 1-bit memory elements. The implementation results for both the SHA-1 collision core and the whole design are listed below. The SHA-1 core reduced footprint allows each of the 20 FPGAs in the prototypical cluster to host six cores.

<table>
<thead>
<tr>
<th>Component</th>
<th>LUTs</th>
<th>flip-flops</th>
<th>delay</th>
</tr>
</thead>
<tbody>
<tr>
<td>Controller Core</td>
<td>2530</td>
<td>2379</td>
<td>12.0(\mu)s</td>
</tr>
<tr>
<td>Single SHA-1 Core</td>
<td>2127</td>
<td>1979</td>
<td>11.8(\mu)s</td>
</tr>
<tr>
<td>Slave Node</td>
<td>15301</td>
<td>14253</td>
<td>12.1(\mu)s</td>
</tr>
<tr>
<td>Master Node</td>
<td>3325</td>
<td>1923</td>
<td>17.3(\mu)s</td>
</tr>
</tbody>
</table>
This section discusses the results collected from this work, and draws some conclusions and possible lines for future developments.

a) A case for reconfigurable computing in cryptanalysis: The work presented in this paper makes a case for the role of reconfigurable computing in cryptanalytic high-performance applications. Reconfigurable technologies enable inherent new opportunities, including algorithmic and architectural exploration and static/dynamic input-dependent circuit specialization. Brand new techniques at the algorithm level, presented in Section IV, stemmed from this study, in addition to a large variety of architectural optimizations and specific design techniques.

b) Highest speed/cost ratio for SHA-1 collision search: To demonstrate the impact of the approach presented, we have developed a working FPGA-based cluster. The EOC it is able to reach is much higher than other existing software or hardware solutions. For example, the proposal in [13], based on the MariCel supercomputer featuring high-speed IBM CBE processors (each containing eight 4-slot SIMD units, called SPUs, working at 3.2GHz) is able to reach an EOC = 0.026, referred to one SPU, running a highly optimized SIMD application. Taking into account the difference in the clock frequency, that means that a single SHA-1 collision core is able to find a collision in a comparable time, precisely only around 1.19 times larger than an SPU core in the supercomputer used in [13]. Compared to the HPC MariCel supercomputer, however, building an FPGA-based cluster based on the SHA-1 collision core costs two orders of magnitude less.

Looking at hardware solutions in the literature, [6] presents a microprocessor with minimal area requirements for speeding-up the MD-4 hash function. Synthesized for a Spartan3 XC3S1000 FPGA device, their collision search unit requires around 700 slices (9% of the device resources, according to the paper), each containing two LUTs and two flip-flops, i.e. around 30% less resources than our SHA-1 collision core. On the other hand, the unit is not targeted at SHA-1 and, being based on a software-like approach, it would be considerably slower than our core. Working sequentially on 32-bit data, in fact, it would require at least 12 cycles for an EO (update of variables $A_i$ ... $E_i$ and $W_i$ for two different messages) not to mention checks and control operations, with an EOC certainly (much) below 1/12 = 0.083 if used for SHA-1.

The FPGA-based prototypical cluster presented in Section V, in conclusion, is currently the solution with the highest performance/cost ratio for SHA-1 collision search.

c) The first 72-round SHA-1 collision: At the end of our work, we used the prototypical cluster to outperform a previous result in SHA-1 cryptanalysis. In fact, we were able to find an actual collision for a 72-round version of SHA-1, beyond the limit reached in [13]. The collision, listed below, was the most advanced result towards a break of the full 80-round SHA-1 algorithm at the time of the discovery.

d) Future developments: With the support of a major FPGA manufacturer, we plan to build a large scale version of the cluster and demonstrate its potential with new results for SHA-1. At the same time, we plan to extend our approach to other hash algorithms, namely the candidates for the selection of the SHA-3 function, so as to enable an early analysis of the proposals and anticipate possible unexpected vulnerabilities.

VI. DISCUSSION AND CONCLUSIONS

ACKNOWLEDGMENT

The author would like to thank Luigi Esposito for supporting the development of the software part of the collision search application.

REFERENCES


