LOW POWER FIR FILTER REALIZATION WITH DIFFERENTIAL COEFFICIENTS AND INPUT

Tian-Sheuan Chang and Chein-Wei Jen
Dept. of Electronics Engineering, National Chiao-Tung University
1001 Ta-Hsueh Rd, Hsinchu, Taiwan, R.O.C.

ABSTRACT

Most FIR filter realizations use the inputs and coefficients directly to compute the convolution. In this paper, we present a low power and high speed FIR filter designs by considering first order difference between inputs and various orders of differences between coefficients. This design first reformulates the FIR operations with the differences in algorithm level. Then, in architecture level, we adopt the DA architecture to exploit the probability distribution such that power consumption can be reduced further. The design is applied to an example FIR filter to quantify the energy savings and speedup. It shows lower power consumption than the previous design with the comparable performance.

I. INTRODUCTION

Recently, due to the popularity of the portable battery-powered wireless communication systems such as cellular phones, pagers and wireless modems, high performance and low power digital signal processing (DSP) has become increasingly important. One of the commonly used operations in DSP is FIR filters. A N-tap FIR filter with coefficients $C_k$, input sequence $X_j$, and output sequence $Y_j$ can be expressed as

$$Y_j = \sum_{k=0}^{N-1} C_k X_{j-k}$$

Conventional realizations of FIR filters use input and coefficients directly, which requires full wordlength of the multiplication and accumulation and thus consumes more power. Realization using differential coefficients (called differential coefficients method (DCM)) has been proposed in [1] to solve above problems. However, their formulation reconstructs original $C_k X_{j-k}$ term by term, which requires many memory accesses. Besides, their formulation only considers coefficients difference. They did not consider the architecture level design to maximize the power saving.

To solve above problems, in this paper, we propose a new algorithm (called differential coefficients/input method (DCIM)) that not only consider differential coefficients but also consider differential inputs. After the algorithm reformulation, we propose an architecture design by using distributed arithmetic (DA) that can group high transition probability LSB input bits at one time so we can skip low transition probability MSB bits more efficiently and save power consumption.

This paper is organized as follows. In Section 2, we first review the DCM algorithm and then we propose our new algorithm formulation. Then in Section 3 we present the DA architecture design based on the DCIM algorithm. In Section 4, we will analyze the power consumption and delay of the proposed design and apply them to an example filter. Finally, we conclude this paper in Section 5.

II. ALGORITHM FORMULATION

The differential coefficients method (DCM) computes the partial product with $m$th-order differences first, and then added the stored previous partial product back. If the $m$th-order differences are defined as

$$\delta_{m=k}^{m+1} = \delta_{m=k}^{m+1} - \delta_{m=k+1}^{m+1}, k = m \text{ to } (N-1): m = 2 \text{ to } (N-1).$$

Then the recurrence relation between coefficients using $m$th-order differences is

$$C_k = C_{k-1} + \sum_{i=k-2}^{m-1} \delta_{i=k}^{i+1} + \delta_{i=k}^{i+1}, k = m \text{ to } (N-1): m = 2 \text{ to } (N-1)$$

Then, for any two consecutive outputs $Y_j$ and $Y_{j+1}$, we can rewrite the outputs with first order difference DCM algorithm and obtain

$$Y_j = C_0 X_j + C_1 X_{j-1} + \ldots + C_N X_{j-N+1}$$

$$Y_{j+1} = C_0 X_{j+1} + C_1 X_j + \ldots + C_N X_{j-N+2}$$

where $C_k = C_{k-1} + \delta_{i=k}^{i+1} + \delta_{i=k}^{i+1}$. The DCM algorithm computes $\{C_1, C_0, X_j + C_0 X_j, \ldots, \{C_{N-1}, C_{N-2}, X_j + C_{N-1} X_{j-N+2}\} \}$ term by term. During the computation of each term, DCM first computes the partial products $\{C_1, C_0\} X_{j+1}$ and then adds $C_0 X_{j+1}$ term by term. Each $C_k X_{j+k-1}$ term that has also occurred in computing $Y_j$ was stored in a memory and retrieved when necessary. So only the differential coefficient is used to do the multiplication, and other terms such as $C_k X_{j+k-1}$ are not computed again and just are just added back. By this way, the small quantity because of the difference $\{C_1, C_0\}$ can save the power consumption to compute the partial products $\{C_1, C_0\} X_{j+k-1}$ because we can trade long multiplier with a short one and overheads. However, such computation order is not efficient enough since the operation to add compensated terms $C_k X_{j+k-1}$ back has to be performed for each term computation, which wastes $N$ unnecessary memory access and addition and thus consumes memory area and power to store and retrieve them. These compensated terms can be summed together, stored and retrieved once per $Y$. Besides, the input data $X$ still uses full word length.

Differential input can be introduced to reduce the word length, and thus save power. If the range of the difference between two successive inputs is $W_d$, bits smaller than original input, we may use even shorter multipliers. Such case may occur in speech...
systems such as wireless phone system. In such system design, the filter input is often obtained from an analog-to-digital converter (ADC). The analog input (speech signal) is continuous and has only small sharp amplitude change. So the filter input data will be quite close to the neighbor input and their difference will be a small value, which is quite suitable for such differential input design.

The proposed method, which we term the differential coefficient/input method (DCIM) can be formulated as follows. For any consecutive outputs $Y_{j+1}$ and $Y_j$, we obtain

$$Y_{j+1} = C_0 X_{j+1} + C_1 X_j + \ldots + C_{N-1} X_{j-N+2}$$

and

$$Y_j = C_0 X_j + C_1 X_{j-1} + \ldots + C_{N-1} X_{j-N+1}$$

Let

$$Y_j = Y_{j+1} - Y_j$$

and define the sum of the first $(N-1)$ partial products of $Y_j$ as

$$Y_{j,p} = Y_{j+1} - Y_j$$

Then, reformulate $Y_j$ and $Y_{j,p}$ with DCIM, we can express $Y_j$ as

$$Y_j = \sum_{i=0}^{N-1} C_i (X_{j-i} - X_{j-i-1})$$

so

$$Y_{j,p} = \sum_{i=0}^{N-2} C_i (X_{j-i} - X_{j-i+1})$$

The terms inside the bracket are precomputed and stored in a memory since all coefficients are constant. The input data is used to address the memory and the result is accumulated to obtain the output. Since all filter coefficients are constant, we can use ROM to store the precomputed partial results and avoid to compute them on line.

As illustrated in Fig. 1, the ROM address uses the same bit position in all input data. So we can group the high transition probability LSB of all inputs at the same time and separate them with other low transition probability MSB bits. Since MSB bits are often zero, we may skip the memory access and accumulation operation and thus save power. Besides, the ROM realization in DA also offers the low power possibility since the high power multiplication is replaced by just table look up and accumulation.

III. ARCHITECTURE DESIGN

The proposed architecture design with the DA technique. DA, since its introduction by Pele and Liu[2], has been regarded an efficient bit-serial computational operation to do the filter operations in a single direct step. Without loss of generality, if we express the differential input $X_j$ as bit-level representations in unsigned fraction

$$X_j = \sum_{i=0}^{N-1} X_j \cdot 2^{-i}$$

we can reformulate the FIR operations as

$$Y_j = \sum_{i=0}^{N-1} C_i X_{j-i,2^{-i}}$$

$$= \sum_{i=0}^{N-1} C_i (\sum_{j=0}^{N-1} X_{j-i,2^{-i}} 2^{-i}) = \sum_{i=0}^{N-1} \left( \sum_{j=0}^{N-1} C_i X_{j-i,2^{-i}} \right) 2^{-i}$$

IV. COMPUTATIONAL ANALYSIS

The average net computational energy per $Y$, denoted by $E_{NET}$, is the sum of multiplication cost ($E_{MULT}$), data and coefficients storage access cost ($E_{MEM}$), product terms accumulation costs ($E_{ACC}$), overhead storage accesses costs($E_{MEM}$), and overhead additions costs ($E_{ADD}$). So

$$E_{NET} = \sum_{Y_j} \{ E_j \} + \sum_{Y_j} \{ E_j \}$$
\[ E_{\text{mem}} \]_{\text{DCM}} = \sum_{i \in \{ \text{MULT}, \text{MEM}, \text{ACC} \}} \sum_{j \in \{ \text{MEM}, \text{ADD} \}} \{E_i \}_{\text{DCM}} + \{E_j \}_{\text{DCM}} = \sum_{i \in \{ \text{MULT}, \text{MEM}, \text{ACC} \}} \sum_{j \in \{ \text{MEM}, \text{ADD} \}} \{E_i \}_{\text{DCM}} + \{E_j \}_{\text{DCM}} \]

We have considered both designs with or without DA architecture. For ease of comparison, let the average energy dissipated in a single bit full addition or subtraction be denoted by \( E_{\text{add}} \). Let the average energy dissipated per bit in a single bit arithmetic shift of a field be denoted by \( E_{\text{shift}} \). Let the magnitude of the m-th order difference between coefficients be \( W_d^m \) bits smaller than the original coefficients. For DCM, according to [1], we have the following:

\[ \{E_{\text{mem}} \}_{\text{DCM}} = N(W_x + W_c) + 2 \times \{E_{\text{add}} \}_{\text{DCM}} \]

\[ \{E_{\text{mem}} \}_{\text{DCM}, \text{DA}} = N(W_x + W_c) + 3 \times \{E_{\text{add}} \}_{\text{DCM}, \text{DA}} \]

As shown in the formula, the overhead of DCM \( \{E_{\text{mem}} \}_{\text{DCM}} \) and \( \{E_{\text{mem}} \}_{\text{DCM}, \text{DA}} \) is proportional to order of difference and tap numbers, which will quickly offset the energy savings by the differential coefficients.

Similar formula can be obtained for DCIM without DA architecture since both DCM and DCIM without DA use shift and add operation for multiplications.

\[ \{E_{\text{mem}} \}_{\text{DCIM}} = N(W_x + W_c) + 2 \times \{E_{\text{add}} \}_{\text{DCIM}} \]

\[ \{E_{\text{mem}} \}_{\text{DCIM}, \text{DA}} = N(W_x + W_c) + 3 \times \{E_{\text{add}} \}_{\text{DCIM}, \text{DA}} \]

For DCIM with DA architecture, the multiplication cost will be the cost of memory table look up. So

\[ \{E_{\text{mem}} \}_{\text{DCIM}, \text{DA}} = N(W_x + W_c) + 2 \times \{E_{\text{add}} \}_{\text{DCIM}, \text{DA}} \]

As to the data and coefficient storage cost, since the DA architecture has distributed the coefficients in the ROM, only the differential input data access cost has to be counted.

\[ \{E_{\text{mem}} \}_{\text{DCIM}, \text{DA}} = N(W_x + W_c) \]

The product term accumulation power is the power consumed by the shift-adder in the architecture.

\[ \{E_{\text{mem}} \}_{\text{DCIM}, \text{DA}} = N(W_x + W_c) \]

The overhead required by DCIM is the compensated \( Y \) and \( Y_{\text{add}} \) for \( m+1 \) and \( m+1 \) as order of differences between coefficients used increases. The overhead for m-th order difference between coefficients is \( (m+1) \) storage and \( (m+1) \) extra additions. The overhead cost (read and write) is

\[ \{E_{\text{mem}} \}_{\text{DCIM}, \text{DA}} = 2m(W_x + W_c + \left[ \log_2 N \right] \}) \]

\[ \{E_{\text{mem}} \}_{\text{DCIM}, \text{DA}} = 2m(W_x + W_c + \left[ \log_2 N \right] \}) \]

\[ \{E_{\text{add}} \}_{\text{DCIM}, \text{DA}} = (W_x + W_c + \left[ \log_2 N \right] \}) \]

\[ \{E_{\text{mem}} \}_{\text{DCIM}, \text{DA}} = 2m(W_x + W_c + \left[ \log_2 N \right] \}) \]

So the average net energy savings, denoted by \( S_{\text{NET}} \), is

\[ S_{\text{NET}} = \left( \{E_{\text{mem}} \}_{\text{DCIM}} - \{E_{\text{mem}} \}_{\text{DCIM}, \text{DA}} \right) / \{E_{\text{mem}} \}_{\text{DCIM}} \]

The storage models used in this paper are the same as those in [1]. For square root model, \( \{E_{\text{mem}} \} = K_{\text{S}} \sqrt{S_{\text{MEM}}} \). For logarithm model, \( \{E_{\text{mem}} \} = K_{\text{L}} S_{\text{MEM}} \). For linear model, \( \{E_{\text{mem}} \} = K_{\text{LIN}} S_{\text{MEM}} \). The memory sizes \( S_{\text{MEM}} \) for the two methods are

\[ \{S_{\text{MEM}} \}_{\text{DCIM}} = N((m+1)(W_x + W_c - W_d^m)) \]

\[ \{S_{\text{MEM}} \}_{\text{DCIM}, \text{DA}} = N((m+1)(W_x + W_c - W_d^m)) \]

Large \( N \), the ROM size will be impractical large. One solution is to partition long taps number into smaller pieces. For fair comparison, we will use the low power library data in [1]. In that library, \( E_{\text{mem}} = 200 \), and \( E_{\text{add}} = 170 \).

Fig. 2 and Fig. 3 shows the energy savings \( S_{\text{NET}} \) for a 26-tap Hamming windows. For fair comparison, we use the same parameters as in [1]. The \( W_d \) is assumed to be 2 for conservative estimation. \( W_d^m \) will be 2m for 26-tap filters. As shown in the figures, DCIM is superior than DCM due to the savings of unnecessary compensated value accesses. DCIM with DA has worse performance at linear and square root memory model due to the large memory size, but it gives large energy savings at the logarithmic memory model. So low word length and high order of differences design is suitable for DCIM without DA. The delays of the DCM and DCIM without DA is comparable since we use add and shift to replace multiplications in both designs. Throughput of DCIM with DA will be higher than that of DCM since we use the precomputed data and avoid computations.

V CONCLUSION

This paper presents low power FIR realization by using both first order difference between input and various order differences between coefficients. For an example filter, DCIM shows greater energy savings than DCM due to the savings of unnecessary compensated value operations. DCIM with DA architecture can be applied to short filters for large energy savings. For long filters, we can directly apply DCIM without DA to avoid the exponential growth of DA ROM table.

REFERENCES


Fig. 2 Energy Savings as a function of the memory access constant.

Fig. 3 Energy savings as a function of the order of differences used.