A Tree-Systolic Array of DLMS Adaptive Filter
Lan-Da Van, Shing Tenqchen*, Chia-Hsun Chang and Wu-Shrung Feng

Department of Electrical Engineering, Labs 353, National Taiwan University. Taipei, Taiwan, ROC

Chunghwa Telecom Telecommunication Labs., 12, Lane 551, Sec. 5, Min-Tsu Rd., Yang-Mei Zien, Tao-Yuan County, Taiwan 326, ROC

Tel: 886-2-23635251-353 Fax: 886-2-23638247, email: stc@ms.chttl.com.

Indexing terms: Maximum driving, Systolic array

ABSTRACT
In this work, we develop an optimized binary tree-level rule for the design of systolic array structure of Delay LMS (DLMS) adaptive filter. Using our developed method, higher convergence rate can be obtained without sacrificing the properties of systolic array structure. Also, based on our optimized tree rule, user can easily design any even-number tap adaptive system with minimum delay and high regularity under the constraints of maximum driving and the total number of taps.

1. INTRODUCTION
Adaptive filters have a wide range of applications, such as system identification [1], adaptive equalizer [2], echo cancellation [3], and noise cancellation. However, they either require longer delay via entire taps in systolic array [1, 4] or shorter delay without considerations of systolic array [5]. One of the most common algorithms for adaptive filtering is least mean square (LMS) algorithm deserved much attention due to its superior performance. However, owing to the need of capability of driving and the nature of local connection in hardware, it is difficult to directly implement the LMS algorithm in VLSI chip implementation without considering delay. A great deal amounts of researches [1, 4] have been conducted on the efficient implementation of an adaptive filter. It is our motivation to design a rule suitable for chip realization. This paper presents a modified systolic implementation of the DLMS algorithm in which we propose a rule for designer to decide the delay stage (i.e., tree level) and to insert delay element to construct the systolic array suitable for VLSI design. Finally, we verify our systolic array structure via two examples; that is, one is the system identification [7] and the other is adaptive equalizer [2] by computer simulation.

2. MODIFIED SYSTOLIC ARRAY STRUCTURE
An N th tap adaptive filter using DLMS algorithm may be represented by the following equations:

\[ y(n) = W'(n)X(n) \]  
\[ e(n) = d(n) - y(n) \]  
\[ W(n + 1) = W(n) + \mu \times e(n - D) \times X(n - D) \]

where \( d(n) \) and \( y(n) \) denote the desired signal, and output signal, respectively. \( D \) is the delay in weight adaptation, \( \mu \) is the step-size used for adaptation of the weight vector, and \( e(n) \) is the error. In the above equations, the weight vector \( W(n) \) and the input vector \( X(n) \) are defined as follows:

\[ W(n) = [\omega_0(n), \omega_1(n), \ldots, \omega_{N-1}(n)]^T \]  
\[ X(n) = [x(n), x(n-1), \ldots, x(n-N+1)]^T \]

where \( T \) denotes the transpose of a matrix.

For the case, when \( D = 0 \), the DLMS
algorithm reduces to the LMS algorithm. DLMS algorithm has been first proposed in 1984 by Proakis [5]. In that paper, the tree method also has been provided; however, the error terms are still required to globally adjust all of the weights. Global propagation is the main drawback and driving all of the weights is also a problem in VLSI design. Recently, several researches [1, 4] about the systolic implementation of DLMS have been studied, but they still require $N$-taps delay. Herein, we propose an optimized binary tree-level rule, and insert the delay element to solve these problems. At first, we can modify the structure [4] to derive a new systolic array without suffering from the global propagation and driving all of the taps except the single feedback loop. On the other hand, the new systolic structure enhances the convergence rate better than that of the conventional systolic array structures [1, 4]. Then, a new processing element ($PE_j$), for example, is depicted in Fig. 1, where the subscript of 2 is the number of tree level, that is, this kind of $PE$ allows us to adjust 4 taps per clock. We pad 3 delay elements, painted with cross section, to 3 input terms (i.e., $\mu \times e(n - D)$, $x(n)$ and $x(n - D)$).

Fig. 1. The inserting delay element painted with cross-section for the inputs: $\mu \times e(n - D)$, $x(n)$ and $x(n - D)$ in $PE_2$.

The generalized structure of overall adaptive filter is depicted in Fig. 2. Obviously, we observe that the global propagation and driving problems can be solved. Thus, we take the merit of the tree structure, (i.e., the speed-up of feedback) and the advantage of systolic array (i.e., suitable for VLSI design). Next, we will encounter another problem about choosing the tree-level to achieve minimum delay in the constraints of given total number of taps and maximum driving per each clock. The terminology of maximum driving can be defined as follows: one can adjust the maximum number of coefficients of system per each clock in physical design. Also, we know that the less delay of the system has, the better performance it will be [5].

Fig. 2. The overall systolic structure with cascaded $PE$s where $PE_{p_{\text{max}}}$ has $2^{p_{\text{max}}}$ taps.

According to the structure in Fig. 2, we induce the following rule:

Rule:

$$S_p = N \mod 2^p$$

for

$$k = (p - 1):-1:1$$

$$S_k = S_{k+1} \mod 2^k$$

end

$$D = p + 2 + \left\lfloor \frac{N}{2^p} \right\rfloor + \sum_{k=1}^{p} \frac{S_k}{2^{k-1}}$$

(4)

where $S$, $p$ and $N$ are residues, the number of tree levels and taps, respectively. The notation of $\lfloor S \rfloor$ is the integer value less than or equal to $S$. Therefore, the rule can give an optimized
minimum delay under varying different $p_{\text{max}}$ and in the constraint of fixed $N$. For example, when $N = 62$, and the maximum driving is 32 taps per each clock, we can find that the tree level could be equal to 4 or 5 as shown in Fig. 3.

![Optimized Tree-Level Rule](image)

Fig. 3. The optimized tree rule between delay and regularity.

With respect to the problem of selecting a good regularity, we observe that $p_{\text{max}} = 4$ tree-level needs fewer kinds of $PE$. It is an optimized value under considering delay and regularity.

3. SIMULATION AND PERFORMANCE

System identification is one of the most widely application in many control areas; therefore, we verify the systolic array structure and rule by computer simulation. In this example, the unknown system is 10-taps band-pass FIR filter, whose frequency response is defined as follows:

$$H(e^{\omega}) = \begin{cases} e^{-j\pi(N-1)/2}, & 0.3\pi \leq |\omega| \leq 0.7\pi \\ 0, & \text{otherwise} \end{cases}$$

Transversal filter containing 16-taps and using step size equal to 0.025 is the identified structure performed with $LMS$, conventional $DLMS$, and modified $DLMS$, while the input sequence is Gaussian distribution, zero-mean random process. Using Rule and Eq. (4), for $N = 16$, we may choose $p_{\text{max}} = 2, 3$ or 4, and obtain $D = 8, 7$ or 7, respectively. Of course, selecting $p_{\text{max}} = 3$ tree-level is an optimized choice for this structure from intuition with minimum delay and high regularity. The ensemble average with 50 runs of (a) $LMS$, (b) modified $DLMS$ using optimized tree, and (c) conventional $DLMS$ is shown in Fig. 4. As a consequence, convergence rate is superior to that of conventional $DLMS$ algorithm by simulation.

![Comparison results](image)

Fig. 4. Comparison results of (a) $LMS$, (b) modified $DLMS$ using optimized tree, and (c) conventional $DLMS$.

In the second example, we study the use of the modified structure for adaptive equalization of a linear dispersive channel [2] that produces unknown distortion. The random sequence $\{x_n\}$ applied to the channel input consists of a Bernoulli sequence, with $x_n = \pm 1$ and having zero mean and unit variance. The impulse response of the channel is described by the raised cosine:

$$h_n = \begin{cases} \frac{1}{2} \left[ 1 + \cos\left(\frac{2\pi}{W} (n - 2)\right) \right], & n = 1, 2, 3 \\ 0, & \text{otherwise} \end{cases}$$

where the parameter $W$ controls the amount of amplitude distortion produced by the channel. Herein, we choose $W$ and step size equal to 3.1
and 0.03, respectively. The simulation results of ensemble average with 200 runs as shown in Fig. 5 can be seen that the convergence rate of (b) tree-systolic DLMS, where \( p_{\text{max}} = 2 \) and \( D = 7 \), has similar convergence to that of (a) LMS algorithm. On the other hand, the conventional DLMS algorithm has larger variation and slower convergence.

![Comparison results](image.png)

Fig. 5. Comparison results of (a) LMS, (b) modified DLMS using optimized tree, and (c) conventional DLMS.

4. CONCLUSIONS

A new systolic array, selecting optimized binary tree structure and inserting the delay element every \( 2^n \) tap to construct the systolic array suitable for VLSI design, has been presented in this paper. We verify our systolic array structure via two examples of system identification and adaptive equalizer by computer simulation and observe that their performances are much better than that of the conventional systolic array structure [1, 4] using DLMS algorithm. Also, while the new structure takes the merit of tree structure and the advantage of systolic array, user can easily design any even-number tap adaptive system under the constraint of maximum driving and the total number of taps.

References


