Implementation of Digit-Serial LDI/LDD. Allpass Filters. Krister Landernäs. Mälardalen University Press Dissertations No. 23.

Mälaralen University Press issertations No. 23 Implementation of igit-serial LI/L Allpass Filters Krister Lanernäs January 2006 epartment of Computer Science an Electronics Mälaralen University Västerås, Sween

Copyright c Krister Lanernäs, 2006 ISSN:1651 4238 ISBN:91 85485 07 1 Printe by Arkitektkopia, Västerås, Sween istribution: Mälaralen University Press

Abstract In this thesis, igit-serial implementation of recursive igital filters is consiere. The theories presente can be applie to any recursive igital filter, an in this thesis we stuy the lossless iscrete integrator (LI) allpass filter. A brief introuction regaring suppression of limit cycles at finite worlength conitions is given, an an extene stability region, where the secon-orer LI allpass filter is free from quantization limit cycles, is presente. The realization of igit-serial processing elements, i.e., igit-serial aers an multipliers, is stuie. A new igit-serial hybri aer (SHA) is presente. The aer can be pipeline to the bit level with a short arithmetic critical path, which makes it well suite when implementing high-throughput recursive igital filters. Two igit-serial multipliers which can be pipeline to the bit level are consiere. It is conclue that a igit-serial/parallel multiplier base on shift-accumulation (SAAM) is a goo caniate when implementing recursive igital systems, mainly ue to low latency. Furthermore, our stuy shows that low latency will lea to higher throughput an lower power consumption. Scheuling of recursive igit-serial algorithms is stuie. It is conclue that implementation issues such as latency an arithmetic critical path are usually require before scheuling consierations can be mae. Cyclic scheuling using igit-serial arithmetics is also consiere. It is shown that igit-serial cyclic scheuling is very attractive for high-throughput implementations.

Acknowlegments First of all, I woul like to thank r. Johnny Holmberg, who always has taken the time to iscuss my work. Our aily iscussions on various topics, many of them having nothing to o with research, have mae all the ifference. I am also grateful to my supervisor Prof. Lennart Harnefors for giving me the opportunity to o this work an for supporting me in my research. I also like to thank my colleagues at the epartment of Computer Science an Electronics for their support an aily iscussions. I woul also like to express my gratitue to Prof. Mark Vesterbacka, Linköping University, for his enthusiasm an interest in my research. Finally, I woul like to thank my family for their support an encouragement. Västerås a clear ay in November 2005 Krister Lanernäs

Contents 1 Introuction 1 1.1 Motivation... 1 1.2 igital Filters............................... 2 1.2.1 igitallatticefilters... 5 1.2.2 LI/L Allpass Filters.................... 6 1.3 Computational Properties of igital Filter Algorithms................................ 11 1.3.1 Preceence Graph........................ 11 1.3.2 Latency an Throughput at the Algorithmic Level...... 11 1.3.3 Critical Path an Minimal Sample Perio........... 12 1.4 System esign.............................. 12 1.4.1 Number Representation..................... 12 1.4.2 Signal Quantization....................... 13 1.4.3 Overflow.............................. 14 1.4.4 igit-serial Arithmetics..................... 15 1.4.5 ComputationGraph... 16 1.4.6 Latency an Throughput at the Arithmetic Level....... 17 1.4.7 Pipelining............................. 17 1.4.8 Power Consumption....................... 19 1.4.9 Implementation Consierations................. 20 1.4.10 esignflow... 21 1.4.11 esign Tools........................... 22 1.5 Scientific Contributions......................... 23 2 Stability Results for the LI Allpass Filter 27 2.1 Previously Publishe Results...................... 28 2.2 Stability Analysis for Systems with One Nonlinearity......... 31 2.3 Stability Analysis for the Secon-Orer LI/L Allpass Filter......................... 34 2.4 Summary... 35 i

ii CONTENTS 3 igit-serial Processing Elements 37 3.1 Introuction... 37 3.2 Aers................................... 37 3.2.1 Linear-Time Aers....................... 37 3.2.2 Logarithmic-Time Aers.................... 39 3.2.3 igit-serial Aers........................ 44 3.3 ANewigit-SerialHybriAer... 48 3.4 igit-serialshifting... 48 3.5 igit-serial Multipliers.......................... 53 3.5.1 igit-serial/parallel Multiplier................. 53 3.5.2 igit-serial/parallel Multiplier Base on Shift-Accumulation... 57 3.5.3 A Pipeline igit-serial/parallel Multiplier.......... 60 4 Scheuling of igit-serial Processing Elements 63 4.1 Introuction... 63 4.2 Computational Properties of igit-serial Processing Elements.... 64 4.3 Single Interval Scheuling........................ 66 4.3.1 Scheuling Using Ripple-Carry Aers............. 67 4.3.2 Scheuling Using Bit-Level Pipeline Processing Elements.. 69 4.4 igit-serial Cyclic Scheuling...................... 70 4.5 Using Retiming to Reuce Power Consumption............ 71 4.6 ControlUnit... 71 5 Conclusions 73 5.1 FutureWork... 74

Chapter 1 Introuction 1.1 Motivation In the last ecae there has been a significant increase in the usage of batterypowere portable evices. Toay, mobile telephones, MP3 players an laptop computers are common proucts. Many of these proucts also have an increasing number of functions, thus, requiring higher complexity. As a result, power consumption has become an important aspect when implementing igital systems intene for battery-powere applications. Low power consumption is also of interest in many proucts that are not battery powere. The reason for this is that a high power issipation will lea to increase chip temperature. This heat will shorten the circuit life time an increase the risk of malfunction. Much time an effort is spent on integrating cooling evices in electronic systems to get ri of excess heat. igital signal processing is common in many of the evices escribe above, for example mobile telephones. It is, therefore, important to stuy how low-power implementation of igital signal processing algorithms can be achieve. Careful consierations, concerning for example arithmetics must be mae when realizing systems in orer to minimize power issipation. A common rule of thumb is that low harware complexity is likely to rener a system with lower power consumption than it s more complex counterpart, since the capacitive loa is reuce. Fining a relation between harware complexity an power consumption is a non-trivial task. The switching activity of the circuit is an important parameter to consier when stuying power issipation. Unfortunately the relationship between switching activity an harware complexity is ifficult to stuy without implementing the circuit an performing simulations. The throughput requirement of the signal processing epens on the application. An auio signal will typically require much lower processing rates than a vieo signal. In most signal processing cases, however, there is no avantage in performing the computation faster than require. This will only cause the process- 1

2 CHAPTER 1. INTROUCTION ing elements to wait until further processing is require. To this en, an efficient implementation must meet throughput requirements while exhibiting low power consumption. Naturally, a small harware solution is preferable since it reuces the manufacturing cost for the system. In this thesis we stuy implementation of high-spee an low-power igital filters. We particularly stuy power/spee characteristics for igit-serial filter implementations. igit-serial computation offers a higher throughput than its corresponing bit-serial realization without the overhea obtaine in a bit-parallel solution. This makes igit-serial implementation interesting in moerate-spee low-power applications. The main motivation for using igit-serial arithmetics in low-power esigns is that it requires fewer wires an less complex processing elements compare to the corresponing bit-parallel implementation. A igit-serial esign approach allows the esigner to fin a trae off between area, spee, an power for the application uner consieration. 1.2 igital Filters There are several reasons why igital filters have become more commoninelectronic systems over the years. Like many igital systems toay, igital filters are often implemente in a computer using a high-level programming language. This results in a short evelopment time an makes them flexible an highly aaptable, since changing the filter characteristics simply implies changing some variables in the coe. Analog filters on the other han are implemente using analog components, such as inuctors an capacitors, which must be carefully tune. This makes analog filters harer to evelop an moify. Another avantage with igital esign is that the characteristics of igital components o not change over time. igital systems are also unaffecte by temperature variations. Avances in CMOS processes have resulte in higher packing ensity an lower threshol voltages, leaing to a consierable ecrease in power consumption, which further explains the increase interest in igital filters. Toay, frequency-selective igital filters are important an common components in moern communication systems. Like their analog counterparts, igital filters are use to suppress unwante frequency components. A linear, time-invariant an causal filter can be escribe by a ifference equation N M y(n) = a k y(n k) b l u(n l), N M. (1.1) k=1 By transforming (1.1) with the z-transform [39] we can express it as a transfer function M Y (z) U(z) = l=0 b lz l 1 B(z) N k=1 a = = H(z). (1.2) k kz A(z) l=0

1.2. IGITAL FILTERS 3 The frequency function can be obtaine by substituting z = e jω in (1.2), where Ω is the normalize frequency Ω=2π f, (1.3) f s an where f s is the sample frequency. The frequency specification of a filter is often escribe using cut-off frequency Ω p, maximum allowe passban ripple r p an maximum allowe stopban ripple r s. igital filters can also be escribe using a state-space representation [52] x(n 1) = Ax(n)Bu(n) (1.4) y(n) = Cx(n)u(n), (1.5) where x(n) isthen-imensional state-vector an A,B,C, an are referre to as the state-space matrices. In this thesis only single-variable signals are consiere, which implies that B, C, an are of imensions N 1, 1 N, an 1 1, respectively. We can erive a transfer function from the state-space expression as H(z) =C(zI A) 1 B. (1.6) The transfer function is a mathematical escription of a igital filter. However, it oes not give any information of how the filter can be implemente. In fact, for a given transfer function there exists an infinite number of possible igital filter structures. When visualizing a filter structure, a signal flow graph (SFG) is commonly use [52]. The SFG consists of noes an branches. The function escribe by (1.1) is a recursive function, since the computation requires the value of former output samples. Since the impulse response of the filter escribe in (1.1) is infinite, these filters are known as infinite impulse response (IIR) filters. A well-known IIR filter structure is the irect-form filter structure. In Fig. 1.1, the SFG for an Nth orer IIR filter is shown. In the case where a k =0for1 k N the function escribe by (1.1) is a finite impulse response (FIR) filter. FIR filter structures are, although exceptions exist, non-recursive [39]. The SFG for a typical FIR igital filter structure is shown in Fig. 1.2. The recursive nature of the IIR filter can cause these filters to become unstable. It is therefore necessary to perform stability analyses when esigning IIR igital filters, especially at finite worlength conitions, see Section 1.4. This is not the case for FIR filter: they cannot become unstable. FIR filters can also be esigne with exact linear phase. The main rawback of FIR filters is that they require higher filter orers than IIR filters to achieve a certain filter specification. The higher filter orer makes FIR filters larger to implement in harware than the corresponing IIR filters. Take for example the case where a filter with Ω p =0.01, r p =0.3 B,anr s = 40 B, is to be esigne. In the FIR filter case the require

4 CHAPTER 1. INTROUCTION u(n) y(n) b 0 -a 1 b 1 -a 2 b 2 -a N-1 b N-1 -a N b N Figure 1.1: N th orer igital IIR filter structure. u(n) b 0 b 1 b 2 b N y(n) Figure 1.2: N th orer igital FIR filter structure.

1.2. IGITAL FILTERS 5 filter orer is 112, if the Remez algorithm [27] is use. The corresponing IIR filter orer is 7 for a Butterworth filter [27] an even lower for Chebyshev an elliptic filters. 1.2.1 igital Lattice Filters IIR transfer functions can be realize using two parallel-connecte allpass filters [39], provie that conitions, consiere below are met. These filters are commonly known as igital lattice filters [39]. When esigning igital lattice filters the separation into two allpass filters is mae accoring to [15]. The transfer function for a igital lattice filter can be expresse as H(z) = B(z) A(z) = 1 2 [H 1(z) ± H 2 (z)], (1.7) where H 1 (z) anh 2 (z) are allpass filter transfer functions. We can re-express (1.7) as H(z) = 1 ( A0 (z 1 ) 2 A 0 (z) zm ± A 1(z 1 ) ) A 1 (z) zp, (1.8) where M an P are the filter orers for the allpass filters. A necessary conition for (1.7) is that B(z) must be either a symmetric or antisymmetric function [39]. This implies that igital lattice filters can realize o-orer elliptic, Butterworth, or Chebyshev lowpass/highpass frequency functions, an two times oorer (6, 10, 14,...) banpass an banstop filters. A typical igital lattice filter structure is shown in Fig. 1.3. igital lattice filters have several properties which make them well suite for implementation. First, igital lattice filters exhibit the power-complementary property [39]. This implies that 1 [ ( H1 e jω ) ( H 2 e jω )] 2 1 [ ( 2 H1 e jω ) ( H 2 e jω )] 2 =1. (1.9) 2 The power-complementary property of igital lattice filters yiels that they typically will have low passban sensitivity [52]. As a result, they can be implemente with fewer bits in the filter coefficients, reucing the latency, see Section 1.3, an the power consumption of the filter [52]. Another avantage when implementing igital lattice filters is that they can be realize with a canonical number of multipliers an elay elements. An N th orer igital lattice filter can therefore be realize with N multipliers, whereas a irect-form IIR filter requires 2N 1 multipliers. There exist two allpass filter structures (known to the author) that can be implemente with a low minimal sample perio an a canonical number of multipliers. These are the lossless iscrete integrator/ifferentiator (LI/L) allpass filter [27] an the wave igital (W) lattice filter [13]. It has been shown that the LI/L allpass filter can be implemente with less harware resources compare

6 CHAPTER 1. INTROUCTION U(z) H 1 (z) H 2 (z) - 1 1/2 Y(z) Figure 1.3: igital lattice filter. to the corresponing W implementation case when consiering low- an higpass filters [27]. Furthermore, the former filter structure exhibits a lower amount of quantization noise compare to the latter case [27]. Therefore, in this thesis, the LI allpass filter structure is consiere. Wave igital filters have goo filter properties uner finite worlength conitions, provie that magnitue truncation an saturation arithmetics are use an place at the elay elements [14]. W filters are low-sensitive filter structures an are goo caniates when esigning igital lattice filters. The W filter consists of first- an secon-orer cascae connecte W allpass sections. These allpass sections are also known as aaptors. In Fig. 1.4, a three-port series aaptor an a secon-orer Richars aaptor are shown. 1.2.2 LI/L Allpass Filters Analog LC filters are passive an have low sensitivity to component variations. These properties are also esirable when esigning igital filters. It is, therefore, no surprise that analog LC filters were use as prototypes, not only for W filters, but also when Bruton [6] began his work on the lossless iscrete integrator/ifferentiator filter (LI/L). Bruton introuce several analog-to-igital transformations, socalle LI transformations, which can be use to transform the analog prototype filter to a corresponing igital filter structure. Over the years, Bruton an others [6], [27], [49], have stuie the LI/L filter an improve upon the original work. In this thesis, we will mainly consier the lossless iscrete integrator (LI) allpass filter presente in [25]. Over the years several LI allpass filters have been presente [27]. It has been shown that the LI allpass filter structure exhibits goo filter properties when the poles are place aroun z =1,evenbetterthanthe corresponing W case. If the poles are place aroun z = 1 the transformation z z shoul be use. We then get the corresponing lossless iscrete ifferentiator (L) allpass filter structure. In Fig. 1.5, the general-orer LI/L allpass filter structure is shown, where the plus an minus signs correspon to LI an L, respectively. The reason for our interest in the LI/L allpass filter is

1.2. IGITAL FILTERS 7 u(n) g 2 T T b 1 -b 2 T g 1 T y(n) y(n) u(n) (a) (b) Figure 1.4: W allpass filters. (a) Three-port series aaptor. (b) Secon-orer Richars aaptor. y(n) - u(n) /- /- /- /- /- - a -a -a -a -a -a 1 2 3 4 5 6 Figure 1.5: General-orer LI/L allpass filter structure.

8 CHAPTER 1. INTROUCTION that it, like the W filter, is a low-sensitive filter structure [27]. Thus, the length of the filter coefficients can be kept small while maintaining an aequate filter frequency function. This is very avantageous for harware implementations, since low-sensitive filter structures have goo power, area, an throughput characteristics. LI Lattice Filter esign Example esign formulas for the general-orer LI allpass filter was given in [27]. Let us use these formulas to esign an 11th-orer LI lattice filter with the following specification Ω p = 0.3π (1.10) r p = 0.5 B (1.11) r s = 110 B. (1.12) The filter is shown in Fig. 1.6. U(z) 5thorer 6thorer 1/2 Y(z) Figure 1.6: 11th-orer igital lattice filter. As presente in [27], the state-space escription can be moifie in orer to simplify the calculation of the filter coefficients. Moifying (1.4) an (1.5) by applying the z-transform an introucing a ifferentiator variable, ξ = z 1, gives us ξx(z) = A X(z)BU(z) (1.13) Y (z) = CX(z)U(z), (1.14) where A = A I. For the 6th-orer LI allpass filter the characteristic polynominal of A can be expresse as p(ξ) = et(ξi A )=ξ 6 c 1 ξ 5 c 2 ξ 4 c 3 ξ 3 c 4 ξ 2 c 5 ξ 1 c 6 = (ξ 1 p 1 )(ξ 1 p 2 ) (ξ 1 p 6 ), (1.15)

1.3 COMPUTATIONAL PROPERTIES OF... 9 where p 1,...,p 6 are the poles of the filter in the z-plane. The coefficients can be calculate as α 1 = c 1 c 2 c 3 c 4 c 5 c 6 (1.16) α 2 = c 1 α 1 1 ( c 3 2c 4 3c 5 4c 6 ) α 1 α 3 = c 1 α 1 α 2 1 ( c 4 2c 5 3c 6 1 ) (c 5 3c 6 ) α 2 α 1 α 4 = c 1 α 1 α 2 α 3 1 ( 1 (c 5 3c 6 ) 1 ) c 6 α 3 α 1 α 2 α 5 = c 1 α 1 α 2 α 3 α 4 1 α 4 ( 1 α 2 c 6 ) α 6 = c 1 α 1 α 2 α 3 α 4 α 5. Using Matlab the values of c 1,...,c 6 in (1.15) can be calculate as p(ξ) = ξ 6 2.009672ξ 5 2.851608ξ 4 2.203817ξ 3... (1.17) 1.247697ξ 2 0.366784ξ 1 0.067207. From (1.18) the coefficients for the 6th-orer LI allpass filter can be erive α (1) 1 = 0.413761 (1.18) α (1) 2 = 0.290938 α (1) 3 = 0.216846 α (1) 4 = 0.312596 α (1) 5 = 0.036551 α (1) 6 = 0.738979. The coefficients for the 5th-orer LI allpass filter can be calculate in a similar manner, renering α (2) 1 = 0.414030 (1.19) α (2) 2 = 0.286741 α (2) 3 = 0.252460 α (2) 4 = 0.137919 α (2) 5 = 0.457750. The frequency function for the 11th-orer LI igital lattice filter using the coefficients erive above is shown in Fig. 1.7.

10 CHAPTER 1. INTROUCTION Passban (B) 0.2 0.1 0 0.1 0.2 0.3 0.4 0.5 r p 0 0.05 0.1 0.15 0.2 0.25 0.3 =Ω 0.35 p 0 Stopban (B) 50 100 r s 150 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Ω/π Figure 1.7: Filter specification an frequency function for the 11th-orer LI lattice filter.

1.3 COMPUTATIONAL PROPERTIES OF... 11 1.3 Computational Properties of igital Filter Algorithms By stuying the computational properties of a igital filter algorithm, the performance of the filter can be etermine. At this, so-calle, algorithmic level, no information about the realization of the processing elements is known. Still, many issues like parallelism an minimum execution time can be consiere at this level. 1.3.1 Preceence Graph When consiering computational properties of a igital filter the SFG escription is mappe to a preceence graph (PG) [52]. The PG is a graphical escription of the state-space representation an escribes in which orer the processing elements can be execute. By stuying the PG it can be erive which processing elements that can operate in parallel an which can execute sequentially. In Fig. 1.8, an u(n) b 0 -a 1 T T b / b 1 0 y(n) u(n) x (n) 1 x (n) 2 b 0 -a 1 b / b 1 0 -a 2 c 0 c 1 c 2 c 3 x (n1) 1 x (n1) 2 y(n) -a 2 (a) b / b 2 0 b / b 2 0 (b) Figure 1.8: a) SFG of secon-orer F filter. b) Preceence graph of secon-orer F filter. SFG an a preceence graph of a F filter is shown. 1.3.2 Latency an Throughput at the Algorithmic Level To escribe the computational properties of an algorithm, two expressions, latency an throughput, are use [46]. Assume that a ata flow is applie to a general igital algorithm. Latency at the algorithmic level is the time it takes for the applie ata flow to reach the output. Throughput is a measurement of how frequently new input ata can be applie to the system. The relationship between latency an throughput epens on the characteristics of the ata flow. More on this in Section 1.4.

12 CHAPTER 1. INTROUCTION 1.3.3 Critical Path an Minimal Sample Perio The throughput of an algorithm is etermine by the longest irecte path in the preceence graph. This path is known as the critical path (T cpa ) of the algorithm [46]. The algorithmic critical path is an upper boun of the throughput. The throughput cannot be increase higher than 1/T cpa without rearranging the algorithm. For the F filter, shown in Fig. 1.8, the algorithmic critical path is one multiplier an three aers. ecreasing the algorithmic critical path is often esirable in high-performance igital filter esign. Equivalent transformations, such as associative an istributive methos or re-timing, can sometimes be use [52]. Pipelining is another metho which sometimes can be use to ecrease T cpa. By introucing elay elements between the processing elements, a long sequential chain of processing elements can be ivie into smaller chains, allowing some processing elements to execute in parallel. Note, however, that in recursive algorithms the sample perio is restricte by the recursive loops of the structure. Pipelining the loops will not increase the throughput of the algorithm. Another metho that can increase the throughput is unfoling [46]. Unfoling oes not ecrease the algorithmic critical path of the algorithm, but unfoling increases the parallelism of the algorithm, resulting in a higher throughput. In recursive algorithms, the recursive loops impose a theoretical boun on the sample perio. This boun, also known as the minimal sample perio, isgivenby T min =max i { Topt N i }, (1.20) where T opt an N i are the total latency of the arithmetic operations an the number of elay elements in the irecte loop i, respectively [48]. When the algorithmic critical path of an algorithm is longer than T min, transformations can be mae so that T cpa = T min.moreonthisinsection4.4. 1.4 System esign Several implementation issues must be consiere in orer to realize a logical escription of a filter algorithm escribe by a preceence graph. These system esign issues will be iscusse in the next sections. 1.4.1 Number Representation A igital system can be implemente using either fixe-point arithmetics or floatingpoint arithmetics. The former is more common when esigning low-power systems, since floating-point processing elements are more complex, resulting in a higher power consumption [52]. One rawback when using fixe-point arithmetics is that the number range is quite limite in comparison to floating-point.

1.4. SYSTEM ESIGN 13 In this thesis, fixe-point two s complement number representation is consiere, unless otherwise note. The magnitue of the signal is assume to be less or equal to one, i.e, X 1. Furthermore, for LI lowpass filters an L highpass filters, coefficient magnitues less or equal to one are usually obtaine (when the poles are place aroun z =1anz = 1 for LI an L, respectively), thus, α n 1, where α n is an arbitrary filter coefficient. These two conitions are sufficient to prevent overflow after multiplication. An f x 1 bit number is represente as (given that X 1): X = x 0.x 1 x fx1 x fx (1.21) where the bits x i, i =0,...,f x, are either 0 or 1 an the most significant bit x 0 is the sign bit, with the value 0 enoting positive numbers an 1 enoting negative numbers. The value of a two s complement number is given by: X = x 0 f x i=1 The number range is 1 X 1 Q, whereq =2 fx. 1.4.2 Signal Quantization x i 2 i. (1.22) In finite-precision igital filters, the increase in worlength after multiplication must be hanle. Multiplying two numbers x an y, wheref x 1 an f y 1 are the worlengths of them, respectively, will result in a prouct of length f x f y 1 bits. The removal of the least significant bits is commonly known as quantization. Quantization will introuce an error that can be moele as noise. This quantization noise will affect the signal-to-noise ratio (SNR) at the output of the filter an shoul, therefore, be kept as small as possible. ifferent filter structures will have ifferent noise properties [39]. It has been shown in [27] that the LI allpass filter has goo noise properties, even better than the corresponing W realization, provie that the poles are place aroun z =1. Quantization can be performe in several ways, as seen in Fig. 1.9. The easiest quantization scheme is value truncation, where the unwante bits are simply roppe. The error introuce by quantization is ifferent epening on the type of quantization use. Roun-off will introuce a smaller error than both magnitue truncation an value truncation [27]. The roun-off function is, however, more complex to implement in harware. Regarless of quantization metho, they can all give rise to unwante oscillations, so-calle quantization limit cycles [27], ue to the fact that they are nonlinear functions. A stable linear system may, therefore, become unstable when introucing these nonlinearities. Quantization limit cycles are usually small in amplitue, a couple of least significant bits. The effect of the limit cycles can be reuce by increasing the worlength of the signal. Magnitue truncation has been shown

14 CHAPTER 1. INTROUCTION to have the best properties for suppressing quantization limit cycles of the ones presente here [27]. ifferent filter structures will have ifferent quantization behaviour an this must be stuie for each structure. Quantization limit cycles in LI/L allpass filters are stuie in Chapter 2 an Publication VII. f(x) f(x) f(x) Q -Q Q -Q x Q -Q Q -Q x Q -Q Q -Q x (a) (b) (c) Figure 1.9: Quantization. (a) Magnitue truncation. (b) Roun-off. (c) Value truncation. 1.4.3 Overflow Overflow occurs when the sum of two numbers excees the range of the number representation. When consiering two s complement numbers, overflow is easily etecte. If the sum of two positive numbers becomes negative or, vice versa, an overflow has occure. This overflow characteristic, also known as wrapping, is shown in Fig. 1.10(a). 1 o(x) 1 o(x) -1 1 x -1 1 x -1-1 (a) (b) Figure 1.10: Overflow characteristics. (a) Wrapping (two s complement). (b) Saturation.

1.4. SYSTEM ESIGN 15 To prevent wrapping when overflow occurs, a saturation nonlinearity can be use. Saturation will set the overflowe signal to the largest or smallest value of the number representation, epening on the sign of the overflowe signal. A typical saturation characteristic is shown in Fig. 1.10(b). Overflow is unesirable when esigning igital filter, since it can cause overflow limit cycles. Overflow limit cycles are large in amplitue an are sustaine even when the input signal is zero. To suppress overflow limit cycles, saturation arithmetics can usually be use [27]. More on limit cycles in Chapter 2 an in Publication VII. The probability of overflow can be reuce by using scaling. Critical noes in the filter structure, where the risk of overflow is high, can be moifie by inserting scaling multipliers. These multipliers will lower the gain of the critical noe an, thus, prevent overflow [52]. 1.4.4 igit-serial Arithmetics Bit-serial an bit-parallel arithmetics are common implementation styles (also known as ata-flow architectures) in igital processing systems. The logical values of a signal are referre to as a wor. In bit-serial implementation each wor is processe one bit at a time on a single wire, usually with the least significant bit (LSB) first. Computation is generally carrie out on each bit as they are receive by the processing elements. This implementation style usually leas to small processing elements an small overhea ue to wiring, since only one wire is use to transmit the signal. In bit-parallel implementation, the bits in each wor are transmitte an processe simultaneously. Clearly, this will require f x 1 number of wires an more complex processing elements, where f x is the number of fractional bits in each wor. A system using parallel arithmetics can process a wor in one clock cycle, whereas a corresponing bit-serial system will require at least f x 1 clock cycles. The arithmetic critical path (see Section 1.4.7) in the bit-serial case is, however, usually much smaller. In Fig. 1.11, a bit-parallel an a bit-serial ata-flow architecture is shown. x 0 x -1 Logic y 0 y -1 x 0 x -1 x -fx Logic y 0 y -1 y -fy x -f x y-fy (a) (b) Figure 1.11: ata-flow architectures. a) Bit-parallel. b) Bit-serial. In igit-serial computation, the bits in each wor are ivie into groups calle igits,, an each igit is processe one at a time. Thus, the minimal number of

16 CHAPTER 1. INTROUCTION clock cycles to compute a whole wor is (f x 1)/. igit-serial computation can therefore be consiere as a compromise between bit-serial = 1 an bit-parallel computation = f x 1. In orer to keep the igits aligne an simplify timing issues, it is require that (f x 1) is an integer multiple of, whichisassume throughout this thesis. Furthermore, least significant igit first (LS) computation is assume unless otherwise note. In Fig. 1.12, a igit-serial ata-flow architecture is shown. x 0 x - x -1 x --1 x -(f x 1) x -(f x 2) Logic y 0 y - y -1 y --1 y -(f y 1) y -(f y 2) x -1 x -21 x -fx y-1 y -21 y -fy Figure 1.12: igit-serial ata-flow architecture. The interest in igit-serial processing techniques originates from the late 1970 s. Hartley an Corbett stuie automatic chip layout tools, also known as silicon compilers, using igit-serial processing elements [20], [21]. Their theories on esigning igit-serial layout cells were summarize in [22] where it was conclue that their approach reners faster evelopment time compare to full custom layout. Irwin an Owens also stuie this topic [30]. In [44], a systematic approach to generate igit-serial architectures from bit-serial architectures using unfoling was escribe. Over the years igit-serial computation has been applie to several research areas, incluing ASL systems, where the FFT processor architecture can be implemente using igit-serial arithmetics [47] an MPEG vieo ecoing, where the inverse iscrete cosine transform can be implemente using igit-serial arithmetics [29]. igit-serial arithmetics have also been consiere when implementing igital filters [1], [32], [38]. The avantage of igit-serial computation is that it allows a trae-off between area, power an throughput. Recently, igit-serial architectures which allow a high egree of pipelining have been presente [8]. These systems tolerate a throughput comparable to parallel systems, but with smaller chip area. It has also been shown in [8] that igit-serial systems are attractive in low-power applications, such as battery-powere systems. 1.4.5 Computation Graph The preceence graph, escribe in Section 1.3.1, may be extene to a computation graph which also comprises timing information. This is possible at the arithmetic level, when etaile information about the processing elements are known. The throughput of the algorithm an timing of control signals can then be obtaine using the computation graph.

1.4. SYSTEM ESIGN 17 1.4.6 Latency an Throughput at the Arithmetic Level Latency an throughput are two common expressions when consiering arithmetic operations. The former is the time it takes to prouce an output value for a given input value. In igit-serial arithmetics, latency is the time it takes for a igit with a certain significance level to propagate. When stuying igital algorithms a istinction between algorithmic latency (see Section 1.3.2) an arithmetic latency is usually mae. However, in this thesis algorithmic transformations will not be consiere. Thus, throughout this thesis latency implies the arithmetic latency (L), unless otherwise note. Throughput is a measure of the sample rate (samples/secon). These two concepts are illustrate in Fig. 1.13. In Fig. 1.13 the sample perio (T s = 1/throughput) is shown. Bit-parallel LSB Input MSB 1/ Throughput Latency Output t LSB Input Bit-serial MSB LSB MSB Latency 1/ Throughput LSB Output t Input LS 1 2 igit-serial MS 1 2 LS 1 2 Latency 1/ Throughput MS 1 2 Output t Figure 1.13: Arithmetic latency for bit-parallel, bit-serial an igit-serial systems. The throughput of a igital filter is best visualize using a computation graph. Let us consier the F filter shown in Fig. 1.8. In this example we assume that multipliers an aers have a latency of three an one (normal values for igit-serial arithmetics), respectively. The corresponing computation graph is shown in Fig. 1.14, where T clk is the perio time of the clock. 1.4.7 Pipelining Pipelining consierations can be mae at both the algorithmic level an the arithmetic level. It is important to istinguish between the two. At the algorithmic level, pipelining is use to exploit inherent parallelism of the algorithm an, hence, increase the throughput of the algorithm, as escribe in Section 1.3.3. At the arithmetic level, the longest path, register-output register-input, in the system etermines the clock perio. This path is referre to as the arithmetic critical path, T cp. In general the clock perio can be expresse as T clk T cp [ns], (1.23)

18 CHAPTER 1. INTROUCTION n 0 2 3 4 5 6 7 11 b 0 a 1 b 1 /b 0 a 2 b 2 /b 0 c 0 c 1 c 2 c 3 t=nt clk Figure 1.14: Computation graph for a secon-orer F filter. where T clk is the perio time of the clock. In this thesis, however, we use in the theoretical stuies that T clk = T cp. Pipelining at the arithmetic level correspons to inserting a number of registers into the structure, hence, shortening the arithmetic critical path. Since recursive loops cannot be pipeline, the arithmetic critical path, i.e., the longest arithmetic loop, will limit the clock perio. In the case were the architecture contains no recursive loops, no theoretical lower boun exists on the clock perio. Pipelining at the arithmetic level is not limite to inserting registers between the processing elements. Large processing elements may benefit from introucing pipelining in the structure. We refer to this as internal pipelining. The egree, or level, of pipelining correspons to the number of register stages inserte into the architecture. In a non-recursive system, pipelining will increase the throughput. We illustrate this by an example. Let us stuy the system in Fig. 1.15(a). The sample perio of this system can be written as T s = N T cp [ns], (1.24) where N is the number of igits applie to the system an T cp is the arithmetic critical path. Inserting a pipelining level will result in Fig. 1.15(b). The sample perio for this system becomes Assuming that T s =(N 1)max{T cp1,t cp2 }. (1.25) T cp1 = T cp2 = T cp 2, (1.26)

1.4. SYSTEM ESIGN 19 INPUT 1 2 Logic N OUTPUT 1 2 N t T cp (a) t 1 INPUT 2 N t L1 T cp1 R E G L2 T cp2 OUTPUT 1 2 N t (b) Figure 1.15: Logic system with a) No pipelining an b) one level of pipelining. the sample perio of (1.25) can be expresse as It is easy to show that T s =(N 1) T cp 2. (1.27) (N 1) T cp 2 N T cp, (1.28) for N 1. For an arbitrary egree of pipelining, (1.28) can be extene (N P 1) T cp P N T cp, (1.29) where P is the number of pipelining levels. Conition (1.29) also hols for N 1. Clearly, increasing the level of pipelining will result in lower sample perio. This analysis is only true in theory, since the elays of the registers are not consiere in the analysis above. When the arithmetic critical path is ominate by the elay of the register, further pipelining will not lea to ecrease sample perio. 1.4.8 Power Consumption As iscusse earlier, power consumption is an important implementation aspect in many moern igital systems. The main source of power consumption in CMOS circuits is the ynamic power issipation cause by switching in the circuit. A

20 CHAPTER 1. INTROUCTION moel for the ynamic power consumption in a CMOS inverter (approximately true for general CMOS gates) is given by, P yn = αf clk C L V 2, (1.30) where α is the activity factor, f clk is the clock frequency an C L is the capacitive loa being switche. The activity factor is the average number of transitions on the output of the gate uring one clock cycle. Typically, α 1, but ue to glitches some systems may experience an activity factor larger than one [7]. It shoul be quite clear from (1.30) that reucing the supply voltage will have a large impact on the power consumption. Voltage scaling is a popular metho to reuce the power consumption of CMOS circuits [37]. Reucing the supply voltage will, however, also affect the spee of the circuit. A first-orer approximation of the elay in a CMOS gate was given in [7] an can be expresse as T = C L V µc ox 2 (W/L)(V V T ) m, (1.31) where 1 m 2 for short-channel evices. The reuction in spee cause by voltage scaling may be compensate by scaling of threshol voltages. A technology-inepenent solution is to increase parallelism in the system to compensate for the spee egraation. As technology improvements lea to smaller voltage supply an smaller threshol voltages, voltage scaling becomes increasingly ifficult ue to increase leakage currents an noise. In this thesis, voltage scaling will not be consiere. The stanar cell esign approach use in this thesis is not well suite for voltage scaling ue to several reasons. Threshol scaling is not possible when using stanar cells. Furthermore, the behavior of the stanar cells are only guarantee using certain voltage ranges. In the case of the UMC 0.18 µm process, use in this thesis, the functionality of the cells is only guarantee own to V =1.62 V (uner normal operating conitions V =1.8V). 1.4.9 Implementation Consierations General-purpose processors are not efficient when implementing signal processing algorithms, whereas a more specialize architecture like a programmable igital signal processor will result in a better performance. A eicate harware solution (i.e. ASIC) is, however, the most effective implementation metho in terms of area an power. The sequential execution of the software processors oes not take avantage of inherent parallelism in the algorithms, an as a result, lowering power consumption by increasing parallelism is not possible [37]. The power consumption for a eicate harware implementation can be two to three orers of magnitue lower than the corresponing programmable solution [10].

1.4. SYSTEM ESIGN 21 1.4.10 esign Flow A esign flow escribes the process of a chip esign from concept to prouction. Several esign flows exist an they iffer mainly on the level of etail. For more on esign flows topologies see [53]. This section will briefly escribe all steps in the filter esign flow shown in Fig. 1.16. The esign flow begins with a filter specification. This specification is the solution to a filter problem. It contains information about the amplitue function, cut-off frequency, allowe attenuation etc. In aition, it may also inclue implementation constraints, such as minimal throughput an maximum power consumption. Filter Specification Algorithmic Level Arithmetic Level Logic Realization Layout Filter Implementation Figure 1.16: igital filter esign flow. At the algorithmic level, the appropriate igital filter algorithm is chosen an scheule. This step also inclues resource allocation an mapping. Next, at the arithmetic level, arithmetic issues such as number representation an ata-flow architectures are etermine. Furthermore, it must also be ecie how to realize the processing elements. An aer may for example be implemente in several ways epening on performance constraints. In the layout step, chip planning issues like

22 CHAPTER 1. INTROUCTION floorplanning an routing are consiere. When all steps have been carrie out, the filter can be implemente in a integrate circuit like an ASIC or an FPGA. The flow shown in Fig. 1.16 is a simplification; it oes not contain all etails comprise in harware implementation. Several iterations an verifications must usually be performe at each step of the flow. These are left out in orer to simplify the esign flow graph. 1.4.11 esign Tools All implementations in this thesis were performe using the same methoology. The chosen technology was the UMC 0.18 µm stanar cell technology unless otherwise note [50]. The main reason for using stanar cells instea of a full-custom esign approach is that the implementation time becomes much shorter in the former case. The chosen stanar cell technology allows up to six metal layers an the recommene supply voltage is 1.8 V. The implementation esign flow use is shown in Fig. 1.17. The implementation was escribe using VHL an the correctness of VHL Verification Synthesis elay Layout Simulation Power Analysis Timing Analysis Figure 1.17: Implementation esign flow. the VHL coe was verifie using Mentor Graphics Moelsim [40]. Synthesis was performe using Synopsys esign analyzer [11], an circuit layout was generate using Caence Silicon Ensemble [12]. The layout was generate using four metal layers. A clock tree generation was performe in orer to minimize clock skew. Since the performance of the filter logic was stuie, no pas were inclue in the

1.5. SCIENTIFIC CONTRIBUTIONS 23 layout. elay information ue to wiring was back-annotate to esign Analyzer, where a static timing analysis was performe. The post-layout netlist with wire RCelay was then simulate using Spice escriptions of the stanar cells in Synopsys Nanosim [42]. From Nanosim, information about current consumption an timing was stuie. 1.5 Scientific Contributions This thesis is base on the following publications (Note: The page layout of some papers may have been change to improve reaability. The contents of the papers are, however, unchange): Publication I: igit-serial implementation of LI/L allpass filters, K. Lanernäs, J. Holmberg, L. Harnefors, an M. Vesterbacka, inproc. IEEE Int. Symp. on Circuits an Systems, ISCAS 2002 Vol. 2, pp. II-684 II-687, Phoenix, USA, May 2002. In this work, a secon-orer LI allpass filter was implemente using igit-serial arithmetics. The performance was compare to a corresponing W filter. Some theories on maximally fast implementation was also given. Publication II: A high-spee low-latency igit-serial aer, K. Lanernäs, J. Holmberg, an M. Vesterbacka, in Proc. IEEE Int. Symp. on Circuits an Systems, ISCAS 2004, Vol. 3, pp. 23 26, Vancouver, Canaa, May 2004. In this paper, a new igit-serial aer architecture was presente. The aer was base on both CLAA an conitional sum techniques. It was shown that the propose aer architecture was well suite for implementation in high-spee recursive filters. Publication III: Implementation of bit-level pipeline igit-serial multipliers, K. Lanernäs, J. Holmberg, an O. Gustafsson, in Proc. IEEE Noric Signal Processing Symp., NORSIG 2004, pp. 125 128, Espoo, Finlan, June 2004 In this work, two igit-serial multipliers that can be pipeline to the bit-level were implemente an compare to each other. It was shown that a multiplier base on shift-accumulation ha a higher throughput as well as lower current consumption. Publication IV: Implementation of high-spee igit-serial LI allpass filters, K. Lanernäs an J. Holmberg, in European Conf. on Circuit Theory an esign., ECCT 2005.

24 CHAPTER 1. INTROUCTION In this paper, a 6th-orer LI allpass filter was implemente. Two cases were stuie. In the first case unfoling was use to realise the processing elements. Arbitrary pipelining was use in the secon case. It was shown that arbitrary pipelining will result in a higher throughput. This is, however, at the expense of much higher current consumption. Publication V: Glitch reuction in igit-serial recursive filters using retiming, K. Lanernäs, J. Holmberg an M. Vesterbacka, accepte at IEEE Int. Conf. on Electronics, Circuits an Systems, ICECS 2006. Retiming was stuie as a metho to ecrease the power consumption in recursive igital filters. It was shown that retiming can reuce the power consumption with about 20% for small igit-sizes without affecting the throughput of the filter. It was also shown that introucing a large number of registers in the filter structure will increase the current consumption. This trae-off, between reucing the amount of glitches an the increase in the number of registers, was also consiere in this work. Publication VI: Implementation of igit-serial LI allpass filters using cyclic scheuling, K. Lanernäs an J. Holmberg, submitte to IEEE Int. Symp. on Circuits an Systems, ISCAS 2006. In this work, scheuling consierations for a igit-serial secon-orer LI allpass filter was consiere. A metho known as cyclic scheuling was stuie. The filters were implemente in a 0.18µm process an it was shown that the secon-orer LI allpass filter can be realize with 40% less area using cyclic scheuling compare to single interval scheuling. Furthermore, it was also shown that for small igit sizes cyclic scheuling results in a 20% better power-throughput characteristic. Publication VII: LI/L Lattice Filters, J.Holmberg,L.Harnefors,K. Lanernäs, an S. Signell, submitte to IEEE Trans. on Circuits an Systems. A new moifie lossless iscrete integrator/ifferentiator (LI/L) was presente. The filter structure was analyze concerning parasitic oscillations, coefficient quantization, quantization noise, an implementation properties. The contribution in this publication is the evelopment of the resulting LI/L structure which has a minimal sample perio of 3T a T mult. Other Publications Computational properties of LI/L lattice filters, J. Holmberg, L. Harnefors, K. Lanerns, an S. Signell, in Proc. IEEE Int. Symp. on Circuits an

1.5. SCIENTIFIC CONTRIBUTIONS 25 Systems, Vol. 2, pp. 685 688, Syney, Australia, May 2001. Implementation aspects of secon-orer LI/L allpass filters, J. Holmberg, K. Lanerns, an M. Vesterbacka in Proc. European Conf. on Circuit Theory an esign, Vol. 1, pp. 237 240, Espoo, Finlan, August 2001. Aaptive secon-orer LI filter, J.Holmberg,L.Harnefors,K.Lanernäs, an S. Signell, in Proc. Noric Signal Processing Symp., NORSIG 2002, Hurtigruten, Norway, October 2002

26 CHAPTER 1. INTROUCTION

Chapter 2 Stability Results for the LI Allpass Filter It is a well known fact that recursive igital filters can sustain parasitic oscillations, so-calle limit cycles, ue to finite worlength [9]. The increase in worlength after arithmetic operations require signal quantization. Quantization may be carrie out in several ways, which was shown in Section 1.4.2. They are all nonlinear operations which may give rise to limit cycles. Naturally, limit cycles are unwante effects, since they alter the expecte behavior of the filter. It is, therefore, important to erive conitions where the filter suppress these phenomena. Limit cycles can also arise ue to signal overflow in the filter [27]. Overflow limit cycles are more serious than quantization limit cycles, since the amplitue of the former is much greater. In contrast to quantization limit cycles, which affect the SNR of the filter, overflow limit cycles will ruin the filter performance. The necessity of stuying conitions uner which overflow limit cycles o not arise shoul be apparent. In Fig. 2.1 typical behavior of quantization an overflow limit cycles are shown. ue to theoretical ifficulties, a stability analysis must often inclue some more or less unrealistic assumptions. One common assumption is to limit the input to a constant or even zero value, with the state x(n) 0. The latter is known as a zero-input, autonomous, or unforce system. In this chapter we will limit the analysis to zero-input conitions, which is usually sufficient [27]. The aim of this chapter is to exten the stability region for the secon-orer LI allpass filter structure presente in [27]. We will aopt a metho presente in [9] an apply the theories to the mentione filter. 27

28 CHAPTER 2. STABILITY RESULTS FOR THE LI ALLPASS FILTER Input signal igital algorithm Output signal Amplitue small Limit cycle (a) Input signal igital algorithm Output signal 1 Amplitue large -1 (b) Limit cycle Figure 2.1: Limit cycle behavior. a) Typical quantization limit cycle. b) Typical overflow limit cycle. 2.1 Previously Publishe Results In this section, we will present previously publishe stability results for the seconorer LI allpass filter. Over the years several publications have consiere this topic [19], [24]. In [27], a stability region for the secon-orer LI/L allpass filter, which is epicte in Fig. 2.2, was presente. The stability analysis was performe using Lyapunov theory [27]. With using this approach, a Lyapunov function is introuce. It can be use to escribe the energy of the linear system, an must meet the following criteria: V (x(n)) > 0, x(n) 0,V(0) = 0 (2.1) V (x(n)) = V (x(n 1)) V (x(n)) 0. (2.2) This implies that if the energy of the system is boune, it is stable. Furthermore, if the energy is ecreasing the system is asymptotically stable. A Lyapunov function can be expresse as the quaratic form V (x(n)) = x T (n)px(n), (2.3) where P is a symmetric N N matrix. Criterion (2.1) is met only if P is positive