A Flexible Design for Optimization of Hardware Architecture in Distributed Arithmetic based FIR Filters
Fazel Sharifi, Saba Amanollahi, Mohammad Amin Taherkhani and Omid Hashemipour
Abstract — FIR filters are used in many performance/power critical applications such as mobile communication devices, analogue to digital converters and digital signal processing applications. Design of appropriate FIR filters usually causes the order of filter to be increased. Synthesis and tape-out of high-order FIR filters with reasonable delay, area and power has become an important challenge for hardware designers. In many cases the complexity of high-order filters causes the constraints of the total design could not be satisfied. In this paper efficient hardware architecture is proposed for distributed arithmetic (DA) based FIR filters. The architecture is based on optimized combination of Look-up Tables (LUTs) and compressors. The optimized system level solution is obtained from a set of dynamic-programming optimization algorithms. The experiments show the proposed design reduced the delay cost between 16%-62.5% in comparison of previous optimized structures for DA-based architectures.
I. Introduction
Nowadays, FIR filters, regarding to their superior properties such as stability and high reliability in digital signal processing, have had many important and widespread applications. This kind of digital filters are applied to an extensive form in many areas such as image processing, radio communication and high technology devices. One of the important applications of FIR filters is in analog to digital converters [1][2]. Also FIR filters are applied in read channel of disc drives known as PRML [3] in addition to wideband receivers in wireless communication devices [4].
Generally in an N-order FIR filter with
coef [] Vi e [0, N -1], the output y[t] is calculated
Manuscript received July 18, 2012.
Faculty of Electrical and Computer Engineering, Shahid Bebehshti
University,G.C., Tehran, Iran
{f_sharifi, s_amanollahi, m_taherkhani, hashemipour}@sbu.ac.ir
according to the current input x[t] and previous inputs x[t - i] Vi e [1, N -1] by equation (1):
N-1
y[t] = £ coef[]x\t- i\
i=0 (i)
= coef0\x[t\ + coef[lJx[t - 1\+... + coef[N -1\x[t - N +1\
As it is shown in equation (1), in each step for calculation of y[t] as the output of N-order filter, N additions and N multiplications are required.
Although the FIR filters are more stable compared to IIR filters, their circuits are very complicated due to necessity of designing high order FIR filters in real application. By increasing in order of filter, the circuit is become more complicated and this is one of the most important challenges in front of designing these types of filters. This complexity sometimes causes desired design does not meet area, timing and power constraints.
In this paper a compound architecture for FIR filters is proposed. Considered modules in the proposed architecture are optimized by efficient algorithms and final architecture will be extracted.
The rest of this paper includes following sections. In section 2, architectures of FIR filter with related works are reviewed. In section 3 the proposed architecture is introduced. Optimization algorithms for construction of efficient hardware architecture are introduced in section 4. Experimental results are shown in 5 and finally section 6 concludes the paper and future works are considered.
II. A REVIEW OF RELATED ARCHITECTURES
For implementation of FIR filter structures, hardware architectures based on multiply and add (MAC) and distributed arithmetic (DA) are known as the main classes of FIR filter architectures.
In MAC-based architectures, computation of the desired output is done directly and it is based on multiplication and addition. To improve the performance of MAC-based architectures, some researches focused on design of filters based on Residue Number system (RNS)
26
R&I, 2012, №4
[5]-[7]. In these designs, processing overhead and hardware cost of binary to RNS conversion should be considered.
Complexity of classic multiplication and addition causes a lot of researches tend to change in filter architecture and be done based on DA, so that without using multipliers, the idea of pre-computing and storing the required values has been exploited.
The classic method of distributed computing works based on changing the form of required computations in equation (1) and rewriting it in new computation forms. For this purpose it assumes, the states x[k] Vk e[t - N +1, t] in binary standard format (2's complement) and has been scaled (|x[] < 1) and it can be shown by the following equation:
x[k ] = - XB-1
B-2
[k ] +2 Xj [k ]2 -j
j=0
(2)
Substituting equation (2) in equation (1) results to:
y[t] = 2 coef^Y-XB-! [t - i]) + 2 2j X (2 coef[ i].Xj [t - i])
(3)
A look up table could be used with N-bit input address
N-1
to preserve the value of 2 coef[i].xj[t - i]. In This
i =0
condition, size of LUT size (without applying optimization) would be 2 X (C+log2), where N is the order of filter C is the length of coefficients in binary form. As the equation 3 shows, a combination of sum of coefficients can be calculated in advance and be stored in look up tables.
The main problem of look-up tables is the complexity of table size which grows exponentially by increasing of filter order. A method have been proposed in [9] to reduce (and ultimately eliminate) the size of required look-up table. But in the proposed LUT-less architecture, the delay of adders and multiplexers (MUX) is not considered and therefore the solution is not efficient for performance-critical applications. In [10] another DA-based FIR filter has been presented that is suitable for FPGA platforms with 4-input look-up tables.
III. Proposed architecture
As it has been shown in previous section, distributed arithmetic based architectures face off with exponential complexity problem for the size of look up tables. In this section an efficient architecture is presented in order to increase performance of computation part of digital filters. As it is illustrated in Figure 1, main components of proposed architecture are formed based on a compound model with two main layers including look up tables and compressors. As it can be seen in Figure 1, N bits are used in first layer of shift register and
m
k = 2 ki bits out of these N bits are assigned to M look
i=1
up tables, with kj... km inputs. In the next section, an optimization algorithm for finding efficient structure for look up tables set with k bits input address is introduced.
Remaining bits (N-k) are used as selectors in 2-1 multiplexers (for every bit) and totally C X (N - k) multiplexers are required. If the selector of ith multiplexer becomes 1 the related coefficient will be added to compressor part.
M outputs of look up tables with N-k outputs of multiplexers are used as inputs of N-k+m:2 compressor. The functionality of the compressor is shown in Figure 2. Extraction of an efficient N-k+m:2 compressor is described in the next section. After compression, a cLA adder is used for final summation.
Fig. 1. Proposed Architecture for Distributed Arithmetic Unit
It should be considered that look up table’s layers and compressors can be used alone without each other. In the other hand optimized architecture can work without compressor (Partitioned-LUT) or without look up tables (LUT-less). In the next section, three algorithms have been proposed to identify the optimum architecture.
Fig. 2. Functionality of the designed compressor Layer
IV. Architecture extraction algorithms
As was noted in the previous section, the proposed approach provides possibility of choosing suitable hardware architecture based on compound structure sets
R&I, 2012, №4
27
in a flexible way. For finding efficient architecture the following parameters should be determined precisely:
• Optimized structure of LUT set based on input parameter (k): In this section ki and m values are chosen somehow LUT size with k bit input address is optimized. The proposed algorithm is presented in section 4.1.
• Optimized compressor structure based on input bits set (h): In this section optimized compressor structure is extracted based on small-optimized compressors. The proposed algorithm is presented in section 4.2.
• Final optimized architecture: In this section, according to parameter values for LUTs and compressors, number of input addresses for the LUT layer and number of selector bits for compressor layer are chosen in a way which the final architecture is optimized. Final optimization algorithm is proposed in section 4.3.
Optimization could be done based on following parameters:
• Gate latency: in all optimizations process delay of XOR gate is considered as the delay unit. This parameter is shown by (2AG).
• Power consumption: optimization for power
consumption is based on minimum number of resources and could be determined by power consumption unit of XOR gate.
• Power-delay product (PDP): the proposed algorithms could be extended to optimize PDP parameter based on previous parameters.
4.1. Optimization of LUT Layer
In this section, technique has been proposed for partitioning of LUTs. In this work the LUT layer with k bits for input addresses and m outputs is partitioned into m basic LUTs based on computation of delay/power and power delay product (PDP) parameters in gate level. LUT models based on Decoder-Memory are exploited for description of architecture in the LUT layer [14].
The structure of a basic LUT with ki input address (for preserving the summation of ki coefficients) contains a ki to decoder and a words memory with C + [log2h] length. Suppose, delay of this LUT is notated as diut щ, and its power consumption and power delay product (PDP) are respectively notated as piut k[i] and pdpiut []■ Therefore for a LUT layer with the k inputs and m outputs, the structure with optimized delay (D(LUTkm)) or optimized power (P(LUTkm)) or optimized PDP (PD(LUTkm)) could be obtained from the dynamic programming Algorithm 1 with O(n2) complexity.
4.2. Optimization of Compressor Layer In this section a method is proposed for auto construction of the h:2 optimized compressor based on delay, power and PDP parameters. Creating large input compressors are carried out by using of optimized conventional and unconventional compressors [11]-[13].
Therefore, for creating an optimized large input compressor (h:2), set of F basic compressors Comp.CompF with compression levels (i ■ о ) vk e [1 fi are used. Unconventional
compressors have some carry in and carry out bits. However, these carry bits are created in such a way in compressor which there are not carry propagation. Suppose delay, power and PDP of basic compressor k are
dComp[k], pComp[kк pdpComp[k] respectively.
OptimizeLUT (Address Bits: k, Number of LUTs :m):
1. D(LUTi,1)^dlut[i]; Vi e [1..k]
2. P(LUTi,1)^ plut[i]; Vi e [1..k]
3. PD(LUTi,1)^ pdlut[i]; Vi e [1..k]
4. for (i=2; i <= k; i++)
5. for (j=2; j <= m; j++)
6. D(LUTi,j)^minu{max{dlut[u] D(LUTi-u,j-1)}};
7. P(LUTi,j )^minw{ plut[w]+P(LUTi-w,j-1)};
8.PD(LUTi,j)^min{D(LUTi,j).Pu(LUTi,j),Dw(LUTi,j).P(LUTi
j)};
end for
end for
return D(LUTk,m), P(LUTk,m), PD(LUTk,m);
Algorithm 1. Proposed Algorithm for Optimization of LUT layer
The problem of finding minimum Delay (Dh2(k)-which describes minimum delay of h:2 compressor with basic compressors Comp1.Compk) or minimum Power (Ph,2(k)) or minimum PDP (PDPh 2(k)) could be followed by two different configurations. In one configuration, the Compk is not used and therefore the best solution may be gathered from previous calculations Dh2(k-1), Ph2(k-1), PDPh2(k-1). But the other way is usage of Compi. In this condition, the compressor is divided into 3 sub-modules as shown in Figure 3. These modules are the basic compressor Compk and two compound compressors h-iCompfkfg and OComp[k]+g:2. The optimum solution is obtained from minimum of these two configurations. The proposed dynamic programming approach with polynomial complexity is shown in Algorithm 2.
4.3. Extraction of final solution
In this step, based on optimization results of the LUT layer and the compressor layer, the final architecture of the proposed DA unit is extracted. In other word the value for m and k is determined by using the Algorithm 3.
Based on the main criteria for the designer, the algorithm could present separately the optimized solution for delay, power or PDP parameters. As shown in
28
R&I, 2012, №4
Algorithm 3, the cost function could be any arbitrary parameters Delay, Power or PDP returned from OptimizedLUT and OptimizedComp Algorithms.
Compressor li:2
Fig. 3. A configuration of h:2 compressor exploiting the basic compressor Compk
OptimizeComp (CompLevel: h):
F— Number of basic Compressors
Di,j(0)—<»; Pi,j(0)—<»; PDPi,j(0)—<»; Vi G [1..h], j < i Dij(k)—0; Pi,j(k)«— 0; PDPi,j(k)—0;
Vk g [1..f],i g {1,2}, j < i
while (k < F)
for (i=3; i <= h; i++) for (j = 1; j < i; j++) if((i:j) = (I Comp[k]: O Comp[k]))
Di,j(k) — dComp[k];
Pi,j(k) — pComp[k];
P DPi,j(k)— pdpComp[k] else
Dmin—minu{max{Di-I Comp[k],u(k), dComp[k]}+DO Comp[k]+u.2(k)};
Pmin—minw{Pi-I Comp[k],u(k) + POComp[k]+u.2(k)+ pComp[k] };
TP—ComputePower(u); TD—ComputeDelay(w);
Di,j(k) — min{Di,j(k-1), Dmin};
Pi,j(k) — min{Pi,j(k-1), Pmin};
PDPi,j(k) — min{Di,j(k-1). Pi,j(k-1), Dmin.TP, Pmin.TD}; end for end for
Algorithm 2. Optimization of Compressor Layer
OptimizeArch (Filter Order: N):
OptSolution—да;
Select cost from {Delay | Power | PDP} for (i=1; i <= N; i++) for (j=1; j <=i; j++)
ArchCost = cost(OptimizeLUT(i,j), OptimizeComp(N-i+j));
if (ArchCost<OptSolution)
OptSolution—ArchCost; end for end for
Algorithm 3.Extraction of optimized architecture
V. Implementation and Experiments
Implementation has performed in two sections. In first section, hardware description of all basic components has been implemented. Verilog implementation of basic optimized compressors includes 3:2, 4:2, 5:2, 6:2, 7:2 and 9:2 compressors [11]-[13]. The other regular compressors such as 7:3 or 15:4 could be constructed from these basic components. The implementation of basic LUTs was based on memory model in CACTI 5.1 [14]. In the second section of this phase, the optimized architecture algorithms have been implemented in 9 source and header files to present the optimized structure of DA unit.
The real coefficients have been extracted from Filter Design and Analysis Tool (FDATool) [15]. The coefficients are produced for implementation of the sample filters listed in Table 1. As shown in the table based on the supplied criteria, the order of filter is determined from 8 to 143. According to our requirements in application of designing digital part of ADCs, the frequency of sampling (Fs) and length of inputs (B) have been set to a 40MHz and 3 bits respectively. In addition, in all design C (length of coefficients) is set to 16 bits.
Table I
Specification of analyzed FIR filters
Filter Order Fs (MHz) Fpass/ Fstop (MHz) APass/ AStop (dB)
8 40 1.2/3.8 3/20
18 40 1.2/3.8 3/40
31 40 1.6/3.0 3/40
72 40 2.2/2.8 3/40
108 40 2.2/2.8 3/60
143 40 2.2/2.8 3/80
The estimated delay based on (Gate delay-AG) for Distributed Arithmetic unit of LUT, LUT-less [9] and proposed architectures is shown in Figure 4. LUT-less architecture in [9] is implemented by full compressors instead of regular adders. As shown in the figure delay of the proposed architecture is 16% (for 8-order filter) to 62.5% (for 143-order filter) less delay in comparison of LUT-less architecture.
Fig. 4. Estimated Delay of FIR filters for DA-based architectures
R&I, 2012, №4
29
VI. Conclusion and Future Works
According to design constraints in high-order FIR filters, in this work the following contributions were presented:
• Compound hardware architecture was proposed for distributed arithmetic based FIR filters. The architecture exploits the benefits of pre-served summation in optimized LUTs and improves the speed of addition by using high efficient compressors.
• A dynamic programming algorithm was proposed with polynomial complexity to find the optimized structure of compressors.
• A dynamic programming algorithm was proposed to find the best solution for LUT partitioning.
• The final optimized architecture could be extracted from third-proposed algorithm.
For the future works, we will attempt to extend the tool which is capable for automatic generation of HDL code for the optimized extracted architecture.
Acknowledgment
The authors would like to acknowledge members of Micro Electronic Lab in Faculty of Electrical and Computer Engineering in Shahid Beheshti University (SBU) specially, Professor K. Navi, S. Z. Reyhani and Adel Hosseiny.
References
[1] R. M. R. Koppula, S. Balagopal, V. Saxena, "Efficient design and synthesis of decimation filters for wideband delta-sigma ADCs," In. Proc. IEEE SOCC, pp. 380-385, 2011.
[2] H.V. Sorensen P.M. Aziz and J.V.D. Spiegel,“An overview of sigma-delta converters,"IEEE Signal Processing Magazine, 61-84, January 1996.
[3] M. Singh et al., ‘‘An Adaptively Pipelined Mixed Synchronous-Asynchronous Digital FIR Filter Chip Operating at 1.3 Gigahertz, ’’
IEEE Trans. Very Large Scale Integration (VLSI) Systems, vol. 18, no. 7, pp. 1043-1056, 2010.
[4] R. Mahesh, A. P. Vinod, "New Reconfigurable Architectures for Implementing FIR Filters With Low Complexity," IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, Vol. 29. Issue 2, pp. 275-288, Feb. 2010
[5] P. Patronik, K. Berezowski; S.J. Piestrak; J. Biernat; A. Shrivastava, "Fast and energy-efficient fir filters constant coefficient using residue number systems," In Proc. Int. Symp on Low Power Electronics and Design (ISLPED'11), pp. 385-390, 2011.
[6] S. Pontarelli, G. C. Cardarilli, Marco Re, Adelio Salsano, "Optimized Implementation of RNS FIR Filters Based on FPGAs," Journal of Signal Processing Systems, Vol. 67, Num. 3, 2012.
[7] R. Conway, J. Nelson, "Improved RNS FIR Filter Architecture," IEEE Transaction on Circuits and Systems II: Express Briefs, Vo. 51, pp. 26-28, 2004.
[8] Peled and B. Liu, “A new hardware realization of digital filters,” IEEE Trans. Acoustics., Speech, Signal Processing, Vol. 22, Issue. 6, pp. 456-462, Dec. 1974.
[9] H .Yoo and D.V. Anderson, “Hardware-Efficient Distributed Arithmetic Architecture for High-Order Digital Filters”, In Proc. IEEE Int. Conf. on Acoustics, Speech, Signal Processing, pp. 125-128. Mar. 2005.
[10] P. Longa and A. Miri, "Area-efficient FIR filter design on FPGAs using distributed arithmetic," In Proc. IEEE Int. Symp. Signal Processing and Information Technology, pp. 248-252, 2006.
[11] M. Rouholamini, O. Kavehie, A. P. Mirbaha, S.J. Jasbi, K. Navi, "A New Design for 7:2 Compressors," In Proc. IEEE/ACS International Conference on Computer Systems and Applications, AICCSA '07, pp.474-478, May 2007.
[12] C. H. Chang; J. Gu; M. Zhang, "Ultra low-voltage low-power CMOS 4-2 and 5-2 compressors for fast arithmetic circuits," IEEE Transactions on Circuits and Systems, vol.51, no.10, pp. 1985- 1997, Oct. 2004.
[13] S. Veeramachaneni; K. M Krishna; L. Avinash; S. R. Puppala; M.B. Srinivas; , "Novel Architectures for High-Speed and Low-Power 3-2, 4-2 and 5-2 Compressors," In Proc. 6th Int. Conf. on Embedded
Systems, pp.324-329, Jan. 2007.
[14] S. Thoziyoor , N. Muralimanohar , J. H. Ahn and N. P. Jouppi CACTI 5.1, 2008.
[15] Filter Design and Analysis Tool (FDATool), MathWorks Inc., http://www.mathworks.com.
30
R&I, 2012, №4