Amir Sabbagh Molahosseini, Azadeh Alsadat Emrani Zarandi
TOWARDS FAST IMPLEMENTATION OF COMPLEX RNS COMPONENTS ON FPGAS
К быстрому выполнению комплексных компонентов системы остаточных классов на программируемых вентильных матрицах
The efficient hardware implementation of RNS particularly on field programmable gate array (FPGA) is very important due to the use of FPGAs in some modern computing systems to achieve flexibility and low time-to-market. The residue number system (RNS) with its inherent parallelism can also be used to enhance the performance of implementation of computing algorithms on FPGAs. However, complex RNS operations such as residue to binary (reverse) conversion, sign detection, scaling, magnitude comparison and overflow detection have not been efficiently implemented on FPGAs until now. In this work, we try to address an approach to increase the speed of residue to binary conversion implementation on FPGAs using parallel-prefix adders. This can be a first step towards fast implementation of complex RNS operations on FPGAs, since residue to binary conversion can also be used to solve other difficult RNS operations.
Keywords: Residue number system (RNS), Field programmable gate array FPGA), Parallel prefix adders.
Эффективное исполнение системы остаточных классов (RNS) в частности на программируемых вентильных матрицах (FPGA) очень важно в виду использования FPGA в некоторых современных вычислительных системах для достижения гибкости и малых временных затрат на разработку. Система остаточных классов со свойственным ей параллелизмом также может быть использована для увеличения производительности выполнения вычислительных алгоритмов на FPGA. Однако комплексные RNS операции, такие как обратное преобразование вычетов в двоичную систему, определение знака, масштабирование, сравнение по величине и обнаружение переполнения до сих пор не были эффективно реализованы на FPGA. В данной работе мы предлагаем подход к увеличению скорости обратного преобразования на FPGA, используя параллельный сумматор. Это может быть первым шагом к быстрому выполнению комплексных RNS операции на FPGA, при этом обратное преобразование также может быть использовано для решения других сложных RNS операций.
Ключевые слова: Система остаточных классов (RNS), программируемая вентильная матрица (FPGA), параллельный сумматор.
1 Introduction
The power consumption is a critical parameter in designing any sort of digital systems. Distinct methods in distinct design's levels have been investigated and different results have been achieved. One way among all available methods is working on the number system and use other ones
rather than weighted number system to achieve parallelism and less power consumption. Residue number system (RNS) [1-3] is a numerical system which has been widely used in different fields such as cryptography and signal processing [4, 5] to enhance them. Although lots of work have been done on RNS, still RNS suffers from some challenges such as implementation issues on field programmable gate arrays (FPGAs).
The main concept behind RNS is converting a weighted number to the series of residues based on selected moduli set. Therefore, instead of performing calculation on large numbers, calculations are done on small reminders in parallel. This result in less power consumption and faster operations. However, such systems require two converters: Forward and Reverse converters. The duty of first one is converting a weighted number to its related RNS representation. On the other hand, the reverse converter computes weighted number from residues. One of the advantage of this system is calculation such as addition, subtraction and multiplications on residues are done parallel without the need of carry propagation.
Residue number system can be implemented using different technologies such as application specific integrated circuits (ASIC) and FPGA. Nowadays FPGAs are highly attracted due to low cost and low time to market, and they are used to implement different circuits. Now the question is: what is the advantage and disadvantages of implementing RNS on FPGA and even is it beneficial to implement a residue number system on FPGA? Furthermore, the other significant aspect that need to be considered is looking for methods to improve RNS implementation on FPGA. The reverse converter with its non-modular structure is one of the most challenging part of RNS and its speed is so effective on general performance of RNS. Thus, in this paper, we focus on fast implementation of reverse converter. Moreover, the structure of reverse converter consists of many modular and regular adder's structure, which its regular adders can be easily implemented using fast adders of FPGA. Also, the use of parallel prefix adders are investigated in reverse converter. To study the mentioned aspects, all designs are first described and verified using VHDL codes and then implemented by Xilinx ISE on a FPGA Virtex-5 model. The results are compared based on delay and the number of slices.
This paper will be continued by introducing reverse converter design using different adders. In the third section of the paper, the mentioned designs will be implemented and they will be compared and analyzed. Finally, conclusion will be presented in the section four.
2 Reverse converter design
Residue number systems with all its applications and benefits still has some problems with its FPGA implementation [10]. Two aspects are considered about implementing RNS on FPGA. Some RNS researchers believe that it is not efficient to implement RNS on FPGA due to its reverse and forward converters overhead [5]. On the other hand, some other researchers suggest implementing reverse converter and other arithmetic circuits based on read-only memory (ROM) to achieve suitable implementations on FPGA [10]. Since, nowadays reverse converters are mostly design memory less and based on arithmetic circuits, ROM based implementation cannot be used for other moduli set and it is not a flexible design method. In this work, a new approach of FPGA implementation of residue number system are investigated. Two main points are considered here: first: design without ROM is done to achieve required flexibility. Second, to achieve a better implementation of residue number systems on FPGA, the hardware structure of reverse converter, one of the most complex component of RNS, is redesigned.
Reverse converter structure consists of a number of adders, which some of them are regular and some are modular. Different kinds of adders can be used to design a hardware structure of reverse converter, but the most common adder is ripple carry adders (RCAs). Since, they are low complexity and hardware cost adders. Furthermore, it is possible to have fast carry propagation chain by using the internal structure of FPGA and this make implementing RCA on FPGA justifiable [8]. Due to this reason, it may seem using other kinds of adders to implement arithmetic circuits on FPGA is not a proper choice.
In order to, improve RNS implementation on FPGA, first some challenges should be investigated. The first important point is to investigate the efficiency of implementing modular adders on FPGA, and is it possible to have modular adders as efficient as regular adders on FPGA? The other significant aspect is looking for a way to improve reverse converter performance on FPGA such as using other kind of adders and prove that RCA is not the only suitable adder structure for FPGA.
As it was mentioned, RCA's implementation on FPGA can be a suitable choice due to its fast carry propagation chain. However, to use such adders to perform modular addition in reverse converter, an adder structure with one representation of zero is required. This result in increasing cost and reducing speed. On the other hand, parallel prefix adder structures with their logarithmic delay can perform fast addition. However, their hardware structures consist of
Fig. 1. The converter for moduli set {2n-1, 2n, 2n+1, 22n+1-1} [11].
lots of gates and wires, because of their parallel structure. If FPGA structure is considered, a large communication structure can be seen. The idea here is the FPGA's communication structure can be an appropriate situation to implement parallel prefix adders.
Recent researches shows that parallel prefix adders cannot be implemented suitably on FPGAs [8], compared to RCA. The important point that should be considered here is that, the comparisons which were done till now are for regular addition. In the other word, no comprehensive research has been done on performance evaluation of implementing modular prefix adder structures on FPGA. Modular addition is two phases operation and if the mentioned moduli is 2n-1, then two representation of zero should be converted to one representation of zero. Therefore, it is expected that modular prefix addition with one representation of zero has better performance rather than RCA when implemented on FPGA.
Fig. 2.
The reverse converter for {2n-1, 2n+1, 22n, 22n+1-1} [12].
To investigate the implementation's issues and to find answer for mentioned questions, different configurations for reverse converter are considered. These configurations are based on distinct position of modular and regular parallel prefix adders in two different reverse converter structure. Reverse converter for moduli sets {2n-1, 2n, 2n+1, 22n+1-1} and {2n-1, 2n +1, 22n, 22n+1-1} were selected and different adders were used in their hardware architecture. Finally, the performance of reverse converters are evaluated and analyzed. It should be mentioned that these recently introduced moduli sets due to their large dynamic range and more parallelism are more attracted, and they are chosen because of their various structures and different kind (modular and regular) adders that they need. The structure of the reverse converters for these moduli sets are shown in Figures 1 and 2. The required adders in these reverse converters are shown in the same figures. In this work only the carry propagate adders (CPAs) are replaced with parallel prefix adders (PPA) and carry save adders (CSA) are remained unchanged. In each structure shown in
the figures, the converter 1 which is totally based on CPAs is considered as the base configuration in comparisons. In the figures, carry propagate adder with end around carry is represented as E-CPA, modular parallel prefix adder as M-PPA and regular parallel prefix adder as K-PPA.
3 Performance evaluation
With the aim of investigating mentioned points, all introduced configurations for both converters are verified by using structural VHDL codes, and then, they are implemented using a Virtex-5 FPGA chip. The related results will be compared based on delay and the number of required slices as shown in Figures 3-6. Designs of [13] and [14] are used to implement of M-PPA (modular parallel-prefix adder) and K-PPA (Kogge-Stone regular parallel-prefix adder), respectively. Now, we analyze the result of implementation of different converter configurations on FPGA. As it can be seen from Figures 3 and 4, for the config. 2 where the CPA2 is replaced with the modulo parallel-prefix adder, the delays for n = 4 and 8 are reduced than the original design, config. 1. But, if we replace the CPA2 which is not on the critical delay path with M-PPA (config. 3), the delay is not as good as expected than config. 1.
CO
и
I
I ГI
Config. 1 Config. 2 Config. 3 Config. 4 Config. 5 Config. 6 Config. 7 Config. 8 Config. 9 Config. 10 Converter configurations
rn = 16 I n = 12 Г n = 4
20
15
10
Fig. 3.
Delays of FPGA implementation of different converter configurations for the moduli set {2n-1, 2n, 2n+1, 22n+1-1}.
However, for the config. 3 in which CPA3 is replaced with M-PPA, both delay and area have been decreased. Therefore, it is better to use M-PPA designs in adder positions on the critical delay path.
On the other hand, in config. 5, we have replaced CPA5 which is a regular binary adder, with the K-PPA, which is a regular parallel-prefix adder, the delay is even worse than config. 1 and also the area is increased. Hence, as it is expected, using regular parallel-prefix adders for implementing converters on FPGAs do not improve the performance. The best tradeoff can be achieved by using M-PPA both for CPA1 and CPA3 (modulo adders on critical delay path) with M-PPA. In this case, the speed significantly increased. The performance in other cases, configs. 7 to 9, where M-PPA and K-PPA both are used in the structure of the converter, are not increased. Finally, the fastest design is config. 10 where all the CPAs replaced with PPAs, but in the expense of great hardware cost. This trend is also correct for the second converter as shown in Figures 5 and 6. In other words, using modulo parallel-prefix adders in implementing the converters on FPGAs can increase the speed.
20
15
10
5
0
Fig. 5.
400
300
200
100
0
Fig. 6.
Config. 1 Corrfig. 2 Corrfig. 3 Config. 4 Config. 5 Corrfig. 6 Corrfig. 7 Config. 8 Converter configurations
■ n = 16 ^P n = 12 n = 8 n = 4
Delays of FPGA implementation of different converter configurations for the moduli set {2n-1, 2n+1, 22n, 22n+1-1}
Config. 1 Config. 2 Config. 3 Config. 4 Config. 5 Config. 6 Config. 7 Config. 8 Converter configurations
■ n = 16 ^P n = 12 n = 8 n = 4
Areas (number of slices) of FPGA implementation of different converter configurations for the moduli set {2n-1, 2n, 2n+1, 22n, 22n+1-1}.
4 Conclusion
In this paper, the effect of using modular and regular parallel-prefix adders in the implementation of residue number system reverse converters on FPGAs is investigated. The implementation results have shown that the use of modulo parallel-prefix adders instead of CPAs with EAC, and ripple-carry adders instead of regular CPAs can lead us to the best performance. Since the core part of difficult RNS operations are residue to binary conversion, the proposed approach for reverse conversion, can improve also the performance other RNS complex components on FPGAs.
Acknowledgements
This work is supported by Kerman Branch, Islamic Azad University, Kerman, Iran as a part of research plan.
REFERENCES
1. A. Omondi and B. Premkumar, "Residue Number Systems: Theory and Implementations," Imperial College Press, London, 2007.
2. J. Chen and J. Hu, "Energy-Efficient Digital Signal Processing via Voltage-Over scaling-Based Residue Number System," IEEE Trans. Very Large Scale Integration (VLSI) Systems, vol. 21, no. 7, pp. 1322-1332, 2013.
3. T. Stouraitis and V. Paliouras, "Considering the alternatives in lowpower design," IEEE Circuits and Devices, vol. 7, pp. 23-29, 2001.
4. C.H. Vun, A.B. Premkumar and W. Zhang, "A New RNS based DA Approach for Inner Product Computation," IEEE Trans. Circuits and Sys-tems-I, vol. 60, no. 8, pp. 2139-2152, 2013.
5. J.C. Bajard, L.S. Didier and T. Hilair, "p-Direct Form transposed and Residue Number Systems for Filter implementations," In Proc. of IEEE International Midwest Symposium on Circuits and Systems, 2011, pp. 1-4.
6. M. Esmaeildoust, D. Schinianakis, H. Javashi, T. Stouraitis, and K. Na-vi, "Efficient RNS Implementation of Elliptic Curve Point Multiplication Over GF(p)," IEEE Trans. Very Large Scale Integration (VLSI) Systems, vol. 21, no. 8, pp. 1545-1549, 2013.
7. S. Pontarelli, G.C. Cardarilli, M. Re and A. Salsano, "Optimized Implementation of RNS FIR Filters Based on FPGAs," Journal of Signal Processing Systems, vol. 67, no. 3, pp. 201-212, 2012.
8. D.H.K. Hoe, C. Martinez, and S.J. Vundavalli, "Design and characterization of parallel prefix adders using FPGAs," In Proc. of IEEE Southeastern Symposium on System Theory, 2011, pp. 168-172.
9. K. Navi, A.S. Molahosseini and M. Esmaeildoust, "How to Teach Residue Number System to Computer Scientists and Engineers," IEEE
Trans. Education, vol. 54, no. 1, pp. 156-163, 2011.
10. A.S. Molahosseini, S. Sorouri and A.A. Emrani Zarandi, "Research Challenges in Next-Generation Residue Number System Architectures," In Proc. of IEEE International Conference on Computer Science and Education, 2012, pp. 1658-1661.
11. A.S. Molahosseini, K. Navi, C. Dadkhah, O. Kavehei, S. Timarchi, "Efficient Reverse Converter Designs for the New 4-Moduli Sets {2n-1, 2n, 2n+1, 22n+1-1} and {2n-1, 2n+1, 22n, 22n+1} Based on New CRTs," IEEE Trans. Circuits and Systems-I, vol. 57, no. 4, pp.823-835, 2010.
12. A.S. Molahosseini and K. Navi, "A Reverse Converter for the Enhanced Moduli Set {2n-1, 2n+1, 22n, 22n+1-1} Using CRT and MRC," in Proc. of IEEE Computer Society Annual Symposium on VLSI, 2010, pp. 456457.
13. R.A. Patel, M. Benaissa and S. Boussakta, "Fast Parallel-Prefix Architectures for Modulo 2n-1 Addition with a Single Representation of Zero," IEEE Trans. Computers, vol. 56, no. 11, pp. 1484-1492, 2007.
14. P.M. Kogge and H.S. Stone, "A Parallel Algorithm for the Efficient Solution of a General Class of Recurrence Equations," IEEE Trans. Computers, vol. 22, no. 8, pp. 783-791, 1973.
ОБ АВТОРАХ
Amir Sabbagh Molahosseini, преподаватель кафедры вычислительной техники Керманского отделения Исламского университета Азад, г. Керман, Иран. Доцент, кандидат технических наук Телефон: 00989131403688, E-mail: [email protected]. Amir Sabbagh Molahosseini, Faculty Member of Department of Computer Engineering Kerman Branch, Islamic Azad University, Kerman, Iran. assistant professor, PH.D. Phone: 00989131403688, E-mail: [email protected].
Azadeh Alsadat Emrani Zarandi, кафедра вычислительной техники, Керманского отделения Исламского университета Азад, г Керман, Иран. E-mail: [email protected].
Azadeh Alsadat Emrani Zarandi, Department of Computer Engineering, Kerman Branch, Islamic Azad University, Kerman, Iran. E-mail: [email protected]