y^K 519.688, 004.315
Sergey I. Salishev, Roman E. Shein
NOVEL ALGORITHMS FOR CONTINUOUS-FLOW MIX-RADIX IN-PLACE MULTI-BANK RAM-BASED FFT1
Abstract
A method of implementing in-place continuous-flow mix-radix FFT on multibank memory with additional constraints is investigated. Using this method four novel FFT architectures are proposed. Parallel butterflies in small radix stage allow substantial speed-up for mixed radix FFT. The single-port memory architecture provides in-place strategy for libraries without dualport memory, effectively reducing memory requirement by 50%. Self-sorting architecture allows using overlapped I/O for natural order FFT reducing initiation interval up to 30%. A combined approach is also proposed.
Keywords: FFT, in-place, continuous-flow, mixed-radix, self-sorting.
Fast Fourier Transform (FFT) is used by multiple communication applications such as 802.11, 802.16 and their modifications. FFT processor performance is crucial for overall performance of these applications. A common approach to FFT processor architecture is an in-place memory-based one. Use of this approach guarantees that for each butterfly or group of butterflies both inputs and results are stored in the same memory locations, so for FFT sampled at N points a dualport multibank memory storing N complex words can be used. Since memory dominates area and power of memory-based FFT processor, such minimization of memory size is crucial for the processor to be efficient.
One butterfly should be initiated every clock to maximize throughput for given butterfly size. To do so each wing of the butterfly should read and write non-conflicting memory ports. This requires conflict-free bank assignment.
Johnson [1] suggested an in-place addressing strategy and architecture that allows launch of one butterfly per clock for pure-radix FFT.
Hsiao, Chen and Lee [2] suggested an in-place addressing strategy and architecture for arbitrary mix-radix FFT launching one butterfly per clock.
Jo and Sunwoo [3] suggested an in-place addressing strategy and architecture for radix 2/4 FFT launching 2 radix 2 butterflies in radix 2 stage that utilizes 2 single-port N-sized memories.
Xilinx LogiCORE IP FFT [4] and Altera MegaCORE [5] use radix 2/4 memory-based burst I/O architecture with both bit-reversed/digit-reversed and natural order of inputs and outputs.
1. INTRODUCTION
© Cam^eB C.H., menH P.E., 2013
1 Logical circuit implementation is pending Russian and US patents.
Xilinx LogiCORE IP FFT uses a method for radix 2/4 FFT launching 2 radix 2 butterflies in parallel, based on the RTL evaluation.
A flexible approach that generalizes results presented in above works is proposed. Using this flexible approach a few novel FFT processor architectures with improved performance are suggested.
For FFT sampled at N points, where N = r ■ Rn, 2r < R and R, r are radixes of butterflies used in the FFT, the simple approach proposed by Johnson [1] is to calculate radix r butterflies using radix R butterfly engine with redundant inputs set to zero. It is possible to significantly speed up the calculation by using radix R butterfly engine to calculate multiple radix r butterflies simultaneously. Such an improvement is proposed by Jo and Sunwoo [3] for r = 2, R = 4. We generalize the approach for any r, R such that R is divisible by r.
A variation of the method that facilitates use of single-port memories instead of dual-port memories while still launching multiple small-radix butterflies per clock is suggested. Replacing dual-port memories with single-port memories while still granting the performance improvements allows area improvements as well. Notice that use of this architecture allows creation of in-place FFT processors with in-place addressing on libraries without dual-port memories, while usually two single-port memories of size N are used. Therefore, use of this approach allows 50 % reduction in memory area for libraries without dual-port memories.
Use of either decimation in time (DIT) or decimation in frequency (DIF) decompositions leads to input or output having different order. Therefore if the order is important, a digit reverse must be performed before or after the FFT calculation. It requires additional shuffling stage with complex memory access pattern. So it is desirable to blend the shuffle operations with computations.
Hegland [6] proposed generalized self-sorting in-place FFT decomposition summarizing previous works on the topic by mixing in-place transposition stages with computation stages. He used symmetric two-sided decomposition combining DIT and DIF. A similar method using asymmetric stage arrangement is proposed in this paper. Due to asymmetric property it is mapped to DIT(DIF) FFT processor architecture with only change in memory generation preserving all the benefits stated above. Using of this method allows up to 30 % reduction in clocks of initiation interval of burst I/O normal order FFT radix-4 of length 1024 compared to Xilinx LogiCORE IP FFT.
In section 2 a convenient notation for FFT addressing is given. A general formula for bank assignment is proposed. In section 3 a FFT processor architecture utilizing dual-port memories is proposed. In section 4 a FFT processor architecture utilizing self-sorting addressing is proposed. In section 5 a FFT processor architecture utilizing single-port memories is proposed. In section 6 a combination of self-sorting and single-port approaches is considered. Proofs of theorems are omitted in the sections and can be found in the appendix.
2. NOTATION
In this section a convenient notation for FFT addressing is proposed. For simplicity introduce the following string substitution: [d]u+j = di, di+1,..., di+j , [d]i+j, i = di+j, di+j_1,..., di.
If d0,..., ds are, accordingly, r0,. . . , rs radix digits, then let [ds,. . . , d0] be a mix-radix number constructed by concatenating the digits. If any di is a radix 1 digit, define [ds,..., di+1; di, di _b..., d0] = = [ds,...,di+1, di_1,...,d0]. This boundary case appears when proofs for mix-radix are applied for pure-radix case. More formally, [ds,..., d0] = d0 + d1 ■ r0 + d2 ■ r0 ■ r1 +... + ds ■ r0 ■ r1 ■ r2 ■... ■ rs-1.
Consider a FFT sampled at N = r0 ■... ■ rn _1 points decomposed into radix r0,..., rn-1 stages, where ri < ri+1 . Note: such ordering for radixes is not mandatory, but is used to simplify proofs and explanations.
Calculation of FFT can be viewed as two nested loops: outer loop iterating over stages and inner loop iterating over butterflies (or butterfly groups for stages with multiple butterflies executed simultaneously) within one stage.
Let FFT (kn k0) stand for result ofthe FFTon input numbered [kn_l,..., k0] (ki e 0..ri, k0 being the least significant digit). Define a radix butterfly operation
BsCfr_i,..., fo) = £ki fk • (Wr)sk . (1)
_ 2pi
Here Wr = e r are complex roots of one.
Let Fc+l([d]0,n_c_2, kc,[k]c_l 0) be stage c output numbered ^d^on-a-i? kc,[k, where kt are already processed digits and dt are digits that are yet to be processed, ki < ri, di < rn-i _1, F0(d0,..., dn-1) are input sample points. Then FFT (kn _1,..., k0) = Fn (kn _1,..., k0).
For DIT decomposition the FFT stage formula is
Fc+1 ([d ] 2, kc,[k]c_1,0 )= Bkc (w0? ...? wrn_c_1_1) , (2)
w = W^-r^l ■ Fc {(n_c_2, U,[k]c_1o0 ). (3)
For DIF decomposition the stage formula is is
Fc+1 ((d ]o,n_c_2, kc ,[k ]c_1,o )= Wr0 X]_0,n_~lc_2] ■ Bkc (Wo,..., Wrn_c-1), (4)
Wu = Fc ([dknic-l? [k]c_1,0 ). (5)
DIT decompositions leads to digit reverse order of the input points, and DIF decomposition leads to digit reverse order of output points.
Notice that formulae for DIF and DIT differ only in whether multiplication by twiddle factors is performed before or after the butterfly operation. Choice of decomposition type is insignificant further, so for convenience suppose DIF is used.
A radix rc butterfly in stage utilizes inputs with numbers [kn n1,..., kc+1, kc, kc n1,..., k0], where kc varies from 0 to rc _ 1. Then the butterfly can be numbered by [kn-1,..., kc+1, kc-1,..., k0].
The approach adopted in this paper implies use of memory split into rn-1 banks in order to allow pipelining butterfly execution: each radix r butterfly operation requires r memory reads and r memory writes. Define bank and address assignments depending only on sample point numbers (it is convenient to use such in-place notation even for self-sorting FFT, which is not actually in-place). Let m(kn-1,..., k0) be bank assignment and a(kn-1,..., k0) be address assignment within the bank for number [kn-1,.., k0]. In this paper any correct address assignment may be used, for simplicity suppose everywhere a(kn-1,..., k0) = [kn_2,..., k^ . Let
Ic ([dn-1?...? dc+1? dc_1,..., d oL d) = [dn-1?...? dc+1? d ? dc_1? ...? d0] , (6)
This notation can be used to conveniently separate butterfly number from wing number:
m (kn-1?...? k1? k0) = m (Ic ([kn-1?...? kc+1? kc_1, k1] k0)) . (7)
While there is a dependency between subsequent stages, butterflies within one stage are independent from each other and therefore can be calculated in arbitrary order. Suppose qc butterflies are run simultaneously in stage c. Stage n - 1 obviously has only one butterfly run simultaneously, because only rn-1 memory banks are available. For any stage c that runs qc butterflies per clock the inner loop iterates over butterfly groups numbered
[kn-1 ■,...■, kc+2- kc+1 - kc-1ko], where kt < rt ? kc+1 <
rc+1
qc
,kc+1 < qc,[ kc+1' kc+1] < rc+1 be number of kc+1 'th butterfly executed in [kn-1,..., kc+2,kc+1,kc-1,..., k0] 'th iteration of loop iterating over butterfly groups in stage c. Essentially kc+1 is split into [kckc+1] and kc+1 is used
as a part of butterfly group number, while kc+1 is used to enumerate butterflies within the group.
The trivial traverse order for all stages is
Tc (kn-1'...' kc+2' kc+1' kc+1' kc_1'...' k0) = [kn_1'...' kc+2' kc+1' kc+1' kc_1'...' k0] . (8)
Let Mc([kn-1'...,k0]) be memory bank used in iteration k of butterfly loop in stage c. If q radix rc butterflies are run in parallel, Mc can be obtained as
Mc ([kn-1'...' k0]) = m(Ic (Tc (kn_1'...' kc+2' K+v K+v kc-1'...' k0)' kc)). (9)
The hypothesis we will exploit is that the following bank assignment is conflict free and allows multiple butterflies per clock in small radix stages for mixed radix FFT on dual-port memories with trivial traverse order
m(kn-1'...'k0) = ^Tog' ■ ki Y°drn-1. (10)
Here gi are some constants depending on radixes chosen for stages.
This bank assignment generalizes bank assignments proposed by Johnson [1], Hsiao, Chen and Lee [2] and Jo and Sunwoo [3]. It is also used as base for self-sorting and single-port memory architectures.
Let Tc{kn_v..„kc+2,kc+vkc+vkc_v..„k0), where k, < rt ,kc+1 <
rc+1
q
c
3. FFT PROCESSOR UTILIZING DUAL-PORT MEMORIES
Consider a FFT sampled at N = r ■ Rn-1 points using radix r, R = r ■ q butterfly operations, i. e. r0 = r, r1 =... = rn-1 = R . It can be calculated utilizing a FFT processor with the following architec-ture similar to one presented in [1]. The corresponding block structure is shown on Fig. 1. It consists of Address Generation Unit (AGU), Random Access memory (RAM), Switchable Interconnect (IC), Butterfly Processing Unit (PU), and twiddle memory.
The key feature of the architecture is AGU implementing addressing strategy that allows execution of q butterflies simultaneously in radix r stage. Launching multiple butterflies
Fig. 1. Architecture for a FFT processor utilizing dual-port memories
Tab. 1. Estimated clocks count for different modifications of the approach
One butterfly per clock Proposed approach
Clocks Radix Clocks Radix
512-point 192 8 192 8
1024-point 896 2/8 512 2/8
2048-point 1280 4/8 1024 4/8
4096-point 2048 8 2048 8
simultaneously makes radix r calculation q times faster, therefore granting significant performance improvement.
The AGU may use trivial traverse order (8) and bank assignment
m(kn_l,...,ko) = ^X"=oki + qk0 ^nodR . (11)
Notice that the bank assignment (11) is a special case of formula (10) and equals to bank assignment introduced in [1] for r = R.
Theorem 1. The bank assignment m (11) with trivial traverse order Tc (8) guarantees no conflicts for dual-port memory FFT processor.
Values of n and r can be adjusted at run-timne to use one FFT processor to calculate transforms (and reverse transforms) of different sizes. Performance gain in comparison to other modifications of Johnson's approach for some sample lengths is addressed in a table below. Notice that the numbers are estimates: pipeline length and, probably, some other constant modifiers must be added in order to obtain real clock count. Although only values for power of 2 radixes are listed, the approach can be used with non-power of 2 radixes as well Tab. 1.
4. FFT PROCESSOR UTILIZING SELF-SORTING ADDRESSING
Consider a FFT sampled at N = r ■ Rn_1 points using radix r, R butterfly operations, i. e. r0 = r, r1 =... = rn—1 = R , where R = r ■ q . Both of common decompositions DIT and DIF lead to either input or output having reversed digit order, i.e. in order to obtain the result an explicit digit reverse operation must be performed. An improved architecture that mixes digit reverse into a FFT processor is proposed. So no explicit digit reverse is required, while it is still running multiple butterflies per clock in radix r stage.
n +1
The same bank assignment (11) as in section 3 is used, first stages use trivial traverse order (8) as well. However, for radix R stages outputs of butterflies are transposed: output numbered
[w, w] is written as if it was numbered [w, w], where w e 0..r _ 1, we 0..q. Starting from stage n+1
——, a permutation of outputs is introduced: for stage c, where c ^ n — 1, butterfly with inputs numbered
[k i,k i,...,k,i,k,i,k ,k ,k i,k i,...,k ,k ,k i,k i,k k ,-t,...,kn] (12)
L n_1' n_1' ' c+1' c+1 c' c' c_1' c_1' ' n—c n—c n_c —1' n_c_V n—c—2, n—c—2, ' 0J • \ /
Outputs are stored in memory addresses calculated as for outputs numbered
[k i,k i,...,k,i,k,i,k i, k ,k i,k i,...,k ,k ,k ,k i,k k kn]. v13)
L n—1' n—1' ' c+1' c+1' n—c—V n—c c—1' c—1' ' n—c c' c' n—c—V n—c—2, n—c—2, ,0A '
Notice that which half of the stages performs reverses is insignificant (Fig. 2). The resulting FFT processor has the following architecture (Fig. 3). Consider the following traverse order
Tc (kn-l, kc+l,0, kc—l, ^-kn—c, k n—c ]' ^kn— c—1, k n—c—1 ]'k n—c—2, ''', k 0^
= ttn-V ... ? kc+V K-V ... ? [kn-c ? k0l[kV kn_c_1], ^^n-c-l ,... ? K kn-c ? kn-c ? kn_c_1] , c < n - 2, (14)
Tn_2(kn_1,0,kn-3? ...? [k2?k2],[k1,k1]k0) = [kn-1 ,kn-3, .", k2?k0?kVk2?k1] , (15)
Tn_1(0, kn_2,..., k2? k1-k0) = [kn-2 ?...? k2? k1-k0] . (16)
Theorem 2. The bank assignment m (11) and traverse order Tc (14), (15) guarantee no memory conflicts for self-sorting FFT processor.
5. FFT PROCESSOR UTILIZING SINGLE-PORT MEMORIES
Consider a FFT sampled at N = r ■ Rn-1 points using radix r, R butterfly operations, i. e. r0 = r, r1 =... = rn-1 = R, where R = r ■ q is even. As shown in section 3, an FFT processor providing significant performance improvements over pure-radix approach suggested in [1] can be constructed. The AGU can be further modified in order to allow use of 2R 1 rw memory banks without increase of overall memory words count. Replacing dual-port memories with single-memories improves the architecture in terms of area while preserving the performance advantage over [1].
The modified FFT processor has the following architecture (Fig. 4).
The absence of memory conflicts is guaranteed by select of such memory assignment and traverse order that read/write operations for memory banks interleave in subsequent clocks. Let
Fig. 2. Delay of write operations in stages performing reverse
Fig. 3. Architecture for a self-sorting FFT processor
i(kn-1,..., ko) =i 2%l
1iki - (ko mod 2) mod 2R .
(17)
(18)
m(kn_1,..., ko) Let traverse order for stage 0 be
To (kn-1,..., k2, k1, k1) = [kn-1,..., k2, + k1 + k1 ■ r) mod R],
For other stages trivial traverse order is used:
Tc (kn-1, ..., kc+2, kc+1, kc+1, kc-1, ..., ko) = [kn-1, ..., kc+2, kc+1, kc+1, kc-1, ko) . (19)
Theorem 3. If the design's pipeline length is odd, the bank assignment m (17) used with traversal orders Tc (18), (19) guarantees no memory conflicts for single-port FFT processor.
Notice that since every butterfly in radix r stage utilizes all possible values of ko and absence of conflicts in radix R stage is guaranteed by interleave of do mod 2 values for subsequent butterflies, it is required to wait for the pipeline in radix r stage to finish before launching the first radix R stage.
6. SELF-SORTING FFT PROCESSOR WITH SINGLE-PORT MEMORIES
A self-sorting architecture on single-port memories is proposed for N-points FFT processor. It combines benefits of architectures proposed in section 5 and section 4. The proposed processor has a following architecture (Fig. 5).
Consider a FFT sampled at N = r ■ Rn 1 points, where R = r ■ q, n > 3, r is even. Let p be pipeline length, suppose p is odd. The idea is to combine approaches presented in the above sections: have read/write operations interleave for each memory bank and eliminate external digit reverse by reversing digits in
. n +1
stages starting from stage numbered
Fig. 4. Architecture for a FFT processor using 1rw memories
2
Since addressing for self-sorting FFT does not
Fig. 5. Architecture for a self-sorting FFT processor on single-port memories
Fig. 6. Butterfly batches merging scheme
differ from plain FFT in stages that don't perform digit reverse, the substantial task is to combine the approaches in digit reversing stages. It can be done by grouping butterflies into batches of size 2R in a specific manner. In single-port approach no read/write conflicts are granted by interleave in some digit ki. In stage one size 2R batch is constructed from two size R batches covering all values of kc, kn _ c _1 such that values of ki interleave between the batches (butterflies from different batches interleave).
Similarly to self-sorting approach for dualport memories outputs of radix R butterflies are transposed: output numbered [w, w] is written as
if it was numbered [w, w].
Then with use of pipeline delay of length 2R - 1 - p there can be no read/write conflicts and no write before read conflicts (by batch construction).
However, it can be proven that in order for this approach to be successful in the last stage for radix 2, the bank assignment must be invariant with respect to switch of the last digit kn_1 and the first digit k0 . The bank assignment used for single-port memories does not comply with this requirement and heavily relies on asymmetry to grant read/write operations interleave. Therefore, a new bank assignment is required (Fig. 6).
The proposed bank assignment is
m(k„_i,..., ko) =
k0 2
+ 2
k
;_1 2
+ (k0 + k;-l ) mod 2) mod R.
(20)
It is easy to prove that m is a correct bank assignment (the proof is similar to one presented in section 3 and is omitted). The traverse orders proposed for stages is
Tc (kc ) =
c = 0,
0 < c <
; +1 2
k;_1,..., ^ )modr +
ki
+ 2 • (ki mod r )
; +1
< c < ; _ 1,
[k;_i,..., ki,(k0 + k;_i) modr ]
k;_1 , ... , k;_c+1
k;_c mod r,
k
mrg
r
kmrg mod r,
";_c
r
c = ; _ 1,
k;_c_2, ..., k2, k;_c_1, [k;_2, ..., ^
2 • q
,(k0 + k;_1) mod 2
(21)
+
k1 , k0 mod 2 2q
k1 mod q 2
k1
k0 + — q
mod 2
mod r,
k1 mod 2q,
k° = k;_1, ..., kc+1,kc_l,..., k0 '
kmg = (k mod (2 • q)) • Г- + %.
2 2
(22) (23)
Theorem 4. For FFT sampled at N = r ■ Rn-1 points, where r is even, if pipeline length p is odd, the single-port self-sorting FFT processor with pipeline delays postponing writes for 2R -p - 1 clocks with bank assignment (20) and traverse order Tc (21) has no memory conflicts.
r
2
k
1
k
0
2
7. RESULTS
In this paper we generalized Johnson's approach [1] and considered not only conflict-free bank assignment, but also butterfly traverse order within a stage. We proposed a new parameterized conflict-free bank assignment generalizing the previous results on relatively prime mixed radix [2] and multiple butterflies per clock [3].
Using these results we considered four new FFT architecture modifications supporting runtime change of transform length (up to implementation-dependent maximum length) and direction. Correctness of used address assignments is proven. For architecture of a FFT processor with dualport memories (section 3) High Level Synthesizable (HLS) SystemC model was created. The RTL obtained has reasonable area characteristics compared to commercially available cores, proving that the architecture can be effectively used to create actual designs. The results of gate-level synthesis show that the AGU utilizes negligible size and power compared to RAM and PU. For other architectures only reference models were developed.
For traditional dual-port memory architecture we considered modification for running multiple butterflies per clock for radixes other than 2/4. It substantially improves architecture performance if small and big radixes have large difference.
We also considered mix-radix self-sorting architecture. At the cost of radix pipeline stages in the butterfly processing unit it allows simpler integration with other computations as part of reconfigurable DSP and allows using of overlapped I/O as stand-alone block improving the initiation interval up to 30%.
We considered single-port memory in-place architectures. The basic single-port architecture allows using in-place strategy for libraries without dual-port memories, effectively reducing memory area by 50% with modest requirement of odd pipeline length of butterfly processing unit. We also considered the hybrid architecture combining both self-sorting and single-port memory. It provides benefits of both approaches at the cost of (2 ■ radix — 1) pipeline stages.
8. SUMMARY
The architecture under consideration is one of the common architectures for computing long FFT. The novel algorithms are of practical interest as they increase possible design space by providing new area/performance trade-offs for practically useful scenarios like latest OFDM-based protocols for ground cable networks and 4G wireless.
The approach can be applied to building architectures for non- 2n lengths if it is of practical interest. The algorithms aren't unique and are defined up to transposition of some digits.
Only one of the developed architectures was implemented in RTL, so the practical implementation for other architectures is still required. There is a possibility that synthesis will imply some modifications in algorithms to make them more hardware-friendly.
Bibliography/References
1. Johnson, L. G. Conflict free memory addressing for dedicated FFT hardware // Circuits and Systems II: Analog and Digital Signal Processing, IEEE Transactions on, May 1992. Vol. 39, № 5. P. 312-316.
2. Chen-Fong Hsiao, Yuan Chen, Chen- Yi Lee. A Generalized Mixed-Radix Algorithm for Memory-Based FFT Processors // Circuits and Systems II: Express Briefs, IEEE Transactions on, Jan. 2010. Vol. 57, № 1. P. 26-30.
3. Jo B.G., Sunwoo M. H. New continuous-flow mixed-radix (CFMR) FFT Processor using novel in-place strategy // Circuits and Systems I: Regular Papers, IEEE Transactions on, May 2005. Vol. 52, № 5. P. 911-919.
4. http://www.xilinx.com/support/documentation/ip documentation/ds808 xfft.pdf (дата обращения: 29.04.2013).
5. http://www.altera.com/literature/ug/ug fft.pdf(дата обращения: 29.04.2013).
6. M. Hegland. A self-sorting in-place fast Fourier transform algorithm suitable for vector and parallel processing // Numerische Mathematik, 1994. Vol. 68, № 4. P. 507-547.
APPENDIX
Theorem 1. If the bank assignment used with trivial traverse order
m(£„_!,..., ko) ^XLik " qko jmodR, (24)
(T(kn_v..., kc+2,kc+vkc+vkc_i,..., ko) = tk„_i,..., kc+2'kc+pkc+pkc_p...' ko]). (25)
It guarantees no conflicts for dual-port FFT processor. See section 3 for the processor's description. Proof: For radix R stage numbered c conflicts can only occur between wings of one butterfly. Suppose a conflict occurred on wings kc, kc, i. e.
m (kn_v ...' kc+Pkc, kc_P ...' k0) = m (kn_V ...' kc+P kc,kc_P ...' k0) . (26)
From definition of m it means kc ° kc (mod R), which leads to kc = kc, since kc, kc < R. So conflicts in radix R stages are impossible. It can be shown in the same manner that in radix r stage there are no conflicts within one butterfly.
Suppose 2 butterflies in the same butterfly group in radix r stage have a conflict, i. e.
m(kn_v..., k2,~ • q + k1,k0) = m(kn_v..., k2,~ • q + k1,k0), (27)
where k1,k1 < q. From definition of m it means k1 + k0 • q ° k1 + k0 • q(modR). Since k0, k0 < r, it leads to k1 = k1, k0 = k0 , therefore, conflict is impossible. ■
Theorem 2. If the bank assignment and traverse order are used
m(kn_i,...,k0) = ( + qk0 lmodR ' (28)
T (k ,к ,,,0,k ,,...,\k ,k l.ik ,,к ,],k тк,,k,,£„) =
^ я-P ' c+P ' c—P ' L и- c' n—c—P n—c—1J' n—c—' P P 0^
= rk ,,..., k +,,k ,,..., [k ,^l.ik,k .1,k k,,k ,k ,k .1, c< n_2, (29)
n i c+i c i n c 0 i n c i n c 2 i n c n c n c i
Tn_2(kn_1,0,k2],rk1'k11,k0) = tkn_Pkn_3, ..., ^k0,k1,k2,k1] > (30)
Tn_i(0,kn_2, ..., k2,ki,ko) = [kn_2, ..., k2,ki,kol . (31)
It guarantees no memory conflicts for self-sorting FFT processor. See section 4 for the processor's description.
Proof: Tc is a transposition of digits in butterfly number, therefore it can be used as a traverse order.
Parts of every digit k i = [k,k.] are permuted with symmetric parts of digits kn_^ and kn_i_1, which results in digit reverse for k considered as a number constructed of digits [ ki, k i ]. Since original decomposition reversed digits kj and butterfly outputs transposition restores order of [k.,kt ], it grants that input and output have the same digit order.
With this approach stages performing reverses are not in-place, therefore it must be ensured that during the stage computation a memory location is written only after it is read by a butterfly. For each stage performing reverse the correct order of read/write operations is guaranteed by reordering butterflies within the stage so
that all butterflies with coinciding values of kn_ 1,..., kc+1,kc_ 1,...,kn_c,kn_c_ 1,kn_c_2,..., k0 are executed sequentially in one batch and adding pipeline delays postponing write operations for R-p clocks, where p is pipeline length. Since write operations of butterflies from one batch can only corrupt values read in the same batch and the butterfly loop is pipelined, these measures are enough to grant correct read/write order.
The bank and address assignments used are the same as in section 3. Since in terms of addressing only the butterfly traverse order is modified, there are no memory conflicts (see Theorem 1). ■
Theorem 3. If the design'spipeline length is odd and the bank assignment is used with traversal orders m(kn_l,..., k0) = _ (k0 mod 2)jmod 2R , (32)
T0(k„-v k2>kvki) _ [kn_v ."> k2> YV2k, + k\ + k\'r)modR]> (33)
Tc (kn_v..., kc+2' kc+V kc+V kc_V...> k0) _ [kn_V.. kc+2'k c+V k c+V kc_l'".' k0) ' C * 0 ' (34)
It guarantees no memory conflicts for single-port FFT processor. See section 5 for the processor's description.
Proof: For radix R stage numbered memory conflicts can occur between read operations of different wings of one butterfly, write operations of different wings of one butterfly, or write operation of a butterfly
and read operation of some subsequent butterfly. Read/write conflicts within one butterfly on wings kc, kc would mean
2Yn_0i*cki + 2kc _ (k0 mod2) ° 2Yn_^ j*ckj + 2kc _ (k0 mod2)(mod2R), (35)
which implies = _ =
F ° kc (modR), i.e. kc _ kc . (36)
so conflicts within one radix R butterfly are impossible.
Since trivial traverse order is used in radix R stages and r is odd, values of k0 interleave for subsequent butterflies. With pipeline having odd length, it guarantees that any 2 butterflies that have read and write operations within the same clock have different parity of k0, and therefore use banks with different parity, therefore there are no conflicts on wings of different butterflies in radix R stages. The above reasoning holds when the butterflies are from different radix R stages as well.
For radix r stage consider memory bank assignment for an arbitrary wing of arbitrary butterfly:
m(
(TQ(kn_l,... , k2,kvk1),kQ) = + 2k1 + 2k1 ■ r + 2k0 -kQ mod2)mod2R =
/ f - \
4 n -1 y nkt + k0 _ 2 _ F r + k, — + 12 k1 2
V V 0
+ 2(kx mod 2) + k0 mod 2
mod2R.
(37)
Points used in butterflies from one group have coinciding values of kn_ 1,..., k2, k1 and differ only in
kvk0 . Since kl £ q-1,
k0
£ — -1 it is enough to consider
mo _ 4
Since 4
k
2
2
+ 2r ■ k1 + k0 mod 2.
(38)
< 2r, values of m0 coincide only for coinciding values of kl,k0 . Hence there are no
conflicts within one butterfly group.
Values of kx interleave for subsequent butterfly groups. With pipeline having odd length, it guarantees that any 2 butterfly groups that have read and write operations within the same clock have different parity of kx, therefore use banks with different second bit in radix 2 representation of the bank's number. Hence there are no conflicts on wings of butterflies from different butterfly groups in radix r stage. ■
Theorem 4. For FFT sampled at N _ r ■ Rn _1 points, where r is even, if pipeline length p is odd, the single-port self-sorting FFT processor with pipeline delay postponing writes for 2R - p - 1 clocks with the following bank assignment.
n_1>...> k0) _
2X_2 ki + 2
2
+ 2
k
n-1 2
+ (k0 + kn-1)mod2) modR .
(39)
The following traverse order has no memory conflicts. See ssection 6 for the processor's description.
2
k
0
Tc (kc ) =
c = 0,
0 < c <
n +1 2
n +1
< c < n -1,
c = n -1,
I n-k
^ (X jmodr +
^п-^..^ k1,(k0 + k„-1)modr]
^i^ kn-c+1,[k„-cmodr,
kn-c-2, ..., k2, kn-c-1, ^T^
2 • q
[kn-2,..., ^
mrg
+ 2 • (kL mod r )
kmrgmodr,
(k0 + kn-1) mod 2
k1 k0 mod 2 2q
k1 mod q 2
» k1
k0 + — q
V /
mod 2
modr ,
k1 mod 2q
(40)
kc = k„_1,..., kc+i, kc_i,..., ko,
kmrg = (ki mod (2 • 9)) • ;2 + k°. Proof: Notice that write before read conflicts are impossible for in-place stages (numbered 0..
n +1
2
(41)
(42)
-1 ).
Consider stage numbered 0. The bank assignment for butterfly executed at iteration k„ _b..., k1 is:
i(T)(k0), k0) = 4
Xn^i
kn-1 + + Tl \
2 2 r
+ 2(kjmodr) + (k0 + kn-1)mod2 . (43)
Since the pipeline length is odd and subsequent butterflies have interleaving values of k1 mod 2 , there are no read/write conflicts.
' n +1
Consider bank assignment for stage numbered c, where 0 < c <
2
m (Ic (Tc (kc ), kc )) = 2
Xi?* + 2
2
+ (k0 mod 2).
(44)
Subsequent butterflies have interleaving values of k0 mod 2. Since pipeline length is odd, there are no read/write conflicts in stage c.
n +1
Consider bank assignment for stage numbered c, where - < c < n _ 1:
m (Ic (Tc (kc ), kc )) = 2
Xn-c-K X""1 n-1 i
„ k* +> kr i=2 1 ^i=n-c+1 1
2
kn-c mod r,
mrg
kmrg mod r,
ki kn-1 \
+2 + 2
2 • q _ 2 0
+ k0 mod 2.
(45)
Subsequent butterflies have interleaving values of k0 mod 2, so there are no read/write conflicts. By
k„_
replacing kn-c-1 with
kmrgmodr,
and kn_c with
k
kn-c mod r, mrg
r
the traverse order builds
size 2R butterfly batches covering all values of digits to be swapped by the reverse. Since the first and the
last butterflies originate from different size R batches, a pipeline delay of length 2R - p - 1 is enough to guarantee no write before read conflicts.
k
1
r
2
k
nc
r
r
k
0
2
+
k
0
k
n-c
+
+
+
r
r
r
Consider bank assignment for stage n - 1:
Tn_l(k"~1)) = 4
kL 2q
■ q +
k modq 2
k^mod 2q,
■1/2
+ 2(k0 mod 2) +
kn_1 +
kl mod 2q,
mod 2.
(46)
Subsequent butterflies have interleaving values of k0 mod 2, so there are no read/write conflicts. The above proof of absence of write before read conflicts holds. ■
k
0
+
+
+
2
0
НОВЫЕ АЛГОРИТМЫ ДЛЯ КОНВЕЙЕРНОГО ВЫЧИСЛЕНИЯ БПФ
ПО СМЕШАННОМУ ОСНОВАНИЮ БЕЗ КОПИРОВАНИЯ НА МНОГОБАНКОВОЙ ПАМЯТИ С ПРОИЗВОЛЬНЫМ ДОСТУПОМ
Аннотация
В статье рассматривается метод реализации конвейерного вычисления БПФ по смешанному основанию на многобанковой памяти с дополнительными ограничениями. На основе рассмотренного метода предлагаются новые аппаратные архитектуры вычисления БПФ. Параллельное вычисление «бабочек» в стадиях с меньшим основанием позволяет существенно ускорить вычисления по смешанному основанию. Архитектура на основе однопортовой памяти позволяет реализовать некопирующую стратегию вычислений на библиотеках элементов без многопортовой памяти, обеспечивая уменьшение используемой памяти в 2 раза. Самоупорядочивающая архитектура позволяет использовать перекрывающиеся операции загрузки и выгрузки данных, обеспечивая уменьшение задержки вычислений до 30%. Также рассматривается архитектура, комбинирующая оба этих свойства.
Ключевые слова: конвейерное БПФ, БПФ по смешанному основанию, некопиру-ющее БПФ, самоупорядочивающееся БПФ.
Салищев Сергей Игоревич, старший преподаватель кафедры информатики СПбГУ, инженер лаборатории Intel, sergey. i. salishev @gmail. com,
Шеин Роман Евгеньевич, аспирант кафедры системного программирования математико-механического факультета СПбГУ marso. des@gmail. com.
© Наши авторы, 2013. Our authors, 2013.