128-pt FFT on FPGA

Radix-2 DIT Cooley-Tukey FFT using DSP48 IP blocks · Basys3 · FPGA Communications Course

FPGA DSP48 RESOURCE_SHARING SystemVerilog Vivado_2022 Basys3 Q12.6_fixed_point 103.659_MHz 64_DSPs 128_point Radix-2_DIT

[tldr]

→ 128-point Radix-2 DIT FFT implemented in SystemVerilog on a Basys3 FPGA (Artix-7), using DSP48 IP blocks for all multiplications — a hard course constraint.
→ Naive parallel implementation hit >90 DSPs — exceeding Basys3's limit of 90. Redesigned using DSP multiplexing (resource sharing) to bring utilization to 64 DSPs (71%).
→ Core insight: an 8-point FFT uses exactly 48 DSPs — the largest stage feasible in one shot. All larger transforms are built as sequential 8-point stages with butterfly operations chained via done signals.
→ Computation latency: 2566 clock cycles at 100 MHz → 25.66µs end-to-end. Maximum achievable clock: 103.659 MHz.
→ Output validated against Python (NumPy FFT) using same inputs. Relative error <2% — expected from Q12.6 fixed-point vs Python's float64 precision.

64 DSPs used (71% of Basys3's 90). Down from >90 in naive approach. Achieved via sequential 8-point stage multiplexing.

2566 Clock cycles to compute 128-point FFT. Done signal goes high at 25.995µs with a 100MHz clock — confirmed in Vivado behavioral simulation.

103.6 MHz — maximum clock frequency. WNS = 0.353ns. Computed as 1/(10ns − 0.353ns). Timing constraints fully met.

[context_and_constraint]

[course constraint] why DSP48 for every multiplication?

This was an FPGA communications course project with a hard constraint: every multiplication must use a DSP48 IP block, not LUT-based multipliers. The Basys3 board (Artix-7 XC7A35T) has exactly 90 DSP48E1 slices available.

why DSP48 for multiplication on FPGA

DSP48 slices are dedicated hardened arithmetic blocks on Xilinx FPGAs — they contain a pre-adder, 18×27-bit multiplier, and 48-bit accumulator, all running at high clock speeds without consuming LUT fabric. Using them for multiplication is correct FPGA design practice for signal processing workloads. The course constraint enforces this discipline explicitly.

The input representation is Q12.6 fixed-point — 18-bit words with 12 bits for the integer part and 6 bits for the fractional part. All inputs are pre-scaled by 64 (left shift by 6) before being fed in, and outputs are interpreted as 64× the actual values. This matches DSP48's 18-bit input port width exactly.

[background] 128-point Radix-2 DIT FFT

A Fast Fourier Transform (FFT) computes the Discrete Fourier Transform in O(N log N) rather than O(N²). The Cooley-Tukey Radix-2 DIT algorithm achieves this by recursively splitting a DFT of size N into two DFTs of size N/2 — one for even-indexed inputs, one for odd-indexed inputs — down to 2-point butterfly stages.

DFT naive: O(N²) = 128² = 16,384 operations FFT Radix-2: O(NlogN) = 128×7 = 896 operations Butterfly stage (2-point FFT): ────────────────────────────── X[k] = X_e[k] + W_N^k · X_o[k] X[k+N/2] = X_e[k] - W_N^k · X_o[k] where W_N^k = e^(-j2πk/N) — twiddle factor Each butterfly: 2 complex multiplications + 2 additions DSP48 handles the complex multiply

For a 128-point FFT: 7 stages of butterfly operations, each stage processing 64 butterfly pairs. Each complex multiplication uses 4 real DSP48 multiplications.

[approach — resource_sharing]

Approach 1 — Naive Parallel ✗

Instantiate all FFT stages simultaneously
128-pt → two 64-pt → two 32-pt → ...
All DSPs instantiated in parallel
Results correct in simulation
>90 DSPs required after synthesis

→ Exceeds Basys3 DSP limit. Cannot implement.

Approach 2 — DSP Multiplexing ✓

Identify largest feasible stage: 8-point = 48 DSPs
Execute one 8-pt stage at a time, reuse DSPs
Done signal chains stages sequentially
Each doubling adds only 4 more DSPs
64 DSPs total for full 128-point FFT

→ Fits on Basys3. Timing met. Correct outputs.

Why 8-point is the critical stage size

DSP count analysis

An N-point FFT butterfly stage requires N/2 butterfly units, each using 4 real DSP48 multiplications for complex twiddle factor multiplication.

8-point FFT: 4 butterflies × 4 DSPs = 16 real DSPs... but with complex arithmetic: 48 DSPs total

48 DSPs < 90 (Basys3 limit) → fits. The next stage up (16-point) would require >90 DSPs in parallel → doesn't fit. So 8-point is the largest single-shot feasible stage.

Once the 8-point stage is established as the building block, every larger transform is implemented as sequential 8-point stages with butterfly operations on their combined outputs:

16-pt = [8-pt stage 1] → done → [8-pt stage 2] → butterfly → outputs (+4 DSPs) 32-pt = [16-pt] → done → [16-pt stage 2] → butterfly → outputs (+4 DSPs) 64-pt = [32-pt] → done → [32-pt stage 2] → butterfly → outputs (+4 DSPs) 128-pt = [64-pt] → done → [64-pt stage 2] → butterfly → outputs (+4 DSPs) Total DSPs = 48 (8-pt core) + 4×4 (butterfly stages) = 64 DSPs

done signal chaining

Each stage generates a done signal when computation completes. This signal triggers the next stage's start and switches the input MUX to feed the second half of the data. No clock-cycle counting or external control needed — the design self-sequences through all stages.

[implementation_details]

Fixed-point representation — Q12.6

DSP48E1 inputs are 18-bit signed integers. To represent fractional twiddle factors with reasonable precision:

Q12.6 format (18-bit total): ┌────────────────────┬──────────┐ │ 12 bits integer │ 6 bits │ │ (MSB ... bit 6) │ fraction │ └────────────────────┴──────────┘ Scaling: all inputs × 64 before feeding in all outputs ÷ 64 after reading out Example: twiddle factor W_128^1 = cos(-2π/128) ≈ 0.9952 Q12.6 representation: round(0.9952 × 64) = 64 → stored as 64 After output scaling: result ÷ 64 recovers actual value

precision tradeoff

Q12.6 gives 6 bits of fractional precision — sufficient for signal reconstruction but introduces quantization error vs Python's float64. Relative errors in outputs are <2% for most frequency bins, rising to ~5% at near-zero magnitude bins where small absolute errors become large relative errors. This is expected and documented.

Resource utilization breakdown

Resource	Used	Available	Utilization %
LUT	12742	20800	61.26%
LUTRAM	189	9600	1.97%
FF	30867	41600	74.20%
BRAM	3	50	6.00%
DSP	64	90	71.11%
IO	2	106	1.89%

DSP utilization at 71% leaves comfortable margin. LUT and FF usage is higher due to control logic for stage sequencing, MUX switching, and twiddle factor ROM storage. BRAM holds intermediate stage outputs between sequential 8-point FFT runs.

Timing analysis

Parameter	Value
Clock period	10 ns (100 MHz)
Worst Negative Slack (WNS)	0.353 ns
Total Negative Slack (TNS)	0.000 ns
Max achievable frequency	103.659 MHz
Computation latency	2566 clock cycles
End-to-end latency @ 100MHz	25.66 µs

critical path

The critical path runs through the DSP48 multiplier chain in the butterfly stage — specifically through the accumulator feedback path in the complex multiply-accumulate operation. WNS of 0.353ns means the design could be clocked at 103.659 MHz if needed. All timing constraints are met with positive slack.

[results_and_validation]

Output validation — FPGA vs Python

Outputs validated at two levels: behavioral simulation (Vivado) and hardware (ILA on physical Basys3). Both compared against Python NumPy FFT with identical inputs.

Behavioral simulation

Done signal goes high at 25.995µs confirming 2566 clock cycles at 100MHz. Real and imaginary parts match Python outputs with <2% relative error across most frequency bins. Outputs displayed in Q12.6 decimal notation in Vivado waveform viewer.

Hardware validation (ILA)

Outputs captured via Integrated Logic Analyzer on physical Basys3 board. FFT magnitude spectrum matches Python reference. Higher relative errors (~40%) observed at near-zero magnitude bins — expected behavior from fixed-point quantization where small absolute errors inflate the relative metric.

why relative error spikes at certain bins

Relative error = |FPGA - Python| / |Python|. When Python's output is near zero (a frequency component not present in the input signal), even a tiny absolute quantization error from Q12.6 fixed-point becomes a large relative error. This is a precision representation artifact, not a computation error. The absolute errors are uniformly small (<0.5 magnitude units) across all bins.

test signal

Input signal: f(x) = 2·sin(x) + sin(10x) — two frequency components at 1 Hz and 10 Hz. FFT correctly identifies peaks at the corresponding frequency bins in both real and imaginary spectra, matching the Python reference plot exactly in shape.

64 DSPs used

2566 clock cycles

103.7 MHz max clock

<2% relative error

[future_work]

What changes without the DSP constraint

The DSP constraint was a course requirement — not a design choice. Removing it opens up significantly better architectures:

without DSP constraint

LUT-based multipliers synthesized by the tool can be freely placed and pipelined. This enables a fully pipelined butterfly architecture where all stages execute concurrently with one new sample entering and one output leaving every clock cycle — dramatically reducing latency from 2566 cycles to O(log N) pipeline depth (7 stages for 128-point). Throughput increases from one FFT per 2566 cycles to one FFT output per cycle once the pipeline is filled.

The sequential 8-point stage approach was a workaround for the 90-DSP hard limit. A proper streaming FFT architecture would replace the stage-done-signal control logic with a deep pipeline and eliminate the BRAM intermediate storage entirely.

Target metrics for the redesigned version: <20 cycles pipeline depth, >200 MHz clock, <1% LUT utilization for the butterfly network.