128-pt FFT on FPGA

Radix-2 DIT Cooley-Tukey FFT using DSP48 IP blocks · Basys3 · FPGA Communications Course
FPGA DSP48 RESOURCE_SHARING SystemVerilog Vivado_2022 Basys3 Q12.6_fixed_point 103.659_MHz 64_DSPs 128_point Radix-2_DIT
[tldr]
64 DSPs used (71% of Basys3's 90). Down from >90 in naive approach. Achieved via sequential 8-point stage multiplexing.
2566 Clock cycles to compute 128-point FFT. Done signal goes high at 25.995µs with a 100MHz clock — confirmed in Vivado behavioral simulation.
103.6 MHz — maximum clock frequency. WNS = 0.353ns. Computed as 1/(10ns − 0.353ns). Timing constraints fully met.
[context_and_constraint]
[course constraint] why DSP48 for every multiplication?

This was an FPGA communications course project with a hard constraint: every multiplication must use a DSP48 IP block, not LUT-based multipliers. The Basys3 board (Artix-7 XC7A35T) has exactly 90 DSP48E1 slices available.

why DSP48 for multiplication on FPGA
DSP48 slices are dedicated hardened arithmetic blocks on Xilinx FPGAs — they contain a pre-adder, 18×27-bit multiplier, and 48-bit accumulator, all running at high clock speeds without consuming LUT fabric. Using them for multiplication is correct FPGA design practice for signal processing workloads. The course constraint enforces this discipline explicitly.

The input representation is Q12.6 fixed-point — 18-bit words with 12 bits for the integer part and 6 bits for the fractional part. All inputs are pre-scaled by 64 (left shift by 6) before being fed in, and outputs are interpreted as 64× the actual values. This matches DSP48's 18-bit input port width exactly.

[background] 128-point Radix-2 DIT FFT

A Fast Fourier Transform (FFT) computes the Discrete Fourier Transform in O(N log N) rather than O(N²). The Cooley-Tukey Radix-2 DIT algorithm achieves this by recursively splitting a DFT of size N into two DFTs of size N/2 — one for even-indexed inputs, one for odd-indexed inputs — down to 2-point butterfly stages.

DFT naive: O(N²) = 128² = 16,384 operations FFT Radix-2: O(NlogN) = 128×7 = 896 operations Butterfly stage (2-point FFT): ────────────────────────────── X[k] = X_e[k] + W_N^k · X_o[k] X[k+N/2] = X_e[k] - W_N^k · X_o[k] where W_N^k = e^(-j2πk/N) — twiddle factor Each butterfly: 2 complex multiplications + 2 additions DSP48 handles the complex multiply

For a 128-point FFT: 7 stages of butterfly operations, each stage processing 64 butterfly pairs. Each complex multiplication uses 4 real DSP48 multiplications.

[approach — resource_sharing]
Approach 1 — Naive Parallel ✗
→ Exceeds Basys3 DSP limit. Cannot implement.
Approach 2 — DSP Multiplexing ✓
→ Fits on Basys3. Timing met. Correct outputs.
Why 8-point is the critical stage size
DSP count analysis
An N-point FFT butterfly stage requires N/2 butterfly units, each using 4 real DSP48 multiplications for complex twiddle factor multiplication.

8-point FFT: 4 butterflies × 4 DSPs = 16 real DSPs... but with complex arithmetic: 48 DSPs total

48 DSPs < 90 (Basys3 limit) → fits. The next stage up (16-point) would require >90 DSPs in parallel → doesn't fit. So 8-point is the largest single-shot feasible stage.

Once the 8-point stage is established as the building block, every larger transform is implemented as sequential 8-point stages with butterfly operations on their combined outputs:

16-pt = [8-pt stage 1] → done → [8-pt stage 2] → butterfly → outputs (+4 DSPs) 32-pt = [16-pt] → done → [16-pt stage 2] → butterfly → outputs (+4 DSPs) 64-pt = [32-pt] → done → [32-pt stage 2] → butterfly → outputs (+4 DSPs) 128-pt = [64-pt] → done → [64-pt stage 2] → butterfly → outputs (+4 DSPs) Total DSPs = 48 (8-pt core) + 4×4 (butterfly stages) = 64 DSPs
done signal chaining
Each stage generates a done signal when computation completes. This signal triggers the next stage's start and switches the input MUX to feed the second half of the data. No clock-cycle counting or external control needed — the design self-sequences through all stages.
[implementation_details]
Fixed-point representation — Q12.6

DSP48E1 inputs are 18-bit signed integers. To represent fractional twiddle factors with reasonable precision:

Q12.6 format (18-bit total): ┌────────────────────┬──────────┐ │ 12 bits integer │ 6 bits │ │ (MSB ... bit 6) │ fraction │ └────────────────────┴──────────┘ Scaling: all inputs × 64 before feeding in all outputs ÷ 64 after reading out Example: twiddle factor W_128^1 = cos(-2π/128) ≈ 0.9952 Q12.6 representation: round(0.9952 × 64) = 64 → stored as 64 After output scaling: result ÷ 64 recovers actual value
precision tradeoff
Q12.6 gives 6 bits of fractional precision — sufficient for signal reconstruction but introduces quantization error vs Python's float64. Relative errors in outputs are <2% for most frequency bins, rising to ~5% at near-zero magnitude bins where small absolute errors become large relative errors. This is expected and documented.
Resource utilization breakdown
ResourceUsedAvailableUtilization %
LUT127422080061.26%
LUTRAM18996001.97%
FF308674160074.20%
BRAM3506.00%
DSP649071.11%
IO21061.89%

DSP utilization at 71% leaves comfortable margin. LUT and FF usage is higher due to control logic for stage sequencing, MUX switching, and twiddle factor ROM storage. BRAM holds intermediate stage outputs between sequential 8-point FFT runs.

Timing analysis
ParameterValue
Clock period10 ns (100 MHz)
Worst Negative Slack (WNS)0.353 ns
Total Negative Slack (TNS)0.000 ns
Max achievable frequency103.659 MHz
Computation latency2566 clock cycles
End-to-end latency @ 100MHz25.66 µs
critical path
The critical path runs through the DSP48 multiplier chain in the butterfly stage — specifically through the accumulator feedback path in the complex multiply-accumulate operation. WNS of 0.353ns means the design could be clocked at 103.659 MHz if needed. All timing constraints are met with positive slack.
[results_and_validation]
Output validation — FPGA vs Python

Outputs validated at two levels: behavioral simulation (Vivado) and hardware (ILA on physical Basys3). Both compared against Python NumPy FFT with identical inputs.

Behavioral simulation
Done signal goes high at 25.995µs confirming 2566 clock cycles at 100MHz. Real and imaginary parts match Python outputs with <2% relative error across most frequency bins. Outputs displayed in Q12.6 decimal notation in Vivado waveform viewer.
Hardware validation (ILA)
Outputs captured via Integrated Logic Analyzer on physical Basys3 board. FFT magnitude spectrum matches Python reference. Higher relative errors (~40%) observed at near-zero magnitude bins — expected behavior from fixed-point quantization where small absolute errors inflate the relative metric.
why relative error spikes at certain bins
Relative error = |FPGA - Python| / |Python|. When Python's output is near zero (a frequency component not present in the input signal), even a tiny absolute quantization error from Q12.6 fixed-point becomes a large relative error. This is a precision representation artifact, not a computation error. The absolute errors are uniformly small (<0.5 magnitude units) across all bins.
test signal
Input signal: f(x) = 2·sin(x) + sin(10x) — two frequency components at 1 Hz and 10 Hz. FFT correctly identifies peaks at the corresponding frequency bins in both real and imaginary spectra, matching the Python reference plot exactly in shape.
64 DSPs used
2566 clock cycles
103.7 MHz max clock
<2% relative error
[future_work]
⚠ [work_in_progress]  A redesigned version without the DSP constraint is in progress — targeting a fully pipelined architecture with improved resource efficiency and lower latency. Details will be added here when complete.
What changes without the DSP constraint

The DSP constraint was a course requirement — not a design choice. Removing it opens up significantly better architectures:

without DSP constraint
LUT-based multipliers synthesized by the tool can be freely placed and pipelined. This enables a fully pipelined butterfly architecture where all stages execute concurrently with one new sample entering and one output leaving every clock cycle — dramatically reducing latency from 2566 cycles to O(log N) pipeline depth (7 stages for 128-point). Throughput increases from one FFT per 2566 cycles to one FFT output per cycle once the pipeline is filled.

The sequential 8-point stage approach was a workaround for the 90-DSP hard limit. A proper streaming FFT architecture would replace the stage-done-signal control logic with a deep pipeline and eliminate the BRAM intermediate storage entirely.

Target metrics for the redesigned version: <20 cycles pipeline depth, >200 MHz clock, <1% LUT utilization for the butterfly network.