This was an FPGA communications course project with a hard constraint: every multiplication must use a DSP48 IP block, not LUT-based multipliers. The Basys3 board (Artix-7 XC7A35T) has exactly 90 DSP48E1 slices available.
The input representation is Q12.6 fixed-point — 18-bit words with 12 bits for the integer part and 6 bits for the fractional part. All inputs are pre-scaled by 64 (left shift by 6) before being fed in, and outputs are interpreted as 64× the actual values. This matches DSP48's 18-bit input port width exactly.
A Fast Fourier Transform (FFT) computes the Discrete Fourier Transform in O(N log N) rather than O(N²). The Cooley-Tukey Radix-2 DIT algorithm achieves this by recursively splitting a DFT of size N into two DFTs of size N/2 — one for even-indexed inputs, one for odd-indexed inputs — down to 2-point butterfly stages.
For a 128-point FFT: 7 stages of butterfly operations, each stage processing 64 butterfly pairs. Each complex multiplication uses 4 real DSP48 multiplications.
8-point FFT: 4 butterflies × 4 DSPs = 16 real DSPs... but with complex arithmetic: 48 DSPs totalOnce the 8-point stage is established as the building block, every larger transform is implemented as sequential 8-point stages with butterfly operations on their combined outputs:
done signal when computation completes. This signal triggers the next stage's start and switches the input MUX to feed the second half of the data. No clock-cycle counting or external control needed — the design self-sequences through all stages.
DSP48E1 inputs are 18-bit signed integers. To represent fractional twiddle factors with reasonable precision:
| Resource | Used | Available | Utilization % |
|---|---|---|---|
| LUT | 12742 | 20800 | 61.26% |
| LUTRAM | 189 | 9600 | 1.97% |
| FF | 30867 | 41600 | 74.20% |
| BRAM | 3 | 50 | 6.00% |
| DSP | 64 | 90 | 71.11% |
| IO | 2 | 106 | 1.89% |
DSP utilization at 71% leaves comfortable margin. LUT and FF usage is higher due to control logic for stage sequencing, MUX switching, and twiddle factor ROM storage. BRAM holds intermediate stage outputs between sequential 8-point FFT runs.
| Parameter | Value |
|---|---|
| Clock period | 10 ns (100 MHz) |
| Worst Negative Slack (WNS) | 0.353 ns |
| Total Negative Slack (TNS) | 0.000 ns |
| Max achievable frequency | 103.659 MHz |
| Computation latency | 2566 clock cycles |
| End-to-end latency @ 100MHz | 25.66 µs |
Outputs validated at two levels: behavioral simulation (Vivado) and hardware (ILA on physical Basys3). Both compared against Python NumPy FFT with identical inputs.
f(x) = 2·sin(x) + sin(10x) — two frequency components at 1 Hz and 10 Hz. FFT correctly identifies peaks at the corresponding frequency bins in both real and imaginary spectra, matching the Python reference plot exactly in shape.
The DSP constraint was a course requirement — not a design choice. Removing it opens up significantly better architectures:
The sequential 8-point stage approach was a workaround for the 90-DSP hard limit. A proper streaming FFT architecture would replace the stage-done-signal control logic with a deep pipeline and eliminate the BRAM intermediate storage entirely.
Target metrics for the redesigned version: <20 cycles pipeline depth, >200 MHz clock, <1% LUT utilization for the butterfly network.