In 5G-NR communications, interleaving protects data against burst errors. The 3GPP specification dictates writing data row-wise into an isosceles triangle and reading it column-wise.
The optimization path focused on aligning 3GPP algorithms with FPGA hardware primitives for maximum concurrency.
The Naive Approach: Standard algorithms require determining a side length $T$ such that T(T+1)/2 ≥ E, then executing nested loops that dynamically skip null bits.
The Bottleneck: Dynamic coordinate tracking is hard to do on hardware. Direct synthesis would generate high logic depth and multi-microsecond computation stalls. This approach was rejected during the design phase.
The Optimization: We designed a baseline that precomputed structural offsets to avoid dynamic division at runtime. The data path used a state machine to track coordinates and resolve destination indices sequentially.
The Limit: While mathematically efficient, the state machine created a sequential data dependency. Processing bits one-by-one limited the effective throughput of the AXI4-Stream bus.
Further Optimizations: By analyzing the deterministic nature of the triangle for specific block sizes, we restructured the data path to support concurrent bit-routing. The sequential index calculation was replaced with a parallel architecture.
The Result: The FPGA now resolves multiple bit positions simultaneously, achieving higher throughput and better performance, significantly lower than traditional iterative approaches.
| Architecture Phase | Clock Target | Complexity | Resource Usage | Timing Status | Latency |
|---|---|---|---|---|---|
| Phase 1 (Naive) | - | High | - | - | Rejected |
| Phase 2 (Baseline) | 100 MHz | Moderate | Low | MET | Few Microseconds(5+) |
| Phase 3 (Parallel) | 100 MHz | Optimized | Efficient | MET | Sub-µs |
→ Constraints: Target was standard-compliant latency and minimal resource utilization. Phase 3 provides significant headroom for real-time 5G baseband processing.
The final architecture balances hardware footprint with critical execution speed. By focusing on parallel data routing, we minimize the value of nanoseconds lost in the radio path. This approach also offloads the mathematical heavy-lifting from DSP slices, freeing them for other physical layer tasks.