The Power and Complexities of High-Level Synthesis Pragmas and Directives
High-level synthesis (HLS) is a technology that immensely increases the productivity of FPGA design by raising the level of abstraction from low-level HDLs to C and C++. Moreover, C-simulation being faster than RTL-simulation provides quick, early functional verification, allowing designers to identify functional issues early-on in the design flow. HLS pragmas and/or directives play a pivotal role for meeting the design goals. They enable the designer to inform the HLS tools about different application properties, explore different architectural implementations to study the impacts to performance and area, and provide guidelines to generate optimized hardware.
Figure 1 shows an example of a C/C++ code with some HLS pragmas highlighted (from now on we will simply refer to directives and pragmas as “pragmas”).
Figure 1 – Example of Code with HLS pragmas
As design complexity increases, deciding which HLS pragmas and what parameters to apply to each becomes increasingly difficult and time consuming. It requires the designer to understand each line of code of the application and its impact on the overall performance to know how to optimize it. Inserting a pragma requires estimating the effect of a particular pragma and knowing specifically where to insert it. Moreover, different HLS tools have their specialized set of directives with significant semantic differences. Although the purpose and semantics of pragmas can be found in the manual, estimating its effect on a given piece of code at a particular location is often tricky. Let’s illustrate this with a simple example using Xilinx Vivado HLS 2019.2.
The window corresponds to a local buffer where pixels are processed. The code is a system of eight loops with a nesting depth of four. The two outer loops iterate on the X, Y dimensions of the image in the frame buffer. The two inner loops iterate on pixels in the buffer. Figure 2 shows the code structure of this kernel.
Figure 2 – Kernel code structure
To investigate the effect of inserting the “pipeline” pragma and the impact of its positions in the code, we implemented four versions. In each version the pipeline pragmas are inserted at different levels of the nested loop system. Figure 3 shows the positions for the pipeline pragmas in the different versions of the kernel.
Figure 3 – Four versions with different positions of the pipeline pragma
A frequency constraint of 100 MHz is applied since the busses and frame buffer interface are clocked at 100 MHz. Figure 4 shows synthesis results in resource consumption: FF, LUT, in frequency and latency. Results are given for our four versions where pipeline pragmas are inserted at different levels and one where no pipeline pragmas are inserted. The latency corresponds to the number of cycles taken to execute a test bench.
All synthesis results meet the frequency constraints of 100 MHz. Some configurations can even be clocked up to 180 MHz but this gain cannot be used due to interface frequency limitations.
Figure 4 – HLS synthesis results for area, frequency and latency
We observe that resource consumption impact can be huge. Depending on where the pipeline directive is inserted, there is a factor of 100x in LUT and 22x in DSP usage.
Between pipeline insertion at loop2 and loop3, there is a 2x improvement in latency for almost the same hardware resource utilization.
It is also interesting to observe that for no pipeline, pipeline in loop1, and pipeline loop2, the latency remains the same although resource consumption varies up to 100x, i.e., no gain with the extra resources. Along with the design size, such variations often have a strong impact on the synthesis and simulation time. A 100x increase in resources can lead to the project being ‘stuck’ in synthesis for several days. This makes solution space exploration hard to manage and has a strong negative impact on the development team’s productivity.
The example above may seem somewhat predictable in the sense that in many cases it is better to pipeline the innermost loop, but that isn’t always the case. There are also many more pragmas, with many more loops to consider on full designs. Factor in the choice between unrolling or pipelining, unroll factors, where and how to partition arrays, and so on, and you can see that the complexity grows very quickly. Now add the need to understand the interdependencies between different portions of the code, as well as any loop carried dependencies, and the selection of pragmas and parameters can take on a life of its own. Finally, what do you do if the design consumes too many resources?
SLX FPGA is a tool that utilizes static and dynamic analysis techniques to develop deep insights into your application. Using high level resource/performance models and scheduling simulators, SLX FPGA does an extensive design space exploration for various HLS pragmas and their intricate parameter options to choose the best implementation, given a target area constraint. Furthermore, the analysis capabilities help HLS developers identify bottlenecks so they can quickly identify the code areas that need refactoring.
In summary, using the correct HLS pragmas can provide excellent results in a short time with significantly reduced verification effort. However, getting to the correct pragmas can be complex, time consuming, and in many cases, it may not be practical to explore enough of the design space to understand if additional performance is possible. Furthermore, it is not easy to estimate the effects of a pragma on an application without dynamic insights into the application. SLX FPGA fills this gap and allows developers and system architects to very quickly and efficiently identify the solution that meets their requirements.
Zubair Wadood is a Technical Marketing Engineer at Silexica GmbH. He completed his PhD in computer science from the University of Leuven, Belgium in 2014; his interests include embedded systems and high-performance computing. Before joining Silexica, he has worked with Mentor Graphics and u-blox.