High-Level Synthesis: How to solve common challenges for new and experienced users
There are tremendous benefits designing in C/C++ and using high-level synthesis (HLS) rather than Verilog or VHDL, but it’s not without its challenges. Though the challenges for new and experienced HLS users are different they can feel equally daunting. Let’s look at some of the challenges for new and experienced HLS users and how Silexica’s SLX FPGA tool can help.
New or occasional HLS users
For someone new to HLS or the occasional HLS user, it is typical to get stuck in various parts of the flow. The first part of the design flow is synthesizability. Simply put… not everything you write in software can be synthesized in hardware. Without your design being synthesizable, your C/C++ based design will not successfully complete the HLS compile step, leaving you without performance or area results to begin optimization. Fixing all the synthesizability issues and understanding all the errors can be challenging for a new user. Also, understanding the many options to fix synthesizability issues and their impact on your design gets more complex.
The second part of the design flow is understanding why the performance is so poor or the area is so big. This has a lot to do with assumptions the HLS compiler made; it is not your fault. But, it does require you to fix those assumptions the HLS compiler is making. The performance/area challenges often come from a few areas. Essentially writing serial C/C++ code does not magically turn parallel, which is where most of the speedup comes from in hardware. To solve this problem, you need to understand where the parallelism is in your design, specifically the bottlenecks and what can be parallelized, (e.g., unrolling loops or pipelining). Then you need to covey that information to the HLS compiler via directives or pragmas. Each HLS compiler vendor has different pragmas, different pragma inputs (e.g., specifying the unroll_factor for the unroll pragma ), and different behavior on how multiple (e.g., nested loops) pragmas behave together. The combination of pragmas, attributes, and pragma combinations can quickly become overwhelming and lead many to give up and revert to safety (Verilog/VHDL). I encourage you to stick with HLS and get some help (via SLX FPGA).
SLX FPGA is designed to help the new or occasional HLS users be successful in their first design. To get on the path to your first successful HLS design, SLX FPGA walks you through the design process. First, SLX FPGA will check synthesizability and provide guidance and examples on what you need to do to make your design synthesizable. Second, the tool will show you where the performance bottlenecks are and where the parallelism is that you can exploit. This includes both data-level and pipeline-level parallelism. Moreover, the tool can determine which pragmas are the best, which attributes to use, and where to put the pragmas. The tool can insert the pragmas automatically based on constraints you provide such as how much area to use, whether to optimize for throughput or latency, etc. SLX FPGA is designed to take the guesswork out of getting to your first successful HLS design.
View the SLX FPGA Walkthrough video series for more info
For Experienced HLS users
For experienced HLS users, squeezing out performance or meeting area targets is usually where much of your time is spent (yep, not much different than writing Verilog/VHDL except much faster and that doesn’t even address the 1000x faster functional simulation part being done in C/C++). The effort here is typically in code refactoring; a.k.a. changing the way you wrote part of your C/C++ code to be more hardware aware (match the architecture). This could be done at the architecture level, function level, or even loop level. Performance and area optimization often come in two areas. The first is understanding the optimum parallelism. Simply put, do you have parallel blockers preventing you from getting more performance or better performance per area? The second is data movement, usually optimizing memory structures to ensure you are not bottlenecked at the memory level. Let’s look at these in more detail.
How you write your code, particularly loops where most of your parallelism can come from, gets complicated. Often, many loops are nested within a function, and functions are nested within other functions. This provides a challenge to understand the best place to unroll or pipeline a loop and what is preventing you from getting more performance. Too many times, you try to unroll a loop, but it leads to no better performance. This can be counter-intuitive when you know this is your hotspot. The challenge is often loop-carried dependencies that can prevent you from unrolling any further. Here the only solution can be refactoring the code to remove the loop carried dependencies. If you can keep track of all the loop dependencies you are fine; if not, a tool can help here.
Understanding data movement is a second key area to focus on when optimizing a design. Here, it is important to note what variables are being accessed the most and how these variables are implemented in hardware. Are your memory accesses pointing to registers, block RAM, or off-chip? All of these factors have a big impact on your performance, and they are often not clear.
SLX FPGA is designed to get experienced HLS users the design analysis and insights they need to better optimize their design. The tool offers parallelism detection and more importantly parallelism blocker detection. The tool will tell you where you have loop-carried dependencies and whether it is preventing more parallelism. Additionally, SLX FPGA has a full understanding of hot spot analysis at the function level and at the memory access level. It can show you how many times every variable was read from and written to. In short, SLX FPGA provides the insights and analysis an experienced HLS user needs to refactor their code for better optimization. Furthermore, the SLX FPGA can also save an experienced user time by going through millions of design space points in seconds to explore different optimizations.
For more information on deep code analysis for experienced users, see Silexica’s White Paper: High level Synthesis: Can it outperform hand-coded HDL?.
Jordon’s role is to drive Silexica’s product strategy including product planning and management, product marketing and corporate marketing functions. Jordon has 20+ years of experience in marketing and product management at world-leading semiconductor companies including Intel and Altera. His role at Intel PSG (Programmable Solutions Group) included leading product and corporate strategy, driving a $1 billion product line, building high-performance marketing teams and driving go-to-market strategies. Most recently, he was Director of Marketing for Intel’s Flagship Stratix Series FPGA family. For 16 years as part of Altera, before its acquisition by Intel in 2015, Jordon held senior marketing and product management positions for FPGA product families, EDA software including OpenCL, IP (Intellectual Property) and corporate marketing/communication teams.