BY DESHANAND SINGH
Supervising Principal Engineer
Altera, www.altera.com
The initial era of programmable technologies contained two different extremes of programmability. One extreme was the single-core CPU and DSP units. These were programmable using software consisting of a list of instructions to be executed. Instructions were created in a manner that was conceptually sequential to the programmer, although an advanced processor could reorder instructions to extract instruction-level parallelism from these sequential programs at run time. The other extreme of programmable technology was the FPGA. These devices are programmed by creating configurable hardware circuits, which execute completely in parallel. A designer using an FPGA is essentially creating a massively fine-grained parallel application.
For many years, these extremes coexisted with each type of programmability being applied to different application domains. However, recent trends in technology scaling have favored technologies that are both programmable and parallel.
The second trend that the software-programmable devices relied on was the emergence of complex hardware that extracted instruction-level parallelism from sequential programs. A single-core architecture would take in a stream of instructions and execute them on a device that might have many parallel functional units. A significant fraction of the processor hardware was dedicated to extracting parallelism from the sequential code.
Additionally, hardware attempted to compensate for memory latencies. Generally, programmers create programs without consideration of the processor's underlying memory hierarchy, as if there were only a large, flat, uniformly fast memory. In contrast, the processor must deal with the physical realities of high-latency and limited bandwidth connections to external memory. In order to keep functional units fed with data, many processors speculatively prefetch data from external memory into on-chip caches so that the data is much closer. After many decades of performance improvements using these techniques, we have recently seen greatly diminishing returns from this type of architecture.
Fig. 1: Recent Trend of Programmable and Parallel Technologies.
Emphasis is shifting
Given the diminishing benefits of these two trends on conventional processor architectures, we are beginning to see that the spectrum of software-programmable devices is now evolving significantly, as shown in Fig. 1 . The emphasis is shifting from automatically extracting instruction-level parallelism at run time to explicitly identifying thread-level parallelism at coding time. Highly parallel multicore devices are beginning to emerge with a general trend of containing multiple simpler processors where more of the transistors are dedicated to computation rather than caching and extraction of parallelism.
These devices range from multicore CPUs, which commonly have two, four, or eight cores, to GPUs with hundreds of simple cores optimized for data-parallel computation. To achieve high performance on these multicore devices, the programmer must explicitly code their applications in a parallel fashion. Each core must be assigned work in such a way that all cores can cooperate to execute a particular computation. This is also exactly what FPGA designers do to create their high-level system architectures.
Considering the need for creating parallel programs for the emerging multicore era, OpenCL (Open Computing Language) was created in an effort to create a cross-platform parallel-programming standard. The OpenCL standard inherently offers the ability to describe parallel algorithms to be implemented on FPGAs at a much higher level of abstraction than hardware description languages (HDLs) such as VHDL or Verilog.
Although many high-level synthesis tools exist for gaining this higher level of abstraction, they have all suffered from the same fundamental problem. These tools would attempt to take in a sequential C program and produce a parallel HDL implementation. The difficulty was not so much in the creation of an HDL implementation, but rather in the extraction of thread-level parallelism that would allow the FPGA implementation to achieve high performance.
With FPGAs being on the furthest extreme of the parallel spectrum, any failure to extract maximum parallelism is more crippling than on other devices. The OpenCL standard solves many of these problems by allowing the programmer to explicitly specify and control parallelism. The OpenCL standard more naturally matches the highly-parallel nature of FPGAs than do sequential programs described in pure C.
OpenCL structure
An OpenCL application has two parts. The OpenCL host program is a pure software routine written in standard C/C++ that runs on any sort of microprocessor. That processor may be, for example, an embedded soft processor in an FPGA, a hard ARM processor, or an external x86 processor.
At a certain point during the execution of this host software routine, there is likely to be a function that is computationally expensive and can benefit from the highly parallel acceleration on a more parallel device: a CPU, GPU, FPGA, etc. This function to be accelerated is referred to as an OpenCL kernel. These kernels are written in standard C, but they are annotated with constructs to specify parallelism and memory hierarchy. The example shown in Fig. 2 performs the vector addition of two arrays, a and b, and writes the results back to an output array answer. Parallel threads operate on the each element of the vector, allowing the result to be computed much more quickly when it is accelerated by a device that offers massive amounts of fine-grained parallelism such as an FPGA. The host program has access to standard OpenCL APIs that allow data to be transferred to the FPGA, invoking the kernel on the FPGA and returning the resulting data.
Fig. 2: Example of OpenCL Implementation on an FPGA.
In FPGAs, kernel functions can be transformed into dedicated and deeply pipelined hardware circuits that are inherently multithreaded using the concept of pipeline parallelism. Each of these pipelines can be replicated many times to provide even more parallelism than is possible with a single pipeline and are completely customized to the requirements of the function.
Benefits of OpenCL
The creation of designs for FPGAs using an OpenCL description offers several advantages compared to traditional methodologies based on HDL design. Development for software programmable devices typically follows the flow of conceiving an idea, coding the algorithm in a high-level language such as C, and then using an automatic compiler to create the instruction stream. An example of an available tool is the Altera SDK for OpenCL. It provides a design environment to easily implement OpenCL applications on FPGAs in a software-friendly environment as shown in Fig. 3 .
Fig. 3: Altera SDK for OpenCL overview.
This approach can be contrasted with traditional FPGA-based design methodologies, which require designer to create cycle-by-cycle descriptions of hardware that are used to implement their algorithm. The traditional flow involves the creation of data paths, state machines to control those data paths, connections to low-level IP cores using system-level tools, and handling the timing closure problems since external interfaces impose fixed constraints that must be met. The Altera SDK for OpenCL performs all of these steps automatically for the designers, allowing them to focus on defining their algorithm rather than focusing on the tedious details of hardware design. Designing in this way allows the designer to easily migrate to new FPGAs that offer higher performance and capacities because the OpenCL compiler will transform the same high-level description into pipelines that take advantage of the new FPGAs without requiring any modifications to the kernel code.
Using the OpenCL standard on an FPGA may offer significantly higher performance and at much lower power than is available today from hardware architectures (CPU, GPUs, etc). In addition, an FPGA-based heterogeneous system (CPU + FPGA) using the OpenCL standard has a significant time-to-market advantage compared to traditional FPGA development using lower-level hardware description languages (HDLs) such as Verilog or VHDL.
Learn more about Altera