By Steve Leibson, Tensilica, Inc.
Digital signal processing is now mainstream technology, so it seems heretical to be declaring the end for DSP cores. All media processing—ranging from voice, to music, to still images, to video—requires DSP functions.
Steve Leibson, Tensilica, Inc.
By Steve Leibson, Tensilica, Inc.
Digital signal processing is now mainstream technology, so it seems heretical to be declaring the end for DSP cores. All media processing—ranging from voice, to music, to still images, to video—requires DSP functions.
Signal-processing algorithms such as finite- and infinite-impulse-response (FIR and IIR) filtering consist of many multiplications followed by accumulation of the multiplication products so DSPs have incorporated hardware multipliers and MAC (multiplier/accumulator) units ever since TI introduced the first commercially successful DSP in 1982.
General-purpose processors lacked the hardware multipliers needed for fast DSP execution because multipliers consume a large number of gates. Yet today, MAC units aren’t that large relative to other on-chip blocks. Configurable processor cores such as Tensilica’s Xtensa family have optional MAC units, allowing SOC design teams to include MAC function units in a more general-purpose processor core if the target application requires it. Consequently, single-cycle MAC units no longer make DSPs unique.
High-speed computation units need a stream of operands and the results of DSP operations create a corresponding result stream. Consequently, a processor’s ability to efficiently execute signal-processing computations must match its load/store bandwidth.
To address the need for greater memory bandwidth, DSP designers adopted non-standard memory architectures that perform multiple memory access per cycle. The most widely adopted approaches are Harvard architectures (separate memory buses for instructions and operands) and the XY memory architecture, which simultaneously fetch operands from separate X and Y memories in one clock cycle. Specialized DSP address-generation units exploit the predictable memory-access patterns in many signal-processing loops using addressing modes such as indirect addressing with post-increment, circular, and bit-reversed addressing. These addressing modes accelerate FIR filtering and the FFT. Configurable processor cores offer all of the memory-architecture options developed for DSPs with the requisite address-generation units to accelerate algorithm execution, so these features are no longer unique to DSPs.
It’s often possible to simultaneously execute the same operation on multiple data words within the inner loop of a signal-processing algorithm using SIMD (single-instruction, multiple-data) execution units. For algorithms where SIMD execution is useful, the parallelism can be quite high. A 4-way or 8-way SIMD unit can effectively accelerate an inner loop respectively by a factor of four or eight. Like the other features discussed above, many processor architectures including configurable cores incorporate SIMD units.
High-performance DSPs have become VLIW (very-long instruction word) machines. They issue multiple independent operations to their parallel execution units during each cycle. VLIW processors require wider instruction words with perhaps 32 or 64 bits (or wider) per instruction instead of 16. These wider instruction words produce code bloat—the program code expands simply because of the larger instruction word, not because more work is performed.
The added ability to execute multiple independent operations per clock cycle need not incur code bloat. Tensilica’s Xtensa LX2 processor core incorporates a VLIW-like feature called FLIX (flexible-length instruction extensions) that adds 32- or 64-bit multi-issue operation bundles to the processor’s existing 24/16-bit native instruction set. The compiler selects FLIX instructions if they’re more efficient than the equivalent sequence of native instructions, which greatly accelerates code within loops. In control code (all signal-processing algorithms are laced with such code), parallelism is generally not helpful, so the compiler selects the processor’s narrower native instructions.
Automated compiler selection of appropriate instructions opens this discussion to a major difference between DSPs and DSP-augmented configurable processor cores. In general, the DSP’s highly specialized, irregular, and complicated instruction sets, small register files, and irregular memory architectures make them poor compiler targets. Compiled DSP code is relatively inefficient because the compiler must translate from C to the DSP’s irregular instruction set and small register complement.
Conversely, the general-purpose configurable processor is a good target for compiled code. Configurable processors excel at executing control code. The processor’s DSP enhancements are used within the signal-processing algorithm’s inner loops where the compiler can best harness these specialized instructions. DSP-enhanced configurable processors offer the performance benefits of DSPs while remaining good compiler targets.
In summary, DSP cores no longer offer the SOC design team any performance advantages over configurable processor architectures. All of the DSP architects’ good ideas have become a configurable processor’s optional abilities. At the same time, configurable processors retain their superior ability to execute control code and they remain better compiler targets.
DSPs led the way to a variety of performance-enhancing architectural features, but their time to serve as on-chip processors has passed. The on-chip DSP is obsolete.
Steve Leibson is an experienced hardware and software design engineer, engineering manager, and design consultant. He spent 10 years working at electronic systems companies including HP’s Desktop Computer Division, Auto-Trol Technology (graphics workstations), and Cadnetix (EDA workstations) after earning his BSEE cum laude from Case Western Reserve University. At HP, Auto-Trol, and Cadnetix, he specialized in the design of desktop computers and workstations, especially in the areas of system and I/O design. He then spent 15 years as an award-winningtechnology journalist, publishing more than 200 articles in Microprocessor Report, EDN, EE Times, Electronic News, and the Embedded Developers Journal. He served as Editor in Chief of both EDN and the Microprocessor Report and was the founding Editor in Chief of the Embedded Developers Journal. Leibson has just written and published “Designing SOCs with Configured Cores,” a treatise on 21st-century MPSOC design. In 2004, he co-authored “Engineering the Complex SOC” with Tensilica’s president and CEO Chris Rowen, which has also been used as a textbook in university classes. He has also contributed chapters to several other SOC design books since joining Tensilica in 2001 tensilica.com