Parallel computing is not a new concept in digital simulation. The industry's leading simulators all have solutions that take advantage of advanced multicore technology. However, not all designs are appropriate for this technology, with certain factors limiting the performance and efficiency of parallel simulations. As a result, the functional verification community has not adopted it as widely as its advantages warrant. When used correctly, parallel simulations can show dramatic performance results.
In order to ensure design and verification engineers get the maximum performance out of parallel simulations, they need to be aware of favorable and unfavorable design characteristics for parallel simulations and to understand how load balancing, concurrency, and communication impact the way a design runs in parallel simulation. They must also understand the tradeoffs required for this technology to be effective and which design scenarios lend themselves to the application of this technology.
Figure 1: Design partitioning for multi-core simulations.
Design factors affecting performance
Design partitioning decisions have the most impact on parallel simulation performance. Potential performance gains are highly dependent on load balance, concurrency, and communication overhead among the partitions. Designs that are suitable for multicore simulation exhibit the following:
- A balanced partition load in terms of code weight and/or run-time activity
- A high degree of simulation concurrency between partitions
- Low inter-partition communication
Load-balanced activity on all partitions is very important for maximizing the throughput of the parallel simulation. It is imperative to design with load balancing in mind. Engineers often activate and test only one block at a time because testing all blocks together makes simulation very time consuming. At times, such simulations cannot be run realistically with single-core simulations.
Designing with multicore simulation in mind makes it possible to activate all parallel blocks in the design and take advantage of multicore simulation’s ability to partition them, run them in parallel, and reduce overall simulation time. Even with a well-balanced partition load, it is important to establish concurrency. A design with inherent parallelism and low sequential behavior is characterized by complex functions that are broken down into a series of small independently performed tasks that operate in parallel. This concurrency improves parallel simulation. In addition, a high degree of concurrency ensures minimal communication between blocks and fewer synchronization points.
Designs that have large blocks with independent parallel activity, such as a multicore SoC with busy cores and independent functions, are natural candidates for parallel simulations and may provide significant speedups depending on the design partitioning. On the other hand, designs with large serial compute/processing, such as Ethernet, are not good candidates and may even perform poorly in a multicore simulation due to added communication overhead between the partitions.
Inter-partition communication (IPC) between blocks can add considerable overhead to a parallel simulation. When design partitioning causes two partitions to communicate at a very high rate for data transfer the communication overhead will have a negative impact on the simulation. It may be better to combine the design units that communicate heavily into a single partition to minimize the overhead. Creating design partitions in higher level blocks is also critical to minimizing inter-partition communication. Trying to partition at lower design levels often results in increased communication activity between partitions.
Fortunately, it is possible to identify the blocks that should be either kept together or separated in order to minimize communication overhead during parallel simulation. One example is Questa Sim MC2 from Mentor Graphics, which has tools to analyze this traffic and provide information that can be used to avoid high communication overhead.
Parallel simulation technology applicability
Parallel simulations are not the best approach for some scenarios, so it is important to understand when to use it and when not to. The three main factors that help achieve good parallel simulation performance, as discussed above, also provide a key to design qualification criteria that can lead to successful multicore simulation.
- Big designs with simulation times of more than an hour are a good fit.
- Designs that can be partitioned well with balanced activity in each partition can provide good multicore simulation speedups.
- Flat gate-level netlists and designs with very low to no hierarchies are not good candidates for multicore simulations. Such designs do not partition well and generate excessive amounts of communication at lower levels.
- Caution should be taken with designs that contain a lot of PLI/DPI/VPI/FLI usage. In some scenarios, a design may not partition well due to access to cross-partition contents. Individual partitions should be thought of as mostly “local” simulations with no global access.
- The design should be race-free as much as possible since multicore simulations may expose races, due to reordering of events, which may result in mismatches against single-core simulations.
If an overnight regression suite consists of a large number of tests with small simulation time, it is not recommended to use multicore simulations to achieve speedup. Short tests may result in worse performance with multicore due to synchronization overhead. In such a scenario, total throughput of an overnight regression run can be improved by submitting multiple single-core simulation jobs in parallel to a distributed grid.
In the case of weeklong or longer runs, a single test that takes multiple hours or days to complete can be an ideal candidate for multicore simulations, provided it also adheres to the other qualifying criteria of a balanced load, communication, and design concurrency. When design runs are too long, it becomes important to get results as early as possible, so as to find functional issues, fix them, and re-do the simulations sooner; thus increasing productivity. It is also possible to have a mix of multicore and single core simulations in a regression suite, which can result in optimal throughput depending on the length of the test and the performance speed up produced by multicore simulation.
Learn more about Mentor Graphics