Test choices for improving PCI Express reliability
Since there is no “one-size-fits-all” solution, users must learn to select the right tool at the right time
BY YENYI FU
Agilent Technologies
Santa Clara, CA
http://www.agilent.com
As PCI Express is used in an increasing number of applications, and with a constantly growing number of devices that support it, users are demanding that it become more robust and handle failures more reliably.
To that end, the PCI-SIG specification has a number of reporting mechanisms for handling errors. Depending on the type of error, when a failure is detected it is handled either in the hardware or passed on to drivers or application software. For instance, one option would be for application software to avoid the faulty device by shifting to a working one.
This is theoretically a great way to increase the reliability of the whole system. The challenge is to generate such errors in a test environment to see how well the system responds. This article explores some traditional methods of negative (error-injection) testing as well as recent innovations.
Error-injection testing
The PCI-SIG specification for PCI Express includes two mechanisms for reporting errors: baseline and advanced. While baseline error reporting is required of all PCI Express devices, the advanced error reporting capability is optional. However, advanced error reporting has a more robust capability for handling errors that might occur on the device.
The many different errors that occur on a PCI Express link are grouped into three main categories, depending on their specifications: correctable, uncorrectable but nonfatal, and uncorrectable and fatal. The correctable errors can be handled in the hardware and do not have any functional impact on the system or the application. The so-called uncorrectable errors, on the other hand, are managed by the device driver or the application software. Each system will deal with errors differently, depending on the type of error.
Test and validation engineers are tasked with ensuring that the device or system has a known behavior when it encounters different types of errors, and that a deployed system doesn’t experience an outage or performance degradation. To help engineers get a better understanding of device behavior under specific error scenarios, there are a number of tools available; they fall into three main categories: test suites, exercisers, and jammers.
Pre-defined test suites
In a pre-defined test suite, there are typically a set number of test cases. The best example of a pre-defined test suite is the protocol test card (PTC) that is used by the PCI-SIG for testing in PCI Express compliance workshops. The PTC includes 13 test cases that are used to test specific data-link-layer and transaction-layer functions. A key advantage of the test suite is ease of use with a few clicks, tests are automatically executed and a pass/fail report generated for each.
One limitation of pre-defined test suites is that they only test one side of the link. The PCI Express link is made up of both the end points and the root complex (see Fig. 1 ), therefore, for robust system operation, the validation engineer needs to test both sides of the link in a real environment to ensure full coverage, injecting errors in both end points and the root complex to make sure that both sides react correctly to errors.
Fig. 1. In a typical PCI Express system, the bus connects the root complex (sometimes referred to as a motherboard or master) to an end point (daughterboard or slave).
With a typical PTC type card, the PTC replaces the motherboard and is only able to create errors to the end point devices. This limits test suite usage to end-point validation only. Additionally, as end points are not available in a real system, drivers or software cannot be installed. As a result, pre-defined suites cannot be used for system-level testing.
Another test-suite limitation is its fixed scope. The PTC‘s 13 tests cover a very small part of the specification, enough to show interoperability. The suite is not enough for robust testing, and it is very hard for a user to expand the number of test cases.
Other products have more extensive test coverage than the PTC — PCI Express Gen2’s test suites have over 170 tests — but they still suffers from being very static and not easily adapted for increased test coverage, which users say they need in the field.
Exercisers
Exercisers emulate one side of the link; some can be programmed to emulate either the end point or the root complex of a link, while others are statically configured, emulating only the end point or root complex.
In either case, the overall test method is similar. To test an end point device, the engineer replaces the root complex with an exerciser; to test a root complex, the end point is replaced with an exerciser (see Fig. 2 ). Unlike a real end point or a root complex, an exerciser typically can be programmed to emulate any type of good or bad behavior.
(a)
(b)
(c)
Fig. 2. Agilent’s N5323A PCIe Jammer card is shown working as a PTC to tests an end-point device (a) and as an exerciser for root complex testing (b) or with a backplane for end-point testing (c).
Exercisers are critical at the start of a design cycle, when device availability is limited, when they can be used for initial functional testing (also known as system or device bring up) and do not depend on the availability of other devices. This is important in the scenario where the device under test is targeted to be first to market.
Most exercisers can be thought of as a protocol state machine engine programmed through a GUI or API. The test engineer can program the exerciser to do pretty much anything, and it can perform functional testing, error-injection testing, or performance testing.
However, the exerciser’s flexibility also makes it hard to configure and control. For instance, imagine that a user wants to inject a “poisoned” (erroneous) Transaction Layer Packet (TLP) randomly after the link is established and the device has been configured. In a normal system, link establishment and device configuration is all handled by the root complex and the drivers. However, since the root complex is being replaced by the exerciser, the exerciser must first be programmed to recreate the initiation process — a tedious task — before it can inject errors.
While the exerciser is a great tool for development test or functional test, that’s not the case for system test. For system test, it is important to see the full system — root complex, devices, drivers, software applications, and such — working together. Users need to answer to a key questions: does the system handle an error correctly? For instance, if a LAN card reports an uncorrectable error, does the driver or the application take the appropriate action? Since an exerciser replaces a key component, we cannot test the system’s real behavior.
Jammers
The third type of tool for error testing is known as a jammer. While jammers are not new (RF jammers have been used in military applications for may years), their application for PCI Express technology is.
Basically, a jammer sits between the root complex and the end device, and is transparent to both. The jammer does not create any of the messages or traffic; it needs both devices on each side of the link to do that. The jammer’s function is to inject errors in the traffic between the root complex and end device to simulate failures and to monitor responses.
The key advantage of the jammer is its ability to test the full system — root complex, end device, drivers, and software application. With the full system is in operation, a test engineer can monitor how a driver or software responds to a specific error.
The jammer is actually easier to set up than the exerciser. As previously noted, with the exerciser, the user has to program the full initiation process manually. The jammer, however, is transparent during the initiation process, which is handled by the root complex and end device. Thus the user need only program the jammer to inject errors after initiation, with the help of a sequencer.
It’s an “if/else” state machine: if a match condition occurs, then perform a specific action. For the same example noted above, injecting a poisoned TLP randomly can be easily set up in the jammer: after the device is configured, randomly (condition) inject a “poisoned“ TLP with a modified header (action).
Selection guidelines
Selecting the appropriate tool depends on multiple factors. First and foremost is an understanding of the level of reliability that one is looking for in a device. To produce a consumer-grade LAN card, it may not be so important to test all aspects of the specification to cover all possible negative scenarios. In this case, a fast, easy-to-use test suite with good coverage (more than 13 test cases) is a viable choice.
However, if the validation engineer is developing a server platform for mission-critical applications, it would be essential to test the reliability and recoverability of the full system. In such cases, a jammer application is recommended, as it will allow the testing of hardware, software, and drivers together.
A second consideration is the development phase of the product under test: is it in the early development phase, or are you looking to replicate the tool that is used in the PCI-SIG compliance workshop for compliance testing? For early product testing, such as development testing, it is important to simplify the environment. An exerciser that emulates the root complex with complete control over what is sent might be a good tool for this stage.
But the jammer would be a better fit for system level testing or customer support labs to create errors in the full system or to easily replicate the customer’s problems. ■
For more on PCI Express, visit http://www2.electronicproducts.com/BoardLevel.aspx
Learn more about Agilent Technologies