Advertisement

Multimedia chip choice widens

MS80.OCT–AT&T/Production Catalysts–rm

Multimedia chip choice widens

Picking the right ICs to implement multimedia systems requires a
knowledge of the appropriate standards and system integration issues

BY GENE SCUTERI AT&T Microelectronics Allentown, PA

and

SEYMOUR FRIEDEL Production Catalysts, Inc. Manchester, NH

Within a few years, many multimedia applications will emerge to offer
stored interactive video, CD-quality stereo music, speech compression and
decompression, and speech synthesis. Imagery will include full-motion
video and high-resolution still pictures. Multimedia personal computers
and workstations will extend these capabilities to video telephony to
enable high-quality audio and still images as well as full-motion video
transmission, in real time, over telecommunications networks. The
multimedia technology will be driven by progress in three areas:
standards, networking, and low-cost standards-compliant VLSI silicon for
audio and video compression and decompression. Progress in standards is
well under way. The MPEG and JPEG ISO standards are replacing proprietary
compression and decompression algorithms such as CD-I and DVI in stored
interactive video applications. CITT standard P*64 is displacing
vendor-specific algorithms for video conferencing. These standards provide
a common platform that enables systems to operate together and foster
economies of scale. Networking progress is also being made. Although
packet-switched datacom LANs like Ethernet and FDDI are useful for
authoring, playback, and E-mail, they are not well suited to video
conferencing, where fast, guaranteed response times are critical for human
interaction. This problem is currently solved with a direct connection
between the video-conferencing equipment and the telecommunications
network (such as T1 or Accunet). However, better solutions are coming.
FDDI II, for example, adds isochronous transmission capability to FDDI.
This provides constant information delivery per unit of time with
deterministic, equal, and short delays. Another method, Basic Rate ISDN,
provides similar capabilities over standard telephone lines. Such methods
will gradually displace the telecommunications links currently used for
video telephony. The JPEG, MPEG, and P*64 standards were developed to
overcome the interoperability limitations of proprietary compression and
decompression algorithms (see box, “Multimedia standards”). MPEG is used
for the compression and decompression of stored interactive video and
audio, JPEG for high-resolution still images, and P*64 for
teleconferencing video and audio. Both MPEG and P*64 include
synchronization, control, and other functions appropriate for their tasks.
All three use symmetrical compression and decompression algorithms based
on the discrete cosine transform (DCT) that make it easier to design
cost-effective hardware for decoding and encoding. Designing and building
multimedia and video-conferencing systems becomes easier as these
standards are more fully wired in silicon. Until recently, most available
chips could only partially implement JPEG, MPEG, and P*64. Such devices
often ignored the audio portions of the MPEG and P*64 standards or
neglected critical functions such as synchronization, multiplexing, and
communication. Chipsets now coming to market offer fuller compliance with
MPEG and P*64. Designers must understand the requirements of full
compliance to make intelligent choices. There is more to these standards
than compression and decompression. Complete MPEG compliance also requires
MPEG systems-layer multiplexing and demultiplexing for audio, video, and
user data. Full P*64 compliance requires several framing, channel
management, error correction, and other communications functions. These
include H.221 (the P*64 transport protocol that encompasses multiplexing,
demultiplexing, and framing of audio, video, and user data) as well as H.
230 (which encompasses call setup and tear-down). Additional functions
include H.261 (which specifies the forward error correction) and H.242
(which specifies the handshake protocol required for two
video-conferencing terminals as well as the means to determine how a
frame's bandwidth is allocated between audio, video, and user data). Many
solutions ignore H.2XX functions, leaving them to host-based high-level
software. In some cases, this slows the host or ties up the system bus. In
other cases (H.261 error correction, for example), the host may lack the
bus bandwidth and CPU power for real-time implementation. Consequently,
system designers are forced to develop custom ASICs to handle such
functions. These kinds of omissions make the designer's life a lot harder
than it needs to be. As a minimum requirement, any competent multimedia
chipset should encompass the functionality of all three standards.

Comparing compression Compression is the cornerstone of stored
interactive video and videotelephony. It reduces bit rates enough for
efficient disk storage, bus bandwidth utilization, and network
transmission. Compression is also the most compute-intensive task required
by multimedia applications. Decompression requirements for MPEG and P*64
video and audio are explicitly defined in the standards. Therefore decoder
implementations should be consistent from one vendor to the next.
Compression-encoding algorithms, by contrast, are not explicitly defined
in the standards. Consequently, the capabilities of any two MPEG- or
P*64-compliant encoders can differ considerably in many areas including
compression ratio, bit rate, frame rate, resolution, delay, and picture
quality. At one end of the spectrum, I-Frame MPEG encoders are optimized
for high-resolution still frames and digital editing (authoring). At the
other end, full-function MPEG encoders maximize compression ratios and
picture quality by using motion estimation with exhaustive search over
expanded search ranges, together with special techniques involving one or
more interpolative frames. P*64 encoding is optimized to provide good
picture quality with minimal delay over a broad range of bit rates. P*64
supports both CIF (352 x 288) and QCIF (176 x 144) resolution. It also
supports two types of compression frames: Intra Coded (I Frames) and
Predictive (P Frames). I frames exploit redundancy within a frame. P
frames achieve higher compression by exploiting redundancy between frames.
P*64 implementations typically differ in one of four ways: maximum
encoded bit rate, resolution, the sophistication of the motion estimation
algorithm, and delay. A high maximum-bit rate is important because it
enables the encoder to maximize picture quality when transmitting over
high-bandwidth telecommunications channels. Another way that P*64 chips
vary is in the compression algorithms that they use. P*64 allows for the
use of motion estimation (to code P frames) and specifies the size of the
motion estimation search space (+/-15 pixels). However, it does not
specify the encoding algorithm (including motion estimation) and it is
this algorithm that largely determines the compression ratio and picture
quality. Because motion estimation is the most compute-intensive algorithm
used in compression, it is a frequent shortcut target. The best
motion-estimation algorithms, however, use the full exhaustive search
technique, in which each pixel in the search space is evaluated.
Unfortunately, most P*64 encoders lack the power to perform exhaustive
search. Instead, they settle for a less compute-intensive technique, known
as hierarchical search. Using this method, the search space is divided
into quadrants. The quadrant with the best match is then evaluated either
hierarchically or pixel by pixel, depending on the available power. The
problem with this technique is that the quadrant with the best overall
match doesn't always contain the pixel with the best match. It's
important to get a match for the highest compression ratio. If something
moves, you want to find it so that just the movement will be transmitted
instead of the entire block. Encoders that use full exhaustive search can
always find the best match. MPEG encoding is even less explicitly defined
than P*64, allowing even greater variations in MPEG encoder quality.
Unlike P*64, for example, MPEG doesn't fix picture resolution or motion
estimation search range. As a result, shortcut encoders can claim to
support full-motion video, with low-resolution images, nonstandard frame
rates, and meager search ranges. Advanced MPEG encoders take advantage of
larger motion search spaces and frame interpolation. To increase
compression levels and picture quality, full-blown MPEG encoders use
several interframe coding techniques. The most powerful and
compute-intensive technique is motion estimation and, like P*64, the most
effective MPEG motion estimation algorithms use full exhaustive search.
Unlike P*64, however, the motion search range isn't fixed at +/-15. An
expanded search range improves compression ratios and picture quality,
particularly for video sequences with fast moving objects. Some encoders,
like those from AT&T, for example (see Fig. 1), provide a full exhaustive
search range of +/-32 with 1/2-pixel accuracy. But this requires
considerable processing power so that most MPEG encoders must settle for
greatly reduced search ranges.

System integration The standards don't address several important
system-level functions. Designers should be aware of these and select
silicon that either solves the problems or makes a system solution easier.
These issues include buffer management, acoustic echo cancellation, and
synchronization of audio, video, and user data. For video-conferencing
applications, after audio, video, and user data are encapsulated as P*64
H.221 frames, a direct interface to all of the common telecommunications
links is needed. These links include T1, fractional T1, switched 56
kbits/s (Accunet), switched 384 kbits/s, and basic and primary-rate ISDN.
For video conferencing, a direct connection between the system controller
and the network controller offers several advantages: higher throughput,
reduced host bus traffic, and network clock recovery. Clock recovery is
needed to synchronize audio data. Video is more forgiving because
occasional missed frames don't have much impact on picture quality, but
lost data has a noticeable impact on audio. Most system integration
issues concern timing and synchronization. There are several levels of
synchronization. For P*64 with ISDN, the system synchronizes to the
communication channel and the ISDN interface provides synchronized clock
and data. The sampling rate has to be synchronized to the line clock
because either too few or too many bits causes clicks and pops in the audio.
Video synchronization is less of a problem. Because the encoding rate is
never faster than 30 frames/s (+/-50 parts per million) and the decoder
display is never slower than 60 frames/s (25 and 50 frames/s in Europe),
the display need only keep the last decoded frame until a new one is
available. This will result in most frames being displayed twice.
Occasionally a frame will be displayed once or three times, but neither
effect will be noticeable. An MPEG recording on disk loses all sense of
clock because a disk is an asynchronous device. To synchronize the video
with the recording, the video decoder must be set to the frame rate at
which the encoded stream was recorded. A FIFO chip and control circuitry
sets the output frame rate. Although video encoding and decoding requires
silicon designed for that purpose, a standard digital signal processor
(DSP) and associated circuitry can manage the slower data rates of both
P*64 and MPEG audio (see Fig. 2). For audio synchronization, since it is
impossible to get the same audio sampling clock used for encoding, the
clock must be recovered. This is handled by having the DSP run a small
output FIFO in software. Whenever it is more than half full, it will cause
an external digitally controlled oscillator to increase the output
sampling frequency. When the FIFO is less than half full, it will decrease
the frequency. Without lip synchronization, a video teleconference would
look like a badly dubbed foreign film. Because there is more processing in
video encoding, audio would normally be transmitted sooner than video.
(Video processing may cause as much as 150 ms of delay while audio takes
only 3 ms to process.) Consequently, audio has to be artificially delayed
to synchronize it with video so that both can be transmitted together. On
the receiving end, video decoding also takes much longer, so audio has to
be delayed again. These are system dependent delays. P*64 and MPEG each
requires no delay between the video and the audio in the compressed data
stream. Because of inherent delays, video teleconferencing produces large
round-trip audio delays. With speaker-microphone feedback, acoustic echo
can make a link unusable. Acoustic echo cancellation and suppression
software on the DSP counteracts these effects. An acoustic echo is
generated in the following way: If I speak to you on the phone, the sound
comes out of the speaker at your end and travels acoustically back to your
microphone and is transmitted back to me. (This even happens in ordinary
telephone conversations, but the delay is only tens or hundreds of
microseconds so the ear can't hear the echo as a distinctly different
voice.) Because of all the internal delays, video has a round-trip delay
of about 300 ms. When the audio is synchronized with the video, as it is
for lip synch, it has the same round-trip delay as the video. This makes
the echo audible, making conversation difficult. Audio echo-cancellation
software, running on the DSP, subtracts out the echo. Suppression software
on the DSP determines who is speaking and shuts down the microphone of the
listener. On AT&T DSPs, echo cancellation and suppression software modules
are available under VCOS, the DSP operating system. Together they provide
acoustic echo cancellation and suppression of more than 60 dB, a figure
experts believe to be a minimum requirement.

BOX:

Multimedia standards

Today's JPEG, MPEG, and P*64 standards are the culmination of years of
research in algorithms that have been developed for stored interactive
video and video conferencing. Among the first proprietary algorithms
proposed for stored interactive video were CD-I (Compact Disk Interactive)
and DVI (Digital Video Interactive), both precursors to JPEG (Joint
Photographic Experts Group) and MPEG (Moving Picture Experts Group).
Proprietary algorithms were also developed for videotelephony, all
precursors to P*64. Apart from their proprietary nature, the principal
drawback to CD-I and DVI is their asymmetry; encoding is orders of
magnitude more complex than decoding. While decompression can be handled
in real time by one or two chips, compression requires several hours on a
mainframe computer. As a result, CD-I and DVI are well suited for playback
applications, but not desktop authoring and interactive person-to-person
communications. The primary drawback to the proprietary algorithms
developed for video-conferencing equipment is their lack of
interoperability. Because compression and decompression implementations
vary from one vendor to the next, equipment from different vendors must use
a gateway in order to communicate with each other. This not only
significantly increases delay, but restricts the flexibility and
widespread availability of video-conferencing services. To overcome the
interoperability and performance limitations of proprietary compression
and decompression algorithms, the JPEG, MPEG, and P*64 algorithms were
developed. Based on the Discrete Cosine Transform, all three use
symmetrical compression and decompression algorithms that facilitate the
design of cost-effective hardware for both decoding and encoding. JPEG is
a standard for compressing and decompressing still images. Key
applications for JPEG include picture-archiving equipment and photo
database systems, facsimile machines, and still cameras. JPEG does not
specify a compression ratio. Instead, the achievable compression ratio
depends on the redundancy in a given image and the image quality required
for the application. Compression ratios of 15:1 or 20:1 are common. At a
compression ratio of 15:1, nearly visually lossless images can be
reconstructed. MPEG is a standard for compressing and decompressing
full-motion video and audio. It is intended primarily for stored
interactive video applications like playback and digital authoring. MPEG
compression, based on motion estimation and interpolative frames, provides
good image quality at compression ratios as high as 200:1. It covers a
range of audio and video compression standards, as well as techniques for
multiplexing, demultiplexing, and framing audio, video, and user data as
defined in the MPEG System Layer specification. P*64 is a videotelephony
standard for transmitting CIF (288 lines x 352 pixels) and QCIF (144 lines
x 176 pixels) full-motion video and audio. Based on multiples of 64-kbit/s
Basic Rate ISDN (or 56 Kbits/s), P*64 is oriented toward applications,
such as video conferencing, that require the transmission of images and
audio over digital networks. P*64 specifies compression for a wide range
of full-motion video (H.261) and audio (the G.7XX series) signals, as well
as the functionality required to multiplex, demultiplex, and frame (H.221)
P*64 audio, video, and user data. It also specifies a long list of
communications functions (H.2XX) that encompass channel allocations among
audio, video, and user data, call setup and tear-down, forward-error
correction, and other channel management functions.

CAPTIONS:

Fig. 1. This block diagram shows a video and audio encoder that can be
built with the latest generation of multimedia ICs. )Diagram courtesy of
AT&T.)

Fig. 2. This multimedia decoder block diagram shows how the necessity of
recovering the clock complicates audio decoding. (Diagram courtesy of
AT&T.)

Advertisement

Leave a Reply