Real-Time Protocol-based Audio Engine: Audio Codec Module
Prev	Chapter 1. Introduction	Next

1.2. PC Audio Tutorial

1.2.1. Sound devices

Sound device is an electronic gadget, commonly part of a personal computer, in charge of all jobs related to the sound. Sound devices can usually record and play sounds. When sound device can perform recording and playing sound simultaneously, it is called 'full-duplex'. Input devices (e.g. microphone) and output ones (e.g. loudspeaker) are connected directly to a sound device. These devices work on analog signals. As a result of this fact, sound device must also be a medium between analog device and digital computer system. It converts the outgoing signal from digital form into analog one, and converts analog incoming signal into digital form, so it can be stored in computer memory. PC sound devices may work in various modes (coding styles, sampling rates, etc). They are controlled by operating system (have assigned I/O port and interrupt number), and usually communicate by the use of direct memory access.

Almost all contemporary operating systems deliver mechanisms of controlling sound devices over high-level programming interface. No knowledge about internals of sound card (device) are needed. OS driver does all low-level tasks for a programmer. For example multimedia unices, like Sun-OS or Linux, support open-read-write mechanisms on sound devices that are the part of the input-output Unix API [12].

There are plenty of sound libraries, developed for Microsoft Windows. Windows 95 and 98 seem to be the most popular multimedia platforms. Microsoft worked out at least three sound interfaces, which are commonly used.

1.2.2. PCM: Sampling and Quantization

Advantages of using digital audio are well-known: it can be stored, processed, duplicated and transmitted in a simple way without losing its quality. Like any other computer data, it can be transmitted as an ordinary data in packet-based computer networks with no additional features.

There are many ways of storing audio signal in digital form. The most popular one is Pulse Code Modulation (PCM). Signal stored in PCM have structure of a matrix (or a table). Every column of the matrix is called a sample. Every row is called a quantization level. Quantization levels are measured in bits. For example: if we say that there is a 16-bit quantization, we mean that there are 2^16 (2 to the power 16) quantization levels. We can also call it 16-bit samples, referencing to every sample in the matrix. The number of samples per second is a sampling rate, e.g. if we have 2-second signal stored in 10000 samples, we call it 5000 Hz (or 5kHz) sampling rate. The higher sampling rate and quantization levels the better quality and dynamic range of the sound. For CD-Audio stored in PCM form (16-bit quantization, 44.1 kHz sampling rate, stereo) we need almost 180 kB to store one second of sound. So the typical values in telephony are 8-bit quantization and 8 kHz sampling rate, generating 64 kbit/s bit-stream. The process of transforming audio analog signal into digital form (and vice-versa) is made by sound devices. Description of the process is beyond the scope of this work. For additional information, refer to [2].

1.2.3. Audio Codecs

Simply speaking coding is a way of transforming a signal from one form into another. Decoding is an opposite transformation to coding i.e., encoded form of the signal is transformed into the original (or similar to) original signal. The items that encode or decode signals (in hardware or software way) are called coders or decoders respectively. Codecs are all-in-one items that make both coding and decoding job. Audio codecs are items that work on audio signal. According to the definition above the transformation from electric signal to digital audio is also coding but is this chapter, term "coding" refers to transformation from one digital form into another.

There are many different reasons for using audio codecs: to make the signal less sensitive to interferences, to allow transporting in various networks, to encrypt the signal, etc. But the most important reason is saving the bandwidth. When recording with 8000 sampling rate and 16-bit quantization, 128 kbit/s bit-stream is generated. This requires quite a large bandwidth even for the commonest 10 Mbit/s LAN. Codecs which decrease a number of bits per second work like compressing programs. They get a block of samples (N bytes) and transform it into M-bytes ( M is less N ) block. Most of audio codecs are lossy ones because human ear cannot detect slight differences between genuine and decoded signal.

Codecs used in multimedia terminals must be standardized to be able to communicate with other terminals, which have been prepared by another producer. There are many standards defined by ITU-T recommendations and other international organizations.

For more information about audio coding refer [1], [2].

1.2.4. Audio in packet-based networks

Packet networks grow more and more popular in modern telecommunication. The idea of a packet network is that the endpoints are not connected directly (like for instance in case of telephone networks). Endpoints send their data in so-called packets (containing the address of the receiver in their header field, and small quantity of data). Every packet is carried separately over the transmission medium. This is quite an economic solution (no all the trail is reserved for the purposes of one particular connection) but it causes also some problems: packets may be exessively long stored by the network's nodes (routers). Sometimes the packets may be lost (e.g. because the overflow of a router's buffers). This is all the result of the fact that the most of packet-based networks (esp. IP networks) were designed for data transmitting purposes. Usually, data transmission does not require real-time mechanisms, the higher emphasis is set on reliability of the transmission media and protocols. The problem is lost packets was solved by developing reliable protocols (at the cost of delay). But the delay problem cannot be solved in a simple way. In some types of networks in cannot be solved at all.

Audio in packet based networks is transmitted in packets like an ordinary data. A frame (some amount of time long) is stored in a buffer and then it is transmitted into the network. The receiver of the audio packet stream must be ready for these two events:

Packet may be lost; in this case receiver should make some procedures to supply the continuous audio signal.
Packet may be delayed (jitter); to prevent it, special buffers are prepared at the receiver side. If the delay is too big these buffers may be also not enough.

The aspects above are more thoroughly described in [2] and [10].