Chapter 2. Implementation

Table of Contents
2.1. Audio Engine Architecture
2.2. Overview of this software
2.3. Communication with RTP : Audio Queue Module
2.4. Audio Codec Controlling : Codec Management Module
2.5. Capturing and Playing Module
2.6. Implementation of G.711 codec

This chapter describes how Audio Modules are implemented in this project. The description includes Capturing and Playing Module for 32-bit Windows system, Audio Codec Module which claims to be portable, Audio Queue Module, the module used to exchange the audio frames between Audio and RTP modules (portable) and finally, the G.711-based codec implementation description (portable, too).

Audio Modules are very system-dependent modules. They must be designed and prepared taking to consideration the capabilities of the current OS and used audio library. If one is going to prepare the versions for the other operating systems and/or the other sound libraries, one should re-design and rewrite this module completely. This version of the Audio Engine is designed for Windows 95, 98, NT, and 2000 operating systems (NT/2000 are suggested) and based on the rules described in Section 2.1. The audio mechanism used by the project is Waveform Audio.

2.1. Audio Engine Architecture

Capturing and Playing Module is a part of an application (e.g. multimedia terminal), responsible for a proper capturing and playing real-time sounds. There are plenty of audio mechanisms prepared for various operating systems. But just a small subset of them is ready to be used in real-time applications. The real-time audio programming and the full-duplex programming requires some specific approach to the matter. Occasionally, programmer must make the most of the available device and the used programming libraries to prepare such a program. This chapter explains how to design a real-time application module.

2.1.1. Audio Streams

In multimedia terminal the two type of audio streams can be separated: an input stream and an output stream. They should be treated independently.

Figure 2-1. Input Audio Stream

In case of the input stream, as showed at Figure 2-1, the sound is taken from the microphone, sampled and prepared by the Input Audio Engine. Next, it is encoded and sent to the Real Time Protocol (RTP) process. The RTP process puts the encoded sound (payload) into the special frame and sends it to the network.

Figure 2-2. Output Audio Stream

In case of the output sound (Figure 2-2), the sound payload is taken from the RTP process, then decoded and sent to Output Audio Engine, responsible for continuous playing of the audio signal.

The RTP module is beyond the scope of this work. Coder Modules and Decoder Modules are very simple from the user point of view. They are usually procedures that get a block of data and return the other ("encoded" or "decoded") data block. The remaining work is done by the Capturing and Playing Module. Luckily, low-leveled sampling is performed by OS and sound device. It is the Capturing and Playing Module that supports real-time capturing and playing of the sounds. It is Capturing and Playing Module that examines sound devices and their capabilities. Finally, it is Capturing and Playing Module that fills the empty space when sound frames are not received from the network.

2.1.2. Capturing Loop

Very popular scenario of real-time sound processing are "capturing loop" and "playing loop". Any loop is independent from the other loops. They may be organized as separate threads or even the separate processes. To simplify the matter, the capturing, encoding, decoding and playing procedures are called further threads.

while(EXIT_CONDITION == FALSE) 
{
	TABLE_OF_SAMPLES = get N samples from 'INPUT SOUND DEVICE';
	pass TABLE_OF_SAMPLES to 'CODER PROCEDURE';
}

Figure 2-3. Capturing Thread

As shows the pseudo-code at Figure 2-3, the procedure just takes N number of samples from the device (using a given API) and passes it it to the audio coder procedure. The encoding thread can also be organized as a loop (as shown in Figure 2-4). The reason for separating the encoding thread was the time of encoding procedure: it takes some time, so it should not be called from inside the capturing thread (as organized in Figure 2-3). Looking at Figure 2-1 it would appear that audio coders operate on the flowing audio stream and generate the flowing one. But audio coders use a block of samples as their input and generate a bit-stream block. The simplest encoding function would have the prototype: encode(void *input_data, void *output_data);
while(EXIT_CONDITION == FALSE) 
{
	SAMPLES = get TABLE_OF_SAMPLES from 'INPUT AUDIO ENGINE';
	ENCODED_FRAME = encode SAMPLES;
	pass ENCODED_FRAME to 'RTP PROCEDURE';
}

Figure 2-4. Encoding Thread

Unfortunately, the illustrated methods do not support the continuous signal. The capturing thread spends some time with passing the data to the coder procedure. Then it starts sampling next block (by calling "get N samples from"). Occasionally, the capturing thread can be suspended by the operation system after passing samples to the encoding thread but before starting sampling the next block. This would cause the unintentional break in the capturing. The designer of the module cannot afford to such a situation, so the above algorithm should be changed.

First, the next block should be started immediately after the previous one. This feature must be supplied by the operating system (OS). Windows 9x/NT systems go with such a feature. Before starting sampling, the user sends to the OS a number of buffers. Then, the sampling is started. The system fills every buffer and after completing the current one it begins filling the next buffer. The system sends also the signal to the user that the sampling of a certain data has been completed. The user can process it, and after all this work he adds the buffer to the system. Consequently, it can be filled again, and again.

Now, there is no need to separate the encoding thread from the capturing thread. In case of the contemporary personal computers, sampling takes much more time than encoding of the input block. This would save the designer from programming the inter-thread communication for capturing and encoding threads. This would also simplify the control over the threads. The new (safe) procedure is shown in Figure 2-5.

Figure 2-5. Safe Capturing Loop

The improved algorithm (with included encoding) would look like at Figure 2-6.
for (i = 1; i < NUMBER_OF_SYSTEM_BUFFERS; i++)
{
	put BUFFER[i] to 'OPERATING SYSTEM';
}

start capturing to buffers in 'OPERATING SYSTEM';

while(EXIT_CONDITION == FALSE)
{
	wait for the fill-up of the current block;
	SAMPLE_TABLE = get the recently filled buffer from 'OPERATING SYSTEM';
	ENCODED_TABLE = encode SAMPLE_TABLE;
	pass ENCODED_TABLE to 'RTP PROCEDURE';
	put SAMPLE_TABLE to 'OPERATING SYSTEM';
}

Figure 2-6. Improved Capturing Loop

As shown in Figure 2-6, the first phase of the capturing thread is the putting NUMBER_OF_SYSTEM_BUFFERS buffers to the operating system. This number should be at least two (to deliver the continuous audio signal), but three or more would be the good choice in some computer systems. The buffers are represented by structures (in C sense) containing the pointer to the memory allocated by the user. It contains also the length of the memory and some additional flags. A number of these flags are used by the sound library itself. The others may be used by the user (programmer). For instance one of them can be used to name a given buffer (set a unique value), and the other flag should be set to FALSE by the user, and system sets it to TRUE when the buffer has been filled. Using this system the user can recognize whether more than one buffer has been filled in the time of going through the capturing loop body (it may happen if the OS is full of busy processes).

In this implementation capturing thread is organized as separate thread. The threads is called by start_recorder routine that initiates data structures, especially headers (using waveInPrepareHeader), adds them to system buffers, runs recorder_thread as a new thread and returns a handle of the type snd_thread_id_ptr. By this handle user can control recording thread, especially kill it if no longer required (using stop_recorder procedure).

Procedure recorder_thread starts sampling and goes to the main capturing loop. At the beginning of the loop thread is being blocked. It is system who unblocks recording thread when a block of samples has been recorded. Then, recorder encodes block of samples using a specified audio coder. Encoded frame is put to queue owing by RTP thread. Empty header is thrown to the system (waveInAddBuffer) and the thread looks whether the stop condition is true (this makes thread call reset, stop and close audio device and free all allocated data). If not, thread jumps to the beginning of the loop i.e. it blocks itself. And so forth.

2.1.3. Playing Loop

Playing must be organized in a different way. First, it must support a buffer of frames ready to be played (an anti-jitter buffer). Second, it should be occasionally ready for fill-up of the lacking frames. The pseudo-code of such a procedure goes below:

while('ANTI-JITTER BUFFER' is not fully filled)
{
	ENCODED_TABLE = get from 'RTP PROCEDURE';
	SAMPLE_TABLE = decode ENCODED_TABLE;
	put SAMPLE_TABLE into 'ANTI-JITTER BUFFER';
}

start playing from 'ANTI-JITTER BUFFER';
	
while(EXIT CONDITION == FALSE)
{
	wait for the the current buffer buffer being finished;
	ENCODED_TABLE = get from 'RTP PROCEDURE';
	if (ENCODED_TABLE == EMPTY) then /* no buffer from RTP */
		SAMPLE_TABLE = prepare the virtual frame; 
	else
		SAMPLE_TABLE = decode ENCODED_TABLE;
	put ENCODED_TABLE into 'ANTI-JITTER BUFFER');
}

Figure 2-7. Playing Thread

Figure 2-7 shows an Output Audio Engine procedure. The first part of the procedure stores a number of the incoming frames in the "anti-jitter buffer" (AJB). The number of the frames (or the number of milliseconds of the AJB) is passed as one of the procedure parameters. When the buffer is long enough, the playing is started. Working in a loop, the playing thread acquires the audio frame from the RTP thread (the RTP thread receives the frame directly from the network). If frame could not be acquired, a "virtual frame" is prepared by the special procedures (to supply the continuous audio signal). Otherwise the frame is decoded and passed the AJB. Then it is played from this buffer.

The algorithm described above shows just the main idea of the procedure. Thus it is very simplified here. First, samples should be not played directly from the anti-jitter buffer. Second, preparation of lacking frames is sometimes very complicated process, and usually it is not performed immediately after one delayed frame. Finally, the modern complex algorithms are being developed to solve the lack of Quality of Service (QoS) in certain packet-based networks. These algorithms are not took into consideration here.

Playing loop is organized as separate thread called Playing Thread. Playing Thread is started the start_player routine. This function prepares all data structures for playing thread e.g., headers, starts new thread and returns snd_thread_id_ptr. At first playing thread starts waiting for sufficient number of digital audio frames in queue (this number is parametrized value). This is made by blocking supported by the queue between RTP and playing thread. Then the waiting is finished playing thread gets a few frames from the queue, decodes them, sends them to the system (waveOutWrite) and runs playing (waveOutRestart). Then it blocks itself waiting for finish of playing audio frame. Every frame this thread is unblocked, gets next frame from the queue, decodes it and sends to system. If no frames are available (what means that something wrong must have happen - delay or even loss of RTP packet) than playing thread prepares frame using previous one.

It is clear that capturing and playing procedures should be organized separately. They should be implemented as separate processes or separate threads. The other reason to do so is that these procedures should have the special privileges : they are really real-time procedures, so any delay or stopping of these can cause serious effects. The privileges are very system-dependent and will not be described in this chapter.