Nadine E. Miner Sandia National Laboratories Albuquerque, NM Thomas P. Caudell Department of Electrical and Computer Engineering University of New Mexico
A Wavelet Synthesis Technique for Creating Realistic Virtual Environment Sounds
Abstract This paper describes a new technique for synthesizing realistic sounds for virtual environments. The four-phase technique described uses wavelet analysis to create a sound model. Parameters are extracted from the model to provide dynamic sound synthesis control from a virtual environment simulation. Sounds can be synthesized in real time using the fast inverse wavelet transform. Perceptual experiment validation is an integral part of the model development process. This paper describes the four-phase process for creating the parameterized sound models. Several developed models and perceptual experiments for validating the sound synthesis veracity are described. The developed models and results demonstrate proof of the concept and illustrate the potential of this approach.
1
Presence, Vol. 11, No. 5, October 2002, 493–507 ©
2002 by the Massachusetts Institute of Technology
Introduction
Stochastic, nonpitched sounds fill our real-world environment. Here, stochastic sounds are defined as nondeterministic, randomly varying sounds. Many nonpitched, stochastic sounds have a characteristically identifiable structure. For example, the sound of rain has a characteristic structure that makes it easily identifiable as rain and easily distinguishable from random noise. Humans almost continuously hear stochastic sounds such as wind, rain, motor sounds, and different types of impact sounds. Because of their prevalence in real-world environments, it is important to include these types of sounds in realistic virtual environment simulations. Most current virtual reality (VR) systems use digitized sounds rather than synthesized sounds. Digitized sounds are static and do not dynamically change in response to user actions or to changes within a virtual environment. Creating an acoustically rich virtual environment can require thousands of sounds and their variations. Using a digitized sound approach requires switching between static sounds. Furthermore, obtaining a digitized, application-specific sound sequence is difficult and often impractical (Miner, 1994). The alternative to using digitized sound is to use sound synthesis. Although sound synthesis may be preferred, essentially no virtual sound systems are available today that provide flexible, real-time sound synthesis tools for the virtual world builder. The approach described in this paper is a step towards filling this void. The synthesis technique provides a method for creating flexible, dynamic sound models that yield a variety of sounds and increase the richness and realism of a virtual experience. Miner and Caudell 493
494 PRESENCE: VOLUME 11, NUMBER 5
The overall goal of this research is to develop methods for synthesizing perceptually compelling sounds for virtual environments. The perceptual believability of a synthesized sound is the ultimate test of success. One advantage of this approach is that perceptually convincing sounds need not be mathematically precise. Creating physically accurate simulations of complex sounds is computationally intensive. It is anticipated that synthesis of perceptually convincing sounds will be less so because evaluation of complex physics equations is not required. This research develops some new parameterized models to synthesize sounds. A parameterized model is one in which changing parametric values prior to simulation results in a new synthesized sound. There are two reasons for choosing a parameterized model approach. First, parameterization provides the possibility of obtaining a variety of sounds from a single model. (For example, one parameterized rain model might generate the sound of light rain, medium rain, heavy rain, and the sound of a waterfall.) The second reason is to create dynamic sound models: manipulating the sound model parameters in real time can yield a dynamically changing sound. With the rain model example, changing the parameters as the virtual simulation evolves allows the rain sound to progressively and dynamically increase in intensity as the graphics simulation shows increasing and darkening clouds. Overall, model parameterization provides flexibility and dynamic control such that a variety of sounds result from a small model set. The synthesis method described uses wavelets for modeling non-pitched stochastic-based sounds. It is likely that this method will be equally successful in synthesizing pitched sounds. Wavelet analysis provides an efficient method of extracting model parameters. Parameter modification and sound synthesis can be accomplished in real time in parallel with a virtual environment simulation. Overall, wavelets are highly appropriate for modeling real-world sounds and providing real-time sound synthesis. This paper describes a four-phase model development and sound synthesis process. Three perceptual experiments were conducted to validate the sound synthesis veracity (that is, perceptual quality). These experiments and results are briefly described here, although a more
extensive description of the experiments is contained in Miner, Goldsmith, and Caudell (2002). Experimental results indicated that the synthesized sounds are perceptually convincing to human listeners. Finally, this paper describes several different parameterized stochastic sound models developed to demonstrate the functionality and potential of this sound synthesis approach.
1.1 Related Work Some related work in synthesizing real-world sounds using dynamic, parameterized models exists in the literature. Gaver (1994), in developing a sound interface for human– computer interaction, proposed some physical-like models for real-world sounds. Gaver implemented parameterized models for impact, scrapping, breaking, and bouncing sounds. The synthesis algorithms succeeded in creating parameterized sounds in real time, however, the results were somewhat cartoon-like and required training to interpret. Van den Doel and Pai (1997) proposed a general framework for producing impact sounds for virtual environments. They used physical modeling of the vibration dynamics of physical bodies. The models were parameterized based on the material and object shape, and the collision force and location. Prototype sound simulations produced realistic impact sounds. Smith (1992) used a “digital waveguide” method for developing physical models of string, wind, and brass instruments. This method yields excellent-quality music synthesis, and some high-end synthesizer keyboards are based on this technology. In recent work by Cook (1997), the synthesis of percussive sounds was explored. Cook introduced the PhiSAM (Physically Informed Spectral Additive Modeling) approach, which is based on the spectral modeling technique of modal synthesis but with controls and parameters that have physical meaning. The PhiSAM approach yields perceptually convincing, real-time synthesis of a variety of sounds. A major difference between these methods and the one proposed here is the emphasis on the importance of modeling the stochastic sound components. Serra (1989) showed that incorporating stochastic components in a sound model results in sound simulations
Miner and Caudell 495
with more realism. The wavelet-based modeling approach presented here focuses on capturing and modeling the stochastic components of sounds. The result is realistic sound synthesis of real-world sounds.
2
Sound Synthesis Using Wavelets
The sound synthesis method described here uses wavelets for modeling stochastic-based sounds (Miner, 1998a). The Fourier theorem is the basis of many signal analysis techniques including the Fourier transform (FT), the short-time Fourier transform (STFT) and, more recently, the wavelet transform. The Fourier theorem states that all signals are composed of a combination of sine waves of varying frequency, amplitude, and phase that may or may not change with time. The FT technique breaks a signal up into its constituent sinusoidal components. The FT method is most useful when considering stationary signals (that is, signals that do not change over time). However, most real-world signals are not stationary in time. An FT variation that captures some time-varying information by analyzing signal “windows” is the short-time Fourier transform. The STFT captures the frequency information for different sections of time, but the resolution is limited and fixed by the choice of window size. The wavelet transform (WT) was selected for this work because the FT and STFT methods do not adequately model the timevarying nature of real-world signals. Wavelet analysis provides a time-based windowing technique with variable-sized windows. Wavelets examine the highfrequency content of a signal with a narrow time window and the low-frequency content with a wide time window. Fast wavelet algorithms provide the potential for synthesizing wavelet-modeled sounds in real time. The fast wavelet algorithms are comparable in terms of compute time to the fast Fourier transform algorithms according to Ogden (1996).
2.1 Background on Wavelets Wavelet analysis is a logical approach for analysis of time-varying, real-world signals. As with the FT and
STFT methods, wavelet analysis consists of signal decomposition (wavelet transform) and reconstruction (inverse wavelet transform) phases. Alfred Haar (1910) is credited with the first use of a wavelet, the Haar wavelet, although the term wavelet was not coined until Morlet used it in his signalprocessing work of 1983. Esteban and Galand (1977) in their subband coding research proposed a nonaliasing filtering scheme. With this scheme, signals are filtered into low- and high-frequency components with a pair of filters. The filters are mirror images with respect to the middle, or quadrature, frequency, /2 (Strang & Nguyen, 1996). Filters chosen according to this scheme are called quadrature mirror filters (QMFs) or conjugate quadrature filters (CQFs). Wavelet functions developed with QMFs provide exact signal reconstruction. Stromberg (1982) is often credited with the development of the first orthonormal wavelets. However, the system introduced by Yves Meyer in 1985 received more recognition and became known as the Meyer basis (Meyer, 1993). Orthonormal wavelet functions provide a means for the efficient decomposition of a signal. These functions ultimately define a specific set of filters for signal decomposition and reconstruction and provide for real-time wavelet synthesis. Ingrid Daubechies (1988) constructed wavelet bases with compact support, meaning that the wavelets are nonzero on an interval of finite length (as opposed to the infinite interval length of the FT’s sine and cosine bases functions). Compactly supported wavelet families accomplish signal decomposition and reconstruction using only finite impulse response (FIR) filters. This development made the discrete-time wavelet transform a reality. Stephane Mallat (1989) proposed the fast wavelet transform (FWT) algorithm for the computation of wavelets in 1987. This technique is unified with other noise reduction techniques through the concepts of multiresolution analysis (MRA), which is based on the concept that objects can be examined using varying levels of resolution. Cohen & Ryan (1995) provided a more complete mathematical description of MRA and the associated properties.
496 PRESENCE: VOLUME 11, NUMBER 5
Figure 1. Illustration of wavelet transform steps to calculate wavelet coefficients, Dij (example uses Daubechies 4 (db4) wavelet type).
2.2 The Wavelet Transform The wavelet transform decomposes a signal into wavelet coefficients through a series of filtering operations. The wavelet transform is similar to the STFT in that both techniques analyze an input signal in sections by translation of an analysis function. With the STFT, the analysis function is a window; the window is translated in time but is not otherwise modified. The wavelet approach replaces the STFT window with a wavelet function, . The wavelet function is scaled (or expanded or dilated) in addition to being translated in time. The is often called a mother wavelet because it “gives birth” to a family of wavelets through the dilations and translations. A wavelet is not necessarily symmetric, but, for perfect reconstruction to be possible, it does satisfy 兰共x)dx ⫽ 0. Other properties of the wavelets used in the sound synthesis approach presented in this paper are orthonormality and compact support. Wavelet families that satisfy these conditions are the Daubechies wavelets (often denoted by dbN, where N is the wavelet order), Symlet wavelets (symN ), and Coiflet wavelets (coif N ). The sound synthesis method proposed here uses the Daubechies wavelets, although other wavelet families may prove equally viable. The choice of wavelet type is highly application specific. Figure 1 graphically illustrates the wavelet transform
steps on an arbitrary signal using the Daubechies 4 (db4) wavelet type. First, the wavelet is compared against an input signal section. A measure of the goodness of fit between the wavelet and the input signal is captured in a wavelet coefficient (indicated by Dij in figure 1). Large coefficients indicate a good fit. Next, the wavelet is shifted (or translated ) in time and the comparison operation is repeated, resulting in another wavelet coefficient. This translation and comparison process is repeated for the duration of the input signal. All of these wavelet coefficients are considered to be on the same level. Stretching (or scaling) the wavelet and repeating the series of comparison and translation operations create subsequent levels of wavelet coefficients. The result is a set of wavelet coefficients (referred to as detail and approximation coefficients) that completely describe the input signal. When moving between levels, the wavelet is most commonly scaled or stretched by a factor of 2. Thus, scaling is also known as dilation. The scale parameter, a, indicates the analysis level. Small values of a provide a local, fine-grain, or high-frequency analysis, whereas large values correspond to large-scale, coarse-grain, or low-frequency analysis. Translation is often referred to by the b parameter, which moves the time localization center of each wavelet; thus, each a,b(x) is localized around x ⫽ b. Two functions are used in a wavelet analysis: the wavelet function and the scaling function. The wavelet and scaling functions are orthogonal to each other. The wavelet function creates a high-pass filter ( gk) that provides the detail coefficients; the scaling function has low-frequency oscillations and is used to create a lowpass filter (hk) to provide the approximation coefficients. The wavelet and scaling filters are quadrature mirror filters (QMF), and this makes perfect signal reconstruction possible. Many additional sources are available to provide more details of wavelet analysis. Misiti et al. (1996) provided a high-level treatment of wavelets. More-formal mathematical treatments of wavelets can be found in Daubechies (1992), Cohen & Ryan (1995), Ogden (1996), and Meyer (1993). An introductory tutorial on wavelet analysis is available on the Web (Miner, 1998b).
Miner and Caudell 497
Figure 2. Four-phase process for sound synthesis model development.
3
Development of Sound Synthesis Models
Development of the wavelet sound model is accomplished through a four-phase process (as shown in figure 2): analysis, parameterization, synthesis, and validation.
3.1 Analysis Phase The analysis phase begins with a digitized sound sample. The digitized sound representation can be obtained in a number of different ways including digitally recording a real-world sound using a DAT recorder or computer; digitizing an analog recording of a sound; obtaining a digitized sound from a sound effects library, CD, or from the Internet; and accepting the digitized representation from a computer simulation of a physical event. Next, a wavelet type () and scaling function () are selected for the decomposition process. The best choice of wavelet is one that decomposes the salient signal features most effectively. However, identifying the salient signal features is difficult because these features vary from signal to signal and application to application. A
generalized algorithm for determining the best wavelet for decomposition does not exist. We propose examining the original digitized sound signals at different scales (that is, time domain expansion and contraction) to determine the best wavelet type (in terms of shape similarity between the wavelet type and various sound characteristics at different levels). Wavelet type selection is largely an iterative process based on how well the original signal can be resynthesized. The models presented here used Daubechies wavelet types db4, db5 and db6 and corresponding scaling functions from the standard Daubechies family. Choosing a wavelet type from this family is a safe first choice for any signed decomposition. Once the wavelet and corresponding scale function are selected, the original digitized sound is decomposed using the discrete wavelet transform (DWT). The two wavelet coefficient sets resulting from the first level of decomposition are referred to as approximation coefficients in vector A1 and detail coefficients in vector D1 (that is, Aj and Dj, where j ⫽ level). In a multilevel decomposition, the approximation coefficients are decomposed into coarser-grained coefficient vectors by recursing on the decomposition algorithm. Each coefficient vector serves as the input to successive wavelet decomposition stages. The second-level coefficients are denoted by AA2 and DA2. Given that the original signal has length N, the DWT consists of log2N stages at most. The result of the DWT is a set of approximation and detail coefficients that contain all of the timevarying frequency information of the original signal. These coefficients become the parameters that control the sound synthesis. Multiple levels of decomposition provide access to different sound frequency components. The choice of decomposition level for developing sound synthesis models is largely iterative as parameterization and validation experiments serve to refine the selection. We decomposed to level 5 for the models presented here. All wavelet operations for this research, including decomposition and reconstruction, were performed using Matlab. Software systems that support wavelet operations typically contain a single-level or multilevel decomposition function. In Matlab, these functions are
498 PRESENCE: VOLUME 11, NUMBER 5
dwt and wavedec, respectively, and inputs to these functions include an input signal vector, the desired decomposition level, and the wavelet type. The function output is a set of coefficient vectors and corresponding vector lengths. For example, in Matlab, the signal f is decomposed to level three using the Daubechies wavelet type ‘db2’ with the command: [C,L] ⫽ wavedec ( f,3,‘db2’). The resulting four wavelet coefficient groups are contained in the vector C, namely the approximation coefficients, AAA3, followed by the detail coefficients DAA3, DA2, and D1. The length of each wavelet coefficient group is maintained in the L vector. The wavelet coefficients become the inputs for the parameterization phase as described in subsection 3.2.
Figure 3. Illustration of a perceptual sound space. Circles indicate extent of synthetic sounds possible from parameterized models.
3.2 Parameterization Phase The second phase of the model development process is parameterization, which entails determining groups of wavelet coefficients and specific modifications to their values to provide perceptually convincing sound synthesis. The wavelet decomposition coefficients are the source of the parameters for the sound synthesis model. Depending on the level of decomposition, essentially unlimited control in amplitude, time, and frequency are available; however, the parameters are not directly related to the physical characteristics of the sound source, as is the case with other parametric approaches (such as, Van den Doel and Pai (1997)). Determining the sound model parameterization is largely an iterative process. For example, increasing the lowfrequency content of a model can result in the perception of a larger sound source having generated the sound. Manipulating the low-frequency and highfrequency coefficients (or parameters) of an engine model turns the sound of a standard-sized car engine into the sound of a large truck or a small toy car, respectively. For reconstruction, the sound is synthesized using the modified wavelet coefficients, and the result is perceptually analyzed. The process iterates by determining what additional parameter manipulations are required to obtain the desired sound. If more highfrequency information is required, the detail coefficients are modified further. The cycle of parameter modifica-
tions, synthesis, and evaluation continues until a clear definition of parameterization and coefficient manipulation is established for changing the original sound into a variety of new sounds. Parameter manipulations perceptually modify the synthesized sound. A perceptual sound space diagram, as depicted in figure 3, represents the effect of parameter manipulations on three base sounds. The axes represent perceptual sound dimensions. These dimensions are the perceived result of changes in the model parameters. Each circle represents a variety of perceived sounds achievable from individual wavelet models. The center of each circle represents the original digitized sound (base sound) from which the model was developed. Parameter manipulation extends the sound perception into many dimensions. It is feasible to move from one type of sound to another by changing the parameter settings as indicated in the figure by the overlapping circles. For example, manipulating the rain model parameters creates a sound that includes the sound of light rain, medium rain, heavy rain, a small waterfall, and some motor sounds. We have examined three different types of parameter manipulation methods: magnitude scaling of coefficient groups to emphasize or de-emphasize certain frequency regions, scaling filter manipulations to frequency shift
Miner and Caudell 499
the original signal, and envelope manipulations to alter the amplitude, onset, offset, and duration of the sound. These parameterization methods produce compelling variations of the original sound. Other parameterization techniques and manipulations may increase the synthesis potential of a model by producing a greater variety of sounds. Magnitude scaling provides a straightforward way of changing the frequency content of a sound. For example, a large sound source, such as an airplane engine, will have large approximation coefficients (Aj ), indicating a significant low-frequency contribution. The airplane engine sound can be converted into a car sound by de-emphasizing the approximation coefficients and enhancing the high-frequency detail coefficients (Dj ). Various scaling techniques can be applied to wavelet coefficient groups to achieve different effects. One manipulation is to multiply or divide a coefficient group by a scalar. This simple manipulation is powerful and effective. In fact, all of the magnitude manipulation operations for the models described in this paper are simply multiplying different groups of wavelet coefficients by scalar values, as described in subsection 4.1. Different combinations of coefficient manipulations result in a variety of perceptually related sounds. Manipulations that are more complex involve filtering coefficient groups by static or dynamic functions. The desired perceptual result determines the filter structures. Overall, the magnitude scaling method provides a means of creating an assortment of sounds by manipulating wavelet coefficient groups. The second type of parameter manipulation is modifying scaling filter parameters. Scaling filter manipulations can shift the sound in frequency without changing the frequency contributions. Combining this method with magnitude scaling provides frequency shifting and frequency emphasis or de-emphasis. The scaling filter computes the decomposition and reconstruction filters. By stretching or compressing the scaling filter prior to calculation of the reconstruction filters, the original signal frequency content is shifted down or up, respectively. Scaling filter manipulations can change the sound of a brook to the sound of a large, slow moving river (stretching scaling filter) or to the sound of a rapidly
moving stream (compressing scaling filter). The scaling filter manipulation method involves five steps: 1. Decompose the original signal using a wavelet with scale filter support. 2. Obtain the scaling filter, S, associated with the wavelet. 3. Extract the standard reconstruction scaling filters from the wavelet so that it can be modified. 4. Perform compression or expansion operations on the reconstruction filters. 5. Reconstruct the signal using the modified reconstruction filter. The compression/expansion operations (step 4) can be accomplished with a number of different methods such as linear interpolation or cubic-spline interpolation, followed by resampling. Through laboratory experimentation, cubic-spline interpolation was found superior to linear interpolation in terms of maintaining the perceptual quality of the original sound. Cubic-spline fits a third-degree polynomial between every two points and yields a smoother sound than does linear interpolation. Matlab contains all the functions necessary to complete these steps. The models described in section 4 demonstrate the variety of sounds created by scaling filter manipulations. Two classes of envelope manipulations can be used with the wavelet synthesis method. The first type of manipulation involves envelope filtering the wavelet coefficients prior to synthesis. This includes manipulations discussed in the magnitude scaling approach, where the envelope is a scalar filter. The envelope filter shape is determined by the perceptual effect desired. For example, a Gaussian-shaped envelope can be applied to a group, or groups, of wavelet coefficients, or across all wavelet coefficients. Then, the filtered wavelet coefficients undergo the normal synthesis process. The result is a synthesized sound that is a derivation of the original sound, wherein the frequency region around which the Gaussian envelope was centered is emphasized and the surrounding frequency regions are de-emphasized. Any envelope shape can be applied to the wavelet coefficients including linear, nonlinear, quadratic, exponential, and random filters, and filters derived from mathematical
500 PRESENCE: VOLUME 11, NUMBER 5
functions or from characteristic shapes of sounds. The wavelet operations of compression and denoising can be grouped with this parameterization method. Envelopes resulting in the compression of the number of wavelet coefficients can be useful for saving on data storage space and data transmission times. Compression and denoising functions on the wavelet coefficients can yield a variety of perceptually related sounds. The second class of envelope manipulations imposes time domain filtering operations on all, or part, of the synthesized sound. These operations are applied to the sound after synthesis. This type of sound processing is commonly applied to digitized sound samples to achieve a customized application sound. Time domain filtering can alter the overall amplitude, onset, and/or offset characteristics and duration. Time domain amplitude filtering with a random characteristic can be applied to the synthesized sound of rain to obtain a continuously varying and natural sounding rainstorm. Combining the time domain enveloping with wavelet parameter enveloping can enhance the naturalness of the synthesized sound.
3.3 Synthesis Phase The synthesis phase uses the inverse discrete wavelet transform (IDWT). The parameters, modified wavelet coefficients, are the inputs to the IDWT. The IDWT starts with the modified coefficient vectors and constructs a signal by inverting the decomposition steps. The first step convolves up-sampled versions of the lowest-level coefficient vectors with high-pass and lowpass filters that are mirror reflections of the decomposition filters. Successively higher-level vectors are reconstructed by recursively iterating over the same process. This continues for reconstruction of all coefficient vectors. The result is a new waveform containing the synthesized sound. In Matlab, idwt performs single-level reconstruction and waverec performs multilevel reconstruction. The input to these functions is the coefficient vectors, the vector lengths, and wavelet type. Users can supply the reconstruction filters in lieu of the wavelet type. For example, the signal is reconstructed from the coefficient
vector, C, and lengths vector, L, and Daubechies wavelet db2 with the command f ⬘ ⫽ waverec(C,L,‘db2’). The output from this function is the reconstructed signal, f ⬘. The reconstructed signal can be converted to a standard audio file format (and sent to an audio output device for playback), saved for later use, or transmitted over a computer network for remote application.
3.4 Validation Phase Validation is the final phase of the sound synthesis process. Because the goal is to create perceptually convincing sounds, performing a rigorous mathematical proof is not feasible to validate the sound synthesis success. Instead, success is examined by human judgments of the perceptual sound imagery. During development, the designer listens to the synthesized sound and decides if the desired aural imagery has been achieved. If the goal has not been achieved, different parameter manipulations are implemented by returning to the parameterization phase. It is also reasonable to reanalyze the original sound with a different wavelet decomposition if the aural imagery is far off the mark. Successive iterations continue through the four-phase development process until the desired audio result is obtained. Formal validation of the sound models requires psychoacoustic experimentation. Varieties of experiments are possible. Bonebright, Miner, Goldsmith, and Caudell (1998) presented a test battery for validating sound veracity, although these experiments have not been accepted as a standard. For this research, a set of three psychoacoustic experiments was conducted: similarity rating, freeform identification, and context-based rating. These studies provide validation of the aural imagery produced from the synthesized sound models. The similarity rating experiment examines the relationships between sound models and provides information about how models might be successfully expanded to synthesize a broader range of sounds. The freeform identification experiment evaluates the scope of the synthesized sounds by using the subject’s perceptual identification capability. The context-based rating experiment provides metrics that indicate the sound synthesis success by comparing the synthesized sounds against hu-
Miner and Caudell 501
man expectations. These experiments are generally useful for evaluating any type of sound synthesis. In addition, the experiments can provide valuable cognitive and perceptual information for psychoacoustic researchers. The experiments are briefly described here, and an in-depth description of the experiments are provided by Miner et al. (2002). The similarity rating experiment examined the perceptual parameter space of sound synthesis models and examined the effect of various parameter settings for those models. Subjects rated the similarity between two synthesized sounds on a five-point rating scale. Twentytwo subjects (seven men and fifteen women) participated. Twenty unique sound stimuli were used to create 190 sound pair combinations rated by the subjects. Two techniques were used to analyze the data: multidimensional scaling (MDS) and Pathfinder analysis. The MDS analysis provided evidence to show that manipulation of sound model parameters changed the sound perception in a predictable way. This is important for being able to reliably control the sound synthesis from a virtual environment simulation system. The Pathfinder analysis revealed relationships within and across different sound groups. This information is important for extending a sound model synthesis capability to a broader class of sounds. These results also proved useful for fine-tuning sound synthesis models. The similarity rating experiment provided a tool for examining the sound stimuli relatedness without imposing experimenter bias. However, this experiment did not reveal the perceptual extent of aural images that could be synthesized with the wavelet models, nor did it provide a metric for the sound synthesis quality. The next two studies were designed to provide this information. The second experiment was a freeform identification experiment used to examine the perceptual identification of the synthesized sounds without providing a context. This experiment answered the question “what aural image comes to mind when you listen to this sound?” This is a freeform identification experiment similar to that run by Ballas (1993) and Mynatt (1994). The purpose of the experiment was two-fold. First, the experiment tested whether the synthesized sound re-
sembled the base sound (that is, the sound being synthesized) strongly enough to elicit a freeform identification without any verbal or visual context. Secondly, the experiment identified perceptually related sound labels that were not the base sound. These perceptually related labels served to extend the synthesis domain for individual models. In this experiment, subjects listened to synthesized sounds and entered an identification description. Identification phrases included a noun and descriptive adjectives. Thirty-five sound stimuli were presented in random order to 22 subjects (seven men and fifteen women). Results indicated that the synthesized sounds most frequently elicited the correct freeform response (correct in the sense that the response matched the target sound being synthesized). Results showed that a wide variety of perceptually convincing sounds could be obtained by manipulating the model parameters. Mechanically oriented labels emerged as the high-frequency information in the synthesized sound was increased. Sound labels indicating larger objects emerged when the low-frequency content of the synthesized sound was increased, and this result showed that manipulating the model parameters resulted in predictable changes in aural imagery. The third experiment was a context-based rating experiment designed to provide a sound synthesis veracity metric by asking subjects to rate the sound quality within a verbal context. Phrases obtained from the freeform experiment were paired with synthesized sounds. The phrases provide a perceptual context for the sounds. Twenty-seven subjects (five men and 22 women) were asked to rate how well the phrases matched the sounds they heard. Subjects rated 207 randomly presented sound and phrase label pairs on a fivepoint scale, with 1 ⫽ no match and 5 ⫽ perfect match. Both digitized and synthesized sounds were included. Results quantified free-form label responses thereby providing an indication of label quality. Furthermore, this experiment provided numeric information about how the aural imagery changes as the model parameter settings changed. Thus, this experiment numerically validated the perceptual success of the parameter manipulations. Examination of perceptual experiment results indi-
502 PRESENCE: VOLUME 11, NUMBER 5
cates whether design iteration is necessary. Iteration of the process refines the synthesis model to obtain the desired perceptual characteristics. Reanalysis of the model involves iterating through the process starting either with phase 1 (a new wavelet analysis) or phase 2 (parameterization). For more information on the experimental results, refer to Miner et al. (2002). Section 4 describes several example sound synthesis models. We also present metric values for the rain sound model as an example of the experimental results obtained.
4
Example Sound Synthesis Models
This section describes four example sound models that have been developed using this four-phase process. A high-level summary of some of the experimental results are also included to illustrate the effectiveness of the synthesis method.
4.1 General Model Development Details The equipment used to develop the models and run the experiments was a Sun Sparc Server 20 host computer interfaced with a Network Computing Devices (NCD) smart terminal (model MCX). The NCD workstation contained an embedded soundboard to allow playback of the synthesized sounds. Synthesized sounds were listened to through both workstation speakers and AKG K240 stereo headphones. MathWorks’ Matlab version 4.2 provided the wavelet decomposition, scaling filter extraction, and reconstruction routines. Custom Matlab routines were developed for sound signal input and output, parameterizations, and other functions required for model development. All Matlab functions were performed in non-real time, prior to validation experiment execution. In practice, decomposition would be performed in non-real time to set up the sound models. Reconstruction, or synthesis, would be performed in real time with parameter values being dictated by a VR simulation. Model development for each of these examples began
with a digitized “base” sound sample. The base sounds were digitized at a 22,050 Hz sample rate with 16-bit resolution. The sounds were captured using a portable digital audio tape (DAT) recorder and a studio-quality microphone. The wavelet type used for decomposition has a direct effect on the sound synthesis results. Decomposition with a relatively complex wavelet type (for example, Daubechies 4, 5, or 6) can provide the basis for a powerful sound synthesis model. These example models created sounds resulting from decomposition with the Daubechies 4 (db4) and Daubechies 6 (db6) wavelet types with a decomposition level of 5. Level 5 decomposition was selected for these models because fewer levels of decomposition produced overly dramatic changes in the resulting synthesized sounds. Manipulations of finer levels of detail (that is, decomposition levels greater than 5) did not create perceptually significant changes. These results were determined during the model design process by iterative cycles through the analysis, parameterize, synthesize, and perceptual validation steps. The choice of wavelet type and decomposition level is application specific. To demonstrate the effect of varying model parameters, several parameterizations were applied to the base sounds as described in table 1. The perceptual experiments described by Miner et al. (2002) used a subset of these sounds to validate the synthesis method. Each row in the table represents one sound model. The table arranges the sounds into five columns according to parameter setting type. The first column contains the model name and represents the synthesized sound with no parameter manipulations. The last four columns represent the coefficient groupings and parameter manipulations. The first two parameterizations (column 1 and 2 in table 1) were magnitude-scale operations on the level 1 detail (D1) and level 5 approximation (A5) coefficients obtained from a wavelet decomposition using the db4 wavelet type. Scaling the D1 coefficients resulted in enhancing the high-frequency sound components. Scaling the A5 coefficients resulted in enhancing the lowfrequency sound components. Initially, coefficient groups were scaled by factors as large as 20 and 100. For these sound model examples, scalings of this magni-
Miner and Caudell 503
Table 1. Parameter Settings for Example Models. Original/Base Sound Plus Four Categories of Parameter Settings. Parameters 1 and 2 are the Scalar Values Applied to the Coefficient Groups (D1 and A5 ). Parameters 3 and 4 are the Length of the Modified Scaling Filter Parameter settings Original sound
Scale details (D1)
Scale approx. (A5)
Increase filter points
Decrease filter points
Rain Car motor Footstep Breaking glass
1.2, 2, 4, 5, 8, 10, 20, 100 2, 4, 5, 8, 10, 20, 100 2, 4, 8 2, 4, 8
2, 4, 5, 8, 10, 20, 100 2, 4, 5, 8, 10, 20, 100 2, 4, 8 2, 4, 8
14, 17, 20, 24 14, 17, 20, 24 14, 17, 20, 24 14, 17, 20, 24
6, 7, 8, 9 6, 7, 8, 9 6, 7, 8, 9 6, 7, 8, 9
tude did not yield perceptually compelling sounds (that is, the sounds became unrecognizable). Scaling coefficient groups by less than a factor of 1.2 did not yield a perceptual change to the sound. The next two parameterizations (column 3 and 4 in table 1) involved scaling filter manipulations of the Daubechies wavelet type 6 (db6) reconstruction function. The db6 wavelet has a twelve-point reconstruction-scaling filter. One parameterization increased the number of points in the reconstruction filter to stretch the filter (shift sound down in frequency). The other parameterization decreased the scaling filter length (compressed the filter) shifting the sound frequency up. Each base sound had the same scaling filter manipulations applied. Scale filter manipulations ranged from decreasing the filter length in half (creating a six-point filter) to doubling the filter length (creating a 24-point filter). The results were perceptually convincing in some cases but not in others, depending on the base sound characteristics. To evaluate individual effects, parameterizations were applied one at a time rather than in combination. In practice, combinations of magnitude scaling, scaling filter manipulations, and enveloping would be applied to wavelet coefficient groups at various levels to create a powerful sound model capable of producing thousands of sounds.
4.2 Four Example Parameterized Sound Models This section contains sound model descriptions for four example models. The perceptual results of the parameter manipulations (as indicated in table 1) are sum-
marized. Two models (rain and car engine) are continuous stochastic sound models consisting primarily of nonpitched sound and infinite duration. Two models (footsteps and glass breaking) are finite-duration sounds defined as time-limited sounds whose onset and offset characteristics significantly influence the sound perception. Raw average context-based rating data and standard deviations are provided for the rain sound model. This data is provided to demonstrate the effects of specific parameter manipulations. 4.2.1 Rain. This model simulated the sound of rain. The original digitized sound was that of rain hitting concrete in an open-air environment. Parameter manipulations yielded the synthesis of light rain, medium rain, and progressively heavier rain. The perception of increasing wind accompanied the sound of increasing rain and conveyed the sense of a large rainstorm. Other perceptually grouped sounds that emerged during the perceptual freeform identification experiment were bacon frying, machine room sounds, a waterfall, a large fire, and applause. Table 1 shows the parameter manipulations for this model. Increasing the magnitude scaling of the detail coefficient vectors (D1) resulted in the perception of increasingly softer rain. Bacon frying, fire, and other sounds were also perceived. Increasing the magnitude scale of the approximation coefficients (A5) increased the contribution of the lowest-frequency sound components, resulting in a deeper, more reverberant sound. Thus, manipulating groups of coefficients (parameters) increases the scope of the sounds generated by the model.
504 PRESENCE: VOLUME 11, NUMBER 5
Table 2. Perceptual Experiment Results for Rain Model. Includes Subset of Freeform Identification Labels, Average ContextBased Ratings, and Standard Deviations for Base (Original) Sound and Two Parameter Manipulations. (1 ⫽ no match, 5 ⫽ perfect match) Rating results Base sound
Base w/D1*8
Base w/A5*4
Sound stimuli group Context labels
Avg rating Std dev Avg rating Std dev Avg rating Std dev
Rain
4.30 2.63 3.85 2.19 3.07 2.41
Hard rain Light drizzle of rain Water running in a shower Bacon frying Small waterfall Large waterfall
Table 2 supports these claims and contains a subset of the results from two of the perceptual experiments. Column 2 contains some labels obtained for the rain model during the freeform identification experiment. The last columns contain average rating results and standard deviations from the context-based rating experiment for the original (or base) sound and two parameterizations (detail coefficients on level 1 (D1) scaled by a factor of 8 and approximation coefficients on level 5 (A5) scaled by a factor of 4). These results show how the perception of the sound changes as the parameters are modified. For example, increasing the A5 intensity increases the perception of hard rain. This synthesized sound matched the label “Hard Rain” on average of 4.63 ⫾ 0.79 out of 5. Increasing the D1 coefficient increases the perception of a light drizzle of rain (average rating of 4.26 ⫾ 1.06 out of 5). Furthermore, these results show how a variety of sounds are perceived from one model, and that changing the parameters effects the relative convincingness of those sounds. For example, the rain base sound with A5*4 simulates a convincing sound of a large waterfall (4.15 ⫾ 1.03) more so than does the base sound with D1*8 (1.37 ⫾ 0.74) and more so than does the base sound without any modifications (2.41 ⫾ 1.19). Other parameter manipulations, as explained in subsection 4.1, included increasing or decreasing scaling
1.20 1.42 1.32 1.11 1.21 1.19
2.11 4.26 3.67 3.44 2.63 1.37
1.42 1.06 1.21 1.55 1.33 0.74
4.63 1.63 2.67 1.89 2.96 4.15
0.79 0.88 1.41 0.97 1.26 1.03
filter length. Changing the scaling filter length upon reconstruction had the effect of shifting the sound up or down in frequency. For the rain model, the largest result of these manipulations was changing the perceived size of the raindrops and the surface hardness. Shifting the sound higher in frequency produced the sound of smaller raindrops hitting a harder surface. The opposite was true for shifting the sound down in frequency. These results were observed during repeated iterations through the four-phase development process and specifically during the validation phase as observed by human listeners. Applying denoising techniques to the rain model resulted in a synthesized sound similar to that of a car engine. The denoising technique is based on coefficient thresholding; coefficients with values below the threshold are set to zero. Thus, the technique is akin to compression of the coefficient vectors. Denoising and compression have considerable promise in terms of adding to the variety of sounds synthesized from individual models and in saving on parameter storage and communication time. 4.2.2 Car Engine. This model simulated the sounds of a car engine idling with parameter adjustments for different-sized cars, different types of machines, and different types of engines. The base sound
Miner and Caudell 505
for this model was that of a digitized mid-sized, fourcylinder engine idling in an open-air environment. Adjusting the parameters resulted in the perception of a large diesel truck, a standard truck, a small car, and a large car. Perceptual labels identified during the freeform experiment were different engine types, machinery, construction site machines, tractor, jackhammer, drill, helicopter, and various-sized airplane engines. Magnitude scaling and scale filter manipulations were performed on this model. Increasing the magnitude of the D1 coefficients increases the high-frequency sound content, resulting in a smaller engine sound, such as a lawn mower or toy car. Increasing the magnitude of the A5 coefficients results in a smoother-sounding engine because the high-frequency metallic sounds are drowned out by the enhanced low-frequency components. The result is a smoother, larger-sounding engine such as a helicopter or airplane. Decreasing the magnitude of the coefficient vectors had the inverse effect. The scale manipulations were intended to create consistent-sized car engine sounds but with different RPM characteristics. This effect was not achieved however. Instead, they shifted the car engine in frequency, thus changing the perception of the sound type. Highfrequency shifts resulted in a buzzing sound, reminiscent of a swarm of bees. Low-frequency shifts resulted in large-engine sounds, such as that of an airplane. Perhaps a more uniform original base sound is required to create the desired effect of RPM variation through scale manipulation. Thresholding the car motor model resulted in significant reduction of the coefficients and did not perceptually change the synthesized sound. Using Matlab’s automatic global threshold function (threshold level of 1004) resulted in a 50.83% of coefficients being set to zero, but retained 99.1% of signal energy. Using this significantly reduced set of coefficients did not have a significant perceptual effect on the synthesized result. This demonstrates how entire groups of coefficients may be eliminated without changing the synthesized sound perception, thereby dramatically reducing the number of coefficients required. This is an example of the significant compression rates that may be possible for wavelet sound models.
4.2.3 Footsteps. This model simulated the sound of footsteps on gravel. Parameter manipulations resulted in the perception of footsteps on different material types such as dirt, snow, a hard concrete floor, or a wood floor and of different weights of the person walking. Additional perceptually grouped sounds were chewing, crumbling paper, crushing or dropping various objects (from soft to hard objects), stomping of horse hooves, stepping on leaves, lighting a gas grill, a lion’s roar, and gunfire. Increasing the magnitude of the detail coefficients resulted in the perception of decreasing the size of the person stepping and creating a harder, less resonant surface. Further increasing the high-frequency components changed the perception from a footstep to crumbling paper and a fire. Increasing the low-frequency coefficients resulted in the perception of a footstep on a softer surface, such as mud or fresh snow. The perceived weight of the person also increased as approximation scaling increased. Increasing the reconstruction filter length shifted the sound down in frequency. The result was similar to the sound of an explosion because the high-frequency crackling was removed. The addition of envelope manipulations to increase the starting signal intensity and exponentially decay over time would increase the perception of an explosion. 4.2.4 Glass Breaking. This model simulated the sound of breaking glass with parameter adjustments for the glass thickness or density, the surface hardness on which the glass is breaking, and the force of the impact. Exercising this model during perceptual experiments resulted in responses of dropping a heavy glass on a wood floor, throwing crystal against a concrete floor, breaking a window, breaking a plate, and keys falling to the floor. Increasing the high-frequency detail coefficients resulted in the perception of decreasing glass thickness. The higher the scale factor, the harder the impact surface seemed and the less resonant the sound. The throwing velocity was perceived to increase as the detail scale increased. Scaling low-frequency coefficients achieved the inverse effect. The perceived glass thickness increased as the scale factor increased, going from a plate or cup to a heavy vase or window. The surface
506 PRESENCE: VOLUME 11, NUMBER 5
hardness decreased as the scale factor increased because the surface resonance increased. Large approximation scale factors gave the perception of a wooden surface. Increasing the reconstruction filter length shifted the sound down in frequency. The result was less like glass breaking because of the lack of high-frequency components. Conversely, decreasing the filter length shifted the sound to the high-frequency region. Decreasing filter lengths resulted in decreasing the glass thickness and increasing the surface hardness.
5
Future Extensions
One direction for future work is to merge several different models into a more generalized sound synthesis model. For example, merging the electric and car motor models may yield a general motor model. This is desirable because users would have in a single model a variety of engine sounds, engine loads, RPMs, and so on. Another example would be a general running water model that could provide synthesis of rain, brook, rivers, waterfalls, water from faucets, and more. Real-time sound synthesis for this technique is possible. Completing the analysis and parameterization phases in non-real time produces the parameterized model. The parameter manipulation and synthesis phases can be computed in real time in parallel with graphical and environmental VR simulations, and realtime implementations of wavelet transforms are available on many desktop platforms. Because this technique is all software based, it is feasible that the sound synthesis can be efficiently combined with three-dimensional sound localization and offloaded to a parallel sound server. This would create a software-only virtual sound system. We are exploring compression of wavelet coefficients further to enhance real-time performance.
6
Conclusions
We have described a four-phase development process for a new sound synthesis approach using wavelets. The iterative nature of the process allows continuous
model refinement according to perceptual sound quality results. The analysis and synthesis phases use the discrete wavelet transform and the inverse discrete wavelet transform, respectively. The parameterization phase creates dynamic, flexible sound models that, when exercised, are capable of producing sounds with a variety of perceptual qualities. We described three perceptual validation experiments using human subjects designed to elucidate the perceived synthesized sounds and rate the sound synthesis veracity. Several continuous and noncontinuous stochastic-based sound models have been developed using this method, including models for rain, car engine, brook, glass breaking, and footstep sounds. These models provide evidence of the validity of this approach. Several steps are required before these sound synthesis models are available to users, including further model and parameterization development, real-time implementation, development of an intuitive user interface, and integration with virtual reality simulation systems.
Acknowledgments Sandia National Laboratories supported this work under its Doctoral Study Program. We thank the reviewers who provided helpful comments, and we also thank the experiment volunteers.
References Ballas, J. (1993). Common factors in the identification of an assortment of brief everyday sounds. Journal of Experimental Psychology: Human Perception and Performance, 19(2), 250 –267. Bonebright, T., Miner, N., Goldsmith, T., & Caudell, T. (1998). Data collection and analysis techniques for evaluating the perceptual qualities of auditory stimuli. Proceedings of the International Conference on Auditory Displays. Available online at www.icad.org/websiteV2.0/Conferences/ ICAD98/icad98programme.html Cohen, A., & Ryan, R. D. (1995). Wavelets and multiscale signal processing. London: Chapman & Hall. Cook, P. (1997). Physically informed sonic modeling
Miner and Caudell 507
(PhISM): Synthesis of percussive sounds. Computer Music Journal, 21(3), 38 – 49. Daubechies, I. (1988). Orthonormal bases of compactly supported wavelets. Communications in Pure and Applied Mathematics, 41, 909 –996. ———. (1992). Ten Lectures on wavelets. Philadelphia: SIAM. Esteban, D., & Galand, C. (1977). Applications of quadrature mirror filters to split-band voice coding schemes. Proceedings of the IEEE International Conference on Acoustic Signals and Speech Processing, 191–195. Gaver, W. (1994). Using and creating auditory icons. In G. Kramer (Ed.), Auditory display: Sonification, audification, and auditory interfaces, proc. vol. XVIII (pp. 417– 446). Reading, MA: Addison-Wesley. Haar, A. (1910). Zur theorie der orthogonalen funktionensysteme. Math. Ann., 69, 331–371. Mallat, S. (1989). A theory for multiresolution signal decomposition: The wavelet representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11, 674 – 693. Meyer, Y. (1993). Wavelets: Algorithms and applications. Philadelphia: SIAM. Miner, N. E. (1994). Using voice input and audio feedback to enhance the reality of a virtual experience. Proceedings of the 1994 IMAGE Conference, 337–343. ———. (1998a). Creating wavelet-based models for real-time synthesis of perceptually convincing environmental sounds. Doctoral dissertation, University of New Mexico. ———. (1998b). An introduction to wavelet theory and analysis. Sandia National Laboratories technical report no. SAND98-2265.
Miner, N. E., Goldsmith, T. E., & Caudell, T. P. (2002). Perceptual validation experiments for evaluating the quality of wavelet synthesized sounds. Presence: Teleoperators and Virtual Environments, 11(5), 508 –524. Misiti, M., Misiti, Y., Oppenheim, G., & Poggi, J. (1996). Wavelet toolbox for use with Matlab. MathWorks, Inc. Mynatt, E. D. (1994). Designing with auditory icons. Proceedings of the Second International Conference on Auditory Display (ICAD), 109 –119. Ogden, R. (1996). Essential wavelets for statistical applications & data analysis. Boston: Birkhauser. Serra, X. (1989). A system for sound analysis/transformation/ synthesis based on a deterministic plus stochastic decomposition. Doctoral dissertation, Stanford University, and CCRMA report no. STAN-M-58. Smith, J. O. (1992). Physical modeling using digital waveguides. Computer Music Journal, 16(4), 74 – 87. Strang, G., & Nguyen, T. (1996). Wavelets and filter banks. Wellesley, MA: Wellesley-Cambridge Press. Stromberg, J. O. (1982). A modified Franklin system and higher order spline systems on ᑬn as unconditional bases for Hardy spaces. In A. Beckner (Ed.), Conference in honor of A. Zygmund, vol. II. (pp. 475– 493). Monterey, CA: Wadsworth Mathematics Series. Van den Doel, K., & Pai, D. K. (1996). Synthesis of shape dependent sounds with physical modeling. Proceedings of the International Conference on Auditory Displays. Available online at www.icad.org/websiteV2.0/Conferences/ ICAD96/Proc96/dendoel.htm