Overview

We decided to study the effect that different room sizes and various materials have on sound and simulate these effects in a 3d environment. In order to do this, we had to determine a way to model both sound and a room of arbitrary size in 3d space while also modeling the effect of the receiver and walls of arbitrary material as filters. Data on materials was collected and data for listener filtering was sourced online.

DSP Tools Used:

In Class

Discrete Convolution
Linearity
Spectrogram (time-frequency analysis)
Impulse Response

Out of Class

Head-related transfer function
Impulse response data collection
Ray tracing, ray model for sound, imaginary sources

problem Statement

When we hear a sound, we characterize it and distinguish it from other sounds based on qualities like its pitch, tone, and location in space (Tonneson). Music synthesizers and speakers can recreate the tone and volume of a sound, but often they do not simulate a location in space. For example, when you listen to recorded music, it does not seem to come from a musician in the room with you. In contrast, surround sound systems do create the effect of sound coming from sources around you, but the audio is usually recorded. This is acceptable for an application like a movie, but infeasible for applications like video games and virtual reality, where the scenarios depicted never actually happened, so audio could not be recorded.

It would be useful to have a way to create audio from scratch that sounds as if it were recorded in a different location to create the illusion that the listener is in that location. Since spatial sound increases a listener's environmental awareness, it can be used to make virtual reality more immersive and give the user a better sense of presence (Tonneson). Synthetic spatial sound could also be useful in improving computer-human interactions, especially for people with disabilities. For example seeing-impaired users could receive audio feedback in a virtual 3D space, rather than visual feedback on a screen (Tonneson). The applications even extend to business, giving the impression that conference calls are happening in a conference room.

3D sound synthesis is the technique of creating an artificial sound that seems to be coming from a source that does not exist. In our project, we wanted to research the spatial apsects of sound and explore the signal processing required to generate an artificial sound. Our goal was to model a room, place a source and receiver in that room, and create the sound that would be heard by the receiver if the room actually existed.

Modeling Environment

To reduce complexity, we chose to model a simple box-shaped room with six walls, each made from one of an array of materials. We also chose to only use one source, though our approach allows for any number of sources. The challenge was to account for all of the factors that influence the sound transmitted from the source to the receiver, including reflections off of the walls, filtering caused by the walls, filtering caused by the listener's head, and attenuation due to loss of signal power over distance.

In MATLAB, we represented walls, floors, and ceilings as planes in 3D space, so that the room was the area enclosed by them. The sound source is located at some point in this space, while the receiver is modeled as a unit vector from another point in the space so that we can represent the direction it is facing. This becomes important in applying the Head Related Transfer Function, which will be discussed in the following section.

Surfaces were modeled to have different materials. Each material has its own filtering effect that depends on angle of incidence and frequency of sound. The attenuation relationship due to the interaction of sound with a material is shown below with 𝜶 representing the absorption coefficient and 𝜹 representing the scattering coefficient.

In practice the angle of incidence is averaged over all values due simply to number of possibilities. Although it should be mentioned coefficients are relatively stable upon a grazing angle is approached. Also because scattering coefficients are very difficult to measure they are commonly either ignored or kept constant for all materials.

For our project, walls, floors, and ceilings were given absorption coefficients based on their material. The effects of the scattering coefficients were ignored. Although materials do have different absorption coefficients for different frequency bands (typically the 1/3 octave bands are used), we chose to focus on the band centered at 4000Hz to contain the scope of our project and because of limitations of our data collection method. The 4000Hz octave band was also chosen because equal loudness contours show it is one of the easiest frequencies for humans to hear.

To collect absorption coefficients for different materials we used balloons as an impulse source and measured the response of a material with an audio recorder (Zoom H5 Recorder equipped with shotgun mic). The distance between the recorder and material was also measured. Using the time domain representation of the recording and distance between the microphone and material, we plotted the response against the distance travelled since impulse. This was simply done using the speed of sound and timeframe of the recording. Although preliminary research showed the most significant frequency component of a balloon popping was around 4000Hz, spectrograms of the impulse show this was not the case for us. Using a threshold to isolate the most significant frequency components, t was mostly found to be much lower at ~1000Hz. This could be due to the type of balloon used and the amount the balloon was inflated. Also, we attempted to keep the balloon perpendicular to the material being tested in order to capture the specular reflection of the sound, but this wasn't a measured variable. A script was written to show these effects which generates both echo plots and spectrograms for any trial. Because our data ended up being very inconsistent and noisy between trials, we decided to use published absorption data in our simulation program. Data collected and an image showing our testing setup can be found at the end of this section.

Published absorption coefficient data was collected using methods described in ISO 354. This standard describes a room of volume V with many sound diffusers hanging from the ceiling in order to create a diffuse sound space and a material object with surface area S placed on the floor of the room. Reverberation time is measured first without the object then again with the object in order to calculate absorption coefficients. An image of an example setup and the equation used to determine absorption coefficients (Sabine's Formula) are both shown below.

Even the method described in ISO 354 isn't able to very accurately produce reproducible results and is currently in the process of being revised.

Lastly, we accounted for attenuation due to air absorption in an indoor environment. The following equation represents power, W, as a function distance travelled, L.

The constant m depends on atmospheric pressure, temperature, humidity, and the frequency of the sound. We chose a value of 0.005 which corresponds to a temperature of 20°C, humidity of 50%, and frequency of around 4000Hz. The overall effect is very minimal (<5% in our case) except for when the distance travelled is very long or when the frequency is very high.

Screen Shot 2019-04-23 at 6.11.26 PM.png

Screen Shot 2019-04-23 at 6.44.49 PM.png

Screen Shot 2019-04-23 at 5.47.04 PM.png

Example echo plots:

Example spectrogram with and without threshold:

Image of testing environment:

Modeling Sound

In this section, we will give an overview of the algorithm we use to generate an artificial sound based on a room's parameters.

Our algorithm allows the user to select:

Room dimensions
Wall material (can be different for each wall)
Source location
Receiver location
Receiver's head position, including direction and tilt
The order of echoes required

Using these parameters, we try to create an impulse response that corresponds to that particular room and that particular receiver. We first generate an ideal impulse response by adding deltas shifted by delays corresponding to reflections off of the walls. We model sound coming out of a receiver as vectors extending in every direction and we model the source and receivers as points. The walls reflect the sound rays based on their angle of incidence and create new reflection rays. For each wall, there is exactly one vector coming out of the source that will reflect off of the wall and create a reflection vector that will hit the receiver. The image below shows this effect, and it demonstrates how moving the receiver changes the delays (Dokmanic, 2013). It also demonstrates that in some cases, second-order reflections may arrive before first-order reflections. A nth-order reflection occurs in the case that the sound ray from the source reflects off of n walls before hitting the receiver.

To find the delays of the first-order echoes, we first use geometry to find the reflection point on each wall that generates the first-order echo, as well as the reflection vector. Through an independent analysis, we found that the reflection point, the source point, and the receiver point are related through similar triangles and thus the reflection point can be found with ratios. With the reflection point known, as well as the speed of sound, we compute all of the delays in seconds. We also find the time it takes for the source to transmit to the receiver without any reflections. Using a given sampling rate, we convert the delay in seconds to a delay in samples and generate a shifted delta function with the shift equal to the delay. Eventually, all of the shifted deltas are added together to create an impulse response that accounts for all of the echoes.

To find the delays of high-order echoes, we execute the function described above recursively. When a sound vector reflects off of a wall, the reflection vector appears as if it were emitted from a source. This is called a virtual source, and it allows the same function to be used again, replacing the real source with a virtual one. A useful property of virtual sources that we neglected to use in our project is that they can easily be computed by mirroring the source location across each wall (Dokmanic, 2013). One way to find the delays would be to recursively calculate the locations of virtual sources by mirroring each source (real or virtual) across each wall over and over. We did not follow this path simply because we didn't think of it when we started. Instead, we use the reflection point off of each wall as a virtual source and keep track of when the reflections occurred, adding this delay to the delay calculated to get from that reflection to the receiver. Using this approach yields an ideal impulse response in which all of the shifted deltas have the same magnitude. An example of this is shown in the left plot of the figure below for 3 orders of echoes. Notice that with three orders of echoes, it is difficult to discern which impulse corresponds with which order and wall. With 6 walls, there are nearly 200 different echoes reaching the listener's ear, and they often blend together if they are too close. This is called reverberation and is typically heard in rooms 17 meters or less in length.

When convolved with a signal, this would produce perfect copies of the signal at different delays. However, real sound gets filtered when it bounces off of walls, so we added this to the algorithm as well. When each sound ray bounces off of a wall, it's amplitude is attenuated by a coefficient related to the material of the wall. This absorption coefficient represents how much of the sound is reflected, and varies depending on the material of the wall. As described above, we model this absorption coefficient as a single value rather than as a function of frequency. We also account for the loss of sound intensity as it propagates through air. The pressure of a sound wave has an attenuation proportional to the inverse of the distance traveled. In our model, the impulses are attenuated based on the ratio between the length of the path taken from the source to the receiver and the length of the direct path from the source to the receiver.

The middle plot of the figure below shows the ideal impulse response after attenuation due to wall absorption and pressure loss. Notice that the left-most spike (the one earliest in time) still has a magnitude of 1. This represents the impulse traveling directly from the source to the receiver. This experiences only minimal attenuation due to transmission through the air, so it is used as a standard that the other impulses are normalized to. As long as the impulses have the correct amplitudes relative to each other, the absolute amplitude is mostly unimportant because scaling amplitude in the frequency domain simply scales the amplitude of the signal in the time domain due to linearity, having only the effect of making the sound louder or softer. Notice that in this simulation, which had wall damping coefficients around 0.5, the echoes quickly attenuate. Note also that the smallest echoes, the third-order ones, have less than 5 percent of the strength of the direct transmission from the source to the receiver. As a result, we believe there is no need to simulate with more than three orders of echoes.

One final filter that we apply to model the experience of hearing a real sound is a filter to model the listener's head. As described above, this filter is commonly described by the head-related transfer function (HRTF). MATLAB provides a useful function called interpolate HRTF which takes as input an array of collected HRTFs, a desired source position, and a head orientation and returns an estimate of the HRTF that would be experienced by a listener in the specific state. Using this function along with the external data set, we were able to generate the HRTF for each echo separately. The right plot of the figure below shows the ideal impulse response after both the previously described attenuation effects and convolution with the HRTF. In this case, some of the amplitudes of the impulses actually increase as the HRTF amplifies them. In this simulation, the "listener" was positioned to face the source. The amplification corresponds to the sound-capturing effect of the listener's ears. The second and third order echoes are mostly attenuated, as many of these come from the sides and back of the listener and are blocked by the head and the ears themselves.

After applying the different filtering effects, it is easy to see the difference between the sounds heard by the left and right ears. The first image below shows the impulses responses heard by the two ears for a source in front of and to the right of the receiver. Notice that the right ear hears in general louder echoes because it experiences less distortion from the head. The second image shows these impulse responses convolved with the bach audio clip taken from the in-class demo. Here, the difference between the ears is harder to see. However, it is clear that some parts of the signal are much louder in the right ear than in the left ear. These differences allow the brain to locate the source of the sound.

One interesting observation that this modeling software allows for is a comparison between the sounds heard when the receiver and source are moved around the room. The plots above show the sound heard when the source is in front and to the right of the receiver. The plots below show when the source is behind and to the left. The difference between the signals demonstrates the significance of the head-related transfer function. Notice how in the first example, the right ear hears sounds that are louder than the left, but in the second example, the both ears hear similar sounds and the volume is generally lower.

To validate our model, we recorded the impulse response from a real room, pictured below, and also generated the impulse response using the software. The two plots are shown below the image of the room, with the simulated plot on the left, and the real one on the right. Note that for the simulation, only the left ear is shown. Though they both have a similar duration, the shapes are generally different. This is likely due to noise in the real room response as well as potential limitations in the microphone and the fact that we are only simulating with three orders of echoes.

Below are audio clips corresponding to the two sets of parameters used above. The first one has the source in front and to the right of the receiver. The second one has it behind and to the left. In the first one, it seems easy to locate the source as being to the right, but it is hard to find the exact distance. In the second one, it is hard even to tell that the sound is coming from the left, let alone to specify a distance or exact angle. The plots of the sound show that the sounds are different as expected, which suggests that difficulty in sound localization is a fault of the user, not the audio. Our research seemed to support this, indicating that in trials, users were generally worse at spatial localization than expected. To get the best effect, use headphones.

Source: Front, right

Source: Back, left

Challenges

One challenge that we faced was in developing the algorithm to compute the time delays between each echo. We first attempted to do this using a method called ray tracing. Essentially, this involves generating rays that emanate from the source in random directions. The computer tracks where the rays travel as they move through the room and bounce off walls, until they eventually reach the receiver. We decided to switch from this to the analytical approach for several reasons. First, this method was generally bad at collecting useful data. With the source and receiver modeled as points, it is very unlikely that a randomly generated ray leaving the source will hit the receiver. Increasing the size of the receiver increases the probability of hitting it, but we still found that we needed either thousands of rays or an unreasonably large receiver in order to capture even first-order echoes off of each wall. This leads to the second failure of this method, which is that it is computationally expensive. Tracking thousands of rays through 3-dimensional space takes a lot of time and compute power; our tests to find first-order echoes ran for about a minute and didn't always capture all six echoes. In a real application such as generating sound for virtual reality, it is essential that the sound generation happen dynamically, either in real-time or just before it is needed in the application. Because ray tracing was infeasible for the applications of interest, we decided it was not valuable to continue trying to integrate it into our project, and we switched to the analytical approach about halfway through the project. Had we stuck with either approach for the full duration instead of working on both, we may have been able to reach some of our stretch goals of increasing the complexity of the room and accounting for more filtering effects.

Validating the final synthesis program was also difficult. Though we collected data from a test room to compare against, this data was noisy, and limitations in the sampling rate of the microphone may have cut out some peaks that would make it match the simulation closer. Therefore, it is difficult for us to know through comparison with the data whether our model is reasonable or not. We found a good way to verify the model was simply by generating rooms with sources in different places and seeing if we could hear the difference between them. This was also difficult because humans are generally not very good at localization. However, we were able to hear a noticeable difference, especially in cases where the source was in front of the receiver, and to the right. It is unclear why in front and to the left scenarios were harder to hear. These scenarios actually sounded similar to when the source was to the right. Potentially this indicates a flaw in the external data set used to generate the HRTF. However, since the plots of the data going to each ear show clear differences as the source moves, it is more likely that our group is good at locating sound sources only in a select region.

Alternative Methods

The geometric model we chose is just one of many used to simulated sound. When modeling sound there are three primary methods of simulation: empirical, wave-based, and geometric. Empirical methods use rough estimations of parameters like reverberation time, sound pressure level, and sound reduction coefficients that are found empirically in very specific testing environments. While this method is widely used to engineering for specific acoustic properties in the real world, it is not especially adept for computer simulation and does make many assumptions about the environment. The fact that it assumes a perfectly diffuse sound space and doesn't consider the location of bodies in 3d space make it not desirable for highly dynamic environments such as those found in computer simulation. Nonetheless, this method is the simplest to implement. Wave-based methods account for the wave nature of sound and are able to simulate such effects as diffraction and interference. Techniques such as finite element analysis are used to model wave propagation in objects by breaking them into small nodes. The size of the nodes must be much smaller than the wavelength of the sound to accurately model which can make it difficult to model higher frequencies in real time. Of the three families of methods this is the most computationally expensive but most accurate. As mentioned earlier, geometric methods simulate sound as energy-carrying rays that interact with objects in the environment which are represented as surfaces. While our method used points and vectors to represents object, ray-tracing is a method which typically sends many rays out of a source of a simple shape such as a sphere or cone and traces them through the environment to interact with the objects represented with as many polygons as possible. Geometric methods can account for effects like air absorption, material absorption, and dynamic environments making it reasonably accurate except for at very low frequencies where the wave aspect of sound can have more pronounced effects. For this reason you could say geometric methods use a high frequency approximation of sound. While this method are also computationally expensive, parameters such as number of rays and polygon count can be adjusted to make it work in real-time.

Sources

Dokmanic, I., Parhizkar, R., Walther, A., Lu, Y. M., & Vetterli, M. (2013). Acoustic echoes reveal room shape. Proceedings of the National Academy of Sciences,110(30), 12186-12191. doi:10.1073/pnas.1221464110

Tonneson, C., & Steinmetz, J. (n.d.). 3D Sound Synthesis. Retrieved from http://www.hitl.washington.edu/projects/knowledge_base/virtual-worlds/EVE/I.B.1.3DSoundSynthesis.html

Elorza, D. O. (2005). Room acoustics modeling using the raytracing method: Implementation and evaluation. University of Turku. Retrieved from http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.112.43&rep=rep1&type=pdf

Vorländer, M. (2007). Annex. In Auralization: Fundamentals of Acoustics, Modelling, Simulation, Algorithms and Acoustic Virtual Reality.

HRTF Data: UC Davis - https://www.ece.ucdavis.edu/cipic/spatial-sound/hrtf-data/