Checkpoint 1 Report

Deliverables

1) Find some papers that try do to what you’re proposing and read at least two. Describe why one or the other might be better and what approach you lean towards, mentioning what things from 351 seem relevant to understanding the work.

2) Get more than one impulse response for a space (one balloon pop, two microphones) and compare them. How do they look different? I’m sure the differences will be minimal. Look in the frequency domain too to find the differences.

3) Read this related work: https://www-pnas-org.proxy.lib.umich.edu/content/110/30/12186. Summarize and comment on that work, mentioning what things from 351 seem relevant to understanding the work.

Acoustic Echoes

This work describes how to use the impulse response of a room, recorded by several microphones, to accurately model the shape of the room. The impulse response is made up of a train of deltas, each one representing an echo from a wall in the room. The algorithm used in the work utilizes properties of euclidean distance matrices to assign the correct walls to each peak of the response. Using only the first-order echoes, the algorithm then calculates the locations of virtual sound sources, and uses those to find the positions of walls. To truly understand this work, the reader needs to have a very good understanding of linear algebra, specifically concepts related to Euclidean Distance Matrices. Most of this is not covered in EECS 351. However, knowledge of the concept of matrix rank, which I learned in class, was quite useful for inferring how the algorithm works. Simpler concepts from EECS 351 were also used. For example, the article describes the time-response of rooms to an impulse and explains how reflections can create echo swapping. In EECS 351, we have dealt with impulse responses and have also discussed the ambiguity of data in signals.

Though most of the detailed work in this paper is not useful to our project, the base knowledge that it presents gives us a good foundation for starting on our own application. The description of virtual sound sources and the timing between peaks in the impulse response is especially useful as we will need to generate impulse responses artificially in order to simulate sound in a room. We will likely generate sound by modeling walls as virtual sound sources, so it is important for us to understand how to calculate the source locations, especially for higher-order echoes. The paper was also useful because it drew attention to many sources of error that the researchers faced in collecting data. First, they found that inaccuracies in measuring distances between microphones forced them to apply a margin of error to their calculations. We will also have to account for the fact that we cannot perfectly measure the dimensions of a room and the locations of all of the microphones and sound sources. Second, they found that low sampling rates and limitations of the microphones themselves led to slightly inaccurate impulse responses in which the delays between peaks were wrong and ghost peaks appeared. We will also likely face measurement error from the microphones. Being aware of these issues may help us identify bad data. ~ CK

Sound Synthesis Localization

“A Refined Algorithm of 3-D Sound Synthesis” by Zhang, Tan, and M. H. Er. Sourced from IEEE Explore.

This article discusses an algorithm that adjusts how head-related transfer functions are modeled in order to produce more accurate synthetic sound. Head-related transfer functions attempt to model the ears as systems whose transfer function is a function of the angle of the sound source. Changes in sound arrival time and intensity between ears, and the shape of the ears cause these transfer functions to change as the sound source location changes. The researchers found that current attempts to model these transfer functions and use them to create artificial sound led to the users making errors in sound localization; that is, they incorrectly identified the source of sounds from certain angles. The new algorithm attempts to improve localization by multiplying the transfer function at every frequency by a weight that is a function of the proportion between the magnitude at that point and the maximum magnitude. This has the effect of attenuating the transfer function significantly for magnitudes already much less than the maximum. Accentuating the peaks of the transfer function in this way led to subjects in their experiments becoming better at localization.

In our project, it will be useful to apply this adjustment to whatever head-related transfer functions we select. If this causes our listeners to better identify sound sources, then our synthesized sound must be more useful than if we don’t make the adjustment. As implementing this algorithm only involves multiplying a transfer function by a vector of weights, this is something that we are well-equipped to do and there is no reason to avoid it in this project. ~ CK

Computer-generated pulse signal

"Computer-generated pulse signal applied for sound measurement" by Aoshima, Nobuharu sourced from asacitation.org

This article discusses using a computer-generated pulse signal as opposed to the traditional balloon pop or pistol shot. These computer generated pulses are flat in the power spectrum, and can be expanded and compressed without altering power (example transfer functions representing filters that act to compress and expand are included). The article specifies a signal that meets the desired qualities of an impulse in the frequency domain.

The paper then discusses techniques for collecting data for the impulse response with, and without, echoes. The purpose of measuring the impulse response without the echoes is because we can use that data in combination with the original impulse signal to find the transfer function of the speaker, which can then be taken into consideration. This article did all of their computation in the frequency domain. Next, it described how specific echoes can be singled out by looking for peaks of the response in the time domain.

Lastly, this article measured the sound attenuation by a barrier. This was accomplished simply by putting a wood plate between the speaker and microphone (the distance of the plate was then varied for different results). After recording the data, the power spectrums of the recording behind the plate was plotted against the power spectrum of no barrier, and attenuation is then clearly able to be seen. Overall, the article suggested four possible advantages of computer-generated impulse signals: impulse signals have flat power spectrums, impulses are perfectly reproduced which helps eliminate error and improves averaging data, elimination of unwanted sounds/durations, or addition of wanted sounds/durations, can be performed in software, and signal power can be made larger without increasing the peak level by the pulse expansion and compression techniques.

In our project, we will need to produce an impulse in order to determine the impulse response of different materials and distances in order to represent walls as filters. Therefore, the more accurate impulse and response data we collect, the better filter we can construct to simulate our rooms in software. Since their techniques for determining the system response requires Fourier transforms and then division of pulse signal by received signal, we know all that we need in order to replicate and modify the experiments in this article. ~ WC

3D Binaural sound synthesis

"A Sparsity-Based Approach to 3D Binaural Sound Synthesis Using Time-Frequency Array Processing" by Cobos, Lopez, and Spors. Sourced from the EURASIP Journal on Advances in Signal Processing.

This paper presents an alternative to the traditional dummy head setup for 3D binaural sound reproduction. Using by a small tetrahedral array of microphones, the authors propose a two stage approach to capturing and processing the spatial characteristics of recorded signals. The first stage, analysis, begins by transforming the four input signals into the time-frequency domain with the Short-Time Fourier Transform (STFT). For an acoustic environment with a specified number of sound sources and sensors, the input signals can be modeled as a convolutive mixture of finite impulse responses. The STFT divides the time domain signal into a series of overlapping pieces and applies the Fourier Transform to each piece after windowing it. Various assumptions can be made to reduce and simplify the transformed model, as shown in the paper. Next, by making use of phase difference information of microphone pairs along with the geometric properties of the tetrahedron, the authors are able to extract Direction-of-Arrival (DOA) data for each time-frequency bin. The second stage, synthesis, attempts to accurately reproduce the original sound with regard to its spatial characteristics. The caveat is that this is performed solely with the DOA information from the previous stage, without access to the individual source signals. The synthesis state makes use of a set of left and right head-related transfer functions (HRTFs), which take the DOA data as parameters, to selectively filter a signal from one of the microphones in the time-frequency domain, ultimately outputting a reproduction of the original sound.

The authors applied this methodology to several real and simulated mixtures of speech and music with varying numbers of sources. They were able to achieve a 3D reproduction comparable to the conventional dummy head method for binaural sound, although they conclude that the quality degrades as the number of sources and reverberation increase, and that spectral overlap makes this approach better suited to speech than music.

In our project, it will be helpful to understand how to extract and process DOA information, especially in regards to modeling a room. We want to be able to parametrize a room so that we can accurately simulate a sound inside it. This paper provides a specific methodology for recreating a given signal inside a given room, using methods for time-frequency analysis and filtering that we discuss in EECS 351. Our task would be to extend the application of the DOA data to be able to characterize the room along with the sound. Additionally, this paper is even more relevant to our goals because of its focus on binaural sound. ~ TM

Physically-based sound effects

“FOLEYAUTOMATIC: Physically-based Sound Effects for Interactive Simulation and Animation” by van den Doel, Kry, and Pai. Sourced from ACM Digital Library.

This paper discusses an approach to produce realistic sounds in a virtual environment in real time. The process consists of three steps: modelling real-world materials and obtaining their acoustical properties using modal synthesis, simulating contact of rigid bodies to obtain control parameters for audio synthesis, and computing various types of impact forces including scraping, sliding, and rolling to link to modal resonance models in simulation. Modal synthesis is used for this implementation of virtual sound simulation because different objects produce different sound depending on where they are struck. Modal synthesis models can account for the propagation of forces through objects that emit sound waves of differing tone and timbre based on the original location of the force. These models, or "audio textures", are created using a group of damped harmonic oscillators that collect the impulse response of the object or material when struck at different locations. The simulation of the contact of surfaces is done using piecewise parametric approximations of surfaces rather than polyhedral approximations to produce more realistic motion for continuous contact. Finally, based on the motion of interacting objects “sound forces” are produced at proper audio sample rates (as opposed to graphics sample rates which happens to be magnitudes lower) to excite the material models and produce virtual sound. Various novel algorithms are then described for different contact motions.

The process presented was put together in a Java toolkit called FoleyAutomatic and tested using a couple of scenarios produced both in the real world and in animation. While the effect was apparently very convincing, there was no metric to accurately compare the simulated sound with the real-world sound. This is primarily due to the fact real-world actions cannot be reproduced exactly in simulation.

Although the majority of this paper discuses algorithms for creating contact sounds, in our project we do plan on simulating the affect material has on sound. We want to be able to accurately model a room made of different materials and seeing what a material model may consist of is helpful in determining the parameters we may want to capture or at least contrive in our models. While we aren’t as concerned with capturing very precise real-world data as we are simulating reasonably accurate sound, some form of modal synthesis could be an interesting stretch goal. ~ JB