Using sound to model the world | MIT News

Imagine the thumping chords of a pipe organ reverberating through the cavernous sanctuary of a massive stone cathedral.

The sound a cathedral visitor will hear is affected by many factors, including the location of the organ, where the listener stands, whether there are any columns, pews, or other obstacles between which the walls are made, the locations of windows or doorways, etc. Hearing a sound can help a person imagine their surroundings.

Researchers from MIT and the MIT-IBM Watson AI Lab are exploring the use of spatial acoustic information to also help machines better imagine their environment. They developed a machine learning model that can capture how each sound in a room will propagate through space, allowing the model to simulate what a listener would hear in different locations.

By accurately modeling the acoustics of a scene, the system can learn the underlying 3D geometry of a room from sound recordings. The researchers can use the acoustic information their system captures to build accurate visual representations of a room, similar to how humans use sound when estimating the properties of their physical environment.

In addition to its potential applications in virtual and augmented reality, this technique could help artificial intelligence agents better understand the world around them. For example, by modeling the acoustic properties of sound in its environment, an underwater exploration robot could perceive things farther away than sight alone, says Yilun Du, a graduate student in the Department of Electrical Engineering and Computer Science (EECS) and co-author of a newspaper describe the model.

“Most researchers have so far only focused on modeling vision. But as humans, we have a multimodal perception. Not only the view is important, the sound is also important. I think this work opens up an exciting avenue of research into better use of sound to model the world,” says Du.

Must Read  Signal to remove SMS support from Android

Joining Du on the paper are lead author Andrew Luo, a graduate student at Carnegie Mellon University (CMU); Michael J. Tarr, the Kavčić-Moura Professor of Cognitive and Brain Sciences at CMU; and senior authors Joshua B. Tenenbaum, the Paul E. Newton Career Development Professor of Cognitive Science and Computation in MIT’s Department of Brain and Cognitive Sciences and member of the Computer Science and Artificial Intelligence Laboratory (CSAIL); Antonio Torralba, the Delta Electronics Professor of Electrical Engineering and Computer Science and a member of CSAIL; and Chuang Gan, principal investigator at the MIT-IBM Watson AI Lab. The research will be presented at the Conference on Neural Information Processing Systems.

Sound and image

In computer vision research, a type of machine learning model called an implicit neural representation model has been used to generate smooth, continuous reconstructions of 3D scenes from images. These models use neural networks, which contain layers of interconnected nodes, or neurons, that process data to complete a task.

The MIT researchers used the same type of model to record how sound travels continuously through a scene.

But they found that vision models take advantage of a property known as photometric consistency, which doesn’t apply to sound. If one looks at the same object from two different locations, the object looks about the same. But with sound you change location and the sound you hear can be completely different due to obstacles, distance, etc. This makes predicting audio very difficult.

The researchers overcame this problem by including two properties of acoustics in their model: the reciprocal nature of sound and the influence of local geometric features.

Sound is reciprocal, meaning that if the source of a sound and a listener switch positions, what the person hears remains unchanged. In addition, what one hears in a particular area is strongly influenced by local features, such as an obstacle between the listener and the source of the sound.

Must Read  Eiti Singhal shares insights on SmartOne Beard Trimmers launch & GIY campaign

To incorporate these two factors into their model, called a neural acoustic field (NAF), they reinforce the neural network with a grid that captures objects and architectural features in the scene, such as doorways or walls. The model randomly samples points on that grid to learn the features at specific locations.

“If you imagine standing near a doorway, what you hear most strongly influences the presence of that doorway, not necessarily geometric features far from you on the other side of the room. We found that this information allows better generalization than a simple fully connected network,” says Luo.

From predicting sounds to visualizing scenes

Researchers can provide the NAF with visual information about a scene and a few spectrograms that show what a piece of audio would sound like if the transmitter and listener were at target locations in the room. Then the model predicts how that audio would sound if the listener moved to any point in the scene.

The NAF performs an impulse response, which captures how a sound should change as it propagates through the scene. The researchers then applied this impulse response to different sounds to hear how those sounds should change as a person walks across a room.

For example, if a song is played from a loudspeaker in the middle of a room, their model would show how that sound gets louder as a person approaches the loudspeaker and then muffles as they walk to an adjacent hallway.

In any case, when the researchers compared their technique with other methods that model acoustic information, it generated more accurate sound models. And because it learned local geometric information, their model could generalize to new locations in a scene much better than other methods.

Must Read  Apple's M1 iPad Air is back down to one of its best prices to date

In addition, they found that applying the acoustic information their model learns to a computer image model can lead to a better visual reconstruction of the scene.

“If you have limited visibility, you can use these acoustic properties to define boundaries more sharply, for example. And maybe this is because to accurately represent the acoustics of a scene, you need to capture the underlying 3D geometry of that scene,” says Du.

The researchers plan to further improve the model so that it can be generalized to brand new scenes. They also want to apply this technique to more complex impulse responses and larger scenes, such as entire buildings or even a town or city.

“This new technique may open up new possibilities to create a multimodal immersive experience in the metaverse application,” Gan added.

“My group has done a lot of work on using machine learning methods to accelerate acoustic simulation or model the acoustics of real scenes. This paper by Chuang Gan and his co-authors is clearly a big step forward in this direction,” said Dinesh Manocha, the Paul Chrisman Iribe Professor of Computer Science and Electrical and Computer Engineering at the University of Maryland, who was not involved. . work. “In particular, this article introduces a nice implicit representation that can capture how sound can propagate in real scenes by modeling it using a linear time-invariant system. This work could have many applications in AR/VR, as well as understanding real-world scenes.”

This work is supported in part by the MIT-IBM Watson AI Lab and the Tianqiao and Chrissy Chen Institute.

Similar Articles



Please enter your comment!
Please enter your name here

About Us provides you with the latest entertainment blogs, technology, top news, and sometimes sports news and other latest news. With the increase in technology all want relevant and exact information about the blog. So, our aim is to provide clear-cut information about the articles to make your day happy and bind more and more users to the side of all topics covered in entertainment. Contact us :

Follow us


Most Popular