Binaural audio is perfect for VR. Binaural audio recordings, on the other hand, are not. Not at all.
Just as a stereo pair of videos gives us the illusion of a 3d view from exactly one perspective, but does not contain the information to let us know how the world looks if we move our head to the left, so too is a binaural recording a 3d illusion of sound without the information to tell us what the world sounds like from any other ear locations.
In stereo film and binaural recording, all the computation and 3d-ness happens in our brain without the recording having a 3d model or any idea what 3d is. With enough cameras on a camera ball, you could create an actual 3d point cloud of the world within camera view (assuming software that doesn’t quite exist yet). Or you can use a light-field camera to capture how all the light waves look from all the locations within some small space. Both those video options aren’t really viable solutions for VR video just yet, but 360 stereo video is good enough to make my brain happy.
What about sound? What are our options, and what is good enough?
Binaural recording is not good enough, but sound fields are easier to capture than light fields. You can do a decent job with a small tetrahedral mic array, which through some mathematics can model the sound field at that point.
This is known as ambisonics, and it’s a relatively open technology (most patents have expired), yet most people haven’t heard of it just as most people haven’t heard of binaural recording. The information can be stored in just 4 regular audio tracks (or more for higher-order ambisonics), which unlike normal audio formats doesn’t represent the sound that should come out of speakers, but the information for a sound field. This “B-format” audio can be decoded using good ol’ fashioned mathematics to a more standard tracks-that-should-play-out-of-speakers format, or basic stereo, or can turn into an equation for a spherical harmonic series (the place where that series truncates depends on the order of your ambisonics).
This technology has been around for a long time, but real-life uses are finicky. In a room full of speakers, the effect is only perfect in one location at the center of those speakers, even assuming you can get your specific speaker setup done properly and decode the sound for it, making it not practical even for home theaters. For pre-VR headphone uses, it has to become stereo anyway, so why bother. But with VR, you’re always at the exact center, and the stereo encoding can change in real time based on head-tracked rotations!
With just 4 small mics, there’s no way the sound field can perfectly simulate exactly what you’d hear as you move your human-sized head around a head-sized sphere. But is it good enough? Is it convincing in VR?
I don’t know, because I haven’t actually tried it in VR yet, which is why I’ve waited so long to post about it. But at some point it’s time to suck it up and write a post, with the promise to get back to you with results later.
Here’s some methods of VR video sound implementation I’ve encountered, from ourselves and others:
We recorded a concert, and assumed that at concerts everyone is used to sound being dislocated from musicians because it comes out of speakers, so we didn’t worry about head tracked audio. We wanted that nice binaural feeling of stuff happening all around you, so we made an ad-hoc dummy head out of a rhombic dodecahedron and modeling clay [right] and put it above the audience. The audience noise is constantly changing with no one specific source anyway, so it works.
We’ve also done videos with voiceovers, where the voice is supposed to float magically anyway, so whatever, head tracking! The locationless voiceover for The Relaxatron is supported by binaural sound clips of birds and stuff, so, it’s all good.
In our first VR talk show recording, rather than get fancy with audio, we assumed people would mostly be facing the couch and looking slightly back and forth between me and Emily. We simply created a regular stereo track of our vocals, no head tracking or anything, which creates a convincing illusion that we did something fancier, as long as you only behave as expected. We got feedback from someone who thought we were doing head-tracked sound, bwahaha!
3. Render different sound clips in locations in a 3d environment
In our VR Video Bubbles demo in Unity, various spheres textured with video were placed around a 3d environment. The sound for each video came from a virtual speaker placed where the narration was supposed to come from in the video. Unity’s integrated Oculus head tracking takes care of the 3d sound from there: walk towards the video bubble, hear the sound grow louder. Turn your head, and hear the sound pan around.
It would be trivial to place speakers in still locations on a video bubble, such as use the VR talk show as a spherical texture and then place each of our voice recordings where our heads usually are. Our locations are constant enough that this implementation would work well.
But we can do better!
The technology already exists to create motion for speakers in 3d rendered environments, so with some tedious work-by-hand you could give any sound clip a motion that follows the thing you want to be playing that sound clip. This is yet another place where rendered environments are way ahead of captured film, because each entity already exists in a defined location, unlike the mysterious pixels of film that only become objects in your head.
4. Code up a special specific implementation
Total Cinema 360’s “Blues” is a demo where a musician multi-tracks on a bunch of electronic instruments and each instrument’s sound file is set to that instrument’s static location in the video. It’s definitely worth checking out, though it’s advertised as an example of realistic 3d audio, which it’s definitely not. Sound files play and pan and cut out suddenly as you look around, and it’s not intuitive to associate the digital sound with its video counterpart. The package is more than a video file, requiring each separate sound file to be programmed to be in a place, and turn on when that pixel is in view or whatever their thing is doing (I didn’t dig into their code), so in its current form it’s not viable for anyone besides Total Cinema 360 to create something with it.
It’s not realistic or usable yet, but as an example of the potential of VR experiences it’s interesting. Why not have a specific sound suddenly cut in when you look at a thing? I can think of plenty of fun things you could do with a more focused version of that idea. There’s already VR experiences where looking at things affects them and creates sound (I’m thinking especially of exploding asteroids with my mind in SightLine’s The Chair, where audio feedback is key), and I like it. It’d be fun to do something like record video in a museum and hear audio narration about the thing you’re looking at. Definitely looking forward to seeing what Total Cinema 360 does next.
I love the way 3dio’s Free Space Omni-Binaural microphone looks. It’s beautiful, and it’s extra-beautiful when mounted on a dummy head as part of a performance. Each of the four lovely-looking binaural mic pairs are good binaural mics, so this beautiful creature can record four good binaural recordings at the same time. That is what it can do. It cannot do more than that.
This mic was developed for Beck and Chris Milk’s “Hello Again,” a cool 360 visual/audio experience. You can pan around the concert video, and the four binaural recordings are panned to match, mixing together the two closest dummy head ears when your ears are between them. The mics, cameras, and stage are constantly moving in circles.
I love the implementation for this piece. I don’t love that now they’re producing and selling this design advertised as actually recording 360 degrees of binaural sound. Humans can localize a forward-facing sound to a precision of a single degree, so I’d be comfortable saying that 3dio’s Free Space Omni records 4 degrees of binaural sound. Our peripheral sound localization skills can be as bad as 15 degrees, so you could say it records 60 degrees of binaural sound if you really stretch it.
It reminds me of this very beautiful but completely unrealistic camera mount. I like 360Heros and we’ve used their pentagonal stereo mount, which works, but if they’re selling that thing (left), I’m guessing they don’t understand how their working camera mounts actually work, because math.
Anyway, recording live music that’s all routed through speakers makes it difficult to judge a microphone system. None of the sounds quite come from the thing making the sound, but is that the mic’s fault, or a bug in the video player, or that the speaker playing that sound was somewhere else? Is the interference from the concert speakers, or is it from mixing together two recordings taken six inches apart? When you move a tiny bit and the sound leaps 90 degrees, is that because the recording is weird, or because the actual microphone is moving around during recording?
It’s good enough though, for that particular implementation. It’s great for anything where you need binaural sound full of cool noises in 3d and don’t care that those noises may be a bit distorted and only accurate to within 90 degrees. Mixing binaural recordings doesn’t average the locations of the sounds any more than layering two stereo photos gives you what you’d see if you looked from your nose, but it can create a cool effect and smoothish transition. Still, in the end we just need better image stitching.
6. Wild speculation
When it comes to fading between multiple mics, enough mics on a mic ball might be good enough for 3d sound that has a believable accuracy. The audio interference from mixing together mics placed a couple inches apart is audible, but you probably wouldn’t really notice it. And if your binaural recording lets you localize a noise within 1 degree of accuracy, but your head is slightly between mics so that perception is ten degrees off from where the noise is supposed to come from in the video, it’s probably good enough.
Or, we could skip the gimmicky stuff and use real mathematics! Sound waves and sound fields, each mic being not a representation of the human ear, but another data point making our model of reality more accurate. That’s why I’m interested in ambisonics. There’s thousands of papers and good cold hard research about it, and if we can do math to it, we can do VR to it.
Ideally, once you’ve got your sound field, you’d render it based on 3d models of the listener’s ears (which you rendered via a few photos of their pinnae) to create for them a true spherical binaural experience. It’s your own 3d ear’s unique distortion of sound in space that lets our brains turn what should be a 1d amount of information into 3d perception. Dummy-head binaural recording using off-the-shelf ears can sound awesome, but will lead to much less accurate spacial sound perception than using a 3d model of your own head and rendering the sound just for you (Andrea told me about this experiment where they’d put fake ears on people and they’d be bad at sound localization).
(Also ideally, instead of just recording the spherical harmonics around a point, we’d get data around the space of possible ear positions, because spherical harmonics totally generalize to higher dimensions and I bet there’s lots of papers on this and someday soon we’re going to have the best amazingly realistic VR sound YESSSS)
Anyway, pretty much nothing you hear at the theater comes from one single recording that includes actors’ voices, footsteps, and background noise, so in that sense live ambisonic capture probably won’t be the future of high-production VR film. Fancy films record clean separate sound effects, music, and vocals, plus stock sounds, and mix them together later to sound like they’re in the right place (as well as to sound epic and level and clean and all that). Existing video editing tools are pretty good at tracking chosen objects in a film, with minimal work-by-hand, but as far as I know no audio is currently mixed by tracking it to an actual bit of pixels. It’s not necessary. Or at least, it wasn’t.
We could use the same tech that lets us place an explosion effect on a car, tracking the car to make the explosion realistically move with the shot, and use that information instead to track the car sound effect around you in VR.
In the gif to the left, I loaded our latest stereo spherical talk show video into After Effects and simply stuck a motion tracker on my head, and I already have a separate vocal track for my voice because I’m using a wireless lapel mic, so we’ve got all the necessary information. Then you have to get the information out, and into the listener’s ear.
One option would be to do a fancier version of what Total Cinema 360 did with “Blues”: have a folder that contains files for the sound effects and a folder of their exported tracking information, then transform and sync it all together in the player itself. This would be a little bit of work, but relatively straight forward. Of course, I don’t really want to have to have a video file plus a folder of trackers and sound effects that can only be assembled by a special program, in a format that may or may not become standard for other players and that has lots of room for separate components becoming misaligned.
As convoluted as it seems, I can totally see people exporting After Effects tracking info to be compatible with Unity, where you can then compile the entire video with all its tracking info and sound effects as a game, and then download and run an entire Unity game to watch a video. Actually I wouldn’t be surprised if you could already port After Effects tracking info to Unity.
Or, even if you didn’t originally capture your sound ambisonically, you can still use the ambisonic format to encode your fancily-produced spatial sound information as a sphere of sound instead of a million little clips and trackers, using a program that only the video creator, not the consumer, needs to use. It seems natural and easy for spherical video players to natively support ambisonic sound. A regular video file can store the info as a standard series of audio tracks representing a nice simple sphere of sound. Apply a rotation to match the head tracking, then collapse it into binaural stereo using virtual microphones. Mathematically simple. And anything fancier, such as higher-order ambisonics, is an easy extension of the technology.
It all seems so easy, perhaps too easy, to implement basic ambisonics, that I’m surprised I haven’t seen it done yet. It should be as simple as this:
record with sound field mic -> convert to standard B-Format -> use head-tracking info to apply rotation transform -> collapse to stereo.
Just how well will this theory work in practice? I don’t know! Perhaps I am making a fundamental error or something! I guess we’ll find out soon enough. I’d appreciate any insight you might have on the topic.