There’s a bunch of mono spherical cameras coming to the consumer market soon, with automated stitching and easy workflow.
But what about stereo? Why are there no consumer stereo spherical cameras, and what might we look forward to seeing once they do exist? What will be the standard for consumer VR video capture?
Spoiler: the ultimate conclusion is that 1. stereo is hard, 2. sacrifices must be made, and 3. that it will look less like this…
…and more like this:
But this is research, so we need to go at it from the direction of most efficient camera setup based on geometry and technology and algorithms, no matter how much I am tempted by an a priori “everything is phones, everything will be phones, phones.”
1. Stereo is Hard
First, let me reiterate that Mono is easy. For mono, there’s a “correct” answer. Two camera views, taken from the same spot but in different directions, have a correct stitch that you can aim for (and do perfectly, in the best case). The idea of a consumer camera for spherical VR video, that automatically stitches from all the lenses and outputs a regular mp4, doesn’t warrant very much skepticism.
With stereo, you cannot simply make the perfect calibration for your camera setup. The same thing that makes stereo vision work, that things only match up one depth at a time, is exactly what makes the perfect stitch impossible if there’s distance between cameras (and to get stereo all around, you need to have distance between not just the left and right eye footage, but footage within one eye).
A stitching calibration can only ever be correct for one depth at a time, and only in the case of mono video does everything appear at the same depth and thus perfect stitching is theoretically possible.
In the stereo pair below, you can focus on the pink, orange, or yellow highlighter, each at a different depth, exactly because they appear in opposite orders from one eye to the other. But how would you stitch that?
Some cases are theoretically possible to deal with algorithmically. When someone is sitting on the stitch seam close to the camera, you could detect that by building a 3d model of the environment, and stitch for that depth. That calibration will not work for the stuff behind them if the person moves out of view, but with really good software (that may or may not someday exist), the stitching could dynamically change to stitch at the new depth at the seam, constantly building and updating a depth map of the space to always stitch the closest objects correctly (which can hopefully be reasonably rendered just from the footage, or maybe by adding an infrared camera or something).
But even this process for “perfect” stitching based on computed depth only does a correct stitch for that one depth, and if your scene has visible objects at more than one depth, objects behind the closest object will appear doubled. And most objects are not perfectly flat and oriented directly facing the camera, so even your most perfect stitching of an object will be a little off, which will be noticeable for things like human faces. There is no such thing as a perfect stitch that avoids this, because while stereo video pretends it can be mapped flat onto a sphere, that’s a convenient hack rather than a mathematical truth.
The ideal automatic stitching software might be able to detect and stitch optimizing for the pink highlighter, or the orange, or the yellow, or the background, but how would it choose?
You must make choices and there’s no right answer. When stitched by hand and filmed with the limitations of stereo in mind, it is possible to set things up in ways that avoid hard choices. You can decide what the focus is of the scene, what the right depth is, whether doubling some things is better than losing others.
If there’s enough overlap between footage, you can make the stitch lines avoid hard areas. You can do piles of post-production to paint out doubled objects and other errors. The overlap can look similar enough that it can be warped into looking smooth, and you can take advantage of that with good set design and blocking, with avoiding having multiple things at multiple depths near the seams.
You can film with lots of cameras and try to spread the inevitable distortion evenly across an entire sphere with as many cameras as possible, or you can collect it into convenient corners.
You can leave the realm of computation and enter the realm of artistry.
Which is not exactly a good answer for a consumer product, and might leave you wondering if it’s worth even trying.
If you have the means to do piles of production and postproduction, it’s not clear that simultaneously captured stereo spherical video is the right choice (except for live events); you could shoot asynchronously and composite the pieces, green screen actors onto 3d rendered backgrounds, or just do completely 3d modeled stuff that you can actually move around inside of. But given the expense of both creation and playback (user-created VR needs to be playable on phones, and video will out-pace 3d models in graphical realism for a while) that stuff is going to be out of regular consumer’s hands for a long time.
But consumers are also creators, and whatever the future brings, humans will still be driven to capture and share their own experiences in whatever medium can express them, perfect or not. The only question is what form consumer VR video might take, and what sacrifices must be made to the perfect ideal in order to make it inexpensive and automatic.
2. Sacrifices must be made
Our most consumer-like stereo spherical camera is our Hippo prototype [above], with pairs of cameras facing out in four directions, plus top and bottom.
It is not very consumer-like at all.
If the GoPro camera could capture as tall as it could capture wide, we would only need one camera on top and one on bottom (top and bottom aren’t actually in stereo because math), or if they could actually capture over 180 degrees vertically, we could eliminate the top and bottom cameras altogether.
Bringing the number of cameras down to 8 would still not solve any of the real problems keeping it from being usable by the average person:
- It must be a connected piece of hardware where the different cameras talk to each other to sync timing, exposure, white balance, etc.
- The hardware and lenses need to be super precise and rigid, so that a stock stitching calibration can look good at least in the places that don’t have stitch lines (no having the right eye slightly tilted from the left eye).
- The stitching software, besides automatically stitching the pile of footage, needs to be user friendly, reliable, and efficient enough to run on a normal laptop.
- The footage must be organized and stored on a single accessible SD card, which can be transferred or USB’d over to a computer where it can be automatically stitched using a standard calibration.
It would be expensive and have huge stitching errors, but there’s no technical reason it couldn’t exist today, and it would at least be usable.
(This is your regular reminder that right now a pile of GoPros does not have any of those necessary features, currently isn’t even close to a consumer VR camera, and is NOT a good choice for those who want to actually make stuff rather than research and innovate and frustrate.)
But as long as large errors along stitch lines can’t be avoided with automatic stitching, I think your best bet is to use as few cameras as possible and avoid putting stuff in error-prone areas. As long as the cameras are very precisely aligned, you can at least get the non-stitched sections looking good, and put the burden of avoiding stitch lines on the person filming, not the person producing.
We can make this burden as light as possible by making our stitch-safe zones as wide as possible, and assume people will get used to having a few terrible stitchy areas, just as signs of amateur production are usually assumed and ignored in other amateur-created content.
Such content turns a technology into a medium. All the pictures in this post were taken with my phone and are objectively terrible as photography, but it doesn’t matter because the point is not the photo itself but what the photo is of.
Instead of eight cameras in a square, you could do six in a triangle (plus top and bottom). Stereo pairs would have a sharp stitching angle with a very far stitching distance at the three corners, but you’d also have three wide stitch-safe zones perfect for filming selfie-style or with a friend.
The errors might be pretty bad, but they’d be clumped together into three avoidable stripes of error from floor to ceiling. With some creative set design and heavy post-production, a wide-angle stereo triangle could probably do pretty well for capturing live events. Plus, it’s still theoretically possible to get the corners to stitch into actual stereo video with post-production work by hand.
Unfortunately the stitching tradeoff is extreme. Five ultra wide angle lenses (like 180 degrees), used as 10 virtual cameras, means ten stitching errors. That’s not acceptable for consumer cameras with automatic stitching.
I’m also not really sure how noticeably the image would warp, as you go down to sharper and sharper angles between cameras. Could you do 4? The field of view would have to be at least 200 degrees to get the 8 virtual camera views you need for stereo all around.
Three cameras, and you’d need each lens to be like 280 degrees, wider than any lens I’ve seen. That’s the minimum if you want each point in space to be seen by two cameras with parallax, and there’s a nice geometric effect where the result would be quite a lot like the 6-camera stereo triangle. It doesn’t help the stitching because it’s still 6 virtual cameras, but the stitch lines would overlap, for 3 stitchy areas rather than 6.
At this point, the smaller number of cameras can be good or bad considering the drastic field of view and resolution changes, but there’s possible worlds in which the costs fall in favor of either one. Right now I think 6 is better, and the only reason to use 6 in a polygon rather than 8 in a square is cost of stitching, not cost of cameras. I’m not an expert on production costs for various lenses and resolutions and other hardware bits, but I’m happy to provide theoretical setups and let others decide what’s optimal to produce using current hardware. Maybe 280 degree lenses are easy and have just been waiting for an application.
Resolution is important, though, and a few more stitching errors (as in a higher order stereo polygon with top and bottom cameras) might be worth it if that’s what you need to do to get enough resolution in. Bad resolution is a bottleneck keeping stereo spherical videos capabilities from really shining right now. Resolution not only highlights the awesomeness of livecaptured real actual things over rendered objects, but has a drastic impact on the stereo effect. Stereo vision relies on differences between two images, and sometimes those differences are subtle.
Anyway, maybe the highest priority is not aspiring to get full spherical stereo, or even just 360 stereo. Maybe a hybrid stereo camera, a video sphere that’s mostly mono with a stereo section at the focus, is the right choice for consumer VR capture. For example, three 200-degree cameras could be arranged as a stereo pair in the front, both stitched to a single mono camera in the back, for a full sphere with stereo in one direction.
We’ve tried some experiments with combining stereo and mono, and concluded they can live happily together. Right now most stereo spherical video is mono in the top and bottom sections, and most people don’t notice that it’s not all stereo, or that there’s a transition where it moves from stereo to mono (the occasional person is sensitive to it).
As long as the focus of the camera and viewer are both on something in the stereo section, nothing is lost by having other parts be mono, and perhaps it even would help to subtly focus attention, as with normal flat-video focus techniques.
We cobbled together a 6-camera setup [right] to test out using front-facing right and left lenses that both get stitched to the same side-facing lenses, for stereo front and mono sides. The intent was to imitate the way that when facing forwards there is no stereo vision at the right and left edges of our vision (nose gets in the way, though in VR we can see through our nose; someday we’ll make Nose Simulator for a more realistic experience). In the resulting video, there’s no magic harsh transition. The brain just deals with it.
In the 2nd episode of our talk show we switched from stereo to mono briefly, and even though we announced when it was happening, I still didn’t see it happen. I had to pause the video and double-check. Yet, starting in mono and moving to stereo, suddenly everything changes, because my brain is getting more information, not less.
This needs further experimentation, but I’m pretty sure there’s lots of ways that stereo and mono video can work together to trick the brain. Stereo for still shots to give a sense of space, mono for moving shots that already give you depth information from movement parallax (and moving shots are a stitching nightmare in stereo so they should be mono anyway). You can use stereo for an important focal point, and mono for unimportant areas. If you look at a part of the room once with the stereo portion of your hybrid camera, it may be that your brain is happy with it being in mono from then on.
But if you’re going to have only sections of stereo, what about the good ol’ stereo polygon design form, but with only four wide-angle cameras on a 2-sided polygon?
You’d get a 2-gon, which seems like the minimal stereo polygon and thus simplest choice, though unlike stereo polygons with volume it’s not stereo all around. There’d be a giant completely mono section around the stitch line in a circle from floor to ceiling. It’d be spherical video with stereo in just the front and back directions.
And I think that’s awesome and perfect, especially for standard consumer use. I’m definitely ready for a camera that you can point at yourself while you vlog in 3D (and know you won’t get a stitching line in your face), while simultaneously also pointing the camera at a thing or scene or building or mountain that you want to talk about or just have in the background. Right now many videos have a lot of turning the camera back and forth, but if you could film both directions at once? Awesome!
The Ricoh Theta [right] is a consumer mono spherical still camera (it technically has a video function but it’s super low res and low framerate), and it’s a cool proof of concept that back-to-back ultra wide lenses can stitch for mono.
We’ve had people ask if you could use two and get stereo, and the answer is that out of the box it won’t work, and it can never be full 360 stereo, but in theory the lens arrangement could do front/back partial stereo and the information captured is sufficient, if the Theta lets you access the raw footage and stitch it by hand.
(We just bought one so we’ll get back to you with more info after it arrives.)
You’d need a completely different stitching technique than what the Theta uses, because for good stereo, it’s not enough to stitch two spheres separately; you need a calibration that aligns the two spheres as well, plus more fanciness if you want to stitch so that you don’t see the cameras themselves.
But assuming someone makes the stitching software and a version is made with high enough resolution, it’s a viable lens type and arrangement for consumer VR video.
And you know what shape that camera lens setup would fit perfectly on? A rectangle that’s about the size of, oh, a phone.
3. Everything is phones
Modern phones already have front/back cameras and are already used with VR headsets. Front/back stereo spherical pictures and video captured on the same phone you use to view and share them? Seems obvious to me.
Right now, I think a simple four lens front/back stereo spherical camera would align pretty well with consumer pricing, workflow, content style, and the shape and capabilities of smartphones already being used for VR. Phones are tools for both capture and consumption of so much other media, and if they’re going to be used for consumption of VR, they’d better be producers of VR too.
Biggest problem is lenses. I like to think that lens technology will get to the point that four ultra wide lenses could be seamlessly integrated into a smartphone, but right now, not so much. Turns out the camera lenses are the biggest limitation to smart phone thickness, and also a substantial part of the cost. And that’s for normal narrow field of view lenses.
It could be that the four phone cameras have to have narrower field of view lenses, and there’s a clip-on accessory that snaps four wide-angle lenses into place. Clip-on ultra-wide phone lenses already exist and work (though could use some improvements), and I think a phone with four cameras plus a double lens clip designed for it could work pretty well.
Camera placement also might not be able to be ideal; rather than directly back to back, since one camera takes up almost the entire thickness of a phone, they might need to be vertically offset. It’s important that each stereo pair have no vertical offset from each other, but as long as we’re planning on having terrible stitching lines anyway, having the back pair a centimeter lower than the front pair won’t make much difference.
Whatever ends up being the most common consumer VR camera, it will be on a phone, but it’s possible that stereo spherical isn’t gonna be the thing, and so consumer stereo spherical video will be uncommon compared to normal wide-angle forward facing stereo, or regular narrow-angle stereo facing a single direction, or whatever does end up fitting on a phone.
But that the efficient and simple 4-lens front/back stereo camera setup just happens to possibly fit on a phone, it’s too good not to try. Hopefully this is all obvious enough that someone with physical camera hardware skills is already working on it.
In the mean time, we’re excited enough by this design that we just ordered some ultra-wide clip-on lenses that we’ll use to film with multiple Galaxy Note 4 phones arranged as if they were a single phone with double cameras. How will the resulting stereo spherical video look? We don’t know yet, but whatever happens, we’ll post about it and make the result available for download just like all our other stuff.
So stay tuned! Why yes, we do have an rss feed.