The eleVR Player – Tech Post

The eleVR Player – Tech Post

posted in: eleVR Player | 0

This is a guest post by Andy Lutomirski, a contributor to eleVR Player.

Back in 2014, eleVR made the first web-based VR video player, the eleVR Player. eleVR didn’t build it as a standalone product or even really as a product at all — they built it because they were already making VR videos and they wanted an easy way to view them. They’ve written plenty of blog posts about their videos, but no one has ever written up how the player itself works.

I helped out a little bit with the video player [1], and I offered to guest-write this blog post.

René_Magritte_The_Human_Condition
René Magritte’s ‘The Human Condition’

A VR video player has a pretty straightforward job [2]. It plays a video file and adjusts the projection for every frame to make it look like you see the part of the frame that’s in your field of view without nasty distortion.

In other words, a VR video player is just like a 3D VR game in which the viewer is trapped dead center in a spherical room with the video playing on the wall. For stereoscopic video, each eye sees a different video.

As a first pass at making a player, we could have done exactly that: used WebGL to render a sphere with a video file as its texture. As background, here’s how the math works. A 3D rendering is just a perspective projection of a scene, which means that a line drawn from your eye into a 3D scene through a flat screen should map each point on the screen to the corresponding point in the real scene.

If you do a good enough job, what you see when you hold the screen in front of your face looks just like what you would see if looked at the real scene instead, exactly like how a flat landscape painting can look just like a real landscape behind the painting.

In conventional rendering, you break the 3D scene into little triangles and, for each piece, figure out where it would appear on a screen held between the viewer and the scene. Then you draw that triangle on the imaginary screen. Once you’ve drawn all the triangles, you have a picture.

IMG_2333

Keep in mind, though, that this was June 2014 [3], we wanted it to work well on phones, and phones were pretty slow. To draw a credible sphere, you need to decompose it into a fairly large number of triangles, and that stresses the graphics chip if you have a lot of triangles. Most of those triangles are outside the field of view, too, so you waste time processing the triangles you can’t see. If you try to optimize this by figuring out which triangles are in view, the video player starts looking like a simple 3D game engine, requires a bunch of tedious, nasty code, and runs slowly.

Instead, we turned the process around. In the eleVR player, each frame is just a rectangle (actually two triangles, since GPUs only know about triangles) that covers the whole screen. The fragment shader runs once for each pixel and decides what color gets drawn on that pixel.

IMG_2332

The fragment shader looks roughly like this, although I’ve simplified it a bit.

varying mediump vec3 vDirection;
uniform sampler2D uSampler;

Normal programmers write functions that take parameters as input and return something as output. GLSL (the shading language used in OpenGL and WebGL) is silly — a shader is basically just a function, but its inputs and outputs are written like global variables and the function is called “main”. I’m sure there’s a reason for this, but I have no idea what that reason is.

#define PI 3.1415926535897932384626433832795

mediump vec4 directionToColor(mediump vec3 direction) {
 /*
  * Input: a direction. +x = right, +y = up, +z = backward.
  * Output: a color from the video
  *
  * Bug alert: the control flow here may screw up texture filtering.
  */

So far, so good. We’re supposed to figure out what color we’d see if we looked out through our screen in the given direction at the video projected on our imaginary sphere. Because the convention for spherical video uses equirectangular coordinates, we calculate theta and phi:

 float theta = atan(direction.x, -1.0 * direction.z);
 mediump float phi = atan(direction.y, length(direction.xz));

If only it were this easy, though. Doing just this put flickering dots in a few places in the scene on our test phones. Why? Because those phones were bad at math:

 /*
  * The Nexus 7 and the Moto X (and possibly many others) have
  * a buggy atan2 implementation that screws up when the numerator
  * (the first argument) is too close to zero. (The 1e-4 is carefully
  * chosen: 1e-5 doesn't fix the problem.
  */
 if (abs(direction.x) < 1e-4 * abs(direction.z))
   theta = 0.5*PI * (1.0 - sign(-1.0 * direction.z));
 if (abs(direction.y) < 1e-4 * length(direction.xz))
   phi = 0.0;

Now we just ask the GPU to find the pixel of the video corresponding to those coordinates:

 return texture2D(uSampler, vec2(mod(theta / (2.0*PI), 1.0), phi / PI + 0.5));
}

And now we have the silly incantation that makes this GLSL:

void main(void) {
 gl_FragColor = directionToColor(vDirection);
}

Now all we need to do is convince the GPU to call this code once per pixel with the appropriate vDirection value. We do this using this vertex shader:

attribute mediump vec2 aVertexPosition;
uniform mediump mat4 proj_inv;
varying mediump vec3 vDirection;
void main(void) {
  gl_Position = vec4(aVertexPosition, 1.0, 1.0);
  mediump vec4 projective_direction = proj_inv * gl_Position;
  vDirection = projective_direction.xyz / projective_direction.w;
}

This program runs a whopping four times per frame, once for each corner of the screen. aVertexPosition is the 2D position of that corner: (-1, -1), (-1, 1), (1, -1), or (1, 1). It then inverts the typical OpenGL/WebGL calculation: it multiplies by the inverse of the projection matrix to produce a vector pointing from the viewer’s eye through the vertex in question. If the viewer were looking in the -z direction while standing up straight so +y is up and +x is to the right, then the top-left corner (-1, -1) would produce the output vDirection ∝ (-1, -1, -1) [4]. If the viewer is looking in a different direction, vDirection will be the 3D coordinates of some point in the direction of the corner of the screen.

This is almost exactly the opposite of a conventional vertex shader. Normally a vertex shader takes the 3D coordinates of a vertex as input and spits out the location on the screen as output.

And that’s basically it. The GPU knows how to fill in our big rectangle by calling the fragment shader for each pixel.

The actual eleVR Player is more complicated, of course. It has to load video files, figure out which way you’re facing, draw a user interface, give keyboard controls for testing on a desktop or laptop without a headset, etc.

Oh, one last thing. As plenty of people have mentioned in the past, the equirectangular projection is a terrible way to store spherical video. Look at that fragment shader’s math: we’re computing two arctangents for each pixel. Transcendental functions are slow. There are plenty of better solutions. My personal favorite is a cube map: rather than trying to flatten an imaginary sphere into a 2D video file, pretend you’re looking out from the center of a large cube and store the video for each of the six faces. This avoids any nasty transcendental math and, even better, WebGL and OpenGL have built-in support for cube maps. Also, cube maps cover a similar amount of viewing angle on each pixel, so you’re not wasting space storing absurd amounts of detail near the top and bottom of the scene like you do with an equirectangular projection.

-Andy

 

[1] I wrote the shaders, and I did a bunch of testing on my crappy phone. Andrea Hawksley did pretty much everything else. Her phone was much less crappy than mine, but, being an iPhone in 2014, it was a lost cause.

[2] Making the video is a whole complicated subject involving stitching, panoramic twist, interesting geometry, and trying to make the inevitable artifacts as unobtrusive as possible. eleVR has some great blog posts about the subject. Playing the video back is much simpler.

[3] My test phone was an original Moto X. That was more like low-to-mid-range 2013.

[4] Depending on how the projection matrix was set up, it could be anything proportional to that. After all, it’s just a vector pointing somewhere, so the length doesn’t matter. I didn’t worry about normalization at all while working on the shaders.