Can we walk inside the movie? Part 1. Stereopanoramic depth estimation
Hi everyone! My name is Alex and I’m a computer-vision researcher at DeoVR, a video streaming company for virtual reality. One of our products is a convenient and free video player for virtual reality headsets.
One of the dreams of the VR community is the fully immersive experience — imagine you can walk inside the video shooted by Steven Spielberg or watch behind the corner into the detective. 6 degrees of freedom inside the video requires information about the scene geometry and it can be possible to get from lidars, additional cameras, or by using smart algorithms with conventional VR shooting equipment.
What is VR video?
There are 2 main differences between video in VR and classic “flat” cinema:
1. In VR cinema, the viewing angle is much larger than in the classic one and can be up to 360 degrees, which allows you to turn your head in VR
2. The presence of the effect of stereoscopic vision — the image for the right and left eyes are different, as they were taken with different lenses
Interestingly, it is technically very difficult to ensure high-quality simultaneous execution of both points when shooting. In the production of VR films, 2 wide-angle lenses are usually used, located at a distance of about 6.5 cm, which corresponds to the average interpupillary distance. The enjoyment of watching such a video can spoil the occurrence of several artifacts.
The fact is that the image from the front corresponds to the position of the pupils, but when the head is turned, there is a discrepancy between the position of the lenses and the position of the pupils, which leads to the disappearance of the stereo effect and the appearance of distortion. In the extreme case of a 360-degree field of view, a 180-degree turn will feed the image from the left camera to the right eye, resulting in reverse parallax. In addition, some objects are closer to one lens than the other, and this can lead to differences in the angular size of objects when the head is turned. The observed distortions depend on 3 Euler angles, which can describe the position of the head — raw, pitch yaw.
Some of the distortions cannot be corrected mathematically exactly, for example, the absence of parallax when turning the head 90 degrees cannot be corrected — there is simply no necessary information on the captured videos. Other distortions may lend themselves to partial or complete correction. However, to make such corrections, information about the three-dimensional location of objects in the video is often needed, or, in other words, about the distance from the lens to each of the points on the frame. In addition, this information can potentially be used to restore a three-dimensional scene and implement the movement of the viewer inside the film (we are also developing in this extremely interesting area, but this is a topic for a separate article). How to get information about the distance to each pixel is a depth map. How can I get it from a video? Interestingly, most of the work on determining the distance to an object from images is done by companies associated with the production of unmanned vehicles. There are several approaches to solving this problem
1. Monocular depth estimation
In principle, one image is not enough to strictly determine the depth, but a person can often tell from a photo what is far and what is close. For this, such features as the relative sizes of objects and comparison with typical sizes of objects, the difference in image sharpness at different distances, etc. are implicitly used. A person can be deceived. A classic example of a game with perspective is shown in the figure.
Nevertheless, the problem can be solved using a neural network approach, for example, as done here. I will not focus on the architectures used — everything is there — from the simplest convolutional networks to transformers. The downside is the low (compared to binocular methods) and poorly predictable accuracy of such methods and the large dependence of accuracy on the area of use (most of the pre-trained models are trained on the image of roads since there are extensive datasets for training unmanned vehicles). A large field of view is not an obstacle for algorithms, since any projection can be applied to the network input and a reasonable result can be obtained. Anyway, monocular depth estimation can be used at panoramas too.
Binocular depth estimation
It is important to understand that the task of binocular depth reconstruction is adjacent to the task of determining the shift (stereo disparity). Very rough, in most cases distance~1/Disp.
The classic method for determining depth is to use the block-matching algorithm. As the name implies, the image for one eye is divided into blocks and a search is made for how much to move one block horizontally so that it best matches the image for the second eye. The similarity metric is most often the modulus of the difference between the pixel values in the block and the second image. Variable parameters of the simple method include the block size and the maximum shift value up to which the search is performed. Such a method can quite well show the shift at the boundaries of objects. The disadvantages of the method include low accuracy on non-textured regions and the failure of the algorithm in the case of objects close to the camera. The latter is because it is very difficult to describe the difference between images by shifting one relative to the other — distortions are because the object is seen from a different angle, and not just shifted.
However, the development of neural networks has made it possible to apply the idea of matching with greater efficiency. For example, if the neural network extracts some features of an image, you can apply matching to the features. This will partially correct the problem of close objects — high-level features (for example, the tip of the nose) will be extracted regardless of the angular distortion. The development of neural matching ideas is the use of the so-called Cost Volume.
The neural network extracts the features of the images, then for each shift, the similarity value of the corresponding features is calculated (the similarity metric in different works is different, for simplicity, you can take the difference). If there are 2 sets of features with dimensions CxHxW, then after calculating the differences in the shift from 0 to D, we get a matrix with dimensions DxCxHxW — it is called Cost Volume. Further, applying several 3d convolutions to the resulting matrix can lead to a fairly accurate result.
A large number of works are aimed at improving the architecture when working with Cost Volume and reducing the number of operations (3d convolutions are quite expensive). Unlike BM or neural BM, the approach is using Cost Volume. A great review of how it works can be found here. However, most SOTA models for binocular depth reading use Cost Volume. You can see the current results and architectures used for restoring here.
Binocular panoramas for depth estimation
The above methods apply to images with a small field of view, and their application in wide-angle videos requires their modification. The main problem lies in the fact that in the methods of stereo depth recovery there is a search for the value of the horizontal shift of one picture relative to another. In the case of wide-angle shots, the differences are not just shifts. To begin with, let’s figure out how it turns out and what a frame in a VR movie is
Most modern VR cameras are equipped with a wide-angle fisheye lens, which allows you to project the world onto a plane. While any projection of the visible world onto a plane can be used for wide-angle video streaming, the most commonly used are Fisheye and Equidistant projections.
In the simplest case, a point with x,y, and z coordinates relative to the lens can be converted to a fisheye projection as follows:
- Convert to a spherical projection (theta, phi)
2. Convert to fisheye projection
Here X and Y are normalized to value -1, 1
Fisheye projection is an image projected onto the matrix of a video camera directly by a Fisheye type lens. For modern binocular fisheye cameras, the field of view can reach 220 degrees. Of course, at such angles, one lens sees the other, but usually they are removed at the stage of video post-processing. The video then consists of 2 fisheye images side by side (side-by-side format) or one above the other. The video player must expand the projection data into a 3D video.
Historically, VR storage and streaming has more often used a different projection, the equirectangular projection. Most people are familiar with it from the world map — the round surface of the earth is projected onto a rectangular map in such a way that all the parallels are horizontal and the meridians are vertical. In other words, with an equidistant projection, a point with coordinates x, y, z will fall into the point
- Convert to a spherical projection (theta, phi)
2. Convert to equirect projection
How to get a depth map from two stereo projections?
The first approach is straightforward — just use the model to determine the depth monocularly on any projection. There are two cons — since the model is not trained on such projections, systematic errors may occur (practice shows that they can almost always be neglected). The second minus is much more significant — the resulting depth map is unstable in time — if we convert the video in this way, we can notice the flickering of the depth map.
The second method involves using both images. Standard methods come down to finding the amount of horizontal shift (either using a feature map or using block-matching). In the case of spherical projections of wide-angle images, the difference in images cannot be reduced to a horizontal shift of some parts relative to others due to the nature of the projection.
Suppose the image is projected onto spheres of radius 1. What will the point with coordinates (XYZ) correspond to for the right eye in the equidistant projection for the left eye?
Calculate the XYZ coordinates for the left image
For the right eye:
In order to determine the depth (R) of 2 images, you will have to work a lot even if we accurately determined the correspondence of points on the right and left images (θR, φR, θL, φL).
If we return to the problem of finding the optimal R or R`, then the method may be as follows — for each estimated R`we re-project the right image onto a sphere of this radius and calculate how the image will look from the point of view of the left camera. Thus, we get a kind of curvilinear analogue of CostVolume, which can be further processed. On torch, the reprojection might look like this.
With this definition of CostVolume, you can try using pre-trained neural networks to determine depth maps. However, a much more elegant method is to use the cylindrical symmetry of the dual lens system.
You can use the equirect projection for work, in which the poles are not above and below, but on the right and left. Then the symmetry of the problem becomes cylindrical, which greatly simplifies the equations used. It can be shown that, in such a projection, a change in the distance to the object leads to a horizontal shift:
Note that the Phi value is the same for the right and left projections, the only difference is the theta value. In other words, you can look for a simple horizontal shift of the right projection relative to the left (of course, there are also horizontal distortions associated with the difference between R and R’). Careful handling of formulas above leads to the following:
That means, it is enough to translate the input image into the required projection, and then you can use the entire arsenal of pre-trained neural networks to search for a horizontal shifts of areas.
Remark: To be completely accurate, after reprojection images should be stretched in a specific way in horizontal direction to get rid of cos θ. Experiments shows that most of neural methods work well without additional stretching and accurate depth can be calculated after application of depth estimator by multiplitcation of depth by factor cos θ
To keep it simple: On most of projections, vertical alignement can be missed, especially in case of close obects. Flipped equirect projection do not have such problem (see example with micro cord on image). reprojection can cause problems with such projection close to the left and right sides of image, but that regions are not usable for stereomatching anyway.
Reprojection in practice can be done in several ways:
1. Write grid for sampling on Torch — this can speed up calculations and is suitable for including the first step into the ML pipeline. For that kind of sampling, the grid_sample method from torch library is a key one. The only implementation trick is to correctly calculate the grids for resample one projection into another. Basically the math is above — XY grids of target projection are recalculated to corespond XYZ in real space and than XYZ grids are recalculated to corresponding points on initial projection:
2. Transfer the mathematics of reprojection to the V360 filter of the ffmpeg library
Full code can be found at https://github.com/Alexankharin/fisheye_disparity
3. For use on end devices (for on-the-fly image correction), you can write a shader for fast reprojection. On opengl it might look like this:
After that you can use any pretrained neural network, for example that one with pre-trained weights or create your own
As a result, we will get an equirect depth map projection, which can be used both to restore a 3d scene and for other purposes, for example, to compensate for distortions that occur when turning the head while watching vr movies - usually this requires knowledge of the distance to the object.
As intended -there is much better depth stimation than without reprojections. Nevertheless, side regions are still have mistaken depth estimagted.
The similar trick with projections be used for calculations directly on the playback device (like VR headset), but the entire pipeline needs to be revisited and optimized for the limited hardware capacity. For example, instead of the relatively slow ffmpeg filter or low-compatable torch code, you should use a shader for reprojection.
In addition, you should pay attention to the speed and efficiency of the neural network on the device and apply appropriate optimizations. For example, several blocks of the mobilenet type can be used as a neural network for feature extraction, instead of Cost volume with a dimension of DxCxHxW, correlations between feature vectors can be considered, having obtained and used as a result a matrix with a dimension of DxHxW, which eliminates the need to use computationally expensive operations 3d convolutions, etc. At this step, optimization is needed both at the level of the solution architecture and at the level of efficient launch on hardware.
Why is all this necessary?
After we have depth data, we can adjust the images in order to remove artifacts. The second way to use it is to back-project points into space to restore a 3D scene and achieve an immersive video effect. In any case, depthmaps are vital tool for the scene reconstruction (example of such reconstruction is below).
Nevertheless, a lot more than depthmaps is required for scene reconstruction — including video inpainting and creation of proper 3d video format suitable for streaming and corresponding VR videopleers. The video impainting and streaming approaches and will be discussed in parts 2 and 3.