Real-Time Video Super Resolution for Live Video Enhancement: A Proof of Concept Implementation
Recently, I have been working on a side project on Video Super Resolution (VSR) using a GAN-based machine learning model. I want to explore if VSR can be done at real-time speed. Specifically, I have built a Proof of Concept (PoC) VSR upscaling system which breaks an input video stream into a sequence of images, run a pre-built Pytorch image super resolution model on the images and upscale the images by a factor of 4 to enhance their visual quality, and then uses FFmpeg to combine and re-encode the images to a single high resolution output video. I also made a series of optimizations to make the whole upscaling (enhancement) process as fast as possible on my Nvidia GeForce 4060 GPU with 8GB built-in memory. My test results show that when input resolution was 108x192 (vertical) and output resolution was 432x768 (i.e., x4 upscaling), real-time VSR was possible and the overall upscaling speed was 1.1x of the input video frame rate. When the input resolution was 180x320 and output resolution was 720x1280, real-time upscaling was not achieved and the upscaling speed of my PoC system was only 0.44x the input frame rate. For both inputs, my VSR upscaling implementation achieved a significantly better visual quality than the traditional LANCZOS upscaling, and at the same time the output video bitrate was much lower than LANCZOS too.
First of all, let me explain why real-time VSR enhancement matters. For one thing, traditional signal processing based video upscaling algorithms such as BICUBIC and LANCZOS do not perform very well when the live input video uses a very low resolution, such as 320x180. Machine learning based super resolution upscaling could significantly outperform traditional technologies by enhancing the visual quality while saving bitrate. However, the biggest challenge of running VSR-based upscaling at real-time speed is how fast we can run ML inference as such operation is highly complex and time consuming even on GPU. The speed of generating ML-inferred high resolution video frames must be at least as high as the speed at which live video frames are ingested and streamed out.
A second question arises as to why we need such low live input resolutions such as 320x180? The answer is that, in many live streaming scenarios, the streamers may be located outdoor with bad cellular connections such as me doing live broadcasting of my daughter’s weekend soccer games. The soccer field is located in a quite rural area and the bad network connection simply cannot support a reasonably good live contribution quality. But I don’t want my viewers to watch poor quality videos, so what I can do is to upscale the low quality input to a higher resolution with a much better quality using super resolution. Adaptive BitRate (ABR) transcoding can then be applied on the upscaled contribution stream before delivering the video to the viewers.
Implementation overview
There are many existing machine learning models for video super resolution, such as EDVR, BasicVSR, ESRGAN, COMISR, Stable Diffusion x4 upscaler, etc. A quite comprehensive list of the models can be found in [1]. I evaluated a set of these models in terms of upscaled video quality, then picked one GAN-based model for my implementation which outputs noticeably better visual quality than most others. I also made a series of optimizations to the PoC implementation, such as serializing the PyTorch model into a TensorRT engine, etc. The PoC was written in Python and shell script, and libraries such as Numpy and OpenCV were used to process the image tensors.
Experiment setup
Test video:
- Resolution: 108 x 192 (vertical) and 180 x 320 (vertical). Vertical video inputs were used because many amateur outdoor streamers simply use their mobile phones for streaming. Using vertical videos is not an speed optimization.
- Bitrate: 67kbps and 149kbps (for 108 x 192 and 180 x 320 input, respectively). These are very friendly bitrate values for live broadcast contribution even with poor cellular connections.
- Video duration (sec)/Frame rate (fps)/Frame count: 39.66/15/598
- Color format: RGB
- Input tensor shape (excluding inference batch size): (3, 108, 192) or (3, 180, 320). Given RGB color format, the number of color channels is 3. The two other dimensions are the video width and height.
- Output tensor shape: (3, 432, 768) or (3, 720, 1280) as the input resolutions are upscaled by a factor of 4.
Hardware
- CPU: AMD Ryzen 7 7700 8-Core Processor 3.80 GHz with 16GB RAM
- GPU: NVIDIA GeForce 4060 with 8GB memory
GPU was used for VSR inference only. CPU was used for running all other steps specified as follows.
Upscaling steps
The PoC uses the following steps for upscaling the input video,
- Breaking the source video into a sequence of 598 images,
ffmpeg -i $1 -vf fps=$2 images/image_%4d.png
- Loading the 598 images, and pre-processing them before starting VSR inference (e.g., normaling and transposing the input tensors before feeding them to the neural network),
- Copying the input tensors (processed 108x192 or 180x320 images) to the GPU and running x4 upscaling inference on the input images (upscaling by a factor of 4),
- Performing post-processing on the output tensors (i.e., the upscaled 432x768 or 720x1280 images) which is basically the reverse of the pre-processing step.
- Performing minimal compression on the output tensors and save them as BMP images.
- Using FFmpeg and libx264 to combine and re-encode the 598 output images to a single output video.
For 108x192 input and 180x320 input, inference batch sizes of 6 and 2 were used, respectively. The GAN-based super resolution model is kind of large with dozens of millions of parameters. Running such a large model consumes a lot of memory during inference which leaves only a limited amount of memory for holding the input tensors as well as the output tensors. Batch size of 6 and 2 are the highest possible values allowed by my GPU which has only 8GB built-in memory. A larger batch size could allow further acceleration (hopefully, ~10–20%), but not available on my server.
For LANCZOS-based upscaling, a upscaling factor of 4 was also used and the following upscaling command was used,
ffmpeg -i go_fishing_180x320.mp4 -vf scale=720:1280:flags=lanczos -c:v libx264 -crf 15 -preset slower -y output_lanczos.mp4
Results
Fig. 1–4 compare the source, the LANCZOS-upscaled and VSR-upscaled images for both 108x192 and 180x320 inputs in two different scenes. As can be seen, the visual quality of super resolution upscaled images are much better than the sources and LANCZOS-upscaled ones (Please view the difference on a larger screen. The difference may not be as obvious when viewed on a mobile phone). Next, let’s look at the upscaling speed.
Table 1 and 2 show the upscaling performance of LANCZOS vs. VSR upscaling. When input resolution is 108x192, VSR was able to achieve real-time upscaling because its overall processing (pre- and post- processing, VSR inference, output image compression and saving) speed was 16.55 fps, higher than 15 fps which is the input frame rate. This represents a relative speed of 1.1x.
Unfortunately, when input resolution is 180x320, VSR was unable to achieve real-time upscaling because its overall processing speed was only 6.62 fps, much lower than the input frame rate, 15 fps. This represents a relative speed of 0.44x. However, this does not conclude the impossibility of real-time VSR upscaling. I think further acceleration can surely be achieved by using higher end GPU models with higher computing power, larger built-in memory and more advanced inference framework, etc. Or, I can also install another set of GPU and run two inference tasks in parallel. My GeForce 4060 is more like a gaming graphic card, not the best model for AI inference.
Finally, let us look at the breakdown of the average total processing time for one single video frame (Table 3). As can be seen, For 108x192 input, the total processing time per frame is 46.9 ms including 32.2 ms for VSR inference only, 4.17 ms for pre-processing the input tensors and post-processing the output tensors, 10.5 ms for compressing and saving the output images, plus some insignificant video re-encoding time.
For 180x320 input, the total processing time per frame is 131.3 ms including 89.5 ms for VSR inference only, 11.64 ms for pre-processing the input tensors and post-processing the output tensors, 30 ms for compressing and saving the output images, plus some insignificant video re-encoding time.
For both input resolutions, inference only took 68% of the total processing time. 32% of time was spent on pre-processing, post-processing, output compression and saving and video re-encoding, combined. I believe the 32% non-inference time can be further cut down to accelerate the whole process.
If you have any questions about this blog, please reach out to me at maxutility2011@gmail.com.
References
[1] Nabajeet Barman, Yuriy Reznik, Maria Martini, On the Performance of Video Super Resolution Algorithms for HTTP-based Adaptive Streaming Applications, https://eprints.kingston.ac.uk/id/eprint/54894/