Date of Award
6-2024
Document Type
Dissertation
Publisher
Santa Clara : Santa Clara University, 2024
Degree Name
Doctor of Philosophy (PhD)
Department
Computer Science and Engineering
First Advisor
Ying Liu
Second Advisor
Nam Ling
Abstract
Video data now fills people’s daily routines, both in their professional endeavors and leisure pursuits. These activities impose significant pressure on the Internet bandwidth. It is essential to develop efficient video coding techniques that compress video data at a low bit rate, while decoding visually appealing frames and saving transmission bandwidth.
In traditional video codecs, like Advanced Video Coding (AVC), High Efficiency Video Coding (HEVC), and Versatile Video Coding (VVC), techniques rooted in signal processing and information theory have long been dominant. Recent years have witnessed a surge in deep learning-based approaches for image and video compression. Among these methods, Generative Adversarial Network (GAN) has shown remarkable efficiency in compressing images at low bit rates while preserving high perceptual quality. This characteristic renders GAN an excellent choice for applications requiring low bit rates, such as video conferencing. The decoded images produced by GAN-based methods typically exhibit finer details, compared to non-GAN-based methods. Furthermore, through adversarial training, GAN-based approaches yield decoded frames that better align with the characteristics of the Human Visual System (HVS) compared to methods focused solely on minimizing pixel-wise differences between raw and decompressed frames. Currently, GAN-based techniques are being successfully deployed in video compression to enhance perceptual quality at low bit rates. This research devised three GAN-based frameworks for video compression:
-
The residue frame contains redundant energy of videos and truly little research directly applied GAN to residue-frame coding. To significantly reduce the signal energies of the videos, our first work developed a novel approach for inter-frame residue-coding-based video compression by leveraging GAN. It utilizes GAN to compress the I-frame and the residue of the P-frame. Empirical evaluations conducted on standard test sequences underscore the remarkable efficacy of our algorithm. Our methods reduced the average bpp by 30.58%, 47.63%, and 34.98% while increasing the average PSNR by 1.29 dB, 1.92 dB, and 1.44 dB when compared to GAN-Intra [6], CAE [84], and End-to-End [10] methods. Crucially, the adversarial training we adopted ensured that the decoded target frames are not only well-preserved in high perceptual quality but also consistent with HVS.
-
Most existing video coding work adopts CNN or RNN in the framework. Though effective in catching local features, it doesn’t model long-distance dependencies and extract non-local features thoroughly. To explore global correlations among sequences and inspired by the potential of GAN to compress frames at extremely low bit rates, our second work introduced a novel generative video compression (GVC) model with a transformer-based discriminator (TD) for P-frame, operating on a residue-coding-based paradigm. This model captures non-local correlations within video frames, thereby enhancing adversarial training. Moreover, our GVC model constraints not only bitrate-distortion base loss but also incorporate a discriminator-dependent feature loss and a perceptual loss. The experiments on test sequences demonstrate that our GVC provides superior performance in terms of perceptual quality, as evidenced by significantly lower FID and KID scores at lower bit rates. Notably, GVC demonstrated exceptional performance at remarkably low bitrates (0.036 bpp to 0.067 bpp), surpassing existing methods such as PLVC [105], RLVC [104], and x265. Comparative analysis reveals that models like x265 (LDP very fast), x265 (LDP default), and RLVC (MS-SSIM) need higher bit rates (1.25× to 2.43×) to achieve comparable results, but they still introduce noticeable blurriness and noise artifacts in decoded frames.
-
Though as simple and efficient as the residue coding method, the videos still contain uncompressed redundant information. To further explore the correlations among frames, reduce bit rates, and learn richer local and global features, our third work developed a contextual generative video compression method with transformers (CGVC-T) which adopts contextual coding to improve coding efficiency and applies GAN for perceptual quality enhancement. The integrated hybrid transformer-convolution structure in auto-encoders of CGVC-T enables the learning of both global and local features within video frames to eliminate temporal and spatial redundancy. Experimental results on HEVC, UVG, and MCL-JCV datasets show that our CGVC-T achieves superior perceptual quality in terms of lower FID, KID, and LPIPS scores, compared to state-of-the-art learned video codecs, industrial video codecs x264 and x265, as well as official reference software JM, HM, and VTM. Additionally, the probability distribution models we developed resulted in lower bitrates required for transmitting the compressed video. On average, with the anchor VTM, CGVC-T achieves 11.7%, 31.8%, and 32.5% BD-rate savings in terms of FID, KID, and LPIPS, respectively. CGVC-T also outperforms all compared learned video codecs in terms of perceptual DISTS scores with a 20.6% BD-rate savings which is more than other compared methods.
Recommended Citation
Du, Pengli, "Generative Video Compression: Achieving High Perceptual Quality at Low Bitrates" (2024). Engineering Ph.D. Theses. 54.
https://scholarcommons.scu.edu/eng_phd_theses/54