Bingxin Hou

Date of Award


Document Type



Santa Clara : Santa Clara University, 2022.

Degree Name

Doctor of Philosophy (PhD)


Computer Science and Engineering

First Advisor

Nam Ling

Second Advisor

Ying Liu


Moving object detection (MOD) is the process of extracting dynamic foreground content from the video frames, such as moving vehicles or pedestrians, while discarding the nonmoving background. It plays an essential role in computer vision field. The traditional methods meet difficulties when applied in complex scenarios, such as videos with illumination changes, shadows, night scenes,and dynamic backgrounds. Deep learning methods have been actively applied to moving object detection in recent years and demonstrated impressive results. However, many existing models render superior detection accuracy at the cost of high computational complexity and slow inference speed. This fact has hindered the development of such models in mobile and embedded vision tasks, which need to be carried out in a timely fashion on a computationally limited platform.

The current research aims to use the technique of separable convolution in both 2D and 3D CNN together with our proposed multi-input multi-output strategy and two-branch structure to devise new deep network models that significantly improve inference speed, yet require smaller model size and achieve reduction in floating-point operations as compared to existing deep learning models with competitive detection accuracy.

This research devised three deep neural network models, addressing the following main problems in the area of moving object detection:

1. Improving Detection Accuracy by extracting both spatial and temporal information: To improve detection accuracy, the proposed models adopt 3D convolution which is more suitable to extract both spatial and temporal information in video data than 2D convolution. We also put this 3D convolution into two-branch network that extracts both high-level global features and low-level detailed features can further increase the accuracy.

2. Reduce model size and computational complexity by changing network structure: The standard 2D and 3D convolution are further decomposed into depthwise and pointwise convolutions. While existing 3D separable CNN all addressed other problems such as gesture recognition, force prediction, 3D object classification or reconstruction, our work applied it to the moving object detection task for the first time in the literature.

3. Increasing inference speed by changing the input-output relationship: We proposed a multi-input multi-output (MIMO) strategy to increase inference speed, which can take multiple frames as the network input and output multiple frames of detection results. This MIMO embedded in 3Dseparable CNN can further increase model inference speed significantly and maintain high detection accuracy.

Compared to state-of-the-art approaches, our proposed methods significantly increases the inference speed, reduces the model size, meanwhile achieving the highest detection accuracy in the scene dependent evaluation (SDE) setup and maintaining a competitive detection accuracy in the scene independent evaluation (SIE) setup. The SDE setup is widely used to tune and test the model on a specific video as the training and test sets are from the same video. The SIE setup is designed to assess the generalization capability of the model on completely unseen videos.