Author

Yuxuan Fang

Date of Award

6-5-2025

Document Type

Thesis - SCU Access Only

Publisher

Santa Clara : Santa Clara University, 2025

Degree Name

Master of Science (MS)

Department

Computer Science and Engineering

First Advisor

Nam Ling

Abstract

In recent years, artificial intelligence technology has made remarkable achievements, especially in the field of AIGC, such as the text generation model GPT DeepSeek, Image generation models such as stable diffusion and Midjourney benefit from large models pretrained on a large amount of data, with a large number of parameters and phenomena such as knowledge emergence, enabling their effects to reach commercial application levels and be similar to those created by humans. But their achievements may benefit from a large amount of data and parameters, as well as powerful computing power. However, their network model structure may not be optimal, such as the stunning stable diffusion model, which relies solely on traditional CNNs, ordinary up and down sampling modules, residual connections, and simple feature fusion. Due to their high training costs, it is almost impossible to improve the structure and retrain it. At present, there is a lack of research on the impact of network structure on the performance of diffusion models. Therefore, this study aims to conduct research on a small amount of data to explore the influence of different network structures (especially the introduction of attention mechanisms) on the performance of diffusion models. This study selects a simple image generation task based on diffusion models, introduces channel attention mechanisms and temporal attention mechanisms to process feature maps, and uses attention mechanisms to fuse different feature maps, and explores their impact on the quality of image generation. Research has shown that the introduction of attention mechanism significantly improves the image generation quality of diffusion models, while the Natural Image Quality Evaluator (NIQE) and Blind/Referential Image Spatial Quality Evaluator (BRISQUE) are significantly reduced. Especially, the operation of using Vit row feature fusion has the best generation performance. Secondly, Vit can significantly improve the spatial and structural rationality of generated images, which may be attributed to its excellent spatial feature extraction ability; Attention mechanisms such as CA and human CBAM can significantly improve the naturalness of generated images, but their ability to represent details of the generated image targets is poor.

Share

COinS