Date of Award

6-12-2025

Document Type

Thesis

Publisher

Santa Clara : Santa Clara University, 2025

Departments

Computer Science and Engineering; Electrical and Computer Engineering

First Advisor

Hoeseok Yang

Second Advisor

Yi Fang

Abstract

Large Language Models (LLMs) are becoming increasingly popular in modern society. However, despite their popularity, the deployment of LLMs in real-world scenarios is extremely challenging due to substantial computational costs and memory constraints. Edge devices, like smartphones and IoT devices, lack resources needed to run these models locally, instead offloading computations for cloud computing. Cloud computing requires users to send their data over the internet leading to numerous privacy and security concerns. In some domains, such as health and finances, sending such sensitive information is not an option. Existing solutions to compress or increase inference speed include Small Language Models (SLMs), model compression techniques, and inference optimization strategies. However, all of these techniques require extensive human effort and manual tuning to find the optimal settings for increased speed without significant degradation of the quality of output. We propose an online hyperparameter finetuning method that autonomously discovers the optimal settings based on tangible metrics during inference. Our approach monitors performance metrics in real-time and dynamically adjusts any tunable parameter without human intervention. We demonstrate this framework on dynamic sparsity prediction, achieving 1.67÷ speedup while maintaining accuracy, but the method generalizes to any tunable parameters.

Share

COinS