Date of Award

6-2023

Document Type

Thesis

Publisher

Santa Clara : Santa Clara University, 2023.

Degree Name

Doctor of Philosophy (PhD)

Department

Computer Science and Engineering

First Advisor

Yi Fang

Abstract

In the domain of text classification, imbalanced datasets are a common occurrence. The skewed distribution of the labels of these datasets poses a great challenge to the performance of text classifiers. One popular way to mitigate this challenge is to augment underwhelmingly represented labels with synthesized items. The synthesized items are generated by data augmentation methods that can typically generate an unbounded number of items. To select the synthesized items that maximize the performance of text classifiers, we introduce a novel method that selects items that jointly maximize the likelihood of the items belonging to their respective labels and the diversity of the selected items. Our proposed method formulates the joint maximization as a monotone submodular objective function, whose solution can be approximated by a tractable and efficient greedy algorithm. We evaluated our method on multiple real-world datasets with different data augmentation techniques and text classifiers, and compared results with several baselines. The experimental results demonstrate the effectiveness and efficiency of our method.

Share

COinS