Author

Zhiyuan Peng

Date of Award

12-2020

Document Type

Thesis - SCU Access Only

Publisher

Santa Clara : Santa Clara University, 2020.

Degree Name

Master of Science (MS)

Department

Computer Science and Engineering

First Advisor

Yi Fang

Abstract

Short texts are very common nowadays on the Internet because of the popularity of Twitter, notes and other lightweight blogging services. Accurate categorization of these short texts and recognition named entities are critical for enhancing these services as classification and named entity recognition (NER) provide the foundation for better Ads targeting and recommendation. In this thesis, we focus on classification and NER of short texts. For classification, due to the sparsity of the context information, traditional multi-label classification methods do not perform well on short text. We propose a novel Label Correlated Recurrent Neural Network (LC-RNN) for multi-label classification of short texts by exploiting label correlations. We utilize a tree structure to represent the relationships between labels and consequently an efficient max-product algorithm can be developed for exact inference of label prediction. We conduct experiments on four testbeds and the results demonstrate the effectiveness of the proposed model. For NER, We design a BioBERT+CRF model to recognize the named entities in call notes, a kind of short texts, written by pharmaceutical sales representatives. The result shows that our model obtains better performance than spaCy v2.0, a very popular public NER tool.

Share

COinS