Date of Award
Santa Clara : Santa Clara University, 2020.
Doctor of Philosophy (PhD)
Computer Science and Engineering
As the amount of textual data has been rapidly increasing over the past decade, efficient similarity search methods have become a crucial component of large-scale information retrieval systems. A popular strategy is to represent original data samples by compact binary codes through hashing. A spectrum of machine learning methods have been utilized, but they often lack expressiveness and flexibility in modeling to learn effective representations. The recent advances of deep learning in a wide range of applications has demonstrated its capability to learn robust and powerful feature representations for complex data. Especially, deep generative models naturally combine the expressiveness of probabilistic generative models with the high capacity of deep neural networks, which is very suitable for text modeling. However, little work has leveraged the recent progress in deep learning for text hashing.
Meanwhile, most state-of-the-art semantic hashing approaches require large amounts of hand-labeled training data which are often expensive and time consuming to collect. The cost of getting labeled data is the key bottleneck in deploying these hashing methods. Furthermore, Most existing text hashing approaches treat each document separately and only learn the hash codes from the content of the documents. However, in reality, documents are related to each other either explicitly through an observed linkage such as citations or implicitly through unobserved connections such as adjacency in the original space. The document relationships are pervasive in the real world while they are largely ignored in the prior semantic hashing work.
In this thesis, we propose a series of novel deep document generative models for text hashing to address the aforementioned challenges. Based on the deep generative modeling framework, our models employ deep neural networks to learn complex mappings from the original space to the hash space. We first introduce an unsupervised models for text hashing. Then we further introduce the supervised models that utilize document labels/tags as well as consider document-specific factors that affect the generation of words.
To address the lack of labeled data, we employ unsupervised ranking methods such as BM25 to extract weak signals from training data. We propose two deep generative semantic hashing models to leverage weak signals for text hashing. Finally, we propose node2hash, an unsupervised deep generative model for semantic text hashing by utilizing graph context. It is designed to incorporate both document content and connection information through a probabilistic formulation. Based on the deep generative modeling framework, node2hash employs deep neural networks to learn complex mappings from the original space to the hash space.
The probabilistic generative formulation of the proposed models provides a principled framework for model extension, uncertainty estimation, simulation, and interpretability. Based on variational inference and reparameterization, the proposed models can be interpreted as encoder-decoder deep neural networks and thus they are capable of learning complex nonlinear distributed representations of the original documents. We conduct a comprehensive set of experiments on various public testbeds. The experimental results have demonstrated the effectiveness of the proposed models over the competitive baselines.
Chaidaroon, Suthee, "Deep Generative Models for Semantic Text Hashing" (2020). Engineering Ph.D. Theses. 27.