时 间: 2023-02-20 10:00 — 11:00
With the development of computer technology and the internet, increasingly large amounts of textual data are generated and collected every day. It is a significant challenge to analyze and extract meaningful and actionable information from vast amounts of unstructured textual data. Many machine learning and natural language processing algorithms have been developed for text classification, clustering, and information retrieval. Driven by applications in a wide range of fields, there is an increasing need for developing computationally efficient statistical methods for analyzing a massive amount of textual data with theoretical guarantees.
In the first part of the talk, I will present the algorithms of unsupervised topic modeling under the probabilistic latent semantic indexing (pLSI) model. Novel and computationally fast algorithms for estimation and inference of both the word-topic matrix and the topic-document matrix are proposed, and their theoretical properties are investigated. In the second part, I will discuss supervised topic modeling, which jointly considers a collection of documents and their paired side information. A bias-adjusted algorithm is developed to study the regression coefficients in the supervised topic modeling under the generalized linear model formulation. I will also introduce an approach to constructing valid confidence intervals. Applications of the proposed methods reveal meaningful latent topic structures of textual data.