Topic Modeling Bibliography   Leave a comment

Edoardo M. Airoldi, David M. Blei, Stephen E. Fienberg, Eric P. Xing. Mixed Membership Stochastic Blockmodels. JMLR (9) 2008 pp. 1981-2014.
Networks
[BibTeX]

Loulwah AlSumait, Daniel Barbará, James Gentle, Carlotta Domeniconi. Topic Significance Ranking of LDA Generative Models. ECML (2009).
Evaluation
[BibTeX]

David Andrzejewski, Anne Mulhern, Ben Liblit, Xiaojin Zhu. Statistical Debugging using Latent Topic Models. ECML (2007).
[BibTeX]

David Andrzejewski, Xiaojin Zhu, Mark Craven. Incorporating domain knowledge into topic modeling via Dirichlet Forest priors. ICML (2009).
[BibTeX]

David Andrzejewski, Xiaojin Zhu, Mark Craven, Ben Recht. A Framework for Incorporating General Domain Knowledge into Latent Dirichlet Allocation using First-Order Logic. IJCAI (2011).
[BibTeX]

Arthur Asuncion, Padhraic Smyth, Max Welling. Asynchronous Distributed Learning of Topic Models. NIPS (2008).
Scalability
[BibTeX]

Arthur Asuncion, Max Welling, Padhraic Smyth, Yee-Whye Teh. On Smoothing and Inference for Topic Models. UAI (2009).
Inference
[BibTeX]

A dense but excellent review of inference in topic models. Introduces CVB0, a method for collapsed variational inference surprisingly similar to Gibbs sampling.

David Blei, Michael Jordan. Modeling Annotated Data. SIGIR (2003).
[BibTeX]

This paper introduces CorrLDA for data that consists of text and images, where image “topics” are chosen only from topics that are assigned to the text in the same document.

David M. Blei. lda-c. (2003).
Implementations
[BibTeX]

lda-c implements LDA with variational inference in C.

David M. Blei, Andrew Ng, Michael Jordan. Latent Dirichlet allocation. JMLR (3) 2003 pp. 993-1022.
[BibTeX]

David M. Blei, Thomas Griffiths, Michael Jordan, Joshua Tenenbaum. Hierarchical topic models and the nested Chinese restaurant process. NIPS (2003).
Non-parametric
[BibTeX]

Introduces hLDA, which models topics in a tree. Each document is generated by topics along a single path through the tree.

David M. Blei, Thomas L. Griffiths, Michael I. Jordan. The nested Chinese restaurant process and hierarchical topic models. (2007).
Non-parametric
[BibTeX][Abstract]

This is a longer version of Blei et al. 2004, which extends that paper’s hLDA model to trees of unlimited depth.

David M. Blei, John D. Lafferty. Dynamic Topic Models. ICML (2006).
Temporal
[BibTeX]

David M. Blei, John D. Lafferty. A Correlated Topic model of Science. AAS (1) 2007 pp. 17-35.
[BibTeX]

David M. Blei, Jon D. McAuliffe. Supervised Topic Models. NIPS (2007).
[BibTeX]

David M. Blei. Introduction to Probabilistic Topic Models. Communications of the ACM () 2011 pp. .
Where to start
[BibTeX]

A high-level overview of probabilistic topic models.

Brad Block. Collapsed variational HDP. (2011).
Implementations
[BibTeX]

This library contains Java source and class files implementing the Latent Dirichlet Allocation (single-threaded collapsed Gibbs sampling) and Hierarchical Dirichlet Process (multi-threaded collapsed variational inference) topic models. The models can be accessed through the command-line or through a simple Java API. Also included is a subset of the 20 Newsgroup dataset and results of experiments done on the dataset to confirm the correct operation and investigate some properties of the topic models. No third-party scientific libraries are required and all needed special functions are implemented and included.

Jordan Boyd-Graber, David M. Blei, Xiaojin Zhu. A Topic Model for Word Sense Disambiguation. EMNLP (2007).
NLP
[BibTeX]

Jordan Boyd-Graber, David M. Blei. PUTOP: Turning Predominant Senses into a Topic Model for WSD. SEMEVAL (2007).
NLP
[BibTeX]

Jordan Boyd-Graber, David M. Blei. Syntactic Topic Models. NIPS (2008).
NLP
[BibTeX]

Jordan Boyd-Graber, David M. Blei. Multilingual Topic Models for Unaligned Text. UAI (2009).
Cross-language
[BibTeX]

David A. Broniatowski, Christopher L. Magee. Analysis of Social Dynamics on FDA Panels Using Social Networks Extracted From Meeting Transcripts. SocCom (2010).
Networks
[BibTeX]

Method for analyzing group decision making based on the Author-Topic Model

David A. Broniatowski, Christopher L. Magee. Towards A Computational Analysis of Status and Leadership Styles on FDA Panels. SBP (2011).
NetworksTemporal
[BibTeX]

Incorporates temporal information to generate directed graphs based upon topic models

Wray L. Buntine. Discrete Component Analysis. (2009).
Implementations
[BibTeX]

C implementation of LDA and multinomial PCA.

Wray L. Buntine, Aleks Jakulin. Discrete Component Analysis. SLSFS (2005).
[BibTeX]

Wray L. Buntine. Estimating Likelihoods for Topic Models. Asian Conference on Machine Learning (2009).
Evaluation
[BibTeX]

Provides improved versions of some of the methods in Wallach et al. (2009) for calculating held-out probability.

Jun Fu Cai, Wee Sun Lee, Yee Whye Teh. NUS-ML: Improving Word Sense Disambiguation Using Topic Features. SEMEVAL (2007).
NLP
[BibTeX]

Jonathan Chang. R package ‘lda’. (2011).
Implementations
[BibTeX]

This package implements latent Dirichlet allocation (LDA) and related models. This includes (but is not limited to) sLDA, corrLDA, and the mixed-membership stochastic blockmodel. Inference for all of these models is implemented via a fast collapsed Gibbs sampler writtten in C. Utility functions for reading/writing data typically used in topic models, as well as tools for examining posterior distributions are also included.

Jonathan Chang, David Blei. Relational Topic Models for Document Networks. AIStats (2009).
Networks
[BibTeX]

Chaitanya Chemudugunta, Padhraic Smyth, Mark Steyvers. Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model. NIPS (2006).
[BibTeX]

This paper has two interesting extensions to LDA that account for the power-law distribution of word frequencies in real documents. First, a general “background” distribution represents common words. Second, a “special words” model allows each document to have some unique words.

Changyou Chen, Lan Du, Wray Buntine. Sampling Table Configurations for the Hierarchical Poisson-Dirichlet Process. ECML-PKDD (2011).
Non-parametric
[BibTeX]

A simple hierarchical Pitman-Yor LDA sampler that does not record “table” assignments. Perplexity is sometimes far superior to other methods.

Jonathan Chang, Jordan Boyd-Graber, Chong Wang, Sean Gerrish, David M. Blei. Reading Tea Leaves: How Humans Interpret Topic Models. NIPS (2009).
Evaluation
[BibTeX]

Pradipto Das, Rohini Srihari, Yun Fu. Simultaneous Joint and Conditional Modeling of Documents Tagged from Two Perspectives. (2011).
[BibTeX]

Hal Daumé III. Markov Random Topic Fields. (2009).
[BibTeX]

Andrew M. Dai, Amos J. Storkey. The Grouped Author-Topic Model for Unsupervised Entity Resolution . ICANN (2011).
[BibTeX]

Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, Richard Harshman. Indexing by Latent Semantic Analysis. JASIS (41) 1990 pp. 391-407.
[BibTeX]

Laura Dietz, Steffen Bickel, Tobias Scheffer. Unsupervised prediction of citation influences. ICML (2007).
NetworksBibliometrics
[BibTeX]

Chris Ding, Tao Li, Wei Peng. On the Equivalence between Non-negative Matrix Factorization and Probabilistic Latent Semantic Indexing. Computational Statistics and Data Analysis (52) 2008pp. 3913-3927.
Theory
[BibTeX]

Gabriel Doyle, Charles Elkan. Accounting for Burstiness in Topic Models. ICML (2009).
[BibTeX]

Replaces the standard multinomial distribution over topics with a Dirichlet-compound Multinomial (DCM).

Jacob Eisenstein, Brendan O’Connor, Noah A. Smith, Eric P. Xing. A Latent Variable Model for Geographic Lexical Variation. EMNLP (2010).
[BibTeX]

The widely-reported Twitter dialects paper. Topics combine a word distribution with a bivariate normal over latitude and longitude.

Jacob Eisenstein, Amr Ahmed, Eric P. Xing. Sparse Additive Generative Models of Text. ICML (2011).
[BibTeX]

Presents a new generative model of text, based on the principle of sparse deviation from a background word distribution. This approach proves effective in supervised, unsupervised, and latent variable settings.

Elena Erosheva, Stephen Fienberg, John Lafferty. Mixed Membership Models of Scientific Publications. PNAS (101) 2004 pp. 5220-5227.
Bibliometrics
[BibTeX]

Radim Řehůřek. gensim. (2009).
Implementations
[BibTeX]

Python package for topic modelling, includes distributed and online implementation of variational LDA.

Sean Gerrish, David M. Blei. A language-based approach to measuring scholarly impact. ICML (2010).
Bibliometrics
[BibTeX]

Mark Girolami, Ata Kabán. On an equivalence between pLSI and LDA. SIGIR (2003).
Theory
[BibTeX]

Andre Gohr, Myra Spiliopoulou, Alexander Hinneburg. Visually Summarizing the Evolution of Documents under a Social Tag. KDIR (2010).
Temporal
[BibTeX]

Andre Gohr, Alexander Hinneburg, Rene Schult, Myra Spiliopoulou. Topic Evolution in a Stream of Documents. SDM (2009).
Temporal
[BibTeX]

Thomas L. Griffiths, Mark Steyvers. Finding Scientific Topics. PNAS (101) 2004 pp. 5228-5235.
[BibTeX]

Thomas L. Griffiths, Mark Steyvers, David M. Blei, Joshua B. Tenenbaum. Integrating Topics and Syntax. In , NIPS (2004).
NLP
[BibTeX]

David Hall, Daniel Jurafsky, Christopher D. Manning. Studying the History of Ideas Using Topic Models. EMNLP (2008).
Bibliometrics
[BibTeX]

Gregor Heinrich. Parameter Estimation for Text Analysis. (2004).
Inference
[BibTeX][Abstract]

Gregor Heinrich. A generic approach to topic models. ECML/PKDD (2009).
Scalability
[BibTeX]

Gregor Heinrich. Infinite LDA. (2011).
ImplementationsNon-parametric
[BibTeX]

A simple implementation of a non-parametric model, where the number of topics is not fixed in advance. Uses Teh’s direct assignment method for HDP.

Alexander Hinneburg, Hans-Henning Gabriel, Andre Gohr. Bayesian Folding-In with Dirichlet Kernels for PLSI. ICDM (2007).
Theory
[BibTeX]

Thomas Hofmann. Probilistic latent semantic analysis. UAI (1999).
[BibTeX]

Matthew Hoffman, David M. Blei, Francis Bach. Online Learning for Latent Dirichlet Allocation. NIPS (2010).
[BibTeX]

Jagadeesh Jagarlamudi, Hal Daumé III. Extracting Multilingual Topics from Unaligned Comparable Corpora. (2010).
Cross-language
[BibTeX]

Mark Johnson. PCFGs, Topic Models, Adaptor Grammars, and Learning Topical Collocations and the Structure of Proper Names. (2010).
NLP
[BibTeX]

Jyri J. Kivinen, Erik B. Sudderth, Michael I. Jordan. Learning Multiscale Representations of Natural Scenes Using Dirichlet Processes. ICCV (2007).
Non-parametricVision
[BibTeX][Abstract]

The paper introduces a blocked Gibbs sampler for learning a nonparametric Bayesian topic model whose topic assignments are coupled with a tree-structured graphical model.

Simon Lacoste-Julien, Fei Sha, Michael I. Jordan. DiscLDA: Discriminative Learning for Dimensionality Reduction and Classification. NIPS (2008).
[BibTeX]

Thomas K. Landauer, Susan T. Dumais. Solutions to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review ()1997 pp. .
[BibTeX]

John Langford. Vowpal Wabbit. (2011).
Implementations
[BibTeX]

VW includes an implementation of Hoffman et al.‘s online variational LDA.

Wei Li, David Blei, Andrew McCallum. Nonparametric Bayes Pachinko Allocation. (2007).
Non-parametric
[BibTeX]

Wei-Hao Lin, Eric P. Xing, Alexander Hauptmann. A Joint Topic and Perspective Model for Ideological Discourse. ECML PKDD (2008).
[BibTeX]

Andrew Kachites McCallum. MALLET: A Machine Learning for Language Toolkit. (2002).
Implementations
[BibTeX]

Implements Gibbs sampling for LDA in Java using fast sampling methods from Yao et al. MALLET also includes support for data preprocessing, classification, and sequence tagging.

Andrew McCallum, Andrés Corrada-Emmanuel, Xuerui Wang. Topic and Role Discovery in Social Networks. IJCAI (2005).
Networks
[BibTeX]

Qiaozhu Mei, Xu Ling, Matthew Wondra, Hang Su, ChengXiang Zhai. Topic sentiment mixture: modeling facets and opinions in weblogs. WWW (2007).
[BibTeX]

Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai. Automatic labeling of multinomial topic models. KDD (2007).
User interface
[BibTeX]

Qiaozhu Mei, Deng Cai, Duo Zhang, ChengXiang Zhai. Topic modeling with network regularization. WWW (2008).
Networks
[BibTeX][Abstract]

David Mimno, Andrew McCallum. Expertise Modeling for Matching Papers with Reviewers. KDD (2007).
[BibTeX]

David Mimno, Andrew McCallum. Mining a digital library for influential authors. JCDL (2007).
Bibliometrics
[BibTeX]

David Mimno, Wei Li, Andrew McCallum. Mixtures of Hierarchical Topics with Pachinko Allocation. ICML (2007).
[BibTeX]

David Mimno, Andrew McCallum. Topic models conditioned on arbitrary features with Dirichlet-multinomial regression. UAI (2008).
[BibTeX]

Per-document Dirichlet priors over topic distributions are generated using a log-linear combination of observed document features and learned feature-topic parameters. Implemented in Mallet

David Mimno, Hanna Wallach, Andrew McCallum. Gibbs Sampling for Logistic Normal Topic Models with Graph-Based Priors. NIPS Workshop on Analyzing Graphs (2008).
Networks
[BibTeX]

Introduces an auxiliary-variable method for Gibbs sampling in non-conjugate topic models.

David Mimno, Hanna Wallach, Jason Naradowsky, David A. Smith, Andrew McCallum. Polylingual Topic Models. EMNLP (2009).
Cross-language
[BibTeX]

David Mimno. Reconstructing Pompeian Households. UAI (2011).
Cross-language
[BibTeX]

David Mimno, Hanna Wallach, Edmund Talley, Miriam Leenders, Andrew McCallum. Optimizing Semantic Coherence in Topic Models. EMNLP (2011).
Evaluation
[BibTeX]

A simple, automated metric that uses only information contained in the training documents has strong ability to predict human judgments of topic coherence.

David Mimno, David Blei. Bayesian Checking for Topic Models. EMNLP (2011).
Evaluation
[BibTeX]

Posterior predictive checks are useful in detecting lack of fit in topic models and identifying which metadata-enriched models might be useful

Indraneel Mukherjee, David Blei. Relative Performance Guarantees for Approximate Inference in Latent Dirichlet Allocation. NIPS (2008).
Inference
[BibTeX]

Indraneel Mukherjee, David Blei. Relative Performance Guarantees for Approximate Inference in Latent Dirichlet Allocation. NIPS (2008).
Inference
[BibTeX]

Claudiu Musat, Julien Velcin, Stefan Trausan-Matu, Marian-Andrei Rizoiu. Improving Topic Evaluation Using Conceptual Knowledge. IJCAI (2011).
Evaluation
[BibTeX]

Ramesh Nallapati, Amr Ahmed, Eric P. Xing, William Cohen. Joint Latent Topic Models for Text and Citations. KDD (2008).
Networks
[BibTeX]

This is one of the first papers to address joint topic models of text and hyperlinks. Used as a baseline in the more recent Relational Topic Models. (R.N.)

Ramesh Nallapati, William Cohen, Susan Ditmore, John Lafferty, Kin Ung. Multi-scale Topic Tomography. KDD (2007).
Temporal
[BibTeX]

Models variation of topic content with time at various scales of resolution. A novel variant of dynamic topic models that uses the Poisson distribution for word generation, and wavelets. (R.N.)

Ramesh Nallapati, William Cohen, John Lafferty. Parallelized Variational EM for Latent Dirichlet Allocation: An experimental evaluation of speed and scalability. ICDM workshop on high performance data mining (2007).
Scalability
[BibTeX]

Early paper on parallel implementations of variational EM for LDA. (R.N.)

Ramesh Nallapati. multithreaded lda-c. (2010).
Implementations
[BibTeX]

Multi Threaded extension of David Blei’s LDA implementation in C. Speeds up the computation by orders of magnitude depending on the number of processors.

David Newman, Chaitanya Chemudugunta, Padhraic Smyth. Statistical entity-topic models. KDD (2006).
[BibTeX]

D. Newman, S. Block. Probabilistic Topic Decomposition of an Eighteenth-Century American Newspaper. JASIST () 2006 pp. .
[BibTeX]

David Newman, Jey Han Lau, Karl Grieser, Timothy Baldwin. Automatic Evaluation of Topic Coherence. NAACL (2010).
Evaluation
[BibTeX]

Xiaochuan Ni, Jian-Tao Sun, Jian Hu, Zheng Chen. Mining Multilingual Topics from Wikipedia. WWW (2009).
Cross-language
[BibTeX][Abstract]

Xuan-Hieu Phan, Cam-Tu Nguyen. GibbsLDA++. (2007).
Implementations
[BibTeX]

C/C++ implementation of LDA with Gibbs sampling.

Jukka Perkiö, Wray L. Buntine, Sami Perttu. Exploring Independent Trends in a Topic-Based Search Engine. Web Intelligence (2004).
[BibTeX]

Matthew Purver, Konrad Körding, Thomas L. Griffiths, Joshua Tenenbaum. Unsupervised Topic Modelling for Multi-Party Spoken Discourse. ACL (2006).
[BibTeX]

Daniel Ramage, Evan Rosen. Stanford Topic Modeling Toolbox. (2009).
Implementations
[BibTeX]

Scala implementation of LDA and LabeledLDA. Input and output integration with Excel.

Daniel Ramage, David Hall, Ramesh Nallapati, Christopher D. Manning. Labeled LDA: A Supervised Topic Model for Credit Attribution in Multi-Labeled Corpora. EMNLP (2009).
[BibTeX]

Daniel Ramage, Susan Dumais, Dan Liebling. Characterizing Microblogs with Topic Models. ICWSM (2010).
[BibTeX]

Joseph Reisinger, Austin Waters, Brian Silverthorn, Raymond J. Mooney. Spherical Topic Models. ICML (2010).
[BibTeX][Abstract]

Michal Rosen-Zvi, Tom Griffiths, Mark Steyvers, Padhraic Smyth. The Author-Topic Model for Authors and Documents. UAI (2004).
[BibTeX]

Ruslan Salakhutdinov, Geoffrey Hinton. Replicated Softmax: an Undirected Topic Model. NIPS (2009).
[BibTeX]

Shravan Narayanamurthy. Yahoo! LDA. (2011).
Implementations
[BibTeX]

Y!LDA implements a fast, sampling-based, distributed algorithm. See Smola and Narayanamurthy for details.

Alexander Smola, Shravan Narayanamurthy. An Architecture for Parallel Topic Models. VLDB (2010).
Scalability
[BibTeX]

Mark Steyvers, Tom Griffiths. Matlab Topic Modeling Toolbox. (2005).
Implementations
[BibTeX]

Implements LDA, Author-Topic, HMM-LDA, LDA-COL. Tools for 2D visualization.

Mark Steyvers, Tom Griffiths. Probabilistic Topic Models. In Landauer, T., Mcnamara, D., Dennis, S., Kintsch, W., Latent Semantic Analysis: A Road to Meaning. (2006).
Where to start
[BibTeX]

A good introduction to topic modeling.

Claudio Taranto, Nicola Di Mauro, Floriana Esposito. rsLDA: a Bayesian Hierarchical Model for Relational Learning. ICDKE (2011).
[BibTeX]

Yee-Whye Teh, David Newman, Max Welling. A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation. NIPS (2006).
Inference
[BibTeX]

Yee Whye Teh, Michael I. Jordan, Matthew J. Beal, David M. Blei. Hierarchical Dirichlet Processes. JASA (101) 2006 pp. .
Non-parametric
[BibTeX]

Kristina Toutanova, Mark Johnson. A Bayesian LDA-based model for semi-supervised part-of-speech tagging. NIPS (2007).
NLP
[BibTeX]

Hanna M. Wallach. Topic modeling: beyond bag-of-words. ICML (2006).
[BibTeX]

Hanna Wallach, Iain Murray, Ruslan Salakhutdinov, David Mimno. Evaluation Methods for Topic Models. ICML (2009).
Evaluation
[BibTeX]

Commonly used methods for estimating the probability of held-out words may be unstable. This paper presents more accurate methods.

Hanna Wallach, David Mimno, Andrew McCallum. Rethinking LDA: Why priors matter. NIPS (2009).
Theory
[BibTeX]

The use of an asymmetric Dirichlet prior on per-document topic distributions reduces sensitivity to very common words (eg stopwords and near-stopwords) and makes topic assignments more stable as the number of topics grows.

Chang Wang, Sridhar Mahadevan. Multiscale Analysis of Document Corpora Based on Diffusion Models. IJCAI (2009).
[BibTeX]

Chang Wang, James Fan, Aditya Kalyanpur, David Gondek. Relation Extraction with Relation Topics. EMNLP (2011).
[BibTeX]

Xuerui Wang, Natasha Mohanty, Andrew McCallum. Group and Topic Discovery from Relations and Their Attributes. NIPS (2005).
Networks
[BibTeX]

Xuerui Wang, Andrew McCallum. Topics Over Time: a non-Markov continuous-time model of topical trends. KDD (2006).
Temporal
[BibTeX]

Chong Wang, David M. Blei, David Heckerman. Continuous Time Dynamic Topic Models. UAI (2008).
Temporal
[BibTeX][Abstract]

Chong Wang, David Blei, Fei-Fei Li. Simultaneous Image Classification and Annotation. CVPR (2009).
Vision
[BibTeX]

Yi Wang. Distributed Gibbs Sampling of Latent Dirichlet Allocation: The Gritty Details. (2011).
Where to start
[BibTeX]

A thorough introduction for those wanting to understand the mathematical basics of topic models.

Wei Li, Andrew McCallum. Pachinko allocation: DAG-structured mixture models of topic correlations. ICML (2006).
[BibTeX]

Xing Wei, Bruce Croft. LDA-based document models for ad-hoc retrieval. SIGIR (2006).
[BibTeX]

Feng Yan, Ningyi Xu, Yuan Qi. Parallel Inference for Latent Dirichlet Allocation on Graphics Processing Units. NIPS (2009).
[BibTeX]

In addition to dividing the corpus between processors, this work divides the vocabulary into the same number of partitions, such that each processor works on both its own documents and its own words at each epoch. This increases the number of epochs, but drastically reduces the possibility of incorrect samples.

Shuang-Hong Yang, Steven P. Crain, Hongyuan Zha. Bridging the language gap: topic adaptation for documents with different technicality. AIStats (2011).
[BibTeX]

Limin Yao, David Mimno, Andrew McCallum. Efficient Methods for Topic Model Inference on Streaming Document Collections. KDD (2009).
Scalability
[BibTeX]

Explores methods for inferring topic distributions for new documents given a trained model. This paper includes the SparseLDA algorithm and data structure, which can dramatically improve time and memory performance in Gibbs sampling.

Jianwen Zhang, Yangqiu Song, Changshui Zhang, Shixia Liu. Evolutionary Hierarchical Dirichlet Processes for Multiple Correlated Time-varying Corpora. KDD (2010).
Non-parametricTemporal
[BibTeX]

Bing Zhao, Eric P. Xing. BiTAM: Bilingual Topic AdMixture Models for Word Alignment. ACL (2006).
Cross-language
[BibTeX]

Bin Zhao, Eric P. Xing. HM-BiTAM: Bilingual Topic Exploration, Word Alignment, and Translation. NIPS (2007).
Cross-language
[BibTeX]

Jun Zhu, Amr Ahmed, Eric P. Xing. MedLDA: Maximum Margin Supervised Topic Models for Regression and Classification. ICML (2009).
[BibTeX]

Jun Zhu, Eric P. Xing. Conditional Topic Random Fields. ICML (2010).
[BibTeX]

Xiaojin Zhu, David M. Blei, John Lafferty. TagLDA: Bringing document structure knowledge into topic models. (2006).
[BibTeX][Abstract]

Posted 十一月 14, 2013 by masonzms in Research

这半年应该完成的事情   Leave a comment

1. 将character-level dependency parsing完成,并写一篇文章投ACL

2. one-beam joint pos tagging and dependency parsing model完成,写完博士毕业论文第三章

3.  lexicon- and sentence-annotation for domain adaptation of  joint word segmentation and pos tagging, 写完博士毕业论文第四章

4. Chinese Word Segmentation with acoustic cues? 写文章投ACL

5. 学习Topic model and Bayesian Learning.

6. 弄明白Mallet工具使用

7. 如果有时间:arc-standard joint model v.s. arc-eager joint model: a comparison (明年Coling)

8. 如果SRL文章杯具,准备明年Coling吧,在英文上增加实验,否则今年把这个工作完善并写完博士论文第二章

Posted 十一月 14, 2013 by masonzms in Research

中文的词性与句法   Leave a comment

在这次CNCCL大会上,对中文的词性和句法方面貌似有很大的争论。有时候我对中文的词性也不是很理解,比如为什么一个词在词典里面只有动词,在实际情况中它就能成动词了,这种名词动词化属于什么层面的,他和一般意义上的动词应该可定是有差别的吧,类似这种的情况会有很多例子。然后再一个,例如在联合模型中,词性和句法为什么不能同时提高,这也应该是有问题的。 可能应该有另一套标准。关于这个标准是什么呢。中文最重要的应该是语义。还有中文分词,也应该是从语义层面上出发的。因为人能够很快的判别一个句子的语义,所以分词也就自然而然了。

Posted 八月 30, 2011 by masonzms in Research

latex中输入算法(转载)   Leave a comment

排版可能需要的包: 

\usepackage{algorithm}               //format of the algorithm 

\usepackage{algorithmic}             //format of the algorithm 

\usepackage{multirow}                //multirow for format of table 

\usepackage{amsmath} 

\usepackage{xcolor} 

\DeclareMathOperator*{\argmin}{argmin}         //argminargmax公式的排版 

\renewcommand{\algorithmicrequire}{\textbf{Input:}}   //Use Input in the format of Algorithm 

\renewcommand{\algorithmicensure}{\textbf{Output:}}  //UseOutput in the format of Algorithm 

排版图片可能需要的包: 

\usepackage{graphics} 

\usepackage{graphicx} 

\usepackage{epsfig} 

算法的排版举例: 

\begin{algorithm}[htb]         %算法的开始 

\caption{ Framework of ensemble learning for our system.}             %算法的标题 

\label{alg:Framwork}                  %给算法一个标签,这样方便在文中对算法的引用 

\begin{algorithmic}[1]                %不知[1]是干嘛的? 

\REQUIRE ~~\\                          %算法的输入参数:Input 

    The set of positive samples for current batch, $P_n$;\\ 

    The set of unlabelled samples for current batch, $U_n$;\\ 

    Ensemble of classifiers on former batches, $E_{n-1}$; 

\ENSURE ~~\\                           %算法的输出:Output 

    Ensemble of classifiers on the current batch,  $E_n$; 

\STATE Extracting the set of reliable negative and/or positive samples $T_n$ from $U_n$  with help of $P_n$; \label{code:fram:extract}      %算法的一个陈述,对应算法的一个步骤或公式之类的; \label{ code:fram:extract }对此行的标记,方便在文中引用算法的某个步骤 

\STATE Training ensemble of classifiers $E$ on $T_n \cup P_n$, with help of data in former batches; \label{code:fram:trainbase} 

\STATE $E_n=E_{n-1}\cup E$; \label{code:fram:add} 

\STATE Classifying samples in $U_n-T_n$ by $E_n$; \label{code:fram:classify} 

\STATE Deleting some weak classifiers in $E_n$ so as to keep the capacity of $E_n$; \label{code:fram:select} 

\RETURN $E_n$;                %算法的返回值 

\end{algorithmic} 

\end{algorithm}

Posted 七月 15, 2011 by masonzms in Research

一个Latex Beamer模板的安装   Leave a comment

Beamer 模板下载地址 :http://www.newsmth.net/bbsanc.php?path=%2Fgroups%2Fcomp.faq%2FTeX%2Fslides%2Fbeamer%2FM.1167373299.g0&ap=739
其中CTex的安装版本是CTeX_2.9.0.152_Full,windows下安装,winedit 6.0
将下载文件解压,整个themes目录拷贝至 “CTEX\MiKTeX\tex\latex\beamer\base”目录下,然后winedit中运行“Tex->MikTex->MikTex Options->Refresh FNDB”, 然后就可以正常编译运行了。 几个月之前曾经配好过,换了系统之后,发现不会配了,记录一下,以免再花一个多小时才搞定这么简单的事情。

Posted 七月 15, 2011 by masonzms in Research

stanford中提到的一些有用的资源–将来可能与我最相关的东西   Leave a comment

Part-of-Speech Tagging

Links referred to in the text

Other links

  • ICOPOST. C taggers by Ingo Schröder that implement maximum entropy, HMM trigram, and transformation-based learning. C source available under GNU public license (but requires registration).
  • TreeTagger, a decision tree based tagger from the University of Stuttgart (Helmut Scmid), complete with parameter files for English, German, French, and Italian, available in Solaris and Linux versions.
  • QTAG, an HMM-based tagger written in Java by Oliver Mason.

Probabilistic Parsing

Links of interest

The supply of probabilistic parsers on the web has increased a lot since the book was published.

Dependency Parsers


Corpus-Based Work

Links referred to in the text

Other links


Collocations

Links referred to in the text

Other links

Some basic web references for information extraction

Fredrik Olsson’s IE course bibliography
An extensive bibliography, with links to online papers
IE tutorial
By Doug Appelt and David Israel, together with some links
RISE
Ion Muslea’s extensive collection of pages: Repository of Online Information Sources Used in Information Extraction Tasks

Statistical estimation: n-gram models over sparse data

Links referred to in the text

  • CMU-Cambridge Statistical Language Modeling toolkit and its documentation. The best freely available tool for building language models.
  • The Austen text files which were used to build sample language models were obtained from Project Gutenberg (perhaps try the Sailor’s Project Gutenberg site mirror).
  • To remove punctuation from the text files, we used the following Unix sed script: sed.strip. It specifies a number of global substitutions in terms of very simple regular expressions. If sed is not available, it would be very easy to write the same thing in Perl, or one could just do the substitutions in a text editor.
  • But here are the resulting ‘clean’ text files that we used: training data (a concatenation of various novels), and test data (cleaned up Persuasion).
  • The Good-Turing estimates for Austen in Table 6.8 were calculated using Gale and Sampson’s (1995) Simple Good Turing technique using Sampson’s C program SGT.c, available from his website. The frequency of frequency data that was used as input is available in this file. (To do exercise 6.6, what you might want to do is use a language modelling toolkit to generate raw n-grams, a Perl program to do counts over those n-grams, and then to feed those into SGT.c for Good-Turing estimation.)
  • This file gives examples of some of the commands we used in calculations in the chapter, using standard Unix commands, and programs from the CMU-Cambridge Statistical Language Modeling toolkit: recipes.txt.
  • Gertjan van Noord’s table of language identification systems available on the WWW

Teaching materials

Other links of interest

  • The SRI Language Modeling toolkit by Andreas Stolcke is another good system for building language models, freely available for research purposes.

Lexical Acquisition

Other links

Teaching materials


Markov Models

Software

Teaching materials


Probabilistic Context Free Grammars

Teaching materials

Links of interest

an online server for PCFG processing


Clustering

Links referred to in the text

Other Links

  • CLUTO: A package (with visualization tools) for clustering high dimensional data sets.
  • mkcls: A word class formation (clustering) tool by Franz Josef Och.

Teaching materials

  • A simple example of EM fitting lines to points in Fortran 90 or Octave by Rob Malouf <malouf@let.rug.nl> (reproduced with permission)

Posted 七月 3, 2011 by masonzms in Research

斯坦福nlp中一些有用的页面链接   Leave a comment

Part-of-Speech Tagging

Links referred to in the text

The Xerox Tagger
Brill’s Transformation-Based Tagger
The MULTEXT tagger
Adwait Ratnaparkhi’s MaxEnt tragger
Thorsten Brants’ TnT tagger
Other links

ICOPOST. C taggers by Ingo Schröder that implement maximum entropy, HMM trigram, and transformation-based learning. C source available under GNU public license (but requires registration).
TreeTagger, a decision tree based tagger from the University of Stuttgart (Helmut Scmid), complete with parameter files for English, German, French, and Italian, available in Solaris and Linux versions.
QTAG, an HMM-based tagger written in Java by Oliver Mason.

Posted 七月 3, 2011 by masonzms in Research