Fantasia Impromptu: Roadmap

11/23/2011

Roadmap

Roadmap

其实做topic modeling就是在topic基于文章words的分布上，找到整篇文章的topic分布（所谓的dirichlet distribution，分布上的分布），Latent的意思是topic对文章的分布是隐藏的，而topic对words是可见的，所以可以通过

这个式子求出来隐藏的单篇文章中topic的分布。
A "topic" consists of a cluster of words that frequently occur together. Using contextual clues, topic models can connect words with similar meanings and distinguish between uses of words with multiple meanings.

所以我们要做新闻推荐，其实就是对我看过的文章（或者我朋友看过的文章），和库里的文章进行training，然后根据我看过的文章的topic分布，在库里的文章中找出topic分布相关性最高的文章推荐回来（另外一种情况可能不是找最相关的，也可能是找我看过的所有topic里出现概率最高的，这个后期可以好好研究一下）。

要实现这个目标，我们可以把内容分成：

1. 利用现有的LDA package对NYTimes的数据进行training

2. 结合classifier，如MaxEnt，HMM，CRF等，这里需要一些编码把别人的两个框架连起来，然后修改curve和learning rate等

3. 有时间，可以用更快的方法做training，用online LDA做training，看看结果如何

4. 把推荐的平台搭起来，不用很复杂，就是个意思

5. 还有时间？try other topic models

Fantasia Impromptu

11/23/2011

Roadmap

No comments:

Post a Comment

Labels

My Blog List