11/30/2011

Why Gibbs Sampling?

Finally figure out why LDA uses gibbs sampling. Because when calculating the posterior distribution, the denominator p(w1:D), which is summing the joint distribution over every possible instantiation of the hidden topic structure, is exponentially large(David Blei, et al. A correlated topic model of Science). So an approximation of this equation can be formed by using sampling-based algorithms or variational algorithms.

There is an article for beginners:
Gibbs Sampling for the Uninitiated. written by P. Resnik, E. Hardisty

One fancifully imagines a Chinese restaurant with an infinite number of circular tables, each with infinite capacity. Customer 1 is seated at an unoccupied table with probability 1. At time n + 1, a new customer chooses uniformly at random to sit at one of the following n + 1 places: directly to the left of one of the n customers already sitting at an occupied table, or at a new, unoccupied circular table. Clearly, each table corresponds to a block of a random partition. (For an account of the custom in actual Chinese restaurants, see table sharing). It is then immediate that the probability assigned to any particular partition (ignoring the order in which customers sit around any particular table) is

\Pr(B_n = B) = \dfrac{\prod_{b\in B} (|b| -1)!}{n!}
where b is a block in the partition B and |b| is the size (i.e. number of elements) of b. The restaurant analogy is due to Pitman and Dubins, and first appears in print in [1].

11/29/2011

LDA Implementation with R Code

Finally get the code running. result is good and reasonable.
http://www.r-bloggers.com/getting-started-with-latent-dirichlet-allocation-using-rtexttools-topicmodels/

Amazon Interview

Today have the third interview with Costin from amazon. It's a mess as the sound in speakerphone is vague so I can hardly hear what he said. Anyway, he asked me totally just two questions, and one of them is can you introduce polymophism? and what's the other two important concept in OO?

And the other is sum up all the prime numbers from 1 to 100.

Now i'm beginning to read probabilistic latent semantic analysis(PLSA). I think this is an old but useful topic model. So it is worth to implement when doing news recommendation. Also, I asked Nannan to read the Recommendation Algorithm based on LDA Model. Finally we separated the work.

btw, the new headset is wonderful!! I'm not tired with it at all!! Sennheiser HD202, cheap and wonderful sound :D


11/26/2011

Paging algorithm

Today I implemented three paging algorithms: FIFO, Second-Chance(SC), and LRU.

There is some different and heuristic methods when doing SC and LRU.

1. When doing SC, if we found the oldest has been referenced, then it can be either moved to the memory list tail or stay there. and both methods we will have to set the oldest page's referenced bit to false and its load time current time and search for the next oldest.

2. When doing LRU(NFU for software simulation), I think the length of aging will be the  clock tic between the next set zero. Because when all the aging bits are set, it is hard to  judge which one is older.

11/23/2011

Roadmap



Roadmap

其实做topic modeling就是在topic基于文章words的分布上,找到整篇文章的topic分布(所谓的dirichlet distribution,分布上的分布),Latent的意思是topic对文章的分布是隐藏的,而topic对words是可见的,所以可以通过
这个式子求出来隐藏的单篇文章中topic的分布。
A "topic" consists of a cluster of words that frequently occur together. Using contextual clues, topic models can connect words with similar meanings and distinguish between uses of words with multiple meanings.

所以我们要做新闻推荐,其实就是对我看过的文章(或者我朋友看过的文章),和库里的文章进行training,然后根据我看过的文章的topic分布,在库里的文章中找出topic分布相关性最高的文章推荐回来(另外一种情况可能不是找最相关的,也可能是找我看过的所有topic里出现概率最高的,这个后期可以好好研究一下)。

要实现这个目标,我们可以把内容分成:
1.       利用现有的LDA package对NYTimes的数据进行training
2.       结合classifier,如MaxEnt,HMM,CRF等,这里需要一些编码把别人的两个框架连起来,然后修改curve和learning rate等
3.       有时间,可以用更快的方法做training,用online LDA做training,看看结果如何
4.       把推荐的平台搭起来,不用很复杂,就是个意思
5.       还有时间?try other topic models

11/22/2011

NP-Hard

http://bbs.ednchina.com/BLOG_ARTICLE_24599.HTM 经典的NP-hard问题:旅行售货员问题 http://wenku.baidu.com/view/9c03d71052d380eb62946d6e.html 最后决定做新闻推荐了。把NYTimes作为corpus,先跑纯LDA,然后加入language model,如HMM或者CRF,最后做成online。

11/21/2011

Beta Distribution & How to write a paper

Three sins of authors in cs and math

beta-distribution是dirichlet distribution的特殊情况.贝叶斯统计的后验分布。
That is, its probability density function returns the belief that the probabilities of K rival events are xi given that each event has been observed αi − 1 times.




http://www.cs.cmu.edu/~jrs/sins.html