Fantasia Impromptu: November 2011

11/30/2011

Why Gibbs Sampling?

Finally figure out why LDA uses gibbs sampling. Because when calculating the posterior distribution, the denominator p(w1:D), which is summing the joint distribution over every possible instantiation of the hidden topic structure, is exponentially large(David Blei, et al. A correlated topic model of Science). So an approximation of this equation can be formed by using sampling-based algorithms or variational algorithms.

There is an article for beginners:
Gibbs Sampling for the Uninitiated. written by P. Resnik, E. Hardisty

One fancifully imagines a Chinese restaurant with an infinite number of circular tables, each with infinite capacity. Customer 1 is seated at an unoccupied table with probability 1. At time n + 1, a new customer chooses uniformly at random to sit at one of the following n + 1 places: directly to the left of one of the n customers already sitting at an occupied table, or at a new, unoccupied circular table. Clearly, each table corresponds to a block of a random partition. (For an account of the custom in actual Chinese restaurants, see table sharing). It is then immediate that the probability assigned to any particular partition (ignoring the order in which customers sit around any particular table) is

$\Pr(B_n = B) = \dfrac{\prod_{b\in B} (|b| -1)!}{n!}$

where b is a block in the partition B and |b| is the size (i.e. number of elements) of b. The restaurant analogy is due to Pitman and Dubins, and first appears in print in ^[1].

11/29/2011

LDA Implementation with R Code

Finally get the code running. result is good and reasonable.
http://www.r-bloggers.com/getting-started-with-latent-dirichlet-allocation-using-rtexttools-topicmodels/

Amazon Interview

Today have the third interview with Costin from amazon. It's a mess as the sound in speakerphone is vague so I can hardly hear what he said. Anyway, he asked me totally just two questions, and one of them is can you introduce polymophism? and what's the other two important concept in OO?

And the other is sum up all the prime numbers from 1 to 100.

Now i'm beginning to read probabilistic latent semantic analysis(PLSA). I think this is an old but useful topic model. So it is worth to implement when doing news recommendation. Also, I asked Nannan to read the Recommendation Algorithm based on LDA Model. Finally we separated the work.

btw, the new headset is wonderful!! I'm not tired with it at all!! Sennheiser HD202, cheap and wonderful sound :D

11/26/2011

Paging algorithm

Today I implemented three paging algorithms: FIFO, Second-Chance(SC), and LRU.

There is some different and heuristic methods when doing SC and LRU.

1. When doing SC, if we found the oldest has been referenced, then it can be either moved to the memory list tail or stay there. and both methods we will have to set the oldest page's referenced bit to false and its load time current time and search for the next oldest.

2. When doing LRU(NFU for software simulation), I think the length of aging will be the clock tic between the next set zero. Because when all the aging bits are set, it is hard to judge which one is older.

11/23/2011

Roadmap

Roadmap

其实做topic modeling就是在topic基于文章words的分布上，找到整篇文章的topic分布（所谓的dirichlet distribution，分布上的分布），Latent的意思是topic对文章的分布是隐藏的，而topic对words是可见的，所以可以通过

这个式子求出来隐藏的单篇文章中topic的分布。
A "topic" consists of a cluster of words that frequently occur together. Using contextual clues, topic models can connect words with similar meanings and distinguish between uses of words with multiple meanings.

所以我们要做新闻推荐，其实就是对我看过的文章（或者我朋友看过的文章），和库里的文章进行training，然后根据我看过的文章的topic分布，在库里的文章中找出topic分布相关性最高的文章推荐回来（另外一种情况可能不是找最相关的，也可能是找我看过的所有topic里出现概率最高的，这个后期可以好好研究一下）。

要实现这个目标，我们可以把内容分成：

1. 利用现有的LDA package对NYTimes的数据进行training

2. 结合classifier，如MaxEnt，HMM，CRF等，这里需要一些编码把别人的两个框架连起来，然后修改curve和learning rate等

3. 有时间，可以用更快的方法做training，用online LDA做training，看看结果如何

4. 把推荐的平台搭起来，不用很复杂，就是个意思

5. 还有时间？try other topic models

11/22/2011

NP-Hard

http://bbs.ednchina.com/BLOG_ARTICLE_24599.HTM 经典的NP-hard问题：旅行售货员问题 http://wenku.baidu.com/view/9c03d71052d380eb62946d6e.html 最后决定做新闻推荐了。把NYTimes作为corpus，先跑纯LDA，然后加入language model，如HMM或者CRF，最后做成online。

11/21/2011

Beta Distribution & How to write a paper

Three sins of authors in cs and math

beta-distribution是dirichlet distribution的特殊情况.贝叶斯统计的后验分布。
That is, its probability density function returns the belief that the probabilities of K rival events are xi given that each event has been observed αi − 1 times.

http://www.cs.cmu.edu/~jrs/sins.html

11/20/2011

颓废的一天

今天早上写完给dennis的报告以后跑到法拉盛看西服去了，真是一个愚蠢的行为法拉盛哪有几个卖西服的地方啊。。回来才开始学python，明天必须到express转转

11/19/2011

Some readings about LDA and DD

不记得改flag的值相当惨啊，要不就死循环，要不就永远只做第一次鸟，以后break之前一定要再三检查自己的flag值是不是再次被initialized了！！

真是惭愧。。自己写的代码自己都没眼看，每次冰哥改完心里除了羡慕就是郁闷。。为啥别人写的代码看着那么舒服咧？
用HashMap作为数据库表用：
Map<Integer, Map<Edgetab, List<Quadruple>>>这个结构一次过就充当了3个表,key=species name, value = map, keysub = edgetab name, valuesub = 4个species的data，为啥原来写的那么悲凉诶。。。

Lecture 24: Dirichlet distribution and Dirichlet Process
http://www.eecs.berkeley.edu/~jordan/courses/260-spring10/lectures/lecture24.pdf

non-parametric statistics
http://en.wikipedia.org/wiki/Non-parametric_statistics

dirichlet distribution
http://en.wikipedia.org/wiki/Dirichlet_distribution
http://blog.csdn.net/sweetrryy/article/details/6437908

再一次傻逼

今天又犯了各种傻逼的错误，比如把传参的两个值反了，debug了一下午，然后某个函数应该存target的存成了source，debug了一晚上，冰哥都对我无语了。。啥时候才能不犯这种傻逼兮兮的错误咧？

另外重新认识了一下blast，blast score越小的（如<E-4）就越接近，因此在利用各种blast score的时候要利用各种heuristic的策略，比如maxofmax, minofmin来对SGD进行训练，这样能够获得更好的结果。当然，老师还说SGD的eta其实在训练的时候很多现在的方法并不一定让结果好多少，就是digging gold，基本上dig个半天会没收获，下个学期要independent study啊，凄凉死

linux的命令还是完全不熟，每次都要查
比如今天想查看某几行数据 cat XXXfile | grep n n即第n行
另外删除某一行数据sed -i '1d' XXXfile 1d是指第一行

11/17/2011

第一帖

创建了个blog，从今天开始一定要把每天做过的东西好好记下来

闷头苦干了几天finding the related species genes, 今天要把代码整合到SGD(stochastic gradient decent)里，然后才认真地把里面的一些tricky的东西回想了一下，看了一些论文。有几个地方是相当值得注意的。比如eta的设置，每跑一个iteration之前要把candidate的数据打乱，另外如何选取更好的candidate以及calculate accuracy也是个相当重要的问题。

今天先这样吧，还要看LDA。。希望以后的文章能写的动人些。

http://scikit-learn.org/stable/modules/sgd.html
http://mli7.wordpress.com/2011/04/08/online_learning_2/
http://leon.bottou.org/publications/pdf/mlss-2003.pdf
http://leon.bottou.org/publications/pdf/compstat-2010.pdf

Fantasia Impromptu