Fantasia Impromptu: 2011

12/18/2011

Topic modeling for recommending books on Douban

Seeking recommendation for books is a common task for people due to the large amount of books and their various themes and topics. Developing an online recommender system for websites that holds large number of book reviews and users’ interaction will benefit a lot. In this project, we used different recommendation models, including a baseline model, latent Dirichlet distribution (LDA), Matrix Factorization (MF), and Collaborative Topic Regression(CTR), to recommend books to users of Douban, an Chinese online networking community providing books, movies, and music. We crawl and study a subset of data from Douban, and trained different parameters and used the optimized parameters to compare results. Results showed that CTR model provides more effective recommender system than other models.

The final product is in the following link:

https://docs.google.com/viewer?a=v&pid=explorer&chrome=true&srcid=0B0mZhauSOmW4Y2E3NmJlMzQtNTg2Mi00OTM4LWIwMTEtYThkZjk5YjczMWI1&hl=en_US

12/13/2011

LDA Collocation modeling

Most of the popular topic models (such as Latent Dirichlet Allocation) have an underlying assumption: bag of words. However, text is indeed a sequence of discrete word tokens, and without considering the order of words (in another word, the nearby context where a word is located), the accurate meaning of language cannot be exactly captured by word co-occurrences only.

http://www.cs.umass.edu/~mccallum/papers/tng-tr05.pdf

12/12/2011

Applying Collaborative Topic Regression to Chinese Articles

Recently I began to crawling data from Douban, which is a recommendation website of Music, Movies, and books. It's a useful and powerful website including millions of users rating and commenting books, movies, etc. and tagging all the items.

I think it is the perfect dataset for training LDA model because we can get the recall and topics easily as they are all well organized by the website. Also, getting its information is quite easy using its API. I will post the result days later.

Also, multi-words treatment for LDA is need especially on Chinese corpus. So I think segmentation will be compared with character based training.

Installing GSL on Ubuntu

It's really a headache installing GSL on Ubuntu and making it clear how the linker works. It's important to make it clear which are the include files and which are the lib files.

Firstly, after down loading the GSL from web, go to the directory of the file and type ./configure, then after it finishes, type make.

Secondly, since the project has a MakeFile which has to include the GSL, the directory of include will be the root of the GSL, such as

/home/shiwenling/Desktop/gsl-1.15

and the lib files should be the directories of libs in GSL, such as

/home/shiwenling/Desktop/gsl-1.15/libs

And if there is a cannot find comment error, probably you haven't set the include of lib right. Also, if there is an error saying cannot find some link files, you will have to check whether the required files are under usr/lib.

It's really uncomfortable using ubuntu to set up the environment, while in windows cygwin can solve the problem easily by just installing the GSL without any command. Also, it's much more easy to implement.

12/07/2011

The internet industry in Chinese market

I think internet companies such as Tencent, Taobao, Baidu, etc. will strive in the next decades as they gradually begin to have their own core technologies and have extremely large user group.

For example, the cloud computing in Taobao is really efficient since the data stream on it is much larger than Amazon or other online purchase companies. Though the stream is big, the system works extremely well. Also, the recommendation part is becoming more and more precise and useful.

What's more, I don't know this phenomena is bubble or not, but funding is keeping going into internet companies, and the number of startups is rocketing. When there is a new idea, if you don't realize it within a really short period, the product will be online by other competitors.

So working on internet industry will be a great choice for computer science majored students who have ambition to impact this realm.

I highly recommend the Stanford Topic Modeling Toolkit. It really well designed and easy to use.

12/04/2011

Correlated Topic Model

There is a limitation of LDA: it fail to directly model correlation between topics. Because it makes an assumption that the presence of one topic is not related with other topics, which in reality is not applicable.(Blei et al. A Correlated Topic Model of Science) However, because of this assumption, we can use Dirichlet to efficiently improve our algorithm using conjugation.

Also, I began to read collaborative topic modeling for news recommendation. Later will introduce the Collaborative topic regression(CTR)

12/02/2011

Performance of LDA

Has anyone evaluate the performance of LDA and other topic models? such as PLSI and NB?

I am thinking about using the topic models to train features and then apply it into a classifier and compare their accuracy.

12/01/2011

Multinomial Naive Bayes

This book from Stanford is really a great material to learn through the bayes model. Should have found that earlier
http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html
Material from of probabilistic inference:
http://www.cs.cmu.edu/~lewicki/cp-s08/Bayesian-inference.pdf

There is an interesting meetup tonight talking about recommendation using topic modeling..Unfortunately I just don't have time to go. Another newly released paper is going to be presented!!! Really want to go.
http://www.cs.princeton.edu/~chongw/papers/WangBlei2011.pdf

different types of multinomial distribution

Multinomial
parameters:	$n > 0$ number of trials (integer) $p_1, \ldots, p_k$ event probabilities ( $Σ p i = 1$ )
support:	$X_i \in \{0,\dots,n\}$ $\Sigma X_i = n\!$
pmf:	$\frac{n!}{x_1!\cdots x_k!} p_1^{x_1} \cdots p_k^{x_k}$
mean:	$E {X i} = n p i$
variance:	$\textstyle{\mathrm{Var}}(X_i) = n p_i (1-p_i)$ $\textstyle {\mathrm{Cov}}(X_i,X_j) = - n p_i p_j~~(i\neq j)$
mgf:	$\biggl( \sum_{i=1}^k p_i e^{t_i} \biggr)^n$
pgf:	$\biggl( \sum_{i=1}^k p_i z_i \biggr)^n\text{ for }(z_1,\ldots,z_k)\in\mathbb{C}^k$

Ten years ago, I was longing for traveling far from home

Ten years after, I am eager for going back home

11/30/2011

Why Gibbs Sampling?

Finally figure out why LDA uses gibbs sampling. Because when calculating the posterior distribution, the denominator p(w1:D), which is summing the joint distribution over every possible instantiation of the hidden topic structure, is exponentially large(David Blei, et al. A correlated topic model of Science). So an approximation of this equation can be formed by using sampling-based algorithms or variational algorithms.

There is an article for beginners:
Gibbs Sampling for the Uninitiated. written by P. Resnik, E. Hardisty

One fancifully imagines a Chinese restaurant with an infinite number of circular tables, each with infinite capacity. Customer 1 is seated at an unoccupied table with probability 1. At time n + 1, a new customer chooses uniformly at random to sit at one of the following n + 1 places: directly to the left of one of the n customers already sitting at an occupied table, or at a new, unoccupied circular table. Clearly, each table corresponds to a block of a random partition. (For an account of the custom in actual Chinese restaurants, see table sharing). It is then immediate that the probability assigned to any particular partition (ignoring the order in which customers sit around any particular table) is

$\Pr(B_n = B) = \dfrac{\prod_{b\in B} (|b| -1)!}{n!}$

where b is a block in the partition B and |b| is the size (i.e. number of elements) of b. The restaurant analogy is due to Pitman and Dubins, and first appears in print in ^[1].

11/29/2011

LDA Implementation with R Code

Finally get the code running. result is good and reasonable.
http://www.r-bloggers.com/getting-started-with-latent-dirichlet-allocation-using-rtexttools-topicmodels/

Amazon Interview

Today have the third interview with Costin from amazon. It's a mess as the sound in speakerphone is vague so I can hardly hear what he said. Anyway, he asked me totally just two questions, and one of them is can you introduce polymophism? and what's the other two important concept in OO?

And the other is sum up all the prime numbers from 1 to 100.

Now i'm beginning to read probabilistic latent semantic analysis(PLSA). I think this is an old but useful topic model. So it is worth to implement when doing news recommendation. Also, I asked Nannan to read the Recommendation Algorithm based on LDA Model. Finally we separated the work.

btw, the new headset is wonderful!! I'm not tired with it at all!! Sennheiser HD202, cheap and wonderful sound :D

11/26/2011

Paging algorithm

Today I implemented three paging algorithms: FIFO, Second-Chance(SC), and LRU.

There is some different and heuristic methods when doing SC and LRU.

1. When doing SC, if we found the oldest has been referenced, then it can be either moved to the memory list tail or stay there. and both methods we will have to set the oldest page's referenced bit to false and its load time current time and search for the next oldest.

2. When doing LRU(NFU for software simulation), I think the length of aging will be the clock tic between the next set zero. Because when all the aging bits are set, it is hard to judge which one is older.

11/23/2011

Roadmap

Roadmap

其实做topic modeling就是在topic基于文章words的分布上，找到整篇文章的topic分布（所谓的dirichlet distribution，分布上的分布），Latent的意思是topic对文章的分布是隐藏的，而topic对words是可见的，所以可以通过

这个式子求出来隐藏的单篇文章中topic的分布。
A "topic" consists of a cluster of words that frequently occur together. Using contextual clues, topic models can connect words with similar meanings and distinguish between uses of words with multiple meanings.

所以我们要做新闻推荐，其实就是对我看过的文章（或者我朋友看过的文章），和库里的文章进行training，然后根据我看过的文章的topic分布，在库里的文章中找出topic分布相关性最高的文章推荐回来（另外一种情况可能不是找最相关的，也可能是找我看过的所有topic里出现概率最高的，这个后期可以好好研究一下）。

要实现这个目标，我们可以把内容分成：

1. 利用现有的LDA package对NYTimes的数据进行training

2. 结合classifier，如MaxEnt，HMM，CRF等，这里需要一些编码把别人的两个框架连起来，然后修改curve和learning rate等

3. 有时间，可以用更快的方法做training，用online LDA做training，看看结果如何

4. 把推荐的平台搭起来，不用很复杂，就是个意思

5. 还有时间？try other topic models

11/22/2011

NP-Hard

http://bbs.ednchina.com/BLOG_ARTICLE_24599.HTM 经典的NP-hard问题：旅行售货员问题 http://wenku.baidu.com/view/9c03d71052d380eb62946d6e.html 最后决定做新闻推荐了。把NYTimes作为corpus，先跑纯LDA，然后加入language model，如HMM或者CRF，最后做成online。

11/21/2011

Beta Distribution & How to write a paper

Three sins of authors in cs and math

beta-distribution是dirichlet distribution的特殊情况.贝叶斯统计的后验分布。
That is, its probability density function returns the belief that the probabilities of K rival events are xi given that each event has been observed αi − 1 times.

http://www.cs.cmu.edu/~jrs/sins.html

11/20/2011

颓废的一天

今天早上写完给dennis的报告以后跑到法拉盛看西服去了，真是一个愚蠢的行为法拉盛哪有几个卖西服的地方啊。。回来才开始学python，明天必须到express转转

11/19/2011

Some readings about LDA and DD

不记得改flag的值相当惨啊，要不就死循环，要不就永远只做第一次鸟，以后break之前一定要再三检查自己的flag值是不是再次被initialized了！！

真是惭愧。。自己写的代码自己都没眼看，每次冰哥改完心里除了羡慕就是郁闷。。为啥别人写的代码看着那么舒服咧？
用HashMap作为数据库表用：
Map<Integer, Map<Edgetab, List<Quadruple>>>这个结构一次过就充当了3个表,key=species name, value = map, keysub = edgetab name, valuesub = 4个species的data，为啥原来写的那么悲凉诶。。。

Lecture 24: Dirichlet distribution and Dirichlet Process
http://www.eecs.berkeley.edu/~jordan/courses/260-spring10/lectures/lecture24.pdf

non-parametric statistics
http://en.wikipedia.org/wiki/Non-parametric_statistics

dirichlet distribution
http://en.wikipedia.org/wiki/Dirichlet_distribution
http://blog.csdn.net/sweetrryy/article/details/6437908

再一次傻逼

今天又犯了各种傻逼的错误，比如把传参的两个值反了，debug了一下午，然后某个函数应该存target的存成了source，debug了一晚上，冰哥都对我无语了。。啥时候才能不犯这种傻逼兮兮的错误咧？

另外重新认识了一下blast，blast score越小的（如<E-4）就越接近，因此在利用各种blast score的时候要利用各种heuristic的策略，比如maxofmax, minofmin来对SGD进行训练，这样能够获得更好的结果。当然，老师还说SGD的eta其实在训练的时候很多现在的方法并不一定让结果好多少，就是digging gold，基本上dig个半天会没收获，下个学期要independent study啊，凄凉死

linux的命令还是完全不熟，每次都要查
比如今天想查看某几行数据 cat XXXfile | grep n n即第n行
另外删除某一行数据sed -i '1d' XXXfile 1d是指第一行

11/17/2011

第一帖

创建了个blog，从今天开始一定要把每天做过的东西好好记下来

闷头苦干了几天finding the related species genes, 今天要把代码整合到SGD(stochastic gradient decent)里，然后才认真地把里面的一些tricky的东西回想了一下，看了一些论文。有几个地方是相当值得注意的。比如eta的设置，每跑一个iteration之前要把candidate的数据打乱，另外如何选取更好的candidate以及calculate accuracy也是个相当重要的问题。

今天先这样吧，还要看LDA。。希望以后的文章能写的动人些。

http://scikit-learn.org/stable/modules/sgd.html
http://mli7.wordpress.com/2011/04/08/online_learning_2/
http://leon.bottou.org/publications/pdf/mlss-2003.pdf
http://leon.bottou.org/publications/pdf/compstat-2010.pdf

Fantasia Impromptu