12/18/2011

Topic modeling for recommending books on Douban


Seeking recommendation for books is a common task for people due to the large amount of books and their various themes and topics. Developing an online recommender system for websites that holds large number of book reviews and users’ interaction will benefit a lot. In this project, we used different recommendation models, including a baseline model, latent Dirichlet distribution (LDA), Matrix Factorization (MF), and Collaborative Topic Regression(CTR), to recommend books to users of Douban, an Chinese online networking community providing books, movies, and music. We crawl and study a subset of data from Douban, and trained different parameters and used the optimized parameters to compare results. Results showed that CTR model provides more effective recommender system than other models.



The final product is in the following link:

12/13/2011

LDA Collocation modeling


Most of the popular topic models (such as Latent Dirichlet Allocation) have an underlying assumption: bag of words. However, text is indeed a sequence of discrete word tokens, and without considering the order of words (in another word, the nearby context where a word is located), the accurate meaning of language cannot be exactly captured by word co-occurrences only.

http://www.cs.umass.edu/~mccallum/papers/tng-tr05.pdf

12/12/2011

Applying Collaborative Topic Regression to Chinese Articles

Recently I began to crawling data from Douban, which is a recommendation website of Music, Movies, and books. It's a useful and powerful website including millions of users rating and commenting books, movies, etc. and tagging all the items.

I think it is the perfect dataset for training LDA model because we can get the recall and topics easily as they are all well organized by the website. Also, getting its information is quite easy using its API. I will post the result days later.

Also, multi-words treatment for LDA is need especially on Chinese corpus. So I think segmentation will be compared with character based training.

Installing GSL on Ubuntu

It's really a headache installing GSL on Ubuntu and making it clear how the linker works. It's important to make it clear which are the include files and which are the lib files.

Firstly, after down loading the GSL from web, go to the directory of the file and type ./configure, then after it finishes, type make.


Secondly, since the project has a MakeFile which has to include the GSL, the directory of include will be the root of the GSL, such as
/home/shiwenling/Desktop/gsl-1.15
and the lib files should be the directories of libs in GSL, such as
/home/shiwenling/Desktop/gsl-1.15/libs
And if there is a cannot find comment error, probably you haven't set the include of lib right. Also, if there is an error saying cannot find some link files, you will have to check whether the required files are under usr/lib.

It's really uncomfortable using ubuntu to set up the environment, while in windows cygwin can solve the problem easily by just installing the GSL without any command. Also, it's much more easy to implement.

12/07/2011

The internet industry in Chinese market

I think internet companies such as Tencent, Taobao, Baidu, etc. will strive in the next decades as they gradually begin to have their own core technologies and have extremely large user group.

For example, the cloud computing in Taobao is really efficient since the data stream on it is much larger than Amazon or other online purchase companies. Though the stream is big, the system works extremely well. Also, the recommendation part is becoming more and more precise and useful.

What's more, I don't know this phenomena is bubble or not, but funding is keeping going into internet companies, and the number of startups is rocketing. When there is a new idea, if you don't realize it within a really short period, the product will be online by other competitors.

So working on internet industry will be a great choice for computer science majored students who have ambition to impact this realm.


I highly recommend the Stanford Topic Modeling Toolkit. It really well designed and easy to use.

12/04/2011

Correlated Topic Model

There is a limitation of LDA: it fail to directly model correlation between topics. Because it makes an assumption that the presence of one topic is not related with other topics, which in reality is not applicable.(Blei et al. A Correlated Topic Model of Science) However, because of this assumption, we can use Dirichlet to efficiently improve our algorithm using conjugation.

Also, I began to read collaborative topic modeling for news recommendation. Later will introduce the Collaborative topic regression(CTR)

12/02/2011

Performance of LDA

Has anyone evaluate the performance of LDA and other topic models? such as PLSI and NB?

I am thinking about using the topic models to train features and then apply it into a classifier and compare their accuracy.

12/01/2011

Multinomial Naive Bayes

This book from Stanford is really a great material to learn through the bayes model. Should have found that earlier
http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html
Material from of probabilistic inference:
http://www.cs.cmu.edu/~lewicki/cp-s08/Bayesian-inference.pdf

There is an interesting meetup tonight talking about recommendation using topic modeling..Unfortunately I just don't have time to go. Another newly released paper is going to be presented!!! Really want to go.
http://www.cs.princeton.edu/~chongw/papers/WangBlei2011.pdf


different types of multinomial distribution

Multinomial
parameters:n > 0 number of trials (integer)
p_1, \ldots, p_k event probabilities (Σpi = 1)
support:X_i \in \{0,\dots,n\}
\Sigma X_i = n\!
pmf:\frac{n!}{x_1!\cdots x_k!} p_1^{x_1} \cdots p_k^{x_k}
mean:E{Xi} = npi
variance:\textstyle{\mathrm{Var}}(X_i) = n p_i (1-p_i)
\textstyle {\mathrm{Cov}}(X_i,X_j) = - n p_i p_j~~(i\neq j)
mgf:\biggl( \sum_{i=1}^k p_i e^{t_i} \biggr)^n
pgf:\biggl( \sum_{i=1}^k p_i z_i \biggr)^n\text{ for }(z_1,\ldots,z_k)\in\mathbb{C}^k

Ten years ago, I was longing for traveling far from home
Ten years after, I am eager for going back home