12/12/2011

Applying Collaborative Topic Regression to Chinese Articles

Recently I began to crawling data from Douban, which is a recommendation website of Music, Movies, and books. It's a useful and powerful website including millions of users rating and commenting books, movies, etc. and tagging all the items.

I think it is the perfect dataset for training LDA model because we can get the recall and topics easily as they are all well organized by the website. Also, getting its information is quite easy using its API. I will post the result days later.

Also, multi-words treatment for LDA is need especially on Chinese corpus. So I think segmentation will be compared with character based training.

No comments:

Post a Comment