6/01/2012

Dealing with Large Dataset

There are several ways to make the process of large dataset more effecient.

1. Compress technique:
    Try to use a link to every same element in the data instead of storing the copy of it.
    eg. if "Umbrella" appears in the data for couple times, then save just one Copy of them, and use link point to the word every other time it appears. In java, there is a method intern() that can compress a string.

2. Index Hashing:
    Hash every appeared element with an integer. Through this way you can save a lot of space.
    eg. "This is Mike and this is Jane." We want to calculate the probability of every word's appearance. If every word has probability and we saved as a Hashtable Counter <String, Double> where string represents every word in the sentence and every double represents the probability of the word's appearance. The probability of "this" is<"This", 2/7>.
    Then we can modify this into Hashtable IndexMap<String, Integer>+Hashtable Counter<Integer, Double>. Here IndexMap is an index for every string and Counter is an index and the probability. So we can search "this" in the IndexMap for the index and find the probability of "this" with the index.
    This technique saves a lot time because we are not using string in hashing anymore.

3. Use Float whenever possible instead of Double. Obviously, this is because Float is half the size of Double.

4. Use primitive collections instead of java.util.collections if your datatype is primitive type. This saves at least half of space and mostly 85% of space.
 

2 comments: