12/04/2012

Portfolio Optimization with Simple Heuristic

Lately on the Heuristic course we have competition of Portfolio. It simulates the real market but largely simplified into a game that we can play with. The description is as here. And the whole gaming process can be found on this website.

A detailed description is given as below:



Initialization

We call the ith gamble g[i]. Each gamble will have three different returns (gihi, gimed, gilow), mapping to three initial probabilities(gihiprob, gimedprob, gilowprob), where the gimedprob is at least 0.4.

Initially, the average expected return per gamble is 2 to 1. Thus the expectation is E=gihi*P(gihi)+gimed*P(gimed)+gilow*P(gilow)=2. Different gambles may give different returns. Low returns are normally less than 1 (so constitute a loss). During each round, we permute all the gambles, and get a new sequence of gambles g'1, g'2, ..., g'n.

Classes

Now we assume each gamble belongs to a class (which is also true in reality), ranging from 1 to 15. Among these classes some are favorable classes, some are neutral, and some are unfavorable. The favorable classes bias the gihiprob higher while the unfavorable bias the gilowprob higher. Number of gambles in each classes might not be equal. We have the following rules to change the probabilities of the gambles in the classes:

for i = 1, N do
  if gi is favorable
    P(gihi) += P(gilow) / 2
    P(gilow) /= 2
  else if gi is unfavorable
    P(gilow) += P(gihi) / 2
    P(gihi) /= 2
end

Link

Certain gamble have links to each other. Link also makes a series of gambles change their probabilities of returns during each round.  The link if different from class, and it gives a correlation between two gambles. For example, the stock price of Microsoft and the price of Nokia will influence each other (I consider these two are not necessarily in the same class). The way of calculating the link's influence to a gamble's probability is as follows:


permute gamble sequence: g'=perm(g)
for i = 1, N do
  Hi = all gambles before g'i that have links with g'i and have high return
  Mi = same but have medium return
  Li = same but have low return
  if Hi > Mi + Li
    P(gihi) +=P(gilow) / 2
    P(gilow) /= 2
  else if Li > Mi + Hi
    P(gilow) += P(gihi) / 2
    P(gihi) /= 2
  Play gi
end

Thus, the full sequence of operation is:

  1. Reset each probability to initial 
  2. Make adjustments based on favorable/unfavorable classes 
  3. Start computing gambles, making adjustments based on links as you go 

Task

Suppose we can bet (invest) on approximately N gambles (stocks). We are given the capital of 1 (the total money or budget you initially have is 1). Our task is to allocate the 1 unit among chosen number of gambles.

Game

1. In the first game we play 5 round.
a. Assign outcomes to each gamble according to the probability above.
b. Calculate each person's wealth based on their allocation and the outcomes (having started each round at 1).
If return of the person >=2, then that person gets a point for that round. The player will be told what the final return outcome of each gamble was to help the player infer which class is favorable and perhaps change allocations for the next round.

The player is not told which order the gambles were played in however. The favorable and unfavorable classes will remain the same during the five rounds.

2. In the second game we play 200 round.

The favorable and unfavorable classes may change up to 10 times (without warning) during those rounds. Again, the player will be told what the final return outcome of each gamble was to help the player infer which class(es) is(are) favorable and perhaps change allocations for the next round. Players need not be fully invested during all rounds. Inspired by Jake Loveless's experience at a finance company where consistent winnings are favored over erratic profits and losses, we will have two criteria for winning: total amount of money at the end, and 10 percentile daily return (that is sort all daily returns from least (most negative) to greatest and take the return that is at the 10th percentile in that list). In the original game, we used Sharpe Ratio as the criteria, which is computed as

sum(return) / sqrt(sum((return - avg_return)^2))
Here the return is the real value of profit we make each round (though the real sharpe ratio is given as a ratio).

Tackling the first problem

For the first game, our intuitive goal is to find some combination of gambles, that give the largest probability of return that is larger than 2.

For example, if we bet two gambles g0, g1, and we allocate x to g0, then the probability that return >=2 is (here i and j has 3 values: 0:high, 1: medium, 2: low):
for (int i = 0; i < 3; ++ i) {
  for (int j = 0; j < 3; ++ j) {
    if (g0.getReturn(i) * x + g1.getReturn(j) * (1-x) >= 2) {
      prob += getProb(g0, i) * getProb(g1, j);
    }
}

This method is guaranteed to have the best probability in the set of gambles with fixed amount of gambles, by traversing all the possible combinations of the selected gambles with different ratio assigned to each gamble. However, in the scenario of two gambles, if we explore the ratio (x) by a step size of 0.01 each time, we will have 100 * N * N computation. If we want to select many gambles, the time complexity is factorial.

But do we really need so many gambles to decide which will give the best return? Actually for the first game we don't need to complicate the problem. If we can find the best allocation in one or two gambles that get us a point, we don't really care about whether the return is 2 or 100. If we name the previous function computeProbability, we have the following algorithm:

double best_prob = 0;
int[] best_comb = new int[2];
double best_x = 0;
for (double x = 0; x <= 1; x += 0.01) {
  for (int i = 0; i < gambles.size() - 1; ++i) {
    for (int j = i + 1; j < gambles.size(); ++j) {
      double prob = computeProbability(i, j, x);
      if (prob > best_prob) {
        best_prob = prob;
        best_comb[0] = i;
        best_comb[1] = j;
        best_x = x;
      }
    }
  }
}
We usually get 5 points in 5 rounds, that means the easy algorithm works quite well to meet the requirement.

Solving the second problem


Here I will firstly explain the original solution to solve the problem to get the best sharpe ratio.

This problem gives more complicated scenario: besides links making probabilities of gambles change each round, the classes might also change.

But do we really want to consider all these complex changes? Lets look at the Sharpe Ratio again:
sum(return) / sqrt(sum((return - avg_return)^2))
It is the measurement of return per risk. The assumption is that if the returns of the gambles of your portfolio stay fixed, but they vary tremendously, the sharpe ratio will be small. This is because the denominator, which is the variance of the returns, will be large. Then you might risk in earning a lot of money for a round, than lose a lot in the next. Also, if you keep the denominator fixed, and make the sum of return as large as possible, the sharpe ratio will be large.

Based on the formular, to get the best sharpe ratio, we need to make the denominator as small as possible. If we can make the return of each round as close as we can, say, almost 0, we will get the denominator almost 0. It doesn't matter if the return is large or small, say 2 or 100, because the nominator and denominator are both multiplied by a constant.

So we bet on some gambles, say 5 gambles, which produce the smallest variance of the high, medium and low returns. If there is one stock whose variance is less than 0.0001, we only bet on that gamble.

PriorityQueue q
for (int gi = 0; gi < n; ++gi) {
  var = computeVariance(gi)
  if gi.avg < 1.2 continue // not consider gi whose average is less than 1.2
  q.add(var)
  if q.size() > 5 // maximum gambles we bet on is 5
    q.poll()
}

smallest = smallest variance in q
if smallest < 0.0001
  only bet on the gamble
else
  bet on all gambles in q

Actually this is really impractical though we gives a very large sharpe ratio, average 10000 during experiments. Because we are not making any money when betting on the stablest gambles.

So now the goal is to make as much money as possible.
Here is the other strategy we used during the competition, which makes billions of money in the scenario.

Now we do need to consider the class problem inorder to get a really big return in each round. To make the strategy as simple as we can, we only bet on one gamble with the highest return (You could make it onto several gambles but the idea here is quite simple). Because the class will change, we need a function to guess the favorable and unfavorable classes. Here I will introduce the depth we trace in history. We only consider a certain time limit of data, say only the previous round, to obtain information of guessing the class of the gamble. Adapting Machine Learning techniques here might contribute more but then it's more than heuristic and not what I am going to cover.

For each current round we update return probability based on the previous round and linking information. And then we look at the previous rounds, if number of high returns (Hi) is more than number of low returns (Li) in the depth we trace, we consider it to be among favorable classes, and vice versa.
int[] probAdaption // save the number of probabilities being high or low

for (int i = 0; i < n; ++i) {
  if (gi.return
    if (probAdaption[i] >= traceDepth) continue;
    ++probAdaption[i];
    } else if (gi.return == Gamble.LOW) {
      if (probAdaption[i] <= -traceDepth) continue;
      --probAdaption[i];
  }
}
Because we are not sure of the adaptation with this strategy, so we don't actually make P(gihi) +=P(gilow) / 2, P(gilow) /= 2, we give an experimental coefficient to replace the 0.5 coefficient. 

int adaption = probAdaption[id];
while (adaption > 1) {
  adaption = 0;
  double d = lProb * adaptionCoefficient;
  lProb -= d;
  hProb += d;
}
while (adaption < -1) {
  adaption = 0;
  double d = hProb * adaptionCoefficient;
  hProb -= d;
  lProb += d;
}
The last thing is to pick the largest expected return among n gambles. That's it. Our strategy doing portfolio. This strategy returns near 5 - 10 times of the previous round in average.

9/26/2012

The Traveling Salesman Problem

The problem description is given in the class:
http://cs.nyu.edu/courses/Fall12/CSCI-GA.2965-001/travelingsalesman.html

From a graph perspective, this problem can be brief as:

Given a complete weighted graph (where the vertices would represent the cities, the edges would represent the roads, and the weights would be the cost or distance of that road), find a Hamiltonian cycle with the least weight.

To traverse all posible city sequence and find the best solution (optimal distance score), time complexity is as high as factorial time O(n!) with simplest recursion. Or to make things simpler, try use dynamic programming to reduce a great amount of efforts and get almost 30 cities run on my laptop.

Many strategies and heuristic are used to solve this problem, such as minimum matching and Simulated Annealing. We chose to use Simulated Annealing as our base algorithm for its flexibility to do optimization.

One general idea is to obtain the Minimum Spanning Tree (MST). For the baseline by traversing every nodes on the MST twice, we get solution up to 2*score of the shortest path. In the input provided, the shortest path of MST is ~625,000 miles, so our baseline is ~1,250,000.

Step 1
Consider the distance score  = sigma^n d[i][j], where n is total number of cities, d[i][j] is distance from city i to city j. And consider score_i = score after the ith iteration.

The primitive version is to randomly generate a city sequence. Then swap 2 cities in each iteration. The score will be updated (accepted) if the score_(i+1) < score_i || (score_(i+1) - score_i)/T > R(0, 1). 

The good thing about this algorithm is that it is really simple, and it allows worse result in order to search a larger solution space. This gives a result ~1,200,000, which is almost the same as the baseline with time limitation of 2 min. :( What can be realized is that this way doesn't shuffle the city sequence a lot as we swap every two cities at a time. So By swapping two fixed length of cities in the sequence as below:
ab1234cde5678fg
ab5678cde1234fg
Here the numbers are cities that are going to be swapped, where letters are cities stay in their current position. But still this doesn't make a lot improvement. So instead of swapping fixed length sub-sequence, we make the length arbitrary as below:
ab123cde45678fg
ab45678cde123fg

With this method and a lot of code optimization, we successfully reduce the score into 960,000. 

We don't shuffle inside the chosen sub-sequence is because that will cause more calculation of the distance score, thus reduce the number of iteration we can run. As in Simulated Annealing, it is guaranteed that with unlimited amount of time,  the optimal score should be obtained. That means we have two ways to improve our algorithm: one is to optimize our swapping (replacement) strategy, the other is to make the iterations to run as many as possible.

Step 2
As to the second one, we make the start sequence as reasonable as possible to avoid random solutions that actually make score worse by accepting bad scores. So we use the Depth First Search of MST as our initial sequence, instead random generalization. During the DFS, if the node in the tree has been visited, we skip and continue searching the whole tree. 

Also, we optimize our random() method by bit manipulation instead of Java.util.Random. This makes three times of iterations as before and start with 970,000 instead of very random scores larger than 1 million. 

Another method is that, after lowering the temperature several times to a low value, we only accept better scores, regardless of the probability of worse scores might generate larger solution space. Then we found the cooling temperature and the acceptance probability becomes not that important because we have already begun with an very small value, so accepting worse values will be meaningless even if we might be easy to get into local minimal solution. So we only accept the better scores through all iteration.

Step 3
As to the first one. We developed a different replacement method. The strategy is described as follow:
  1. Save all the neighbors of the city sorted from the nearest to the farthest;
  2. Randomly pick a start point a of the sequence;
  3. Randomly pick b from the closest 4 neighbor cities around a, which the probability decrease from the closest to the farthest.
  4. Reverse the sub-sequence of the city sequence start from city next to a to b.
For example:
abc12345defg
abc54321defg
Consider the start point we choose is c, and 5 is one of the closest 4 cities to c. So we reverse the sub-sequence from the point next to c, which is 1, to 5. Why this should work? Because you are trying to get a better path by making the almost closest cities to be the next city to travel, so this is much better heuristic than trying to do reverse randomly. We are not using the closest one because this will let the model go into local optima easily.

This simple model successfully beat all other 9 teams in our class. It wins because of the ability to run extremely fast by optimize every operation, and the heuristic of initialization and replacement. Anyone with different or better solution is welcomed to discuss with us.

Our code is on GitHub and running command is just (Consider the input file is the same directory as the  java files)
javac SAOpt.java MST.java
java SAOpt

8/16/2012

Accessing Gmail Accounts through Mail

Today when I was trying to add my nyu account to mail it always showed the same error message: server rejected your username and password.
 Some people mentioned cleaning the Captcha but it didn't work for my case. So I explored google support and I got the way to figure it out.

First go to account:

Then click on security:

And turn the 2-step verification on (It might require you to login again). After turning it on you will be able Generate new application-specific password by edit Authorizing application and sites.

Each application will need a new password but you will need to enter it once then it's all set.

I still have a lot problems switching from my pc to mac, such as shortcut keys issue and personal settings. And now I should say Mac is a good system, much better than windows in a lot of aspects.

8/05/2012

jQuery plugIn: Fixed header table for IE

For the purpose of building a spreadsheet-like table, I will have to make the table header fixed while the body is scrollable. While in Chrome and Firefox you can accomplish this simply set the css of the table body: scrollable. This does not work for IE. After searching I used one plugin on the internet called http://www.farinspace.com/jquery-scrollable-table-plugin/.
And modified it into my own edition.  To use it, simply type:

$("#"+TABLEID).createScrollableTable();

Code is as below:



(function($) {
$.fn.createScrollableTable = function(options) {
var defaults = {
width: '1160',
height: '600'
};
var options = $.extend(defaults, options);


return this.each(function() {
var table = $(this);
prepareTable(table);
});


function prepareTable(table) {

var tableId = table.attr('id');
var tb = $('#'+tableId);
var wholeWidth = $("#"+tableId).width() > options.width ? $("#"+tableId).width() : options.width;
// wrap the current table (will end up being just body table)
var minheight = 580;
var css;
if (minheight > $("#"+tableId).height()) {
css = {
'width': options.width,
'height': $("#"+tableId).height()+20,
'overflow': 'auto'
};
}else {
css = {
'width': options.width,
'height': options.height,
'overflow': 'auto'
};
}
var bodyWrap = table.wrap('<div></div>')
.parent()
.attr('id', tableId + '_body_wrap')
.css(css);

var thead_tr_first = $('thead tr:nth-child(1)',tb);
var tbody_tr_first = $('tbody tr:first',tb);
var col_width = [];
var w = 0;
var cells_head = thead_tr_first.find('th');
var cells_body = tbody_tr_first.find('td');
if (cells_body.length == 1) {
for (var i = 0; i < cells_head.length; ++i) {
col_width[i] = $(cells_head[i]).width();
}
}
else if (cells_head.length == 0) {
for (var i = 0; i < cells_body.length; ++i) {
col_width[i] = $(cells_body[i]).width();
}
}
else {
for (var i = 0; i < cells_body.length; ++i) {
col_width[i] = $(cells_head[i]).width() > $(cells_body[i]).width() ? $(cells_head[i]).width() : $(cells_body[i]).width();
}
col_width[0] = col_width[0] < 0 ? 20 : col_width[0];
}

var thead_cols = $('thead tr:nth-child(1)',tb).find('td, th');
var tbody_cols = tbody_tr_first.find('td');

var tbh = $('<table class="tablesorter" cellspacing="0"></table>').insertBefore(bodyWrap).prepend($('thead',tb));
tbh.css('width', options.width);
var css_head = {
'width': options.width,
'overflow': 'hidden'
};
var headWrap = tbh.wrap('<div></div>').parent().attr('id', tableId + '_head_wrap').css(css_head);

$.each($('.tablesorter'), function(i, elem) {
$(elem).css('width', wholeWidth);
$(elem).css('table-layout', 'fixed');
});

$.each (thead_cols, function (i, elem) {
$(elem).width(col_width[i]);
// $(elem).css('width', col_width[i]);
});
$.each (tbody_cols, function (i, elem) {
$(elem).width(col_width[i]);
// $(elem).css('width', col_width[i]);
});

bodyWrap.scroll(function () {
headWrap.scrollLeft($(this).scrollLeft());
});
}


function getWidth(td) {
if ($.browser.msie) { return $(td).outerWidth() - 1; }
if ($.browser.mozilla) { return $(td).width(); }
if ($.browser.safari) { return $(td).outerWidth(); }
return $(td).outerWidth();
};


};
})(jQuery);

6/20/2012

Connecting Sybase with PHP

Firstly ensure you installed PHP with Sybase with ./configure.

./configure --prefix=/apps/solace/php --with-apxs2=/apps/solace/apache/bin/apxs --with-libxml-dir=/home/tptools/Linux/x86_64 --with-png-dir=/home/tptools/Linux/x86_64 --enable-bcmath --enable-mbstring --enable-sockets --with-mysql=/home/tptools/Linux/x86_64/vendor/mysql-5.1.36-linux-x86_64-glibc23 --with-sybase-ct=$SYBASE/version#

the SYBASE variable should set to the home dir of sybase. And the hostname will be in the sybase interfaces file.

http://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.dc35823_1500/html/uconfig/X41537.htm

Here is some sample code to setup connection with Sybase:
<?php 
$conn=sybase_connect('DATABASE_NAME','USERNAME','PASSCODE')
or die('Could not connect: ' . mysql_error());

if (!sybase_select_db('SUB_DATABASE',$conn)) {
echo "Can't select database\n";
exit;
}
echo "Connected!";
sybase_close($conn);
?>


And if you see the message Could not connect: It is the parameter you set for the connection incorrect.

Or you haven't setup your local tomcat environment file in apache's bin directory envvars file.
Add 
SYBASE=/usr/local/sybase
export SYBASE

and start tomcat again with apachectl start. you will be fine.
If you don't like the warnings popping up, go to the php.ini file (which should be in your php's lib dir, or try phpinfo() to see where the ini file is) and change the warning settings.


Don't forget to setup the root environment variables:

setenv SYB_USER username
setenv SYB_PASSWD password

6/19/2012

Yii: Awesome PHP Framework


This framework looks fantastic.
look at the chart, it supports everything.




By the way, when try to install mysql on linux, mysql will firstly search the option file in /etc/my.cnf, then ~/.my.cnf which is under your $HOME directory. Remember to configure the file to get server connected. If you get any error message like socket error, cannot connect to server, and you have not specify the hostname, username or password, probably your option file is not setup properly.


http://dev.mysql.com/doc/refman/5.5/en/option-files.html#option_general_defaults-extra-file

Strangely, I have to connect mysql with
[wshi@nycdlvapoc002 mysql-5.5.25]$ mysql -u username -p
instead of user=root. And if you like to use root by default just create an user.

http://dev.mysql.com/doc/refman/5.1/en/adding-users.html

6/16/2012

Remote debugging with eclipse

Debugging java application or web application on server is very painful with a lot of reasons. Here I set the server and client the same computer.

I used eclipse here so make sure you installed eclipse

Setup Server:
Assume you've finished your java code and looks like below:

public class ForInterview {
        public static void main(String[] args) {
String a = "ARMZM2G002825";
float b = genIndex(a);
float d = (float) 5.020802E24;
System.out.println(b == d);
int c = 'Z' - '\u0030';
System.out.println(c);
    }
}

Export the project/ForInterview.jar into some directory where you regard as the server. You can save it on your local machine or remote server. 

Then run 
    java -Xdebug -Xrunjdwp:transport=dt_socket,server=y,address=8000 -jar    ForInterview.jar 
You'll see Listening for transport dt_socket at address: 8000
Now the server is waiting for debug command. 
// From java5 you can run  -agentlib:jdwp instead of -Xdebug -Xrunjdwp

Setup Client:
Right click on eclipse Run->Debug Configuration->Remote Java Application and setup the client as the graph below:


You can change whatever Host or Port as the server needed. But I think if the server require permission, there might be a problem, I haven't try that.

6/15/2012

Installing Apache+PHP on Redhat


A good habit is to do 'make clean' everytime you configure, or sometimes you will get something like symbol undefined error during compile time

Installing apache and php onto server:
http://unclean.org/howto/apache-php-mysql.html
http://www.php.net/manual/en/install.unix.apache2.php

(13)Permission denied: make_sock: could not bind to address 0.0.0.0:80
The description is indicate that i do not have enough privileged to bind the port for Apache http server. In Unix / Linux, only some privileged users are allow to bind the port between 1 to 1024. Apache http server is using port 80 as default.
Edit the config file to change the port Apache uses to a number greater than 1024.

When you successfully installed it should show the 'It Words' on your Server.
And remember to add the following to your conf/httpd.conf file: (or if you use 4, 3, change it to php4,3, etc)

 LoadModule php5_module        modules/libphp5.so
 AddType application/x-httpd-php .php .phtml     
 AddType application/x-httpd-php-source .phps

6/12/2012

Some useful metarials before working on

Source for SVN, clear enough:
http://svnbook.red-bean.com/nightly/en/svn.branchmerge.basicmerging.html

Time series database, used to log server data stream:
http://oss.oetiker.ch/rrdtool/tut/rrdtutorial.en.html
http://forums.cacti.net/about12202.html
http://www.fromdual.com/sites/default/files/rrd.pdf

It's really annoying to install rrdtool on unix. reference is:
http://oss.oetiker.ch/rrdtool/doc/rrdbuild.en.html#___top


It takes me a while to figure how to install source code onto my redhat server.
If I downloaded a .xz file, firstly I will have to install xz. using
./configure --prefix=SOME_DIRECTORY
make
make install
then set
PATH=SOME_DIRECTORY/bin:$PATH

command is: xz -d FILENAME ##-d is decompress -c is compress
xz actually is a more powerful tool than gz and bz

It's the same when installing all libraries we need, except we need to add some attributes to configure, make and install. eg. set the CFLAGS and LDFLAGS is probably needed if you are installing application on a server.

If there is a command not found, it might be that your environment variables are not correct. If there is a no file or directory, probably something wrong

When installing glib and you receive
1. glib AttributeError: 'dict' object has no attribute 'has_key'
it's probably because you're using the wrong python version. glib only support python 2.5-2.7
To change verstion all you need is adding the bin directory of your corrent version into PATH
setenv PATH SOMEPLACE/python2.5/bin
someplace is where your python is installed

2. keyword "msgctxt" unknown error

it usually is the gettex (a language support tool) is not installed. By installing it the problem will be fixed



6/01/2012

Dealing with Large Dataset

There are several ways to make the process of large dataset more effecient.

1. Compress technique:
    Try to use a link to every same element in the data instead of storing the copy of it.
    eg. if "Umbrella" appears in the data for couple times, then save just one Copy of them, and use link point to the word every other time it appears. In java, there is a method intern() that can compress a string.

2. Index Hashing:
    Hash every appeared element with an integer. Through this way you can save a lot of space.
    eg. "This is Mike and this is Jane." We want to calculate the probability of every word's appearance. If every word has probability and we saved as a Hashtable Counter <String, Double> where string represents every word in the sentence and every double represents the probability of the word's appearance. The probability of "this" is<"This", 2/7>.
    Then we can modify this into Hashtable IndexMap<String, Integer>+Hashtable Counter<Integer, Double>. Here IndexMap is an index for every string and Counter is an index and the probability. So we can search "this" in the IndexMap for the index and find the probability of "this" with the index.
    This technique saves a lot time because we are not using string in hashing anymore.

3. Use Float whenever possible instead of Double. Obviously, this is because Float is half the size of Double.

4. Use primitive collections instead of java.util.collections if your datatype is primitive type. This saves at least half of space and mostly 85% of space.
 

5/27/2012

超大文件排序

16:00
我报的错是 /tmp/sortA3aLjF: No space left on device。直接用HPC无法解决gigabyte级别的排序问题了,估计得用外排序做。晚点po结果
19:30
其实bash里面的sort是可以做到的。因为这个sort本来就是n路merge外排序。默认临时文件存在/tmp目录下。因为我的/tmp目录下空间不足,所以会报错。使用-T换至自定义的路径即可

http://vkundeti.blogspot.com/2008/03/tech-algorithmic-details-of-unix-sort.html
http://www.linuxquestions.org/questions/linux-newbie-8/sort-big-files-tmp-sorta3aljf-no-space-left-on-device-823971/

5/19/2012

Fast-forward to other forks

For convenience to develop and not to mix with my original project (Moonlyt/Moonlyt) in the team, I forked the project to my own repository (Wenling/Moonlyt) and do local development. But actually I have no idea how to keep my repo up-to-date with the original one, and this brought me a lot of trouble like I have to clone one to my local path then delete the old one of mine...

It's quite easy to reserve my own branch.

Reference:
http://git-scm.com/book/en/Git-Branching-Basic-Branching-and-Merging

# switch to a new branch
git checkout -b upstream/master
# bind the branch to the remote original one 
git remote add upstream git://github.com/moonlyt/moonlyt.git
# usually fetch is better as it don't automatically implement the merge command for 
# you. Pull request for update from original (upstream) to my repo (master)
git pull upstream master
# switch to my master branch
git checkout master
# merge upstream and my master branch
git merge upstream/master
# push 
git push origin master




ctrl-z和ctrl-c都是中断命令,但是他们的作用却不一样.
ctrl-c是强制中断程序的执行,
ctrl-z是将任务中断,但是此任务并没有结束,他仍然在进程中他只是维持挂起的状态,用户可以使用fg/bg操作继续前台或后台的任务,fg命令重新启动前台被中断的任务,bg命令把被中断的任务放在后台执行.
恢复挂起的程序命令:
1、用jobs 查看被挂起程序的序号x
2、用fg %x 恢复被挂起程序(如果只有一个被挂起程序,那直接fg就可以恢复了)

5/07/2012

DPLL implemented with c++


First C++ code written... I think it's quite efficient with all the test set included in path data.

The simplest data set looks like:
-1 -3 0 -1 -5 0 -3 -5 0 -2 -4 0 -2 -6 0 -4 -6 0 1 2 0 3 4 0 5 6 0

used vector saving clauses and map saving assignments.
the main recursive function looks like following:
/*
1. If the set of clauses is empty, it returns true
2. Otherwise, if the set of clauses contains an empty clause, it returns false
3. Otherwise, if there is a unit clause, it applies unit propagation and then calls itself recursively
4. Otherwise, if there is a pure literal, it applies the pure literal rule and then calls itself recursively
5. Otherwise, it applies either the resolution rule or the splitting rule and calls itself recursively
*/

Git Repo: git@github.com:Wenling/DPLL-Algorithm.git
Google Drive: https://docs.google.com/file/d/0B0mZhauSOmW4dFZESVRtMGRpWms

4/25/2012

DPLL Algorithm with ML

Git Repo: git@github.com:Wenling/DPLL-Algorithm.git
There are three satisfiability-preserving transformations in DP.
• The 1-literal rule
• The affirmative-negative rule
• The rule for eliminating atomic formulas
The first two steps reduce the total number of literals in the formula.
The last step reduces the number of variables in the formula.
By repeatedly applying these rules, eventually we obtain a formula containing an
empty clause, indicating unsatisfiability, or a formula with no clauses, indicating
satisfiability.


The DPLL algorithm replaces resolution with a splitting rule.
• Choose a propositional symbol p occuring in the formula.
• Let delta be the current set of clauses.
• Test the satisfiability of delta and {p}.
• If satisfiable, return true.
• Otherwise, return the result of testing delta and {¬p} for satisfiability.


3/22/2012

Brute Force SAT Solver with Scheme


Produced some code solving the SAT problem with scheme, a functional language. It's not perfect and might caused the application to crash when the number of literals of an assignment is more than 25.. One problem is the allassign function: it will save much more space with tail recursion.. But I was too lazy to change my code.

#lang racket

;counts the the number of clauses
(define (countclauses data)
  (cond
    ((null? data) 0)
    ((= (car data) 0) (+ 1 (countclauses (cdr data))))
    (#t (countclauses (cdr data)))))

;determines the maximum variable number
(define (maxvar data)
  (cond
    ((null? data) 0)
    (#t (let* ((max (maxvar (cdr data)))
               (c (car data))
               (a (if (< c 0) (- c) c)))
          (if (> a max) a max)))))

;takes a single argument, a list of integers representing a SAT problem in DIMACS format,
;and returns a list of lists, where each list represents a clause.
(define (getClauses lis)
  (saveClause lis '()))

(define (saveClause lis temp)
  (cond 
    ((null? lis) '())
    ((zero? (car lis)) (append (list temp) (saveClause (cdr lis) '())))
    (else (saveClause (cdr lis) (cons (car lis) temp)))))

;takes a positive integer n and generates all possible assignments of length n.
(define (allassign num)
  (allassign2 num '(())))

(define (allassign2 num temp)
  (cond
    ((zero? num) temp)
    (else (append (allassign2 (- num 1) (list (cons #t (car temp)))) (allassign2 (- num 1) (list (cons #f (car temp))))))))

;takes a variable and a list representing an assignment
;returns the value of that variable under that assignment
(define (eval-var var lis)
  (cond 
    ((eq? var 1) (car lis))
    (else (eval-var (- var 1) (cdr lis)))))

;similar function to evaluate a literal
(define (eval-lit var lis assgn)
  (cond
    ((eq? var 1) (if (< (car lis) 0) (not (eval-var (opposite (car lis)) assgn)) (eval-var (car lis) assgn)))
    (else (eval-lit (- var 1) (cdr lis) assgn))))

(define (opposite var)
  (if (< var 0) (- 0 var) var))

;evaluate a clause
(define (eval-clause var lis assgn)
  (cond
    ((eq? var 1) (eval-clause2 (car lis) assgn))
    (else (eval-clause (- var 1) (cdr lis) assgn))))

(define (eval-clause2 lis assgn)
  (cond
    ((null? lis) #f)
    ((eq? #t (eval-lit 1 lis assgn)) #t)
    (else (eval-clause2 (cdr lis) assgn))))

;evaluate a CNF
(define (eval-form lis assgn)
  (cond
    ((null? lis) #t)
    ((eq? #f (eval-clause 1 lis assgn)) #f)
    (else (eval-form (cdr lis) assgn))))

;main function, determines whether a list of integers is satisfiable
(define (sat lis)
  (SATsolver2 (getClauses lis) (allassign (maxvar lis))))

(define (SATsolver2 lis assgn)
  (cond
   ((null? assgn) #f)
   ((eq? #t (eval-form lis (car assgn))) #t)
   (else (SATsolver2 lis (cdr assgn)))))

3/10/2012

SAT Solver

In computer science, satisfiability (often written in all capitals or abbreviated SAT) is the problem of determining if the variables of a given Boolean formula can be assigned in such a way as to make the formula evaluate to TRUE.

SAT was the first known example of an NP-complete problem. That briefly means that there is no known algorithm that efficiently solves all instances of SAT, and it is generally believed (but not proven, see P versus NP problem) that no such algorithm can exist.

3-satisfiability is a special case of k-satisfiability (k-SAT) or simply satisfiability (SAT), when each clause contains exactly k = 3 literals.


3-SAT is NP-complete and it is used as a starting point for proving that other problems are also NP-hard. This is done by polynomial-time reduction from 3-SAT to the other problem.

EG. Subset sum problem
Assume that we are given some integers, such as {-7, -3, -2, 5, 8}, and we wish to know whether some of the integers can add up to zero. As the number of integers that we feed into the algorithm becomes larger, the number of subsets grows exponentially, and in fact the subset sum problem is NP-complete. An algorithm that verifies whether a given subset has sum zero is called verifier. A problem is said to be in NP if and only if there exists a verifier for the problem that executes in polynomial time.

3/06/2012

Assessment with Epic

It took me 3.5h today to finish the test, from 11:00am to 2:30pm. There are three parts of the test: Math, Understanding a new language, and coding. Epic gives 5h maximum but time you use is also an evaluation.

1. Math problem. 14 of them I think. Easier than GRE quantity but has some weird questions. Mainly because I cannot understand what it is asking..
Eg. A goat jumps 3ft height per minute, and slide back 2ft per minute. If it wants to jump out of a cliff, which is 50.5ft, from the bottom. How many minutes does it take?

2. Given a new language, understand it and answer questions.
A new language is given and there are 20 questions. Not that hard at all.
Eg. the language has two forms of variables: string and integer, expressions are strictly operated left to right, boolean expressions. Variables are not declared type explicitly.
Functions like SET, READ, KILL.

3. Programming skills.
1) Long Subtraction -- Given two arrays A, B, each contains elements of digits, return an array of A - B. Your machine can only do calculation of less than 20.
eg. A = [1,2,5,7,5];
B = [3,4,8,9];
A - B = [9,0,8,6];

2) I forgot

3) Given a keyboard with every letter maps a digit from 0 to 9, return all possible permutation of given a n digit number.

eg. 0 <- z,a,q,x,s,w
1 <- c,d,e
2 <- v,f,r
3 <- b,g,t
...
Then permutation of num 1230 will be:
cvbz
cvba
cvbq
...

4) There is one kind of numbers call Fibonacci number (forgot the specific name), which satisfy the following situation:
A. can be spilt into several numbers;
B. The first two number are the same, the next number is equal to the sum of previous two
eg. 112 (2 = 1 + 1), 12,122,436(12 + 12 = 24, 12 + 24 = 36)
If you are given a range by the user, print all numbers that are Fibonacci number in the range.

I finished the first and second part with less than 1h, while Epic gives more than 2h to do. But I cannot finish question 4 on time, even though I was given more than 110 min...

3/04/2012

搭建Ruby on Rails环境

http://ruby.railstutorial.org/chapters/a-demo-app#sec:planning_the_application
0. 安装和配置git。具体上github官网看
1. 安装Ruby,网上说最好使用rvm,这样子有利于gemset的安装位置放在一起,也有利于切换不同的ruby版本。凄凉的是最好不要同时装1.8.7和1.9.3啊。。本人试了两次每次总是找不到rails,因为默认版本1.8.7,把default改成1.9.3重启又没了,麻烦死了,没那个需求还是别装那么多。
2. 安装gem,会发现用rvm装完ruby1.9.3以后自动已经配好了最新的gem
3. 安装rails,利用gem install rails直接安装最新版本的rails
4. 这时候可以测试一下,用rails new proejctname来做一个project,然后bundle install一下,跑rails server即可。
5. 纠结了很半天git要怎么做push pull fetch等等。暂时还没有明白github的版本控制到底优良在哪里。做提交的时候首先是可以git push origin master,这时提示哪些文件可以add,利用add filename,然后commit即可。
    而从repo上第一次下载到本地有个很纠结的地方:就是这个repo是不是你自己的?如果你不是admin则一般有两种协作方式让你进行版本控制。
一种是fork到自己repo里进行push commit等。另一种是在源项目里创建一个shared的branch和repo,然后push的东西用那个shared的branch提交的相应的repo里。两种都不直接commit到master上,因为这样不利于协作。

先写这么多,其实还没弄太懂,明天继续。

3/03/2012

Interviews I have been gone through

Since last semester I have been applying for summer internship of software development. Companies I was interviewed included Amazon, Google, Morgan Stanley, Bloomberg, Epic, Robustlinks (startup), Moonlyt (startup), Douban. And I think I am still waiting for more company interviews in the following weeks.

Right now I only have offers from the startups and was rejected by all the larger scale companies. And in the following part I will explain all the interview process from big firms and how I regard each firm. Also some thoughts about why I didn't get the offers.

1. Amazon
I applied Amazon in Oct 2011. I got two phone interviews and after two or three weeks I got the third one. A week later I was told rejected.

Questions in the first interview:
What's a Hashmap? How to make a good Hashmap?
Given the pre-order and in-order array of a tree. Can you represent the tree without knowing the elements inside the array?

Question in the second interview:
We have a chess grid (table) which is 8x8, we’ll index the cells from 0 to 7 both vertically and horizontally :
(0,0) is the  top left
(7,7) right bottom and our "target"

An ant is situated in the top left corner (0,0) and she needs to reach the bottom right corner (7,7).

The ant can only move forward to the target , one square at a time either horizontally or vertically.

More formally, the ant can move either:
from       (i,j)  -> (i+1, j)  if    i+1 <= 7
or from    (i,j ) -> (i, j+1)  if     j+1 <= 7
1) write a program that will display all possible paths to the standard output.
2) How many paths there are ?

Questions in the third interview:
1. Sum up all the prime number from 1 to 100.
2. Asked me about what is polymorphism, what is encapsulation.

It's way too early at that time as my English is not that fluent and I haven't prepared for anything. Thus the first interview is a mass and I totally have no idea how Amazon ask questions and what I should be answering.

2. some technical company
Applied in November. Got two phone interview in the same day of early December. Got rejected the next day.

First interview:
1. write function in Java which takes an array as a number, and return the increment of the array by one.

eg.
[2,4,7] -> [2, 4, 8]
[1, 0 , 0 , 3] -> [1 ,0 ,0,4]
[4,5,9] -> [4,6,0]
[1, 9 , 9] -> [2, 0 , 0]
[9,9] -> [1,0,0]



2. You have machine with limited amount of memory: 1024 bytes.
Write the longest running program - which will terminate at some point.
long = as much time as possible

What's the running time of your code? How can you proof your code to have the longest running time?

Second interview:
1. What's the function of the following code? And does it have any bugs?

public class Generator {
 private static final Map<byte[], byte[]> cache = new HashMap<byte[], byte[]>();

 
 public static byte[] generate(byte[] src) {

   byte[] generated = cache.get(src);
   if (generated == null) {
     synchronized (cache) {
       generated = cache.get(src);
       if (generated == null) {
         generated = doGenerate(src);
         cache.put(src, generated);
       }
    }
  }
 return generated;
}

private static byte[] doGenerate(byte[] src) {...}
}

2. Write code for the Fibonacci function.

3. Morgan Stanley
Applied through careerNet. Got oncampus interview. And was rejected several days later.
Questions are pretty general. Not much technical questions.


4. Bloomberg
1. Why bloomberg?
2. Discussed something about differences between Java and C++, then talk about JVM
3. Given a number and a base, return 转换后以那个该进制的数。
eg. number is 5, base is 3, then then return 12
4. You have three baskets with fruits inside and with tags outside. One of them has apple, one has orange and one has both. Now you know they are definitely tagged mistakenly. You can pick up as many fruits as you want. How many times do you have to pick the fruits to determine the fruits in the baskets.

2/05/2012

Drawing Recursion Tree through LaTex

Finally, I learned how to draw recursion tree using latex. There is a package called TikZ which can construct beautiful and clean pictures that can directly be embedded into latex.
http://www.texample.net/tikz/
Below is a tree I drew for algorithm homework:
\begin{tikzpicture}[level/.style={sibling distance=40mm/#1}]
\node (z){$n$}
  child {node (a) {$\frac{n}{3}$}
    child {node  (b) {$\frac{n}{9}$}
      child {node (b1) {$\vdots$}
      child {node (b11) {$D$}}
      }
      child {node (b2) {$\vdots$}
      child {node (b12) {$D$}}
      }
    }
    child {node (g) {$\frac{n}{9}$}
      child {node (g1) {$\vdots$}
      child {node (g11) {$D$}}
      }
      child {node (g2) {$\vdots$}
      child {node (g12) {$D$}}
      }
    }
  }
    child {node (d) {$\frac{n}{3}$}
      child {node  (e) {$\frac{n}{9}$}
        child {node (e1) {$\vdots$}
        child {node (e11) {$D$}}
        }
        child {node (e2) {$\vdots$}
        child {node (e12) {$D$}}
        }
      }
      child {node (f) {$\frac{n}{9}$}
        child {node (f1) {$\vdots$}
        child {node (f11) {$D$}}
        }
        child {node (f2) {$\vdots$}
        child {node (f12) {$D$}}
        }
      }
    }
  child {node  (j) {$\frac{n}{3}$}
    child {node (k) {$\frac{n}{9}$}
      child {node {$\vdots$}
      child {node (k11) {$D$}}
      }
      child {node {$\vdots$}
      child {node (k12) {$D$}}
      }
    }
    child {node (l) {$\frac{n}{9}$}
    child {node {$\vdots$}
    child {node (l11) {$D$}}
    }
    child {node (c){$\vdots$}
    child {node (l12) {$D$}
            child [grow=right] {node (r) {$n$} edge from parent[draw=none]
              child [grow=up] {node (s) {$\vdots$} edge from parent[draw=none]
                child [grow=up] {node (t) {$n$} edge from parent[draw=none]
                  child [grow=up] {node (u) {$n$} edge from parent[draw=none]
                  child [grow=up] {node (u) {$n$} edge from parent[draw=none]}
                  }
                }
              }
            }
            }
    }
  }
};
\path (b) -- (g) node [midway] {$\cdots$};
\path (e) -- (f) node [midway] {$\cdots$};
\path (k) -- (l) node [midway] {$\cdots$};
\path (b11) -- (b12) node [midway] {$\cdots$};
\path (g11) -- (g12) node [midway] {$\cdots$};
\path (e11) -- (e12) node [midway] {$\cdots$};
\path (f11) -- (f12) node [midway] {$\cdots$};
\path (k11) -- (k12) node [midway] {$\cdots$};
\path (l11) -- (l12) node [midway] {$\cdots$};
\end{tikzpicture}


1/24/2012

Stochastic Gradient Descent & non ML algo

These weeks I've been exploring Gradient Decent, which is a simple Machine Learning algorithm :D. SGD is an online learning method that keeps updating weights when calculating the loss of every candidates.

It's pretty intuitive how to test whether the model is working:
1. randomly pick up a set of weights ( or pick up whatever you think reasonable);
2. produce scores and generate the candidates using the weights
3. split the candidates into two part, one for training and one for testing
4. if after training the weights predicted are same or close to the real weights, and the precision and recall should be high.

1. For example, weights = {a1, a2, a3}
2. score =weights[1]*x1+weights[2]*x2+weights[3]*x3;
we randomly generate a matrix of n row of candidates with feature x1, x2 and x3
Then we can calculate the score.
3. here I split the 80% data for training, and 20% for testing
4. after training with SGD, there will be a predicted weights{a1', a2', a3'}
Now I can compare the weights and see if the SGD model is doing a great job.