Friday, November 6, 2009

Utility of Money and the St. Petersburg Paradox

Consider the following game:

We will flip a fair coin, until a tail appears for the first time.

  • If the tail appears in the first throw, you win $2^1=2$ dollars.
  • If the tail appears in the second throw, you win $2^2=4$ dollars. 
  • If the tail appears in the third throw, you win $2^3=8$ dollars.
  • ...
  • If the tail appears in the $n$-throw, you win $2^n$ dollars.
What is the amount of money that someone should risk to enter this game? (This question works best when given to a person that claims to never play a lottery, roulette, or any gambling game, because the expected return is lower than the bet.)

Computing the expected return of this game, we have:

$E=\frac{1}{2}\cdot 2+\frac{1}{4}\cdot 4 + \frac{1}{8}\cdot 8 + \cdots = 1+1+1+ \cdots =\infty$

In other words, the expected utility is infinity, and a rational player should be willing to gamble an arbitrarily large amount of money to enter this game.

Can you find anyone willing to bet $1,000 to play this game? Or $10,000? Or even $100? Yes, I did not think so.

This paradox is called the St. Petersburg Paradox, posed in 1713 by Nicholas Bernoulli and solved in 1738 by Daniel Bernoulli. Since then, a number of potential explanations appeared.

The most common approach is to use expected utility theory. In this case, we introduce a utility function $U(x)$, which describes the "satisfaction" that someone would get by having $x$ amount of money.

Utility of Money: The basic idea is that people do not bet based on the absolute amounts of the return but rather based on the utility of the award. The value of an additional $100 when I have $100 in the bank is much higher compared to the case when I have $1,000,000 in the bank. This means that the "utility of money" function is a concave function of the available funds.

Just for demonstration, below you can see such a concave utility-of-money function that we have computed as part of a research project:


This concavity also partially explains the "risk aversion" that most people have: they prefer certainty over uncertainty. This means that they will reject even a reasonable bet with positive expected return. Why? Notice that the utility gained by winning is smaller than the decrease in utility that results from losing the bet. The higher the concavity, the higher the risk aversion.

If you want to read more about utility of money and its applications to portfolio management, insurance, and analysis of other cases, take a look at this book chapter.

So, next time that someone claims never to engage in any bet with a negative expected return, give the setting of the Bernoulli paradox with the positive expected return and observe the reactions...

Sunday, October 25, 2009

What is the (Real) Cost of Open Access?

After the transformation of Communications of ACM, I find myself increasingly interested in the articles that are published in CACM. As expected, one of the common ways to demonstrate my interest is by sharing the URL for the paper, on Twitter, on Facebook, on the blog, or by sharing the link with friends and colleagues. Unfortunately, CACM has a closed-access policy, effectively preventing anyone without a ACM membership or without a university account from actually reading the papers. Same thing for papers published in conferences and journals, but there I can typically find the paper in the home page of the author. For CACM, this is often not the case.

Needless to say, I hate closed access policies. While I can understand the shortsightedness of for-profit publishers, I fail to see why ACM has not adopted at least a "semi" Open Access model, making, say, the current issue of Communications of ACM available to the public. Or by giving public access to papers published 10 or 20 years back in the different journals and conferences.

The stated goal of the association is to promote the field. By restricting access, ACM simply does not work towards this goal!

The main argument that I hear is that publishing has some costs. But I am really trying to understand what are these costs. What is the magnitude of these costs? And who is being paid? Almost like the health-care debate, we are told that something is expensive but we have no idea of who ends up getting the money.

Let's examine the potential cost factors:

Printing: I understand that printing on paper has costs. But covering the the cost of printing seems easy: Amortize it across the print subscribers. (Or even abolish print versions.)

Servers for distribution: What is the cost of electronically distributing papers? The cost of running a server, should not be a concern. At the worst case, NSF should provide funds for that. I find it hard to think that NSF would turn down a request for funding a server that provides open access to scientific journals!

Submission handling: The cost of the submission website? I doubt that it is above $5K per year, per journal. Ask for a nominal submission fee (say $50 per paper) to cover this. The cost for the copy-editors? We can do much better without them, thank you. (Seriously, why do we still have copyeditors?)

Admin cost: The only cost that I can think of is the cost of the admin staff. But how much is it? I honestly have no idea! Is it so high that the ACM member subscriptions cannot cover the cost? I am trying to find the budget of ACM but I cannot find anything public.

Are there other hidden costs?

If anyone has pointers or extra information, please let me know. I am really trying to understand the real costs of high-quality electronic publishing.

Sunday, October 11, 2009

When Noise is Your Friend: Smoothed Analysis

Have you ever encountered the phrase "the algorithm has exponential running time, in the worst-case scenario, but in practice we observed it to be pretty efficient"? It is the phrase that divides theoreticians and practitioners. Many theoretical computer scientists focus on the analysis of the worst case complexity, generating often results that contradict practice.

For example, the simplex algorithm for linear programming is well known to be pretty efficient in practice. In theory, the worst-case complexity of simplex is exponential, classifying the simplex algorithm as a "non-efficient" algorithm. However, simplex has exponential running time only for very special cases. Most practitioners would even argue that you will never encounter such strange cases in practice. Only an adversary could potentially design such inputs.

Similarly, the Traveling Salesman Problem is a hallmark example of an NP-complete problem, i.e., unlikely to have an efficient algorithm anytime soon. However, there are many implementations of TSP that can provide almost optimal solutions for TSP, for pretty big inputs.

K-means is another such algorithm. It has a horrible worst-case scenario but ask the millions of people that use it for clustering. One of the most efficient clustering algorithms, despite its wost-case exponential complexity.

So, how can we reconcile theory and practice?

A very nice approach towards this reconciliation is the case of smoothed analysis. I first learned about this approach for analyzing algorithms by attending the (fascinating) job talk of Jon Kelner. Jon showed that if you pertubate a little bit the input before feeding it to the simplex algorithm, then it is almost impossible for the pertubed input to generate an exponential running time. In other words, by adding a little bit of noise in the data, there is the guarantee that we avoid the "tricky" parts of the input space.

What is the beauty of this approach? It explains why in many cases "inefficient" algorithms work well in practice: Most real data contain noise, and this noise can actually be beneficial! The other big lesson is that sometimes an algorithm ends up having a horrible worst-case performance just due to a small number of potential inputs, that are almost adversarial. Adding noise, may take care of these strange cases.

The last issue of Communications of ACM, has a great review article by Spielman and Teng on Smoothed Analysis. Explains the difference between worst-case, average-case, and smoothed analysis, and points to a wide variety of problems that have been analyzed using this technique. Highly recommended!

Monday, September 28, 2009

Rationality, P=NP, Prediction Markets, and a Paradox

I got the idea for this post after reading the post of Dick Lipton on betting on the P=NP problem. The discussion in the comments was extensive, mainly touching the issues of risk aversion, the inability of humans to estimate properly small probabilities, and so on. (Wolfers and Snowberg argue that it is due to the inability of human to understand very small probabilities.) The discussion continued in the Overcoming Bias blog and there one of the comments, being tongue-in-cheek, caught my eye:

probabilities must agree with logic on certainly-true and certainly-false statements, which means that the probability of logical truths has to be 1, and logical falsehoods have to be 0.

So,if P=NP is a decidable problem, it is either true or false. So, a fully rational agent, participating in the market, should know whether P=NP. It is not a matter of probabilities! All the information to make the decision is available. So, if the market has one or more rational players, the market should converge to a price of 0 or 1 immediately, depending on the state of the problem. Right?

So, which of the following is true?

  • There are no rational agents. So, all the analysis of prediction markets that assume rationality of traders is incomplete.
  • There are rational agents. The market does not converge to 0 or 1 because the P=?NP problem is undecidable.
  • There are rational agents but the return from the risk-free rate until reaching the time to settlement exceeds the return from the market. So, the market gives information on how long it will take for the problem to be officially solved.
  • If your laptop cannot find the solution, neither can the market.
OK, back to more serious work.

Monday, September 14, 2009

Citation Tracker: Monitoring Citations to your Publications

One of the common pastimes of academics is checking services such as Google Scholar to see the number of papers that cite our work. Quite often the statistics from Google Scholar, or from other services such as Web of Science, are used to create a citation report that is used for promotion and tenure purposes.


While Google Scholar is extremely valuable for finding papers that cite a particular piece of work, it has some shortcomings, especially when creating a citation report for promotion. First, Google Scholar does not differentiate between peer-reviewed (journal, conference, or workshop papers), and other publications (such as tech reports, or term papers); so, when preparing a citation report, I have to go over the list of papers, keeping the "legitimate" citations and removing the citations that are not admissible. Second, Google Scholar is noisy sometimes, and lists twice the same paper, or splits citations for the same paper into two different entries; some other times it does not include papers that are possible to find through a web search.

Another feature that I would really like to see is the ability to find the "new" citations for a given paper, creating the appropriate alerts. A simple RSS feed would work wonders, but it is not there.

Of course, Google Scholar also does not monitor the web to find other types of documents that may mention a particular paper. PhD seminars, or even blog posts, are things that I would like to keep track of when monitoring who cites my own work. Especially for such volatile pages, I typically want to keep a copy so that I can retrieve them a few years later, when compiling my promotion packet.

For this reason, over the summer, I created a tool that can augment Google Scholar and monitor Google Scholar (and other services like Libra, CiteSeerX, SSRN), and also monitor the Web (Google, Bing, Ask) for mentions of the paper.

You can access a pre-alpha version at http://www.citation-tracker.com

Some of the features:
  • Import publications from Google Scholar, DBLP, BibTeX, and manually.
  • Review the citations for each paper, and decide which ones to keep, which to discard, and which ones to examine later.
  • Monitor citation services (Google Scholar, Libra, CiteSeerX, SSRN) and see notifications when new citations to your papers appear.
  • Generate automatically a citation report, listing the papers that cite your work.
I have been using the service over the last few weeks and it seems reasonably stable. I import my papers using Google Scholar, "accept" the existing citations, and then wait to see about the new citations that pop up every now and then. I find it pretty useful for finding new papers that cite my work.

Over the last few days I even started importing papers from other researchers that I consider relevant to my work, and for which I want to see what new papers cite them.

Feel free to login and play with the system. Needless to say, it is an early release so I expect to see bugs here and there. If you see any bug, or if you would like to see a new feature, please add a note using the "feedback" tab that is visible on the side of the screen.

Enjoy!