McEs, A Hacker Life: Links and Notes

Wednesday, August 30, 2006

Links and Notes

Monolith, the most crack project on sf.net I've ever seen. Apparently the author believes that if you zip an MP3 file with a password (or one-time-pad it, that is what Monolith does), the resulting file is not copyrighted by the copyright holder of the song, as it will not be statistically related to the MP3!
What colour are your bits? An interesting view on copyright issues of digital data, though inaccurate on the CS side in a few places.
The New Yorker story on Perelman and the Poincaré is a totally insteresting read. Academic gossip :-).
Google Code Jam 2006. Apparently non-resident Iranians can take part.
The newly generated GNOME WorldWide Map is a lot denser in Iran than it was before. Rock on!
Linux distro timeline
Security Engineering - The Book

Older stuff that totally rock:

Comments:

Hmm, Behdad! Did you first read the articles on copyright philosophy and then listened to all the Hayedeh on your playlist. Or was it reverse?! Take care of your self dude! (j/k)

# posted by

pooya : August 30, 2006 6:42 PM

I'm the author of the "Colour" article and would be interested to know what parts of it you think are "inaccurate on the CS side".

# posted by

Anonymous : August 30, 2006 7:42 PM

Hey Pooya. I normally resonate between Bob Dylan and Hayedeh...!!!

# posted by

behdad : August 31, 2006 1:03 AM

I guess that's something like resonating between bugzilla's OO bugs and copyright philosophy articles. (j/k again!)

# posted by

pooya : September 01, 2006 12:35 AM

Hi, Great link and notes. Do you happen to know what would be the criteria for Google Code Jam's "Qualification Round" ?

# posted by

Anonymous : September 01, 2006 5:44 AM

Monia, it's a programming assignment. If you register, there are practice assignments you can take to get the feel of it.

# posted by

behdad : September 01, 2006 11:42 AM

Ok, here is my comments about the Colour article.

There are statistical tests you can do; for instance, if you look at the file and discover that it contains a copy of the works of Shakespeare, then it doesn't look much like you would expect randomly generated numbers to look. But it could still be randomly generated. The test tells you whether the file has the statistical properties expected from randomly generated files, not whether the file really is randomly generated or not.

This is true as an statement, but the part "The test tells you whether the file has the statistical properties expected from randomly generated files" is ignoring a very important point here, that is, a radom number generator doesn't have to be a uniform random number generator. The works of Shakespeare don't look much like randomly generated data if you are thinking about ASCII encoding and a uniform random number generator, but they don't look as bad when you consider a Markov-based random number generator trained on a huge pile of old English text (not including Shakespear's at all.)

It's not even correct to say "the probability of this being from a random generator is very low" because that's not true

It actually is true for most random number generators. The statement however is not a very informative one, as it holds for too many sequences. For example, with a uniform random number generator, the probability of any sequence of length N-bits being generated from it is the same very low value 2^-N.

- it either was or was not randomly generated, that's not open to probability.

This is an statistician's view. There is another view, called the Bayesian view that actually assigns a probability as degree to which a person believes a proposition. This is a very intuitive notion. For example, if you have not be in Vancouver yesterday and have not read or heard any news, you cannot answer the question "did it rain in Vancouver yesterday" with certainty, but given the time of the year, you have an idea of what the answer possibly is. If it's in the winter, I would say there's a 70% probability that it rained yesterday in Vancouver, while in the summer, that will be less than 10%.

The same is true about the question of "was this text generated by a random number generator?". You definitely can tell by seeing the text. If it's the works of Shakespeare, you would say the probability that this text has been generated by a random number generator is very very low. If it looks like total garbage, you would say "yes, it's quite possible that the data is generated from a uniform random number generator". To understand how this works, one can go back and use the Bayes rule: given that you have a copy of the works of Shakespear in a file in your computer, is it coming from a uniform random number generator, or somebody copied it from another file? The probability of the random number generator generating this file is less than 2^-10,000 given the file is a few pages long. On the other hand, if someone copies a random file off the internet for you, the probability that it's works of Shakespear (given that we know that is available somewhere on the internet) is greater than 2^-100 (assuming gazillions of computers with gazillions of file on each). Comparing those two numbers, you deduce that the probability that the file you have has been generated by the random number generator is essentially zero.

Check out David MacKay's book for similar discussion.

# posted by

behdad : September 10, 2006 11:11 PM

Well, I'm disappointed that you characterized my entire article as "inaccurate" based on one paragraph, and I don't think even that paragraph is inaccurate. Yes, a copy of the complete works of Shakespeare will look more plausible as random number generator output if you assume a generator with an output distribution slanted towards English text of that era, instead of uniform over some larger set, like same-length binary files. But it still won't be plausible that a random generator with any significant amount of entropy in its output will generate Hamlet (the exact text, not just "something similar to Hamlet") without being specifically designed to do so.

As for the Bayesian interpretation of probability:

- I am aware there are multiple interpretations possible.
- I do not subscribe to the one you describe, and I do not agree that it's easier to understand. If you want to assign a number to how much you believe a statement, great, but then it's important to be clear that the uncertainty is in your understanding of whether the event occurred, not in whether the event actually did occur.
- It isn't relevant to what my article was about.
- Rewriting it in the terminology of Bayesian probability would not affect the conclusions I drew.
- In an informal article that is not about the philosophy of probability theory, I don't think it's necessary or useful to attempt to present some kind of balanced exposition of ways to interpret the meaning of probability including significant coverage of interpretations I don't agree with myself.
- I'm not sure that this "debate" even matters, because it seems rather metaphysical. If I'm building for instance a data compressor, my code is going to wind up doing the same thing regardless of whether I believe that the input file is a random oracle with certain probabilities of generating ones or zeroes, or a fixed non-random string about which I am uncertain whether the next unread bit is a one or a zero. Since it appears to depend on our personal beliefs, not on anything we could verify by experiment, and there doesn't seem to be any experimental outcome that would cause either of us to change our views of what probability is, when you get right down to it, I think this is a religious question, not a scientific one.

# posted by

Anonymous : September 11, 2006 12:59 AM

About Me

Twitter Updates