art with code

2009-08-03

EU court rules 11-word snippets can violate copyright


Eleven words, eleven words.. yes.
Let's see.

/usr/share/dict/words has 98569 words. That to the eleventh is 8533827430813090265537515496031670135739632890386559769 different 11-word sentences. Each 11-word sentence takes 17 * 11 bits = 187 bits.

Which gives us a keyspace size of 2*10^44 terabytes. A bit too much.

But if we manage to reduce the freedom of degrees in the sentence through grammatic modeling and some other magic of data mining, while reducing our vocabulary to common words... Let's say 5 free words from reduced vocabulary of 1024.

1024^5 = 1125899906842624 different sentences. Each sentence now weighs 10 * 11 = 110 bits.

So, 15.5 petabytes for the keyspace size.

Assuming that we can generate this stuff as fast as the I/O system allows, we can imagine buying some machine time off the Amazon cloud to bruteforce it. If one machine can manage 50 MB/s write, and we use a cluster of a 1000 machines, the total combined write speed will be 50 GB/s. For a total time of 310000 seconds, or 86 hours, at the cost of 1000*0.1e/h.

Giving us the cost for plausible total copyright control over the English language:
8611 euros for computing time and 1.25 million euros for the 15,500 1TB hard disks to store the resulting document.

(Well, you can compress the document to less than a kilobyte by writing a program that generates every 50-bit number, but what is the legal power of that?)

And now, with our very own copy of bruteforced English, we can sue the pants off every stinking anglo and anglo-wannabe out there. COPYRIGHT BANZAI!

4 comments:

Anonymous said...

Won't you be violating gazillions of already written sentences.

Ilmari Heikkinen said...

Yes, how does that change things? Everyone still infringes on your copyrights even if your stuff infringes on the copyrights of others'. Stealing from a thief is still a crime...

Anonymous said...

If he comes up with something someone else already wrote separately, I don't think that's a violation in the US.

Ken Demarest said...

It is not necessary to generate all of them, since that is so expensive. Simply generate all the ones with "THE", "SALE", "NEW", and "NOW". That should keep you in clover for years!

Blog Archive