Talk is Cheap

Lately, I have posted a little treatise regarding the truth behind the myth of the irregular word-length distribution along the line.

I had developed the theory that the first word in each line has to be longer than average due to word-wrap effects which happen automatically. Such a higher-than-average word length shows up in the VM, but to my dismay I was unable to consistently verify the existance of such an effect in natural language text, namely Mark Twain’s “Tom Sawyer”, which I had used for a reference, failed to exhibit it.

Ger Hungerink then pointed out to me that he had had the same idea before independently from me, he had tested it like I had on “Tom Sawyer” (What is it with that book?), and he did find the first-word effect where I had failed to find any.

Now, this obviously is a bit confusing — two people doing statistics on the same raw data ought to achieve at least vaguely the same results, one would think.

It turned out that the main difference was that while I had retained the complete text, Ger had removed the dialogue from it. I had at first worried about the legitimacy of this step, but became convinced that a) the VM with a high degree of probability is prose rather than dialogue, and b) dialogue will exhibit way more interjections, incomplete sentences, and also in other ways differ from regular text.

Now I did the same as Ger had done, and the result is below:

