Lately, I have posted a little treatise regarding the truth behind the myth of the irregular word-length distribution along the line.
I had developed the theory that the first word in each line has to be longer than average due to word-wrap effects which happen automatically. Such a higher-than-average word length shows up in the VM, but to my dismay I was unable to consistently verify the existance of such an effect in natural language text, namely Mark Twain’s “Tom Sawyer”, which I had used for a reference, failed to exhibit it.
Ger Hungerink then pointed out to me that he had had the same idea before independently from me, he had tested it like I had on “Tom Sawyer” (What is it with that book?), and he did find the first-word effect where I had failed to find any.
Now, this obviously is a bit confusing — two people doing statistics on the same raw data ought to achieve at least vaguely the same results, one would think.
It turned out that the main difference was that while I had retained the complete text, Ger had removed the dialogue from it. I had at first worried about the legitimacy of this step, but became convinced that a) the VM with a high degree of probability is prose rather than dialogue, and b) dialogue will exhibit way more interjections, incomplete sentences, and also in other ways differ from regular text.
Now I did the same as Ger had done, and the result is below:
What you see is the distribution of the average word-length across the line, once for “Tom Sawyer” with dialogue (blue line), and once without it (red line).*) It is immediately obvious that while the dialogue-laden text (blue) starts with a shorter-than-average first word (4.32 chars), the dialogue-less text (red) starts with 4.72 chars as the average for the first word. The blue line then rises, while the red line drops, meaning that both distributions begin to approach each other. (The dialogue-laden text still retains shorter words overall, probably simply because spoken language tends to use phrases like “Hypertrophoid disperambulations paraphernalialize perseverant antagonists” less often than “Oi, Joe, get up!”.)
In other words, once we concentrate on prose, the first-word peak in the word-length, as predicted in the paper, is found on natural language texts, as well as the overall droop of the word length towards the end of the line. That the effect is found in the VM as well can now be explained as a natural consequence, rather than as an artefact of a particular enciphering scheme. This leaves the second-word length dip, as it is observed in the VM, as the major unexplained phenomenon.
*) Please disregard the far right of the diagram, where statistical noise plays the dominant role.