Lately, I have posted a little treatise regarding the truth behind the myth of the irregular word-length distribution along the line.
I had developed the theory that the first word in each line has to be longer than average due to word-wrap effects which happen automatically. Such a higher-than-average word length shows up in the VM, but to my dismay I was unable to consistently verify the existance of such an effect in natural language text, namely Mark Twain’s “Tom Sawyer”, which I had used for a reference, failed to exhibit it.
Ger Hungerink then pointed out to me that he had had the same idea before independently from me, he had tested it like I had on “Tom Sawyer” (What is it with that book?), and he did find the first-word effect where I had failed to find any.
Now, this obviously is a bit confusing — two people doing statistics on the same raw data ought to achieve at least vaguely the same results, one would think.
It turned out that the main difference was that while I had retained the complete text, Ger had removed the dialogue from it. I had at first worried about the legitimacy of this step, but became convinced that a) the VM with a high degree of probability is prose rather than dialogue, and b) dialogue will exhibit way more interjections, incomplete sentences, and also in other ways differ from regular text.
Now I did the same as Ger had done, and the result is below:
What you see is the distribution of the average word-length across the line, once for “Tom Sawyer” with dialogue (blue line), and once without it (red line).*) It is immediately obvious that while the dialogue-laden text (blue) starts with a shorter-than-average first word (4.32 chars), the dialogue-less text (red) starts with 4.72 chars as the average for the first word. The blue line then rises, while the red line drops, meaning that both distributions begin to approach each other. (The dialogue-laden text still retains shorter words overall, probably simply because spoken language tends to use phrases like “Hypertrophoid disperambulations paraphernalialize perseverant antagonists” less often than “Oi, Joe, get up!”.)
In other words, once we concentrate on prose, the first-word peak in the word-length, as predicted in the paper, is found on natural language texts, as well as the overall droop of the word length towards the end of the line. That the effect is found in the VM as well can now be explained as a natural consequence, rather than as an artefact of a particular enciphering scheme. This leaves the second-word length dip, as it is observed in the VM, as the major unexplained phenomenon.
*) Please disregard the far right of the diagram, where statistical noise plays the dominant role.
13 thoughts on “Talk is Cheap”
“Tom Sawyer” dialog is peculiar. In most texts dialog should make no appreciable difference if it is wrapped along with other text. I included dialog in “Huckleberry Finn”, the source text in the second link.
Word length tendencies in several texts.
>This leaves the second-word length dip, as it is observed in the VM, as the major unexplained phenomenon.
Why words at LP02 tend to be shorter, on average, than the next few words on the same line in wordwrapped text.
Quote: “I had at first worried about the legitimacy of this step, but became convinced that a) the VM with a high degree of probability is prose rather than dialogue, and b) dialogue will exhibit way more interjections, incomplete sentences, and also in other ways differ from regular text.”
The actual reason for me was that the lines containing dialogue almost never wrap. Like:
“I don’t know”, he said.
At the time I wrote it down as follows:
“Since Tom Sawyer contains a lot of short lines (conversations) that are not wrapped, I excluded them. This was done simply by assuming lines shorter than 60 characters not to be word wrapped, and omitting them.”
So it was not because of it being conversations, but because of it being too short lines and not paragraphs, like the paragraphs in the VMs I used for comparison. The fact that dialogue tends to start with short words like I, no, it, … makes it worse.
In my article, mentioned above, I even corrected this by hypothesis for paragraphs by excluding the first (and last) word as well, since they are not affected by word wrap either.
This is very interesting, and all new to me. I wonder what results you’d get if your model text was Dickens rather than Twain. Also what would happen if you used different languages. Are there any which habitually begin with … I don’t know.. an agglutination of noun and verbal form. If you ran comparisons how would you decide the control?
I have looked at the influence of token length distribution and length transition from one token to the next in various texts. An analysis would be a difficult project. Here are graphical results for sample texts in English, Spanish, Vulgate, German, Interlingua, Greek, Latin, and VMs.
A control, if needed, could be from combining the results of many texts. Graphically, that would give a fairly regular curve from which individual texts vary. The Poisson distribution that Ger described might be a better standard.
Thank you for the serious response.
Words of Vm are strongly cutted by the draw on the verso of the page with a graphic symbol, generaly “H”,’g” or “8” so you can’t deduce nothing on their lenght before give their originaly form. Better method is to suppress all the blanks… when i make this, i find such as a periodic key that could give their genuine length, but for the moment, i don’t believe in causes of the strength of the graphic-code. When we’ll have supress all those code, perhaps your analysis better functions…
first page and the draw at its verso… https://www.facebook.com/photo.php?fbid=338699589570404&set=pb.100002910983152.-2207520000.1356412527&type=3&theater
you can note that very very often, a line is followed by a space… so we can say that the draw gives a part of the length of words
I just had a bizarre idea about this book and felt the need to share it with someone who may be able to bring this idea to light and perhaps test it. Something I heard about a particular ancient language suggests that when spoken the letters of this language will cause sand on a plate to vibrate in such a way that the sand will take on the shape of the written character that represents the spoken letter. Now to further my hypothesis let us find the individual sounds that will cause sand to vibrate in this manner to form each of the individual characters contained within the voynich manuscript. Once we have the sounds associated with each character let us then compare the text when spoken in this “language” to known languages from our history. Perhaps with this approach we can finally decode this book that has so long been a mystery.
I’m afraid you may have misunderstood Chladni’s figures a bit: http://en.wikipedia.org/wiki/Chladni_patterns#Chladni_figures In any case, the shape of the patterns would be dictated by the shape of the original resonating plate, so without having this “key”, I’m afraid backtracing the original sounds would be impossible.
Surely small words would stand a greater chance of being typeset into their preceding line? Wouldn’t this – a typesetting artefact – yield some kind of on-average-slightly-longer-first-word phenomenon in printed books? Just a thought! ;-)
Yes, this is actually what we observe: Short words have a tendency to remain at the end of a line, long words are wrapped around and move to the beginning of the following line.
The effect is evident with simple ASCII word wraps, and also seems to happen at handwritten pages.
The article mentioned above deals with the possibility that shorter words might be more likely to fit in the remaining line space and finds that that does not work: