Talk is Cheap

Lately, I have posted a little treatise regarding the truth behind the myth of the irregular word-length distribution along the line.

I had developed the theory that the first word in each line has to be longer than average due to word-wrap effects which happen automatically. Such a higher-than-average word length shows up in the VM, but to my dismay I was unable to consistently verify the existance of such an effect in natural language text, namely Mark Twain’s “Tom Sawyer”, which I had used for a reference, failed to exhibit it.

Ger Hungerink then pointed out to me that he had had the same idea before independently from me, he had tested it like I had on “Tom Sawyer” (What is it with that book?), and he did find the first-word effect where I had failed to find any.

Now, this obviously is a bit confusing — two people doing statistics on the same raw data ought to achieve at least vaguely the same results, one would think.

It turned out that the main difference was that while I had retained the complete text, Ger had removed the dialogue from it. I had at first worried about the legitimacy of this step, but became convinced that a) the VM with a high degree of probability is prose rather than dialogue, and b) dialogue will exhibit way more interjections, incomplete sentences, and also in other ways differ from regular text.

Now I did the same as Ger had done, and the result is below:

sawyer_dialogue

What you see is the distribution of the average word-length across the line, once for “Tom Sawyer” with dialogue (blue line), and once without it (red line).*) It is immediately obvious that while the dialogue-laden text (blue) starts with a shorter-than-average first word (4.32 chars), the dialogue-less text (red) starts with 4.72 chars as the average for the first word. The blue line then rises, while the red line drops, meaning that both distributions begin to approach each other. (The dialogue-laden text still retains shorter words overall, probably simply because spoken language tends to use phrases like “Hypertrophoid disperambulations paraphernalialize perseverant antagonists” less often than “Oi, Joe, get up!”.)

In other words, once we concentrate on prose, the first-word peak in the word-length, as predicted in the paper, is found on natural language texts, as well as the overall droop of the word length towards the end of the line. That the effect is found in the VM as well can now be explained as a natural consequence, rather than as an artefact of a particular enciphering scheme. This leaves the second-word length dip, as it is observed in the VM, as the major unexplained phenomenon.

*) Please disregard the far right of the diagram, where statistical noise plays the dominant role.

Advertisements

12 thoughts on “Talk is Cheap

  1. Pingback: The (dys-) Functional Line « Thoughts about the Voynich Manuscript

  2. “Tom Sawyer” dialog is peculiar. In most texts dialog should make no appreciable difference if it is wrapped along with other text. I included dialog in “Huckleberry Finn”, the source text in the second link.

    Word length tendencies in several texts.
    http://notakrian.pbworks.com/w/page/LPWL%201

    >This leaves the second-word length dip, as it is observed in the VM, as the major unexplained phenomenon.

    Why words at LP02 tend to be shorter, on average, than the next few words on the same line in wordwrapped text.
    http://notakrian.pbworks.com/w/page/LPWL%203

  3. Quote: “I had at first worried about the legitimacy of this step, but became convinced that a) the VM with a high degree of probability is prose rather than dialogue, and b) dialogue will exhibit way more interjections, incomplete sentences, and also in other ways differ from regular text.”

    The actual reason for me was that the lines containing dialogue almost never wrap. Like:
    “I don’t know”, he said.

    At the time I wrote it down as follows:
    “Since Tom Sawyer contains a lot of short lines (conversations) that are not wrapped, I excluded them. This was done simply by assuming lines shorter than 60 characters not to be word wrapped, and omitting them.”

    So it was not because of it being conversations, but because of it being too short lines and not paragraphs, like the paragraphs in the VMs I used for comparison. The fact that dialogue tends to start with short words like I, no, it, … makes it worse.

    In my article, mentioned above, I even corrected this by hypothesis for paragraphs by excluding the first (and last) word as well, since they are not affected by word wrap either.

  4. This is very interesting, and all new to me. I wonder what results you’d get if your model text was Dickens rather than Twain. Also what would happen if you used different languages. Are there any which habitually begin with … I don’t know.. an agglutination of noun and verbal form. If you ran comparisons how would you decide the control?

  5. Words of Vm are strongly cutted by the draw on the verso of the page with a graphic symbol, generaly “H”,’g” or “8” so you can’t deduce nothing on their lenght before give their originaly form. Better method is to suppress all the blanks… when i make this, i find such as a periodic key that could give their genuine length, but for the moment, i don’t believe in causes of the strength of the graphic-code. When we’ll have supress all those code, perhaps your analysis better functions…

  6. I just had a bizarre idea about this book and felt the need to share it with someone who may be able to bring this idea to light and perhaps test it. Something I heard about a particular ancient language suggests that when spoken the letters of this language will cause sand on a plate to vibrate in such a way that the sand will take on the shape of the written character that represents the spoken letter. Now to further my hypothesis let us find the individual sounds that will cause sand to vibrate in this manner to form each of the individual characters contained within the voynich manuscript. Once we have the sounds associated with each character let us then compare the text when spoken in this “language” to known languages from our history. Perhaps with this approach we can finally decode this book that has so long been a mystery.

  7. Surely small words would stand a greater chance of being typeset into their preceding line? Wouldn’t this – a typesetting artefact – yield some kind of on-average-slightly-longer-first-word phenomenon in printed books? Just a thought! ;-)

    • Hi Nick,
      Yes, this is actually what we observe: Short words have a tendency to remain at the end of a line, long words are wrapped around and move to the beginning of the following line.
      The effect is evident with simple ASCII word wraps, and also seems to happen at handwritten pages.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s