The (dys-) Functional Line

Dear all,

Once more into the breach, I decided I would take another shot at the ever-elusive properties of the Voynich Manuscript, this time setting my sights on the line statistics, where in the past curioser and curioser properties of the word length distribution have been observed.

I had made up my mind to examine these effects more closely, and especially answer the question whether the line is a “functional unit” (ie, whether it “plays a role”) in the enciphering of the VM or not.

Here is the resulting paper:

the_voynich_line (ca. 500kB)

It’s about twelve pages. Click “more” or see below, if you want to have the short answer.

Comments and suggestions welcome, as always!

I’m as confused as ever. As always, the results are not as clear-cut as one would wish them to be, but it really seems as if the first word of a line is longer, and the second word shorter than average. A drop of the word length along the line could be mostly discounted, though.

Update: I have repeated tests by Ger Hungerink, and now have achieved the same results as he, which also show the omnious first-word peak. Read more about it here. I will update the paper as soon as I find the time.

15 thoughts on “The (dys-) Functional Line

  1. Very nice work, Elmar. Most interesting. The second word length effect must go down as yet another Voynich “WTF?”

  2. Hi Rich,

    You and too harsh…? Never!

    >
    > “To my knowledge, there exists no contemporary line-based scheme that
    > had actually been in use and statisfies the other statistical features
    > of the VM, especially the »grammar« which governs word composition, nor
    > has in modern times a system been suggested which would result in the
    > structure of the VM text as it is observed.”
    >
    >
    > Perhaps there is no scheme which specifically relies on it, but I think
    > that the concept of lines as units can easily exist within other
    > schemes. As for a system which would result in the observed structure,
    > there are I think several. There is Bacon’s bilateral cipher, which
    > would work using features of characters, such as differing height or
    > other, which do not get observed/counted in standard counting methods,
    > but which would exist happily in the structure as observed, and even
    > arguably account for some of those observations.
    > …

    I’m not sure if we’re talking the same concept here. If, for example the first-word peak, second-word dip and other observed features are the result of the enciphering, then it appears logical that the plaintext was enciphered in “chunks” which were the size of a line.
    Now, I know of no contemporary scheme which would use such chunks. Of course, it is *possible* to imagine such a scheme, and to superimpose other schemes (like Bacon’s bilateral) on top of that. But we have no documentation for that. Especially, I know not even a *modern* system which would be line-based *and* exhibited the first-word peak, second-word dip etc.

    So, while a line-based encipherig can’t be ruled out, I also see nothing speaking for it — and this was the challenge of the “line as a functional unit”.

    Especially, taking the varying line widths of the VM into account, the scribe would have to measure out the length of each subsequent line on the vellum, then use a chunk of the appropriate size of the plaintext, encipher it, and hope it would really fit into the gaps in the VM. This doesn’t sound impossible, but cumbersome.

    >
    > Numerical codes, either Arabic or Roman, in combination with either
    > nulls and shorthand or both, could account for “word” and sentence
    > structure in many ways.

    In theory, yes, but AFAIK nobody was able to come up with a plausible system which would match the VM word grammar in detail.

    (I’m a bit unsure what you mean by “sentence”. My calculations were based on the physical line, which is not necessarily the same as a grammatical “information unit”.)

    > 6-4:
    >
    > “Thus, it’s not the position on the line which reduces the word length,
    > but the small word length which only makes high line positions possible
    > in the first place.”
    >
    >
    > I don’t agree. It is a “chicken egg” problem, and from your
    > observations, I feel I could come to this conclusion, contrary to yours,
    > “When longer lines are possible, word length necessarily decreases,
    > especially toward the end of the sentence. This is not necessary on
    > smaller lines, with less information, as it all fits in a smaller space.”

    This is a legal interpretation, but OTOH if one can show that the word-length/line-length relationship appears naturally, then there is no need to interpret this as an artefact of the scribe’s psychology.

    > 8-6:
    >
    > “This can easily be put to the test, since all our arguments are equally
    > valid for »natural« text, and we made no particular assumptions about
    > the VM. Thus, the effects should be observed in the same way when
    > applying the statistics to regular text. To this end, we used
    > manuscripts in various languages,11 namely
    >
    > * Mark Twain: Tom Sawyer (english)
    >
    > * Mark Twain: Tom Sawyer (german translation)
    >
    > * Jules Verne: 20000 Lieues sous les mers (french)”
    >
    >
    > I see a problem with the use of these texts. The line length is
    > controlled by the rules of “justification” for typesetting, which differ
    > greatly from the justification in hand-written works like the Voynich
    > Ms. To make a valid comparison, I think, one would have to compare the
    > word counts in Twain’s and Verne’s manuscripts.

    My statistics (which are somehow suspect, see my mail from today to Ger on the list, regarding this different results from the same plaintext) are based on an ASCII version of the text which I reformatted on a text editor. So, there really are no typographic or aesthetic features left to consider, only word length and language structure.

    >
    >
    > 12-7:
    >
    > “I guess what I’m experiencing is the most familiar feeling for VM
    > researchers of them all, »I really don’t know what to make of it.
    >
    > «Regarding the three statistical effects (or »claims«) we have presented
    > in section 1.1, we have come to the following conclusions:
    >
    >
    > * Although we could suggest a statistical mechanism for the »first word
    > effect« (section 1.1, claim 1), we were mostly unable to corroborate
    > this hypothesis with a test on natural language.
    >
    >
    > * The »second word effect« (claim 2) with a dip in the word length of
    > the second word is undeniably there in the VM. We could neither provide
    > an idea how this effect may have originated in the VM, nor could we find
    > similar effects occurring in natural language texts.”
    >
    >
    > Me: Again, for these two, the use of comparisons of printed texts to the
    > written VMs I feel nullifies the results… unless you have a comparison
    > between manuscripts and printed texts which shows they are sufficiently
    > similar, and my thoughts on this problem are incorrect.
    >

    See above, I didn’t use printed text (in which case your argument would hold true) but a digitized version.

    >
    > “* We think we have demonstrated how effect 3, the continuous drop of
    > the average word length towards the end of a text line, occurs naturally
    > as the result of text composition along lines, namely that short words
    > will result in lines with more words, and thus higher word counts.”
    >
    >
    > Another objection to this conclusion I have (in addition to the ones I
    > have noted, above) is that you purposely set upthe test to disallow the
    > space available to the writer as a factor, when I feel it is probably
    > the one factor which would explain all the observations (example p.6,
    > “Bear in mind that we use the word »line length« to refer to the total
    > number of words on one line, not to its physical width in centimetres or
    > such.)

    I didn’t mean to rule out or disallow the psychological aspect. What I wanted to show was that it’s unnecessary to invoke features of the enciphering mechanism to arrive at the observed ciphertext features, but that these ciphertext features could spring naturally from writing text, and would just the same be present on handwritten plaintext. This complements rather than contradicts your idea of a psychological origin.

    Sorry for the brevity, but I’m tired and still would like to get the answer out on the web tonight. (*And* write comprehensibly… ;-)

    Cheers, and thanks for the feedback,

    Elmar

    1. “So, while a line-based encipherig can’t be ruled out, I also see nothing speaking for it — and this was the challenge of the “line as a functional unit”.”

      I still see it differently, but I understand your point. You are correct that while I can muse on such a system that would explain the specific effects noted in the

      line unit, there is no system that can be pointed to that would create them. But I would argue that the presence of these points is in itself strong evidence of the

      line as a separate unit.

      “Especially, taking the varying line widths of the VM into account, the scribe would have to measure out the length of each subsequent line on the vellum, then

      use a chunk of the appropriate size of the plaintext, encipher it, and hope it would really fit into the gaps in the VM. This doesn’t sound impossible, but cumbersome.”

      I don’t think that follows at all. In any scheme with multiple choices of encoding characters… and there are several such schemes… an encoder would naturally make those choices which would fit the space available. This would cover many numerical codes, in which one has a choice of encoding individual letters of plaintext, or parts of words, and in some, whole words. Also, abbreviations would occur to an encoder just as they do to someone trying to fit a thought into a text message, or twitter post. If one has few ideas and more space, they can be generous; if the idea is more complex, it is time to be more efficient with characters.

      But one more thing… while I do strongly feel that your observations, and the observations of others, as to “word” and line structure point to each line as an encoded unit, I don’t mean to say that the thought does necessarily begin and end on a line, as a unit.

      I also believe that the line unit structure supports free-form gibberish… that it was simply scrawled out, a line at a time, and the effects seen are simply an effect of the human imagination in creating it. That ought to be more fully tested… I would love to see a room full of people creating glossolilia, and then study how that comes out. It might surprise us how close it seems to voynichese… or not, which would also be valuable.

      > Numerical codes, either Arabic or Roman, in combination with either
      > nulls and shorthand or both, could account for “word” and sentence
      > structure in many ways.

      “In theory, yes, but AFAIK nobody was able to come up with a plausible system which would match the VM word grammar in detail.”

      I think a numerical code of 10 numbers, with nulls and abbreviations such as plurals and such, can fit very well with what we see. But yes, I’ve been too lazy to finish up a working example. That sounds really chicken**** when I type it out, but I’ll leave it! It was like saying “I could be president if I tried!”.

      “My statistics (which are somehow suspect, see my mail from today to Ger on the list, regarding this different results from the same plaintext) are based on an

      ASCII version of the text which I reformatted on a text editor. So, there really are no typographic or aesthetic features left to consider, only word length and language structure.”

      But how did you wrap that ASCII text? It came with formatted wraps? Or were the formatting options left to you, even though it was raw text? And was the ASCII taken from the manuscripts, or from the typeset books, and so still contain the typesetter’s choices for book format? That really is a question, not a criticism… I’m trying to determine how this text you used can really be compared to Twain’s manuscript, or any manuscript effect of line spacing and wrap choices, which I still contend would skew the comparison. If the breaks are where Twain put them, then of course not, as it would reflect the manuscript choices.

      I don’t mean to sound negative, or be needlessly argumentative. Maybe my feedback is heavily weighted from pre-conceptions I have built from previous observations, and perhaps they are wrong.

      1. > But one more thing… while I do strongly feel that your observations, and the observations of others, as to “word” and line structure point to each line as an encoded unit…

        Actually, au contraire, I think Ger and I showed that the line structures observed can be explained with the process of writing natural language on lines of varying width (with the exception of the “second-word dip”), necessitating no further assumptions about the enciphering, so I guess we’re actually mostly in line with you.

        As for the reformatting of the text, I used the text as provided by gutenberg.org. ASCII lines were formatted to, I think, 80 characters/line (all characters the same width), with the exception of paragraph ends which were retained from the original print. I reformatted this with a text editor to 62 char/line, where the only criterion which leads to a word wrap is that by leaving the word on the old line, its length would exceed 62 chars.

        So, there really are no typographical considerations left. Any aesthetic or psychological imprints Twain may have left in writing his MS (I think he already used a typewriter, didn’t he?) have already been erased in the scanning processes (book -> ASCII) — even before the reformatting to 62 chars, which worked purely on a calculating basis.

        With the exception of the second-word dip, it really looks like the word length properties can be explained without resorting to any particular enciphering scheme.

      2. Hi Elmar:

        ” I reformatted this with a text editor to 62 char/line, where the only criterion which leads to a word wrap is that by leaving the word on the old line, its length would exceed 62 chars.”

        Then that does settle it… the effects seen are a result of average word length related to line length. I still think it would be interesting to compare manuscripts, and also typewritten texts, (he did? I wonder if it still exists… google time), to see how they all compare.

        Thanks for the feedback-feedback. As always I keep a copy of your work, in the event anything occurs to me of note.

  3. Elmar,

    Re. the second word on a line.

    Did you remove line breaks before you formatted the text to a margin?

    Generally, in any ordinary text, a longer than average word is more likely to be followed by a shorter word than a longer word or a word of the same length. A shorter than average word is more likely to be followed by a longer word than a shorter word or a word of the same length. Each language has its likelihood for lengths of adjacent words so the effect can be strengthened (Greek?) or weakened (Latin?). Since it follows a longer than average word, the length of the second word tends to be of average or below average length. The third word will tend to be longer than the second word. The influence is not as strong as that of the wordwrapped first word. Thereafter, until the precipitous decline that occurs most often at 55% of the way to the end of the longest line (line with most words) in a text, it is easier to see the influence of the language of the text. Longer-shorter-longer-shorter-longer becomes longer-shorter-shorter-longer-same-shorter, for instance. The wordwrap model applies regardless of the margin selected. If the margin varies somewhat, as in quire 20, it still applies — even though quire 20 has different “sub-languages”.

    The VMs is very different from lines of poetry (interesting in itself). It is very different from text that is artificially line wrapped at punctuation. Perspicacious when announced, “structural unity of lines” is an obsolete concept. In some ways it can be so considered; in what we address here it is not.

    Although the second word effect is certain, it is too weak for anyone without a computer to have noticed. This has implications that cannot be ignored in a theory about how the glyphs were sequenced.

    Sorry if I missed some posted information. I’ll look more closely later.

    1. Is the first word usually longer than the average? If the distribution of word lengths has a non-flat, peaked shape, then a longer than average word is more likely to be followed by a shorter word, just by simple statistics, I suppose. Although perhaps it depends on whether we are talking about the mean or the median.

      1. Julian, the first word is *longer* than average, the second is not only short than the first word, but shorter than the *average*, too. That’s something that is not immediately obvious from a stochastic point of view. Figure 6 of the paper shows it best, I think.
        “Average” is always calculated as the arithmetic mean.

    2. Averaging columns from the beginning of lines does not give averages for “last words” and penultimate words. For that, the text has to be right justified or reversed. When we do that, we won’t have average lengths of the first few words. Confusion arises when, conceptually, we try to merge the results of the two procedures. There is a reason for the second word dip below the next few words which we don’t need to go into now or here.

  4. I compare wordlength means in columns, ignoring blank cells. Medians will work if that is what we want to see.

Leave a comment