A String of Luck?

In my neverending quest to find out the truth about the Stroke theory, my lovely wife Sina the other night had a splendid idea. Up to now I’d always tried to dissect the existing ciphertext into the hypothetical “syllables” (which, as the theory goes, compose the ciphertext, with each “syllable” of between one and four ciphertext characters representing the strokes with which one of the plaintext letters would have been drawn); a question not unrelated to the knapsack problem, and equally difficult to treat, especially in view of the fact that we can’t even be sure about the ciphertext character repertoire.

Now, Sina’s idea was not to look for the smallest building blocks which make up the VM, but to large ones. The idea being that long identical sequences of plaintext letters would result in long identical sequences of ciphertext.

Of course, this is easier said than done, if you know neither the plaintext nor even the plaintext language. But one can try to make reasonable guesses und bludgeon them to death by number crunching. For example, it’s not unreasonable to assume the plaintext language to be Latin, especially if we start with the idea that the VM might be a “late fake.” Then, the first step is to run a sufficiently large Latin plaintext through a little program and check for characteristic sequences.

Doing so for Caesar’s “De bello gallico” renders a few suspicious features (like a large number of “Caesar”s), but also some promising ones. To name the first which caught my eye, the most frequent three-letter string in “De bello gallico”, after tossing aside all whitespace and punctuation, is “ere”. (In plaintext of roughly 150k, it shows up 850 times.) This is useful, because the first and the third letter are the same, which makes it easier for us to look for a corresponding ciphertext sequence. So, for example, if plaintext “e” translates into the stroke sequence [Q1], and “r” into [Q2], there ought to be a correspondingly frequent ciphertext sequence [Q1][Q2][Q1], the only problem being that we know neither the length of [Q1] nor of [Q2].

It is also useful because I had to postulate that before the actual enciphering, the plaintext was modified into CaMeLcAsEpLaInTeXt. (If it hadn’t been, Robert Firth would not have found around 50 constituent syllables, but only 25.) That means that “ere” would have been rendered “eRe” or “ErE”, but in either case the first letter is the same as the third. If the frequent string had been “eer”, that would have been nearly useless, because it’d be turned into “eEr” or “EeR” for enciphering, with all three plaintext letters (ie ciphertext sub-sequences) being different, hence they being immediately much more difficult to recognize.

Now the big question is, into how many strokes “e/E” and “r/R” were being broken down?

If the camelcasing happened to render “eRe”, then it’s reasonable to assume that [Q1] (“e”) has a length of two characters, with [Q2] (“R”) having three, ie the “suspicious” ciphertext string should have a length of 2+3+2=7 characters, with the last two being the same as the first two.

Ceteris paribus, “ErE” should occur about as often (under the assumption that the camelcasing works more or less randomly). “E” could be decomposed into either 2 strokes (a “C”-shaped crescent, and a hyphen “-“, if the author used cursive writing as a model), or four (a vertical line and three horizontal bars, if the author used block letters). “r” is difficult to pin down and could be two strokes (a short vertical slash, and a little “comma” on top), or up to three or four strokes, if using one of the German “r”s of the time. (Though IMHO it’s rare to see Latin text written in German letters of the time.) So, “ErE” would translate into something looking like [Q1][Q2][Q1] with [Q1] being three or four letters long, and [Q2] being two or three, so the whole thing would be a string between eight and 11 letters.

That still gives us a lot of leeway. What else can we do to determine suitable candidates?

One thing is the absolute frequency. We have seen that “ere” occurs 850 times in 150,000 letters, in other words, one occurence of “ere” happens about every 300 plaintext letters. Now, Currier “B” makes up about 150,000 ciphertext characters.*) (That would mean that it is equivalent to between 70,000 and 50,000 plaintext characters, assuming that each plaintext letter is represented by between two and three ciphertext characters, on average.) Now, since “ere” was supposedly modified into “eRe” and “ErE” with equal probability, that reduces their frequency to 1/600 each. Likewise, seeing the ciphertext “bloated” compared to the plaintext by a factor of somewhere between 2 and 3, the frequencies of [Q1][Q2][Q1] are expected to drop between 1/1200 and 1/1800 per ciphertext character, or, in other words, they should show up very roughly 100 times in the ciphertext.

But here we are threading on very thin ice, and the uncertainties make guesses like these almost pointless — starting with the question what constitutes an individual ciphertext character.

Anyway, I think this is a promising lead to see whether it’s possible to match certain plaintext sequences to ciphertext strings.

*) The whole examination was limited to Currier “B” text, BTW, to avoid problems with the mix between languages, which might well indicate different enciphering schemes and would thus blur the statistics.

5 thoughts on “A String of Luck?

  1. One unfortunate possibility though (while I digest the rest: I have to read these several times, sometimes, before it properly soaks in): You write:
    “That means that “ere” would have been rendered “eRe” or “ErE”, but in either case the first letter is the same as the third. If the frequent string had been “eer”, that would have been nearly useless, because it’d be turned into “eEr” or “EeR” for enciphering, with all three plaintext letters (ie ciphertext sub-sequences) being different…”.

    But I think that that we can’t assume that “ere” would have only been rendered as “eRe” or “ErE”. Just as “eer” might mix cases for the “e”, then wouldn’t it also be possible for “ere” to be “Ere” or erE”? If so, you run into the same problem you fear for “eer”, again making all three different.

    1. Unless, I got that wrong, and there is some reason to believe that the switching of cases must alternate character-to-character?

  2. No, Rich, you’re quite right: If the camelcasing happens randomly, then the system must fail.
    But Robert’s observation hints that “prefixes” and “suffixes” alternate, ie the letters alternate between upper and lower case.
    Of course, as always, this doesn’t seem to be the *whole* truth. In my observation, the “prefixes” seem to be optional, while the “suffixes” are mandatory — in other words, a VM word consists of a required suffix, preeded by an optional prefix. The assumption is that the prefixes are always lower-case, the suffixes upper-case (or the other way around, but in a fixed arrangement), which would make some sense, and even be parsable by the above statistics..

  3. Suggestions.
    1. Obtain an average ratio of substitutions by comparing rank vs. frequency charts of VMs and Latin, VMs and German, and other languages — . You may find that VMs 5-grams match Latin 3-grams or find some other match.
    2. Work with Language-A. It has a lower ratio. Then compare Language-B. The difference might lead to some tentative conclusions.
    3. Refer to:
    (a) http://www.voynich.nu/extra/wordent.html
    “Voynichese is nearly as information-rich as Julius Caesar’s Latin, and significantly more so than the Vulgate version of Genesis.”
    “Voynichese is less information-rich than Latin in the first two characters of each word, but compensates by greater variability in the trailer.”
    (b) http://www.dcc.unicamp.br/~stolfi/voynich/98-07-09-local-entropy/
    “…each letter xi is painted with a color whose brightness increases monotonically with its local information content vi.”
    4. Choose your weapon:
    5. Enlist some cryptanalysts and mathematicians for alternate solutions.
    Pure intuition, FWIW (often dead wrong): the VMs word parts do seem to have an approximately alternating nature.

    1. Hi Knox,
      Thanks for your suggestions!
      Your last observation hit a particular note with me, “the VMs word parts do seem to have an approximately alternating nature”: Exactly this observation drove me to assume that the plaintext was converted to camelcase, into alternating upper- and lower-case letters before enciphering. Obviously, this wasn’t consistently done, but I assume a VM word consists of a mandatory upper-case (or lower-case) plaintext letter, preceded by an optional lower-case (or upper-case) letter.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s