Thing with the VM is, it offers so preciously little in terms of “hard” information helping us to decipher it, that we tend to cling to whatever shred of facts we can find to guide us along. First and foremost among these, there are the statistics on the VM, but with using them there also comes the danger of overly relying on the results along with it.
For example, to a high degree of certainty, we will have gotten our transcription wrong somewhere; not in the sense of writing down some individual error, but of consistently misidentifying letters as being different or seperate where they are truly the same or compound, or vice versa. But as long as we don’t know whether “iiin” is one letter or four, whether “ch” is one letter, two different letters or just “cc” in disguise, and as long as we can’t even know for sure whether there’s one, two, or four different gallows, all our estimates about character frequencies and word lengths are on very shaky ground. And hence, all our results.
So, while I’m all for statistical tests (After all, it’s all we’ve got, right?), and while I’m wishing for a a “universal statisticator” which would spit out the essential statistical parameters for various ciphertext candidates, I recommend taking such results with a grain of salt.
Take for example something as simple as a monoalphabetic substitution cipher.*) If someone employed that cipher and used the VM alphabet for the ciphertext, we’d probably despair in something even as simple as that, because our prime tool in that case, viz. statistical frequency analysis, would fail for sure as long as we fed it with a transciption which misidentified the ciphertext character set.
Likewise, wordlength distributions etc. are all prone to be distorted due to systematic transcription errors. So, while I agree that statistics can give us valuably hints, I’d regard them as circumstantial evidence, but not as hard facts, and if one test does give results which don’t put a theory in accordance with the VM, I wouldn’t dismiss the theory immediately, if the rest of the story looked good.
As for Zipf’s law, which is so often quoted in the context of the VM — I’ve got my hesitations about that in particular. IIUC, we don’t really know what it means if a distributions does or doesn’t obey Zipf’s law, and since random texts and even the sizes of cities can follow Zipf’s law, I wouldn’t assign too much importance to this test.
So, my advice is, stay tuned, but don’t stand with baited breath.
*) I know, the VM is not a monoalphabetic substitution.
Slightly edited from a post to the Voynich Mailing List.
13 thoughts on “Not a Rose by Any Other Name”
I agree with you on this. Certainly the counts and the statistics based on them have eliminated some possibles, such as monoalphabetic substitution (as you point out), and I think transpostition (observed word structure would be destroyed), and others.
And then, after that, I agree they may not be as helpful as telling anyone what it IS. For some of the same, and some different reasons than you point out (differences in transcription results/opinions skewing counts used). As you know, I think that one reason this may be is that the cipher/code is not in the value of the characters as understood… that is, it is not representing letters or numbers of a code… but for some other feature and order which is unaffected by any of the counts and the analysis of them. There are many such ciphers which would fit this description, but are rarely if ever considered as candidates for Voynichese.
One small example: A known cipher which used a stand out feature of a page… a certain character, or a fruit or leaf in an illustration. The two parties would have identical alphabet “strips”, which they would place at the top of the page. As the decipherer read each VMs line, and came down to the “key” feature, they would follow up to see what letter of their strip it was under, and note it down. For instance, say this was done with the gallows character, or one of them, in the Vms. A page might only contain a few sentences, but the gallows and their placement could carry it. And if this was the case, you could run statistics on the Voynich for eternity, every which way, and it would never even hint at such a system.
I know of at least a dozen such schemes, all well known and used historically, which would similarly confound any statistics, and yet have not been looked at in this case… to my knowledge.
I see your point, but I’m hesitant to agree. My major point is that, assuming this pseudo-Cardan-grille has been used, obviously the author added another layer of encryption on top of that, because we still would be left with VM characters rather than readable plaintext after applying the grille… or am I missing something again?
Besides, the VM looks really fluently written, not exactly like somebody tried to fill in gaps in a preexistant scheme, though admittedly this might be due to the fact that it is a copy of the original done by an ignorant scribe.
I have not explained this well, sorry. I am not talking about any grille here, in this case. But I see why you might have thought this, with my phrase “at the top of the page”. I did not mean by this “cover the page”, but rather along the top edge of the page. Here is an illustration, but without the alphabet strip:
Picture the alphabet along the upper edge of that page. If the decoder slid a ruled edge down the page, keeping it horizontal, every time they came to a fruit, they would look to the top of the page to see what letter it was under (or the alphabet strip could simply be on the ruled edge, although I have not seen this done) Those letters would be plain text letters.
Another, different, example of a type of cipher is this one:
http://www.santa-coloma.net/voynich_drebbel/general/steg_p307.jpg In that case, it is the distance to certain features… in the case of the Voynich, it could be the distance to certain characters, and also, what those characters are, which would determine which plain text letter was intended. As an arbitrary example, a “gallows” in and of itself would have no meaning… but perhaps if it was five characters from the last one, it would mean “L”, but if six from the last one, it would mean “C”. This would be rapid to learn, quick to write, and easy to decode for the intended reader. Also, counts and count statistics would never find such a scheme.
All the systems I suggest and I am thinking of, produce direct plaintext Latin letters, and do not at all leave you with Voynich characters. The Voynich characters themselves would have no inherent meaning, only their placement on the page in relation to some other scheme. I do want to do a blog post on this soon, and if I do it might explain it better… and I will link it here. Rich.
All right, now I see where you’re going… sorry for being so dense; I obviously got it the wrong way around.
I wouldn’t immediately dismiss it, but the question remains: Why employing the elaborate VM alphabet, rather than hiding the cipher true steganography-style in a more inocuous book written in latin letters? And why the complex rules for word composition?
All just red herrings…?
Not to be argumentative, but I do think I have an answer for “Why employing the elaborate VM alphabet, rather than hiding the cipher true stenography-style in a more innocuous book written in Latin letters?”
To hide sequences of your meaningful character(s) in a text, it would be much easier if that cover text had no meaning. For instance, if your usable characters were certain Latin letters, then the encoder must create a believable text around those letters. But by using voynich characters, you avoid this necessity… wherever they fall, so be it, since Voynichese means nothing anyway. In other words, one might question the sprinkling of “c’s” in a Latin passage, when they screw up the readability and sense of it, but the same “c’s” in Voynichese elicit just a “whatever”.
As another example/comparison, F.Bacon’s biliteral, when used with Latin letters requires different typefaces, which may be noticed by a third party. If such a scheme were used in Voynichese, with no “typefaces”, the differentiating characteristics would be irrelevant and unnoticed to a code breaker.
“And why the complex rules for word composition?”
The seemingly complex Voynichese “rules” then, maybe simply be a side effect of the placement of the meaningful characters used. Any such scheme will tend to “land” certain characters in certain positions, which might appear as a rule to one looking at the Voynichese in the usual way.
Carefully read the letter.
Which Marci sent to Rome Kircher.
So find out who is the author of the Voynich manuscript.
Hi, J.T.: I did not want to take up Elmar’s post to address your issue, but briefly: When carefully read, the letter to Kircher clearly shows that Marci did not know the author at all, nor the cipher used. I think that address your point, and I am sorry, but I disagree with your assessment. All the best, Rich.
I’m seriously asking this – why, in a time when we can correctly scan, record, reproduce and read a person’s retinal patterns with near-perfect confidence can we not simply scan a few pages of Voynichese and then number-crunch to.. I don’t know, get a true data set for the number of times a given form is included. Why?
I think the major problem is not the scanning technology, but the question what the VM character set actually is.
For example, if you tried to analyse modern western handwriting without previous knowledge of roman letters, how would you tell that “m” is not a ligature of “nn”, that “I” (capital “i”), “l” (lower case “L”) and “1” are three distinct characters, and that “u” and “v” are disctinct, and that “w” is not “uu”? Compare this to the case of EVA “ch” vs. “ee”, for example.
You need to make assumptions about the underlying alphabet, and the “truth” you find in your data set will depend on the “truth” of your assumptions, or, as the computer adage goes, “Garbage in, garbage out.” Any transcription (like Takahashi’s or Glen Claston’s) will necessarily make assumptions what two identical and two disctinct characters are, and if these assumptions fail, so will the results.
Aside of that, the VM only amounts to some 150,000 or so characters, which is a good handful, but may simply not be enough for all statistical tests to yield valid results.
I don’t see that that follows. A perfect ‘scan’ provides the information needed to determine the number of repeated forms, and thus an initial character set. The level of technology I’m talking about can make minute distinctions – perhaps too minute – but can also make minute points of comparison. At the very least it would highlight earlier errors in transcription and point out where we have supposed similar forms, but consistently-applied differences occur. it should be done. The old idea of creating fonts from transcriptions is way out of date. A combination of scan and trace would give us better and more accurate forms to process. IMO
But even with a scan, as opposed to a transcription into a font, you come right back to the point Elmar was making: Because the programmer of scanning software has to “tell” the program what features to use to distinguish the characters. For instance, as in your post, one would have to program it to determine what was “similar form”, or a “consistently-applied difference”. This cannot be automatic, it must be determined by a human… and one is back to the original problem.
As Richard said: You would be right, Diane if the VM was a machine-printed book or such where identical letters always have the exact shape. But since the VM is written in handwriting (and a fairly small writing to boot), there are necessarily considerable variations in the individual letters. Without a clear knowledge what the underlying alphabet is, you have to rely on guesswork what constitutes two truly different characters, and what are variations of the same letter. See the “ch” vs. “ee” problem in EVA.
Subsequently, you run into a lot of difficulties when you want to determine letter frequencies or word lengths.
I expect you both know more about the high-level mapping/graphic sort of scanning I mean than I do. I know it is used to correlate and tabulate data on everything from fingerprints to forged documents – and that’s about all I know. I’m surprised at the degree of input you describe as needed from the operator. I rather thought that the program itself would re-define our perception of the forms, the number of distinct characters or pairs and so on. In fact I expected that it would produce a greater number of distinct ‘char-sets’ than our own eyes can, but that the stats generated afterwards might modify our own opinions. Sort of Currier++
I’ll take your word for it, though.