Recently, I made my first tests to revive the Strokes theory of how the VM was enciphered and arrived at a quota of around 80% of the VM text (by volume, Currier A) which could be composed of Robert Firth’s 24 building blocks (or “syllables”). Now, is that a good or a bad result?
- It can be considered bad inasfar as it’s “only” 80%. There are a number of degrees of freedom involved in the experiment, namely as regards transcription and block composition. Assuming that the VM text wasn’t written as a completely random string of symbols but governed by some kind of “grammar” which dictates possible word compositions, it’s not surprising that it’s possible to reconstruct a good chunk of this tome from some set of building blocks, especially if this building blocks are freely chosen. (And some of them consist only of a single letter, hey!) So one could argue that, if Firth’s blocks are any good, they should be able to cover more than 80% of the ciphertext.
- On the other hand, one can consider the 80% surprisingly good. For example, the current set of 44 blocks allows for the representation of just two sets of characters forming the latin alphabet, one of uppercase and one of lowercase letters.*) This means that any special characters — arab digits, greek letters — aren’t covered from in this repertoire and drop through. Likewise Firths blocks didn’t include some gallows letters from the start, and thus are unable to compose words containing them.
Let us add to this the high probabilities that
- Firth’s block set isn’t completely correct, and
- There are errors in the transcription system. With this I don’t mean a mistake in the transcription process, but an error in the transcription system, ie two different ciphertext letters are consistently considered the same (or vice versa), or two seperate letters are transcribed as one letter (or the other way around.) The ciphertext character set being unknown has notoriously been one of the obstacles of tackling the VM.**)
Any of these mistakes would naturally result in lower composition rates, and in the light of this, one could consider the 80% surprisingly good.
How to proceed from here?
As opposed to everyone else who seems bored under the Corona lockdown, I myself am actually quite busy. Nevertheless, there are two avenues of attack I’d like to write little pieces of software for:
- One would be an interactive “Fiddler”. Basically, a piece of interactive software which lets you reassign plaintext letters to blocks and to change the blocks on the fly to see what effect this would have on a “decipherment.” With a little luck, and patient fiddling, one or the other readable word might come out of this, hinting at the “true” composition and assignment of the block…
- Of course, there’s also the opportunity for a brute force attack. Empty diskspace is unused assets. The idea is to introduce random variations to the block set and see how these variations influence both the “composition rate” (ie the volume of text that can be synthesized with the blocks) and the number of blocks required for the composition.***) Letting the software run for a few hours and leaving the “better” results to survive (and discarding those which make the result “worse” might start an evolution towards a better optimized set of blocks.
Now all I need to do is find the time to hack the code.
*) Depending on whether some of the fancier renaissance additions like “j” and “w” or seperate letters for “u” and “v” should be considered.
**) The other being the underlying unknown plaintext language.
***) This might require a bit of explanation. I tried similar “evolition runs” in the past, but one problem was that they tend to “erode” the blocks used further and further until you’re left with a set of single-letter blocks. And this is only logical, because if you have a transcription which uses, say 44 different transcription letters, it will be possible to cover 100% of the transcription with 44 single-letter blocks, if each “block” contains exactly one letter of the set of transcription symbols. Thus there have to be two criteria whether one set of blocks is “better” or worse than the other in explaining the ciphertext:
- The “better” set must cover more volume of the ciphertext, and
- The “better” set must not use more blocks than the previous set. (Or, in other terms, the blocks used must be larger or of equal length.)
Edit: I’ve just added a whole page dedicated to the “Face Value-Fallacy“, because I feel it’s important more people are aware of it.
One of the pitfalls of VM research is the presumption to take its text at face value — these letters that make up the text look so very much like latin letters (except… not quite ;-)), that it’s tempting to presume that each ciphertext letter indeed does represent one plaintext letter. And from that starting point the next logical step is to presume that each chipertext word corresponds to one plaintext word.
But upon closer inspection, this presumption is not borne out by observation, except by the fact that the letters are grouped into small sequences, seperated by visual spaces. A lot of features speak against this assumption or “words”,*) namely —
- The words of the VM show a high internal structure: Many letters appear only word-initial, some only word-terminal, and many show a high dependency on their neighborhood. While these features are not unheard of in natural languages — compare “q”, which is always followed by “u” in most western languages, or the German “ß”-s which has a strong tendency to appear word-terminal — no language exhibits so many of these features and such a strongly regulated word-internal grammar.
- The letters aren’t evenly distributed on the page. It’s common knowledge that the gallows characters are concentrated on the page tops and paragraph starts. While this could be explained by them being ornamental versions of regular characters, Julian Bunn’s analysis from 2016 shows a bunch of certain characters “crowd” in line-initial or line-terminal positions, which is a pretty odd feature, if one character really represents one plaintext letter.
- Unless we are very wrong about the character set used for the VM, one VM word simply doesn’t have enough information content to encipher a plaintext word.**)
- “Sentences” often differ by only slight changes from word to word or show word repetitions or show word repetitions, so that it almost looks like words are not independent but “morphing” one into the other, and the true information content doesn’t lie in the words themselves, but in the changes introduced between them.***) This is also difficult to reconcile with the idea that each VM word corresponds to a plaintext word.
No. There is much too much going on in the encipherment of the VM. A ciphertext word is not a plaintext word, and a ciphertext letter does not correspond to a plaintext letter, I’m willing to bet on both.
It’s still my convinction that the fiendishness of the VM encipherment doesn’t lie in it’s complexity, but in it’s seeming simplicity: Taken at face value, it looks like something dead simple to solve, and so even a moderately complicated scheme escapes the eye of the beholder. We’re missing the forest for the trees which look like shrubbery.
*) Subsequently I’ll use the term “word” for “a short sequence of glyphs in the VM, seperated from the rest by visual breaks.
**) It could be that the VM character set is much more complicated than presumed and contains many more fine details which discriminate between different character, but I doubt this for reasons of practicability: The VM characters are already quite small, and it would have been impossible for the author to write down his letters so exactly on rough vellum that small nuances would have been legible for a reader. (Not to make too fine a point on this.)
***) Wouldn’t it be fascinating if the word sequence “walter winter” would be used in such a manner to encipher the word “in”?
When all you’ve got is a hammer, then everything looks like a nail. For the VM, the same holds true for number crunching, which seems to be about the only tool we have to get any information out of the VM — no matter how misleading it may be.
Now I’ve decided to go back to my notorious “Strokes” theory, which, as I found out to my shock, dates back to early 2005 (without having made much progress, I have to admit.) Read all about it!
Continue reading “Fresh attempt at the strokes (1): Robert’s observation” →
Many Voynicheros assume that the enciphering system of the VM treats the space between words as a particular character, like any other of the alphabet.
This is an attitude we have grown accustomed to, since we’ve grown up with computers, where the space has an ASCII code like the rest of “A” to “Z”, and before that the typewriter, where the space bar was a key similar to the others.
But it’s a fairly modern attitude. Until fairly recently, a space was just that — an empty gap between words, but not a character or symbol in its own right. (Even the venerable Engima cipher machine of WWII fame didn’t feature that character in its symbol set.) Rather, it was considered a part of visual design, like a line break (for with the Engima didn’t have a symbol either). Word breaks were useful to discriminate between word boundaries, but they contained no information in themselves. Throughout much of the middle ages, textswerewrittenassimplyalongsequenceofletters, and it was up to the reader to find the word breaks. (Compare this to modern typography, where it’s for the better part up to the reader to learn about stressed syllables etc.)
Even though this practice had pretty much ended by the presumed genesis of the VM (early 15th century), and word breaks were regularly used to increase readbility, I don’t think that any encipherer would already have thought of treating the spaces thus generated as particular characters which would be enciphered like regular letters. Hence, I also think it’s futile to search for such enciphering characteristics in the VM.
I’m at it again. Crunching the numbers.