As I mentioned the other day, I had committed a serious omission in my assessment of the Stroke Theory.
I had written a little tool to analyse the VM ciphertext and decompose it into the hypothetical “syllables” of the Stroke Theory, where each ciphertext “syllable” would represent one plaintext letter. This tool worked by constantly modifying the hypothetical syllable set and retaining those modifications which lead to an overall increase of “coverable text” (= ciphertext words that could be composed from the syllable set). The tool got “saturated” (ie, further changes would increase the overall coverage no more), when the syllable set could compose 66% and 74% of the ciphertext by volume, for Currier A and B, resp. This was interesting, but by no means convincing.
A different approach used a second tool which would synthesize ciphertext from inoccuous plaintext, according to the rules of the Stroke Theory. This test was simply on a qualitative basis, to see whether the ciphertext rendered this way would look anything like the VM, and, in my humble opinion it actually did.
It took me about a year to figure out that one could combine the two approaches, namely letting the analytical tool work on the results of the synthetical tool. In a perfect world, namely if the analytical tool worked correctly, this chase of one’s own tail should result in a 100% coverage of the plaintext.
Of course it didn’t. For instance, the plaintext used — the German 15th century “Weinbuch” — contained a few characters (like arab digits or umlaute) which couldn’t be properly transcribed in the synthesis and which could thus not be recovered.
Still, the result was that about 68% of the plaintext words by volume could be synthesized before the analysis program showed signs of stalling. This teaches us two things:
- It’s — pardon my french — fucking close to the results for the VM, and
- This can’t be due to special unencrypted characters alone.
Upon closer inspection, I found the culprit indeed. Namely, I had set the minimum syllable length in the analyzer to 2 strokes, and this is of course stupid. Letters like “I”, “l”, “o” will hardly require more than one stroke (and indeed required only one in my synthesis). Thus, these letters were effectively indecipherable, and this may well account for a good deal of the lost coverage.
This error of course would also hold true for the VM. (There is no reason why the VM author should have chosen to use a minimum of two strokes per plaintext letter.) And the fact that this programming error led to almost the same amount of lost coverage in the VM as in the synthesized text could be a hint that the same effects are in play, and that I’m thus on the right track by chasing my own tail…