In my neverending quest to find out the truth about the Stroke theory, my lovely wife Sina the other night had a splendid idea. Up to now I’d always tried to dissect the existing ciphertext into the hypothetical “syllables” (which, as the theory goes, compose the ciphertext, with each “syllable” of between one and four ciphertext characters representing the strokes with which one of the plaintext letters would have been drawn); a question not unrelated to the knapsack problem, and equally difficult to treat, especially in view of the fact that we can’t even be sure about the ciphertext character repertoire.
Now, Sina’s idea was not to look for the smallest building blocks which make up the VM, but to large ones. The idea being that long identical sequences of plaintext letters would result in long identical sequences of ciphertext.
Of course, this is easier said than done, if you know neither the plaintext nor even the plaintext language. But one can try to make reasonable guesses und bludgeon them to death by number crunching. For example, it’s not unreasonable to assume the plaintext language to be Latin, especially if we start with the idea that the VM might be a “late fake.” Then, the first step is to run a sufficiently large Latin plaintext through a little program and check for characteristic sequences.
Doing so for Caesar’s “De bello gallico” renders a few suspicious features (like a large number of “Caesar”s), but also some promising ones. To name the first which caught my eye, the most frequent three-letter string in “De bello gallico”, after tossing aside all whitespace and punctuation, is “ere”. (In plaintext of roughly 150k, it shows up 850 times.) This is useful, because the first and the third letter are the same, which makes it easier for us to look for a corresponding ciphertext sequence. So, for example, if plaintext “e” translates into the stroke sequence [Q1], and “r” into [Q2], there ought to be a correspondingly frequent ciphertext sequence [Q1][Q2][Q1], the only problem being that we know neither the length of [Q1] nor of [Q2].
It is also useful because I had to postulate that before the actual enciphering, the plaintext was modified into CaMeLcAsEpLaInTeXt. (If it hadn’t been, Robert Firth would not have found around 50 constituent syllables, but only 25.) That means that “ere” would have been rendered “eRe” or “ErE”, but in either case the first letter is the same as the third. If the frequent string had been “eer”, that would have been nearly useless, because it’d be turned into “eEr” or “EeR” for enciphering, with all three plaintext letters (ie ciphertext sub-sequences) being different, hence they being immediately much more difficult to recognize.
Now the big question is, into how many strokes “e/E” and “r/R” were being broken down?
If the camelcasing happened to render “eRe”, then it’s reasonable to assume that [Q1] (“e”) has a length of two characters, with [Q2] (“R”) having three, ie the “suspicious” ciphertext string should have a length of 2+3+2=7 characters, with the last two being the same as the first two.
Ceteris paribus, “ErE” should occur about as often (under the assumption that the camelcasing works more or less randomly). “E” could be decomposed into either 2 strokes (a “C”-shaped crescent, and a hyphen “-“, if the author used cursive writing as a model), or four (a vertical line and three horizontal bars, if the author used block letters). “r” is difficult to pin down and could be two strokes (a short vertical slash, and a little “comma” on top), or up to three or four strokes, if using one of the German “r”s of the time. (Though IMHO it’s rare to see Latin text written in German letters of the time.) So, “ErE” would translate into something looking like [Q1][Q2][Q1] with [Q1] being three or four letters long, and [Q2] being two or three, so the whole thing would be a string between eight and 11 letters.
That still gives us a lot of leeway. What else can we do to determine suitable candidates?
One thing is the absolute frequency. We have seen that “ere” occurs 850 times in 150,000 letters, in other words, one occurence of “ere” happens about every 300 plaintext letters. Now, Currier “B” makes up about 150,000 ciphertext characters.*) (That would mean that it is equivalent to between 70,000 and 50,000 plaintext characters, assuming that each plaintext letter is represented by between two and three ciphertext characters, on average.) Now, since “ere” was supposedly modified into “eRe” and “ErE” with equal probability, that reduces their frequency to 1/600 each. Likewise, seeing the ciphertext “bloated” compared to the plaintext by a factor of somewhere between 2 and 3, the frequencies of [Q1][Q2][Q1] are expected to drop between 1/1200 and 1/1800 per ciphertext character, or, in other words, they should show up very roughly 100 times in the ciphertext.
But here we are threading on very thin ice, and the uncertainties make guesses like these almost pointless — starting with the question what constitutes an individual ciphertext character.
Anyway, I think this is a promising lead to see whether it’s possible to match certain plaintext sequences to ciphertext strings.
*) The whole examination was limited to Currier “B” text, BTW, to avoid problems with the mix between languages, which might well indicate different enciphering schemes and would thus blur the statistics.