When all you’ve got is a hammer, then everything looks like a nail. For the VM, the same holds true for number crunching, which seems to be about the only tool we have to get any information out of the VM — no matter how misleading it may be.
Now I’ve decided to go back to my notorious “Strokes” theory, which, as I found out to my shock, dates back to early 2005 (without having made much progress, I have to admit.) Read all about it!
The idea builds on an observation by Robert Firth, who noted as early as 1995, that a majority of the corpus written in the “Currier A” hand is made up of a curiously limited repertoire of “syllables”. (To avoid the false impression that these are supposed to be syllables of the plaintext, I’ll refer to them as “blocks”.) There were to “species” of those blocks, one which tended to begin ciphertext words, and one which made up the rest of the word; these he termed “odd” and “even” blocks, resp.
They are (in Currier notation):
Odd Even 2 89 4O 8AE 4OF 8AM 4OP AE 8 AJ 9F AM 9P AN F AR O C9 OF CC9 OP COE P OE Q OM S OR SF S9 SP SC9 SQ SO SW SOE SX SOR W Z9 X 9 Z ZO
Now the number of different blocks of the two kinds was 23 and 21, resp. — tantalizingly close to the number of letters in the latin alphabet. Unfortunately, at this point Robert’s research stalled, and he didn’t publish any further results. I don’t know if he didn’t know how to proceed from here, or if all his attempts turned into dead ends.
Either way, a decade later I tried to match this to the “Strokes” idea which I had had. Basically, to create a ciphertext this way it’s necessary to “decompose” the plaintext letters into their graphical constituents, or pen-strokes. A letter like “b” would thus be a vertical line, followed by a ring; the letter “d” would be the ring followed by the line, while “l” would be only the line. “p” would be a ‘low’ vertical line, followed by a ring. Now each of the components — vertical line, low line, ring — would be assigned one specific ciphertext letter, eg <q>, <o> and <e> (in EVA) — voila, ciphertext composed as a sequence of these components. Deciphering works exactly the other way around.
In its most simple form, on ciphertext word would equate on plaintext letter, and there would be only 26 different ciphertext words, so this quite obviously can’t be the whole truth. But one could imagine cramming several plaintext letters into a single ciphertext word. While this could create ambiguities regarding where one letter ended in the ciphertext word and where the next began (“lo” and “b” would render the same two ciphertext letters), it would make the enciphering scheme less obvious. In other words, each of Robert’s “blocks” would represent one plaintext letter.
But then why two sets of blocks? Maybe, so my thought, the author alternated upper and lower case letters in his words in a kind of “cAmElCaSe” to confound codebreakers? Would that hold water? Time to number-crunch.
I set to work on the “Currier A” corpus as transcribed by Takeshi Takahashi.*) This corpus consists of 11477 tokens**) according to my count. I used the Currier transcription.***) I wrote the smallest program****) to see which amount of the Currier A corpus I could compose from the blocks or “syllables” given by Robert.
This is what the result looks like, in percent of tokens of the whole Currier A corpus:
The columns are percentages of tokens which are composable:
- Total corpus (11477),
- by any sequence of blocks (8716),
- from one or two blocks only, any kind (6841),
- from any number of blocks, but alternatingly odd and even blocks, starting with an odd one (6002),
- like (4.), but starting with an even block (3176).
So, Robert was right, 3/4ths of “Currier A” is made up of his 44 blocks, and roughly two thirds only require one or two of them. Plus, as column 5 shows, order is important, and starting with one of the even blocks drastically reduces the composition success. (A side note: For quite a number of words it was possible to compose them in more than one way from Robert’s blocks.)
But what does it mean?
That’s of course difficult to assess. Not the whole of the “Currier A” corpus can be created this way, so Robert might well have been wrong. OTOH it is to assume that he may have made mistakes both in the transcription of the VM (we still don’t know what the original character set really is), as well as in his divining of the blocks. I’ll leave it to you to interpret the numbers as you see fit.
In any case, I think the next step would be to perform a frequency check on the blocks: How often were they used in the ciphertext? This would allow us a match them against the letter frequencies of various plaintext language candidates. To some degree the length of the blocks (ie, how many ciphertext letters did they require?) could help to resolve ambiguities. If according to frequency analysis two blocks could represent the plaintext letter “B”, and one of the blocks consists only of one ciphertext glyph, but the other of three, then it’s more plausible that “B” wasn’t enciphered as a single stroke, but as three strokes, or ciphertext glyphs: A vertical line, and two rings or arcs.
I’ll keep you posted!
*) “Currier B” would probably yield comparable results, but with a different set of blocks. If “A” works out, I’ll be happy to go over the VM with “B” again as well.
**) Definition: A “token” is a sequence of glyphs in the VM. This is different from the number of “words” which counts only the unique sequences. In other words, duplicate strings increase the number of tokens, but not the number of words, and “I very very much like this” containts 6 tokens, but only 5 words.
***) Though in general I prefer EVA to work with, I found out that it’s best to stick with Currier if I wanted to build on Robert’s work. The transposition of one transcription into the other is far from trivial and lossy, so I thought, if Currier was good enough for Robert to arrive nowhere, it’s good enough for me.
****) Source code available on request.