(Attack from the rear…)
As mentioned previously, I’ve come to the conclusion that the failure to reproduce Robert Firth’s results was at least partly due to me using the wrong transcription scheme. So, tonight I gave it a second shot, this time working with Currier’s transcription.
I used Elias Schwerdtfeger’s extractor to create two samples, one with Takahashi’s transcription (A), and one with Currier’s own (B). Both times I used all the “Currier A” material at hand. Sample A amounted to about 55k, while B was somewhat less (35k); apparently Currier didn’t transcribe the whole manuscript.
I also slightly modified my software. Previously, the little hack would only recognize a word as belonging to the Firth vocabulary if it contained one odd and one even group. I eased the restriction somewhat and also allowed words consisting of only a single odd or even group.
The results were for sample A (Takahashi transcription):
Number of tokens: 11415
Number of different words: 3440
Firth words: 7558
Sample B (Currier transcription):
Number of tokens: 7363
Number of different words: 2352
Firth words: 5411
That gives a hit quota of 66% for sample A, and 74% for sample B. This is not quite as high as others had reported, but still better than the results of some 40% I had previously.
Let’s remember, the litmus test for the Stroke theory is that frequency distributions for odd and even groups are pretty much identical — if each odd group and each even group represents one plaintext letter, then both frequency distributions should be similar. (This is something we had not observed when using the EVA transcription.
Top chart shows the results for sample A from Takahashi, bottom chart is sample B from Currier (odd groups in red, evens in blue):
Again, the frequency distributions are obviously not the same: (As much as it sucks to admit this.)
- The first two most frequent odd groups (“8” and “S” for both samples) occur more than twice as often as the third ranking group (“Z” in both cases), with “8” being again roughly 30% more common than “S”.
- As opposed to this, the three most common even groups (“9”, “AM”, and “OE”) come in with almost the same frequencies.
In short, there are three possible reasons for this, either individually or in combination:
- The transcription alphabet is wrong, and some letters of the Voynich alphabet considered identical really are different, or vice versa,
- The composition of the Firth groups is wrong, and they should really look different,
- There is nothing to the Stroke theory, and the enciphering is following a completely different method.
Of course, 3.) is absurd, and I included it only for the sake of completeness.
2.) most obviously is playing a part in all this. There should be as many Firth groups as there are letters in the plaintext alphabet, and this could be anything between 20 and 30, depending on whether “i” and “j” are both included, whether “k” and “y” play a role, whether “w” is or isn’t “uu”, and whether special characters like “ß” or umlauts like “ä” were used, etc. Yet the number of odd and even groups should be the same in any case.*)
But this is not what is in the Firth groups, because there are two more odd groups than even groups, so either some odd groups should be combined, or one or two of the even groups should be split up. Doing so with the most frequent groups just might change the shape of the frequency distribution.
Thus, in a nutshell, there is still room for experimentation. I’ll have to look into which groups exactly would be candidates for merging or splitting, hopefully without getting quagmired too much in wishful thinking. Not to mention the fact that the transcription might play a role here as well.
*) Of course, this is not quite true, because for example there is no capital counterpart for the letter “ß”.