The Day I Parsed Nothing
Posted on February 22, 2009 - No Comments »
supposeditypedallmypostslikethiswithnopunctuationnorwordbou
ndariesyouwouldstillbeabletofigureoutthegistofwhatimtryingtosay
It’s somewhat easy for you to read that, if you’re a native English speaker. Try this one out.
iefterföljandebladharjagsökttecknaendelafdestämningarochkänslorhvilk
aliktunderbaranyafåglarsväfvakringgränsenmellanbarndomochungdom
Give up? That’s okay; I have no idea what it says either, I don’t know Swedish. Although this raises another question: Can a machine figure it out, with a large enough training set? A few weeks ago I posed this problem to a coworker, and here’s what resulted after an hour or so.
Bigram counts for Alice in Wonderland, base 16 for smaller display size.
a b c d e f g h i j k l m n o p q r s t u v w x y z a 9 fb d0 1dd 0 5d cc 30 2d0 f 82 417 e8 656 6 98 3 2d9 3c4 4c6 4e c9 70 4 103 5 b 51 42 0 1 20c 0 0 1 6f 7 0 69 1 0 cb 0 0 3b 1c 9 cc 0 0 0 4b 0 c 13d 1 10 0 2c1 0 0 1c2 1e 0 b1 42 0 0 155 0 0 6e 1 4a 67 0 0 0 8 0 d 21e a3 39 78 1d8 5b 56 b4 1c3 f 8 7b 3c b8 253 2f e 89 155 349 64 30 9b 0 64 0 e 4e3 f1 1b7 4b4 238 108 133 15b 17f 30 7a 262 1fc 41c 170 11d 62 7c9 49c 3ec 29 e6 259 68 139 e f c3 17 28 7 a3 85 c 40 fb 1 3 35 19 f 156 c 2 5e 34 140 6c 6 22 0 2e 0 g 182 25 d 1a 12a 1c 17 173 af 4 1 4d 1b 1e 125 13 6 dd 97 cb 3f f 2e 0 11 1 h 4fd b 12 13 ecd e 7 47 346 2 1 e 1a 8 264 a 0 55 46 155 3c 2 24 0 3f 0 i 32 1b 27f 2f3 c6 a6 df 22 9 0 70 15c 130 7f7 b7 15 2 d7 282 550 7 5f 32 9 0 1a j 6 0 1 0 14 0 0 0 0 0 0 0 0 0 11 0 0 0 0 0 66 0 0 0 0 0 k 4d 7 4 3 16e 9 1 11 f5 2 1 19 9 93 1e 4 1 5 24 76 8 4 10 0 17 0 l 176 25 19 160 2ea a3 c 23 393 4 3d 2b4 20 8 16c 18 1 1a 82 a1 14 12 31 0 1d2 1 m 177 49 6 7 249 14 8 f c9 0 1 2 13 1d 154 4d 2 5 3c 3a 80 2 14 0 4a 0 n 1c2 39 d8 50d 245 40 485 69 15d 7 79 73 23 49 26c 10 9 22 124 41c 46 19 7c 11 86 0 o 89 7b 87 93 44 294 56 ae bb 1 da d8 159 436 1ec 82 e 2bb 129 26a 622 5f 254 b 2f 2 p a6 12 5 4 11f 1 0 42 9d 0 1 b5 3 2 c8 73 0 58 2a 59 46 4 8 0 11 0 q 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 d0 0 0 0 0 0 r 1e1 25 63 d3 4a3 4c 4f 9a 1da 6 45 5f 75 70 1a7 66 a 64 203 1d3 58 17 6a 3 194 0 s 3ee 42 5f 24 343 45 37 3ce 1bd 9 33 74 59 8c 215 7b 16 21 1b1 3af af 22 af 0 2b 0 t 2aa 84 70 5e 31b 51 33 e52 337 c 17 170 75 33 4e8 39 13 c4 246 354 ea 16 153 0 7c 0 u 31 24 c8 54 a8 e b8 f 58 5 2f 15d 55 fb 11 d7 0 20a 1cd 266 1 7 27 0 3 e v 14 0 0 0 2ba 0 0 0 3c 0 0 0 0 0 3f 0 0 0 0 1 1 0 0 0 3 0 w 274 7 a 18 15f c 1 210 197 0 1 29 f 90 11a 5 2 23 36 40 6 5 2a 0 a 1 x f 0 d 0 18 0 0 1 19 0 0 0 0 0 4 17 0 0 0 2a 0 0 1 0 0 0 y d1 3e 30 43 74 2b 23 39 a0 8 a 3b 3a 1f 232 57 4 34 a9 eb 17 a 7e 0 1f 0 z 7 0 0 0 1f 0 0 0 a 0 0 e 0 0 0 0 0 0 0 0 0 0 0 0 2 e
In short, this means that the letter-pair “ha” appeared 4fd16 (1277) times, while “xh” appeared once. Most likely, that “xh” appeared at a word boundary, i.e., one word ended with x and the next word began with h (“…trying to box her own ears…”). The idea here is that low bigram counts correspond to word boundaries.
There are 107716 characters (a-z) in the text, so given an average word-length of 8 (I made that up), there should be 13464 words. This next part is tricky. A ‘threshold’ must be found. Any bigram with a count fewer than the threshold will have a word boundary inserted, and any bigram with a count higher than the threshold will be part of a word. I’ll cut to the chase: k<=133 (8516) provides 13490 word boundaries (see the above table, find the cumulative sum for k<=133 yourself if you’d like).
Results
downtherab b itholealicewas begin ningto get verytiredof sitting b y hersisterontheb an k andof hav ing nothingtodoonceortwiceshehad peepedintothebook hersister was reading butithadno pic turesor con versationsinitandwhatisthe useofabook thoughtalicewithout pic turesor con versationsoshewas consideringin herown mindaswellasshecould forthehot day madeher feel verys leep yandstupidwhetherthepleasureof ma kingadais y chain wouldbeworththetrou b leof getting upand pickingthedaisieswhensu
You get the point: It didn’t really work out so well. As I was skimming through the results, I found this gem:
goingoutalto gether li keacand lei wonder whatishouldbeli kethenandshetriedtofanc y whatthef lameofacand leis li
kea fterthecand leis b lownout forshecouldnotremem bereverhav ingseensuchathinga ftera whilefindingthat nothing
morehap penedshedec idedongoingintothegardenatoncebutalas for pooralicewhenshegottothedoorshefoundshehad for
And I’m sure you can guess what my reaction was. “Oh wow, check it out. I parsed nothing!”
View the source (alice.c)
Alice in Wonderland, from Project Gutenberg