Joe Frambach

Musings and Misadventures in Travel and Technology

The Day I Parsed Nothing

Posted on February 22, 2009 - No Comments »

supposeditypedallmypostslikethiswithnopunctuationnorwordbou
ndariesyouwouldstillbeabletofigureoutthegistofwhatimtryingtosay

It’s somewhat easy for you to read that, if you’re a native English speaker. Try this one out.

iefterföljandebladharjagsökttecknaendelafdestämningarochkänslorhvilk
aliktunderbaranyafåglarsväfvakringgränsenmellanbarndomochungdom

Give up? That’s okay; I have no idea what it says either, I don’t know Swedish. Although this raises another question: Can a machine figure it out, with a large enough training set? A few weeks ago I posed this problem to a coworker, and here’s what resulted after an hour or so.

Bigram counts for Alice in Wonderland, base 16 for smaller display size.

    a   b   c   d   e   f   g   h   i   j   k   l   m   n   o   p   q   r   s   t   u   v   w   x   y   z
a   9  fb  d0 1dd   0  5d  cc  30 2d0   f  82 417  e8 656   6  98   3 2d9 3c4 4c6  4e  c9  70   4 103   5
b  51  42   0   1 20c   0   0   1  6f   7   0  69   1   0  cb   0   0  3b  1c   9  cc   0   0   0  4b   0
c 13d   1  10   0 2c1   0   0 1c2  1e   0  b1  42   0   0 155   0   0  6e   1  4a  67   0   0   0   8   0
d 21e  a3  39  78 1d8  5b  56  b4 1c3   f   8  7b  3c  b8 253  2f   e  89 155 349  64  30  9b   0  64   0
e 4e3  f1 1b7 4b4 238 108 133 15b 17f  30  7a 262 1fc 41c 170 11d  62 7c9 49c 3ec  29  e6 259  68 139   e
f  c3  17  28   7  a3  85   c  40  fb   1   3  35  19   f 156   c   2  5e  34 140  6c   6  22   0  2e   0
g 182  25   d  1a 12a  1c  17 173  af   4   1  4d  1b  1e 125  13   6  dd  97  cb  3f   f  2e   0  11   1
h 4fd   b  12  13 ecd   e   7  47 346   2   1   e  1a   8 264   a   0  55  46 155  3c   2  24   0  3f   0
i  32  1b 27f 2f3  c6  a6  df  22   9   0  70 15c 130 7f7  b7  15   2  d7 282 550   7  5f  32   9   0  1a
j   6   0   1   0  14   0   0   0   0   0   0   0   0   0  11   0   0   0   0   0  66   0   0   0   0   0
k  4d   7   4   3 16e   9   1  11  f5   2   1  19   9  93  1e   4   1   5  24  76   8   4  10   0  17   0
l 176  25  19 160 2ea  a3   c  23 393   4  3d 2b4  20   8 16c  18   1  1a  82  a1  14  12  31   0 1d2   1
m 177  49   6   7 249  14   8   f  c9   0   1   2  13  1d 154  4d   2   5  3c  3a  80   2  14   0  4a   0
n 1c2  39  d8 50d 245  40 485  69 15d   7  79  73  23  49 26c  10   9  22 124 41c  46  19  7c  11  86   0
o  89  7b  87  93  44 294  56  ae  bb   1  da  d8 159 436 1ec  82   e 2bb 129 26a 622  5f 254   b  2f   2
p  a6  12   5   4 11f   1   0  42  9d   0   1  b5   3   2  c8  73   0  58  2a  59  46   4   8   0  11   0
q   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0  d0   0   0   0   0   0
r 1e1  25  63  d3 4a3  4c  4f  9a 1da   6  45  5f  75  70 1a7  66   a  64 203 1d3  58  17  6a   3 194   0
s 3ee  42  5f  24 343  45  37 3ce 1bd   9  33  74  59  8c 215  7b  16  21 1b1 3af  af  22  af   0  2b   0
t 2aa  84  70  5e 31b  51  33 e52 337   c  17 170  75  33 4e8  39  13  c4 246 354  ea  16 153   0  7c   0
u  31  24  c8  54  a8   e  b8   f  58   5  2f 15d  55  fb  11  d7   0 20a 1cd 266   1   7  27   0   3   e
v  14   0   0   0 2ba   0   0   0  3c   0   0   0   0   0  3f   0   0   0   0   1   1   0   0   0   3   0
w 274   7   a  18 15f   c   1 210 197   0   1  29   f  90 11a   5   2  23  36  40   6   5  2a   0   a   1
x   f   0   d   0  18   0   0   1  19   0   0   0   0   0   4  17   0   0   0  2a   0   0   1   0   0   0
y  d1  3e  30  43  74  2b  23  39  a0   8   a  3b  3a  1f 232  57   4  34  a9  eb  17   a  7e   0  1f   0
z   7   0   0   0  1f   0   0   0   a   0   0   e   0   0   0   0   0   0   0   0   0   0   0   0   2   e 

In short, this means that the letter-pair “ha” appeared 4fd16 (1277) times, while “xh” appeared once. Most likely, that “xh” appeared at a word boundary, i.e., one word ended with x and the next word began with h (“…trying to box her own ears…”). The idea here is that low bigram counts correspond to word boundaries.

There are 107716 characters (a-z) in the text, so given an average word-length of 8 (I made that up), there should be 13464 words. This next part is tricky. A ‘threshold’ must be found. Any bigram with a count fewer than the threshold will have a word boundary inserted, and any bigram with a count higher than the threshold will be part of a word. I’ll cut to the chase: k<=133 (8516) provides 13490 word boundaries (see the above table, find the cumulative sum for k<=133 yourself if you’d like).

Results

downtherab b itholealicewas begin ningto get verytiredof sitting b y hersisterontheb an k andof hav ing nothingtodoonceortwiceshehad peepedintothebook hersister was reading butithadno pic turesor con versationsinitandwhatisthe useofabook thoughtalicewithout pic turesor con versationsoshewas consideringin herown mindaswellasshecould forthehot day madeher feel verys leep yandstupidwhetherthepleasureof ma kingadais y chain wouldbeworththetrou b leof getting upand pickingthedaisieswhensu

You get the point: It didn’t really work out so well. As I was skimming through the results, I found this gem:

goingoutalto gether li keacand lei wonder whatishouldbeli kethenandshetriedtofanc y whatthef lameofacand leis li
kea fterthecand leis b lownout forshecouldnotremem bereverhav ingseensuchathinga ftera whilefindingthat nothing
morehap penedshedec idedongoingintothegardenatoncebutalas for pooralicewhenshegottothedoorshefoundshehad for

And I’m sure you can guess what my reaction was. “Oh wow, check it out. I parsed nothing!”

View the source (alice.c)
Alice in Wonderland, from Project Gutenberg