Stupid word games Stupid word games
Today, Jeroen Ooms announced the appearance on CRAN of an R package for language detection, wrapping the “CLD2″ compact language detector.   Obviously, given a tool... Stupid word games

Today, Jeroen Ooms announced the appearance on CRAN of an R package for language detection, wrapping the “CLD2″ compact language detector.   Obviously, given a tool like that on a holiday long weekend, my first reaction was to try to confuse it.

Two fun games to play with a language detector:

  • Find an obviously English sentence (ideally a quote) that it doesn’t recognise as English, and a very non-obviously English sentence that it does
  • Find two sentences with as few differences as possible, where one is recognised as English and the other not

CLD2 doesn’t recognise the famous telegram about platypuses “Monotremes oviparous, ovum meroblastic” as English, which I suppose is fair enough.

It didn’t recognise Getrude Stein’s “Rose is a rose is a rose is a rose”, or even the shorter “Rose is a rose is a rose”, though it had no trouble with the start of FInnegan’s “Finnegans Wake” or bits of “Howl”.  Even better than the Stein, though:

image

There’s a linguistic discussion of this sort of sentence at Language Log – it’s not usual English in a lot of ways –  but I think it’s going to be hard to beat as a false negative.

As a false positive I tried Jabberwocky (English), and then thought of Douglas Hofstatder’s self-referential example sentences

image

Ok, so how far can the second one be warped and still show up as English?

image

That’s English, but “See Spot run” isn’t!

For minimal changes: changing “a” to “the”

image

And as a sort of combination of the two: Chomsky’s obviously-English nonsense sentence “Colorless green ideas sleep furiously” is recognised as English, but so is every permutation of the same words.

So, is there a point to this (other than a fun way to waste half an hour)? Well, one of the important things to remember about automated classification algorithms is (as Zeynep Tukfeci puts it) how alien they are. They can often imitate human decisions astonishingly well, but they don’t work the same way.  If another person makes the same decisions as you, it’s a good bet there are some basically similar reasons underneath. It’s easy to believe the same is true for machines, but it isn’t.

 

Originally posted at notstatschat.tumblr.com/

Thomas Lumley

Thomas Lumley

Thomas Lumley attended Monash University (B.Sc.(Hons) in Pure Mathematics), the University of Oxford (M.Sc. in Applied Statistics) and the University of Washington, Seattle (PhD in Biostatistics). He spent twelve years on the faculty of the Department of Biostatistics at the University of Washington, and then moved to Auckland in 2010. He is still an Affiliate Professor at the University of Washington.

1