Hannah Pedersen Google Summer of Code Project: 7/20

Wednesday, July 20, 2016

7/20

I kept working on fuzzy matching today. This was a lot of trail and error. I tried using getLevenshetinDistance and getJaroWinklerDistance but neither worked as well as what I currently had (two passes through my ingredient list, one looking for an exact match and one looking for something that contained it).

I ended up finding a better combination, however. I pass through once to look for an exact match and then I pass through looking for the highest result when comparing with getFuzzyDistance.

However, this created a new problem in which 'null' was never returned (meaning Alexa would never say "you don't need that ingredient". If I asked Alexa "how much pig do I need" it would return all-purpose flour, the highest ranked in fuzzyDistance. To fix this, I've added one last search at the end to make sure the result being returned at least contains the passed in ingredient value. This leads to a few weird cases not working such as asking about "oats" if the recipe specifies "oatmeal" (which worked without this check) but I think for the most part, it's getting more accurate.

1 comment:

Bart MasseyJuly 20, 2016 at 4:59 PM
If Levenshtein distance didn't work, Jaro-Winkler likely wouldn't either, because it's just a computationally cheaper and but weaker test.

You'll want to start with a stemming algorithm, to get rid of spurious plurals and the like. Phonetic matching may be especially helpful for you, since you're dealing with spoken input: see the Phonix algorithm (e.g. https://github.com/BartMassey/phonetic-code ) for a nice matching algorithm. My spelling suggestion algorithm uses Phonix and Levenshtein distance by default, and works pretty well, although that problem is easier. Splitting compound words would solve your oatmeal problem, and should be pretty straightforward.

Once you've got word variation sorted out, you'll probably want to do some deeper things. It would be nice to have a synonym dictionary for cooking ingredients. Unfortunately, it's hard to Google for one, so I don't know if one is out there already. I would guess so.

There's a whole literature on text information retrieval, and you could easily do a thesis on the topic finding texts that refer to a given term. Something like OpenEphyra (https://mu.lti.cs.cmu.edu/trac/Ephyra/wiki/OpenEphyra) could probably be cut down and adapted for your problem, but that sounds like a full GSoC by itself. So don't get bogged down here for now. You're on the right track trying to find heuristics that aren't terrible and working on the rest of this separable problem later.
ReplyDelete
Replies

Add comment