/
Ingredient list

Ingredient list

Ingredient lists that can be found online or extracted from existing datasets are often limited because they are very incomplete and are specific to a particular set of recipes. Moreover, these lists typically do not include additional information about the type of ingredient (e.g., pasta) nor about synonyms (e.g., zucchini is a synonym for courgette). Although it is hardly possible to create a complete list of ingredients (a list will always be relative to the sources used), we document here how we created our own list of ingredients for our cooking assistant.

We used the following steps to create an ingredient list:

  • Created csv files with lists of ingredients from the following sources:

We preprocessed each of these lists:

  • Kaggle list:

    • removed ‘la’, ‘red’, ‘uni’ and other ungrammatical items;

    • removed ‘and’, ‘&', ‘or’ and split conjunctions but not in e.g. 'sweet and sour mix’;

    • removed qualifiers such as ‘boiled', ‘chopped’, ‘cook and drain’, ‘cooked’, ‘crisp-cooked and crumbled’, ‘crumbled’, ‘diced’, ‘for dusting’, ‘fried’, ‘grated’, ‘minced’ (except for ‘minced beef’), ‘peel and devein’, ‘peeled’, ‘roasted‘, ‘refrigerated’, ‘shaved’, ‘shredded’, ‘sliced', ‘slivered’, ‘smoked’, ‘toasted’ but also ‘boneless’, ‘bone-in’, bottled’, ‘canned’, ‘condensed’, ‘dried’, ‘dry’, ‘fresh', ‘frozen’, ‘ground’, ‘natural', 'new’, ‘plain’, ‘salted’, ‘skinless’, 'sweetened’, 'unbaked’, ‘uncook’, ‘uncooked’, ‘unsalted’, ‘unsweetened’, 'vegan’ (as these are typically not performed at home but bought in a store as such) but not 'sweet’ in e.g. 'sweet potato’;

    • removed size indicators such as ‘medium’, ‘small’, ‘1 inch thick’, ‘halves’, ‘pieces’, ‘extra large’, ‘whole’, ‘large’;

    • removed modifiers such as ‘low sodium’, ‘medium dry’, ‘medium firm’, ‘medium grain’, ‘low fat', ‘nonfat’, ‘reduced fat’, ‘1% low-fat’, 'extra sharp’;

    • removed modifiers such as ‘seedless’, ‘unleavened’, ‘vegetarian’ even though a vegetarian sausage is not the same as a sausage and an unleavened bread (made without yeast) is very different from a bread;

    • removed cuisine modifiers such as ‘American’, ' Italian', ‘European’ but not e.g. ‘Dubliner' in ‘Dubliner cheese’ or ‘Greek’ in ‘Greek yoghurt’ or ‘Guiness’ in 'Guinness beer’;

    • removed dishes such as ‘mashed potatoes’, ‘pork shoulder roast’ and soups;

    • removed brand labels such as ‘Bertoli’ and some brand specific products (which most likely would not be found very often in recipes) but not ‘OREO Cookies' nor 'Nutella’;

    • kept both singular as well as plural if present, e.g. ‘apple’ and ‘apples’, ‘egg’ and ‘eggs’, ‘walnut’ and ‘walnuts’, and ‘tomato’ and ‘tomatoes’.

    • also kept ingredient categories such as ‘cheese’, ‘fish’, ‘soup’, ‘stew’;

    • changed all capitals to lower case letters.

This resulted in a reduced list of 3271 ingredients (from the original 4469 items). We’ll refer to this list as the ‘ingredient list’ we are constructing.

We continued with the recipe5K list:

  • Comparing the ingredient list with the recipe5K list yielded 529 new items. Analysis revealed that most of these items were singulars instead of plural items (or vice versa), spelling variants of items (British instead of American spelling, e.g. ‘tartare sauce’ vs ‘tartar sauce’), or synonyms of items (e.g., courgette instead of zucchini) on the original Kaggle list. The recipe5K also included a noticeably larger variety of cheese and fish. All of these were added to the ingredient list;

  • Dishes such as ‘ratatouille’, ‘rice pudding', ‘roast turkey’, ‘rock salmon’, ‘taco’ (not ‘taco shell’), ‘terrine’ or ‘tempura’ but we did include 'ratafia biscuits’ and cooking techniques such as ‘teriyaki’ (not 'teriyaki sauce') were excluded (not added to the ingredient list);

This yielded a list of 3676 ingredients, i.e. 405 items were added to the list (excluding some synonyms which were also added).

Finally, we compared with the list of '1000 most frequently used’ ingredients from Spoonacular. We found 370 items on the Spoonacular list that were not included in the ingredient list. Most of these items, however, included modifiers that we excluded in previous steps (see above). We added 41 items, resulting in an updated list of 3717 items.

The much lower number of items we added from the Spoonacular list indicates the ingredient list was already more complete than the cleaned Kaggle list. This does not mean that this list is not still missing (many) ingredients. Checking against a list of 95 types of seafood on Wikipedia, for example, yields an additional 44 missing items which we added too. Similarly, checking against a list of 311 vegetables, fruits and peppers on Wikipedia, we found that 197 were still missing. This can be partly explained by the very specific vegetables listed on the Wikipedia page and we only included those for which it was easy to find a recipe with a Google search. This resulted in the addition of yet another 79 items. We also added pasta variants that were obtained from Wikipedia’s list of pastas, as well as gnocchi with their synonyms. The final list consists of 3850 different ingredient items and 4270 ingredient names (including 420 synonyms).

Notes

Even though ingredients are names, we made sure all ingredient names in are list are in lowercase to avoid issues with matching case.

A key issue that remains is how to deal with spelling variants of ingredients (e.g., chili, chilli, and chile, whisky and whiskey, and wheat germ and wheatgerm) and singular versus plural.

 

 

Related content

Ingredients
More like this
Recipes
More like this
Recipe Selection
Recipe Selection
More like this
APIs and Other Tools
APIs and Other Tools
More like this
Extensions
More like this
Week 2
More like this