Cracking the Restaurant Industry

A team of me, Thilo Weigert and Marc Giraud took on the task of predicting restaurant success using data analysis for our final project in my MIT Sloan class on Business Analytics (15.071).

One of the highlights of my time at MIT was taking class 15.071, 'The Analytics Edge' at MIT's Sloan School of Management was to apply our knowledge of business analytics to a big data problem and come up with meaningful recommendations from a business context. We decided to use data from Yelp, who every year publish an anonymised dataset of reviews, photos and user data called the Yelp Dataset Challenge. The 2017 dataset contains over 4 million reviews for 144,000 businesses, of which 48,000 of these are restaurants. What could we recommend to current and prospective restaurant owners based on patterns in the data?

Learning R, the language of data science

Despite being well versed in Python, which I would ordinarily use for a project like this, the class had been taught entirely with R, so this was a great chance to use many of the wide range of available tools for statistics and machine learning in R. The code for analysing the data consisted mainly of two steps: preprocessing, to turn it from the supplied JSON into a useful data structure we could work with in R; and analysis, which involved writing various functions to apply analytical methods and seek patterns in the data. This included both metadata analysis (timestamps, average review scores, geo-locations, etc.) and natural language processing of the review content itself.

The methods used to come up with our recommendations included the following:

  • logistic regression
  • CART analysis
  • k-means clustering
  • hierarchical clustering
  • word frequency analysis

What shoud you know before opening a restaurant?

Because of the large dataset size and our limited time and processing power, we segmented the data by city and chose to focus on the two cities with the most data point: Toronto, Canada and Las Vegas, NV.

Summing up our analyses, we find that in Toronto the most popular categories of restaurants are brunch/breakfast and grille/dinner places. This can be, in many cases, consistent with the most popular attributes: a ‘trendy’ or ‘hipster’ ambiance. In terms of locations, we recommend looking at district 1 for grille restaurants, and districts 2 and 7 for a brunch/breakfast places. We would discourage prospective restaurant owners from a opening a seafood or French restaurant, or a tapas bar, without serious further research.

Looking at Las Vegas, we see that like Toronto, brunch/breakfast places have been particularly popular among Yelp users, together with Japanese/Asian fusion restaurants and various types of bars. This recommendation is supported by the highest performing attributes being ‘good place for breakfast’, ‘café’, and ‘good place for dinner’. We would suggest looking at district 2 for Asian restaurants, district 5 for cafés and bars, and avoiding the category Italian in general because it is likely to be a saturated market in this area.

See the code and more

All code used in my analysis is available on my Github, along with our full project report (which received an A) and some plots.