Prepping Yelp Data for Mining

In April 2014, my Data Mining project team at BYU began work on our semester project which we chose from kaggle.com. The project was based on data from an old Yelp Business Ratings Prediction Contest which finished in August 2013. Over several posts, we will take a look at some of the interesting things our team did in this project.

Problem Description

As explained on the Kaggle web page for the Yelp Contest, the goal of the project was to predict the number of stars a user would rate a new business using user data, business data, checkin data, and past reviews data. The prediction model/algorithm would then become the heart of a recommender system.

Data Description

The dataset is four files holding json objects. In this case, each line of the files is a distinct JSON object, which means we will parse each line into JSON as we go. The following is a description of the file yelp_training_set_business.json:

{
  'type': 'business',
  'business_id': (encrypted business id),
  'name': (business name),
  'neighborhoods': [(hood names)],
  'full_address': (localized address),
  'city': (city),
  'state': (state),
  'latitude': latitude,
  'longitude': longitude,
  'stars': (star rating, rounded to half-stars),
  'review_count': review count,
  'categories': [(localized category names)]
  'open': True / False (corresponds to permanently closed, not business hours),
}

Extracting, Transforming, and Loading the Data

I heard a lot on the web about this concept called ETL–Extract, Transform, Load–but in my data mining and machine learning training I have heard it a grand total of ONE times. Well, I guess this is our ETL section where we “Extract” the data from the files, “Transform” the data into meaningful representations, and “Load” the it into our database.

So here’s how we implemented it. I wrote a python script to parse through the JSON files and load them into a MySQL database. By using a relational database, we could combine, aggregate, and output the data in countless ways that we could input into our learning algorithm.

For your convenience, I put my code and a subset of the data on github so you can explore them yourself: https://github.com/tgroshon/yelp-data-mining-loader. I used Peewee, a lightweight ORM compatible with MySQL (amongst others), to create the database and load the data. Here is an excerpt of the Business Model from models.py:

class Business(peewee.Model):
    """Business Object Table."""

    bid = peewee.PrimaryKeyField()
    business_id = peewee.CharField()  # encrypted ID with letters, numbers, and symbols
    name = peewee.CharField()
    full_address = peewee.CharField()
    city = peewee.CharField()
    state = peewee.CharField(max_length=2)  # AZ
    latitude = peewee.CharField()
    longitude = peewee.CharField()
    stars = peewee.DecimalField()  # star rating rounded to half-stars
    review_count = peewee.BigIntegerField()
    is_open = peewee.BooleanField()

    class Meta:
        database = db

Not too strange. This was the most complex Model and covers most of the bases for creating a Model class (which maps to a Database Table) with Peewee. Now lets take a look at the Load script itself: json_to_mysql.py. First, we need to read each json file, parse the input into JSON objects, and map the JSON objects to Models for saving. Here is the function for reading the lines of a file, parsing them to JSON, and yielding the JSON:

def iterate_file(model_name, shortcircuit=True, status_frequency=500):
    i = 0
    jsonfilename = "json/yelp_training_set_%s.json" % model_name.lower()
    with open(jsonfilename) as jfile:
        for line in jfile:
            i += 1
            yield json.loads(line)
            if i % status_frequency == 0:
                print("Status >>> %s: %d" % (jsonfilename, i))
            if shortcircuit and i == 10:
                raise StopIteration()

The file takes the name of a model, opens the corresponding JSON data file for that model, iterates each line in that file, and yields up the parsed JSON object to the caller. This allows the function to be called as an iterator like this excerpt from the save_businesses() function:

def save_businesses():
    for bdata in iterate_file("business", shortcircuit=False):
        business = Business()
        business.business_id = bdata['business_id']
        business.name = bdata['name']
        business.full_address = bdata['full_address']
        business.city = bdata['city']
        business.state = bdata['state']
        business.latitude = bdata['latitude']
        business.longitude = bdata['longitude']
        business.stars = decimal.Decimal(bdata.get('stars', 0))
        business.review_count = int(bdata['review_count'])
        business.is_open = True if bdata['open'] == "True" else False
        business.save()

Straightforward no? Each Model has some corresponding function to handle creating a Model, assigning appropiate data, and saving it to the database. Then at the end we have a spot calling each function when the script is run:

if __name__ == "__main__":
    reset_database()

    save_businesses()
    save_users()
    save_checkins()
    save_review()

The function reset_database() is important to create the tables for your Models. It looks like this:

def reset_database():
    tables = (Business, Review, User, Checkin, Neighborhood, Category,)
    for table in tables:
        # Nuke the Tables
        try:
            table.drop_table()
        except OperationalError:
            pass
        # Create the Tables
        try:
            table.create_table()
        except OperationalError:
            pass

Conclusion

And that is the gist on ETL (I guess?) for this system. The funny part is I think this data came from a SQL database and was converted to JSON for distribution in this tournament. Now that we have undone all their work, we can start work of our own in Part 2!

Advertisements
Prepping Yelp Data for Mining