Omniscient Life App

Elements of this post have been ruminating in my mind for a few years now.  This is the best I have been able to put it into words; such as they are.


A thought experiment for your consideration.

An app is developed–the Omniscient Life app–with the following features:

  • Scans your biometrics to identify you using government databases
  • Collects all audio/video about your life into a history stream by connecting with NSA/CIA/GCHQ surveillance archives
  • Data mines the videos using AI to identify micro expressions to add a “thought-track” as subtitles or commentary
  • Connects to your social media (Facebook, Twitter, and LinkedIn) to share your history stream

With this app, you can view any moment of your lifewith your accompanying psychological and emotion state via the “thought-track”.

If this app existed, how does it affect your views on the following:

  1. Privacy
  2. Government Surveillance
  3. Social Media
  4. Information Security
  5. Artificial Intelligence
  6. Cyber Bullying:
    • Doxxing
    • twitter mobs
    • crucible effect (red scare, witch hunt)
    • etc.

Version 2 adds the following features:

  • All history streams are searchable and shareable by any other user
  • The app analyzes cumulative thoughts and behaviors of a user and assigns labels: good person, bad person, kind, jerk, religious, sanctimonious, revolutionary, terrorist, etc.

How does that further affect your views?


The Omniscient Life app leads to the following personal views:

  1. More privacy – everyone has something to hide
  2. Less surveillance – governments should not spy on people and certainly not store the data
  3. Less social media – over-sharing is actually dangerous
  4. More security – data is power in the information age and must be protected
  5. Artificial Intelligence – some calculations should not be made
  6. Less cyber bullying – if an abuser has motive, the internet provides means and opportunity


The internet is a powerful tool. Use it wisely. Protect it strongly.

Related Reading:

  1. The Eternal Value of Privacy:
  2. Do Not Ship It:
Omniscient Life App

Prepping Yelp Data for Mining

In April 2014, my Data Mining project team at BYU began work on our semester project which we chose from The project was based on data from an old Yelp Business Ratings Prediction Contest which finished in August 2013. Over several posts, we will take a look at some of the interesting things our team did in this project.

Problem Description

As explained on the Kaggle web page for the Yelp Contest, the goal of the project was to predict the number of stars a user would rate a new business using user data, business data, checkin data, and past reviews data. The prediction model/algorithm would then become the heart of a recommender system.

Data Description

The dataset is four files holding json objects. In this case, each line of the files is a distinct JSON object, which means we will parse each line into JSON as we go. The following is a description of the file yelp_training_set_business.json:

  'type': 'business',
  'business_id': (encrypted business id),
  'name': (business name),
  'neighborhoods': [(hood names)],
  'full_address': (localized address),
  'city': (city),
  'state': (state),
  'latitude': latitude,
  'longitude': longitude,
  'stars': (star rating, rounded to half-stars),
  'review_count': review count,
  'categories': [(localized category names)]
  'open': True / False (corresponds to permanently closed, not business hours),

Extracting, Transforming, and Loading the Data

I heard a lot on the web about this concept called ETL–Extract, Transform, Load–but in my data mining and machine learning training I have heard it a grand total of ONE times. Well, I guess this is our ETL section where we “Extract” the data from the files, “Transform” the data into meaningful representations, and “Load” the it into our database.

So here’s how we implemented it. I wrote a python script to parse through the JSON files and load them into a MySQL database. By using a relational database, we could combine, aggregate, and output the data in countless ways that we could input into our learning algorithm.

For your convenience, I put my code and a subset of the data on github so you can explore them yourself: I used Peewee, a lightweight ORM compatible with MySQL (amongst others), to create the database and load the data. Here is an excerpt of the Business Model from

class Business(peewee.Model):
    """Business Object Table."""

    bid = peewee.PrimaryKeyField()
    business_id = peewee.CharField()  # encrypted ID with letters, numbers, and symbols
    name = peewee.CharField()
    full_address = peewee.CharField()
    city = peewee.CharField()
    state = peewee.CharField(max_length=2)  # AZ
    latitude = peewee.CharField()
    longitude = peewee.CharField()
    stars = peewee.DecimalField()  # star rating rounded to half-stars
    review_count = peewee.BigIntegerField()
    is_open = peewee.BooleanField()

    class Meta:
        database = db

Not too strange. This was the most complex Model and covers most of the bases for creating a Model class (which maps to a Database Table) with Peewee. Now lets take a look at the Load script itself: First, we need to read each json file, parse the input into JSON objects, and map the JSON objects to Models for saving. Here is the function for reading the lines of a file, parsing them to JSON, and yielding the JSON:

def iterate_file(model_name, shortcircuit=True, status_frequency=500):
    i = 0
    jsonfilename = "json/yelp_training_set_%s.json" % model_name.lower()
    with open(jsonfilename) as jfile:
        for line in jfile:
            i += 1
            yield json.loads(line)
            if i % status_frequency == 0:
                print("Status >>> %s: %d" % (jsonfilename, i))
            if shortcircuit and i == 10:
                raise StopIteration()

The file takes the name of a model, opens the corresponding JSON data file for that model, iterates each line in that file, and yields up the parsed JSON object to the caller. This allows the function to be called as an iterator like this excerpt from the save_businesses() function:

def save_businesses():
    for bdata in iterate_file("business", shortcircuit=False):
        business = Business()
        business.business_id = bdata['business_id'] = bdata['name']
        business.full_address = bdata['full_address'] = bdata['city']
        business.state = bdata['state']
        business.latitude = bdata['latitude']
        business.longitude = bdata['longitude']
        business.stars = decimal.Decimal(bdata.get('stars', 0))
        business.review_count = int(bdata['review_count'])
        business.is_open = True if bdata['open'] == "True" else False

Straightforward no? Each Model has some corresponding function to handle creating a Model, assigning appropiate data, and saving it to the database. Then at the end we have a spot calling each function when the script is run:

if __name__ == "__main__":


The function reset_database() is important to create the tables for your Models. It looks like this:

def reset_database():
    tables = (Business, Review, User, Checkin, Neighborhood, Category,)
    for table in tables:
        # Nuke the Tables
        except OperationalError:
        # Create the Tables
        except OperationalError:


And that is the gist on ETL (I guess?) for this system. The funny part is I think this data came from a SQL database and was converted to JSON for distribution in this tournament. Now that we have undone all their work, we can start work of our own in Part 2!

Prepping Yelp Data for Mining