Header Files, Compilers, and Static Type Checks

Have you ever thought to yourself, “why does C++ have header files”?  I had never thought about it much until recently and decided to do some research into why some languages (C, C++, Objective C etc.) use header files but other languages do not (e.g. C# and Java).

Header files, in case you do not have much experience with them, are where you put declarations and definitions.  You declare constants, function signatures, type definitions (like structs) etc.  In C, all these declarations go into a .h file and then you put the implementation of your functions in .c files.

Here’s an example of a header file called mainproj.h:

#ifndef MAINPROJ_H__
#define MAINPROJ_H__

extern const char const *one_hit_wonder;

void MyFN( int left, int back, int right );

Here is a corresponding source file mainproj.c:

#include "mainproj.h"

const char const *one_hit_wonder = "Yazz";

void MyFN( int left, int back, int right )
{
    printf( "The only way is up, baby\n" );
}

Notice that the header only has the function definition for MyFN and it also does not specify what one_hit_wonder is set to. But why do we do this in C but not in Java?  Both are compiled and statically typed.  Ask GOOGLE!

A great MSDN blog post by Eric Lippert called “How Many Passes” was very helpful.  The main idea I got out of the article is that header files are necessary because of Static Typing.  To enforce type checks, the compiler needs to know things like function signatures to guarantee functions never get called with the wrong argument types.

Eric lists two reasons for header files:

  1. Compilers can be designed to do a single pass over the source code instead of multiple passes.
  2. Programmers can compile a single source file instead of all the files.

Single Pass Compilation

In a language like C#, which is statically typed but has no header files, the compiler needs to run over all the source code once to collect declarations and function signatures and then a second time to actually compile the function bodies (where all the real work of a program happens) using the declarations it knows about to do type checks.

It makes sense to me that C and C++ would have header files because they are quite old languages and the CPU and Memory resources required to do multiple passes in this way would be very expensive on computers of that era.  Nowadays, computers have more resources and the process is less of a problem.

Single file compilation

One interesting other benefit of header files though is that a programmer can compile a single file.  Java and C# can not do that: compilation occurs at the project level, not the file level.  So if a single file is changed, all files must be re-compiled.  That makes sense because the compiler needs to check every file in order to get the declarations.  In languages with header files, you can only compile the file that changed because you have header files to guarantee type checks between files.

Relevance Today

Interesting as this may be, is it relevant today if you only do Java, C#, or a dynamic language?  Actually it does!

For instance, consider TypeScript and Flow which both bring gradual typing to JavaScript. Both systems have a concept of Declaration files.  What do they do?  You guessed it!  Type declarations, function signatures, etc.

TypeScript Declaration file:

module Zoo {
  function fooFn(bar: string): void;
}

Flow Declaration file:

declare module Zoo {
  declare function fooFn(bar: string): void;
}

To me, these look an awful lot like header files!

As we see, header files are not dead!  They are alive and well in many strategies for Type Checking.

Advertisements
Header Files, Compilers, and Static Type Checks

Why you should be using Fig and Docker

This is a introductory article to convince and prepare you try setting up your web app development environment with Fig and Docker.

The snowflake Problem

Let me take a moment to lay some foundation by rambling about dev environments.  They take weeks to build, seconds to destroy, and a lifetime to organize.  As you configure your machine to handle all the projects you deal with, it becomes a unique snowflake and increasingly difficult to duplicate (short of full image backups).  The worst part is that as you take on more projects, you configure your laptop more, and it becomes more costly to replace.

I develop on Linux and Mac and primarily do web development.  Websites have the worst effect on your dev environment because they often (read always) need to connect a number of other services like databases, background queues, caching services, web servers, etc.  At any given moment, I probably have half a dozen of those services running on my local machine to test things.  It is worse when I am working on Linux, because it is so easy to locally install all the services an app runs in production.  I routinely have MongoDB, PostgreSQL, MySQL (MariaDB), Nginx, and Redis running on my machine.  And lets not even talk about all the python virtualenv’s or vendorized Rails projects I have lying around my file system.

Docker Steps In

Docker is an such an intriguing tool.  If you have not heard, Docker builds on the ideas of Linux container features (cgroups and namespace isolation) to create lightweight images capable of running processes completely isolated from the host system.  It is similar to running a VM, but much smaller and faster.  Instead of emulating hardware virtually, you access the host system’s hardware.  Instead of running an entire OS virtually, you run a single process.  The concept has many potential use cases.

But with Docker, you can start and stop processes easily without needing to clutter your machine with any of that drama.  You can have one Docker image that runs Postgres and another that runs Nginx without having them really installed on your host.  You can even have multiple language runtimes of different versions and with different dependencies.  For example, several python apps running different versions of Django on different or the same versions of CPython.  Another interesting side effect, if you have multiple apps using the same kind of database, their data will not be on the same running instance of your database.  The databases, like the processes, are isolated.

Docker images are created with Dockerfiles.  They are simple text files that start from some base image and build up the environment necessary to run the process you want.  The following is a simple Dockerfile that I use on a small Django site:

FROM python:3.4
MAINTAINER tgroshon@gmail.com

ENV PYTHONUNBUFFERED 1
RUN mkdir /code
WORKDIR /code
ADD requirements.txt /code/
RUN pip install -r requirements.txt
ADD . /code/

Simple right?  For popular platforms like Python, Ruby, and Node.js, prebuilt Docker images already exist.  The first line of my Dockerfile specifies that it builds on the python version 3.4 image.  Everything else after that is configuring the environment.  You could even start with a basic Ubuntu image and apt-get all the things:

FROM ubuntu:14.04

# Install.
RUN \
  apt-get update && \
  apt-get -y upgrade && \
  apt-get install -y build-essential && \
  apt-get install -y software-properties-common && \
  apt-get install -y byobu curl git htop man unzip vim wget

From there you can build virtually any system you want. Just remember, the container only runs a single process. If you want to run more than one process, you will need to install and run some kind of manager like upstart, supervisord, or systemd. Personally, I do not think that is a good idea.  It is better to have a container do a single job and then compose multiple containers together.

Enter Fig

Problem is, Docker requires quite a bit of know-how to get configured in this kind of useful way.  So, let’s talk about Fig.  It is created specifically to use Docker to handle the Dev Environment use case.  The idea is to specify what Docker images your app uses and how they connect.  Then, once you build the images, you can start and stop them together at your leisure with simple commands.

You configure Fig with a simple yaml file that looks like this for a python application:

web:
  build: .
  command: python app.py
  links:
   - db
  ports:
   - "8000:8000"
db:
  image: postgres

This simple configuration specifies two Docker containers: a Postgres container called db and a custom container built from a Dockerfile in the directory specified by the web.build key (current directory in this case).  Normally, a Dockerfile will end with the command (CMD) that should run it.  The web.command is another way to specify that command.  web.links is how you indicate that a process needs to be able to discover another one (the database in this example).  And web.ports simply maps from a host port to the container port so you can visit the running container in your browser.

Once you have the Dockerfile and fig.yml in your project directory, simply run fig up to start all of your containers and ctrl-c to stop them.  When they aren’t running, you can also remove them from fig by running fig rm although it seems to me that the docker images still exist, so you might also want to remove those for a completely clean install.

Conclusion

Once I learned about Docker and Fig, it is one of the first things I do on new web projects.  The initial configuration can take some time, but once you have it configured it pays for itself almost immediately.  Especially when you add other developers to a project.  All they need to have installed are Docker and Fig, and they are up and running with a simple fig up command.  Configure once, run everywhere.  Harness all that effort spent configuring your personal machine and channel it into something that benefits the whole team!

Why you should be using Fig and Docker

Why I did not like AngularJS in 2014

Edited Mar, 2015:  Previously titled “Why I Do Not Recommend Learning AngularJS”.  In retrospect, my arguments are superficial and likely apply to the specific situation I was in.  In addition, I was wrong that learning a new tech is wasteful.  Learning anything makes you better at learning and that is what we should all be trying to do.  Learn what you’re excited about!

tl;dr

Despite it’s good qualities, I did not enjoy learning AngularJS.  With all the available options of web frameworks (e.g. Ember, React, Backbone, etc.), Angular fell behind in the following three areas:

  1. Performance
  2. Complexity
  3. Non-transference of Skills

Introduction

A lot of people ask me what I think about AngularJS, so I wanted to take some time to collect my thoughts and try to explain it clearly and rationally.  The following is the result.

I would like to start by saying AngularJS has a lot of good qualities, or else not so many people would use it so happily.  It makes developers excited to do web development again and that is hugely important.

With that being said, I did not like learning AngularJS.  With all the available options of web frameworks (e.g. React, Ember, Backbone, etc.), Angular falls behind in the following three areas:

  1. Performance
  2. Complexity
  3. Non-transference of Skills

Performance

I normally do not like picking on performance flaws, especially when a conscious decision has been made to trade performance for productivity.  I can understand that trade-off.  I do Ruby on Rails 😉

However, Angular’s performance has such serious problems that it becomes almost unusable for certain features or whole applications.  The threshold of how much work you can make Angular do on a page before performance tanks is scary low!  Once you have a couple thousand watchers/bindings/directives doing work on a page, you notice the performance problems.  And it is not actually that hard to get that large amount of binding happening on the page.  Just have a long list or table with several components per row, each with healthy number of directives and scope bindings, and then add more elements as you scroll.  Sound like a familiar feature?

Again I’d like to say that performance is not that terrible of a problem to have, because new versions of a framework can (and almost always will) optimize around common, performance problems.  I do not think performance will be a long-term problem in Angular; but it is a problem right now.

Complexity

Of all the most popular front-end frameworks (Ember, React, and Backbone), Angular is the most complex.  Angular has the most new terms and concepts to learn in a JavaScript framework such as scopes, directives, providers, and dependency injection.  Each of these concepts are vital to effectively use Angular for any use case beyond trivial.

Ember is also quite complex, but the framework itself gives direction for project organization which mitigates some complexity.  Also Ember is better at mapping its concepts to commonly used paradigms which I will talk about in the next section.

With React, you can be productive after learning a few function calls (e.g. createClass() and renderComponent()), creating components with objects that implement a render() method, and setting your component state to trigger re-renders.  Once you wrap your head around what React is doing, it is all very simple.  My experience was after weeks with Ember and Angular, I still did not grok all the complexity or feel like a useful contributor to the project.  After a day with React, I was writing production quality UI with ease.

Non-transference of Skills

I have been a web developer for years now.  Not a lot of years, but a few.  My first dev job was in college building UI with jQuery, which I learned very well.  Then I remember my first job interview outside of school with a company that built web applications with vanilla JavaScript and no jQuery.  I got destroyed in the JavaScript portion of the interview because my jQuery knowledge mapped very poorly to vanilla JavaScript.  In fact, I would go so far to say that I knew next to nothing about JavaScript even after a year of extensive web development with jQuery.

Why didn’t my jQuery skills transfer?  Because my development with jQuery taught me a Domain Specific Language (DSL).  While DSL’s can improve productivity, knowledge of them will seldom transfer to other areas.  The reverse can also be true.  You could call this inbound and outbound transference.

Angular is like jQuery.  It has transference problems.  The most serious problem in my mind is that Angular suffers from both inbound and outbound transference problems.  Knowing JavaScript, MVC, or other frameworks was less helpful while learning Angular.  What I learned from doing Angular has not helped me learn other things.  But maybe that’s just me.

Conclusion

If you know Angular and are productive with it, great!  Use it.  Enjoy it.  Be productive with it.  I tried Angular, and it just didn’t do it for me.

If you are looking for a framework that is both scalable and flexible, look into React.  In my experience, it is the easiest to learn and plays the nicest with legacy code.  Iterating React into almost any project is quite easy.  Of all the frameworks, React is probably the easiest to get out of because all your application logic is in pure JavaScript instead of a DSL.  The strongest benefit I have seen when using React is the ability to reason about your app’s state and data flow.  If you want a high-performance and transferable application, I highly recommend React.

If you want the experience of a framework that does a lot for you, go for Ember.  It will arguably do more for you than even Angular.  As I have seen, the Ember team is also more responsible/devoted to supporting large-scale applications or corporate clients which require stability and longevity.  They are the clients who do not want to be rewriting their apps every other year.  The one drawback I have seen is that Ember prefers to control everything of your app and does not play nice with other technologies.  If you have substantial legacy code, Ember will be a problem.

AngularJS will be releasing 2.0 soon, and it will be completely different from Angular 1.x.  Controllers, Scopes, and Modules are all going away.   To me, that seems like realization by the Angular Core Team that some of those neo-logisms did not work out.

Why I did not like AngularJS in 2014

Deploying to Production with Git

On your last day at any job, it is fun to go and change a bunch of things and then leave it all with your colleagues and say “Peace Out!” One thing you could do is rewrite/rework the project’s build and deploy process. Here is a way you could do it using Git with a Django project using Nginx and UWSGI (and yes, I did this all on my last day :).

The process is (1) automate our build process with a Makefile, (2) setup a Git repo on the live server to push to, and (3) use a Git hook to automatically call our Makefile targets.

Build with a Makefile

First things first, lets write a Makefile because they are very helpful. You can replace a Makefile with some other script or set of scripts, but I find Makefiles to be a very good idea on Unix based systems. You just need something to automate the your build process.

Here is an excerpt that is pretty similar to the Makefile I used:

# Makefile for building and deploying
#

UWSGI=/etc/init.d/uwsgi
NGINX=/etc/init.d/nginx

deploy: dependencies clean minified_static_files

restart: $(UWSGI) $(NGINX)
	$(UWSGI) restart
	$(NGINX) restart

stop: $(UWSGI) $(NGINX)
	$(NGINX) stop
	$(UWSGI) stop

dependencies: dependencies.pip
	pip install -r dependencies.pip # Or requirements.txt

resources:
	python resources_build.py # Minifies static files.

minified_static_files: resources
	python manage.py collectstatic # Collect into static_files/

clean:
	@-rm -rf static_files/
	@-find . -name '__pycache__' -exec /bin/rm -rf {} ;
	@echo 'Successfully Cleaned!'

.PHONY: clean resources dependencies restart stop deploy

You can put any useful commands that you run often in your project during development or when deploying to production. Put this in the root of your project directory or anywhere else in your Git project so you will not lose it.

Setup Production Git Repo

Lets use a Bare Git Repository. Log in to your production server, create a new directory, and initialize it as a bare git repository.

mkdir prod-repo
cd prod-repo
git init --bare

You will need to add this repository to your Git remotes on your local machine. The command looks something like this:

git remote add production ssh://username@www.yourserver.com:PORT/path/to/prod-repo

FYI: The PORT is whatever port you run your ssh server on.

Post-Receive Git Hook

Now lets setup a post-receive git hook on the production bare git repository that will call your makefile (or other automatic script) once a push has been received.

vim prod-repo/hooks/post-receive

A git hook file is any executable script, so you can write it in bash, sh, python, ruby, etc. Lets keep it simple and use sh.

#!/bin/sh
#
# SOURCE_DEST is whatever directory you have configured
# uwsgi look for your app in.  This is where Git will put
# the new source files that you push to this repo.

# Variables
SOURCE_DEST=/path/to/source
GIT_DIR=/path/to/prod-repo

# Update the HEAD to latest commit
git --work-tree=$SOURCE_DEST --git-dir=$GIT_DIR checkout -f

cd $SOURCE_DEST

# Run make targets
make deploy
make restart

# Fix permissions for Code
chown -R www-user $SOURCE_DEST
chgrp -R www-user $SOURCE_DEST

Putting it all Together

To update production, just issue a git push command to your production remote:

git push production master

This will push your changes on git and run your post-receive git hook script which calls your Makefile targets. Customize this to fit your needs. You can easily add in targets to run database migrations, compile coffeescript, pre-process CSS for SASS or Less, run unit tests, etc. The sky is the limit. It would also be a good idea to use a git tag each time before you push to production. Consider using some client-side git hook to accomplish that 🙂

Deploying to Production with Git

Prepping Yelp Data for Mining

In April 2014, my Data Mining project team at BYU began work on our semester project which we chose from kaggle.com. The project was based on data from an old Yelp Business Ratings Prediction Contest which finished in August 2013. Over several posts, we will take a look at some of the interesting things our team did in this project.

Problem Description

As explained on the Kaggle web page for the Yelp Contest, the goal of the project was to predict the number of stars a user would rate a new business using user data, business data, checkin data, and past reviews data. The prediction model/algorithm would then become the heart of a recommender system.

Data Description

The dataset is four files holding json objects. In this case, each line of the files is a distinct JSON object, which means we will parse each line into JSON as we go. The following is a description of the file yelp_training_set_business.json:

{
  'type': 'business',
  'business_id': (encrypted business id),
  'name': (business name),
  'neighborhoods': [(hood names)],
  'full_address': (localized address),
  'city': (city),
  'state': (state),
  'latitude': latitude,
  'longitude': longitude,
  'stars': (star rating, rounded to half-stars),
  'review_count': review count,
  'categories': [(localized category names)]
  'open': True / False (corresponds to permanently closed, not business hours),
}

Extracting, Transforming, and Loading the Data

I heard a lot on the web about this concept called ETL–Extract, Transform, Load–but in my data mining and machine learning training I have heard it a grand total of ONE times. Well, I guess this is our ETL section where we “Extract” the data from the files, “Transform” the data into meaningful representations, and “Load” the it into our database.

So here’s how we implemented it. I wrote a python script to parse through the JSON files and load them into a MySQL database. By using a relational database, we could combine, aggregate, and output the data in countless ways that we could input into our learning algorithm.

For your convenience, I put my code and a subset of the data on github so you can explore them yourself: https://github.com/tgroshon/yelp-data-mining-loader. I used Peewee, a lightweight ORM compatible with MySQL (amongst others), to create the database and load the data. Here is an excerpt of the Business Model from models.py:

class Business(peewee.Model):
    """Business Object Table."""

    bid = peewee.PrimaryKeyField()
    business_id = peewee.CharField()  # encrypted ID with letters, numbers, and symbols
    name = peewee.CharField()
    full_address = peewee.CharField()
    city = peewee.CharField()
    state = peewee.CharField(max_length=2)  # AZ
    latitude = peewee.CharField()
    longitude = peewee.CharField()
    stars = peewee.DecimalField()  # star rating rounded to half-stars
    review_count = peewee.BigIntegerField()
    is_open = peewee.BooleanField()

    class Meta:
        database = db

Not too strange. This was the most complex Model and covers most of the bases for creating a Model class (which maps to a Database Table) with Peewee. Now lets take a look at the Load script itself: json_to_mysql.py. First, we need to read each json file, parse the input into JSON objects, and map the JSON objects to Models for saving. Here is the function for reading the lines of a file, parsing them to JSON, and yielding the JSON:

def iterate_file(model_name, shortcircuit=True, status_frequency=500):
    i = 0
    jsonfilename = "json/yelp_training_set_%s.json" % model_name.lower()
    with open(jsonfilename) as jfile:
        for line in jfile:
            i += 1
            yield json.loads(line)
            if i % status_frequency == 0:
                print("Status >>> %s: %d" % (jsonfilename, i))
            if shortcircuit and i == 10:
                raise StopIteration()

The file takes the name of a model, opens the corresponding JSON data file for that model, iterates each line in that file, and yields up the parsed JSON object to the caller. This allows the function to be called as an iterator like this excerpt from the save_businesses() function:

def save_businesses():
    for bdata in iterate_file("business", shortcircuit=False):
        business = Business()
        business.business_id = bdata['business_id']
        business.name = bdata['name']
        business.full_address = bdata['full_address']
        business.city = bdata['city']
        business.state = bdata['state']
        business.latitude = bdata['latitude']
        business.longitude = bdata['longitude']
        business.stars = decimal.Decimal(bdata.get('stars', 0))
        business.review_count = int(bdata['review_count'])
        business.is_open = True if bdata['open'] == "True" else False
        business.save()

Straightforward no? Each Model has some corresponding function to handle creating a Model, assigning appropiate data, and saving it to the database. Then at the end we have a spot calling each function when the script is run:

if __name__ == "__main__":
    reset_database()

    save_businesses()
    save_users()
    save_checkins()
    save_review()

The function reset_database() is important to create the tables for your Models. It looks like this:

def reset_database():
    tables = (Business, Review, User, Checkin, Neighborhood, Category,)
    for table in tables:
        # Nuke the Tables
        try:
            table.drop_table()
        except OperationalError:
            pass
        # Create the Tables
        try:
            table.create_table()
        except OperationalError:
            pass

Conclusion

And that is the gist on ETL (I guess?) for this system. The funny part is I think this data came from a SQL database and was converted to JSON for distribution in this tournament. Now that we have undone all their work, we can start work of our own in Part 2!

Prepping Yelp Data for Mining

Python Generator Expressions

In Python, you can do a list comprehension (short list builder) which is really efficient:

[ val for val in biglist ]

You can do the same thing with dictionaries:

{ x.val: x.val2 for x in biglist }

The Feature that Changed My Life

You can take that inner statement “val for val in biglist” and package it up into a reusable generator expression!

mygenerator = ( val for val in someiterable )

Then, you can pass that generator around to other functions. Each call will run the generator once, return a value, and then wait until it is told to iterate again. The crazy part is, I knew that this is how the yield statement worked in Python to create generator functions (functions using yield return a generator). I had never thought of this flip side of generators functions packaged up all nice and neatly.

Continue reading “Python Generator Expressions”

Python Generator Expressions

PostgreSQL Character Data Types

PostgreSQL is a powerful, open-source, relational Database Management System (did that have enough adjectives for you? :). Until spring 2013, I had only been on projects using MySQL (they were all Java and PHP projects FYI). That spring, I started with a two projects that used PostgreSQL (one ruby and one python FYI). Ever since the switch, I can emphatically say that I love PostgreSQL! It does so many things right.

One interesting thing I just found out today at work is how PostgreSQL handles Character data types: char, varchar, and text. In PostgreSQL, they are basically the same data type! To be more clear, they actually use the exact same underlying C struct named varlena. Here she is:

 /* ----------------
 * Variable-length datatypes all share the 'struct varlena' header.
 * [omited notes]
 */
struct varlena
{
 char vl_len_[4]; /* Do not touch this field directly! */
 char vl_dat[1];
};

#define VARHDRSZ ((int32) sizeof(int32))

/*
 * These widely-used datatypes are just a varlena header and the data bytes.
 * There is no terminating null or anything like that --- the data length is
 * always VARSIZE(ptr) - VARHDRSZ.
 */
typedef struct varlena bytea;
typedef struct varlena text;
typedef struct varlena BpChar; /* blank-padded char, ie SQL char(n) */
typedef struct varlena VarChar; /* var-length char, ie SQL varchar(n) */

The main differences between the three is the validation: char gets padded to fill its limit, varchar is not padded but has a limit, text has no limit and no padding. They basically have the same performance as well. Interestingly, the text data type often ends up being the most efficient because it does not need to check sizes when putting or pulling data and if you switch all your char fields to text fields you will probably use storage.

Don’t believe me? Check the documentation!

There is no performance difference among these three types, apart from increased storage space when using the blank-padded type, and a few extra CPU cycles to check the length when storing into a length-constrained column. While character(n) has performance advantages in some other database systems, there is no such advantage in PostgreSQL; in fact character(n) is usually the slowest of the three because of its additional storage costs. In most situations text or character varying should be used instead.

So basically, we switched every field in our database that was holding a string to text fields. If you want to see some performance comparisons between the three types, check out this article. Interesting stuff!

PostgreSQL Character Data Types