PostgreSQL Character Data Types

PostgreSQL is a powerful, open-source, relational Database Management System (did that have enough adjectives for you? :). Until spring 2013, I had only been on projects using MySQL (they were all Java and PHP projects FYI). That spring, I started with a two projects that used PostgreSQL (one ruby and one python FYI). Ever since the switch, I can emphatically say that I love PostgreSQL! It does so many things right.

One interesting thing I just found out today at work is how PostgreSQL handles Character data types: char, varchar, and text. In PostgreSQL, they are basically the same data type! To be more clear, they actually use the exact same underlying C struct named varlena. Here she is:

 /* ----------------
 * Variable-length datatypes all share the 'struct varlena' header.
 * [omited notes]
struct varlena
 char vl_len_[4]; /* Do not touch this field directly! */
 char vl_dat[1];

#define VARHDRSZ ((int32) sizeof(int32))

 * These widely-used datatypes are just a varlena header and the data bytes.
 * There is no terminating null or anything like that --- the data length is
 * always VARSIZE(ptr) - VARHDRSZ.
typedef struct varlena bytea;
typedef struct varlena text;
typedef struct varlena BpChar; /* blank-padded char, ie SQL char(n) */
typedef struct varlena VarChar; /* var-length char, ie SQL varchar(n) */

The main differences between the three is the validation: char gets padded to fill its limit, varchar is not padded but has a limit, text has no limit and no padding. They basically have the same performance as well. Interestingly, the text data type often ends up being the most efficient because it does not need to check sizes when putting or pulling data and if you switch all your char fields to text fields you will probably use storage.

Don’t believe me? Check the documentation!

There is no performance difference among these three types, apart from increased storage space when using the blank-padded type, and a few extra CPU cycles to check the length when storing into a length-constrained column. While character(n) has performance advantages in some other database systems, there is no such advantage in PostgreSQL; in fact character(n) is usually the slowest of the three because of its additional storage costs. In most situations text or character varying should be used instead.

So basically, we switched every field in our database that was holding a string to text fields. If you want to see some performance comparisons between the three types, check out this article. Interesting stuff!

PostgreSQL Character Data Types

File integrity with MD5 Hashes on Linux

I have talked a few times about the usefullness of hashing programs. They are central to password security and encryption; which most internet users use on a daily basis (most of the times transparently). Another helpful use of hashing can be for checking file integrity. By comparing two hashes of a file, you can tell if that file was modified (even 1 bit) between the time of those two hashes.

If you are on linux, you probably have the program/command md5sum builtin. By piping some text to md5sum, you can calculate a md5 hash of it:

$ echo -n "Hello" | md5sum
> 8b1a9953c4611296a827abf8c47804d7

But md5sum is especially useful for checking whole files like so:

$ md5sum file1.txt file2.doc file3.iso file4.mp3
> e42e2699cda3405fe484f951836d5f5a  file1.txt
> 0692b01aeca5db59c762ad7d16bc0d35  file2.doc
> 01c3ec3dd8e601c81d7fca5151f415ae  file3.iso
> 7bf30953798adad01dc413da8ce5252c  file4.mp3

Running the command in this way simply prints the output to the terminal. If you save the output to a file, you can use md5sum to compare the file hashes later on using the ––check/–c flag like so:

$ md5sum file1.txt file2.doc file3.iso file4.mp3 > integrity_hashes.md5
$ md5sum -c integrity_hashes.md5
> file1.txt: OK
> file2.doc: OK
> file3.iso: OK
> file4.mp3: OK

If you modify one of files and run this integrity check, md5sum will output a FAILED message beside the filename instead of an OK.

Things like this are why I love Linux. I was just thinking one day, “I wonder what the easiest way to calculate hashes of my files is”, and a quick google search showed me that I have the handy md5sum tool already builtin and it does exactly what I want.

Linux FTW!!!

File integrity with MD5 Hashes on Linux

Git Commands You Should Know

Git is awesome! I used Subversion a lot when doing Java projects in school, and we used Mercurial at Qualtrics but Git is my favorite by far. I like Git and Mercurial over Subversion because of the distributed design: every repository contains the entire codebase and version data. Mercurial does have a little (IMHO very little) better command line interface than Git, but I feel Git has far more power especially when editing commit history (interactive rebase FTW). For a more in-depth discussion of Git vs. Mercurial (and why Git is better), I recommend this Atlassian blog post.

This post however is not about why Git is better than other DVCS. I would like to point out a few helpful commands that I use often (like all the time and you should to) or commands I wish I knew about when I first started using Git. I will skip some common commands like add, commit, push, merge, fetch, and pull and try to focus on non-essential but useful commands; the ones you could work without but really shouldn’t.

Continue reading “Git Commands You Should Know”

Git Commands You Should Know