Column
A column-oriented database stores data in tables, organized by columns, whereas rdbms organize data in rows.
It cas manage large dataset and access it fast. It allows complex analytic calculus but is effective when data has the same type.
Usages: CMS, blogs, counters, expiring usage, etc.
Examples: Cassandra, Hbase, BigTable, Parquet
Pros
- Only attributes that are needed are read from disk
- Adding new column is easy
Cons
- Combining values from multiple column is costly (tuple reconstruction)
- Inserting a new tuple is costly
Column compression
When there are repeated values, we can encode the column to reduce dataset size and speed up requests.
Sometimes, we don’t have to decode the column to answer the request e.g, sum the values to count number of elements.
run-length
Count number of repetitions Format: value, start row, run-length
bit-vector
For each value in column, create bit vector (one bit / row). Good for few distinct values.
dictionary
Replace values by shorter placeholders. Maintain dictionary to map placeholders back.
frame of reference
Choose median value as a reference. Store off-set for following values. Use #
marker for “big offsets values” exceptions.
differential
Like frame of reference, but we store the difference with the preceding row instead of the reference.
We can also use #
marker for big offsets values.