Columnar Store is King

Most Modern Data Warehouses are Columnar Stores

In partnership with

Learn AI in 5 minutes a day

This is the easiest way for a busy person wanting to learn AI in as little time as possible:

  1. Sign up for The Rundown AI newsletter

  2. They send you 5-minute email updates on the latest AI news and how to use it

  3. You learn how to become 2x more productive by leveraging AI

Welcome Everyone. 👏

  • Databases: Two kings 👑

  • Use Case: Data warehouse 📊

  • Learning: Reductionism 🤯

  • Architecture: Columnar stores 🏛️

  • Bottleneck: Disks 💾

  • Leaders: Columnar data warehouses 🔝

There are many types of databases and data stores. However, in the applied space (real-world), two are used most often. They are the relational databases and the data warehouse.

  • A relational database is like a well-organized system for storing and managing tabular information. The key idea is that it stores data in tables, which are similar to spreadsheets that are designed in a way to create fast access. Relational databases are designed for mission critical applications.

  • A data warehouse is a type of data management intended to perform queries and analysis and often contain large amounts of historical data. The data within a data warehouse is usually derived from a wide range of sources such as application log files and transaction applications. Many data warehouses are loaded from relational databases.

Relational databases and data warehouses have two completely different use cases.

Relational databases are designed for day-to-day operations and handling real-time transactions. They focus on efficiency in processing individual transactions (like placing an order, updating a customer record, or processing a payment) 

Data warehouses are designed for analytical reporting and business intelligence. They focus on providing insights and supporting decision-making by analyzing large volumes of historical data.

Let’s use hyper-focus to learn them. What are core keywords for each. I use the term hyper-focus to reduce each item to its simplest form in terms of vernacular.

  • Relational Database - Tables, data, quick retrieval, real-time

  • Data Warehouse - Tables, data, reporting, historical

If you have the core terminology around an idea, your mind can often help you create the rest. In the example above you only need to remember 4 words. Tables and data are the same. For relational databases remember quick retrieval and real-time and for data warehouses remember reporting and historical.

Now we know what they are, let’s discuss storage mechanics. Yeah, nothing very deep, this isn’t a paper on colossus, just a post on the basic mechanism behind data warehouses. 😰

Wait. What’s colossus? It’s Google’s storage system. Here, have a read. 😁 

Let’s make a high-level statement about columnar store databases then unpack it.

A columnar database (column-oriented) is a database management system (DBMS) that stores data on disk in columns instead of rows. The purpose of a columnar database is to efficiently read data from hard disk storage to speed up the time it takes to return a query. Columnar databases store data in a way that greatly improves disk I/O performance.

Wait. Why do you need fast access in a data warehouse? You don’t, but you need acceptable retrieval speeds. In the real-world, over two minutes on any query in a data warehouse is unacceptable regardless of the amount of data.

😮 True story, no shit. Ten years or so ago I was consulting for a medium sized company out of Ohio. I was hired to increase data retrieval speeds for this companies internal sales team.

During the initial phone call the manager for the sales teams said, “Our sales teams needs to be a little more efficient. When they build their screens for their morning calls, and then push the send button and they take a smoke break or get coffee.” 🤔

I then asked, what’s the average response for each of these requests. He said, “around 30-45 minutes.” They had 15 people waiting on results from this sales application for a minimum of 30 minutes every time they ran a search.

I created about thirty indexes that reduced the retrieval to 2-3 minutes/button push. 🙂 The manager was happy but the sales people hated me.

Ok, back to our regularly scheduled program. 🎥

Column-oriented databases and row-oriented databases are both methods for storing data in data warehouses. They both use tables with rows containing entries and columns for the data about the entry. However, they have different approaches to how the data is stored on disk: Row-oriented databases store the data for each row together, while columnar databases store the data for each column together. Here’s a pictorial representation of columnar store versus row oriented store.

On Disk Storage

The primary limiting factor in large modern databases is not the speed of the processor, it is how fast data can be read off the disk. Many databases are terabytes of data, with some large ones reaching into the petabyte range. Databases this large are impossible to keep in any computer's random access memory (RAM), so they need to be read off the disk. Even extremely fast NVMe solid-state drives cannot keep up with the speed of processors, making the disk access the limiting factor in databases.

The primary limiting factor in large modern databases is not the speed of the processor, it is how fast data can be read off the disk.

The main benefit of a columnar database is faster read and query performance for analytics and big data compared to a row-oriented one. That's because most queries use column data more than row data, requiring less data to be read off a disk. Also, because the initial data retrieval is done on a column-by-column basis, only the columns that need to be used are retrieved. This makes it possible for a columnar database to scale efficiently and handle large amounts of data.

In extremely large, clustered servers and hyperconverged databases, the issue of how data is stored is compounded. A database at these scales needs to be split (sharding) between many servers, with each server only having portions of the total data. These servers are also connected through relatively slow network links. Columnar databases are well suited for scaling across several servers. Storing the same type of data together can make optimizing and finding the data easier.

Currently, the majority of modern data warehouses are columnar store. Here are three big ones.

  • SnowFlake

  • BigQuery

  • Fabric

  • Databricks

All of these utilize columnar storage under the hood. The key reason for the success of columnar store databases is greatly improved query performance for analytical queries. When performing aggregations, filtering, or reporting on specific data points across a large dataset, a columnar database only needs to read the relevant columns. 

In contrast, a row-oriented database would have to read entire rows, including data in columns not required for the query, leading to significant I/O overhead. This selective reading drastically reduces the amount of data accessed from disk, resulting in much faster query execution times.

The success of columnar store databases is a testament to their specialized design, which is perfectly tailored for the demands of modern data analytics and the need to quickly derive value from massive datasets.

Thanks everyone. 👏

While you might think I made a typo in my poll, I did not. Just not a fan of AWS. 🤣

What data warehouse do you want to work with?

Login or Subscribe to participate in polls.