SQL and NoSQL Database Features and Differences
I have heard about NoSQL databases, but don't know what it is. What's the difference between NoSQL and relational databases ("SQL") like SQL Server?
NoSQL (derived from "Not only SQL") is the name given to a type of database which can host non-relational, unstructured data.
This means data in a NoSQL database does not necessarily exist in fixed-length columns and rows like it does in a relational database and can be highly unstructured in nature. This type of database comes with built-in high-availability and fast performance features. Applications using NoSQL databases are less concerned about entity relationship, transactional consistency or data duplication.
NoSQL databases have made a deep proliferation over the last decade or so, fueled by the explosive growth of web and mobile applications running in the cloud. This new breed of internet-connected applications demand fast, fault-tolerant and scalable schema-less data storage which NoSQL can offer.
In this tip, we will talk about the features and main differences between SQL and NoSQL databases.
SQL Database Overview
Relational Database Management Systems (RDBMS) have existed for a few decades now. These databases are widely used in corporate applications, large information systems or data warehouses.
Relational databases are also known as "SQL databases" because of their adoption of ANSI-standard SQL for data manipulation and querying. An example of a SQL database is Microsoft SQL Server.
In SQL databases, business entities are represented by logical, tabular format data structures called "tables". For example, a database table can represent the data elements for sales transactions, another table can represent employee personal information while another table can represent product inventory. Tables are made up of columns of specific data type and lengths. Each column (or field) represents an attribute of the entity. For example, a table hosting employee information can have fields like first_name, last_name, age or start_date.
Each table can hold rows of data for each instance of that entity. The total size of the columns for each row will be the same – in other words, rows in a table are fixed length.
Different tables can be "related" to each other through primary and foreign keys. This is where the term "relational database" comes from. A primary key uniquely identifies each row in a table. A foreign key in a related table uses the same key to represent a separate set of rows for that row. For example, the master employee table can have a primary key on the employee_id field which uniquely identifies each employee record. The employment_history table can host related information for each employee. Since each employee can have one or more job titles in their job history, there can be one or more rows for an employee in the employment_history table. A relational database will use the employee_id field to link the employee table with the employment_history table. This relationship makes it possible for SQL queries to select an employee record and its associated employment history from both tables.
SQL Database Use Cases
SQL databases are best-suited where transaction integrity and data consistency is most important. They are tried and tested for applications which need to satisfy rigid ACID principles (more about it later). Typically, SQL databases are used for:
- OLTP (Online Transaction Processing) systems like finance and accounting, payroll, HR, fleet management, reservations, inventory, sales, order processing, CRM or trading support systems to name a few.
- Data Warehouses or data marts.
NoSQL Database Overview
Like their SQL counterparts, NoSQL databases are also used to store data, but they have a number of differentiating features:
- The data model is not based on tables of fixed length columns. NoSQL databases can host semi-structured or unstructured data with different lengths.
- NoSQL data does not depend on pre-defined tables. A table can be created, and records added to it dynamically.
- There is more flexibility in using storage because data does not need to be normalized to save disk space.
- The concept of joining records from multiple tables is not as important as it is in relational databases.
NoSQL databases are also classified by the way they store data:
- Document databases are used to store JSON (Java Script Object Notation) documents. MongoDB and Couchbase are examples of such databases. These JSON documents can have different schemas and elements within them, but the same table can be used to host all data.
- Key value databases store data in a key and its associated value form. The key is a pointer to the value; the value can be an image, a JSON document or even a single text string or number. Amazon DynamoDB is a widely-used key-value NoSQL database. Other examples are Redis and Aerospike.
- Column oriented databases are the opposite of row-based relational databases. In relational databases, data is saved as rows in a table. Each row is part of a data block, meaning all column values for the same row are located in one storage block. A block can host one or multiple rows of data and the table is made up of one or more data blocks. The block size is determined by the database engine and the underlying operating system. For SQL Server, the block is called a database page and it's 8 KB in size. In a column-oriented database, values from the same column are hosted in contiguous data blocks. An example of a column-oriented NoSQL database is Apache Cassandra.
- Graph databases are different from other types of NoSQL databases because they do not represent data in rows, columns, tables or column families. Instead, graph databases use components like "edges", "nodes", and "properties" to store and relate data. The nodes are like entities in a relational database and will have some properties. The relationship between two nodes can have properties too. A typical use case of graph database is recommendations in e-commerce stores. Neo4j or AWS Neptune are examples of graph databases.
NoSQL Database Use Cases
NoSQL is used in modern lightweight, internet-enabled applications like social media, click stream capture, shopping carts, chatbots, user preferences, ad-techs, IoT, messaging, personalization, mobile apps or online gaming to name a few.
History of SQL and NoSQL Databases
SQL databases first came out when there was no concept of the cloud. Back then, what we know as the Internet today was in its infancy. Over time, relational databases were ported from mainframe or mini computers to PCs and from there to large servers with a PC-architecture. These servers were hosted in large data centers and accessed by information workers using SQL client tools. That picture still exists today, and we call these systems on-premises ("on-prem" for short) databases. Typical examples of SQL databases in a data center would be Microsoft SQL Server, Oracle, MySQL or PostgreSQL.
As the cloud exploded in the later part of the noughties, most vendors of major relational databases ported their engine to the cloud. Today in Amazon Web Service (AWS), it's possible to run all four major RDBMS in a managed environment.
NoSQL databases on other hand was, were meant for cloud applications from the very beginning. They also came much later, during the early to mid-parts of the last decade. Most mobile or web applications today would have some type of cloud-hosted NoSQL footprint. However, it's also possible to run NoSQL databases on-premise.
SQL and NoSQL Database Differences
We will compare ACID vs CAP and BASE in the following sections.
Databases like SQL Server run on the ACID principle. ACID forms the core mechanism of data integrity in relational database engines. ACID stands for:
Atomicity: Every transaction in a SQL database is a unit of work. It either completes in its entirety or it does not. For example, if we run the following command:
UPDATE <table_name> SET column1 = value1, column2 = value2 WHERE column3 = <some_value>
The relational database engine will make sure both column1 and column2 are updated in one transaction, or they are not. In other words, the database will not leave the fields with inconsistent values.
Consistency: This principle ensures a database always remains in a valid state following a transaction. Valid state means the data will not breach any primary key, foreign key, constraint, trigger, cascade or other business rules when a transaction completes.
Isolation: A relational database can run multiple transactions on the same piece of data from different connections. The isolation principle dictates that when these concurrent transactions finish, the data will be in such a state as if the transactions were played sequentially. This principle determines the transaction isolation level run time setting in SQL Server and other database engines.
Durability: Durability makes sure data is permanently persisted when a transaction completes. Even if there is a power or hardware failure before the actual data is written to disk, the database engine will be able to recover the transaction and bring the database to a consistent state. This is the basis of how the SQL Server transaction log works.
In contrast, distributed NoSQL databases work under the CAP theorem, which has three principles:
Consistency: After a write operation, all copies of a data element will be exactly the same in all nodes. Similarly, when a query reads data, all nodes responding to the request will have the same data.
Availability: The required number of nodes in the cluster needs to be present during a data read or write operation. If there is no response from the required number of nodes, the transaction fails, or the read request returns an error.
Partition Tolerance: This principle states that the system will remain online and functioning even if one or more nodes are unavailable. Ensuring this principle means a data element is sufficiently replicated between multiple nodes so any network or hardware outage does not affect the system.
A NoSQL database can satisfy two out of these three principles, but not all of them. For example, if the database engine implements Consistency and Availability, it behaves more like a distributed relational database as far as data integrity is concerned. Partition tolerance will not be an option here because failed nodes or a network outage will mean not all nodes have the same copy of data.
NoSQL databases often favor the A and P of CAP to ensure system availability. This means data written to a multi-node cluster endpoint will be marked with success as long as the required number of nodes respond with a success status, although it may take some time for the data to propagate to other non-participating nodes. Similarly, for a read operation, as long as the required number of nodes respond with the same copy of the data, it's returned to the client.
When Consistency is given up in favor of Availability and Partition Tolerance, a distributed NoSQL database follows the BASE principle. BASE is:
Basically Available: It means the system is available even if some parts of it are unreachable.
Soft state: It means parts of the distributed database can contain different versions of the same data. This can happen when certain nodes have not synched up with the latest copy of the data.
Eventual Consistency: It assumes that at some point in time, the inconsistent nodes will catch up with the latest copy of data and the system will become "eventually consistent". However, the principle does not guarantee this happening.
Normalization and Joins
SQL databases store data in a normalized form to streamline storage and eliminate duplication. Reads are performed by joining related tables. Sometimes join operations are the biggest contributor to poor query performance.
Normalization is generally not recommended for NoSQL data models, although there is nothing stopping anyone from doing it. The performance cost of writing to and reading from a normalized NoSQL system however would be extremely prohibitive. That's why data redundancy is not an issue for most NoSQL systems.
For example, a MongoDB collection can host a document containing a certain number of JSON elements (e.g. purchase order from a customer). The same collection could be another document from the same entity containing different number of elements. Some elements may be the same between the two (e.g. customer name, address etc.). Breaking down the JSON documents into ones with "common" fields and ones with "related fields" would unnecessarily increase write-time. Relating two different documents for the same entity would be even more time consuming.
SQL databases typically rely on vertical scaling to deal with increased load. This means for higher number of connections, more complex queries or larger data sets, DBAs add more computing resources to the same database server. This is done to keep performance at acceptable level. Sometimes load balancing, clustering or replication is used to separate read and write workload and increase performance, however, any new node functions as a passive, read endpoint. This behavior is somewhat changing in modern data warehouses like Amazon Redshift, but most relational databases still can't be horizontally scaled.
With horizontal scaling, each node in a database cluster can work both as a reader and writer and data is distributed between all nodes. For increased load, new nodes are easily added to the cluster and the existing data automatically redistributes itself among the nodes.
NoSQL databases generally use the second type of scaling. This is also necessary from an operational perspective because these systems need to remain accessible even when one or more nodes go down.
SQL databases are optimized for both read and write workloads. To improve read performance, relational databases use a number of techniques like normalization, materialized views, indexes, statistics update, locking hint, partitioning or read-replicas. Write operations are affected by underlying storage, indexes or triggers.
In contrast, NoSQL databases are optimized for write operations. That's because they are used for high volume, high transaction applications. To ensure users are not left waiting when saving data, a NoSQL database would simultaneously write to multiple nodes. As soon as there is an acknowledgement from majority of the nodes (quorum), it will send a confirmation to the client app.
Another reason for high performance write is the schema-less nature of NoSQL. There's no rigid table structure to adhere to, no foreign key table to update or unique key constraints to satisfy. For delete or update operation, old values are not physically removed from disk, but marked with "tombstones".
For speeding up read operations, NoSQL databases still make use of indexes.
SQL databases strongly adhere to ANSI-standard SQL, a versatile and widely-adopted query language that has matured over a few decades. It's easy to learn: complex queries can be built with only a small number of operators.
ANSI SQL lacks many of the features offered by traditional programming languages like reusable functions, loops, conditional branching, variables or error handling.
Database vendors thus included many of these extra but much needed features as extensions to ANSI SQL. The Transact SQL of Microsoft SQL Server is one example of this. Similar are PL/SQL in Oracle or PL/pgSQL in PostgreSQL. As the database vendors release new features, the extended SQL versions are also updated with new functions and operators. Relational databases often include stored procedures, functions or triggers as integrated logic to process the data they contain.
Despite the lack of a common dialect, all major programming languages or frameworks include APIs and drivers for popular NoSQL databases. For example, it's possible to write applications in C# that use DynamoDB backends or a Python apps that interface with Redis. All that a developer needs to know is what interfaces the driver in their chosen language supports.
NoSQL database are generally available for multiple operating systems - for both server and client. The purpose is to ensure ease of development and deployment in any platform.
Couchbase for example, is available for Windows Server, multiple flavors of Linux including Amazon Linux 2, CentOS, Debian, Oracle Linux, RedHat Enterprise Linux, Windows desktop and macOS.
What we discussed today is a high-level comparison between SQL and NoSQL databases. There are other, more subtle differences between the two. Some relational databases now allow storing semi structured data. But the underlying processing is still done by the relational engine so it's still far from being a true NoSQL environment. NoSQL is a technology unto itself and therefore remains a choice for separate use cases.
- Download, install, and explore different NoSQL databases like MongoDB, Cassandra or DynamoDB
- Explore the distributed computing models of different NoSQL databases
- Explore how different NoSQL databases use CAP theorem
- Learn how to configure NoSQL databases for optimal performance
- Use different programming languages to read from and write to NoSQL databases
- Investigate different performance monitoring tools and metrics for NoSQL databases
About the author
View all my tips