Postgres Architectures for High Availability

Understanding the Basics: The Single Machine/Node Architecture

PostgreSQL, a robust and versatile RDBMS, often begins its journey in the simplest form - a single machine or node. This architecture offers several compelling advantages:

No Network Latency: Operating on a single node eliminates network latency at the database layer.
Co-location Benefits: It's possible to colocate the application server, harnessing millions of IOPs.
Microsecond Disk Latency: Disk operations are incredibly fast, measured in microseconds.
Cost and Performance Efficiency: Overall, this setup is highly cost-effective and performs well.

However, this simplicity comes with significant risks. The most notable is the lack of high availability. Should the single node fail, it inevitably leads to application downtime. Moreover, a disk failure could mean catastrophic data loss, especially if there are no backups. Thus, while a single machine offers efficiency and low latency, it lacks in terms of availability.

Transitioning to Distributed PostgreSQL Architectures

To mitigate these risks, distributed PostgreSQL architectures have been developed. These systems aim to preserve the rich functionality and transaction semantics of a single-node RDBMS while adding the critical component of high availability. However, achieving this involves trade-offs, often sacrificing some degree of efficiency and latency.

The distributed systems employ various mechanisms:

Replication: Creating copies of data across different machines.
Distribution: Partitioning data across various machines.
Decentralization: Distributing different DBMS activities across multiple machines.

Each mechanism inherently requires concessions in performance, transactional semantics, functionality, and operational complexity.

Latency in OLTP Systems

Rich functionality in RDBMS like PostgreSQL necessitates trade-offs. PostgreSQL employs a synchronous interactive protocol, where transactions are executed step-by-step, and any network latency introduces noticeable delays. Even a 10-20ms delay per transaction can severely limit the database to around 50 transactions per second per session. Managing many concurrent sessions, each consuming significant memory resources, becomes a practical challenge, especially when network latency is involved.

PostgreSQL can be distributed across various layers, each with its pros and cons compared to a single machine setup. Three notable architectures are:

Architecture 1: Network-attached Block Storage

Common in cloud-based architectures, this setup involves storing database files on a different, network attached, device, typically resulting in network calls for any read/write operations.

While internal replication almost eliminates storage failure downtime, there's a trade-off in performance. Modern NVMe drives might outperform these setups in terms of IOPS and latency.

The durability and availability of network attached storage usually outweights the performance downsides, but it is good to know that Postgres can be muuuchch faster (up to 2 orders of magnitude).

Architecture 2: Read Replicas

PostgreSQL supports physical replication to read-only replicas. The idea is to have a primary node that accepts writes and one or more read replicas that replicate the primary node and accept read requests.

The default setup is to have a "hot standby", which takes over in the case of a primary node failure. This setup is ideal for scaling read-throughput and offloading the primary node, improving write speed.

Problem: Eventual "read-your-writes consistency" issues: Imaging you are writing to the shopping cart table and then redirect the user to the cart page. You might be unfortunate where the read replica has not yet replicated the new shopping cart table and therefore the user sees an empty shopping cart.

Architecture 3: Active-active

Active-active architectures allow all nodes to accept writes independently, ensuring low latency for both reads and writes.

However, this comes with the downside of eventual consistency (see above) and unresolved update conflicts, which could lead to data inaccuracies.

Problem: Unresolved update conflicts: Update conflicts are not resolved upfront. Normally with PostgreSQL, if two transactions try to update the same row in PostgrSQL, the first one will take row-level lock. With active active, both transaction might be accepted concurrently: E.g. two transactions are incrementing a counter from 3 to 4 on different replicas. When the replication finally happens, Postgres happily agrees on 4 being the new value. Ooops we missed an increment.

Architecture 4: Transparent Sharding

Transparent sharding distributes tables across multiple primary nodes, showing them as regular PostgreSQL tables.

Each node shows the distributed tables as if they were regular Postgres tables and queries & transactions are are transparently routed or parallelized across nodes.

Scaling out transactional workloads is most effective when queries have a filter on shard key, such that the query can be routed to a single shard group. Like limiting queries of a single tenant in a multi tenant app by a “tenantId”.

If you are willing to adjust your application, sharding can be one of the most powerful tools when dealing with data-intensive applications.

Conclusion

Deploying a distributed PostgreSQL system is a complex and nuanced challenge. It demands careful consideration and planning by engineers, weighing the pros and cons of each architecture against the specific needs of their application. While there is no one-size-fits-all solution, understanding these architectures is crucial for making informed decisions in the realm of database management.