The rate at which businesses are realizing there is significant value in their data is increasing exponentially and daily. If they wish to gain competitive advantage—or in many cases just stay competitive—they need to unlock this value.
At the same time, data variety, volume, complexity, and velocity are also growing at astounding rates, straining traditional solutions to the breaking point and exposing a dramatic need for new solutions. The result has been a push for new solutions, platforms, and databases enabling Big Data applications. These solutions often require entirely new application development and programming paradigms, which slows their adoption and increases their cost.
Big Data implies massive scale with real time data flows and processing. A Big Data platform requires massive scalability, deep analytics, high agility, high availability, and real time processing. Ideally, a Big Data platform provides standard interfaces and semantics that leverage existing applications, tool chains, and developer resources.
Deep has developed a set of unique, patented technologies that solve some of the most difficult and important problems in scalable database system design and implementation. This has been accomplished while providing standard interfaces and semantics. The result is a fundamentally new database, DeepDB.
DeepDB is a completely new architecture, design, and implementation optimized for simultaneous, high performance transaction processing plus analytics. It enables the best of both worlds with a unique combination of breakthroughs and innovations.
Before describing DeepDB in detail—what makes it unique and how it achieves best in class performance in both transactional and analytical workloads—we will establish context by examining traditional database designs.
Traditional Database Designs
Traditional databases have been optimized for specific use cases: optimizations made for one use case often negatively impact others. Two of the most common use cases are transaction processing and analytics.
Transaction processing systems are generally optimized to maximize transaction rate and minimize transaction latency. Traditional transactional databases are composed of log files and fixed-size, paged database files. The log files are used to ensure ACID compliance and error recovery, and the paged database files maintain the current state of the database.
Since log files are sequential and append-mostly they may be written rather quickly, recording each transaction as it occurs and enabling fast ACID transactions. It is safe to say that transactional systems are write-optimized.
Once the transaction has been logged, the database files must still be updated with the new state of the database.
This state includes the values of the database’s tuples and indexes, along with information for check pointing, error detection, and error correction.
Traditional designs store both tuples and indexes as b-trees on disk. They update these data structures in-place, often using memory-mapped file IO.
This leads to disk seek-bound workloads when the database state is read and/or updated. This also results in complex error detection and recovery algorithms, and long startup and shutdown times, as log files must be synchronized with the paged database files.
Deferring work that must be done speeds up transaction processing … but accumulates costs in the form of deferred work and bookkeeping for that work. When this deferred work must finally be performed (because of a log file wrap, for example) performance suffers greatly, as paged database files must be updated to reflect the state of the database. Updating these pages becomes a seek-bound operation, leading to performance cliffs and resulting performance that is orders of magnitude worse than ‘normal’ operation. Furthermore, studies such as OLTP Through the Looking Glass, and What We Found There [HAR08] have shown that buffer management (part of memory management and paging) and logging account for more than 45% of the instruction count for the New Order Transaction from the TPC-C benchmark. Additionally, locking (e.g. two phase locking) and latching (part of concurrency control) accounted for more than 30% of instructions for the New Order Transaction. Combined these operations account for 75% of all instructions executed. Once all overhead is accounted for, useful work consumes less than 7% of the cycles. It is self-evident that wasting more than 93% of available cycles has a huge negative performance impact.
What is needed, then, is a transactional database that maintains the performance gained by sequentially writing log files—without the overhead incurred by complex error detection/correction or the buffer management overhead incurred by seek-bound, paged data and index files. Additionally, lightweight concurrency control (e.g. locking and latching) is critical.
Transactional Row Stores
Key points about transactional row stores:
- They use undo logs or write ahead logs for error recovery; writing to these logs performs very well
- They use paged database files with update in place, quickly becoming seek-bound
- Most use memory-mapped file IO, leading to:
- Database corruption if stray writes occur
- In-memory and on-disk data structures being the same
- They use b-trees on-disk for row and index storage, leading to seek-bound workloads
- They require complex error detection and recovery methods because the database state is updated in place
- Long startup and shutdown times are incurred, as error detection and correction is performed on startup and the in-memory state is written to the persistent database on shutdown
High-performance analytics systems are generally optimized to maximize query rate and minimize query latency across very large data sets. Relational analytics systems are often implemented as column stores, meaning they store and process the tuples of a relation in column order; column storage enables large compression ratios and efficient aggregation operations, but requires rows to be reconstructed from columns whenever rows are needed.
Extreme read optimization is accomplished through the batch creation of compressed column files combined with specialized column-oriented query optimizations such as late materialization (of rows) and block iteration (of column values). In fact, studies like Column-Stores vs. Row-Stores: How Different Are They Really? [ABA08] have determined that a majority of performance gains are obtained by these techniques. Specifically, compression can offer order of magnitude performance gains; late materialization about a factor of 3; and block iteration about a factor of 1.5.
Thus, very high compression ratios are an important characteristic that leads to the high performance of column stores. Since columnar data is usually more similar than different (e.g. often it is of relatively low cardinality) it is generally more compressible than row-oriented data. This is especially true of sorted columnar data, where large blocks of column data may have the same value and simple compression techniques like run length encoding produce impressive results.
In order to achieve maximum compression, column stores transform relations into sets of column files. Systems like C-Store: A Column Oriented DBMS [STN05] perform this operation by staging modifications in-memory and on-disk in column-oriented b-tree structures (as with a traditional row-oriented relational database) and then merging these modifications with column-oriented files in batch mode. Queries must pass through the traditional database layer (when it contains modifications) and merge those results with the information stored in the column files (representing the entire database).
Periodically the in-memory and b-tree on-disk cache must be merged with the read-optimized column store files. Systems like C-Store produce new column store files each time a merge is performed. This merge operation consists of reading existing column store files, merging the in-memory/b-tree on-disk information with the existing column store file, and writing the results to a new column store file. Once fully merged, the new column store file replaces the original and the in-memory/b-tree on disk information may be cleared.
It is notable that the merge operation may block updates, or may be done against a database snapshot. This approach leads to low create/update/delete transaction rates while enabling read optimized queries.
Read-Optimized Column Stores
Key points about read-optimized column stores:
- Extreme read optimizations produce low transaction rates, as entire relations must be repeatedly transformed into column stores
- Bulk loading of data is required, due to the low transaction rates induced by batch creation of column store files
- In-memory updates are first moved to standard b-tree on-disk database structures, and therefore have characteristics similar to traditional b-tree on-disk databases but with additional column oriented overhead
- Queries must check (and merge) both the in-memory/b-tree on-disk database as well as the column stores
- The b-tree on-disk database is eventually merged with existing column store files, creating new column store files
This is a very memory-efficient operation (from the merge sort perspective) because the files are sorted
The resulting column store files are then swapped with the original files and the in-memory database/b-tree on-disk is cleared
The entire merge operation is IO intensive, consisting of:
- Column file reads
- Possible b-tree on-disk reads
- New, merged, and sorted column file writes
- B-tree on-disk pages being cleared/purged/deleted
- Original column files deleted after being swapped with merged column files
- A significant part of query performance gains are due to compression and column-oriented query optimizations
- A pure column store requires rows to be reconstructed from column data that may reside in multiple column store files
We started with a clean slate and no preconceived notions. Making no assumptions, we asked a new set of questions: – What if a database could provide best-in-class transaction processing and analytics simultaneously?
- What keeps traditional systems from being good at both types of processing within the same system?
- Are these limits fundamental, or are they architecture, design, and implementation trade-offs?
- Are there ways to reduce or eliminate the limiting factors?
What we have found is that traditional systems have made a series of architecture, design, and implementation trade-offs based on 30+ year old assumptions and system characteristics (see The End of an Architectural Era (It’s Time for a Complete Rewrite) [STN07]). Conventional wisdom and classical design patterns permeate existing systems, limiting their scalability and use cases.
While papers like [STN07] have identified many problems with existing systems and architectures, we believe they draw the wrong conclusions. Specifically, they conclude that individual, highly optimized (in one dimension/for specific workloads), special-purpose systems are required.
An Optimized General Purpose System
Contrary to popular belief, obtaining best-in-class transaction processing and analytics simultaneously in the same system is not only possible, but the resulting system is much less complex than traditional systems. Understanding, and then eliminating, the limitations of existing systems produces major breakthroughs.
Our breakthroughs are the result of recognizing the following:
- Memory and disk capacity are both orders of magnitude larger than 30+ years ago
- Old architecture and design decisions must be reviewed in this light
- Modern, multi-core CPUs are inherently parallel, possessing many cores that must all be utilized efficiently to maximize performance
- Traditional systems have been hyper-optimized for either transaction processing or analytics—not both
- Doing more of the same type of special-purpose optimizations will likely not lead to breakthroughs
- Hard drive, solid state drive, and even network-attached storage throughput is maximized when sequential access patterns are used
- While spinning platter storage capacities have increased faster than Moore’s Law, latencies have not decreased at the same rate
- Sequential log files enable high transaction rates by deferring processing
- Eventually the database’s state must be updated on-disk, often resulting in performance cliffs
- B-trees on-disk are seek-bound, and other tree structures on-disk attempt to eliminate/alleviate these issues, for example:
- Fractal trees [CHE02]
- LSM trees [ONE96]
- Stratified b-trees [TWG11]
- Column stores achieve order of magnitude performance gains through compression
- Compression is the largest contributor to their scale-up performance gains
- Optimal in-memory structures and access patterns are not necessarily optimal on-disk structures and access patterns
- The ability to efficiently use all available system resources is critical to maximizing performance and reducing cost; this includes fully and efficiently utilizing all available CPU cores, minimizing memory overhead, and maximizing memory and storage subsystem throughput
DeepDB fully leverages the capabilities of modern hardware and systems architecture:
- Multiple, multi-core CPUs that enable high degrees of parallel processing are fully-leveraged by new, concurrent, database algorithms
- Large and inexpensive memory is used to enable huge caches and in-memory-only operations on very large data sets
- High-throughput storage subsystems are operated at maximum throughput by designing algorithms that force all IO operations to be sequential
All algorithms within DeepDB have been designed with latency minimization in mind. Constant and bounded time algorithms have been used extensively in all golden code paths. The result is best-in-class and consistent write, read, and query latency.
The architecture, design, and implementation of DeepDB are inherently concurrent. For example, transactions are streamed to disk while index compression and index file updates are performed concurrently and asynchronously using additional available cores. Queries may be split and performed in parallel with results being aggregated (i.e. map/reduce). Efficient range and fine-grained row locking enable massive concurrency within index data structures. Lock-free algorithms and thread-local data are used extensively to reduce or eliminate contention, greatly increasing concurrency.
The result is a system that fully and efficiently utilizes all available cores.
Minimizing Memory Requirements
Large and inexpensive memory enables huge caches and in-memory operations on very large data sets. DeepDB leverages all memory provided without arbitrary limits. Our in-memory data structures have been designed to be as small as possible to maximize cache efficiency. Unique techniques, such as hierarchical summary indexing and summary caching, minimize memory-resident data while simultaneously managing massive data sets. Finally, both in-memory and on-disk compression can be fully utilized. Since groups and segments are variably sized frames (as opposed to fixed size pages) there is no wasted space during compression.
Optimizing Storage Subsystem Utilization
The understanding that storage system throughput is maximized by using sequential access patterns led to the creation of streaming, append-only transactional and indexing algorithms. Our approach is unique in that all database files (i.e. transactional state, indexes, and metadata) are streamed, append-only files. Enforcing the streaming, append-only rule required the invention of a series of new algorithms and data structures, producing major breakthroughs in transaction and index streaming.
Since files are never updated in place, on-disk data structures are effectively immutable, or Purely Functional (a.k.a. Persistent Data Structures): a rethink of update-in-place algorithms was therefore required.
Transaction streaming is relatively straightforward and highly optimized. As transactions are committed, the resulting database state changes are streamed to non-transient storage, making each transaction durable. While similar to a statement log, transaction streaming represents changes to the database’s state and is effectively the state of the database. When combined with continuous defragmentation, the streamed transactional state is organized for optimal access, making it an optimized transactional row store.
The state log serves a dual purpose: transaction logging for durability, and data organization for high-performance queries. Combining transaction logging with database state representation eliminates the overhead incurred in traditional database designs, where they must maintain both a statement log and database state as b-trees or column store files.
Within the state log all information is contained within groups. For example, a transaction’s state is written sequentially and is bounded by start group and end group indications. The same is true for defragmented information. Groups enable error detection and correction as well as compression.
In all cases, state streaming is a highly write-optimized operation of O(1) complexity. When defragmentation is performed, a worst-case read complexity of O(log(N)) may be incurred when the segment to defragment is not in cache. As presented in the following section, both read and write operations are designed to minimize random access patterns and their resultant disk seeks, amortizing their cost to much less than 1 seek per group write or read operation.
The latency incurred by random access patterns (on all storage mediums) is a dominant factor limiting IO subsystem throughput. Our algorithms are designed to minimize random access patterns and by adapting to system characteristics. So, while the algorithmic complexity varies from O(1) to O(log(N)), the real-world performance is O(1) as all sequential operations become essentially seek-less.
Index Streaming: introducing The CASI Tree
Index streaming and efficient indexing is a much harder problem to solve than transaction streaming; in fact, it could be posited that it is the hard problem that needs be solved in order to create high-performance, persistent database indexing.
Deep invented the Cache Ahead Summary Indexing (CASI) Tree to solve the index streaming problem. This breakthrough enables unmatched indexing performance. The CASI Tree breaks from a traditional b-tree on-disk approach, eliminating update in-place operations on fixed size pages. Instead, the CASI Tree is an append-only Purely Functional Tree data structure.
The CASI Tree divides an index into variable length segments, deltas to those segments and summaries of those segments. It is both read and write optimized, providing O(1) write complexity and a worst case O(log(N)) read complexity.
The CASI Tree employs a variety of techniques:
- Division of index ranges into variable sized segments
- Summary segments
- Segment deltas
- Segment compaction and compression
- Statistics aggregation
The CASI Tree is a segmented column store. While traditional column stores maintain large, monolithic column store files, the CASI Tree effectively breaks these files into segments. Since segments are variably sized, they simultaneously eliminate the wasted space present in b-trees on-disk and enable tradeoffs between throughput, latency, and available Input/Output Operations-Per-Second (IOPS).
Segment modifications can be recorded by segment deltas, thereby minimizing the amount of data written during segment changes. When appropriate, (e.g. initial writes or delta collapse) complete segments are written. In all cases, segmenting a column store enables latency to be bounded and modification costs minimized when compared to column store file rewriting and update-in-place algorithms.
The combination of segments, segment deltas, and summary segments enable massive increases of performance and scale.
Continuous And Adaptive Optimization
Streamed transactions and indexes are both continuously and adaptively optimized based on access and query patterns. Data-driven optimization enables extreme performance gains while simultaneously minimizing the cost of optimization.
Since all optimization is automatic and continuous there is no off-line optimization, downtime, or operator intervention. This greatly increases availability and minimizes TCO.
Our unique architecture enables:
- Full audit trail, with rollback to any transactional state
- Simplified backup, recovery, and replication
- Instantaneous startup and shutdown
- High availability
A complete database audit trail is maintained when running in archival mode, making all previous database states available. These states may be queried in read-only mode, efficiently supporting read-only analytics of point-in-time transactional database states. Additionally, the database may be rolled back to any previous transactional state.
Because all files are streamed and append-only, many previously hard problems are greatly simplified. Incremental file copies using tools like rsync can provide high performance, non-blocking backup and recovery. Replication can be performed by simply transferring the streamed state log while generating indexes, as desired, on the replicas.
Instantaneous startup and shutdown is achieved by the massive simplification afforded by append-only files. Since no file is updated in place, error detection and recovery on startup only requires examination of the ends of files. In fact, since the state log is authoritative, all indexes may (if necessary) be reconstructed from it once it is deemed error free. Additionally, shut down is instantaneous because each transaction has been recorded in the state log upon commit and modified indexes are continuously streamed to disk.
High availability is achieved through:
- Instantaneous shut down and start up, minimizing Mean Time to Repair (MTTR)
- Simplified backup and restore (can be lockless)
- Online operations like continuous defragmentation; the database is continuously available and continuously optimized
- Replication; inexpensive and high performance state replication
- Simplifying, reducing, and eliminating the need for operator intervention and manual maintenance operations
Since operator errors are a major cause of outages [PAT02][NAG04], operator involvement is minimized and the remaining manual operations are simplified. Additionally, minimizing MTTR dramatically increases availability.
DeepDB Key Points
Key points about DeepDB:
- All file operations are streaming and append-only, thereby maximizing disk throughput
- No memory-mapped file IO is used, eliminating updates in place as well as possible corruption
- All storage system access is both seek and IOP optimized
- All on-disk database data structures are Purely Functional (a.k.a. Persistent Data Structures)
- In-memory data structure representation is completely separated from on-disk representation, enabling independent optimization and efficient compression
- Implements optimized transactional row storage
- Implements optimized transaction consistent, segmented column storage
- Updated asynchronously and concurrently, efficiently utilizing all available resources
- Variable sized groups and segments eliminate wasted space and enable bandwidth/latency tradeoffs
- High availability is achieved through massive simplification, online operations, greatly reduced operator intervention, and instantaneous start up and shut down
DeepDB achieves performance gains 4 to more than 200 times greater than traditional solutions using industry standard tests. This performance advantage is dramatically illustrated when comparing MySQL with DeepDB to MySQL with InnoDB.
Summary And Conclusions
Deep has invented and implemented a series of new algorithms, data structures, and libraries, enabling us to maximize the useful work done on a given set of hardware. The result is record-breaking performance (using industry standard tests) on simultaneous transactional and analytics workloads.
This unequivocally places Deep and DeepDB within the Big Data ecosystem. Our technology allows our customers to solve new problems, create new applications, and unlock new value from their data. Simultaneously, our customers realize significant TCO reductions through massive scalability, standard interfaces, and minimal database maintenance.
Stavros Harizopoulos, Daniel J. Abadi, Samuel Madden, Michael Stonebraker.
OLTP Through the Looking Glass, and What We Found There. SIGMOD’08.
Mike Stonebraker, Daniel J. Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, Sam Madden, Elizabeth O’Neil, Pat O’Neil, Alex Rasin, Nga Tran, Stan Zdonik.
C-Store: A Column Oriented DBMS. Proceedings of the 31st VLDB Conference,
Trondheim, Norway, 2005.
Michael Stonebraker, Samuel Madden, Daniel J. Abadi, Stavros Harizopoulos, Nabil Hachem, Pat Helland.
The End of an Architectural Era (It’s Time for a Complete Rewrite). VLDB ’07.
Daniel J. Abadi, Samuel R. Madden, Nabil Hachem.
Column-Stores vs. Row-Stores: How Different Are They Really? SIGMOD’08.
Patrick O’Neil, Edward Cheng, Dieter Gawlick, Elizabeth O’Neil.
The Log-Structured Merge-Tree (LSM-Tree). Acta Inf., vol. 33, no. 4, pp. 351–385, 1996.
Shimin Chen, Phillip B. Gibbons, Todd C. Mowry, and Gary Valentin.
Fractal Prefetching B -Trees: Optimizing Both Cache and Disk Performance. ACM SIGMOD ’2002.
Andy Twigg, Andrew Byde, Grzegorz Miłos, Tim Moreton, John Wilkes, Tom Wilkie.
Stratiﬁed B-trees and Versioned Dictionaries. USENIX Workshop on Hot Topics in Storage and File Systems, Portland, OR June 2011.
David Patterson, Aaron Brown, Pete Broadwell, George Candea, Mike Chen, James Cutler, Patricia Enriquez*, Armando Fox, Emre Kıcıman, Matthew Merzbacher*, David Oppenheimer, Naveen Sastry, William Tetzlaff, Jonathan Traupman, and Noah Treuhaft.
Recovery Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies. Computer Science Technical Report UCB//CSD-02-1175, U.C. Berkeley, March 15, 2002.
Kiran Nagaraja, Fabio Oliveira, Ricardo Bianchini, Richard P. Martin, Thu D. Nguyen.
Understanding and Dealing with Operator Mistakes in Internet Services. OSDI ’04: 6th Symposium on Operating Systems Design and Implementation.