Databases

References

Jamie Brandon What is a database TigerBeetle talk, Database Engineering by Meta, Expert Mysql, Database System Concepts 7th Edition, CMU: Intro to Databases Systems, Mysql for Developers: Aaron@planetscale.com, Database Internals: Alex Petrov.

INTRODUCTION

  • A database is an organized collection of inter-related data that models some aspect of real-world. Databases are the core component of most computer applications and among the reasons we might need one include data integrity, data implementation and data durability and security.
  • A DBMS on the other hand is software that allows applications to store and analyze information in a database and among its functions supports: definition, creation, querying, update and administration in accordance to a data model.
  • A data model here is a collection of concepts for describing the data in a database while a schema is a description of a particular collection of data using a given data model.

RELATIONAL MODEL

  • It defines a database abstraction based on relations to avoid maintenance overhead by storing database in simple data structures called relations.
  • A relation is an unordered set that contain the relationship of attributes that represent entities. A tuple is a set of attribute values (domain) in the relation with the values are normally atomic or scalar. A special value NULL is a member of every domain.
  • A relations primary key uniquely identifies a single tuple. Some DBMS automatically create an internal primary key if table does not define one via identity column, example SEQUENCE in postgres or AUTO_INCREMENT in Mysql.
  • A foreign key specifies that an attribute from one relation has to map to a tuple in another relation.
  • There also exists the concept of constraints which are user-defined conditions that must hold for any instance of the database and can validate data within a single tuple or across entire relations.
  • Unique key and referential constraints are the most common with the DBMS preventing modifications that violate any constraint.
  • The physical storage is left up to the DBMS implementation with access to data happening through a high-level language leaving the DBMS to figure out the best execution strategy.
  • Some key aspects of the relational model can therefore be defined as • Structure: The definition of the database relations and their contents. • Integrity: Ensure the database contents satisfy constraints. • Manipulation: Programming interface for accessing and modifying database contents.
  • Relational Algebra defines the high level steps of how to compute a query and can also be thought of as the fundamental operations to retrieve and manipulate tuples in a relation.

Document/ Object

  • This is a model which consists of a collection of record documents containing a hierarchy of named field/value pairs. A field's value can either be a scalar type, an array of values or another document.
  • Modern implementations use JSON, older ones use XML or custom object representations.
  • The goal of the document model was to avoid relational-object impedance mismatch by tightly coupling objects and database.

Wide Column

Vector

  • This is a model ideally used for Machine Learning workloads using one dimensional arrays for nearest-neighbor search (exact or approximate). The vectors are generated by applying some kind of transformation or embedding function to the raw data, i.e text, images, video, audio.

  • Vector model is also used for semantic search on embeddings generated by ML-trained transformer models and ideally have native integration with modern ML tools and APIs.

  • A query in a vector model is a classifier, for example • Half-space (Dot-product) • Cone (Cosine similarity) • Ball (Euclidean distance)

  • Given a query, the goal is to find the most similar documents to it among all the database documents and given ML algorithms cannot directly work with raw text and images, documents and images are usually preprocessed and stored as vectors of numbers

  • These representations are called embeddings and to accelerate search performance, an index is built on top of dataset embeddings.

  • Some of the similarity metrics one can perform on the indexes include;

    • Euclidean distance - d = sqrt(sum(n-i)(xi-yi)^2)
    • Manhattan distance - d = sum(n ->i)|xi-yi|
    • Chebyshev distance - d = max(xi-yi)
    • Cosine distance - d = Theta = x.y/|x|.|y|
    • Inner product - d = sum(n ->i) xi.yi
  • Algorithms used in vector model index search include KNN, Locality sensitivity hashing, inverted file or clustering, hierarchical navigable small worlds and product quantization.

Algorithms

  • kNN.

    • Brute force is inefficient in terms of time performance.
    • Ideal for small datasets and embedding dimensionality is small.
    • Also ideal for when accuracy must be 100%, none outperforms on this metric.
    • Improved by reduce search scope or vector dimensionality.
  • Locality Sensitive Hashing(LSH)

  • Inverted File(IVF) or Clustering.

    • Also known as postings list, postings file, inverted index.
    • This is a database index storing a mapping from content such as words or numbers, to its locations in a table or in a document or a set of documents.
    • On query, the hash function of the query is computed and mapped values from the hash table are taken.
    • Each of these mapped values contains its own set of potential candidates which are then fully checked on a condition to be the nearest neighbour for the query.
    • Many implementations exist, i.e Voronoi diagrams or Dirichlet tesselation.
      • Create several non-intersecting regions to which each dataset will belong, each having its own centroid pointing to the center of that region.
      • Main property of Voronoi is that the distance from centroid to any other point is smaller than between it and another centroid.
      • Any new object, distances to the centroid are calculated to give region.
    • Suffers from the edge problem.
    • Ideal for cases where we trade a little accuracy for speed growth, content-based recommender system.
  • Hierarchial Navigable Small World Graphs(HNSW)

    • The main idea of HNSW is to construct such a graph where a path between any pair of vertices could be traversed in a small number of steps.
    • Skip lists
      • Skip lists is a probabilistic data structure that allows inserting and searching elements within a sorted list for 0logn on average.
      • It is constructed by several layers of linked lists.
      • The lowest layer has the original linked list with all elements in it.
      • When moving to higher levels, number of skipped elements increases, thus decreasing number of connections.
    • Navigable small world
      • This is a graph with polylogarithmic search complexity with uses greedy routing.
      • Routing refers to the process of starting the search process from low-degree vertices and ending with high-degree vertices.
      • Prone to early stopping.
    • Based on both skip lists and navigable small world with its structure representing a multi-layered graph with fewer connections on the top layers and more dense regions on the bottom layers.
  • Product Quantization(PQ)

    • This is the process where each dataset vector is converted into a short memory-efficient representation(PQ code).
    • Instead of keeping all the vectors, their short representations are stored, these leads to lower prediction accuracy since its a lossy-compression method.
    • Quantization is the process of mapping infinite values into discrete ones.
    • Algorithm divides each vector into several equal parts-subvectors.
    • Each of the respective parts of all dataset vectors form independent subspaces and is processed separately.
    • Subspace centroids are called quantized vectors.
    • The collection of all centroids inside a subspace is a called a codebook.
    • Optimized query approach is dividing it into subvectors.
  • At their core, they use specialised indexes to perform NN searches quickly.

  • E.g Pinecone, Marqo, LanceDB, Milvus

SQL

  • Current standard is SQL:2016

  • DML

    • Data Manipulation Language
    • Methods to store and retrieve information from a database.
    • Approaches:
      • Procedural
        • query specifies the high-level strategy the DBMS should use to find the desired result.
        • Relational Algebra
      • Non-Procedure(Declarative)
        • query specifies only what data is wanted and not how to find it.
        • Relational Calculus.
  • DDL

    • Data Definition Language
  • DCL

    • Data Control Language
  • SQL is based on bags(there may be duplicates) not sets(no possibility of duplicates)

  • Aggregates(AVG(col), MIN, MAX, SUM, COUNT, DISTINCT)

    • functions that return a single value from a bag of tuples, almost only used in the SELECT output list, can get multiple aggregates, `select avg(gpa), count(students) from super where sth;
    • use with DISTINCT.
  • GROUP BY and HAVING

    • Project tuples into subsets and calculate aggregates against each subset, filter after aggregation, non-aggregated values in select output clause must appear in group by clause.
    • examples for this
  • HAVING

    • Filters results based on aggregation computation. Like a WHERE clause for a GROUP BY.
  • LIKE

    • used for string matching, % - any substring, _ - match any character.
  • String Operations

    • CONCAT(), LIKE(), SUBSTRING(), UPPER()
  • DATE/TIME OPS

  • Unix Epoch 01-01-1970

  • Output redirection

    • Store query results in another tables, the table must not already be defined and the resultant table will have the same # of columns with the same type as the input.
  • Output Control

    • order the output tuples by the value in one or more of their columns, ORDER BY, DESC, LIMIT, OFFSET.
  • Nested Queries

    • Invoke queries inside other queries to compose more complex computations.They are however difficult to optimize for the DBMS due to correlations, Inner queries can appear almost anywhere in query.
    • IN, ALL, ANY, EXISTS
  • Window Functions

    • performs a sliding calculation across a set of tuples that are related, like an aggregation but tuples are not grouped into a single output tuples, ROW_NUMBER(), RANK(),
    • OVER() - specifies how to group together tuples when computing the window function
    • use PARTITION BY to specify group, include an ORDER BY in the window grouping to sort entries in each group.
  • Common Table Expressions

    • Auxiliary statements for use in larger query.
      • Think of it like a temp table just for one query, alternate to nested queries and views, can bind output columns to names before the AS keyword. CTE-recursion.
  • Lateral Joins

    • The lateral operator allows a nested query to reference attributes in other nested queries that precede it. Think of it like a for loop that allows you to invoke another query for each tuple in a table.

Relational Algebra

  • It defines the ordering of the high-level steps of how to compute a query. A better approach of which is to state the high-level answer that you want the DBMS to compute.
  • The relational model is independent of any query language implementations.
  • The DBMS is responsible for efficient evaluation of the query with high-end systems having a sophisticated query optimzer that can rewrite queries and search for optimal execution strategies.
  • Fundamental operations to retrieve and manipulate tuples in a relation based on set algebra(unordered lists with no duplicate)
  • Each operator takes one or more relations as its inputs and outputs a new relation and we can chain operators together to create more complex operations.

Operations

  • Select

    • Choose a subset of the tuples from a relation that satisfies a selection predicate with the predicate acting as a filter to retain only tuples that fulfill its qualifying requirements and we can combine multiple predicataes using conjuctions/disjunctions.
  • Projection

    • Generate a relation with tuples that contains only the specified attributes, can rearrange attributes ordering, remove unwanted attributes and can manipulate the values.
  • Union

    • Generate a relation that contains all tuples that appear in either only one or both input relations.
  • Intersection

    • Generate a relation that contains only tuples that appear in both of the input relations.
  • Difference

    • Generate a relation that contains only tuples that appear in the first and not the second of the input relations.
  • Product

    • Generate a relation that contains all possible combinations of tuples from the input relations.
  • Join

    • Generate a relation that contains all tuples that are a combination of two tuples(one form each relation) with a common value for one or more attributes.
  • Extra operators

    • Rename, Assignment, Duplicate Elimination, Aggregation, Sorting, Division

MYSQL SERIES

SCHEMA

  • Choose smallest data type for your data, schema should be honest and represent your data.

  • Integers

    • TINYINT, SMALLINT, MEDIUMINT, INT, BIGINT.
    • Stored in increasing 1 byte apart from BIGINT which is 8 bytes, should also specify signed or unsigned, number in bracket does nothing to size.
  • Decimals

    • Store numbers with decimal parts. DECIMAL(exact), FLOAT(approximation). decimal(10,4). Double is bigger than float, 8 to 4 bytes. Exact types are implemented in the db.
  • Strings

    • CHAR, VARCHAR, BINARY, VARBINARY, BLOB, TEXT, ENUM, SET.
    • CHAR(fixed length), VARCHAR(variable length)
    • keep eye on collations when choosing character sets.
    • BINARY mainly used to store hashes.
    • TEXT, BLOB - large amount of data in character or binary format.
    • Only select them when you need them.
  • Enums

    • range of values a column can take. sorts by enum position integer starting at 1.
  • Dates

    • DATE(3), DATETIME(8), TIMESTAMP(4), YEAR(1), TIME(3).
    • TIMESTAMP - (1970 - 2038-01-19)
    • set session timezone.
    • TIMESTAMP preffered when it comes to storing and retrieving timezone sensitive data.
    • CURRENT_TIMESTAMP attribute on column, on update too.
  • JSON

    • Mysql validates json. Special functions for working with json. can't index on entire json document.
  • Unexpected types

    • Booleans sameas tinyint
    • Zip codes char(10)
    • IpAddress - function to work with ip-addresses. check them out.
  • Generated columns

    • use AS(function to obtain value form another column)
    • better than relying on application code to handle this
    • read more on this.
    • has to be deterministic.
    • Stored vs Virtual.
  • Schema migrations

    • maintanable, sharable chnage to schema

TINYINT    - 1  - -128
SMALLINT   - 2  - -32768
MEDIUMINT  - 3  - -8388608
INT        - 4  - -2147483648
BIGINT     - 8  - -2^63

Postgres: Numeric

typedef struct {
  intndigits;
  int weight;
  int scale;
  int sign;
  NumericDigit *digits;
} numeric;

INDEXES

  • An index is a separate data structure that maintains a copy of part of your data and points back to the row.

  • The general idea of indexes is to create as many as you need but as few as you can get away with. An important consideration to keep in mind is that queries should drive the number and kind of indexes you need.

  • B+ Tree is a data structure that is mostly used to construct indexes.

  • Indexes are automatically created for primary keys as they are always not nullable and they exist in every table while columns with few distinct values not perfect candidate for indexes.

  • Primary Keys

    • Indexes are automatically created for primary keys.
    • Always no nullable.
    • Unsigned, Auto_increment.
    • One per table.
    • determine how table is stored on disk, is the table.
    • clustered index.
  • Secondary keys

    • Any index that is not the primary key.
    • Secondary index has pointer to the primary index.
  • Primary Key data types

    • have room to grow.
    • keep an eye on size of index.
    • unsigned bigints
  • Where to add indexes

    • Effective indexes for each query.
    • EXPLAIN keyword to get a glimpse into how query is run.
    • use EXPLAIN to see which indexes are used per query.
    • order, group, ranges, bounded ranges
  • Index selectivity

    • cardinality,
    • column with few distinct values not perfect candidate for indexes.
  • Prefix indexes

    • possible to create an index on only a portion of a column.
    • use selectivity off the full index to determine how close you can get to original index selectivity by just indexing part of column.
    • can't be used to sort
  • Composite index

    • indexes on multiple columns
    • order of definiton crucial to query performance.
    • access is left-to-right, no skipping, stops at first range
    • key-len column on index table.
    • access patterns crucial to consider when defining ci.
  • Covering Indexes

    • This is a regular index in a special situation for example when an index supplies all a given query requires.
  • Functional Indexes

    • Index on a function, index something you can't quite reach.
    • when you wrap a column in a function, MYSQL has no idea what's in there, blackbox essentially, leading to a case of index obsufication.
    • case sensitive collation works well for comparison.
    • alter table users add index m((month(birthday)))
    • Mysql: 8 and beyond, 5 : use generated column index.
  • Indexing JSON columns

    • No way to index an entire JSON column in MySQL
    • json ->> '$.email' generated column, unquoting operator
    • Careful with collation here even after CAST.
  • Indexing for wildcard searches

    • When searching for substring in a column, indexes can still be used but only until the first wildcard.
  • Fulltext Indexes

    • Mysql supports Fulltext searching although not as good as purpose built FTS tools, adding fulltext keyword, syntax match(a,b,c) against (expression), BOOLEAN mode and NORMAL LANGUAGE mode and relevancy score.
  • Invisible indexes

    • Before you drop an index, you can make an index invisible to test effects of removing it.
  • Duplicate indexes

    • Because of the way composite indexes work, you may have duplicate and redundant indexes defined.
  • Foreign Keys

    • Not the same as foreign key constraints as it allows one to reference data in another table.
    • Constraints check for referential integrity
    • column types have to be the same type.
    • on delete cascade/set null, on update cascade/set null.
    • constraint in application code.

QUERIES

EXPLAIN

  • Learn to read and interpret the EXPLAIN statement.

  • Keyword meanings

    • cost to get first row, cost to get all rows.
    • rows: number of rows.
    • width: average width of a row in bytes.
    • loops: number of loops in each row
    • Buffers: blocks read and write, in cache(hit/read) and outside(write), measured in blocks
    • Filter:
    • Rows removed by Filter:
  • Major scan modes

    • Sequential scan: read all the tables in sequential order
    • Index scan: read the index to filter on the WHERE clause and table to filter invsible rows.
    • Bitmap Index scan: same, but read fully the index and then the table, much quicker for a bigger number of rows.
    • Index only scan: for covering indexes.
  • Join types

    • Hash join
      • only equality join
      • really quick with enough memory
      • but slow start
    • Merge join
      • sort, then merge.
      • slow start with no index
      • quickest on big data sets
    • Nested loop
      • for small tables
      • easy to start, hard to quit.

EXPLAIN SELECT * FROM foo;

QUERY PLAN
--------------
Seq Scan on foo (const 0.00..18584.82 rows=1025082 width=36) 

--add the ANALYZE keyword to get the actual stats, though it also executes the query, so best to wrap it in a transaction so one can rollback.

EXPLAIN ANALYZE SELECT * FROM foo;

--also add the BUFFERS option

EXPLAIN ANALYZE, BUFFERS SELECT * FROM foo;

QUERY PLAN
--------------
Seq scan on foo () Buffers: shared read=8334

--text_pattern_ops
--Bitmap Index scan on columnname__idx
--Index only scan using column_name on table__name

  • Explain access type

    • Const - unique look up
    • Ref - index use
    • range - equality
    • index - scan entire index
    • all - scan entire table.
    • look them up in documentation.
  • Explain Analyze

    • gives more detail to work with as compared to traditional EXPLAIN output.
    • can change format i.e format=tree, format=json
    • explain analyze runs the query
  • Index Obfuscation

    • happens when you wrap a column with a function.
    • always work on the right side of the query to avoid this, leave column alone.
  • Redundant and Approximate conditions

    • Impossible to index a condition correctly.
    • Use a redundant condition to help narrow down records quickly.
    • Should be logically correct.
  • Select only what you need.

    • Especially for large data format fields.
    • check ActiveRecord model for hrams of this.
    • invisible keyword to exclude from default SELECT * unless explicitly called.
  • Limiting rows

    • only return rows to remain performant.
    • look for calculations or operations that can be done in the DB.
    • select count(*) and select *
    • accompany your LIMIT with ORDER BY.
  • Joins

    • select * from users(left table) join locations(right table) on users.id=locations.manager_id;
    • default is inner join(left matches right and has results on both)
    • left join(all results on left table)
    • right join(all results on right table)
  • Indexing Joins

    • combining joins efficiently.
    • *relook at this

chapter 8 of MySql documentation on optimization

  • Subqueries

    • can be used in place of joins and used to either include or remove results of a query based on another query,
    • semi-join, anti-join
    • where exists, where not in,
  • Common Table Expressions(CTEs)

    • powerful way to refactor complicated queries into more readable versions with table_name as (insert query).
    • run only once.
  • Recursive CTEs

    • CTE that references itself and can be used to generate new data or work with existing data.
    • use recursive keyword.
  • Unions

    • puts the results one on top of the other as opposed to joins which match them.
    • union all ignores duplicates
  • Window Functions

    • Allow you to work on a subset of rows related to the current row while query is being processed.
    • row_number() over()
    • produce ranked lists,
    • named windows
    • maintains individual rows even after processing
  • Sorting and Limiting

    • order by....default asc
    • limit ... not deterministic.
  • Sorting with indexes

    • don't sort your rows if you dont need them sorted.
    • either via index or after results have been returned(filesort).
    • can create a desc index.
  • Sorting with composite indexes

    • order remains and definition.
    • index and query have to be the same.
  • Counting results

    • count(*) - count as fast as you can.
    • include a function inside count function
  • Dealing with NULL

    • use <=> operator to get NULL out of column.
    • also called null safety operator.
    • ifnull(), coalesce(preffered, default, fallback)

MISC EXAMPLES

  • MD5 Hash

    • create a very fast lookup on very large values.It can work on multiple columns.
    • GENERATED ALWAYS AS()
    • good for strcit equality and constraint esp on logical units across columns
  • Bitwise Operations

    • use the 8 bits to represent different states i.e flags
    • implementation side vs interface side.
    • very cool stuff
  • Timestamps versus booleans

    • when storing a boolean it is sometimes desirable to know when it was turned on/off.
    • timestamp null
    • look more
  • Claiming rows

    • application needs to claim rows for processing.
    • set owner
  • Summary table

    • also known as rollup table.
    • (*)
  • Meta Tables

    • wide tables with many columns can be expensive to query and maintain therefore it makes sense to put them in a secondary table.
    • blob, json ,text columns for example.
  • Offset limit pagination

    • has to be deterministic.
    • Offset/limit method most common.
    • pages may drift as they are looking at them.
    • formula = page * offset.
    • to show pages have to know how many they are.
    • next button reliant on whether an extra record exists.
  • Cursor pagination

    • keep track of some record of state...i.e id, token.
    • can't directly address a page
    • great for infinite scroll.
    • anything you order by should be in cursor
  • Deferred Joins

    • offset/limit performant as you reach deeper pages.
  • Geographic searches

    • ST_DISTANCE_SPHERE function
    • bouding box of a point.

Data-Intensive Apps

  • Reliability: Tolerating hardware and software faults and human error.

  • Scalability: Measuring load and performance, Latency percentiles and throughput

  • Maintainability: Operability, simplicity and evolvability

  • Fault vs Failure: Fault is when a particular system component deviates from its spec, whereas a failure is when the system as a whole stops providing required service to user.

Anatomy of a DB.

  • Good to understand how best to optimize a server and even how to utilize its features. Essential when modifying or extending its features.

Structure

  • Client Apps

    • Database connectors; based on Open Database Connectivity model, include client access, API, db driver
  • Query Interface; i.e SQL

  • Query Processing; query shipping, logical vs physical models of database design, query tree turned from a logical plan to physical plan.

    • Steps: Parsing ->Query Validation ->Optimization ->Plan Generation/Compilation ->Execution

    • How does the query optimizer work??

      • Cost based optimization, Heuristic optimization, Semantic optimization, Parametric optimization
  • Internal Representation of queries.

  • Query Execution

    • Iterative: Generate iterative programs from algebra-based query specs, defined compiled functional primitives that are then combined in a call stack.
    • Interpretative: Form query exec using exisiting compiled abstractions of basic operations. Reconstructed as a queue of method calls, which are each taken off queue and processed. compiled here means one thats been optimised and stored for future execution not actually compiled
  • File access

  • A storage engine is designed to read and write data mechanism that provides some unique benefits to the user.

    • Goal is to find data we want quickly and efficiently without scanning more blocks than necessary.
    • Storage strategies
    • Buffering mechanisms
    • Index mechanisms
  • Query results.

Query Engine

  • What is it?

    • A piece of software that can execute queries against data to produce answers.They provide a set of standard operations and transformations that the end-user can combine in different ways via a simple query language or application programming interface and are tuned for good performance.
  • Divided by core functionality

    • Supported field and data types.
    • Locking types.
    • Indexing
    • Transactions.

Components of a query engine.

  • Frontend

    • Query language parser + Semantic checker
    • Planner: Parse Tree -> Logical Plan
  • Intermediate Query Representation("logical plan")

    • Expression/Type system
    • Query plan w/ relational operators(data flow graph)
    • Rewrites/ Optimizations
  • Low level query representation("Physical plan")

    • Statistics, partitions, sort orders, algorithms(Hash join vs merge join)
    • Rewrites /Optimizations
  • Execution Runtime: Operators

    • Allocate resources(CPU, memory)
    • Pushes bytes around, vectorized calculations.
  • The first step in building a query engine is to choose a type system to represent the different types of data that the query engine will process.

    • One option is to invent a proprietary type system specific to the query engine, while another option is to use the type system of the data source that the query engine is designed to qiery from.
    • For multiple sources querying, some conversion is required between each supported data source and engine's type system.
  • Row-based or columnar

    • see Volcano Query Planner
    • columnar always better as one can take advantage of vectorised processing, SIMD with opportunity for expansion to GPUs to increase parallelism.
  • Create an interface that the query engine can use to interact with data source; files, databases, in-memory objects.

  • Logical plans and expressions.

    • Logical plan reppresents a relation with a known schema.Each logical plan can have zero or more logical plans as inputs.

    • It is important to be able to print logical plans in human readable form to help with debugging, printed hierarchically with child nodes indented.

    • Serialization of logical plans is also desirable for easy transfer to another process using either language default serialization or use language-agnostic serialization format, i.e Substrait.

    • One of the building blocks of a query plan is an expression that can be evaluated against data at runtime.

      • Logical expressions: literal value, column reference, math, comparison, boolean, aggregate, scalar function and aliased expression.
      • Column expressions
      • Literal expressions
      • Binary expressions, comparison, boolean, math.
    • Logical plans

      • Scan
      • Projection
      • Selection aka filter.
      • Aggregate
  • Dataframe API: Build logical query plans in a much more user-friendly way.

  • Physical plans and expressions: It is considered good practice to separate logical and physical plans, as for one there might be multiple ways to execute a particular operation. There could be multiple physical expression implementations for each logical expression.

  • Query planning: Query planner translates the logical plan into the physical plan and may choose different physical plans based on configuration options or based on the target platform's hardware capabilities.

  • Query optimizations.

    • It rearranges the query plan to make it more efficient.
    • Rule-based optimizations - projection push-down(filter-out columns as soon as possible), predicate push-down(filter-out rows as early as possible, reduce redundancy) eliminate common sub-expressions, convert correrated subqueries to joins.
    • Cost-based optimizations - use statistics about underlying data to determine cost then choose lowest cost.
  • Query execution.

  • SQL Support

    • Tokenizer, convert SQL query string into list of tokens.
    • Parser, see: Top Down Operator Precedence.
    • SQL query planner, translates SQL query tree into a Logical Plan.
  • Parallel query execution

    • This allows query execution to utilise multiple CPU cores and multiple servers with the simplest implementation being use of threads utilizimg multiple CPU cores on a single node.
  • Distributed query execution.

    • The goal is to create a physical query plan which defines how work is distributed to a number of executors in a cluster, will typically contain new operators that describe how data is exchanged between executors at various points during query execution.
    • Necessitates building of a scheduler to coordinate across executors.
    • Scheduler needs to examine the whole query and break it down into stages that can be executed in isolation and then schedule these stages for execution based on available resources on cluster.
    • Scheduler could also be responsible for managing compute resources in the cluster.
  • Testing: Unit, Integration and Fuzzing.

DBMS Systems

  • Storage systems, store actual data.
  • Catalog, store metadata about what is in the storage.
  • Query Engine, query and retrieve requested data.
  • Access Control and Authorisation, users, groups and permissions.
  • Resource Management, divide resources between uses.
  • Administration Utilities, monitor resource usage, set policies.
  • Clients for Network connectivity.
  • Multi-node coordination and management.

MySQL structure

  • SQL Interface
    • multi-threaded
  • Parser
    • implemented in YACC and Lex
  • Query Optimizer
  • Query Execution
    • handled by a set of library methods designed to implement a particular query.
  • Query Cache
    • caches query structure and query results
    • on by default
  • Cache and Buffers
    • Table cache, Unireg(data chunking mechanism)
    • .frm files, contain metadata for the table.
    • Buffer pool used by Innodb storage engine to cache table and index data.
    • Record cache, enhance sequential reads form storage engines, works like
    • LRU algorithm
    • Privilege cache, store grant data on user account, FILO cache
    • Hostname cache
  • File Access and Pluggable storage engines.
    • plugin - sth that can be changed at runtime.
    • MyISAM - fast access, read heavy, concurrency, advanced caching and indexing
    • InnoDb - High reliability and ACID transactional support
    • Memory - valid during session.
    • Merge - couple of MyIsam with the same structure
    • Archive - large amounts in compressed format.
    • Federated - merge across databases.
    • Cluster/NDB - high availability and highperfromance environment.
    • CSV
    • Blackhole - store data you dont want to save, mostly used to test.
    • Custom - any you can create to enhance your db server.

CMU PATH - Storage -> Execution -> Concurrency control -> Recovery -> Distributed Dbs

Query Planning -> Operator Execution -> Access Methods -> Buffer Pool Manager -> Disk Manager.

Storage

Disk based Architecture

  • DBMS assumes the primary storage is non-volatile disk. Manage movement of data between volatile to non-volatile.

  • *intel optane - persistence memory.

  • Access times:

    • LI Cache - 1 ns.
    • L2 Cache - 4 ns.
    • DRAM - 100 ns.
    • SSD - 16,000 ns.
    • HDD - 2,000,000 ns.
    • Network Storage - 50,000,000 ns.
    • Tape Archives - 1B ns.
  • Sequential vs Random Access. Maximise sequential access. Pages are on disk. Buffer pool is on memory, manage page access. DBMS can use memory maping(mmap) to store contents of a file into the address space of a program.

  • Don't rely on OS, take care of everythng in DB as knows better, i.e Flushing dirty pages to disk in the correct order, Specialized prefetching, Buffer replacement policy and Thread/process scheduling.

  • Memory-mapped I/O Problems

    • Transaction safety: OS can flush dirty pages at any time.
    • I/O Stalls: DBMS does not know which pages are in memory, this will stall a thread on page fault.
    • Error Handling: It's difficult to validate pages, any access can cause a SIGBUS that the DBMS must handle.
    • Performance Issues: OS data structure contention, TLB shootdowns.
  • Solutions to the above problems

    • madvise - Tell the OS how you expect to read certain pages.
    • mlock - Tell the OS that memory ranges cannot be paged out.
    • msync - Tell the OS to flush memory ranges out to disk.
  • Main Questions of database storage

    • How the DBMS represents the database in files on disk?
    • How the DBMS manages its memory and moves data back-and-forth from disk?

File Storage

  • Stores a db as one or more files on disk typically in a proprietary format to the db. OS doesn't know anything about the contents of these files.

Storage Manager

  • It is responsible for maintaining a database files, some do their own scheduling for reads and writes to improve spatial and temporal locality of pages.
  • It organizes the files as a collection of pages, tracks data read/written to pages and tracks available space.
  • A DBMS typically does not maintain multiple copies of a page on disk.

Database pages

  • A page is a fixed-size block of data. It can contain tuples, meta-data, indexes and long records but not together, page types not mixed. Some systems require a page to be self contained, containing all data and metadata relating to it is in the page.

  • Each page is given a unique identifier, the DBMS uses an indirection layer to map page IDs to physical locations. A hardware page is the largest block of data that storage can guarantee failsafe writes. atomicity unit.

  • Notions of pages

    • H/w page - 4KB
    • OS page - 4KB too.
    • DB page - 512B-32KB
  • Larger pages ideal to maximise sequential access to data, writes then become more expensive.

Page storage Architecture

  • The different DBMSs manage pages in files on disk in different ways, not ideally knowing anything about what is inside of the pages.
  • Heap File Organisation
    • This is an unordered collection of pages with tuples that are stored in random order, support create/ get/ write/ delete page also iteration over pages.
    • Page directory
      • The DBMS maintains special pages that tracks the location of data pages in the database file, making sure directory pages are in sync with the data pages. It also records meta-data about the available space: free-slots per page, free/empty pages
  • Tree File Organisation
  • Sequential / Sorted File organisation
  • Hashing File Organisation

Page Structure

  • Every pages contains a header with metadata about the pages contents.
    • Page size
    • Checksum
    • Dbms version
    • Transaction visibility
    • Compression / Encoding meta-data.
    • Schema Information.
    • Data summary / sketches.
  • Some systems require pages to self-contained(Oracle).

Page Layout

  • Page storage architectures need to decide how data is organized inside the page.

Tuple oriented

  • The most common layout scheme is called slotted pages. The slot array maps slots to the tuples' starting position offsets. The header keeps track of number used slots and offset from the starting location of the last slot used.

  • DBMS assigns each logical tuple a unique record identifier that represents its physical location in the database.

    • page_id + offset/slot.
    • e.g ctid in postres(6 bytes), file:page:slot in mysql.
    • SQLite uses ROWID(8 bytes) as the true primary key and stores them as a hidden attribute.
    • You can run Vacuum in postgres to clean tuple pages, i.e VACCUM FULL table_name
    • Applications should never rely on these IDs to mean anything, as they can be affected by compaction, vacuuming or garbage collection.
  • Tuple layout

    • A tuple is a sequence of bytes, it's job of the DBMS to interpret those bytes into attribute types and values. Each tuple is prefixed with a header that contain meta-data and attribute data, i.e visibility info(concurrency control), Bitmap for NULL values.
    • We don't need to store meta-data about the schema.
    • Word-aligned tuples
      • All attributes in a tuple must be word aligned to enable the CPU to access it without any unexpected behaviour or additional work.
      • Problems with reading data that spans word boundaries
        • Perform extra reads, execute two reads to load the appropriate parts of the data word and reassemble them.
        • Random reads, read some unexpected combination of bytes assembled into a 64-bit word.
        • Reject, throw an exception and hope app handles it.
      • Solutions
        • Padding, add empty bits after attributes to ensure that tuple is word aligned.
        • Reordering, switch the order of attributes in the tuples' physical layout to make sure they are aligned.
    • DBMS can physically denormalize related tuples and store them together in the same page. reduce i/o but make updates expensive.
  • Insert a new Tuple

    • check page directory to find a page with a free slot, retrieve the page from disk, check slot array to find empty space in page that will fit.
  • Update an existing tuple using record id.

    • check page directory to find location of page, retrieve page from disk, find offset in page using slot array, overwrite existing data(if new data fits)
  • Problems

    • Fragmentation, pages are not fully utilized(unusable space, empty slots)
    • Useless disk I/O, DBMS must fetch entire page to update one tuple.
    • Random disk I/O, worst case scenario when updating multiple tuples is that each tuple is on a separate page.

Log-structured Storage

  • DBMS stores log records that contain changes to tuples(put, delete).
  • each log record must contain tuple's unique identifier
  • put records contain the tuple contents.
  • deletes marks the tuple as deleted.
  • As app makes changes to the db, DBMS appends log records to end of file without checking previous log records.
  • DBMS appends new log entries to an in-memory buffer and then writes out the changes sequentially to disk.
  • The dbms may also flush partially full pages for transactions.
  • On-disk pages are immutable.
  • To read a tuple with a given id, DBMS finds newest log record corresponding to that id, scanning log from newest to oldest.
  • For faster traversal, maintain an index that maps a tuple id to the newest log record.
  • Log-structured compaction.
    • DBMS does not need to maintain all older log entries for a tuple indefinitely, periodically compact pages to reduce wasted space.
    • After a page is compacted, DBMS doesnt need to maintain temporal ordering of records within the page, each tuple id to appear at most once in the page.
  • Coalesces larger disk-resident log files into smaller files by removing unnecessary records.
  • DBMS can instead sort the page based on id order to improve efficiency of future look-ups.
    • Sorted String Tables, SSTables.
    • Universal compaction (Rocks DB).
    • Level compaction(Level DB).
  • Problems:
    • Reads are slower.
    • Write-amplification??, High number of read and writes between memory and disk.
    • Compaction is expensive.

Index-organized storage

  • DBMS stores a table's tuples as the value of an index data structure, still using a page layout that looks like a slotted page.
  • Tuples are typically sorted in page based on key.

System Catalogs

  • The DBMS catalogs contain the schema information about tables that the system uses to figure out the tuple's layout.

  • Most dbms don't allow a tuple to exceed the size of a single page, to store them dbms uses separate overflow storage pages.

    • Postgres, TOAST(>2KB)
    • MySql, Overflow(>1/2 size of page)
    • SQL server, overflow(> size of page)
  • External value storage

    • Some systems allow you to store a large value in an external file, treated as a BLOB type.
    • Oracle, BFILE data type.
    • Microsoft, FILESTREAM data type.
  • The DBMS cannot manipulate the contents of an external file, no durability or transaction protections.

  • ref paper: To BLOB or NOT To Blob

  • Variable precision numbers

  • Fixed precision numbers

  • Null data type

    • Null column bitmap header, store a bitmap in a centralized header that specifies what attributes are null.
    • Designate a value to represent NULL for a data type.
    • Per attribute Null flag, store a flag that marks that a value is null.

Database Workloads

  • OLTP - Online Transaction Processing.

    • Fast ops that only read/update a small amount of data each time, usually kind people build first.
  • OLAP - Online Transactional Analytics Processing.

    • Complex queries that read large portions of database spanning multiple entities to compute aggregates, these workloads executed on the data collected from the OLTP applications.
  • Hybrid Transaction and Analytical Processing

    • OLAP + OLTP together on the same database instance.
  • The relational model does not specify that the DBMS must store all a tuple's attributes together in a single page.

  • This may not actually be the best layout for some workloads.

Storage Model

  • A DBMS storage model specifies how it physically organizes tuples on disk and in memory. It can have different perfromance characteristics based on the target workload (OLTP vs OLAP). This also influences the design choices of the rest of the DBMS.

N-ary storage model (NSM)

  • It is ideal for OLTP workloads where queries tend to access individual entities and execute write-heavy workloads, use the tuple-at-a-time iterator processing model.

  • The DBMS stores all the attributes for a single tuple contiguously in a single page, also known as a "row store".

  • NSM db page size are typically some constant multiple of 4KB h/w page i.e Oracle(4KB), Postgres(8KB), MySQL(16KB)

  • A disk oriented NSM system stores a tuple's fixed-length and variable-length attributes contiguously in a single slotted page.

  • The tuple's record id(page#, slot#) is how the DBMS uniquely identifies a physical tuple.

  • header + slot array.

  • Advantages

    • fast inserts, updates and deletes.
    • good for queries that need entire tuples(OLTP).
    • Can use index-oriented physical storage for clustering.
  • Disadvantages

    • not good for scanning large portions of the table and/or a subset of the attributes.
    • terrible memory locality in access patterns.
    • It is not ideal for compression because of multiple value domains within a single page.

Decomposition storage model (DSM)

  • The DBMS stores the values of a single attribute for all tuples contiguously in a page, also known as a column store.
  • It is ideal for OLAP workloads where read-only queries perform large scans over a subset of the table's attrbutes, use a batched vectorized processing model.
  • DBMS is responsible for combining/splitting a tuple's attributes when reading/writing.
  • It stores attributes and meta-data in separate arrays of fixed-length values.
    • most systems identify unique physical tuples using offsets into these arrays
    • need to handle variable-length values.
      • padding variable length fields to ensure they are fixed-length is wasteful, especially for large attributes.
      • better approach is to use dictionary compression to convert repetitive variable-length data into fixed-lenght values(typically 32-bit integers)
  • Maintain a separate file per attribute with a dedicated header area for metadata about entire column.
  • Tuple identification across pages, i.e in queries that access more than one column.
    • Fixed-length offsets, each value is the same length for an attribute, use simple arithmetic to jump to an offset to find a tuple. Need to convert variable-length data into fixed-length values.
    • Embedded tuple ids, value is stored with it tuple id in a column. Need auxiliary data structures to find offset within a column for a given tuple id.
  • Advantages:
    • Reduces amount wasted i/o because DBMS only reads data it needs.
    • Better query processing and data compression because of increased locality and cached data reuse.
    • Better data compression.
  • Disadvantages:
    • Slow for point queries, inserts, updates and deletes because of tuple splitting/stitching.
  • See: Cantor DBMS, DSM proposal, SybaseIQ, Vertica, MonetDB, Parquet/ORC.

Observation

  • OLAP queries almost never access a single column in a table by itself, at some point during query execution, the DBMS must get other columns and stitch the original tuple back together.
  • We still need to store data in a columnar format to get the storage + execution benefits.
  • We need columnar scheme that still stores attributes separately but keeps the data for each tuple physically close to each other.

PAX storage model

  • ref paper: data page layouts for relational databases on deep memory hierarchies

  • Partition Attributes Across(PAX) is a hybrid storage model that vertically partitions attributes within a database page

    • Parquet and Orc
  • The goal is to get the benefit of faster processing on columnar storage while retaining the spatial locality benefits of row storage.

  • Horizontally partition data into row groups. Then vertically partition their attributes into column chunks.

  • Global header contains directory with the offsets to the file's row groups

    • This is stored in the footer if the file is immutable(Parquet, Orc)
  • Each row group contains its own meta-data header about its contents.

  • Transparent Huge Pages(THP)

    • Instead of always allocating memory in 4KB pages, linux supports creating larger pages.
    • reduced the # of TLB entries.

Hybrid storage model.

  • Use separate execution engones that are optimized for either NSM or DSM databases.

    • store new data in NSM for fast OLTP.
    • Migrate data to DSM for more efficient OLAP.
    • Combine query results from both engines to appear as a single logical database to the application
    • Approaches
      • Fractured Mirrors
        • store a second copy of the db in a DSM layout that is automatically updated.
        • Oracle, SQL Server.
      • Delta Store
        • stage updates to the database in an NSM table.
        • A background thread migrates ipdates from delta store and applies them to DSM data.
        • Vertica, SingleStore, Databricks, Napa.
  • I/O is the main bottleneck if the DBMS fetches data from disk during query execution.

  • DBMS can compress pages to increase the utility of the data moved per I/O operation.

  • Key trade-off is speed vs compression ratio.

    • Compression reduces the database DRAM requirements
    • It may decrease CPU costs during query execution
  • Data sets tend to have highly skewed distributions for attribute values.

  • Data sets tend to have high correlation between attributes of the same tuple.

OLAP INDEXES

  • OLTP DBMS use indexes to find individual tuples without performing sequential scans
    • tree-based indexes are meant for queries with low selectivity predicates
    • also need to accomodate incremental updates
  • OLAP don't necessarily need to find individual tuples and data files are read-only.

Sequential scans optimizations

  • Data prefetching
  • Task parallelization/Multi-threading
  • Clustering / Sorting.
  • Late Materialization
  • Materialized Views / Result Caching
  • Data Skipping
    • Approaches:
      • Approximate Queries(Lossy)
        • Execute queries on a sampled subset of the entire table to produce approximate results
        • Examples: BlinkDB, Redshift, Snowflake, BigQuery, DataBricks.
      • Data Pruning(Loseless)
        • use auxiliary data structure for evaluating predicates to quickly identify portions of a tbale that the DBMS can skip instead of examining tuples individually

        • DBMS must consider tradeoffs between scope(size) vs filter efficacy, manual vs automatic

        • Considerations

          • Predicate Selectivity
            • How many tuples will satisfy a query's predicates
          • Skewness
            • Whether an attribute has all unique values or contain repeated values
          • Clustering/Sorting
            • Whether the table is pre-sorted on the attributes accessed in a query's predicates.
        • Zone Maps

          • ref paper:small materialized aggregates: a lightweight index structure for data warehousing
          • pre-computed aggregates for the attribute values in a block of tuples.
          • DBMS checks the zone map first to decide whether it wants to access the block.
            • originally called small materialized aggregates.
            • DBMS automatically creates/maintains this meta-data.
          • Zone Maps are only useful when the target attribute's position and values are correlated.
            • if scope is too large, then the zone maps will be useless.
            • if scope is too small, DBMS will spend too much time checking zone maps.
        • BitMap Indexes

          • ref paper:model 204 architecture and performance

          • store a separate Bitmap for each unique value for an attribute where an offset in the vector corresponds to a tuple.

            • the i-th position in the Bitmap corresponds to the i-th tuple in the table.
          • typically segmented into chucks to avoid allocating large blocks of contiguous memory

            • one row per group in PAX.
          • Design decisions

            • Encoding scheme
              • How to represent and organize data in a Bitmap.
              • Approaches
                • Equality encoding
                  • basic scheme with one bitmap per unique value.
                • Range encoding
                  • use one bitmap per interval instead of one per value.
                • Hierarchial encoding
                  • ref paper:Hierarchical bitmap index:an efficient and scalable indexing technique for set-valued attributes
                  • use a tree to identify empty key ranges.
                  • high cost overhead.
                • Bit-sliced encoding
                  • use a bitmap per bit location across all values
                  • Bit-slices can also be used for efficient aggregate computations.
                  • can use Hamming Weight.
                • Bitweaving
                  • ref paper:bitweaving: fast scans for main memory data processing
                  • Alternative storage layout for columnar databases that is designed for efficient predicate evaluation on compressed data using SIMD.
                    • Order-preserving dictionary encoding
                    • Bit-level parallelization
                    • Only require common instructions(no scatter/gather)
                  • alternate to Bit-Slicing.
                  • Approaches:
                    • Horizontal
                      • Row-oriented storage at the bit-level.
                    • Vertical
                      • Column-oriented storage at the bit-level.
          • Column Imprints

            • ref paper:column imprints: a secondary index structure.
            • Store a bitmap that indicated whether there is a bit set at a bit-slice of cache-line values.
          • Column Sketches

            • ref paper:column sketches: a scan accelration for rapid and robust predicate evaluation

            • a variation of range-encoded bitmaps that uses a smaller sketch cdes to indicate that a tuple's value exists in a range.

            • DBMS must automatically figure out the best mapping of codes.

              • trade-off between distribution of values and compactness.
              • assign unique codes to frequent values to avoid false positives.
            • Compression

              • How to reduce the size of sparse Bitmaps
  • Data Parallelization / Vectorization
  • Code Specialization / Compilation

Database Compression

  • Reduce the size of the database physical representation to increase the # of values accessed and processed per unit of computation or I/O.

  • Data sets tend to have highly skewed distributions for attrbute values.

  • Data sets tend to have high correlation between attributes of the same tuple.

  • Key trade-off is speed vs compression ratio.

  • Goal 1: Must produce fixed-length values, only exception is var-length stored in separate pool.

  • Goal 2: Postpone decompression for as long as possible during query execution, late materialization

  • Goal 3: Must be a lossless scheme.

  • Lossless vs Lossy compression(should be at application level).

Compression Granularity

  • Block-level

    • compress a block of tuples for the same table
    • Naive compression
      • scope of compression is only based on the data provided as input.
      • LZO, LZ4, Snappy, Zstd*.
      • Computational overhead
      • compress vs decompress speed.
      • DBMS must decompress data first before it can be read and potentially modified, limiting the scope of the compression scheme.
      • The schemes do not consider the high-level meaning or semantics of the data.
  • Tuple-level

    • compress the contents of the entire tuple(NSM only)
  • Attribute-level

    • compress a single attribute within one tuple(overflow)
    • can target multiple attributes for the sam tuple
  • Column-level

    • Compress multiple values for one or more attrbutes stored for multiple tuples.(DSM only)
    • Run-length encoding
      • Compress runs of the same value in a single column into triplets
        • value of the attribute,
        • start position in the column segment,
        • the # of elements in run.(v,o,l)
      • Sort then compress for better results.
      • Requires columns to be sorted intelligently to maximise compression opportunities.
    • Bit-Packing encoding
      • When values for an attribute are always less than the values declared largest size, store them as smaller data type.
      • Use bit-shifting tricks to operate on multiple values in a single word.
      • Mostly encoding
        • A bit-packing variant that uses special marker to indicate when a value excess largest size, store them with smaller data type.
        • The remaining values that cannot be compressed are stored in their raw form.
    • Bitmap encoding
      • Store a separate bitmap for each unique value for an attribute where an offset in the vector corresponds to a tuple.
        • The ith position in the Bitmap corresponds to the ith tuple in the table.
        • Typically segmented into chunks to avoid allocating large blocks of contiguous memory.
      • Only practical if the value cardinality is low.
    • Delta encoding
      • Record the difference between values that follow each other in the same column.
      • Can combine with RLE for better results
    • Incremental encoding
      • Type of delta encoding that avoids duplicating common prefixes/suffixes between consecutive tuples.
      • works best with sorted data.
    • Dictionary encoding*
      • ref paper:Integrating compression and execution in column-oriented database systems
      • Build a data structure, dictionary, that maps variable-length values to a smaller integer identifier.
      • Replace those values with their corresponding identifier in the dictionary data structure.
        • need to support fast encoding and decoding for both point and range queries.
      • Most widely used compression scheme in DBMS.
      • A dictionary needs to support encode/locate and decode/extract.
      • It should also be order preserving.
      • Data structures used:
        • Array
        • One array of varibale lenght string and another array with pointers that maps to string offsets.
        • Expensive to update so only usable in immutable files.
        • Hash Table
        • Fast and compact.
        • Unable to support range and prefix queries.
        • B+ Tree
        • Slower than a hash table and takes more memory.
        • It can support range and prefix queries.

Memory management and Buffer Pool

  • Spatial Control

    • This refers to where to write pages on disk.The goal is to keep pages used together often as physically close as possible on disk.
  • Temporal Control

    • This refers to when to read into memory and when to write them to disk.The goal is to minimise number of stalls from having to read data from disk.
  • Buffer Pool

    • This is a memory region organized as an array of fixed size pages used to store pages fetched from disk. An array entry is called a frame(name is to distinguish it from other components).
    • When DBMS requests a page, an exact copy is placed into one of these frames. Dirty pages are buffered and not written to disk immediately.(write back cache)
    • A page table keeps track of pages that are currently in memory, usually a fixed-size hash table protected with latches to ensure thread-safe access.
    • It also maintains metadata for efficiency per page:
      • Dirty flag(set by a thread when it modifies a page, indicated to storage manager that page must be written back to disk)
      • Pin(prevent swap)/reference counter.(tracks number of threads accessing the page)
      • Access tracking information.

Locks vs Latches

  • Locks

    • This is a high-level logical primitive that protects the db's logical contents(tuples, tables, databases) from other transactions. They are held for entire transation duration. Databases need to expose to user which locks are held and also be able to rollback changes.
  • Latches

    • This is a low-level protection primitive that DBMS uses to protect critical sections of the DBMS internal ds(hash tables, memory regions) from other threads. They are held for duration of operation being made. They do not need to be able to rollback changes.

Page Table vs Page Directory

  • Page directory

    • This is the mapping from page ids to page locations in the database files. All changes must be recorded on disk to allow the dbms to find on restart.
  • Page table

    • This is the mapping from page ids to a copy of the page in buffer pool frames. Its an in-memory data structure that does not need to be stored on disk.

Memory Allocation Policies

  • Global policies

    • DBMS should make decisions for the benefit of all active queries/workload. It considers all active transactions to find an optimal decision for allocating memory.
  • Local policies

    • DBMS makes decisions for the benefit of one query/transaction running faster. It allocates frames to specific queries without considering the behaviour of concurrent queries. It still needs to support sharing pages.
  • Most systems use a combination of both global and local allocation.

Buffer Pool Optimizations

  • Multiple Buffer Pools

    • DBMS can maintain multiple buffer pools for different purposes i.e per-database buffer pool, per-page type buffer pool, with each adopting local policies for data inside it.
    • This partitioning of memory access across multiple pools helps reduce latch contention and improve locality
    • The two approaches to mapping desired pages to a buffer pool are object IDs and hashing.
      • Object id, embed an object identifier in record ids and then maintain a mapping from objects to specific buffer pools.
      • Hashing, hash the page id to select which buffer pool to access.
  • Pre-Fetching

    • DBMS can also optimize by prefetching pages based on a query plan. This method is commonly used by DBMS when accessing many pages sequentially.
  • Scan Sharing(Synchronized scans)

    • Queries cursors can reuse data retrieved from storage or operator computations. This allows multiple queries to attach to a single cursor that scans a table. This is different from result caching with variant being continuous scan sharing.
    • The DBMS keeps track of where the second query joined with the first so that it can finish the scan when it reaches the end of the data structure.
  • Buffer Pool Bypass

    • The sequential scan operator will not store fetched pages in the buffer pool to avoid overhead. Instead memory is local to running query.
    • It works well if operator needs to read a large sequence of pages that are contiguous on disk. It can be used for temporary data(sorting, joins). Light scans.
  • OS page cache

    • Most disk ops go through the OS API. Unless DBMS tells it not to, the OS maintains its own filesystem cache(page cache, buffer cache)
    • Most DBMS use direct I/O(O_DIRECT) to bypass the OS's cache.
      • Redundant copies of pages
      • Different eviction policies
      • Loss of control over file I/O.
  • Fsync problems

Buffer replacement policies

  • When DBMS needs to free up a frame to make room for a new page, it must decide which page to evict from buffer pool. A replacement policy is an algorithm that the DBMS implements that makes a decision on which pages to evict from buffer pool when it needs space.

  • Implementation goals of replacement policies are improved correctness, accuracy, speed, meta-data overhead.

  • Least Recently Used

    • This rp maintains a single timestamp of when each page was last accessed. When eviction comes, select the one with the oldest timestamp. This timestamp stored in separate data structure to allow for sorting and improve efficiency by reducing sort time on eviction.
  • Clock

    • This is an approximation of LRU without needing a separate timestamp per page. Each page has a reference bit, when accessed set to 1.
    • Organize pages in a circular buffer with a clock hand. Upon sweeping, check if a page bit is set to 1, if yes, set to zero, if no, evict.
    • common mechanism - read more.
  • Clock and LRU are susceptible to sequential flooding, where buffer pool contents are corrupted due to a sequential scan. Since seq scans read many pages, this pollutes the buffer pool with pages that are read once and then never again. In OLAP workloads, the most recently used page is often the best page to evict.

  • Clock and LRU only tracks when a page was last accessed but not how often a page is accessed.

  • Better policies:

    • LRU-K

      • It tracks history of last K references to each page as timestamps and compute the interval between subsequent accesses. DBMS then uses history to estimate the next time that page is going to be accessed.
      • It maintains an ephemeral in-memory cache for recently evicted pages to prevent them from always being evicted. It can also track who is accessing pages.
    • Localization

      • The DBMS chooses which pages to evict a per transaction/query basis. This minimizes pollution of the buffer pool from each query.
      • Postgres maintain small ring buffer that is private to the query.
    • Priority Hints

      • The DBMS knows about the context of each page during query execution and thus provides hints to the buffer pool on whether a page is important or not.

Dirty pages

  • There are two methods to handling pages with dirty bits, Fast Path: If a page in the buffer pool is not dirty then the DBMS can simply drop it and Slow Path: If a page is dirty then the DBMS must write back to disk to ensure that its changes are persisted.

  • These two methods illustrate trade-off between fast evictions versus dirty writing pages that will not be read again in the future.

  • Background writing is one way to avoid having to write out pages unnecessarily where the DBMS can periodically walk through the page table and write dirty pages to disk. When dirty page is written, DBMS can either evict or unset dirty flag. A careful system doesnt write dirty pages before their log records are written.

Other Memory Pools

  • The DBMS needs memory for things other than just indexes and tuples. They may not be always backed by disk depending on implementations

  • e.g:

    • Sorting + Join buffers
    • Query caches.
    • Maintenance buffers
    • Log buffers
    • Dictionary caches.
  • DBMS can almost always manage memory better than the OS as it can leverage semantics about the query plan to make better decisions.

Disk I/O Scheduling

  • OS/HW tries to maximize disk bandwidth by reordering and batching I/O requests, but they do not know which I/O requests are more important.
  • The DBMS maintain internal queue to track page read/write requests from entire system computing priorities based on several factors:
    • Sequential vs Random I/O.
    • Critical path task vs Background Task.
    • Table vs Index vs Log vs Ephemeral Data.
    • Transaction information.
    • User-based SLAs

Hash Tables

  • Data structures that support the DBMS execution engine to read/write data from pages, hash tables and trees.

  • DBMS uses various data structures for many different parts of the system internals.

    • Internal Meta-data (data that keeps track of the information about the database and system state)
    • Core data storage (data structures are used as the base storage for tuples in the database)
    • Temporary data structures (ephemeral ds built by DBMS to speed up execution)
    • Table Indices (auxilliary ds used to easily find tuples)
  • Design decisions

    • Data Organisation
      • How we layout data structure in memory/pages and what information to store to support efficient access.
    • Concurrency
      • How to enable multiple threads to access the data structure at the same time without causing problems.
  • A hash table implements an unordered associative array abstract data type that maps keys to values. It uses a hash function to compute an offset into this array for a given key, from which the desired value can be found. It has space complexity O(n), Time complexity: average O(1), worst O(n).

  • A hash table implementation is comprised of two parts:

    • Hash function: It tell us how to map a large key space into a smaller domain via computing an index into an array of buckets or slots keeping in mind trade-off between fast execution and collision rate.
    • Hashing scheme: It tells us how to handle key collisions after hashing, tradeoff between allocating a large hash table to reduce collisions and having to execute additional instructions on collisions.
  • Hash Functions

    • For any input key, return an integer representation of that key, with the output being deterministic. We dont want to use a cryptograhic hash function for DBMS hash tables because we dont want to worry about protecting contents of keys. Its desirable state is it being fast and low collision rates.
    • Namely:
      • CRC-64 (1975) - used in networking for error detections
      • MurmurHash (2008) - fast, general purpose hash function
      • Google CityHash (2011) - faster for shorter keys
      • *FB XXHash (2012) - creator of zstd compression
      • Google FarmHash (2014) - newer version of CityHash with better collision rates
  • Static Hash Table:

    • A static hashing scheme is one where the size of the hash table is fixed, if the DBMS runs out of storage space in the hash table, it rebuilds a larger one from scratch which is expensive.
    • Typically the new hash table is twice the size of the original hash table. Its important to avoid collisions of hashed keys to reduce the number of wasteful comparisons, typically using twice the number of slots as the number of expected elements.
    • The following assumptions don't hold in reality:
      • Number of elements is known ahead of time and fixed.
      • Each key is unique
      • Existence of perfect hash function

Static Hashing Schemes

  • Linear Probe Hashing(open addressing)

    • This is the most basic hashing scheme, typically the fastest and uses a circular buffer of array slots. The hash function maps keys to slots, linearly searching adjacent slots for open slots on collision. To determine whether an element is present, hash to a location in the index and scan for it. We store the key in the index to know when to stop scanning. Insertions and deletions are generalizations of lookups. Solution to deletions are use of tombstones or shifting the adjacent data after deleting an entry to fill the now empty slot.
    • In the case of non-unique keys,
      • Separate Linked List
        • We store a pointer to a separate storage area that contains a linked list of values. These value lists can overflow to multiple pages if the number of duplicates is large.
      • Redundant Keys
        • We store duplicate keys entries together in the hash table. This is what most systems do.
  • Robin Hood Hashing

    • Variant of linear probe hashing that steals slots from rich keys and give them to poor keys.
    • difference from initial position and move the rest to equidistant position.
  • Cuckoo Hashing

    • Instead of one hash table, we use multiple hash tables a with different hash function seeds. Use multiple hash functions to find multiple locations in the hash table to insert records.
    • On insert, check every table and pick anyone that has a free slot, if no table has a free slot, evict element from one of them and re-hash it to find a new location. Look-ups and deletions are always 0(1) because only one location per hash table is checked, insertions however are more expensive.
  • Optimizations

    • Specialized hash table implementations based on key type and sizes, i.e maintain multiple hash tables for different string sizes for a set of keys.
    • Store metadata separate in a seprate array, packed bitmap tracks whether a slot is empty/tombstone.
    • Use table + slot versioning metadata to quickly invalidate all entries in the hash table, i.e if the table version does not match slot version then treat slot as empty.
  • Above hash tables require the DBMS to know the number of elements it wants to store, otherwise needs to rebuild the table if it needs to grow/shrink in size. Dynamic hashing schemes are able to resize the hash table on demand without needing to rebuild the entire table, with the schemes used resizing in different ways that either maximize reads or writes.

Dynamic Hashing Schemes

  • Chained Hashing

    • This is the most common dynamic hashing scheme. The DBMS maintains a linked list of buckets for each slot in the hash table. It resolves collisions by placing all elements with the same hash key into a linked list for that bucket.
    • To determine whether an element is present, we hash to its bucket and scan for it. This can be optimised by additionally storing bloom filters in the bucket pointer list, which would tell us if a key does not exist and avoid a lookup.
  • Extendible Hashing

    • This is an improved version of chained-hashing approach that splits buckets incrementally instead of letting the linked list grow forever. This approach allows multiple slot locations in the hash table to point to the same bucket chain.
    • The core idea is to reshuffle bucket entries on split and increase the number of bits to examine to find entries in the hash table. This means data movement is localized to just the split chain wiht all other buckets left untouched.
    • The DBMS maintains a global and local depth bit counts that determine the number bits needed to find buckets in the slot array. When a bucket is full, DBMS splits the bucket and reshuffle its elements. If the local depth of the split bucket is less than global depth, then the new bucket is just added to the existing slot array, otherwise DBMS doubles size of slot array to accomodate the new bucket and increments the global depth counter.
  • Linear Hashing

    • Instead of immediately splitting a bucket when it overflows, this scheme maintains a split pointer that tracks the next bucket to split. No matter whether this pointer points to the bucket that overflowed it still split with the specific overflow criterion being left to implementation i.e space utilization, average length of overflow chains.
    • Use multiple hashes to find the right bucket for a given key. Splitting buckets based on the split pointer will eventually get to all overflowed buckets. When the pointer reaches the last slot, remove the first hash function and move pointer back to the beginning.
    • If the highest bucket below the split pointer is empty, the hash table could remove it and move the splinter pointer in reverse direction, reducing hash table size.

B+ Trees

  • Due to the unordered nature of hash tables, table indexes which involve queries with range scans are ideal and as such Indexes are preferred. A table index is a replica of a subset of a table's columns that is organized and/or sorted for efficient access using a subset of those attributes. DBMS ensures the contents of the table and index are always logically in sync.

  • This is a self-balancing, ordered tree data structure that keeps data sorted and allows searches, sequential access, insertions and deletions always in 0(log n). It is optimized for disk oriented systems that read and write large blocks of data.

  • Generalization of a binary search tree, since a node can have more than two children.

  • It is mostly used for table indexes.

  • It's the DBMS job to figure out the best indexes to use to execute each query.

  • There is a trade-off regarding the number of indexes to creaeper database

    • Storage overhead
    • Maintenance overhead
  • The ubiquitous b-tree, Efficient locking for concurrent ops on b-trees

  • Properties:

    • M-way search tree.
    • Its perfectly balanced, every leaf node is at the same depth in the tree.
    • Every node other than the root is at least half-full.(M/2-1 < #keys < M-1)
    • Every inner node with k keys has k+1 non-null children
  • Structure

    • Every node is comprised of an array of key/value pairs.
    • The keys are derived from the attributes that the index is based on.
    • The values will differ based on whether the node is classified as an inner node or a leaf node.
    • The arrays are kept in sorted key order, usually, and store all NULL keys at either first or last leaf nodes.
    • Leaf node values
      • Record IDs
        • A pointer to the location of the tuple to which the index entry corresponds
      • Tuple data
        • Index-organized storage.
        • The leaf nodes store the actual contents of the tuple.
        • Secondary indexes must store the record ID as their values.
  • Original b-tree stored keys and values in all nodes in the tree, more space-efficient, since each key only appears once in the tree, but this makes search traversal ineffective.

  • B+ tree only stores values in leaf nodes, inner nodes only guide search.

  • Consistent with implementation.

  • The dbms can use a B+ tree index of the query provides any of the attributes of the search key.

  • B+ Tree - insert

    • Find correct leaf node L.
    • Insert data entry into L in sorted order.
    • If L has enough space, done, otherwise, split L keys into L and a new node L2.
      • Redistribute entries evenly, copy up middle key.
      • Insert index entry pointing to L2 into parent of L.
    • To split inner node, redistribute entries evenly but push up middle key.
  • B+ Tree - delete

    • Start at root, find leaf L where entry belongs.
    • Remove the entry.
    • If L is at least half-full, done, if L has only M/2-1 entries,
      • Try to re-distribute, borrowing from sibling(adjacent node with same parent as L).
      • If re-distribution fails, merge L and sibling.
    • If merge occurred, must delete entry(pointing to L or sibling) from parent of L.
  • Handle duplicate keys

    • Append Record ID.
      • Add the tuple's unique Record ID as part of the key to ensure that all keys are unique.
      • The DBMS can still use partial keys to find tuples.
    • Overflow Leaf Nodes.
      • Allow leaf nodes to spill into overflow nodes that contain the duplicate keys.
      • This is more complex to maintain and modify.
  • Clustered Indexes

    • The table is stored in sort order specified by the primary key, can either heap or index organized storage.
    • Some DBMSs always use a clustered index.
      • If a table doesnt contain primary key, DBMS will automatically make a hidden primary key.
    • Other DBMS cannot use them at all.
    • Clustered B+ tree
      • Traverse to the left-most leaf page and then retrieve tuples from all leaf pages.
      • This will always be better than sorting data for each query.
  • Index scan page sorting

    • Retrieving tuples in the order they appear in a non-clustered index is inefficient due to redundant reads.
    • The DBMS can first figure out all the tuples that query needs and then sort them based on their page ID, retrieving each page once.
  • Design Choices

    • Node size
      • The slower the storage device the larger the optimal node size of a B+ tree.
      • Optimal sizes can vary depending on the workload, leaf node scans vs root-to-leaf traversals.
    • Merge Threshold
      • Some DBMS don't always merge nodes when they are half full, average occupancy rate for nodes is 69%.
      • Delaying a merge operation may reduce the amount of reorganization.
      • It may also be better to just let smaller nodes exist and then periodically rebuild entire tree.
      • This is why PostresSQL calls their B+ tree a node-balanced b+ tree.(nbtree).
    • Variable-length keys.
      • Pointer
        • Store the keys as pointers to the tuple's attribute, also called T-tree(in-memory).
      • Variable length nodes
        • The size of each node in the index can vary, requires careful memory management.
      • Padding.
        • Always pad the key to the max length of the key type.
      • Key Map /Indirection.
        • Embed an array of pointers that map to the key+value list within the node.
    • Intra-node search
      • Linear
        • Scan node keys from beginning to end, use SIMD to vectorize comparisons.
      • Binary
        • Jump to middle key, pivot left/right depending on comparison.
      • Interpolation
        • Approx location of desired key based on known distribution of keys.
  • Optimizations

    • Prefix compression
      • Sorted keys in the same leaf node are likely to have the same prefix.
      • Instead of storing the entire key each time, extract common prefix and store only unique suffix for each key, many variations.
    • Deduplication
      • Non-unique indexes can end up storing multiple copies of the same key in leaf nodes.
      • The leaf node can store the key once and then maintain a posting list of tuples with that key(hashtable like).
    • Suffix Truncation
      • The keys in the inner nodes are only used to direct traffic, may not need to store the entire key.
      • Store a minimum prefix that is needed to correctly route probes into the index.
    • Pointer Swizzling
      • Node use pages ids to reference other nodes in the index.
      • The DBMS must get the memory location from the page table during traversal.
      • If a page is pinned in the buffer pool, then we can store raw pointers instead of page ids.
      • This avoids address lookups from the page table.
    • Bulk Insert
      • The fastest way to build a new B+ tree for an existing table is to first sort the keys and then build the index from the bottom up.
    • Buffer Updates
  • Modifying a B+ tree is expensive when the DBMS has to split/merge nodes.

    • The worst case is when DBMS reorganizes the entire tree.
    • The worker that causes a split/merge is responsible for doing the work.
  • Ideally, delay updates and then apply multiple changes together in a batch, enter,

  • Write-optimized B+ tree

    • Also known as B epsilon - trees.
    • Instead of immediately applying updates, store changes to key/value entries in log buffers at inner nodes.
    • Updates cascade down to lower nodes incrementally when buffers get full.

Concurrent Control

  • Assumption has been all DS that we've discussed so far are single-threaded, but a DBMS needs to allow multiple threads to safely access DS to take advantage of CPU core and hide I/O disk stalls.

  • Concurrency control protocol is the method that DBMS uses to ensure correct results for concurrent operations on a shared object

  • Protocol correctness criteria can vary:

    • Logical correctness: Can a thread see data it is supposed to see.
    • Physical correctness: Is the internal representation of the object sound?(start here)
  • Locks(transactions) vs Latches(workers)

    • Protect db logical sections from other txns vs protect critical sectionsof dbms internal DS from other workers(threads).
    • Held for txns durations vs held for operation duration.
    • Need to be able to rollback changes vs dont need to be able to rollback changes.
  • Locks are used to separate transactions, protecting database contents, during entire transactions, can either be shared, exclusive, update or intention mode, providing deadlock detection and resolution, by waits-for, timeouts or aborts and information kept in lock manager.

  • Latches separate workers(threads, processes), protecting in-memory data structures, during critical sections, can either be in read or write modes, offers deadlock avaoidance only through coding discipline and inforamtion is kept in protected data structures.

  • Latch Modes

    • Read Mode
      • Multiple threads can read the same object at the same time.
      • A thread can acquire the read latch if another thread has it in read mode.
    • Write Mode
      • Only one thread can access the object.
      • A thread cannot acquire a write latch if another thread has it in any mode.
  • Latch Implementation Goals

    • It should have a small memory footprint.
    • It should have fast execution path when no contention, descheduling thread when it has been waiting for too long to avoid burning cycles.
    • Each latch should not have to implement their own queue to track waiting threads.
  • Latch Implementations

    • Test-and-Set Spinlock.
      • Very efficient (single instruction to latch/unlatch)
      • Non-scalable, not cache friendly, not OS friendly.
    • Blocking OS mutex.
      • Simple to use.
      • Non-scalable(about 25ns per lock/unlock invocation)
    • Reader-writer latches
      • Allows for concurrent readers, must manage read/write queues to avoid starvation.
      • They can be implemented on top of spinlocks.
    • Adaptive Spinlock(Apple ParkingLot)
    • Queue-based Spinlock(MCS Locks)
  • Compare-and-swap

    • Atomic instruction that compares contents of a memory location M to a given value V.
    • If values are equal, installs new given value V' in M, otherwise, operation fails.
  • Hash Table Latching

  • Easy to support concurrent access due to the limited ways threads access the data structure.

  • All threads move in the same direction and only access a single page/slot at a time, deadlocks are not possible.

  • To resize the table, take a global write latch on the entire table(in the header page).

  • Latch implementations

    • Page Latches
      • Each page has its own reader-writer latch that protects its entire contents.
      • Threads acquire either a read or write latch before they access a page.
    • Slot Latches
      • Each slot has its own latch.
      • It can use a single-mode latch to reduce meta-data and computational overhead.
  • B+ Tree concurrency control

    • Ideally, allow multiple threads to read and update a B+ tree at the same time.
    • Protect against
      • Threads trying to modify the contents of a node at the same time.
      • One thread traversing the tree while another thread splits/merges nodes.
    • Latch crabbing/coupling
      • This is a protocol to allow multiple threads to access/modify B+ tree at the same time.
        • Get latch for parent, latch for child, release latch for parent if safe.
      • A safe node is one that will not split or merge on update, not full on insertion, more than half-full on deletion.
    • Taking a write latch on the root every time becomes a bottleneck with higher concurrency.
      • Fifo release style.
      • Better latching algorithm, ref paper: Concurrency of operations on B+Trees.
        • Most modifications to B+ Tree will not require a split or merge.
        • Instead of assuming there will be a split/merge, optimistically traverse the tree using read latches, if wrong, repeat with pessimistic algorithm.
    • Leaf node scanning concurrency.
    • Latches do not support deadlock detection or avoidance, only via coding discipline
    • The leaf node sibling latch acquisition protocol must support a no-wait mode.
    • The dbms data structures must cope with failed latch acquisitions.

Making a data structure thread-safe is notoriously difficult in practice

OPERATOR EXECUTION

Sorting and Aggregation Algorithms

  • Query Plan

    • A query plan is a DAG of operators. An operator instance is an invocation of an operator on a unique segment of data. A task is a sequence of one or more operator instances. A task set is the collection of executable tasks for a logical pipeline.
    • Operators are arranged in a tree. Data flows from the leaves of the tree up towards the root. The output of the root node is the result of the query.
  • Disk-oriented DBMS.

    • A disk-oriented DBMS cannot assume that query results fit in memory, just like it assumes a table cannot fit in memory. Use the buffer pool to implement algorithms that need to spill to disk. We are also going to prefer algorithms that maximize the amount of sequential I/O.
  • Queries may request that tuples are sorted in a specific way (ORDER BY), and even when a query does not specify an order, we may still want to sort to do other things. i.e Aggregations(GROUP BY), Duplicate elimination (DISTINCT)

  • If data fits in memory, then we can use a standard sorting algorithm like Quicksort.

  • Include:

    • Top N Heap Sort

      • If a query contains an ORDER BY with a LIMIT, DBMS only needs to scan data once to find top-N elements.
      • If the top-N elements fit in memory, scan data once and maintain an in-memory sorted priority queue.
    • External Merge Sort

      • Divide and conquer algo that splits data into separate runs, sorts them individually and then combines then into longer sorted runs.
      • Sorting: Sort chunks of data that fit in memory and then write back the sorted chunks to a file on disk.
      • Merging: Combine sorted runs into larger chunks.
      • A run is a list of Key(attr to compare to compute the sort order)/Value(Tuple(early materialization), Record ID(late materialization)) pairs.
      • 2-way external merge sort.
        • Pass #0
          • Read one page of the table into memory, sort page into a "run" and write it back to disk. Repeat until the whole table has been sorted into runs.
        • Pass #1,2,3
          • Recursively merge pairs of runs into runs twice as long. Need at least 3 buffer pages(2 for input, 1 for output)
        • Can be improved by increasing number of pages and also increasing number of K-way merges.
        • Double buffering
          • Prefetch the next run in the background and store it in a second buffer while the system is processing the current run, overlapping CPU and I/O operations.
        • Comparison optimizations
          • Code Specialization
            • Instead of providing a comparison funtion as pointer to sorting algorithm, create a hardcoded versions of sort that is specific to a key type.
          • Suffix truncation
            • First compare a binary prefix of long VARCHAR keys instead of slower string comparison, fallback to slower version if prefixes are equal.
    • B+tree to sort

      • If the table has a B+ tree index on the sort attribute, oen can use that to accerelate sorting.
      • Retrieve tuples in desired sort order by simply traversing the leaf pages of the tree.
      • Clustered B+ tree
        • Traverse to the left-most leaf page and then retrieve tuples from all leaf pages.
        • This is always better than external sorting because there is no computational cost and all disk access is sequential.
      • Unclustered B+ Tree
        • Chase each pointer to the page that contains the data.
        • This is almost always a bad idea, in general, one I/O per data record.
    • Aggregations

      • Collapse values for a single attribute from multiple tuples into a single scalar value.
      • DBMS needs to quickly find tuples with the same distinguishing attributes for grouping.
      • Implementations:
        • Sorting
        • Hashing
          • Populate an ephemeral hash table as the DBMS scans the table, for each record, check whether there is already an entry in the hash table.
          • Partition.
            • Divide tuples int buckets based on hash key, write them out to disk when they get full.
          • Rehashing.
            • Build in-memory hash table for each partition and compute the aggregation.

Join Algorithms

  • We normalize tables in a relational database to avoid unnecessary repetition of information.

  • We then use the join operator to reconstruct the original tuples without loss of information.

  • We will focus on performing binary joins using inner equijoin algorithms

    • They can be tweaked to support other joins.
    • multi-way joins exist but only in research
  • In theory we want the smaller table to always be the left table in the query olan

    • Optimizer will figure this out when generating the physical plan.
  • The operators are arranged in a tree in a query plan, data flowing from the leaves of the tree up towards the root, output of the root node is the result of the query.

  • Join Operators

    • Output
      • What data does the join operator emit to its parent operator in the query plan tree.
      • Output contents can vary:
        • processing model
        • storage model
        • data requirements in the query.
      • Output: Data
        • Early materialization
          • Copy the values for the attributes in outer and inner tuples into a new output tuple.
          • Subsequent operators in the query plan never need to go back to the base tables to get more data.
      • Output: Record IDs
        • Late materialization
          • Only copy the joins keys along with the Record IDs of the matching tuples.
          • Ideal for column stores because the DBMS does not copy data that is not needed for the query.
  • Cost Analysis Criteria

    • How do we determine whether one join algorithm is better than the other.
    • Number of I/O's to compute join.
    • There are many algorithms for reducing join cost, but none works well in all scenarios.
  • Join Algorithms

    • Nested Loop Join

      • Naive/Simple/Stupid
        • for each tuple r in R: for each tuple s in S: if r and s match then emit.
        • cost: M + (m * N)
      • Block
        • for each block Br in R: foreach block Bs in S: foreach tuple r in Br: foreach tuple s in Bs: if r and s match then emit.
        • cost: M + (M * N)
        • The smaller table should be the outer table.
        • We determine size based on the number of pages, not the number of tuples.
      • Index
        • We can avoid sequential scans by using an index to find inner table matches.
        • foreach tuple r in R: foreach tuple s in Index(r1 = sj): if r and s match then emit.
    • Sort-Merge Join

      • Sort
        • Sort both tables on the join key.
        • You can use any appropriate sort algorithm.
        • These phases are distinct from the sort/merge phases of an external merge sort.
      • Merge
        • Step through the two sorted tables with cursors and emit matching tuples.
        • It may need to backtrack depending on the join type.
      • The worst case for the merging phase is when join attribute of all tuples is same on join tables.
      • The best case is when one or both tables are already sorted on join key or output must be sorted on join key.
    • Hash Join

      • Build

        • Scan the outer relation and populate a hash table using the hash function h1 on the join attributes, i.e linear probing.
      • Probe

        • Scan the inner relation and use h1 on each tuple to jump to a location in the hash table and find a matching tuple.
      • Hash table contents

        • Key: attribute that the query is joining on, hash table needs to store the key to verify we have a correct match in case of hash collisions.
        • Value: it varies per DBMS, depends on what the next query operators will do with the output from the join.
      • Cost Analysis

      • Optimization

        • Probe filter
          • Create a probe filter i.e Bloom Filter during the build phase if the key is likely to not exist in the inner relation.
          • Check filter before probing hash table.
          • It is fast because the filter fits in CPU cache.
          • It is sometimes called sideways information passing.
          • Probabilistic data structure that answers set membership queries
            • false negatives will never occur
            • false positives can sometimes occur
      • Partitioned Hash Join

        • Hash join when tables dont fit in memory, GRACE Hash join.
        • Partition phase: hash both tables on the join attribute into partitions.
        • Probe phase: compares tuples in corresponding partitions for each table.
      • Hybrid Hash Join

        • If keys are skewed, then the DBMS keeps the hot partition in-memory and immediately perform the comparison instead of spilling it to disk.
        • Difficult to get right.

Integrate a profile for project

Query Execution and Processing Models

  • DBMS engineering is an orchestration of a bunch of optimizations that seek to make full use of hardware. No one technique is more important than others.

  • Top 3 optimizations

    • Data parallelization.(Vectorization)
    • Task parallelization.(Multi-threading)
    • Code specialization.(Pre-compiled/JIT)
  • Optimization goals

    • Reduce instruction count, use fewer instructions to do the same amount of work.
    • Reduce cycles per instruction, execute more CPU instructions in fewer cycles.
    • Parallelize execution, use multiple threads to compute each query in parallel.
  • Processing Models

    • It defines how the system executes a query plan and moves data from one operator to the next. There are different trade-offs for different workloads, OLTP vs OLAP.
    • Each processing model is comprised of two types of execution paths;
      • Control Flow: How the DBMS invokes an operator.
      • Data Flow: How an operator sends its results.
    • The output of an operator can be wither whole tuples(NSM) or subsets of columns(DSM).

Approaches:

  • Iterator Model

    • Its also known as Volcano or Pipeline model.
    • Each query plan operator implements a Next() function, where on each invocation, the operator returns either a single tuple or a eof marker if there are no more tuples. The operator implements a loop that calls next on its children to retrieve their tuples and then process them, some operators must block until their children emit all their tuples, joins, subqueries, order by.
    • Each operator implementation also has Open() and Close() functions, analogous to constructors/destructors, but for operators.
    • It is used in almost every DBMS as it's easy to implement and debug while output control works easily with this approach. It also allows for pipelining where the DBMS tries to process each tuple via as many operators as possible before retrieving the next tuple.
    • A pipeline breaker is an operator that cannot finish until all its children emit all their tuples, i.e Joins, Subqueries, Order By.
  • Materialization Model

    • Each operator processes its input all at once and then emits its output all at once as a single result. The DBMS can push down hints to avoid scanning too many tuples, can also send either a materialized row or a single column.
    • Better for OLTP workloads because queries only access a small number of tuples at a time as they have lower execution/coordination overhead and result in fewer function calls, not good for OLAP queries with large intermediate results.
    • The output can be either whole tuples (NSM) or subsets of columns(DSM).
  • Vectorized / Batch Model

    • It is like the iterator model where each operator implements a Next() function, but now each operator emits a batch of tuples instead of a single tuple.
    • The operator's internal loop processes multiple tuples at a time with the size of the batches varying based on hardware or query properties and also each batch containing one or more columns each their own null bitmaps.
    • It is ideal for OLAP / data warehouses because it greatly reduces the number of invocations per operator while also allowing for out-of-order CPU to more easily use SIMD to execute operators over batches of tuples, i.e operators perform work in tight for-loops over arrays which compilers know how to optimize/vectorize, no data or control dependencies and hot instruction cache.
    • It however may contain some tuples that do not satisfy some filters which can be overcome by;
      • Selection vectors; dense sorted list of tuple identifiers that indicate which tuples in a batch are valid. pre-allocate selection vector as the max-size of the input vector. ref paper: Filter representation in vectorized query execution
      • Bitmaps; positionally-aligned bitmap that indicates whether a tuple is valid at an offset. Some SIMD instructions natively use these bitmaps as input masks.
  • Plan Processing Direction

    • ref paper:push vs pull-based loop fusion in query engines
    • Top-to-Bottom (Pull)
      • Start with the root and pull data up from its children. Tuples are always passed between operators using function calls.
      • Easy to control output via LIMIT, parent operator blocks until child returns with a tuple.
      • Additional overhead because operators' Next() functions are implemented as virtual functions and branching costs on each Next() invocation.
    • Bottom-to-Top (Push)
      • Start with leaf nodes and push data to their parents. It can fuse operators together within a for-loop to minimize intermediate result staging. It allows for tighter control of caches/registers in pipelines.
      • May not have exact control of intermediate result sizes, difficult to implement some operators(Sort-Merge Join).

Access Methods

  • This is the way DBMS access the data stored in a table.

Approaches

  • Sequential Scan

    • For each page in the table, retrieve it from the buffer pool, iterate over each tuple and check whether to include it.
    • The DBMS maintains an internal cursor that tracks the last page/slot it examined.
    • Optimizations
      • Prefetching
      • Buffer Pool Bypass
      • Parallelization
      • Heap Clustering
      • Late Materialization
      • Data Skipping
        • Approximate Queries (Lossy)
          • execute queries on a sampled subset of the entire table to produce approxiate results
        • Zone Maps (Loseless)
          • pre-computed columnar aggregates for the attribute values in a page. DBMS checks zone map first to decide whether it wants to access the page
          • ref paper:Small materialiazed Aggreagtes; A lightweight index structure for Data warehousing
          • In large OLAP systems they can eliminate need to for indexes.
  • Index Scan

    • DBMS picks an index to find the tuples that the query needs.
    • Depends on:
      • What attributes the index contains
      • What attributes the query references
      • Attribute's value domains
      • Predicate composition
      • Whether the index has unique or non-unique keys
  • Multi-Index Scan

    • multiple indexes that the DBMS can use for a query.
    • Compute sets of Record IDs using each matching index, combine these sets based on the query predicates(union vs intersect), retrieve the records and apply any remaining predicates.
  • Modification Queries

    • Operators that modify the database(INSERT, UPDATE, DELETE) are responsible for modifying the target table and its indexes.
      • constraint checks can either happen immediately inside of operator or deferred until later in query/transaction.
    • The output of these operators can either be Record Ids or tuple data.
    • Halloween problem
      • anomaly where an update operations chages the physical location of a tuple which causes a scan operator to visit the tuple multiple times.
      • track modified record ids per query.
      • Materialization would prevent this from occuring.
      • Solution is to track modified record ids per query.
  • Expression Evaluation

    • The DBMS represents a WHERE clause as an expression tree. The nodes in the tree represent different expression types; comparisons, conjuction, arithmetic operators, constant values, tuple attribute references.
    • Evaluating predicates in this manner is slow as the DBMS traverses the tree and for each node that it visits, it must figure out what the operator needs to do.
    • A better approach is to evaluate the expression directly, an even better one being to vectorize it. JIT compilation can potentially speed times up even more hence a better approach.
    • Velox converts expression trees into a flattened intermediate representation that they then execute during query processing, think of it like an array of function pointers to precomplied (untemplated) primitives.

Scheduler

  • The previous perspective has been that of a data flow to query processing model.
  • The control flow was implicit in the processing model, we can make the control flow more explicit with a scheduler.
  • CMU Quickstep project.

Parallel Query Execution

  • Multiple workers in the same database, increased perfomance for potentially the same hardware resources, i.e Higher Throughput, Lower Latency.

  • The DBMS executes multiple tasks simultaneously to improve hardware utilization, tasks dont need to belong to the same query.

  • Increased responsiveness of the system.

  • Potentially lower TCO(total cost of ownership), fewer machines means less parts/ physical footprint/ energy consumption.

  • Parallel vs Distributed

    • Database is spread across multiple resources to deal with large data sets, higher performance and redundancy/fault tolerance.
    • Both appear as single logical database instance to a client application or DBMS frontend.
    • SQL query for a single-resource DBMS should have the same results on parallel or distributed DBMS.
    • Resources close vs far from each other
    • Resources communicate fast vs slow.
    • Communication cheap and reliable vs very opposite.
  • Process Model

    • A process model defines how a system is architectured to support concurrent requests / queries from a multi-user application.
    • A worker is the DBMS component that is responsible for executing tasks on behalf of the client and returning the results.

Approaches

  • Process per DBMS worker

    • Each worker is a separate OS process. Relies on OS scheduler. Uses shared memory for global data structures. A process crash does not take down the entire system.
  • Thread per DBMS worker

    • Single process with multiple worker threads. DBMS manages its own scheduling. It may or not use a dispatcher thread. Thread crash may kill the entire system.
  • Query Scheduling

    • How many tasks should it use?
    • How many CPU cores should it use.
    • What CPU core should the tasks eecute on
    • Where should a task store its output
  • DBMS always knows more than the OS.

Scheduling Goals

  • Throughput: maximise the number of completed queries.

  • Fairness: ensure that no query is starved for resources.

  • Query Responsiveness: minimize tail latencies(especially for short queries)

  • Low Overhead: workers should spend most of their time executing not figuring out what task to run next.

  • Worker Allocation

    • One worker per CPU core.
      • Each core is assigned one thread that is pinned to that core in the OS via syscalls. see sched_setaffinity.
    • Multiple workers per core
      • Use a pool of workers per core or per socket. Allows CPU cores to be fully utilized in case one worker at a core blocks.
  • Task Assignment

    • Push
      • A centralized dispatcher assigns taks to workers and monitors their progress. When the worker notifies the dispatcher that it is finished, it is given a new task.
    • Pull
      • Workers pull the next task from a queue, process it and then return to get the next task.
  • Regardless of what worker allocation or task assignment policy the DBMS uses, it's important that workers operate on local data. The DBMS scheduler must be aware of the location of data and each node's memory layout, i.e Uniform vs Non-Uniform Memory Access, Attached Cached vs Nearby Cache vs Remote Storage.

  • Data Placement

    • DBMS can partition memory for a database and assign each partition to a CPU. By controlling and tracking the location of partitions, it can schedule operators to execute on workers at the closest CPU core.
    • Linux move_pages and numactl
  • Memory Allocation

    • What happens when the DBMS calls malloc?

      • Assume that the allocator doesnt already have a chunk of memory that is an give out.
      • Almost nothing!
        • The allocator will extend the process data segment
        • But this new virtual memory is not immediately backed by physical memory
        • OS allocates physical memory when there is a page fault on access.
      • After a page fault, where does theOS allocate physical memory in a NUMA system.
    • Memory Allocation Location

      • Interleaving
        • Distribute allocated memory uniformly across CPUs
      • First Touch
        • at the CPU of the thread that access the memory location that caused the page fault.
      • OS can try to move memory to another NUMA region from observed access patterns.
  • Partitioning vs Placement

    • A partitioning scheme is used to split the database based on some policy

      • Round-robin
      • Attribute Ranges
      • Hashing
      • Partial/Full Replication
    • A placement scheme then tell the DBMS where to put those partitions

      • Round-robin
      • Interleave across cores
  • How do we decide how to create a set of tasks from a logical query plan?

  • Static scheduling

    • DBMS decided how many threads to use to execute the query when it generates the plan. It does not change while the query executes. The easiest approach is to just use the same # of tasks as the # of cores. One can still assing tasks to threads based on data location to maximize local data processing.
    • Slower tasks will hurt query latency because dependent tasks must wait until that pipeline completes.
  • Morsel-driven scheduling

    • ref paper: morsel-driven parallelism: numa-aware query evaluation framewrok for the many-core age

    • Dynamic scheduling of tasks that operate over horizontal partitions called morsels distributed across cores

      • One worker per core.
      • One morseld per task.
      • Pull-based task assignment.
      • Round-robin data placement.
    • Supports parallel, NUMA-aware operator implementations.

    • The workers perform cooperative scheduling for each query plan using a single global task queue, each worker tries to select tasks that will execute on morsels that are local to it. If there are no local tasks, then the worker just pulls the next task from the global work queue.

    • Because there is only one worker per core and oe morsel per task, HyPer must use work stealing because otherwise threads could sit idle waiting for stragglers. DBMS uses a lock-free hash table to maintain the global work queues.

    • Tasks can have different execution costs per tuple, i.e Simple selection vs string matching.

    • HyPer has no notion of execution priorities, short-running queries get blocked behind long-running queries and all query tasks are executed with same priority.

    • Umbra: Morsel scheduling 2.0

      • ref paper:self-tuning query scheduling for analytical workloads
      • Rather than scheduling tasks based chunks on data(morsels), the DBMS schedules tasks based on time. Tasks are not created statically at runtime. System exponentially grow morsel sizes per task set until an individual task takes 1ms to execute.
      • Automatic priority decay for query tasks, ensures that short-running queries finish quickly and long-running queries are not starved for resources.Modern implementation of stride scheduling.
      • Scheduling state
        • Each worker maintains its own thread-local meta-data about the available tasks to execute.
          • Active slots - which entries in the global slot array have active task sets available.
          • Change mask - indicates when a new task set is added to the global slot array.
          • Return mask - indicates when a worker completes a task set.
        • Workers perform CaS updates to TLS meta-data to broadcast changes.
        • When a worker completes the last morsel for a query's active task set, it inserts the next task set into the global slot array and updates the Return Mask for all workers. When a new query arrives, the scheduler updates the workers' Change Mask to inform them of the new query tasks in a slot.
      • Priority decay
        • Each worker maintains additional thread-local meta-data to compute the priorities of queries in real time:
          • Global pass - How many quantum rounds the DBMS has completed.
          • Pass Values - How much time a query has consumed.
          • Priorities - Decremented as the query runs longer.
    • SAP HANA: NUMA-AWARE SCHEDULER

      • ref paper: scaling up concurrent main-memory column store scans:towards adaptive numa-aware data and task placement
      • Pull-based scheduling with multiple worker threads that are organized into groups, i.e Each CPU can have multiple groups, each group has a soft and hard priority queue.
      • Uses a separate "watchdog" thread to check whether groups are saturated and can reassingn tasks dynamically.
      • DBMS maintains soft and hard priority task queues for each thread group. Threads can steal tasks from other groups soft queue.
      • Four different pools of thread per group
        • Working: actively executing a task.
        • Inactive: blocked inside of the kernel due to a latch
        • Free: Sleeps for a little, wake up to see whether there is a new task to execute.
        • Parked: Waiting for a task(free thread) but blocked in the kernel until the watchdog thread wakes it up.
      • Dynamically adjust thread pinning based on whether a task is CPU or memory bound.
  • SQLOS

    • User-level OS layer that runs inside the DBMS and manages provisioned hardware resources. It determines which tasks are scheduled onto which threads. It also manages I/O scheduling and higher-level concepts like logical database locks.
      • ref: Ms sql server 2012 interals
  • Dynamic scaling vs Work stealing

    • There are two desing decisions on how to handle queries that take longer to complete than expected, dynamic scaling involves provision of additional workers before a query starts while work stealing is taking tasks from a peer.
  • Non-preemptive thread scheduling through instrumented DBMS code.

  • If requests arrive at the DBMS faster than it can execute them, then the system becomes overloaded.

    • The OS cannot help us here bacause it does not know what threads are doing
      • CPU bound: do nothing
      • memory bound:oom
    • Easiest DBMS solution: Crash
    • Approach:
      • Admission control
        • abort new requests whwn the system believes that it will not have enough resources to execute that request.
      • Throttling
        • delay responses ro clients to increase amount of time between requests
        • assumes a synchronous submission scheme.
  • Embedded worker

    • DBMS runs inside the same address space as the application.
    • The application is responsible for threads and scheduling.
    • The application may support outside connections, BerkeleyDB, SQLite, RocksDB, LevelDB.
  • Process models

    • advantages of multi-thread
      • less overhead per context switch
      • do not have to manage shread memory.
      • thread per worker does not mean that DBMS supports intra-query parallelism

Parallel Join Algorithms

  • Perform a join between two relations on multiple threads simultaneously to speed up operation. The two main approaches are Hash join and Sort-Merge Join.

  • ref paper:An experimental comparison of thirteen relational equi-joins in main memory.

  • Design goals (goals matter whether DBMS is using hardware-conscious or hardware-oblivious algorithms for joins)

    • Minimize synchronization, avoid taking latches during execution.
    • Minimize memory access cost, ensure that data is always local to worker thread, reuse data while it exists in CPU cache.
  • Improving cache behaviour

    • Factors that affect cache misses in a DBMS, cache + TLB capacity, locality (temporal and spatial)
    • Non-random access (Scan), clustering data to a cache line, execute more operations per cache line.
    • Random Access (Lookups), partition data to fit in cache + TLB.
  • Hash join is one of the most important operators in a DBMS for OLAP workloads, but still not the dominant cost.

  • It is important that we speed up our DBMS's join algorithm by taking advantage of multiple cores, we want to keep all cores busy without becoming memory bound due to cache misses.

  • Hash Join

    • Partition: Divide the tuples of R and S into disjoint subsets using a hash funcion on the join key.
    • Build: Scan relation R and create a hash table on join key.
    • Probe: For each tuple in S, look up its join key in hash table for R. If a match is found, output combined tuple.
  • Partition Phase

    • Split the input relations into partitioned buffers by hashing the tuples' join key(s). Divide the inner/outer relations and redistribute among the CPU cores. Ideally the cost of partitioning is less that the cost of cache misses during build phase.
    • Explicitly partitioning the input relations before a join operator is sometimes called Grace Hash Join.
  • Approaches

    • Non-blocking partitioning: Only scan the input relation once. Produce output incrementally and let other threads build hash table at the same time.

      • Shared partitions: Single global set of partitions that all threads update. Must use a latch to synchronize threads.
      • Private partitions: Each thread has its own set of partitions, must consolidate them after all threads finish.
    • Blocking Partitioning (Radix): Scan the input relation multiple times to generate the partitions, only materialize results all at once. Sometimes called Radix Hash Join.

      • Step 1: Scan R and compute a histogram of the # of tuples per hash key for the radix at some offset.
      • Step 2: Use this histogram to determine per-thread output offsets by computing the prefix sum.
      • Step 3: Scan R again and partition them according to the hash key.
      • The Radix of a key is the value of an integer at a position (using its base), efficient to compute with bitshifting + multiplication. Compute radix for each key and populate histogram of counts per radix.
      • The prefix sum of a sequence of numbers is a second sequence of numbers that is a running total of the input sequence.
    • Optimizations

      • Software Write Combine Buffers: Each worker maintains local output buffer to stage writes, when buffer full, write changes to global partition. Similar to private partitions but without a separate write phase at the end.
      • Non-temporal streaming writes: Workers write data to global partition memory using streaming instructions to bypass CPU caches.
      • ref paper: On the surprising difficulty of simple things: the case of radix partitioning.
  • Build Phase

    • The threads are then to scan either the tuples or partitions of R. For each tuple, hash the join key attribute(s) for that tuple and add it to the appropriate bucket in the hash table, buckets should be a few cache lines in size.
  • Hash tables design decisions

    • Hash function: How to map a large key space into a smaller domain and the trade-off between being fast vs collision rate.
      • Do not want use a crypto hash functions for our join algorithm, need one that is fast and has a low collision rate. SMhasher site for reference.
    • Hashing scheme: How to handle collisions after hashing and trade-off between allocating a large hash table vs additional instructions to find/insert keys.
  • Hash Table contents

    • Tuple Data vs Pointers/Offsets to data: Whether to store the tuple directly inside of the hash table, storing tuples inside of the table not possible in open-addressing if there is variable length data.
    • Join Keys vs Hashes: Whether to store the original join keys in the hash table or the computed hashed key (can store both).
  • Probe Phase

    • For each tuple in S, hash its join key and check to see whether there is a match for each tuple in corresponding bucket in the hash table constructed for R. If inputs were partitioned then assign each thread a unique partition otherwise synchronize their access to the cursor on S.
    • Sideways information passing via the use of a bloom filter.

Inter vs Intra Query Parallelism

  • Inter-Query

    • Execute multiple disparate queries simultaneously improving overall performance, increasing throughput and reduces latency.
    • If queries are read-ony then this requires almost no explicit co-ord between queries while if multiple queries are updating the db at the same time, then this is hard.
    • OLAP queries have parallelizable and non-parallelizable phases, goal is to keep all cores active.
  • Intra-Query

    • Improve the performance of a single query by executing its operators in parallel, organisation of operators can be thought of in a producer/consumer paradigm.
    • There are parallel versions of every operator, can either have multiple threads access centralized data structures or use partitioning to divide work up.
    • Decreases latency for long-running queries esp for OLAP queries.
  • Approaches

    • Intra-operator(Horizontal)

      • Decompose operators into independent fragments that perform the same function on different subsets of data. DBMS inserts an exchange operator into the query plan to coalesce/split results from multiple parent/child. i.e Gather, Distribute, Repartition
    • Inter-operator(Vertical)

      • Operations are overlapped to pipeline data from one stage to the next without materialization. Workers execute multiple operators from different segments of a query plan at the same time. It still needs exchange operators to combine intermediate results from segments. It is also called pipelined parallelism. It is more common in streaming systems, Flink, Kafka, Pulsar, Spark.
    • Bushy

      • Hybrid of Intra and Inter, where workers execute multiple operators from different segments of a query plan at the same time. It still needs exchange operators to combine intermediate results from segments.
  • Using additional processes/threads to execute queries in parallel wont help if the disk is always the main bottleneck.

  • Execution Parallelism

  • I/O Parallelism

    • Split the DBMS across multiple storage devices to improve disk bandwidth latency.
    • Many different options that have trade-offs
      • multiple disks per db.
      • one db per disk.
      • one relation per disk.
      • split relation across multiple disks
    • Some dbms support this natively, others require admin to configure outside of DBMS.
  • Multi-disk parallelism

    • Data on the disk can get corrupted or an entire disk can fail, can get higher performance from a disk array.
    • Performance, Capacity, Durability.
  • Database Partitioning

    • some DBMS allow you to specify the disk location of each individual database.
    • discussed further below
  • Issues:

    • Coordination Overhead.
    • Scheduling
    • Concurrency Issues
    • Resource Contention.

Adaptive Query Processing

  • This allows the execution engine to modify a query's plan and expression trees while it is running. The goal is to use information gathered from executing some part of the query to decide how to proceed with executing the rest of the query. In the extreme case, the DBMS can give up and return the query to the optimizer but with new information.
  • ref paper: Adaptive Query Processing in the looking glass.

Vectorised Query Execution

  • The process of converting an algorithm's scalar implementation that processes a single pair of operands at a time, to a vector implementation that processes one operation on multiple pairs of operands at once. This is known as Data Parallelization. Suppose we can parallelize our algorithms over 32 cores, assuming each core has a 4-wide SIMD registers, speed-up 32x * 4x = 128x.

  • Single Instruction, Multiple Data (SIMD): A class of CPU instructions that allow the processor to perform the same operation on multiple data points simultaneously. All major ISAs have microarchitecture support SIMD operations

    • x86: mmx, sse, sse2, sse3, sse4, avx, avx2, avx512
    • PowerPC: Altivec
    • ARM: Neon, SVE
    • RISC-V: RVV.
  • Approaches

    • Horizontal: Perform operation on all elements together within a single vector to produce a scalar output.
    • Vertical: Perform operation in an elementwise manner on elements of each vector.

SIMD Instructions

  • Data movement: Move data in and out of vector registers.
  • Arithmetic Operations: Apply operation on multiple data items, i.e ADD, MUL, SUB, DIV, MAX, MIN.
  • Logical Instructions: Logical operations on multiple data items, i.e AND, OR, XOR, ANDN, ANDPS, ANDNPS.
  • Comparison Instructions: Compare multiple data items
  • Shuffle Instructions: Move data between SIMD registers
  • Conversion: Transform data between x86 and SIMD registers.
  • Cache Control: Move data directly from SIMD registers to memory(bypass CPU cache)
  • ref video:james reinders

SIMD Trade-offs

  • Advantages
    • Significant performance gains and resource utilization if an algorithm can be vectorized.
  • Disadvantage
    • Implement an algorithm using SIMD is still mostly a manual process.
    • SIMD may have restrictions on data alignment
    • Gathering data into SIMD registers and scattering it to the correct locations is tricky and/or inefficient.(no longer true in avx-512f)
    • read more on avx-512

SIMD Implementations

  • Automatic Vectorization

    • The compiler can identify when instructions inside of a loop can be rewritten as a vectorized operations. It works for simple loops only and is rare in database operators. Require h/w support for SIMD instructions.
  • Compiler Hints

    • We provide the compiler with additional information about the code to let it know that its safe to vectorize.
    • Approaches:
      • Give explicit information about memory locations
      • Tell the compiler to ignore vector dependencies.
      • check languages support for such directives
  • Explicit Vectorization

    • Use CPU intrinsics to manually marshal data between SIMD registers and execute vectorized instructions, not portable across CPUs(ISAs / versions).
    • There are libraries that hide the underlying calls to SIMD intrinsics, i.e Google Highway, simd, Expressive Vector Engine, std::simd.
  • ref paper: Everything you always wanted to know about compiled and vectorized queries but were afraid to ask.

Vectorization Fundamentals

  • There are fundamental SIMD operations that the DBMS will use to build more complex functionality

    • Masking: Almost all avx-512 operations support predication variants whereby the CPU only performs operations on lanes specified by an input bitmask.
    • Permute: For each lane, copy values in the input vector specified by the offset in the index vector into the destination vector, prior to avx-512, DBMS had to write data from the SIMD register to memory then back to the SIMD register.
    • Selective Load/Store
    • Compress/Expand
    • Selective Gather/Scatter
  • ref paper:make the most out of your smid investments: counter control flow divergence in compiled query pipelines

Vectorized DBMS algorithms

  • ref paper:rethinking simd vectorization for in-memory databases
  • They are principles for efficient vectorization by using fundamental vector operations to construct more advanced functionality. They favor vertical vectorization by processing different input data per lane and maximize lane utilization by executing unique data items per lane subset (no useless computations).

Vectorized operators

  • Selection Scans

    • branchless.
    • key - mask - offset.
    • bitmasks replace if-clauses.
    • Relaxed Operator Fusion
      • ref paper:relaxed operator fusion for in-memory databases: making compilation, vectorization and prefetching work together at last
      • vectorized processing model designed for query compilation execution engines.
      • decompose pipelines into stages that operate on vectors of tuples, where each stage may contain multiple operators, communicate through cache-resident buffers where stages are granularity of vectorization + fusion.
      • The DBMS can tell the CPU to grab the next vector while it works on the current batch, i.e prefetch enabled operators define start of new stage, hides the cache miss latency.
      • Any prefetching technique is suitable, i.e Group prefetching, s/w pipelining, AMAC. Group prefetching works and is simple to implement.
  • Vector Refill Algorithms

    • Buffered: Use additional SIMD registers to stage results within an operator and procees with next loop iteration to fill in underutilized lanes vectors.
    • Partial: Use additional SIMD registers to buffer results from underutilized vectors and then return to previous operator to process the next vector. Requires fine-grained bookkeeping to make sure other operators don't clobber deferred vectors.
    • ref paper: Make the most out of your SIMD investemtns: counter control flow divergence in compiled query pipelines.
  • Hash Tables

    • Expand linear probing hash table.
    • Converge on cache size degradation.
  • Partitioning / Histograms.

    • Use scatter and gathers to increment counts.
    • Replicate the histogram to handle collisions.
  • Caveat Emptor

    • AVX-512 is not always faster than AVX2.
    • Some CPUs downgrade their clockspeed when switching to AVX-512 mode, compilers will prefer 256-bit SIMD operations
    • If only a small portion of the process uses AVX-512, then it is not worth the downclock penalty.
  • Vectorization is essential for OLAP queries, but implementing an algorihtm using SIMD is still a manual process. We can combine all the intra-query parallelism optimizations we've talked about in a DBMS

    • Multiple threads processing the same query.
    • Each thread can execute a compiled plan
    • The compiled plan can invoke vectorized operations.

Query Planning and Optimization

  • For a given query find the correct execution plan that has the lowest "cost". This is the part of a DBMS that is the hardest to implement well(proven to be NP-complete)

  • No optimizer truly produces the "optimal" plan.

    • Use estimation techniques to guess real plan cost
    • Use heuristics to limit the search space.
  • A query plan is a DAG of operators. An operator instance is an invocation of an operaotr on a unique segment of data. A task is a sequence of one or more operator instances (pipelines).

  • Monetdb/*100

    • ref paper:monetdb/*100: hyper-pipelining query execution
    • Low-level analysis of execution bottlenecks for in-memory DBMSs on OLAP workloads, show how DBMS are designed incorrectly for modern CPU architectures.
  • This is the hardest part of any database

  • Logical plans vs Physical plans

    • The optimizer generates a mapping of a logical algebra expression to the optimal equivalent physical algebra expression.
    • Physical operators define a specific execution strategy using an access path.
      • They can depend on the physical format of the data that they process(sorting, compression).
      • Not always a 1:1 mapping from logical to physical.
    • The goal is to increase the likelihood of enumerating the optimal plan in the search.
  • Architecture overview

    • | Application | -> SQL Query -> | SQL Rewriter | -> SQL Query -> | Parser | -> AST -> | Binder | -> Logical Plan | Tree Rewriter | -> Logical Plan -> | Optimizer | -> Physical Plan | Execute |.
    • System Catalog -> name - internal ID, Schema Info,
    • Cost model -> Estimates.
  • Query Optimization

    • Identify candidate equivalent trees(logical. It is an NP-hard problem so the space is large. For each candidate, find the execution plan tree (physical). We need to estimate the cost of each plan and choose the best overall (physical) plan.
    • Heuristics / Rules
      • Rewrite the query to remove inefficient things. The techniques may need to examine catalog but they do not need to examine data, i.e Predicate pushdown, replace cartesian produc and projection pushdown.
    • Cost based search
      • Use a model to estimate the cost of executing a plan. Enumerate multiple equivalent plans for a query, estimate their costs and implement the one with the lowest cost, i.e Single relation, Multiple relations, Nested sub-queries.
      • It chooses the best plan it has seen for the query after exhausting all plans or some timeout.
  • Logical Query Optimization

    • Split Conjuctive predicates
      • Decompose predicates into their simplest forms to make it easier for the optimizer to move them around.
    • Predicate pushdown
      • Move the predicate to the lowest applicable point in the plan.
    • Replace cartesian products with joins
      • Replace with inner joins using the join predicate.
    • Projection pushdown
      • Eliminate redundant attributes before pipeline breakers to reduce materialization cost.
  • Nested Sub-Queries.

    • DBMS treats nested sub-queries in the where clause as functions that take parameters and return a single value or set of values
    • Two approaches
      • Rewrite to de-correlate and/or flatten them
      • Decompose nested query and store result to temporary table.
        • For harder one, optimzer breaks up queries into blocks and then concentrates on one block at a time.
        • Sub-queries are written to a temporary table that are discarded after the query finishes
  • Expression Rewriting

    • Optimizer transforms a query expressions(WHERE/ON) into the minimal set of expressions.
    • Implemented using if/then/else clauses or a pattern-matching rule engine.
      • Search for expressions that match a pattern
      • When a match is found, rewrite the expression
      • Halt if there are no more rules that match.
    • Impossible /Unnecessary predicates
    • Merging predicates
  • Cost Estimation

    • DBMS uses a cost model to predict the behavior of a query plan given a database state.

      • Internal cost that allows the DBMS to compare one plan with another.
    • Its too expensive to run every possible plan to determine this information, so DBMS need a way to derive this information.

      • Look into MongoDB implementation
    • Cost model components:

      • Physical costs
        • Predict CPU cycles, I/O, cache misses, RAM consumption, network messages.
        • Heavily dependant on hardware
      • Logical Costs
        • Estimate output size per operator
        • Independent of the operator algorihtm
        • Need estimations for operator result sizes
      • Algorithmic costs
        • Complexity of the operator algorithm implementation
    • Postgres cost model

      • Uses a combo of CPU and I/O costs that are weighted by magic constant factors.
      • Default settings are obviously for a disk-resident database without a lot of memory
        • Processing a tuple in memory is 400x faster than reading a tuple from disk
        • Sequential i/o is 4x faste than random i/o.
      • Some dbs run benchmark tests to update planner cost constants.
    • Statistics

      • DBMS stores internal statistics about tables, attributes and indexes in its internal catalog
      • Different systems update them at different times
      • Manual invocations
        • Postgres/SQLite: ANALYZE
        • Oracle/MySQL: ANALYZE TABLE
        • SQL Server: UPDATE STATISTICS
        • DB2: RUNSTATS
      • Selection Cardinality
        • Uniform Data
          • distribution of values is the same.
        • Independent Predicates
          • predicates on attributes are independent.
        • Inclusion Principle
          • domain of join keys overlap such that each key in inner relation will also exist in the outer table.
      • Correlated attributes
      • Statistics storage:
        • Histograms
          • Maintain an occurence count per value (or ranhe of values) in a column
          • Equi-width histogram.
          • Equi-depth histogram.
        • Sketches
          • Probabilistic data structure that gives an approximate count for a given value.
          • Cost-model can replace histograms with sketches to improve its selectively estimate accuracy.
          • I.e, Count-Min Sketch, HyperLogLog.
        • Samplings
          • Maintains a small subset of each table that is used to evaluate expressions to compute selectivity.
          • Update samoles when the underlying table chages significantly.
    • Query Optimization

      • After performing rule-based rewriting, the DBMS will enumerate different plans for the query and estimate their costs

      • I.e:

        • Single relation
        • Multiple relations
        • Nested sub-queries
      • Chooses the best plan it has seen for the query after exhausting all plans or some timeout.

      • Single relation

        • Pick the best access method
        • Predicate evaluation ordering
        • Simple heuristics are often good enough for this.
        • OLTP queries are easy
          • Sargable (Search Argument Able)
        • System R Optimizer
          • Break the query into blocks and generate the logical operators for each block.
          • For each logical operator, generate a set of physical operators that implement it.
          • Then, iteratively construct a left-deep join tree that minimizes the estimated amount of work to execute the plan.
      • Multiple Relation QP

        • Bottom-up optimization
          • Use static rules to perform initial optimization.
          • Then use dynamic programming to determine the best join order for tables using the divide and conquer search method.
          • Used by most DBMS.
          • System R optimizer
        • Top-down optimization
          • Start with a logical plan of what we want the query to be.
          • Perform a branch and bound search to traverse the plan tree by converting logical operators into physical operators.
            • Keep track of global best plan during search.
            • Treat physical properties of data as first class entities during planning.
          • Watch the MSSQL query optimizer talk
      • Nested Sub-queries

        • The DBMS treats nested sub-queries in the where clause as functions that take parameters and return a single value or set of values.
        • Two approaches:
          • Rewrite to de-correlate and/or flatten them.
          • Decompose nested query and store results in a temporary table.
      • Ref: Essential Query Optimization Papers.

        • An overview of Query Optimization in Relational Systems, Surajit Chaudhuri.
        • The Volcano Optimizer Generator: Extensibility and Efficient Search.
        • Access path Selection in a Relational DBMS.
        • Of Nests and Trees,Umeshwar Dayal.

Query Compilation and Code Generation

  • Optimization goals:

    • Reduce Instruction count: Use fewer instructions to do the same amount of work.
    • Reduce Cycles per Instruction: Execute more CPU instructions in fewer cycles.
    • Parallelize Execution: Use multiple threads to compute each query in parallel.
  • Ref paper:Compilation in the microsoft sql server hekaton engine

  • After minimizing disk i/o during query execution, only way to increase throughput is to reduce the number of instructions executed, To go 10x faster, DBMS must execute 90% fewer instruction while to go 100x faster, DBMS must execute 99% fewer instructions.very hard.

  • One way to achieve such a reduction in instructions is through code specialization. This means generating code that is specific to a task in the DBMS. Most code is written to make it easy for humans to understand rather than performance.

Code specialization

  • The DBMS generates code for any CPU-intensive task that has a similar execution pattern on different inputs.

    • Access Methods
    • Stored Procedures
    • Query Operator execution
    • Predicate Evaluation *most common
    • Logging Operations
  • For query-focused compilation, the DBMS specializes it after generating the physical plan for a query.

  • Code specialization benefits

    • Attribute types are known a priori, data access function calls can be converted to inline pointer casting.
    • Predicates are known a priori, they can be evaluated using primitive data comparisons.
    • No function calls in loops, allows the compiler to efficiently distribute data to registers and increase cache reuse.
  • Code specialization methods

    • Transpilation: Write code that converts a relational query plan into imperative language source code and then run it through a conventional compiler to generate native code.
    • JIT Compilation: Generate an IR of the query that the DBMS then compiles into native code.

Hique: Holistic Code Generation

  • ref paper:Generating code for holistic query evaluation

  • For a given query plan, create a c/cpp program that implements that query's execution, bake in all the predicates and type conversions. Use an off-self compiler to convert the code into a shared object, link it to the DBMS process and then invoke the exec function.

  • The generated query code can invoke any other function in the DBMS. This allows it to use all the same components as interpreted queries, i.e network handlers, Buffer Pool Manager, Concurrency control, Logging / Checkpoints and Indexes. Debugging is also relatively easy because you step through the generated source code.

  • Relational operators are a useful way to reason about a query but are not the most efficient way to execute it. It takes a relatively long time to compile a c/cpp source file into executable code. Hique also does not support full pipelining.

Hyper - JIT Query Compilation

  • ref paper:efficiently compiling efficient query plans for modern h/w

  • Here we compile queries in-memory into native code using LLVM toolkit, instead of emitting Cpp code, Hyper emits LLVM IR.

  • Aggressive operator fusion within pipelines to keep a tuple in CPU registers for as long as possible.

    • push-based vs pull-based.
    • data centric vs operator centric.
  • Query compilation time grows super-linearly relative to the query size, i.e no of joins, no of predicates, no of aggregations. It is not a big issue with OLTP as opposed to OLAP workloads.

  • Hyper: Adaptive Execution

    • ref paper:adaptive execution of compiled queries
    • Generate LLVM IR for the query and immediately start executing the IR using an interpreter. Then the DBMS compiles the query in the background. When the compiled query is ready, seamlessly replace the interpretive execution, for each morsel, check to see whether the compiled version is available.

Real-world implementations

  • Custom

    • System R: ref paper: A history and evaluation of system R
    • Actian Vectorwise: ref paper:Micro adaptivity in vectorwise
    • Amazon Redshift: ref paper:Amazon Redshift re-invented
    • Oracle
    • Ms Hekaton: ref paper:Compilation in the microsoft sql server hekaton engine
    • SQLite: converts a query plan into opcodes, then executes them in a custom VM(bytecode engine also known as Virtual Database Engine(VDBE). VM ensures queries execute the same in any possible environment.
    • TUM Umbra: ref paper:Tidy tuples and flying start: fast compilation and fast execution of relational queries in umbra, on another levle; how to debug compiling query engines*
  • JVM-BASED

    • Apache Spark: ref paper:spark sql: relational data processing in spark
    • Neo4j
    • Presto/Trino
  • LLVM

    • Single store: MemSQL programming language(MPL)
    • VitesseDB
    • PostgesSQL 2018

Concurrency Control Theory

  • A DBMS's concurrency control and recovery components permeate throughout the desing of its entire architecture.
  • Motivation:
    • How to avoid race condition? Lost Updates
    • What is the correct database state? after a failure? Durability
  • Concurrency control and recovery
    • Valuable properties of DBMS
    • Based on concept of transactions with ACID properties

Transactions

  • A Tx is the execution of a sequence of one or more operations on a database to perform some higher-level function

  • It is the basic unit of change in a DBMS

    • partial tx are not allowed.
  • Strawman system

    • execute txn one-by-one as they arrive at the DBMS.
    • before a txn starts, copy the entire db to a new file and make all changes to that file.
    • is it correct? is it fast?
  • A better approach is to allow concurrent execution of independent txns

    • Better utilization/throughput
    • Increased response times to users
  • Arbitrary interleaving of operations can lead to"

    • Temporary Inconsistency
    • Permanent Inconsistency
  • A DBMS is only concerned about what data is read/written from /to the database.

  • Transactions in SQL

    • Start with BEGIN command.
    • Stops with either COMMIT or ABORT.

ACID.

Atomicity.

  • Txn always either executes all its actions or executes no actions at all, COMMIT or ABORT.
  • Approach:
    • Logging
      • DBMS logs all actions so that it can undo the actions of aborted txns.
      • Maintain undo records both in memory and on disk.
      • Used for audit trail and/or efficiency reasons.
    • Shadow Paging
      • Makes copies of pages and txns make changes to these copies, only when the txn commits is the page made visible to others.

Consistency

  • The world represented by the db is logically correct, all questions asked about the data are given logically correct answers.
  • Database consistency
    • Accurately models the real world and follows integrity constraints.
    • Txns in future see the effects of txns committed in the past inside of the database.
  • Transaction consistency
    • If db is consistent before txn starts it will also be consistent after.
    • This is the application's responsibility.

Isolation

  • User submit txns and each txn executes as if it was running by itself.

  • But the dbms achieves concurrency by interleaving txns

  • A concurrency control protocol is how the DBMS decides the proper interleaving of operations from multiple transactions.

  • Categories:

    • Pessimistic - dont let problems arise in the first place.
    • Optimistic - assume conflicts are rare and deal with them after they happen.
  • When one txn stalls because of resource then another txn allowed to go forward.

  • Formal properties of schedules

    • Serial schedule.
      • A schedule that does not interleave the actions of different transactions.
    • Equivalent schedules.
      • For any db state, the effect of executing the first schedule is identical to the effect of executing the second schedule.
    • Serializable schedule
      • A schedule that is equivalent to some serial execution of the transactions.
      • If each txn preserves consistency, every serializable schedule preserves consistency.
      • Serializability is a less intuitive notion of correctness compared to txn initiation time or commit order, but it provides the DBMS with more flexibility in scheduling operations
      • More flexibility means better parallelism.
  • Conflicting operations

    • Need a formal notion of equivalence that can be implemented efficiently based on the notion of conflicting operations.
    • Two ops conflict if:
      • They are by different transactions
      • They are on the same object and one of them is a write.
    • Interleaved Execution Anomalies
      • Read - Write conflicts
        • Unrepeatable Read: Txn gets different values when reading the same object multiple times.
      • Write - Read conflicts
        • Dirty Read: One txn reads data written by another txn that has not committed yet.
      • Write - Write conflicts.
        • Lost Update: One txn overwrites uncommitted data from another uncommitted txn
  • Levels of serializability

    • Conflict serializability*
    • View Serializability - no system can do this?????
  • Conflict serializable schedules

    • Two txns are conflict equivalent iff:
      • They involve the same actions of the same txns.
      • Every pair of conflicting actions is ordered the same way.
    • Schedule S is conflict serializable if:
      • S is conflict equivalent to some serial schedule.
      • Intuition
  • Dependency graph

    • If the actions result in a cycle in the dependency graph then the schedule will violate isolation principles.

Durability

  • All changes of committed txns should be persistent.
    • No torn updates.
    • No changes from failed transactions.
  • DBMS can use either logging or shadow paging to ensure all changes are durable.
  • Concurrency control and recovery are among the most important functions provided by a DBMS.

-ref paper: Spanner: Google's Globally-Distributed Database.

Two-Phase Locking

  • We need a way to guarentee that all execution schedules are correct(serializable) without knowing the entire schedule ahead of time.

  • Use LOCKS to protect database objects, pessimistic.

  • Existing component called a Lock Manager.

  • Locks vs Latches:

    • Separate User transactions VS separate threads.
    • Protect Database contents VS In-memory data structures.
    • During Entire Transactions vs during critical sections.
    • Modes: Shared, Exclusive, Update, Intention vs Read, Write modes.
    • Deadlock Detection and Resolution vs deadlock avoidance.
    • By Waits-for, Timeout, Aborts vs By coding discipline.
    • Kept in Lock Manager vs kept in protected data structures.
  • Basic Lock Types

    • S-LOCK: Shared locks for reads.
    • X-LOCK: Exclusive locks for writes.
  • Executing with locks

    • Transactions request locks.
    • Lock manager grants or blocks requests.
    • Transactions release locks.
    • Lock manager updates its internal lock-table.
      • Keeps track of what transactions hold what locks and what transactions are waiting to acquire any locks.
      • Global view of a transactions activity
      • Cheaper than updating latches on a B+ TREE.
  • Concurrency control protocol

    • 2PL is a concurrency control protocol that determines whether a txn can access an object in the database at runtime.
    • The protocol does not need to know all the queries that a txn will execute ahead of time.
  • Two-Phase Locking

    • Growing
      • Each txn requests that locks that it needs from the DBMS lock manager.
      • Lock manager grants/denies lock requests.
    • Shrinking
      • Txn is allowed to only release/downgrade locks that it previously acquired, cannot acquire new locks.
  • 2PL on its own is sufficient to guarantee conflict serializability because it generates schedules whose precendence graph is acyclic.

  • But it is subject to cascading aborts.

  • Potential schedules that are serializable but would not be allowed by 2PL because locking limits concurrency.

  • May still have "dirty reads".

  • Strong Strict Two-phase locking.

    • Txn is only allowed to release locks after it has ended(committed or aborted).
    • Allows only conflict serializable schedules, but it is often stronger than needed for some apps.
    • A schedule is strict if a value written by a txn is not read or overwritten by other txns until that txn finishes.
    • Advantages:
      • Does not incur cascading aborts.
      • Aborted txns can be undone by just restoring original values of modified tuples.
  • May lead to deadlocks.

    • A deadlock is a cycle of transactions waiting for locks to be released by each other.
      • Deadlock Detection
        • DBMS creates a waits-for graph to keep track of what locks each txn is waiting to acquire.
          • Nodes are transactions.
          • Edge from Ti to Tj if Ti is waiting for Tj to release a lock.
        • System periodically checks for cycles in waits-for graph and then decided how to break it.
      • Deadlock Handling
        • DBMS selects a victim txn to rollback to break the cycle.
        • Either by:
          • By age
          • By progress
          • By the # of items alredy locked.
          • By # of txns that we have to rollback with it.
        • Should also consider how many times a txn has been restarted to prevent starvation.
        • Rollback length
          • Completely
            • Rollback entire txn and tell the application it was aborted.
          • Partial(Savepoints)
            • DBMS rools back a portion of a txn(to break deadlock) and then attempts to re-execute the undone queries.
        • The victim txn will either restart or abort depending on how it was invoked.
        • Tradeoff between the frequency of checking for deadlocks and how long txns wait before deadlocks are broken.
      • Deadlock Prevention
        • When a txn tries to acquire a lock that is held by another txn, the DBMS kills one of them to prevent a deadlock.
        • This approach does not require a waits-for graph or detection algorithm.
        • Approach
          • Wait-Die.
            • If requesting txn has higher priority than holding txn, then requesting txn waits for holding txn.
            • Otherwise requesting txn aborts.
          • Wound-Wait
            • If requesting txn has higher priority than holding txn, then holding txn aborts and releases lock.
            • Otherwise requesting txn waits.
        • Why do schemes quarantee no deadlocks
          • Only one type of direction allowed when waiting for a lock.
        • When a txn restarts, what is its new priority?
          • Its original timestamp to prevent it from getting starved for resources.
      • Acquiring locks is amore expensive operation than acquiring a latch even if that lock is available
  • Lock Granularities

    • DBMS decides the granularity of a lock when a txn wants a lock.
      • Attribute, Tuple, Page, Table
    • DBMS should ideally obtain fewest number of locks that a txn needs.
    • Trade-off parallelism versus overhead.
    • Database Lock Hierarchy
      • Database ->Table ->Page ->Tuple ->Attribute.
  • Intention Locks

    • An intention lock allows a higher-level node to be locked in shared or exclusive mode without having to check all descendent nodes
    • If a node is locked in an intention mode, then some txn is doing explicit locking at a lower level in the tree.
    • Types:
      • Intention-Shared(IS)
        • Intent to get S locks at finer granularity.
      • Intention-Exclusive(IX)
        • Intent to get X lock at finer granularity.
      • Shared+Intention-Exclusive(SIX)
        • Subtree rooted by that node is locked explicitly in shared mode and explicit locking is done at a lower level with exclusive-mode locks.
    • Compatibility Matrix
    • Locking Protocol
      • Each txn obtains appropriate lock at the highest level of the database hierarchy.
    • Lock Escalation
      • DBMS can automatically switch to coarse grained locks when a txn acquires too many low-level locks
      • This reduces the number of requests that the lock manager must process.
  • Applications dont acquire a txn's locks manually.

  • Need to provide the DBMS with hints to help it improve concurrency.

  • Explicit locks are also useful when doing major changes to the database.

  • Lock Table

Timestamp-Ordering Concurrency Control

  • Determine serializability order of txns before they execute.

  • Use timestamps to determine the serializability order of txns.

  • Each txn is assigned a unique fixed timestamp that is monotonically increasing.

  • Implementation strategies

    • System/Wall clock.
    • Logical counter.
    • Hybrid.
  • Basic T/O

    • Txns read and write objects without locks
    • Every object X is tagged with timestamp of the last txn that successfully did read/write.
    • Check timestamps for every operation, if txn tries to access an object "from the future" it aborts and restarts.
    • Thomas Write Rule(Robert H Thomas)
      • Ignore the write to allow the txn to continue executing without aborting.
    • Generates a schedules that is conflict serializable of you do not use the Thomas write Rules
    • High overhead from copying data to txns workspace and from updating timestamps
      • Every read requires the txn to write to the database.
    • Long running txns can get starved
      • Likelihood that a txn will read sth from a newer txn increases.
  • If you assume that conflicts between txns are rare and that most txns are short-lived then forcing txns to acquire locks or update timestamps adds unncessary overhead.

  • A better approach is to optimize for the no-conflict case.

  • Optimistic Concurrency Control ref paper: On Optimistic Methods for Concurrency Control.

    • DBMS creates a private workspace for each transaction.
      • Any object read is copied into workspace.
      • Modifications are applied to workspace.
    • When a txn commits, DBMS compares workspace write set to see whether it conflicts with other txns.
    • If there are no conflicts, the write set is installed into the global database.
  • Phases:

    • Read Phase
      • Track the r/w sets of txns and store their writes in a private workspace.
      • DBMS copies every tuple that the txn accesses from the shared database to its workspace ensure repeatable reads
    • Validation Phase
      • When a txn commits, check whether it conflicts with other txns.
      • Approaches:
        • Backward Validation - check whether committing txn intersects its read/write sets with those of any txns that have already committed.
        • Forward Validation - check whether the committing txn intersects its read/write sets with any active txns that have not yet committed.
    • Write Phase
      • If validation succeeds, apply private changes to database, otherwise abort and restart the txn.
      • Serial Commits - Use a global latch to limit a single txn to be in the Validation/Write phases at a time.
      • Parallel Commits - Use fine-grained write latches to support parallel validation/Write phases Txns acquire latches in primary key order to avoid deadlocks.
  • Performance Issues.

    • OCC works well when the # of conflicts is low, all txns are read-only and txns access disjoint subsets of data.
    • If the database is large and the workload is not skewed then there is a low probability of conflict, so again locking is wasteful.
    • High overhead for copying data locally.
    • Validation/Write phase bottlenecks.
    • Aborts are more wasteful that in 2PL because they occur after a txn has already executed.
  • Dynamic Databases

    • We have only dealt with transactions that read and update existing objects in the database.
    • But now if txns perform inserts, updates and deletions, we have new problems.
    • Phantom reads
      • Tuples can appear and disappear as txns are running.

      • Approach:

        • Re-execute scans.
          • DBMS tracks the WHERE clause for all queries that the tcn executes.
            • retain the scan set for every range query in a txn.
          • Upon commit, re-execute just the scan portion of each query and check whether it generates the same result.
        • Predicate Locking
          • Logically determine the overlap of predicates before queries start running.
            • HyPer DB.
            • In memory systems.
        • Index Locking Schemes
          • Use keys in indexes to protect ranges.
          • Key-Value Locks
            • Locks that cover a single key-value in an index.
            • Need "virtual keys" for non-existant values.
          • Gap Locks
            • Each txn acquires a key-value lock on the single key that it wants to access, then get a gap lock on the next key gap.
          • Key-Range Locks
            • Locks that cover a key value and the gap to the next key value in a single index, need virtual keys for artificial values.
          • Hierarchical Locking
            • Allow for a txn to hold a wider key-range locks with different locking modes.
              • reduces the number of visits to lock manager.
      • Weaker levels of isolation

        • Serializability is useful because it allows programmers to ignore concurrency issues but enforcing it may allow too little concurrency and limit performance
        • We may want to use a weaker level of consisitency to improve scalability.
      • Isolation Levels

        • Controls the extent that a txn is exposed to the actions of other concurrent txns.
        • Provides for greater concurrency at the cost of exposing txns to uncommitted changes.
          • Dirty Reads
          • Unrepeatable Reads
          • Phantom Reads
        • Levels:
          • Serializable: No phantoms, all reads repeatbal, no dirty reads.
          • Repeatable Reads: Phantoms may happen.
          • Read Committed: Phantoms and unrepeatable reads may happen.
          • Read Uncommitted: All of them may happen.
        • How to obtain:
          • Serializable: Obtain all locks first, plus index locks, plus strict 2PL.
          • Repeatable Reads: Same as above but no index locks.
          • Read Committed: Same as abov, but S locks are released immediately.
          • Read Uncommitted: Same as above but allows dirty reads(no S locks).
  • ref paper: Critique of SQL Isolation Levels

Multi-Version Concurrency Control

-ref paper: An Empirical evaluation of In-Memory MVCC, The Hekaton Memory-Optimized OLTP Engine

  • DBMS maintains multiple physical versions of a single logical object in the database.

    • When a txn writes to an object, the DBMS creates a new verison of that object.
    • When a txn reads an object, it reads the newest version that existed when the txn started.
  • First implemented was db/VMS.

  • Firebird.....influenced firefox browser naming.

  • Key ideas:

    • Writers do not block readers.
    • Readers do not block writers
    • Read-only txns can read a consistent snapshot without acquiring locks, use timestamps to determine visibility.
    • Easily support time-travel queries.
  • Snapshot Isolation

    • When a txn starts, it sees a consistent snaphot of the database that existed when that txn started.
      • No torn writes from active txns
      • If two txns update the same object, then first writer wins.
    • Susceptible to the Write Skew Anomaly.
  • MVCC is more than just a concurrency control protocol. It completely affects how the DBMS manages transactions and the database.

  • Concurrency control protocol

    • Timestamp Ordering
    • Optimistic Concurrency Control
    • Two-phase Locking
  • Version Storage

    • DBMS uses the tuples pointer field to create a version chain per logical tuple
      • This allows the DBMS to find the version that is visible to a particular txn at runtime.
      • Indexes always point to the head of the chain.
    • Different storage schemes determine where/what to store for each version
  • Approach:

    • Append-Only Storage

      • All the physical versions of a logical tuple are stored in the same table space, with versions being inter-mixed.
      • new versions are appended to the same table space.
      • On every update, append a new version of the tuple into an empty space in the table.
      • Version chain ordering
        • Oldest to newest
          • Append new version at the end of the chain.
          • Must traverse chain on look-ups.
        • Newest to Oldest
          • Must update index pointers for very new version.
          • Do not have to traverse chain on look-ups.
    • Time-Travel Storage

      • Old versions are copied to separate table space.
      • On every update, copy the current version to the time-travel, update pointers.
      • Overwrite master version on the main table and update pointers.
    • Delta Storage

      • On every update, copy only the column values that were modified to the delta storage and overwrite the master version.
      • Original values of the modified attributes are copied into a separate delta record space.
      • Transactions can recreate old versions by applying the delta in reverse order.
  • Garbage Collection

    • DBMS needs to remove reclaimable physical versions from the database over time

      • No active txn in the DBMS can see that version
      • The version was created by an aborted txn
    • Two additional design decisions

      • How to look for expired versions
      • How to decide when it is safe to reclaim memory.
    • Approach:

      • Tuple-level

        • Find old versions by examining tuples directly.
        • Background Vacuuming.
          • Separate threads periodically scan the table and look for reclaimable versions. Works with any storage.
          • Dirty Block bitMap.
        • Cooperative Cleaning.
          • Worker threads ideally scan reclaimable versions as they traverse tuple version chains on operations.
          • Only works with Oldest to Newest
      • Transactions level GC

        • Each txn keeps track of its read/write set
        • On commit/abort, the txn provides this information to a centralized vacuum worker.
        • DBMS periodically dtermines when all versions created by a finished txn are no longer visible.
    • Index management

      • Primary key indexes point to version chain head

        • How often the DBMS must update the pkey index depends on whether the system creates new versions when a tuple is updated.
        • If a txn updates a tuple's pkey attributes then this is treated as a DELETE followed by an INSERT.
      • Secondary Indexes are more complicated. READ WHY UBER SWITCHED FROM POSTGRES TO MYSQL

        • Approach:
          • Logical Pointers
            • use a fixed indentifier per tuple that does not change
            • requires an extra indirection layer
            • Primary Key vs Tuple Id
          • Physical Pointers
            • Use the physical address to the version chain head.
      • MVCC Indexes

        • MVCC DBMS indexes usually do not store version information about tuples with their keys
          • exception: Index-organized tables(MySQL)
        • Every index must support deduplicate keys from different snapshots
          • the same key may point to different logical tuples in different snapshots.
        • MVCC duplicate key problem.
          • Each index underlying data structure must support the storage of non-unique keys.
          • Use additional execution logic to perform conditional inserts for pkey/unique indexes, atomically check whether the key exists and then insert.
          • Workers may get back multiple entries for a single fetch, they must follow the pointers to find the proper physical tuple.
        • MVCC Deletes
          • The DBMS physically deletes a tuple from the database only when all versions of a logically deleted tuple are not visible.
          • If a tuple is deleted, then there cannot be a new version of that tuple after the newest version, no write-write conflicts / first-writer wins.
          • We need a way to denote a tuple has been logically deleted at some point in time.
          • Approaches:
            • Deleted Flag.
              • Maintain a flag to indicate that the logical tuple has been delted after the newest physical version.
              • Can either be in tuple header or a separate column.
            • Tombstone Tuple.
              • Create an empty physical version to indicate that a logical tuple is deleted.
              • Use a separate pool for tombstone tuples with only a special bit pattern in version chain pointer to reduce the storage overhead.
    • Transaction-level

      • Txns keep track of their old versions os the DBMS does not have to scan tuples to determine visibility

Database Logging and Shadow Paging

  • Crash Recovery

    • Recovery algorithms are techniques to ensure database consistency, transaction atomiticy and durability despite failures
    • Recovery algorithms have two parts
      • Actions during normal txn processing to ensure that the DBMS can recover from a failure.
      • Actions after a failure to recover the database to a state that ensures atomicity, consistency and durability.
  • DBMS is divided into different components based on the underlying storage device.

  • We want high performance hence want to write to volatile storage.

  • Allow dirty pages in the buffer pool for performance with buffer pool replacement policy dictating flush to non-volatile storage.

    • Volatile
      • Data does not persist after power loss or program exit
      • DRAM, SRAM
    • Non-volatile
      • Data persists after power loss and program exit
      • HDD, SDD
    • Stable Storage
      • A non-existent form of non-volatile storage that survives all possible failure scenarios
  • We must also classify the different types of failures that the DBMS needs to handle.

  • Failure Classification

    • Transaction Failures

      • Logical Errors
        • Txn cannot complete due to internal error condition(integrity constraint violation)
      • Internal State Errors
        • DBMS must terminate an active transaction due to an error condition(deadlock)
    • System Failures

      • Software Failures
        • Problem with the OS or DBMS implementation(divide-by-zero exception)
      • Hardware Failure
        • Computer hosting the DBMS crashes.
        • Fail-stop Assumption: Non-volatile storage contents assumed to not be corrupted by system crash.
    • Storage media Failures

      • Non-repairable Hardware Failure
        • A head crash or similar disk failure destroys all or part of non-volatile storage
        • Destruction is assumed to be detectable, (disk controller use checksums to detect failures)
      • No DBMS can recover from this! Database must be restored from archived versions.
  • Observation

    • The DBs primary storage location is on volatile storage but it is slower than volatile storage.Use volatile memory for faster access
      • First copy target record into memory
      • Perform the writes in memory
      • Write dirty records back to disk.
    • The DBMS needs to ensure the following
      • The changes for any txn are durable once the DBMS has said its committed.
      • No partial changes are durable if the txn aborted.
  • Undo vs Redo

    • Undo: Process of removing the effects of an incomplete or aborted txn.
    • Redo: Process of re-applying the effects of a committed txn for durability.
  • How the DBMS supports this functionality depends on how it manages the buffer pool.

    • Steal Policy
      • Whether the DBMS allows an uncommitted txn to overwrite the most recent committed value of an object in non-volatile storage.
    • Force Policy
      • Whether the DBMS requires that all updates made by a txn are reflected on non-volatile storage before the txn can commit.
    • No Steal + Force
      • Easiest approach to implement.
        • Never have to undo changes of an aborted txn because the changes were not written to disk
        • Never have to redo changes of a committed txn because all changes are guaranteed to be written to disk at commit time
      • Cannot support write sets that excedd the amount of physical memory available.
  • Shadow Paging

    • Instead of copying the entire database, the DBMS copies pages on write to create two versions:
      • Master: contains only changes form committed txns
      • Shadow: Temporary db with chnages made form uncommitted txns
    • Buffer Pool Policy: No-steal + Force
    • To install updates when a txn commits, overwrite the root so it points to the shadow, thereby swapping the master and shadow.
    • Supporting rollbacks and recovery is easy.
    • Disadvantages:
      • Copying the entire page tbale is expensive
        • use a page table structured like a B+ TREE
        • No need to copy entire tree, only need to copy path in the tree that lead to updated leaf nodes
      • Commit overhead is high
        • Flush every updated page, page table and root
        • Data gets fragmented(bad for sequential scsns)
        • Need garbage collections
        • Only supports one writer txn at a time or txns in a batch.
    • SQLITE(PRE-2010)
      • rollback mode.
      • use a journal file
  • We need a way for the DBMS convert random writes into sequential writes

  • Write-Ahead Log

    • Maintain a log file separate from data files that contains the changes that txns make to database
      • Assume the log is on stable storage
      • Log contains enough information to perform the necessary undo and redo actions to restore the database
    • DBMS must write to disk the log file records that correspond to changes made to a db object before it can flush that object to disk
    • Buffer Pool Policy: Steal + No-force
    • WAL Protocol
      • DBMS stages all a txns log records in volatile storage(backed by buffer pool).
      • All log records pertaining to an updated page are written to non-volatile storage before the page itself is over-written in no-volatile storage
      • A txn is not considered committed until all its log records have been written to stable storage.
      • Write a begin record to the log for each txn to mark its starting point.
      • When a txn finishes, DBMS will:
        • Write a commit record on the log
        • Make sure that all log records are flushed before it returns an ack to application.
      • Each log entry contains information about the change to a single object.
        • Transaction ID
        • Object ID
        • Before Value(UNDO)
        • After Value(REDO)
    • Flushing the walog buffer to disk every time a txn commits will become a bottleneck
    • DBMS can use the group commit optimization to batch multiple log flushes together to amortize overhead.
    • Almost every DBMS uses No-force+Steal.
  • Logging Schemes

    • Physical Logging
      • Record the byte-level changes made to a specific page
      • e.g git diff
    • Logical Logging
      • Record the high-level operations executed by txns
      • e.g UPDATE, DELETE and INSERT
    • Physiological Logging
      • Hybrid approach with byte-level changes for a single tuple identified by page id+slot number
      • Does not specify organization of the page.
    • Logical logging requires less data written in each log record than physical loggin
    • Difficult to implement recovery with logical loggin if you have concurrent txns running at lower isolation levels
      • Hard to determine whihc parts of the db may have been modified by a query beofre crash.
      • Also takes longer to recover because you must re-execute every txn all over again.
  • Log-structured systems

    • Do not have dirty pages
      • any page retrieved from disk is immutable
    • DBMS buffers log records in in-memory ages(MemTable), If this buffer is full, it must be flushed to disk. But it may contain changes, uncommitted txns
    • DBMS still maintain a separate WAL to recreate the MemTable on crash.
  • Checkpoints

    • WAL will grow forever
    • After a crash, the DBMS must replay the entire log, which will take a long time.
    • DBMS periodically takes a checkpoint where it flushes all buffers out to disk.
      • Provides a hint on how far back it needs to replay the WAL after a crash.
    • Blocking/Consistent Checkpoint Protocol
      • Pause all queries.
      • Flush all WAL records in memory to disk.
      • Flush all modified pages in buffer pool to disk
      • Write a CHECKPOINT entry to WAL and flush to disk.
      • Resume queries.
    • Scanning the log to find uncommitted txns can take a long time.
    • How often the DBMS should take checkpoints depends on many different factors.
      • Frequency
      • Tunable option that depends on application recovery time requirements

Database Recovery

  • ARIES: Algorithms for Recovery and Isolation Exploiting Semantics. look and read paper by the same name

Main Ideas

  • Write-Ahead Logging

    • Any chnage is recorded in log on stable storage before the database change is written to disk
    • Must use STEAL+NO-FORCE buffer pool policies.
  • Repeating History during Redo

    • On DBMS restart, retrace actions and restore database to exact state before crash.
  • Logging Changes During Undo

    • Record undo actions to log to ensure actions is not repeated in the event of repeated failures.
  • WAL Records

    • We need to extend our log record format from last class to include additional info
    • Every log record now includes a gloablly unique log sequence number(LSN)
    • LSN represent the physical order that txns make chnages to the database
    • Various components in the system keep track of LSNs that pertain to them.
  • Log Sequence Number

    • It is uniwue and monotonically increasing.
    • System keeps track of flushedLSN - Memory - Last LSN in log on disk, max LSN flushed so far.
    • Each data page contains a pageLSN - page(x) - LSN of the most newest log record that updated the page(x)
    • recLSN - page(x) - Oldest update to page(x) since it was last flushed.
    • lastLSN - T(i) - Latest record of txn.
    • MasterRecord - Disk - LSN of latest checkpoint.
    • Before the DBMS can write page X to disk, it must flush the log at least to the point where: pageLSN <= flushedLSN
  • Writing Log records

    • All log records have an LSN.
    • Update the pageLSN every time a txn modifies a record in the page.
    • Update the flushedLSN in memory every time the DBMS writes out the WAL buffer to disk.
  • Normal Execution

    • Each txn invokes a sequence of reads and writes, followed by commit or abort.
    • Assumptions
      • All log records fit within a single page.
      • Disk writes are atomic.
      • Single-versioned tuples with strong strict 2PL.
      • STEAL + NO-FORCE buffer management with WAL.
  • Transaction commit

    • When a txn commits, the DBMS writes a COMMIT record to log and guarantees that all log records up to txns COMMIT record are flushed to disk.
      • Log flushes are sequential, synchronous writes to disk.
      • Many log records per log page.
    • When the commit succeeds, write a special TXN-END record to log.
      • Indicates that no new log record for a txn will appear in the log ever again.
      • This does not need to be flushed immediately.
  • Transaction Abort

    • Abort a txn is a special case of the ARIES undo operation applied to only one txn.
    • We need to add another field to our log records:
      • prevLSN: the previous LSN for the txn.
      • This maintains a linked-list for each txn that mkes it easy to walk through its records.
    • Compensation Log records
      • A CLR describes the actions taken to undo the actions of a previous update record.
      • It has all the fields of an update log record plus the undoNext pointer(the next-to-be-undone LSN).
      • CLRs are added to log records but the DBMS does not wait for them to be flushed before notifying the application that the txn aborted.
    • Abort Algorithm
      • First write an ABORT record to log for the txn.
      • Then analyze the txn updates on reverse order.
      • For each update record:
        • Write a CLR entry to the log.
        • Restore old value.
      • Lastly, write a TXN-END record and release locks.
      • Notice: CLRs never need to be undone.
  • Non-fuzzy checkpoints

    • DBMS halts everything when it takes a checkpoint to ensure a consistent snapshot
      • Halt the start of any new txns
      • Wait until all active txns finish executing
      • Flushes dirty pages on disk.
    • This is bad for runtime performance but makes recovery easy.
  • Better way to checkpoints

  • Pause modifying txns while the DBMS takes the checkpoint.

    • Prevent queries from acquiring write latch on table/index pages
    • Dont have to wait until all txns finish before taking the checkpoint
  • To prevent torn pages

    • We must record internal state as of the beginning of the checkpoint
      • Active Transaction Table(ATT)
        • One entry per currently active txn
          • txnId: Unique txn identifier
          • status: Current mode of the txn.
          • lastLSN: Most recent LSN created by txn
        • Remove entry after the TXN-END record
        • Transaction Status Codes
          • R - Running
          • C - Committing
          • U - Candidate for Undo
      • Dirty Page Table(DPT)
        • Keep track of which pages in the buffer pool contain changes that have not been flushed to disk.
        • One entry per dirty page in the buffer pool
          • recLSN: The LSN of the log record that first caused the page to be dirty.
  • Fuzzy Checkpoints

  • A fuzzy checkpoint is where the DBMS allows active txns to continue the run while the system writes the log records for checkpoint

    • No attempt to force dirty pages to disk.
  • New log records to track checkpoint boundaries

    • CHECKPOINT-BEGIN: Indicates the start of checkpoint
    • CHECKPOINT-END: Contains ATT + DPT
    • Any txn that begins after the checkpoint start is excluded form the ATT in the CHECKPOINT-END record
    • The LSN of the CHECKPOINT-BEGIN record is written to the MasterRecord when it completes.

ARIES - Recovery Phases

  • Analysis

    • Examine the WAL in forward direction starting at MasterRecord to identify dirty pages in the buffer pool and active txns at the time of the crash.
    • Figure out which txns committed or failed since last successful checkpoint
      • If DBMS finds a TXN-END record, remove its corresponding txn from ATT.
      • All other records:
        • If txn not in ATT, add it with status UNDO.
        • On commit, change txn status to COMMIT.
      • For update log records
    • At end of Analysis Phase
      • ATT identifies which txns were active at time of crash
      • DPT identifies which dirty pages might not have made it to disk.
  • Redo

    • Repeat all actions starting from an appropriate point in the log(even txns that will abort)
    • Goal is to repeat history to recnstruct the database state at the moment of the crash
      • reapply all updates, even aborted txns and redo CLRs
    • There are techniques that allow the DBMS to avoid unnecessary reads/writes.....look into this
    • Scan forward form the log record containing smallest recLSN in DPT.
    • For each update log record or CLR with a given LSN redo the action unless
      • Affected page is not in DPT or
      • Affected page is in DPT but that record's LSN is less than the page's recLSN.
    • To redo an action
      • Reapply logged uodate
      • Set pageLSN to log records LSN.
      • No additional logging, no forced flushes
    • At the end of Redo Phase, write TXN-END log records for all txns with status C and remove them from the ATT.
  • Undo

    • Reverse the actions of txns that did not commit before the crash.
    • Undo all txns that were active at the time of crash and therefore will never commit
      • these are all the txns with U status in the ATT after the Analysis Phase
    • Process them in reverse LSN order using the lastLSN to speed up traversal
    • Write a CLR for every modification.
  • What does the DBMS do if it crashes during recovery in the Analysis Phase?

    • Nothing. Just run recovery again.
  • What does the DBMS do if it crashed during recovery in the Redo Phase?

    • Again nothing. Redo everything again.
  • How can the DBMS improve performance during recovery in the Redo Phase?

    • Assume that it is not going to crash again and flush all changes to disk asynchronously in the background.
  • How can the DBMS improve performance during recovery in the Undo Phase?

    • Lazily rollback changes before new txns access pages.
    • Rewrite the application to avoid long-running txns.

Distributed Databases

  • Characteristics
    • Nodes can be far from each other.
    • Nodes connected using public network.
  • Use the building blocks that are covered in single node DBMS to now support transaction processing and query execution in distributed environments.
    • Optimizations and Planning.
    • Concurrency Control.
    • Logging and Recovery.
  • A distributed DBMS system architecture specifies what shared resources are directly accessible to CPUs
  • This affects how the CPUs coordinate with each other and where they retrieve/store objects in the database.

System Architectures

  • Shared Everything

  • Shared Memory

    • CPUs have access to common memory address space via a fast interconnect
      • each processor has a global view of all the in-memory data structures
      • each DBMS instance on a processor must "know" about the other instances
  • Shared Disk*

    • CPUs can access a single logical disk directly via an interconnect, but each have their own private memories
      • can scale execution layer independently form storage layer
      • must send messages between CPUs to learn about their current state.
      • facilitates data lakes and serveless systems.
    • Any OLAP DB in use now.
  • Shared Nothing*

    • Each DBMS instance has its own CPU, memory and local disk
    • Node only communicate with each other via network
      • Harder to scale capacity
      • Harder to ensure consistency
      • Better performance and efficiency.
  • Design Issues

    • How does the application find data?
    • Where does the application send queries?
    • How to execute queries on distributed data?
      • Push query to data.
      • Pull data to query.
    • How does the DBMS ensure correctness?
    • How do we divide the database across resources?
  • Homogenous Nodes

    • Every node in the cluster can perform the same set of tasks(albeit on potentially different partitions of data)
    • Makes provisioning and failover easier.
  • Heterogenous Nodes

    • Nodes are assigned specific tasks
    • Can allow a single physical node to host multiple "virtual" node types for dedicated tasks.
  • Data Transperancy

    • Apps should not be required to know where data is physically located in a distributed DBMS
      • query on a single node should have same results as distributed db.
    • In practice, developers need to be aware of the communication costs of queries to avoid excessively expensive data movement.
  • Database Partitioning

    • Split database across multiple resources
      • Disks, nodes, processors
      • Often called "Sharding"
    • DBMS executes query fragments on each partition and then combines the results to produce a single answer.
    • DBMS can partition physically(shared nothing) or logically(shared disk)
  • Naive TABLE Partitioning

    • Assign an entire table to a single node.
    • Assume that each node has enough storage space for an entire table
    • Ideal if queries never join data across tables stored on different nodes and access patterns are uniform.
  • Vertical Partitioning

    • Split a table's attributes into separate partitions.
    • Must store tuple information to reconstruct the original record.
  • Horizontal Partitioning

    • Split a table's tuples into disjoint subsets based on some partitioning key and scheme
      • Choose column that divided the database equally in terms of size, load, usage
    • Partitioning schemes
      • Hashing
      • Ranges
      • Predicates
  • Logical Partitioning vs Physical Partitioning

  • Consistent Hashing*

    • Solution to rehashing on mod change.
    • Learn more about this.
  • Single Node vs Distributed

    • A single node txn only accesses data that is contained on one partition so DBMS doesn't need to check behaviour of concurrent txns running on other nodes.
    • A distributed txn accesses data at one or more partition and therefore requires expensive coordination.
  • Transaction Coordination

    • Need a way to coordinate their execution on the system if our DBMS supports multi-operation and distributed txns.
    • Two different approaches
      • Centralized: Global "traffic cop".
        • TP Monitor: X/Open XA.
        • Middleware: Vitess, planetscale, mongoDB
      • Decentralized: Nodes organize themselves.
        • Leader election type.
    • Most distributed DBMSs use a hybrid approach where they periodically elect some node to be a temporary coordinator.
  • An assumption stands that the nodes in our distributed systems are running the same DBMS s/w but rganizations often run many different DBMSs in their applications hence the need for a single interface for all our data.

  • Federated Databases

    • Distributed arch that connects disparate DBMSs into a single logical system, expose a single query interface that can access data at any location.
    • It is hard and nobody does it well
      • Different data models. query languages, limitations.
      • No easy way to optimize queries.
      • Lots of data copying(bad)
  • Distributed Concurrency Control

    • Need to allow multiple txn to execute simultaneously across multiple nodes.
      • Many of the same protocols from single node DBMSs can be adapted
    • This is harder because of
      • Replication
      • Network Communication Overhead
      • Node failures(permanent+ephemeral)
      • Clock Skew

OLAP VS OLTP

  • On-line Transaction Processing(OLTP)

    • Short-lived read/write txns, 50ms.
    • Small footprint.
    • Repetitive operations.
  • On-line Analytical Processing

    • Long-running read-only queries.
    • Complex joins.
    • Exploratory queries.

OLTP

  • Our goal is to have multiple physical nodes appear as a single logical DBMS.

  • How to ensure that all nodes agree to commit a txn and then make sure it does commit if we decide that it should.

    • What happends if a node fails?
    • What happens if our messages show up late?
    • What happens if we dont wait for every node to agree?
  • Assume all nodes in a distributed DBMS are well-behaved and under the same administrative domain

  • If we tell a node to commit a txn, then it will commit the txn(if there is no failure)

  • If you dont trust the other nodes in a distributed DBMS you need to use a BYzantine Fault Tolerant protocol for txns(blockchain).

  • Atomic Commit Protocol

    • When a multinode txn finishes, the DBMS needs to ask all the nodes involved whether it is safe to commit
    • Coordination of commit order of txns across nodes in a distributed DBMS regardless of whether data is replicated or partitioned.
    • Commit order = State machine

Examples

  • Two-Phase commit*

    • Each node records the inbound/outbound messages and outcome of each phase in a non-volatile storage log
    • On recovery examine the log for 2PC messages
      • if txn is in prepared state, contact coord.
      • if txn is not in prepared, abort it.
      • if txn was committing and node is cood, send COMMIT message to nodes.
    • If coordinator crashes, participants must decide what to do after a timeout.
    • If participant crashes, co-od assumes abort response and again a timeout is used to decide if a participant is dead.
    • 2PC Optimizations
      • Early Prepare Voting
        • node return query result if it knows no other query will run there.
      • Early Acknowledge After Prepare
        • if all nodes vote to commit a txn, coord can send client an ack that their txn was successful before the commit phase finishes.
  • Three-Phase commit

  • Paxos*

    • Consensus protocol where a coord proposes an outcome(commit or abort) and then the participants vote on whether that outcome should succeed.
    • Does not block if a majority of participants are available and has provably minimal mesasge delays in the best case.
    • PAXOS MADE LIVE, Consensus on Transaction Commit
    • K-safety - minimal number of required partitions.
    • Multi-Paxos
      • If the system elects a single leader that oversees proposing changes for some period then it can skip the propose phase.
        • fall back to full Paxos whenever there is a failure
      • The system periodically renews the leader(known as a lease) using another Paxos round
        • Nodes must exchange log entries during leader election to make sure that everyone is up-to-date.
    • 2PC vs Paxos
      • 2PC
        • Blocks if coord fails after the prepapre message is sent, until coodinator recovers.
      • Paxos
        • Non-blocking if a majority participants are alive. provided there is a sufficiently long period without further failures.
  • Raft

    • Similar to Paxos but with fewer node types.
    • Only nodes with most up-to-date log can become leaders
  • ZAB(Zookeeper)

  • Viewstamped Replication

  • Replication

    • The DBMS can replicate a database across redundant nodes to increase availability.
    • Partitioned vs Non-Partitioned
    • Shared-Nothing vs Shared-Disk

Design decisions

  • Replica configuration

    • Approaches

      • Primary-Replica

        • All updates go to a designated primary for each object.
        • The primary propagates updates to its replicas without an atomic commit protocol.
        • Read-only txns may be allowed to access replicas.
        • If the primary goes down, then hold an election to select a new primary.
      • Multi-Primary

        • Txns can update data objects at any replica.
        • Replicas must synchronize with each other using an atomic commit protocol.
    • K-Safety

      • Threshold for determining the fault tolerance of the replicated database.
      • The value K represents the number of replicas per data object that must always be available.
      • If the number of replicas goes below thos threshold then DBMS halts execution and takes itself offline.
  • Propagation scheme

    • When a txn commits on a replicated database, the DBMS decides whether it must wait for that txn changes to propagate to other nodes before it can send ack to application
    • Propagation Levels:
      • Synchronous (Strong Consistency)
        • The primary sends updates to replicas and then waits for them to ack that they fully applied(logged) the changes.
      • Asynchronous (Eventual Consistency)
        • The primary immediately returns the ack to the client without waiting for replicas to apply the changes.
        • Mostly used.
  • Propagation Timing

    • Approach
      • Continuous
        • DBMS sends log messages immediately as it generates them.
        • Also need to send a commit/abort message.
      • On Commit
        • DBMS only sends the log messages for a txn to the replicas once the txn is commits
        • Do not waste time sending log records for aborted txns
        • Assumes that a txn log record fits entirely in memory.
  • Update Method

    • Approach
      • Active-Active
        • a txn executes at each replica independently
        • need to check at the end whether the txn ends up with the same result at each replica
      • Active-Passive
        • each txn executes at a single location and propagates the changes to the replica
        • either do physical or logical replication.
        • not the same as primary-replica vs multi-primary
  • Google Spanner

    • Google Geo-replicated DBMS
    • Schematized, semi-relational data model
    • Decentralized shared-disk architecture
    • Log-structured on disk storage
    • Cooncurrency Control
      • Strict 2PL + MVCC + Multi-Paxos + 2PC
      • Wound-Wiat Deadlock Prevention.
      • Externally consistent global write-transactions with synchronous replication
      • Lock-free read-only transactions.
      • DBMS ensures ordering through globally unique timestamps generated form atomic clocks and GPS devices.
      • Database is broken up into tablets(partitions)
        • Use Paxos to elect leader in tablet group
        • Use 2PC for txns that span tablets.
    • Transaction Ordering
      • DBMS orders transactions based on physical "wall clock" time.
      • Guarantee strict serializability.
      • If T1 finishes before T2 then T2 should see the result of T1.
  • CAP Theorem

    • Consistent, Always Available, Network Partition Tolerant.
    • How a DBMS handles failures determines which elements of the CAP theorem they support.
      • Traditional/Distributed Relational DBMS
        • stop allowing updates until a majoruty of nodes are reconnected.
      • NoSQL DBMSs
        • provide mechanisms to resolve conflicts after nodes are reconnected.
  • PACELC Theorem

OLTP ->EXTRACT, TRANSFORM, LOAD ->OLAP

OLAP

  • Decision Support Systems

    • Applications that serve the management, operations and planning levels of an organization to help people make decisions about future issues and problems by analyzing historical data
  • Star schema vs Snowflake Schema

    • Fat tables used in star schemas.
    • Schema issues:
      • Normalization
        • Snowflake schemas take up less storage space.
        • Denormalized data models may incur integrity and consistency violations.
      • Query Complexity
        • Snowflake schemas require more joins to get the data needed for a query
        • Queries on star schemas will usually be faster.
  • Executing an OLAP query in a distributed DBMS is roughly the same as on a single-node DBMS.

  • For each operator, the DBMS considers where input is coming from and where to send output; Table scans, Joins, Aggregations, Sorting.

  • A distributed DBMS system architecture specifies the location of the db data files. This affects how nodes coordinate with each other and where they retrieve/store objects in the database.

  • Push vs Pull

    • Approach:
      • Push query to data
        • Send the query to the node that contains the data.
        • Perform as much filtering and processing as possible where data resides before transmitting over network.
      • Pull data to query
        • Bring the data to the node that is executing a query that needs it for processing.
        • This is necessary when there is no compute resources available where database files are located.
  • The data that a node reveives from remote sources are cached in the buffer pool

    • Allows DBMS to support intermediate results that are large than the amount of memory available
    • Ephemeral pages are not persisted after a restart.
  • What happens to a long-running OLAP query if a node crashes during execution??

  • Query Fault Tolerance

    • Most shared nothing distributed OLAP DBMSs are designed to assume that nodes do not fail during query execution.
      • If one node fails during query execution, then the whole query fails.
    • DBMS could take a snapshot of the intermediate results for a query during execution to allow it to recover of nodes fail.
      • Be strategic about this.
    • Shared disk.
  • Query Planning

    • All the optimizations from above are still applicable in a distributed environments.
      • Predicate pushdown
      • Early projections
      • Optimal Join Orderings
    • Distibuted query optimizations is even harder because it must consider the physical location of data and network transfer costs.
    • Query plan fragments
      • Approach:
        • Physical operators
          • Generate a single query plan and then break it up into partition specific fragments
          • Most systems implement this
        • SQL
          • Rewrite the original query into partition-specific queries
          • Allows for local optimization at each node.
          • SingleStore + Vitess use this.
      • The efficiency of a distributed join depends on the target tables partitioning schemes
      • One approach is to put entire tables on a single node and then perform the join
        • You lose the parallelism of a dstributed DBMS
        • Costly data transfer over the network.
        • Does not scale.
  • The efficiency of a distibuted join depends on the target tables' partitioning schemes with one approach being to put entire tables on a single node and then performing the join, however you lose parallelism of a distributed DBMS and suffer costly data transfer over the network.

  • Distibuted Join Algorithms

    • To join table R and S, the DBMS needs to get the proper tuples on the same node.
    • Once the data is at the node, the DBMS then executes the same join algorithms that we discussed earlier in the semester, need to avoid false negatives due to missing tuples when running local join on each node.
    • Variations on partitioning
      • Both tables are partitioned on different keys, if one of the tables is small, then DBMS "broadcasts" that table to all nodes.
      • Both tables are not partitioned on the join key, DBMS copies the tables by shuffling them across nodes.
    • Hard to find optimal partition key.
    • Semi-Join
      • Join type where the result only contains columns from the left table.
      • Distributed DBMSs use semi-join to minimize the amount of data sent during joins
        • This is like a projection pushdown.
      • Some DBMSs support SEMI-JOIN or fake it with EXISTS

Cloud Systems

  • DBaaS; managed DBMS environments.

  • Newer systems are starting to blur the lines between shared-nothing and shared-disk i.e you can do simple filtering on Amazon S3 before copying data to compute nodes.

  • Approach

    • Manages DBMS
      • No significant modification to the DBMS to be aware that is running in a cloud environment.
    • Cloud-Native DBMS
      • Explicitly designed to run ina cloud environment
      • Shared-disk architecture.
      • example: Snowflake, Google BigQuery, Amazon Redshift, Ms SQL Azure
  • Serverless Dbs

    • rather than always maintaining compute resources for each customer a serverless DBMS evicts tenants when they become idle.
    • page in buffer pool and page table or restart.
    • example: Neon, Fauna, Planetscale, CockroachDB
  • Data Lakes

    • Repository for storing large amounts of structured, semi-structured and unstructured data without having to define a schema or ingest the data into proprietary internal formats
    • example: Trino, Redshift, Presto, Databricks, Hive, Snowflake

Universal Formats

  • Most DBMSs use proprietary on-disk binary file format for their databases.

  • The only way to share data between systems is to convert data into a common text-based format

    • CSV, JSON, XML
  • There are new open-source binary file fomrats that make it easier to access data across systems

  • Benefits:

    • High-levels languages can make calls into lower-level languages for compute-intensive tasks by passing pointers to data rather than making copies in different formats.
    • Data can be transferred between processes efficiently without much serialization overhead because memory format is also network format.
    • It is also easier to build connectors,drivers and integrations between various open source and commercial projects.
  • Examples:

    • Apache Parquet
      • compressed columnar storage from cloudera/twitter
      • writeonce read many
    • Apache ORC
      • compressed columnar storage from Apache Hive
    • Apache CarbonData
      • compressed columnar storage with indexes from Huawei
    • Apache Iceberg
      • flexible data format that supports schema evolution from Netflix
    • HDF5
      • multi-dimensional arrays for scientific workloads
    • Apache Arrow
      • in-memory compressed storage from Pandas/Dremio
      • It is efficient for vectorised processing on modern h/w.
      • IPC format is defined for exchanging metadata such as schema information, Google Flatbuffers.
      • Flight protocol is used for efficiently streaming Arrow data over the network.

Apache Arrow

  • Arrow is a collection of libraries and specifications that make it easier to build high performance software utilities for processing and transporting large datasets.

  • It consists of a collection of libraries related to in-memory data processing(processing of data in RAM) including specs for memory layouts and protocols for sharing and efficiently transporting data between systems and processes. This means that third-party projects that depend on Arrow don't need to use the entirety of the project and instead can only link against, embed or only include the portions that they need.

  • On-disk formats tend to focus more on increasing I/O throughput such as compressing the data to make it smaller and faster to read into memory, i.e Parquet.

  • Apache Arrow focus is the in-memory format case which targets CPU efficiency as the goal with numerous tactics such as cache locality and vectorization of computation, common formats encourage systems interoperability.

  • Arrow's raw wire data format is the same as it is in memory, allowing you to directly reference network memory buffers without deserialization.

  • Parquet and Arrow are two different flavors of column-based storage with different trade-offs made in their designs.

Disaggreagated Components

  • A recent trend has been the breakout OLAP sub-systems into standalone open-source components, typically done by organizations not in the business of selling DBMS s/w.

  • System catalogs

    • store metadata about the data.
    • HCatalog, Google Data Catalog, Amazon Glue Data Catalog
  • Intermediate Representation.

  • Node management

    • K8S, Apache YARN
  • Query Optimizers

    • Extendible search engine framework for heuristic and cost-based query optimization where DBMS provides transformation rules and cost estimates and framework returns either a logical or physical query plan.
    • Apache Calcite, Greenplum Orca
  • File Formats/ Access Libraries.

  • Execution engines / Fabrics.

    • Standalone libraries for executing vectorised query operators on columnar data, input is a DAG of physocal operators and require external scheduling and orchestration.
    • example: Velox, Datafusion, Intel OAP.

Architecture Overview

  • Query ->FRONTEND(parser) ->PLANNER(binder,rewriter,optimizer+cost models) ->CATALOG(metadata,data locations, statistics, data discovery) ->SCHEDULER(plan fragments) ->EXECUTION ENGINE(block requests) ->I/O SERVICE

Data categories

  • ref paper:Building an elastic query engine on disaggregated storage

  • Persistent Data

    • These are "Source of record" for the database, modern systems assume that these data files are immutable but can support updates by rewriting them.
  • Intermediate Data

    • Short-lived artifiacts produced by query operators during execution and then consumed by other operators, it however has no correlation to the amount of persistent data that it reads or the execution time.

Distributed System Architecture

  • ref paper:The case of shared nothing

  • A distributed DBMS system architecture specifies the location of the database persisten data files and this affects how nodes coordinate with each other and where they retrive/store objects in the database.

  • Approaches;

  • Push Query to Data

    • This involves sending the query to the node that contains the data and performing as much filtering and processing as possible where data resides before transmitting over network.
  • Pull Data to Query

    • This involves bringing the data to the node that is executing a query that need it for processing, ideal for when there is no compute resources available where persistent data files are located.
  • Shared Nothing vs Shared Disk

Object Stores

  • Database tables are partitioned into large, immutable files stored in an object store where all attributes for a tuple are stored in the same file in a columnar layout(PAX), and the header(or footer) contains meta-data about columnar offsets, compression schemes, indexes and zone maps.
  • The DBMS retrieves a block's header to determine what byte ranges it needs to retrieve.
  • Each cloud vendor provides their own proprietary API to access data(PUT, GET, DELETE)

Embedded Database Logic

  • The application has a conversation with the DBMS to store/retrieve data.
    • each DBMS has its own network protocol.
    • client side APIs: JDBC, ODBC.
  • Conversational Database API.
  • Moving application logic into the DBMS can potentially provide sevaral benefits
    • Fewer network round-trips
    • immdiate notification of changes
    • DBMS spends less time waiting during transactions
    • Developers dont have to reimplement functionality
  • User-defined functions
    • A UDF is a functon written by the application developer that extends the system's functionality beyond its built-in operations
      • it takes in input arguments(scalars)
      • perform some computations
      • return a result(scalars, tables)
    • UDF Defn
      • Return Types
        • scalar functions: return single data value
        • table functions: return a single result table.
      • Computation definition
        • sql functions
        • external programming language
      • a SQL-based UDF contains a list of queries that the DBMS executes in order when invoked
        • the function returns the result of the last query executed.
        • SQL standard provides the ATOMIC keyword to tell the DBMS to track dependencies UDFs.
      • also use external programming language
        • sandbox vs non-sandbox
    • Advantages
      • encourage modularity and code reuse
        • different queries can reuse the same application logic without having to reimplement it each time.
      • fewer network round-trips between application server and DBMS for complex operations
      • some types of application logic are easier to express and read as UDFs than SQL.
    • Disavantages
      • Query optimzers treat UDFs as black boxes
        • unable to estimate cost of you dont know that a UDF is going to do when run.
      • It is difficult to parallelize UDFs due to correlated queries inside of them
        • some DBMS will only execute queries with a single thread if they contain a UDF.
        • some UDFs incrementally construct queries.
      • complex UDFs in SLECT/WHERE clauses force the DBMS to execute iteratively
        • RBAR = row by agonizing row
        • things get worse if UDF invokes queries doe to implicit joins that the optimzer coannot see.
      • since the DBMS executes the commands in the UDF one-by-one, its unable to perform cross-statement optimizations.
  • Stored Procedures
    • A stored procedure is a self-contained function that perfroms more complex logic inside of the DBMS
      • cna have many input/output parameters
      • can modify the database table/structures
      • not normally used within a SQL query
    • Some DBMS distinguish UDFs vs stored procedures, but not all.
      • A UDF is meant to perform a subset of a read-only computations within a query.
      • A stored procedure is meant to perform a complete computations that is independent of a query.
  • Database Triggers
    • A trigger instructs the DBMS to invoke a UDF when some event occurs in the database.
    • The developer has to define
      • What type of event will cause it to fire
      • The scope of the event
      • When it fires relative to that event.
    • Event Type: INSERT, UPDATE, DELETE, TRUNCATE, CREATE, ALTER, DROP
    • Event Scope: TABLE, DATABASE, VIEW, SYSTEM
    • Timing: Before, After query, row
  • Change Notifications
    • A chnage notification is like a trigger except that the DBMS sends a message to an external entity that something notable has happened to the db.

      • think pub/sub system
      • can be chained with a trigger to pass along whenever a change occurs
    • SQL STANDARD: LISTEN + NOFITY

  • All DBMSs support the basic primitive types in the SQL standard, basic arithmetic and string manipulation on them.
  • if we want to store data that doesnt match any of the built-in types?
  • Complex Types
    • Attribute splitting
    • Application serialization
  • User-defined Types
    • is a special data type that is defined by the application developer that the DBMS can store natively
      • introduced by Postgres in 1980
      • added to SQL:1999
    • also referred to as structured user-defined types or structured types.
    • each DBMS exposes apis to create UDT.
  • Views
    • Creates a virtual table containing the putput from a SELECT query. The view can then be accessed as if it was a real table.
    • This allows programmers to simplify a complex query that is executed often
      • wont make it run faster
    • Often used a mechanism for hiding a subset of table's attributes from certain users.
  • Select...into
    • creates static table that does no get updated when student gets updated.
  • Materialized Views
    • creates a view containing the output from a SELECT query that is reatined(not recomputed)
      • some DBMSs automatically update matviews when the underlying tables chage
      • other DBMSs(postgresql) require manual refresh.
  • Disavantages:
    • Not portable
    • DBSs dont like change
    • Potentially need tomaintain different versions.

IN-MEMORY DATABASES

  • Development history of DBMS is about dealing with limitations of hardware.
  • DRAM capacities are large enough that most databases can fit in memory
    • structured data sets are smaller
    • unstructured or semi-structured data sets are larger.
  • Why not use traditional disk oriented DBMS with a really large cache??
  • Buffer Pool
  • Concurrency control
  • Logging and Recovery: STEAL + NO0FORCE buffer pool policies
  • LSN - log sequence numbers.
  • ref paper:OLTP through the looking glass and what we found there
  • Assume that the primary storage location of the database is permanently in memory.
  • storage access latencies
  • ref paper:lets talk about storage and recovery methods for non-volatile memory database systems
  • Data organization
    • an in-memory DBMS does not need to store the database in slotted pages but it will still organize tuples in blocks/pages
      • direct memory pointers vs record ids
      • fixed-lenght vs variable-lenght data pools
      • use checksums to detect software errors from trashing the database.
  • Why not Mmap???
  • Query processing
  • Timesten
  • ref paper:oracle timesten:
  • Dali
  • ref paper:dali:a high performance main memory storage manager
  • P* Time
  • ref paper:p*time:highly scalable oltp dbms for managing update-intensive stream workload

DATA FORMATS AND ENCODING

  • OLAP workloads perform sequential scans on large segments of read-only data as the DBMS only needs to find individual tuples to stitch them back together while OLTP workloads use indexes to find individual tuples without performing sequential scans i.e tree based indexes are meant for queries with low selectivity predicates and the need to accomodate incremental updates.

Format Design Decisions

  • File Meta-data

    • Files are self-contained to increase portability as they contain all the necessary information to interpret their contents without external data dependencies with each file maintaining global meta-data about its contents, i.e table schema, row group offsets/length, tuple counts/zone maps.
  • Format layout

    • The most common formats use the PAX storage models that splits data row groups that contain one or more column chunks, the size of row groups varies per implementation and makes compute/memory trade-offs, i.e Parquet(1m tuples), Orc(250mb), Arrow(1024*1024 tuples)
  • Type system

    • It defines the data types that the format supports, physical(low level byte representation), logical(auxiliary types that map to physcial types). Formats vary in the complexity of their type systems that determine how much upstream producer/consumers need to implement.
  • Encoding schemes

    • It specifies how the format stores the bytes for contiguous/related data, can apply multiple encoding schemes on top of each other to further improve compression. i.e bitpacking, run-length encoding, dictionary encoding, delta encoding and frame-of-reference.
  • Block compression

    • Compress data using a general-purpose algorithm with the scope of compression being absed on the data provided as input.Notable considerations include computational overhead, compress vs decompress speed and data opaqueness. i.e LZO, LZ4, Snappy and Zstd.
  • Filters

    • Zone maps, maintain min/max values per column at the file-level and row group-level.
    • Bloom filters, track existence of values for each column in a row group, more effective if values are clustered.
  • Nested data

    • Real world data sets often contain semi-structured objects(JSON, protobufs)., a file format will want to encode the contents of these objects as if they were regular columns. This can either be through record shredding(store paths in nested structure as separate columns with additional meta-data about paths.) or length+presence encoding(same as shredding but maintain additional columsn to track number of entries at each path level and whether a key exists at that level for a record).
  • ref paper: Procella: Unifying serving and analytical data at Youtube,Dremel: A decade of interactive sql analysis at web scale

  • Critiques of existing formats

    • Variable-sized runs, not SIMD friendly.
    • Eager decompression, no random access if using block compression.
    • Dependencies between adjacent values, i.e RLE, delta encoding.
    • Vectorization portability, ISAs have different SIMD capabilities.

BTRBLOCKS

  • ref paper: Btrblocks: Efficient columnar compression for data lakes
  • PAX-based file format with more aggressive nested encoding schemes than Parquet/ORC.
  • Uses a greedy algorithm to select the best encoding for a column chunk(based on sample) and then recursively tries to encode outputs of that encoding, no naive block compression(snappy, zstd).
  • Store a file's meta-data separately from the data.
  • Encoding schemes: RLE/One value, Frequency encoding, FOR+ Bitpacking, Dictionary encoding, Pseudodecimals, Fast Static Symbol Table(FSST), Roaring Bitmaps.
  • ref paper: FSST: Fast random access string compression.
  • ref paper: Better bitmap performance with roaring bitmaps
  • Encoding selection: collect a sample from the data and then try out all viable encoding schemes, repeat for three rounds.

FASTLANES

  • ref paper: The fastlanes compression layout: Decoding 100 billion integers per second with scalar code

  • Suite of encoding schemes that achieve better data parallelism through clever reordering of tuples to maximize ueseful work in SIMD operations. Similar nested encoding as BtrBlocks; Dictionary, FOR, Delta, RLE.

  • To future proof format they define a virtual ISA with 1024-bit SIMD registers.

  • Unified transposed layout: reorder values in a column in a manner that improves the DBMS's ability to process them in an efficient, vectorized manner via SIMD.

  • The previous encoding schemes scan data by examining the entire value of each attribute(all the bits at the same time), DBMS cannot "short-circuit" comparisons integer types because CPU instructions operate on entire words.

  • What if you could only examine a subset of the each value's bit and then only check the rest of the bits if needed?

BIT-SLICED ENCODING.

  • Store bit-slices that represent original data and then gradually compare bits as need be.
  • They can also be used for efficient aggregate computations, i.e SUM(attr) using Hamming Weight. Use the POPCNT instruction to efficiently count the number of bits set to 1 in a register.

BITWEAVING

  • ref paper: Bitweaving: Fast scans for main memory data processing
  • Alternate encoding scheme for columnar databases that supports efficient predicate evaluation on compressed data using SIMD with order-preserving dictionary encoding, bit-level parallelization and only require common instructions(no scatter/gather)

EXECUTION OPTIMIZATION

  • CPUs organize instructions into pipeline stages and the goal is to keep all parts of the processor busy at each cycle by masking delays from instructions that cannot complete in a single cycle.
  • Super-scalar CPUs support multiple pipelines, execute multiple instructions in parallel in a single cycle if they are independent.