Databases, also referred to as structured data, are an essential part of any data centre. While databases do not typically hold a high percentage of the terabytes housed in a given data centre, they do hold a high percentage of mission-critical data. Understanding their unique structure and operation is key to backing them up.
Structured data cannot be backed up like unstructured data due to three very big challenges. First, databases are typically stored in data files that are constantly changing as long as something is making updates to the database. This means you cannot just back them up like any other file.
Second, most databases will have some kind of journal that can be replayed to either restore transactions after a point-in-time restore or to roll back a partially completed transaction after a crash.
And third, a typical restore starts by restoring the data files from the most recent back-up and is then followed with a restore from a journal in order to restore the database to as recent a state as possible. Therecovery point objective (RPO) made possible with typical single-step restores of most back-up systems—possibly more than 24 hours—is simply not good enough for mission critical databases.
The key to getting the backups of your database right lies in understanding how your database solves these challenges.
There are at least 13 different database models, and knowing how to back up yours starts with knowing what kind of database you are backing up.
These models include: relational (the most common), key-value, time series, document, graph, search engine, wide column, object oriented, RDF, multivalue, native XML, navigational, and event. The following is a list of just the most popular models, along with a few whose popular databases have generated a lot of backup questions.
A relational database management system (RDBMS) is what most people think of when they say the word database: a series of tables with a defined schema (table layout), records (rows), and attributes (values). Examples include Oracle, SQL Server, MySQL, and PostgreSQL. These databases are often called SQL databases, after the query language they use.
A very simple NoSQL (Not only SQL) DBMS, consisting of keys and values, where you can look up the value if you know the key. Popular examples are Redis and DynamoDB.
A NoSQL database specifically designed to handle time data, as each entry has a time stamp. The popular Prometheus database is an example and is used quite a bit in Kubernetes.
A schema-free NoSQL DBMS is designed specifically to store documents. Records do not need to conform to any uniform standard and can store very different types of data. JSON is often used to store documents in such a database. MongoDB is the most popular database that supports only the document model.
Another schema-free NoSQL DBMS that can store very large numbers of columns of data without a predefined schema is the wide-column model. Column names and keys can be defined throughout the database. Cassandra is the best known database of this type.
Database terminology is also important, so what follows is a list of important terms. Not all databases use the same terms, but they should have a similar term that means the same thing. NoSQL databases often use very different terms or may lack something similar to the item in question.
Datafile: A datafile is where a database stores its data. This may be a raw device (e.g., /dev/hda1 in Linux), or a “cooked” file (e.g., /sap/datafiles/dbs06.dbf or c:\MySQL\datafile.dbf). At this point, most databases use cooked or regular files as datafiles, and most have more than one for each database.
Table: This is where things get a bit murky. In a SQL, relational, database, a table is a bunch of related values that behaves kind of like a virtual spreadsheet. NoSQL databases may have something similar or they may not.
Tablespace: A tablespace is a space where you put tables and is a collection of one or more data files. If your database doesn’t have tables, it probably doesn’t have tablespaces.
Partition: Modern databases can divvy up and spread or partition a table across multiple resources, including multiple tablespaces.
Sharding: Sharding takes partitioning to another level and is the key to large scale-out databases. Sharding can even place pieces—shards—of a table on different nodes.
Master database: A master database keeps track of the status of all databases and data files. If multiple databases are allowed, it needs to keep track of them as well.
Transaction: A transaction is an activity within a database that changes one or more attributes within one or more tables. Simple transactions change one attribute, and complex transactions will change many attributes as a single, atomic action. NoSQL databases tend to use simple transactions, and many who use them don’t even think of their transactions as such.
Transaction log: A transaction log records each transaction and what elements it changed. This information is used in case of a system crash or after a restore to either undo or redo transactions.
There are two very different ways databases ensure that views of inserted or updated database data are the same for all viewers of the database. These are referred to as consistency models, and they affect backup and recovery.
The first is immediate consistency, also known as strong consistency, and it ensures that all users will see the same data at the same time, regardless of where or how they view the data. Most traditional, relational, databases follow this model.
The second model is an eventually consistent or weak-consistency database, which ensures that a given attribute will eventually be consistent for all viewers, but that may take some time. A great example of eventual consistency is within the DNS system, which has to wait until the time-to-live for DNS records has expired before updating information about domain names. This can take up to 72 hours.
What, how, and why are you backing up?
If you’re responsible for backing up a database, you need to understand how it is built and how it works. You need to understand where it stores its data, such as data files, whether or not it uses complex or simple transactions, and where it stores the log of those transactions. You will need to know how to get a consistent backup of the stored data and the transaction log.
You also need to understand how distributed your database is. Is it partitioned, but all within one host or is it sharded and spread across dozens or hundreds of hosts? If it is the latter, you will most likely be dealing with an eventually consistent database. Getting a consistent snapshot of a database spread across hundreds of nodes will be quite challenging, and restoring it will be just as challenging.
Some may think that an eventually consistent database that uses replication across many nodes doesn’t need to be backed up, but it definitely does. While you are protected against node failure, you are definitely not protected against human error. If you drop a table, it doesn’t matter how replicated it is. You will need to restore it.