Engineering

Documenting your database schema

Written By

Max Tagher

Graphic illustration of code against a grid backdrop | Documenting Your Database Wchema | Mercury engineering blog
Copy Link
Share on Twitter
Share on LinkedIn
Share on Facebook
Copy Link
Share on Twitter
Share on LinkedIn
Share on Facebook

As a company scales, written documentation becomes essential for efficiently sharing knowledge. Of the different types of documentation, database schema documentation is one of the most important: the schema makes up the core data structures a typical web app interacts with; it’s read by employees outside of engineering; and unlike other documentation, it is both easy to find and known to exist.

This importance has been borne out in practice at Mercury, where despite requiring documentation on all new tables and columns, internal surveys still show demand for more database schema documentation, with much of that demand being about including the right information in the docs. In light of that, this post will offer detailed suggestions for what information to include in schema documentation.

Why document the schema

Tables are core data structures

The first reason to document database tables and columns is that they are the primary data structures used by most web apps. If you understand those data structures, you can immediately move to reading the specific code you're interested in that touches them.

Conversely, if you don't understand how a table is used, your first step becomes read all the code touching it, then reverse engineer the purpose of the table. The inefficiency only gets worse as your codebase grows, though — more code will touch the same table, and tables start to have multiple purposes.

Even as early as code review, reading an explanation of a table’s purpose helps set the scene for the code to come, making the payoff of documentation felt even before code is merged.

Show me your code and conceal your data structures, and I shall continue to be mystified. Show me your data structures, and I won't usually need your code; it'll be obvious.
— Eric S. Raymond, The Cathedral and the Bazaar

Schema documentation is used by more than the engineering department

The second reason to document the schema is that data science and product teams heavily benefit from those docs, since they probably can't read the code itself to get context.

Furthermore, their queries will naturally cross team boundaries, making the query authors less familiar with the local concepts an engineering team works with every day. For example, to understand the impact of a new policy, the compliance team might want to know how many international wires (wires team) are sent by Chilean business owners (KYC team), and how much revenue is associated with those customers (growth team).

At Mercury, we use the comments from the file format used by our ORM to create an HTML file with all table documentation in one place. This file is used by engineers and non-engineers alike, and gets regular demand for more features and polish. I highly recommend generating such a file to surface schema documentation.

An example of the generated table docs

An example of the generated table docs (better styling is on our to-do list)

Finally, we have a cron job that updates the comments in Postgres (COMMENT ON TABLE ...), where they can be previewed in psql and consumed by tools like data science IDEs.

Table documentation

Note: The examples below use Persistent’s syntax for schema documentation:

Copy Code
-- | Table docs
TableName
  -- | Column 1 docs
  column1 ColumnType

I recommend that table documentation takes this structure:

Copy Code
-- | Concise, direct, ~one line description
-- |
-- | Longer summary of the most important things about the table
—- | ...
—- | ...
TableName

Concise, direct opening

Start with a concise 1–2 sentence description of the table. It can be somewhat obvious/repetitive, just to confirm to people that it is what it seems to be. It should roughly answer the question “What does one row of this table represent?” Two examples:

Copy Code
-- | Represents someone a Payment can be sent to, e.g. a law firm.
Recipient
Copy Code
-- | Each row of TransactionMetadata is 1:1 with what users see on the /transactions page. Most user-facing code should deal with TransactionMetadata instead of lower-level concepts like LedgerTransactions.
TransactionMetadata

You might think it's redundant with the table name, but this one sentence helps clarify intent, given the “metadata” name is misleading and there are multiple types of “transactions” at our company (Plaid transactions, database transactions, ledger transactions, etc.).

When you skip this step, it will feel awkward to read top-to-bottom. Instead of briefly setting the scene, allowing a reader to quickly determine if they’re in the right place, the documentation will require reading everything to get the gist. Here’s an example of skipping this step:

Copy Code
-- | This table serves two purposes:
-- | 1. Records when a user deletes a beneficial owner (BO) while in onboarding. Sometimes users do this to try to get around compliance rules, like a BO from Somalia being cause for rejection.
-- | We store that it was deleted so we can see if they were trying to trick us.
-- |
-- | 2. Records when a beneficial owner is no longer a member of the company, like if the CEO were to quit.
-- | We store this data for compliance/logging purposes.
-- |
-- | We don't use any foreign keys in this table because we don't want it to affect "real" business logic.
DeletedOnboardingBeneficialOwner

Here’s a quick rewrite with the summary up-front:

Copy Code
-- | Stores deleted beneficial owners (BOs), either ones deleted during onboarding, or previously onboarded BOs that are no longer part of the company.
-- |
-- | <...details here...>
DeletedOnboardingBeneficialOwner

Summary of most notable things about the table

After the introductory line, give a longer description of the most important things to know about the table. This is MercuryAccount:

Copy Code
-- | Parent of all kinds of Mercury bank accounts. Each kind of Mercury bank account has an associated child table:
-- |
-- | - Checking and Savings accounts are MercuryDepositoryAccounts
-- | - Treasury accounts are MercuryTreasuryAccounts
-- | - Credit accounts are MercuryCreditAccounts
-- |
-- | The existence of a row in the relevant child table is enforced by database constraint triggers.
-- |
-- | You can watch a video about this table here: https://mercury.lessonly.com/lesson/fake-url-database-schema?section_id=5087909
MercuryAccount

Here’s another example, Attachment:

Copy Code
-- | An attachment is just a link to an uploaded document in S3
-- |
-- | An attachment can exist in three phases: not uploaded, uploaded, or deleted
-- | - 'not uploaded' has null 'uploadedAt', 'deletedAt', and 'deletedBy'
-- | - 'uploaded' has non-null 'uploadedAt' and null 'deletedAt', 'deletedBy'
-- | - 'deleted' has non-null 'uploadedAt', 'deletedAt', and 'deletedBy'
-- |
-- | The upload process: first, we insert a 'not uploaded' attachment into the
-- | database and pass the corresponding S3 URL back to the frontend.
-- | If the frontend can successfully upload the document, it sends us a confirmation
-- | and we set the 'uploadedAt'. A deleted attachment will remain in S3 but
-- | should not be shown as attached to the user.
Attachment

What should go into the summary? Here are some top things to consider:

General product/business functionality provided

If you can put your table in the context of the product, the reader can much more easily transfer their product knowledge into what to expect from the schema.

For example, I could tell you that at Mercury an external_account models a third party bank account, and you'd have some idea of what that means.

But if I told you that external accounts are linked through Plaid — the same UI used to link a bank account on Venmo — that might give you an intuitive understanding of:

  • How this data gets created and validated
  • What features an external_account might be used for
  • Roughly what tables an external_account would be connected to

This understanding will be much stronger for your coworkers than for you reading this example, since coworkers will have extensive product knowledge.

Table lifecycle

It's helpful to know when data is:

  1. INSERTed - Essential for knowing when the data will exist, if you need to LEFT JOIN, etc.
  2. UPDATEd - (I'll talk about this later when discussing columns)
  3. DELETEd - Knowing if rows are DELETEd at all is very helpful, and if so, if there is any trace of that data afterwards

Here are three examples of documentation you might have about table lifecycle, and how it would impact someone querying the data:

  • “A user_preferences row is inserted lazily when the user first changes a setting from the default.”
    • Implication: Code cannot assume a user_preferences row exists, and queries should likely use LEFT JOIN.
  • “Each time a company is locked by our compliance department, a new company_lockdown row is inserted with active=true. We unlock by setting active=false. Only one row can be active at a time, as guaranteed by a partial unique index.”
    • A JOIN between companies and company_lockdowns could return multiple rows, unless we also constrain to active=true.
  • “If a recipients row is deleted, it gets removed from this table and copied into recipients_history.”
    • We may need to query both recipients and recipients_history to understand historical data.

(These implications were pretty helpful information! In practice I’d also include them in the docs)

Relationship to other tables

The foreign keys in your table should give a basic idea of connection to other tables. You can go further with notes like:

“This table is directly 1:1 with users. A user_settings row is inserted when a user is created, as enforced by a constraint trigger.”

This means it's safe to INNER JOIN those two tables and get the full result set.

In the MercuryAccount example above, the docs say it is a parent to three types of child tables. This lets the reader know that MercuryAccount is roughly an abstract concept with concrete subtypes. This helps the reader:

  • Understand what things to expect from MercuryAccount (columns that are generic to all the concrete subtypes)
  • Where to go for more information
  • Prevents the reader from making bad assumptions like that the COUNT(*) of mercury_accounts is the number of checking and savings accounts we have.

Related code & links

Pointing a user to a key module of the codebase or separate written or video documentation is always helpful. This might be a third party's documentation as well. Here are some examples:

I highly recommend creating your own supplemental materials to document your schema. At Mercury, I created a video series going over the most core tables in our schema, and for our reimbursements project I made a Figjam + video walkthrough explaining how tables connected together.

A diagram made in FigJam illustrating a database schema

Column specific documentation

So far I've talked about documenting tables, as opposed to individual columns. There's of course overlap between the two, but documentation at the table level can:

  • More easily aggregate multiple columns into a cohesive narrative
  • Describe the row lifecycle better, because individual columns aren’t INSERTed or DELETEd
  • Describe other tables that foreign key to it. Columns only describe what they foreign key to.

It's normal for table documentation to duplicate some column documentation. Similarly, the advice I give for documenting a table could easily apply to documenting a column (for example, including a link to a Slack conversation that discussed adding that column).

With that in mind, let's cover what is best suited for column-specific documentation.

When to expect a given value, and the implications of it

For some columns, like a user’s password, it’s fairly obvious where that data came from, and the contents are largely uninteresting. But if the password field is nullable, that raises questions: when can password be NULL? What happens if a user tries to login with a NULL password? When a value can be NULL, this often reveals key system behavior:

Copy Code
User
  -- | This will be NULL in two cases:
  —- | 1. The user hasn’t finished setting up their account
  —- | 2. Customer support forced a password reset on the account
  password HashedPassword Nullable

But password could be NULL for an entirely different reason—we need docs to know what NULL means:

Copy Code
User
  -- | This will be NULL if a user hasn’t setup password authentication, i.e. is using WebAuthn for passwordless authentication.
  password HashedPassword Nullable

Even small implications of NULL values can be helpful to know about:

Copy Code
Attachment
  -- | If the attachment is deleted, the row will have non-null 'uploadedAt', 'deletedAt', and 'deletedBy'
  deletedAt UTCTime Nullable
  deletedBy UserId Nullable

The examples above focused on a value being NULL or not, but NULL is just a common discrete state for a value. The advice applies equally to enums, where you should document what each enum value implies. For example, a common use of enums in the Mercury schema is for an abstract table to indicate which table has more specific data. Our schema docs tell you that the value of the mercury_accounts.kind column indicates if additional details are in the mercury_depository_accounts table or the mercury_credit_accounts table.

Mutability: If and why a column can change, and what changes it

Readers want to know:

  • If a column can change at all. Mutation suggests you're losing historical data from edits, and being immutable indicates there could be a new copy of the data made for each edit.
  • Why a column changes. For example, why might the amount of a transaction change?
  • What causes changes — is it driven by webhook, cron job, or user interaction? If user interaction, what backend route or frontend page drives that update?

Here's an abridged example table from Mercury's codebase. Which fields do you think can change?

Copy Code
ServerSession
  id
  userId
  remoteIp
  country
  lastUsed
  createdAt



ServerSession Spoilers (read after you consider what fields are mutable!)

Copy Code
ServerSession
  id
  userId
  -- | If the IP changes, a trigger inserts a job to scrape metadata on the IP.
  -- | After a user logs in, their IP can change frequently on a mobile device or VPN. It will be rarer on desktop.
  remoteIp !mutable
  country
  -- | The last time the user made a request with this session.
  -- | Warning: As a performance optimization, this is only updated if it’s been more than ten minutes.
  lastUsed !mutable
  createdAt

lastUsed is pretty obvious from its name, but it only updating once every ten minutes is probably a surprise. Maybe approximateLastUsed would be a better name? Having to explain a concept is sometimes a red flag that indicates the naming of the column or even the behavior itself is too confusing.

Did you expect that remoteIp would change? Without knowing remoteIp is mutable, you could easily assume server_sessions was effectively a log of every IP a user used, but that’s not quite true. Even people on our security team were surprised by this!

Finally, why doesn’t country change if remoteIp can, since presumably the country is inferred from a GeoIP database? The answer is that country should change, but we just haven’t implemented code to update it.

The ServerSession example shows how innocuous-looking column names (remoteIp, lastUsed) might be hiding unexpected behavior that documentation helps surface.

Note: Mercury has annotations like !mutable on columns that get turned into triggers enforcing that behavior, but you could also have a comment describing if a column is mutable or not. More on this in Appendix A.

Column history

Unlike code, where the latest commit can be read while ignoring past code, data has history. The most common example of this is a new column being added and only populated going forward, in which case you should (a) document when that column was added and (b) ideally make a constraint to enforce that new data isn't NULL.

Copy Code
ApiToken
  -- | Column added September 2021. A constraint ensures it’s non-null after 2021-09-14.
  createdDuringSession ServerSessionId Maybe

This not only helps someone understand the data, it also helps you understand the code, because otherwise you’d be assuming there is some branch of code where API tokens are intentionally created without a ServerSessionId.

Similarly, it’s good to document future plans, such as in the case that a column is nullable now but will be backpopulated in the future, or that a column is deprecated and shouldn’t be used in new code.

Distribution of the actual data

Documentation typically should only cover the “shape” of the data (like what a valid email address can look like) instead of the contents of the database itself (like how many email addresses use gmail). Occasionally though, the data is extreme or unexpected in a way that’s relevant to code and query authors. For example, only old rows will have a certain value of an enum, or one in a million rows will have a certain value. I recommend including the SQL you used to check this claim, so future readers can verify it.

Copy Code
PlaidAccount
  -- | 99% of the time, the last 4 digits of the account. Rarely, last 2, 3, or NULL.
  -- | SELECT LENGTH(mask), COUNT(*) FROM plaid_accounts GROUP BY 1;
  mask NonEmptyText299 Maybe

Conclusion

This post advised that each table’s documentation begin with a concise, direct, ~one line sentence explaining the table. This opening should answer the question “What does one row of this table represent?” This should then be followed by a summary of the most notable aspects of the table, which should often include this information:

  • The product or business context of the table
  • The lifecycle of the table (when rows are INSERTed/UPDATEd/DELETEd)
  • The relationship to other tables
  • Links to related code, documentation, product specs, etc.

Then for each column, I suggested covering these things:

  • When to expect a given value, and the implications of that value (especially if that value is NULL or an enum)
  • If and why a column can change, and if so what causes it
  • The history of the column (if the column has changed over time)
  • The distribution of values in the underlying data (only if the underlying values are unexpected)

Ultimately, you’ll need to apply your judgment to figure out which of these is relevant for a given table or column, and there’s bound to be things worth mentioning that aren’t covered in this blog post. But hopefully the suggestions here form a solid baseline for what should be documented on each table or column in your schema.

Implementing at your company

If you liked this post, or just want to see more schema documentation at your company, here are some approaches you could take:

  1. Make schema documentation a requirement. At Mercury, we’ve added tests that require documentation of all tables and columns. But at a smaller company, you could probably enforce this through common understanding and PR review.
    1. Inevitably, documentation for some columns will feel so obvious it’s useless, but I believe this is an acceptable tradeoff for ensuring that documentation on important columns isn’t forgotten.
    2. This requirement might feel onerous to a small company trying to move fast. However, even with eight employees Mercury dealt with confusing database concepts that needed multiple explanations in Slack, when a single canonical explanation in docs would have been preferable. You’re also likely to have some semi-technical people early on, like a CEO, who write SQL on an ad hoc basis. Finally, it’s much easier to add the documentation up front, when the context is fresh, than retroactively.
  2. Establish some sort of new hire training on how to document the database schema. If you’re fully sold on everything in this post you could link to it, but feel free to copy/paste the pieces you find most important, and adapt the examples to fit your company.
  3. Make your schema documentation more accessible to the company. Mercury’s database documentation started as just the comments in our ORM files, evolved into a markdown file stored in the Github repo, then an HTML file uploaded by a Github action. Once the schema is more accessible outside of engineering, you’ll have stronger support for spending time on it.

Appendix A: Types are documentation

This post has focused on writing bespoke comments for every table and column in your database. This is great, but it’s even better if you can obviate (some of) the need for documentation by using standardized types and attributes across your schema, like in these examples:

  • Using a type like Postgres' inet for an IP address is better than a text column with documentation about how IPs can be formatted.
  • Using Postgres domains, you can create types like positive_dollar or social_security_number, avoiding the need to document that a column's value is always >0, or whether an SSN is consistently stored without dashes.
  • Using Postgres enums avoids the need to document what possible values a string might have.

At Mercury, we push this further with attributes on tables (‍!blockDeletes) and columns (‍‍!mutable, !mutable-if-null). These attributes then get turned into triggers that enforce those properties.

Standardized types and attributes improve documentation in three ways:

  1. They enforce that the documentation matches the expected behavior.
  2. Standard attributes can be better displayed in documentation. For example, in an HTML table describing the columns, you can simply have a "Mutable?" column with a check mark. This is easier to parse, and means the same thing across all tables.
  3. They greatly increase the chances that the behavior is enforced and documented.

Appendix B: Beyond tables and columns

This post focused on documenting tables and their columns, but a schema consists of much more. You should also consider documenting database features like triggers and constraints, like so:

Copy Code
UserSecurityLog
  -- | When a new row is inserted, a trigger inserts a job to scrape IP metadata.
  remoteIp IP

Sometimes these database features can be neatly tied to a single column's documentation. Ideally though, your ORM has a clean way to model the database feature, allowing you to document it directly. Here we document a unique constraint on two columns:

Copy Code
SweepAccountBalance
  -- | We should only have one of these per org per day
  UniqueOrganizationBalanceDate organizationId balanceDate

In practice, tables and columns make up the vast majority of what people query, so I consider documenting additional database features to be a nice-to-have.

Special thanks to Sebastian Bensusan, Janey Muñoz, Sarah Cain, Matt Parsons, Holly Leslie, and Elizabeth Barton for reviewing this post!

Notes
Written by

Max Tagher

Share
Copy Link
Share on Twitter
Share on LinkedIn
Share on Facebook