As a company scales, written documentation becomes essential for efficiently sharing knowledge. Of the different types of documentation, database schema documentation is one of the most important: the schema makes up the core data structures a typical web app interacts with; it’s read by employees outside of engineering; and unlike other documentation, it is both easy to find and known to exist.
This importance has been borne out in practice at Mercury, where despite requiring documentation on all new tables and columns, internal surveys still show demand for more database schema documentation, with much of that demand being about including the right information in the docs. In light of that, this post will offer detailed suggestions for what information to include in schema documentation.
Why document the schema
Tables are core data structures
The first reason to document database tables and columns is that they are the primary data structures used by most web apps. If you understand those data structures, you can immediately move to reading the specific code you're interested in that touches them.
Conversely, if you don't understand how a table is used, your first step becomes read all the code touching it, then reverse engineer the purpose of the table. The inefficiency only gets worse as your codebase grows, though — more code will touch the same table, and tables start to have multiple purposes.
Even as early as code review, reading an explanation of a table’s purpose helps set the scene for the code to come, making the payoff of documentation felt even before code is merged.
Show me your code and conceal your data structures, and I shall continue to be mystified. Show me your data structures, and I won't usually need your code; it'll be obvious.
— Eric S. Raymond, The Cathedral and the Bazaar
Schema documentation is used by more than the engineering department
The second reason to document the schema is that data science and product teams heavily benefit from those docs, since they probably can't read the code itself to get context.
Furthermore, their queries will naturally cross team boundaries, making the query authors less familiar with the local concepts an engineering team works with every day. For example, to understand the impact of a new policy, the compliance team might want to know how many international wires (wires team) are sent by Chilean business owners (KYC team), and how much revenue is associated with those customers (growth team).
At Mercury, we use the comments from the file format used by our ORM to create an HTML file with all table documentation in one place. This file is used by engineers and non-engineers alike, and gets regular demand for more features and polish. I highly recommend generating such a file to surface schema documentation.
An example of the generated table docs (better styling is on our to-do list)
Finally, we have a cron job that updates the comments in Postgres (COMMENT ON TABLE ...
), where they can be previewed in psql
and consumed by tools like data science IDEs.
Table documentation
Note: The examples below use Persistent’s syntax for schema documentation:
-- | Table docs
TableName
-- | Column 1 docs
column1 ColumnType
I recommend that table documentation takes this structure:
-- | Concise, direct, ~one line description
-- |
-- | Longer summary of the most important things about the table
—- | ...
—- | ...
TableName
Concise, direct opening
Start with a concise 1–2 sentence description of the table. It can be somewhat obvious/repetitive, just to confirm to people that it is what it seems to be. It should roughly answer the question “What does one row of this table represent?” Two examples:
-- | Represents someone a Payment can be sent to, e.g. a law firm.
Recipient
-- | Each row of TransactionMetadata is 1:1 with what users see on the /transactions page. Most user-facing code should deal with TransactionMetadata instead of lower-level concepts like LedgerTransactions.
TransactionMetadata
You might think it's redundant with the table name, but this one sentence helps clarify intent, given the “metadata” name is misleading and there are multiple types of “transactions” at our company (Plaid transactions, database transactions, ledger transactions, etc.).
When you skip this step, it will feel awkward to read top-to-bottom. Instead of briefly setting the scene, allowing a reader to quickly determine if they’re in the right place, the documentation will require reading everything to get the gist. Here’s an example of skipping this step:
-- | This table serves two purposes:
-- | 1. Records when a user deletes a beneficial owner (BO) while in onboarding. Sometimes users do this to try to get around compliance rules, like a BO from Somalia being cause for rejection.
-- | We store that it was deleted so we can see if they were trying to trick us.
-- |
-- | 2. Records when a beneficial owner is no longer a member of the company, like if the CEO were to quit.
-- | We store this data for compliance/logging purposes.
-- |
-- | We don't use any foreign keys in this table because we don't want it to affect "real" business logic.
DeletedOnboardingBeneficialOwner
Here’s a quick rewrite with the summary up-front:
-- | Stores deleted beneficial owners (BOs), either ones deleted during onboarding, or previously onboarded BOs that are no longer part of the company.
-- |
-- | <...details here...>
DeletedOnboardingBeneficialOwner
Summary of most notable things about the table
After the introductory line, give a longer description of the most important things to know about the table. This is MercuryAccount
:
-- | Parent of all kinds of Mercury bank accounts. Each kind of Mercury bank account has an associated child table:
-- |
-- | - Checking and Savings accounts are MercuryDepositoryAccounts
-- | - Treasury accounts are MercuryTreasuryAccounts
-- | - Credit accounts are MercuryCreditAccounts
-- |
-- | The existence of a row in the relevant child table is enforced by database constraint triggers.
-- |
-- | You can watch a video about this table here: https://mercury.lessonly.com/lesson/fake-url-database-schema?section_id=5087909
MercuryAccount
Here’s another example, Attachment
:
-- | An attachment is just a link to an uploaded document in S3
-- |
-- | An attachment can exist in three phases: not uploaded, uploaded, or deleted
-- | - 'not uploaded' has null 'uploadedAt', 'deletedAt', and 'deletedBy'
-- | - 'uploaded' has non-null 'uploadedAt' and null 'deletedAt', 'deletedBy'
-- | - 'deleted' has non-null 'uploadedAt', 'deletedAt', and 'deletedBy'
-- |
-- | The upload process: first, we insert a 'not uploaded' attachment into the
-- | database and pass the corresponding S3 URL back to the frontend.
-- | If the frontend can successfully upload the document, it sends us a confirmation
-- | and we set the 'uploadedAt'. A deleted attachment will remain in S3 but
-- | should not be shown as attached to the user.
Attachment
What should go into the summary? Here are some top things to consider:
General product/business functionality provided
If you can put your table in the context of the product, the reader can much more easily transfer their product knowledge into what to expect from the schema.
For example, I could tell you that at Mercury an external_account
models a third party bank account, and you'd have some idea of what that means.
But if I told you that external accounts are linked through Plaid — the same UI used to link a bank account on Venmo — that might give you an intuitive understanding of:
- How this data gets created and validated
- What features an
external_account
might be used for - Roughly what tables an
external_account
would be connected to
This understanding will be much stronger for your coworkers than for you reading this example, since coworkers will have extensive product knowledge.
Table lifecycle
It's helpful to know when data is:
INSERT
ed - Essential for knowing when the data will exist, if you need toLEFT JOIN
, etc.UPDATE
d - (I'll talk about this later when discussing columns)DELETE
d - Knowing if rows areDELETE
d at all is very helpful, and if so, if there is any trace of that data afterwards
Here are three examples of documentation you might have about table lifecycle, and how it would impact someone querying the data:
- “A
user_preferences
row is inserted lazily when the user first changes a setting from the default.”- Implication: Code cannot assume a
user_preferences
row exists, and queries should likely useLEFT JOIN
.
- Implication: Code cannot assume a
- “Each time a company is locked by our compliance department, a new
company_lockdown
row is inserted withactive=true
. We unlock by settingactive=false
. Only one row can be active at a time, as guaranteed by a partial unique index.”- A
JOIN
betweencompanies
andcompany_lockdowns
could return multiple rows, unless we also constrain toactive=true
.
- A
- “If a
recipients
row is deleted, it gets removed from this table and copied intorecipients_history
.”- We may need to query both
recipients
andrecipients_history
to understand historical data.
- We may need to query both
(These implications were pretty helpful information! In practice I’d also include them in the docs)
Relationship to other tables
The foreign keys in your table should give a basic idea of connection to other tables. You can go further with notes like:
“This table is directly 1:1 with users. A user_settings
row is inserted when a user is created, as enforced by a constraint trigger.”
This means it's safe to INNER JOIN
those two tables and get the full result set.
In the MercuryAccount
example above, the docs say it is a parent to three types of child tables. This lets the reader know that MercuryAccount
is roughly an abstract concept with concrete subtypes. This helps the reader:
- Understand what things to expect from
MercuryAccount
(columns that are generic to all the concrete subtypes) - Where to go for more information
- Prevents the reader from making bad assumptions like that the
COUNT(*)
ofmercury_accounts
is the number of checking and savings accounts we have.
Related code & links
Pointing a user to a key module of the codebase or separate written or video documentation is always helpful. This might be a third party's documentation as well. Here are some examples:
- See the spec at https://www.notion.so/foo/Big-Spec-c6a2795cec68427c86a2446d4f3c71b8
- Invites are sent from this page: https://demo.mercury.com/settings/team
- See https://dev.maxmind.com/geoip/docs/databases/city-and-country
I highly recommend creating your own supplemental materials to document your schema. At Mercury, I created a video series going over the most core tables in our schema, and for our reimbursements project I made a Figjam + video walkthrough explaining how tables connected together.
Column specific documentation
So far I've talked about documenting tables, as opposed to individual columns. There's of course overlap between the two, but documentation at the table level can:
- More easily aggregate multiple columns into a cohesive narrative
- Describe the row lifecycle better, because individual columns aren’t
INSERT
ed orDELETE
d - Describe other tables that foreign key to it. Columns only describe what they foreign key to.
It's normal for table documentation to duplicate some column documentation. Similarly, the advice I give for documenting a table could easily apply to documenting a column (for example, including a link to a Slack conversation that discussed adding that column).
With that in mind, let's cover what is best suited for column-specific documentation.
When to expect a given value, and the implications of it
For some columns, like a user’s password, it’s fairly obvious where that data came from, and the contents are largely uninteresting. But if the password field is nullable, that raises questions: when can password be NULL
? What happens if a user tries to login with a NULL
password? When a value can be NULL
, this often reveals key system behavior:
User
-- | This will be NULL in two cases:
—- | 1. The user hasn’t finished setting up their account
—- | 2. Customer support forced a password reset on the account
password HashedPassword Nullable
But password
could be NULL
for an entirely different reason—we need docs to know what NULL
means:
User
-- | This will be NULL if a user hasn’t setup password authentication, i.e. is using WebAuthn for passwordless authentication.
password HashedPassword Nullable
Even small implications of NULL values can be helpful to know about:
Attachment
-- | If the attachment is deleted, the row will have non-null 'uploadedAt', 'deletedAt', and 'deletedBy'
deletedAt UTCTime Nullable
deletedBy UserId Nullable
The examples above focused on a value being NULL
or not, but NULL
is just a common discrete state for a value. The advice applies equally to enums, where you should document what each enum value implies. For example, a common use of enums in the Mercury schema is for an abstract table to indicate which table has more specific data. Our schema docs tell you that the value of the mercury_accounts.kind
column indicates if additional details are in the mercury_depository_accounts
table or the mercury_credit_accounts
table.
Mutability: If and why a column can change, and what changes it
Readers want to know:
- If a column can change at all. Mutation suggests you're losing historical data from edits, and being immutable indicates there could be a new copy of the data made for each edit.
- Why a column changes. For example, why might the amount of a transaction change?
- What causes changes — is it driven by webhook, cron job, or user interaction? If user interaction, what backend route or frontend page drives that update?
Here's an abridged example table from Mercury's codebase. Which fields do you think can change?
ServerSession
id
userId
remoteIp
country
lastUsed
createdAt
ServerSession Spoilers (read after you consider what fields are mutable!)
ServerSession
id
userId
-- | If the IP changes, a trigger inserts a job to scrape metadata on the IP.
-- | After a user logs in, their IP can change frequently on a mobile device or VPN. It will be rarer on desktop.
remoteIp !mutable
country
-- | The last time the user made a request with this session.
-- | Warning: As a performance optimization, this is only updated if it’s been more than ten minutes.
lastUsed !mutable
createdAt
lastUsed
is pretty obvious from its name, but it only updating once every ten minutes is probably a surprise. Maybe approximateLastUsed
would be a better name? Having to explain a concept is sometimes a red flag that indicates the naming of the column or even the behavior itself is too confusing.
Did you expect that remoteIp
would change? Without knowing remoteIp
is mutable, you could easily assume server_sessions
was effectively a log of every IP a user used, but that’s not quite true. Even people on our security team were surprised by this!
Finally, why doesn’t country
change if remoteIp
can, since presumably the country
is inferred from a GeoIP database? The answer is that country
should change, but we just haven’t implemented code to update it.
The ServerSession
example shows how innocuous-looking column names (remoteIp
, lastUsed
) might be hiding unexpected behavior that documentation helps surface.
Note: Mercury has annotations like !mutable
on columns that get turned into triggers enforcing that behavior, but you could also have a comment describing if a column is mutable or not. More on this in Appendix A.
Column history
Unlike code, where the latest commit can be read while ignoring past code, data has history. The most common example of this is a new column being added and only populated going forward, in which case you should (a) document when that column was added and (b) ideally make a constraint to enforce that new data isn't NULL
.
ApiToken
-- | Column added September 2021. A constraint ensures it’s non-null after 2021-09-14.
createdDuringSession ServerSessionId Maybe
This not only helps someone understand the data, it also helps you understand the code, because otherwise you’d be assuming there is some branch of code where API tokens are intentionally created without a ServerSessionId
.
Similarly, it’s good to document future plans, such as in the case that a column is nullable now but will be backpopulated in the future, or that a column is deprecated and shouldn’t be used in new code.
Distribution of the actual data
Documentation typically should only cover the “shape” of the data (like what a valid email address can look like) instead of the contents of the database itself (like how many email addresses use gmail). Occasionally though, the data is extreme or unexpected in a way that’s relevant to code and query authors. For example, only old rows will have a certain value of an enum, or one in a million rows will have a certain value. I recommend including the SQL you used to check this claim, so future readers can verify it.
PlaidAccount
-- | 99% of the time, the last 4 digits of the account. Rarely, last 2, 3, or NULL.
-- | SELECT LENGTH(mask), COUNT(*) FROM plaid_accounts GROUP BY 1;
mask NonEmptyText299 Maybe
Conclusion
This post advised that each table’s documentation begin with a concise, direct, ~one line sentence explaining the table. This opening should answer the question “What does one row of this table represent?” This should then be followed by a summary of the most notable aspects of the table, which should often include this information:
- The product or business context of the table
- The lifecycle of the table (when rows are
INSERT
ed/UPDATE
d/DELETE
d) - The relationship to other tables
- Links to related code, documentation, product specs, etc.
Then for each column, I suggested covering these things:
- When to expect a given value, and the implications of that value (especially if that value is
NULL
or an enum) - If and why a column can change, and if so what causes it
- The history of the column (if the column has changed over time)
- The distribution of values in the underlying data (only if the underlying values are unexpected)
Ultimately, you’ll need to apply your judgment to figure out which of these is relevant for a given table or column, and there’s bound to be things worth mentioning that aren’t covered in this blog post. But hopefully the suggestions here form a solid baseline for what should be documented on each table or column in your schema.
Implementing at your company
If you liked this post, or just want to see more schema documentation at your company, here are some approaches you could take:
- Make schema documentation a requirement. At Mercury, we’ve added tests that require documentation of all tables and columns. But at a smaller company, you could probably enforce this through common understanding and PR review.
- Inevitably, documentation for some columns will feel so obvious it’s useless, but I believe this is an acceptable tradeoff for ensuring that documentation on important columns isn’t forgotten.
- This requirement might feel onerous to a small company trying to move fast. However, even with eight employees Mercury dealt with confusing database concepts that needed multiple explanations in Slack, when a single canonical explanation in docs would have been preferable. You’re also likely to have some semi-technical people early on, like a CEO, who write SQL on an ad hoc basis. Finally, it’s much easier to add the documentation up front, when the context is fresh, than retroactively.
- Establish some sort of new hire training on how to document the database schema. If you’re fully sold on everything in this post you could link to it, but feel free to copy/paste the pieces you find most important, and adapt the examples to fit your company.
- Make your schema documentation more accessible to the company. Mercury’s database documentation started as just the comments in our ORM files, evolved into a markdown file stored in the Github repo, then an HTML file uploaded by a Github action. Once the schema is more accessible outside of engineering, you’ll have stronger support for spending time on it.
Appendix A: Types are documentation
This post has focused on writing bespoke comments for every table and column in your database. This is great, but it’s even better if you can obviate (some of) the need for documentation by using standardized types and attributes across your schema, like in these examples:
- Using a type like Postgres'
inet
for an IP address is better than a text column with documentation about how IPs can be formatted. - Using Postgres domains, you can create types like
positive_dollar
orsocial_security_number
, avoiding the need to document that a column's value is always >0, or whether an SSN is consistently stored without dashes. - Using Postgres enums avoids the need to document what possible values a string might have.
At Mercury, we push this further with attributes on tables (!blockDeletes
) and columns (!mutable
, !mutable-if-null
). These attributes then get turned into triggers that enforce those properties.
Standardized types and attributes improve documentation in three ways:
- They enforce that the documentation matches the expected behavior.
- Standard attributes can be better displayed in documentation. For example, in an HTML table describing the columns, you can simply have a "Mutable?" column with a check mark. This is easier to parse, and means the same thing across all tables.
- They greatly increase the chances that the behavior is enforced and documented.
Appendix B: Beyond tables and columns
This post focused on documenting tables and their columns, but a schema consists of much more. You should also consider documenting database features like triggers and constraints, like so:
UserSecurityLog
-- | When a new row is inserted, a trigger inserts a job to scrape IP metadata.
remoteIp IP
Sometimes these database features can be neatly tied to a single column's documentation. Ideally though, your ORM has a clean way to model the database feature, allowing you to document it directly. Here we document a unique constraint on two columns:
SweepAccountBalance
-- | We should only have one of these per org per day
UniqueOrganizationBalanceDate organizationId balanceDate
In practice, tables and columns make up the vast majority of what people query, so I consider documenting additional database features to be a nice-to-have.
Special thanks to Sebastian Bensusan, Janey Muñoz, Sarah Cain, Matt Parsons, Holly Leslie, and Elizabeth Barton for reviewing this post!
Max Tagher