Eng Blog

Documenting your database schema

Max Tagher is the co-founder and CTO of Mercury.

July 24, 2024

Finally, we have a cron job that updates the comments in Postgres (COMMENT ON TABLE ...), where they can be previewed in psql and consumed by tools like data science IDEs.

-- | Table docs
TableName
  -- | Column 1 docs
  column1 ColumnType

-- | Concise, direct, ~one line description
-- |
-- | Longer summary of the most important things about the table
—- | ...
—- | ...
TableName

-- | Represents someone a Payment can be sent to, e.g. a law firm.
Recipient

-- | Each row of TransactionMetadata is 1:1 with what users see on the /transactions page. Most user-facing code should deal with TransactionMetadata instead of lower-level concepts like LedgerTransactions.
TransactionMetadata

-- | This table serves two purposes:
-- | 1. Records when a user deletes a beneficial owner (BO) while in onboarding. Sometimes users do this to try to get around compliance rules, like a BO from Somalia being cause for rejection.
-- | We store that it was deleted so we can see if they were trying to trick us.
-- |
-- | 2. Records when a beneficial owner is no longer a member of the company, like if the CEO were to quit.
-- | We store this data for compliance/logging purposes.
-- |
-- | We don't use any foreign keys in this table because we don't want it to affect "real" business logic.
DeletedOnboardingBeneficialOwner

-- | Stores deleted beneficial owners (BOs), either ones deleted during onboarding, or previously onboarded BOs that are no longer part of the company.
-- |
-- | <...details here...>
DeletedOnboardingBeneficialOwner

After the introductory line, give a longer description of the most important things to know about the table. This is MercuryAccount:

-- | Parent of all kinds of Mercury bank accounts. Each kind of Mercury bank account has an associated child table:
-- |
-- | - Checking and Savings accounts are MercuryDepositoryAccounts
-- | - Treasury accounts are MercuryTreasuryAccounts
-- | - Credit accounts are MercuryCreditAccounts
-- |
-- | The existence of a row in the relevant child table is enforced by database constraint triggers.
-- |
-- | You can watch a video about this table here: https://mercury.lessonly.com/lesson/fake-url-database-schema?section_id=5087909
MercuryAccount

Here’s another example, Attachment:

-- | An attachment is just a link to an uploaded document in S3
-- |
-- | An attachment can exist in three phases: not uploaded, uploaded, or deleted
-- | - 'not uploaded' has null 'uploadedAt', 'deletedAt', and 'deletedBy'
-- | - 'uploaded' has non-null 'uploadedAt' and null 'deletedAt', 'deletedBy'
-- | - 'deleted' has non-null 'uploadedAt', 'deletedAt', and 'deletedBy'
-- |
-- | The upload process: first, we insert a 'not uploaded' attachment into the
-- | database and pass the corresponding S3 URL back to the frontend.
-- | If the frontend can successfully upload the document, it sends us a confirmation
-- | and we set the 'uploadedAt'. A deleted attachment will remain in S3 but
-- | should not be shown as attached to the user.
Attachment

For example, I could tell you that at Mercury an external_account models a third party bank account, and you'd have some idea of what that means.

“This table is directly 1:1 with users. A user_settings row is inserted when a user is created, as enforced by a constraint trigger.”

This means it's safe to INNER JOIN those two tables and get the full result set.

In the MercuryAccount example above, the docs say it is a parent to three types of child tables. This lets the reader know that MercuryAccount is roughly an abstract concept with concrete subtypes. This helps the reader:

For some columns, like a user’s password, it’s fairly obvious where that data came from, and the contents are largely uninteresting. But if the password field is nullable, that raises questions: when can password be NULL? What happens if a user tries to login with a NULL password? When a value can be NULL, this often reveals key system behavior:

User
  -- | This will be NULL in two cases:
  —- | 1. The user hasn’t finished setting up their account
  —- | 2. Customer support forced a password reset on the account
  password HashedPassword Nullable

But password could be NULL for an entirely different reason—we need docs to know what NULL means:

User
  -- | This will be NULL if a user hasn’t setup password authentication, i.e. is using WebAuthn for passwordless authentication.
  password HashedPassword Nullable

Attachment
  -- | If the attachment is deleted, the row will have non-null 'uploadedAt', 'deletedAt', and 'deletedBy'
  deletedAt UTCTime Nullable
  deletedBy UserId Nullable

The examples above focused on a value being NULL or not, but NULL is just a common discrete state for a value. The advice applies equally to enums, where you should document what each enum value implies. For example, a common use of enums in the Mercury schema is for an abstract table to indicate which table has more specific data. Our schema docs tell you that the value of the mercury_accounts.kind column indicates if additional details are in the mercury_depository_accounts table or the mercury_credit_accounts table.

ServerSession
  id
  userId
  remoteIp
  country
  lastUsed
  createdAt

ServerSession
  id
  userId
  -- | If the IP changes, a trigger inserts a job to scrape metadata on the IP.
  -- | After a user logs in, their IP can change frequently on a mobile device or VPN. It will be rarer on desktop.
  remoteIp !mutable
  country
  -- | The last time the user made a request with this session.
  -- | Warning: As a performance optimization, this is only updated if it’s been more than ten minutes.
  lastUsed !mutable
  createdAt

lastUsed is pretty obvious from its name, but it only updating once every ten minutes is probably a surprise. Maybe approximateLastUsed would be a better name? Having to explain a concept is sometimes a red flag that indicates the naming of the column or even the behavior itself is too confusing.

Did you expect that remoteIp would change? Without knowing remoteIp is mutable, you could easily assume server_sessions was effectively a log of every IP a user used, but that’s not quite true. Even people on our security team were surprised by this!

Finally, why doesn’t country change if remoteIp can, since presumably the country is inferred from a GeoIP database? The answer is that country should change, but we just haven’t implemented code to update it.

The ServerSession example shows how innocuous-looking column names (remoteIp, lastUsed) might be hiding unexpected behavior that documentation helps surface.

Note: Mercury has annotations like !mutable on columns that get turned into triggers enforcing that behavior, but you could also have a comment describing if a column is mutable or not. More on this in Appendix A.

Unlike code, where the latest commit can be read while ignoring past code, data has history. The most common example of this is a new column being added and only populated going forward, in which case you should (a) document when that column was added and (b) ideally make a constraint to enforce that new data isn't NULL.

ApiToken
  -- | Column added September 2021. A constraint ensures it’s non-null after 2021-09-14.
  createdDuringSession ServerSessionId Maybe

This not only helps someone understand the data, it also helps you understand the code, because otherwise you’d be assuming there is some branch of code where API tokens are intentionally created without a ServerSessionId.

PlaidAccount
  -- | 99% of the time, the last 4 digits of the account. Rarely, last 2, 3, or NULL.
  -- | SELECT LENGTH(mask), COUNT(*) FROM plaid_accounts GROUP BY 1;
  mask NonEmptyText299 Maybe

At Mercury, we push this further with attributes on tables (‍!blockDeletes) and columns (‍‍!mutable, !mutable-if-null). These attributes then get turned into triggers that enforce those properties.

UserSecurityLog
  -- | When a new row is inserted, a trigger inserts a job to scrape IP metadata.
  remoteIp IP

SweepAccountBalance
  -- | We should only have one of these per org per day
  UniqueOrganizationBalanceDate organizationId balanceDate

Special thanks to Sebastian Bensusan, Janey Muñoz, Sarah Cain, Matt Parsons, Holly Leslie, and Elizabeth Barton for reviewing this post!

About the author

Max Tagher

Max Tagher is the co-founder and CTO of Mercury.

Documenting your database schema

Show me your code and conceal your data structures, and I shall continue to be mystified. Show me your data structures, and I won't usually need your code; it'll be obvious.

Related reads

How Mercury defeats phishing with device verification

Escalating Esqueleto

Static-ls v1.0 announcement

Announcing ghciwatch 1.0

Disclaimers and footnotes

Documenting your database schema

Why document the schema

Tables are core data structures

Schema documentation is used by more than the engineering department

Table documentation

Concise, direct opening

Summary of most notable things about the table

General product/business functionality provided

Table lifecycle

Relationship to other tables

Related code & links

Column specific documentation

When to expect a given value, and the implications of it

Mutability: If and why a column can change, and what changes it

Column history

Distribution of the actual data

Conclusion

Implementing at your company

Appendix A: Types are documentation

Appendix B: Beyond tables and columns

Related reads

How Mercury defeats phishing with device verification

Escalating Esqueleto

Static-ls v1.0 announcement

Announcing ghciwatch 1.0

Disclaimers and footnotes