Eng Blog

8 Lints for your Postgres schema

Max Tagher is the co-founder and CTO of Mercury.

September 19, 2023

SELECT relname FROM pg_class pgc
JOIN pg_namespace pgns ON pgns.oid = pgc.relnamespace
WHERE pgns.nspname = 'public'
AND pgc.relkind = 'r'
AND pgc.oid NOT IN
(SELECT pgc.oid
 FROM pg_class pgc
 JOIN pg_index pgi ON pgi.indrelid = pgc.oid
 JOIN pg_namespace pgns ON pgns.oid = pgc.relnamespace
 WHERE pgi.indisprimary = true
 AND pgc.relkind = 'r'
 AND pgns.nspname = 'public'
);

Ok, so this isn’t actually a schema lint, but you’re likely to get weird behavior when a developer’s local Postgres is on their local timezone instead of UTC. It’s much easier to catch this with a test for the correct timezone than trying to debug the broken code it causes. Simply run this SQL in your tests and confirm it returns UTC:

SHOW TIMEZONE;

As part of the test failure message, suggest updating postgres.conf (to change all databases) or running ALTER DATABASE db_name SET TIMEZONE TO 'UTC';.

If your data science team asked you “Hey, can the created_at field ever change?” or “Do we ever change the customer a transaction is associated with?”, you’d probably say, “No, that can never happen”—with just a thought in the back of your mind that, somewhere across your hundreds of thousands of lines of code, some nasty little bug is lurking to prove you wrong.

At Mercury, such a bug reared its head when some errant code began updating the created_at field of our api_tokens table to the current time (a repsert gone wrong). Our graph of “API tokens created per week” was based on the created_at field, so it looked like we were suddenly creating thousands of new tokens per day, and that we potentially had a critical security vulnerability on our hands.

Because we were working off the precondition that created_at couldn’t change, our graph wasn’t telling us what we thought it was, and it took us a while to realize what was going on. What’s more, we lost the timestamp of when an API token was actually created.

Transaction
  amount Dollar
  account_id MercuryAccountId
  created_at UTCTime
  updated_at UTCTime !mutable -- Only this column can change

CREATE TRIGGER immutable_columns
AFTER UPDATE OF amount, account_id, created_at ON transactions -- It's more efficient if we make sure columns are listed here, so we can in practice avoid the WHEN condition from having to be run
FOR EACH ROW
WHEN
  ((   OLD.amount IS DISTINCT FROM NEW.amount
    OR OLD.account_id IS DISTINCT FROM NEW.account_id
    OR OLD.created_at IS DISTINCT FROM NEW.created_at
  )
EXECUTE FUNCTION raise_immutability_exception();

CREATE OR REPLACE FUNCTION raise_immutability_exception() RETURNS trigger AS
$$
BEGIN
  RAISE EXCEPTION 'This update was rejected because it attempted to UPDATE an immutable column. Old data: % -- New data: %', OLD, NEW
  USING HINT = 'Tip: check the triggers on the table to see what columns are immutable. repsert-esque functions are a common cause of this bug.';
END
$$
LANGUAGE PLPGSQL;

CREATE OR REPLACE FUNCTION set_immutable_columns(VARIADIC table_and_columns text[]) 
returns void
as $$
    declare
        table_name text;
        columns text[];
        column_list text;
        column_condition text;
    begin   
      if array_length(table_and_columns, 1) < 2 then
        raise exception 'Pass at least a table and one column to set_immutable_columns, eg set_immutable_columns(''table'', ''col1'', ''col2'')'
            using detail = format('Arguments received: %s', table_and_columns);
      end if;
      
      table_name := table_and_columns[1]; -- one indexed (heathens!), so first element
      columns := table_and_columns[2:];

      column_list := array_to_string(columns, ', ');
      column_condition := ( SELECT string_agg(x.y, ' OR ') FROM (SELECT '(OLD.' || col || ' IS DISTINCT FROM NEW.' || col || ')' as y FROM unnest(columns) AS col) as x );

      EXECUTE format(
        'CREATE OR REPLACE TRIGGER immutable_columns
          AFTER
          UPDATE OF %s
          ON public.%s
          FOR EACH ROW
          WHEN (%s)
          EXECUTE FUNCTION raise_immutability_exception ();'
        , column_list
        , table_name
        , column_condition
        );

    end;
$$
language plpgsql;

-- Now this one line function call is available:

SELECT set_immutable_columns('transactions', 'amount', 'id', 'created_at');

-- Get tables that have a trigger called immutable_columns
SELECT pgc.relname,
  pgp.proname,
  pgt.tgattr,
  pgt.tgtype,
  pg_get_triggerdef(pgt.oid) as triggerdef
FROM pg_trigger pgt
JOIN pg_class pgc ON pgc.oid = pgt.tgrelid
JOIN pg_proc pgp ON pgp.oid = pgt.tgfoid
WHERE tgname = 'immutable_columns';

-- Extra credit: Get a list of columns by attnum to confirm the correct ones are listed in the WHEN condition
SELECT pgc.relname,
     pga.attname,
     pga.attnum
FROM pg_trigger pgt
JOIN pg_class pgc ON pgc.oid = pgt.tgrelid
JOIN pg_attribute pga ON pga.attrelid = pgc.oid
WHERE tgname = 'immutable_columns';

You’ll need to do a little work in the test suite to ensure everything is set up correctly. Your test code should look at the tables that have the immutable_columns trigger and check they match what is expected by your ORM. You can use pg_get_triggerdef to check the WHEN clause, and use the second query above to check the UPDATE OF part.

Should you need to temporarily update those immutable columns, like in a test, you can disable triggers with SET session_replication_role = replica;.

GRANT UPDATE (updated_at, mutable_column_here) ON tablename TO "database-username";

Just like with the above, you could implement a similar system to block DELETE on certain tables, and enforce it in the same way (via permissions or triggers).

CREATE OR REPLACE FUNCTION raise_undeletable_table_exception() RETURNS trigger AS
$$
BEGIN
  RAISE EXCEPTION 'DELETE is not allowed on this table';
END
$$
LANGUAGE PLPGSQL;

CREATE TRIGGER undeletable_table
AFTER DELETE ON transaction_metadata
FOR EACH STATEMENT
EXECUTE FUNCTION raise_undeletable_table_exception();

Having timestamp columns (created_at, updated_at, and in some cases deleted_at) is a great way to debug production systems, and you often need the data for user-facing pages anyway. I recommend requiring that created_at and updated_at columns exist on every table, and that triggers exist to update them appropriately (you don’t want to rely on your ORM to update these à la Ruby on Rails—ORM triggers won’t fire on raw SQL).

Below, you’ll find some SQL to check created_at and updated_at exist on all your tables. This SQL will return any tables that are missing a created_at or updated_at column. Just check to make sure the query returns no rows:

with all_tables as (
  select
    t.table_name,
    exists (
      select from information_schema.columns c
      where c.column_name = 'created_at'
      and c.table_name = t.table_name
      and c.table_schema = t.table_schema
    ) as has_created_at,
    exists (
      select from information_schema.columns c
      where c.column_name = 'updated_at'
      and c.table_name = t.table_name
      and c.table_schema = t.table_schema
    ) as has_updated_at
  from information_schema.tables t
  where
    t.table_schema not in ('information_schema', 'pg_catalog')
    and t.table_type = 'BASE TABLE'
)
select * from all_tables
where
  not has_created_at
  or not has_updated_at
order by table_name;

But it’s not enough to check that created_at and updated_at columns exist—you also need to set and update them correctly. This SQL creates the functions to do that:

create or replace function create_timestamps()
  returns trigger as $$
begin
  new.created_at = now();
  new.updated_at = now();
  return new;
end $$ language 'plpgsql';

create or replace function update_timestamps()
  returns trigger as $$
begin
  new.updated_at = now();
  return new;
end $$ language 'plpgsql';

CREATE TRIGGER table_name_insert
BEFORE INSERT ON table_name
FOR EACH ROW EXECUTE PROCEDURE create_timestamps();

CREATE TRIGGER table_name_update
BEFORE UPDATE ON table_name
FOR EACH ROW EXECUTE PROCEDURE update_timestamps();

CREATE OR REPLACE FUNCTION add_timestamp_columns (table_name text)
    RETURNS void
    AS $$
BEGIN
    EXECUTE format('ALTER TABLE %I ADD COLUMN IF NOT EXISTS created_at TIMESTAMPTZ NOT NULL;', table_name);
    EXECUTE format('ALTER TABLE %I ADD COLUMN IF NOT EXISTS updated_at TIMESTAMPTZ NOT NULL;', table_name);
    EXECUTE format('CREATE OR REPLACE TRIGGER %I_insert BEFORE INSERT ON %I FOR EACH ROW EXECUTE PROCEDURE create_timestamps();', table_name, table_name);
    EXECUTE format('CREATE OR REPLACE TRIGGER %I_update BEFORE UPDATE ON %I FOR EACH ROW EXECUTE PROCEDURE update_timestamps();', table_name, table_name);
END;
$$
LANGUAGE plpgsql;

-- Now this one line function call is available:

SELECT add_timestamp_columns('table_name');

WITH tables_with_trigger AS (
  SELECT pgcl.*
  FROM pg_class pgcl
  JOIN pg_namespace pgns on pgns.oid = pgcl.relnamespace
  LEFT JOIN pg_trigger pgtg ON pgtg.tgrelid = pgcl.oid
  LEFT JOIN pg_proc proc ON proc.oid = pgtg.tgfoid
  WHERE pgns.nspname = 'public'
  AND relkind = 'r'
  AND proc.proname = 'create_timestamps'
  AND pgtg.tgenabled <> 'D'
  )
SELECT pgcl.relname FROM pg_class pgcl
JOIN pg_namespace pgns on pgns.oid = pgcl.relnamespace
LEFT JOIN tables_with_trigger twt ON twt.oid = pgcl.oid
WHERE twt.oid IS NULL
AND pgns.nspname = 'public'
AND pgcl.relkind = 'r'
AND EXISTS (
  SELECT FROM information_schema.columns c
  WHERE c.column_name = 'created_at'
  AND c.table_name = pgcl.relname
  AND c.table_schema = pgns.nspname
);

-- For updates, run the same SQL above, but swap in 'update_timestamps' + 'updated_at'

At Mercury, we document tables and columns in our ORM code, which is nice because the comments are editable, searchable, and in source control. Using our ORM persistent, we can get a list of all models and loop through them, calling getEntityComments to see if every table has documentation. If not, we fail the test.

You could potentially go further by requiring every column to have documentation, but this might create too many useless comments (“created_at is when the row was inserted”).

1. We have the equivalent of a cron job that copies the comments into Postgres itself, like so:

COMMENT ON TABLE my_schema.my_table IS 'Employee Information';
COMMENT ON COLUMN my_table.my_column IS 'Employee ID number';
COMMENT ON TYPE an_enum IS 'List each enum case and docs here';

These comments show up when you run \d+ table_name, and can be used by database GUI tools like Metabase.

2. We generate a file called schema.html via a Github Action. This file makes it easy for people to see our entire schema in one command+f-able place, especially non-engineering teams like data science and product.

SELECT true FROM tablename LIMIT 1;

Foreign key constraints are a great way to keep your database consistent. When you have a parent table foreign keyed to by a child table, the column on the parent table must be a primary key or have a unique constraint, and thus is automatically indexed.

But the reverse direction is not required to have an index, so child.parent_id might not have an index.

In some cases this is totally fine, like if you just INSERT into both tables. But what if you DELETE from the parent table? In that case, Postgres needs to check if the child table has rows that foreign key to the parent to make sure a constraint isn’t violated. Without an index, that implies a full table scan on child.parent_id.

Further, if you DELETE CASCADE, a grandchild table referencing the child table needs to have an index on its child_id column to allow for an efficient lookup.

Here’s the SQL to recursively check if everything referencing the parent has an index on that column. This SQL works from a fixed list of deletableTables, but you could also get the list dynamically by labeling in your ORM which tables allow DELETE, as mentioned above. This SQL also allows you to exclude some tables from needing an index, via excludedReferencingTables.

-- Replace the values inside %{NE.fromList ...} with your own data
WITH RECURSIVE
  fkey_constraints AS (
    SELECT oid, conrelid, conkey, confrelid
      FROM pg_constraint
      WHERE contype = 'f'
  ),
  deletable_fkey_constraints AS (
    SELECT oid, conrelid, conkey
      FROM fkey_constraints
      WHERE confrelid::regclass::text IN %{NE.fromList deletableTables}
    UNION ALL
    SELECT c2.oid, c2.conrelid, c2.conkey
      FROM deletable_fkey_constraints c1
      JOIN fkey_constraints c2 ON c1.oid <> c2.oid AND c1.conrelid = c2.confrelid
      WHERE c1.conrelid::regclass::text NOT IN %{NE.fromList excludedReferencingTables}
  ),
  deletable_fkey_constraints_distinct AS (
    SELECT DISTINCT conrelid, conkey
    FROM deletable_fkey_constraints
  )
SELECT CASE WHEN a.attnotnull THEN 'CREATE INDEX CONCURRENTLY IF NOT EXISTS ' || c.conrelid::regclass::text || '_' || a.attname || '_idx' || ' ON ' || c.conrelid::regclass::text || ' (' || a.attname || ');'
            ELSE 'CREATE INDEX CONCURRENTLY IF NOT EXISTS ' || c.conrelid::regclass::text || '_' || a.attname || '_idx' || ' ON ' || c.conrelid::regclass::text || ' (' || a.attname || ') WHERE ' || a.attname || ' IS NOT NULL;'
       END AS missing_index
FROM deletable_fkey_constraints_distinct c
CROSS JOIN LATERAL UNNEST(c.conkey) ak(k)
INNER JOIN pg_attribute a ON a.attrelid = c.conrelid AND a.attnum = ak.k
LEFT JOIN pg_index ix ON ix.indrelid = c.conrelid
                     AND ix.indkey[0] = ak.k
                     AND ix.indisvalid
                     AND (ix.indpred IS NULL OR NOT a.attnotnull) -- allow partial index if attribute is nullable
                                                                  -- ideally we would make sure it's just attribute IS NOT NULL
WHERE ix IS NULL
ORDER BY c.conrelid::regclass::text, a.attname;

Thanks to Sebastian Bensusan, Janey Muñoz, and Micah Fivecoate for reviewing this post.

About the author

Max Tagher

Max Tagher is the co-founder and CTO of Mercury.

8 Lints for your Postgres schema

Related reads

How Mercury defeats phishing with device verification

Escalating Esqueleto

Static-ls v1.0 announcement

Documenting your database schema

Disclaimers and footnotes

8 Lints for your Postgres schema

Correctness

All tables should have primary keys

Database timezone in UTC

Immutable columns

Enforcing immutability with triggers

Enforcing immutability with permissions

Immutability conclusion

Blocking deletes on a table

Timestamps

Documentation

Requiring comments on every table

All tables present in seed data

Performance

Indexes on foreign key constraints

Even more lints

Conclusion

Related reads

How Mercury defeats phishing with device verification

Escalating Esqueleto

Static-ls v1.0 announcement

Documenting your database schema

Disclaimers and footnotes