Hacker News new | past | comments | ask | show | jobs | submit login
Common DB schema change mistakes in Postgres (postgres.ai)
357 points by thunderbong 16 days ago | hide | past | favorite | 80 comments



I like Postgres, a lot.

Most of the things in this article are avoidable, and good to keep an eye out for.

But let's be clear we're not talking about the worst part of Postgres: roles. There is a ton of power there, it would be amazing to use it. Making it work feels like black magic. Every bit of the interface around it just seems like esoteric incantations that may or may not do what you expect. It's a terrible way to manage something so important.

THe manual for this section is, thin. It gives you an idea of how things should work, maybe, in a narrow use case. The problem is when they dont your going to spend time doing a lot of trial and error to figure out what you did wrong, and likely not have a clue as to how to do it right. And may god have mercy on your soul if you want to migrate a db with complex user permissions.

I need to sit down with it for a month and write "cookbook". If one person uses it and goes to bed that night without crying them selves to sleep it will have been worth it.


I agree with the sentiment that IAM in PostgreSQL is complex.

What makes it complex is that there are 3 layers of objects (Database, Schema, Tables) and also implicit grants given to DB object owners

To be able to select from a table you need:

* CONNECT on the Database

* USAGE on the Schema (Given implicitly to schema owner)

* SELECT on the Table (Given implicitly to table owner)

To see these privileges we need to understand acl entries of this format

`grantee=privilege-abbreviation[]/grantor:`

* Use \l+ to see privileges of Database

* Use \dn+ to see privileges of Schemas

* Use \dp+ to see privileges of Tables

Privileges are seen [here](https://www.postgresql.org/docs/current/ddl-priv.html)

e.g. in the following example user has been given all permissions by postgres role

`user=arwdDxt/postgres`

If the “grantee” column is empty for a given object, it means the object has default owner privileges (all privileges) or it can mean privileges to PUBLIC role (every role that exists)

`=r/postgres`

Also it's confusing when Public schema is used. You have CREATE permission on schema so when the tables are created with the same user you select data with and you have owner permissions out of the box.


> To be able to select from a table you need

> * CONNECT

> * USAGE

> * SELECT

Isn't LOGIN (https://www.postgresql.org/docs/16/role-attributes.html) also needed?

  Only roles that have the LOGIN attribute can be used as the initial role name for a database connection


You can log in and then change your role (if you permissions) with SET ROLE NEW_USER;

We managed to kludge our way to defaulting to read only, then using set role to do writes if you need to.


Good point. Connecting to the Cluster is indeed actually a 4th layer.

There you do need user with LOGIN, valid password & SSL.


> What makes it complex is that there are 3 layers of objects (Database, Schema, Tables) and also implicit grants given to DB object owners

The main privileges systems includes Columns, as well as Databases/Schemas/Tables. You can SELECT from a table if you have been granted SELECT on the table, or if you have been granted it on the specific columns used in your query. ("A user may perform SELECT, INSERT, etc. on a column if they hold that privilege for either the specific column or its whole table. Granting the privilege at the table level and then revoking it for one column will not do what one might wish: the table-level grant is unaffected by a column-level operation." [1])

There's also a system of Row Security Policies [2].

[1]: https://www.postgresql.org/docs/current/sql-grant.html

[2]: https://www.postgresql.org/docs/current/ddl-rowsecurity.html


I've seen funny things happen when developers think their GRANT is doing what ALTER DEFAULT PRIVILEGES does, and a day later break production because their user doesn't have access to the new tables.


What do you mean exactly?

If roles have INHERIT, then doing the following works, no?

* Role A creates table * GRANT A TO B; * ROLE B can read from table just like A can.

Also if Role A creates new table, Role B can read that too no?


Remove any "if" from your comment and it seems like you understood what would fail! There's more ways to make misunderstanding grant vs default privileges will create issues as well.


Then I would argue for trying to centrally ensure that all roles have INHERIT in your DBs. :thinking:

Doing ALTER DEFAULT PRIVILEGES could be another future footgun of it's own.


I think you probably don’t want the application user to inherit from the superuser you may use for migrations since some migrations require a superuser.


IMO, the permission management of the MS SQL Server is the best one around. It gets all the basics right, without any big issue. Other DBMS should copy it, or use it as a starting point if they want to go further (and there is plenty of stuff to improve on it).

TLDR, the container objects and the contained ones all share the same kind of permissions. Permissions of the container are applied to the contained unless explicitly changed.

So, if you grant select on the schema dbo to a, a will get select on all tables there. If you want to remove some table, you revoke the select on that specific table. And there is both metadata to discover where a specific privilege comes from and specific commands that edit the privileges on a specific level.


Even the docs of postgrest, which relies on roles for auth, seems to be not very detailed: https://postgrest.org/en/v12/explanations/db_authz.html

Interesting. If you are serious about writing a cookbook for postgres roles, and open something like a kickstarter, I'll be one of the first to pledge!


> Making it work feels like black magic.

Can confirm. Last year I implemented a simple postgREST server with rowlevel security. (The postgREST logs are really good. With cookbook and all)

The path there was somewhat difficult, but once it worked, it was truly magical. And the mechanisms involved were quite simple even.


I’d read that. Role management for me often involves a lot of guessing and too often roles end up with too many permissions because of that.


Hah, role management for us is "create a role for migrations, and a role to do db things, and enforce auth entirely in the web app"

I suspect we aren't alone


This is a fairly sane place to be in terms of bang for your buck. It's easy to find yourself in a place where authorization data and logic span multiple services and at that point having everything deeply siloed into Postgres might be a doozy. That being said, there are plenty of times that'll never be the case and you should try to lean on the abstractions that work best for you.


Roles are for services not for users. If you have a read-only Web api then it makes sense to use a read-only role regardless of which user is using it.


Everything in PostgreSQL is a role.

It's just named such that when a ROLE allows `login` it's considered a user


Someone was feeling very clever when they came up with this idea.


It makes a lot of sense, especially since in the SQL standard, you can grant privileges to "users" or "roles."

Might as well simplify the mental model and make them the same.


This is part of what many people find so confusing. In most systems “role” is a group (or something closely resembling a group), not a user. The weird terminology confuses beginners


It's a bit confusing and legacy.

All roles function like you would expect groups to function

A role that is not allowed to login is a `group`.

While the CREATE USER and CREATE GROUP commands still exist, they are simply aliases for CREATE ROLE.


To me, the comment you are replying to is saying that you should^ DIFFERENTIATE roles by service, not ‘end user’.


> Hah, role management for us is "create a role for migrations, and a role to do db things, and enforce auth entirely in the web app"

> I suspect we aren't alone

Honestly I'd be happy to spend the time learning the ins and outs of PostgreSQL IAM stuff, but there's two very good reasons why I won't use it:

1. Still need the "role"/user across other services, so I don't save anything by doing ACL inside the DB.

2. I've no idea how to temporarily drop privileges for a single transaction. Connecting as the correct user on each incoming HTTP request is way too slow.


> 2. I've no idea how to temporarily drop privileges for a single transaction. Connecting as the correct user on each incoming HTTP request is way too slow.

`SET ROLE`[1] changes the "the current user identifier of the current SQL session"; after running it "permissions checking for SQL commands is carried out as though the named role were the one that had logged in originally".

Whilst it changes the "current user" it doesn't change the current "session user", and this is what determines which roles you can switch to.

The docs also note that:

> SQL does not allow this command during a transaction; PostgreSQL does not make this restriction because there is no reason to.

[1]: https://www.postgresql.org/docs/16/sql-set-role.html



Oh it's really worrying that you post a link to a SO page as the recommended 'cookbook' for something that's referred to as a powerful feature. And both of the answers only have a single up vote each. Seems like there's a serious gap here.


You can help out by posting answer / upvoting good answer.


I don't think Stack Overflow is a good place for good documentation. I think the Postgres documentation should be better.


> I need to sit down with it for a month and write "cookbook". If one person uses it and goes to bed that night without crying them selves to sleep it will have been worth it.

Please do. I'd be happy to pay ~$20 for it.


What I don't understand is why databases don't use tables/views to implement their native permissions systems.

Create a table to hold rows of (db,schema,table,role,read,write) configured by the admin with INSERT/UPDATE/DELETE, then a view that applies inheritance behavior and can answer whether any user can access any given resource.


Literally what PostgreSQL does.

E.g. you can do manual selects from internal tables to see the same content as `\dt` command for example.


Oh, cool. Where's the schema?


I have seen what you have seen (did several "serious" things using postgrest).

edit: i mean yes, we need that cookbook real bad


I recently had to rename an app I was developing. That included the linux user account under which the service ran, and therefore the postgres user.

I don't remember all the details, but I had to also rename the postgres user and role, which seemed a simple thing to do. But for some reason renaming the user didn't include the permissions on the database. I was left with a very confusing state of working table access and denied record access. I decided to backup the data, dropp the database and do an import that didn't include any permissions.

That simple thing turned out a complete shit show and I blame Postgres for making something so simple so complex.


I never use system accounts for Postgres auth. It feels like coupling that won’t do anything good for most people most of the time, and will only bite you in the ass when you don’t expect it to.


If you are running schema migrations in production, use "lock_timeout". Even seemingly-benign modifications like dropping a table (with foreign keys) or dropping a foreign key, which are generally quick and may run nearly-instantaneously when you're testing them, may end up hitting a lock conflict on a heavily-used production database (with existing transactions, with an autovacuum, etc). That ALTER is then waiting on the first transaction's lock, but it has acquired an ACCESS EXCLUSIVE lock, meaning no queries can run against the locked table.

If you're doing any postgres at scale, it's just a matter of time until you hit one of these conflicts. "lock_timeout" will just cause the migration to fail after the timeout, rather than just blocking all other queries.


'statement_timeout' includes lock timeout and lets you better estimate the impact on hot tables. When you set the timeout to 5s, you know the downtime will be max 5s total, and transactions will continue afterwards. With lock timeout, you dont control how long the part after the lock will take - may be fast, may be slow, due to concurrent traffic, for example.


Related to this, there is quite a large variation across Postgres versions about whether a particular DML query will take an exclusive lock or not.

Is there a good way to analyse a query and be informed of what sort of lock it will take?

I’ve always resorted to re-reading docs when I’m unsure.


For me the best way is testing it locally, side-by-side, with two transactions. You dont need any data for this.

After some experience, you start to see the reasons for locks and how they will impact you.

Or read the docs.


Excellent advice.

On the technical side, I believed waiting was due to the lock queue rather that having acquired an ACCESS EXCLUSIVE lock. The ALTER is specifically _waiting_ for any lock lower than ACCESS EXCLUSIVE to be release.


It also makes all new readers/writers to wait for that lock, essentially leading to downtime until the lock is eventually acquired and released. This is the classic readers/writers library example, and you want to avoid starving the writers.

Thats why size of data is the least of your issues - its the access patterns/hotness that are the issue.


Sometimes if you do that, you'll never run your ALTER. Just needs enough traffic on the table. In those cases I think the best way is to kill other ongoing queries that block the ALTER, if your app can recover from that.


It might be kind of nice if Postgres had a "force this statement through" option which kills any conflicting processes.

We created something like that in Citus when changing a node's hostname (e.g. during a failover). While node updates should be mutually exclusive with writes (otherwise we might lose them), we didn't want to wait for long-running or possibly frozen writers to release their locks. So after some initial waiting we'd start a background worker to kill anything that was blocking the node update.


I refer to Fly.io’s guide to Safe Migrations in Ecto (Elixir’s DB adapter) multiple times a week. It’s a very useful quick reference to check whether you can get away with a basic migration or if something more involved is required.

https://fly.io/phoenix-files/safe-ecto-migrations/


The most surprising thing about Postgres indexes as a beginner was how UNIQUE indexes can affect the outcomes of concurrent queries because of the additional locking they add. Something like

  INSERT INTO foo (bar) (SELECT max(bar) + 1 FROM foo);
can insert duplicate `bar` values when run concurrently using the default mode, since one xact might not see the new max value created by the other. You might think adding a UNIQUE index would cause the "losing" xact to get constraint errors, but instead both xacts succeed and no longer have a race condition.


> You might think adding a UNIQUE index would cause the "losing" xact to get constraint errors, but instead both xacts succeed and no longer have a race condition.

This is not true. What happens is that the (sub)transaction that loses the race to the index is aborted:

  =# INSERT INTO foo (bar) (SELECT max(bar) + 1 FROM foo);
  ERROR:  duplicate key value violates unique constraint "foo_bar_idx"
  DETAIL:  Key (bar)=(2) already exists.


I probably got the wrong query. There was some other one where the unique index removes a race condition, lemme find it...

Are you claiming that both inserts succeed even with a UNIQUE index in place, and they end up inserting duplicate values? That must be a bug if you're right.


I think they're saying that the unique index changes the locking strategy for the queries so they are effectively serialized and will not both read the same max value for existing rows.


That would make a lot of sense, thanks. I hope that's what the author meant.


Yeah, that's what I meant.


If im not mistake, you can do this with no downtime by creating a regular index CONCURRENTLY and creating a not checked unique constraint. The constraint applies to new insert/updates only. After that, you run VALIDATE on the constraint and it will be a fully fledged unique constraint.


Well, if it's surprising, it's because you have too much exposure to imperative languages.

I can agree this is common, but then, the issue isn't with postgres, it's with software development as a whole.


The exact guarantees you get from Postgres's default isolation level are pretty detailed. The manual* doesn't mention locking on constraints or unique indexes, but I can make guesses based on how I'd expect it to work. I don't see what it has to do with the language being declarative. It's not unique to Postgres, though, cause MySQL is similar.

* https://www.postgresql.org/docs/current/transaction-iso.html...


With what isolation level?


The default, read committed


These pitfalls are one reasons why I built Reshape [0], which aims to automate zero-downtime schema migrations.

I can’t say it avoids all of them but we are working on a new product that would. If you are interested in this space (and Postgres specifically), I’d love to hear from you: fabian@reshapedb.com

[0] https://github.com/fabianlindfors/reshape


Any chance that works for crdb?


Unfortunately not as Cockroach doesn't support triggers, which is what powers Reshape. We are looking into how to widen support though for our next product, reach out to fabian@reshapedb.com if you'd be interested in something like that!

it might work but keep in mind schema changes in cockroachdb very different under the hood. for one they are not transactional and run in the background https://www.cockroachlabs.com/docs/stable/online-schema-chan...


See my other comment. Unfortunately Cockroach doesn't do much in terms of helping with zero-downtime schema migrations :/

Another common mistake I have seen: duplicating tables without the indexes.

This is not how it works, period.

    CREATE TABLE <abcv2> SELECT * FROM <abc> WHERE <>
People do it all the time, either to create a backup table, or deleting data in bulk, etc.


If I'm creating a backup table -- as in, I'm going to do some complex nebulous operation that might screw up in ways I don't immediately foresee -- then I don't care at all about indexes or constraints. I want an immediately present copy of the data that I plan to never need but is there so I don't have to restore from DB backup + WAL. Creating the indexes is a waste of both server time and disk space.

If something goes pear-shaped or I really need them, I can create those indexes later.


Could you also mention what is the appropriate way to go about it?


Either put machinery in your schema migration tool to create indexes as a separate step, so they are easy to re-apply. This makes keeping indexes in sync between production and staging a lot easier. Or you can use "CREATE TABLE_B (LIKE TABLE_A WITH INDEXES);".


Make sure to create indices AFTER the data is in. Its way faster to create indices from data than create them empty and insert data.

Also make sure to set maintenance_work_mem high as it helps with index creation


> Case 2. Misuse of IF [NOT] EXISTS

The article doesn't provide any good example of misuse. And that's exactly how you use it. It's clean and simple, no hidden pitfalls. Schema migration tools are overhead when you have a few tables.


The pitfall is pretty simply stated: "It masks the problem with logic, adding some risks of anomalies." When one puts a bandaid over bad data, the problem isn't solved, it's masked. Depending on the issue, it could bite you later, in unexpected ways, at the worst possible times.

In this particular case, the "bad data" is a table/column/view that exists (or doesn't) when it should(n't). Why does the table exist when it shouldn't yet exist? Did it fail to get dropped? Is the existing one the right schema? Is the same migration erroneously running twice?

After each migration, your schema should be in a precise state. If the phrase "IF [NOT] EXISTS" is in your migration, then after a previous migration, it wasn't left in a precise state. Being uncertain about the state of your schema isn't great.


We use IF NOT EXISTS to bring non prod environments in sync with prod. The size of prod requires some migrations to be done separately over course of days, in separate transactions. The IF NOT EXISTS clause then brings dev and non prod envs in sync.


In many cases you can still be ok just using IF NOT EXISTS everywhere, or occasionally IF EXISTS when trying to delete a no-longer-used col or index. Whether you start with a blank DB or any intermediate state, you'd end up with the same thing. But it gets messy after a long enough time.


>The article doesn't provide any good example of misuse.

It did describe the misuse pretty well, though. The idea is that out-of-band schema modifications are a process/workflow issue that needs to be directly addressed. As stated by OP, this is an easy way for anomalies to creep in - what if the already present table has different columns than the one in the migration? IF EXISTS lets a migration succeed but leaves the schema in a bad state. This is an example of where you would prefer a migration to "fail fast".


A little nitpick about

> To those who still tend to use int4 in surrogate PKs, I have a question. Consider a table with 1 billion rows, with two columns – an integer and a timestamp. Will you see the difference in size between the two versions of the table

Wouldn't the important thing be an index size, not a table size? Table size already has 23-byte header + alignment padding. So 4 byte difference doesn't do much for table size. But fitting more of an index into memory could have some benefits. An index entry has 8-byte header.

Secondly, 1 billion rows (used in the example) are far too close to the maximum for int4 for comfort.

Great article nonetheless.


Yeah, there’s index size, and also disk size. Postgres packs table rows on disk tightly, but not in RAM.

So I guess an 8kb page on disk could be more than 8kb in ram?

Seems to only affect working memory for table row data. Still significant (especially in Postgres where rows are randomly ordered which is horrible for locality for range queries) but not a home run insight imo.


I'm a developer who has largely been shielded from database related issues. Within django, I know how to create migrations, model tables, and query using the ORM but a lot of what happens is black magic to me. Now that I'm starting a company, I have a lot of anxiety about running into these issues and then resolving them on my own. How should I go about learning what to do while in a dev environment?


Fail and learn from your mistakes. Or hire developers and fail and learn together.


I love postgres but I really hate that there is no builtin way to batch update or delete. This is the most annoying things and I need to (re)write a batcher almost every month every time I'm hitting a wall


I have managed to make a lot of these mistakes, and they’re lead to scary outages and bummers. This is a pretty great article.


(2022) (not that it changes much, all seems up to date)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: