Skip to content

Initial database setup is not atomic and can become wedged if interrupted #722

@hfinucane

Description

@hfinucane

Bug Report

I have a Chef cookbook that deploys Concourse. This is fine, but one of the things it makes difficult is controlling the timing of delayed actions. As an overview, the actions taken by an actual installation look something like this:

  • Do a bunch of postgres init stuff- initdb, create the 'atc' database, write out conf files
  • Register delayed restarts because we changed config files
  • Install concourse, kickstart concourse-web
  • Chef run is done! Now do the delayed restarts of postgres and concourse

Everything Works(tm) in Vagrant, but in EC2, with 2.3.1, this fails pretty reliably. The atc database appears to be partly set up:

atc=> \dt
             List of relations
 Schema |       Name        | Type  | Owner
--------+-------------------+-------+-------
 public | migration_version | table | atc
(1 row)

atc=> select * from migration_version;
 version
---------
(0 rows)

The concourse logs look like

Oct 25 22:23:27 ip-172-29-1-168 concourse[10032]: {"timestamp":"1477434207.567716599","source":"atc","message":"atc.db.migrations.failed-to-acquire-lock-retrying","log_level":2,"data":{"error":"pq: the database system is shutting down","session":"2"}}
Oct 25 22:23:32 ip-172-29-1-168 concourse[10032]: {"timestamp":"1477434212.579482079","source":"atc","message":"atc.db.migrations.failed-to-acquire-lock-retrying","log_level":2,"data":{"error":"pq: the database system is shutting down","session":"2"}}
Oct 25 22:23:37 ip-172-29-1-168 concourse[10032]: {"timestamp":"1477434217.592168808","source":"atc","message":"atc.db.migrations.failed-to-acquire-lock-retrying","log_level":2,"data":{"error":"pq: the database system is shutting down","session":"2"}}
Oct 25 22:23:42 ip-172-29-1-168 concourse[10032]: {"timestamp":"1477434222.599715471","source":"atc","message":"atc.db.migrations.failed-to-acquire-lock-retrying","log_level":2,"data":{"error":"pq: the database system is shutting down","session":"2"}}
Oct 25 22:23:47 ip-172-29-1-168 concourse[10032]: {"timestamp":"1477434227.771785975","source":"atc","message":"atc.db.migrations.failed-to-acquire-lock-retrying","log_level":2,"data":{"error":"pq: the database system is starting up","session":"2"}}
Oct 25 22:23:52 ip-172-29-1-168 concourse[10032]: {"timestamp":"1477434232.796749353","source":"atc","message":"atc.db.migrations.migration-lock-acquired","log_level":1,"data":{"session":"2"}}
Oct 25 22:23:53 ip-172-29-1-168 systemd[1]: Stopping Concourse Web...
Oct 25 22:23:53 ip-172-29-1-168 systemd[1]: Started Concourse Web.
Oct 25 22:23:53 ip-172-29-1-168 systemd[1]: Starting Concourse Web...
Oct 25 22:24:07 ip-172-29-1-168 concourse[10431]: {"timestamp":"1477434247.272916317","source":"atc","message":"atc.db.migrations.migration-lock-acquired","log_level":1,"data":{"session":"2"}}
Oct 25 22:24:07 ip-172-29-1-168 concourse[10431]: {"timestamp":"1477434247.307472944","source":"atc","message":"atc.db.migrations.failed-to-run-migrations","log_level":3,"data":{"error":"Could not get DB version: sql: no rows in result set","session":"2","trace":"goroutine 1 [running]:\ncode.cloudfoundry.org/lager.(*logger).Fatal(0xc42004e2a0, 0xe35cad, 0x18, 0x
Oct 25 22:24:07 ip-172-29-1-168 concourse[10431]: mp/build/9674af12/gopath/src/github.com/concourse/bin/vendor/github.com/jessevdk/go-flags/parser.go:185 +0x74\nmain.main()\n\t/tmp/build/9674af12/gopath/src/github.com/concourse/bin/cmd/concourse/main.go:29 +0x113\n"}}
Oct 25 22:24:07 ip-172-29-1-168 systemd[1]: concourse-web.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Oct 25 22:24:07 ip-172-29-1-168 systemd[1]: Unit concourse-web.service entered failed state.
Oct 25 22:24:07 ip-172-29-1-168 systemd[1]: concourse-web.service failed.
Oct 25 22:24:07 ip-172-29-1-168 concourse[10431]: panic: Could not get DB version: sql: no rows in result set
Oct 25 22:24:07 ip-172-29-1-168 concourse[10431]: goroutine 1 [running]:
Oct 25 22:24:07 ip-172-29-1-168 concourse[10431]: panic(0xce9560, 0xc420148d60)
Oct 25 22:24:07 ip-172-29-1-168 concourse[10431]: /usr/local/go/src/runtime/panic.go:500 +0x1a1
Oct 25 22:24:07 ip-172-29-1-168 concourse[10431]: code.cloudfoundry.org/lager.(*logger).Fatal(0xc42004e2a0, 0xe35cad, 0x18, 0xef9cc60, 0xc420148d60, 0x0, 0x0, 0x0)
Oct 25 22:24:07 ip-172-29-1-168 concourse[10431]: /tmp/build/9674af12/concourse/src/code.cloudfoundry.org/lager/logger.go:152 +0x41c
Oct 25 22:24:07 ip-172-29-1-168 concourse[10431]: github.com/concourse/atc/db/migrations.LockDBAndMigrate(0xefaf220, 0xc42004e2a0, 0xe2fd36, 0x13, 0x7ffcf9e1ff69, 0x1c, 0xefaf220, 0xc42004e2a0, 0xc4201c6ce8, 0x599f23)

There is a certain degree of "doctor, doctor, it hurts when I do this" here, but I don't really want to have to litigate a move away from Chef, and it also seems like the "set up the database" step shouldn't fail like this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions