Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Icinga DB crashes when it can't reach the PostgreSQL database #620

Open
log1-c opened this issue Jul 24, 2023 · 11 comments
Open

Icinga DB crashes when it can't reach the PostgreSQL database #620

log1-c opened this issue Jul 24, 2023 · 11 comments
Assignees
Labels
bug Something isn't working dev-call Issues and Pull Requests to be discussed at the Dev Call.
Milestone

Comments

@log1-c
Copy link

log1-c commented Jul 24, 2023

We are running Postgres as a backend for IcingaDB. Postgres is running as a Patroni cluster with 3 nodes. On the Icinga Masters we use pgbouncer and haproxy for the connection to the database.
pgbouncer listens for the connections from icinga and haproxy is setup to present on of the servers on localhost to pgbouncer.

We experienced some crashes of the daemon with error messages similar to #577
After some testing we noticed the following:

If a config deployment (or presumably any action that triggers a reload of the icinga2 service) happens at the time of a switch-over of the leading node, we get the "insert into" error:

Jul 17 15:42:49 ma02 pgbouncer[875]: S-0x64d46597b190: icingadb/icingadb@127.0.0.1:6432 closing because: server conn crashed? (age=25138s)
Jul 17 15:42:49 ma02 pgbouncer[875]: C-0x64d4659739a0: icingadb/icingadb@[::1]:56192 closing because: server conn crashed? (age=25138s)
Jul 17 15:42:49 ma02 pgbouncer[875]: C-0x64d4659739a0: icingadb/icingadb@[::1]:56192 pooler error: server conn crashed?
Jul 17 15:42:49 ma02 pgbouncer[875]: S-0x64d46597acd0: icingadb/icingadb@127.0.0.1:6432 closing because: server conn crashed? (age=1499s)
Jul 17 15:42:49 ma02 pgbouncer[875]: C-0x64d4659734e0: icingadb/icingadb@[::1]:45092 closing because: server conn crashed? (age=1499s)
Jul 17 15:42:49 ma02 pgbouncer[875]: C-0x64d4659734e0: icingadb/icingadb@[::1]:45092 pooler error: server conn crashed?
Jul 17 15:42:49 ma02 pgbouncer[875]: S-0x64d46597aa70: icingadb/icingadb@127.0.0.1:6432 closing because: server conn crashed? (age=4173s)
Jul 17 15:42:49 ma02 pgbouncer[875]: C-0x64d465973020: icingadb/icingadb@[::1]:35834 closing because: server conn crashed? (age=4173s)
Jul 17 15:42:49 ma02 pgbouncer[875]: C-0x64d465973020: icingadb/icingadb@[::1]:35834 pooler error: server conn crashed?
Jul 17 15:42:50 ma02 pgbouncer[875]: C-0x64d465973020: icingadb/icingadb@[::1]:34438 login attempt: db=icingadb user=icingadb tls=no
Jul 17 15:42:50 ma02 pgbouncer[875]: S-0x64d46597aa70: icingadb/icingadb@127.0.0.1:6432 new connection to server (from 127.0.0.1:43752)
Jul 17 15:42:52 ma02 pgbouncer[875]: S-0x64d46597aa70: icingadb/icingadb@127.0.0.1:6432 closing because: server conn crashed? (age=2s)
Jul 17 15:42:52 ma02 pgbouncer[875]: stats: 0 xacts/s, 1 queries/s, in 556 B/s, out 75 B/s, xact 37939 us, query 9738 us, wait 0 us
Jul 17 15:42:53 ma02 icingadb[205409]: pq: unexpected message 'E'; expected ReadyForQuery#012can't perform "INSERT INTO \"state_history\" (\"state_type\", \"check_source\", \"id\", \"previous_soft_state\", \"max_check_attempts\", \"previous_hard_state\", \"check_attempt\", \"scheduling_source\", \"endpoint_id\", \"host_id\", \"object_type\", \"service_id\", \"event_time\", \"soft_state\", \"hard_state\", \"output\", \"long_output\", \"environment_id\") VALUES (:state_type,:check_source,:id,:previous_soft_state,:max_check_attempts,:previous_hard_state,:check_attempt,:scheduling_source,:endpoint_id,:host_id,:object_type,:service_id,:event_time,:soft_state,:hard_state,:output,:long_output,:environment_id) ON CONFLICT ON CONSTRAINT pk_state_history DO UPDATE SET \"id\" = EXCLUDED.\"id\""#012github.com/icinga/icingadb/internal.CantPerformQuery#012#011github.com/icinga/icingadb/internal/internal.go:30#012github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.1.1.1#012#011github.com/icinga/icingadb/pkg/icingadb/db.go:394#012github.com/icinga/icingadb/pkg/retry.WithBackoff#012#011github.com/icinga/icingadb/pkg/retry/retry.go:45#012github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.1.1#012#011github.com/icinga/icingadb/pkg/icingadb/db.go:389#012golang.org/x/sync/errgroup.(*Group).Go.func1#012#011golang.org/x/sync@v0.0.0-20210220032951-036812b2e83c/errgroup/errgroup.go:57#012runtime.goexit#012#011runtime/asm_amd64.s:1594#012can't retry#012github.com/icinga/icingadb/pkg/retry.WithBackoff#012#011github.com/icinga/icingadb/pkg/retry/retry.go:64#012github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.1.1#012#011github.com/icinga/icingadb/pkg/icingadb/db.go:389#012golang.org/x/sync/errgroup.(*Group).Go.func1#012#011golang.org/x/sync@v0.0.0-20210220032951-036812b2e83c/errgroup/errgroup.go:57#012runtime.goexit#012#011runtime/asm_amd64.s:1594
Jul 17 15:42:53 ma02 systemd[1]: icingadb.service: Main process exited, code=exited, status=1/FAILURE
Jul 17 15:42:53 ma02 systemd[1]: icingadb.service: Failed with result 'exit-code'.

Without a config deploy at switch-over time the error message changes to:

Jul 17 15:48:06 ma02 icingadb[348949]: pq: cannot use serializable mode in a hot standby#012can't start transaction#012github.com/icinga/icingadb/pkg/icingadb.(*HA).realize.func1#012#011github.com/icinga/icingadb/pkg/icingadb/ha.go:253#012github.com/icinga/icingadb/pkg/retry.WithBackoff#012#011github.com/icinga/icingadb/pkg/retry/retry.go:45#012github.com/icinga/icingadb/pkg/icingadb.(*HA).realize#012#011github.com/icinga/icingadb/pkg/icingadb/ha.go:245#012github.com/icinga/icingadb/pkg/icingadb.(*HA).controller#012#011github.com/icinga/icingadb/pkg/icingadb/ha.go:211#012runtime.goexit#012#011runtime/asm_amd64.s:1594#012can't retry#012github.com/icinga/icingadb/pkg/retry.WithBackoff#012#011github.com/icinga/icingadb/pkg/retry/retry.go:64#012github.com/icinga/icingadb/pkg/icingadb.(*HA).realize#012#011github.com/icinga/icingadb/pkg/icingadb/ha.go:245#012github.com/icinga/icingadb/pkg/icingadb.(*HA).controller#012#011github.com/icinga/icingadb/pkg/icingadb/ha.go:211#012runtime.goexit#012#011runtime/asm_amd64.s:1594#012HA aborted#012github.com/icinga/icingadb/pkg/icingadb.(*HA).abort.func1#012#011github.com/icinga/icingadb/pkg/icingadb/ha.go:128#012sync.(*Once).doSlow#012#011sync/once.go:74#012sync.(*Once).Do#012#011sync/once.go:65#012github.com/icinga/icingadb/pkg/icingadb.(*HA).abort#012#011github.com/icinga/icingadb/pkg/icingadb/ha.go:126#012github.com/icinga/icingadb/pkg/icingadb.(*HA).controller#012#011github.com/icinga/icingadb/pkg/icingadb/ha.go:218#012runtime.goexit#012#011runtime/asm_amd64.s:1594#012HA exited with an error#012main.run#012#011github.com/icinga/icingadb/cmd/icingadb/main.go:335#012main.main#012#011github.com/icinga/icingadb/cmd/icingadb/main.go:37#012runtime.main#012#011runtime/proc.go:250#012runtime.goexit#012#011runtime/asm_amd64.s:1594

Looks like the icingadb daemon doesn't like it if the db cluster isn't available for even a short time.

@julianbrost
Copy link
Contributor

pq: unexpected message 'E'; expected ReadyForQuery

There's an issue in our database client driver: lib/pq#478. It would be interesting to know which error is hiding in there, so this will need further debugging.

I've also found an interesting comment in another project that claims (I haven't verified this myself but it sounds plausible) that the PostgreSQL wire protocol allows multiple error messages being sent: cockroachdb/cockroach#24149 (comment). So this sounds like it's quite possible this is an issue with the client, not the database cluster.

pq: cannot use serializable mode in a hot standby

According to the PostgreSQL documentation:

When the hot_standby parameter is set to true on a standby server, it will begin accepting connections once the recovery has brought the system to a consistent state. All such connections are strictly read-only; not even temporary tables may be written.

The Icinga DB daemon will always write to the database, which obviously can't work on a hot standby server. Could it be the case that connections are routed to the wrong server? Or is something like all servers are running as hot standby for a short time during a failover operation something that might happen?

@log1-c
Copy link
Author

log1-c commented Jul 31, 2023

I have tested what happens during a switch-over:
Patroni node status:
Leader changes to "unknown" for a short time, new leader changes from "replica" to "leader".

hot_standby is necessary for patroni to work (patroni/patroni#1811 (comment)), so it can't be disabled.

So I'm not sure what to test/debug further.

But I think the IcingaDB daemon should not crash, when it can't connect to the database, but rather cache it's queries, like the IDO feature does/did.

@log1-c log1-c changed the title IcingaDB daemon crashes when it can't reach the database. IcingaDB daemon crashes when it can't reach the psql database. Jul 31, 2023
@julianbrost
Copy link
Contributor

For the "cannot use serializable mode in a hot standby" part of the issue, treating this error as retryable (similar to how we do for other server is starting/shutting down errors already), but from a quick online search, it looks like error code 0A000 (feature_not_supported) is used for that, so something quite generic (comparing error messages is always ugly, they better don't change ever).

@log1-c
Copy link
Author

log1-c commented Dec 19, 2023

Follow up to this:

Due to the changes mentioned in the linked lib/pq issue the error message is now more specific:

Dec 17 20:18:19 msd-ic-ma02 icingadb[1674]: Starting history sync
Dec 17 20:18:20 msd-ic-ma02 icingadb[1674]: pq: cannot execute INSERT in a read-only transaction
                                            can't perform "INSERT INTO \"state_history\" 
...

We now "tuned" our haproxy config, so that it is faster in switching after detecting a non-functioning connection.
Since then the daemon did not crash on leading node switches.

@log1-c log1-c closed this as completed Dec 19, 2023
@Al2Klimov
Copy link
Member

Colleagues, we could re-try 25006 (read_only_sql_transaction) if we wish, once lib/pq#1136 has been merged. Shall we?

@log1-c Please could you add that code to https://github.com/Icinga/icingadb/blob/v1.1.1/pkg/retry/retry.go#L161-L175 locally and report what happens?

@Al2Klimov Al2Klimov removed their assignment Dec 19, 2023
@lippserd lippserd added the dev-call Issues and Pull Requests to be discussed at the Dev Call. label Aug 2, 2024
@lippserd lippserd changed the title IcingaDB daemon crashes when it can't reach the psql database. Icinga DB crashes when it can't reach the PostgreSQL database Aug 6, 2024
@lippserd lippserd added this to the 1.3.0 milestone Aug 6, 2024
@lippserd
Copy link
Member

lippserd commented Aug 6, 2024

We must not rely on lib/pq#1136 and fix this properly.

@lippserd lippserd reopened this Aug 6, 2024
@lippserd
Copy link
Member

lippserd commented Aug 6, 2024

[root@hostname ~]# systemctl status icingadb
● icingadb.service - Icinga DB
Loaded: loaded (/usr/lib/systemd/system/icingadb.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Fri 2024-08-02 09:46:12 CEST; 1min 44s ago
Process: 452833 ExecStart=/usr/sbin/icingadb --config /etc/icingadb/config.yml (code=exited, status=1/FAILURE)
Main PID: 452833 (code=exited, status=1/FAILURE)

...

Aug 02 09:45:04 msp-ic-ma01 icingadb[452833]: history-sync: Synced 10 state history items
Aug 02 09:45:11 msp-ic-ma01 icingadb[452833]: runtime-updates: Upserted 10 ServiceState items
Aug 02 09:45:24 msp-ic-ma01 icingadb[452833]: history-sync: Synced 3 state history items
Aug 02 09:45:31 msp-ic-ma01 icingadb[452833]: runtime-updates: Upserted 5 ServiceState items
Aug 02 09:45:44 msp-ic-ma01 icingadb[452833]: history-sync: Synced 9 state history items
Aug 02 09:45:51 msp-ic-ma01 icingadb[452833]: runtime-updates: Upserted 11 ServiceState items
Aug 02 09:45:57 msp-ic-ma01 icingadb[452833]: overdue-sync: Synced 1 service overdue indicators
Aug 02 09:46:04 msp-ic-ma01 icingadb[452833]: history-sync: Synced 11 state history items
Aug 02 09:46:12 msp-ic-ma01 systemd[1]: icingadb.service: Main process exited, code=exited, status=1/FAILURE
Aug 02 09:46:12 msp-ic-ma01 systemd[1]: icingadb.service: Failed with result 'exit-code'.

...

[root@hostname ~]# systemctl status icingadb
● icingadb.service - Icinga DB
Loaded: loaded (/usr/lib/systemd/system/icingadb.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Fri 2024-08-02 09:46:07 CEST; 50s ago
Process: 4072859 ExecStart=/usr/sbin/icingadb --config /etc/icingadb/config.yml (code=exited, status=1/FAILURE)
Main PID: 4072859 (code=exited, status=1/FAILURE)

...

Aug 02 09:43:49 hostname icingadb[4072859]: history-sync: Synced 6 state history items
Aug 02 09:44:09 hostname icingadb[4072859]: history-sync: Synced 4 state history items
Aug 02 09:44:29 hostname icingadb[4072859]: history-sync: Synced 6 state history items
Aug 02 09:44:29 hostname icingadb[4072859]: high-availability: Another instance is active
Aug 02 09:44:49 hostname icingadb[4072859]: history-sync: Synced 2 state history items
Aug 02 09:45:09 hostname icingadb[4072859]: history-sync: Synced 11 state history items
Aug 02 09:45:29 hostname icingadb[4072859]: history-sync: Synced 5 state history items
Aug 02 09:45:49 hostname icingadb[4072859]: history-sync: Synced 10 state history items
Aug 02 09:46:07 hostname systemd[1]: icingadb.service: Main process exited, code=exited, status=1/FAILURE
Aug 02 09:46:07 hostname systemd[1]: icingadb.service: Failed with result 'exit-code'.

...

Aug 02 09:58:20 hostname systemd[1]: Starting Icinga DB...
Aug 02 09:58:20 hostname icingadb[4102936]: Starting Icinga DB daemon (v1.2.0)
Aug 02 09:58:20 hostname icingadb[4102936]: Connecting to database at '127.0.0.1:5432'
Aug 02 09:58:20 hostname systemd[1]: Started Icinga DB.
Aug 02 09:58:20 hostname icingadb[4102936]: driver: bad connection
can't connect to database
main.run
github.com/icinga/icingadb/cmd/icingadb/main.go:70
main.main
github.com/icinga/icingadb/cmd/icingadb/main.go:37
runtime.main
runtime/proc.go:271
runtime.goexit
runtime/asm_amd64.s:1695

...

Aug 02 10:34:53 hostname icingadb[4113948]: pq: unexpected message 'E'; expected ReadyForQuery
can't perform "INSERT INTO \"history\" (\"environment_id\", \"endpoint_id\", \"object_type\", \"service_id\", \"event_type\", \"state_history_id\", \"event_time\", \"host_id\", \"id\") VALUES (:environment_id,:endpoint_id,:object_type,:service_id,:event_type,:state_history_id,:event_time,:host_id,:id) ON CONFLICT ON CONSTRAINT pk_history DO UPDATE SET \"id\" = EXCLUDED.\"id\""
github.com/icinga/icingadb/internal.CantPerformQuery
github.com/icinga/icingadb/internal/internal.go:30
github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.(*DB).NamedBulkExec.func1.1.2.1
github.com/icinga/icingadb/pkg/icingadb/db.go:412
github.com/icinga/icingadb/pkg/retry.WithBackoff
github.com/icinga/icingadb/pkg/retry/retry.go:60
github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.(*DB).NamedBulkExec.func1.1.2
github.com/icinga/icingadb/pkg/icingadb/db.go:407
golang.org/x/sync/errgroup.(*Group).Go.func1
golang.org/x/sync@v0.7.0/errgroup/errgroup.go:78
runtime.goexit
runtime/asm_amd64.s:1695
can't retry
github.com/icinga/icingadb/pkg/retry.WithBackoff
github.com/icinga/icingadb/pkg/retry/retry.go:82
github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.(*DB).NamedBulkExec.func1.1.2
github.com/icinga/icingadb/pkg/icingadb/db.go:407
golang.org/x/sync/errgroup.(*Group).Go.func1
golang.org/x/sync@v0.7.0/errgroup/errgroup.go:78
runtime.goexit
runtime/asm_amd64.s:1695

@lippserd lippserd added the bug Something isn't working label Aug 6, 2024
@julianbrost
Copy link
Contributor

How is this unexpected message 'E' error returned? Is it another type than *pq.Error? Otherwise, shouldn't this be covered by #698 (in particular 33e07c4)?

@Al2Klimov
Copy link
Member

Unfortunately, "pq: %s" errors -at least "unexpected message %q; expected ReadyForQuery"- are made in pq.errorf by fmt.Errorf and of type *errors.errorString. But #698 is a great idea...

@oxzi
Copy link
Member

oxzi commented Aug 8, 2024

The particular error is crafted here, https://github.com/lib/pq/blob/3d613208bca2e74f2a20e04126ed30bcb5c4cc27/conn.go#L1814, in the readReadyForQuery method, expecting some "raw" Z message, while getting the logged E message instead.

The called errorf function itself wraps a panic(fmt.Errorf("pq: %s", fmt.Sprintf(s, args...))) call, resulting in panic("pq: unexpected message Z; expected ReadyForQuery"). Thus, it is not only an error, but a panic, which is recovered somewhere down the line, @Al2Klimov. Btw, I have not found a promising recover in the library with a quick grep.

@yhabteab
Copy link
Member

yhabteab commented Aug 8, 2024

Btw, I have not found a promising recover in the library with a quick grep.

The errors are all recovered within the conn.go file (search for errRecover(&err)).

@Al2Klimov Al2Klimov self-assigned this Aug 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working dev-call Issues and Pull Requests to be discussed at the Dev Call.
Projects
None yet
Development

No branches or pull requests

6 participants