systemd units failed after "invalid connection" - manual recovery needed #794

louiswebdev · 2024-09-02T07:20:26Z

Describe the bug

HA Icinga2 setup, two nodes in master zone. Database is AWS RDS MySQL, Multi-AZ. During/after scheduled maintenance of the database service we found both IcingaDB systemd units failed.

Expected behavior

I would have expected the connection loss to be handled grafecully, without the IcingaDB services failing. We had to restart them manually.

Your Environment

Icinga DB version: 1.2.0-1+debian11
Icinga 2 version: 2.14.2-1+debian11
Operating System and version: Debian GNU/Linux 11 (bullseye)

Additional context

Time in logs is UTC, time in screenshot is CEST (==UTC+2), so the events do actually match:

Sep 02 00:33:32 ip-10-7-10-85 icingadb[121007]: invalid connection
                                                can't commit transaction
                                                github.com/icinga/icingadb/pkg/icingadb.(*HA).realize.func1
                                                        github.com/icinga/icingadb/pkg/icingadb/ha.go:381
                                                github.com/icinga/icingadb/pkg/retry.WithBackoff
                                                        github.com/icinga/icingadb/pkg/retry/retry.go:60
                                                github.com/icinga/icingadb/pkg/icingadb.(*HA).realize
                                                        github.com/icinga/icingadb/pkg/icingadb/ha.go:284
                                                github.com/icinga/icingadb/pkg/icingadb.(*HA).controller
                                                        github.com/icinga/icingadb/pkg/icingadb/ha.go:221
                                                runtime.goexit
                                                        runtime/asm_amd64.s:1695
                                                HA aborted
                                                github.com/icinga/icingadb/pkg/icingadb.(*HA).abort.func1
                                                        github.com/icinga/icingadb/pkg/icingadb/ha.go:131
                                                sync.(*Once).doSlow
                                                        sync/once.go:74
                                                sync.(*Once).Do
                                                        sync/once.go:65
                                                github.com/icinga/icingadb/pkg/icingadb.(*HA).abort
                                                        github.com/icinga/icingadb/pkg/icingadb/ha.go:129
                                                github.com/icinga/icingadb/pkg/icingadb.(*HA).controller
                                                        github.com/icinga/icingadb/pkg/icingadb/ha.go:233
                                                runtime.goexit
                                                        runtime/asm_amd64.s:1695
                                                HA exited with an error
                                                main.run
                                                        github.com/icinga/icingadb/cmd/icingadb/main.go:335
                                                main.main
                                                        github.com/icinga/icingadb/cmd/icingadb/main.go:37
                                                runtime.main
                                                        runtime/proc.go:271
                                                runtime.goexit
                                                        runtime/asm_amd64.s:1695
Sep 02 00:33:32 ip-10-7-10-85 systemd[1]: icingadb.service: Main process exited, code=exited, status=1/FAILURE

Sep 02 00:38:11 ip-10-7-11-6 icingadb[110917]: invalid connection
                                               can't perform "INSERT INTO \"state_history\" (\"max_check_attempts\", \"scheduling_source\", \"event_time\", \"hard_state\", \"previous_hard_state\", \"check_attempt\", \"output\", \"long_output\", \"check_source\", \"environment_id\", \"state_type\", \"soft_state\", \"id\", \"endpoint_id\", \"host_id\", \"service_id\", \"previous_soft_state\", \"object_type\") VALUES (:max_check_attempts,:scheduling_source,:event_time,:hard_state,:previous_hard_state,:check_attempt,:output,:long_output,:check_source,:environment_id,:state_type,:soft_state,:id,:endpoint_id,:host_id,:service_id,:previous_soft_state,:object_type) ON DUPLICATE KEY UPDATE \"id\" = VALUES(\"id\")"
                                               github.com/icinga/icingadb/internal.CantPerformQuery
                                                       github.com/icinga/icingadb/internal/internal.go:30
                                               github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.(*DB).NamedBulkExec.func1.1.2.1
                                                       github.com/icinga/icingadb/pkg/icingadb/db.go:412
                                               github.com/icinga/icingadb/pkg/retry.WithBackoff
                                                       github.com/icinga/icingadb/pkg/retry/retry.go:60
                                               github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.(*DB).NamedBulkExec.func1.1.2
                                                       github.com/icinga/icingadb/pkg/icingadb/db.go:407
                                               golang.org/x/sync/errgroup.(*Group).Go.func1
                                                       golang.org/x/sync@v0.7.0/errgroup/errgroup.go:78
                                               runtime.goexit
                                                       runtime/asm_amd64.s:1695
                                               retry deadline exceeded
                                               github.com/icinga/icingadb/pkg/retry.WithBackoff
                                                       github.com/icinga/icingadb/pkg/retry/retry.go:95
                                               github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.(*DB).NamedBulkExec.func1.1.2
                                                       github.com/icinga/icingadb/pkg/icingadb/db.go:407
                                               golang.org/x/sync/errgroup.(*Group).Go.func1
                                                       golang.org/x/sync@v0.7.0/errgroup/errgroup.go:78
                                               runtime.goexit
                                                       runtime/asm_amd64.s:1695
Sep 02 00:38:11 ip-10-7-11-6 systemd[1]: icingadb.service: Main process exited, code=exited, status=1/FAILURE

The text was updated successfully, but these errors were encountered:

yhabteab · 2024-09-02T09:00:55Z

During/after scheduled maintenance of the database service we found both IcingaDB systemd units failed.

Hi, how long did that maintenance window last? Icinga DB actually retries such temporary errors, but only for 5m and if the database does not return to its operational state within this time slot, it ultimately gives up and crashes with retry deadline exceeded.

louiswebdev · 2024-09-02T09:11:43Z

Looks like there were 8 minutes between the last two loglines (offline operation) - I suspect that is the timeframe the DB was not reachable at all, yes. Do I have config options to prolong the 5m period?

louiswebdev · 2024-09-02T09:13:38Z

I also added the following to both systems as a countermeasure:

└─ # ▶ cat /etc/systemd/system/icingadb.service.d/override.conf
[Unit]
StartLimitInterval=0
StartLimitBurst=30

[Service]
Restart=on-failure
RestartSec=30s

Edit: but I am not sure if this is okay to do, or if it may lead to other problems.

oxzi · 2024-09-02T10:46:59Z

@louiswebdev: Your systemd unit fix should work. Restarting the daemon shouldn't lead to errors. The HA logic would restrict the restarted Icinga DB daemon to directly take over, let the lease time out and then allow them to be the lead again.

For more time intense maintenance jobs, one should consider temporary stopping Icinga DB because it cannot work without a database. As @yhabteab wrote, it will retry the queries, but for a longer database absence, this will eventually fail. Furthermore, I am unsure if the "invalid connection" error will even be retried.

yhabteab · 2024-09-02T10:51:46Z

Do I have config options to prolong the 5m period?

No, you can't override the timeout at the moment!

Furthermore, I am unsure if the "invalid connection" error will even be retried.

See https://github.com/Icinga/icinga-go-library/blob/main/retry/retry.go#L190.

louiswebdev · 2024-09-02T11:38:27Z

@louiswebdev: Your systemd unit fix should work. Restarting the daemon shouldn't lead to errors. The HA logic would restrict the restarted Icinga DB daemon to directly take over, let the lease time out and then allow them to be the lead again.

For more time intense maintenance jobs, one should consider temporary stopping Icinga DB because it cannot work without a database. As @yhabteab wrote, it will retry the queries, but for a longer database absence, this will eventually fail. Furthermore, I am unsure if the "invalid connection" error will even be retried.

Hm 🤔

The problem with this is: maintenance on RDS databases always occurs during defined timeframes, preferably outside peak hours. We never know when exactly it will happen, and even if we knew it's most probably at a time we're not awake to stop services accordingly. We might consider now to schedule maintenance during daytime, but we prefer not to.

With the ido-mysql feature we are coming from this never was a problem, the system is up and running since mid 2019.

louiswebdev · 2024-09-02T11:43:43Z

Do I have config options to prolong the 5m period?

No, you can't override the timeout at the moment!

Maybe that's a feature we would like to see then - but we think it's not super-urgent, since those maintenances that require offline-operations do not occur very often and the aforementioned systemd config is in effect now.

louiswebdev · 2024-09-03T06:20:32Z

During/after scheduled maintenance of the database service we found both IcingaDB systemd units failed.

Hi, how long did that maintenance window last? Icinga DB actually retries such temporary errors, but only for 5m and if the database does not return to its operational state within this time slot, it ultimately gives up and crashes with retry deadline exceeded.

Oh, one more thing: only one of the nodes exited with retry deadline exceeded, the other one bailed out with HA aborted and HA exited with an error five minutes before the first node.

oxzi · 2024-09-03T06:39:59Z

"HA exited with an error" also only happens if something seemed unrecoverable happened. However, could you please supply longer logs including the events before the crash? This would ease us to understand if Icinga DB works as intended or if we are dealing with a bug.

louiswebdev · 2024-09-04T06:38:02Z

Would journalctl -u icingadb.service --since "2024-09-01 23:55:00" --until "2024-09-02 00:55:00"suffice?

Both nodes seem to have logged that another instance was active. Apart from that it does not look suspicious to me.

icingadb_logs_node_a.txt
icingadb_logs_node_b.txt

oxzi · 2024-09-04T12:26:52Z

Thanks for supplying those logs, @louiswebdev. But it seems you have uploaded the same file twice; the file content is identical.

$ sha256sum icingadb_logs_node_{a,b}.txt
92ff2299ef441eb8cd5cb2d2fd611a881b909d4b87f08aee039e81f22749ff29  icingadb_logs_node_a.txt
92ff2299ef441eb8cd5cb2d2fd611a881b909d4b87f08aee039e81f22749ff29  icingadb_logs_node_b.txt

Unfortunately, the log for ip-10-7-10-85 does not offer any further insides. Based on these information and what you prior wrote, I would say that the issue was "just" the database being absence longer than the retry timeout of five minutes.

louiswebdev · 2024-09-04T14:13:33Z

Yeah, sorry, my bad - I re-uploaded icingadb_logs_node_b.txt

oxzi · 2024-09-05T11:20:54Z

Thank you for the updated log file. The node ip-10-7-11-6 had a flapping HA status at 00:32:01. Could you please share the "Handing over" log message including all additional fields, as described in https://github.com/Icinga/icingadb/blob/main/doc/03-Configuration.md#systemd-journald-fields?

louiswebdev · 2024-09-05T14:32:29Z

icingadb_verbose_logs_node_b.txt

Sure, the timeframe is the same as in the previous log dumps.

oxzi · 2024-09-05T15:04:58Z

Thanks again for your fast response and the logs.

    MESSAGE=Handing over
    ICINGADB_REASON=instance update/insert deadline exceeded heartbeat expiry time

The HA handover reason was the timeout of the realization method. As no debug logging was enabled and no further information was logged, this was most likely due to a long blocking SQL command. I would guess this was the first effect of the database maintenance.

I am unsure what else to do here. Your crash is unfortunate and generally this is something which should not happen.

Thus, I would open for discussing if the retry timeout can be made configurable, if we want to consider some "database is absent, so let's idle a bit" mode or something else.

louiswebdev · 2024-09-06T07:22:03Z

Yeah, I understand - thanks for hearing me, and looking at the logs, and contextualising the problem for me.

I will of course closely watch the service and especially the RDS maintenance windows in the future.

I also think this must have been a particularly long maintenance, especially considering we have booked a multi-AZ database to prevent long outages.

Besides of that I'd like to express how well I regard the Icinga2 and related projects like IcingaDB. The documentation is very good, and reactions to issues, too. Thanks for that and keep up the good work.

oxzi added the crash label Sep 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

systemd units failed after "invalid connection" - manual recovery needed #794

systemd units failed after "invalid connection" - manual recovery needed #794

louiswebdev commented Sep 2, 2024

yhabteab commented Sep 2, 2024

louiswebdev commented Sep 2, 2024

louiswebdev commented Sep 2, 2024 •

edited

Loading

oxzi commented Sep 2, 2024

yhabteab commented Sep 2, 2024

louiswebdev commented Sep 2, 2024

louiswebdev commented Sep 2, 2024 •

edited

Loading

louiswebdev commented Sep 3, 2024

oxzi commented Sep 3, 2024

louiswebdev commented Sep 4, 2024 •

edited

Loading

oxzi commented Sep 4, 2024

louiswebdev commented Sep 4, 2024

oxzi commented Sep 5, 2024

louiswebdev commented Sep 5, 2024

oxzi commented Sep 5, 2024

louiswebdev commented Sep 6, 2024

systemd units failed after "invalid connection" - manual recovery needed #794

systemd units failed after "invalid connection" - manual recovery needed #794

Comments

louiswebdev commented Sep 2, 2024

Describe the bug

Expected behavior

Your Environment

Additional context

yhabteab commented Sep 2, 2024

louiswebdev commented Sep 2, 2024

louiswebdev commented Sep 2, 2024 • edited Loading

oxzi commented Sep 2, 2024

yhabteab commented Sep 2, 2024

louiswebdev commented Sep 2, 2024

louiswebdev commented Sep 2, 2024 • edited Loading

louiswebdev commented Sep 3, 2024

oxzi commented Sep 3, 2024

louiswebdev commented Sep 4, 2024 • edited Loading

oxzi commented Sep 4, 2024

louiswebdev commented Sep 4, 2024

oxzi commented Sep 5, 2024

louiswebdev commented Sep 5, 2024

oxzi commented Sep 5, 2024

louiswebdev commented Sep 6, 2024

louiswebdev commented Sep 2, 2024 •

edited

Loading

louiswebdev commented Sep 2, 2024 •

edited

Loading

louiswebdev commented Sep 4, 2024 •

edited

Loading