Cluster has been running smoothly for many hours now. Our provider reports that the root cause was a pod wasn't scheduling as it got stuck when the node it was on was upgrading machine. They have fixed the process internally.
We are working on getting assurances that better monitoring is in place to minimize downtime if something similar happens again. We are also researching if moving to a HA setup would prevent similar issues in the future.