DocsBot AI - Training data VectorDB Cluster Issues – Incident details

Training data VectorDB Cluster Issues

Resolved
Major outage
Started about 2 months agoLasted about 7 hours

Affected

Website

Partial outage from 2:40 AM to 8:58 AM, Operational from 8:58 AM to 9:17 AM

Chatbots

Major outage from 2:40 AM to 8:58 AM, Operational from 8:58 AM to 9:17 AM

Updates
  • Resolved
    Resolved

    Cluster has been running smoothly for many hours now. Our provider reports that the root cause was a pod wasn't scheduling as it got stuck when the node it was on was upgrading machine. They have fixed the process internally.

    We are working on getting assurances that better monitoring is in place to minimize downtime if something similar happens again. We are also researching if moving to a HA setup would prevent similar issues in the future.

  • Monitoring
    Monitoring

    Our cloud provider has implemented a fix which appears to have brought the cluster back online and are currently monitoring the result. Analysis to follow.

  • Identified
    Update

    Our cloud provider has confirmed they can see where the issue lies and are right now liaising with the SRE team for right approach to repairing the cluster.

  • Identified
    Identified

    We have been unable to access our primary production VectorDB managed by our cloud provider. Their status shows it undergoing unplanned maintenance, and we are waiting to hear back from them a status report on progress of restoring access. This affects bot creation, source creation/updates, and bot usage. We will update as soon as we have any news.