DocsBot AI - Intermittent Database search errors – Incident details

Chatbots experiencing partial outage

Intermittent Database search errors

Resolved
Major outage
Started 1 day agoLasted about 20 hours

Affected

Chatbots

Degraded performance from 7:30 AM to 4:30 PM, Operational from 4:30 PM to 6:43 PM, Partial outage from 6:43 PM to 12:00 AM

Updates
  • Resolved
    Resolved

    This incident has been resolved. Repair script seems to have been completed.

  • Update
    Update

    Affected bots count is dropping as they are fixed one by one. It's projected that all bots will be repaired within the hour.

  • Monitoring
    Monitoring

    They implemented a fix and are currently monitoring the results as it gradually repairs affected bots across all database shards.

  • Update
    Update

    We realize that part of the confusion with this outage is that, previously, when there would be database issues, it would return a user-friendly error message in the chat interface. With our architecture recently launched a few weeks ago, this was no longer a fatal error, and the AI instead was passed the error message. It would still return an answer to the user, just letting them know that it couldn't find the sources.

    We have pushed a hotfix to our API so that when searching for sources fails, it again returns an actual error instead of responding fully to the user with the potential for hallucination.

    Waiting on the status from our provider on how those fixes are going.

  • Update
    Update

    Our cloud provider has identified a performance configuration on the cluster that was preventing entrypoint repairs from completing. They are now rolling out an update to remove this setting and restore normal repair operations.

  • Identified
    Identified

    The cloud team is continuing to work on a fix for this incident. They attempted a full cluster restart but that did not solve the issue.

  • Update
    Update

    For a subset of bots error rates remain higher than 75% when searching docs. Our cloud provider is still working on clearing this issue that seems to be affecting a subset of our high availability nodes.

  • Monitoring
    Monitoring

    Error rates have dropped nearly completely, but our cloud provider is still working on fixing the underlying root cause with configuration tweaks.

  • Identified
    Identified

    We've seen increasing error rates with our training database that show up as a database search error or "I couldn't access the documentation" when chatting with a bot. The problem is intermittent. At its peak, we are seeing as much as 5% of requests triggering the error.

    We are in communication with our vector database cloud provider, who is working on fixing this issue on their side.