DocsBot AI - Status Page

Chatbots experiencing partial outage

Intermittent Database search errors for subset of bots
  • Update
    Update

    We’ve been working closely with our cloud provider who have now confirmed the root causes of the recent intermittent vector index errors.

    What happened

    - When compression was enabled, some tenants were in a lazy-loading state.

    - In certain cases, compression could begin before the cache was fully loaded, leading to intermittent shard-level errors.

    - Because each tenant runs across multiple shards, responses can come from different shards, which made the issue appear inconsistent.

    - This has only affected a small subset of bots at any given time.

    Impact

    - Of 592 tenants with compressed data (those potentially impacted), only 4 required requantization due to errors.

    - Errors have been limited and intermittent rather than systemic.

    Actions taken by our cloud provider

    - Cluster upgraded to patch version 1.35.10, which includes a fix for the RQ compression behavior.

    - Async indexing enabled to immediately compress remaining shards.

    - Repair tasks are running to restore any missing vector index elements or entry points.

    - A full repair script is in progress to ensure all tenants are fully remediated.

    - The fix has been identified and is moving through their QA pipeline.

    Current status

    - Repair tasks are approximately 50% complete and progressing on the impacted bots.

    - A separate issue impacting new tenant creation is under investigation.

    - We expect remaining cleanup and stabilization to be completed by mid day.

    We will continue monitoring closely and provide further updates as remediation completes.

  • Monitoring
    Monitoring

    It seems that overnight the issue resurfaced with a small handful of bots. The cloud team is still looking into a permanent fix for the root issue that caused that to reoccur.

  • Resolved
    Resolved

    This incident has been resolved. Repair script seems to have been completed.

  • Update
    Update

    Affected bots count is dropping as they are fixed one by one. It's projected that all bots will be repaired within the hour.

  • Monitoring
    Monitoring

    They implemented a fix and are currently monitoring the results as it gradually repairs affected bots across all database shards.

  • Update
    Update

    We realize that part of the confusion with this outage is that, previously, when there would be database issues, it would return a user-friendly error message in the chat interface. With our architecture recently launched a few weeks ago, this was no longer a fatal error, and the AI instead was passed the error message. It would still return an answer to the user, just letting them know that it couldn't find the sources.

    We have pushed a hotfix to our API so that when searching for sources fails, it again returns an actual error instead of responding fully to the user with the potential for hallucination.

    Waiting on the status from our provider on how those fixes are going.

  • Update
    Update

    Our cloud provider has identified a performance configuration on the cluster that was preventing entrypoint repairs from completing. They are now rolling out an update to remove this setting and restore normal repair operations.

  • Identified
    Identified

    The cloud team is continuing to work on a fix for this incident. They attempted a full cluster restart but that did not solve the issue.

  • Update
    Update

    For a subset of bots error rates remain higher than 75% when searching docs. Our cloud provider is still working on clearing this issue that seems to be affecting a subset of our high availability nodes.

  • Monitoring
    Monitoring

    Error rates have dropped nearly completely, but our cloud provider is still working on fixing the underlying root cause with configuration tweaks.

  • Identified
    Identified

    We've seen increasing error rates with our training database that show up as a database search error or "I couldn't access the documentation" when chatting with a bot. The problem is intermittent. At its peak, we are seeing as much as 5% of requests triggering the error.

    We are in communication with our vector database cloud provider, who is working on fixing this issue on their side.

Website - Operational

100% - uptime

Chatbots - Partial outage

100% - uptime

OpenAI → API - Operational

Third Party: Vercel → Edge Functions - Operational

Recent notices

Show notice history