We’ve been working closely with our cloud provider who have now confirmed the root causes of the recent intermittent vector index errors.
What happened
- When compression was enabled, some tenants were in a lazy-loading state.
- In certain cases, compression could begin before the cache was fully loaded, leading to intermittent shard-level errors.
- Because each tenant runs across multiple shards, responses can come from different shards, which made the issue appear inconsistent.
- This has only affected a small subset of bots at any given time.
Impact
- Of 592 tenants with compressed data (those potentially impacted), only 4 required requantization due to errors.
- Errors have been limited and intermittent rather than systemic.
Actions taken by our cloud provider
- Cluster upgraded to patch version 1.35.10, which includes a fix for the RQ compression behavior.
- Async indexing enabled to immediately compress remaining shards.
- Repair tasks are running to restore any missing vector index elements or entry points.
- A full repair script is in progress to ensure all tenants are fully remediated.
- The fix has been identified and is moving through their QA pipeline.
Current status
- Repair tasks are approximately 50% complete and progressing on the impacted bots.
- A separate issue impacting new tenant creation is under investigation.
- We expect remaining cleanup and stabilization to be completed by mid day.
We will continue monitoring closely and provide further updates as remediation completes.