DocsBot AI Status - Incident history

Chat API outage

2024-09-18T14:36:00.000+00:00

Sep 18, 14:36:00 GMT+0
Identified - We are continuing to work on a fix for this incident..

Sep 18, 15:37:36 GMT+0
Monitoring - We implemented a fix and are currently monitoring the result..

Sep 18, 16:34:26 GMT+0
Resolved - We've confirmed the incident has not recurred, and implementing monitoring improvements to automatically deal with this failure mode in the future..

DB errors during bot training

2024-09-05T09:12:00.000+00:00

Sep 5, 09:12:00 GMT+0
Investigating - We are currently investigating this incident. It appears that while bots are functional, training bots is triggering a timeout error message 'properties'. We are in contact with our cloud provider to look into the status of our cluster..

Sep 5, 14:56:56 GMT+0
Identified - From provider: About 6 hours ago the pod for the primary docsbot production cluster was moved to a new kubernetes node. due to this, it is going through a startup process still where it is cleaning up old/stale data and in read-only mode. Unfortunately we do not have a good estimate for how long this will take. Unfortunately we received no notice of this maintenance or that the DB would be unwritable for an extended period of time. We have disabled all training actions on the site temporarily until the DB is ready to avoid further confusion or accidental deleting of sources..

Sep 5, 17:39:53 GMT+0
Identified - We are continuing to work on a fix for this incident. Writes are working again but running to slow to enable for customers. A backup and upgrade of the DB is being performed right now to see if that can improve the issue with batch imports..

Sep 5, 18:22:35 GMT+0
Monitoring - After performing a DB upgrade and restart, training data ingestion seems to be performing well now. We are now re-enabling training in our Dashboard, and will continue to monitor performance..

Sep 5, 20:20:43 GMT+0
Resolved - This incident has been resolved..

API software upgrades

2024-08-07T21:02:26.576+00:00

Aug 7, 22:02:26 GMT+0
Completed - Maintenance has completed successfully.

Aug 7, 21:02:27 GMT+0
Identified - Maintenance is now in progress.

Code regression impacting <2% of Bots

2024-07-30T06:57:00.000+00:00

Jul 30, 06:57:00 GMT+0
Investigating - A code regression was deployed that was affecting a small subset of bots that had a specific unexpected metadata format in their training data (a url field saved with empty string). Unfortunately our automated testing and monitoring scripts did not detect this because it did not affect our test bots and only impacted <2% of customer bots. We are working on adding an additional monitoring solution that could detect these kind of edge-case regressions in the future hopefully..

Jul 30, 15:00:08 GMT+0
Resolved - A code regression was deployed that was affecting a small subset of bots that had a specific unexpected metadata format in their training data (a url field saved with empty string). Unfortunately our automated testing and monitoring scripts did not detect this because it did not affect our test bots and only impacted <2% of customer bots. We are working on adding an additional monitoring solution that could detect these kind of edge-case regressions in the future hopefully..

Training data VectorDB Cluster Issues

2024-07-19T02:40:00.000+00:00

Jul 19, 02:40:00 GMT+0
Identified - We have been unable to access our primary production VectorDB managed by our cloud provider. Their status shows it undergoing unplanned maintenance, and we are waiting to hear back from them a status report on progress of restoring access. This affects bot creation, source creation/updates, and bot usage. We will update as soon as we have any news..

Jul 19, 07:42:20 GMT+0
Identified - Our cloud provider has confirmed they can see where the issue lies and are right now liaising with the SRE team for right approach to repairing the cluster. .

Jul 19, 08:58:00 GMT+0
Monitoring - Our cloud provider has implemented a fix which appears to have brought the cluster back online and are currently monitoring the result. Analysis to follow. .

Jul 19, 09:17:00 GMT+0
Resolved - Cluster has been running smoothly for many hours now. Our provider reports that the root cause was a pod wasn't scheduling as it got stuck when the node it was on was upgrading machine. They have fixed the process internally. We are working on getting assurances that better monitoring is in place to minimize downtime if something similar happens again. We are also researching if moving to a HA setup would prevent similar issues in the future. .

DNS Propagation Errors

2024-07-16T06:00:00.000+00:00

Jul 16, 06:00:00 GMT+0
Investigating - DNS propagation + SSL renewal causing an infinite redirect on the website..

Jul 16, 11:30:00 GMT+0
Resolved - This incident has been resolved. Last night we switched our DNS provider to add more sophisticated DDoS protection to the [DocsBot.ai](http://DocsBot.ai) website and API. Unfortunately for some global regions the DNS propagation caused an SSL redirection loop that made the website, admin API, and widget inaccessible. We apologize profusely for the outage and want to assure our users that this was a one-time maintenance operation that will not need to happen again..

DB MIgrations

2024-03-29T16:43:56.141+00:00

Mar 29, 16:43:56 GMT+0
Identified - We are planning for a scheduled maintenance during this time. It may lead to temporary slowness or outages for some bots during the migration process..

Mar 29, 23:50:38 GMT+0
Identified - Migration is proceeding, no outages so far..

Mar 30, 16:43:56 GMT+0
Completed - Maintenance has completed successfully.

DB Maintenance

2024-03-26T20:26:00.000+00:00

Mar 26, 23:26:00 GMT+0
Completed - Maintenance has completed successfully.

Mar 26, 20:26:00 GMT+0
Identified - We are doing DB maintenance and a migration. We do not expect much if any downtime as we migrate bots one-by-one..

Mar 26, 20:38:22 GMT+0
Identified - Maintenance is now in progress..

VectorDB Connection issues

2024-02-12T17:04:53.048+00:00

Feb 12, 22:04:17 GMT+0
Identified - An update from the WCS DB team: * Cluster was in crashloop with a fairly repeating pattern. We thought it might have been hitting liveness limits * However, then we saw it crash live after only \~7min, so we know it had to be something else Our engineers are still working on this and we hope to have this resolved as soon as possible. Thank you for your patience. .

Feb 12, 22:24:53 GMT+0
Resolved - This incident has been resolved. We will followup with root cause when we have one..

Feb 12, 17:04:53 GMT+0
Investigating - We are currently investigating this incident with our managed database provider..

Feb 12, 19:26:50 GMT+0
Identified - They have engaged our WCS engineering team to investigate further into this issue. They have now attempted to move to a new node and were waiting for startup. I have escalated this issue now and have raised a higher level incident for this ticket. Unfortunately, I don't have an exact timeframe for when this will be resolved. Rest assured we are prioritizing this with our WCS engineering team and I am monitoring activities actively..

Feb 12, 17:18:39 GMT+0
Identified - Our Database provider is currently investigating..

Elevated Chat Error rates

2023-10-25T08:05:00.000+00:00

Oct 25, 08:05:00 GMT+0
Investigating - There was a very high error rate from the OpenAI API around that time, but our VectorDB cluster also began having connection issues. It may be that the OpenAI outage was the root cause of the DB issues. I'm having our DB provider review the health of our cluster again today just to be sure..

Oct 25, 08:45:00 GMT+0
Resolved - This incident has been resolved..

Indexing new sources shows "store is read-only" error

2023-10-24T17:38:35.915+00:00

Oct 24, 17:38:35 GMT+0
Identified - We are currently investigating this incident. It appears that our cloud provider needs to increase disk space for our DB cluster. We are working with them now to do this (it should be automated)..

Oct 24, 23:36:07 GMT+0
Identified - Our cloud provider has increased the disks and is continuing to work on a fix for this incident..

Oct 25, 00:01:07 GMT+0
Resolved - This incident has been resolved..

Database connection issues

2023-10-12T22:45:00.000+00:00

Oct 13, 06:00:00 GMT+0
Resolved - This incident has been resolved..

Oct 12, 22:45:00 GMT+0
Identified - Our database provider has recently updated our credentials to include multifactor authentication, which unfortunately caused a disruption in the authentication process to the database on our API. Rest assured, your data is safe we just can't access it for the moment, and we are actively trying to get in contact with them to revert back to the previous authentication method. We deeply regret the inconvenience caused, and should hopefully be back online shortly! .

OpenAI API Bug

2023-10-11T20:50:00.000+00:00

Oct 12, 00:50:00 GMT+0
Resolved - This incident has been resolved..

Oct 11, 20:50:00 GMT+0
Investigating - OpenAI introduced a breaking bug in their steaming API. This is resulting in about 30% of chat widget responses to return an error message and not save the question to logs. We are waiting for them to resolve this bug..