Engineering captured a detail log of the event and determined the root cause of this incident.
A web server was manually restarted by engineering at 1:19:25 PM ET, which caused end-users browsers to reload the Omnistream site. Unexpectedly, a subset of web clients at a large installation were unable to complete the reload within the 10 second expectation prescribed in software. As a result, they repeatedly attempted to reload the Omnistream every 10 seconds, which drove the system to a very high load and denied real-time response to other clients in a cascading effect. This prevented some customers from dialing or taking calls, effectively resulting in a denial of service attack.
As an immediate mitigation, at 7:32 PM ET engineering increased the timeout to 50 seconds to provide more latitude for slower web clients, and as a critical priority development item, is optimizing the work performed by web clients so that call centers supporting large networks of dealers can reliably use popular enterprise desktop computers.
The root cause of this outage was addressed through a final software update, 2024-05-29. The client was refactored to lazily fetch dealership information from web servers as needed. Whether a call center services one dealership or a thousand, its browsers can now reload the Omnistream site in under three seconds.