The idea is:
To introduce a native bulk re-execution feature within the “Executions” view of a workflow or in the "Executions’ tab of the Project level. This would allow an operator to filter for all failed executions, select them in bulk (e.g., via checkboxes or a “select all” action), and trigger them all with a single “Re-execute Selected” command. The system would then re-run each selected execution using its original trigger data.
My use case:
Our workflows handle thousands of critical, real-time transactions daily. If a dependent service (like a third-party API) has an outage, hundreds of executions can fail in a short period. Once the service is restored, we need to reprocess all of those failed transactions. The current process requires an operator to manually find and re-run each failed execution one-by-one, which is not feasible at scale and delays recovery.
I think it would be beneficial to add this because:
This feature addresses a critical operational need for managing workflows in a real-world production environment.
- Drastically Improves Recovery Time: It transforms a slow, manual recovery process into a swift, one-click action, minimizing data processing delays after an outage.
- Reduces Operational Overhead: It empowers operators to manage large-scale failures efficiently without the stress and human error associated with repetitive manual tasks.
- Enhances Enterprise Readiness: It provides a robust, built-in tool for operational resilience, helping teams meet their Recovery Time Objectives (RTOs). This kind of reliability is essential for enterprise-grade deployments where meeting service-level objectives is paramount.
- Provides a Vital Safety Net: While complex queuing patterns are a best practice for some, this feature offers a more accessible, native recovery option for all workflows, making the platform more forgiving and powerful.