readthedocs.org
19f8bfea - Build: disable builds after 25 consecutive failed builds on default version (#12624)

Commit
7 days ago
Build: disable builds after 25 consecutive failed builds on default version (#12624) - [x] Explore and understand the codebase - [x] Add a new notification message for consecutive failed builds (`MESSAGE_PROJECT_BUILDS_DISABLED_DUE_TO_CONSECUTIVE_FAILURES`) in `readthedocs/projects/notifications.py` - [x] Create a signal receiver to check for consecutive failed builds after build completion in `readthedocs/builds/signals_receivers.py` - [x] Add a helper method to count consecutive failed builds for the default version - [x] When 50+ consecutive failures occur: attach notification to Project and disable builds (`skip=True`) - [x] Add documentation explaining this behavior in `docs/user/builds.rst` - [x] Add tests for the new functionality in `readthedocs/builds/tests/test_signals_receivers.py` - [x] Run code review and fix any issues - [x] Run CodeQL security check - no vulnerabilities found - [x] Address review feedback: - [x] Move "Automatic disabling of builds" section before "Build resources" - [x] Simplify documentation text - [x] Move RTD_BUILDS_MAX_CONSECUTIVE_FAILURES setting to base.py - [x] Update docstring to reference the setting name - [x] Keep only the main test for disabling at threshold - [x] Convert signal handler to trigger a Celery task (builders don't have DB access) - [x] Add comment explaining why project_slug/version_slug check is needed - [x] Revert task logic to use groupby solution (the simplified version didn't work correctly) <!-- START COPILOT CODING AGENT SUFFIX --> <details> <summary>Original prompt</summary> ---- *This section details on the original issue you should resolve* <issue_title>Build: disable webhooks after N failed builds in a row</issue_title> <issue_description>Analyzing the data we have, I've found that we have a bunch of projects that have been triggering builds where _all of them_ have failed. During months. This consumes resources in our side, but also degrade the UX for valid projects since they have to wait for a build that _we already know_ it's gonna fail, before their project's build is taken. So, to reduce projects in this scenario, we are thinking about "auto-disable webhooks on projects that have N failed builds in a row" and communicate their maintainers. To re-enable the webhook, they would have to go to the project's admin page and re-activate/re-configure/re-add the webhook. This is the metabase question for future reference: https://ethicalads.metabaseapp.com/question/309-project-with-all-failed-builds-in-the-last-3-months. It shows the amount of builds and minutes consumed per project.</issue_description> <agent_instructions>We want to attach a notification to the `Project` when the active default version fails 50+ times to build. Besides, we want to disable the builds for this project and tell the user to contact support if it was a mistake. Also, adds documentation about this behavior.</agent_instructions> ## Comments on the Issue (you are @copilot in this section) <comments> <comment_new><author>@agjohnson</author><body> Also, it would be great to have a maintainers site notification specific to this that goes out, so we can re-use this pattern when soft-disabling a project. A hard disable (setting `Project.skip = True`) can be reserved for abusive/spam projects, but that "notification" pattern is quite broken for normal users. Or perhaps I'm describing two potential notifications -- a specific notification and a generic notification? - "Your project hasn't built successfully for 50 consecutive builds, we've disabled your project from automatically building" - "Your project has a problem, we've disabled your project from automatically building"</body></comment_new> <comment_new><author>@humitos</author><body> Yeah, I'm fine with better notifications. Maybe we want to make that part of the work on readthedocs/readthedocs.org#9279 and readthedocs/readthedocs.org#3399 We need to probably set two different thresholds here: 1. N consecutive builds on branch/tags (lower, maybe 25) and disable the webhook completely 2. N consecutive builds on PR builds (upper, maybe 50) and only disable PR building I'm trying to say that "builds failing on PRs are _more acceptable_ than failing on branch/tags and should be handled differently"</body></comment_new> <comment_new><author>@agjohnson</author><body> Yeah, great point on PR builds. Maybe we are strictly only concerned with failed active versions? I agree we can rethink notifications in a larger PR. If we add something here it will be using the new notification pattern anyways. Some of the old patterns can eventually be culled, and just adding a new mechanism for disabling projects here would be enough to start that process.</body></comment_new> <comment_new><author>@humitos</author><body> @agjohnson > Maybe we are strictly only concerned with failed active versions? It makes sense to me, yeah.</body></comment_new> <comment_new><author>@humitos</author><body> I'd like to think about prioritizing this feature if possible because this will help us a lot to know "how many active/valid non-spam projects do we have in our platform" in an easier way. This data will be useful when migrating/deprecating things as we are doing with "building without a config file" since we will know "how many of the active/valid projects have already migrated". Many non-spam projects will not migrate to the new configuration file just because they don't need it or have to. Mainly projects that are archived or where created just for testing/development/educational purposes, for example. I assume we have _a lot of them_. The work required that I see here is: - [ ] check for N failed builds on the default version and delete/disable the webhook integrated with that project - [ ] send an onsite and email notification to all the maintainers explaining the situation - [ ] write some documentation explaining this behavior (probably in https://docs.readthedocs.io/en/stable/guides/connecting-git-account.html)</body></comment_new> <comment_new><author>@agjohnson</author><body> Because taking automated actions on projects has been hard to get right -- spam banning, build retry -- maybe we should start fairly conservatively here. I think I'm comfortable saying that any project who has a *default* version that has 25 consecutive failed builds is probably not monitoring their project. We also don't have to take any automated action yet, we could start with just a notification or drip campaign. At least disabling the webho... </details> - Fixes readthedocs/readthedocs.org#9690 <!-- START COPILOT CODING AGENT TIPS --> --- 💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more [Copilot coding agent tips](https://gh.io/copilot-coding-agent-tips) in the docs. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: humitos <244656+humitos@users.noreply.github.com> Co-authored-by: Manuel Kaufmann <humitos@gmail.com> Co-authored-by: ericholscher <25510+ericholscher@users.noreply.github.com>
Author
Parents
Loading