Build: disable builds after 25 consecutive failed builds on default version (#12624)
- [x] Explore and understand the codebase
- [x] Add a new notification message for consecutive failed builds
(`MESSAGE_PROJECT_BUILDS_DISABLED_DUE_TO_CONSECUTIVE_FAILURES`) in
`readthedocs/projects/notifications.py`
- [x] Create a signal receiver to check for consecutive failed builds
after build completion in `readthedocs/builds/signals_receivers.py`
- [x] Add a helper method to count consecutive failed builds for the
default version
- [x] When 50+ consecutive failures occur: attach notification to
Project and disable builds (`skip=True`)
- [x] Add documentation explaining this behavior in
`docs/user/builds.rst`
- [x] Add tests for the new functionality in
`readthedocs/builds/tests/test_signals_receivers.py`
- [x] Run code review and fix any issues
- [x] Run CodeQL security check - no vulnerabilities found
- [x] Address review feedback:
- [x] Move "Automatic disabling of builds" section before "Build
resources"
- [x] Simplify documentation text
- [x] Move RTD_BUILDS_MAX_CONSECUTIVE_FAILURES setting to base.py
- [x] Update docstring to reference the setting name
- [x] Keep only the main test for disabling at threshold
- [x] Convert signal handler to trigger a Celery task (builders don't
have DB access)
- [x] Add comment explaining why project_slug/version_slug check is
needed
- [x] Revert task logic to use groupby solution (the simplified version
didn't work correctly)
<!-- START COPILOT CODING AGENT SUFFIX -->
<details>
<summary>Original prompt</summary>
----
*This section details on the original issue you should resolve*
<issue_title>Build: disable webhooks after N failed builds in a
row</issue_title>
<issue_description>Analyzing the data we have, I've found that we have a
bunch of projects that have been triggering builds where _all of them_
have failed. During months. This consumes resources in our side, but
also degrade the UX for valid projects since they have to wait for a
build that _we already know_ it's gonna fail, before their project's
build is taken.
So, to reduce projects in this scenario, we are thinking about
"auto-disable webhooks on projects that have N failed builds in a row"
and communicate their maintainers. To re-enable the webhook, they would
have to go to the project's admin page and
re-activate/re-configure/re-add the webhook.
This is the metabase question for future reference:
https://ethicalads.metabaseapp.com/question/309-project-with-all-failed-builds-in-the-last-3-months.
It shows the amount of builds and minutes consumed per
project.</issue_description>
<agent_instructions>We want to attach a notification to the `Project`
when the active default version fails 50+ times to build. Besides, we
want to disable the builds for this project and tell the user to contact
support if it was a mistake.
Also, adds documentation about this behavior.</agent_instructions>
## Comments on the Issue (you are @copilot in this section)
<comments>
<comment_new><author>@agjohnson</author><body>
Also, it would be great to have a maintainers site notification specific
to this that goes out, so we can re-use this pattern when soft-disabling
a project. A hard disable (setting `Project.skip = True`) can be
reserved for abusive/spam projects, but that "notification" pattern is
quite broken for normal users.
Or perhaps I'm describing two potential notifications -- a specific
notification and a generic notification?
- "Your project hasn't built successfully for 50 consecutive builds,
we've disabled your project from automatically building"
- "Your project has a problem, we've disabled your project from
automatically building"</body></comment_new>
<comment_new><author>@humitos</author><body>
Yeah, I'm fine with better notifications. Maybe we want to make that
part of the work on readthedocs/readthedocs.org#9279 and
readthedocs/readthedocs.org#3399
We need to probably set two different thresholds here:
1. N consecutive builds on branch/tags (lower, maybe 25) and disable the
webhook completely
2. N consecutive builds on PR builds (upper, maybe 50) and only disable
PR building
I'm trying to say that "builds failing on PRs are _more acceptable_ than
failing on branch/tags and should be handled
differently"</body></comment_new>
<comment_new><author>@agjohnson</author><body>
Yeah, great point on PR builds. Maybe we are strictly only concerned
with failed active versions?
I agree we can rethink notifications in a larger PR. If we add something
here it will be using the new notification pattern anyways. Some of the
old patterns can eventually be culled, and just adding a new mechanism
for disabling projects here would be enough to start that
process.</body></comment_new>
<comment_new><author>@humitos</author><body>
@agjohnson
> Maybe we are strictly only concerned with failed active versions?
It makes sense to me, yeah.</body></comment_new>
<comment_new><author>@humitos</author><body>
I'd like to think about prioritizing this feature if possible because
this will help us a lot to know "how many active/valid non-spam projects
do we have in our platform" in an easier way. This data will be useful
when migrating/deprecating things as we are doing with "building without
a config file" since we will know "how many of the active/valid projects
have already migrated".
Many non-spam projects will not migrate to the new configuration file
just because they don't need it or have to. Mainly projects that are
archived or where created just for testing/development/educational
purposes, for example. I assume we have _a lot of them_.
The work required that I see here is:
- [ ] check for N failed builds on the default version and
delete/disable the webhook integrated with that project
- [ ] send an onsite and email notification to all the maintainers
explaining the situation
- [ ] write some documentation explaining this behavior (probably in
https://docs.readthedocs.io/en/stable/guides/connecting-git-account.html)</body></comment_new>
<comment_new><author>@agjohnson</author><body>
Because taking automated actions on projects has been hard to get right
-- spam banning, build retry -- maybe we should start fairly
conservatively here. I think I'm comfortable saying that any project who
has a *default* version that has 25 consecutive failed builds is
probably not monitoring their project.
We also don't have to take any automated action yet, we could start with
just a notification or drip campaign.
At least disabling the webho...
</details>
- Fixes readthedocs/readthedocs.org#9690
<!-- START COPILOT CODING AGENT TIPS -->
---
💡 You can make Copilot smarter by setting up custom instructions,
customizing its development environment and configuring Model Context
Protocol (MCP) servers. Learn more [Copilot coding agent
tips](https://gh.io/copilot-coding-agent-tips) in the docs.
---------
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: humitos <244656+humitos@users.noreply.github.com>
Co-authored-by: Manuel Kaufmann <humitos@gmail.com>
Co-authored-by: ericholscher <25510+ericholscher@users.noreply.github.com>