Page MenuHomePhabricator

Section-wide circuit breaking
Open, MediumPublic

Description

Currently, if one section (e.g. s2) becomes slow or overloaded, all appservers in all sections pile up to wait for response for s2 even though 80-90% of requests have nothing to do with s2. This turns a local incident (s2 being unavailable) to a general "everything is now down" outage.

Circuit breakers are designed to exactly handle such scenarios. If mw just immediately fatal to any attempt of connecting to an overloaded section, it'll save the appservers from being exhausted. Based on the numbers I collected, if all replicas have more than 400 connections, it means they are overloaded.

Thanks to T314020: LoadMonitor connection weighting reimagined implementing this is actually quite easy now.

Event Timeline

Ladsgroup triaged this task as Medium priority.Mar 25 2024, 6:29 PM
Ladsgroup moved this task from Triage to In progress on the DBA board.

Change #1014101 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] rdbms: Set up section-wide circuit breaking

https://gerrit.wikimedia.org/r/1014101

Change #1014101 merged by jenkins-bot:

[mediawiki/core@master] rdbms: Set up section-wide circuit breaking

https://gerrit.wikimedia.org/r/1014101

Change #1031021 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/mediawiki-config@master] Enable section-wide circuit breaking

https://gerrit.wikimedia.org/r/1031021

Pulling it in mwdebug and reducing the threshold to four connection led to this:

grafik.png (278×1 px, 16 KB)

Next is to build a more user-friendly error page.

Change #1031021 merged by jenkins-bot:

[operations/mediawiki-config@master] Enable section-wide circuit breaking

https://gerrit.wikimedia.org/r/1031021

Mentioned in SAL (#wikimedia-operations) [2024-05-14T12:03:18Z] <ladsgroup@deploy1002> Started scap: Backport for [[gerrit:1031021|Enable section-wide circuit breaking (T360930)]]

Mentioned in SAL (#wikimedia-operations) [2024-05-14T12:06:00Z] <ladsgroup@deploy1002> ladsgroup: Backport for [[gerrit:1031021|Enable section-wide circuit breaking (T360930)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-05-14T12:24:31Z] <ladsgroup@deploy1002> Finished scap: Backport for [[gerrit:1031021|Enable section-wide circuit breaking (T360930)]] (duration: 21m 12s)

Change #1031437 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] rdbms: Force 503 status in case of circuit breaking turned on

https://gerrit.wikimedia.org/r/1031437

Change #1031437 abandoned by Ladsgroup:

[mediawiki/core@master] rdbms: Force 503 status in case of circuit breaking turned on

Reason:

Needs another approach

https://gerrit.wikimedia.org/r/1031437