Wikimedia Production/Service Catalog/Ownership Roles and Responsibilities

What problem are we trying to solve?

Some software is primarily developed by Wikimedia Foundation (WMF) staff, and the WMF is responsible for continuing to take care of that software.

Other software is primarily developed by the volunteer community, but runs in WMF production systems. In that case, the WMF doesn’t handle feature development, but takes a contingency responsibility for time-sensitive work (like handling production emergencies, security vulnerabilities, and codebase-wide initiatives) in the event volunteers aren't able to complete it.

The WMF is already committed to doing this work, but we need to make sure we know which staff members will do it. This page describes the responsibilities of WMF teams, in furtherance of that goal. As a matter of scope, it doesn't describe the work that volunteers already do and will continue to do, and “ownership” here doesn’t mean that the Wikimedia Foundation plans to take away control from current developers; ownership means making sure we know which Foundation team is responsible in situations such as the ones described above.

Service Ownership: The big ideas

Each service^{[Footnotes 1]} has multiple owners^{[Footnotes 2]} fulfilling different roles. Software engineers, product managers, SREs and others all own the software, together, as a partnership with different responsibilities.
Owners are WMF^{[Footnotes 3]} teams, not individuals (for continuity, as people join and depart) and not volunteers (for organizational accountability).
Ownership is a long-term commitment. Usership continues through staffing changes, reprioritization, and reorgs—so ownership does too. When necessary, it gets handed off from one team to another, without being dropped.
Ownership is established deliberately, not accidentally. “You touched it last” doesn’t create ownership. Even when a service is unowned, teams can fix a bug, review a patch, or do other necessary work without inadvertently becoming de facto owners. Ownership is an intentional decision.
For services not developed by the Wikimedia Foundation, ownership is not about feature development, but a responsibility to keep a production service functional.
Wikimedia’s ownership model is unique, because what we do is unique. We run a top-10 website using software we write in the open, relying on volunteer developers—and we operate third-party open-source software, which we don’t develop at all—but we can still have clear responsibilities within the organization. We continue to operate software, with clear and manageable maintenance responsibilities, even when it’s complete and no longer needs to be actively developed.

This policy doesn’t establish complete job descriptions for these teams, but rather outlines their ownership responsibilities—focusing especially on the interfaces between owners, so that teams can work together in a coordinated way.

It’s most important in two situations: the first is at key points in the service lifecycle, like when we place a service in “maintenance mode” and we need a shared understanding of what that means—what work will continue and what won’t. The second is in incidents of various forms: production incidents, deployment incidents, security incidents, and so on. Each constitutes an unplanned demand for work that somebody needs to do.

When the service has a Service Level Objective, it’s established as an agreement between all the service’s owners. A production incident is defined as a problem that causes the service to violate its SLO—or a problem that would eventually do so, if left uncorrected. (When no SLO exists, production incidents are defined more subjectively in terms of “severity of user impact,” which can lead to misaligned expectations between owners.)

The service lifecycle

A service can be in a number of different states as it goes through its development lifecycle. Software developed at the WMF is expected to go through a process like this:

For each of the states a service can be in, there are different ownership roles. Not every role exists for each state:

	Development Owner	Contingency Development Owner	Configuration & Deployment Owner	Incident Response Owner	Product Management Owner
Active WMF Development	Writes and reviews the code.	n/a	Writes and reviews changes to the service’s config in production. Deploys new versions of the service.	Receives and responds to alerts when a problem arises in production.	Manages the service’s user priorities.
Active Community Development	n/a^{[Footnotes 4]}	Triages bug reports. Writes and reviews code to fix bugs, but only when urgent.			Manages the service’s user priorities.
Maintenance	n/a				(Fulfilled by EM)
Third-party	n/a				n/a
Unowned	Potentially none. This state represents an active risk; if anything goes wrong, there may not be anyone to fix it!
Decommissioned	n/a	n/a	n/a	n/a	n/a

Next, we’ll discuss each of the states in more detail, including each of the corresponding types of ownership. At the end, we’ll describe how a service makes an orderly transition from one state to another.

Ownership responsibilities

Active WMF Development

The most straightforward situation is a service running in production while actively being developed by a team at the Wikimedia Foundation. (Examples include Parsoid and Search.) In this case, all the ownership roles are filled, and they generally match the intuitive idea of what each of these teams does.

We’ll use it as a starting point to introduce each of the roles and define their responsibilities.

Development Owner: Responsible for software development, bug triage, and code review. Bug triage includes prioritization, and by extension working on the high-priority items, but it doesn’t mean every single task is worked on.
Configuration & Deployment Owner: Responsible for rolling out configuration changes and deploying new versions of the software. For all parts of MediaWiki, this is owned collectively by Release Engineering (using backport windows for config changes, and the train for deployment). For software that doesn’t ride the MediaWiki train, this is typically the same team as the dev owner.
Incident Response Owner: Responsible for rapidly handling service problems with major user impact, including by receiving automated alerts. They may not solve the problem solo, but they take responsibility for making sure it gets solved. When the service has an SLO, this often takes the form of detecting SLO violations and escalating to other service owners for shared effort in resolution.
Product Management Owner: Responsible for prioritizing the service’s feature work. In particular, the PM owner is responsible for re-prioritizing work, for example when incidents take up a lot of unexpected time, or when an SLO violation means corrective work is prioritized in the next quarter. (This is why the PM owner also has to be involved in setting the SLO initially.) The PM owner is also responsible for leaving developers enough time to do routine maintenance work.

Next, we’ll look at the other possible states for a service, and define their ownership roles and responsibilities by contrasting them with the baseline we’ve described here.

Maintenance

A service is in maintenance mode when it isn’t being actively worked on. Examples include Maps and OOUI.

Many software companies don’t do this; every product is staffed with a full development team, and that team has a full docket of projects until it’s no longer worthwhile to continue, at which point the product is shut down. At the WMF, we can continue to support finished software, which means defining ongoing ownership roles for it—with substantially less ongoing commitment than for actively developed software.

Contingency Development Owner: Responsible for software development work on “Unbreak Now!” tasks. These are generally critical security vulnerabilities, feature breakages, accessibility breakages, and so on—but only at the UBN level. Responsible also for any work required on the service for platform-wide code initiatives (such as PHP version upgrades, compliance with new linters, and so on). There’s no expectation that any feature work, or any noncritical bug fixes, are done at all. The team does conduct bug triage (many incoming tasks will be triaged and not otherwise worked on) and may or may not review volunteer-contributed patches for noncritical work.
Product Management Owner: For maintenance services, this is fulfilled by the engineering manager. There’s no feature work to prioritize, so this role is limited to decisions about what work is in scope.

The Incident Response Owner and Configuration & Deployment Owner roles are unchanged: the service still needs to be operated in production, even if code changes are rare. A clear escalation path between teams, for unusual but urgent situations, is critical.

Contingency Development ownership is a much lighter-weight commitment than actively developing a service. A team may have a portfolio of many contingency-development services with little associated workload—they just have the expertise to work on those services someday, if it becomes necessary.

In some special cases, when it’s organizationally necessary, instead of a specific team taking on a Contingency Development Owner role, an engineering director might decide to leave that role unoccupied. That director is placing themselves in the escalation path and making the commitment that, in the event critical work becomes necessary, they’ll identify a suitable team to do it, deprioritizing other tasks if necessary. (An example of an appropriate service for this unusual setup is Charts: the code was written not by an established development team which could go on to own it in maintenance mode, but by a task force formed to write the code and then disband.)

Active Community Development

MediaWiki is free and open-source software, and much of its development work is done by volunteers. Some software components have no active development team of WMF staff. Examples include AbuseFilter and Math.

We still need reliable staff resources for some situations. Our volunteers do important, impactful work, but volunteers by nature don’t have OKRs or specific expectations of availability or continuity. In order to plan for contingencies, we need a staff team committed ahead of time to be able to deal with emergencies.

Contingency Development Owner: Responsible for critical software development work, as with maintenance services, but only for situations where the volunteer developer community can’t respond with the necessary turnaround time. The team may or may not review volunteer-contributed patches for noncritical work.
Product Management Owner: Responsible for prioritizing the service’s feature work. In particular, even community-developed features are subject to discussion and review (collaboratively, but with the PM owner responsible for a final decision) on whether they take the product in the right direction, in order to understand the long-term impact of new features on the Wikimedia ecosystem, including the maintenance commitments.

The Incident Response Owner and Configuration & Deployment Owner roles are still unchanged. The escalation path can include active volunteers with knowledge of the service, but must also include staff teams.

Distinguishing between Maintenance and Active Community Development can be a fine line; the WMF staff responsibilities are similar, except that a Maintenance service undergoes much smaller, much more infrequent changes, and so requires less ongoing staff involvement.

Third-party

We also run open-source software not developed within the Foundation. An example is the seven CDN sites around the world that act as the first point of contact for each HTTP request; the caching stack is comprised of HAProxy, Varnish, and Apache Traffic Server, three services that we operate (and configure extensively) but don’t ourselves develop.

These are typically owned by a single team that fulfills all the ownership roles. (This is often, but not always, the SRE team.)

Most of the responsibilities are in the normal Incident Response Owner and Configuration & Deployment Owner roles. Occasionally, there’s also some Contingency Development responsibility; if the code needs to be modified for some WMF-specific functionality, this team does that work, and either contributes the patch upstream or maintains it on a local fork as appropriate. Typically there’s no Product Management Owner, as product decisions are made by the upstream open-source project.

Unowned

Unowned services shouldn’t exist in production; some do exist, as a legacy state. Each of them constitutes an ongoing risk: if something goes wrong (technically, organizationally, strategically) it might be no team’s job in particular to fix it. We’ve traditionally addressed those situations with individual heroics, which isn’t sustainable in the long term.

Partially owned services shouldn't exist either. Ownership is a partnership between teams, and one partner can't hold up their end alone. For example, an SRE team can't perform all the responsibilities of an Incident Response owner if there's no Development Owner to fix critical software bugs. Likewise, if there’s a Development Owner but no Incident Response Owner, the risk is that no one responds to production alerts at all. (The same engineering team could sign up for both roles; the risky situation is when one role is unfilled.)

Here too, some team or individual might step in to perform heroics, but—as in other situations—this doesn't create ongoing ownership. Partial-ownership situations are documented where they exist, but for planning purposes it’s prudent to consider these services unowned until each of the responsibilities is taken on by some team, even if that team wears multiple hats.

We document these unowned and partially owned services in the service catalog because identifying these risks is an important step on the way to mitigating them. It doesn’t imply that adding new unowned services is allowed (either by launching them unowned, or by abandoning ownership).

Most importantly, in the state transitions below, we discuss how to cleanly move services out of this unowned state.

Decommissioned

A decommissioned service is one which is no longer operating; as a corollary, it has no owners and no ownership responsibilities for anyone. A service which is “running but not supported” is not decommissioned, and neither is a service whose announced deprecation date has passed: those services should still have owners. A service is only decommissioned after it’s been undeployed.

Decommissioned services are listed in the service catalog as a matter of bookkeeping and historical record: decommissioning the service is a way of documenting that all the ownership responsibilities have been completed.

State transitions

Whenever a service moves from one state to the next, the ownership roles and responsibilities change.

Each of the transitions is a little different, and they’re described below. They have two elements in common. First, each is initiated as a strategic decision, in the course of the annual planning process or other strategic planning: a decision that changing the service’s state is the right direction for users, or for the product, or for the organization. Second, each is carried out as a consensus decision among all the service’s owners.

Launches (none → Active)

Most new services enter the Active WMF Development state when they launch.

In order to launch, all the ownership roles must be fulfilled by teams who are aware of their responsibilities and agree that the service is ready to go. For example, if a service were deployed to production without an Incident Response Owner, then either there would be no monitoring alerts to tell us when something went wrong, or those alerts would page a team who hadn’t agreed to take on responsibility for that service. Each of those is an unacceptably risky state for a new service.

When a service is volunteer-built from the beginning, it can launch in the Active Community Development state. It still requires all the ownership roles to be filled, but the only development responsibility is attached to a Contingency Development Owner, who only needs to take on urgent development tasks, and only when volunteers are unavailable to complete them in a timely manner.

Active → Maintenance

A service enters maintenance mode when we determine that it isn’t strategically important enough to keep working on, but is important enough to keep operating. Often this can be a good thing: all the important feature work is done, and the service can graduate to maintenance mode as the team moves on to the next big project.

The Development Owner becomes a Contingency Development Owner (or, occasionally, simultaneously transfers that role to a different team). This involves triaging the open tasks: every task is either something that must be done even in maintenance mode (UBN! or sometimes High priority) or something that the team intends not to do (Medium or Low priority).
The Product Management Owner, optionally after participating in that triage process, hands off responsibility to the engineering manager of the contingency development team. The EM is now responsible for deciding whether incoming maintenance tasks are in or out of scope for the team.

The Incident Response and Configuration & Deployment owners don’t change, but they’ll need to have an up-to-date escalation path to the contingency development owners.

Maintenance → Active

A service can return from maintenance mode to active development if we decide to do long-term work on it again. This isn’t the same as when critical work needs to be done in maintenance mode—that’s still maintenance. Rather, a service returns to active development if, for the foreseeable future, the team will dedicate time to feature work and improvements, entertain feature requests and code reviews, and so on. The process is basically the same as above.

Maintenance → Decommissioned

We decommission a service when we decide not to operate it at all anymore. Sometimes this is purely a product-driven decision: we decided the service had a negative impact on the site overall. Other times it’s because the service is an undue maintenance burden: the effort required to keep the software in working order isn’t justified by the benefit to users.

After this decision has been made, the service is still in maintenance mode, with all the attendant commitments. Shutting down the software responsibly may involve a sunsetting period (announcing a deprecation to users, determining and carrying out a migration plan, etc.) but during this time the owners are still responsible for any critical work that arises. (For example, even when the service is deprecated, if critical security vulnerabilities are found they may still call for immediate action.)

The service transitions to the Decommissioned state only when it’s no longer running: the software is undeployed from all WMF hosts, and user traffic can’t access it via any code path. (The code doesn’t need to be deleted from the repository: a service can be Decommissioned when the WMF isn’t operating it anymore, even if the code is still running in other MediaWiki installations.)

A service might go from Active directly to Decommissioned without passing through the Maintenance state in between, but it’s unlikely: it usually takes time to turn the service off, and during that period, the development team is unlikely to take on new feature work—Maintenance is probably the most accurate description of what the expectations are. The only exception is when the service was initially designed to be a quick experiment, easy to turn on and easy to turn off. In that case all the stakeholders knew all along that it couldn’t be relied on to exist long-term, so it can be shut down immediately with minimal fuss.

Unowned → Maintenance

At the time of this writing, many of the services in production are de facto Unowned, representing a risk (low probability but high impact) that critical work will become necessary and no one in particular is signed up to do it. Transitioning to the Maintenance state is the most common way we’ll resolve that risk: the Contingency Development Owner signs up to take on any future critical items, without generally needing to do any work on the service right away. The Configuration & Deployment and Incident Response Owners, which may already have existed in a de facto way, can make those responsibilities explicit at the same time.

Ownership is always a joint responsibility, so in order to transition to maintenance mode, the service needs all three owners. In the event of a production incident, responders may need a development team to escalate to.

Transitioning from Unowned to Active is possible, but rare: it might happen if some team is picking up responsibility for a previously-dormant product area and dusting off the associated code.

Transitioning from Unowned straight to Decommissioned is likewise rare: it might happen if a service has so few users and traffic that it can be decommissioned without needing a deprecation period.

Note that there is no transition from Maintenance back to Unowned. Once ownership is picked up, it’s an ongoing responsibility for the life of the service. The only state transitions from Maintenance are to Active Development, or to Decommissioned.

Transfers

A team can hand off their ownership of a service to another team. This is a normal thing to do over the course of a team’s lifetime, and it may be associated with strategic planning, rebranding, or other scope changes.

The service might be in Active Development, Maintenance, or any other state, and that state doesn’t change with the transfer: the new owner just takes the place of the old one. This is simple, as long as it’s done with appropriate care: the outgoing owner should make sure the incoming owner knows everything they need to know about the service, both its documented and undocumented elements. Obviously, both teams need to be fully aware of the handover: ownership can’t be dropped onto a new team without their involvement.

It’s also important to update the service catalog and inform all the service’s other owners in order to maintain a clear escalation path.

Reorgs

Similarly, if a team is reorganized, they need to consider what to do with all the services they own.

If the team will continue to exist in some capacity, they might decide to keep responsibility for those services, in which case there’s nothing further to do (except update the service catalog to reflect the team’s new name, if applicable).

Any service the team won’t continue to own (because it no longer makes sense in their portfolio, or even because the team is being disbanded completely) must be transferred to another team as a part of the reorg process, and this needs to be taken into account when planning the reorg—both the need to identify a new owner, and the need for time to execute a successful handoff.

Even during and after major reorgs, where large parts of the organization are undergoing structural change, we keep the wikis working continuously for users, which means that no services go from owned to unowned in the process.

Footnotes

↑ Throughout, we use the word service to describe the unit of ownership. This is a stand-in for a broad category: it might be a microservice, but it might also be a MediaWiki component, a feature, a module, or any other well-defined chunk of software that a team can be responsible for.
↑ Throughout, we use the word owner to describe a specific set of responsibilities. It doesn’t mean sole control of the software, the project, or the decision-making.
↑ Or employee teams at a Wikimedia affiliate meeting certain criteria, such as WMDE. (Those criteria include reliable funding, a track record of responsiveness, and the necessary access in production.)
↑ Remember the big ideas -- this explains the responsibilities of WMF teams, so these responsibilities don’t capture volunteer contributions, even when they’re very important. We discuss that further in detail in the Active Community Development services section.

[1] Throughout, we use the word service to describe the unit of ownership. This is a stand-in for a broad category: it might be a microservice, but it might also be a MediaWiki component, a feature, a module, or any other well-defined chunk of software that a team can be responsible for.

[2] Throughout, we use the word owner to describe a specific set of responsibilities. It doesn’t mean sole control of the software, the project, or the decision-making.

[3] Or employee teams at a Wikimedia affiliate meeting certain criteria, such as WMDE. (Those criteria include reliable funding, a track record of responsiveness, and the necessary access in production.)

[4] Remember the big ideas -- this explains the responsibilities of WMF teams, so these responsibilities don’t capture volunteer contributions, even when they’re very important. We discuss that further in detail in the Active Community Development services section.

[Footnotes 1]

[Footnotes 2]

[Footnotes 3]

[Footnotes 4]