The Problems with King's "Diversity Space Method"
A bizarre data-free approach to data analysis

The Diversity Space Method

Yesterday, an Activision-Blizzard blog post went viral on twitter for all the wrong reasons. It was intended to show off a tool King had used to help make ABK games more diverse, largely on their own time and with the help of MIT. Innocent enough, right? Well… let’s look at what went wrong.

How it got made

I did a bit of digging into this, and I think I’ve more or less worked out the history on this thing. According to the paper on the tool, a group of employees (at an unnamed company, but I assume this is King) made a paper model of this system, and then for unknown reasons, reached out to the MIT Game Labs for help digitizing it. A group of 3 undergraduates worked on it together, however their paper is very strange – it seems to be entirely about the making of the web app and the app’s UX, and not actually core methodology of this approach. I tried to track down the secondary paper on the subject that is in the references, written by W. Wu, and couldn’t find it on the internet. Curious. I looked into him, and it seems he’s a digital media major… probably not doing any major data analysis.

So then who came up with this system? I went and watched the GDC talk on it by 2 King employees. Again, they just stated the system as if it sprung into existence by magic, although with a bit more detail on how they intend it to work. So far as I can tell, all of the numbers are just magic numbers, invented via some vague feeling on how “far away” a group is from the “norm”. A lot of the writeups about it past that have to do with making the web tool, which seems to be mainly for easy visualization, as you can see in the highlighted quote from the paper below.

We wanted to create an experience that engages the user so that they stay with the design choices they are exploring long enough for deep reflection to occur so that this in turn can lead to insights that the designer would not have reached otherwise

This is our first major problem. To get into it, first we have to take a look at star plots, the visualization of choice for this particular tool.

Star Plots

A picture of a star plot with stats for football player Ronaldo.

Usually used to summarize scores, attributes, or other simple, additive, and multivariate data, star plots (also known as radar plots, spider plots.. etc) have a couple of key features intrinsic to them that bear mentioning here.

  • The area created by the plot is intended to be meaningful, and in most cases, positive. The most frequent usage of these charts is to illustrate strengths and weaknesses, and this is how our brain tends to want to interpret them. In other words, unless explicitly stated otherwise, smaller areas bad, larger areas good.

  • They rely on linear, continuous relationships between variables. Moving the point up and down one of the axes has to be meaningful, you can’t just arbitrarily set the end of the axis to be 2 and the middle to be 7 and expect it to be sensible or good visualization.

These two things are a HUGE problem for this DSM app.

The DSM Star Plot and its Axes

A picture of a star plot with DSM stats, including age, race, ability, body type, beauty, gender, sexuality, socioeconomic background, and cognitive ability.

(Note: I’ve been asked by some people what this plot above is trying to show, I’ve looked at it and I believe it is showing the maximum values in each group per axis, and the table lists the average. I.e. if there is any person of race 8 in a group, but the average is actually only 1.3 (made up numbers), then the table will say 1.3 and the plot will have that color as 8 on the race axis. The plot is not a single character, a different one could have say sexual orientation 5, be the max of the group on that axis, and be on the plot as well. And yep, the colour coding is extremely awful. I won’t bog this post down with getting into why people might want to see this visualization.)

First, we have to understand what DSM is intending to show, namely a quantified difference between the “standard video game character” (that is not well defined in the paper, but in the GDC talk they get into it a bit more) and the character we’re looking at. We are meant to assume the “norm” means a young adult able bodied cisgendered heterosexual white man from the middle class and somewhat fit body type. Sure, I’ll just accept that one without justification for the sake of simplicity.

A screenshot of the GDC presentation on the slide describing the norm, as proof of the description given above.

But here’s the first wrinkle – how do we get that difference? A lot of these axes as labeled are horribly complex and usually qualitative. Many of them are nonlinear and not even 1-dimensional.. for example once you get into gender you’re in at least a 2d space (to include nonbinary, agender, etc) and sexuality gets even more complicated (sexual attraction, romantic attraction, intensity of sexual attraction to account for demisexual and asexual… that’s at least 3d, and now people are in a volume, not on a point. Oops.) And then the issue becomes, how do you even establish a measure of distance? Maybe with careful modelling, you could almost manage this in 1d space for age, and 2d/3d space for gender/sexuality (if I’m not missing something myself), but there’s an elephant in the room here. They tried to do this with culture, ethnicity, ability, etc as well. How do you order ethnicity? How do you quantify it? More importantly, how do you quantify it without being horrifically racist and introducing your own biases? Definitely not a problem that I feel qualified to solve, at least.

So what did they do? Well… as far as I can tell, they assigned random numbers that felt right to them.

A picture of the stats used to generate Ana's star graph.

Here’s a look at the values for OW hero “Ana”. From this and other screenshots, cisgendered woman seems to equate to 5 in this system. “Arab” is somehow 7, as is “Egyptian”, which is not a monocultural country but let’s set that aside for a moment. What does that even MEAN? What does it mean to have a distance of 7 between Arab and White? What does it mean for a cis woman to be a 5 in gender? Well.. Nothing at all, really. From the GDC video, they casually assign these numbers based on their “feeling” of how “uncommon” they are in games. Whether or not this is based on statistical analysis is unclear, but I would hope they would mention it if it were. A Black character is casually dismissed as pretty high, Ana is apparently a 7.. not super meaningful.

Here’s where it gets really hairy. Remember our star plot traits? We expect bigger values to be meaningfully better. We expect bigger areas to be better. We expect the scale to be reasonable and continuous. This is where the backlash to DSM began in a big way. Suddenly, people asked (reasonably) what gender is the “most gender”, which race is “most diverse”, etc. The very structure of the data presentation implies this was a decision made, even if it never was. People immediately assumed that the goal was to maximise the star chart, if not on a per character basis, at least for an entire game. Indeed, the presenters seem to confirm this in the GDC talk, if halfheartedly.

This ends up coming across as extremely dehumanizing, crude, and cold, not to mention nonsensical. The MIT paper itself mentions that the star plot format led to misinterpretation, confusion, and undesirable min-maxing instead of the starting point for discussion it was intended to be, as seen below (the paper is not clear as to why they didn’t scrap the star plot upon realizing this).

Several long paragraphs on the issues the team had with star plots, please refer to the paper link at the bottom as it exceeds my alt text limit.

This was a major mistake, and honestly I don’t think it can be salvaged. Quantifying (and ordering on a continuum) race, ethnicity, culture, etc is not a good road to be walking down, especially as an inclusion advocate. But is that the only problem? Well…

Tokenization and Stereotypes

The original ABK blog post claims that this method fights tokenization, but I would say that the very attempt to categorize characters by how “different” they are from “normal” is tokenizing, othering, and discriminatory. The talk proclaims that we need to “fight” the norm, but encoding that norm into the very way we think about diversity and representation is counterproductive. Pretending that having many characters that are “different enough” from their definition of normal just for the sake of having them is healthy or respectful representation is simply wrong. It is fundamentally insulting to reduce entire complex human beings down to these attributes for the sake of accumulating diversity in your work, and the core concept seems to miss that.

Additionally, the blog post claims this can be used to avoid stereotypes, but both the paper and the GDC presentation explicitly state that you need an experienced and thoughtful person to do a detailed review of the characters to avoid this, as this method does not catch it (or tokenization) at all. I’m not sure who decided to misrepresent it in this way, but it certainly did the project as a whole no favours.

The original proposed tool seemed to want to be a discussion starter, an audit tool of sorts. I myself don’t see any practical value in being able to measure my characters’ distance from an imaginary “typical” game character, but maybe there is something there for showing someone how homogenous a group of characters is, potentially. I think this could be accomplished in a better way (scatter plot with discrete axes, for example), and probably using much simpler tooling than this system, but certainly what it’s being pitched as in the blog post is a far cry from that. At the end of the day, you need to hire for diverse perspectives and experiences, and you need to seek and listen to peer feedback.

And you need to not support bad data science.

Other Notes

As far as I can tell, King made this system and used OW characters as an example in their presentation, which were then reused in the blog post. I don’t personally believe that this system is used in design processes in Overwatch or elsewhere in Blizzard.

I also think this tool was made with good intentions, I just don’t think it has value as it is, and I think it’s on shaky fundamentals. I personally wouldn’t use it myself, and I think it invited its own backlash with how it presented itself visually. I hope the original creators are doing ok, and don’t catch too much hate for this, and I hope ABK does not take this opportunity to starve research efforts. An interesting (and worrying) note on the treatment of this project can be found in the paper’s conclusions below..

This type of design research typically generates both the type of insights we have shared here and an artifact which carries knowledge contributions which are hard to put into words. Our study is no exception, but beyond this, we have also come away with a richer understanding of the particular challenges of collaborating with a commercial industry partner on a grassroots project informed by critical media studies. Our team members at the development studio put themselves in a vulnerable position by pushing for funding for the project, and as women in the games industry they already face systemic hegemonic pressures. Through the course of the project, it became increasingly clear to us that the usual dynamic of us as the academic partner pushing for more experimentation, more exploration of critical angles, and an overall more explorative approach was putting our embedded allies at risk and had to be tempered.

The ABK blog post has been heavily edited in response to backlash, so here is a wayback machine link https://web.archive.org/web/20220512185745/https://www.activisionblizzard.com/newsroom/2022/05/king-diversity-space-tool

And here is the GDC talk https://www.youtube.com/watch?v=HmZZAHDqdfE

And here is the paper itself http://www.digra.org/wp-content/uploads/digital-library/DiGRA_2019_paper_378.pdf

Also, when I say continuous above I don’t mean necessarily in the sense of the real numbers, I mean it in a colloquial, non-mathematical sense - ie that if a scale goes from 0 to 5, 2.5 should be logically between those two values, and that moving smoothly between those two values makes sense.


Last modified on 2022-05-14