The prospect of boiling the entire range of critics’ assessments and reviews of any work of art down to a single number, whether it’s a percentage or a mystery-laden weighted average, is absurd and—more importantly—overly reductive. So I’m gonna play the fool’s game of trying to propose some sort of improvement.
This is a long-belated sequel to an article I had done two years ago, which was about the inherent shortcomings of the Metascore. Near the very end of the piece, I stated:
Honestly, I don’t even think the average is the best way to capture game reception in the first place, even if standard deviation was included. But [looks at article length] that’s a story for another time. Let’s just say my reasoning would start with how the average’s primary intent, even with standard deviation involved, is to highlight the center of data, and how that is inherently contrary to the idea of representing a variety of opinions.
It’s something that I had always hoped to follow up on. Given the free spot in the SixTAY Days of Writing challenge, along with how there seem to be frequent Tomatometer-based mini-controversies and such, this seems as good of a time as any to finally take a dive.
Now, before beginning in earnest, I ought to state, for the record, that as a concept, I am not fundamentally opposed to the prospect of aggregating a bunch of reviews together into one place, especially if the entity doing so takes care to post links (or otherwise give detailed info about the source, if not online, e.g. the edition of a magazine or newspaper) to the original review. In fact, given how disparate the world wide web is, having a central hub from which game or movie reviews can be easily found could be rather practical.
If judged purely within these parameters, I would confidently say that both Rotten Tomatoes and Metacritic are damn good at their jobs. They compile and usefully summarize a healthy variety of reviews for numerous movies, video games, TV shows, or music albums, and also make it easy to locate and read the original full reviews. I’m not going to front; I’ve used both in getting the lay of the criticism land, and have found them to be useful.
Where they err is in then going that extra step and trying to boil down the multitudes they contain into a single simple, pleasing-to-the-eyes number. Something far more useless by orders of magnitude...yet also the point of contention that everybody always wants to fixate on and obsess over.
Let’s begin with that first point, in relation to Metacritic’s Metascore: That the entire concept of an average directly runs against the ideal to represent diversity of opinion. The average, in brief, is a single number that shows the “center” of a set of numerical data. No more, no less.
Thus, when you use an average to be the storefront representation of your aggregate of reviews, it saddles that aggregate with a massive assumption. That all of those tens or hundreds of reviews from the tens or hundreds of critics all converge upon a single “consensus.” It doesn’t matter how scattershot the reviews end up actually being; if three critics give something a 100, a 35, and a 90, they’re feeling a 75 altogether, gosh darn it!
An average downplays the possibility that the aggregate of reviews may actually represent a diversity of opinions across critics rather than some mythical “central judgment” that everyone is revolving around. The alternative that I’d wish to propose would directly address this shortcoming; my aim is to not just acknowledge a far-reaching set of opinions, but push it to the forefront.
As the quoted passage above alluded to, one of the possibilities that I explored in my past article was the usage of the measure of standard deviation to represent the spread of opinions. I think that would be a slight improvement over using solely an average, but that it is still hamstrung by many of the same shortcomings. It still plays into that assumption that there is a single consensus within the entire range of reviews; the standard deviation, in this context, is merely a measure of how strong or how weak that consensus is.
The question, then, is the following. How could we better represent a range of opinions?
We don’t actually have to look all that far for alternatives that already exist. For example—believe it or not—Rotten Tomatoes. Yes, I’ve given them their grief earlier on, but their Tomatometer, within their stupid single number, does in fact fit the bill for making an attempt at illustrating the diversity of opinion.
Here is why: The numerical value at the forefront of the Tomatometer is the percentage of reviews which gave the work of art in question a positive perception, i.e. deemed it “fresh.” Therefore, it also implicitly indicates the percentage of reviews which were not positive, i.e. deemed the work of art “rotten.”
If, for example, 68% of reviews for a movie were positive, then that means 100% - 68% = 32% of reviews were negative. 68% fresh, 32 % rotten—that is a summary of the aggregate which does make some inroads at representing a spectrum of opinions.
In fact, you can even represent it in the form of a picture. Rotten Tomatoes even does this exact thing when you look at the page for a specific movie! So, for example, here’s what it looks like for Sicario: Day of the Soldado.
There it is. That bar. The red part is the proportion of positive reviews—the 68 percent—while the dark gray part of the bar is the proportion of negative reviews. A succinct visual representation what the number means!
I am being wholly honest when I say that what Rotten Tomatoes is doing is concrete better than the Metascore. They are still flaunting a single simple number, but instead of being an average score which is meant to be interpreted as the “consensus of all critics,” it instead points out the dividing line between the degree of positive reception versus negative reception. It illustrates, albeit indirectly, some measure of diversity of opinion.
That’s about where the praise ends, though. Rotten Tomatoes’ big problem, when it comes to summarizing critical reception, is that its idea of “diversity of opinion” is overly simplistic. There is only positive and negative. Fresh and rotten. Put in the clearest terms, “It was good” and “It was not good.”
That...is not even close to the range of distinct sentiments that could characterize a review. A review that is utterly scathing is not the same as a review that is just mostly unimpressed; what’s the sense in giving them equal weight as being “rotten?” By the same token, a review that modestly praises its subject is not the same as a glowing review; what’s the sense in giving them equal weight as being “fresh?”
Thus, the Rotten Tomatoes way is not solid enough. However, it does illustrate the basic idea that I do want to use for my own proposition for representing the aggregate: Categorizing the reviews according to sentiment.
Having some way of quickly showing that x reviewers felt this way, while y reviews felt that way, and z reviewers felt something else altogether, and so on and so forth as per necessity—this seems like an appropriate way to illustrate a spectrum of varying opinions. No need to posit that all of these critics overall revolve around THIS single assessment, when we can instead show how their assessments spread out.
We don’t even need to go that far for a real-life example of this idea better put into action, either! We can, in fact, turn to...Metacritic?!!
It’s not nearly as well-advertised as the Metascore, but when you go into the summary page for a movie, game, TV show, or music album, one of the things that you’re presented with is a chart that categorizes reviews into three sets of sentiments. It even takes the liberty of doing so for both critics’ reviews and user reviews! Here’s what that looks like for Detroit: Become Human, for example.
Separating the reviews according to positive sentiments, negative sentiments, and then something in the middle (what they call “mixed”) for good measure is, I think, a pretty decent way to hash out the range of differing opinions. And it’s presented here with an attractive view of three bars at that!
If Metacritic could find a way to make a version of this as their hook rather than the Metasacore, it would probably be a significant improvement. However, I think that the method by which they categorize reviews as positive, mixed, or negative—how everything gets placed into their places—leaves a bit to be desired.
They do everything according to review score, which is imperfect, but at least understandable. My points of contention, rather, are a different twofold. First, their positive and negative categories are a bit too broad, and that has the additional detriment of shortchanging the range of their mixed/neutral category, which would seem too narrow in comparison when in fact I think it’s the right size.
Second, they split up the score categories differently for video games versus movies, TV shows, and music albums, and the more that I think about it, the more I find that their movies/TV/music breakdown would be more useful. See here.
My beef with the games scale is that it’s meant to reinforce the Four Point Scale phenomenon specific to video game reviews, which is a crock of crap. Critics for pretty much every other medium have no qualms with using the entire scale of grades to assess their subjects, no matter how popular they may be! Games are one of the only cases where the assessment of reviews is so front-loaded towards the higher end of the scale, that a 7 out of 10—a grade that I’d attach to the sentiment that something was solid! Not great, but certainly good!—somehow amounts to being a “mixed review” or “neutral.”
Metacritic’s standard for categorizing reviews for everything else, on the other hand, looks far more reasonable. Those ranges pass the “smell test” for what scores would amount to what sentiments. A 30 is the kind of score that I’d indeed give to something I thought was bad, a 50 is the kind of score I’d indeed give to something I was either conflicted about or just thought was unremarkable, and a 70 is indeed the kind of score I’d give to something I thought was good.
I think it would be beneficial for game reviews to be similarly held to a standard like that rather than being treated as a special case. We want video games to be a mature medium, right? If so, then publications who meaningfully use the entire grading scale ought to be encouraged to do so, not discouraged. Challenging the Four Point Scale would be a potential way to do that.
However, even the standard Metacritic way of categorizing movie/TV/music reviews still runs up against my prior criticism, that the positive and negative categories are too broad. There is a lot of daylight between an assessment of a 6 versus an assessment of a 9, yet both are categorized as “positive?” It’s the kind of lack of distinction that makes their Metacritic green/yellow/red divide somewhat less useful.
Personally, I would make a couple more categories to distinguish between the varying degrees of “positive” and the varying degrees of “negative.” Perhaps split the 60-to-100 range for positive reviews right down the middle so that the lower half of the range (from 60 to 80, approximately) represents the “good” stuff, while the upper half (from 81 to 100, approximately) represents the “great” stuff. We can then do something similar for the 0-to-40 range, splitting it down the middle to separate the merely mediocre from the truly awful.
This would therefore be similar to [checks my notes closely] something that Metacritic also already does. Going back to the breakdown...
Their measure for Generally Favorable as compared to Universal Acclaim are already a solid split between “good” and “great!” Same for their measure of Overwhelming Dislike as compared to Generally Unfavorable as a split between “bad” and “mediocre!” In that way, everything even happens to be just about equally sized with the “neutral” category of Mixed or Average, meaning that nothing would end up being shortchanged!
So it’s just a tad annoying that Metacritic then takes such an overall reasonable scale, and overly simplifies things so that both good and great = green, and both bad and mediocre = red. These do away with useful distinctions. In my own ideal world, I’d keep them squarely in place, and easily identifiable.
On top of that, there is one last extra-special category I would like to include: A “perfect” category, just for reviews scored at 100. There is so much notability attached to the maximum possible score, representing that what is being reviewed is deemed to be one of THE golden standards of the medium—the difference between a 99 and a 100 might arguably be one of the only single-point differences in score that actually mean something—that it probably deserves to be singled out. Especially if lots and lots of reviewers are handing out that score en masse.
Therefore, if I were to come up with a method for categorizing reviews, and if I had to have some way of mapping them to review scores—ideally, even publications that don’t use any sort of simple scale, like Kotaku, could hopefully still be roughly summarized by a qualitative assessment—this is what it would probably be.
We can even throw this against some examples. Like, perhaps a couple of familiar friends from last time?
Here is a breakdown using Metacritic’s aggregate of reviews for the Playstation 4 version of the relatively tepidly received Assassin’s Creed Unity.
And here is another breakdown using Metacritic’s aggregate of reviews for the Xbox 360 version of the widely positively received but infamously 84-Metascoring Fallout: New Vegas.
Finally, to show off what things might look like for a game with glowing accolades, here is a breakdown (with scored entries only, for the sake of convenience; there are ten reviews collected which were left unscored) using Metacritic’s aggregate of reviews for Super Mario Odyssey.
Something like this would still leave a lot of room for improvement, obviously. Additionally, it still would not change the problematic tendency of the industry and the community both putting such outsized stock in the values of numbers with reviews. That is an issue which would can only be improved separately from the question of how to best summarize an aggregate, and improved specifically by putting what the summaries actually represent into proper perspective, and thus taking them with more of a grain of salt.
But this has to be at least slightly better than what we’ve currently got.
Well, those are more than enough paragraphs out of me on the manner. Sheesh...