'Can data analysis reveal the most bigoted corners of Reddit?' diff viewer (3/4)

This article is from the source 'guardian' and was first published or seen on March 23, 2015 06:24 (UTC). It last changed over 40 days ago and won't be checked again for changes.

You can find the current article at its original source at http://www.theguardian.com/technology/2015/mar/23/can-data-analysis-reveal-the-most-bigoted-corners-of-reddit

The article has changed 5 times. There is an RSS feed of changes available.

Previous version 1 2 3 4 Next version

Previous version 1 2 3 4 Next version

Version 3	Version 4
Can data analysis reveal the most bigoted corners of Reddit?	Can data analysis reveal the most bigoted corners of Reddit?
2015-03-23 16:45:42 UTC	2015-03-23 17:20:11 UTC (34 minutes later)
With its decentralised structure, community moderation, and hands-off management, it’s hard to generalise about the social network Reddit. The site is built of thousands of ‘subreddits’ - user-created forums with a focus on specific topics such as the video game Destiny, fitness, a love of maps, or even just drugs.	With its decentralised structure, community moderation, and hands-off management, it’s hard to generalise about the social network Reddit. The site is built of thousands of ‘subreddits’ - user-created forums with a focus on specific topics such as the video game Destiny, fitness, a love of maps, or even just drugs.
But each subreddit has different norms, rules and tone, which can make navigating the site an exercise in frustration and nasty surprises. It takes a while to develop a feeling for any particular sub, by which point a hostile community may already have ruined your day.	But each subreddit has different norms, rules and tone, which can make navigating the site an exercise in frustration and nasty surprises. It takes a while to develop a feeling for any particular sub, by which point a hostile community may already have ruined your day.
Ben Bell, a data scientist at text-analytics start up Idibon, decided to apply his company’s technology to the site to work out which subreddits have communities you would want to be a part of, and which you would be best avoiding.	Ben Bell, a data scientist at text-analytics start up Idibon, decided to apply his company’s technology to the site to work out which subreddits have communities you would want to be a part of, and which you would be best avoiding.
Bell’s interest was sparked by a post asking Redditors to suggest their nominees for the most “toxic communities” on the site. Suggestions included the parenting subreddit – full of “sanctimommies” – and the community for the game League of Legends, which has “made professional players quit the game”.	Bell’s interest was sparked by a post asking Redditors to suggest their nominees for the most “toxic communities” on the site. Suggestions included the parenting subreddit – full of “sanctimommies” – and the community for the game League of Legends, which has “made professional players quit the game”.
He writes: “As I sifted through the thread, my data geek sensibilities tingled as I wondered: ‘Why must we rely upon opinion for such a question? Shouldn’t there be an objective way to measure toxicity?’	He writes: “As I sifted through the thread, my data geek sensibilities tingled as I wondered: ‘Why must we rely upon opinion for such a question? Shouldn’t there be an objective way to measure toxicity?’
“With this in mind, I set out to scientifically measure toxicity and supportiveness in Reddit comments and communities. I then compared Reddit’s own evaluation of its subreddits to see where they were right, where they were wrong, and what they may have missed. While this post is specific to Reddit, our methodology here could be applied to offer an objective score of community health for any data set featuring user comments.”	“With this in mind, I set out to scientifically measure toxicity and supportiveness in Reddit comments and communities. I then compared Reddit’s own evaluation of its subreddits to see where they were right, where they were wrong, and what they may have missed. While this post is specific to Reddit, our methodology here could be applied to offer an objective score of community health for any data set featuring user comments.”
Related: Reddit: can anyone clean up the mess behind 'the front page of the internet'?	Related: Reddit: can anyone clean up the mess behind 'the front page of the internet'?
Bell pulled out a sample of comments from every one of the top 250 subreddits, as well as any forum mentioned in the toxicity thread, and subjected them to a number of tests designed to look for toxicity, which he defined as a combination of ad hominem attacks and overt bigotry.	Bell pulled out a sample of comments from every one of the top 250 subreddits, as well as any forum mentioned in the toxicity thread, and subjected them to a number of tests designed to look for toxicity, which he defined as a combination of ad hominem attacks and overt bigotry.
From there, he used a combination of sentiment analysis and human annotation to code each comment as toxic or non-toxic. The former involves applying ~~Ibidon’s~~ technology to attempt to categorise comments as either positive, negative or neutral in sentiment, which let him narrow down the work required for the human annotators by 96%, only looking at those subreddits which had already been picked as containing a lot of negative comments.	From there, he used a combination of sentiment analysis and human annotation to code each comment as toxic or non-toxic. The former involves applying Idibon’s technology to attempt to categorise comments as either positive, negative or neutral in sentiment, which let him narrow down the work required for the human annotators by 96%, only looking at those subreddits which had already been picked as containing a lot of negative comments.
Sentiment analysis is a controversial technology. It allows researchers to automatically process reams of data but it is criticised as an overly simplistic tool. In Bell’s tests, however, it proved its worth. “Using the sentiment model, we selected for human annotation the 30 most positive, the 30 most negative posts, and another 40 random posts from each subreddit,” he said.	Sentiment analysis is a controversial technology. It allows researchers to automatically process reams of data but it is criticised as an overly simplistic tool. In Bell’s tests, however, it proved its worth. “Using the sentiment model, we selected for human annotation the 30 most positive, the 30 most negative posts, and another 40 random posts from each subreddit,” he said.
“Each post was labeled as ‘supportive’, ‘toxic’, or ‘neutral’. We found that for comments that were randomly chosen, the vast majority were labeled ‘neutral’, which didn’t really provide much information for comparison, while the ones that were chosen by our sentiment model were far more likely to be labeled with the predicted sentiment than any other label.	“Each post was labeled as ‘supportive’, ‘toxic’, or ‘neutral’. We found that for comments that were randomly chosen, the vast majority were labeled ‘neutral’, which didn’t really provide much information for comparison, while the ones that were chosen by our sentiment model were far more likely to be labeled with the predicted sentiment than any other label.
“There are often problems with sentiment analysis and accuracy, but probably a bigger issue is that it’s not always all that actionable.”	“There are often problems with sentiment analysis and accuracy, but probably a bigger issue is that it’s not always all that actionable.”
“Bigoted comments received overwhelming approval”	“Bigoted comments received overwhelming approval”
The initial finding was that, as expected, there’s a huge variation in the scale of bigotry on Reddit. Bell weighed the presence of bigoted comments by the approval they had been given by the other members of the community – the comment’s score, which is the net of upvotes minus downvotes. That led to some interesting quirks: in a number of subreddits, the community is proactive enough at self-policing that the average score for a bigoted comment is negative. Those include /r/Jokes and /r/Libertarian, both fairly self-evident.	The initial finding was that, as expected, there’s a huge variation in the scale of bigotry on Reddit. Bell weighed the presence of bigoted comments by the approval they had been given by the other members of the community – the comment’s score, which is the net of upvotes minus downvotes. That led to some interesting quirks: in a number of subreddits, the community is proactive enough at self-policing that the average score for a bigoted comment is negative. Those include /r/Jokes and /r/Libertarian, both fairly self-evident.
At the other end of the spectrum are those communities which seem to deliberately encourage bigotry. Top of the list is /r/TheRedPill, “a subreddit dedicated to proud male chauvinism”, as Bell puts it, “where bigoted comments received overwhelming approval from the community at large”.	At the other end of the spectrum are those communities which seem to deliberately encourage bigotry. Top of the list is /r/TheRedPill, “a subreddit dedicated to proud male chauvinism”, as Bell puts it, “where bigoted comments received overwhelming approval from the community at large”.
When it comes to toxicity as a whole, there is some crossover with mere bigotry. The /r/TumblrinAction forum, devoted to mocking users of the Tumblr social network (which is particularly associated with its queer and female userbase), is ranked by Bell as both bigoted and toxic.	When it comes to toxicity as a whole, there is some crossover with mere bigotry. The /r/TumblrinAction forum, devoted to mocking users of the Tumblr social network (which is particularly associated with its queer and female userbase), is ranked by Bell as both bigoted and toxic.
Other usual suspects, such as the Atheism, Politics and News subreddits, are all ranked as fairly toxic. The problems with /r/Atheism led to that particular forum being removed from Reddit’s list of default subs.	Other usual suspects, such as the Atheism, Politics and News subreddits, are all ranked as fairly toxic. The problems with /r/Atheism led to that particular forum being removed from Reddit’s list of default subs.
But there are also some surprises.	But there are also some surprises.
ShitRedditSays, a subreddit which focuses on highlighting bad content around the rest of Reddit (frequently from a social justice viewpoint), comes close to the top. In part, that is because the toxicity is directed outwards, “at the Reddit community at large”, says Bell.	ShitRedditSays, a subreddit which focuses on highlighting bad content around the rest of Reddit (frequently from a social justice viewpoint), comes close to the top. In part, that is because the toxicity is directed outwards, “at the Reddit community at large”, says Bell.
“It’s also important to note that a significant portion of their Toxicity score came from conversations between SRS members and other Redditors who come specifically to disagree and pick fights with the community - a trap that many members tend to fall into, and which led to some rather nasty and highly unproductive conversations.”	“It’s also important to note that a significant portion of their Toxicity score came from conversations between SRS members and other Redditors who come specifically to disagree and pick fights with the community - a trap that many members tend to fall into, and which led to some rather nasty and highly unproductive conversations.”
‘Tough problem’	‘Tough problem’
But naming the problem, and identifying the subforums that have community issues is a number of steps away from making Reddit a friendlier place for all.	But naming the problem, and identifying the subforums that have community issues is a number of steps away from making Reddit a friendlier place for all.
“It’s a really tough problem,” Bell says. “There needs to come a unified understanding of what really is toxic within these communities. Just from the comments we’ve seen in response to our article, there certainly seems to be a considerable contingent that doesn’t get that their behaviour would make others feel uncomfortable.	“It’s a really tough problem,” Bell says. “There needs to come a unified understanding of what really is toxic within these communities. Just from the comments we’ve seen in response to our article, there certainly seems to be a considerable contingent that doesn’t get that their behaviour would make others feel uncomfortable.
“In the past few weeks there have already been a number of articles on bigotry in Reddit, and as subreddits become aware of how their behavior is viewed, their own understanding will change as to what is acceptable and what isn’t – and, hopefully, they will work to create communities they can be proud of.”	“In the past few weeks there have already been a number of articles on bigotry in Reddit, and as subreddits become aware of how their behavior is viewed, their own understanding will change as to what is acceptable and what isn’t – and, hopefully, they will work to create communities they can be proud of.”