Who Deletes Wikipedia?
Recently, we setup our brand new Hadoop cluster and loaded the entire Wikipedia revision history, about 8TB of text when uncompressed, into WibiData so we could analyze editor behavior in ways that would be impractical with traditional databases and analysis tools. We set out to explore editor behavior in depth, and for the first time, consider every edit made by every editor since the beginning in a comprehensive analysis. What follows is the first in a series of posts highlighting our findings. A few weeks ago I read a great article by Aaron Swartz that discusses two very different viewpoints on how Wikipedia was written. A commonly held view is that many people contribute to articles in topics that they are knowledgeable, and the whole of Wikipedia comes out of the collective effort of a lot of people. On the other hand, Swartz cites Jimmy Wales as contending that Wikipedia is actually written by a small handful of people, like a traditional encyclopedia. To quote Swartz quoting Wales:
I expected to find something like an 80-20 rule: 80% of the work being done by 20% of the users, just because that seems to come up a lot. But it’s actually much, much tighter than that: it turns out over 50% of all the edits are done by just .7% of the users … 524 people. … And in fact the most active 2%, which is 1400 people, have done 73.4% of all the edits.
The most active users may contribute the most edits, but does this mean that Wikipedia is actually written by a small group of people? There are many editors that spend time reverting vandalized pages or adding a "citation needed" template. This contributes to the quality of Wikipedia, but it does not correspond to adding new content. Aaron Swartz took this line of reasoning and decided to instead count the number of characters a user contributed that still appear in the present article. He randomly selected a few articles and computed this metric for every editor. He found that many of the top content creators, by his measure, have few edits and many are not even registered users. His result is interesting but, admittedly, not as comprehensive as he would have liked:
I don’t have the resources to run this calculation across all of Wikipedia (there are over 60 million edits!), but I ran it on several more randomly-selected articles and the results were much the same.
Since the revision history of Wikipedia is publicly available, this analysis could, theoretically, be done by anyone. A major limiting factor, as Swartz noted, is the availability of computational resources. For my analysis, I used every revision on the English Wikipedia main namespace (no talk pages or meta-wiki) from Wikipedia's creation up until January 1st, 2012. This consists of more than 290 million revisions. The Hadoop framework makes data storage and analysis of a data set of this size more accessible, but going from logs of revisions to insights about editing behavior is still complicated. The data and processing model of Wibi makes the analysis of user behavior easy. In Wibi, a row in a table contains information about a single entity, a row or revision for example. So, a row can store the data to be processed as well as information derived from that data. As I began to get answers to my initial questions, my approach to analysis and the types of questions I was interested in inevitably changed. It was easy to make ad hoc changes to the table layout so that I could store new derived information about a user or revision. For more information on how Wibi works, check out this blog post. If you want to know more about the producer and gatherer analysis paradigm, check out this blog post. Before I got digging into the data, I wanted to make sure I was only considering human editors, and not automated bots. In order to remove bots from the group of editors considered in this analysis, I wrote a producer to parse the editor identifier, either an IP address or a user name, and tag editors as either IP, user, or bot. I used the list of flagged bots and unflagged bots for bot identification. This allowed me to not only remove bots, but to also investigate the difference in editing behavior between logged in users and anonymous users identified only by their IP address. In order to verify Jimmy Wales' claim that the top 2% most active editors have done 73.4% of all edits, I first counted the number of revisions each editor had made. I then grouped the editors, ranking them by how many edits each had made. In the plot below, each bar represents a group of editors who have made about the same number of edits. Above 10 edits, I began to group together editors so that there was about 1% of the population in each group. The height of the bar (y axis) shows how many edits come from this group. Over each bar, I have included a label which shows what percentage of the editor population this bar represents. For example, the left-most bar shows that 50.49% of editors only ever make one edit, and that this corresponds to about 6% of the total edits in Wikipedia. Our data includes all revisions up to January 1st, 2012, so our exact numbers are different, but the principle behind his claim is true. The top 0.098% of people have done more than 50% of all revisions. This is an even stronger result than Wales originally presented. My next step in extending Swartz's analysis was to come up with a measure of content creation that can be computed for each revision. My original approach was to count the number of characters added in each revision. Using this metric, I found that many revisions hadn't added anything at all; they consisted entirely of deleting content! By neglecting to account for any deleting behavior, I had missed a significant aspect of editing on Wikipedia. I modified my metric to now compute the number of characters added, and then subtract off the number of characters deleted. I called this the delta of a revision. So, changing “colours” to “colors” would have a delta of -1, and adding the sentence “Narwhals are unicorns of the sea.” would have a delta of +33. I wrote a gatherer to compute the cumulative delta for each of our original groups of editors. Below are graphs of the percentage of the total delta that comes from each group. You can think of the total delta as all characters in Wikipedia as of January 1st, 2012, that were not contributed by bots. We can see that the group of editors with more than 844 edits have contributed more characters to Wikipedia over its entire history, that the characters added exceeds the current character count of Wikipedia. This is the reason that the percentage of the total delta for that group is greater than 100%. The editors are grouped together by the number of edits they have made, in exactly the same way as the first plot. It seems that people with fewer edits tend to delete more than they add and the most frequent editors still contribute the most characters. The top group of editors have actually contributed more characters to Wikipedia over the course of its existence, than currently exist on Wikipedia today. In order to get a better understanding of user behavior in these groups, I computed the average delta for a revision from each group. You can see that, on average, editors with less than 13 edits tend to delete more characters than they add. Above 13 edits, the edits tend to add more characters. While I don't claim to know why this is, I do suspect that the simple fact that it is easier to delete content than it is to make an original contribution plays a large part in this trend. Next, I wanted to know if the editing behavior between editors that are logged in and editors identified only by an IP address is significant. Of course, thinking about any results of this analysis is complicated by the fact that a single user doesn't necessarily correspond to a single IP address, and vice versa. So it goes. For all of my analysis, I kept user grouped by editing activity, but also incorporated their identifier type. Going back to the number of edits made we can we can see that IP addresses tend to dominate in terms of number of edits, until we get to the most active editors. I was interested in seeing if anonymous and logged in users has different editing behaviors, so I computed the average delta for a revision, aggregated according to editing activity and whether or not they editor was logged in. The difference in behavior between logged in users and IP addresses is even more startling. Editors identified only by an IP address are much more likely to delete content from Wikipedia than logged in users. Logged in users go from deleting more, on average, to adding more at 6 edits. Anonymous editors add more content, on average, only when we get up to the highest editing activity range. This and the significant difference between the highest and second highest editing activity groups makes me suspect that there is some bot activity that I had not caught. Overall, it seems that editors only identified by an IP address delete significantly more than logged in editors. I'm no psychologist, so I'm interested in hearing what other people have to say about this difference in behavior between users and IPs. So, leave a comment! In this post, we explored how editors add and remove content from Wikipedia. Unsurprisingly, a small percentage of editors are responsible for most of the edits. What is surprising is exactly how small that percentage is. Things are even more extreme when you count not just edits, but measure how much content people add vs. how much they delete. Individuals who make only a few edits (less than 12) tend to remove more characters than they add, while individuals who make a lot of edits tend to be contributing more content than they remove. In further classifying editors, there is a distinct difference between the cohort of users who are logged in, and those who are anonymous (identified only by IP address). Logged in users have a much greater tendency to add content, while anonymous users tend to remove far more than they add. As these results show, simple tools such as counts and histograms allow us to gain interesting insight into a population of individuals, but were made possible only by using large-scale data analysis tools like Hadoop, HBase and WibiData. We will continue to explore the Wikipedia editor community in a series of blog posts. Be on the lookout for posts that extend the analysis presented here, as well as one about visualizing the category relationships in Wikipedia.