Reverting in Wikipedia
Reverting an edit in Wikipedia generally refers to undoing the actions of a specific edit. This could be removing vandalism (like pictures of Squidward, for example) or bringing a page’s content back after a new user has blanked it while experimenting. In the comments of my first blog post, many experienced Wikipedians brought up the point that reverting is a very common activity and its affect on the results of my previous analysis should be examined. In the rest of this post I’ll present the initial set of results that came from taking into account whether edits were reverted, reverting, both or neither. Daryl, one of our interns this summer, identified reverted and reverting edits. For each page in Wikipedia, he compared the MD5 hashes of the text of the page after each edit had been applied. If any two hashes are identical on a page, the most recent edit is marked as reverting and all edits between the identical ones are marked as reverted. This labeling of edits is illustrated in the diagram below; the state of a Wikipedia page is represented by its color, and the arrows represent edits that take pages from one state to another.
This method of identifying reverted and reverting edits relies on a very broad definition of those terms. As Daryl pointed out in his post
, there are a few very common states for a Wikipedia page. If a page gets blanked in 2003, and then again in 2008, far more edits would be labeled as reverted than really should be. One potential solution is to limit the number of edits between two of the same states of page. It is unclear to me what that upper limit should be. Feedback on how to address this drawback of our labeling method would be appreciated, as we currently do not apply such a limit to our analysis. With this new information about each edit available, I went back and recomputed the results of my last analysis. This time I filtered the revisions by whether they are reverted, reverting, both or neither. The editing frequency groups that a given editor belongs to was computed taking into account all edits they had made: reverted, reverting, or otherwise. Editors have not changed editing groups between my last analysis and this one. Instead, their edits are filtered by whether a given edit was reverted, reverting, both or neither.
If we limit our analysis only to the edits that are neither reverted nor reverting, we can see that the average character contribution per edit is positive for every editing frequency group. In the original analysis of this data, we found that lower editing frequency groups had an average delta per edit that was negative.
The delta increases as the number of edits per editor increase, up until we reach editors with 19-28 edits, where it begins to fall back off again. As many Wikipedians pointed out in the comments of my post on deletion behavior, attribution of deletion is tricky on Wikipedia. When an article is deleted from Wikipedia, often one person will tag it for deletion and another editor with admin privileges will actually delete it. I suspect that having a large number of edits and being a person who ultimately deletes articles are strongly correlated. This would result in the group with high numbers of total edits being wrongly attributed with deletion, when someone else actually nominated the article.
We can also separate out the average delta by user type (logged in users vs. anonymous users identified only by IP address), along with editing frequency. In both groups, the average delta per edit is positive. The average delta per edit is higher for logged in users than for IPs, for every edit frequency group except the very highest. The logged in users also exhibit a decrease in the number of characters added in an average edit as the total number of edits made increases. This supports that idea that admins will tend to be logged in users with high edit counts, and thus their deleting responsibilities will have a significant effect on the high edit count group’s average delta. It would be interesting to identify the people marking articles for deletion, attribute the deleted article to them, and recompute these statistics. When we filter the edits considered in our analysis to be only those that are not reverting or reverted, then the net character contribution from every group is positive. Generally, the total characters contributed in edits that are not reverted or reverting increases as the number of edits made by editors increase until we get to the editors who make 19-28 edits, then it decreases. When we separate our population into logged in users with accounts and editors identified only by their IP address, we see that logged in users have the highest number of characters added per edits when their number of total edits is smallest, and this continues to decrease as the number of edits increases. This seems to jive with Aaron Swartz’s model
of content generation on Wikipedia, though more needs to be done in order to correctly attribute deletion to editors. Reversion is an interesting editing phenomenon, which we will be examining more in the future, so keep your eye out.