Wikipedia talk:Wikipedia Signpost/2025-03-22/Recent research
Appearance
Discuss this story
- The Wikimedia Research Fund section could use a sentence about what the fund is, what it does, etc. I hadn't heard of it before this article. –Novem Linguae (talk) 22:02, 23 March 2025 (UTC)
- Thanks for the feedback. I think the list of previously funded proposals may already convey quite a bit of information in that regard, as do the changes mentioned. But point taken. Regards, HaeB (talk) 02:27, 24 March 2025 (UTC)
- Taking a look at the words to watch page, the usage of "crucial" is hinted in the puffery section and "additionally" might be something to add to the editorial section. – The Grid (talk) 13:33, 24 March 2025 (UTC)
- Most of these articles about ChatGPT's impact seem to assume the current chatbot model will just keep going, if not grow even more. I'd like to remind everyone OpenAI is still losing billions of dollars a year and, if investor confidence wanes even a little bit, may not be able to provide free access to ChatGPT for much longer. It takes insane amounts of power to run an AI data center and the product just isn't valuable enough to justify the expense. This blog post discusses AI's profitability problems at length. HansVonStuttgart (talk) 08:41, 26 March 2025 (UTC)
- That's an extremely misguided comment. Whatever one thinks about OpenAI's finances (and I would recommend some healthy skepticism about Ed Zitron's rants; in any case, re investor confidence, the company raised another $40 billion right after you posted this comment here, so its demise is presumably still some time away):
- The cost of operating LLMs like ChatGPT "has fallen dramatically in recent years" [1]. And one now "run a GPT-4 class model on [a] laptop" [2] - i.e. a model that is significantly better than the free version of ChatGPT was during the timespan covered by the studies reviewed here.
- Regards, HaeB (talk) 04:54, 29 April 2025 (UTC)
"Wikipedia Contributions in the Wake of ChatGPT"
[edit]- I'm skeptical of this paper's conclusions. I think this WWW article's evidence is actually partly inconsistent with the narrative of ChatGPT as a competitive threat to Wikipedia. Figure 2 shows a dramatic increase in Wikipedia views and edits in "dissimilar articles" much larger than the decrease observed in "similar articles". These similarity categories are based on an attempt to classify if a Wikipedia article is similar to content the chatbot would produce. But it is difficult to pin down what explains the difference between these sets. They set it up so that the dissimilar articles are the "control group", as if they would be unaffected by ChatGPT. But that's not the story they tell in Figure 2. The headline could easily be ChatGPT increased Wikipedia usage and contribution if the researchers had started from a different narrative frame. I'm still skeptical that current chatbots pose a competitive threat to Wikipedia. Wikipedia gets you facts faster than Chatbots, and has a stronger brand among its users for verifiable and factual information. Groceryheist (talk) 15:07, 26 March 2025 (UTC)
- Update: The more I look at this, the less convincing it is. The assumptions of their DiD just don't seem plausible if you look at the data in Figure 2. There's a big increase in views for dissimilar articles that starts before the change. The decline in views for similar articles begins prior to the change. This invalidates the "parallel trends" assumption of the DiD estimator. They only estimated a large decrease in views for "similar" articles in the DiD because there was an increase in views to "dissimilar" articles in the same period. Groceryheist (talk) 15:24, 26 March 2025 (UTC)
- @Groceryheist: Yes, this confused me greatly too. For what it's worth though:
- 1. Parallel trends assumption: We should note that the authors directly claim that
Figure 2 [...], which shows similar pre-ChatGPT evolution of these two groups, bolsters our confidence in [the "parallel trends"] assumption
. Now, I don't think that one should blindly defer to a Nobel prize winner. (As I mentioned in the review, this wouldn't be the first Acemoglu-coauthored or -promoted paper on AI impacts that attracts criticism for statistical flaws - or, in recent news, worse.) But at the very least they didn't ignore this assumption. There's a big increase in views for dissimilar articles that starts before the change. The decline in views for similar articles begins prior to the change.
- note that Figure 2 actually shows residual views and edits, more specificallymean residuals for each month of activity 𝑡 aggregated over similar and dissimilar articles respectively
from the "Comparative time series" regression on the page preceding Figure 2. And that regression already includes alinear trend[] for [...] activity time
. I too am more familiar with visually checking the parallel trends assumption by looking at the actual outcome variable instead of such averaged residuals. But perhaps in this residuals view one is supposed to check the parallel trends assumption by e.g. visually inspecting whether the error bars cover the mean mean residual (dashed lines) pre-launch, which they do actually do fairly well in Figure 2? Just guessing though. In any case, the authors might question what you mean by "big" exactly inbig increase in views for dissimilar articles
- shouldn't that be put in relation to the size of the standard errors? Other observations derived from ocular inspection likeThe decline in views for similar articles begins prior to the change
should probably take the error bars into account, too.- 2. Eyeballed trends from Figure 2 vs. DiD regression coefficients in Figure 3: When the authors write in section 3.1 about Figure 2 that (my bolding)
For both views and edits, we see that similar articles exhibited little changes in activity from the pre-GPT to the post-GPT period (accounting for controls). Dissimilar articles, on the other hand, show an increase in edits after ChatGPT launched [...]
- perhaps they simply refer to the positions of the dashed lines (which show the
[mean] mean residuals for similar (blue) and dissimilar (red) articles over the pre-GPT and post-GPT periods respectively
)? But it's in weird contrast to what they say right afterwards in section 3.2 based on the DiD regression(s): The diff-in-diff coefficients for Figure 3a (views) are negative and statistically significant for all article ages except 𝑇 = 1, which implies that Wikipedia articles where ChatGPT provides a similar output experience a larger drop in views after the launch of ChatGPT. This effect is much less pronounced for edit behavior.
- My hunch is that they may have gotten a bit carried away with these ocular observations in section 3.1 (perhaps forgetting themselves for a moment that Figure 2 shows residuals only?) and should have stuck with reporting the DiD outcomes, instead of yielding to the temptation of (to use your expression) "telling a story" about Figure 2 already. It's also worth being aware that Figure 2 only shows the situation for a single T (T=6).
- But I'm genuinely still confused about all this too. At the very least, it is safe to say that the paper goes through only a small part of the steps recommended in this new "practitioner's guide" to Difference-in-Differences designs (see p.53 f., "Conclusions"; also, section 5.1. there about 2xT designs looks like it could be relevant here, but I haven't gotten around to reading it yet).
- In any case, thanks a lot for raising this. As mentioned, it had confused me as well, but I didn't discuss this problem in the review because a) I wasn't sure about it and b) this review and the rest of this issue already felt grumpy enough ;-) Overall, writing this roundup has been a bit of a disconcerting experience - the quantitative impact of the AI boom on Wikipedia is one of the most important research questions about Wikipedia in recent memory, and sure, it is not easy to tackle, but the quality of what has been published so far is not great. (I still think the "Wake of" paper might be the one with the most solid approach among those mentioned in this issue.)
- Regards, HaeB (talk) 08:25, 18 May 2025 (UTC) (Tilman)
But perhaps in this residuals view one is supposed to check the parallel trends assumption by e.g. visually inspecting whether the error bars cover the mean mean residual (dashed lines) pre-launch, which they do actually do fairly well in Figure 2?
- Hey! Thanks for responding in such thoughtful depth. I think Figure 2 is exactly what you'd want to look at for evaluating parallel trends for their DiD models. The DiD model is exactly the "comparative time series model" plus the terms needed to statistically test for post-intervention differences in treatment effects. This part of the study seems well-executed, as is true of the study overall, my concerns are entirely about "interpretation". What is a bit surprising or unclear is why they felt the need to adjust for length and creation month at all. The linear trend for time in these two models means that both lines (similar and dissimilar) in Figure 2 should be flat if there were no differences between similar and dissimilar articles. So we'd expect flat lines pre-cutoff and slopes post-cutoff; however, I see slopes pre-cutoff for both edits and views. Edit: The lines pre-cutoff could not be flat, but trending in the same direction if there were higher-order temporal trends (i.e., acceleration in editing or views).
- That said, I don't think it appropriate to consider the standard errors in Figure 2 when considering the parallel trends assumption since this isn't a theoretically justified or interpretable test of the assumption. One reason is that the parallel trends assumption is about the "counter-factual" of what would have happened in the absence of treatment. This fundamentally isn't possible to test. Looking at pre-cutoff patterns and extrapolating is one approach, and people also tend to squabble over the substance of the phenomena to decide how much weight to give a DiD estimate. My understanding of DiD is that it is particularly prone to mislead when pre-cutoff trends are opposite, which to me appears so in this data.
shouldn't that be put in relation to the size of the standard errors? Other observations derived from ocular inspection like "The decline in views for similar articles begins prior to the change" should probably take the error bars into account, too.
- Even if we do try to use SEs to tell if the trends are statistically significant or random fluctuations, it's hard to dismiss the pre-cutoff trends. For edits which looks like an increase of almost 1 SE for dissimilar and a decrease of about 0.5 SE for similar, and something happens in the three points before the cutoff that could be a bit opposite trend or a random fluctuation.
- Overall, I wouldn't claim that the assumption is surely "invalid", but it is far from bullet proof.
- I do agree that statements interpreting the magnitudes of these changes make more sense in the context of the of the variance in the outcomes. By that standard, it might be difficult to claim any of the changes as "big". But comparing the far-left and far-right of Figure 2 the change for dissimilar articles looks to about nearly 2 SE. For edits that change might be about 1 SE. Groceryheist (talk) 17:05, 18 May 2025 (UTC)
and has a stronger brand among its users for verifiable and factual information
- maybe so (I think WMF made such a claim based on survey data from its 2023 ChatGPT plugin experiment). But "among its users" does a lot of work here. Contrast these two studies we previously reviewed here:- Regards, HaeB (talk) 06:22, 18 May 2025 (UTC)
- Blind tests inherently factor out any effect of Wikipedia's brand on credibility. Groceryheist (talk) 17:07, 18 May 2025 (UTC)
- But yeah. These lab-based studies do suggest that some part of the potential audience doesn't ascribe all that much prestige to Wikipedia, at least compared to the chatbots. Groceryheist (talk) 19:07, 18 May 2025 (UTC)
- Blind tests inherently factor out any effect of Wikipedia's brand on credibility. Groceryheist (talk) 17:07, 18 May 2025 (UTC)
← Back to Recent research