Wikipedia:Wikipedia Signpost/2025-06-24/Recent research

Recent research

Wikipedia's political bias; "Ethical" LLMs accede to copyright owners' demands but ignore those of Wikipedians

A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.

Wikipedia's political bias: Halfway to liberal

Reviewed by Clayoquot

Is Wikipedia "Wokepedia" as some have claimed? A 2024 paper^[1] by Puyu Yang and Giovanni Colavizza sheds some light on the question. It adds to a corpus of research on ideological bias on Wikipedia; some previous studies have found leftist bias and one study found a center-right bias. The authors of the present study (whose previous work includes several papers on Wikipedia citations) had already reported on it in a 2022 preprint (see our earlier review), but it has since been published in the peer-reviewed journal Online Information Review, with some changes including an updated abstract.

The paper looks at the English Wikipedia's citations to news sources and associates each source with a score corresponding to its political bias. The bias scores come from a dataset called Media Bias Monitor (MBM), described in this 2018 paper.

The MBM dataset is based on the propensity of Facebook users to share links to particular sources. For instance, it presumes that if a source is shared more by self-identified liberals than by self-identified conservatives, the source has a liberal bias.

Yang and Colavizza find that on a scale ranging from –2 (very liberal) to +2 (very conservative), the average Wikipedia news citation has a score of -0.5, which is halfway between "moderate" and "liberal".

Could editors be preferring liberal news sources because they are more factually accurate? The paper anticipates this question. Through further analysis using ratings of factual reliability from Media Bias/Fact Check, Yang and Colavizza conclude that the favouring of liberal sources "persists when accounting for the factual reliability of the news media."

The authors say their findings "can be attributed to several factors, including the political leanings of Wikipedia contributors, the prominence and accessibility of liberal-oriented news sources, and potential methodological biases in gauging political polarization." With regard to the last two factors, The Guardian, which makes up more than half of Wikipedia's "very liberal" citations, owes some of its popularity to its open access. Its classification as "very liberal" is debatable, as other sources have described it as closer to the centre.

See also our earlier reviews of related research, in particular:

New "Ethical" LLMs accede to the demands of copyright owners – but not necessarily those of Wikipedians

Reviewed by Tilman Bayer

Several years into the AI boom, controversies rage on about whether and to what extent the training of models on copyrighted material is covered by fair use or instead requires permission from copyright owners. Numerous lawsuits about the matter are still making their way through the courts in the US and elsewhere. And although various US judges have already dismissed many overwrought claims by copyright industry plaintiffs, independent legal scholars still consider it possible that some others may succeed. Separately, legislative changes have been proposed in several countries to either tighten or loosen requirements around AI training.

Two groups recently waded into these debates by releasing datasets for LLM training that are based only on text that is either in the public domain or "permissively licensed" (such as from Wikipedia), with the insinuation that they are free of such concerns. Notably, both also already trained their own LLMs on them, which they claim to have competitive performance.

French startup PleIAs earlier this month published a preprint titled "Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training"^[2]. From the abstract:

Large Language Models (LLMs) are pre-trained on large amounts of data from different sources and domains. These data most often contain trillions of tokens with large portions of copyrighted or proprietary content, which hinders the usage of such models under AI legislation. This raises the need for truly open pre-training data that is compliant with the data security regulations. In this paper, we introduce Common Corpus [https://huggingface.co/datasets/PleIAs/common_corpus], the largest open dataset for language model pre-training. The data assembled in Common Corpus are either uncopyrighted or under permissible licenses and amount to about two trillion tokens. The dataset contains a wide variety of languages [...]

This paper (presented as a "technical report") follows several earlier announcements for various versions of the same corpus, which had attracted media coverage as early as April 2024 ("This French start-up just proved OpenAI wrong. It claims you can train AI on non-copyrighted data").

Three days later, on June 5, a group of 28 authors (from e.g. the University of Toronto and other North American universities, EleutherAI, and Hugging Face) announced "The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text"^[3]. From the abstract:

Large language models (LLMs) are typically trained on enormous quantities of unlicensed text, a practice that has led to scrutiny due to possible intellectual property infringement and ethical concerns. Training LLMs on openly licensed text presents a first step towards addressing these issues, but prior data collection efforts have yielded datasets too small or low-quality to produce performant LLMs. To address this gap, we collect, curate, and release the Common Pile v0.1 [https://huggingface.co/common-pile], an eight terabyte collection of openly licensed text designed for LLM pretraining. The Common Pile comprises content from 30 sources that span diverse domains including research papers, code, books, encyclopedias, educational materials, audio transcripts, and more.

The datasets, and how Wikimedia projects are represented in them

As is well known, Wikipedia text has long been a staple of LLM training – either via the dumps provided by the Wikimedia Foundation itself, or (perhaps more frequently) as part of larger datasets such as that of Common Crawl or "The Pile", which cover a large number of websites regardless of their copyright licenses. Common Crawl, a US-based nonprofit, has been offering these to the public since about 2008, relying on fair use. Similarly, "The Pile" was compiled by the nonprofit EleutherAI (also one of the driving forces behind the new "Common Pile" dataset) and distributed under fair use. But as the provision of datasets forms a significant early step in what one extensive legal analysis calls "the generative-AI supply chain", such invocations of fair use have come under increased scrutiny more recently. This evidently contributes to both groups' motivation for providing datasets that can be hosted everywhere without such legal considerations, by confining themselves to public domain and "permissively licensed" material – such as Wikipedia.

While Wikipedia and its sister projects form only a relatively small part of both datasets (see below for more detail), they are interesting from a Wikimedian perspective for several reasons. Firstly, they make much more content from Wikimedia projects available than what is commonly used in LLM training (which has often been confined to mainspace content from English Wikipedia, at least according to what is publicly known).

The Common Pile group

downloaded the official database dumps from March 2025 of the English-language wikis that are directly managed by the Wikimedia foundation [...]. These database dumps include the wikitext — Mediawiki's custom markup language — for each page as well as talk pages, where editors discuss changes made to a page. [...]
The Common Pile includes the following Wikimedia wikis: Wikipedia, Wikinews, Wikibooks, Wikiquote, Wikisource, Wikiversity, Wikivoyage, and Wiktionary.

(not, however, Wikidata)

PleIAs/Common Corpus on the other hand only draws from two of these eight Wikimedia projects, namely Wikipedia and Wikisource. But it includes several languages, and uses the newer HTML dumps instead of wikitext dumps:

Wikimedia projects have always been major sources for language model training due to their reliability, extensive coverage, and textbook-like style. Despite this centrality, there is still a range of unresolved challenges with the most common versions available for training. The raw source of Wikimedia projects is made available in a specific mediawiki syntax, including a lot of project-specific models, tags, and conventions. The parsing of models is especially not straightforward, as they can either format existing text or remove or include external content (transclusion). As part of Wikimedia Enterprise, the Wikimedia Foundation created entirely new dumps from the rendered HTML sources, which in effect ensure that they include all the text made available to readers.

Here, project-specific models apparently means templates, as a false friend mistranslation of the French modèle (Pierre-Carl Langlais, the vocal co-founder of PleIAs, is a longtime admin on French Wikipedia himself as User:Alexander Doria). The Common Corpus paper leaves it open how much of an issue this parsing of templates is in practice, or how much of an improvement the use of HTML dumps yields for the purpose of LLM training. In contrast, the Common Pile group simply converted wikitext to plain text using wtf_wikipedia.

"Wikimedia Enterprise" refers to the API products of the Wikimedia Foundation's for-profit subsidiary Wikimedia LLC. The Wikimedia Enterprise HTML dumps had been available for public download since 2021 on the same dumps.wikimedia.org site as the regular wikitext dumps, but recently were taken down there and now require a signing up for a free account with Wikimedia Enterprise.

Apparently separately, in February 2025 Wikimedia Enterprise announced a partnership with PleIAs under which Pleias has leveraged Wikimedia Enterprise's structured datasets to develop verifiable language models, multilingual content enriched with metadata and credibility signals like RevertRisk, pre-parsed infoboxes, sections, and summaries. This presumably refers to the separate Structured Contents snapshots – see also last month's Signpost coverage: "New version of AI-optimized Wikipedia dataset released on Kaggle". The current paper does not mention this data yet. Wikimedia Enterprise keeps the list of its paying customers confidential (apart from a few exceptions like Google), so it's not clear whether PleIAs is among them or whether it is using this Wikimedia data without financial compensation.

The "Impact" section of the Common Corpus paper lists various third-party uses of the dataset, and last month Langlais posted "Happy to see Common Corpus has grown to become one of the most popular pretraining dataset on @huggingface". (At the time of writing, Hugging Face listed it as the 13th most downloaded text dataset.)

The first(?) Wikidata-derived text dataset for LLM training

Of particular interest to Wikimedians is the fact that besides using Wikipedia and Wikisource, PleIAs/Common Corpus appears to be the first to make Wikidata available as a source for pretraining of language models:

"Semantic data is the latest set added to Common Corpus and currently includes only one collection: Wikidata. [...] Despite the rising interest in mixed LLM/knowledge graph methods, Wikidata has hardly been used in language models. [...] A persistent challenge has been the exclusive availability of Wikidata dumps under formats optimized for data exchange rather than language model training. Thanks to a collaboration with Wikimedia Deutschland, the entire set of Wikidata has been adapted in natural language and added to Common Corpus. This is to date the only available textual collection of Wikidata covering the entire range of 300 languages. Data processing involved the translation of items and properties into formal language sequences as simple natural language sequences, without textual synthesis: "Q41309 | P:27 | Q171150" becoming "Franz Liszt country of citizenship Kingdom of Hungary". Within each entry, we provide all the available translations as consecutive blocks separated by a newline, anticipating that this may contribute to language alignment."

(This might be regarded as a rudimentary mini version of Abstract Wikipedia.)

License due diligence

The Common Pile authors emphasize that they "put a lot of work into our metadata" – specifically, vetting the license information that accompanies each piece of content in their dataset. A section in the paper covers "License due diligence", for example pointing out the common pitfall [of] "license laundering," where a copyrighted work is redistributed (typically by a non-rights holder) with an incorrect license (a problem that is well-known to Wikimedians, too).

The PleIAs/Common Corpus paper on the other hand seems less concerned with such problems (it instead devotes more space to data issues that might affect regulatory compliance in the EU, such as toxicity detection and PII removal). Correspondingly, the Common Corpus data seems rather cavalier with license information, containing thousands of rows whose license field provides only vague information like "Various open licenses", or (including for Wikipedia content) specifies the license as "CC-By-SA" without a version number. (Such issues were pointed out by this reviewer over half a year ago already, and acknowledged at the time with a "Yes it's planned", but don't seem to have been addressed yet.) Similarly, Table 3 in the Common Corpus paper provides token counts by "license type", but confusingly lists "CC-By" and "CC-BY-4.0" as separate types.

Assuming that the provided license and attribution information is correct though (or will be fixed), few Wikipedians are likely to object to their content being included in these datasets. In the big picture, those join the numerous mirrors and forks that are intentionally enabled by the projects' free license and have existed almost since Wikipedia's founding.

The "ethical" LLMs that both projects train on their respective datasets present more complex questions, though.

The "ethical" models and their performance

The Common Pile authors ask "Can you train a performant language model using only openly licensed text?" and report a positive answer:

Crucially, we validate our efforts by training two 7 billion parameter LLMs on text from the Common Pile: Comma v0.1-1T and Comma v0.1-2T, trained on 1 and 2 trillion tokens respectively. Both models attain competitive performance to LLMs trained on unlicensed text with similar computational budgets, such as Llama 1 and 2 7B.

However, "competitive" has to be taken with a big grain of salt, or at least a thorough understanding of the qualifiers. For example, the Llama 2 model chosen for comparison dates from 2023, eons ago in GenAI terms. And even back then was the smallest and least capable LLM of Meta's LLama model family. AI expert Simon Willison called the Comma models "promising". But he also seemed a bit underwhelmed by their performance in a quick test, pointing out among other limitations that right now [...] it's a raw base model—it hasn't been instruction-tuned or set up for chat (unlike e.g. ChatGPT or most of the open-weight LLMs that have seen wider usage in recent times), which makes it a lot harder to evaluate. (On the other hand, as indicated by the "v0.1", the group still expects to be able to train bigger and better models on freely licensed data in the future.)

PleIAs' paper on the other hand provides less information about the models that the startup has already trained on its Common Corpus dataset. Instead, the company had described them briefly in a December 2024 blog post:

Today we release Pleias 1.0 models, a family of fully open small language models. Pleias 1.0 models include three base models: 350M, 1.2B, and 3B parameters. They feature two specialized models for knowledge retrieval with unprecedented performance for their size on multilingual Retrieval-Augmented Generation, Pleias-Pico (350M parameters) and Pleias-Nano (1.2B parameters).
These represent the first ever models trained exclusively on open data, meaning data that are either non-copyrighted or are published under a permissible license. These are the first fully EU AI Act compliant models. In fact, Pleias sets a new standard for safety and openness.

Like with the Common Pile paper, these "unprecedented performance" claims should be appropriately contextualized. While the Common Pile authors at least evaluated their LLMs with a number of standard, widely used third-party benchmarks, the PleIAs authors relate that this was not possible in their case because The most popular generalist benchmarks are not suitable for evaluating small models. Therefore, they had to grade themselves: Instead, we develop targeted benchmarks to evaluate key capabilities that are essential to our desired downstream application, namely for RAG (retrieval-augmented generation), the models' ability to refrain from switching languages while generating text for a variety of EU languages, and their avoidance of generating toxic or harmful text (in order to comply with EU regulations). PleIAs later also released some derived models specifically dedicated to the RAG use case.

The small size and limited performance of both groups' models seem entirely understandable given their presumably limited compute budgets. (Also, PleIAs' December 2024 announcement describes them as "small language models", a term that some have tried to establish recently for large language models with only up to a few billion parameters.)

Still, one can't help wondering how much this might limit their value as evidence for bold claims that fair use (i.e. training on copyrighted text) is unnecessary for producing useful and competitive LLMs. As mentioned, PleIAs' largest model has 3 billion parameters and the Common Pile group's Comma models have 7 billion, whereas some publicly known LLMs have already surpassed 1 trillion (1000 billion) parameters. While smaller models have many use cases, this is one of several reasons to doubt whether Common Corpus or Common Pile could form the basis of a model that performs as well as e.g. the current versions of ChatGPT or Claude.

Is Wikipedia "central" for modern AI, or merely the fifth most important freely licensed source for LLM training?

Somewhat in contrast to claims about the centrality of Wikipedia for contemporary AI, content from Wikimedia projects forms only a minority of the tokens in each of the two datasets, as can be seen in the above figure in the case of Common Pile. (This remains true if one disregards the public domain parts and focuses on the parts that are copyrighted but under a free license. In fact, in the Common Pile dataset, even non-Wikimedia wikis make up a slightly larger share than Wikimedia wikis.)

But another aspect of the Common Pile paper of specific interest to Wikimedians is that the authors conduct an evaluation of the quality and relative importance of the different sources (in terms of what they contribute to LLM performance):

"Recent work [...] has shown that up- or down-weighting pre-training data sources in accordance with some notion of data quality can produce more performant models. Indeed, the sources in the Common Pile vary drastically in their characteristics, and we don't necessarily expect that our largest sources contain the highest quality text. For example, patent text sourced from the USPTO (our second-largest source) exhibits substantially different wording, terminology, and repetition than typical natural language. [...] To determine mixing weights, we first trained per-source language models [....] for 28 billion tokens on all sources that were sufficiently large to be repeated less than four times at this data budget. Based on the performance of these per-source models, we heuristically set mixing weights to up- and down-weight high- and low-performance sources respectively [...]

In the filtered dataset used for this training, Wikimedia wikis contributed 57.4 GB out of a total size of 1838.3 GB, i.e. about 3.1%. (On a side note, Table 5 indicates that the Wikimedia projects were among the sources that did not require filtering for "toxicity", as opposed to e.g. "Ubuntu IRC", "Pre-1929 Books", or the non-Wikimedia wikis scraped by Wikiteam.)

Per "Table 7: Overview of the data mixing used to up/down-weight individual sources", Wikimedia wikis were assigned the highest weight (6 repetitions). However, they shared that honor with 18 other sources (i.e. the majority), such as arXiv, Foodista, LibreTexts, peS2o (a corpus of open access scientific publications), StackExchange and Ubuntu IRC. The downweighted sources with the lowest number of repeats (0.25) were Biodiversity Heritage Library, Library of Congress, USGPO and USPTO.

While this weight number can be regarded as a rough proxy for a source's quality from the perspective of LLM training, its resulting share of tokens in the overall training corresponds to its importance. Here Wikimedia wikis ended up at 8.616%, behind peS2o (27.409%), CC Common Crawl (8.716%), StackExchange (13.469%) and Stack V2 (13.009%). ("Stack V2" contains open source software code, and the "CC Common Crawl" slice consists of freely licensed web pages from the internet-wide Common Crawl; presumably without wikis, as the authors performed "deduplication across all sources".)

In other words, there is now objective evidence justifying the statement that Wikimedia wikis form the fifth most important freely licensed source for LLM training. This is somewhat in contrast to e.g. claims about Wikipedia's central role in the development of modern AI, to quote the framing chosen in the keynote of last year's Wiki Workshop conference by Microsoft's Brent Hecht (known to Wikimedians for various influential research publications during his time in academia). According to Hecht, there is a [s]trong argument [that Wikipedia] was the single most important dataset for AI research since about 2005.

One might be able to reconcile these different perspectives by recognizing the possibility – as Hecht did in the Q&A for his keynote (in response to a question by this reviewer) – that Wikipedia's early popularity as an AI dataset might have been caused by its convenience rather than – or in addition to – the unique qualities of its content. (The Wikimedia Foundation had been providing its easily downloadable dumps since at least 2003.) What's more, some other claims about Wikipedia's supposed centrality for AI simply stretch the facts. For example, in a 2023 article titled "Wikipedia's value in the age of generative AI", the Wikimedia Foundation's Chief Product and Technology Officer Selena Deckelmann proudly proclaimed that To date, every LLM is trained on Wikipedia content, and it is almost always the largest source of training data in their data sets. However, the study cited in that sentence ranked Wikipedia at #2, behind a corpus of patents, and at a mere 0.19% (less than 1 in 500) of tokens in the analyzed dataset.

As another finding of interest to the free-culture movement as a whole, the Common Pile paper provides a chart depicting "Growth rates of openly licensed data", observing that

[...] approximately half of the Common Pile (around 3.8TB) was created since 2020. This trend provides insight into the growing availability of openly licensed data and suggests a promising trajectory for future LLMs trained entirely on openly licensed sources.

While the subcategory of "Wikis" appears to be one of the slower growing ones in that chart, it still has increased about tenfold in size from 2010 to 2024.

"Ethical" apparently means uncritical deference to the business interests of the copyright industry...

Coming back to the AI copyright disputes mentioned at the beginning: It seems very clear that both projects are driven by a motivation to provide legal and political ammunition to the copyright industries' side in those debates, i.e. against fair use defenses for the purpose of training LLMs. This is especially evident in PleIAs' communications. For example, the company titled its aforementioned December 2024 announcement "They Said It Couldn't Be Done", with a link in the first sentence making it clear that this title is meant as a debunking of statements about the necessity of fair use in AI training made in response to an inquiry of the UK House of Lords:

"Training large language models required copyrighted data until it did not. Today we release Pleias 1.0 models, a family of fully open small language models."

(As discussed above, PleIAs offers only very thin evidence for its bold claim that "copyrighted data" is no longer required, at least if that is understood to refer to the kind of state-of-the-art LLMs that hundreds of millions use today in ChatGPT and its competitors.)

The Common Pile paper was likewise interpreted as weighing in on the copyright industries' side of the debate, by the Washington Post: AI firms say they can't respect copyright. These researchers tried. [...] That could have implications for the policy debate swirling around AI and copyright (despite objections by one of the paper's authors that it is not a tech policy writeup. It's a machine learning research paper).

Both papers also uncritically embrace copyright maximalist interpretations of "ethical" as being incompatible with fair use, i.e. requiring to give IP owners total control over whether their content is used in training LLMs. PleIAs' "Common Corpus" paper already does so in its title (The Largest Collection of Ethical Data for LLM Pre-Training). The "Common Pile" authors begin their abstract by relating training on unlicensed text directly to intellectual property infringement and ethical concerns, again without acknowledging the existence of fair use defenses. (In fact, the term "fair use" isn't mentioned anywhere in the text of the paper, it only appears once in the "References" section in the title of a publication cited for other reasons.) And several other statements in their paper implicitly denigrate fair use as unethical, similar to PleIAs.

What's more, both papers also adapt the rhetorical strategy of copyright industry advocates to focus on individual "content creators", rather than the corporations who in reality are the largest copyright owners, say Elsevier, Bertelsmann (owner of Brockhaus encyclopedia, whose revenue was greatly diminished by Wikipedia), Murdoch, or Getty Images. Or a company like Adobe, whose Adobe Firefly image generator is an interesting related example of an AI model advertised as "ethical" – due to having been solely trained on CC-licensed images from Wikimedia Commons and Flickr Commons or under public domain as well as hundreds of millions of images and videos users uploaded to the company's own Adobe Stock. It may be illustrating how some individual creators' "no training without consent" demand can turn out to be a monkey's paw wish.

Neither the PleIAs paper nor the Common Pile paper attempts to offer policy arguments to justify their use of such loaded language and promotion of anti-fair use viewpoints. The authors never acknowledge the possible harms of elevating the interests of copyright owners over fair-use protections in AI – something that e.g. the Internet Archive's Brewster Kahle warned against last month in a post titled "Protect Fair Use, Especially Now", joined by the likes of the Electronic Frontier Foundation [1]. It is also worth noting that such advocacy for copyright industry viewpoints seems unnecessary for both papers' purposes: Instead of denigrating fair use as unethical, the authors could have confined themselves to framing their efforts in terms of minimizing legal risks for particular jurisdictions and situations. Indeed, two other recent papers adopt exactly this approach (see below).

Lastly, a commonly voiced concern regarding AI, about copyrighted works being used to enrich big tech companies like OpenAI, Google or Microsoft, does not really apply here, considering e.g. that both groups are in fact themselves releasing open-weight models under a free (software) license, or calling to mind the longstanding work of nonprofits like AI2 or indeed EleutherAI (of the Common Pile group) itself on truly open LLMs that are not controlled by Big Tech. More generally, fears about enclosure, i.e. AI companies like OpenAI absorbing and monopolizing knowledge from published sources, have been greatly mitigated in recent times by the rise of competitive open-weight LLMs such as those released by DeepSeek or Meta. In fact, PleIAS' and Common Pile's efforts to build AI that must not learn from any unfree source (no matter what the actual legal constraints might be) remind this reviewer of the perennially rejected proposal that information on Wikipedia should only be drawn from open access sources and must avoid paywalled ones (WP:FUTON).

What's more, there is a strong argument that the anti fair use advocacy of PleIAS and the Common Pile group (such as the former's evident attempt to influence an ongoing legislative debate in the UK) could in fact facilitate such enclosures by the likes of OpenAI. This kind of danger was pointed out by the Wikimedia Foundation in its 2023 response to a consultation by the US Copyright Office (Signpost coverage: "AI policy positions of the Wikimedia Foundation"), even while also voicing concerns about unattributed usage of Wikipedia by AI developers:

[...] we encourage the Office to consider the potential impacts that changes to copyright law could have on competition among AI developers. If copyright law changes are enacted such that the acquisition and use of training materials becomes more expensive or difficult, there is a risk that dominant firms with greater resources will become further entrenched while smaller companies, including nonprofit organizations, struggle to keep up with mounting development costs.

In case of PleIAs, one can't help wondering if such attitudes are correlated with the project's funding. The Common Corpus paper states that

It was built up with the support and concerted efforts of AI Alliance, the state start-up LANGU:IA (start-up d'Etat), supported by the French Ministry of Culture and DINUM, as part of the prefiguration of the service offering of the Alliance for Language technologies EDIC (ALT-EDIC).

The French state (and its Ministry of Culture in particular) is not exactly known for its free culture advocacy. E.g. a decade ago, one report dryly noted that

The rights of French authors and artists have always been very well protected.

in order to explain the country's fierce resistance against Felix Reda's EU Copyright reform proposals (which had been welcomed by many European Wikimedians, see Signpost coverage at the time, which mentioned a mighty backlash, led in particular by French MEPs and the French government).

...but not to Wikipedians' concerns about AI training

For all the energy and enthusiasm they devote to acceding to the rhetoric and demands of copyright industry advocates who see AI as threatening their private business interests (no scraping or training without "consent", denigrating fair use as unethical, etc.), both groups exhibit clear disregard for the much more modest requirements of Wikipedians and others copyright owners who choose to release their works under a free license for the common good.

For example, both PleIAS and the Common Pile group released their aforementioned own "ethical" LLMs under an Apache License. While this license maintains attribution requirements, it does not contain a copyleft (share-alike) requirement – unlike Wikipedia's CC-BY-SA 4.0 license.

In case of Common Pile, a user raised that issue on the project's GitHub repository:

[...] the Comma models are Apache licensed. However the training data is at least in part CC-BY-SA. Since Apache is not ShareAlike (or compatible with it), could this be an issue?

There's a good case to be made that the models are derivative works, since you can probably extract a substantial amount of the data from them. Should the models be CC-BY-SA licensed instead?

More generally, if I train a model on all the data, is there a license under which I can safely distribute it, or are there multiple incompatible sharealike-style licenses combined?

At the time of writing, these questions have remained without response for almost two weeks.

Similarly, Michael Weinberg from the New York University School of Law called the Common Pile project

[...] cool, but nowhere do they talk about doing anything to . . . actually comply with the terms of the open licenses? That's a part of what makes this a hard question!

Open source lawyer Luis Villa (former Deputy General Counsel of the Wikimedia Foundation, who has more recently written about issues regarding open licenses and AI on his "Open(Ish) Machine Learning" blog) agreed, noting further that

All of the "we are doing training on permissively-licensed materials" sets have this problem, because permissive != no obligations.

PleIAs' cofounder Pierre-Carl Langlais had aggressively dismissed related concerns last year on social media, deriding

people being extremely protective of their content in open licenses

in particular regarding

inclusion of Wikipedia in the set which is released in the CC-By-SA, basically the gold standard of AI training for a decade and we had official support from Wikimedia enterprise. Kinda absurd

(Regardless of whether Langlais is correct in interpreting his lab's collaborations with Wikimedia LLC/Wikimedia Enterprise/WMF as an endorsement of his legal views, it is worth noting that individual contributors, not the Wikimedia Foundation, hold the copyright over Wikipedia's freely licensed content.)

Tellingly, PleIAs' November 2024 announcement in its initial version erroneously described its Common Corpus as a "copyright-free dataset" and as avoiding "copyrighted data". I.e. it assumed that "permissively licensed" material is no longer copyrighted – something that would be news to Wikipedians.

In stark contrast to PleIAs and Common Pile, two earlier papers which had similarly provided datasets enabling the training of LLMs without relying on fair use decided to exclude Wikipedia and other BY-SA licensed content precisely because of such concerns:

"The KL3M Data Project" presents "Copyright-Clean Training Resources for Large Language Models"^[4]. In contrast to the Common Corpus and Common Pile papers, this paper by a small legal consultancy counts an actual legal scholar among its authors (Daniel Martin Katz, a professor of law at Chicago-Kent College of Law), alongside Michael and Jillian Bommarito, the firm's husband-and-wife CEO and Chief Risk Officer. They write:

2.3 Wikipedia: A Case Study in the Complexity of Compliance
Many foundational Internet resources are governed by complex licensing arrangements that are often overlooked by AI developers. As the most notable example, Wikipedia content is frequently included in LLM training datasets. However, Wikipedia and various other Wikimedia Foundation projects are governed by the Creative Commons Attribution-ShareAlike (CC BY-SA) license, which imposes important restrictions on the use of content.
[...]
In response to our direct legal inquiry regarding LLM training on Wikipedia content, the Wikimedia Foundation responded with their interpretation of these compliance requirements [ https://jillianbommarito.com/wikimedia-says-no-llm-training/ ]. Their response noted: "We are monitoring what many LLM companies do with Wikimedia data and generally to be upfront, many may not be compliant with the letter of the Creative Commons rules or the spirit of the licenses." When questioned about specific compliance mechanisms, they emphasized that downstream developers must "adhere to the 'attribution,' 'share-alike,' and other elements of the license."
Most critically for LLM developers, the Foundation explicitly rejected the simplified compliance approaches currently employed by virtually all AI companies: "Providing a general notice to customers would not be an adequate solution to compliance […] [T]he notice would need to be made to everyone the content is shared with, not just customers." This position directly contradicts the practices of commercial LLM developers who include Wikipedia content in their training data.
In the context of building or fine-tuning large language models, it is simple to provide a general attribution notice acknowledging input sources to a given dataset or model. However, specific attribution to the specific work or works that gave rise to a specific model output is a difficult and expensive, if not impossible, technical challenge.
While Wikimedia's interpretation of the CC BY-SA requirement is not the final word on this important legal question, we did not include this content given the risk that it could encumber downstream usage.

Similarly, a 2023 preprint by authors from UW and UC Berkeley (several also affiliated with the Allen Institute for AI, a US nonprofit which similarly to EleutherAI works on open LLMs), titled "SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore"^[5] , describes training LLMs on an "Open License Corpus that excludes "high-risk" "attribution required" data such as ... Wikipedia and other sources under Creative Commons licenses, for the purpose of "copyright risk mitigation". In the initial (preprint) version of their paper, the authors justified this by claiming

For example, if a journalist writes a new article that cites information from Wikipedia (a CC-BY source), then they must provide a form of citation, link, or attribution back to the original source. In the context of machine learning, it is not clear what an attribution would constitute.

The authors did not respond to a question by this reviewer about this rather adventurous legal claim that merely using or "citing" information from Wikipedia would trigger such a requirement. In the paper's later peer-reviewed version, this was changed to "if a journalist quotes an article from Wikipedia (a CC-BY source), then they must provide a form of citation, link, or attribution back to the original source" (our emphasis) – still an evidently inaccurate claim, for example in its omitting of the share-alike terms of Wikipedia's license.

To be sure, these concerns about legal risks from training LLMs on Wikipedia content may be overwrought in practice (and not just because of the naivete of the legal arguments in the SILO paper).

In an August 2023 article, Creative Commons' General Counsel Kat Walsh (known to Wikimedians as User:Mindspillage and as former WMF Trustee, including during the decision to adapt CC BY-SA as the license for Wikipedia's text) addressed the question "Can you use CC licenses to restrict how people use copyrighted works in AI training?" by pointing out that

there are strong arguments that, in most cases, using copyrighted works to train generative AI models would be fair use in the United States, and such training can be protected by the text and data mining exception in the EU

(Last month, Creative Commons followed up by publishing "Understanding CC Licenses and AI Training: A Legal Primer", reiterating that The short answer is: AI training is often permitted by copyright, while caveating in a more in-depth analysis that The application of copyright law to AI training is complicated, it varies depending on the jurisdiction where the use is made, and litigation related to generative AI training remains ongoing.)

Similarly, Andrés Guadamuz (an intellectual property law scholar at the University of Sussex) has argued that CC licences are fully compatible with AI training, and in fact allow it to take place without asking from permission from the licensor – again citing fair use and (regarding the UK) fair dealing among other arguments. And in a 2024 policy paper commissioned by "Open Future", two scholars from the University of Amsterdam similarly concluded that Share Alike/CopyLeft licenses are largely ineffective when materials licensed under them are used to train AI models.

And even though the Wikimedia Foundation has been calling out LLM developers' non-compliance with the "letter" and "spirit" of CC licenses in its responses to the KL3M authors (see above), it too has elsewhere acknowledged the possibility or even likelihood that training AI models is covered by fair use. For example in a March 2023 legal analysis:

[...] it is more likely than not if current precedent holds that training systems on copyrighted data will be covered by fair use in the United States, but there is significant uncertainty

or also in its aforementioned response to a consultation by the US Copyright Office later that year (Signpost coverage: "AI policy positions of the Wikimedia Foundation").

The hypocrisy of "ethical" anti fair use advocacy

So PleIAs and the Common Pile group might be at little legal risk for ignoring Wikipedia's license terms in the release of their "ethical" LLMs. But this is likely exactly because of the fair use type defenses that they denigrate as unethical, and actively work to undermine in case of other copyrighted content.

To be transparent, this reviewer, like many in the free knowledge movement, doesn't mind his Wikipedia and Wikimedia Commons contributions being used for AI training (and would find it very unfortunate if PleIAs or the Common Pile group became the target of a lawsuit by Wikimedians).

That said, many Wikimedians feel differently and are less comfortable with their work being used in LLMs without restrictions. An extreme example is the aforementioned keynote at last year's Wiki Workshop, where Microsoft's Brent Hecht called on the Wikipedia community to be aware of its "legal leverage" and "data leverage", and even to consider a "data strike", using a labor rights framing. Again, this reviewer personally doesn't find this line of argument very convincing (for example because it ignores the interest of Wikipedia readers, or more generally that of society at large, in accessing knowledge without restrictions), and finds himself agreeing much more with e.g. the Internet Archive's and the EFF's aforementioned arguments in favor of fair use with regard to AI.

However, it's difficult to see PleIAs' and the Common Pile group's apparent confidence that they can ignore such concerns by Wikimedians (and other owners of permissively licensed but copyrighted content) as anything other than glaring hypocrisy, when contrasted with both groups' advocacy for absolute deference to the demands of (some) professional intellectual property owners. To use Hecht's terminology, both appear to be working under the assumption that Wikipedians have zero "leverage" apart from dataset provenance transparency requirements, and thus won't be able to interfere with LLM developers' use of the Common Corpus and Common Pile datasets. Or seen from another angle: PleIAs and the Common Pile group work to undermine the fair use defenses that other AI labs rely on when training their LLMs on non-freely licensed content. But both implicitly rely on these themselves when releasing their own "ethical" LLMs.

In the aforementioned remarks where PleIAs' Langlais had derided people for being extremely protective of their content in open license, he also expressed his puzzlement about

The counterintuitive thing while doing actually open ai: you can get more support from copyright collective/cultural industries than from long term open actors. The first are desperate enough at this point to see even a vaguely ethical alternative emerging.

It does not appear to occur to Langlais that many Wikimedians and other "open actors" who have worked long and hard to make knowledge freely accessible (so that people wouldn't have to pay said copyright collective/cultural industries for access) may not be very fond of seeing their work being used to promote legal changes that are likely to make knowledge less accessible, for the purpose of furthering the business interests of exactly these copyright industries – and to hand them major influence over the future of AI as an important new form of accessing knowledge.

The Wikimedia Foundation's Vision calls for Imagin[ing] a world in which every single human being can freely share in the sum of all knowledge, in future times like 20, 50, 100 years from today. While LLMs are clearly still imperfect at this point, AI continues to improve rapidly. And with ChatGPT recently surpassing Wikipedia in user numbers according to Similarweb, it is likely that in those future times, AI will be an important way in which human beings access knowledge. Despite their professed commitment to open LLMs, the anti-fair use advocacy of PleIAs and the Common Pile group are likely to bring us closer to a world in which every human being can share in the sum of all knowledge only as long as enough revenue can be extracted from them to serve the business interests of the copyright industries.

Briefly

Wiki Workshop 2025, WMF Research Award

The annual Wiki Workshop (organized by the Wikimedia Foundation's research team with other collaborators) took place virtually last month, for the first time extended to two days. The 46 accepted extended abstracts are available online (non-archival), as are video recordings from the conference.

At the event, the Foundation's "Research Award of the Year" for "best paper" went to "Motivating Experts to Contribute to Digital Public Goods: A Personalized Field Experiment on Wikipedia"^[6], a paper reporting on the results of an experiment conducted in 2015/16 that invited 3974 economists to improve Wikipedia articles (see also our 2017 review of a preprint about the same experiment: "ExpertIdeas: Incentivizing Domain Experts to Contribute to Wikipedia").

The award for best student paper went to "Low-Resourced Languages and Online Knowledge Repositories: A Need-Finding Study"^[7], a CHI 2024 paper describring (1) a thematic analysis of Wikipedia forum discussions and (2) a contextual inquiry study with 14 novice contributors [...] focused on three Ethiopian languages: Afan Oromo, Amharic, and Tigrinya.

Wikimedia Research Showcase

See the page of the monthly Wikimedia Research Showcase for videos and slides of past presentations.

Other recent publications

Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.

Compiled by Tilman Bayer

"WikiVideo: Article Generation from Multiple Videos"

From the abstract:^[8]

"We present the challenging task of automatically creating a high-level Wikipedia-style article that aggregates information from multiple diverse videos about real-world events, such as natural disasters or political elections. [...] we introduce WikiVideo, a benchmark consisting of expert-written articles and densely annotated videos that provide evidence for articles' claims, facilitating the integration of video into RAG pipelines and enabling the creation of in-depth content that is grounded in multimodal sources. We further propose Collaborative Article Generation (CAG), a novel interactive method for article creation from multiple videos. CAG leverages an iterative interaction between an r1-style reasoning model and a VideoLLM to draw higher level inferences about the target event than is possible with VideoLLMs alone, which fixate on low-level visual features."

See also an explanatory thread by one of the authors

Using Wikidata to pre-train LLMs for improved multilingual question answering in underrepresented languages

From the abstract:^[9]

"Recent approaches to multilingual open-domain question answering (MLODQA) have achieved promising results given abundant language-specific training data. However, the considerable annotation cost limits the application of these methods for underrepresented languages. We introduce a few-shot learning approach to synthesise large-scale multilingual data from large language models (LLMs). Our method begins with large-scale self-supervised pre-training using WikiData, followed by training on high-quality synthetic multilingual data generated by prompting LLMs with few-shot supervision. The final model, FsModQA, significantly outperforms existing few-shot and supervised baselines in MLODQA and cross-lingual and monolingual retrieval.

Generating "Open Artificial Knowledge" synthetic data with LLMs, guided by Wikipedia categories

From the abstract:^[10]

"[...] acquiring high-quality, diverse, and ethically sourced training data [for LLMs] remains a significant challenge. We introduce the Open Artificial Knowledge (OAK) dataset, a large-scale resource of over 500 million tokens (at the moment of writing) designed to address this issue. OAK leverages an ensemble of state-of-the-art LLMs [...], to generate high-quality text across diverse domains, guided by Wikipedia's main categories. Our methodology ensures broad knowledge coverage while maintaining coherence and factual accuracy. The OAK dataset aims to foster the development of more capable and aligned language models while addressing critical issues of data scarcity and privacy in LLM training [...].

"Synthetic Multimodal Question Generation" from Wikipedia

From the abstract:^[11]

"[...] we propose SMMQG, a synthetic data generation framework. SMMQG leverages interplay between a retriever, large language model (LLM) and large multimodal model (LMM) to generate question and answer pairs directly from multimodal documents, with the questions conforming to specified styles and modalities. We use SMMQG to generate an MMRAG dataset of 1024 questions over Wikipedia documents and evaluate state-of-the-art models using it [...]

"EuroLLM: Multilingual Language Models for Europe" in all official European Union languages, trained on Wikipedia as "high quality data"

From the abstract:^[12]

"The quality of open-weight LLMs has seen significant improvement, yet they remain predominantly focused on English. In this paper, we introduce the EuroLLM project, aimed at developing a suite of open-weight multilingual LLMs capable of understanding and generating text in all official European Union languages, as well as several additional relevant languages. We outline the progress made to date [...] Additionally, we release our initial models: EuroLLM-1.7B and EuroLLM-1.7B-Instruct and report their performance on multilingual general benchmarks and machine translation."

From the paper:

"To train the EuroLLM models, we collect and filter data from various sources for all supported languages. The data included in the final corpus can be divided into four categories: web data, parallel data, code / math data, and high-quality data
[...]
High-quality Data. Regarding higher quality data, we use the Wikipedia [sic] for all languages and the arXiv [...and two other sources] for English."

"Fake news, an internet troll, and a conspiracy theory about 'Wikipedia's Intentional Distortion of the History of the Holocaust'"

From the abstract:^[13]

"In 2023, an essay alleged Wikipedia's 'intentional distortion' of the Holocaust. Subsequently, the Wikipedia community largely dismissed these claims during a formal investigation. The allegations repeated a narrative of a former Wikipedia volunteer banned from all Wikimedia projects for unethical behavior. While Wikipedia undoubtedly contains errors in its coverage of the Holocaust, there is no convincing evidence to prove that most of it is 'intentional', or that it can be attributed to the parties identified by the essay authors."

A response by one of the Wikipedia editors criticized in the 2023 article. See also our review at the time, and further coverage in this issue's "In the media", also about recent public statements by the two authors of the 2023 article.

References

^ Yang, Puyu; Colavizza, Giovanni (2024-01-18). "Polarization and reliability of news sources in Wikipedia". Online Information Review. 48 (5): 908–925. doi:10.1108/OIR-02-2023-0084. hdl:11585/953887. ISSN 1468-4527.
^ Langlais, Pierre-Carl; Hinostroza, Carlos Rosas; Nee, Mattia; Arnett, Catherine; Chizhov, Pavel; Jones, Eliot Krzystof; Girard, Irène; Mach, David; Stasenko, Anastasia; Yamshchikov, Ivan P. (2025-06-02). "Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training". arXiv:2506.01732 [cs.CL].
^ Kandpal, Nikhil; Lester, Brian; Raffel, Colin; Majstorovic, Sebastian; Biderman, Stella; Abbasi, Baber; Soldaini, Luca; Shippole, Enrico; Cooper, A. Feder; Skowron, Aviya; Kirchenbauer, John; Longpre, Shayne; Sutawika, Lintang; Albalak, Alon; Xu, Zhenlin; Penedo, Guilherme; Allal, Loubna Ben; Bakouch, Elie; Pressman, John David; Fan, Honglu; Stander, Dashiell; Song, Guangyu; Gokaslan, Aaron; Goldstein, Tom; Bartoldson, Brian R.; Kailkhura, Bhavya; Murray, Tyler (2025-06-05). "The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text". arXiv:2506.05209 [cs.CL].
^ II, Michael J. Bommarito; Bommarito, Jillian; Katz, Daniel Martin (2025-04-10). "The KL3M Data Project: Copyright-Clean Training Resources for Large Language Models". arXiv:2504.07854 [cs.CL].
^ Min, Sewon; Gururangan, Suchin; Wallace, Eric; Shi, Weijia; Hajishirzi, Hannaneh; Smith, Noah A.; Zettlemoyer, Luke (2024-07-31). "SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore". arXiv:2308.04430 [cs.CL]. (also as "Spotlight Poster" at ICLR 2024)
^ Chen, Yan; Farzan, Rosta; Kraut, Robert; YeckehZaare, Iman; Zhang, Ark Fangzhou (May 2024). "Motivating Experts to Contribute to Digital Public Goods: A Personalized Field Experiment on Wikipedia". Management Science. 70 (5): 3264–3280. doi:10.1287/mnsc.2023.4852. ISSN 0025-1909.
^ Nigatu, Hellina Hailu; Canny, John; Chasins, Sarah E. (2024-05-11). "Low-Resourced Languages and Online Knowledge Repositories: A Need-Finding Study.". Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems. CHI '24. New York, NY, USA: Association for Computing Machinery. pp. 1–21. arXiv:2405.16669. doi:10.1145/3613904.3642605. ISBN 9798400703300.
^ Martin, Alexander; Kriz, Reno; Walden, William Gantt; Sanders, Kate; Recknor, Hannah; Yang, Eugene; Ferraro, Francis; Durme, Benjamin Van (2025-04-01). "WikiVideo: Article Generation from Multiple Videos". arXiv:2504.00939 [cs.CV]. / Dataset, repo
^ Jiang, Fan; Drummond, Tom; Cohn, Trevor (2025-02-27). "Few-Shot Multilingual Open-Domain QA from 5 Examples". arXiv:2502.19722 [cs.CL].
^ Borisov, Vadim; Schreiber, Richard H. (2024-07-19). "Open Artificial Knowledge". arXiv:2407.14371 [cs.CL]. / Poster at ICML 2024, dataset and report
^ Wu, Ian; Jayanthi, Sravan; Viswanathan, Vijay; Rosenberg, Simon; Pakazad, Sina Khoshfetrat; Wu, Tongshuang; Neubig, Graham (November 2024). "Synthetic Multimodal Question Generation". In Yaser Al-Onaizan; Mohit Bansal; Yun-Nung Chen (eds.). Findings of the Association for Computational Linguistics: EMNLP 2024. Findings 2024. Miami, Florida, USA: Association for Computational Linguistics. pp. 12960–12993. doi:10.18653/v1/2024.findings-emnlp.759.
^ Martins, Pedro Henrique; Fernandes, Patrick; Alves, João; Guerreiro, Nuno M.; Rei, Ricardo; Alves, Duarte M.; Pombal, José; Farajian, Amin; Faysse, Manuel; Klimaszewski, Mateusz; Colombo, Pierre; Haddow, Barry; de Souza, José G. C.; Birch, Alexandra; Martins, André F. T. (2025-01-01). "EuroLLM: Multilingual Language Models for Europe". Procedia Computer Science. Proceedings of the Second EuroHPC user day. 255: 53–62. doi:10.1016/j.procs.2025.02.260. ISSN 1877-0509.
^ Konieczny, Piotr (2025). "Fake news, an internet troll, and a conspiracy theory about 'Wikipedia's Intentional Distortion of the History of the Holocaust'". Holocaust Studies: 1–39. doi:10.1080/17504902.2025.2511459. ISSN 1750-4902. / Author's copy

← Previous "Recent research"

In this issue

24 June 2025 (all comments)

News and notes

In the media

Disinformation report

Recent research

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.

Regarding the "ethical" AIs - are Chinese models represented here? --_{Piotr Konieczny aka Prokonsul Piotrus| reply here} 03:27, 24 June 2025 (UTC)[reply]

Interesting question. I am not aware of any Chinese LLMs in that regard.

The well-known open-weights/open-source models by DeepSeek or Qwen can be assumed to have been trained on copious amounts of non-freely licensed material (just like, say, ChatGPT or Claude). So they too are super "unethical" in the framing promoted by PleIAs and the Common Pile group.

Regards, HaeB (talk) 03:50, 24 June 2025 (UTC)[reply]

The Signpost is looking for new talent.

Home

About