A LLM training dataset made up of only open or public domain content?

Getting your Trinity Audio player ready...
3 min read

One of the (many legitimate) major concerns I often hear when it comes to working with GenAI and LLM’s revolves around the copyright status of the content used to train the LLM and whether it is legal or ethical to use LLM’s based on the fact that they were trained using large swaths of copyright protected content. It came up again in the chat during David Wiley’s webinar Why Open Education Will Become Generative AI Education (video) at the University of Regina.

Something I began wondering as I saw the questions roll in about copyright status was whether it would be possible to create a training corpus that was made up of exclusively openly licensed or public domain content. So a bit of digging this afternoon and I discovered that, yes, there is a LLM training dataset that is made up exclusively of openly licensed and public domain content. The (somewhat unfortunately named) Common Corpus. I say unfortunately named because the name Common Corpus is very similar to Common Crawl, which is a training dataset that consists of both open and copyrighted materials. The names are so close that when I tried to query ChatGPT about LLM’s that are trained using only the Common Corpus dataset, it thought I meant Common Crawl, which led to this eyeroll chat.

Given that OpenAi has suggested that it is virtually impossible to create a chatbot without copyright material, the fact that their flagship product ChatGPT does not recognize the term “Common Corpus” – a dataset that does exactly what they say is impossible to do – feels slightly suss.

Why is this important, especially when legally it seems like the use of copyright data to train LLM’s is likely fair use. Well, even if it is fair use it still does leave a sour taste in the mouths of many in the open education community. In my opinion, it has been one (of many legitimate) barriers to open educators experimenting with generative AI tools and technologies. But if there was a way to use a LLM that was trained on an openly licensed or public domain training data set, that might make experimentation feel less like an exercise in violating someones copyright.

At any rate, it is good to know there are others thinking about this and coming up with more ethical ways to training LLM’s than scraping the entire internet lock, stock and barrel. Now to find a LLM that actually uses this training dataset and see if it can be run locally.

Addendum: after publishing I came across this blog post from Mozilla about a convening this summer looking at exactly this issue.

Leading AI companies want us to believe that training performant LLMs without copyrighted material is impossible. We refuse to believe this. An emerging ecosystem of open LLM developers have created LLM training datasets —such as Common Corpus, YouTube-Commons, Fine Web, Dolma, Aya, Red Pajama and many more—that could provide blueprints for more transparent and responsible AI progress.

So there is work being done by others on this issue and hopefully there will be (is already?) a viable LLM trained using fully open data.

Theme by Anders Norén