LLM’s and open (access, data, source)

Getting your Trinity Audio player ready...
3 min read

In March I wrote an op-ed for the BCcampus blog about open education and generative AI focused mostly on OER’s, but I continue to think about the relationship between open (not just education, but in all it’s incarnations – technology, pedagogy, data, etc) and generative AI. In the past week I have found myself thinking more about generative AI and open, not so much related to OER or OEP, but moreso knowledge creation and dissemination through open technologies and open access publishing.

First, Arthur Spirling wrote an op-ed in Nature this month about the need for the research community to support the development of open-source large language models that are independent of commercial interest. His reasoning is that research cannot be beholden to corporations for access to necessary resources researchers need.

From my perspective as a political and data scientist who is using and teaching about such models, scholars should be wary. The most widely touted LLMs are proprietary and closed: run by companies that do not disclose their underlying model for independent inspection or verification, so researchers and the public don’t know on which documents the model has been trained.

Why open-source generative AI models are an ethical way forward for science, Arthur Spirling, Nature, April 18, 2023

I don’t think it is a stretch to suggest that the “most widely touted” he refers to is OpenAI, which is far from open despite having the term in their name.

Arthur makes a solid case for why researchers should have control over the tools and technologies they use to generate knowledge. By extension, since much of our society’s research occurs in academic institutions, I read Arthur’s piece as a call for both individual researchers and academic institutions to become more deeply involved in the development of open source LLM’s that are transparent, reproducible and, most importantly, controlled by a community and not a corporation.

I have also been thinking about the data that is used to train those LLM’s – the corpus – and the role of open access and OA research. A very valid criticism of current generative AI models is that the information is often unreliable, unverified and, in many cases, just plain wrong, likely as a result of a training corpus that has been built primarily on the backs of unpublished books scrapped from the internet. But if the corpus was built on academic research instead, openly available for use as a training corpus and then transparently reported as being used as part of the training corpus, wouldn’t that help considerably with both the reliability and verification of generative AI output?

It seems to me that generative AI is only going to have a larger role in how we generate knowledge moving forward, so it makes sense to me that we strive to train the models with the most accurate, verifiable, peer-reviewed information we have; exactly the kind of information researchers create. But much of that research is not openly published. Instead, it is published and controlled by corporations. That doesn’t automatically mean it could not be used to train LLM’s, but it does add a barrier and another corporate entity in the middle to act as a gatekeeper to access the information. If that research was instead openly licensed, it would remove a barrier and a gatekeeper and make it easier for it to be included in the training corpus for LLM’s.

In terms of knowledge dissemination, it is feeling like including academic research in the training corpus for generative AI tools is going to become more important to ensure that the research is made broadly available to the public, who will likely be accessing information not through a library database or Google search, but through text generated by generative AI tools.

One Comment

Alan Levine April 20, 2023

Certainly better training data will help. Does that mean generating from research content only? Or mixing it in with the large mysterious pool?

Everything still feels problematic in that what is produced is disconnected from where it came from.

Been looking more at Stability https://stability.ai/ at least on the front page is a more open source and semi-transparency commitment and way of operating than the one that just has “open” as a brand name.

I maintan we have trouble grappling with this because the way stuff is generated is so far from our experiences and conceptions.

And I am leaving mytypos everywhere as my human calling sign 😉

Theme by Anders Norén