Generative AI training data sets are now trackable – and often legally complicated

Generative AI training data sets are now trackable – and often legally complicated

A new tool, Data Provenance Explorer, lets users pick through the questionable provenance of many large data sets used for AI training.

Credit: Shutterstock

A new online tool allows users to identify, track and learn about the legal status of training data sets for generative AI, and a quick glance shows that many may have licensing issues.

The tool, dubbed the Data Provenance Explorer, is the result of a joint effort between machine learning and legal experts from MIT, generative AI API provider Cohere, and 11 other organisations — Harvard Law School, Carnegie Mellon University and Apple all among the contributors. The Data Provenance Explorer lets researchers, journalists and anyone else search through thousands of AI training databases and trace the “lineage” of widely used data sets.

The idea is to provide a way to explore the sometimes murky world of training data used to develop generative AI. In an official statement announcing the Data Provenance Explorer, the team behind it described a “data transparency crisis” that could complicate the development and commercial use of generative AI systems.

Crowdsourced data sets lack licenses

“Crowdsourced aggregators like GitHub, Papers with Code, and many of the open source LLMs [large language models] trained from data on these aggregators, have an extremely high proportion of missing data licenses … ranging from 72% to 83%,” the group said. “In addition, the licenses that are assigned by crowdsourced aggregators frequently allow broader use than the original intent expressed by the authors of a data set.”

The need for responsibly developed AI is something that the industry appears to be well aware of, according to Kathy Lange, a research director for IDC. The headlong rush to deploy generative AI has created a public focus on the safe and legal use of data, she said.

“Understanding the provenance of the data; how it was collected, processed, and transformed can impact the trust in AI model results,” Lange said. “AI vendors prioritising data provenance will have a leg-up in the market for customers requiring transparency, accountability, and compliance initiatives.”

AI data has become nothing less than a battleground, in certain respects. Lange highlighted the recent introduction of the Nightshade tool, which subtly changes digital art in such a way as to confuse AI creators attempting to use copyrighted works for training data. Moreover, authors and other copyright holders have begun to take legal action against the use of their works in generative AI training – comedian and author Sarah Silverman is among those suing OpenAI for this reason. However, the legal landscape for those claims remains murky in many respects.

Follow Us

Join the newsletter!


Sign up to gain exclusive access to email subscriptions, event invitations, competitions, giveaways, and much more.

Membership is free, and your security and privacy remain protected. View our privacy policy before signing up.

Error: Please check your email address.


EDGE 2024

Register your interest now for EDGE 2024!



How MSPs can capitalise on integrating AI into existing services

How MSPs can capitalise on integrating AI into existing services

​Given the pace of change, scale of digitalisation and evolution of generative AI, partners must get ahead of the trends to capture the best use of innovative AI solutions to develop new service opportunities. For MSPs, integrating AI capabilities into existing service portfolios can unlock enhancements in key areas including managed hosting, cloud computing and data centre management. This exclusive Reseller News roundtable in association with rhipe, a Crayon company and VMware, focused on how partners can integrate generative AI solutions into existing service offerings and unlocking new revenue streams.

How MSPs can capitalise on integrating AI into existing services
Access4 holds inaugural A/NZ Annual Conference

Access4 holds inaugural A/NZ Annual Conference

​Access4 held its inaugural Annual Conference in Port Douglass, Queensland, for Australia and New Zealand from 9-11 October, hosting partners from across the region with presentations on Access4 product updates, its 2023 Partner of the Year awards and more.

Access4 holds inaugural A/NZ Annual Conference
Show Comments