The Rise of Ethical AI: Common Corpus and the Push for Fairly Trained Models
OpenAI’s Admission and the Ensuing Backlash
In a recent address to the UK parliament, OpenAI acknowledged the prevalence of a “Wild West” mentality in the AI industry. Leading companies, including OpenAI, have been utilizing materials scraped from the internet to train their AI models, which power chatbots and image generators. This revelation has sparked a heated debate about the ethics of such practices, with some prominent figures, like Emad Mostaque, leaving their positions at AI startups like Stability AI in protest of policies that allow scraping content without permission.
Fairly Trained: Certifying Ethical AI Practices
Fairly Trained, a nonprofit organization, offers certification to companies that can demonstrate their AI models have been trained on data they either own, have licensed, or is in the public domain. Recently, Fairly Trained certified its first large language model (LLM), KL3M, created by the AI startup 273. The model, which has been made available on the open-source AI platform Hugging Face, was trained on a dataset called Common Corpus.
Common Corpus: A Collaborative Effort for Ethical AI Training
Common Corpus, a project coordinated by the French startup Pleias in collaboration with various AI groups like Allen AI, Nomic AI, and EleutherAI, aims to provide a vetted, multicultural, and multipurpose training dataset for researchers and startups. The dataset, which contains 500 million tokens, was built from public domain sources such as digitized newspapers from the US Library of Congress and the National Library of France. While this size is impressive, it still pales in comparison to the trillions of tokens believed to have been used in training OpenAI’s most advanced models.
Limitations and Advantages of Public Domain Data
Using public domain data for training AI models comes with its own set of limitations. Much of this data is outdated, as copyright protection often lasts for over 70 years after the author’s death. As a result, models trained on such datasets may struggle to engage with current affairs or generate content using modern slang. However, this approach also offers the advantage of avoiding potential infringement concerns.
“As far as I am aware, this is currently the largest public domain dataset to date for training LLMs. It’s an invaluable resource.”
– Stella Biderman, Executive Director of EleutherAI
The Growing Trend of Licensing and Ethical AI Practices
While projects like Common Corpus are still rare, with KL3M being the only LLM certified by Fairly Trained so far, there is a growing trend towards licensing and requests for licensing in the AI industry. Organizations like the Authors Guild, which represents book authors, and the actors and radio artists labor union SAG-AFTRA, have recently become official supporters of Fairly Trained, signaling a shift towards more ethical AI practices.
Fairly Trained has also certified its first company offering AI voice models, the Spanish voice-changing startup VoiceMod, and its first “AI band,” a heavy-metal project called Frostbite Orckings.
“We were always going to see legally and ethically created large language models spring up. It just took a bit of time.”
– Newton-Rex, Fairly Trained
As the AI industry continues to evolve, initiatives like Common Corpus and the push for fairly trained models demonstrate a growing skepticism towards the permissionless scraping of data and a desire for more ethical practices in the development of AI technologies.
3 Comments
Rogue? More like game-changer; show me the receipts, and I might just believe in miracles.
So, what’s the secret sauce to training AI without stepping on creative toes? Color me intrigued.
Are we finally seeing creativity respected in the digital age? Show me the evidence and let’s talk!