Common Corpus: A Beacon of Ethical AI Training
As the AI landscape continues to evolve, the debate surrounding the ethical use of data for training AI models has reached a fever pitch. In a recent testimony before the UK parliament, OpenAI acknowledged the prevalence of a ”Wild West” mentality in the AI industry, where leading companies have relied on materials scraped from the internet without permission to train their models. This practice has sparked a wave of controversy, with some industry leaders, like Emad Mostaque, taking a stand against it by resigning from their positions at AI startups.
Fairly Trained: Championing Responsible AI Development
Enter Fairly Trained, a nonprofit organization that offers a certification to companies that can demonstrate their AI models have been trained using data they own, have licensed, or that is in the public domain. Recently, Fairly Trained certified its first large language model (LLM), KL3M, developed by the AI startup 273. This model, trained on the Common Corpus dataset, has been made available on the open-source AI platform Hugging Face.
The Common Corpus Dataset: A Collaborative Effort
The Common Corpus dataset is the result of a collaboration between various AI groups, including Pleias, Allen AI, Nomic AI, and EleutherAI, with support from the French Ministry of Culture. The dataset, which boasts 500 billion tokens, was built using sources such as public domain newspapers digitized by the US Library of Congress and the National Library of France. Pierre-Carl Langlais, the project coordinator for Common Corpus, believes the dataset is substantial enough to train a state-of-the-art LLM.
Multicultural and Multipurpose: The Aspirations of Common Corpus
Common Corpus aims to be both multicultural and multipurpose, providing researchers and startups across various fields with access to a vetted training set free from concerns over potential infringement. The dataset includes the largest open dataset in French to date, showcasing its commitment to diversity.
Limitations and Opportunities
While the Common Corpus dataset offers a significant step forward in ethical AI training, it does come with some limitations. Much of the public domain data is dated, as copyright protection typically lasts over 70 years from the author’s death. As a result, AI models trained on this dataset may struggle to engage with current affairs or incorporate modern slang. However, this limitation also presents an opportunity for the AI to excel in tasks such as mimicking the writing style of historical figures like Proust.
“As far as I am aware, this is currently the largest public domain dataset to date for training LLMs. It’s an invaluable resource.”
Stella Biderman, the executive director of EleutherAI, recognizes the significance of the Common Corpus dataset, emphasizing its rarity and value in the AI community.
A Growing Trend: Licensing and Fairness in AI
While projects like Common Corpus are still uncommon, with KL3M being the only LLM certified by Fairly Trained thus far, there is a growing trend towards licensing and fairness in the AI world. Organizations like the Authors Guild, SAG-AFTRA, and other professional groups have lent their support to Fairly Trained, signaling a shift in the industry’s approach to data usage.
Fairly Trained has also certified its first company offering AI voice models, the Spanish voice-changing startup VoiceMod, and its first “AI band,” a heavy-metal project called Frostbite Orckings.
“We were always going to see legally and ethically created large language models spring up. It just took a bit of time.”
As Fairly Trained’s CEO, Aza Newton-Rex, notes, the emergence of legally and ethically created LLMs was inevitable; it simply required time for the industry to adapt and evolve.
2 Comments
Honestly, who knew AI could be this ethical and innovative at the same time
Truly groundbreaking, or just another headline grabber? Let’s dive deeper.