In late 2021, OpenAI faced a data shortage as it developed its latest AI system. To overcome this, researchers created Whisper, a speech recognition tool that transcribed audio from YouTube videos, providing a vast source of conversational text.
Despite concerns about violating YouTube's rules, OpenAI transcribed over a million hours of videos, including contributions from Greg Brockman, the company's president. This data was fed into GPT-4, a powerful AI model that formed the basis of ChatGPT.
The race for AI leadership has intensified the demand for digital data. Tech giants like OpenAI, Google, and Meta have resorted to questionable practices to acquire it.
Meta considered purchasing Simon & Schuster to access long-form works and discussed gathering copyrighted data from the internet, even at the risk of lawsuits. Google also transcribed YouTube videos and expanded its terms of service to allow it to use publicly available content for AI development.
The abundance of online information has become the lifeblood of the AI industry. Leading chatbots have been trained on trillions of words, surpassing the collection of Oxford University's Bodleian Library. High-quality data, such as published books and articles, is particularly valuable.
As the internet's data supply dwindles, tech companies are exploring synthetic information generated by AI models themselves.
OpenAI, Google, and Meta have defended their data acquisition practices, citing agreements with content creators and investments in AI integration. However, creators have raised concerns about copyright infringement and unauthorized use of their works.
The Copyright Office is currently reviewing the application of copyright law in the AI era, with over 10,000 submissions from stakeholders. Filmmaker and author Justine Bateman has condemned the unauthorized use of creative content by AI models, calling it "the largest theft in the United States."