Engineers at OpenAI inadvertently destroyed relevant evidence in a copyright infringement lawsuit brought against the AI startup by news publishers The New York Times and Daily News.
As part of the legal process, OpenAI agreed to allow the publishers’ lawyers to go through its AI training datasets for any copyrighted content. Beginning November 1, a team of lawyers and specialists began searching OpenAI’s training data on virtual machines created by the firm.
However, on November 14, lawyers for the publishers claimed that search data stored on one of the servers after 150 hours of work had vanished. While OpenAI managed to retrieve the majority of the destroyed data, The lawyers stated that the retrieved data did not include file names or folder structures. As a result, it “cannot be used to determine where the news plaintiffs’ copied articles were used to build [OpenAI’s] models,” the lawyers wrote in a letter to a US federal court on Wednesday, November 20.” News plaintiffs found themselves forced to recreate their work from scratch using significant individual hours. and computer processing time,” according to the letter.
“The news plaintiffs found only yesterday that the recovered data is unusable and that an entire week’s worth of its experts’ and lawyers’ work must be re-done, which is why this supplemental letter is being filed today,” they stated. While the publishers’ lawyers acknowledged that OpenAI did not intentionally erase the data, they emphasized that the business was “in the best position to search its datasets.”
Faced with various lawsuits from publishers alleging copyright infringement, OpenAI has contended that training its AI model on publicly available data, such as news items provided by The New York Times, is a reasonable use of such information. On the other hand, ChatGPT has signed content licensing agreements with a number of big media, including Reuters, the Associated Press, the New York Times, and Axel Springer, the parent corporation of Business Insider and Politico.