Authors Sue Salesforce for AI Copyright Infringement on Pirated Books

Authors Molly Tanzer and Jennifer Gilmore have filed a class-action lawsuit against Salesforce, accusing it of copyright infringement by training its xGen AI models on pirated books from "The Pile" dataset. The suit seeks damages and an injunction, joining similar cases against AI firms and potentially reshaping ethical data practices in tech.
Authors Sue Salesforce for AI Copyright Infringement on Pirated Books
Written by Juan Vasquez

In a move that underscores the growing tensions between creative industries and artificial intelligence developers, cloud-computing giant Salesforce Inc. finds itself in the crosshairs of a proposed class-action lawsuit filed by novelists Molly Tanzer and Jennifer Gilmore. The complaint, lodged in federal court in San Francisco, accuses the company of infringing copyrights by using thousands of books without permission to train its xGen AI models. The authors claim Salesforce’s software processed language from pirated copies of their works, including Tanzer’s “Creatures of Will and Temper” and Gilmore’s “The Mothers,” as part of a broader dataset allegedly sourced from unauthorized online libraries.

The lawsuit highlights Salesforce’s admission in a 2023 research paper that it trained its AI on a dataset known as “The Pile,” which reportedly includes over 800 gigabytes of text from books obtained through shadow libraries like Bibliotik. Plaintiffs argue this constitutes willful infringement, seeking damages and an injunction to halt the use of such models.

The Broader Implications for AI Training Practices

This case joins a wave of similar legal challenges against tech firms, including suits against OpenAI and Meta Platforms Inc., where creators allege unauthorized use of copyrighted material to fuel generative AI. According to a report from Reuters, the authors contend that Salesforce not only trained on pirated books but also attempted to obscure this by scrubbing references from public disclosures after initial revelations.

Industry insiders note that Salesforce’s xGen models, designed for natural language processing in enterprise applications like customer relationship management, rely on vast datasets to achieve high performance. Yet, the complaint details how the company allegedly ingested nearly 200,000 books from illicit sources, raising questions about ethical data sourcing in an era when AI is integral to business operations.

Evidence and Admissions in the Spotlight

Court documents cite internal Salesforce communications, including a GitHub post from an employee acknowledging the use of The Pile dataset, which is notorious for containing copyrighted works without licenses. The plaintiffs, represented by prominent intellectual property attorneys, aim to represent a class of potentially thousands of authors whose books were similarly exploited.

Salesforce has declined to comment on the litigation, but the suit demands not just monetary compensation but also the destruction of any AI models trained on infringing data. As reported by Slashdot, an anonymous reader shared details of the complaint, emphasizing the scale of the alleged infringement involving cloud-based AI tools that power Salesforce’s Einstein platform.

Parallels with Ongoing AI Copyright Battles

This dispute echoes broader industry debates, such as the New York Times’ lawsuit against Microsoft and OpenAI over news article usage. Legal experts suggest that if successful, the Tanzer-Gilmore case could force AI companies to adopt transparent licensing models or face escalating liabilities.

For Salesforce, a leader in CRM software with a market capitalization exceeding $250 billion, the lawsuit poses reputational risks amid its push into AI-driven analytics. Posts on social platform X, formerly Twitter, reflect creator sentiment, with users like Ed Newton-Rex highlighting the “inspiring” surge in such lawsuits as evidence of pushback against unchecked AI training.

Potential Outcomes and Industry Shifts

Analysts predict that resolving this case could take years, potentially reaching the Supreme Court if it hinges on fair use doctrines under U.S. copyright law. The complaint references Salesforce’s own research papers, which initially disclosed the dataset but were later edited, as per findings in Decrypt.

Meanwhile, authors and publishers are increasingly vigilant, with organizations like the Authors Guild supporting similar actions. This litigation may accelerate calls for federal regulations on AI data practices, compelling companies to negotiate royalties or seek explicit permissions for training materials.

Looking Ahead: Balancing Innovation and Rights

As AI permeates sectors from finance to healthcare, cases like this test the boundaries of innovation versus intellectual property protection. Salesforce’s response will be closely watched, potentially influencing how enterprises integrate AI while respecting creators’ rights. With damages potentially in the millions, the outcome could reshape data ethics in tech, ensuring that the rush to build smarter machines doesn’t trample on the foundations of human creativity.

Subscribe for Updates

AITrends Newsletter

The AITrends Email Newsletter keeps you informed on the latest developments in artificial intelligence. Perfect for business leaders, tech professionals, and AI enthusiasts looking to stay ahead of the curve.

By signing up for our newsletter you agree to receive content related to ientry.com / webpronews.com and our affiliate partners. For additional information refer to our terms of service.

Notice an error?

Help us improve our content by reporting any issues you find.

Get the WebProNews newsletter delivered to your inbox

Get the free daily newsletter read by decision makers

Subscribe
Advertise with Us

Ready to get started?

Get our media kit

Advertise with Us