Copyright in the Age of Generative AI, Part II: Reinterpreting DMCA 1202 and Encoded Representations
Andersen v. Stability AI Ltd. is an ongoing lawsuit since January 2023 in the U.S. District Court for the Northern District of California, where visual artists are seeking compensation and intellectual property protections from generative artificial intelligence (gen AI) platforms, including Stability AI, Midjourney, Inc., DeviantArt, Inc., and Runway AI, Inc. In recent months, key developments in the lawsuit have underscored the need for technical clarity in interpreting copyright law for software. The information disparity between artists and gen AI corporations was heightened by the defendants’ objection to disclosing sensitive source code to an expert witness and the plaintiffs’ delayed request to review the training data used for Runway and Midjourney’s gen AI. Tracing Andersen v. Stability AI Ltd. and related case Doe v. GitHub, Inc., this article discusses how recent litigation pushes courts to confront nontraditional interpretations of the DMCA 1202b to regulate software against the hurdles of information asymmetry and confidential trade secrets.
Following multiple back-and-forths in court, plaintiffs in Andersen v. Stability AI Ltd. called for greater scrutiny of gen AI’s dataset practices, including image scraping and training algorithms, in an attempt to pinpoint violations of Section 1202b of the Digital Millennium Copyright Act (DMCA). Plaintiffs requested that Midjourney produce all datasets used for training its models in addition to the LAION-400M and LAION-5B datasets, which the lawsuit focused on. They aimed to broadly examine the deep learning pipeline for possible removal or manipulation of copyright management information (CMI) during data collection. Removed CMI—information conveyed with a copyrighted work, such as the title or the author's name—could indicate that defendants willfully bypassed proper attribution and licensing. This request reflected a pivotal shift in the broader copyright debate surrounding AI, from identifying similarities in AI-generated content to deducing processes of data collection and image production where infringement may have occurred.
Traditionally, the DMCA 1202b—which prohibits intentional removal or alteration of CMI as well as the import for distribution of copies with removed or altered CMI—was used to adjudicate intentional plagiarism in human-made artwork. When the perpetrator is gen AI, it becomes difficult for artists to allege copyright infringement under the DMCA 1202b without providing incontestable evidence that the software was intentionally designed to remove or alter CMI. The plaintiffs’ strategy is to prove this through a detailed review of the code used to build the defendants’ gen AI platforms. The case tests whether the DMCA 1202b could support copyright infringement claims, contingent upon the defendants disclosing competitive software programs and the plaintiffs’ ability to successfully diagnose CMI removal or alteration in the scraping or training process.
Indeed, the early proceedings of this class action lawsuit denote the courts’ receptiveness towards evaluating training datasets for copyright infringement. In October 2023, the Northern District of California Court allowed illustrator Sarah Andersen to proceed with her lawsuit, finding sufficient evidence that Stability AI acquired billions of copyrighted images without permission. During model training, Stability AI stored compressed copies of Andersen’s licensed works in the image-generating platform Stable Diffusion.
The court recognized from the onset that Andersen had plausibly alleged direct infringement against Stability AI’s training datasets stored in Stable Diffusion. Stability AI moved to dismiss the artists’ lawsuit, arguing that Stable Diffusion did not retain any direct copies of artworks, but was trained and refined on abstract parameters such as “lines, colors, shades, and other attributes associated with innumerable subjects and concepts.” The court partially denied this motion, rejecting Stability AI’s defense that storing artworks in algorithmic form invalidated infringement claims. In his decision, Judge William Orrick stated that the absence of works stored in their original medium was “not an impediment to the [direct infringement] claim.” Judge Orrick’s decision underlined an expansion of copyright protection to non-identical, encoded representations of stripped CMI in software. This ruling was extraordinary; Judge Orrick’s noncompliance with the “identicality” requirement could grant intellectual property protection to content spanning across image, audio, text, and other mediums that could be encoded like Andersen’s cartoons. This would prevent gen AI programs from storing copies of licensed work across various industries, eliminating their ability to produce imitative content or mimic individual artists’ creative styles.
In a related ongoing case, Doe v. GitHub, Inc., parallel litigation related to CMI removal in training data challenges how the DMCA 1202b applies to gen AI. The plaintiffs-appellants of Doe are software engineers who publicly uploaded their code to GitHub, protected under open-source licenses. Appellants allege that GitHub and OpenAI removed CMI from copyrighted code to develop training datasets for their commercial gen AI coding tools Copilot and Codex. Citing how early iterations of Copilot would regurgitate memorized software code along with the license text and CMI, appellants inferred that defendants programmed their AI models to strip plaintiffs' CMI from their code during the data collection and training process. Similar to how Stability AI stored compressed copies of images in Stable Diffusion, appellants suspect that defendants “made complete, identical copies of appellants’ works and removed the CMI before feeding the data into Codex and Copilot images,” violating their licenses and the DMCA. The Doe appellants follow a similar strategy as the Andersen plaintiffs: exposing copies of CMI-removed licensed data stored in gen AI programs, and adducing it as an internal, covert violation of the DMCA 1202b.
Moreover, in the appeal, the Doe appellants argue that the DMCA’s software provisions “embrace the concept of non-identical copies” to preserve the statutory context of the DMCA. This contrasts with how the DMCA traditionally applied to mediums such as essays or paintings, which can definitively be judged as plagiarized based on identicality. The appellants assert that the identicality clause would bar copyright law from preventing piracy in code, as multiple programs and algorithms could produce the same output. Abiding by the “identicality” requirement, a simple cosmetic change in the code would render copyright infringement claims powerless. Thus, for any gen AI that stores copies of data in non-identical forms—whether snippets of licensed code written by engineers, or parameters from artworks such as lines and colors—the courts’ DMCA reading poses crucial consequences for copyright violation, and implies a new criterion for author compensation.
In April 2025, following the appeal in Doe v. GitHub, Inc., the Authors Guild, Inc., the News/Media Alliance, and the International Association of Scientific, Technical and Medical Publishers filed an amicus brief in support of the plaintiffs-appellants. Reinforcing the appellants’ argument, the amicus curiae warned that a strict interpretation of the DMCA’s identicality requirement would impede, rather than bolster, the World Intellectual Property Organization’s Performances and Phonograms Treaty. The identicality requirement could undermine the treaty’s principle of obligating member states to protect the rights of creators against unauthorized digital distribution. As stakeholders continue to voice their concerns, their inputs reflect the expanding complexity of gen AI litigation and accentuate the need for technical expertise and regulatory clarity across industries. The Court’s reading of the DMCA 1202b could significantly affect how gen AI is regulated, disrupting AI performance in industries such as media, journalism, and technology.
The Andersen plaintiffs’ main objective to prohibit copyright infringement in datasets faces an obstacle: technology corporations’ reluctance to disclose lucrative information. When plaintiffs in Andersen v. Stability AI Ltd. called on computer science expert Dr. Ben Yanbin Zhao to testify as a witness, defendants objected while citing competition concerns, reflecting a tumultuous reality of widening chasms between corporations, lawmakers, and researchers. Plaintiffs’ demand for Runway’s disclosure of training data used by its gen AI model was also delayed due to production time, pushing back plaintiffs’ review of the code. Most recently, plaintiffs’ request for the production of training data beyond that sourced by LAION was denied by the Court for being overly burdensome. The legal strategy relentlessly maintained by plaintiffs reads as deliberately ambitious and risky against resistance from defendants and the Court. These oppositions foreshadow increasing friction between corporations and stakeholders as district courts across New York, Delaware, and California move to award reparations for gen AI copyright cases.
Unless courts accept nontraditional interpretations of DMCA 1202b to address copyright violations within training data and gen AI programs, plaintiffs in both Andersen v. Stability AI Ltd. and Doe v. GitHub, Inc. may be unable to secure justice for the unauthorized use of their licensed works—setting a dangerous precedent. These rulings could affect millions of users online whose images, music, code, or news have already been scraped into datasets to train gen AI. The rationale and language of each trial document are paramount for future legislation: even a judge’s single order that adheres to the identicality clause for software could block future artists’ abilities to claim ownership over their works and seek compensation. In this uncharted era of AI, Andersen v. Stability AI Ltd. is a springboard that would dictate copyright principles for future lawsuits and policies to come.
Under a national spotlight, Andersen v. Stability AI Ltd. encapsulates a symbolic battle for artists as plaintiffs aggressively pursue motions beyond the scope of the lawsuit. Yet, resistance from technology corporations and the court indicates a need to enact forward-looking federal policy tailored to address copyright infringement in gen AI datasets. Evidently, reactionary litigation alone cannot protect the full scope of intellectual and cultural production in the U.S. As gen AI services disrupt outlets of artistry and uproot traditional authorship and monetization, Andersen v. Stability AI Ltd. displays an ever-present need for frameworks of transparency to emerge between technology entrepreneurs, stakeholders, and lawmakers.
Edited by Ananya Bhatia and Leah Druch