Codify — Article

California AB 412: Requires GenAI developers to track and disclose copyrighted training data

Creates a fingerprint-based transparency regime that lets rights owners query GenAI models and sue for nonresponse, with several drafting ambiguities left unresolved.

The Brief

AB 412 creates a statutory framework requiring developers who deploy generative AI models to document and make discoverable the copyrighted (registered or indexed) works used to train those models when those models are used commercially in California or made available to Californians. The bill centers on an “approximate content fingerprint” system: developers must publish tooling or instructions that let rights owners generate standardized fingerprints and must maintain records tying documented covered materials to their models.

The law gives rights owners a defined channel to submit identity-verified requests for information and obliges developers to assess whether a provided fingerprint likely appears in the developer’s training dataset and to report back — with missed deadlines treated as discrete violations that can trigger a $1,000-per-violation statutory award or actual damages, injunctive relief, and attorney’s fees. The measure also contains several exemptions (academic research, public datasets, wholly owned material) and multiple drafting inconsistencies that create real implementation questions for compliance teams and courts.

At a Glance

What It Does

AB 412 requires GenAI developers who operate commercially in California or make models available to Californians to document covered copyrighted works used for training, publish fingerprint-generation information conforming to industry standards, and provide a website mechanism for rights owners to submit verified inquiries about training-use. Developers must retain documentation while the model is in use (plus an ambiguous additional period).

Who It Affects

The rule targets any person or company that designs or substantially modifies GenAI models and either commercially uses them in California or makes them available to Californians — from large platform owners to startups offering models as a service. Copyright owners (including owners of pre‑1972 sound recordings) and legal claimants will use the new transparency paths.

Why It Matters

This is one of the first state-level statutes to mandate provenance tooling for GenAI training data, shifting the burden of provenance onto model creators and creating a private-enforcement path. Compliance will change how training datasets are compiled, documented, and governed, and it will increase litigation risk for developers that can’t demonstrate provenance.

More articles like this one.

A weekly email with all the latest developments on this topic.

Unsubscribe anytime.

What This Bill Actually Does

AB 412 defines a narrow technical approach to provenance: an "approximate content fingerprint" that uniquely represents a digital work, is resilient to minor edits, cannot be reversed to reconstruct the work, and can be used to identify content within large datasets. The bill defines who counts as a developer (entities that design or substantially modify GenAI models and either use them commercially in California or make them available to Californians) and limits the covered works to those registered, preregistered, or indexed with the U.S. Copyright Office, plus certain pre‑1972 sound recordings.

Operationally, the bill forces developers to keep two linked streams of records: (1) documentation of covered materials they know were used in training and reasonable efforts to identify other covered materials present in training data, and (2) logs of requests received from rights owners and the developer’s responses. To make matching feasible, developers must publish on their websites enough technical detail to allow a natural person to produce a fingerprint compatible with the developer’s dataset; the statute permits pointing rights owners to an external, free, nondiscriminatory tool that conforms to widely accepted industry standards.When a rights owner submits an identity‑verified request that includes registration numbers or fingerprints, the developer must assess, for each compliant fingerprint, whether the represented covered material is likely present in the model’s dataset and return two lists: works the developer has already documented as used and works that the fingerprint assessment suggests are likely present.

The bill forbids answering requests that lack adequate identity documentation or that violate the one‑request‑per‑quarter rule. Missing the statutory response deadline — currently expressed inconsistently in the text — results in daily, discrete violations, and a successful rights owner may recover at least $1,000 per violation (or actual damages), injunctive relief, and attorney’s fees.AB 412 also carves out routine exceptions: models trained exclusively on freely publicly available data, models used only for noncommercial academic or government research, models not trained on covered materials, or models trained exclusively on covered materials owned by the developer.

Several drafting problems — notably inconsistent timelines and an unclear retention-duration phrase — leave material implementation questions for regulators, courts, and compliance teams to resolve before companies can build reliable processes around the statute.

The Five Things You Need to Know

1

The bill requires developers to publish information enabling generation of an "approximate content fingerprint" compatible with training data and aligned with widely accepted industry standards.

2

Developers must provide a website mechanism to receive identity-verified requests from rights owners and log those requests and responses for the life of the model in California plus an ambiguous additional period.

3

For each compliant fingerprint supplied, the developer must assess whether the represented covered material is likely in its dataset and return lists of documented covered materials and likely matches.

4

Failure to respond within the statute’s stated deadline (text alternates between seven and thirty days) triggers daily, discrete violations; successful claimants can recover $1,000 per violation or actual damages, injunctive relief, and attorneys’ fees.

5

The statute exempts models trained exclusively on publicly available, free data; models for noncommercial academic or government research; models not trained on covered materials; and models trained only on covered materials owned by the developer.

Section-by-Section Breakdown

Every bill we cover gets an analysis of its key sections. Expand all ↓

Section 3115

Definitions (fingerprint, GenAI, covered material, developer)

This section defines the technical and scope terms the rest of the title uses. The fingerprint must be distinctive, robust to minor edits, nonreconstructive, and useful for identification — a set of constraints that push implementers toward hash‑like, lossy representations rather than reversible encodings. The developer and GenAI definitions set the statute’s jurisdictional hook: the obligations attach when a model is used commercially in California or made available to Californians, which focuses compliance on real‑world deployments rather than internal research prototypes.

Section 3116

Documentation, fingerprint tooling, and request mechanism

This provision obliges developers to document both known covered materials used for training and reasonable efforts to identify other covered materials present in datasets. It requires developers to publish sufficient information to let a person generate a compatible fingerprint, and it expressly allows directing rights owners to an external tool that is free, nondiscriminatory, and reasonably accessible. The section also mandates a website mechanism that accepts identity verification, signatures, and registration/fingerprint data from rights owners, and it requires developers to retain the documentation and request logs for the stated retention period (which the text renders inconsistently). Practically, this means developers must build recordkeeping, a public-facing information page, and an intake workflow tied to identity validation.

Section 3117

Assessment obligations and timing, with per‑day violation rule

Under this section a developer must assess each fingerprint that complies with the industry‑standards requirement and report back to the rights owner with two lists: (a) covered materials the developer previously documented as used in training, and (b) covered materials that a fingerprint assessment suggests are likely present. The statute ties failure to meet the response deadline to discrete, daily violations. However, the draft contains inconsistent timing language (it alternates between seven and thirty days), creating immediate compliance uncertainty about the applicable response window and the accrual of daily penalties.

3 more sections
Section 3118

Limits on requests

This short section curtails potential abuse by imposing a one‑request‑per‑developer‑per‑quarter limit for the same model and rights owner, unless the later request provides new material information. The provision allows multi‑work requests in a single submission, which reduces friction for rights owners with catalog claims but also structures how developers must track per‑model, per‑owner request cadence.

Section 3119

Enforcement: private civil actions and remedies

The bill grants rights owners a private right of action when they comply with the request rules and a developer fails to provide required information. Damages are statutory ($1,000 per violation) or actual damages if greater; the court may award injunctive relief and attorneys’ fees. Because the statute treats each day after the deadline as a separate violation, exposure can compound quickly, creating meaningful litigation leverage for plaintiffs and sizable exposure for developers who miss timelines.

Section 3119.5

Exemptions and non‑application

This section lists four exemptions: models trained solely on data the developer makes publicly available at no cost; models used only for noncommercial academic or governmental research; models not trained on covered materials; and models trained exclusively on covered materials owned by the developer. These carveouts narrow the statute’s sweep but leave open questions about hybrid datasets, dual‑use models, and what constitutes sufficiently public and free data.

At scale

This bill is one of many.

Codify tracks hundreds of bills on Technology across all five countries.

Explore Technology in Codify Search →

Who Benefits and Who Bears the Cost

Every bill creates winners and losers. Here's who stands to gain and who bears the cost.

Who Benefits

  • Registered copyright owners (authors, publishers): The bill gives them a standardized channel and technical means to test whether their registered works were used in a model’s training set and to seek remedies if developers fail to respond.
  • Owners of pre‑1972 sound recordings: The statute explicitly includes these rights owners in the definition of rights owner, translating longstanding uncertainty about older recordings into actionable rights under the title.
  • Creators and small rights holders with provenance disputes: The fingerprint mechanism lowers the cost of screening large models for specific works compared with manual audits, helping smaller rights holders identify potential uses of their works.

Who Bears the Cost

  • GenAI developers operating in or serving Californians: They must build documentation workflows, publish fingerprint tooling or link to compliant external tools, implement identity‑verified intake systems, retain records for the model lifecycle plus an unclear extra period, and absorb litigation risk tied to response deadlines.
  • Startups and research teams that commercialize models: Compliance burdens and potential statutory damages may disproportionately impact smaller entities that lack established provenance pipelines, increasing time‑to‑market and legal costs.
  • Courts and defense counsel: The private right of action and per‑day violation structure will likely generate litigation, requiring courts to resolve technical disputes about fingerprint validity, standards conformity, and ambiguous timing and retention language.

Key Issues

The Core Tension

AB 412 pits two legitimate aims against each other: giving copyright owners practical, technical ways to detect and challenge unauthorized uses of registered works versus imposing operational, compliance, and legal burdens on developers that could slow innovation, raise costs for smaller players, and invite litigation driven by timing and technical disputes. The statute solves the provenance problem by shifting the burden onto model creators, but it leaves open how to do that fairly and predictably in practice.

Two drafting inconsistencies are central to implementation risk. First, the response deadline is stated alternately as seven days and thirty days in the same provision; second, the retention clause refers to the retention period as "plus 10 five years," producing ambiguity about how long records must be kept after the model ceases California commercial operations.

Those errors matter: the per‑day violation rule converts timing disputes into monetary exposure, so uncertainty about the correct deadline or retention horizon creates asymmetric litigation risk for developers.

Beyond textual errors, the statute leaves critical technical choices unspecified. It requires fingerprints to conform to "widely accepted industry standards" but does not identify acceptable methods, certifiers, or dispute-resolution paths if a rights owner supplies a fingerprint the developer rejects.

The law also allows developers to point to external tools, but it does not specify liability or validation requirements for third‑party tooling, nor how to reconcile competing fingerprints from different tools. Finally, the identity and signature requirements for requestors are plausible anti‑abuse measures but raise privacy and administrative burdens and may exclude legitimate claimants without easy access to verification methods.

Try it yourself.

Ask a question in plain English, or pick a topic below. Results in seconds.