Beyond Watermarks: Content Integrity Through Tiered Defense

Watermarking is often discussed as a solution to the problems posed by AI-generated content. However, watermarking is inadequate without other methods of detecting and sorting out AI-generated content.

A green wireframe model covers an actor’s lower face during the creation of a synthetic facial reanimation video, known alternatively as a deepfake, in London, Britain February 12, 2019. Reuters

Published: May 8, 2024 5:27 p.m.

The United States, European Union, and China have all taken major steps towards requiring that developers of large AI models watermark their outputs—meaning, requiring that the outputs include invisible signatures which indicate that they are AI-generated. The momentum for watermarking requirements has grown in large part due to concerns about information integrity and disinformation campaigns in an election year: sixty four countries are holding elections in 2024, even as digital platforms have cut funding for content moderation and reduced the headcount of their integrity teams. In response, companies like OpenAI, Meta, and Google have committed to label AI-generated images and contribute to common standards. But the uncomfortable reality is that watermarking is not a solution to content provenance— government-mandated watermarking will not be the thing that prevents AI-generated deepfakes from having a significant effect on elections this year.

There are three gaps that make watermarking an inadequate remedy for addressing AI-generated content intended to manipulate audiences. First, watermarking methods that are embedded into models can usually be removed by downstream developers. Second, bad actors can create fake watermarks and trick many watermark detectors. Third, some open-source models will continue to lack watermarks even after the adoption of watermarking requirements.

Watermarking is an appealing solution to policymakers seeking to prevent safeguard democratic elections and restore online trust. It sounds like a quick fix: AI content producing tools will simply indicate that their responses are AI generated, and social media companies will be able to surface this fact within the social media feeds where users are likely to encounter the content. Unfortunately, things are not quite so simple: building up public trust in watermarks as the determinant of “real” versus “fake” content will likely spur the adversarial creation of fake watermarks (so the real will be contested), even as there will be an ongoing proliferation of unwatermarked AI-generated content that comes from open source models (suggesting that the fake is real). A better approach is to regulate the harms from AI models, such as non-consensual intimate imagery, and adopt a tiered defense against inauthentic content that does not overly rely on promoting trust primarily through technological means.

How watermarking does and does not work

Before we delve into these gaps and quandaries, it’s important to understand present approaches to watermarking for video, audio, and generative text. Techniques for watermarking the outputs of AI models vary based on the modality of the model. For models that generate images, there are a wide variety of different approaches to inserting invisible watermarks by imperceptibly manipulating individual pixels. These include adding the watermark after an image is generated, fine-tuning a model such that its responses automatically include a watermark, and watermarking the data used to build a model.

Efforts to watermark audio models often adopt analogous approaches, injecting metadata humans cannot detect that can indicate the provenance of an audio clip. For example, an audio watermark may be a series of high-frequency sounds that humans cannot hear that can be identified by AI models when the audio is parsed.

Watermarking for text models is an active area of research that has continued in the wake of OpenAI’s decision to shut down its classifier for AI-written text in July 2023. One popular approach is to randomly select a subset of “tokens” (i.e. words and subwords) before generating a snippet of text and manipulating the model such that it is more likely to generate those tokens. Although watermarking the outputs of a text model cannot in general detect if a passage is AI-generated, it can succeed in identifying if the passage was generated by a specific model. Recent research has demonstrated that watermarks could be baked into the weights of open-source text models and that these watermarks could be robust to some amount of edits to the model’s outputs.

Leading companies like Google and Adobe have built tools for watermarking images and audio and joined the Coalition for Content Provenance and Authenticity, an industry group that has built technical standards for content provenance that involve signing images to verify their authenticity. However, attaching metadata to images using the C2PA standard is not a solution to synthetic image detection as bad actors can choose to remove that metadata. C2PA has many valuable applications—but it is not intended to solve the problem of online disinformation, and it won’t.

Watermarking and trust

Despite these advances, the ability to remove watermarks, create fake watermarks, and distribute open-source models that do not watermark their outputs makes watermarking a fundamentally limited approach to solving issues of content provenance.

There are no known techniques that are guaranteed to persist after an AI model is adapted by downstream developers, meaning watermarks can be removed even if companies implement them. Open-source model developers cannot prevent users from customizing their model such that its outputs are no longer watermarked. Removing a watermark from Stable Diffusion, perhaps the most popular model for generating images, “amounts to commenting out a single line in the source code.” New watermarking methods are regularly being developed that may help address these issues for open models, but there are new attacks that circumvent them being published too. Closed developers are also vulnerable to efforts to remove watermarks from model outputs if they allow their models to be fine-tuned via an API. Efforts to mandate watermarking for all generative models could restrict how downstream developers can adapt them, hamstringing one of the major drivers of innovation in the AI ecosystem.

Widespread watermarking of AI-generated content will help build trust in watermarks as signifiers of machine-created content, which may have unintended consequences. Some manipulative actors, such as state propagandists running disinformation campaigns, will be incentivized both to develop models that have no watermark—creating the false impression that generated content is “real”—or to spoof watermarks after the fact, such that images of real events, edited slightly such as through inpainting, are misread as AI-generated simulations. Visible watermarks for images can be spoofed using lookalikes, or by training a model on another model’s watermarked outputs in order to learn the watermark; for invisible watermarks, detectors can be tricked using adversarial noise. Additionally, watermarking requirements will not prevent the use of open-source models that do not watermark their outputs. Given the global nature of the Internet, these models will be readily available everywhere even if some jurisdictions adopt watermarking requirements, enabling anyone to generate persuasive deepfakes. There is also relatively little evidence that watermarking works well for models that add content to existing images rather than generating novel images, meaning that watermarking would not solve issues posed by apps like Nudify that create deepfakes through inpainting.

Watermarking will not solve the problem of disinformation, or non-consensual intimate imagery (NCII). It is worth pursuing, as it injects frictions for the average user that will reduce the ease and likely the volume of manipulative content. But it is not a panacea – it is only a minimal deterrent for sophisticated adversaries, particularly when compounded by the decreased investment from digital platforms in Trust and Safety teams. Given the limits of what can be achieved with current watermarking techniques, we need a diversity of technical and policy interventions to reduce the harm from generative AI systems in these contexts.

A tiered defense against manipulative inauthentic content

The shortcomings of watermarking do not imply that content provenance is a lost cause—on the contrary, they point towards other complementary solutions that can help ensure watermarking advances trust and safety on digital platforms. A tiered defense against inauthentic content might include additional methods for detecting AI-generated content, model-level interventions, and government interventions.

Measures to embed watermarks for AI-generated outputs into large AI models are an area of active research, but will also be the focus of adversarial attention. Until there is strong evidence of a foolproof way to watermark AI-generated content that cannot be removed, we should assume all watermarks are removable. By contrast, measures to uniquely identify each AI model such that models can be traced throughout the ecosystem may be more difficult to combat. A recent proposal to “fingerprint” language models would allow developers to plant a secret key in a model such that it is possible to identify even after it has been fine-tuned, which might help open developers to trace how their watermarks are being removed. While validating the provenance of an individual piece of content from a generative model may not be possible, validating the provenance of the model itself may be a viable intervention.

Continued research into methods of detecting AI-generated content can also help promote content integrity. For example, photos that are generated by a text-to-image model can often be regenerated by another text-to-image model since, unlike real photos, the outputs of these models have consistent, identifiable characteristics. This area could be a promising approach to identifying whether content is AI-generated and warrants further research.

While disinformation content is often the focus of the public conversation about generative AI harms, there are other areas of significant concern. The government should step in to prevent the proliferation of AI-generated NCII, as has been proposed in Congress and in state legislatures in Illinois and California. Child sexual abuse material (CSAM) and NCII are already illegal in many states, but there are few existing measures to ensure that enforcement of these prohibitions extends to AI-generated content. Legislation should hold platforms that facilitate the creation of AI-generated CSAM accountable as they have the means to prevent their models from generating CSAM by curating their datasets and filtering model outputs.

A tiered defense against inauthentic content requires a much broader array of approaches than watermarking model outputs. Policymakers should widen their aperture and consider measures that go beyond promoting digital trust through watermarking.

Kevin Klyman is a Research Assistant at the Stanford Internet Observatory.

Renée DiResta is the Technical Research Manager at the Stanford Internet Observatory.

By experts and staff