Preliminary

This introduction provides a comprehensive overview of large language models (LLMs), their mathematical foundations, and the critical need for copyright protection. It explores the evolution from traditional watermarking to advanced model fingerprinting techniques, presenting a detailed taxonomy that categorizes fingerprinting methods into intrinsic (parameter-based, semantic feature extraction, adversarial example-based) and invasive (weight-based watermarking, backdoor-based watermarking) approaches. The content also covers fingerprint transferability and removal strategies, offering insights into the complete lifecycle of model protection mechanisms.

What is large language model?

We begin by formalizing a LLM as a neural probabilistic model \(\mathcal{M}_\theta\), parameterized by \(\theta\), which assigns likelihoods to sequences of discrete tokens \(\boldsymbol{x} = (x^1, \ldots, x^n)\). These models typically rely on an autoregressive factorization, where the joint probability is decomposed as \(p_\theta(\boldsymbol{x}) = \prod_{i=1}^n p_\theta(x^i \mid \boldsymbol{x}^{<i})\), with \(\boldsymbol{x}^{<i} = (x^1, \ldots, x^{i-1})\) denoting the prefix context at position \(i\).

At each step, the model consumes the context \(\boldsymbol{x}^{<i}\), maps each token \(x^j\) within it to a continuous embedding \(\boldsymbol{e}^j \in \mathbb{R}^d\), and processes the resulting sequence through a stack of neural layers—most commonly Transformer blocks. This yields a hidden representation \(\boldsymbol{h}^i = \mathcal{F}_\theta(\boldsymbol{x}^{<i})\), where \(\mathcal{F}_\theta\) denotes the composition of Transformer layers.

The model then transforms the hidden state \(\boldsymbol{h}^i\) into a distribution over the vocabulary via a linear projection followed by a softmax operation. Formally, the conditional probability of the next token is given by:

\[ p_\theta(x^i \mid \boldsymbol{x}^{<i}) = \text{Softmax}(\boldsymbol{W} \boldsymbol{h}^i + \boldsymbol{b}) \]

where \(\boldsymbol{W} \in \mathbb{R}^{|\mathcal{V}| \times d}\) and \(\boldsymbol{b} \in \mathbb{R}^{|\mathcal{V}|}\) are learnable output projection parameters, and \(\mathcal{V}\) denotes the vocabulary set. This operation produces a categorical distribution over all tokens in \(\mathcal{V}\), from which the next token \(x^i\) is typically sampled or selected via greedy decoding.

Why do large language models need copyright protection?

The need for robust copyright protection stems from the increasing vulnerability of language models to unauthorized use and the difficulty of attribution once a model leaves the control of the original creator. Two representative scenarios illustrate the central challenges:

Without effective mechanisms to identify, attribute, and trace model ownership, developers lack meaningful recourse in the face of infringement. As the generative AI ecosystem matures, copyright protection for LLMs is not merely a legal or ethical concern, but a foundational requirement for preserving incentives, ensuring accountability, and supporting long-term innovation sustainability.

From LLM Watermarking to Model Fingerprinting

Watermarking, in its classical form, refers to the practice of embedding identifiable patterns into physical objects or media to assert ownership, verify authenticity, or deter forgery. Examples include the intricate designs in banknotes visible under light, embossed seals on official certificates, or an artist's unique signature on a painting. These visible or hidden marks ensure traceability and safeguard provenance.

In the digital realm, watermarking has become a foundational technique for protecting intellectual property. With the emergence of LLMs, watermarking approaches have adapted accordingly. As described in [cite:liu2024survey], LLM watermarking broadly refers to any technique that embeds verifiable information into LLMs or their outputs to support copyright attribution and traceability. These techniques are generally grouped into two categories: text watermarking and model watermarking.

Text watermarking embeds statistical or semantic signals into an LLM's generated content. The goal is to allow content verification without altering semantics or fluency, often using perturbation to token probabilities [cite:kirchenbauer2023watermark], sampling constraints [cite:christ2024undetectable] or neural rewriting [cite:abdelnabi2021adversarial]. Such signals are typically imperceptible to end users but detectable through specialized algorithms. This approach enables model owners to trace content distribution, enforce proper use, and support regulatory compliance.

Model watermarking, in contrast, focuses on protecting the model artifact itself by embedding identifiable patterns that can be later extracted or verified. This can be achieved through various mechanisms, such as inserting functional triggers (i.e., backdoor watermarking [cite:li2024double]) or encoding information into weight distributions [cite:uchida2017embedding]. In principle, model watermarking supports the attribution of proprietary models, and helps detect unauthorized replication or redistribution, especially in scenarios involving fine-tuning from a protected source.

Important Distinction: Not all methods that embed watermarks into a model should be classified as model watermarking. Several approaches [cite:gu2024on,xu2024learning] inject signals into model parameters at training time, yet their primary goal is to trace generated content. Despite operating on the model, these methods align more closely with text watermarking in terms of intent and evaluation.

Evolution to Model Fingerprinting

Further blurring the boundaries in this taxonomy, recent backdoor-based model watermarking approaches [cite:xu2024instructional,zhang2025scalable,zhang2025imf,cai2024utf,russinovich2024hey,yamabe-etal-2025-mergeprint]—which embed functional triggers for ownership verification—are increasingly characterized in the literature as instances of model fingerprinting. Historically, however, the term model fingerprinting was used to denote exclusively non-invasive techniques, such as output-based identification [cite:ren2025cotsrf], feature-space analysis [cite:zeng2023huref], or leveraging adversarial examples near the decision boundary [cite:Cao2019IPGuard].

To reconcile these evolving trends, we adopt the term model fingerprinting as a unifying label. It encompasses both conventional, non-invasive fingerprinting methods—referred to in this work as intrinsic fingerprinting—and model watermarking techniques that aim to attribute ownership of the model itself, which we refer to as invasive fingerprinting. For clarity and compatibility with prior literature, we adopt hybrid terms such as backdoor watermark as fingerprint to reflect both the methodological origin and prevailing terminology in current research.

Unified Definition: In this survey, model fingerprinting denotes methods for verifying a model's identity or provenance. This includes both non-invasive fingerprinting schemes and invasive model watermarking techniques, in accordance with evolving usage across the literature.

Model Fingerprinting Algorithms

We define a model fingerprint, denoted by \(\boldsymbol{f}\), as a distinctive and verifiable signature that can be associated with a model \(\mathcal{M}_\theta\). Depending on whether the fingerprint is embedded via direct modification of \(\theta\), fingerprinting algorithms can be broadly categorized into intrinsic (non-invasive) and invasive approaches.

Intrinsic fingerprinting operates under the assumption that a trained model inherently encodes identity-related information, even without any explicit modification. In this setting, the fingerprint is extracted as \(\boldsymbol{f} = \mathcal{F}_{\text{intrinsic}}(\mathcal{M}_\theta)\), where \(\mathcal{F}_{\text{intrinsic}}(\cdot)\) denotes a fingerprinting function that leverages the internal properties of the model. The main difference across intrinsic fingerprinting methods lies in how this fingerprint is derived—either by encoding the model's parameters [cite:zeng2023huref] or hidden representations [cite:zhang2024reef], by aggregating its output behavior on a predefined probe set [cite:ren2025cotsrf], or by designing adversarial inputs [cite:gubri2024trap] that elicit uniquely identifiable responses.

In contrast, invasive fingerprinting involves explicitly modifying the model to embed an externally defined fingerprint. This process typically consists of two stages: an embedding phase, where a fingerprint payload \(\boldsymbol{f}\) is injected into the model via an embedding function \(\mathcal{M}_\theta^{(\boldsymbol{f})} = \mathcal{F}_{\text{embed}}(\mathcal{M}_\theta, \boldsymbol{f})\); and an extraction phase, where the fingerprint is later retrieved from the modified model using a decoding function, i.e., \(\hat{\boldsymbol{f}} = \mathcal{F}_{\text{extract}}(\mathcal{M}_\theta^{(\boldsymbol{f})})\). Variations across invasive methods arise from both the encoding scheme—such as injecting fingerprint bits into the parameter space [cite:zhang2024emmark], or embedding functional backdoors within the model [cite:li2024double,xu2024instructional,cai2024utf,russinovich2024hey,wu2025imf,yamabe-etal-2025-mergeprint]—and the decoding strategy, which may rely on reading specific weights, observing triggered responses to secret inputs, or estimating gradient-based artifacts.

Key Characteristics of Model Fingerprinting Algorithms

To systematically understand and evaluate model fingerprinting algorithms, we highlight five core characteristics that determine their effectiveness and practical utility.

🎯

Effectiveness

The fingerprint \(\boldsymbol{f}\) should be reliably extractable and verifiable, enabling consistent attribution of the model through its outputs, internal states, or parameters.

🛡️

Harmlessness

Fingerprinting should not significantly impair the model's original performance. The model should retain its general-purpose capabilities after fingerprinting.

💪

Robustness

A robust fingerprint is resilient to both model-level changes (e.g., fine-tuning, pruning, model merging) and interaction-level manipulations (e.g., input perturbations, decoding changes), remaining intact under such transformations.

🔒

Stealthiness

The fingerprint should be difficult to detect or isolate, preventing unauthorized parties from identifying, removing or suppress it without access to proprietary knowledge.

Reliability

Fingerprints should uniquely correspond to their source models. Unrelated models should not produce similar signatures, and for interaction-triggered schemes, the fingerprint should remain latent under benign usage and only activate upon specific triggers.

Summary: These five properties serve as guiding principles for fingerprint design and form the basis for comparisons across different algorithms, as discussed in subsequent sections.

Taxonomy of Model Fingerprinting Algorithms

To facilitate the systematic review presented in non-invasive fingerprinting and invasive fingerprinting, this section introduces a taxonomy that categorizes existing model fingerprinting algorithms into two major types. The first category, intrinsic fingerprinting, leverages the inherent characteristics of a model \(\mathcal{M}_\theta\) to derive fingerprint information. Such fingerprints can be extracted from various properties of the model, including its weight parameters and activation representations, output semantics, or model-specific reactions to adversarially designed inputs. The second category, invasive fingerprinting, involves explicitly modifying the model to embed externally defined ownership information. These modifications may include embedding fingerprint payloads directly into the model's weights or utilizing backdoor-style watermarking schemes as a fingerprinting mechanism. This provides a more fine-grained taxonomy, covering representative techniques within each category and illustrating the diverse design choices found in the literature.

Taxonomy of Model Fingerprinting Methods

In addition to intrinsic and invasive fingerprinting, this taxonomy includes fingerprint transferability and removal, covering dynamic scenarios across the model lifecycle.

LLM Model
Fingerprinting
Intrinsic Fingerprinting
Parameter and Representation
Parameter-based: HuRef [cite:zeng2023huref], REEF [cite:zhang2024reef], [cite:yoon2025intrinsic]
Representation-based: DEEPJUDGE [cite:chen2022copy], zkLLM [cite:sun2024zkllm], TensorGuard [cite:wu2025gradient], Riemannian fingerprinting [cite:song2025riemannian], [cite:alhazbi2025llms]
Semantic Feature Extraction
[cite:liu2024fingerprint], Llmmap [cite:pasquini2024llmmap], DuFFin [cite:yan2025duffin], [cite:bhardwaj2025invisibletracesusinghybrid], [cite:bitton2025detecting], CoTSRF [cite:ren2025cotsrf], [cite:dasgupta2024watermarking]
Adversarial Example-Based
TRAP [cite:gubri2024trap], ProFLingo [cite:jin2024proflingo], RAP-SM [cite:xu2025rapsmrobustadversarialprompt], RoFL [cite:tsai2025rofl], FIT-Print [cite:shao2025fitprintfalseclaimresistantmodelownership]
Invasive Fingerprinting
Weight-based Watermarking
EmMark [cite:zhang2024emmark], Invariant-based Watermarking [cite:guo2025invariant], Structural Weight Watermarking with ECC [cite:block2025robust], Functional Invariants [cite:fernandez2023functional]
Backdoor-based Watermarking
IF [cite:xu2024instructional], UTF [cite:cai2024utf], MergePrint [cite:yamabe-etal-2025-mergeprint], ImF [cite:wu2025imf], [cite:nasery2025scalable], Chain&Hash [cite:russinovich2024hey], Double-I [cite:li2024double], PLMmark [cite:li2023plmmark]
Fingerprint Transfer
FP-VEC [cite:xu2024fpvec]
Fingerprint Removal
Training-time Removal
MEraser [cite:zhang2025meraser]
Inference-time Removal
[cite:carlini2021extracting], Token Forcing (TF) [cite:hoscilowicz2024unconditional], Generation Revision Intervention (GRI) [cite:zhang2025imf]