Preliminary
This introduction provides a comprehensive overview of large language models (LLMs), their mathematical foundations, and the critical need for copyright protection. It explores the evolution from traditional watermarking to advanced model fingerprinting techniques, presenting a detailed taxonomy that categorizes fingerprinting methods into intrinsic (parameter-based, semantic feature extraction, adversarial example-based) and invasive (weight-based watermarking, backdoor-based watermarking) approaches. The content also covers fingerprint transferability and removal strategies, offering insights into the complete lifecycle of model protection mechanisms.
What is large language model?
We begin by formalizing a LLM as a neural probabilistic model \(\mathcal{M}_\theta\), parameterized by \(\theta\), which assigns likelihoods to sequences of discrete tokens \(\boldsymbol{x} = (x^1, \ldots, x^n)\). These models typically rely on an autoregressive factorization, where the joint probability is decomposed as \(p_\theta(\boldsymbol{x}) = \prod_{i=1}^n p_\theta(x^i \mid \boldsymbol{x}^{<i})\), with \(\boldsymbol{x}^{<i} = (x^1, \ldots, x^{i-1})\) denoting the prefix context at position \(i\).
At each step, the model consumes the context \(\boldsymbol{x}^{<i}\), maps each token \(x^j\) within it to a continuous embedding \(\boldsymbol{e}^j \in \mathbb{R}^d\), and processes the resulting sequence through a stack of neural layers—most commonly Transformer blocks. This yields a hidden representation \(\boldsymbol{h}^i = \mathcal{F}_\theta(\boldsymbol{x}^{<i})\), where \(\mathcal{F}_\theta\) denotes the composition of Transformer layers.
The model then transforms the hidden state \(\boldsymbol{h}^i\) into a distribution over the vocabulary via a linear projection followed by a softmax operation. Formally, the conditional probability of the next token is given by:
\[ p_\theta(x^i \mid \boldsymbol{x}^{<i}) = \text{Softmax}(\boldsymbol{W} \boldsymbol{h}^i + \boldsymbol{b}) \]
where \(\boldsymbol{W} \in \mathbb{R}^{|\mathcal{V}| \times d}\) and \(\boldsymbol{b} \in \mathbb{R}^{|\mathcal{V}|}\) are learnable output projection parameters, and \(\mathcal{V}\) denotes the vocabulary set. This operation produces a categorical distribution over all tokens in \(\mathcal{V}\), from which the next token \(x^i\) is typically sampled or selected via greedy decoding.
Why do large language models need copyright protection?
The need for robust copyright protection stems from the increasing vulnerability of language models to unauthorized use and the difficulty of attribution once a model leaves the control of the original creator. Two representative scenarios illustrate the central challenges:
- Unauthorized model distribution: In the case of privately held LLMs—such as proprietary models deployed on the cloud—there exists a tangible risk of unintentional leakage. These leaks may occur through internal mishandling (e.g., by employees with access to model weights), or via external vectors such as cyberattacks. Once leaked, adversaries may redistribute or monetize the models without the original developer’s consent, leading to severe intellectual property and security concerns. For instance, in early 2024, an internal large-scale model was inadvertently made public on a popular hosting platform and later confirmed by its developer to have been exposed by an enterprise partner’s employee, highlighting the real-world risk of weight leakage.
- Violation of open-source license agreements: For models released under open-source licenses, usage often comes with specific terms and restrictions. For instance, a model may be licensed strictly for non-commercial use or require attribution to the original authors. Nonetheless, it is not uncommon for third-party actors to make minimal algorithmic changes to the released models and then redistribute them, potentially for commercial use, thereby violating licensing terms and undermining the original creators' intentions. For example, in 2024 a research team withdrew a released model after acknowledging it was derived from another project without proper attribution, illustrating how even academic efforts can inadvertently breach license terms.
Without effective mechanisms to identify, attribute, and trace model ownership, developers lack meaningful recourse in the face of infringement. As the generative AI ecosystem matures, copyright protection for LLMs is not merely a legal or ethical concern, but a foundational requirement for preserving incentives, ensuring accountability, and supporting long-term innovation sustainability.
From LLM Watermarking to Model Fingerprinting
Watermarking, in its classical form, refers to the practice of embedding identifiable patterns into physical objects or media to assert ownership, verify authenticity, or deter forgery. Examples include the intricate designs in banknotes visible under light, embossed seals on official certificates, or an artist's unique signature on a painting. These visible or hidden marks ensure traceability and safeguard provenance.
In the digital realm, watermarking has become a foundational technique for protecting intellectual property. With the emergence of LLMs, watermarking approaches have adapted accordingly. As described in [cite:liu2024survey], LLM watermarking broadly refers to any technique that embeds verifiable information into LLMs or their outputs to support copyright attribution and traceability. These techniques are generally grouped into two categories: text watermarking and model watermarking.
Text watermarking embeds statistical or semantic signals into an LLM's generated content. The goal is to allow content verification without altering semantics or fluency, often using perturbation to token probabilities [cite:kirchenbauer2023watermark], sampling constraints [cite:christ2024undetectable] or neural rewriting [cite:abdelnabi2021adversarial]. Such signals are typically imperceptible to end users but detectable through specialized algorithms. This approach enables model owners to trace content distribution, enforce proper use, and support regulatory compliance.
Model watermarking, in contrast, focuses on protecting the model artifact itself by embedding identifiable patterns that can be later extracted or verified. This can be achieved through various mechanisms, such as inserting functional triggers (i.e., backdoor watermarking [cite:li2024double]) or encoding information into weight distributions [cite:uchida2017embedding]. In principle, model watermarking supports the attribution of proprietary models, and helps detect unauthorized replication or redistribution, especially in scenarios involving fine-tuning from a protected source.
Evolution to Model Fingerprinting
Further blurring the boundaries in this taxonomy, recent backdoor-based model watermarking approaches [cite:xu2024instructional,zhang2025scalable,zhang2025imf,cai2024utf,russinovich2024hey,yamabe-etal-2025-mergeprint]—which embed functional triggers for ownership verification—are increasingly characterized in the literature as instances of model fingerprinting. Historically, however, the term model fingerprinting was used to denote exclusively non-invasive techniques, such as output-based identification [cite:ren2025cotsrf], feature-space analysis [cite:zeng2023huref], or leveraging adversarial examples near the decision boundary [cite:Cao2019IPGuard].
To reconcile these evolving trends, we adopt the term model fingerprinting as a unifying label. It encompasses both conventional, non-invasive fingerprinting methods—referred to in this work as intrinsic fingerprinting—and model watermarking techniques that aim to attribute ownership of the model itself, which we refer to as invasive fingerprinting. For clarity and compatibility with prior literature, we adopt hybrid terms such as backdoor watermark as fingerprint to reflect both the methodological origin and prevailing terminology in current research.
Model Fingerprinting Algorithms
We define a model fingerprint, denoted by \(\boldsymbol{f}\), as a distinctive and verifiable signature that can be associated with a model \(\mathcal{M}_\theta\). Depending on whether the fingerprint is embedded via direct modification of \(\theta\), fingerprinting algorithms can be broadly categorized into intrinsic (non-invasive) and invasive approaches.
Intrinsic fingerprinting operates under the assumption that a trained model inherently encodes identity-related information, even without any explicit modification. In this setting, the fingerprint is extracted as \(\boldsymbol{f} = \mathcal{F}_{\text{intrinsic}}(\mathcal{M}_\theta)\), where \(\mathcal{F}_{\text{intrinsic}}(\cdot)\) denotes a fingerprinting function that leverages the internal properties of the model. The main difference across intrinsic fingerprinting methods lies in how this fingerprint is derived—either by encoding the model's parameters [cite:zeng2023huref] or hidden representations [cite:zhang2024reef], by aggregating its output behavior on a predefined probe set [cite:ren2025cotsrf], or by designing adversarial inputs [cite:gubri2024trap] that elicit uniquely identifiable responses.
In contrast, invasive fingerprinting involves explicitly modifying the model to embed an externally defined fingerprint. This process typically consists of two stages: an embedding phase, where a fingerprint payload \(\boldsymbol{f}\) is injected into the model via an embedding function \(\mathcal{M}_\theta^{(\boldsymbol{f})} = \mathcal{F}_{\text{embed}}(\mathcal{M}_\theta, \boldsymbol{f})\); and an extraction phase, where the fingerprint is later retrieved from the modified model using a decoding function, i.e., \(\hat{\boldsymbol{f}} = \mathcal{F}_{\text{extract}}(\mathcal{M}_\theta^{(\boldsymbol{f})})\). Variations across invasive methods arise from both the encoding scheme—such as injecting fingerprint bits into the parameter space [cite:zhang2024emmark], or embedding functional backdoors within the model [cite:li2024double,xu2024instructional,cai2024utf,russinovich2024hey,wu2025imf,yamabe-etal-2025-mergeprint]—and the decoding strategy, which may rely on reading specific weights, observing triggered responses to secret inputs, or estimating gradient-based artifacts.
Key Characteristics of Model Fingerprinting Algorithms
To systematically understand and evaluate model fingerprinting algorithms, we highlight five core characteristics that determine their effectiveness and practical utility.
Effectiveness
The fingerprint \(\boldsymbol{f}\) should be reliably extractable and verifiable, enabling consistent attribution of the model through its outputs, internal states, or parameters.
Harmlessness
Fingerprinting should not significantly impair the model's original performance. The model should retain its general-purpose capabilities after fingerprinting.
Robustness
A robust fingerprint is resilient to both model-level changes (e.g., fine-tuning, pruning, model merging) and interaction-level manipulations (e.g., input perturbations, decoding changes), remaining intact under such transformations.
Stealthiness
The fingerprint should be difficult to detect or isolate, preventing unauthorized parties from identifying, removing or suppress it without access to proprietary knowledge.
Reliability
Fingerprints should uniquely correspond to their source models. Unrelated models should not produce similar signatures, and for interaction-triggered schemes, the fingerprint should remain latent under benign usage and only activate upon specific triggers.
Taxonomy of Model Fingerprinting Algorithms
To facilitate the systematic review presented in non-invasive fingerprinting and invasive fingerprinting, this section introduces a taxonomy that categorizes existing model fingerprinting algorithms into two major types. The first category, intrinsic fingerprinting, leverages the inherent characteristics of a model \(\mathcal{M}_\theta\) to derive fingerprint information. Such fingerprints can be extracted from various properties of the model, including its weight parameters and activation representations, output semantics, or model-specific reactions to adversarially designed inputs. The second category, invasive fingerprinting, involves explicitly modifying the model to embed externally defined ownership information. These modifications may include embedding fingerprint payloads directly into the model's weights or utilizing backdoor-style watermarking schemes as a fingerprinting mechanism. This provides a more fine-grained taxonomy, covering representative techniques within each category and illustrating the diverse design choices found in the literature.
Taxonomy of Model Fingerprinting Methods
In addition to intrinsic and invasive fingerprinting, this taxonomy includes fingerprint transferability and removal, covering dynamic scenarios across the model lifecycle.
Fingerprinting
Representation-based: DEEPJUDGE [cite:chen2022copy], zkLLM [cite:sun2024zkllm], TensorGuard [cite:wu2025gradient], Riemannian fingerprinting [cite:song2025riemannian], [cite:alhazbi2025llms]