Evaluation for LLM Fingerprinting
Building upon the key characteristics outlined in the key characteristics section, we now provide a detailed discussion of how each criterion can be evaluated in practice. Specifically, given a clean base model Mθ, its fingerprinted counterpart Mθ(f), and a downstream suspect model Mθ(s), we describe concrete procedures and metrics for assessing whether the suspect model contains the embedded fingerprint, and to what extent the fingerprint meets the target properties.
The evaluation framework covers four main aspects: Detectability (Effectiveness), Capability Impact (Harmlessness), Reliability, and Robustness Under Fingerprint Attack. Each section examines specific metrics and procedures for comprehensive assessment.
Detectability (Effectiveness)
Effectiveness assesses whether the embedded fingerprint can be reliably extracted from the fingerprinted model and, when necessary, distinguished from signals in a suspect model Mθ(s). We quantify this property using the Fingerprint Success Rate (FSR), which measures the strength of the recovered fingerprint signal.
Intrinsic Fingerprinting
Parameter and Representation as Fingerprint
For approaches that define the fingerprint as parameters or intermediate representations, we focus on measuring the similarity between the extracted signals. Let $f = \mathcal{F}_{\text{intrinsic}}(\mathcal{M}_\theta)$ and $f^{(s)} = \mathcal{F}_{\text{intrinsic}}(\mathcal{M}_\theta^{(s)})$ denote the fingerprints extracted from the base model and the suspect model, respectively.
The FSR in this setting is commonly computed as the cosine similarity, where values closer to 1 indicate stronger fingerprint correspondence. Thresholds for acceptance can be derived empirically or via statistical hypothesis testing against unrelated models.
Semantic Feature as Fingerprint
Methods in this category typically require a predefined probe dataset Dprobe. Each prompt in Dprobe is fed into the model to collect an output set O (e.g., logits, generated text). From O, the fingerprint signal f is then extracted using either a pre-trained feature extractor or statistical feature analysis.
Adversarial Example as Fingerprint
In this category, the probe dataset Dprobe consists of paired examples {(xtrigger, yfp)}, rather than unlabeled prompts. The owner first constructs a target set {(x, yfp)} and employs a specific optimization procedure (such as GCG [cite:zou2023universal]) to transform each x into an adversarial input xtrigger.
Effectiveness is measured by the FSR, defined as the proportion of triggers in Dprobe that elicit their intended fingerprint responses, where $\mathbb{1}[\cdot]$ is the indicator function.
Invasive Fingerprinting
Weight Watermark as Fingerprint
In this setting, the model owner defines a binary watermark message m = (b1, b2, ..., bn) of length n, and embeds it into the model's weights via a regularization-based constraint during training. After deployment, the corresponding extraction rule is applied to the target weights to recover a message m'.
The FSR can be quantified as the bit accuracy or, equivalently, as one minus the bit error rate (BER).
Backdoor Watermark as Fingerprint
In this setting, the model owner constructs a backdoor fingerprint dataset Dfp = {(xtrigger, yfp)}, where each xtrigger conforms to a predefined trigger pattern. The FSR is computed analogously to the adversarial-example case, by measuring the proportion of triggers in Dfp that elicit their intended fingerprint outputs.
The FSR is computed analogously to the adversarial-example case, by measuring the proportion of triggers in Dfp that elicit their intended fingerprint outputs, where $\mathbb{1}[\cdot]$ is the indicator function.
Capability Impact (Harmlessness)
From a model fingerprinting perspective, harmlessness refers to the property that the embedding of ownership signals neither degrades the model's original capabilities nor interferes with its intended functionalities. In practice, a fingerprinting scheme is considered harmless if (i) the quality of model-generated content remains essentially unaffected, and (ii) the performance gap between the original and fingerprinted models is statistically negligible across a sufficiently diverse set of representative tasks.
Generated Content Quality Preservation
General Capability Preservation
Representative evaluation categories include logical and commonsense reasoning (ANLI R1--R3 [cite:nie-etal-2020-adversarial], ARC [cite:clark2018think], OpenBookQA [cite:mihaylov2018can], Winogrande [cite:sakaguchi2021winogrande], LogiQA [cite:liu2021logiqa]), scientific understanding (SciQ [cite:welbl2017crowdsourcing]), linguistic and textual entailment (BoolQ [cite:clark2019boolq], CB [cite:de2019commitmentbank], RTE [cite:giampiccolo2007third], WiC [cite:pilehvar2019wic], WSC [cite:levesque2012winograd], CoPA [cite:roemmele2011choice], MultiRC [cite:khashabi2018looking]), long-form prediction (LAMBADA-OpenAI and LAMBADA-Standard [cite:paperno2016lambada]), and additional capability domains [cite:liu2024survey] including text completion [cite:kirchenbauer2023watermark], code generation [cite:lee2023wrote], machine translation [cite:hu2023unbiased], text summarization [cite:he2024can], question answering [cite:fernandez2023three], mathematical reasoning [cite:liang2024watme], knowledge probing [cite:tu2023waterbench], and instruction following [cite:tu2023waterbench].
Reliability
In the context of traditional model watermarking, this property is often referred to as fidelity. It requires that the FSR obtained from unrelated models be kept below a minimal threshold. Formally, given a set of unrelated models $\{\mathcal{M}_1^{(u)}, \mathcal{M}_2^{(u)}, \ldots\}$, the fingerprint extractor should yield consistently low FSR values across all $\mathcal{M}_i^{(u)}$.
Robustness Under Fingerprint Attack
In real-world scenarios, an adversary may attempt to remove or overwrite embedded copyright information, potentially sacrificing some model performance in the process. Robustness measures the extent to which the fingerprint signal remains detectable under such deliberate evasion attempts, and is typically quantified by the FSR achieved after various attack strategies.
Model-Level Attacks
Model Fine-tuning
Fine-tuning refers to the process whereby an adversary continues training a stolen model using strategies such as continued pretraining, instruction tuning, or reinforcement learning on curated datasets. In real-world applications, fine-tuning is one of the most common methods for enhancing a model's capabilities.
Continued fine-tuning thus represents one of the most prevalent and practically relevant adversarial settings, and has historically served as the primary robustness benchmark for many fingerprinting methods [cite:xu2024instructional,cai2024utf,russinovich2024hey]. Moreover, certain heuristic fine-tuning strategies have been explicitly proposed to erase backdoor-based fingerprints, such as MEraser [cite:zhang2025meraser], which targets the selective removal of implanted triggers while preserving the model's utility.
Model Quantization and Pruning
In real-world deployments, adversaries (or even benign users) may need to adapt models for low-resource environments, where reduced memory footprint and faster inference are critical. Two common strategies for this are quantization—reducing parameter precision—and pruning—removing redundant weights or structures.
Quantization covers techniques such as half-precision (fp16) deployment and low-bit (e.g., 8-bit or 4-bit) integer quantization, which significantly compress model size while retaining functionality. Pruning can be applied in structured or unstructured forms, including random pruning, magnitude-based pruning using $L_1$/$L_2$ norms, or heuristic approaches such as Taylor-based saliency pruning [cite:ma2023llmpruner].
Model Merging
Model merging [cite:bhardwaj2024language,arora2024here] has recently gained traction as a lightweight paradigm for integrating multiple upstream expert models—each specialized for particular tasks—into a single model that consolidates their capabilities. Its main appeal lies in the ability to combine functionalities without requiring high-performance computing resources.
[cite:cong2024have] were among the first to formally investigate merging as an attack vector against model fingerprinting. Rather than proposing new merging algorithms, they adopted representative existing approaches—such as Task Arithmetic [cite:ilharco2022task-arithmetic] and Ties-Merging [cite:yadav2024ties]—to evaluate fingerprint persistence under fusion. Beyond these, many other merging strategies are available in practice, with toolkits such as MergeKit [cite:goddard-etal-2024-mergekit] providing streamlined workflows for implementing lightweight model merging in real systems.
Input and Output Level Attacks
Beyond direct modifications to model parameters, interaction-dependent fingerprinting methods—such as those based on adversarial examples, backdoor watermarks, semantic features, or activation representations—can be challenged through manipulations of the model's inputs and/or outputs during querying.
Input Manipulation
In practical settings, an adversary may systematically inspect all incoming queries—including benign user inputs—to detect fragments that could reveal embedded fingerprint patterns. Upon identification, such queries may be blocked, ignored, or otherwise suppressed. Detection can also be performed using heuristic metrics such as perplexity (PPL), defined as:
If an input bypasses the initial detection stage, the adversary may still opt to perturb it—such as by re-paragraphing, removing non-essential content at random, or otherwise altering its structure—thereby reducing the likelihood that a fingerprint trigger is activated.
Response Manipulation
Beyond manipulating inputs, an adversary could attempt to detect and suppress fingerprint activation by examining the semantic consistency between an input and its corresponding output. Since fingerprinted responses are often designed to exhibit distinctive features, they may lie outside the model's greedy decoding path or occur in low-probability regions of the output distribution.
System-Level Attacks
Ultimately, LLMs are deployed within broader systems, a common example being LLM-based agents [cite:kong2025surveyllmdrivenaiagent]. Such systems often integrate memory modules or external knowledge sources (e.g., web search) into the model's reasoning process—either to mitigate hallucination or to synchronize responses with up-to-date information.
While these additional prompts improve factual accuracy and relevance, they can also interfere with the activation or manifestation of fingerprint signals. As a result, evaluating fingerprint robustness in the presence of such system-level interactions is essential to understanding performance in realistic deployment scenarios.