Evaluation for LLM Fingerprinting

Building upon the key characteristics outlined in the key characteristics section, we now provide a detailed discussion of how each criterion can be evaluated in practice. Specifically, given a clean base model M_θ, its fingerprinted counterpart M_θ^(f), and a downstream suspect model M_θ^(s), we describe concrete procedures and metrics for assessing whether the suspect model contains the embedded fingerprint, and to what extent the fingerprint meets the target properties.

          Note: If the fingerprinting method is invasive, we
          assume by default that the suspect model originates from the
          fingerprinted model Mθ(f). If the method is intrinsic (non-invasive), the suspect
          model is assumed to originate from the base model
          Mθ, in which case the fingerprinted and base models are interchangeable
          for evaluation purposes.
        

The evaluation framework covers four main aspects: Detectability (Effectiveness), Capability Impact (Harmlessness), Reliability, and Robustness Under Fingerprint Attack. Each section examines specific metrics and procedures for comprehensive assessment.

Detectability (Effectiveness)

Effectiveness assesses whether the embedded fingerprint can be reliably extracted from the fingerprinted model and, when necessary, distinguished from signals in a suspect model M_θ^(s). We quantify this property using the Fingerprint Success Rate (FSR), which measures the strength of the recovered fingerprint signal.

          Key Point: As the most fundamental criterion, effectiveness
          underpins all other benchmarks: if a fingerprint signal cannot be
          extracted with sufficient strength, considerations of harmlessness,
          robustness, reliability, or stealthiness become irrelevant.
        

Intrinsic Fingerprinting

Parameter and Representation as Fingerprint

For approaches that define the fingerprint as parameters or intermediate representations, we focus on measuring the similarity between the extracted signals. Let $f = \mathcal{F}_{\text{intrinsic}}(\mathcal{M}_\theta)$ and $f^{(s)} = \mathcal{F}_{\text{intrinsic}}(\mathcal{M}_\theta^{(s)})$ denote the fingerprints extracted from the base model and the suspect model, respectively.

$$\text{FSR} = \frac{\langle f, f^{(s)} \rangle}{\|f\|_2 \cdot \|f^{(s)}\|_2}$$

The FSR in this setting is commonly computed as the cosine similarity, where values closer to 1 indicate stronger fingerprint correspondence. Thresholds for acceptance can be derived empirically or via statistical hypothesis testing against unrelated models.

Semantic Feature as Fingerprint

Methods in this category typically require a predefined probe dataset D_probe. Each prompt in D_probe is fed into the model to collect an output set O (e.g., logits, generated text). From O, the fingerprint signal f is then extracted using either a pre-trained feature extractor or statistical feature analysis.

Adversarial Example as Fingerprint

In this category, the probe dataset D_probe consists of paired examples {(x_trigger, y_fp)}, rather than unlabeled prompts. The owner first constructs a target set {(x, y_fp)} and employs a specific optimization procedure (such as GCG [cite:zou2023universal]) to transform each x into an adversarial input x_trigger.

$$\text{FSR} = \frac{1}{|D_{probe}|} \sum_{(x_{trigger}, y_{fp}) \in D_{probe}} \mathbb{1}[M(x_{trigger}) = y_{fp}]$$

Effectiveness is measured by the FSR, defined as the proportion of triggers in D_probe that elicit their intended fingerprint responses, where $\mathbb{1}[\cdot]$ is the indicator function.

Invasive Fingerprinting

Weight Watermark as Fingerprint

In this setting, the model owner defines a binary watermark message m = (b₁, b₂, ..., b_n) of length n, and embeds it into the model's weights via a regularization-based constraint during training. After deployment, the corresponding extraction rule is applied to the target weights to recover a message m'.

$$\text{FSR} = 1 - \frac{1}{n} \sum_{i=1}^{n} \mathbb{1}[b_i \neq b'_i]$$

The FSR can be quantified as the bit accuracy or, equivalently, as one minus the bit error rate (BER).

Backdoor Watermark as Fingerprint

In this setting, the model owner constructs a backdoor fingerprint dataset D_fp = {(x_trigger, y_fp)}, where each x_trigger conforms to a predefined trigger pattern. The FSR is computed analogously to the adversarial-example case, by measuring the proportion of triggers in D_fp that elicit their intended fingerprint outputs.

$$\text{FSR} = \frac{1}{|D_{fp}|} \sum_{(x_{trigger}, y_{fp}) \in D_{fp}} \mathbb{1}[M(x_{trigger}) = y_{fp}]$$

The FSR is computed analogously to the adversarial-example case, by measuring the proportion of triggers in D_fp that elicit their intended fingerprint outputs, where $\mathbb{1}[\cdot]$ is the indicator function.

Capability Impact (Harmlessness)

From a model fingerprinting perspective, harmlessness refers to the property that the embedding of ownership signals neither degrades the model's original capabilities nor interferes with its intended functionalities. In practice, a fingerprinting scheme is considered harmless if (i) the quality of model-generated content remains essentially unaffected, and (ii) the performance gap between the original and fingerprinted models is statistically negligible across a sufficiently diverse set of representative tasks.

Generated Content Quality Preservation

Following the principles outlined in [cite:wang2025building, liu2024survey], harmlessness evaluation should first verify that fingerprint embedding has minimal impact on the fluency, coherence, and semantic fidelity of generated text.

Metrics: BLEU, Meteor, Semantic Similarity, Perplexity (PPL)

General Capability Preservation

Beyond text quality, harmlessness further requires that the fingerprinted model retains its broad task-solving abilities across multiple linguistic and reasoning competencies.

Categories: Logical reasoning, Scientific understanding, Linguistic entailment, Long-form prediction

Representative evaluation categories include logical and commonsense reasoning (ANLI R1--R3 [cite:nie-etal-2020-adversarial], ARC [cite:clark2018think], OpenBookQA [cite:mihaylov2018can], Winogrande [cite:sakaguchi2021winogrande], LogiQA [cite:liu2021logiqa]), scientific understanding (SciQ [cite:welbl2017crowdsourcing]), linguistic and textual entailment (BoolQ [cite:clark2019boolq], CB [cite:de2019commitmentbank], RTE [cite:giampiccolo2007third], WiC [cite:pilehvar2019wic], WSC [cite:levesque2012winograd], CoPA [cite:roemmele2011choice], MultiRC [cite:khashabi2018looking]), long-form prediction (LAMBADA-OpenAI and LAMBADA-Standard [cite:paperno2016lambada]), and additional capability domains [cite:liu2024survey] including text completion [cite:kirchenbauer2023watermark], code generation [cite:lee2023wrote], machine translation [cite:hu2023unbiased], text summarization [cite:he2024can], question answering [cite:fernandez2023three], mathematical reasoning [cite:liang2024watme], knowledge probing [cite:tu2023waterbench], and instruction following [cite:tu2023waterbench].

Reliability

In the context of traditional model watermarking, this property is often referred to as fidelity. It requires that the FSR obtained from unrelated models be kept below a minimal threshold. Formally, given a set of unrelated models $\{\mathcal{M}_1^{(u)}, \mathcal{M}_2^{(u)}, \ldots\}$, the fingerprint extractor should yield consistently low FSR values across all $\mathcal{M}_i^{(u)}$.

          Key Point: For adversarial-example- or backdoor-based
          fingerprints, reliability further implies that, during normal user
          interactions, benign queries should not inadvertently activate the
          fingerprint. Overall, in copyright verification, reliability hinges on
          ensuring that fingerprint extraction remains strictly controlled and
          cannot be reproduced by models lacking the embedded identifier.
        

Robustness Under Fingerprint Attack

In real-world scenarios, an adversary may attempt to remove or overwrite embedded copyright information, potentially sacrificing some model performance in the process. Robustness measures the extent to which the fingerprint signal remains detectable under such deliberate evasion attempts, and is typically quantified by the FSR achieved after various attack strategies.

🔧 Model-Level Attacks (Weight/Architecture modifications)

📝 Input/Output Attacks (Interaction manipulations)

⚙️ System-Level Attacks (Deployment environment)

Model-Level Attacks

Model Fine-tuning

Fine-tuning refers to the process whereby an adversary continues training a stolen model using strategies such as continued pretraining, instruction tuning, or reinforcement learning on curated datasets. In real-world applications, fine-tuning is one of the most common methods for enhancing a model's capabilities.

Continued fine-tuning thus represents one of the most prevalent and practically relevant adversarial settings, and has historically served as the primary robustness benchmark for many fingerprinting methods [cite:xu2024instructional,cai2024utf,russinovich2024hey]. Moreover, certain heuristic fine-tuning strategies have been explicitly proposed to erase backdoor-based fingerprints, such as MEraser [cite:zhang2025meraser], which targets the selective removal of implanted triggers while preserving the model's utility.

Model Quantization and Pruning

In real-world deployments, adversaries (or even benign users) may need to adapt models for low-resource environments, where reduced memory footprint and faster inference are critical. Two common strategies for this are quantization—reducing parameter precision—and pruning—removing redundant weights or structures.

Quantization covers techniques such as half-precision (fp16) deployment and low-bit (e.g., 8-bit or 4-bit) integer quantization, which significantly compress model size while retaining functionality. Pruning can be applied in structured or unstructured forms, including random pruning, magnitude-based pruning using $L_1$/$L_2$ norms, or heuristic approaches such as Taylor-based saliency pruning [cite:ma2023llmpruner].

Model Merging

Model merging [cite:bhardwaj2024language,arora2024here] has recently gained traction as a lightweight paradigm for integrating multiple upstream expert models—each specialized for particular tasks—into a single model that consolidates their capabilities. Its main appeal lies in the ability to combine functionalities without requiring high-performance computing resources.

[cite:cong2024have] were among the first to formally investigate merging as an attack vector against model fingerprinting. Rather than proposing new merging algorithms, they adopted representative existing approaches—such as Task Arithmetic [cite:ilharco2022task-arithmetic] and Ties-Merging [cite:yadav2024ties]—to evaluate fingerprint persistence under fusion. Beyond these, many other merging strategies are available in practice, with toolkits such as MergeKit [cite:goddard-etal-2024-mergekit] providing streamlined workflows for implementing lightweight model merging in real systems.

Input and Output Level Attacks

Beyond direct modifications to model parameters, interaction-dependent fingerprinting methods—such as those based on adversarial examples, backdoor watermarks, semantic features, or activation representations—can be challenged through manipulations of the model's inputs and/or outputs during querying.

Input Manipulation

In practical settings, an adversary may systematically inspect all incoming queries—including benign user inputs—to detect fragments that could reveal embedded fingerprint patterns. Upon identification, such queries may be blocked, ignored, or otherwise suppressed. Detection can also be performed using heuristic metrics such as perplexity (PPL), defined as:

$$\text{PPL}(x) = \exp\left(-\frac{1}{n} \sum_{i=1}^{n} \log p_{\theta}(x_i | x_{<i})\right)$$

If an input bypasses the initial detection stage, the adversary may still opt to perturb it—such as by re-paragraphing, removing non-essential content at random, or otherwise altering its structure—thereby reducing the likelihood that a fingerprint trigger is activated.

            Key Insight: Input manipulation attacks exploit the fact that
            fingerprint triggers often have distinctive linguistic patterns or
            statistical characteristics that can be detected and suppressed
            without requiring access to the model's internal parameters.
          

Response Manipulation

Beyond manipulating inputs, an adversary could attempt to detect and suppress fingerprint activation by examining the semantic consistency between an input and its corresponding output. Since fingerprinted responses are often designed to exhibit distinctive features, they may lie outside the model's greedy decoding path or occur in low-probability regions of the output distribution.

            Detection Strategy: Response manipulation focuses on
            identifying and filtering out outputs that exhibit unusual patterns,
            reduced fluency, or semantic inconsistencies that may indicate
            fingerprint activation.
          

System-Level Attacks

Ultimately, LLMs are deployed within broader systems, a common example being LLM-based agents [cite:kong2025surveyllmdrivenaiagent]. Such systems often integrate memory modules or external knowledge sources (e.g., web search) into the model's reasoning process—either to mitigate hallucination or to synchronize responses with up-to-date information.

While these additional prompts improve factual accuracy and relevance, they can also interfere with the activation or manifestation of fingerprint signals. As a result, evaluating fingerprint robustness in the presence of such system-level interactions is essential to understanding performance in realistic deployment scenarios.