Parameter Space
Methods that analyze the weight space of language models to identify
unique patterns and characteristics.
Representation Features
Methods that analyze the internal representations of LLMs, including
activation patterns, hidden states, and output logits, which are
derived from the data, strategies, and frameworks used during the
training process. These representations serve as intrinsic features
for model identification, capturing the unique characteristics of how
different models process and transform information. The output logits,
representing the model's prediction probabilities, also reflect the
model's learned patterns and decision boundaries, making them valuable
for fingerprinting purposes.
Semantic Feature Extraction
This category of methods conducts statistical analysis on the content
generated by different models, exploiting the linguistic patterns and
semantic preferences exhibited by various LLMs as their unique
fingerprints.
Adversarial Example-Based
The fundamental process of prompt optimization-based fingerprinting
can be understood as follows: given an original input and a predefined
response, the method optimizes the prompt to obtain a final version
that, when input to the model, produces the predefined response. Since
this optimization process is tightly coupled with the model's weights,
the resulting optimized prompt is effective only for the target model
and ineffective for other unrelated models, thus serving as a stable
fingerprint feature.