Fingerprinting

In today's blog, I'll introduce the concept of fingerprinting algorithms.

The premise of fingerprinting is simple. Suppose that you have two objects, $x$ and $y$, and you want to quickly verify if $x = y$. If $x, y$ are drawn from some universe $U$, to determine whether $x = y$ one could simply compare all $\log |U|$ bits of each. However, this may be inefficient if the objects are exceedingly large! How can we get around comparing $x$ and $y$ directly?

You might immediately suggest compression as a solution to the above comparison problem. This is close in spirit to fingerprinting. However, depending on the compression scheme chosen, using compression as a means to verify equality may prove difficult to analyze in terms of approximate correctness. That said, we restrict the discussion in this post to fingerprinting mechanisms.

For starters, we may suppose $x$ and $y$ are $d$-dimensional feature vectors (or to the simple programmer, standard objects with $d$ parameters). Then, we could pick particular features of $x$ and $y$ and compare them. For example, these features could compose the primary key of a database entity. Of course, this would require underlying structure in the objects being compared, as well as assumptions about distinguishing features. In addition, we would have to require that the number of distinguishing features $q \ll d$ to see performance improvements (or more specifically, the total size of the distinguishing features).

Then, what happens in the case where we can not make these assumptions about the structure of the data being compared?

File Comparison Problem. Suppose that two agents, Alice and Bob, separately have files $A$ and $B$ and wish to verify whether or not $A = B$. $A$ and $B$ may be represented as bit strings of length $m$ and $n$ respectively.

It is easy to see that any deterministic approach to verifying $A = B$ requires that at least $\min\{m, n\}$ bits be exchanged.

Claim. There exists an $O(\log n)$ bits algorithm which suffices to correctly decide equality with probability $\geq 1 - \frac{1}{n}$.

Suppose that Alice and Bob share a communication channel where channel capacity is limited and sending data is expensive. If $m$ and $n$ are substantially large, then Alice would prefer to send a fingerprint of $A$ to Bob, a significantly smaller representation of $A$.

Protocol.

Alice picks a prime number $p$ uniformly at random from $\{2, \dots n^2 \log n\}$. (The motivation behind the choice of $n^2 \log n$ will be seen later). Treating $A$ as a large integer, she then computes $A \text{ mod }p$. Then, Alice sends the following tuple to Bob: $\langle A \text{ mod } p, p \rangle$. Then, Bob computes $B \text{ mod } p$. If $A \equiv B \text{ mod } p$, Bob outputs YES, and otherwise, NO.

Space Analysis.

Alice only sends two quantities, $A \text{ mod } p$ and $p$. Then, the maximum value of each is at most $n^2 \log n$, and the total number of bits sent is $O(\log n)$.

Error Analysis.

Clearly, in the above scheme there can never be false negatives, such that $A = B$ results in Bob outputting NO. Then,
$$P(\text{failure} | A = B) = 0$$

However, due to the potential of collisions under mod $ p$, there is a non-zero probability of false positives.

$$
\begin{align*}
P(\text{failure} | A \neq B) &= P(A \equiv B \text{ mod } p | A \neq B) \
&= P((A - B) \equiv 0 \text{ mod } p | A \neq B) \
&= P(C \equiv 0 \text{ mod } p | A \neq B) && C = |A - B|
\end{align*}
$$

Then, we need to examine the probability that some non-zero $C$, $1 \leq C \leq 2^n$ has $p$ as a prime factor.

To do this, we will (somewhat loosely) use the Prime Number Theorem.

Prime Number Theorem. For any integer $N \geq 2$, the number of primes in the interval $[2, N]$ is roughly $\frac{N}{\log N}$, as $N \rightarrow \infty$.

Then, the approximate number of primes in the range $[2, n^2 \log n]$ is $\frac{n^2 \log n}{2 \log n + \log\log n} \approx n^2$.

We also observe that the quantity $C$ has at most $n$ distinct prime divisors. This is a satisfactory upper bound, as the smallest prime is $2$, and any more than $n$ prime divisors would imply that $C > 2^n$.

Then, the probability that the randomly chosen prime $p$ is a factor of $C$ is $≤ \frac{n}{n^2} = \frac{1}{n}$.

The total error then becomes $≥ 1 - \frac{1}{n}$.

While this error is already polynomially small, it can be reduced even further by "boosting" – modifying the above protocol to repeat for multiple rounds, and outputting NO if any of the rounds produces a negative result. The analysis would follow, except the polynomial error would decrease to $\frac{1}{n^c}$, where $c$ is the number of rounds.

Application.

A motivating problem for fingerprinting stems from DNA matching. Suppose we have a length $n$ string representing the entire human genome, $X$, and a length $m$ DNA substring $Y$, $m < n$. We want to quickly determine with high probability whether $Y$ is a substring of $X$.

Naively, this substring check can be done in $O(mn)$ time. There exist more difficult to implement deterministic, linear time algorithms – $O(m + n)$.

A very simple linear time randomized algorithm using fingerprints may be used to verify whether $Y$ is a substring of $X$ in $O(m+n)$ time. In this example, we use the fingerprint function $F_p(x)$ as a black box that takes a prime number $p$ and an input $x$.

Algorithm.

Pick a random prime $p \in [2, m^2 \log m]$. Pre-compute the fingerprint $F_p(Y)$. For each length $m$ substring of $X$, $X(i)$ starting at index $i$, $1 \leq i \leq n - m + 1$, compute $F_p(X(i))$. If $F_p(X(i)) \equiv F_p(Y) \text{ mod } p$, output MATCH and halt. Otherwise, once the loop completes, output NO MATCH.

I'll leave the analysis of this algorithm to the reader (it is very, very similar to the prior analysis). It can be shown that the error rate is polynomial in $m$ and $n$. :)

That concludes today's post! I may address more applications of fingerprinting in future posts, but I hope you found the motivating problem compelling, as well. As always, please leave a comment if you have any questions or corrections. Cheers!