跳转至

32.1 The naive string-matching algorithm

32.1-1

Show the comparisons the naive string matcher makes for the pattern \(P = 0001\) in the text \(T = 000010001010001\).

C++
1
2
3
4
5
STRING-MATCHER(P, T, i)
    for j = i to i + P.length
        if P[j - i + 1] != T[j]
            return false
    return true

32.1-2

Suppose that all characters in the pattern \(P\) are different. Show how to accelerate \(\text{NAIVE-STRING-MATCHER}\) to run in time \(O(n)\) on an \(n\)-character text \(T\).

Suppose \(T[i] \ne P[j]\), then for \(k \in [1, j)\), \(T[i - k] = P[j - k] \ne P[0]\), the \([i - k, i)\) are all invalid shifts which could be skipped, therefore we can compare \(T[i]\) with \(P[0]\) in the next iteration.

32.1-3

Suppose that pattern \(P\) and text \(T\) are randomly chosen strings of length \(m\) and \(n\), respectively, from the \(d\)-ary alphabet \(\Sigma_d = \\{ 0, 1, \ldots, d - 1 \\}\), where \(d \ge 2\). Show that the expected number of character-to-character comparisons made by the implicit loop in line 4 of the naive algorithm is

\[(n - m + 1) \frac{1 - d^{-m}}{1 - d^{-1}} \le 2(n - m + 1)\]

over all executions of this loop. (Assume that the naive algorithm stops comparing characters for a given shift once it finds a mismatch or matches the entire pattern.) Thus, for randomly chosen strings, the naive algorithm is quite efficient.

Suppose for each shift, the number of compared characters is \(L\), then:

\[ \begin{aligned} \text E[L] & = 1 \cdot \frac{d - 1}{d} + 2 \cdot (\frac{1}{d})^1 \frac{d - 1}{d} + \cdots + m \cdot (\frac{1}{d})^{m - 1} \frac{d - 1}{d} + m \cdot (\frac{1}{d})^{m} \\\\ & = (1 + 2 \cdot (\frac{1}{d})^1 + \cdots + m \cdot (\frac{1}{d})^{m}) \frac{d - 1}{d} + m \cdot (\frac{1}{d})^{m}. \end{aligned} \]
\[ \begin{aligned} S & = 1 + 2 \cdot (\frac{1}{d})^1 + \cdots + m \cdot (\frac{1}{d})^{m - 1} \\\\ \frac{1}{d}S & = 1 \cdot (\frac{1}{d})^1 + \cdots + (m - 1) \cdot (\frac{1}{d})^{m - 1} + m \cdot (\frac{1}{d})^{m} \\\\ \frac{d - 1}{d}S & = 1 + (\frac{1}{d})^1 + \cdots + \cdot (\frac{1}{d})^{m - 1} - m \cdot (\frac{1}{d})^{m} \\\\ \frac{d - 1}{d}S & = \frac{1 - d^{-m}}{1 - d^{-1}} - m \cdot (\frac{1}{d})^{m}. \end{aligned} \]
\[ \begin{aligned} \text E[L] & = (1 + 2 \cdot (\frac{1}{d})^1 + \cdots + m \cdot (\frac{1}{d})^{m}) \frac{d - 1}{d} + m \cdot (\frac{1}{d})^{m} \\\\ & = \frac{1 - d^{-m}}{1 - d^{-1}} - m \cdot (\frac{1}{d})^{m} + m \cdot (\frac{1}{d})^{m} \\\\ & = \frac{1 - d^{-m}}{1 - d^{-1}}. \end{aligned} \]

There are \(n - m + 1\) shifts, therefore the expected number of comparisons is:

\[(n - m + 1) \cdot \text E[L] = (n - m + 1) \frac{1 - d^{-m}}{1 - d^{-1}}\]

Since \(d \ge 2\), \(1 - d^{-1} \ge 0.5\), \(1 - d^{-m} < 1\), and $ \frac{1 - d^{-m}}{1 - d^{-1}} \le 2$, therefore

\[(n - m + 1) \frac{1 - d^{-m}}{1 - d^{-1}} \le 2 (n - m + 1).\]

32.1-4

Suppose we allow the pattern \(P\) to contain occurrences of a gap character \(\diamond\) that can match an arbitrary string of characters (even one of zero length). For example, the pattern \(ab\diamond ba\diamond c\) occurs in the text \(cabccbacbacab\) as

\[c \underbrace{ab}\_{ab} \underbrace{cc}\_{\diamond} \underbrace{ba}\_{ba} \underbrace{cba}\_{\diamond} \underbrace{c}\_{c} ab\]

and as

\[c \underbrace{ab}\_{ab} \underbrace{ccbac}\_{\diamond} \underbrace{ba}\_{ba} \underbrace{\text{ }}\_{\diamond} \underbrace{c}\_{c} ab\]

Note that the gap character may occur an arbitrary number of times in the pattern but not at all in the text. Give a polynomial-time algorithm to determine whether such a pattern \(P\) occurs in a given text \(T\), and analyze the running time of your algorithm.

By using dynamic programming, the time complexity is \(O(mn)\) where \(m\) is the length of the text \(T\) and \(n\) is the length of the pattern \(P\); the space complexity is \(O(mn)\), too.

This problem is similar to LeetCode 44. WildCard Matching, except that it has no question mark (?) requirement. You can see my naive DP implementation here.