32.1 The naive string-matching algorithm

32.1-1

Show the comparisons the naive string matcher makes for the pattern $P = 0001$ in the text $T = 000010001010001$.

C++
STRING-MATCHER(P, T, i)
    for j = i to i + P.length
        if P[j - i + 1] != T[j]
            return false
    return true

32.1-2

Suppose that all characters in the pattern $P$ are different. Show how to accelerate $\text{NAIVE-STRING-MATCHER}$ to run in time $O(n)$ on an $n$-character text $T$.

Suppose $T[i] \ne P[j]$, then for $k \in [1, j)$, $T[i - k] = P[j - k] \ne P[0]$, the $[i - k, i)$ are all invalid shifts which could be skipped, therefore we can compare $T[i]$ with $P[0]$ in the next iteration.

32.1-3

Suppose that pattern $P$ and text $T$ are randomly chosen strings of length $m$ and $n$, respectively, from the $d$-ary alphabet $\Sigma_d = \\{ 0, 1, \ldots, d - 1 \\}$, where $d \ge 2$. Show that the expected number of character-to-character comparisons made by the implicit loop in line 4 of the naive algorithm is

\[(n - m + 1) \frac{1 - d^{-m}}{1 - d^{-1}} \le 2(n - m + 1)\]

over all executions of this loop. (Assume that the naive algorithm stops comparing characters for a given shift once it finds a mismatch or matches the entire pattern.) Thus, for randomly chosen strings, the naive algorithm is quite efficient.

Suppose for each shift, the number of compared characters is $L$, then:

\[ \begin{aligned} \text E[L] & = 1 \cdot \frac{d - 1}{d} + 2 \cdot (\frac{1}{d})^1 \frac{d - 1}{d} + \cdots + m \cdot (\frac{1}{d})^{m - 1} \frac{d - 1}{d} + m \cdot (\frac{1}{d})^{m} \\\\ & = (1 + 2 \cdot (\frac{1}{d})^1 + \cdots + m \cdot (\frac{1}{d})^{m}) \frac{d - 1}{d} + m \cdot (\frac{1}{d})^{m}. \end{aligned} \]

\[ \begin{aligned} S & = 1 + 2 \cdot (\frac{1}{d})^1 + \cdots + m \cdot (\frac{1}{d})^{m - 1} \\\\ \frac{1}{d}S & = 1 \cdot (\frac{1}{d})^1 + \cdots + (m - 1) \cdot (\frac{1}{d})^{m - 1} + m \cdot (\frac{1}{d})^{m} \\\\ \frac{d - 1}{d}S & = 1 + (\frac{1}{d})^1 + \cdots + \cdot (\frac{1}{d})^{m - 1} - m \cdot (\frac{1}{d})^{m} \\\\ \frac{d - 1}{d}S & = \frac{1 - d^{-m}}{1 - d^{-1}} - m \cdot (\frac{1}{d})^{m}. \end{aligned} \]

\[ \begin{aligned} \text E[L] & = (1 + 2 \cdot (\frac{1}{d})^1 + \cdots + m \cdot (\frac{1}{d})^{m}) \frac{d - 1}{d} + m \cdot (\frac{1}{d})^{m} \\\\ & = \frac{1 - d^{-m}}{1 - d^{-1}} - m \cdot (\frac{1}{d})^{m} + m \cdot (\frac{1}{d})^{m} \\\\ & = \frac{1 - d^{-m}}{1 - d^{-1}}. \end{aligned} \]

There are $n - m + 1$ shifts, therefore the expected number of comparisons is:

\[(n - m + 1) \cdot \text E[L] = (n - m + 1) \frac{1 - d^{-m}}{1 - d^{-1}}\]

Since $d \ge 2$, $1 - d^{-1} \ge 0.5$, $1 - d^{-m} < 1$, and $ \frac{1 - d^{-m}}{1 - d^{-1}} \le 2$, therefore

\[(n - m + 1) \frac{1 - d^{-m}}{1 - d^{-1}} \le 2 (n - m + 1).\]

32.1-4

Suppose we allow the pattern $P$ to contain occurrences of a gap character $\diamond$ that can match an arbitrary string of characters (even one of zero length). For example, the pattern $ab\diamond ba\diamond c$ occurs in the text $cabccbacbacab$ as

\[c \underbrace{ab}\_{ab} \underbrace{cc}\_{\diamond} \underbrace{ba}\_{ba} \underbrace{cba}\_{\diamond} \underbrace{c}\_{c} ab\]

and as

\[c \underbrace{ab}\_{ab} \underbrace{ccbac}\_{\diamond} \underbrace{ba}\_{ba} \underbrace{\text{ }}\_{\diamond} \underbrace{c}\_{c} ab\]

Note that the gap character may occur an arbitrary number of times in the pattern but not at all in the text. Give a polynomial-time algorithm to determine whether such a pattern $P$ occurs in a given text $T$, and analyze the running time of your algorithm.

By using dynamic programming, the time complexity is $O(mn)$ where $m$ is the length of the text $T$ and $n$ is the length of the pattern $P$; the space complexity is $O(mn)$, too.

This problem is similar to LeetCode 44. WildCard Matching, except that it has no question mark (?) requirement. You can see my naive DP implementation here.

本页面的全部内容在 小熊老师 - 莆田青少年编程俱乐部 0594codes.cn 协议之条款下提供，附加条款亦可能应用