27-4 Multithreading reductions and prefix computations

A \(\otimes\)-reduction of an array \(x[1 \ldots n]\), where \(\otimes\) is an associative operator, is the value

\[y = x[1] \otimes x[2] \otimes \cdots \otimes x[n].\]

The following procedure computes the \(\otimes\)-reduction of a subarray \(x[i \ldots j]\) serially.

C++
1
2
3
4
5
REDUCE(x, i, j)
    y = x[i]
    for k = i + 1 to j
         y = y  x[k]
    return y

a. Use nested parallelism to implement a multithreaded algorithm \(\text{P-REDUCE}\), which performs the same function with \(\Theta(n)\) work and \(\Theta(\lg n)\) span. Analyze your algorithm.

A related problem is that of computing a \(\otimes\)-prefix computation, sometimes called a \(\otimes\)-scan, on an array \(x[1 \ldots n]\), where \(\otimes\) is once again an associative operator. The \(\otimes\)-scan produces the array \(y[1 \ldots n]\) given by

\[ \begin{aligned} y[1] & = x[1], \\\\ y[2] & = x[1] \otimes x[2], \\\\ y[3] & = x[1] \otimes x[2] \otimes x[3], \\\\ & \vdots \\\\ y[n] & = x[1] \otimes x[2] \otimes x[3] \otimes \cdots \otimes x[n], \end{aligned} \]

that is, all prefixes of the array \(x\) "summed" using \(\otimes\) operator. The following serial procedure \(\text{SCAN}\) performs a \(\otimes\)-prefix computation:

C++
1
2
3
4
5
6
7
SCAN(x)
    n = x.length
    let y[1..n] be a new array
    y[1] = x[1]
    for i = 2 to n
        y[i] = y[i - 1]  x[i]
    return y

Unfortunately, multithreading \(\text{SCAN}\) is not straightforward. For example, changing the for loop to a parallel for loop would create races, since each iteration of the loop body depends on the previous iteration. The following procedure \(\text{P-SCAN-1}\) performs the \(\otimes\)-prefix computation in parallel, albeit inefficiently.

C++
1
2
3
4
5
P-SCAN-1(x)
    n = x.length
    let y[1..n] be a new array
    P-SCAN-1-AUX(x, y, 1, n)
    return y
C++
1
2
3
P-SCAN-1-AUX(x, y, i, j)
    parallel for l = i to j
        y[l] = P-REDUCE(x, 1, l)

b. Analyze the work, span, and parallelism of \(\text{P-SCAN-1}\).

By using nested parallelism, we can obtain a more efficient \(\otimes\)-prefix computation:

C++
1
2
3
4
5
P-SCAN-2(x)
    n = x.length
    let y[1..n] be a new array
    P-SCAN-2-AUX(x, y, 1, n)
    return y
C++
1
2
3
4
5
6
7
8
9
P-SCAN-2-AUX(x, y, i, j)
    if i == j
        y[i] = x[i]
    else k = floor((i + j) / 2)
        spawn P-SCAN-2-AUX(x, y, i, k)
        P-SCAN-2-AUX(x, y, k + 1, j)
        sync
        parallel for l = k + 1 to j
            y[l] = y[k]  y[l]

c. Argue that \(\text{P-SCAN-2}\) is correct, and analyze its work, span, and parallelism.

We can improve on both \(\text{P-SCAN-1}\) and \(\text{P-SCAN-2}\) by performing the \(\otimes\)-prefix computation in two distinct passes over the data. On the first pass, we gather the terms for various contiguous subarrays of \(x\) into a temporary array \(t\), and on the second pass we use the terms in \(t\) to compute the final result \(y\). The following pseudocode implements this strategy, but certain expressions have been omitted:

C++
1
2
3
4
5
6
7
8
P-SCAN-3(x)
    n = x.length
    let y[1..n] and t[1..n] be new arrays
    y[1] = x[1]
    if n > 1
        P-SCAN-UP(x, t, 2, n)
        P-SCAN-DOWN(x[1], x, t, y, 2, n)
    return y
C++
1
2
3
4
5
6
7
8
9
P-SCAN-UP(x, t, i, j)
    if i == j
        return x[i]
    else
        k = floor((i + j) / 2)
        t[k] = spawn P-SCAN-UP(x, t, i, k)
        right = P-SCAN-UP(x, t, k + 1, j)
        sync
        return ____           // fill in the blank
C++
1
2
3
4
5
6
7
8
P-SCAN-DOWN(v, x, t, y, i, j)
    if i == j
        y[i] = v  x[i]
    else
        k = floor((i + j) / 2)
        spawn P-SCAN-DOWN(____, x, t, y, i, k)  // fill in the blank
        P-SCAN-DOWN(____, x, t, y, k + 1, j)    // fill in the blank
        sync

d. Fill in the three missing expressions in line 8 of \(\text{P-SCAN-UP}\) and lines 5 and 6 of \(\text{P-SCAN-DOWN}\). Argue that with expressions you supplied, \(\text{P-SCAN-3}\) is correct. (\(\textit{Hint:}\) Prove that the value \(v\) passed to \(\text{P-SCAN-DOWN}(v, x, t, y, i, j)\) satisfies \(v = x[1] \otimes x[2] \otimes \cdots \otimes x[i - 1]\).)

e. Analyze the work, span, and parallelism of \(\text{P-SCAN-3}\).

(Removed)