|
5 | 5 | - Practice |
6 | 6 | --- |
7 | 7 | # 6.4 - String of Words |
| 8 | +![[Pasted image 20260116141341.png]] |
| 9 | + |
| 10 | +## Exploration |
| 11 | +First, we need some examples. Specifically we need an example which will help identify how the DP algorithm performs backtracking. This will occur in cases where a word that is identified earlier needs to be broken up to form |
| 12 | + |
| 13 | +1. one whole word, and the beginning of the next word |
| 14 | +2. two or more whole words |
| 15 | +3. two or more whole words, and the beginning of the next word |
| 16 | + |
| 17 | +For case 2, this algorithm doesn't seek to identify grammatically correct sentences, just strings of whole words that exist in a given dictionary. Therefore, case 2 won't actually result in any backtracking. |
| 18 | + |
| 19 | +If we don't have such an example, then we may accidentally develop a greedy $O(n)$ algorithm, which runs forward collecting letters into the longest word available in the dictionary. |
| 20 | + |
| 21 | +### Greedy Algorithm Failure Example |
| 22 | +To avoid coming up with an example which uses real words, let's review this contrived one. |
| 23 | + |
| 24 | +- $dict(w)=w \in \{a, ab, bc\}$ |
| 25 | +- $S=abc$ |
| 26 | + |
| 27 | +In this example, a greedy forward-only DP algorithm with no backtracking would find the following substrings. How do we force $s(3)$ to find $T,\{a, bc\}$? In addition to $s(2)=T,\{ab\}$, we would also need to keep track of $s(2)=F,\{a,b\}$. |
| 28 | + |
| 29 | +- $s(1)=T,\{a\}$ |
| 30 | +- $s(2)=T,\{ab\}$ |
| 31 | +- $s(3)=F,\{ab,c\} \leftarrow \text{error}$ |
| 32 | + |
| 33 | +### Ambiguity Example |
| 34 | +Note that there is possible ambiguity for a given dictionary and substring. In this contrived example below, there are many possible input word strings that can construct the given $S$. |
| 35 | + |
| 36 | +- $dict(w)=w \in \{a, b, c, d, ab, bc, cd, abc, bcd, abcd\}$ |
| 37 | +- $S=abcdabcdabcd$ |
| 38 | + |
| 39 | +Sample of possible results: |
| 40 | +- $\{a,b,c,d,a,b,c,d,a,b,c,d\}$ |
| 41 | +- $\{ab,cd,ab,cd,ab,cd\}$ |
| 42 | +- $\{abcd,abcd,abcd\}$ |
| 43 | +- $\{abc,d,a,bcd,abcd\}$ |
| 44 | +- $\{abc,d,a,bcd,ab,cd\}$ |
| 45 | +- $\{a,b,c,d,abcd,abc,d\}$ |
| 46 | +- ... |
| 47 | + |
| 48 | +This problem only asks whether a given string "can be reconstituted as a series of valid words." This allows for the solution to be less complicated, though it's likely possible to construct an algorithm which returns all possible results while staying within a DP solution space. |
| 49 | + |
| 50 | +## Step 1: Define the Subproblem in Words |
| 51 | +``` |
| 52 | +Given |
| 53 | + a function dict(S') which returns T or F for arbitrary length S' |
| 54 | + a sequence of characters S of length n |
| 55 | +
|
| 56 | +for i=1 -> n: |
| 57 | + for j=1 -> i: |
| 58 | + VS(i,j) contains T or F, indicating whether S_1, ..., S_j contains valid word strings. |
| 59 | +``` |
| 60 | + |
| 61 | +Why does this need to be 2D? What am I using $i$ for? Let's simplify (I found an example solution online.) |
| 62 | + |
| 63 | +``` |
| 64 | +Given |
| 65 | + a sequence of characters S of length n |
| 66 | + a function dict(S'), which returns T or F for arbitrary lengh contiguous subsequence (i.e. substring) S' of S |
| 67 | +
|
| 68 | +for i=1 -> n: |
| 69 | + VS(i) returns T or F, indicating whether S_1, ..., S_i only contains sequential valid word strings. |
| 70 | + P(i) returns the ending index of the prior word. |
| 71 | +``` |
| 72 | + |
| 73 | +## Step 2: Define the Recurrence Relation |
| 74 | +For $VS(i)$, we need to find a previous index $0\lt j \lt i$ for which there are only valid words ($VS(j)=True$). For convenience and completeness, we can assert that $VS(0)=True$, because an empty string contains no _invalid_ words. Then we need to check $dict(\{s_{j+1},\space ... \space,s_i\})$. If $True$, $VS(i)=T$ and $P(i)=j$. If there is no $j$ for which $dict(s_{j+1},\space ... \space, s_i)$, then $VS(i)=False$. |
| 75 | + |
| 76 | +It's unclear to me how you would convert this into a clean mathematical expression. Anyway, here's the code. |
| 77 | +## Code |
| 78 | +This code models `is_in_dict` as a function which takes $S$, along with the starting and ending indexes of a given substring. This allows for the dictionary to be modeled as a trie, or some other graph structure, as opposed to a hash-table with non-guaranteed $O(1)$ lookup performance. Even with guaranteed $O(1)$ lookup performance, you still need to build substrings from each given range. |
| 79 | + |
| 80 | +```python |
| 81 | +import typing |
| 82 | + |
| 83 | +character = int | str |
| 84 | +string = list[character] | tuple[character] |
| 85 | + |
| 86 | +def string_of_words( |
| 87 | + S: string, |
| 88 | + is_in_dict: typing.Callable[[string, int, int], bool] |
| 89 | +) -> tuple[bool, list[string] | None]: |
| 90 | + VS: list[bool] = [False for _ in S] |
| 91 | + P: list[int] = [-1 for _ in S] |
| 92 | + |
| 93 | + for i in range(len(S)): |
| 94 | + # Iterating j forward will tend to find longer words. |
| 95 | + # Iterating j in reverse will tend to find shorter words. |
| 96 | + # If we want to find all possible strings of words, we probably |
| 97 | + # need to use 2D "VS" and "P" arrays. |
| 98 | + for j in range(-1, i): |
| 99 | + if j == -1 or VS[j]: |
| 100 | + if is_in_dict(S, j+1, i): |
| 101 | + VS[i] = True |
| 102 | + P[i] = j |
| 103 | + break |
| 104 | + |
| 105 | + if not VS[-1]: |
| 106 | + return False, None |
| 107 | + |
| 108 | + sentence: list[string] = [] |
| 109 | + current_end = len(P) |
| 110 | + previous_end = P[-1] |
| 111 | + while True: |
| 112 | + sentence.insert(0, list(S[previous_end+1:current_end])) |
| 113 | + current_end = previous_end+1 |
| 114 | + if previous_end == -1: |
| 115 | + return True, sentence |
| 116 | + previous_end = P[previous_end] |
| 117 | + |
| 118 | +``` |
0 commit comments