Skip to content

[Model] LongestCommonSubsequence #414

@isPANN

Description

@isPANN

Motivation

LONGEST COMMON SUBSEQUENCE (P158) from Garey & Johnson, A4 SR10. A fundamental problem in string algorithms with deep connections to edit distance, diff utilities, and bioinformatics (sequence alignment). For |R| = 2, solvable in O(n^2) by classic dynamic programming (Wagner and Fischer, 1974). For an arbitrary number of strings, NP-complete even over a binary alphabet |Σ| = 2 (Maier, 1978). Dual to Shortest Common Supersequence: for two strings, SCS length = |x| + |y| - LCS length. The analogous LONGEST COMMON SUBSTRING problem (contiguous) is trivially solvable in polynomial time.

Associated rules:

  • R104: MinimumVertexCover → LongestCommonSubsequence (as target)

Definition

Name: LongestCommonSubsequence
Canonical name: LONGEST COMMON SUBSEQUENCE
Reference: Garey & Johnson, Computers and Intractability, A4 SR10

Mathematical definition:

INSTANCE: Finite alphabet Σ, finite set R of strings from Σ*, and a positive integer K.
QUESTION: Is there a string w ∈ Σ* with |w| ≥ K such that w is a subsequence of each string x ∈ R?

Variables

  • Count: K variables (one per position in the candidate common subsequence w), where K is the length bound.
  • Per-variable domain: {0, 1, ..., |Σ|-1} — index into the alphabet Σ.
  • Meaning: Variable i encodes the symbol at position i of the candidate common subsequence w. A satisfying assignment produces a string w of length ≥ K such that w is a subsequence of every string in R. That is, for each x ∈ R, there exist indices 1 ≤ i_1 < i_2 < ... < i_K ≤ |x| with x[i_j] = w[j] for all j.

Schema (data type)

Type name: LongestCommonSubsequence
Variants: none

Field Type Description
alphabet_size usize Size of the alphabet
strings Vec<Vec<usize>> Set R of input strings, each encoded as a vector of alphabet indices
bound usize Length bound K

Notes:

  • This is a satisfaction (decision) problem: Metric = bool, implementing SatisfactionProblem.
  • A string w is a subsequence of x if w can be obtained by deleting zero or more characters from x (not necessarily contiguous).
  • For the optimization version, maximize |w| subject to w being a subsequence of all strings in R.
  • Dual to ShortestCommonSupersequence: for two strings x, y, LCS(x,y) + SCS(x,y) = |x| + |y|.

Complexity

  • Best known exact algorithm:
    • For |R| = 2: O(n · m) DP where n = |x_1|, m = |x_2| (classic Needleman-Wunsch / Wagner-Fischer). Subquadratic algorithms exist: O(n^2 / log^2 n) (Masek and Paterson, 1980). Conditional lower bound: no truly subquadratic O(n^{2-ε}) algorithm unless SETH fails.
    • For fixed |R| = k: O(n^k) multi-dimensional DP, where n = max string length.
    • For arbitrary |R|: NP-hard. The dominant-point approach with divide-and-conquer heuristics is used in practice.
  • Approximation: Hard to approximate within ratio n^{1-ε} for any ε > 0 unless P = NP (for arbitrary |R|).
  • NP-completeness: NP-complete (Maier, 1978). Remains NP-complete even if |Σ| = 2.
  • Polynomial cases:
    • Fixed |R|: O(n^|R|) DP.
    • Fixed K: O(n^K · |R|) by checking all length-K subsequences.
    • |R| = 2: O(n · m) classical DP.
  • References:
    • David Maier (1978). "The complexity of some problems on subsequences and supersequences". JACM 25(2):322-336.
    • Robert A. Wagner and Michael J. Fischer (1974). "The string-to-string correction problem". JACM 21:168-173.
    • William J. Masek and Michael S. Paterson (1980). "A faster algorithm computing string edit distances". J. Comput. Syst. Sci. 20(1):18-31.

Extra Remark

Full book text:

INSTANCE: Finite alphabet Σ, finite set R of strings from Σ*, and a positive integer K.
QUESTION: Is there a string w ∈ Σ* with |w| ≥ K such that w is a subsequence of each x ∈ R?
Reference: [Maier, 1978]. Transformation from VERTEX COVER.
Comment: Remains NP-complete even if |Σ| = 2. Solvable in polynomial time for any fixed K or for fixed |R| (by dynamic programming, e.g., see [Wagner and Fischer, 1974]). The analogous LONGEST COMMON SUBSTRING problem is trivially solvable in polynomial time.

How to solve

  • It can be solved by (existing) bruteforce — enumerate all strings w ∈ Σ* with |w| = K and check if w is a subsequence of each x ∈ R.
  • It can be solved by reducing to integer programming — encode matching positions as integer variables with ordering constraints.
  • Other: For |R| = 2, classic DP in O(n · m) time. For fixed |R|, multi-dimensional DP in O(n^|R|). For arbitrary |R|, dominant-point algorithms, A* search, or branch-and-bound. Also related to maximum clique in a "match graph" formulation.

Example Instance

Instance 1 (YES, binary alphabet, 6 strings):

  • Alphabet: Σ = {0, 1} (|Σ| = 2)
  • Strings: R = {"010110", "100101", "001011", "110010", "010101", "101010"}
  • Bound K = 3
  • Candidate LCS: w = "010"
    • "010" ⊆ "010110"? 0(1)1(2)0(3) -- wait, w = "010": 0(pos1), 1(pos3), 0(pos4) ✓
    • "010" ⊆ "100101"? 0(pos2), 1(pos4), 0(pos5) ✓
    • "010" ⊆ "001011"? 0(pos1), 1(pos4), 0(-- need 0 after pos 4) -- "001011" = 0,0,1,0,1,1. 0(pos1), 1(pos3), 0(pos4) ✓
    • "010" ⊆ "110010"? 0(pos3), 1(pos4)... wait: "110010" = 1,1,0,0,1,0. 0(pos3), 1(pos5), 0(pos6) ✓
    • "010" ⊆ "010101"? 0(pos1), 1(pos2), 0(pos3) ✓
    • "010" ⊆ "101010"? 0(pos2), 1(pos3), 0(pos4) ✓
  • Answer: YES

Instance 2 (YES, ternary alphabet, 6 strings, verified):

  • Alphabet: Σ = {a, b, c} (|Σ| = 3)
  • Strings: R = {"abcabc", "bacbac", "cabcab", "abcbac", "bcabca", "acbacb"}
  • Bound K = 3
  • Candidate LCS: w = "abc"
    • "abc" ⊆ "abcabc"? a(1)b(2)c(3) ✓
    • "abc" ⊆ "bacbac"? a(2)... wait: "bacbac" = b,a,c,b,a,c. a(pos2), b(pos4), c(pos6) ✓
    • "abc" ⊆ "cabcab"? a(pos3), b(pos4), c(-- need c after pos 4): "cabcab" = c,a,b,c,a,b. a(pos2), b(pos3), c(pos4) ✓
    • "abc" ⊆ "abcbac"? a(pos1), b(pos2), c(pos3) ✓
    • "abc" ⊆ "bcabca"? a(pos3), b(pos4), c(pos5) ✓
    • "abc" ⊆ "acbacb"? a(pos1), -- need b after pos1: b(pos3), c(pos4)... "acbacb" = a,c,b,a,c,b. a(pos1), b(pos3), c(pos5) ✓ (but w = "abc" needs a then b then c in order: a(1), c is at 2 but we need b first... a(pos1), b(pos3), c(pos5) ✓ since 1 < 3 < 5)
  • Candidate LCS: w = "bac"
    • "bac" ⊆ "abcabc"? b(pos2), a(pos4), c(pos6) ✓
    • "bac" ⊆ "bacbac"? b(pos1), a(pos2), c(pos3) ✓
    • "bac" ⊆ "cabcab"? -- "cabcab" = c,a,b,c,a,b. b(pos3), a(pos5), -- need c after pos5: no c after pos5. b(pos6)... only one b after a. Actually b(pos3), a(pos5), but no c after pos 5. Try b(pos6)? only one b at pos 3 and 6. b(pos6), but then need a after 6 -- fails.
    • "bac" does not work for "cabcab".
  • The LCS "abc" works for all 6 strings with K = 3.
  • Answer: YES

Instance 3 (NO):

  • Alphabet: Σ = {a, b} (|Σ| = 2)
  • Strings: R = {"aaa", "bbb", "aba", "bab", "aab", "bba"}
  • Bound K = 2
  • Any common subsequence of length 2 must use symbols from {a, b}. Check w = "aa": not subsequence of "bbb" ✗. w = "bb": not subsequence of "aaa" ✗. w = "ab": "ab" ⊆ "bbb"? No 'a' ✗. w = "ba": "ba" ⊆ "aaa"? No 'b' ✗.
  • Even K = 1: w = "a" ⊆ "bbb"? ✗. w = "b" ⊆ "aaa"? ✗.
  • No common subsequence of length ≥ 1 exists (strings "aaa" and "bbb" share no common symbols in common subsequence terms -- wait, both are over {a,b}, but "aaa" contains no 'b' and "bbb" contains no 'a'. So LCS("aaa", "bbb") = empty string.
  • Answer: NO

Instance 4 (YES, non-trivial with 8 strings):

  • Alphabet: Σ = {0, 1} (|Σ| = 2)
  • Strings: R = {"01100110", "10011001", "01010101", "10101010", "00110011", "11001100", "01001011", "10110100"}
  • Bound K = 3
  • Candidate LCS: w = "010"
    • "010" ⊆ "01100110"? 0(1),1(2),0(4) ✓
    • "010" ⊆ "10011001"? 0(2),1(4),0(5) ✓
    • "010" ⊆ "01010101"? 0(1),1(2),0(3) ✓
    • "010" ⊆ "10101010"? 0(2),1(3),0(4) ✓
    • "010" ⊆ "00110011"? 0(1),1(3),0(5) ✓
    • "010" ⊆ "11001100"? 0(3),1(5),0(7) ✓
    • "010" ⊆ "01001011"? 0(1),1(5),0(-- need 0 after pos5): "01001011" = 0,1,0,0,1,0,1,1. 0(1),1(2),0(3) ✓
    • "010" ⊆ "10110100"? 0(3),1(-- need 1 after 3): "10110100" = 1,0,1,1,0,1,0,0. 0(2),1(3),0(5) ✓
  • Answer: YES

Metadata

Metadata

Assignees

No one assigned

    Labels

    GoodAn issue passed all checks.modelA model problem to be implemented.

    Type

    No type

    Projects

    Status

    Done

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions