fix: reduce memory usage in diff by using linear-space LCS algorithm#1010
Merged
Conversation
Replace the O(N*M) space LCS dynamic programming from aryann/difflib with Hirschberg's linear-space algorithm. For large manifests such as Kyverno CRDs with 10,000+ differing lines, peak memory drops from ~786 MB to ~14 MB (57x reduction) while producing identical diff output. Fixes #996 Signed-off-by: yxxhero <aiopsclub@163.com>
|
Thanks so much 🙂 |
Collaborator
Author
|
@jim-barber-he please try |
|
@yxxhero Sorry. I didn't get around to it before you released it. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Fixes #996
The
aryann/diffliblibrary uses an O(N×M) dynamic programming matrix (longestCommonSubsequenceMatrix) to compute diffs. For large manifests such as Kyverno CRDs with 10,000+ differing lines, this matrix alone consumes ~786 MB. With multiple resources and GC overhead, peak memory reaches 3.6 GB as reported in the issue, causing OOM kills on agents with 2 GB memory limits.Solution
Replace the full-matrix LCS algorithm with Hirschberg's linear-space LCS algorithm, which produces identical diff output but requires only O(N+M) space instead of O(N×M).
The algorithm uses divide-and-conquer: it splits
seq1in half, finds the optimal split point inseq2using forward and backward LCS score rows (each computed in O(len(seq2)) space), then recurses on each half.Changes
diff/lcs.go(new):diffLines()implementing Hirschberg's algorithm, returning[]difflib.DiffRecordfor full compatibility.diff/diff.go:diffStrings()now callsdiffLines()instead ofdifflib.Diff().diff/lcs_test.go(new): Tests verifying output parity for standard cases, semantic validity across 1,000 random inputs, and large input handling.Measured Improvement
Compatibility
All existing tests pass without modification. The new implementation produces identical output to
difflib.Difffor all existing test cases. For inputs with multiple valid LCS paths (e.g., highly repetitive content), the output is semantically equivalent — same LCS length, same set of added/removed lines — but may differ in tie-breaking, which does not affect the visual diff usefulness.