Optimize query for user CSV export#5969
Conversation
rtibblesbot
left a comment
There was a problem hiding this comment.
Good performance improvement — the CTE decomposition cuts the export query from 104s to ~390ms and the EXPLAIN ANALYZE comparison in the description communicates the change clearly. One blocking issue with the join direction.
CI: Python tests still in progress at review time; all other checks passing.
Findings:
- blocking: join direction silently drops files with
contentnode_id=NULL— see inline - suggestion: add a test covering files with no contentnode
- suggestion: PR targets
hotfixesbut issue #5954 specifies "Target branch: unstable". If this qualifies as an urgent hotfix given the 3-hour production query, it should be noted in the description. - praise: clean CTE structure — see inline
- praise: field-level assertions in the metadata test — see inline
@rtibblesbot's comments are generated by an LLM, and should be evaluated accordingly
How was this generated?
Reviewed the pull request diff checking for:
- Correctness: bugs, edge cases, undocumented behavior, resource leaks, hardcoded values
- Design: unnecessary complexity, naming, readability, comment accuracy, redundant state
- Architecture: duplicated concerns, minimal interfaces, composition over inheritance
- Testing: behavior-based assertions, mocks only at hard boundaries, accurate coverage
- Completeness: missing dependencies, unupdated usages, i18n, accessibility, security
- Principles: DRY (same reason to change), SRP, Rule of Three (no premature abstraction)
- Checked CI status and linked issue acceptance criteria
- For UI changes: inspected screenshots for layout, visual completeness, and consistency
| | Q(trash_tree__tree_id=OuterRef("contentnode__tree_id")) | ||
| # Build CTEs so we first reduce to this user's files, then resolve only | ||
| # needed content node and channel fields. | ||
| user_files_cte = With( |
There was a problem hiding this comment.
praise: The three-CTE decomposition (user_files_cte → content_nodes_cte → channel_names_cte) is well-structured — each CTE reduces cardinality before the next step, which is exactly why the planner avoids the 676-iteration correlated channel subquery that was dominating the old plan. Easy to follow and a good pattern for future similar optimisations.
| self.assertIn(_format_size(videos[index - 1].file_size), row) | ||
| self.assertEqual(index, len(videos)) | ||
|
|
||
| def test_user_csv_export_reports_channel_and_content_metadata(self): |
There was a problem hiding this comment.
praise: Asserting specific field values (row["Channel"], row["Author"], row["Language"], etc.) rather than just presence makes this a meaningful regression guard — it will catch aliasing mistakes in the CTE column names, not just "the CSV was produced".
Summary
hotfixesfor early deployment with next patchBefore
After
References
closes #5954
Reviewer guidance
…
AI usage
Directed AI to tackle this issue by first writing the regression tests and capturing the original raw SQL. Then allowed the AI to optimize it by instructing how to break the overall query down. Then humanly further optimized the CTE query after its completion. Ran the
EXPLAIN ANALYZEon production.