Skip to content

Optimize query for user CSV export#5969

Open
bjester wants to merge 3 commits into
learningequality:hotfixesfrom
bjester:user-csv-perf
Open

Optimize query for user CSV export#5969
bjester wants to merge 3 commits into
learningequality:hotfixesfrom
bjester:user-csv-perf

Conversation

@bjester

@bjester bjester commented Jun 11, 2026

Copy link
Copy Markdown
Member

Summary

  • Targeting hotfixes for early deployment with next patch
  • Adds regression test to ensure functionality before and after changes
  • Optimizes export queries by using CTEs, avoiding large joins on big tables, and aligning filtering with indices

Before

SELECT "contentcuration_file"."original_filename",
       "contentcuration_file"."file_size",
       "contentcuration_file"."checksum",
       "contentcuration_file"."file_format_id",
       "contentcuration_language"."readable_name",
       "contentcuration_contentnode"."title",
       T6."readable_name",
       "contentcuration_license"."license_name",
       "contentcuration_contentnode"."kind_id",
       "contentcuration_contentnode"."description",
       "contentcuration_contentnode"."author",
       "contentcuration_contentnode"."provider",
       "contentcuration_contentnode"."aggregator",
       "contentcuration_contentnode"."license_description",
       "contentcuration_contentnode"."copyright_holder",
       (
         SELECT U0."name"
         FROM "contentcuration_channel" U0
         LEFT OUTER JOIN "contentcuration_contentnode" U1 ON (U0."main_tree_id" = U1."id")
         LEFT OUTER JOIN "contentcuration_contentnode" U2 ON (U0."trash_tree_id" = U2."id")
         WHERE (
           U1."tree_id" = "contentcuration_contentnode"."tree_id"
           OR U2."tree_id" = "contentcuration_contentnode"."tree_id"
         )
         LIMIT 1
       ) AS "channel_name"
FROM "contentcuration_file"
LEFT OUTER JOIN "contentcuration_contentnode"
  ON ("contentcuration_file"."contentnode_id" = "contentcuration_contentnode"."id")
LEFT OUTER JOIN "contentcuration_language"
  ON ("contentcuration_file"."language_id" = "contentcuration_language"."id")
LEFT OUTER JOIN "contentcuration_language" T6
  ON ("contentcuration_contentnode"."language_id" = T6."id")
LEFT OUTER JOIN "contentcuration_license"
  ON ("contentcuration_contentnode"."license_id" = "contentcuration_license"."id")
WHERE "contentcuration_file"."uploaded_by_id" = 2512
;
Nested Loop Left Join  (cost=1647.96..149498236.05 rows=6723 width=1240) (actual time=265.645..104656.723 rows=676 loops=1)
  ->  Gather  (cost=1647.67..52382.00 rows=6723 width=611) (actual time=1.663..3.158 rows=676 loops=1)
        Workers Planned: 2
        Workers Launched: 2
        ->  Hash Left Join  (cost=647.67..50709.70 rows=2801 width=611) (actual time=0.756..3.910 rows=225 loops=3)
              Hash Cond: (contentcuration_contentnode.license_id = contentcuration_license.id)
              ->  Hash Left Join  (cost=634.74..50689.24 rows=2801 width=497) (actual time=0.689..3.789 rows=225 loops=3)
                    Hash Cond: ((contentcuration_contentnode.language_id)::text = (t6.id)::text)
                    ->  Nested Loop Left Join  (cost=169.23..50216.38 rows=2801 width=282) (actual time=0.190..3.236 rows=225 loops=3)
                          ->  Parallel Bitmap Heap Scan on contentcuration_file  (cost=168.67..26240.80 rows=2801 width=99) (actual time=0.164..0.312 rows=225 loops=3)
                                Recheck Cond: (uploaded_by_id = 2512)
                                Heap Blocks: exact=1
                                ->  Bitmap Index Scan on contentcuration_file_4095e96b  (cost=0.00..166.99 rows=6723 width=0) (actual time=0.384..0.385 rows=676 loops=1)
                                      Index Cond: (uploaded_by_id = 2512)
                          ->  Index Scan using contentcuration_contentnode_pkey on contentcuration_contentnode  (cost=0.56..8.56 rows=1 width=249) (actual time=0.012..0.012 rows=1 loops=676)
                                Index Cond: ((id)::text = (contentcuration_file.contentnode_id)::text)
                    ->  Hash  (cost=403.56..403.56 rows=4956 width=264) (actual time=0.414..0.414 rows=289 loops=3)
                          Buckets: 8192  Batches: 1  Memory Usage: 78kB
                          ->  Seq Scan on contentcuration_language t6  (cost=0.00..403.56 rows=4956 width=264) (actual time=0.012..0.359 rows=289 loops=3)
              ->  Hash  (cost=11.30..11.30 rows=130 width=122) (actual time=0.029..0.030 rows=9 loops=3)
                    Buckets: 1024  Batches: 1  Memory Usage: 9kB
                    ->  Seq Scan on contentcuration_license  (cost=0.00..11.30 rows=130 width=122) (actual time=0.015..0.027 rows=9 loops=3)
  ->  Memoize  (cost=0.29..7.32 rows=1 width=264) (actual time=0.002..0.002 rows=0 loops=676)
        Cache Key: contentcuration_file.language_id
        Cache Mode: logical
        Hits: 672  Misses: 4  Evictions: 0  Overflows: 0  Memory Usage: 1kB
        ->  Index Scan using contentcuration_language_pkey on contentcuration_language  (cost=0.28..7.31 rows=1 width=264) (actual time=0.010..0.010 rows=1 loops=4)
              Index Cond: ((id)::text = (contentcuration_file.language_id)::text)
  SubPlan 1
    ->  Limit  (cost=1001.12..22228.92 rows=1 width=20) (actual time=154.795..154.800 rows=1 loops=676)
          ->  Nested Loop Left Join  (cost=1001.12..276962.46 rows=13 width=20) (actual time=154.542..154.547 rows=1 loops=676)
                Filter: ((u1.tree_id = contentcuration_contentnode.tree_id) OR (u2.tree_id = contentcuration_contentnode.tree_id))
                Rows Removed by Filter: 14337
                ->  Gather  (cost=1000.56..86995.63 rows=22263 width=57) (actual time=2.569..7.361 rows=14338 loops=676)
                      Workers Planned: 2
                      Workers Launched: 2
                      ->  Nested Loop Left Join  (cost=0.56..83769.33 rows=9276 width=57) (actual time=0.044..53.143 rows=5096 loops=2028)
                            ->  Parallel Seq Scan on contentcuration_channel u0  (cost=0.00..4757.76 rows=9276 width=86) (actual time=0.009..2.214 rows=5096 loops=2028)
                            ->  Index Scan using contentcuration_contentnode_pkey on contentcuration_contentnode u1  (cost=0.56..8.52 rows=1 width=37) (actual time=0.010..0.010 rows=1 loops=10334261)
                                  Index Cond: ((id)::text = (u0.main_tree_id)::text)
                ->  Index Scan using contentcuration_contentnode_pkey on contentcuration_contentnode u2  (cost=0.56..8.52 rows=1 width=37) (actual time=0.010..0.010 rows=1 loops=9692472)
                      Index Cond: ((id)::text = (u0.trash_tree_id)::text)
Planning Time: 7.173 ms
Execution Time: 104658.708 ms

After

WITH RECURSIVE "user_files" AS (
  SELECT
    "contentcuration_file"."id",
    "contentcuration_file"."contentnode_id",
    "contentcuration_file"."original_filename",
    "contentcuration_file"."file_size",
    "contentcuration_file"."checksum",
    "contentcuration_file"."file_format_id" AS "file_extension",
    "contentcuration_language"."readable_name" AS "file_language"
  FROM "contentcuration_file"
  LEFT OUTER JOIN "contentcuration_language"
    ON ("contentcuration_file"."language_id" = "contentcuration_language"."id")
  WHERE "contentcuration_file"."uploaded_by_id" = 2512
),
"content_nodes" AS (
  SELECT DISTINCT
    "contentcuration_contentnode"."id",
    "contentcuration_contentnode"."tree_id",
    "contentcuration_contentnode"."title" AS "node_title",
    "contentcuration_contentnode"."kind_id" AS "node_kind_id",
    "contentcuration_contentnode"."description" AS "node_description",
    "contentcuration_contentnode"."author" AS "node_author",
    "contentcuration_language"."readable_name" AS "node_language",
    "contentcuration_license"."license_name" AS "node_license_name",
    "contentcuration_contentnode"."license_description" AS "node_license_description",
    "contentcuration_contentnode"."copyright_holder" AS "node_copyright_holder",
    "contentcuration_contentnode"."lft"
  FROM "contentcuration_contentnode"
  INNER JOIN "user_files"
    ON "contentcuration_contentnode"."id" = "user_files"."contentnode_id"
  LEFT OUTER JOIN "contentcuration_language"
    ON ("contentcuration_contentnode"."language_id" = "contentcuration_language"."id")
  LEFT OUTER JOIN "contentcuration_license"
    ON ("contentcuration_contentnode"."license_id" = "contentcuration_license"."id")
  ORDER BY "contentcuration_contentnode"."tree_id" ASC, "contentcuration_contentnode"."lft" ASC
),
"channel_names" AS (
  (SELECT
     "contentcuration_contentnode"."tree_id" AS "tree_id",
     "contentcuration_channel"."name" AS "channel_name"
   FROM "contentcuration_channel"
   LEFT OUTER JOIN "contentcuration_contentnode"
     ON ("contentcuration_channel"."main_tree_id" = "contentcuration_contentnode"."id")
   WHERE EXISTS (
     SELECT (1) AS "a"
     FROM "content_nodes" U0
     WHERE U0."tree_id" = "contentcuration_contentnode"."tree_id"
     LIMIT 1
   ))
  UNION
  (SELECT
     "contentcuration_contentnode"."tree_id" AS "tree_id",
     "contentcuration_channel"."name" AS "channel_name"
   FROM "contentcuration_channel"
   LEFT OUTER JOIN "contentcuration_contentnode"
     ON ("contentcuration_channel"."trash_tree_id" = "contentcuration_contentnode"."id")
   WHERE EXISTS (
     SELECT (1) AS "a"
     FROM "content_nodes" U0
     WHERE U0."tree_id" = "contentcuration_contentnode"."tree_id"
     LIMIT 1
   ))
)
SELECT
  "user_files"."original_filename",
  "user_files"."file_size",
  "user_files"."checksum",
  "user_files"."file_extension" AS "file_extension",
  "user_files"."file_language" AS "file_language",
  (
    SELECT U0."channel_name" AS "channel_name"
    FROM "channel_names" U0
    WHERE U0."tree_id" = "content_nodes"."tree_id"
    LIMIT 1
  ) AS "channel_name",
  "content_nodes"."node_title" AS "node_title",
  "content_nodes"."node_kind_id" AS "node_kind_id",
  "content_nodes"."node_description" AS "node_description",
  "content_nodes"."node_author" AS "node_author",
  "content_nodes"."node_language" AS "node_language",
  "content_nodes"."node_license_name" AS "node_license_name",
  "content_nodes"."node_license_description" AS "node_license_description",
  "content_nodes"."node_copyright_holder" AS "node_copyright_holder"
FROM "user_files"
LEFT OUTER JOIN "content_nodes"
  ON "user_files"."contentnode_id" = "content_nodes"."id";
Merge Left Join  (cost=68848.98..11664991211.38 rows=225994 width=4326) (actual time=45.839..389.574 rows=676 loops=1)
  Merge Cond: ((user_files.contentnode_id)::text = (content_nodes.id)::text)
  CTE user_files
    ->  Nested Loop Left Join  (cost=0.86..8143.55 rows=6723 width=347) (actual time=0.046..0.700 rows=676 loops=1)
          ->  Index Scan using contentcuration_file_4095e96b on contentcuration_file  (cost=0.57..7634.81 rows=6723 width=132) (actual time=0.031..0.377 rows=676 loops=1)
                Index Cond: (uploaded_by_id = 2512)
          ->  Memoize  (cost=0.29..3.63 rows=1 width=264) (actual time=0.000..0.000 rows=0 loops=676)
                Cache Key: contentcuration_file.language_id
                Cache Mode: logical
                Hits: 672  Misses: 4  Evictions: 0  Overflows: 0  Memory Usage: 1kB
                ->  Index Scan using contentcuration_language_pkey on contentcuration_language  (cost=0.28..3.62 rows=1 width=264) (actual time=0.008..0.008 rows=1 loops=4)
                      Index Cond: ((id)::text = (contentcuration_file.language_id)::text)
  CTE content_nodes
    ->  Unique  (cost=59380.00..59581.69 rows=6723 width=579) (actual time=9.071..9.319 rows=513 loops=1)
          ->  Sort  (cost=59380.00..59396.81 rows=6723 width=579) (actual time=9.070..9.104 rows=668 loops=1)
                Sort Key: contentcuration_contentnode.tree_id, contentcuration_contentnode.lft, contentcuration_contentnode.id, contentcuration_contentnode.title, contentcuration_contentnode.kind_id, contentcuration_contentnode.description, contentcuration_contentnode.author, contentcuration_language_1.readable_name, contentcuration_license.license_name, contentcuration_contentnode.license_description, contentcuration_contentnode.copyright_holder
                Sort Method: quicksort  Memory: 125kB
                ->  Nested Loop Left Join  (cost=1.13..58952.59 rows=6723 width=579) (actual time=0.065..8.233 rows=668 loops=1)
                      ->  Nested Loop Left Join  (cost=0.85..58718.56 rows=6723 width=465) (actual time=0.050..7.963 rows=668 loops=1)
                            ->  Nested Loop  (cost=0.56..57680.99 rows=6723 width=250) (actual time=0.036..7.680 rows=668 loops=1)
                                  ->  CTE Scan on user_files user_files_1  (cost=0.00..134.46 rows=6723 width=82) (actual time=0.001..0.108 rows=676 loops=1)
                                  ->  Index Scan using contentcuration_contentnode_pkey on contentcuration_contentnode  (cost=0.56..8.56 rows=1 width=250) (actual time=0.011..0.011 rows=1 loops=676)
                                        Index Cond: ((id)::text = (user_files_1.contentnode_id)::text)
                            ->  Memoize  (cost=0.29..7.08 rows=1 width=264) (actual time=0.000..0.000 rows=0 loops=668)
                                  Cache Key: contentcuration_contentnode.language_id
                                  Cache Mode: logical
                                  Hits: 661  Misses: 7  Evictions: 0  Overflows: 0  Memory Usage: 1kB
                                  ->  Index Scan using contentcuration_language_pkey on contentcuration_language contentcuration_language_1  (cost=0.28..7.07 rows=1 width=264) (actual time=0.005..0.006 rows=1 loops=7)
                                        Index Cond: ((id)::text = (contentcuration_contentnode.language_id)::text)
                      ->  Memoize  (cost=0.28..6.70 rows=1 width=122) (actual time=0.000..0.000 rows=1 loops=668)
                            Cache Key: contentcuration_contentnode.license_id
                            Cache Mode: logical
                            Hits: 659  Misses: 9  Evictions: 0  Overflows: 0  Memory Usage: 2kB
                            ->  Index Scan using contentcuration_license_pkey on contentcuration_license  (cost=0.27..6.69 rows=1 width=122) (actual time=0.003..0.003 rows=1 loops=9)
                                  Index Cond: (id = contentcuration_contentnode.license_id)
  ->  Sort  (cost=561.87..578.68 rows=6723 width=1434) (actual time=1.937..2.175 rows=676 loops=1)
        Sort Key: user_files.contentnode_id
        Sort Method: quicksort  Memory: 106kB
        ->  CTE Scan on user_files  (cost=0.00..134.46 rows=6723 width=1434) (actual time=0.048..1.096 rows=676 loops=1)
  ->  Sort  (cost=561.87..578.68 rows=6723 width=2642) (actual time=10.121..10.337 rows=668 loops=1)
        Sort Key: content_nodes.id
        Sort Method: quicksort  Memory: 103kB
        ->  CTE Scan on content_nodes  (cost=0.00..134.46 rows=6723 width=2642) (actual time=9.075..9.477 rows=513 loops=1)
  SubPlan 3
    ->  Limit  (cost=51616.04..51616.06 rows=1 width=418) (actual time=0.556..0.556 rows=1 loops=676)
          ->  Subquery Scan on u0  (cost=51616.04..51616.32 rows=14 width=418) (actual time=0.555..0.555 rows=1 loops=676)
                ->  HashAggregate  (cost=51616.04..51616.18 rows=14 width=422) (actual time=0.555..0.555 rows=1 loops=676)
                      Group Key: contentcuration_contentnode_1.tree_id, contentcuration_channel.name
                      Batches: 1  Memory Usage: 24kB
                      ->  Append  (cost=5166.48..51615.97 rows=14 width=422) (actual time=0.240..0.554 rows=1 loops=676)
                            ->  Nested Loop Semi Join  (cost=5166.48..25807.95 rows=7 width=24) (actual time=0.187..0.308 rows=1 loops=676)
                                  ->  Hash Join  (cost=5166.48..25656.43 rows=7 width=24) (actual time=0.177..0.298 rows=1 loops=676)
                                        Hash Cond: ((contentcuration_contentnode_1.id)::text = (contentcuration_channel.main_tree_id)::text)
                                        ->  Index Scan using contentcuration_contentnode_656442a0 on contentcuration_contentnode contentcuration_contentnode_1  (cost=0.56..20472.98 rows=4655 width=37) (actual time=0.006..0.198 rows=395 loops=676)
                                              Index Cond: (tree_id = content_nodes.tree_id)
                                        ->  Hash  (cost=4887.63..4887.63 rows=22263 width=53) (actual time=15.487..15.488 rows=22269 loops=1)
                                              Buckets: 32768  Batches: 1  Memory Usage: 2111kB
                                              ->  Seq Scan on contentcuration_channel  (cost=0.00..4887.63 rows=22263 width=53) (actual time=0.013..10.393 rows=22269 loops=1)
                                  ->  CTE Scan on content_nodes u0_1  (cost=0.00..151.27 rows=34 width=4) (actual time=0.014..0.014 rows=1 loops=459)
                                        Filter: (tree_id = content_nodes.tree_id)
                                        Rows Removed by Filter: 182
                            ->  Nested Loop Semi Join  (cost=5166.48..25807.95 rows=7 width=24) (actual time=0.230..0.246 rows=0 loops=676)
                                  ->  Hash Join  (cost=5166.48..25656.43 rows=7 width=24) (actual time=0.225..0.241 rows=0 loops=676)
                                        Hash Cond: ((contentcuration_contentnode_2.id)::text = (contentcuration_channel_1.trash_tree_id)::text)
                                        ->  Index Scan using contentcuration_contentnode_656442a0 on contentcuration_contentnode contentcuration_contentnode_2  (cost=0.56..20472.98 rows=4655 width=37) (actual time=0.004..0.146 rows=395 loops=676)
                                              Index Cond: (tree_id = content_nodes.tree_id)
                                        ->  Hash  (cost=4887.63..4887.63 rows=22263 width=53) (actual time=14.815..14.816 rows=22269 loops=1)
                                              Buckets: 32768  Batches: 1  Memory Usage: 2111kB
                                              ->  Seq Scan on contentcuration_channel contentcuration_channel_1  (cost=0.00..4887.63 rows=22263 width=53) (actual time=0.013..10.075 rows=22269 loops=1)
                                  ->  CTE Scan on content_nodes u0_2  (cost=0.00..151.27 rows=34 width=4) (actual time=0.033..0.033 rows=1 loops=90)
                                        Filter: (tree_id = content_nodes.tree_id)
                                        Rows Removed by Filter: 342
Planning Time: 7.400 ms
Execution Time: 390.302 ms

References

closes #5954

Reviewer guidance

AI usage

Directed AI to tackle this issue by first writing the regression tests and capturing the original raw SQL. Then allowed the AI to optimize it by instructing how to break the overall query down. Then humanly further optimized the CTE query after its completion. Ran the EXPLAIN ANALYZE on production.

@bjester bjester changed the title Additional coverage for user CSV export Optimize query for user CSV export Jun 11, 2026
@bjester bjester marked this pull request as ready for review June 11, 2026 22:03
@bjester bjester requested a review from rtibblesbot June 11, 2026 22:04

@rtibblesbot rtibblesbot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good performance improvement — the CTE decomposition cuts the export query from 104s to ~390ms and the EXPLAIN ANALYZE comparison in the description communicates the change clearly. One blocking issue with the join direction.

CI: Python tests still in progress at review time; all other checks passing.

Findings:

  • blocking: join direction silently drops files with contentnode_id=NULL — see inline
  • suggestion: add a test covering files with no contentnode
  • suggestion: PR targets hotfixes but issue #5954 specifies "Target branch: unstable". If this qualifies as an urgent hotfix given the 3-hour production query, it should be noted in the description.
  • praise: clean CTE structure — see inline
  • praise: field-level assertions in the metadata test — see inline

@rtibblesbot's comments are generated by an LLM, and should be evaluated accordingly

How was this generated?

Reviewed the pull request diff checking for:

  • Correctness: bugs, edge cases, undocumented behavior, resource leaks, hardcoded values
  • Design: unnecessary complexity, naming, readability, comment accuracy, redundant state
  • Architecture: duplicated concerns, minimal interfaces, composition over inheritance
  • Testing: behavior-based assertions, mocks only at hard boundaries, accurate coverage
  • Completeness: missing dependencies, unupdated usages, i18n, accessibility, security
  • Principles: DRY (same reason to change), SRP, Rule of Three (no premature abstraction)
  • Checked CI status and linked issue acceptance criteria
  • For UI changes: inspected screenshots for layout, visual completeness, and consistency

Comment thread contentcuration/contentcuration/utils/csv_writer.py
| Q(trash_tree__tree_id=OuterRef("contentnode__tree_id"))
# Build CTEs so we first reduce to this user's files, then resolve only
# needed content node and channel fields.
user_files_cte = With(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

praise: The three-CTE decomposition (user_files_ctecontent_nodes_ctechannel_names_cte) is well-structured — each CTE reduces cardinality before the next step, which is exactly why the planner avoids the 676-iteration correlated channel subquery that was dominating the old plan. Easy to follow and a good pattern for future similar optimisations.

self.assertIn(_format_size(videos[index - 1].file_size), row)
self.assertEqual(index, len(videos))

def test_user_csv_export_reports_channel_and_content_metadata(self):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

praise: Asserting specific field values (row["Channel"], row["Author"], row["Language"], etc.) rather than just presence makes this a meaningful regression guard — it will catch aliasing mistakes in the CTE column names, not just "the CSV was produced".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants