Add first-class support for Blosc2 CTable (.b2z) tables#288
Open
FrancescAlted wants to merge 23 commits into
Open
Add first-class support for Blosc2 CTable (.b2z) tables#288FrancescAlted wants to merge 23 commits into
FrancescAlted wants to merge 23 commits into
Conversation
…(), Python client Table class
…z storage handling Whole-table /api/fetch was returning the raw .b2z zip instead of a cframe, so table[:] failed client-side. Also introduces Array/Table as proper Dataset subclasses (client.py), dispatches cframe decoding by known kind instead of trial/except, adds Table.nrows/columns/head/rows, and treats .b2z as a native Blosc2 suffix for upload/load_from_url/htmx paths so tables round-trip byte-identical. Adds regression tests.
htmx_path_info/htmx_path_view: render a paged row/column preview for CTable using schema_dict(), with Filter/Sort-by hidden (filterable flag) since they don't apply to tables; also fixes a pre-existing crash in the Meta tab template for CTableMetadata (no cparams). cli.py: `info` prints table-shaped fields instead of crashing on cparams.get(None); `show` parses the optional row-slice syntax (table.b2z[start:stop]) and prints rows via the Table client class instead of calling the array-oriented fetch(). Adds regression tests for both surfaces, plus tests for nested/ non-identifier CTable column names (e.g. "trip.sec" struct leaves), now resolved natively by blosc2's CTableRow.__getitem__.
Follow-up fixes from review of the CTable support work: - client: bound Table.rows() default to [0:50) instead of the whole table, so table.rows() no longer silently fetches every row of a large table (pass stop=self.nrows for all rows). - server: fix /api/fetch CTable slice resolution to use `is None` instead of truthiness, so table[0:0] returns an empty result rather than the whole table; also normalize negative indices and clamp start/stop to [0, nrows]. - cli: coerce numpy scalars, bytes, and arrays in `show --json` via a json default, matching the web preview's cell handling. - server: return a clean htmx error (not an uncaught AssertionError) when a filter/sort is requested on a dataset type that does not support it (e.g. a .b2z). - server: drop a stray comment token in the CTable fetch branch.
There was a problem hiding this comment.
Pull request overview
Adds end-to-end support for Blosc2 CTable single-file tables (.b2z) across Caterva2’s server APIs, Python client, CLI, and web UI, aligning table handling with existing NDArray workflows (notably via /api/fetch returning cframes for both whole tables and slices).
Changes:
- Server: recognize
.b2zas a native Blosc2 suffix; addCTableMetadata; extend/api/fetchto stream table cframes; add web preview rendering for tables and guard filter/sort UI. - Client/CLI: introduce
Array/Tablefirst-class client types; decode fetch responses by known kind; add CLIinfo/showbehavior for.b2z. - Tests/docs: add comprehensive CTable tests plus design/task plan documents.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| plans/ctable-support.md | Full design record + implementation notes for CTable support. |
| plans/ctable-support-tasks.md | Task checklist and acceptance criteria for the implementation. |
| plans/ctable-support-orig-gpt5.5.md | Archived original plan draft for reference. |
| caterva2/tests/test_notebook_bootstrap.py | Updates bootstrap-cell injection expectations (now two cells). |
| caterva2/tests/test_ctable.py | Adds CTable coverage: metadata, fetch/download, client, CLI, and web preview. |
| caterva2/services/templates/info_view.html | Hides filter/sort controls when filterable=False (tables). |
| caterva2/services/templates/includes/info_metadata.html | Adds a CTable metadata branch to prevent Meta-tab crashes. |
| caterva2/services/srv_utils.py | Centralizes Blosc2 suffix constants and adds CTable metadata extraction. |
| caterva2/services/server.py | Implements CTable-aware open/metadata/fetch/upload/web-preview behavior. |
| caterva2/models.py | Adds CTableMetadata Pydantic model. |
| caterva2/clients/cli.py | Adds .b2z table display logic for info and show (incl. JSON coercion). |
| caterva2/client.py | Refactors client hierarchy and adds Table API + kind-based fetch decode. |
| caterva2/init.py | Exports Array and Table symbols. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
A .b2z may hold a TreeStore (a hierarchy of leaves), not just a CTable. Address leaves by path (tree.b2z/level1/ctable) without unfolding to disk: list descends, info/fetch open the leaf, the web tree expands into leaf rows, and the client dispatches by server-reported kind.
A .b2z may hold a TreeStore (a hierarchy of NDArray/CTable leaves), not just a single CTable. Address leaves by virtual path (tree.b2z/level1/ctable) without unfolding to disk: - split_container_path()/treestore_leaves() split a request path at the .b2z boundary and enumerate leaves. - API: list descends, info/fetch open the leaf; leaves inherit the container mtime. - Web: the tree expands a container into leaf rows; info/view tabs work on leaves. Unify group-like things behind models.Directory (kind="dir", mtime, size, nfiles): info returns it for a real directory, a TreeStore container (root group), and a virtual group inside one. Group size is summed cheaply from the .b2z zip index (no per-leaf open). Client: new Group class (browsable/indexable); Root.__getitem__ dispatches on server-reported kind (dir->Group, ctable->Table, shape->Array, else File), reusing already-fetched metadata to avoid a double info round-trip.
A .b2z TreeStore now shows as a single mountable row in the datasets list instead of auto-expanding into one row per leaf (which flooded the list). Clicking the plug icon "mounts" it as a virtual root alongside @personal/@shared/@public, with its own checkbox and an unmount control; checking it lists that container's leaves. Mount state lives client-side in localStorage (key caterva2:mounted), bridged to the server via an htmx:configRequest listener that adds `mounted=` params to the root-list request. No new endpoints, DB, or per-user server state. - server.py: htmx_root_list accepts `mounted` and filters it through get_rootdir_or_none; htmx_path_list renders TreeStores as single mountable rows and expands mounted containers into leaf rows. - templates: root_list.html renders mounted roots, path_list.html adds the plug button, home.html holds the mount/unmount JS. - Rename Directory.kind "dir" -> "group" (models/client/cli). Review fixes: - Avoid stored XSS: read paths from data-* attributes at click time instead of interpolating into inline handler JS source. - Don't 500 the listing on a corrupt/non-TreeStore/stale .b2z (untrusted localStorage input); skip it in both the walk and virtual-root loops. - Dedup roots in mountRoot so a repeat click can't double-list leaves. - stat() the container once per mount instead of once per leaf. - Update test_treestore.py for single-row behavior; add coverage for virtual-root leaf expansion and bogus-container safety.
Extend the .b2z TreeStore virtual-descent/mount feature to plain HDF5 files: a srv_utils.open_container() adapter (_TreeStoreAdapter / _HDF5Adapter) unifies list/info/fetch across both formats, backed by a file-less HDF5Proxy.open_leaf() (in-memory, no .b2nd written to disk). Client Group gains unfold/copy/move/remove/download, since a plain .h5 now dispatches to Group instead of File. Also fix the mounted-root unmount (x) icon: a long root name (typical for .h5 files) grew the row past the sidebar's fixed column, pushing the icon under the neighboring higher-z-index panel and eating the click. The icon now sits in a fixed, absolutely-positioned slot in the row's own gutter, so it stays clickable and its checkbox lines up with the regular-root rows above it.
htmx_path_list reused the container file's stat().st_size for every leaf inside a mounted .b2z/.h5, so all datasets showed the same size. Add a cheap leaf_size() to both container adapters (schunk cbytes for TreeStore, h5py storage size for HDF5, no full proxy needed) and use it for per-leaf rows instead.
…r 500
- get_filtered_array: accept inner_key param, open container member instead
of blosc2.open() on the whole file
- htmx_path_view: replace blanket "no filter/sort on container members" 400
with HDF5-only guard; .b2z members now flow through get_filtered_array
- Fix pre-existing 500 on 0-d container members: arr[()] returns unhashable
ndarray, broken by `value in header_sort` in template; convert to scalar
- Tests: structured & 0-d leaves in _make_tree fixture, sort asc/desc tests,
0-d view test, i4-no-fields 400 test
- get_filtered_array: HDF5Proxy branch using .indices()/.sort() (materialized,
cache-safe). Filter still blocked (needs LazyExpr plumbing on proxy).
- htmx_path_view: narrow HDF5 guard to filter-only; sort passes through.
Set filterable=False for HDF5 members (hide filter box in UI).
- hdf5.py: blosc2.asarray(self.dset) instead of self.dset[:] so ingestion
streams chunk-by-chunk from HDF5 for >16 MB datasets — no intermediate
full numpy array.
- Tests: structured HDF5 leaf in fixture, sort asc/desc, filter 400, sort
on plain-dtype 400, filterable=False assertion, 0-d scalar view fix.
1. Filter-only crash on .b2z members — root cause is a blosc2 bug: the where-fastpath re-opens the operand's urlpath, which for a TreeStore leaf is the whole .b2z. Worked around in get_filtered_array by detaching filtered members with an in-memory arr.copy() (cache-bounded, same materialization trade-off the filter path already makes). New tests cover filter-only and filter+sort on members. 2. /api/fetch silently dropping filter on members — fetch_data now routes filter requests through get_filtered_array(..., inner_key=inner_key); HDF5-member filters get a clean 400 (raised from a 2-line guard in get_filtered_array), and ValueErrors map to 400 instead of 500. Tested for both .b2z (filtered rows come back) and .h5 (400). 3. Corrupt-member 500s — added except (RuntimeError, OSError) to the htmx except chain. 4. open_container None-check divergence — new srv_utils.open_container_member() helper replaces all three copies of the open→get→validate pattern (htmx view, fetch, filtered path). The bogus-.b2z-member case now yields "Cannot open container member" instead of the nonsensical "Invalid filter" message (regression test added). 5. Double dataset ingest — HDF5Proxy now materializes once via a memoized _as_blosc2(); argsort (with indices kept as an alias) and sort share the single conversion. 6. Tiny-chunk inheritance cliff — _as_blosc2() ignores degenerate HDF5 chunks (< 1 MiB) and lets blosc2 pick its own chunking. 7. Redundant HDF5Proxy branch in the server — deleted; the argsort alias lets HDF5 members flow through the generic NDArray path. 8. 0-d comment misattribution — reworded to name blosc2.NDArray[()] as the 0-d source. Two bonus fixes along the way: the "unsupported dataset type" asserts became ValueErrors (they were uncaught 500s from /api/fetch and vanish under python -O), and running the suite exposed 4 latent test bugs from the earlier header-sort session (assertions matching row-label/y cells, and raise_for_status() on an intentional 400) — those tests were curl-verified back then because port 8000 was occupied; they're now fixed and passing under pytest.
Datasets panel: clicking a row highlights it as the keyboard cursor (separate from the teal "loaded" indicator); Up/Down move the cursor and focus its link so Enter loads it, starting from whichever dataset is already active if no cursor has been set yet. Display tab: clicking a data row highlights it; Up/Down move the highlight within the loaded window and page in the adjacent window at the edges, continuing the highlight into it. Reuses Bootstrap's border/table-active utilities, no new CSS beyond suppressing the default focus outline on dataset links in favor of the row border.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
Adds first-class support for Blosc2 heterogeneous tables (
blosc2.CTable, compact single-file.b2z) alongside the existingNDArray/.b2ndsupport — discover, download, inspect, preview, and slice tables through the REST API, Python client, CLI, and web UI. Design rationale lives inplans/ctable-support.md.Key changes
REST / server
/api/fetchserves.b2z(whole and sliced) as a Blosc2 cframe viaCTable.to_cframe(), mirroring the array workflow. Whole tables are excluded from the raw-fileFileResponseshort-circuit (a whole.b2zis a zip, not a cframe, so returning it raw broke client decode).read_metadata()→ newCTableMetadatamodel (nrows/ncols/columns/schema_dict/…);open_b2()returnsCTableearly;.b2ztreated as a native suffix.BLOSC2_NATIVE_SUFFIXESconstant (incl..b2z) so tables are stored/served as-is.Python client (behavior change — see caveats)
File → Dataset → {Array, Table}.blosc2.Operandnow lives onArrayonly;Array/Tableare exported.root["x.b2nd"]now returns anArray(wasDataset).TableAPI:nrows,ncols,columns,schema,slice/[...](→blosc2.CTable),rows(),head()._fetch_datadispatches decode on known kind instead of trial-and-except sniffing.CLI
cat2-client info/showsupport.b2z(table-shapedinfo;show table.b2z[start:stop]prints rows off the cframe).cat2-clientcommands.Web UI
info_view.html): Display tab + ahtmx_path_viewbranch rendering rows/columns; filter/sort hidden for tables (filterableflag). Fixes a pre-existing Meta-tab crash on non-array metadata.Tests / docs
caterva2/tests/test_ctable.py(25 tests): metadata,/api/info,/api/downloadround-trip, whole+slice fetch, clientTable, CLI, web preview, and nested/non-identifier column-name regressions.plans/.blosc2than the current>=4.6.0pin — the code usesCTable.to_cframe()/blosc2.ctable_from_cframe()and nested-column dotted-path access, which land in 4.7.x. Bump theblosc2requirement inpyproject.tomlbefore/with merge, or installs will fail at runtime.isinstance(x, cat2.Dataset)now matches tables too (tables areDatasets), and array datasets areArrayinstances /<Array: …>reprs instead of<Dataset: …>.Datasetstays importable as the shared base.