Skip to content

Add first-class support for Blosc2 CTable (.b2z) tables#288

Open
FrancescAlted wants to merge 23 commits into
mainfrom
new-table
Open

Add first-class support for Blosc2 CTable (.b2z) tables#288
FrancescAlted wants to merge 23 commits into
mainfrom
new-table

Conversation

@FrancescAlted

Copy link
Copy Markdown
Member

Overview

Adds first-class support for Blosc2 heterogeneous tables (blosc2.CTable, compact single-file .b2z) alongside the existing NDArray/.b2nd support — discover, download, inspect, preview, and slice tables through the REST API, Python client, CLI, and web UI. Design rationale lives in plans/ctable-support.md.

Key changes

REST / server

  • /api/fetch serves .b2z (whole and sliced) as a Blosc2 cframe via CTable.to_cframe(), mirroring the array workflow. Whole tables are excluded from the raw-file FileResponse short-circuit (a whole .b2z is a zip, not a cframe, so returning it raw broke client decode).
  • read_metadata() → new CTableMetadata model (nrows/ncols/columns/schema_dict/…); open_b2() returns CTable early; .b2z treated as a native suffix.
  • Upload/download paths switched to the shared BLOSC2_NATIVE_SUFFIXES constant (incl. .b2z) so tables are stored/served as-is.

Python client (behavior change — see caveats)

  • Reworked leaf-class hierarchy: File → Dataset → {Array, Table}. blosc2.Operand now lives on Array only; Array/Table are exported. root["x.b2nd"] now returns an Array (was Dataset).
  • New Table API: nrows, ncols, columns, schema, slice/[...] (→ blosc2.CTable), rows(), head().
  • _fetch_data dispatches decode on known kind instead of trial-and-except sniffing.

CLI

  • cat2-client info/show support .b2z (table-shaped info; show table.b2z[start:stop] prints rows off the cframe).
  • Also includes a cleaner error message for invalid cat2-client commands.

Web UI

  • CTable preview reuses the existing structured-array visualizer (info_view.html): Display tab + a htmx_path_view branch rendering rows/columns; filter/sort hidden for tables (filterable flag). Fixes a pre-existing Meta-tab crash on non-array metadata.

Tests / docs

  • New caterva2/tests/test_ctable.py (25 tests): metadata, /api/info, /api/download round-trip, whole+slice fetch, client Table, CLI, web preview, and nested/non-identifier column-name regressions.
  • Design + task plans under plans/.

⚠️ Caveats for reviewers / merge

  • Requires a newer blosc2 than the current >=4.6.0 pin — the code uses CTable.to_cframe()/blosc2.ctable_from_cframe() and nested-column dotted-path access, which land in 4.7.x. Bump the blosc2 requirement in pyproject.toml before/with merge, or installs will fail at runtime.
  • Behavior change: isinstance(x, cat2.Dataset) now matches tables too (tables are Datasets), and array datasets are Array instances / <Array: …> reprs instead of <Dataset: …>. Dataset stays importable as the shared base.

…z storage handling

   Whole-table /api/fetch was returning the raw .b2z zip instead of a
   cframe, so table[:] failed client-side. Also introduces Array/Table as
   proper Dataset subclasses (client.py), dispatches cframe decoding by
   known kind instead of trial/except, adds Table.nrows/columns/head/rows,
   and treats .b2z as a native Blosc2 suffix for upload/load_from_url/htmx
   paths so tables round-trip byte-identical. Adds regression tests.
htmx_path_info/htmx_path_view: render a paged row/column preview for
CTable using schema_dict(), with Filter/Sort-by hidden (filterable
flag) since they don't apply to tables; also fixes a pre-existing
crash in the Meta tab template for CTableMetadata (no cparams).

cli.py: `info` prints table-shaped fields instead of crashing on
cparams.get(None); `show` parses the optional row-slice syntax
(table.b2z[start:stop]) and prints rows via the Table client class
instead of calling the array-oriented fetch().

Adds regression tests for both surfaces, plus tests for nested/
non-identifier CTable column names (e.g. "trip.sec" struct leaves),
now resolved natively by blosc2's CTableRow.__getitem__.
Follow-up fixes from review of the CTable support work:

- client: bound Table.rows() default to [0:50) instead of the whole
  table, so table.rows() no longer silently fetches every row of a
  large table (pass stop=self.nrows for all rows).
- server: fix /api/fetch CTable slice resolution to use `is None`
  instead of truthiness, so table[0:0] returns an empty result rather
  than the whole table; also normalize negative indices and clamp
  start/stop to [0, nrows].
- cli: coerce numpy scalars, bytes, and arrays in `show --json` via a
  json default, matching the web preview's cell handling.
- server: return a clean htmx error (not an uncaught AssertionError)
  when a filter/sort is requested on a dataset type that does not
  support it (e.g. a .b2z).
- server: drop a stray comment token in the CTable fetch branch.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds end-to-end support for Blosc2 CTable single-file tables (.b2z) across Caterva2’s server APIs, Python client, CLI, and web UI, aligning table handling with existing NDArray workflows (notably via /api/fetch returning cframes for both whole tables and slices).

Changes:

  • Server: recognize .b2z as a native Blosc2 suffix; add CTableMetadata; extend /api/fetch to stream table cframes; add web preview rendering for tables and guard filter/sort UI.
  • Client/CLI: introduce Array/Table first-class client types; decode fetch responses by known kind; add CLI info/show behavior for .b2z.
  • Tests/docs: add comprehensive CTable tests plus design/task plan documents.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
plans/ctable-support.md Full design record + implementation notes for CTable support.
plans/ctable-support-tasks.md Task checklist and acceptance criteria for the implementation.
plans/ctable-support-orig-gpt5.5.md Archived original plan draft for reference.
caterva2/tests/test_notebook_bootstrap.py Updates bootstrap-cell injection expectations (now two cells).
caterva2/tests/test_ctable.py Adds CTable coverage: metadata, fetch/download, client, CLI, and web preview.
caterva2/services/templates/info_view.html Hides filter/sort controls when filterable=False (tables).
caterva2/services/templates/includes/info_metadata.html Adds a CTable metadata branch to prevent Meta-tab crashes.
caterva2/services/srv_utils.py Centralizes Blosc2 suffix constants and adds CTable metadata extraction.
caterva2/services/server.py Implements CTable-aware open/metadata/fetch/upload/web-preview behavior.
caterva2/models.py Adds CTableMetadata Pydantic model.
caterva2/clients/cli.py Adds .b2z table display logic for info and show (incl. JSON coercion).
caterva2/client.py Refactors client hierarchy and adds Table API + kind-based fetch decode.
caterva2/init.py Exports Array and Table symbols.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread caterva2/services/server.py
Comment thread caterva2/services/server.py
Comment thread caterva2/client.py Outdated
FrancescAlted and others added 2 commits July 1, 2026 14:02
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
A .b2z may hold a TreeStore (a hierarchy of leaves), not just a CTable.
Address leaves by path (tree.b2z/level1/ctable) without unfolding to disk:
list descends, info/fetch open the leaf, the web tree expands into leaf
rows, and the client dispatches by server-reported kind.
A .b2z may hold a TreeStore (a hierarchy of NDArray/CTable leaves), not
just a single CTable. Address leaves by virtual path
(tree.b2z/level1/ctable) without unfolding to disk:

- split_container_path()/treestore_leaves() split a request path at the
  .b2z boundary and enumerate leaves.
- API: list descends, info/fetch open the leaf; leaves inherit the
  container mtime.
- Web: the tree expands a container into leaf rows; info/view tabs work
  on leaves.

Unify group-like things behind models.Directory (kind="dir", mtime,
size, nfiles): info returns it for a real directory, a TreeStore
container (root group), and a virtual group inside one. Group size is
summed cheaply from the .b2z zip index (no per-leaf open).

Client: new Group class (browsable/indexable); Root.__getitem__
dispatches on server-reported kind (dir->Group, ctable->Table,
shape->Array, else File), reusing already-fetched metadata to avoid a
double info round-trip.
A .b2z TreeStore now shows as a single mountable row in the datasets
list instead of auto-expanding into one row per leaf (which flooded the
list). Clicking the plug icon "mounts" it as a virtual root alongside
@personal/@shared/@public, with its own checkbox and an unmount control;
checking it lists that container's leaves. Mount state lives client-side
in localStorage (key caterva2:mounted), bridged to the server via an
htmx:configRequest listener that adds `mounted=` params to the root-list
request. No new endpoints, DB, or per-user server state.

- server.py: htmx_root_list accepts `mounted` and filters it through
  get_rootdir_or_none; htmx_path_list renders TreeStores as single
  mountable rows and expands mounted containers into leaf rows.
- templates: root_list.html renders mounted roots, path_list.html adds
  the plug button, home.html holds the mount/unmount JS.
- Rename Directory.kind "dir" -> "group" (models/client/cli).

Review fixes:
- Avoid stored XSS: read paths from data-* attributes at click time
  instead of interpolating into inline handler JS source.
- Don't 500 the listing on a corrupt/non-TreeStore/stale .b2z (untrusted
  localStorage input); skip it in both the walk and virtual-root loops.
- Dedup roots in mountRoot so a repeat click can't double-list leaves.
- stat() the container once per mount instead of once per leaf.
- Update test_treestore.py for single-row behavior; add coverage for
  virtual-root leaf expansion and bogus-container safety.
Extend the .b2z TreeStore virtual-descent/mount feature to plain HDF5
files: a srv_utils.open_container() adapter (_TreeStoreAdapter /
_HDF5Adapter) unifies list/info/fetch across both formats, backed by a
file-less HDF5Proxy.open_leaf() (in-memory, no .b2nd written to disk).
Client Group gains unfold/copy/move/remove/download, since a plain .h5
now dispatches to Group instead of File.

Also fix the mounted-root unmount (x) icon: a long root name (typical
for .h5 files) grew the row past the sidebar's fixed column, pushing
the icon under the neighboring higher-z-index panel and eating the
click. The icon now sits in a fixed, absolutely-positioned slot in the
row's own gutter, so it stays clickable and its checkbox lines up with
the regular-root rows above it.
htmx_path_list reused the container file's stat().st_size for every
leaf inside a mounted .b2z/.h5, so all datasets showed the same size.
Add a cheap leaf_size() to both container adapters (schunk cbytes for
TreeStore, h5py storage size for HDF5, no full proxy needed) and use
it for per-leaf rows instead.
…r 500

   - get_filtered_array: accept inner_key param, open container member instead
     of blosc2.open() on the whole file
   - htmx_path_view: replace blanket "no filter/sort on container members" 400
     with HDF5-only guard; .b2z members now flow through get_filtered_array
   - Fix pre-existing 500 on 0-d container members: arr[()] returns unhashable
     ndarray, broken by `value in header_sort` in template; convert to scalar
   - Tests: structured & 0-d leaves in _make_tree fixture, sort asc/desc tests,
     0-d view test, i4-no-fields 400 test
   - get_filtered_array: HDF5Proxy branch using .indices()/.sort() (materialized,
     cache-safe). Filter still blocked (needs LazyExpr plumbing on proxy).
   - htmx_path_view: narrow HDF5 guard to filter-only; sort passes through.
     Set filterable=False for HDF5 members (hide filter box in UI).
   - hdf5.py: blosc2.asarray(self.dset) instead of self.dset[:] so ingestion
     streams chunk-by-chunk from HDF5 for >16 MB datasets — no intermediate
     full numpy array.
   - Tests: structured HDF5 leaf in fixture, sort asc/desc, filter 400, sort
     on plain-dtype 400, filterable=False assertion, 0-d scalar view fix.
1. Filter-only crash on .b2z members — root cause is a blosc2 bug: the where-fastpath re-opens the operand's urlpath, which for a TreeStore leaf is the whole .b2z. Worked around in get_filtered_array by detaching filtered members with an in-memory arr.copy() (cache-bounded, same materialization trade-off the filter path already makes). New tests cover filter-only and filter+sort on members.
2. /api/fetch silently dropping filter on members — fetch_data now routes filter requests through get_filtered_array(..., inner_key=inner_key); HDF5-member filters get a clean 400 (raised from a 2-line guard in get_filtered_array), and ValueErrors map to 400 instead of 500. Tested for both .b2z (filtered rows come back) and .h5 (400).
3. Corrupt-member 500s — added except (RuntimeError, OSError) to the htmx except chain.
4. open_container None-check divergence — new srv_utils.open_container_member() helper replaces all three copies of the open→get→validate pattern (htmx view, fetch, filtered path). The bogus-.b2z-member case now yields "Cannot open container member" instead of the nonsensical "Invalid filter" message (regression test added).
5. Double dataset ingest — HDF5Proxy now materializes once via a memoized _as_blosc2(); argsort (with indices kept as an alias) and sort share the single conversion.
6. Tiny-chunk inheritance cliff — _as_blosc2() ignores degenerate HDF5 chunks (< 1 MiB) and lets blosc2 pick its own chunking.
7. Redundant HDF5Proxy branch in the server — deleted; the argsort alias lets HDF5 members flow through the generic NDArray path.
8. 0-d comment misattribution — reworded to name blosc2.NDArray[()] as the 0-d source.

Two bonus fixes along the way: the "unsupported dataset type" asserts became ValueErrors (they were uncaught 500s from /api/fetch and vanish under python -O), and running the suite exposed 4 latent test bugs from the earlier header-sort session (assertions matching row-label/y cells, and raise_for_status() on an intentional 400) — those tests were curl-verified back then because port 8000 was occupied; they're now fixed and passing under pytest.
Datasets panel: clicking a row highlights it as the keyboard cursor
(separate from the teal "loaded" indicator); Up/Down move the cursor
and focus its link so Enter loads it, starting from whichever dataset
is already active if no cursor has been set yet.

Display tab: clicking a data row highlights it; Up/Down move the
highlight within the loaded window and page in the adjacent window
at the edges, continuing the highlight into it.

Reuses Bootstrap's border/table-active utilities, no new CSS beyond
suppressing the default focus outline on dataset links in favor of
the row border.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants