Skip to content

GH-48701: [C++][Parquet] Add ALPpd encoding#48345

Open
prtkgaur wants to merge 112 commits into
apache:mainfrom
prtkgaur:gh540-alp-pseudoDecimal-encoding
Open

GH-48701: [C++][Parquet] Add ALPpd encoding#48345
prtkgaur wants to merge 112 commits into
apache:mainfrom
prtkgaur:gh540-alp-pseudoDecimal-encoding

Conversation

@prtkgaur

@prtkgaur prtkgaur commented Dec 5, 2025

Copy link
Copy Markdown

Co-authored-by: dhirhan17@gmail.com

@Reviewer : Suggested order : Outdated, will update shortly in which to look at the code while reviewing.

Rationale for this change

ALP significantly improves on the compression ratio and decompression speed over of float/double columns over other encoding/compression techniques.

Spec

Spec
This PR also contains a terse version of the spec in the file cpp/src/arrow/util/alp/ALP_Encoding_Specification_terse.md which can go in the Encodings.md

Parquet Format PR

Dataset PR (parquet-testing)

apache/parquet-testing#100

What changes are included in this PR?

This PR
Introduces ALP (pseudo-decimal) encoding into c++ arrow code.
We also provide benchmarks and dataset to prove the effectiveness of the above algorithm.

Adding above needed us to add following classes.

  • Alp h/cc : Houses core logic for encoding and decoding.
  • Sampler h/cc : Houses logic to sample and select parameters for encoding.
  • AlpWrapper h/cc : Binds together Alp and Sampler classes.

Integration of the above code was done in

  • Encoder/Decoder cc which exposes wrapper to encode buffer of data.

Are these changes tested?

  • We have added unit tests to test the code.
  • Also the benchmarks have been added that cover wide variety of floating point values from low precision to high precision.

Unit tests

  • alp_test.cc

Benchmark tests

  • encoding_benchmark.cc and encoding_alp_benchmark.cc

Are there any user-facing changes?

  • It's a new encoding so the only impact is query performance which we claim will only get better.

DuckDB

  • We did look at DuckDB's ALP implementation while we were implementing ALP and would like to give that team the desired credit.

@github-actions

github-actions Bot commented Dec 5, 2025

Copy link
Copy Markdown

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@prtkgaur prtkgaur force-pushed the gh540-alp-pseudoDecimal-encoding branch 3 times, most recently from 1b78a5c to d563ce0 Compare December 7, 2025 15:46

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the more standard place to put test data is in either arrow-testing or parquet-testing so it can be used across implementations

In this case I would recommend https://github.com/apache/parquet-testing

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Thanks.
apache/parquet-testing#100

Comment thread cpp/src/parquet/types.h
DELTA_BYTE_ARRAY = 7,
RLE_DICTIONARY = 8,
BYTE_STREAM_SPLIT = 9,
ALP = 10,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bump

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For parquet-format we have this PR : apache/parquet-format#557

@alamb

alamb commented Dec 8, 2025

Copy link
Copy Markdown
Contributor

Thanks @prtkgaur -- it is super exciting to see this movement.

Unfortunately, I am not familiar with the C/C++ codebase to give this a realistic review.

I started the CI checks on this PR and had some comments about the testing.

@prtkgaur prtkgaur changed the title [Gh540] Add ALPpd encoding to parquet [Gh539] Add ALPpd encoding to parquet Dec 8, 2025

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Thanks.
apache/parquet-testing#100

std::string tarball_path = std::string(__FILE__);
tarball_path = tarball_path.substr(0, tarball_path.find_last_of("/\\"));
tarball_path = tarball_path.substr(0, tarball_path.find_last_of("/\\"));
tarball_path += "/arrow/cpp/submodules/parquet-testing/data/floatingpoint_data.tar.gz";

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Reviewer the data sits in the parquet-testing submodule
apache/parquet-testing#100


// Unsafe resize without initialization - use only when you will immediately
// overwrite the memory (e.g., before memcpy). Only safe for POD types.
void UnsafeResize(size_t n) {

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using this over resize gave us around 2-3% performance improvement

@prtkgaur prtkgaur changed the title [Gh539] Add ALPpd encoding to parquet [Gh539][Encoding] Add ALPpd encoding to parquet Dec 8, 2025
@prtkgaur prtkgaur changed the title [Gh539][Encoding] Add ALPpd encoding to parquet [Gh-539][Encoding] Add ALPpd encoding to parquet Dec 8, 2025
Comment thread cpp/src/arrow/util/alp/alp_test.cc Outdated
ASSERT_OK(AlpCodec<TypeParam>::template Decode<TypeParam>(
static_cast<int32_t>(kBatchSize), comp2.data(), comp_size2, output2.data()));

EXPECT_EQ(std::memcmp(output1.data(), batch1.data(), kBatchSize * sizeof(TypeParam)), 0);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same general comment on testing for equality.

Comment thread cpp/src/arrow/util/alp/alp_test.cc Outdated
comp_size, output.data()));

// Verify successful decode
EXPECT_EQ(std::memcmp(output.data(), input.data(), input.size() * sizeof(double)), 0);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where does truncated/corruption happen, how does this successfully decode?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Rewrote to truncate


using namespace arrow::util::alp;

static void printHex(const std::string& name, const uint8_t* data, size_t len) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PrintHex.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like this file and the next file aren't compiled? can they be removed for now?

Comment thread cpp/src/parquet/CMakeLists.txt Outdated
add_parquet_benchmark(encoding_alp_benchmark)

add_executable(generate-alp-parquet
${PROJECT_SOURCE_DIR}/src/arrow/util/alp/generate_alp_parquet.cc)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its a little strange to have this cross directory target. I think I commented that this wasn't compiled above, could we move the here maybe if we need it?

@emkornfield emkornfield left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I think I've done a complete read through now of all the code. I think the one other thing we should figure out is how to fall back to plain encoding at some point if ALP is completely failing on the dataset (i.e. it is consistently adding pages that take more space the PLAIN encoding). I think this can be done in a follow-up.

@prtkgaur prtkgaur requested a review from pitrou as a code owner June 25, 2026 23:36
CreateSamplingPreset, EncodeWithPreset, Encode, and GetMaxCompressedSize
now return Status / Result instead of aborting on invalid input. Callers
in parquet/encoder.cc and the test suites updated; three EXPECT_DEATH
tests converted to ASSERT_RAISES(Invalid, ...).

Also in this commit:
- Strip Snowflake attribution comments from encoding_alp_benchmark.cc
- Pause timing around SetData in BM_AlpDecodingDouble so only Decode is measured
- Slice data to output size in AlpSamplerTest.PresetGenerationDecimalData
  (the IsBitwiseEqual migration left this site with a 1024-vs-10000 size mismatch)
Replace ~250 lines of hand-rolled CSV parsing in encoding_alp_benchmark.cc
with arrow::csv::TableReader, addressing reviewer feedback. Two helpers
handle the three file shapes used by the benchmark: by-name lookup for
header-bearing files, and by-index lookup (with optional header skip) for
positional access used by the pipe-delimited and headerless datasets.

Gate add_parquet_benchmark(encoding_alp_benchmark) on ARROW_CSV in
src/parquet/CMakeLists.txt and emit a status message when skipped.
The reviewer suggested initializing `kStoredSize` from member sizes
(`sizeof(exponent_) + sizeof(factor_) + sizeof(num_exceptions_)`) rather
than from type names. The earlier attempt didn't compile because static
data member initializers are evaluated in declaration order — the private
members were declared below the public `kStoredSize`, so the names weren't
visible yet.

Moved the private data members to the top of each class so the public
`kStoredSize` initializer can use `sizeof(member_)` directly. Dropped the
out-of-class drift-guard `static_assert`, redundant once the member names
are in the initializer.
@prtkgaur prtkgaur requested a review from emkornfield June 30, 2026 04:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants