GH-48701: [C++][Parquet] Add ALPpd encoding#48345
Conversation
|
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format? or See also: |
1b78a5c to
d563ce0
Compare
There was a problem hiding this comment.
I think the more standard place to put test data is in either arrow-testing or parquet-testing so it can be used across implementations
In this case I would recommend https://github.com/apache/parquet-testing
| DELTA_BYTE_ARRAY = 7, | ||
| RLE_DICTIONARY = 8, | ||
| BYTE_STREAM_SPLIT = 9, | ||
| ALP = 10, |
There was a problem hiding this comment.
https://github.com/apache/arrow/blob/main/cpp/src/parquet/parquet.thrift#L631 needs to be updated here and in parqut-format.
There was a problem hiding this comment.
For parquet-format we have this PR : apache/parquet-format#557
|
Thanks @prtkgaur -- it is super exciting to see this movement. Unfortunately, I am not familiar with the C/C++ codebase to give this a realistic review. I started the CI checks on this PR and had some comments about the testing. |
| std::string tarball_path = std::string(__FILE__); | ||
| tarball_path = tarball_path.substr(0, tarball_path.find_last_of("/\\")); | ||
| tarball_path = tarball_path.substr(0, tarball_path.find_last_of("/\\")); | ||
| tarball_path += "/arrow/cpp/submodules/parquet-testing/data/floatingpoint_data.tar.gz"; |
There was a problem hiding this comment.
@Reviewer the data sits in the parquet-testing submodule
apache/parquet-testing#100
|
|
||
| // Unsafe resize without initialization - use only when you will immediately | ||
| // overwrite the memory (e.g., before memcpy). Only safe for POD types. | ||
| void UnsafeResize(size_t n) { |
There was a problem hiding this comment.
Using this over resize gave us around 2-3% performance improvement
Co-authored-by: Dhirhan Kanesalingam <dhirhan17@gmail.com>
Also ensure that no line exceeds 90 characters
| ASSERT_OK(AlpCodec<TypeParam>::template Decode<TypeParam>( | ||
| static_cast<int32_t>(kBatchSize), comp2.data(), comp_size2, output2.data())); | ||
|
|
||
| EXPECT_EQ(std::memcmp(output1.data(), batch1.data(), kBatchSize * sizeof(TypeParam)), 0); |
There was a problem hiding this comment.
same general comment on testing for equality.
| comp_size, output.data())); | ||
|
|
||
| // Verify successful decode | ||
| EXPECT_EQ(std::memcmp(output.data(), input.data(), input.size() * sizeof(double)), 0); |
There was a problem hiding this comment.
where does truncated/corruption happen, how does this successfully decode?
There was a problem hiding this comment.
Good catch. Rewrote to truncate
|
|
||
| using namespace arrow::util::alp; | ||
|
|
||
| static void printHex(const std::string& name, const uint8_t* data, size_t len) { |
There was a problem hiding this comment.
seems like this file and the next file aren't compiled? can they be removed for now?
| add_parquet_benchmark(encoding_alp_benchmark) | ||
|
|
||
| add_executable(generate-alp-parquet | ||
| ${PROJECT_SOURCE_DIR}/src/arrow/util/alp/generate_alp_parquet.cc) |
There was a problem hiding this comment.
its a little strange to have this cross directory target. I think I commented that this wasn't compiled above, could we move the here maybe if we need it?
emkornfield
left a comment
There was a problem hiding this comment.
OK, I think I've done a complete read through now of all the code. I think the one other thing we should figure out is how to fall back to plain encoding at some point if ALP is completely failing on the dataset (i.e. it is consistently adding pages that take more space the PLAIN encoding). I think this can be done in a follow-up.
CreateSamplingPreset, EncodeWithPreset, Encode, and GetMaxCompressedSize now return Status / Result instead of aborting on invalid input. Callers in parquet/encoder.cc and the test suites updated; three EXPECT_DEATH tests converted to ASSERT_RAISES(Invalid, ...). Also in this commit: - Strip Snowflake attribution comments from encoding_alp_benchmark.cc - Pause timing around SetData in BM_AlpDecodingDouble so only Decode is measured - Slice data to output size in AlpSamplerTest.PresetGenerationDecimalData (the IsBitwiseEqual migration left this site with a 1024-vs-10000 size mismatch)
Replace ~250 lines of hand-rolled CSV parsing in encoding_alp_benchmark.cc with arrow::csv::TableReader, addressing reviewer feedback. Two helpers handle the three file shapes used by the benchmark: by-name lookup for header-bearing files, and by-index lookup (with optional header skip) for positional access used by the pipe-delimited and headerless datasets. Gate add_parquet_benchmark(encoding_alp_benchmark) on ARROW_CSV in src/parquet/CMakeLists.txt and emit a status message when skipped.
The reviewer suggested initializing `kStoredSize` from member sizes (`sizeof(exponent_) + sizeof(factor_) + sizeof(num_exceptions_)`) rather than from type names. The earlier attempt didn't compile because static data member initializers are evaluated in declaration order — the private members were declared below the public `kStoredSize`, so the names weren't visible yet. Moved the private data members to the top of each class so the public `kStoredSize` initializer can use `sizeof(member_)` directly. Dropped the out-of-class drift-guard `static_assert`, redundant once the member names are in the initializer.
Co-authored-by: dhirhan17@gmail.com
@Reviewer : Suggested order : Outdated, will update shortly in which to look at the code while reviewing.
Rationale for this change
ALP significantly improves on the compression ratio and decompression speed over of float/double columns over other encoding/compression techniques.
Spec
Spec
This PR also contains a terse version of the spec in the file cpp/src/arrow/util/alp/ALP_Encoding_Specification_terse.md which can go in the Encodings.md
Parquet Format PR
Dataset PR (parquet-testing)
apache/parquet-testing#100
What changes are included in this PR?
This PR
Introduces ALP (pseudo-decimal) encoding into c++ arrow code.
We also provide benchmarks and dataset to prove the effectiveness of the above algorithm.
Adding above needed us to add following classes.
Integration of the above code was done in
Are these changes tested?
Unit tests
Benchmark tests
Are there any user-facing changes?
DuckDB