-
Notifications
You must be signed in to change notification settings - Fork 408
Open
Description
Apache Iceberg version
0.10.0 (latest release)
Please describe the bug 🐞
When using a small value for write.target-file-size-bytes, appending any records larger than this value result in the following exception:
File "/var/lang/lib/python3.12/site-packages/pyiceberg/table/__init__.py", line 485, in append
data_files = list(
^^^^^
File "/var/lang/lib/python3.12/site-packages/pyiceberg/io/pyarrow.py", line 2774, in _dataframe_to_data_files
for batches in bin_pack_arrow_table(partition.arrow_table_partition, target_file_size)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/var/lang/lib/python3.12/site-packages/pyiceberg/io/pyarrow.py", line 2588, in bin_pack_arrow_table
batches = tbl.to_batches(max_chunksize=target_rows_per_file)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/table.pxi", line 5099, in pyarrow.lib.Table.to_batches
ValueError: 'max_chunksize' should be strictly positive
The offending lines appear to be here:
iceberg-python/pyiceberg/io/pyarrow.py
Lines 2680 to 2693 in abae20f
| def bin_pack_arrow_table(tbl: pa.Table, target_file_size: int) -> Iterator[list[pa.RecordBatch]]: | |
| from pyiceberg.utils.bin_packing import PackingIterator | |
| avg_row_size_bytes = tbl.nbytes / tbl.num_rows | |
| target_rows_per_file = target_file_size // avg_row_size_bytes | |
| batches = tbl.to_batches(max_chunksize=target_rows_per_file) | |
| bin_packed_record_batches = PackingIterator( | |
| items=batches, | |
| target_weight=target_file_size, | |
| lookback=len(batches), # ignore lookback | |
| weight_func=lambda x: x.nbytes, | |
| largest_bin_first=False, | |
| ) | |
| return bin_packed_record_batches |
target_rows_per_file = target_file_size // avg_row_size_bytes could probably be changed to target_rows_per_file = max(1, target_file_size // avg_row_size_bytes) to ensure at least 1 row always gets written per chunk.
Willingness to contribute
- I can contribute a fix for this bug independently
- I would be willing to contribute a fix for this bug with guidance from the Iceberg community
- I cannot contribute a fix for this bug at this time
Metadata
Metadata
Assignees
Labels
No labels