Convert Hyrax results to Parquet

Convert Hyrax results to Parquet#

After running inference with Hyrax, results are saved in the Lance format — a columnar format optimized for ML workloads. This notebook shows how to convert those results to Parquet files on disk.

For more information about working with results in memory, see:

Note: This notebook uses pre-computed inference results from the Getting Started notebook. Update result_directory in the next cell to point to your own results.

[1]:
from pathlib import Path
import lancedb
import pyarrow.parquet as pq

result_directory = Path("./example_results/getting_started_results")

Simple conversion to Parquet#

If your dataset fits into memory easily, you can convert to parquet in a few lines of code.

[2]:
lance_dir = result_directory / "lance_db"
db = lancedb.connect(str(lance_dir))

table = db.open_table("results")

# Convert to Arrow and write as Parquet
arrow_table = table.to_arrow()
pq.write_table(arrow_table, result_directory / "output.parquet")

Larger dataset batching#

If your lance dataset is too large to fit in memory, you can write it to parquet in batches.

[3]:
batch_size = 10_000
writer = None
try:
    for batch in table.to_lance().to_batches(batch_size=batch_size):
        if writer is None:
            writer = pq.ParquetWriter(result_directory / "batched_output.parquet", batch.schema)
        writer.write_batch(batch)
finally:
    if writer is not None:
        writer.close()