Skip to content

polars-row-collector

Facade to collect rows one-by-one into a Polars DataFrame (in the least-bad way)

Getting Started Example

import polars as pl
from polars_row_collector import PolarsRowCollector

collector = PolarsRowCollector(
    # Note: Schema is optional, but recommended.
    schema={"col1": pl.Int64, "col2": pl.Float64}
)

for item in items:
    row = {
        "col1": item.value1,
        "col2": item.value2,
    }
    collector.add_row(row)

df = collector.to_df()

You can think of collector as filling the same niche as a list_of_dfs: list[pl.DataFrame].

Features

  • Highly performant and memory-optimized.
    • Much more-so than collecting into a list[dict[str, Any]] or concatenating one-row dataframes.
  • Optionally supply a schema for the incoming rows.
  • Thread-safe (when GIL is enabled - default in Python <= 3.15).
  • Configuration arguments for safety vs. performance tradeoffs:
    • Behaviour if there are missing columns: Enforce all columns present or allow missing columns.
    • Behaviour if there are extra columns: Drop silently or raise.
    • Maintain insertion order.

Example Applications

  • Gathering data in a web scraping/parsing tool.
  • Gathering/batching incoming log messages or event logs before writing in bulk to some destination.
  • Gathering data in a document parsing pipeline (e.g., XML with lots of conditionals).

Future Features

  • Intermediate to-disk storage to temporary parquet files to larger-than-memory collections.
  • Further optimize appending many rows at once.
  • Read the dataframe so-far, in the middle of gathering rows.
  • Documentation.

Disclaimer

As the project's description says, this is the "least-bad way" to accomplish this pattern.

If you can implement your code in such a way that you're not collecting individual rows of a dataframe, you are likely better-off doing it that way (e.g., collecting a list[pl.DataFrame]).

However, there are always exceptions to the best practices, and this library is significantly more efficient (performance and memory) than collecting into a list[dict[str, Any]].