![]() |
| Polars — practical examples (Python) |
Polars (Python) — Big Data Essentials (Practical Guide)
Hands-on patterns for loading, cleaning, transforming, and analyzing large tables quickly. Polars is built in Rust, uses lazy query plans (where possible), and is designed to be memory-efficient and parallel — a strong fit when pandas feels slow or RAM-heavy.
What Polars is — plain English
scan_*) so the full table is not loaded until the end.
Polars vs pandas — pros & cons
- Often faster on large tables and multi-step pipelines
- Lazy mode can skip unused columns/rows (pushdown)
- Clear expression API (
pl.col,with_columns) - Good Parquet story (common in analytics)
- Smaller data and you already know pandas well
- Some niche libraries only support pandas
- Team habits / stack — both can coexist in one project
You can convert between them: df.to_pandas() / pl.from_pandas(...) when needed.
Install + import
pl.col) over Python for loops over rows.pip install polars
import polars as pl
# Optional: nicer printing in terminal / notebooks
pl.Config.set_tbl_rows(12)
pl.Config.set_tbl_cols(12)
pl.Config.set_fmt_str_lengths(120)
Use Python 3.8+ (check Polars docs for your version’s minimum).
Read / write — CSV, Parquet, lazy scans
scan_csv / scan_parquet (lazy). Use read_* when data fits comfortably in RAM.Small/medium CSV (eager)
df = pl.read_csv("data.csv")
df.head()
Large CSV (lazy) — plan first, collect() at the end
lf = pl.scan_csv("data.csv") # LazyFrame — not fully loaded yet
result = (
lf
.select(["id", "department", "salary"])
.filter(pl.col("salary") > 50_000)
.group_by("department")
.agg(
pl.len().alias("rows"),
pl.col("salary").mean().alias("avg_salary"),
)
.sort("avg_salary", descending=True)
.collect()
)
Schema control (messy CSVs)
schema = {
"id": pl.Int64,
"department": pl.Utf8,
"salary": pl.Float64,
"date": pl.Utf8,
}
lf = pl.scan_csv("data.csv", schema=schema, ignore_errors=True)
Parquet (fast analytics + compression)
dfp = pl.read_parquet("data.parquet")
lfp = pl.scan_parquet("data.parquet")
df.write_parquet("out.parquet")
Lazy pipeline → write without holding everything in memory
(
pl.scan_csv("big.csv")
.select(["id", "department", "salary"])
.filter(pl.col("salary") > 0)
.sink_parquet("clean.parquet")
)
sink_parquet streams results to disk; availability depends on Polars version and engine — if missing, use .collect() then write_parquet for smaller outputs.
Inspect data — quick checks
df.shape
df.schema
df.null_count()
df.describe()
df.estimated_size("mb")
lf = pl.scan_csv("big.csv")
lf.fetch(5) # tiny sample — safe for debugging lazy plans
Select, rename, cast
df.select(["name", "salary"])
# Regex column selection
df.select(pl.col("^sales_.*$"))
df = df.rename({"dept": "department"}).with_columns(
pl.col("salary").cast(pl.Float64),
pl.col("id").cast(pl.Int64),
)
Filter — common patterns
df.filter(pl.col("salary") > 50_000)
df.filter(
(pl.col("age") > 30) &
(pl.col("department") == "IT")
)
df.filter(pl.col("department").is_in(["IT", "HR"]))
df.filter(pl.col("salary").is_between(40_000, 120_000))
df.filter(pl.col("name").str.contains("john", literal=False))
New columns + conditionals
You cannot use a column alias in the same with_columns call — chain two steps or use one expression.
df = df.with_columns((pl.col("salary") * 0.10).alias("bonus"))
df = df.with_columns((pl.col("salary") + pl.col("bonus")).alias("salary_plus_bonus"))
df = df.with_columns(
pl.when(pl.col("salary") >= 100_000).then(pl.lit("high"))
.when(pl.col("salary") >= 60_000).then(pl.lit("mid"))
.otherwise(pl.lit("low"))
.alias("salary_band")
)
group_by + aggregations
out = df.group_by("department").agg(
pl.len().alias("rows"),
pl.col("salary").mean().alias("avg_salary"),
pl.col("salary").max().alias("max_salary"),
pl.col("salary").quantile(0.5).alias("median_salary"),
).sort("avg_salary", descending=True)
out = df.group_by(["department", "salary_band"]).agg(
pl.len().alias("rows"),
pl.col("salary").sum().alias("total_salary"),
)
Window functions — rank, share of group
df2 = df.with_columns(
pl.col("salary").rank("dense", descending=True).over("department").alias("dept_rank")
)
df2 = df.with_columns(
(pl.col("salary") / pl.col("salary").sum().over("department")).alias("pct_of_dept_total")
)
Joins — inner, left, semi, anti, asof
df = df1.join(df2, on="id", how="inner")
df = df1.join(df2, on="id", how="left")
# Rows in df1 that have a match in df2
df_keep = df1.join(df2.select("id"), on="id", how="semi")
# Rows in df1 with no match in df2
df_drop = df1.join(df2.select("id"), on="id", how="anti")
# Nearest earlier quote per trade (per symbol)
df_asof = df_trades.join_asof(
df_quotes,
on="ts",
by="symbol",
strategy="backward"
)
Reshape — pivot, melt, explode
pivoted = df.pivot(
values="metric",
index="date",
on="department",
aggregate_function="sum",
)
Polars pivot API: use on= for column names to pivot (updated in recent versions). Adjust if your Polars version shows a deprecation — check help(pl.DataFrame.pivot).
long = df.melt(
id_vars=["id", "date"],
value_vars=["sales", "returns"],
variable_name="metric",
value_name="value",
)
df2 = df.with_columns(
pl.col("tags").str.split(",").alias("tags_list")
).explode("tags_list")
Strings, dates, time buckets
df = df.with_columns(
pl.col("name").str.strip_chars().str.to_uppercase().alias("name_clean"),
)
df = df.with_columns(
pl.col("date").str.strptime(pl.Date, "%Y-%m-%d", strict=False).alias("date_parsed")
)
out = df.group_by_dynamic(
"date_parsed",
every="1d",
by="department",
).agg(
pl.col("salary").mean().alias("avg_salary"),
pl.len().alias("rows"),
)
Time-series APIs differ slightly by Polars version; verify argument names (index_column vs first positional) in your install.
Nulls + duplicates
df.fill_null(0)
df.drop_nulls()
df = df.with_columns(
pl.col("salary").fill_null(0),
pl.col("department").fill_null("UNKNOWN"),
)
df_unique = df.unique(subset=["id"], keep="first")
Lazy optimization + debugging
lf = (
pl.scan_csv("big.csv")
.select(["id", "department", "salary", "date"])
.filter(pl.col("salary") > 0)
)
print(lf.explain(optimized=True))
# Streamed collect (reduces peak memory) — Polars 1.x style
result = lf.collect(engine="streaming")
Older examples used collect(streaming=True). Use the flag your version documents; run help(pl.LazyFrame.collect).
Real pipeline templates
Clean CSV → Parquet on disk
schema = {
"id": pl.Int64,
"department": pl.Utf8,
"salary": pl.Float64,
"date": pl.Utf8,
}
(
pl.scan_csv("big.csv", schema=schema, ignore_errors=True)
.with_columns(
pl.col("date").str.strptime(pl.Date, "%Y-%m-%d", strict=False).alias("date"),
pl.col("department").str.strip_chars().str.to_uppercase().alias("department"),
pl.col("salary").fill_null(0),
)
.filter(pl.col("salary") > 0)
.select(["id", "department", "salary", "date"])
.sink_parquet("clean.parquet")
)
Daily KPIs from Parquet
out = (
pl.scan_parquet("clean.parquet")
.group_by_dynamic("date", every="1d", by="department")
.agg(
pl.len().alias("rows"),
pl.col("salary").sum().alias("sum_salary"),
pl.col("salary").mean().alias("avg_salary"),
)
.sort(["date", "department"])
.collect()
)
out.write_csv("daily_kpis.csv")
Performance checklist
- Lazy first:
scan_csv/scan_parquet→ transform →collect()orsink_parquet. - Select early: drop unused columns as soon as possible.
- Filter early: reduces rows before heavy joins or group_by.
- Parquet for repeat work: convert raw CSV once; analyze Parquet many times.
- Avoid row-wise Python: keep logic in Polars expressions.
- Debug:
lf.fetch(5),lf.explain(optimized=True).
If something goes wrong
| Symptom | What to try |
|---|---|
Out of memory on read_csv | Switch to scan_csv + lazy pipeline, or read in chunks / convert to Parquet first. |
pivot / group_by_dynamic error | Polars API changed across versions — check help() for your exact argument names. |
| Join gives empty result | Key dtypes mismatch (e.g. Int vs Str), or key values don’t match — inspect unique() on keys. |
| Streaming / engine errors | Update Polars; fall back to non-streaming collect() for smaller data. |
Short glossary
- DataFrame — eager table in memory (like pandas DataFrame).
- LazyFrame — deferred plan; optimized before execution.
- Expression —
pl.col("x").mean()etc., runs vectorized. - Pushdown — optimizer moves filters/selects closer to the data source.
- Parquet — columnar file format; efficient for analytics.
Use lazy scans for big files, select/filter early, prefer Parquet for repeated work, and keep transformations in Polars expressions — not Python loops.
To paste into Blogger: switch post to HTML view and paste from <style> through the end (or strip <html>/<head> if your theme already sets fonts).




Comments from Facebook