Skip to content
Back to projects

Project

Ontario Real Estate Lakehouse

Production-grade Medallion pipeline on 1.2M civic records.

PySparkDelta LakeDatabricksStreamlitPython
1.2M+
Records processed
540 MB
Data volume
6
Gold tables
6
Dashboard tabs

The problem

Toronto and the Ontario government publish valuable real estate data (property boundaries, building permits, rental evaluations, housing price indices) — but it’s fragmented across portals, inconsistently shaped, and updated on different cadences. No single place gives decision-makers a trustworthy view of what’s happening in the market.

What I built

A production-style Medallion (Bronze/Silver/Gold) lakehouse on Databricks + Delta Lake:

  • Bronze (8 tables): Raw API landing from Toronto Open Data + StatsCan — idempotent ingestion with audit columns.
  • Silver (4 tables): Type-cast, deduplicated, enriched with geospatial keys; schema-enforced Delta writes.
  • Gold (6 tables): Business-ready aggregates powering the dashboard — permit trends, construction investment, apartment quality scores, price indices.

A Streamlit dashboard on top surfaces six analytical tabs tied directly to the Gold layer.

Why it matters

This was my playground for putting my day-job patterns into a public, portable repo: incremental Delta, schema enforcement, layered refinement, and a thin but real UI. Full CI-style reruns, audit trails, and documented Gold grain.

Questions about this project?

Email me