About 3 years ago, I briefly worked in real estate and spent a lot of time refreshing Zillow to find agent- and owner-listed properties for investors. The experience made one thing clear: the faster you access property data, the better your chances.
This inspired me to build a complete end-to-end Data Engineering pipeline that automates the extraction of Zillow property listings, cleans and transforms the data, and loads it into PostgreSQL for structured analytics. The pipeline demonstrates the full flow of a real-world system: from raw API data, through a landing zone, transformation, and finally into a reliable database ready for downstream analysis, reporting, or machine learning.
Due to API quota limits, this project uses only a single page of 25 records, making it impossible for daily automation; however, the pipeline is fully designed to handle larger datasets efficiently when more data is available.
-
Extract Data extract_scripts
- A request is sent to the Zillow API using Python, which returns a nested JSON containing property listings.
- The response is stored locally inside
-
Data Storage (Landing Zone)
- The raw JSON file is saved locally in raw_databefore any transformation.
- This preserves the original data and mimics a typical cloud landing zone.
-
Data Transformation: transform_scripts
- The raw JSON is processed using Pandas:
- Flatten nested keys into a clean table.
- Select relevant fields (address, price, beds, baths, coordinates, etc.).
- Convert datatypes to their proper formats
- Output becomes an analysis-ready DataFrame.
- A custom
ZillowDataCleanerclass was created to structure and modularize the entire transformation logic.
-
Loading: Loading_scripts
- A connection is created between Python and PostgreSQL using psycogp2
- A target table is created with the appropriate schema.
- The cleaned data is copied into PostgreSQL from
local storage - This makes the data persistent, queryable, and ready for downstream usage.
-
Data Serving
- Once stored in PostgreSQL: It becomes a system of house listing records and SQL queries can be executed for insights.
- The data can be connected to BI tools like Power BI / Tableau.
- Increase the retrieved data and use Snowflake as the data warehouse
- implement automation to run the ETL pipeline on a daily basis to get updated properties
- Support multiple locations: Expand beyond Houston to other cities or states.
- Containerize pipeline with Docker: Make it portable and easy to deploy anywhere.
