Zillow ETL Pipeline

About 3 years ago, I briefly worked in real estate and spent a lot of time refreshing Zillow to find agent- and owner-listed properties for investors. The experience made one thing clear: the faster you access property data, the better your chances.

This inspired me to build a complete end-to-end Data Engineering pipeline that automates the extraction of Zillow property listings, cleans and transforms the data, and loads it into PostgreSQL for structured analytics. The pipeline demonstrates the full flow of a real-world system: from raw API data, through a landing zone, transformation, and finally into a reliable database ready for downstream analysis, reporting, or machine learning.

Limitation

Due to API quota limits, this project uses only a single page of 25 records, making it impossible for daily automation; however, the pipeline is fully designed to handle larger datasets efficiently when more data is available.

Architecture Diagram

Detailed Workflow

Extract Data extract_scripts
- A request is sent to the Zillow API using Python, which returns a nested JSON containing property listings.
- The response is stored locally inside
Data Storage (Landing Zone)
- The raw JSON file is saved locally in raw_databefore any transformation.
- This preserves the original data and mimics a typical cloud landing zone.
Data Transformation: transform_scripts
- The raw JSON is processed using Pandas:
- Flatten nested keys into a clean table.
- Select relevant fields (address, price, beds, baths, coordinates, etc.).
- Convert datatypes to their proper formats
- Output becomes an analysis-ready DataFrame.
- A custom ZillowDataCleaner class was created to structure and modularize the entire transformation logic.
Loading: Loading_scripts
- A connection is created between Python and PostgreSQL using psycogp2
- A target table is created with the appropriate schema.
- The cleaned data is copied into PostgreSQL from local storage
- This makes the data persistent, queryable, and ready for downstream usage.
Data Serving
- Once stored in PostgreSQL: It becomes a system of house listing records and SQL queries can be executed for insights.
- The data can be connected to BI tools like Power BI / Tableau.

🔧 Technologies Used

Future Enhancement

Increase the retrieved data and use Snowflake as the data warehouse
implement automation to run the ETL pipeline on a daily basis to get updated properties
Support multiple locations: Expand beyond Houston to other cities or states.
Containerize pipeline with Docker: Make it portable and easy to deploy anywhere.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
assets		assets
data		data
scripts		scripts
.$architecture_flow.drawio.bkp		.$architecture_flow.drawio.bkp
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Zillow ETL Pipeline

Limitation

Architecture Diagram

Detailed Workflow

🔧 Technologies Used

Future Enhancement

About

Uh oh!

Releases

Packages

Languages

tos-in1/Zillow-Real-Estate-ETL-Pipeline

Folders and files

Latest commit

History

Repository files navigation

Zillow ETL Pipeline

Limitation

Architecture Diagram

Detailed Workflow

🔧 Technologies Used

Future Enhancement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages