Skip to content

This project simulates a real-world data engineering workflow by extracting Zillow property listings via API, saving the raw response locally, transforming the nested JSON into a clean dataset with Python, and loading the processed output into PostgreSQL. The final database serves as a foundation for analytics, reporting, or future automation.

Notifications You must be signed in to change notification settings

tos-in1/Zillow-Real-Estate-ETL-Pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Zillow ETL Pipeline

About 3 years ago, I briefly worked in real estate and spent a lot of time refreshing Zillow to find agent- and owner-listed properties for investors. The experience made one thing clear: the faster you access property data, the better your chances.

This inspired me to build a complete end-to-end Data Engineering pipeline that automates the extraction of Zillow property listings, cleans and transforms the data, and loads it into PostgreSQL for structured analytics. The pipeline demonstrates the full flow of a real-world system: from raw API data, through a landing zone, transformation, and finally into a reliable database ready for downstream analysis, reporting, or machine learning.

Limitation

Due to API quota limits, this project uses only a single page of 25 records, making it impossible for daily automation; however, the pipeline is fully designed to handle larger datasets efficiently when more data is available.

Architecture Diagram

Demo

Detailed Workflow

  • Extract Data extract_scripts

    • A request is sent to the Zillow API using Python, which returns a nested JSON containing property listings.
    • The response is stored locally inside
  • Data Storage (Landing Zone)

    • The raw JSON file is saved locally in raw_databefore any transformation.
    • This preserves the original data and mimics a typical cloud landing zone.
  • Data Transformation: transform_scripts

    • The raw JSON is processed using Pandas:
    • Flatten nested keys into a clean table.
    • Select relevant fields (address, price, beds, baths, coordinates, etc.).
    • Convert datatypes to their proper formats
    • Output becomes an analysis-ready DataFrame.
    • A custom ZillowDataCleaner class was created to structure and modularize the entire transformation logic.
  • Loading: Loading_scripts

    • A connection is created between Python and PostgreSQL using psycogp2
    • A target table is created with the appropriate schema.
    • The cleaned data is copied into PostgreSQL from local storage
    • This makes the data persistent, queryable, and ready for downstream usage.
  • Data Serving

    • Once stored in PostgreSQL: It becomes a system of house listing records and SQL queries can be executed for insights.
    • The data can be connected to BI tools like Power BI / Tableau.

🔧 Technologies Used

Python Pandas PostgreSQL Git Ubuntu REST API

Future Enhancement

  • Increase the retrieved data and use Snowflake as the data warehouse
  • implement automation to run the ETL pipeline on a daily basis to get updated properties
  • Support multiple locations: Expand beyond Houston to other cities or states.
  • Containerize pipeline with Docker: Make it portable and easy to deploy anywhere.

About

This project simulates a real-world data engineering workflow by extracting Zillow property listings via API, saving the raw response locally, transforming the nested JSON into a clean dataset with Python, and loading the processed output into PostgreSQL. The final database serves as a foundation for analytics, reporting, or future automation.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages