Last January, I attended a Hackathon at Brown University. As I was walking past the tables of visiting companies, a screen showing world-wide shipping routes caught my eye. I was amazed by the data analytics and visualization brought together by the user-friendly, cutting-edge technology of the OmniSci platform. About 5 hours later, I was building Meddle, a real estate decision-making tool that uses aggregate income and property data from New York City to identify and predict gentrifying neighborhoods.
Gentrification Displaces Local Culture
Gentrification is an urgent issue facing many urban inner cities in the United States. Comprised of many different cultures divided by neighborhoods, United States offers a very unique demographic texture in its metropolitan cities. Taking the lead in this diversity, New York and its communities face the persistent threat of gentrification led by larger real estate development companies.
The inherent problem with gentrification is two fold: first, it eradicates local cultures and displaces them, and second, if not executed strategically, longer-term investments in the area could turn into poor investments if the neighborhood experiences a cultural downturn. The latter problem usually emerges when the investing company cannot correctly identify which stage of gentrification the neighborhoods is in. Meddle offers the insight companies need in order to invest timely, wisely and safely while preserving the important cultural fabric of local neighborhoods.
Preparing the Data
Meddle allows users to track the income per Zip Code and compare the size and number of property acquisitions per each Zip Code. Then, the user can make a comparative analysis of the low income neighborhoods and identify, based on predictive modeling, which neighborhoods are likely to gentrify.
The datasets are first wrangled and cleaned in Python Pandas, then uploaded onto the OmniSci platform. In order to fully utilize the powerful cross-filtering feature that OmniSci provides, I recommend joining the datasets to load a single table from a CSV file.
For the purposes of this application, I use the Michigan University Zip Code based National Income Data for the per Zip Code income means, and the New York State Real Estate Property Acquisition Data for investment information.
Combined, the datasets give us the following variables:
After merging all the datasets into one, we are left with ~14,000 transactions between 2015-2016 in New York City, primarily in Manhattan. The dataset has the 14,223 rows and 28 dimensions.
Income Variables per Zip Code
Using the national data on income per Zip Code, I compute the Mean_Income, Median_Income and Population columns into the dataset, per Zip Code:
Taking the quantiles of mean income for all the Zip Codes, I classified the mean incomes based on the income brackets calculated in the data:
Min: $30,637; Q1: $53,967; Q2: $112,292; Q3: $158,965; Max: $256,236
Using Meddle to Identify Gentrifying Neighborhoods
Scenario: A real estate executive wants to identify a gentrifying neighborhood and report back the maximum amount of property (investment) the company can make there.
Approach: Gentrification is defined in the model as a neighborhood which still preserves its local culture, and the properties on it are not as valuable as nearby neighborhoods or neighborhoods with similar demographics. Thus, we are primarily focusing on low income neighborhoods, which have yet to mature in their property values. As the demand for housing increases in New York, the city is expanding northwards. We start our approach with identifying the low income neighborhoods and looking at outliers.
Analysis Steps Using Meddle
- Press on the “Low Income” bar at the bar graph to refine the data to Low Income only.
The cross-filtering feature in OmniSci Immerse simultaneously filters all of the visualizations to focus on “Low Income” class:
- Look at the bordering cities. Using the scatter plot, positioned in the middle of the platform, first identify the Zip Codes which have an unusually high (closer to a middle class neighborhood) number of acquisitions for its income class:
- Going over the outlier points on the scatter graph, I used the pop-up box in order identify the Zip Codes of these outliers.
Identified Potential Zip Code: 10002
- Click on the table and identify the neighborhood geospatially.
- Next, we look at how 10002 performs among the selected 4, the border-line outlier group within the “Low Income” category.
From this data, we observe that 10002 has the lowest income of the other three outliers (which are all Low Income) and also has the highest price per square feet among the four. High price is indicative of demand for the neighborhood, backed by the relatively high number of property acquisitions that led us to select the neighborhood as an outlier in Step 2.
- Now that we know 10002 has the lowest income among the outliers and the highest number of acquisitions in all of of low income Zip Codes, we can deduce that the Zip Code is becoming gentrified. However, in order to check, we need to cancel out all the filters and see how 10002 performs compared to the neighboring Zip Codes, which include Middle Income and High Income Zip Codes.
We can observe that all of the neighboring Zip Codes are Middle Income Zip Codes, with Price per Square feet that is higher than 70% of the map. Zip Codew 10002 performs poorly, and has the lowest price per square feet among its neighbors. Thus, we can confidently conclude that 10002 is becoming gentrified.
From Prototype to Production
Based on my experience at the Hackathon, the potential for using OmniSci is enormous. By incorporating more data, historical trends and even machine learning, Meddle could be extended to identify neighborhoods at the different stages and types of gentrification, rather than just pointing out the neighborhoods that are already being gentrified. By collaborating with a larger real estate company that has its own detailed dataset, the insights gained could provide a huge financial advantage in investment.
Even with the limited data I used, gentrification is found to be traceable and identifiable. The goal of any Hackathon is to be able to produce something useful, and in that sense I feel that this project was a success!
Questions or comments? Stop by the OmniSci Community forum and let us know what you think!