Making Sure People Count - Using Big Data and ML
Download OmniSci Free, a full-featured version available for use at no cost.
The US Constitution requires that representation in congress be based on an “actual enumeration” carried out at least once every 10 years. This year, there was a large fight over this obscure clause, ending in a supreme court decision re-affirming that all permanent residents should be counted, including non-citizens. At stake were as many as 3 of California’s 53 house seats, and the allocation of $700 billion in annual funding nationally.
Yet even if we agree in principle to count everyone, in practice this has always been hard. There has always been a census undercount, because some people are hard to locate, contact, and interview. This year there have been even more challenges due to the outbreak of Covid-19. Recent mobility changes make it tougher for the mail to reach people on time, or at all. Census counting ended early this year, limiting the time available to track down missing respondents. Essential workers have been working longer hours away from home, and many service workers have been unemployed and may have moved.
We were curious if new data sources and approaches might help shed new light on these issues. We used the OmniSci accelerated analytics platform to visualize the data, including a massive GPS mobility dataset from our data partner X-Mode. We then trained several Machine Learning models, first to see how well these data correlate to historic census, and then in predicting census undercount.
What’s Known About Prior Undercounts
The census specifically measures undercounting. Census researchers have found that mail return rates are highly correlated with actual undercounts, and so we can look directly at this variable and see how it correlates with geographies and demographics. For example, as might be expected, minority groups are particularly affected. There are also some surprises. For instance “white rural renters” actually had the highest undercount in the 1990 census.
Historically, it has also been observed that a household with a complex family structure has a higher chance of being undercounted. This could be attributed to the fact that such complexity may cause ambiguity for census respondents about whom to include on the household roster.
X-Mode and GPS Trace Data
X-Mode (xmode.io) produces location data by providing a mobile app SDK which rewards app developers for sharing anonymized device locations. We conducted this work as part of a joint initiative to explore how anonymized GPS data can be used to improve COVID-19 response. Thus we looked specifically at the time period at the beginning of the pandemic in the US, February and March 2020. For this time period, the X-Model dataset contains around 3.1 billion records.
Doing most any kind of spatial analysis on 3B+ records with conventional tools would likely fall between excruciatingly slow and impossible. Fortunately, OmniSci’s GPU-based server rendering architecture served us well, allowing us to explore the data interactively. For example, the dashboard shown in Figure 2 is fully interactive, allowing the choice of any date range, time range, census block or device speed.
The X-Mode data as provided contained mostly basic information like an anonymized device id and GPS location. Using the OmniSci platform, we “geo-enriched” this data, adding things like current location census block, county and state. Since the census measures residential population locations, this allowed us to estimate demographic characteristics of the X-Mode points and to compare them to overall census demographics. This is important, because prior to doing so, we had no detailed understanding of how representative X-Mode data might be. This also allowed us to leverage census-level privacy protection, since block groups were designed for that purpose.
In addition to demographic enrichment from census data, we were also able to measure device mobility. We did this both with general measures like distance travelled, and measures that considered trip purpose. In particular, we computed the distance to the nearest road segment, building, and SafeGraph Place and the “dwell times” for each. Beyond this particular project, some of these measures are likely useful for retail and transportation planning more generally. But here we considered only if such “mobility indices” were predictive of census undercount.
The Census demographic dataset is enriched with the X-Mode data at block level to anonymize it while adding information before and after the pandemic (Figure 2). The X-Mode data has an average sampling rate of 0.45% at the block level. We are interested to see if, by enriching the Census dataset with the X-Mode data, we could better understand the reasons for census undercount. In Figure 3, we could see the X-Mode sampling at a block level across the United States.
Mobility ratio is defined as the ratio of the vehicles travelling on the road at an instant to the capacity of the vehicular traffic
Data Science in OmniSci
The OmniSci platform integrates directly with Jupyter Notebooks. Hence, any data, structured or unstructured, can be brought into the notebook. After working on it, and generating the required outputs the results can be seen interactively within OmniSci’s Immerse platform.
OmniSci also has the necessary libraries installed, which helps in easy GPU access. In the following problem, modeling efforts have been organized in ML Flow.
ML Flow is an open-source platform to manage the ML lifecycle. It supports experimentation (trying out different models), reproducibility (the code can be easily packaged and reproduced elsewhere), and deployment (can be deployed on other servers). It has a central registry where all the model results and parameters are stored. Currently, backend support for ML Flow is added to OmniSci through python and the option to add support through Immerse is being evaluated.
Here is a Sample ML Flow implementation of training a Random Forest model. The constructor WITH clause creates a model run, and function log_param records parameters used in training.
In Figure 6, we are using the hyperparameters corresponding to the best model for prediction. We have the log_metric, and the log_model functions which store the corresponding model features.
Figure 8 depicts the default ML Flow UX, which shows various model runs, and the corresponding parameters, and the results.
As described above, the mail return rate in the census survey is used as a proxy variable for the population undercount at the Census Block level. In our study, we have used the sociodemographic variables from the article published in the Oxford Academic journal to see if we could replicate the same results by modeling the data using ML. A variety of models have been trained on the data including the Elastic Net model, Random Forest, and Gradient Boosting.
It is observed that the Total Population is predicted much more accurately than the Mail Return Rate (r2 score 0.98 vs 0.65), and XGBoost is the best predictor. This result is not unexpected in that ‘total population’ would be expected to be easier to predict the undercount proxy. However we were surprised that X-Mode variables didn’t add much explanatory power to the model. This could be attributed to the fact that the percentage sampling of the GPS points is relatively low compared to the census itself. Hence, the X-Mode data may have less explanatory power just due to sample size.
Another possibility could be that the X-Mode data is particularly sparse in those block groups which are important for the census undercount. However the differences here were much less than we had originally anticipated. Visually, there is no obvious overall skew between X-Mode sampling rates and the distribution of Census mail return rates. When we compare block groups with particularly low (<50%) and high (>85%) response rates, we find sampling is actually higher in low response areas (0.77% versus 0.49%).
Ultimately, a third hypothesis about the lack of prediction difference is simpler: when it comes to difficulty in counting people, demographics is more predictive than mobility patterns. To be more precise, we should say the conventional census demographic measures in the literature are more predictive than the specific mobility measures we applied. While this was somewhat analytically disappointing to us, the good news is that widely-available methods are very robust, neither major differences in methods nor the use of big data changed our results. So from a public policy perspective, if we want to reduce census undercount in the future, we should continue to focus on the special-needs populations described above.
The very strong correlation between X-mode data and the census, as well as the lack of evident demographic biases bodes well for its use elsewhere within retail analytics or public policy. In particular, the visibility this data gives into travel patterns could prove very useful to transportation planners. The sample sizes here are so large, that they completely dwarf ground-based survey methods commonly used, for example, in travel demand forecasting. Enriched X-mode data could be particularly useful in characterizing the travel patterns of essential workers.
In terms of tools and methods, we found the combination of OmniSci with MLFlow very handy. As the examples above hopefully illustrate, we were able to repeatedly interrogate the data interactively, both in initial exploration and feature engineering, and in reviewing conclusions. MLFlow allowed us to rapidly build a medium-large archive of model experimental runs and made it visually obvious which hyperparameters and models were performing better than others.
The Jupyter notebooks of the models are available as open source (MIT license) from OmniSci’s Github repository (https://github.com/omnisci/census_undercount_example).