Building a usable food environment dataset for New York City
Before neighborhood food patterns could be measured across New York City, the underlying business records had to be prepared for analysis. The larger study examined how local food environments lined up with diabetes prevalence among adults and children. That work depended on something more basic first: making sure the food outlet records were consistent enough to support neighborhood-level measurement.
The source records came from restaurant inspections maintained by the New York City Department of Health and Mental Hygiene and retail food inspection data from the New York State Department of Agriculture and Markets for the years 2009 through 2013. These were administrative datasets, and like most administrative data, they reflected the needs of recordkeeping and oversight more than the needs of later analysis. The same business could appear under slightly different names. Punctuation varied. Abbreviations were not always used the same way. Spacing, spelling, and business labels were uneven. None of that is unusual. It is the kind of inconsistency that shows up all the time when records are collected over time and across systems.
Python was used to clean and standardize business names and related text fields so that establishments that were effectively the same could be treated the same way in the analysis. That was not cosmetic cleanup. It changed what the data could actually support.
From there, outlet classification was built in layers rather than handled with one automatic sort. Clear cases came first. Recognizable naming patterns and broad business categories made it possible to separate likely fast food establishments, bodegas, grocery stores, and other food retailers in a consistent way. Text-matching methods helped catch the less tidy cases and reduce the noise that appears when similar establishments are recorded in slightly different ways.
Not every record fit cleanly into a category, and the workflow made room for that. Some records supported a strong classification. Others only supported a decent guess. Instead of forcing certainty where the data did not deserve it, the process left some ambiguity in place. The point was not to stamp every business with a label. It was to build a dataset that gave a more believable picture of the city’s food retail pattern.
Once the records had been cleaned, grouped, and classified consistently, they could be linked to census tract geography and used to build neighborhood exposure measures. In the larger study, those measures were calculated around each census tract centroid using a one-mile radius. Fast-food swamp exposure was defined as the proportion of nearby restaurants classified as fast food. Retail food swamp exposure was defined as the proportion of nearby retail food outlets classified as bodegas or small convenience stores, generally under 2,000 square feet. That step turned a business listing into something more analytically useful: a way of describing the food environment surrounding each neighborhood.
The work relied on Python text cleaning and standardization, rule-based classification, text matching, data grouping, quality review, and the preparation of tract-level food environment measures for spatial analysis. More than that, it helped turn a set of routine administrative records into something that could support a real public-health question. Public datasets often carry the authority of official information long before they are ready to answer the question in front of them. This kind of preparation is what makes later analysis more credible. In this case, it made it possible to describe how food exposure was distributed across the city without pretending the records were cleaner or more precise than they were.
Related publications
Identifying Geographic Disparities in Diabetes Prevalence Among Adults and Children Using Emergency Claims Data
https://pmc.ncbi.nlm.nih.gov/articles/PMC5920312/