David Lewis, Russell Hofvendahl, Jason Trager

0. Foreward

Sustainabilist often works on data that is related to quality assurance and control (QA/QC) inspections of public or private infrastructure. Typically, this infrastructure takes the form of solar energy systems or energy efficiency upgrades for buildings. These data sets almost exclusively belong to private entities that have commissioned a study to evaluate how safe and/or well-installed the infrastructure that they financed is. For this reason, it has been very difficult to put anything up in the public sphere about how our work is conducted and any public documentation of what kind of analysis we do.

Enter Epicodus, a coding bootcamp in Portland, OR. Several weeks ago, I met David and Russell - two eager coding students who were just learning how to code. They were attending the first meeting of CleanWeb Portland’s first meeting, which Sustainabilist organized. We were talking about the lack of public datasets in sustainability, and I mentioned how Chicago’s food science data set was very similar to many of the QA/QC data sets that I have looked at. Just like that, a project was born.

The coding work demonstrated herein is 100% that of the student interns, under my guidance for how to structure, examine, and explore the data. The work was conducted using Google Collaboratory, iPython notebooks, and Anaconda’s scientific computing packages.

1. Introduction

Foodborne illnesses afflict 48 million Americans each year, resulting in 128,000 hospitalizations and 3,000 deaths (CDC). City governments curb the spread of illness by enforcing stringent health codes but often face daunting inspection schedules and severe budget constraints. To meet regulations both public officials and the businesses they inspect must make difficult decisions in allocating resources.

In this tutorial we present our efforts to support these decision-makers by predicting when and where health code violations will occur. As we draw from past research, clean and explore the data available and ultimately build and test a predictive model of inspection outcomes we invite any questions, comments or critiques that might arise. This project would not be possible without generous community support and we welcome the opportunity to extend that community and expand our skills as developers and data scientists.

1.1. Past Work

In the city of Chicago health code violations are rated as minor, serious or critical. Critical violations denote a pressing hazard to public health and an automatic failed inspection, so a health inspector's job is essentially to minimize the duration for which critical violations go undiscovered.

In 2014 the Chicago Department of Public Health (CDPH) began work on a predictive model of critical violations to facilitate this task. The team explored a number of datasets and features before eventually arriving at the following list of predictors:

  • inspector assigned
  • previous inspection outcomes
  • age at inspection
  • license for alcohol consumption on premises
  • license for over counter tobacco retail
  • local temperature at time of inspection
  • nearby burglaries
  • nearby sanitation complaints
  • nearby garbage cart requests.

Using a glmnet model trained on these features the team demonstrated that a data-driven inspections schedule could significantly reduce the time to find a critical violation. This improved inspections schedule has since been adopted by the CDPH and both the model and data used have been made publicly available.

1.2. Premise

In light of the CDPH's success we began this project as an exploration and extension of their work. Whereas the Chicago team set out to predict critical violations we are working to develop a series of models predicting individual violations as well as to explore neural networks for the prediction model.

By developing a higher resolution model we hope to provide actionable information not just to public officials but to restaurants themselves. In addition to a general interest in customers' health restaurants are fined \$500 for each critical and $250 for each serious violation found, and so are strongly motivated to maintain health codes. A real-time violation risk analysis could provide actionable information to restaurants on an ongoing basis, improving inspection outcomes and promoting public health.

1.3. Data

Although we are developing a more specific model than the CDPH and using different tools to do so, we both seek to predict inspection outcomes from environmental data. This common investigative question allows us to build on the Chicago team's research, prioritizing the preparation and exploration of those datasets and features already found to be significant. While we may draw from additional sources as work progresses, our current list of datasets used is as follows:

With the exception of weather data from, all data is available through the Chicago Data Portal. We will not be working with data on the inspector assigned as this is not publicly available and so not useful in practice.

1.4. Tools

We performed initial exploration with Python 3.6 notebooks running on the cloud-based Colaboratory Jupyter environment. For later work we used local Jupyter notebooks installed with Anaconda and shared through our Github repository, with Dropbox for large file sharing. We used pip 9 for packages and pandas for data manipulation. We used Slack, Trello and Notable for coordinating work.

2. Pre-processing and Data Structuring

In order to derive insights from our data we first had to ensure that each dataset is consistent and ready for further processing (e.g. standardizing column names, removing duplicates, filtering unwanted data, converting types). After this preliminary cleaning we then had to translate certain observations into a form more convenient for analysis (e.g. splitting strings of violations into a series of binary violation columns). Finally we used this data to derive secondary features like the critical violation count or the time since last inspection. The results of these processes are available on dropbox.

2.1. Preparation of Food Inspections Data

To download the inspections data we used Sodapy, a client for the Socrata data platform used by the Chicago Data Portal:

In [1]:
!pip install sodapy
In [2]:
import pandas as pd
import numpy as np
from sodapy import Socrata

# Unauthenticated client only works with public data sets. Note 'None'
# in place of application token, and no username or password:
client = Socrata("", None)

# First 50000 results, recieved as JSON & returned as dict
# Columns converted to snake case, special chars removed,
# dates and location formatted
results = client.get("4ijn-s7e5", limit=50000)

# Convert to pandas DataFrame
inspections = pd.DataFrame.from_records(results)
WARNING:root:Requests made without an app_token will be subject to strict throttling limits.

We determined the dataset identifer "4ijn-s7e5" for the client.get() method based upon the URL

Conveniently, Sodapy converts all columns to snake case, removes special characters and standardizes date format by default. As Socrata restricts queries to 50,000 entries we then paged through the remainder to access the full dataset:

In [3]:
# Download remaining food inspections (limit 50000 / call)
start = 50000
while results:
    results = client.get("4ijn-s7e5", limit=50000, offset=start)
    inspections = inspections.append(pd.DataFrame.from_records(results))
    start += 50000

After exploring the data and reading through the CDPH's work in R we then applied a series of filters to produce a consistent and usable dataset:

In [4]:
# Remove trailing backslash (left over from sodapy conversion of "License #" column)
inspections.rename(columns={"license_": "license"}, inplace=True)

# Drop rows with missing data
inspections.dropna(subset=["inspection_date", "license", "latitude", "longitude"], inplace=True)

# Drop duplicates
inspections.drop_duplicates("inspection_id", inplace=True)

# Drop "0" licenses
inspections = inspections[inspections.license != "0"]

# Only consider successful inspections
inspections = inspections[~inspections.results.isin(["Out of Business", "Business Not Located", "No Entry"])]

For later calculations we converted location data to floats:

In [5]:
# Convert latitude & longitude to floats
inspections.latitude = inspections.latitude.astype(float)
inspections.longitude = inspections.longitude.astype(float)

Depending on risk level all facilities are subject to canvass inspections at an interval of six months to two years. In addition to canvass-type inspections, a facility may also be inspected to establish an initial license, to correct a recent violation or due to public complaints. To ensure that our data is representative of restaurants operating as usual we filtered the dataset to consist only of canvass inspections:

In [6]:
# Only consider canvass inspections (not complaints or re-inspections)
inspections = inspections[inspections.inspection_type == "Canvass"]

We also filtered inspections by facility type, to eliminate the inconsistency of data from sources like hospitals or schools which follow different inspection schedules:

In [7]:
# Only consider restaurants and grocery stores (subject to change)
inspections = inspections[inspections.facility_type.isin(["Restaurant", "Grocery Store"])]

Finally we saved the resulting dataframe as a CSV file, ready for later use:

In [8]:
import os.path
root_path = os.path.dirname(os.getcwd())

# Save result
inspections.to_csv(os.path.join(root_path, "DATA/food_inspections_demo.csv"), index=False)

2.2. Preparation of Remaining Socrata Data

The process of downloading and filtering business licenses, garbage cart requests, sanitation complaints and crimes was much the same as with food inspections - for specifics see the CODE folder of our project repository.

In brief:

  • All datasets were formatted by sodapy and filtered to remove duplicates and missing data.
  • Crimes were filtered to include only burglaries since 2010 to reduce size and ensure consistency.
  • Garbage cart requests and sanitation complaints were filtered to include only completed or open requests.

2.3. Preparation of Weather Data

The process of pulling the weather data from the Darksky API was straightforward and can be found in notebook 16 in the CODE folder of our project repository. To eliminate redundant API calls, the script first imports any existing weather data and checks for records in this data first before requesting from the API.

Each weather record consists of an inspection ID, a date, the pricipitation intensity, the maxiumum temperature, the windspeed and the humidity at that date. As the Chicago region only measures some twenty miles across we used a single location in central Chicago for all weather records.

2.4. Calculation of Violations Data

Each entry in the violations column of the food inspections dataset is made up of a number of violation/comment pairs joined into a string:

In [9]:

Although this is a rich source of data, the format makes it difficult to perform operations based on specific violations (e.g. charting the distribution of critical violations).

To make violations data more accessible for analysis we split each entry into a series of columns representing the presence of each violation with a binary value:

In [10]:
# Split violations into binary values for each violation
def split_violations(violations):
    valuesrow = pd.Series([])
    if type(violations) == str:
        violations = violations.split(' | ')
        for violation in violations:
            index = "v" + violation.split('.')[0]
            values_row[index] = 1
    return values_row

# Calculate violation values (5 mins), set missing violations to 0
values_data = inspections.violations.apply(split_violations).fillna(0)

# Generate column names
criticalcolumns = [("v" + str(num)) for num in range(1, 15)]
seriouscolumns = [("v" + str(num)) for num in range(15, 30)]
minorcolumns = [("v" + str(num)) for num in range(30, 45)]

# Create complete list of column names
columns = critical_columns + serious_columns + minor_columns

# Create dataframe using column names, violation data and inspection ID
values = pd.DataFrame(values_data, columns=columns)
values['inspection_id'] = inspections.inspection_id
In [11]:
# Display selection of values dataframe
v_1 v_2 v_3 v_4 v_5 v_6 v_7 v_8 v_9 v_10 ... v_37 v_38 v_39 v_40 v_41 v_42 v_43 v_44 v_70 inspection_id
140 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 2184496
147 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 2184490
149 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 2184487
159 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 2184468
169 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 2184453
185 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 2184424
189 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 2184420

7 rows × 46 columns

While not especially human-readable, this format allows us to easily access specific violation data, sum sets of violations and train models that require numeric or binary features.

Next we created a separate dataframe for critical, serious and minor violation counts by pairing each inspection ID with the sum of the appropriate subset of violation values:

In [12]:
# Count violations
counts = pd.DataFrame({
    "critical_count": values[critical_columns].sum(axis=1),
    "serious_count": values[serious_columns].sum(axis=1),
    "minor_count": values[minor_columns].sum(axis=1)

counts['inspection_id'] = inspections.inspection_id

# Display selection of sums dataframe
critical_count serious_count minor_count inspection_id
140 14.0 15.0 15.0 2184496
147 14.0 15.0 15.0 2184490
149 14.0 15.0 15.0 2184487
159 14.0 14.0 15.0 2184468
169 14.0 15.0 15.0 2184453
185 12.0 14.0 15.0 2184424
189 14.0 15.0 15.0 2184420

Lastly, we saved both datasets for later use:

In [13]:
# Save violation values and counts
values.to_csv(os.path.join(root_path, "DATA/violation_values_demo.csv"), index=False)
counts.to_csv(os.path.join(root_path, "DATA/violation_counts_demo.csv"), index=False)

3. Exploratory Analysis

To familiarize ourselves with the data and to inform feature selection and model design we then took some time to explore and visualize the information. During this period we investigated the spatial distribution of inspections, the comments accompanying violations, the relative distribution of violations and the role of past inspection outcomes as predictors. We also looked at facility inspection statistics to assess the economic potential of violation forecasting.

3.1. Inspections Map

To improve our understanading of the inspections dataset and its relation to other location-based data we used folium to generate an inspections heat map:

In [14]:
!pip install folium
In [15]:
# Import necessary packages 
from folium import folium, plugins
from IPython.display import HTML

%matplotlib inline
In [16]:
# Generate map
m = folium.Map([41.8600, -87.6298], zoom_start=10)

# Convert to (n, 2) nd-array format for heatmap
inspections_arr = inspections.sample(20000)[["latitude", "longitude"]].values

# Plot heatmap
m.add_child(plugins.HeatMap(inspections_arr.tolist(), radius=10))