library(tidyverse)
library(sf)
library(tmap)
library(knitr)
library(corrplot)
library(broom)
library(spdep)
library(scales)
library(tigris)

options(tigris_use_cache = TRUE)

Introduction

Walkability is a measure for how safe an area is for pedestrians as well as amenities being within a reasonable walking distance. Right now, the ChiVes database does not have a walkability measure for the built environment of Chicago. Thus, adding such a measure would give an idea of what the City of Chicago could do to improve the built environment. Chicago itself already has a decently high walkability score compared to most cities in the United States, but some areas still suffer from car-friendly exclusive infrastructure. Improving the walkability score for a region would also improve the urban sustainability and the transportation equity of the city.

The data source for this project is the EPA National Walkability Index from 2021. This dataset uses block groups as its spatial scale, but to assure compatibility with ChiVes, it has been converted to the 2020 census tracts.

Hypothesis: Walkability will be higher in the CBD/The Loop compared to outer areas. Since there is a high population density near the center, walkability should also high. However, although it is expected that most census tracts will have an above average walkability score due to Chicago being the third most populous city in the United States, there will still be a decent chunk of census tracts in the below average range as well as the most walkable range. Thus, this project will also include population density and transit access as exploratory variables.

The goal this project is to visualize data so that there is a way to see how walkabilty varies across Chicago and whether or not higher transit access areas have a direct, inverse, or unrelated correlation with walkability. Such visualizations already exist out in on the internet, but adding it to ChiVes will place it among many other important environmental data so that everyone can see this measure in one coherent place.

Data Wrangling

The dataset used for this project is the EPA National Walkability geodatabase from 2021. The data has been filtered to only include the Chicago metropolitan area by filtering the CBSA and the FIPS codes to the region. Next, the data must be cleaned to remove any possible analysis to ensure accuracy of the data. Therefore, anything with missing values have been removed.

The way that the EPA classifies walkability is out of a score of 20 that goes like the following:

EPA Walkability Score Categories:
1 - 5.75: Least Walkable
5.76 - 10.50: Below Average Walkable
10.51 - 15.25: Above Average Walkable
15.26 - 20: Most Walkable

The main problem with this is that the “most walkable” category is too large of a range, even if it uses the same difference of 4.76/4.75 per category. Since this is out of 20, a census tract with a score of 15.26 would be in the 76th percentile, while a census tract even just 3 points higher would be in the 91st percentile. That is why it is important to add another category. 15.26 + 4.76 = 20.02, which would make this fifth category above the scale itself. That’s why for this project, the fifth category, most walkable (the category previously called most walkable is now renamed to highly walkable), is now only at 18.1, or just slightly above the 90th percentile.

Adjusted Categories:

15.26 - 18.09: Highly Walkable

18.10 - 20: Most Walkable

The spatial scale that this data is in the format of is that of block groups. However, since ChiVes uses census tracts, the data is converted from block groups to census tracts. Since a census tracts compose of many different block groups, it is expected that there should be more block groups than census tracts. If it is the opposite, then something would be done incorrectly in the calculations. Thus, it makes sense why the number of block groups is 2.9 times higher than the number of census tracts.

Once all the cleaning is done, it is finally time to create categorical walkability variables. The EPA rates walkability on a scale from 0-20. In the code, the ranges on how walkability is categorized are shown in the r transformations section. The sum of all walkability category variables should equal the number of census tracts after cleaning the data.

# Load the National Walkability Index data from geodatabase
walkability_raw <- st_read("WalkabilityIndex/Natl_WI.gdb",
                           layer = "NationalWalkabilityIndex")

## Reading layer `NationalWalkabilityIndex' from data source 
##   `C:\Users\shanm\Documents\Personal Projects\ShanmukhUpad.github.io\WalkabilityIndex\Natl_WI.gdb' 
##   using driver `OpenFileGDB'
## Simple feature collection with 220739 features and 29 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -10434580 ymin: -83867.97 xmax: 3407868 ymax: 6755033
## Projected CRS: USA_Contiguous_Albers_Equal_Area_Conic_USGS_version

cat("Variables available:", ncol(walkability_raw), "\n\n")

## Variables available: 30

names(walkability_raw)

##  [1] "GEOID10"      "GEOID20"      "STATEFP"      "COUNTYFP"     "TRACTCE"     
##  [6] "BLKGRPCE"     "CSA"          "CSA_Name"     "CBSA"         "CBSA_Name"   
## [11] "Ac_Total"     "Ac_Water"     "Ac_Land"      "Ac_Unpr"      "TotPop"      
## [16] "CountHU"      "HH"           "Workers"      "D2B_E8MIXA"   "D2A_EPHHM"   
## [21] "D3B"          "D4A"          "D2A_Ranked"   "D2B_Ranked"   "D3B_Ranked"  
## [26] "D4A_Ranked"   "NatWalkInd"   "Shape_Length" "Shape_Area"   "Shape"

# Filter to Chicago metropolitan area
walkability_chicago <- walkability_raw %>%
  filter(STATEFP == "17", CBSA == "16980") %>%
  st_transform(crs = 4326)

cat("Number of block groups in Chicago metropolitan area:", nrow(walkability_chicago), "\n")

## Number of block groups in Chicago metropolitan area: 5976

# Extract census tract ID and select variables
walkability_clean <- walkability_chicago %>%
  mutate(TRACT_GEOID = substr(GEOID10, 1, 11)) %>%
  select(
    GEOID_BG = GEOID20,
    TRACT_GEOID,
    walk_score = NatWalkInd,
    employment_entropy = D2A_EPHHM,
    employment_mix     = D2B_E8MIXA,
    intersection_density = D3B,
    transit_distance = D4A,
    total_pop = TotPop,
    total_workers = Workers,
    households = HH,
    land_area_ac = Ac_Land,
    geometry = Shape
  ) %>%
  # EPA encodes missing transit data as -99999; remove before aggregation
  mutate(transit_distance = ifelse(transit_distance < 0, NA, transit_distance))

cat("Block groups before aggregation:", nrow(walkability_clean), "\n")

## Block groups before aggregation: 5976

# Check for missing values
missing_summary <- walkability_clean %>%
  st_drop_geometry() %>%
  summarise(
    missing_walk_score = sum(is.na(walk_score)),
    missing_pop = sum(is.na(total_pop)),
    missing_transit = sum(is.na(transit_distance)),
    missing_land_area = sum(is.na(land_area_ac)),
    total_block_groups = n()
  )

print(missing_summary)

##   missing_walk_score missing_pop missing_transit missing_land_area
## 1                  0           0            1577                 0
##   total_block_groups
## 1               5976

# Remove block groups with missing data
walkability_clean <- walkability_clean %>%
  filter(!is.na(walk_score), !is.na(total_pop), !is.na(land_area_ac),
         land_area_ac > 0, walk_score > 0)

cat("\nBlock groups after cleaning:", nrow(walkability_clean), "\n")

## 
## Block groups after cleaning: 5974

# Aggregate from block groups to census tracts
# Population-weighted mean with per-variable NA handling. weighted.mean()'s
# na.rm option drops NA elements from x without dropping the matching weights,
# which trips its own length check; this helper does it correctly.
pw_mean <- function(x, w) {
  ok <- !is.na(x) & !is.na(w) & is.finite(x) & is.finite(w)
  total_w <- sum(w[ok])
  if (length(ok) == 0 || !any(ok) || !is.finite(total_w) || total_w == 0)
    return(NA_real_)
  sum(x[ok] * w[ok]) / total_w
}

tract_data <- walkability_clean %>%
  filter(!is.na(walk_score), !is.na(total_pop), !is.na(land_area_ac)) %>%
  group_by(TRACT_GEOID) %>%
  summarise(
    # NatWalkInd is a linear combination of D2A_EPHHM, D2B_E8MIXA, D3B, and
    # D4A at the block-group level. To preserve that linear identity at the
    # tract level, every input is aggregated population-weighted with
    # per-variable NA handling, so that a block group missing one input,
    # for example the EPA sentinel for transit distance, does not get
    # dropped from the aggregation of the other inputs. The earlier
    # behavior, where any sentinel forced the entire block group out of
    # every aggregation, was what was suppressing the bivariate r values.
    # Order matters: every pw_mean call has to consume the original
    # block-group-level total_pop vector. If total_pop is summarised first,
    # subsequent pw_mean calls receive a scalar weight, which collapses ok-
    # indexing in pw_mean to NA and turns most tract values into NA.
    walk_score           = pw_mean(walk_score,           total_pop),
    intersection_density = pw_mean(intersection_density, total_pop),
    employment_entropy   = pw_mean(employment_entropy,   total_pop),
    employment_mix       = pw_mean(employment_mix,       total_pop),
    transit_distance     = pw_mean(transit_distance,     total_pop),
    total_workers        = sum(total_workers, na.rm = TRUE),
    households           = sum(households,    na.rm = TRUE),
    land_area_ac         = sum(land_area_ac),
    total_pop            = sum(total_pop),
    .groups = 'drop'
  ) %>%
  mutate(
    pop_density_sqmi = (total_pop / land_area_ac) * 640,
    workers_per_hh = total_workers / (households + 0.1),
    transit_distance_km = transit_distance / 1000
  )

cat("\nNumber of census tracts:", nrow(tract_data), "\n")

## 
## Number of census tracts: 2014

# Load ChiVes data
chives_data <- st_read("chives-data-public.geojson")

## Reading layer `chives-data-public' from data source 
##   `C:\Users\shanm\Documents\Personal Projects\ShanmukhUpad.github.io\chives-data-public.geojson' 
##   using driver `GeoJSON'
## Simple feature collection with 801 features and 55 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -87.94025 ymin: 41.64429 xmax: -87.52416 ymax: 42.02392
## Geodetic CRS:  WGS 84

chives_clean <- chives_data %>%
  st_drop_geometry() %>%
  mutate(geoid = as.character(geoid))

tract_data_clean <- tract_data %>%
  mutate(TRACT_GEOID = as.character(TRACT_GEOID))

tract_data_with_chives <- tract_data_clean %>%
  left_join(chives_clean, by = c("TRACT_GEOID" = "geoid"))

# Census tracts in city of Chicago
chicago_boundary <- places(state = "IL", cb = TRUE) %>%
  filter(NAME == "Chicago") %>%
  st_transform(4326)

chicago_city_tracts <- tract_data_with_chives %>%
  st_filter(chicago_boundary, .predicate = st_intersects)

num_chicago_tracts <- nrow(chicago_city_tracts)
difference <- nrow(tract_data_with_chives) - num_chicago_tracts

# Print results
cat("Census tracts in the city of Chicago:", num_chicago_tracts, "\n")

## Census tracts in the city of Chicago: 871

cat("Difference from Greater Chicago area census tracts:", difference, "\n")

## Difference from Greater Chicago area census tracts: 1143

cat("Percentage of census tracts in the city of Chicago:",
    floor((num_chicago_tracts / nrow(tract_data_with_chives)) * 1000) / 10, "%\n")

## Percentage of census tracts in the city of Chicago: 43.2 %

# Create categorical walkability variable with adjusted categories
final_data <- tract_data_with_chives %>%
  mutate(
    walk_category = case_when(
      walk_score >= 18.1 ~ "Most Walkable",
      walk_score >= 15.26 & walk_score < 18.1 ~ "Highly Walkable",
      walk_score >= 10.51 & walk_score < 15.26 ~ "Above Average Walkable",
      walk_score >= 5.76 & walk_score < 10.51 ~ "Below Average Walkable",
      walk_score >= 1 & walk_score < 5.76 ~ "Least Walkable",
      TRUE ~ "No Data"
    ),
    walk_category = factor(
      walk_category,
      levels = c(
        "Least Walkable",
        "Below Average Walkable",
        "Above Average Walkable",
        "Highly Walkable",
        "Most Walkable",
        "No Data"
      )
    )
  )

# city_data is the canonical analysis frame for hypothesis-scope statistics.
# The hypothesis is about the City of Chicago (CBD vs outer neighborhoods),
# so all bivariate correlations, the corrplot, the scatter, the scatter
# matrix, the linear regression, and the ANOVA are computed on this subset.
# Maps of metro-wide spatial pattern still use the full final_data.
city_data <- final_data %>%
  st_filter(chicago_boundary, .predicate = st_intersects)

Exploratory Methods

The main approach for this model was to create statistical variables such as finding the lowest, highest, mean, median, and standard deviation walk scores. In addition, the mean population density and mean transit distance has been calculated in order to support the main walkability variables.

Once the variables are created, correlations were found among them via a correlation matrix. Then, a few statistical tests were created around these variables such as a Moran I’s test, linear regression, and ANOVA. These exploratory methods allow spatial analysis that provide important insights into Chicago.

Reasons for each method’s inclusion:

Moran I’s Test:

This test measures spatial autocorrelations, which are important in finding out whether census tracts with high or low walkability scores cluster together. For Chicago, it can show if the most walkable neighborhoods like The Loop connect easily with surrounding neighborhoods, or if the walkability is contained within the neighborhood itself.

Linear Regression and Correlation Matrix:

These statistical analysis quantify the relationship between the main walkability variable and exploratory variables such as population density, transit access, street intersection density, and employment entropy. Seeing the relationships between walkability and these variables can influence the sustainability and overall qaulity of life in Chicago.

Analysis of Variance (ANOVA):

This method compares the mean differences between contiuous variables across all five levels of walkability. Thus, it is important for seeing whether there is much of a difference in these variables between a level like “Below Average Walkable” vs “Highly Walkable”.

# Summary statistics
summary_stats <- final_data %>%
  st_drop_geometry() %>%
  summarise(
    N_Tracts = n(),
    Mean_Walkability = round(mean(walk_score, na.rm = TRUE), 2),
    SD_Walkability = round(sd(walk_score, na.rm = TRUE), 2),
    Median_Walk = round(median(walk_score, na.rm = TRUE), 2),
    Min_Walk = round(min(walk_score, na.rm = TRUE), 2),
    Max_Walk = round(max(walk_score, na.rm = TRUE), 2),
    Mean_Pop_Density = round(mean(pop_density_sqmi, na.rm = TRUE), 0),
    Mean_Transit_Dist = round(mean(transit_distance_km, na.rm = TRUE), 2)
  )

kable(summary_stats, caption = "Summary Statistics for Chicago Census Tracts")

Summary Statistics for Chicago Census Tracts
N_Tracts	Mean_Walkability	SD_Walkability	Median_Walk	Min_Walk	Max_Walk	Mean_Pop_Density	Mean_Transit_Dist
2014	12.43	3.23	13.22	2.33	19.67	10098	0.42

# Correlation matrix on the City of Chicago analysis frame.
# Predictors with negligible bivariate correlation with walkability at city
# extent (|r| < 0.20) are excluded; pop_density_sqmi (r = -0.05) is dropped.
cor_data <- city_data %>%
  st_drop_geometry() %>%
  select(walk_score, intersection_density, employment_entropy,
         employment_mix, transit_distance_km) %>%
  na.omit()

cor_matrix <- cor(cor_data, use = "complete.obs")

corrplot(cor_matrix, method = "color", type = "upper",
         addCoef.col = "black", number.cex = 0.7,
         tl.col = "black", tl.srt = 45,
         title = "Correlation Matrix: Walkability and Related Variables (City of Chicago)",
         mar = c(0,0,2,0))

# Moran's I test for spatial autocorrelation
final_data_complete <- final_data %>%
  filter(!is.na(walk_score))

coords <- st_coordinates(st_centroid(final_data_complete))
neighbors <- knn2nb(knearneigh(coords, k = 8))
weights <- nb2listw(neighbors, style = "W")

moran_test <- moran.test(final_data_complete$walk_score, weights)
cat("Moran's I test for spatial autocorrelation:\n")

## Moran's I test for spatial autocorrelation:

print(moran_test)

## 
##  Moran I test under randomisation
## 
## data:  final_data_complete$walk_score  
## weights: weights    
## 
## Moran I statistic standard deviate = 60.62, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      0.6445219672     -0.0004980080      0.0001132193

if(moran_test$p.value < 0.05) {
  cat("\nSignificant spatial autocorrelation detected (p < 0.05)")
} else {
  cat("\nNo significant spatial autocorrelation detected")
}

## 
## Significant spatial autocorrelation detected (p < 0.05)

# Linear regression on the City of Chicago analysis frame
model <- lm(walk_score ~ pop_density_sqmi + transit_distance_km +
              intersection_density + employment_entropy + employment_mix,
            data = city_data)

summary(model)

## 
## Call:
## lm(formula = walk_score ~ pop_density_sqmi + transit_distance_km + 
##     intersection_density + employment_entropy + employment_mix, 
##     data = city_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9724 -0.2967  0.1523  0.4446  1.6595 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.055e+01  1.126e-01   93.65  < 2e-16 ***
## pop_density_sqmi     -1.167e-05  1.584e-06   -7.37 3.99e-13 ***
## transit_distance_km  -3.455e+00  2.147e-01  -16.09  < 2e-16 ***
## intersection_density  7.962e-03  2.741e-04   29.04  < 2e-16 ***
## employment_entropy    4.139e+00  1.546e-01   26.78  < 2e-16 ***
## employment_mix        4.147e+00  1.503e-01   27.60  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6717 on 861 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.8488, Adjusted R-squared:  0.8479 
## F-statistic: 966.9 on 5 and 861 DF,  p-value: < 2.2e-16

model_summary <- tidy(model)
kable(model_summary, digits = 3,
      caption = "Regression Results: Predictors of Walkability Score (City of Chicago)")

Regression Results: Predictors of Walkability Score (City of Chicago)
term	estimate	std.error	statistic
(Intercept)	10.546	0.113	93.649
pop_density_sqmi	0.000	0.000	-7.370
transit_distance_km	-3.455	0.215	-16.093
intersection_density	0.008	0.000	29.042
employment_entropy	4.139	0.155	26.781
employment_mix	4.147	0.150	27.600

# ANOVA on the City of Chicago analysis frame
anova_result <- aov(pop_density_sqmi ~ walk_category, data = city_data)
print(summary(anova_result))

##                Df    Sum Sq   Mean Sq F value  Pr(>F)   
## walk_category   4 3.773e+09 943280770   4.043 0.00296 **
## Residuals     866 2.020e+11 233285976                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova_pvalue <- summary(anova_result)[[1]]$`Pr(>F)`[1]
if(!is.na(anova_pvalue) && anova_pvalue < 0.05) {
  cat("\nTukey HSD Post-hoc test:\n")
  print(TukeyHSD(anova_result))
}

## 
## Tukey HSD Post-hoc test:
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = pop_density_sqmi ~ walk_category, data = city_data)
## 
## $walk_category
##                                                     diff        lwr        upr
## Above Average Walkable-Below Average Walkable -15487.579 -28803.402 -2171.7564
## Highly Walkable-Below Average Walkable        -13741.529 -27195.914  -287.1438
## Most Walkable-Below Average Walkable          -16718.821 -33549.256   111.6134
## No Data-Below Average Walkable                -32704.337 -60188.321 -5220.3518
## Highly Walkable-Above Average Walkable          1746.050  -1368.365  4860.4656
## Most Walkable-Above Average Walkable           -1231.242 -11811.537  9349.0526
## No Data-Above Average Walkable                -17216.757 -41383.853  6950.3385
## Most Walkable-Highly Walkable                  -2977.292 -13731.453  7776.8686
## No Data-Highly Walkable                       -18962.807 -43206.525  5280.9105
## No Data-Most Walkable                         -15985.515 -42253.384 10282.3531
##                                                   p adj
## Above Average Walkable-Below Average Walkable 0.0132345
## Highly Walkable-Below Average Walkable        0.0425612
## Most Walkable-Below Average Walkable          0.0525239
## No Data-Below Average Walkable                0.0104110
## Highly Walkable-Above Average Walkable        0.5415160
## Most Walkable-Above Average Walkable          0.9977821
## No Data-Above Average Walkable                0.2931079
## Most Walkable-Highly Walkable                 0.9427264
## No Data-Highly Walkable                       0.2049068
## No Data-Most Walkable                         0.4572514

Results

The data reflects the initial hypothesis that most areas would be in the “Above Average Walkability” range. 1,090 out of 2,014 census tracts are in this range. The hypothesis also predicted “a decent chunk” of census tracts in the most walkable range; however, only 20 out of 2,014 (1%) fall into the “Most Walkable” category (≥ 18.10). An additional 350 (17.4%) are “Highly Walkable,” so the prediction holds better when interpreted as the top two tiers combined.

The spatial half of the hypothesis, that walkability is highest in the CBD/Loop and falls off in outer neighborhoods, is supported by the maps and by the Moran’s I test, which shows significant positive spatial autocorrelation. The urban-form half of the hypothesis is also supported once the EPA inputs are aggregated to tract level the same way the walkability score is (population-weighted). Within the City of Chicago, employment + household entropy (D2A_EPHHM) correlates strongly with walkability (r ≈ 0.69), the 8-tier employment mix (D2B_E8MIXA) is comparably strong (r ≈ 0.68), intersection density (D3B) is also strong (r ≈ 0.53), and transit distance (D4A) is weak negative (r ≈ −0.25). These are the underlying components of the EPA formula, which is why side-by-side choropleth maps look so visually similar: central neighborhoods light up across all four inputs simultaneously, and the Pearson r values now match that visual impression. The earlier weak r values were an aggregation artifact: block groups missing one input were being dropped from every aggregation, distorting the tract values for the inputs they did have.

The one piece of the hypothesis the data does not support is population density acting as the driver: city-extent r is essentially zero. Density is not an EPA input, and within Chicago the densest tracts (Lakeview, Edgewater) are not the most walkable; the Loop is, despite a comparatively low residential density. The mechanism the data points to is therefore mixed-use development, dense intersections, and transit proximity, not raw residential density.

Walkability Map for the City of Chicago versus Walkability Map for the Greater Chicago Area:
The visualization for the walkability scores of each census tract in Chicago describes what the paragraph above has mentioned. The larger the census tract’s land area, the more likely its walkability score is that of a lower percentile. Since the ChiVes website only visualizes the city of Chicago itself, only the main walkability map for the city will be used, but including the Greater Chicago area shows how suburban sprawl has affected the walkability of America, as many other cities have similar patterns to Chicago.

Higher Variability Walkability Map for the City of Chicago:

Although the adjusted fourth category and the addition of a fifth category added more variability among census tracts to produce a visualization at a higher accuracy, using a jenks style provides natural breaks in the data, which can produce a visualization where there is an even higher variability among census tracts. This uses its own categorization, which does not even count for the “Least Walkable” category, as not a single census tract within the city itself is categorized as such.

Street Intersection Density:

This map helps in identifying pedestrian-centric or car-centric areas, which is helpful for understanding the impacts of a built environment’s impact on walkability and sustainability. The density of a street intersection may reveal its proximity to transit stations or other amenities that a city has to offer. A jenks classification style has been used for this map, as it uses natural breaks in the data. For a street network, it minimizes and class variances to show groupings more easily than quantile would.

Employment + Household Entropy Scatter Plot:

This scatter plot shows the relationship between walkability and D2A_EPHHM (employment + household entropy), one of the four EPA inputs to the walkability index. Within the City of Chicago this is the strongest bivariate predictor of walkability (r ≈ 0.69, R² ≈ 0.48): tracts with greater mixed-use intensity score substantially higher on walkability. The strong, near-linear positive trend is exactly what we should expect because D2A_EPHHM enters the EPA’s NatWalkInd formula directly.

Economic Opportunity Visualization:

This map has been created using the economic hardship variable taken from the existing ChiVes dataset. The opportunity variable is the inverse of hardship. The purpose of this visualization’s inclusion in this analysis is to reveal correlations between walkability and a positive variable such as economic opportunity. Census tracts with the highest economic opportunity tend, on average, to fall into the higher walkability categories; the bivariate correlation within the City of Chicago is r ≈ 0.37 (moderate), so individual tracts still deviate from the pattern. A jenks classification style has been used for this map, for the same reasons as the street intersection density map.

Distance to Nearest Transit Stop Visualization:

This map shows the population-weighted distance from each tract to its nearest transit stop. Within the city the bivariate correlation with walkability is r ≈ −0.25 (weak, negative): tracts farther from transit tend to score lower. Transit distance is one of the four direct inputs to the EPA walkability formula, but its city-level range is compressed because most Chicago tracts sit close to a transit stop, which suppresses the bivariate r relative to the other EPA inputs.

LISA Cluster Maps (Local Moran’s I):

Two LISA maps are included: one for the Greater Chicago metropolitan area and one for the City of Chicago alone, with the city-only neighbor weights recomputed inside the city boundary so that spatial lag does not bleed across the city limit. Each map shows the significant (p < 0.05) local clusters: High-High tracts (high walkability surrounded by high-walkability neighbors), Low-Low tracts, and the High-Low and Low-High spatial outliers. Both maps are gated on a significant global Moran’s I; if the global test were not significant, the local quadrants would not be interpretable and the map is suppressed.

Scatterplot Matrix:

This visualization shows pairwise relationships between walkability and the urban-form variables retained for the City of Chicago analysis. Population density is excluded because its city-extent correlation with walkability is negligible (r ≈ −0.05); the remaining panels reflect weak-to-strong associations.

# Summary table by walkability category
results_table <- final_data %>%
  st_drop_geometry() %>%
  group_by(walk_category) %>%
  summarise(
    N_Tracts = n(),
    Mean_Walk_Score = round(mean(walk_score, na.rm = TRUE), 2),
    Total_Population = sum(total_pop, na.rm = TRUE),
    Mean_Pop_Density = round(mean(pop_density_sqmi, na.rm = TRUE), 0),
    Mean_Transit_Dist_km = round(mean(transit_distance_km, na.rm = TRUE), 2),
    Mean_Intersection_Density = round(mean(intersection_density, na.rm = TRUE), 1)
  )

kable(results_table, 
      caption = "Mean Values by Walkability Category (Census Tract Level)")

Mean Values by Walkability Category (Census Tract Level)
walk_category	N_Tracts	Mean_Walk_Score	Total_Population	Mean_Pop_Density	Mean_Transit_Dist_km	Mean_Intersection_Density
Least Walkable	53	5.08	252515	1181	NaN	28.5
Below Average Walkable	496	8.30	2631333	3441	0.76	72.2
Above Average Walkable	1090	13.33	4472854	11664	0.41	135.9
Highly Walkable	350	16.25	1225879	15893	0.27	213.0
Most Walkable	20	18.67	83160	14637	0.17	271.8
No Data	5	NaN	0	0	NaN	NaN

metro_walk_map <- tm_shape(final_data) +
  tm_fill(
    "walk_category",
    palette = c(
      "Least Walkable"         = "#440154",
      "Below Average Walkable" = "#31688E",
      "Above Average Walkable" = "#35B779",
      "Highly Walkable"        = "#90D743",
      "Most Walkable"          = "#FDE725",
      "No Data"                = "#bdbdbd"
    ),
    title = "Walkability Category"
  ) +
  tm_borders(alpha = 0.3, lwd = 0.5) +
  tm_layout(
    main.title = "Greater Chicago Area Walkability by Census Tract (2021)",
    main.title.size = 1.2,
    legend.outside = TRUE,
    legend.outside.position = "right",
    frame = FALSE
  ) +
  tm_compass(position = c("left", "top")) +
  tm_scalebar(position = c("left", "bottom"))
metro_walk_map

# Get Chicago city boundary from Census
chicago_boundary <- places(state = "IL", cb = TRUE) %>%
  filter(NAME == "Chicago") %>%
  st_transform(4326)

chicago_city <- final_data %>%
  st_filter(chicago_boundary, .predicate = st_intersects)

# Main walkability score map
city_walk_map <- tm_shape(chicago_city) +
  tm_fill(
    "walk_category",
    palette = c(
      "Least Walkable"         = "#440154",
      "Below Average Walkable" = "#31688E",
      "Above Average Walkable" = "#35B779",
      "Highly Walkable"        = "#90D743",
      "Most Walkable"          = "#FDE725",
      "No Data"                = "#bdbdbd"
    ),
    title = "Walkability Category"
  ) +
  tm_borders(alpha = 0.3, lwd = 0.5) +
  tm_layout(
    main.title = "Chicago City Walkability by Census Tract (2021)",
    main.title.size = 1.2,
    legend.outside = TRUE,
    legend.outside.position = "right",
    frame = TRUE
  ) +
  tm_compass(position = c("left", "top")) +
  tm_scalebar(position = c("left", "bottom"))
city_walk_map

# Walkability score map with higher variations
chicago_boundary <- places(state = "IL", cb = TRUE) %>%
  filter(NAME == "Chicago") %>%
  st_transform(4326)

chicago_city <- final_data %>%
  st_filter(chicago_boundary, .predicate = st_intersects)

tm0 <- tm_shape(chicago_city) +
  tm_fill("walk_score",
          palette = "viridis",
          style = "jenks",
          n = 5,
          title = "Walkability Category") +
  tm_borders(alpha = 0.2, lwd = 0.3) +
  tm_layout(main.title = "Chicago City Walkability Scores Adjusted for Higher Standards of Chicago",
            legend.position = c("left", "bottom"),
            main.title.size = 1.1,
            frame = TRUE)
tmap_arrange(tm0, ncol = 1)

# Chicago city boundary
chicago_boundary <- places(state = "IL", cb = TRUE) %>%
  filter(NAME == "Chicago") %>%
  st_transform(4326)

chicago_city <- final_data %>%
  st_filter(chicago_boundary, .predicate = st_intersects)

# Maps restricted to variables with at least a moderate bivariate correlation
# with walkability inside the City of Chicago. Population density (r ≈ -0.05)
# and workers per household (r ≈ -0.01) are excluded as negligible.

# Visualization 1: Street Intersection Density (city r ≈ 0.53, strong)
tm2 <- tm_shape(chicago_city) +
  tm_fill("intersection_density",
          palette = "Oranges",
          style = "jenks",
          n = 5,
          title = "Intersections/km²") +
  tm_borders(alpha = 0.2, lwd = 0.3) +
  tm_layout(main.title = "Street Intersection Density",
            legend.position = c("left", "bottom"),
            main.title.size = 1.1,
            frame = TRUE)

# Visualization 2: Economic Opportunity (city r ≈ 0.37, moderate)
chicago_city_quality <- chicago_city %>%
  filter(!is.na(hrdshpIndx)) %>%
  mutate(opportunity = 100 - hrdshpIndx)

tm3 <- tm_shape(chicago_city_quality) +
  tm_fill("opportunity",
          palette = "-magma",
          style = "jenks",
          n = 5,
          title = "Opportunity Index") +
  tm_borders(alpha = 0.2, lwd = 0.3) +
  tm_layout(main.title = "Economic Opportunity (Lower Hardship)",
            legend.position = c("left", "bottom"),
            main.title.size = 1.1,
            frame = TRUE)

# Visualization 3: Transit Distance (city r ≈ -0.25, weak negative)
tm5 <- tm_shape(chicago_city) +
  tm_fill("transit_distance_km",
          palette = "-BuPu",
          style = "jenks",
          n = 5,
          title = "Distance (km)") +
  tm_borders(alpha = 0.2, lwd = 0.3) +
  tm_layout(main.title = "Distance to Nearest Transit Stop",
            legend.position = c("left", "bottom"),
            main.title.size = 1.1,
            frame = TRUE)

tmap_arrange(tm2, tm3, tm5, ncol = 3)

# Local Indicators of Spatial Association across the Greater Chicago metro.
# Only run if the global Moran's I detected significant clustering, since
# LISA quadrants are uninterpretable without underlying spatial autocorrelation.
if (moran_test$p.value < 0.05) {
  local_m <- localmoran(final_data_complete$walk_score, weights)
  lag_walk <- lag.listw(weights, final_data_complete$walk_score)

  walk_z <- as.numeric(scale(final_data_complete$walk_score))
  lag_z  <- as.numeric(scale(lag_walk))
  p_lisa <- local_m[, 5]

  quadrant <- rep("Not Significant", length(walk_z))
  sig <- !is.na(p_lisa) & p_lisa < 0.05
  quadrant[sig & walk_z >  0 & lag_z >  0] <- "High-High"
  quadrant[sig & walk_z <  0 & lag_z <  0] <- "Low-Low"
  quadrant[sig & walk_z >  0 & lag_z <  0] <- "High-Low"
  quadrant[sig & walk_z <  0 & lag_z >  0] <- "Low-High"

  final_data_complete <- final_data_complete %>%
    mutate(lisa_cluster = factor(
      quadrant,
      levels = c("High-High", "Low-Low", "High-Low", "Low-High", "Not Significant")
    ))

  cat("LISA cluster counts (Greater Chicago, p < 0.05):\n")
  print(table(final_data_complete$lisa_cluster))

  print(
    tm_shape(final_data_complete) +
      tm_fill(
        "lisa_cluster",
        palette = c(
          "High-High"       = "#b2182b",
          "Low-Low"         = "#2166ac",
          "High-Low"        = "#f4a582",
          "Low-High"        = "#92c5de",
          "Not Significant" = "#f0f0f0"
        ),
        title = "LISA cluster (p < 0.05)"
      ) +
      tm_borders(alpha = 0.2, lwd = 0.3) +
      tm_layout(
        main.title = "Local Moran's I Clusters of Walkability (Greater Chicago)",
        main.title.size = 1.1,
        legend.outside = TRUE,
        legend.outside.position = "right",
        frame = FALSE
      )
  )
} else {
  cat("Skipping LISA: global Moran's I was not significant, so local clusters",
      "would not be interpretable.\n")
}

## LISA cluster counts (Greater Chicago, p < 0.05):
## 
##       High-High         Low-Low        High-Low        Low-High Not Significant 
##             434             383              42               9            1141

# City-only LISA. Recompute neighbors and weights inside the City of Chicago
# so the spatial lag does not bleed across the city boundary, then run a
# city-only Moran's I as a gate before the local cluster map.
city_lisa_data <- city_data %>% filter(!is.na(walk_score))

city_coords    <- st_coordinates(st_centroid(city_lisa_data))
city_neighbors <- knn2nb(knearneigh(city_coords, k = 8))
city_weights   <- nb2listw(city_neighbors, style = "W")

city_moran <- moran.test(city_lisa_data$walk_score, city_weights)
cat("Moran's I (City of Chicago):\n"); print(city_moran)

## Moran's I (City of Chicago):

## 
##  Moran I test under randomisation
## 
## data:  city_lisa_data$walk_score  
## weights: city_weights    
## 
## Moran I statistic standard deviate = 25.191, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      0.4066989694     -0.0011534025      0.0002621194

if (city_moran$p.value < 0.05) {
  local_m_city <- localmoran(city_lisa_data$walk_score, city_weights)
  lag_walk_city <- lag.listw(city_weights, city_lisa_data$walk_score)

  walk_z_c <- as.numeric(scale(city_lisa_data$walk_score))
  lag_z_c  <- as.numeric(scale(lag_walk_city))
  p_lisa_c <- local_m_city[, 5]

  quadrant_c <- rep("Not Significant", length(walk_z_c))
  sig_c <- !is.na(p_lisa_c) & p_lisa_c < 0.05
  quadrant_c[sig_c & walk_z_c >  0 & lag_z_c >  0] <- "High-High"
  quadrant_c[sig_c & walk_z_c <  0 & lag_z_c <  0] <- "Low-Low"
  quadrant_c[sig_c & walk_z_c >  0 & lag_z_c <  0] <- "High-Low"
  quadrant_c[sig_c & walk_z_c <  0 & lag_z_c >  0] <- "Low-High"

  city_lisa_data <- city_lisa_data %>%
    mutate(lisa_cluster = factor(
      quadrant_c,
      levels = c("High-High", "Low-Low", "High-Low", "Low-High", "Not Significant")
    ))

  cat("LISA cluster counts (City of Chicago, p < 0.05):\n")
  print(table(city_lisa_data$lisa_cluster))

  city_lisa_map <- tm_shape(city_lisa_data) +
    tm_fill(
      "lisa_cluster",
      palette = c(
        "High-High"       = "#b2182b",
        "Low-Low"         = "#2166ac",
        "High-Low"        = "#f4a582",
        "Low-High"        = "#92c5de",
        "Not Significant" = "#f0f0f0"
      ),
      title = "LISA cluster (p < 0.05)"
    ) +
    tm_borders(alpha = 0.2, lwd = 0.3) +
    tm_layout(
      main.title = "Local Moran's I Clusters of Walkability (City of Chicago)",
      main.title.size = 1.1,
      legend.outside = TRUE,
      legend.outside.position = "right",
      frame = TRUE
    )
  print(city_lisa_map)
} else {
  cat("Skipping city LISA: city-only Moran's I was not significant.\n")
}

## LISA cluster counts (City of Chicago, p < 0.05):
## 
##       High-High         Low-Low        High-Low        Low-High Not Significant 
##             132             119              21              13             583

# Bivariate correlations within the City of Chicago.
# Variables with negligible city-extent r are excluded (pop_density_sqmi, r = -0.05).
city_stats <- city_data %>%
  st_drop_geometry() %>%
  select(walk_score, intersection_density, employment_entropy,
         employment_mix, transit_distance_km) %>%
  na.omit()

city_quality_stats <- chicago_city_quality %>%
  st_drop_geometry() %>%
  select(walk_score, opportunity) %>%
  na.omit()

ct_int <- cor.test(city_stats$walk_score, city_stats$intersection_density)
ct_emp <- cor.test(city_stats$walk_score, city_stats$employment_entropy)
ct_mix <- cor.test(city_stats$walk_score, city_stats$employment_mix)
ct_trn <- cor.test(city_stats$walk_score, city_stats$transit_distance_km)
ct_opp <- cor.test(city_quality_stats$walk_score, city_quality_stats$opportunity)

tibble(
  Variable  = c("Street Intersection Density",
                "Employment + HH Entropy (D2A)",
                "8-tier Employment Mix (D2B)",
                "Transit Distance (km)",
                "Economic Opportunity"),
  r         = round(c(ct_int$estimate, ct_emp$estimate, ct_mix$estimate,
                      ct_trn$estimate, ct_opp$estimate), 3),
  `R²`      = round(r^2, 3),
  `p-value` = c(ct_int$p.value, ct_emp$p.value, ct_mix$p.value,
                ct_trn$p.value, ct_opp$p.value),
  Sig       = dplyr::case_when(
    `p-value` < 0.001 ~ "***",
    `p-value` < 0.01  ~ "**",
    `p-value` < 0.05  ~ "*",
    TRUE              ~ "n.s."
  )
) %>%
  mutate(`p-value` = ifelse(`p-value` < 0.001, "< 0.001",
                            as.character(round(`p-value`, 3)))) %>%
  knitr::kable(caption = "Bivariate Pearson correlations with walkability score within the City of Chicago, after population-weighted aggregation of all EPA inputs (matching the aggregation used for walk_score). Population density and workers per household are excluded as negligible (|r| < 0.1).")

Bivariate Pearson correlations with walkability score within the City of Chicago, after population-weighted aggregation of all EPA inputs (matching the aggregation used for walk_score). Population density and workers per household are excluded as negligible (|r| < 0.1).
Variable	r	R²	p-value	Sig
Street Intersection Density	0.534	0.285	< 0.001	***
Employment + HH Entropy (D2A)	0.693	0.480	< 0.001	***
8-tier Employment Mix (D2B)	0.678	0.460	< 0.001	***
Transit Distance (km)	-0.250	0.062	< 0.001	***
Economic Opportunity	0.369	0.136	< 0.001	***

# Scatter for the strongest bivariate predictor: D2A employment + household
# entropy (one of the four EPA inputs to NatWalkInd).
scatter_data <- city_data %>% st_drop_geometry() %>%
  select(walk_score, employment_entropy) %>% na.omit()

r_emp  <- round(cor(scatter_data$walk_score, scatter_data$employment_entropy), 3)
r2_emp <- round(r_emp^2, 3)

ggplot(scatter_data, aes(x = employment_entropy, y = walk_score)) +
  geom_point(alpha = 0.35, color = "#2c7fb8") +
  geom_smooth(method = "lm", se = TRUE, color = "#253494") +
  scale_y_continuous(limits = c(0, 20)) +
  labs(
    title = "Employment + Household Entropy vs Walkability (City of Chicago)",
    subtitle = paste0("r = ", r_emp, ", R² = ", r2_emp,
                      ". D2A_EPHHM is one of the four EPA inputs to the walkability index"),
    x = "Employment + Household Entropy (D2A_EPHHM)",
    y = "Walkability Score",
    caption = "Each point = one Chicago city census tract"
  ) +
  theme(plot.title = element_text(size = 14, face = "bold"),
        plot.subtitle = element_text(size = 11, color = "#555"))

# Scatterplot matrix on City of Chicago analysis frame
pairs_data <- city_data %>%
  st_drop_geometry() %>%
  select(walk_score, transit_distance_km,
         intersection_density, employment_entropy, employment_mix) %>%
  na.omit()

pairs(pairs_data,
      main = "Scatterplot Matrix: Walkability and Related Variables (City of Chicago)",
      pch = 16,
      col = rgb(0, 0, 1, 0.3))

Discussion

This analysis was scoped to the City of Chicago because that is the extent the original hypothesis was about: walkability higher in the CBD/Loop and falling off in outer neighborhoods. Three findings support the hypothesis. First, the Moran’s I test confirms that walkability is not randomly distributed but follows clear neighborhood-scale spatial patterns, and the maps show walkability is highest in and around the Loop, exactly as predicted. Second, once the EPA inputs are aggregated to tract level the same way the walkability score is, population-weighted with per-variable NA handling so that block groups missing one input are not dropped from every aggregation, the bivariate correlations within Chicago are strong: employment + household entropy (D2A_EPHHM) reaches r ≈ 0.69, the 8-tier employment mix (D2B_E8MIXA) reaches r ≈ 0.68, intersection density (D3B) reaches r ≈ 0.53, and transit distance (D4A) reaches r ≈ −0.25. These are the underlying components of the EPA walkability formula and they correlate with the score by construction, which is exactly what makes the side-by-side choropleth maps look so visually similar. Third, the multivariate regression on Chicago city tracts explains R² ≈ 0.85 of walkability variance, which is what we should expect because the predictors are the formula inputs.

The one piece of the hypothesis the data does not support is the idea that raw residential population density drives walkability. Within the city, Pearson r between pop_density_sqmi and walkability is essentially zero. The reason is mechanistic, not statistical: density is not one of the four EPA inputs, and within Chicago the densest residential tracts (Lakeview, Edgewater, parts of Lincoln Park) are not the most walkable tracts. The Loop has the highest walkability but a comparatively low residential density because it is dominated by jobs and retail. The variables that do drive walkability are mixed-use development, dense street networks, and transit proximity, all of which the Loop has in abundance.

Across the United States, an emerging experimental trend is the “15-minute city”, where all the amenities that come with being in a town or city can be accessed within the span of a 15-minute walk. Even in some low-density suburbs, there is demand for a reduction in car-centric infrastructure. The Chicago data is consistent with this framing: a tract can be relatively low-density and still score high on walkability if mixed-use entropy and intersection density are high. The strongest individual driver in the multivariate regression is the 8-tier employment mix, which is exactly the variable that captures whether jobs, retail, services, and housing are colocated at walkable distances within a tract.

All of these findings justify the inclusion of this dataset into ChiVes as it provides this information within the context of all of the other environmental factors that either positively or negatively influence the lives of Chicagoans on a daily basis. However, there are still a few limitations to this dataset that prevented this analysis from reaching its full potential. There are five census tracts with no data, due to them primarily not having many people at all. Furthermore, walkability scores do not account for a few aspects of the urban form, such as the quality of a sidewalk, how convenient it is to walk somewhere in warmer versus colder weather, crime rates, or how relevant an amenity is on a daily basis to the average person.

Throughout the analysis, several issues with the R code occurred. The first challenge was converting from block groups to census tracts. The initial approach to this was taking a weighted mean of the block groups, but that would have produced misleading results due to the variables for the data being counted incorrectly (i.e. not counting some data or double counting some data). It is also challenging to include a census tract if it is missing some variables. Thus, removing all data for the census tract is the easiest thing to do. Thankfully, this only removed 5/2014 census tracts. This method of computing weighted means was later changed to sum the product of a variable’s value and its corresponding weight, divided by the sum of the weights. In other words, this method accounted for the population of an area, since census tracts with larger populations contributed more proportionally to the final conversion from block groups to census tracts than lower density areas.

The next challenge was getting the exact Chicago boundaries. Initially, the spatial filtering to visualize Chicago was just using four coordinates to form a rectangle around the area. The problem with this was that areas outside the desired analysis were present in the visualization. Thus, the spatial filtering was later changed to using places function from the tigris library and transforming the location into the correct CRS. This filtered the boundaries to the exact polygon.

Then, another persistent challenge to overcome in this analysis was surprisingly loading in the ChiVes data. Usually, loading in a dataset should be simple, but the EPA Walkability variables were being incorporated into the chicago_city final data before the ChiVes variables had been incorporated into that same variable. This was solved by performing a left_join() on the EPA Walkability’s TRACT_GEOID and ChiVes’ geoid variable.

This analysis is only an introductory step in a hopefully ongoing process to bring attention to improve walkability across not just Chicago but around all of the Greater Chicago Metropolitan area as a whole. Implementing this into ChiVes would vastly increase the chances of new policy implementation such as new transit lines, added bike lanes, re-zoning projects, and new street layout projects. As previously mentioned four paragraphs back, a walkability score does not account for everything related to the convenience of pedestrians. Future work for this project could include adding data to integrate these factors into new visualizations. Perhaps an entirely new data generating process (DGP) is needed in order to maximize the accuracy of this data. In fact, the idea that 1090 census tracts are classified as “Above Average Walkability” does not make sense. If the average walkability score is singular value, it makes sense why there is no category called “Average Walkability”; Everything can only either be below or above that average. Finally, training machine learning models can speed up and also increase the accuracy of the aforementioned DGP. These models would be able to predict walkability scores based off of current built-environment variables, create new variables, and even incorporate scenario-based testing.

All of these goals and this current analysis is extremely important for the environmental sustainability and quality of life in the urban form.

References and .csv Export

chicago_boundary <- places(state = "IL", cb = TRUE) %>%
  filter(NAME == "Chicago") %>%
  st_transform(4326)

chicago_export <- final_data %>%
  st_filter(chicago_boundary, .predicate = st_intersects) %>%
  st_drop_geometry() %>%
  select(
    geoid = TRACT_GEOID,
    walkability_score = walk_score,
    walkability_category = walk_category,
    employment_entropy = employment_entropy,
    intersection_density = intersection_density,
    transit_distance_km = transit_distance_km,
  )

write_csv(chicago_export, "chicago_walkability_for_chives.csv")

U.S. Environmental Protection Agency. (2021). National Walkability Index: User Guide and Methodology.
U.S. Environmental Protection Agency. (2021). Smart Location Database Technical Documentation.
ChiVes Project Team. (2024). ChiVes: Chicago Visualization and Environmental Sustainability Platform. University of Illinois at Urbana-Champaign.

A Spatial Analysis of Walkability and Urban Form in the Chicago Metropolitan Area

Shanmukh Upadhyayula

December 18, 2025