Find Your Next Location

AI-powered site intelligence for retail expansion across Greater London. Identifying high-potential locations across 33 boroughs and 6 business types.

0 Boroughs
0 Micro-locations
0% Mean AUC (Top 5)
0 Business Types
H3 Hexagonal Grid XGBoost ML Spatial Cross-Validation SHAP Explanations 5-Tier Recommendations 33 Features

Project Overview

This project identifies optimal locations for new retail businesses across all 33 boroughs of Greater London using 5 data modalities — population, demographics, crime rates, transport accessibility, and spatial graph structure — combined with supervised machine learning. The model supports 6 business types (cafe, restaurant, pub, fast food, gym, bakery) with 33 engineered features per location. The core insight: a binary classifier's False Positives — locations the model predicts should have a given business but currently don't — represent untapped market opportunities, ranked into a 5-tier recommendation system with per-location SHAP explanations.

In the language of network theory, these are Burt's Structural Holes (1992): positions in a spatial network with high demand signals but no existing supply, now identified by data-driven learning rather than heuristic scoring.

Full Pipeline Architecture

LandScan Raster Population density
WGS84 • ~1km resolution
Zonal stats → H3 hexagons
Digimap Census CSVs Education • Age • Employment
LSOA-level • 2021 Census
Areal interpolation → H3
OSM / Overpass API 845k+ POIs across 33 boroughs
19 tag categories • 4×4 tile grid
Deduplicated by H3 index
1 Spatial Preprocessing & Grid Generation
Borough Boundaries osmnx geocoding • 33 boroughs
Reprojected EPSG:4326 → EPSG:27700
H3 Hexagonal Grid Resolution 9 • ~174m edge
16,889 hexagons • BNG clipped
CRS Normalisation All vectors → EPSG:27700
Raster clips in WGS84, results back to BNG
2 Feature Engineering — 33 Features per Hexagon
POI Counts Synergy • Competitor
Anchor • Nearby (500m)
Graph Features 6 centrality • Louvain community
Node2Vec embeddings (3D)
Demographics Education (Level 4+)
Age 16–34 • Employment
Population LandScan zonal stats
Daytime + residential
Crime Rates police.uk 2021
Violent • Property • ASB
Transport Access Station distance • Bus stops
Walkable catchment (800m)
3 Machine Learning Pipeline
Spatial Block CV H3 Res-7 parent grouping
5 folds • no data leakage
XGBoost (Tuned) GridSearchCV • 6 params
LR & RF baselines compared
F₁ Threshold Tuning Precision–Recall curves
OOF predictions • per-type
↓ One model per business type ↓
Coffee
Shop
Restau-
rant
Pub /
Bar
Fast
Food
Gym /
Fitness
Bakery
↓ 5-Tier Classification → FP = New Sites (Prime/Strong/Viable/Competitive) • SHAP Explanations ↓
5-Tier Recommendations Prime • Strong • Viable • Competitive
Per-type CSV • SHAP “Why Here?” cards
Structural holes + growth zones
Interactive 3D Maps Pydeck H3HexagonLayer
6 maps • Colour = tier
Height = confidence • Borough filter
Portfolio & Report This site • Report tab
Bootstrap CI • Permutation test
Feature importance • SHAP analysis

Research Question

"Can a binary classification model, trained on 33 geospatial features derived from population rasters, census demographics, crime rates, transport accessibility, graph centrality metrics, community structure, and graph embeddings, identify underserved locations (structural holes) for retail businesses across Greater London — with per-location SHAP explanations and confidence-based recommendation tiers?"

Key Results

16,889

H3 Hexagons

Resolution 9 (~174m edge) covering 33 London boroughs

33

Engineered Features

Per type: 6 demographic, 3 crime, 3 transport, 6 graph centrality, 1 community, 3 graph embeddings, 10 POI co-occurrence, 1 competition density

6

Business Types

Cafe, restaurant, pub, fast food, gym, bakery

5 Tiers

Recommendation System

Prime • Strong • Viable • Competitive • Not Recommended — with SHAP explanations

Data Sources

SourceTypeWhat It Provides
LandScan (ORNL, 2023)Raster (~1km)Population density as a footfall proxy
ONS Census 2021 (Digimap)Tabular (OA level)Demographics: education, age, employment
OpenStreetMap (OSMnx)Vector (Points)POIs: cafes, gyms, stations, offices
H3 Adjacency GraphComputed (NetworkX)6 centrality metrics, Louvain community detection, Node2Vec embeddings (3D)
police.uk (Met Police + City of London, 2021)Tabular (geocoded)Crime rates: violent, property, anti-social behaviour per hex
OpenStreetMap / Overpass API (bus stops)Vector (Points)Bus stop locations for transit accessibility features

The False Positive Thesis

Key Insight: False Positive hexagons possess all the learned features of successful business locations — high footfall, educated demographics, low crime, strong transit connectivity, co-located complementary businesses — but no one has opened that business there yet. These are untapped opportunities, validated by supervised learning and ranked into 5 confidence tiers (Prime → Strong → Viable → Competitive → Not Recommended) based on model probability and nearby competition density. Each recommendation includes SHAP-powered “Why Here?” explanations.

How It Works

The pipeline transforms raw geospatial data into actionable site recommendations in six stages.

Five geospatial data modalities are acquired and harmonised into EPSG:27700 (British National Grid): population (LandScan), demographics (Census 2021), crime (police.uk 2021), transport (OSM stations + Overpass bus stops), and POIs (OSMnx).

Three Data Modalities

ModalityRepresentationExamplePrecision
Vector Discrete geometric primitives with exact coordinate pairs OSM cafe locations (Points), building footprints (Polygons) Sub-metre
Raster Regular grid of cells, each storing a numeric value LandScan population grid — each pixel holds a count ~1 km per pixel
Tabular Attribute records keyed by a spatial identifier ONS Census CSVs — percentages per Output Area OA centroid

POI Categorisation

Each Point of Interest is classified into a business role relative to the target business type:

RoleOSM Tags (examples)Effect on Score
CompetitorSame type as target (e.g. cafes for cafe model)Excluded from features (encodes the target)
Synergygym, university, office, library, co-workingIncreases score (complementary foot traffic)
Anchorstation (public transport)Increases score (transit node)
Otherbakery, supermarket, restaurantContextual enrichment
CRS Fix Applied: The original code computed centroids in EPSG:4326 (degrees). At 51°N, 1° longitude spans ~69 km while 1° latitude spans ~111 km — a 60% distortion. All geometric operations now use EPSG:27700 (metres) to eliminate this anisotropic error.

Competition Density (k=2 Ring)

For each business type, a competition density feature counts same-type businesses in a k=2 H3 ring (~18 neighboring hexagons, ~350m radius).

Census Merge

Three ONS Census 2021 CSVs (Economic Activity, Age Structure, Qualifications) are merged on geog_code (Output Area identifier), producing 846 OAs with 7 demographic columns. Missing values are imputed with the column median.

Exploratory Data Analysis

Missingness audit and class balance

Fig. 1 — Feature missingness audit (left) and target class distribution (right).

Feature correlation heatmap

Fig. 2 — Feature correlation matrix.

Feature distributions by target class

Fig. 3 — Feature distributions split by target class (cafe).

Uber's H3 library partitions Greater London into hierarchical hexagonal cells. Unlike square grids, hexagons have 6 equidistant neighbours — ideal for walking-distance analysis.

Why Resolution 9?

ResolutionEdge LengthHex AreaUse Case
7~1.2 km~5.16 km²District-level
8~460 m~0.74 km²Neighbourhood
9~174 m~0.105 km²Walking-scale (our choice)
10~66 m~0.015 km²Street-level

At Resolution 9, each hexagon covers roughly the area a person walks in 2 minutes — the micro-unit of the 15-minute city (Moreno et al., 2021).

Enrichment Pipeline

H3 Grid~16,889 hexagons
Zonal StatsLandScan population
Spatial JoinCensus demographics
Master Grid33 features per hex

Census demographics are joined via sjoin_nearest with population-weighted mean aggregation. Hexes with no OA match receive the borough median (conservative imputation).

Street-level crime data from police.uk (Metropolitan Police + City of London Police, ~1.08M raw records across 2021) is cleaned, deduplicated, and aggregated to H3 hexagons.

Data Cleaning

The raw data contains significant quality issues that are resolved before feature engineering:

IssueScaleAction
Exact duplicate rows~166,000 (15.7%)Removed via deduplication
Crime ID duplicates~40,600Jurisdictional overlap at force boundaries; deduplicated by Crime ID
Non-London records~19,800Met Police covers some edge areas (Epping, Elmbridge); filtered by LSOA to 33 boroughs
ASB without Crime ID~302,000By design in police.uk data; retained after exact dedup

Post-cleaning: ~852,000 verified crime records across all 33 London boroughs.

Crime Type Grouping

14 raw crime types grouped into 3 retail-relevant categories:

GroupTypes IncludedRetail Signal
Violent crimeViolence, Robbery, Weapons, Public order, DrugsSafety perception — deters footfall
Property crimeBurglary, Vehicle crime, Theft, Shoplifting, Criminal damageOperational cost — shrinkage, insurance
Anti-social behaviourASBNeighbourhood quality — high street decline

Features (3 per Hexagon)

Each crime group is aggregated per hex, then log-transformed (log1p) to compress the right-skewed tail — hotspot hexes like Oxford Street and Stratford Westfield have thousands of incidents while most residential hexes have single digits. Hexes with no crime receive 0 (an observed zero, not imputed with borough median).

FeatureDescription
violent_crime_log1plog(1 + violent crime count) — safety signal
property_crime_log1plog(1 + property crime count) — retail cost signal
antisocial_behaviour_log1plog(1 + ASB count) — neighbourhood quality signal

Transport connectivity features capture the accessibility gradient that drives retail footfall. Station data comes from the existing POI fetch (public_transport=station); bus stops are fetched separately via Overpass API (highway=bus_stop, ~19,000 across London).

Spatial Computation

All distance calculations use EPSG:27700 (British National Grid) for metric accuracy. A scipy.spatial.cKDTree indexes station locations for O(log n) nearest-neighbour and radius queries against ~16,000 hex centroids.

Features (3 per Hexagon)

FeatureDescription
dist_to_nearest_station_loglog(1 + BNG meters to nearest rail/tube/DLR station) — accessibility gradient
station_count_800mCount of stations within 800m walkable catchment — interchange density
bus_stop_count_loglog(1 + bus stops per hex) — local transit frequency proxy

Missing Values

Bus stops: 0 for hexes with no stops (observed zero). Station distance: always computed (no missing values). Station count: 0 when no stations within 800m.

The enriched H3 grid becomes a spatial graph where hexagons are nodes and edges connect adjacent hexes. From this graph we extract three layers of structural information.

Six Centrality Metrics

MetricWhat It MeasuresBusiness Meaning
DegreeNumber of neighboursInterior (6) vs boundary (<6) connectivity
BetweennessFrequency on shortest pathsTransit corridor importance
ClosenessAverage distance to all nodesCentral vs peripheral location
ClusteringNeighbour interconnectionNeighbourhood cohesion
EigenvectorInfluence via well-connected neighboursHub proximity (near other hubs)
PageRankDamped random walk steady stateFoot-traffic accumulation potential

Louvain Community Detection

The Louvain algorithm partitions the H3 adjacency graph into communities, maximising modularity. Each hexagon receives a community_id capturing which spatial neighbourhood cluster it belongs to.

Node2Vec Graph Embeddings

Node2Vec learns a 3-dimensional vector representation for each hexagon by simulating biased random walks on the graph. Hexagons with similar graph neighbourhoods receive similar embeddings, even if geographically distant.

Heuristic Site Score

ScoreH = PopH × DH + 5 · SH + 3 · AH − 15 · CH

Where DH = Level 4 qualification % / 100 (demand proxy), S = synergy count, A = anchor count, C = competitor count.

Problem Formulation

Binary classification: for each of 6 business types, predict whether a hexagon should contain that business. The commercially valuable output is the False Positives — locations predicted as suitable that have no current supply.

Feature Matrix (33 Features per Business Type)

ModalityCountFeatures
Footfall1population (LandScan zonal sum)
Demographics5employed_total_perc, age_16_to_34_perc, level4_perc, retired_perc, no_qualifications_perc
Crime Rates3violent_crime_log1p, property_crime_log1p, antisocial_behaviour_log1p
Transport Access3dist_to_nearest_station_log, station_count_800m, bus_stop_count_log
Graph Centrality6degree_centrality, betweenness_centrality, closeness_centrality, clustering_coeff, eigenvector_centrality, pagerank
Community Structure1community_id (Louvain)
Graph Embeddings3node2vec_dim0/1/2
POI Co-occurrence10Counts of all 11 POI types except the target type
Competition Density1nearby_{type} — same-type businesses in k=2 ring

Spatial Cross-Validation

Standard k-fold CV violates spatial independence. We group Resolution-9 hexes by their Resolution-7 parent cell (~0.74 km² blocks). All hexes sharing a parent are assigned to the same fold, preserving spatial integrity across 5 folds.

Model Comparison

ModelTypeImbalance HandlingRationale
Logistic RegressionLinearclass_weight='balanced'Interpretable baseline
Random ForestBagged ensembleclass_weight='balanced'Non-linear, robust
XGBoostBoosted ensemblescale_pos_weightState-of-the-art tabular

Evaluation Figures

ROC curves for all models

Fig. 4 — ROC curves (out-of-fold, spatial CV)

Confusion matrices for all 6 business types

Fig. 5 — Confusion matrices (out-of-fold, all 6 types)

Feature importance for all 6 business types

Fig. 6 — Top 10 features by business type (XGBoost gain)

Precision-Recall curves

Fig. 7 — Precision–Recall curves with F1-optimal thresholds

Calibration curve

Fig. 8 — Calibration curve

Feature profiles by prediction outcome

Fig. 9 — Failure mode analysis

Model Performance (33 Features, Spatial Block CV)

Business TypeOOF AUC± StdAccuracyPrecisionRecallF1Threshold
Coffee Shop / Cafe0.91230.006585.1%59.6%56.4%0.5800.728
Restaurant0.93440.009086.3%57.7%65.0%0.6110.741
Pub / Bar0.89160.013384.7%45.0%54.2%0.4920.722
Fast Food0.93570.008387.5%55.8%61.7%0.5860.788
Gym / Fitness *0.77520.034485.3%24.5%40.1%0.3040.723
Bakery0.92330.022894.9%30.9%57.7%0.4020.822

* Gym model has degraded AUC due to data sparsity (6.6% prevalence, only 1,014 positive hexes). Recommendations should be treated with lower confidence. See caveat below.

Feature Value-Add: Crime + Transport (30 → 33 Features)

Business Type30-Feat AUC33-Feat AUCDeltaKey New Feature
Coffee Shop / Cafe0.91130.9123+0.10pp— (no crime/transport in top-5)
Restaurant0.91820.9344+1.62ppproperty_crime (#4, 6.0%)
Pub / Bar0.92200.8916−3.04ppproperty_crime (#4, 4.5%), bus_stops (#5, 4.4%)
Fast Food0.89230.9357+4.34ppproperty_crime (#3, 7.5%), violent_crime (#4, 5.1%), bus_stops (#5, 4.5%)
Gym / Fitness0.87650.7752−10.13ppproperty_crime (#5, 3.5%)
Bakery0.86350.9233+5.98ppproperty_crime (#5, 4.0%)
Verdict: The 6 new features (3 crime + 3 transport) improve 4/6 types. Mean AUC excluding Gym: 0.9015 → 0.9195 (+1.8pp). property_crime_log1p is the single most valuable new feature, appearing in 5/6 models’ top-5. Fast Food (+4.3pp) and Bakery (+5.9pp) benefit most — crime rates clearly discriminate viable retail locations for these types.
Gym Model Caveat: Gym/Fitness degraded from 0.877 to 0.775 AUC (−10.2pp). Root cause: only 1,014 positive hexes (6.6% prevalence) cannot support 33 features without overfitting. The model’s fold variance is 5× higher than Cafe (±0.035 vs ±0.007), and precision is just 24.5%. The 5-tier system mitigates this: 0 Prime and 0 Strong recommendations survived the tier filter, correctly flagging that gym predictions lack conviction. Gym recommendations should be treated as directional signals, not confident site selections.
Learning Curve and Feature Ablation

Fig. 10 — Learning Curve (left) and Feature Ablation (right)

Statistical Significance

All 6 models are validated with bootstrap 95% confidence intervals (n=2000) and permutation tests (n=1000) at a strict p < 0.001 threshold. This high bar accounts for multiple comparisons (6 models tested simultaneously) and remains significant even after Bonferroni correction.

Permutation test null distributions

Fig. 11 — Permutation test: null distribution vs observed AUC (all 6 types)

Interpretation: The observed AUC (red line) falls far outside the null distribution (grey histogram) for all 6 models. This confirms the model has learned genuine spatial patterns in the data, not random noise. All p-values < 0.001 with 95% CI lower bounds well above 0.5.

Confidence Score Distributions

Histograms of out-of-fold predicted probabilities, split by actual label (positive vs negative), reveal whether the model achieves genuine score separation. A well-calibrated model produces bimodal distributions: negatives cluster near 0, positives near 1. Flat distributions indicate poor discrimination. Tier threshold lines show where Prime (≥0.95) and Strong (≥0.85) recommendations are carved from the probability space.

Predicted probability distributions by business type

Fig. 14 — OOF predicted probability distributions: negative hexes (grey) vs positive hexes (red), with F1 threshold (orange dashed) and tier thresholds (green dotted). Bimodal separation confirms genuine score discrimination.

Borough Holdout Cross-Validation (External Validity)

Standard spatial CV tests within-London generalisation. Borough holdout (leave-one-borough-out) tests a harder question: does the model generalise to administratively distinct areas it has never seen? Each of the 33 boroughs is used as a test set in turn (minimum 10 positives required to compute a valid AUC). The heatmap shows per-borough AUC; the boxplot shows the distribution across boroughs per type.

Borough holdout CV heatmap and boxplot

Fig. 15 — Borough holdout CV: AUC heatmap (boroughs × types, left) and per-type AUC distribution (right). Consistent AUC above 0.5 across boroughs confirms the model generalises beyond its training boroughs.

Gym Model Deep-Dive: Root Cause Analysis

The gym model (AUC 0.775) was subjected to three diagnostic tests: (1) removing crime+transport features to test whether the 6 new features cause the degradation, (2) doubling the class weight to test whether imbalance handling is insufficient, and (3) examining per-fold AUC variance to confirm the instability signature of data sparsity.

Gym model deep-dive diagnostics

Fig. 16 — Gym deep-dive: class distribution (A), AUC comparison across configurations (B), precision-recall curve (C), and per-fold variance vs. Cafe (D).

5-Tier Recommendation System

TierCriteriaBusiness Value
Prime Location ≥95% confidence, ≤2 competitors nearby Highest-conviction site — strong demand, minimal competition
Strong Recommend 85–95% confidence, ≤3 competitors High confidence, manageable competitive landscape
Viable Above threshold, ≤5 competitors Suitable location with some existing competition
Competitive Above threshold, >5 competitors Model says suitable but market is saturated
Expansion Opportunity Model says YES, reality is YES, low saturation Room for another — high demand, few competitors
True Positive Model says YES, reality is YES Validates model — correctly identifies existing shops
True Negative Model says NO, reality is NO Correctly rejects unsuitable locations
False Negative Model says NO, reality is YES Niche shops not captured by feature set

Every recommendation comes with a per-location explanation powered by SHAP (SHapley Additive exPlanations) — a game-theoretic framework that decomposes each prediction into individual feature contributions.

Why SHAP, Not Feature Importance?

MethodScopeQuestion Answered
XGBoost Gain Global (all hexagons) “Which features matter overall?”
SHAP Values Local (per hexagon) “Why was this specific location recommended?”

How It Works

For each hexagon, SHAP computes a signed contribution for every feature:

FeatureSHAP ValueInterpretation
population+0.18High footfall pushes prediction up
property_crime_log1p−0.09High crime pushes prediction down
station_count_800m+0.12Good transit access pushes prediction up

The top 3 SHAP drivers are stored per hexagon and displayed as “Why Here?” cards in the Site Finder tab, giving stakeholders transparent, actionable insight into each recommendation.

SHAP feature importance bar charts for all 6 business types

Fig. 12 — SHAP feature importance: mean |SHAP| value per feature across all 6 business types (top 8 features shown)

Contrastive SHAP: Why Did the Model Recommend Here but Not There?

Each hexagon is assigned to a confusion-matrix quadrant (TP, FP, FN, TN) based on out-of-fold predictions. The grouped bar chart compares mean absolute SHAP attribution for the top 10 features across TP (existing businesses the model confirms), FP (recommendations — gaps the model identifies), and FN (missed sites). Features where FP bars exceed FN bars indicate what separates an identified opportunity from a model blind spot.

Contrastive SHAP by confusion-matrix quadrant

Fig. 13 — Contrastive SHAP: mean absolute feature attribution for True Positives (red), False Positives / recommendations (green), and False Negatives / missed sites (orange) across all 6 business types.

Stakeholder Value: SHAP explanations transform the model from a black box into a decision-support tool. Instead of just “open here,” the system says “open here because population is high, crime is low, and 3 stations are within walking distance.”

Do Recommendations Systematically Exclude Deprived Areas?

A model trained on commercial presence data risks learning historical inequity: businesses have historically avoided deprived areas, so the model may conclude deprived areas are unsuitable — a self-fulfilling prophecy. We audit this by computing recommendation rates across deprivation quintiles (using no_qualifications_perc as a deprivation proxy, as no IMD data is available).

Proxy Limitation: no_qualifications_perc (the ONS Census 2021 percentage of residents with no qualifications) is used as a deprivation proxy in the absence of IMD data. It correlates strongly with IMD at LSOA level (Pearson r ≈ 0.72 in prior literature) but is not a perfect substitute. Results should be interpreted directionally.

Method

Hexagons are split into quintiles (Q1 = least deprived, Q5 = most deprived) and recommendation rate (False Positives / total hexes per quintile) is computed per business type. Spearman ρ tests whether recommendation rate monotonically increases or decreases with deprivation quintile. A positive ρ indicates the model over-recommends in deprived areas.

Recommendation rate by deprivation quintile

Fig. 17 — Recommendation rate by deprivation quintile across all 6 business types. Q1 = least deprived; Q5 = most deprived. Spearman ρ annotated per type.

Interpretation: A positive ρ indicates systematic over-recommendation in deprived areas; a negative ρ indicates under-recommendation. Either direction warrants scrutiny: over-recommendation may reflect genuine unmet demand or model miscalibration; under-recommendation may encode historical inequity. Human review of Q5 recommendations is advised before acting on model output.

Partial Dependence Plots (Cafe Model)

Partial Dependence Plots (PDPs) show the marginal effect of each feature on the predicted probability, holding all other features constant. Unlike SHAP (which explains individual predictions), PDPs reveal the global shape of each feature’s relationship: is it linear, threshold-based, or saturating? The top 6 features of the Cafe model are plotted below.

Partial Dependence Plots for Cafe model

Fig. 18 — Partial Dependence Plots: top 6 features of the Cafe model, showing marginal effect on predicted probability. Rug plots indicate feature value density.

Spatial Autocorrelation of Residuals (Moran’s I)

If model residuals cluster spatially, the model is missing a spatially structured signal — a violation of the independence assumption that inflates confidence intervals. Moran’s I quantifies spatial autocorrelation using H3 k-ring adjacency as the spatial weight matrix. Values near 0 indicate spatial randomness; positive values indicate clustering; negative values indicate dispersion.

Moran's I spatial autocorrelation of residuals

Fig. 19 — Moran’s I for model residuals: observed I (red line) vs permutation null distribution (grey histogram). Low I values confirm spatial CV effectively decorrelates residuals.

Why this matters: Standard cross-validation on spatial data can be optimistically biased because nearby hexagons leak information. Borough-block spatial CV mitigates this, and Moran’s I on the residuals validates that the mitigation worked — residuals should show no significant spatial clustering if the folds are properly structured.

Feature Group Ablation Study

To quantify each feature category’s contribution, we perform a leave-one-group-out ablation: remove all features from one category (POI, Demographics, Crime, Transport, or Graph centrality), retrain the model, and measure AUC drop. Larger drops indicate higher dependency on that feature group.

Feature group ablation heatmap

Fig. 20 — Feature group ablation: AUC change when each feature group is removed. Negative values (red) indicate the model depends on that group; positive values (blue) indicate the features were adding noise.

Methodology Flowchart

End-to-end pipeline from raw data sources through H3 spatial indexing, feature engineering, spatial cross-validation, XGBoost classification, evaluation, and final recommendation output.

End-to-end methodology flowchart

Fig. 21 — Pipeline architecture: data ingestion → H3 indexing → feature engineering → spatial CV → XGBoost → evaluation → tiered recommendations → interactive maps.

Why Does Crime Predict Retail Viability?

At first glance, property_crime_log1p appearing in 5/6 models’ top-5 features seems counterintuitive. The explanation lies in routine activity theory (Cohen & Felson, 1979): crime requires convergence of motivated offenders, suitable targets, and absent guardians. High-footfall commercial areas provide all three — more people, more targets, more opportunity. Crime is therefore a proxy for footfall intensity, not a causal driver of business viability.

Agglomeration Economics

POI co-occurrence dominates the feature importance rankings (e.g., n_restaurant at 38–42% for Cafe/Fast Food). This reflects Hotelling’s principle of minimum differentiation (1929): businesses cluster because proximity to competitors signals high demand density. The model captures this agglomeration effect, validating that retail location selection is fundamentally a clustering problem.

Fairness Implications of Training on Presence

A core limitation: the model is trained on existing business locations as positive labels. Areas that lack businesses (due to historical disinvestment, planning restrictions, or demographic change) are labelled as negatives. The model may therefore encode a self-fulfilling prophecy: “no businesses here → unsuitable location” even when genuine demand exists. The Fairness Audit (Fig. 17) quantifies this risk. Practitioners should supplement model recommendations with local market research in under-served areas.

Spatial CV Effectiveness

Borough-block spatial cross-validation prevents geographic data leakage by ensuring that no hexagons from the same borough appear in both train and test sets. The Moran’s I analysis (Fig. 19) validates this design: if residuals show no significant spatial clustering, the spatial CV folds are structured correctly and AUC estimates are unbiased. Borough holdout CV (Fig. 15) provides an even stronger test of external validity by holding out entire boroughs.

LimitationImpact
Static POI snapshotOSM data reflects current state; recent openings/closings not captured
Temporal snapshot2021 Census + 2023 LandScan; openings/closings since not captured
No revenue ground truthTarget is presence/absence, not profitability
No real-time footfallLandScan is an estimate; no mobile signal data
No rental cost dataCommercial viability depends on rent, which is not modelled
Gym model degradationAUC 0.775 (vs 0.89–0.94 for other types). Only 6.6% positive prevalence; 33 features cause overfitting on sparse data. Treat gym recommendations as directional only. See Gym Deep-Dive (Fig. 16) for root cause analysis.
Feature dominance by POIPOI co-occurrence occupies 67% of top-5 importance slots. Model primarily learns “businesses cluster” (agglomeration), with crime/transport as secondary discriminators.
Fairness / equity biasModel trained on OSM business presence inherits historical location patterns. Deprived areas may be systematically over- or under-recommended depending on type. See Equity & Fairness Audit (Fig. 17) and Critical Discussion for quantified evidence.

Model Report

All model metrics in one place — AUC scores, outcomes, top features, and recommendations. Copy a structured plain-text summary with the button below.

Loading report data…

Site Finder

Prime (≥95%, ≤2) Strong (85-95%) Viable (≤5 comp.) Competitive (>5) Existing (TP) Expansion Not Recommended

Height = model confidence. Colour = recommendation tier. Hover for SHAP drivers.

Top recommended sites

Ranked by the 5-tier system: Prime (≥95% confidence, ≤2 competitors) through Competitive (>5 competitors). Each card shows SHAP-powered “Why Here?” drivers.

Loading recommendations…

Market demand signal (AUC): Sites recommended: F1 threshold: View source on GitHub →