title

An intelligent location study and machine learning algorithms to select locations from a Italian restaurant in the city of San Francisco

The Italian restaurant of San Francisco are part of the culture of the city, the customs of its inhabitants and its tourist circuit. They have been the subject of study by different writers, inspirers of countless artistic creations and traditional union meeting. In this project, the idea is to find an optimal location for a new Italian restaurant, based on machine learning algorithms taken from the "The Battle of Neighborhoods: Coursera Capstone Project" course (1). Starting from the association of Italian restaurant with restaurants, we will first try to detect locations based on the definition of factors that will influence our decision:

1- Places that are not yet full of restaurants.

2- Areas with little or no cafe nearby.

3- Near the center, if possible, assuming the first two conditions are met.

With these simple parameters we will program an algorithm to discover what solutions can be obtained.

Data Source

The following data sources will be needed to extract and generate the required information:

1.- The centers of the candidate areas will be generated automatically following the algorithm and the approximate addresses of the centers of these areas will be obtained using one of the Geopy Geocoders packages. (2)

2-The number of restaurants, their type and location in each neighborhood will be obtained using the Foursquare API. (3)

The data will be used in the following scenarios:

1- To discover the density of all restaurants and cafes from the data extracted.

2- To identify areas that are not very dense and not very competitive.

3- To calculate the distances between competing restaurants.

Locate the candidates

The target area will be the center of the city, where tourist attractions are more numerous compared to other places. From this we will create a grid of cells that covers the area of ​​interest which will be about 12x12 kilometers centered around the center of the city of San Francisco.

In [140]:
import requests

from geopy.geocoders import Nominatim


address = '199 Gough St, San Francisco, CA 94102, USA'
geolocator = Nominatim(user_agent="usa_explorer")
location = geolocator.geocode(address)
lat = location.latitude
lng = location.longitude
sf_center = [lat, lng]
print('Coordinate of {}: {}'.format(address, sf_center), ' location : ', location)
Coordinate of 199 Gough St, San Francisco, CA 94102, USA: [37.7752096, -122.4227735]  location :  Rich Table, 199, Gough Street, Western Addition, San Francisco, San Francisco City and County, California, 94102, United States

We create a grid of the equidistant candidate areas, centered around the city center and that is 6 km around this point, for this we calculate the distances we need to create our grid of locations in a 2D Cartesian coordinate system that will allow us to then Calculate distances in meters.

Next, we will project these coordinates in degrees of latitude / longitude to be displayed on the maps with Mapbox and Folium (3).

In [141]:
#!pip install shapely
import shapely.geometry

#!pip install pyproj
import pyproj

import math

def lonlat_to_xy(lon, lat):
    proj_latlon = pyproj.Proj(proj='latlong',datum='WGS84')
    proj_xy = pyproj.Proj(proj="utm", zone=10, datum='WGS84')
    xy = pyproj.transform(proj_latlon, proj_xy, lon, lat)
    return xy[0], xy[1]

def xy_to_lonlat(x, y):
    proj_latlon = pyproj.Proj(proj='latlong',datum='WGS84')
    proj_xy = pyproj.Proj(proj="utm", zone=10, datum='WGS84')
    lonlat = pyproj.transform(proj_xy, proj_latlon, x, y)
    return lonlat[0], lonlat[1]

def calc_xy_distance(x1, y1, x2, y2):
    dx = x2 - x1
    dy = y2 - y1
    return math.sqrt(dx*dx + dy*dy)

print('Coordinate Verification')
print('-------------------------------')
print('San Francisco Center Union Square longitude={}, latitude={}'.format(sf_center[1], sf_center[0]))
x, y = lonlat_to_xy(sf_center[1], sf_center[0])
print('San Francisco Center Union Square UTM X={}, Y={}'.format(x, y))
lo, la = xy_to_lonlat(x, y)
print('San Francisco Center Union Square longitude={}, latitude={}'.format(lo, la))
Coordinate Verification
-------------------------------
San Francisco Center Union Square longitude=-122.4227735, latitude=37.7752096
San Francisco Center Union Square UTM X=550833.4653390996, Y=4181031.39254272
San Francisco Center Union Square longitude=-122.4227735, latitude=37.7752096

We create a hexagonal grid of cells: we move all the lines and adjust the spacing of the vertical lines so that each cell center is equidistant from all its neighbors.

In [142]:
sf_center_x, sf_center_y = lonlat_to_xy(sf_center[1], sf_center[0]) # City center in Cartesian coordinates

k = math.sqrt(3) / 2 # Vertical offset for hexagonal grid cells
x_min = sf_center_x - 6000
x_step = 600
y_min = sf_center_y - 6000 - (int(21/k)*k*600 - 12000)/2
y_step = 600 * k 

latitude = []
longitude = []
distances_from_center = []
xs = []
ys = []
for i in range(0, int(21/k)):
    y = y_min + i * y_step
    x_offset = 300 if i%2==0 else 0
    for j in range(0, 21):
        x = x_min + j * x_step + x_offset
        distance_from_center = calc_xy_distance(sf_center_x, sf_center_y, x, y)
        if (distance_from_center <= 6001):
            lon, lat = xy_to_lonlat(x, y)
            latitude.append(lat)
            longitude.append(lon)
            distances_from_center.append(distance_from_center)
            xs.append(x)
            ys.append(y)

print(len(latitudes), 'Union Square San Francisco grid - SF')
728 Union Square San Francisco grid - SF

Let's look at the data we have so far: location in the center and the candidate neighborhood centers:

In [143]:
import folium
In [144]:
tileset = r'https://api.mapbox.com'
attribution = (r'Map data © <a href="http://openstreetmap.org">OpenStreetMap</a>'
                ' contributors, Imagery © <a href="http://mapbox.com">MapBox</a>')

map_sf = folium.Map(location=sf_center, zoom_start=14, tiles=tileset, attr=attribution)
folium.Marker(sf_center, popup='San Francisco').add_to(map_sf)
for lat, lon in zip(latitude, longitude):
    #folium.CircleMarker([lat, lon], radius=2, color='blue', fill=True, fill_color='blue', fill_opacity=1).add_to(map_lyon) 
    folium.Circle([lat, lon], radius=300, color='purple', fill=False).add_to(map_sf)
    #folium.Marker([lat, lon]).add_to(map_caba)
map_sf
Out[144]:

At this point, we now have the coordinates of the local centers / areas to be evaluated, at the same distance (the distance between each point and its neighbors is exactly the same) and approximately 4 km from downtown San Francisco.

In [145]:
def get_address(lat, lng):
    #print('entering get address')
    try:
        #address = '{},{}'.format(lat, lng)
        address = [lat, lng]
        geolocator = Nominatim(user_agent="usa_explorer")
        location = geolocator.geocode(address)
        #print(location[0])
        return location[0]
    except:
        return 'nothing found'


addr = get_address(sf_center[0], sf_center[1])
print('Reverse geocoding check')
print('-----------------------')
print('Address of [{}, {}] is: {}'.format(sf_center[0], sf_center[1], addr)) 
print(type(location[0]))
Reverse geocoding check
-----------------------
Address of [37.7752096, -122.4227735] is: Rich Table, 199, Gough Street, Western Addition, San Francisco, San Francisco City and County, California, 94102, United States
<class 'str'>
In [146]:
print('Getting Locations: ', end='')
addresses = []
for lat, lon in zip(latitude, longitude):
    address = get_address(lat, lon)
    if address is None:
        address = 'NO ADDRESS'
    address = address.replace(', United States', '') 
    addresses.append(address)
    print(' .', end='')
print(' done.')
Getting Locations:  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . done.
In [180]:
import pandas as pd

df_locations = pd.DataFrame({'Dirección': addresses,
                             'Latitude': latitude,
                             'Longitude': longitude,
                             'X': xs,
                             'Y': ys,
                             'Distance from centroid': distances_from_center})

df_locations.head()
Out[180]:
Dirección Latitude Longitude X Y Distance from centroid
0 San Jose Avenue, Excelsior, San Francisco, San... 37.723793 -122.443598 549033.465339 4.175316e+06 5992.495307
1 nothing found 37.723760 -122.436790 549633.465339 4.175316e+06 5840.376700
2 335, Edinburgh Street, Excelsior, San Francisc... 37.723727 -122.429982 550233.465339 4.175316e+06 5747.173218
3 John McLaren Park Playground, Burrows Street, ... 37.723694 -122.423174 550833.465339 4.175316e+06 5715.767665
4 400, Yale Street, Portola, San Francisco, San ... 37.723661 -122.416365 551433.465339 4.175316e+06 5747.173218
In [181]:
df_locations.shape
Out[181]:
(364, 6)
In [182]:
df_locations.to_pickle('./Dataset/sf_locations.pkl')    

Foursquare

Now we will use the Foursquare API to explore the number of restaurants available within these grids and we will limit the search to food categories to retrieve latitude and longitude data from restaurants and Italian restaurant.

In [183]:
client_id = 'xxx'
client_secret = 'xxx'
VERSION = 'xxx'

We use the Foursquare API to explore the number of restaurants available within 4 km of downtown San Francisco and limit the search to all locations associated with the category of restaurants and especially those that correspond to Italian restaurants.

In [184]:
food_category = '4d4b7105d754a06374d81259' 

sf_italian_categories = ['4bf58dd8d48988d110941735', '55a5a1ebe4b013909087cbb6', '55a5a1ebe4b013909087cb7c', '55a5a1ebe4b013909087cba7',
                       '55a5a1ebe4b013909087cba1', '55a5a1ebe4b013909087cba4', '55a5a1ebe4b013909087cb95', '55a5a1ebe4b013909087cb89',
                       '55a5a1ebe4b013909087cb9b', '55a5a1ebe4b013909087cb98', '55a5a1ebe4b013909087cbbf', '55a5a1ebe4b013909087cb79',
                       '55a5a1ebe4b013909087cbb0', '55a5a1ebe4b013909087cbb3', '55a5a1ebe4b013909087cb74', '55a5a1ebe4b013909087cbaa',
                       '55a5a1ebe4b013909087cb83', '55a5a1ebe4b013909087cb8c', '55a5a1ebe4b013909087cb92', '55a5a1ebe4b013909087cb8f',
                       '55a5a1ebe4b013909087cb86', '55a5a1ebe4b013909087cbb9', '55a5a1ebe4b013909087cb7f', '55a5a1ebe4b013909087cbbc',
                       '55a5a1ebe4b013909087cb9e', '55a5a1ebe4b013909087cbc2', '55a5a1ebe4b013909087cbad'] # 'Food' Catégorie de restaurants cafe
In [185]:
def is_restaurant(categories, specific_filter=None):
    restaurant_words = ['restaurant', 'sushi', 'hamburger', 'seafood']
    restaurant = False
    specific = False
    for c in categories:
        category_name = c[0].lower()
        category_id = c[1]
        for r in restaurant_words:
            if r in category_name:
                restaurant = True
        if 'Restaurante' in category_name:
            restaurant = False
        if not(specific_filter is None) and (category_id in specific_filter):
            specific = True
            restaurant = True
    return restaurant, specific

def get_categories(categories):
    return [(cat['name'], cat['id']) for cat in categories]

def format_address(location):
    address = ', '.join(location['formattedAddress'])
    address = address.replace(', USA', '')
    address = address.replace(', United States', '')
    return address

def get_venues_near_location(lat, lon, category, client_id, client_secret, radius=500, limit=1000):
    version = '20180724'
    url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&categoryId={}&radius={}&limit={}'.format(
        client_id, client_secret, version, lat, lon, category, radius, limit)
    try:
        results = requests.get(url).json()['response']['groups'][0]['items']
        venues = [(item['venue']['id'],
                   item['venue']['name'],
                   get_categories(item['venue']['categories']),
                   (item['venue']['location']['lat'], item['venue']['location']['lng']),
                   format_address(item['venue']['location']),
                   item['venue']['location']['distance']) for item in results]        
    except:
        venues = []
    return venues
In [186]:
import pickle

def get_restaurants(lats, lons):
    restaurants = {}
    sf_italian = {}
    location_restaurants = []

    print('Obtaining the candidates', end='')
    for lat, lon in zip(lats, lons):
        venues = get_venues_near_location(lat, lon, food_category, client_id, client_secret, radius=350, limit=100)
        area_restaurants = []
        for venue in venues:
            venue_id = venue[0]
            venue_name = venue[1]
            venue_categories = venue[2]
            venue_latlon = venue[3]
            venue_address = venue[4]
            venue_distance = venue[5]
            is_res, is_italian = is_restaurant(venue_categories, specific_filter=sf_italian_categories)
            if is_res:
                x, y = lonlat_to_xy(venue_latlon[1], venue_latlon[0])
                restaurant = (venue_id, venue_name, venue_latlon[0], venue_latlon[1], venue_address, venue_distance, is_italian, x, y)
                if venue_distance<=300:
                    area_restaurants.append(restaurant)
                restaurants[venue_id] = restaurant
                if is_italian:
                    sf_italian[venue_id] = restaurant
        location_restaurants.append(area_restaurants)
        print(' .', end='')
    print(' done.')
    return restaurants, sf_italian, location_restaurants


restaurants = {}
sf_italian = {}
location_restaurants = []
loaded = False
try:
    with open('/Dataset/restaurants_350.pkl', 'rb') as f:
        restaurants = pickle.load(f)
        print('Restaurant data loaded.')
    with open('/Dataset/sf_italian_350.pkl', 'rb') as f:
        caba_cafe = pickle.load(f)
        print('Descargando Datos de las Cafeterías')
    with open('/Dataset/location_restaurants_350.pkl', 'rb') as f:
        location_restaurants = pickle.load(f)
        print('Downloading data from San Francisco Restaurants')
    loaded = True
except:
    print('Restaurant Data Downloading')
    pass


if not loaded:
    restaurants, sf_italian, location_restaurants = get_restaurants(latitudes, longitudes)
    
Restaurant Data Downloading
Obtaining the candidates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . done.
In [187]:
import numpy as np
In [188]:
print('**Results**',)
print('Total Number of Restaurants:', len(restaurants))
print('Total Number of Italian restaurants:', len(sf_italian))
print('Percentage of Italian restaurants: {:.2f}%'.format(len(sf_italian) / len(restaurants) * 100))
print('Average of Venues per grid:', np.array([len(r) for r in location_restaurants]).mean())
**Results**
Total Number of Restaurants: 1681
Total Number of Italian restaurants: 118
Percentage of Italian restaurants: 7.02%
Average of Venues per grid: 4.052197802197802
In [189]:
print('List of All Restaurants')
print('-----------------------')
for r in list(restaurants.values())[:10]:
    print(r)
print('...')
print('Total:', len(restaurants))
List of All Restaurants
-----------------------
('4ceec3b83b03f04de88d3bdc', "Henry's Hunan Restaurant", 37.72218603642267, -122.43659651808754, '4753 Mission St, San Francisco, CA 94112', 176, False, 549651.5465355547, 4175141.074954057)
('4a0e123af964a520c2751fe3', 'Taquerias El Farolito', 37.72122961664814, -122.43739536867459, '4817 Mission St (at Onondaga St), San Francisco, CA 94112', 286, False, 549581.7824803684, 4175034.538650764)
('546960f7498eac74bd5baf47', 'Tao Sushi', 37.721036775089686, -122.4376651904847, '4808 Mission At (Onondaga Ave), San Francisco, CA', 312, False, 549558.1316389883, 4175013.0004052045)
('4b244110f964a520c76424e3', 'Taqueria Guadalajara', 37.7212324569519, -122.43763599260711, '4798 Mission St (at Onondaga Ave), San Francisco, CA 94112', 291, False, 549560.574468874, 4175034.726401493)
('4a6b8478f964a520ecce1fe3', 'Mexico Tipico', 37.72501226746621, -122.43447912554541, '4581 Mission St (at Brazil Ave), San Francisco, CA 94112', 246, False, 549836.2556397481, 4175455.7650877447)
('4a91a3faf964a520171b20e3', 'Beijing Restaurant 北京小馆', 37.723599683798, -122.43719187724251, '1801 Alemany Blvd (at Ocean Ave), San Francisco, CA 94112', 39, False, 549598.1357189683, 4175297.6010806696)
('588e3e6632b072494c6cf57e', 'An Chi', 37.72343008519264, -122.43573516334256, '4683 Mission St, San Francisco, CA 94112', 99, False, 549726.6248046655, 4175279.5569969686)
('4aff274cf964a5200b3522e3', 'Hawaiian Drive Inn #28', 37.72114068878443, -122.43738942911332, '4827 Mission St, San Francisco, CA 94112', 296, False, 549582.3652084664, 4175024.675411926)
('57bd06c8cd10e903763a7664', 'Hwaro', 37.725637597880784, -122.43431782363075, '4516 Mission St, San Francisco, CA 94112', 322, False, 549850.0512717982, 4175525.230272441)
('5941ec67e2ead1688f4f464a', 'El Gran Taco Loco', 37.724746, -122.43448300000001, '4591 Mission St, San Francisco, CA 94112', 230, False, 549836.0926191276, 4175426.2211156166)
...
Total: 1681
In [190]:
print('List of all Italian restaurants')
print('---------------------------')
for r in list(sf_italian.values())[:10]:
    print(r)
print('...')
print('Total:', len(sf_italian))
List of all Italian restaurants
---------------------------
('4be4bf122457a593e2b9aa15', 'Marche Club', 37.728095, -122.432397, '4346 Mission St (btwn Tingley St & Theresa St), San Francisco, CA 94112', 91, True, 550017.6701432205, 4175798.899217597)
('4ef010c00e01e1fde2099099', 'Manzoni', 37.73467816914885, -122.43389799980405, '2790 Diamond St, San Francisco, CA 94131', 302, True, 549880.9832699064, 4176528.490363779)
('5195394d498e344eeb952b4f', 'Trattoria Da Vittorio', 37.739295412112625, -122.46759110305597, '150 West Portal Ave, San Francisco, CA 94127', 151, True, 546909.2447572381, 4177023.347445145)
('4be72d932457a593b8a6ad15', 'Spiazzo Ristorante', 37.74049906835031, -122.46611414213069, '33 West Portal Ave, San Francisco, CA 94127', 306, True, 547038.6154491554, 4177157.632339159)
('4b2edd7df964a520a2e724e3', 'Vega', 37.7391742135669, -122.41743951497574, '419 Cortland Ave (btwn Bennington & Wool), San Francisco, CA 94110', 253, True, 551328.0990331663, 4177036.2170301196)
('4ae4ff0cf964a520f49f21e3', 'VinoRosso', 37.73901245660888, -122.41534272358848, '629 Cortland Ave (at Anderson Street), San Francisco, CA 94110', 263, True, 551512.9563385877, 4177019.42214691)
('49bed272f964a520e3541fe3', 'La Ciccia', 37.74200800946477, -122.42653101682663, '291 30th St (at Church), San Francisco, CA 94131', 311, True, 550525.1341258159, 4177345.6763315448)
('58c6b74f730a925fc305a126', 'Ardiana', 37.74248738572593, -122.42650722060347, '1781 Church St, San Francisco, CA 94131', 306, True, 550526.9048224975, 4177398.875309537)
('4b5fb718f964a5209dc929e3', 'Cafe Stefano', 37.74236536, -122.423196, '59 30th St (btw Mission & San Jose), San Francisco, CA 94110', 16, True, 550818.7219270115, 4177387.1293513896)
('4be1d60c4283c9b68da754f8', 'South Beach Cafe', 37.74791482485267, -122.43318557739258, '800 Embarcadero, San Francisco, CA 94107', 84, True, 549934.8644184133, 4177997.458160931)
...
Total: 118
In [191]:
print('Author Restaurants')
print('---------------------------')
for i in range(100, 110):
    rs = location_restaurants[i][:8]
    names = ', '.join([r[1] for r in rs])
    print('Restaurants around location {}: {}'.format(i+1, names))
Author Restaurants
---------------------------
Restaurants around location 101: 
Restaurants around location 102: Rainbow Cafe
Restaurants around location 103: 
Restaurants around location 104: restaurante pressman@berman, Le Chateau De Bob
Restaurants around location 105: 
Restaurants around location 106: Lolinda, Foreign Cinema, El Techo, Loló, Radio Habana Social Club, Naked Kitchen, Californios, Udupi Palace
Restaurants around location 107: Heirloom Café, Bon, Nene, El Metate, flour + water, Sushi Hon, Mis Antojitos, El Porvenir Produce Market, Sasaki
Restaurants around location 108: La Paz Restaurant Pupuseria, VBOWLS
Restaurants around location 109: 
Restaurants around location 110: ChocolateLab

All restaurants in the city of San Francisco are indicated in gray and those associated with Italian restaurants will be highlighted in red.

In [192]:
map_sf = folium.Map(location=sf_center, zoom_start=13, tiles=tileset, attr=attribution)
folium.Marker(sf_center, popup='San Francisco').add_to(map_sf)
for res in restaurants.values():
    lat = res[2]; lon = res[3]
    is_cafe = res[6]
    color = 'red' if is_cafe else 'grey'
    folium.CircleMarker([lat, lon], radius=3, color=color, fill=True, fill_color=color, fill_opacity=1).add_to(map_sf)
map_sf
Out[192]:

Analysis

Now we calculate the distance from the nearest Italian restaurant to each grid (not only those located less than 300 m away, since we also want to know the distance to the nearest center.

In [194]:
distances_to_sf_italian = []

for area_x, area_y in zip(xs, ys):
    min_distance = 100
    for res in sf_italian.values():
        res_x = res[7]
        res_y = res[8]
        d = calc_xy_distance(area_x, area_y, res_x, res_y)
        if d<min_distance:
            min_distance = d
    distances_to_sf_italian.append(min_distance)

df_locations['Distances to the Italian restaurant'] = distances_to_sf_italian
In [195]:
df_locations.head(10)
Out[195]:
Dirección Latitude Longitude X Y Distance from centroid Distances to the Italian restaurant
0 San Jose Avenue, Excelsior, San Francisco, San... 37.723793 -122.443598 549033.465339 4.175316e+06 5992.495307 100.0
1 nothing found 37.723760 -122.436790 549633.465339 4.175316e+06 5840.376700 100.0
2 335, Edinburgh Street, Excelsior, San Francisc... 37.723727 -122.429982 550233.465339 4.175316e+06 5747.173218 100.0
3 John McLaren Park Playground, Burrows Street, ... 37.723694 -122.423174 550833.465339 4.175316e+06 5715.767665 100.0
4 400, Yale Street, Portola, San Francisco, San ... 37.723661 -122.416365 551433.465339 4.175316e+06 5747.173218 100.0
5 Bowdoin Street, Portola, San Francisco, San Fr... 37.723627 -122.409557 552033.465339 4.175316e+06 5840.376700 100.0
6 717, Girard Street, Portola, San Francisco, Sa... 37.723593 -122.402749 552633.465339 4.175316e+06 5992.495307 100.0
7 Archbishop Riordan High School, Judson Avenue,... 37.728524 -122.453776 548133.465339 4.175835e+06 5855.766389 100.0
8 212, Judson Avenue, Ingleside, San Francisco, ... 37.728492 -122.446967 548733.465339 4.175835e+06 5604.462508 100.0
9 Samoan Assemblies of God, 1819, San Jose Avenu... 37.728460 -122.440159 549333.465339 4.175835e+06 5408.326913 100.0
In [196]:
print('Average distance in meters from the nearest coffee shop to each center:', df_locations['Distances to the Italian restaurant'].mean())
Average distance in meters from the nearest coffee shop to each center: 98.57250001080786

We use HeatMap with Mapbox to visualize the density of restaurants in the selected radio from downtown San Francisco.

In [197]:
restaurant_latlons = [[res[2], res[3]] for res in restaurants.values()]

italian_latlons = [[res[2], res[3]] for res in sf_italian.values()]
In [198]:
from folium import plugins
from folium.plugins import HeatMap

map_sf = folium.Map(location=sf_center, zoom_start=13, tiles=tileset, attr=attribution)
HeatMap(restaurant_latlons).add_to(map_sf)
folium.Marker(sf_center).add_to(map_sf)
folium.Circle(sf_center, radius=1000, fill=False, color='white').add_to(map_sf)
folium.Circle(sf_center, radius=2000, fill=False, color='blue').add_to(map_sf)
folium.Circle(sf_center, radius=3000, fill=False, color='red').add_to(map_sf)
map_sf
Out[198]:

Now we present another visualization with a Heatmap of only Italian restaurants

In [199]:
map_sf = folium.Map(location=sf_center, zoom_start=13, tiles=tileset, attr=attribution)
HeatMap(italian_latlons).add_to(map_sf)
folium.Marker(sf_center).add_to(map_sf)
folium.Circle(sf_center, radius=1000, fill=False, color='white').add_to(map_sf)
folium.Circle(sf_center, radius=2000, fill=False, color='blue').add_to(map_sf)
folium.Circle(sf_center, radius=3000, fill=False, color='red').add_to(map_sf)
map_sf
Out[199]:

From the above maps, we found that most of the restaurants are scattered on the north side of the center of the area under study. We will focus on the areas with the lowest density to locate the candidates.

In [200]:
roi_x_min = sf_center_x - 2000
roi_y_max = sf_center_y + 1000
roi_width = 5000
roi_height = 5000
roi_center_x = roi_x_min + 1900
roi_center_y = roi_y_max - 700
roi_center_lon, roi_center_lat = xy_to_lonlat(roi_center_x, roi_center_y)
roi_center = [roi_center_lat, roi_center_lon]
map_caba = folium.Map(location=sf_center, zoom_start=13, tiles=tileset, attr=attribution)
HeatMap(restaurant_latlons).add_to(map_sf)
folium.Marker(sf_center).add_to(map_sf)
folium.Circle(roi_center, radius=2500, color='white', fill=True, fill_opacity=0.4).add_to(map_sf)
map_sf
Out[200]:

Now we build a grid again to locate the candidates and the main tourist attractions.

In [201]:
k = math.sqrt(3) / 2 
x_step = 100
y_step = 100 * k 
roi_y_min = roi_center_y - 2500

roi_latitudes = []
roi_longitudes = []
roi_xs = []
roi_ys = []
for i in range(0, int(51/k)):
    y = roi_y_min + i * y_step
    x_offset = 50 if i%2==0 else 0
    for j in range(0, 51):
        x = roi_x_min + j * x_step + x_offset
        d = calc_xy_distance(roi_center_x, roi_center_y, x, y)
        if (d <= 2501):
            lon, lat = xy_to_lonlat(x, y)
            roi_latitudes.append(lat)
            roi_longitudes.append(lon)
            roi_xs.append(x)
            roi_ys.append(y)

print(len(roi_latitudes), 'Locations with possible candidates.')
2120 Locations with possible candidates.

We calculate two more important things for each candidate location: the number of nearby restaurants (we will use a radius of 250 meters) and the distance to the nearest Italian restaurant.

In [216]:
def count_restaurants_nearby(x, y, restaurants, radius=250):    
    count = 0
    for res in restaurants.values():
        res_x = res[7]; res_y = res[8]
        d = calc_xy_distance(x, y, res_x, res_y)
        if d<=radius:
            count += 1
    return count

def find_nearest_restaurant(x, y, restaurants):
    d_min = 100000
    for res in restaurants.values():
        res_x = res[7]; res_y = res[8]
        d = calc_xy_distance(x, y, res_x, res_y)
        if d<=d_min:
            d_min = d
    return d_min

roi_restaurant_counts = []
roi_italian_distances = []

print('Generating the data of potential candidates... ', end='')
for x, y in zip(roi_xs, roi_ys):
    count = count_restaurants_nearby(x, y, restaurants, radius=250)
    roi_restaurant_counts.append(count)
    distance = find_nearest_restaurant(x, y, sf_italian)
    roi_italian_distances.append(distance)
print('done.')
Generating the data of potential candidates... done.
In [217]:
df_roi_locations = pd.DataFrame({'Latitude':roi_latitudes,
                                 'Longitude':roi_longitudes,
                                 'X':roi_xs,
                                 'Y':roi_ys,
                                 'Nearby Restaurants':roi_restaurant_counts,
                                 'Distance to nearby Italian restaurants':roi_italian_distances})


df_roi_locations.sort_values(by=['Nearby Restaurants'], ascending=False, inplace=True)

df_roi_locations.head(5)
Out[217]:
Latitude Longitude X Y Nearby Restaurants Distance to nearby Italian restaurants
1988 37.795104 -122.405581 552333.465339 4.183248e+06 43 161.782747
1955 37.794320 -122.405020 552383.465339 4.183162e+06 43 104.505932
1987 37.795109 -122.406717 552233.465339 4.183248e+06 37 242.223144
1451 37.785083 -122.431214 550083.465339 4.182122e+06 36 322.730435
1954 37.794326 -122.406156 552283.465339 4.183162e+06 35 202.651736
In [218]:
df_roi_locations.shape
Out[218]:
(2120, 6)

Now we are going to filter these places: we are only interested in locations with no more than two restaurants within a radius of 250 meters and no Italian Restaurant within a perimeter of 400 meters.

In [219]:
good_res_count = np.array((df_roi_locations['Nearby Restaurants']<=2))
print('Places with no more than two restaurants nearby:', good_res_count.sum())

good_ind_distance = np.array(df_roi_locations['Distance to nearby Italian restaurants']>=400)
print('Grids without Italian restaurants within 400 m.:', good_ind_distance.sum())

good_locations = np.logical_and(good_res_count, good_ind_distance)
print('Places with both conditions met:', good_locations.sum())

df_good_locations = df_roi_locations[good_locations]
Places with no more than two restaurants nearby: 596
Grids without Italian restaurants within 400 m.: 823
Places with both conditions met: 356
In [220]:
good_latitudes = df_good_locations['Latitude'].values
good_longitudes = df_good_locations['Longitude'].values

good_locations = [[lat, lon] for lat, lon in zip(good_latitudes, good_longitudes)]
map_sf = folium.Map(location=sf_center, zoom_start=14, tiles=tileset, attr=attribution)
HeatMap(restaurant_latlons).add_to(map_sf)
folium.Circle(roi_center, radius=2500, color='white', fill=True, fill_opacity=0.6).add_to(map_sf)
folium.Marker(sf_center).add_to(map_sf)
for lat, lon in zip(good_latitudes, good_longitudes):
    folium.CircleMarker([lat, lon], radius=2, color='purple', fill=True, fill_color='blue', fill_opacity=1).add_to(map_sf) 
map_sf
Out[220]:
In [215]:
map_sf = folium.Map(location=sf_center, zoom_start=14, tiles=tileset, attr=attribution)
HeatMap(good_locations, radius=25).add_to(map_sf)
folium.Marker(sf_center).add_to(map_sf)
for lat, lon in zip(good_latitudes, good_longitudes):
    folium.CircleMarker([lat, lon], radius=2, color='purple', fill=True, fill_color='blue', fill_opacity=1).add_to(map_sf)
map_sf
Out[215]:

Now we are going to group these locations using a machine learning algorithm in this case K-medias to create 8 groups that contain good locations. These areas, their centers and addresses will be the final result of our analysis.

In [221]:
from sklearn.cluster import KMeans

number_of_clusters = 8

good_xys = df_good_locations[['X', 'Y']].values
kmeans = KMeans(n_clusters=number_of_clusters, random_state=0).fit(good_xys)

cluster_centers = [xy_to_lonlat(cc[0], cc[1]) for cc in kmeans.cluster_centers_]

map_caba = folium.Map(location=sf_center, zoom_start=14, tiles=tileset, attr=attribution)
HeatMap(restaurant_latlons).add_to(map_sf)
folium.Circle(roi_center, radius=2500, color='white', fill=True, fill_opacity=0.4).add_to(map_sf)
folium.Marker(sf_center).add_to(map_sf)
for lon, lat in cluster_centers:
    folium.Circle([lat, lon], radius=500, color='gray', fill=True, fill_opacity=0.25).add_to(map_sf) 
for lat, lon in zip(good_latitudes, good_longitudes):
    folium.CircleMarker([lat, lon], radius=2, color='purple', fill=True, fill_color='blue', fill_opacity=1).add_to(map_sf)
map_sf
Out[221]:

Let's look at these areas west and south of the city with a Heatmap, using shaded areas to indicate the 8 groups created:

In [222]:
map_caba = folium.Map(location=sf_center, zoom_start=14, tiles=tileset, attr=attribution)
folium.Marker(sf_center).add_to(map_sf)
for lat, lon in zip(good_latitudes, good_longitudes):
    folium.Circle([lat, lon], radius=250, color='#00000000', fill=True, fill_color='#0066ff', fill_opacity=0.07).add_to(map_sf)
for lat, lon in zip(good_latitudes, good_longitudes):
    folium.CircleMarker([lat, lon], radius=2, color='purple', fill=True, fill_color='blue', fill_opacity=1).add_to(map_sf)
for lon, lat in cluster_centers:
    folium.Circle([lat, lon], radius=500, color='white', fill=False).add_to(map_sf) 
map_sf
Out[222]:

Now we are going to list the candidate locations

In [223]:
candidate_area_addresses = []
print('==============================================================')
print('Addresses of recommended locations')
print('==============================================================\n')
for lon, lat in cluster_centers:
    addr = get_address(lat, lon)
    addr = addr.replace(', United States', '')
    addr = addr.replace(', San Francisco', '')
    addr = addr.replace(', USA', '')
    addr = addr.replace(', SF', '')
    addr = addr.replace("'", '')
    candidate_area_addresses.append(addr)    
    x, y = lonlat_to_xy(lon, lat)
    d = calc_xy_distance(x, y, sf_center_x, sf_center_y)
    print('{}{} => {:.1f}km from downtown San Francisco'.format(addr, ' '*(50-len(addr)), d/1000))
    
==============================================================
Addresses of recommended locations
==============================================================

nothing found                                      => 2.1km from downtown San Francisco
1049, Laguna Street, Western Addition City and County, California, 94115 => 0.7km from downtown San Francisco
355, Buena Vista Avenue East, Haight-Ashbury City and County, California, 94117 of America => 1.7km from downtown San Francisco
219, Saint Josephs Avenue, Western Addition City and County, California, 94115 => 1.8km from downtown San Francisco
20th Street, Liberty Street Historic District City and County, California, 94143 => 1.9km from downtown San Francisco
nothing found                                      => 1.6km from downtown San Francisco
2801, Pacific Avenue, Pacific Heights City and County, California, 94123 => 2.5km from downtown San Francisco
2247, Octavia Street, Japantown City and County, California, 94109 => 2.0km from downtown San Francisco

Results

In [224]:
map_sf = folium.Map(location=sf_center, zoom_start=14, tiles=tileset, attr=attribution)
folium.Circle(sf_center, radius=50, color='red', fill=True, fill_color='red', fill_opacity=1).add_to(map_sf)
for lonlat, addr in zip(cluster_centers, candidate_area_addresses):
    folium.Marker([lonlat[1], lonlat[0]], popup=addr).add_to(map_sf)     
for lat, lon in zip(good_latitudes, good_longitudes):
    folium.Circle([lat, lon], radius=250, color='#0000ff00', fill=True, fill_color='#0066ff', fill_opacity=0.05).add_to(map_sf)
map_sf
Out[224]:

The above locations are quite close to downtown San Francisco and each of these locations has no more than two restaurants within a radius of 250 m, no Italian Restaurant 400 m away. Any of these establishments is a potential candidate for the new restaurant, at least considering the nearby competition. The K-means unsupervised learning algorithm has allowed us to group the 8 locations with an appropriate choice for interested parties to choose from the results presented below.

Conclusions

The objective of this project was to identify the areas of San Francisco near the center, with a small number of restaurants (especially Italian restaurants) to help stakeholders reduce the search for an optimal location for a new Italian restaurant.

When calculating the distribution of restaurant density from the Foursquare API data, it is possible to generate a large collection of locations that meet certain basic requirements.

This data was then grouped using machine learning algorithms (K-means) to create the main areas of interest (containing the greatest number of potential locations) and the addresses of these area centers were created. From this interpretation we can have a starting point for the final exploration by the interested parties.

Interested parties will make the final decision on the optimal location of the restaurants based on the specific characteristics and locations of the neighborhood in each recommended area, taking into account additional factors such as the attractiveness of each location (proximity to a park or water), levels of noise / main roads. real estate availability, price, social and economic dynamics of each neighborhood, etc.

Finally, a more complete analysis and future work should integrate data from other external databases.

References