The Italian restaurant of San Francisco are part of the culture of the city, the customs of its inhabitants and its tourist circuit. They have been the subject of study by different writers, inspirers of countless artistic creations and traditional union meeting. In this project, the idea is to find an optimal location for a new Italian restaurant, based on machine learning algorithms taken from the "The Battle of Neighborhoods: Coursera Capstone Project" course (1). Starting from the association of Italian restaurant with restaurants, we will first try to detect locations based on the definition of factors that will influence our decision:
1- Places that are not yet full of restaurants.
2- Areas with little or no cafe nearby.
3- Near the center, if possible, assuming the first two conditions are met.
With these simple parameters we will program an algorithm to discover what solutions can be obtained.
The following data sources will be needed to extract and generate the required information:
1.- The centers of the candidate areas will be generated automatically following the algorithm and the approximate addresses of the centers of these areas will be obtained using one of the Geopy Geocoders packages. (2)
2-The number of restaurants, their type and location in each neighborhood will be obtained using the Foursquare API. (3)
The data will be used in the following scenarios:
1- To discover the density of all restaurants and cafes from the data extracted.
2- To identify areas that are not very dense and not very competitive.
3- To calculate the distances between competing restaurants.
The target area will be the center of the city, where tourist attractions are more numerous compared to other places. From this we will create a grid of cells that covers the area of interest which will be about 12x12 kilometers centered around the center of the city of San Francisco.
import requests
from geopy.geocoders import Nominatim
address = '199 Gough St, San Francisco, CA 94102, USA'
geolocator = Nominatim(user_agent="usa_explorer")
location = geolocator.geocode(address)
lat = location.latitude
lng = location.longitude
sf_center = [lat, lng]
print('Coordinate of {}: {}'.format(address, sf_center), ' location : ', location)
We create a grid of the equidistant candidate areas, centered around the city center and that is 6 km around this point, for this we calculate the distances we need to create our grid of locations in a 2D Cartesian coordinate system that will allow us to then Calculate distances in meters.
Next, we will project these coordinates in degrees of latitude / longitude to be displayed on the maps with Mapbox and Folium (3).
#!pip install shapely
import shapely.geometry
#!pip install pyproj
import pyproj
import math
def lonlat_to_xy(lon, lat):
proj_latlon = pyproj.Proj(proj='latlong',datum='WGS84')
proj_xy = pyproj.Proj(proj="utm", zone=10, datum='WGS84')
xy = pyproj.transform(proj_latlon, proj_xy, lon, lat)
return xy[0], xy[1]
def xy_to_lonlat(x, y):
proj_latlon = pyproj.Proj(proj='latlong',datum='WGS84')
proj_xy = pyproj.Proj(proj="utm", zone=10, datum='WGS84')
lonlat = pyproj.transform(proj_xy, proj_latlon, x, y)
return lonlat[0], lonlat[1]
def calc_xy_distance(x1, y1, x2, y2):
dx = x2 - x1
dy = y2 - y1
return math.sqrt(dx*dx + dy*dy)
print('Coordinate Verification')
print('-------------------------------')
print('San Francisco Center Union Square longitude={}, latitude={}'.format(sf_center[1], sf_center[0]))
x, y = lonlat_to_xy(sf_center[1], sf_center[0])
print('San Francisco Center Union Square UTM X={}, Y={}'.format(x, y))
lo, la = xy_to_lonlat(x, y)
print('San Francisco Center Union Square longitude={}, latitude={}'.format(lo, la))
We create a hexagonal grid of cells: we move all the lines and adjust the spacing of the vertical lines so that each cell center is equidistant from all its neighbors.
sf_center_x, sf_center_y = lonlat_to_xy(sf_center[1], sf_center[0]) # City center in Cartesian coordinates
k = math.sqrt(3) / 2 # Vertical offset for hexagonal grid cells
x_min = sf_center_x - 6000
x_step = 600
y_min = sf_center_y - 6000 - (int(21/k)*k*600 - 12000)/2
y_step = 600 * k
latitude = []
longitude = []
distances_from_center = []
xs = []
ys = []
for i in range(0, int(21/k)):
y = y_min + i * y_step
x_offset = 300 if i%2==0 else 0
for j in range(0, 21):
x = x_min + j * x_step + x_offset
distance_from_center = calc_xy_distance(sf_center_x, sf_center_y, x, y)
if (distance_from_center <= 6001):
lon, lat = xy_to_lonlat(x, y)
latitude.append(lat)
longitude.append(lon)
distances_from_center.append(distance_from_center)
xs.append(x)
ys.append(y)
print(len(latitudes), 'Union Square San Francisco grid - SF')
Let's look at the data we have so far: location in the center and the candidate neighborhood centers:
import folium
tileset = r'https://api.mapbox.com'
attribution = (r'Map data © <a href="http://openstreetmap.org">OpenStreetMap</a>'
' contributors, Imagery © <a href="http://mapbox.com">MapBox</a>')
map_sf = folium.Map(location=sf_center, zoom_start=14, tiles=tileset, attr=attribution)
folium.Marker(sf_center, popup='San Francisco').add_to(map_sf)
for lat, lon in zip(latitude, longitude):
#folium.CircleMarker([lat, lon], radius=2, color='blue', fill=True, fill_color='blue', fill_opacity=1).add_to(map_lyon)
folium.Circle([lat, lon], radius=300, color='purple', fill=False).add_to(map_sf)
#folium.Marker([lat, lon]).add_to(map_caba)
map_sf
At this point, we now have the coordinates of the local centers / areas to be evaluated, at the same distance (the distance between each point and its neighbors is exactly the same) and approximately 4 km from downtown San Francisco.
def get_address(lat, lng):
#print('entering get address')
try:
#address = '{},{}'.format(lat, lng)
address = [lat, lng]
geolocator = Nominatim(user_agent="usa_explorer")
location = geolocator.geocode(address)
#print(location[0])
return location[0]
except:
return 'nothing found'
addr = get_address(sf_center[0], sf_center[1])
print('Reverse geocoding check')
print('-----------------------')
print('Address of [{}, {}] is: {}'.format(sf_center[0], sf_center[1], addr))
print(type(location[0]))
print('Getting Locations: ', end='')
addresses = []
for lat, lon in zip(latitude, longitude):
address = get_address(lat, lon)
if address is None:
address = 'NO ADDRESS'
address = address.replace(', United States', '')
addresses.append(address)
print(' .', end='')
print(' done.')
import pandas as pd
df_locations = pd.DataFrame({'Dirección': addresses,
'Latitude': latitude,
'Longitude': longitude,
'X': xs,
'Y': ys,
'Distance from centroid': distances_from_center})
df_locations.head()
df_locations.shape
df_locations.to_pickle('./Dataset/sf_locations.pkl')
Now we will use the Foursquare API to explore the number of restaurants available within these grids and we will limit the search to food categories to retrieve latitude and longitude data from restaurants and Italian restaurant.
client_id = 'xxx'
client_secret = 'xxx'
VERSION = 'xxx'
We use the Foursquare API to explore the number of restaurants available within 4 km of downtown San Francisco and limit the search to all locations associated with the category of restaurants and especially those that correspond to Italian restaurants.
food_category = '4d4b7105d754a06374d81259'
sf_italian_categories = ['4bf58dd8d48988d110941735', '55a5a1ebe4b013909087cbb6', '55a5a1ebe4b013909087cb7c', '55a5a1ebe4b013909087cba7',
'55a5a1ebe4b013909087cba1', '55a5a1ebe4b013909087cba4', '55a5a1ebe4b013909087cb95', '55a5a1ebe4b013909087cb89',
'55a5a1ebe4b013909087cb9b', '55a5a1ebe4b013909087cb98', '55a5a1ebe4b013909087cbbf', '55a5a1ebe4b013909087cb79',
'55a5a1ebe4b013909087cbb0', '55a5a1ebe4b013909087cbb3', '55a5a1ebe4b013909087cb74', '55a5a1ebe4b013909087cbaa',
'55a5a1ebe4b013909087cb83', '55a5a1ebe4b013909087cb8c', '55a5a1ebe4b013909087cb92', '55a5a1ebe4b013909087cb8f',
'55a5a1ebe4b013909087cb86', '55a5a1ebe4b013909087cbb9', '55a5a1ebe4b013909087cb7f', '55a5a1ebe4b013909087cbbc',
'55a5a1ebe4b013909087cb9e', '55a5a1ebe4b013909087cbc2', '55a5a1ebe4b013909087cbad'] # 'Food' Catégorie de restaurants cafe
def is_restaurant(categories, specific_filter=None):
restaurant_words = ['restaurant', 'sushi', 'hamburger', 'seafood']
restaurant = False
specific = False
for c in categories:
category_name = c[0].lower()
category_id = c[1]
for r in restaurant_words:
if r in category_name:
restaurant = True
if 'Restaurante' in category_name:
restaurant = False
if not(specific_filter is None) and (category_id in specific_filter):
specific = True
restaurant = True
return restaurant, specific
def get_categories(categories):
return [(cat['name'], cat['id']) for cat in categories]
def format_address(location):
address = ', '.join(location['formattedAddress'])
address = address.replace(', USA', '')
address = address.replace(', United States', '')
return address
def get_venues_near_location(lat, lon, category, client_id, client_secret, radius=500, limit=1000):
version = '20180724'
url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&categoryId={}&radius={}&limit={}'.format(
client_id, client_secret, version, lat, lon, category, radius, limit)
try:
results = requests.get(url).json()['response']['groups'][0]['items']
venues = [(item['venue']['id'],
item['venue']['name'],
get_categories(item['venue']['categories']),
(item['venue']['location']['lat'], item['venue']['location']['lng']),
format_address(item['venue']['location']),
item['venue']['location']['distance']) for item in results]
except:
venues = []
return venues
import pickle
def get_restaurants(lats, lons):
restaurants = {}
sf_italian = {}
location_restaurants = []
print('Obtaining the candidates', end='')
for lat, lon in zip(lats, lons):
venues = get_venues_near_location(lat, lon, food_category, client_id, client_secret, radius=350, limit=100)
area_restaurants = []
for venue in venues:
venue_id = venue[0]
venue_name = venue[1]
venue_categories = venue[2]
venue_latlon = venue[3]
venue_address = venue[4]
venue_distance = venue[5]
is_res, is_italian = is_restaurant(venue_categories, specific_filter=sf_italian_categories)
if is_res:
x, y = lonlat_to_xy(venue_latlon[1], venue_latlon[0])
restaurant = (venue_id, venue_name, venue_latlon[0], venue_latlon[1], venue_address, venue_distance, is_italian, x, y)
if venue_distance<=300:
area_restaurants.append(restaurant)
restaurants[venue_id] = restaurant
if is_italian:
sf_italian[venue_id] = restaurant
location_restaurants.append(area_restaurants)
print(' .', end='')
print(' done.')
return restaurants, sf_italian, location_restaurants
restaurants = {}
sf_italian = {}
location_restaurants = []
loaded = False
try:
with open('/Dataset/restaurants_350.pkl', 'rb') as f:
restaurants = pickle.load(f)
print('Restaurant data loaded.')
with open('/Dataset/sf_italian_350.pkl', 'rb') as f:
caba_cafe = pickle.load(f)
print('Descargando Datos de las Cafeterías')
with open('/Dataset/location_restaurants_350.pkl', 'rb') as f:
location_restaurants = pickle.load(f)
print('Downloading data from San Francisco Restaurants')
loaded = True
except:
print('Restaurant Data Downloading')
pass
if not loaded:
restaurants, sf_italian, location_restaurants = get_restaurants(latitudes, longitudes)
import numpy as np
print('**Results**',)
print('Total Number of Restaurants:', len(restaurants))
print('Total Number of Italian restaurants:', len(sf_italian))
print('Percentage of Italian restaurants: {:.2f}%'.format(len(sf_italian) / len(restaurants) * 100))
print('Average of Venues per grid:', np.array([len(r) for r in location_restaurants]).mean())
print('List of All Restaurants')
print('-----------------------')
for r in list(restaurants.values())[:10]:
print(r)
print('...')
print('Total:', len(restaurants))
print('List of all Italian restaurants')
print('---------------------------')
for r in list(sf_italian.values())[:10]:
print(r)
print('...')
print('Total:', len(sf_italian))
print('Author Restaurants')
print('---------------------------')
for i in range(100, 110):
rs = location_restaurants[i][:8]
names = ', '.join([r[1] for r in rs])
print('Restaurants around location {}: {}'.format(i+1, names))
All restaurants in the city of San Francisco are indicated in gray and those associated with Italian restaurants will be highlighted in red.
map_sf = folium.Map(location=sf_center, zoom_start=13, tiles=tileset, attr=attribution)
folium.Marker(sf_center, popup='San Francisco').add_to(map_sf)
for res in restaurants.values():
lat = res[2]; lon = res[3]
is_cafe = res[6]
color = 'red' if is_cafe else 'grey'
folium.CircleMarker([lat, lon], radius=3, color=color, fill=True, fill_color=color, fill_opacity=1).add_to(map_sf)
map_sf
Now we calculate the distance from the nearest Italian restaurant to each grid (not only those located less than 300 m away, since we also want to know the distance to the nearest center.
distances_to_sf_italian = []
for area_x, area_y in zip(xs, ys):
min_distance = 100
for res in sf_italian.values():
res_x = res[7]
res_y = res[8]
d = calc_xy_distance(area_x, area_y, res_x, res_y)
if d<min_distance:
min_distance = d
distances_to_sf_italian.append(min_distance)
df_locations['Distances to the Italian restaurant'] = distances_to_sf_italian
df_locations.head(10)
print('Average distance in meters from the nearest coffee shop to each center:', df_locations['Distances to the Italian restaurant'].mean())
We use HeatMap with Mapbox to visualize the density of restaurants in the selected radio from downtown San Francisco.
restaurant_latlons = [[res[2], res[3]] for res in restaurants.values()]
italian_latlons = [[res[2], res[3]] for res in sf_italian.values()]
from folium import plugins
from folium.plugins import HeatMap
map_sf = folium.Map(location=sf_center, zoom_start=13, tiles=tileset, attr=attribution)
HeatMap(restaurant_latlons).add_to(map_sf)
folium.Marker(sf_center).add_to(map_sf)
folium.Circle(sf_center, radius=1000, fill=False, color='white').add_to(map_sf)
folium.Circle(sf_center, radius=2000, fill=False, color='blue').add_to(map_sf)
folium.Circle(sf_center, radius=3000, fill=False, color='red').add_to(map_sf)
map_sf
Now we present another visualization with a Heatmap of only Italian restaurants
map_sf = folium.Map(location=sf_center, zoom_start=13, tiles=tileset, attr=attribution)
HeatMap(italian_latlons).add_to(map_sf)
folium.Marker(sf_center).add_to(map_sf)
folium.Circle(sf_center, radius=1000, fill=False, color='white').add_to(map_sf)
folium.Circle(sf_center, radius=2000, fill=False, color='blue').add_to(map_sf)
folium.Circle(sf_center, radius=3000, fill=False, color='red').add_to(map_sf)
map_sf
From the above maps, we found that most of the restaurants are scattered on the north side of the center of the area under study. We will focus on the areas with the lowest density to locate the candidates.
roi_x_min = sf_center_x - 2000
roi_y_max = sf_center_y + 1000
roi_width = 5000
roi_height = 5000
roi_center_x = roi_x_min + 1900
roi_center_y = roi_y_max - 700
roi_center_lon, roi_center_lat = xy_to_lonlat(roi_center_x, roi_center_y)
roi_center = [roi_center_lat, roi_center_lon]
map_caba = folium.Map(location=sf_center, zoom_start=13, tiles=tileset, attr=attribution)
HeatMap(restaurant_latlons).add_to(map_sf)
folium.Marker(sf_center).add_to(map_sf)
folium.Circle(roi_center, radius=2500, color='white', fill=True, fill_opacity=0.4).add_to(map_sf)
map_sf
Now we build a grid again to locate the candidates and the main tourist attractions.
k = math.sqrt(3) / 2
x_step = 100
y_step = 100 * k
roi_y_min = roi_center_y - 2500
roi_latitudes = []
roi_longitudes = []
roi_xs = []
roi_ys = []
for i in range(0, int(51/k)):
y = roi_y_min + i * y_step
x_offset = 50 if i%2==0 else 0
for j in range(0, 51):
x = roi_x_min + j * x_step + x_offset
d = calc_xy_distance(roi_center_x, roi_center_y, x, y)
if (d <= 2501):
lon, lat = xy_to_lonlat(x, y)
roi_latitudes.append(lat)
roi_longitudes.append(lon)
roi_xs.append(x)
roi_ys.append(y)
print(len(roi_latitudes), 'Locations with possible candidates.')
We calculate two more important things for each candidate location: the number of nearby restaurants (we will use a radius of 250 meters) and the distance to the nearest Italian restaurant.
def count_restaurants_nearby(x, y, restaurants, radius=250):
count = 0
for res in restaurants.values():
res_x = res[7]; res_y = res[8]
d = calc_xy_distance(x, y, res_x, res_y)
if d<=radius:
count += 1
return count
def find_nearest_restaurant(x, y, restaurants):
d_min = 100000
for res in restaurants.values():
res_x = res[7]; res_y = res[8]
d = calc_xy_distance(x, y, res_x, res_y)
if d<=d_min:
d_min = d
return d_min
roi_restaurant_counts = []
roi_italian_distances = []
print('Generating the data of potential candidates... ', end='')
for x, y in zip(roi_xs, roi_ys):
count = count_restaurants_nearby(x, y, restaurants, radius=250)
roi_restaurant_counts.append(count)
distance = find_nearest_restaurant(x, y, sf_italian)
roi_italian_distances.append(distance)
print('done.')
df_roi_locations = pd.DataFrame({'Latitude':roi_latitudes,
'Longitude':roi_longitudes,
'X':roi_xs,
'Y':roi_ys,
'Nearby Restaurants':roi_restaurant_counts,
'Distance to nearby Italian restaurants':roi_italian_distances})
df_roi_locations.sort_values(by=['Nearby Restaurants'], ascending=False, inplace=True)
df_roi_locations.head(5)
df_roi_locations.shape
Now we are going to filter these places: we are only interested in locations with no more than two restaurants within a radius of 250 meters and no Italian Restaurant within a perimeter of 400 meters.
good_res_count = np.array((df_roi_locations['Nearby Restaurants']<=2))
print('Places with no more than two restaurants nearby:', good_res_count.sum())
good_ind_distance = np.array(df_roi_locations['Distance to nearby Italian restaurants']>=400)
print('Grids without Italian restaurants within 400 m.:', good_ind_distance.sum())
good_locations = np.logical_and(good_res_count, good_ind_distance)
print('Places with both conditions met:', good_locations.sum())
df_good_locations = df_roi_locations[good_locations]
good_latitudes = df_good_locations['Latitude'].values
good_longitudes = df_good_locations['Longitude'].values
good_locations = [[lat, lon] for lat, lon in zip(good_latitudes, good_longitudes)]
map_sf = folium.Map(location=sf_center, zoom_start=14, tiles=tileset, attr=attribution)
HeatMap(restaurant_latlons).add_to(map_sf)
folium.Circle(roi_center, radius=2500, color='white', fill=True, fill_opacity=0.6).add_to(map_sf)
folium.Marker(sf_center).add_to(map_sf)
for lat, lon in zip(good_latitudes, good_longitudes):
folium.CircleMarker([lat, lon], radius=2, color='purple', fill=True, fill_color='blue', fill_opacity=1).add_to(map_sf)
map_sf
map_sf = folium.Map(location=sf_center, zoom_start=14, tiles=tileset, attr=attribution)
HeatMap(good_locations, radius=25).add_to(map_sf)
folium.Marker(sf_center).add_to(map_sf)
for lat, lon in zip(good_latitudes, good_longitudes):
folium.CircleMarker([lat, lon], radius=2, color='purple', fill=True, fill_color='blue', fill_opacity=1).add_to(map_sf)
map_sf
Now we are going to group these locations using a machine learning algorithm in this case K-medias to create 8 groups that contain good locations. These areas, their centers and addresses will be the final result of our analysis.
from sklearn.cluster import KMeans
number_of_clusters = 8
good_xys = df_good_locations[['X', 'Y']].values
kmeans = KMeans(n_clusters=number_of_clusters, random_state=0).fit(good_xys)
cluster_centers = [xy_to_lonlat(cc[0], cc[1]) for cc in kmeans.cluster_centers_]
map_caba = folium.Map(location=sf_center, zoom_start=14, tiles=tileset, attr=attribution)
HeatMap(restaurant_latlons).add_to(map_sf)
folium.Circle(roi_center, radius=2500, color='white', fill=True, fill_opacity=0.4).add_to(map_sf)
folium.Marker(sf_center).add_to(map_sf)
for lon, lat in cluster_centers:
folium.Circle([lat, lon], radius=500, color='gray', fill=True, fill_opacity=0.25).add_to(map_sf)
for lat, lon in zip(good_latitudes, good_longitudes):
folium.CircleMarker([lat, lon], radius=2, color='purple', fill=True, fill_color='blue', fill_opacity=1).add_to(map_sf)
map_sf
Let's look at these areas west and south of the city with a Heatmap, using shaded areas to indicate the 8 groups created:
map_caba = folium.Map(location=sf_center, zoom_start=14, tiles=tileset, attr=attribution)
folium.Marker(sf_center).add_to(map_sf)
for lat, lon in zip(good_latitudes, good_longitudes):
folium.Circle([lat, lon], radius=250, color='#00000000', fill=True, fill_color='#0066ff', fill_opacity=0.07).add_to(map_sf)
for lat, lon in zip(good_latitudes, good_longitudes):
folium.CircleMarker([lat, lon], radius=2, color='purple', fill=True, fill_color='blue', fill_opacity=1).add_to(map_sf)
for lon, lat in cluster_centers:
folium.Circle([lat, lon], radius=500, color='white', fill=False).add_to(map_sf)
map_sf
Now we are going to list the candidate locations
candidate_area_addresses = []
print('==============================================================')
print('Addresses of recommended locations')
print('==============================================================\n')
for lon, lat in cluster_centers:
addr = get_address(lat, lon)
addr = addr.replace(', United States', '')
addr = addr.replace(', San Francisco', '')
addr = addr.replace(', USA', '')
addr = addr.replace(', SF', '')
addr = addr.replace("'", '')
candidate_area_addresses.append(addr)
x, y = lonlat_to_xy(lon, lat)
d = calc_xy_distance(x, y, sf_center_x, sf_center_y)
print('{}{} => {:.1f}km from downtown San Francisco'.format(addr, ' '*(50-len(addr)), d/1000))
map_sf = folium.Map(location=sf_center, zoom_start=14, tiles=tileset, attr=attribution)
folium.Circle(sf_center, radius=50, color='red', fill=True, fill_color='red', fill_opacity=1).add_to(map_sf)
for lonlat, addr in zip(cluster_centers, candidate_area_addresses):
folium.Marker([lonlat[1], lonlat[0]], popup=addr).add_to(map_sf)
for lat, lon in zip(good_latitudes, good_longitudes):
folium.Circle([lat, lon], radius=250, color='#0000ff00', fill=True, fill_color='#0066ff', fill_opacity=0.05).add_to(map_sf)
map_sf
The above locations are quite close to downtown San Francisco and each of these locations has no more than two restaurants within a radius of 250 m, no Italian Restaurant 400 m away. Any of these establishments is a potential candidate for the new restaurant, at least considering the nearby competition. The K-means unsupervised learning algorithm has allowed us to group the 8 locations with an appropriate choice for interested parties to choose from the results presented below.
The objective of this project was to identify the areas of San Francisco near the center, with a small number of restaurants (especially Italian restaurants) to help stakeholders reduce the search for an optimal location for a new Italian restaurant.
When calculating the distribution of restaurant density from the Foursquare API data, it is possible to generate a large collection of locations that meet certain basic requirements.
This data was then grouped using machine learning algorithms (K-means) to create the main areas of interest (containing the greatest number of potential locations) and the addresses of these area centers were created. From this interpretation we can have a starting point for the final exploration by the interested parties.
Interested parties will make the final decision on the optimal location of the restaurants based on the specific characteristics and locations of the neighborhood in each recommended area, taking into account additional factors such as the attractiveness of each location (proximity to a park or water), levels of noise / main roads. real estate availability, price, social and economic dynamics of each neighborhood, etc.
Finally, a more complete analysis and future work should integrate data from other external databases.