Metro Network: How It Was Made

The Madrid Metro section in Network Centrality uses two pre-built CSV files: metro_nodes.csv and metro_edges.csv. This page documents how those files were produced from the raw GTFS data.

You don’t need to run any of this. It’s here for reference if you’re curious about the data pipeline, or if you want to adapt it for another city’s transit network.

What is GTFS?

GTFS (General Transit Feed Specification) is the standard format transit agencies use to publish their schedules. Most cities make theirs available as open data. Madrid’s comes from the Consorcio Regional de Transportes de Madrid (CRTM).

A GTFS feed is a set of linked CSV files (with .txt extensions). For building a network graph, we need four of them:

File What it contains
routes.txt One row per metro line (name, colour)
trips.txt Links individual trips to their route
stop_times.txt Ordered sequence of stops for each trip
stops.txt Station names and geographic coordinates

Loading the raw data

import pandas as pd

gtfs_path = "../../data/madrid_metro_gtfs"

routes = pd.read_csv(f"{gtfs_path}/routes.txt", encoding="utf-8-sig")
trips = pd.read_csv(f"{gtfs_path}/trips.txt", encoding="utf-8-sig")
stop_times = pd.read_csv(f"{gtfs_path}/stop_times.txt", encoding="utf-8-sig")
stops = pd.read_csv(f"{gtfs_path}/stops.txt", encoding="utf-8-sig")

print(f"Routes: {len(routes)}, Trips: {len(trips)}, "
      f"Stop times: {len(stop_times)}, Stops: {len(stops)}")

Picking one trip per line

Each route has hundreds of trips (different departure times, weekday vs weekend schedules). They all visit the same stations in the same order, so we only need one trip per line to extract the station sequence. We filter to direction_id == 0 (one direction) and take the first trip for each route.

first_trips = (
    trips[trips["direction_id"] == 0]
    .groupby("route_id")["trip_id"]
    .first()
    .reset_index()
)
first_trips.head()

Joining the tables

The four files are linked by shared keys: route_id connects routes to trips, trip_id connects trips to stop sequences, and stop_id connects stop sequences to station coordinates. We chain these joins together.

trip_stops = (
    first_trips
    .merge(stop_times, on="trip_id")
    .merge(stops[["stop_id", "stop_name", "stop_lat", "stop_lon"]], on="stop_id")
    .merge(routes[["route_id", "route_short_name", "route_color"]], on="route_id")
    .sort_values(["route_id", "stop_sequence"])
)
trip_stops[["route_short_name", "stop_sequence", "stop_name",
            "stop_lat", "stop_lon"]].head(10)

Building node and edge lists

For each line, we walk through the ordered stations and create an edge between each consecutive pair. We also collect unique stations with their coordinates.

edges_rows = []
nodes_dict = {}

for route_id, group in trip_stops.groupby("route_id"):
    ordered = group.sort_values("stop_sequence")
    colour = f"#{ordered['route_color'].iloc[0]}"
    line_name = str(ordered["route_short_name"].iloc[0])
    stop_list = ordered[["stop_name", "stop_lat", "stop_lon"]].values.tolist()

    for i in range(len(stop_list) - 1):
        name_a, lat_a, lon_a = stop_list[i]
        name_b, lat_b, lon_b = stop_list[i + 1]
        nodes_dict[name_a] = (lat_a, lon_a)
        nodes_dict[name_b] = (lat_b, lon_b)
        edges_rows.append({
            "station_a": name_a,
            "station_b": name_b,
            "line": line_name,
            "color": colour,
        })

Saving to CSV

The result is two simple files that the main lesson can load without any GTFS knowledge.

nodes_df = pd.DataFrame([
    {"station": name, "lat": lat, "lon": lon}
    for name, (lat, lon) in sorted(nodes_dict.items())
])
edges_df = pd.DataFrame(edges_rows)

nodes_df.to_csv(f"{gtfs_path}/metro_nodes.csv", index=False)
edges_df.to_csv(f"{gtfs_path}/metro_edges.csv", index=False)

print(f"Saved {len(nodes_df)} stations and {len(edges_df)} connections")

metro_nodes.csv has columns: station, lat, lon.

metro_edges.csv has columns: station_a, station_b, line, color.

Interchange stations (like Sol or Avenida de America) appear once in the nodes file but show up in multiple edges, one per line that serves them.