Traffic simulation systems are highly complex and require huge computational effort, especially for large-scale models where dynamic simulation represents the evolution of congestion over time. The Aimsun team is always mindful of this challenge and actively seeks ways to reduce the computational effort for our own real-time traffic simulation system, Aimsun Live, particularly for super large-scale networks.
Machine Learning (ML) is sometimes posed as an alternative to simulation-based approaches. ML models are efficient to build from data, and once built, provide predictions from input data. Offering real-time traffic predictions is not a challenge, especially when the throughput time is down to one minute. However, ML models present serious difficulties when they must provide predictions for a target variable that is not observed, or in situations not represented in the training dataset. That’s because ML models seek correlations (linear or non-linear relationships) between explanatory and target variables and not causality. Causal inference or causal modelling may solve this challenge and the literature exhibits some approaches that make use of causal inference theory and graph-based modelling for timeseries forecasting. However, for normal levels of traffic observability, which, even in the best cases, only cover a small portion of the network, causal inference is currently unfeasible.
On the other hand, traffic simulation systems require a great effort in modelling traffic demand and network so that they can provide causal inference of what will happen, depending on the estimated demand and the state of supply, even in locations with no observability. However, modelling the demand requires weeks of work and usually only the main demand patterns are modelled, such as typical workdays, weekends, and holidays, which means that demand is modelled by and in the resolution defined by the set of demand patterns. Any change in the demand that goes beyond these demand patterns requires the generation of new patterns to ensure that simulation outputs mirror what happens in real life; if demand shifts are frequent, it may become unfeasible to model new demand patterns.
Apparently neither simulation nor ML alone can provide the perfect solution of cheap and fast modelling, cheap and fast adaptation to new contexts, and robust predictions in new and unseen circumstances. Aimsun Live combines simulation and ML for this reason, and ML-based predictions are used to dynamically adjust the demand. But there is still a long way to go. With this in mind, we at Aimsun analyzed how ML models perform in estimating traffic flow in unobservable sections (sections with no real-time observations) given an offline training dataset with data in all sections. And how this model performs when faced with shifts in demand.
We took two real datasets from two cities: Bergen (25 months of data – 2018, 2019 and March 2021), and Wiesbaden (11 months of data – from August 2020 to June 2021). We then extracted demand patterns from these datasets (8 demand patterns in Bergen and 12 demand patterns in Wiesbaden), where each pattern represents 24-hour-long demand in time intervals of 15min, i.e., each pattern represents the demand of a type of day. Finally, we simulated each type of day with normal supply conditions, generating a synthetic dataset of 8 and 12 days, respectively. With this synthetic dataset we trained and tested a ML model using 5-fold cross-validation. This means that for each city, we trained 5 ML models with 80% of the synthetic days and tested each model with the remaining 20% of the days. But the ML model was trained to receive data only from observable sections, those sections with real loop detectors, and to try to estimate traffic flow in unobservable sections. This means that it used data from 110 and 322 sections to estimate the traffic flow in 3732 and 4403 sections, respectively for Bergen and Wiesbaden.
The figure below shows the results of this experiment in terms of %GEH<5 and root mean square error (RMSE). Boxplots represent the 1st and 3rd quantiles, the median (orange), the mean (green triangle) and the 1st and 3rd quantiles minus/plus 1.5 times the interquartile range. In order to prove that the simulated synthetic days are different, and that they represent the most common demand patterns in the real dataset, the boxplots also show the accuracy when the best synthetic day in the training set (the best day of the 80% used for training) is used to predict each of the days in the test set (the remaining 20% of days).