In summary, this analysis shows that despite not knowing if mobility restrictions, working from home, or the closure of economic sectors such as leisure still apply, road traffic has not yet reached pre-March 2020 levels, although it is heading in that direction.
However, we do not know if peak hour is at the same time as before, or if some days of the week are closer to pre-COVID levels than others, so we need to repeat this analysis and take different days of the week or different day-time slices. All these combinations make this easy task quite tedious. Fortunately, machine learning can do the work for us! For example, we can cluster the days of our dataset according to the flow at each detector and throughout the day and see how they are grouped and then calculate the pattern that represents each group, for example, by calculating the centroid (or the exemplar) of each cluster.
Before we go any further, there are some decisions to make: what distance, what clustering algorithm, how many clusters, whether to normalize data or not… Regarding the distance metric, it seems that Euclidean distance is a good choice as we want to see differences in the flow between days throughout the time of day. Another option could be the GEH or the GEH<5 as they are popular metrics to compare traffic flows. Regarding the number of clusters and the clustering method, there are many options. From algorithms that automatically detect the number of clusters from density such as affinity propagation or DBScan, or others that let the user set the number of clusters such as hierarchical clustering or k-means. For this example, we choose agglomerative hierarchical clustering because we want to have control over the number of clusters. We finally decide to not normalize the dataset because we are going to use Euclidean distance over flow only, and we want to keep detectors with high vehicle count more important than the ones with low vehicle count. Once we’ve made these decisions, we just need to run the clustering algorithm and iterate, if necessary, over the number of clusters.
Once the clustering is done, we can visualize pattern distribution over time, for example, using calendar-like table coloring each day-cell according to the pattern of such day as in the next figure. Note that white cells correspond to days without data and the week number, the month and the year are indicated on the legend on the left. The figure show that along three years (2017-2019) workdays are classified in two different patterns or clusters (0 and 3), and that these two patterns alternate each other with the season, but with pattern 0 becoming more mainstream as time goes on. Similarly, 2017-2019 Sundays and holidays are classified between two other patterns (6 and 7) and Saturdays as pattern 2. The beginning of 2020 also fulfills this relation of patterns (at least for the period with data), but suddenly, on March 18th appears a new group of days represented as pattern 5. Afterwards, new patterns pop up which also represent workdays, Saturdays, and Sundays (or holidays). Therefore, we see how none of the patterns before 18th March 2020 happens again, and that none of the patterns after COVID outbreak happened before it. There is a complete change of behavior, and the past did not come back (will it?). Another conclusion is that in the COVID era, there are more patterns, meaning that transport modelers need to do more work to model the demand than in the pre-COVID era, because more patterns involve more demands. But this is a problem for transport modelers, not data scientists, so let us keep it out of the scope of this post.