Guides

Spatiotemporal Metrics Guide

Spatiotemporal Metrics

Spatiotemporal Metrics (STMetrics) is a TLCMap tool that provides a relatively easy way to get some basic statistical and metrical information about spatiotemporal datasets, without needing specialised statistical software, and without needing to be a mathematician or IT expert. It does require a little preparation and care in interpretating results, which this guide aims to help with.

Statistics and other common metrical techniques can help support an argument more rigorously, provide measures for comparison, test hypotheses and can help us see, understand or explain patterns we weren't aware of.

Are locations grouped around a specific region? Are the locations close together or far apart? Are there more around the centre or are they evenly distributed? Are there distinct groups or clusters of places? Are events that are close in time also close in space? Is there a pattern of movement of these events?

Obtaining statistical information from spatial data is not easy. You cannot simply take the average of the coordinates for example. This is due to the ellipsoidal shape of the earth. For example, the distance of one-degree longitude varies depending on your latitude – from 111km at the equator to 0km at the pole. If points are scattered around the globe, does it even mean anything to speak of a maximum and minimum? The difference between 179 and -179 longitude is 2 not 358 (either side of the international date line). If you want the midpoint between 0, 100 and 0, -100, do you mean on this side of the earth or the other?

STMetrics resolves many of these difficulties for you, to provide simple, common, easy to understand methods, that will be broadly useful. We do not aim to provide a complex suite of tools as there are already mathematical and GIS systems if you do require detailed and complex analysis in a specific area. If you have spatiotemporal data, it should be possible to get some quick informative answers with spatiotemporal metrics without having to run an expensive project.

Our aim with STMetrics is to make it easy for Humanities researchers to get some common quantitative measures of spatio temporal data easily. There are already advanced mathematics systems for handling complex statistical problems which we do not want to replicate, and can be used if a project requires something more advanced. TLCMap uses a rapid prototyping approach to ensure simple, useful tools are quickly delivered, with future development driven by feedback and demand.

When resources become available the immediate priorities are:

  1. Measures of how 'close' datasets are to each other.
  2. Report numeric values for the network, not just the visualisation.
  3. Improved visualisation of the network.
  4. Bulk upload of data files.
  5. Closer integration with other systems such as the Gazetteer and Map Finder.

We welcome all feedback at tlcmap AT newcastle.edu.au – especially if it pertains to use cases. Everything is prioritised in terms of feasibility, resources and demand (bear in mind resources are often very limited, some things are not feasible, and we avoid replicating functionality that can be readily obtained elsewhere).

Based on a survey and feedback from Chief Investigators, the general areas where metrics on spatiotemporal data were sought, have been prioritised more broadly at below:

  1. Basic Statistics
  2. Network analytics
  3. Convert To Gov Standards
  4. Detailed Statistics
  5. Comparative Statistics
  6. Space-Time Warping
  7. Wayfinding

We maintain a more detailed task list in an issue tracking system.

A critical attitude is required in interpreting results.

Comparison

One of the main points of obtaining statistics is for comparison. Often we want to say that one thing is more so than another, and by how much, to support some argument, or answer a question. A number alone is often meaningless without comparison. Is 5 a lot or a little? Compared to what?

It's important to be clear about what is being compared. Eg: Under 'Basic Statistics' the 'average' based on the Midpoint indicates how far places tend to be from the middle. While the Pair average indicates how far they tend to be from each other. You can use this to make comparisons between datasets. Eg: If the average under midpoint for horse rustling is greater than the average for midpoint for bicycle theft it indicates horse rustling occurs further from the city than bicycle theft. If the 'Pair' average is higher in one dataset than another, it indicates that one dataset is more spread out than another, more densely clustered set, but the points may be evenly spread, or there may be many close to the middle and a few outliers.

Do the results provide information about the topic of interest, or about something else?

Eg: if we find there is a focal point or high concentration of points around a place, are we really measuring the spread of those points, or just the spread of population? Does this really tell us anything other than that there is a city there?

What other factors are influencing the result?

Eg: we might see that one set of points is much more spread out than another. That might just be because one is in the mountains and another is in the plains. Perhaps the points are consistently separated by 'one days travel', like campsites or stores for example, so that they appear more dense in the mountains than the plains. They might be 'spread out' to the same degree if you factor in travel time, rather than just the spatial distance between two points.

Is the difference significant or random?

If one dataset has a figure of 5km, and another 6km – can we say that the second dataset is more than the first, or is it just a slight random variation in the points we have sampled?

Because statistics generally cannot be based on the values of coordinates (as described above), coordinates are converted to distances and statistics are based on these distances.

Simple Statistics

Metric In Brief Explanation

Count

The number of points.

The count simply refers to the number of locations in the dataset. In this case, each location is represented by a GeoJSON geometry of type “Point”.

Geo Midpoint

The middle of all the points.

Imagine a set of two points on a plane that are connected by a line that represents the displacement between them. The centroid of these points is precisely at the centre of that line.

Now, imagine a set of three points on a plane that are connected to form a triangle. The centroid of these points is that triangle’s centre of mass.

Again, imagine a set of four points on a plane that connect to form a quadrilateral. The centroid of these points is that quadrilateral’s centre of mass.

Finally, imagine any number of points on a plane that are connected to form some polygon. The centroid of those points is that polygon’s centre of mass.

If we are finding the centroid of a polygon on the surface of the earth, then we refer to that centroid as the geographical midpoint, aka the geo midpoint.

In summary, the geo midpoint of a set of locations equates to the mean location of those points in terms of a geographical coordinate system. It is the same concept as a centroid in geometry.

Enclosed Area

The area covered by the points.

The area of the polygon, if a line were drawn joining the furthest points, enclosing all the points.

Statistics Based on Midpoint and Each Pair

Other statistics, such as averages and ranges, are based on either:

It is important to understand the meanings of the terms. For example, let's say you want to answer the simple question, 'How wide is my dataset?' (ie: the maximum distance between two points)? You might assume this is the 'range', but the range is the difference between the lowest measurement and the highest. If you were going by midpoint and the point closest to the middle is 1km and the furthest point is 5km from the middle, the range is 4km. If those two points were on opposite sides of the middle, you'd be expecting 6km. If you look at the 'maximum' value, this is the greatest of all the distances, so if you are going by midpoint it would tell you how far the furthest point from the middle is. This might be interesting in telling you which is the farthest flung outpost, but it is not the width of your dataset at its greatest extent. What you really need to answer the question is the maximum between any two pairs.

Displacement from the Midpoint

Each of the other metrics refers to the displacement from the geo midpoint to the other locations in the set. When considered alongside the geo midpoint, these metrics provides insight into the centre, shape, and spread of the spatial data in question. See below:

Metric Brief Description Definition Contextual Description Example
Mean Average distance of points from midpoint, or between each point. The mean is the sum of values in a set by the number of values in that set. ‘Average’ is another word for the mean. It is the most common and easily understood measure of how far points tend to be from the middle, or from each other. “On average, the street art occurs about 500m of a central point.”
Std Dev and Variance A measure of how focused or spread out the points are. The standard deviation and variance both indicate how spread out those values are. If you know the mean you have a sense of where the middle values are but no idea about how widely varied values are from the mean. The points could be vastly different to each otehr, or they could be mostly around the same as the mean. These two figures indicate how varied or how much the values deviate, from the average (for example, the mean of 40 and 60 is 50. The average of 10 and 90 is also 50, but 10 and 90 deviate much more from 50). Mathematically speaking, the standard deviation is the square root of the variance. This is because to cater for negative values, the variance is calculated by average the difference of each value and then squaring it. The standard deviation then takes the square root to bring the value to something comparable to the original values used. Standard deviation is more commonly used because it is easier to compare. The variance of 40km and 60km is 100km, but the standard deviation is 10km (ie: this is a much clearer indication that the values tend to vary by about 10km from the average). The variance of 10km and 90km is 1600km, and the standard deviation is 40km. Normally we would have more than 2 values to calculate with, and this is just for a simple understanding. You can see even if there were different amounts of points in different datasets you could still compare them, using the standard deviation to say something like, with a standard deviation of 10km, the cafes are much closer to the middle while fast food restaurants are more spread out around the centre with with a standard deviation of 40km."
Minimum and Maximum How close and how far the nearest and furthest points. The minimum value of a set of numbers is simply the smallest number in that set. Similarly, the maximum value is the largest number in that set. This is the closest and furthest from the midpoint, or the closest or furthest any two points are from each other which is also the greatest distance across the whole dataset. "Of all the performances, this one was nearest the centre, and this one furthest away. The performance of this troupe was much further from their centre of activity than the other troupe." or "Travelling shows by this troupe covered 1000km, a greater distance than the other troupe which remained localised to within 100km at most."
Range The difference between max and min. The range is calculated by subtracting the minimum value from the maximum value. This is not the width of the dataset. If calculating by midpoint it is the difference between the closest and furthest point from the middle. If going by pairs of points, it's the difference between the distance between the closest two points and the two points furthest from each other. “No mall was within 10km of the centre, and none further than 100km out." (Midpoint) or "No two ice cream parlours were within 200m of each other and none further than 5km apart.” (Each Pair)
Median Of all the points this is the middle distance from the mid point. It is not the point closest to the midpoint. If we have a set of numbers and order them from smallest to greatest, the median is the middle number in that set. Like the mean, the median is a measure of central tendency. Median is more resistant to outliers making it a better measurement when dealing with skewed data. The median is the point such that 50% of the data falls before, and 50% of the data falls after. “50% of the rock art sites are within 200km of the geographical midpoint.”
Quartile One and Quartile Three A quarter of points within this distance from the midpoint. A quartile is similar to the median but divides the data into quarters. Eg: in a set of 20 numbers ordered from lowest to highest, the 1st quartile is the value of the 5th number. The 2nd quartile is the middle number (or the average of the middle two numbers) and is also know as the 'median'. The 3rd quartile is the value of the number that is 3 quarters of they way along. Using the first quartile we can say that 25% of the data falls before it, and 75% falls after the 3rd quartile. Quartiles give a sense of how the data is spread out or skewed towards one end or another. For example if the 1st quartile is a very low number compared to the whole data set it would indicate that most of the data is focused around the centre. If the 3rd quartile is a very high number (close to the maximum, or to be specific, greater than a quarter of the range from the maximum) it indicates that the data is concentrated around the edges of the region. "The first quartile is 5km, and the 3rd quartile is 10km, so with the maximum that any point is from the centre being 200km it seems camps were closely clustered around a central location with a few scattered outposts." or "Quartile three is 180km, this is close to the maximum distance any point is from the midpoint (midpoint maximum) which is 200km. This indicates the camps were in a roughly circular pattern, probably a seasonal circuit."
Quartile Three Three quarters of points are within this distance from the midpoint. Represents the point such that 75% of the data falls before and 75% falls after. Indicates the third quartile of the displacement of the points from the geo midpoint. That is, the point at which 75% of the displacements fall before and 25% of the displacements fall after. “75% of the water holes are within 150km of the geo midpoint.”
Interquartile Range (IQR) A band within which the middle half of measurements fall, excluding the smallest and largest extremes, based on distance from the midpoint. The IQR is the range of the middle 50% of the dataset. It is calculated by subtracting the first quartile from the third quartile. It is a good measurement of spread when dealing with skewed data because you can remove the influence of any outliers that might be exceptional cases. You may be interested in these because they are exceptional cases, but they can distort an understanding of the most common cases from which they diverge. The IQR is a measure of how spread out the points are from the geo midpoint. It is a better measure of spread than standard deviation in cases where the data is skewed. “Around the colonial city in the 1840s the interquartile range of pubs was 32km, of churches 27km and of police stations 10kms indicating liquour, then christianity, then policing reached the frontier in that order, at least in their institutionalised forms. No doubt individuals carried with them their own bottles, bibles and fears of being caught.”

Proximity Graphs

Proximity graphs can help identify clusters, pathways and relationships.

Graph theory is a vast and complex field in mathematics. STMetrics aims to make it easier for humanities researchers to use some of the most common, easy to understand and widely applicable techniques.

The word 'graph' has many meanings. The Greek root 'graph' simply means to draw, write or make a mark. In school we learnt to plot points using coordinates on a 'graph' with an x and y axis and to draw bar graphs and pie graphs. But a 'graph' in mathematics can also refer to a network of nodes and links or 'edges'. That is the sense we are using here. A proximity graph makes connections or 'edges' between things based on how close they are to each other.

When dealing with map data when we say 'close', we often literally mean how close together they are spatially. Bear in mind that by 'close' we could also mean many other things. We could mean close in time, as in events which happened soon after each other. We could mean close in age, in the sense that two people aged 41 and 42 are closer in age that someone aged 18. We could mean close in income, height, speed, level of enthusiasm. Indeed, anything that can be measured numerically can be used as our measure of closeness. In STMetrics the default measure of 'closeness' is distance in space, but you can also 'map' closeness in time or other factors.

The STMetrics proximity graphs provide a visualisation of proximity networks by drawing connecting lines between points that are 'close' using two optional methods.

Method In brief Explanation
MST (minimum spanning tree) The shortest way to connect all points.

A minimum spanning tree connects points by finding the shortest way to ensure every point is connected to every other point. If you were to measure the distance between each point and each other point, and only take the shortest connections, while ensuring that they are all ultimately connected (ie: you don't end up with two seperate networks), you would have a minimum spanning tree. Another way of saying this is that if you were to add up the length of all lines connecting points, this is the shortest possible. If you wanted to build roads or lay cables, this would be a good way to keep costs to a minimum, because it would show the least amount of road to build or cable to lay, to make sure everything is connected. This could be useful for guessing what routes someone might take to move between places, or for planning a trip. It can also suggest a line of best fit around which places are distributed, and indicate those places which are furthest from each other, or tangential, or at extremeties.

k-Nearest Neighbour Grouping clusters of points that are closest to each other. K-nearest neighbour works by making a connection from each point to its closest point. You can also choose how many closest points to connect, the 'k-threshold'. Eg: if you select a k-threshold of '3', then it will connect each point to the three closest points. This is a good algorithm for identifying clusters, as you will see groups that are closely connected to each other, but disconnected from other points.

Graphing Time and Other Variables

If you import data that has dates associated with each place you can use MST graphs to visualise the sequence of events over time by selecting your date field under 'Choose distance variable'. Because MST connects the shortest points, when the shortest distances in time between the points are organised, you end up with a line or lines, that roughly connect points in the order in which they occured. This could help you see if events close together in space also occured close together in time (if, having selected to graph by date, you notice that the line appears to connect places close together) or not (if the line is eratic).

If you use k-Nearest neighbour, it will connect those events which are close together in time. Again you will be able to see if those clusters that are close in space are also close in time, or not. It may be useful also, simply to see which events are close in time, across those spaces.

GeoJSON files have a 'properties' section containing attribute value pairs. If you have converted data from Excel, CSV, or some other source to GeoJSON it is likely to be stored in these properties. Any property can be used to create graphs and identify clusters, just as you can with space and time, with the options under 'Choose distance variable'. You can use this to note whether these clusters seem to correspond to space, or if they are quite eratic, or to observe how they are connected across space. The potential interpretations of course depend on the nature of the data.