As part of the TLC project, a suite of tools will be provided for the analysis of spatial and temporal data – to be titled ‘Spatiotemporal Metrics’. This prototype provides the first of the suite’s modules that deals with basic spatial descriptive statistics. The purpose is for testing and to give an idea of the direction we are heading.
To be valid, statistics must be properly understood and interpreted. Please see below for a detailed explanation of each result.
Sometimes we need statistics to support or demonstrate our argument, or to see and understand patterns that might not be immediately apparent. When dealing with spatial datasets, we are often interested in how various locations are distributed over the surface of the earth.Do the locations tend to be grouped around a specific region? Are the locations close together or far apart? Are the locations skewed towards a certain direction, or are they evenly distributed? In other words, we are interested in the centre, spread, and shape of our spatial data. This prototype provides metrics for analysing these concepts.
Obtaining statistical information from spatial data is not as easy as it might seem. This is due to the ellipsoidal shape of the earth. For example, finding the mean location given a set of geographical coordinates requires a special process – simply taking the average latitude and longitude can have false, skewed or unintended results. To demonstrate this point, imagine that you are trying to find the midpoint between some points around the international dateline. If you want the midpoint between 0, 179 and 0, -179 – do you mean 0, 0 on the other side of the Earth, or 0, 180 on this side of the Earth? Another major issue is that the distance of one-degree longitude varies depending on your latitude – from 111km at the equator to 0km at the pole.
This first prototype handles the complex trigonometry and statistics to solve these issues and provides some basic spatial descriptive statistics, namely:
We plan to add some features to the existing visualisation, such as factoring in time and circles on the map to visualise the statistic for easier comparison across datasets, but very much would like to hear from user to direct development where it is needed. We welcome all feedback at tlcmap AT newcastle.edu.au – especially if it pertains to use cases.
Based on a survey and feedback from Chief Investigators, the broad types of statistics about spatiotemporal data have been prioritised at below. Some of these areas are very large fields in themselves. We don't expect to complete all of the following, but will work through priorities to achieve as much as we can with limited time and budget. The next release will include some basic, common and useful network analytics, such as clustering and nearest neighbour. These will include use cases demonstrating how and why these can inform humanities research.
Note that a critical attitude is required in interpreting results. Often a number alone is meaningless and must be compared to something else to have any meaning. Is 5km a lot or a little? Are these points ‘close’ or ‘far apart’ compared to what?
Do the results provide information about the topic of interest, or about something else?
For example, if we find there is a focal point or high concentration of points around a place, are we really measuring the spread of those points, or just the spread of population? Does this really tell us anything other than that there is a city there?
What other factors are influencing the result?
For example, we might find that one set of points is much more spread out than another. Perhaps travel time is affecting the result. Is one spread out more simply because it is in the plains and the other is in the mountains? They might be spread out the same degree based on travel time.
Is the difference significant or random?
If one dataset has a figure of 5km, and another 6km – can we say that the second dataset is more than the first, or is it just a slight random variation in the points we have sampled?
Briefly: The amount of points.
The count simply refers to the number of locations in the dataset. In this case, each location is represented by a GeoJSON geometry of type “Point”.
Briefly: The middle of all the points.
Imagine a set of two points on a plane that are connected by a line that represents the displacement between them. The centroid of these points is precisely at the centre of that line.
Now, imagine a set of three points on a plane that are connected to form a triangle. The centroid of these points is that triangle’s centre of mass.
Again, imagine a set of four points on a plane that connect to form a quadrilateral. The centroid of these points is that quadrilateral’s centre of mass.
Finally, imagine any number of points on a plane that are connected to form some polygon. The centroid of those points is that polygon’s centre of mass.
If we are finding the centroid of a polygon on the surface of the earth, then we refer to that centroid as the geographical midpoint, aka the geo midpoint.
In summary, the geo midpoint of a set of locations equates to the mean location of those points in terms of a geographical coordinate system. It is the same concept as a centroid in geometry.
Displacement from the Midpoint
Each of the other metrics refers to the displacement from the geo midpoint to the other locations in the set. When considered alongside the geo midpoint, these metrics provides insight into the centre, shape, and spread of the spatial data in question. See below:
|Metric||Brief Description||Definition||Contextual Description||Example Application|
|Mean||Average distance of points from midpoint.||The mean is the arithmetic average of a set of values. It is calculated by dividing the sum of values in a set by the number of values in that set. The common usage of the word ‘average’ refers to the mean.||Indicates the mean displacement of the geo midpoint to the other points in the set. It is a measure of how near or far the points are, on average, from the geo midpoint. It is a good way to gauge the central tendency of the points.||“On average, the traffic accidents occurred 50km away from a central point.”|
|Std Dev and Variance||A measure of how focused or spread out the points are.||Standard deviation and variance are both measures of spread. Given a set of values, the standard deviation and variance indicate how spread out those values are. Standard deviation and variance are directly related. Mathematically speaking, the variance is the square of the standard deviation.||
The standard deviation is a measure of how spread out the points are from the geo midpoint. It is best used to measure the spread of data that is not heavily skewed.
Note that if the displacements follow a normal distribution, then the standard deviation can be used to estimate the probability of a given point existing in the dataset based on its displacement from the geo midpoint (according to ‘the Empirical Rule’).
|“With a mean of 50km, a standard deviation of 15km, and a normally distributed dataset, we can estimate that 68% of traffic accidents occur between 35km to 65km of a central point.”|
|Minimum and Maximum||How close and how far the nearest and furthest points are from the midpoint.||The minimum value of a set of numbers is simply the smallest number in that set. Similarly, the maximum value is the largest number in that set.||Minimum refers to the smallest displacement from the geo midpoint and maximum refers to the greatest displacement from the geo midpoint.||“Each telecommunication tower was at least 5km away from the geo midpoint, but no more than 40km.”|
|Range||The difference between max and min. This is not the width of the the dataset.||The range is calculated by subtracting the minimum value from the maximum value.||Represents the range of displacements of the points from the geo midpoint. That is, the difference between the maximum and minimum displacements.||“Any given telecommunication tower was within the range of 35km away from the geo midpoint.”|
|Median||Of all the points this is the middle distance from the mid point. It is not the point closest to the midpoint.||If we have a set of numbers and order them from smallest to greatest, the median is the middle number in that set. Like the mean, the median is a measure of central tendency. Median is more resistant to outliers making it a better measurement when dealing with skewed data. The median is the point such that 50% of the data falls before, and 50% of the data falls after.||Indicates the median displacement of the points from the geo midpoint. That is, the point at which 50% of the displacements fall before and 50% of the displacements fall after.||“50% of the sacred sites are within 200km of the geo midpoint.”|
|Quartile One||A quarter of points within this distance from the midpoint.||Similar to the median but represents the point such that 25% of the data falls before, and 75% falls after.||Indicates the first quartile of the displacement of the points from the geo midpoint. That is, the point at which 25% of the displacements fall before and 75% of the displacements fall after.||“25% of the water holes are within 90km of the geo midpoint.”|
|Quartile Three||Three quarters of points are within this distance from the midpoint.||Represents the point such that 75% of the data falls before and 75% falls after.||Indicates the third quartile of the displacement of the points from the geo midpoint. That is, the point at which 75% of the displacements fall before and 25% of the displacements fall after.||“75% of the water holes are within 150km of the geo midpoint.”|
|Interquartile Range (IQR)||A donut shaped band within which the middle half of points fall, based on distance from the midpoint.||The IQR is the range of the middle 50% of the dataset. It is calculated by subtracting the first quartile from the third quartile. It is a good measurement of spread when dealing with skewed data.||The IQR is a measure of how spread out the points are from the geo midpoint. It is a better measure of spread than standard deviation in cases where the data is skewed.||“50% of the train stations are within 200km of the geo midpoint, without the first and last quarters of data that may skew results.”|