As we get more granular with data collected, it is critical to evaluate the reliability of each of the different measures we are hoping to use. If the data we are collecting isn’t reliable, there is no tool, software, or algorithm we can use to make it reliable; we are simply collecting bad data. The saying that statisticians and data scientists often use is “Garbage in, garbage out,” and, while it may seem a bit harsh, it is reality. Bad data can be extremely dangerous and cause us to make poor decisions from inaccurate information.
When performing a vertical jump on a force plate, our software collects different data points as often as 1000 times per second from 4 different sensors…an extremely large amount of data taken from such a simple test. Utilizing these data points, we can capture various measures throughout different time points (peak power, rate of force production, flight time, take-off velocity, etc.). It is critical to understand that just because the vertical jump is believed to be a reliable test and force plates are reliable tools; not all measures are inherently reliable. While this concept seems obvious, it is far too often ignored.
The photo above is taken from a 2011 paper investigating three different devices’ reliability when measuring the vertical jump (1). According to the study, all three devices tested showed good reliability (ICC > 0.80). This is good news for both the vertical jump and the devices utilized in the study. But we also need to be careful here as this research paper does NOT “prove” that any measurement we can derive from the vertical jump is automatically reliable. Reliability is specific and needs to be evaluated independently across any and all types of data being collected.
A 2019 paper by Carroll et al. examined the reliability of the vertical jump utilizing a force plate (2). This paper looks at the jump height and also looks at some “alternative” variables that can be derived from the vertical jump when utilizing force plates. The variables utilized are countermovement jump height, modified reactive strength index (RSI-mod), relative peak power, and countermovement depth. As we might expect, this paper is consistent with the paper shared last week, stating the jump height variable is reliable. RSI-mod is a calculated variable that is partially derived using the jump height variable, the ratio of jump height to time to take-off. Partially because of this, we are not surprised to see RSI-mod report sufficient reliability as well. However, if we look at both the relative peak power and countermovement depth variables when evaluated inter-session (reliability between multiple sessions or test-retest), we see ICC values of 0.41 and 0.39, respectively – well below our cutoff of acceptable reliability at 0.50. This is a great example of the specificity of reliability and a true case study for showing the importance of not only assessing the reliability of the assessments (jumping) or devices (force plate, Vertec, jump-map, etc,) but also each specific variable or data-point that we are able to utilize.
At the most basic level, assessing your data’s reliability is critical for understanding what you can actually use and what ends up being bad data. The next step is understanding what is called the Typical Error, which allows us to understand when there is an actual worthwhile or meaningful change. When assessing our athletes, soldiers, clients, or patients, it is important to know what is considered a meaningful change, compared to what may just be normal fluctuations in the test or the person. To keep our example simple utilizing vertical jump height, if an individual increases their jump height by .64 inches, we must ask if this a significant change or not. Is this simply within the normal variation, or can we confidently say the individual is jumping higher than before?
The Typical Error is the random variation that occurs from one measurement to another which can result from both technological error (from the testing device) or biological variation (normal human variability). To use a simple example of body weight, we can utilize the two measures from a test-retest experiment:
A reliable scale should give us a similar value each time we step on it. But because of this “error,” it might not be exactly the same (and that can be ok!). Defining the typical error allows practitioners to separate what is significant from what is simply noise. For example, an inexpensive or poor quality scale may suffer from technological error simply because of low-quality construction or components. If we don’t account for this error, we are unable to know if the change in weight of 1.3 lbs (222.1-220.8) is a real change or not. This scale could even suffer from what we might call non-uniform error or heteroscedasticity, meaning that the scale is actually more and less inaccurate at different weights. This can be a big problem as it may mean the reliability isn’t consistent across multiple populations.
Normal biological variation can also be a source of error. A normal person’s body weight can fluctuate even 2-3% in a single day. So even a 5 or 6 lbs increase or decrease in body weight for our 220 lbs individual could simply be due to our typical error! As you can see, understanding this typical error is extremely important in evaluating reliability and determining significant change.
A common calculation utilized to allow practitioners to quickly and simply understand what constitutes a significant change is called the Smallest Worthwhile Change (SWC). This is simply based on effect sizes, and is calculated by multiplying the standard deviation of our data set by 0.2. The concept is that if this SWC is greater than our Typical Error, we can state that the change is a real change and not simply due to error. The smallest worthwhile change isn’t the only tool to identify effect sizes but is worth a mention as it has become a commonly utilized tool and is often a question from practitioners.
No data is better than bad data, and bad data is unreliable data. It is often assumed that new technologies are inherently reliable and scientific purely because they are digital, wireless, or sleek, but that is often not the case. Reliability analyses are often ignored as they aren’t exciting, but the very first question that needs to be asked when looking to collect data: is this data reliable?