Caution: Not All VO2 Max Tests Are Created Equal

VO2 Max has become easy to buy.

A mask, a treadmill, an app, and fifteen minutes later you walk away with a number that looks precise and feels authoritative. It's presented as lab-grade physiology made convenient, and most people have no reason to question it.

The problem is that convenience and accuracy are not the same thing, and in VO2 testing they are often in direct tension.

Many of the VO2 tests now being sold are not measuring oxygen consumption in the way exercise physiology has historically defined it. They are estimating it. That distinction is subtle, but it matters.

Measuring VO2 is fundamentally a gas-exchange problem. It depends on accurately capturing ventilation and the difference between inspired and expired oxygen. Small errors in either propagate quickly. A few percent of drift in airflow or gas fraction translates into a meaningful change in calculated VO2.

For decades, laboratories addressed this by collecting expired air over time, averaging it, and analysing stable gas concentrations. This reduces breath-to-breath variability and produces values suitable for diagnostics, research, and precise training prescription. It is slower and less convenient, but it prioritises measurement stability.

Most modern portable systems take a different approach. They rely on breath-by-breath analysis, treating each breath as an individual data point and smoothing the signal algorithmically. This allows for lighter hardware and greater mobility, but it introduces known limitations that are well described in the literature.

Breathing is inherently variable. Depth and timing fluctuate from breath to breath, particularly at lower ventilations and during changes in workload. When each breath is treated as a discrete measurement, noise increases. Software smoothing can reduce visible variability, but it does not remove underlying error. It redistributes it.

Independent validation studies on portable breath-by-breath systems have shown acceptable agreement with laboratory systems under certain conditions, particularly for repeated testing and trend tracking. At the same time, those studies also report greater variability at lower intensities and increased sensitivity to protocol design, including stage length and workload progression. These are not flaws. They are trade-offs inherent to the measurement approach.

Additional problems arise from the testing approach itself.

Many commercial VO2 tests pair portable breath-by-breath systems with one-minute stages, large jumps in speed or power, or continuous ramp protocols that do not allow gas exchange to stabilise before workload changes again. Under these conditions, oxygen uptake lags behind external load, meaning the reported VO2 values reflect transient kinetics, protocol design, and signal processing rather than steady physiological demand.

That matters because those same VO2 values form the foundation for nearly every other metric people care about.

VO2 Max is rarely the most important output of a metabolic test. Performance is driven far more by where thresholds occur, how efficiently oxygen is used at submaximal workloads, and how stable internal load remains as intensity increases. When oxygen consumption is unstable or overestimated, threshold identification shifts. Training zones drift. Easy work becomes harder than intended, hard work becomes poorly targeted, and fueling recommendations lose coherence.

None of this makes portable VO2 systems useless. Devices such as VO2 Master and PNOĒ are not junk, and the literature does not support treating them as such. They have been validated for specific use cases, including field-based assessments, and tracking relative change over time, often with reported errors in the low single-digit percentages under controlled protocols.

Interpretation becomes fragile when outputs derived from short, non-steady protocols are treated as interchangeable with laboratory-style measurements, or when thresholds inferred from rapidly changing workloads are used as precise anchors for training prescription. These are methodological limitations, not questions of intent.

Most consumers are never told how VO2 is calculated, whether gas exchange reached equilibrium, how sensitive the result is to protocol choice, or that two systems can legitimately produce materially different values while both claiming validity. There is no requirement to explain this, and so it usually goes unexplained.

Accurate metabolic testing is not mysterious. It prioritises stable gas analysis over time, protocols long enough for physiological equilibrium, and interpretation that extends beyond a single peak value. It is slower, more controlled, and often less flattering.

Like any physiological measurement, it still depends on proper calibration, protocol control, and experienced interpretation. No system is immune to error. The difference is that some methods are better equipped to minimise it, which is what makes them defensible.

The issue, then, is not VO2 Max itself, but how it is increasingly packaged and applied. Tests built around speed, portability, and app-driven outputs inevitably trade some measurement fidelity for convenience. That trade-off may be acceptable for tracking trends. It is not acceptable when precision matters.

The problem is not that physiology is complicated. It is that most people are never told how much the method matters.

A simple rule of thumb is this: when a VO2 test is built around speed and convenience, it is usually optimised for accessibility rather than precision. Tests that rely on very short stages, rapid workload increases, or app-generated outputs without explaining how equilibrium is assessed should be interpreted cautiously, particularly when the results are being used to set training zones or define thresholds.