3 parameters vs 270,000 runs: how TrueZone matches Apple's model

In 2023, a team at Apple published a paper in NPJ Digital Medicine that turned heads in the exercise science world. Nazaret et al. trained a deep learning model on 270,000 runs from 7,465 Apple Watch users to predict heart rate response during exercise. It was one of the largest studies of its kind -- a massive dataset, a sophisticated hybrid architecture, and results that demonstrated real predictive power.

TrueZone matches their accuracy with three parameters and no training data.

That sentence deserves unpacking, because the implications run deep.

What Apple built

The Apple model is technically impressive. It combines a causal convolutional neural network (CNN) with an ordinary differential equation (ODE) framework in a hybrid architecture trained end-to-end. The CNN processes historical workout data to produce a 32-dimensional latent embedding for each user -- a compressed representation of their fitness state. This embedding feeds into the ODE component, which then predicts heart rate trajectories during running.

The scale is notable: 270,000 runs from 7,465 users, all collected through Apple Watch. The model was trained globally -- learning patterns across the entire population to improve predictions for individuals.

Their results: a median heart rate prediction error of approximately 6 beats per minute, with the learned embeddings showing correlation with clinically measured VO2max. The paper demonstrated that deep learning on wearable data could capture meaningful physiological information.

It is good work. But it raises a question that the paper does not address: how much of this requires deep learning at all?

What TrueZone does differently

TrueZone takes a fundamentally different approach. Instead of learning patterns from a massive dataset, it starts from physiology.

The model is a pure physiological ODE -- a system of differential equations that describes how the human body responds to exercise. The equations encode known physiology: oxygen uptake kinetics, metabolic threshold dynamics, cardiac drift, and the interplay between aerobic and anaerobic energy systems.

This ODE has three core parameters:

E (Endurance): a value between 0 and 1 that captures aerobic efficiency -- how well the body sustains effort over time through fat oxidation and threshold alignment
Vmax (Maximum speed): the upper bound of the individual's performance capacity
P (Power/fitness): a scaling parameter that captures overall cardiovascular fitness

These three parameters, fitted to an individual's workout data using Bayesian inference, are sufficient to predict heart rate response during exercise.

There is no global training set. No neural network. No 32-dimensional latent space. Just three numbers with clear physiological meaning, accumulated from each user's own data.

Head to head

We ran TrueZone on a comparable task using the Endomondo dataset -- 75 runners across 2,970 sessions with GPS, pace, and heart rate data. The comparison was not identical to Apple's setup (different population, different data source), but the task was the same: predict heart rate from exercise intensity.

The results:

Metric	TrueZone	Apple
Fitted MAE	7.0 bpm	7.22 bpm
MAPE	4.3%	4.2%

Effectively a wash. TrueZone's mean absolute error was slightly lower; Apple's mean absolute percentage error was slightly lower. The differences are well within noise. On the core task of heart rate prediction, three physiological parameters match the performance of a deep neural network trained on 270,000 runs.

Why this matters more than accuracy

If both models predict heart rate equally well, you might ask: what is the difference? The difference is everything that comes after the prediction.

Apple's 32-dimensional embedding is opaque. It captures something about fitness, but you cannot look at it and say what. The paper shows that post-hoc linear regression against VO2max produces a correlation -- meaning the embedding implicitly contains VO2max-related information. But extracting it requires an additional regression step, and the remaining 31 dimensions have no clear interpretation. You cannot tell a user "your endurance improved by 5%" or "your threshold shifted up" based on movements in a latent space.

TrueZone's three parameters are directly interpretable. E tells you about endurance and fat oxidation efficiency. Vmax tells you about maximum performance capacity. P tells you about cardiovascular fitness. When E increases over a training block, you know the athlete's aerobic base is developing. When Vmax drops, you know maximum capacity is declining. These are things a coach or athlete can act on.

More importantly, TrueZone derives a complete physiological profile from those three parameters. Not just heart rate predictions, but:

Exercise thresholds (aerobic, lactate, anaerobic) in both heart rate and pace/power
Training zones with physiological boundaries, not arbitrary percentages
Estimated HRmax from the model, not from age-based formulas
Race predictions calibrated to the individual's endurance and capacity
Endurance tracking over weeks, months, and years

Apple's model predicts heart rate. That is its output. TrueZone predicts heart rate as a byproduct of modeling the underlying physiology -- the heart rate prediction is a validation of the model, not its purpose.

The training data problem

There is a practical dimension that matters enormously for real-world deployment.

Apple's model requires 270,000 runs to train. It needs a massive, centralized dataset of Apple Watch users before it can make predictions. This creates several constraints:

Platform lock-in: the model is trained on Apple Watch data. Deploying it on Garmin, COROS, or Polar data would require retraining on equivalent datasets from those platforms.
Cold start: a new user with no history gets predictions based on the population model, not their own physiology.
Data dependency: the model's quality depends on continued access to large-scale user data, with all the privacy and regulatory implications that entails.

TrueZone has none of these constraints. Because it starts from physiological first principles rather than learned patterns, it works on any data source that provides heart rate and pace (or power). Garmin, Apple Watch, COROS, Polar, Wahoo, a chest strap paired with a phone -- it does not matter. The model fits to the individual from their own data, session by session, using Bayesian accumulation.

A new user gets a meaningful profile after a handful of workouts. No global training set required. No platform-specific model needed.

What this says about the field

The comparison between TrueZone and Apple's model illustrates a broader tension in exercise science and health tech.

The machine learning approach -- throw massive data at a powerful model and let it learn the patterns -- has become the default in industry. It works. But it often works by recapitulating physics and physiology that we already understand, at enormous computational and data cost, while sacrificing interpretability.

The alternative is to encode what we know about human physiology into the model structure and let the data determine the individual parameters. This is not a new idea -- it is how physics-based modeling has always worked. But it has been underexplored in the wearable fitness space, where the default assumption has been that physiology is too complex for simple models.

The evidence suggests otherwise. Three parameters, grounded in known exercise physiology, capture individual variation in heart rate response as accurately as a deep neural network with access to 270,000 training runs.

The implication for the industry is significant. You do not need massive datasets and GPU clusters to build accurate, personalized fitness models. You need the right equations and a principled way to fit them to individual data. The physiology was there all along -- waiting to be written down, not learned from scratch.