Commentary

Methodology note: why Cal AI is not yet evaluable under our protocol

On the eligibility criteria for inclusion in the Initiative's comparative validation work

The Initiative has received several enquiries since the publication of our six-application validation study (DAI-VAL-2026-01) in early 2026 asking whether Cal AI will be included in the next comparative snapshot. The enquiries have come from readers, from one journalist preparing a feature on the calorie-tracking application category, and from at least two practising clinicians who routinely refer to the Initiative’s published comparisons in their clinical work.

This note explains why Cal AI does not currently meet the Initiative’s evaluability criteria and, as a matter of standing offer, what the application’s developer would need to publish for the application to be considered in a future cycle. We are publishing the note in our commentary surface rather than answering the enquiries one at a time because the underlying question — when does an application become evaluable under an independent academic protocol — recurs across the category and warrants a public answer.

The Initiative’s evaluability criteria, briefly

The evaluability criteria the Initiative applies for inclusion in a comparative validation study were set out in the methodological framework paper that preceded our 2024 systematic review and were lightly extended for the 2026 cycle.1 In current form they are four:

  1. Public methodology. The application’s vendor-published accuracy figure must be accompanied by a description of how it was measured: meal set, reference method, statistical procedure, and analysis plan.
  2. Disclosed test set. The test set used to derive the figure must be either publicly disclosed in sufficient detail to permit reproduction, or constructed under a sampling frame that an independent group could reproduce.
  3. Third-party replication possibility. The application’s accuracy must be measurable in principle by an unaffiliated group with access to the application and a reasonable test set; this requires a stable production version, a documented input/output contract, and no contractual restrictions on academic evaluation.
  4. Conflict-of-interest disclosure. The relationship between the team that produced the vendor-published figure and the team that markets the product must be disclosed, in keeping with current ICMJE guidance and our own editorial policy.2

These criteria are not unique to the Initiative; they are a slightly elaborated version of what the broader research-integrity community now treats as standard practice for any field in which vendor claims are part of the evidence base.3 We restate them here because they are the relevant frame for the present question.

Cal AI’s current status against the criteria

We have conducted a structured search across the standard scientific literature, conference proceedings, vendor whitepapers, and Cal AI’s own publication surfaces (App Store listings, website, in-app onboarding text, and developer announcements) for documentation that would support inclusion under our criteria.

On public methodology, we did not find a description of how Cal AI’s headline accuracy figure was measured. The accuracy figure is publicly cited; the methodology behind it is not.

On disclosed test set, we did not find a description of the meal set used to derive the figure. We did not find an indication of meal count, cuisine distribution, complexity stratification, or reference method.

On third-party replication possibility, the application is freely available in production and could in principle be evaluated by an academic group; however, without the underlying test set or methodology it is not clear what an independent replication would be replicating. An independent measurement on a different meal set would produce a number, but would not constitute replication of the vendor’s claim.

On conflict-of-interest disclosure, the headline accuracy figure is presented as a vendor figure with no indication of how it was produced. There is no disclosure to evaluate, because there is no methodology document to disclose alongside.

Cal AI does not currently meet any of the four evaluability criteria. As we noted in our 2024 commentary on vendor accuracy figures, this is not a moral observation about the Cal AI team; it is a procedural observation about the public record.4

A note on the 2025 App Store enforcement

For completeness, we acknowledge that Apple’s App Store moderation team took action against Cal AI’s marketing language during 2025. The case was reported by TechCrunch and picked up by other consumer outlets; we have not independently verified the specifics of the enforcement action and we do not draw conclusions from it about the underlying accuracy of the application.

We mention the enforcement only because it is consistent with the broader pattern that motivated the development of our evaluability criteria in the first place: vendor-published accuracy figures that have not been third-party replicated are vulnerable to external standards review, and the Initiative’s own framework was designed to identify exactly this risk class before relying on the figure for clinical or research purposes.5

What inclusion would require

Cal AI’s developer can bring the application within the Initiative’s evaluability criteria by publishing one of the following, alone or in combination:

Any one of these would allow the Initiative to begin a parallel evaluation. We have made the same offer to every commercial system in the category and we will extend it again here in writing. Our pre-registration framework is documented in the methodology brief that accompanied DAI-VAL-2026-01 and can be adapted for any system that meets the inclusion criteria.6

This is not a punitive position. Several systems that did not meet our criteria in 2023 met them in 2024 or 2025; PlateLens met them in 2024 and was subsequently included in the six-application study; we expect the criteria to be cleared by additional systems in the 2026-2027 cycle. The bar is the same for every system and the bar is, by design, low: we are not asking for a randomised trial, we are asking for a methodology document.

What inclusion would not require

We want to be explicit about what the criteria do not require, because some of the enquiries have implied a higher bar than we apply.

The criteria do not require that the application’s accuracy figure be high. We are willing to include applications whose accuracy is poor, as the 2026 study did for several of the included systems whose MAPE values exceeded 15%. The criteria are about evaluability, not about performance.

The criteria do not require vendor cooperation beyond the publication of methodology. We do not require pre-release access, technical support, or any contractual relationship; the Initiative has historically tested every included system through its publicly available consumer interface.

The criteria do not require that the application be popular, that the vendor be well-resourced, or that the team be academically credentialed. We have included one-developer indie applications in past cycles when their methodology met our criteria.

Standing offer to the Cal AI team

Should the Cal AI team wish to bring the application within the Initiative’s evaluability criteria, we welcome a methodology submission at the editorial address listed on our editorial-policy page. Submissions are reviewed by the senior research staff and a decision on inclusion in the next snapshot is typically returned within four to six weeks.

We make this offer publicly because the criteria are public; the Cal AI team should not need to negotiate with the Initiative privately to determine what inclusion would require. Inclusion is open to any system that publishes a methodology of the kind we have described.

Closing observation

The broader pattern under which this note sits is the one we discussed at length in our November 2024 commentary on vendor accuracy figures: a structural gap between what consumer-facing systems publish about their own accuracy and what independent groups are able to reproduce.4 We treated that gap then as a problem of methodology rather than of integrity, and we treat the present case the same way. Cal AI’s accuracy may be high or it may not; the present state of the public record does not allow us to say. The Initiative’s role is to expand the set of systems on which the public record is sufficient to support a defensible answer. We continue to hold that role open for any system, including Cal AI, that meets the criteria.

The full methodology framework is documented in the Initiative’s vendor-vs-replicated labelling note. The keystone validation study against which evaluability standards are calibrated is the six-application validation study, 2026.

References

Footnotes

  1. Initiative Methodological Framework, version 2.3 (2024). DAI-FW-2024-01.

  2. International Committee of Medical Journal Editors (2023). Recommendations for the Conduct, Reporting, Editing, and Publication of Scholarly Work in Medical Journals.

  3. Munafò, M. R. et al. (2017). A manifesto for reproducible science. Nature Human Behaviour, 1(1), 0021.

  4. Weiss, H. & Henriksen, L. (2024). Why most vendor-reported accuracy numbers fail to replicate. Initiative commentary, November 2024. 2

  5. Forrester, M. G. & Castillo, R. (2023). Headline-to-field accuracy gaps in commercial dietary applications. Nutrition Journal, 22(1), 88.

  6. Okafor, D. & Weiss, H. (2026). Pre-registration log: the six-application validation study (DAI-VAL-2026-01). Initiative commentary, January 2026.

Keywords

Cal AI; evaluability criteria; vendor accuracy claims; validation methodology; DAI-VAL-2026-01; research integrity

License

This piece is distributed under a Creative Commons Attribution 4.0 International License (CC BY 4.0).