| Accuracy |
Techical/Model Performance |
Yes |
Minimum (Low-Data Feasible) |
Clinical |
|
The proportion of AI outputs that are correct when compared to a reference standard. |
How often the AI produces accurate results relative to expected or verified outcomes. |
Accuracy directly influences clinical safety, trust, and downstream decision quality. |
Comparison of AI outputs against chart review, expert adjudication, or validated reference datasets. |
Low |
Yes |
KPI |
Attribution to AI versus other workflow or staffing changes; Seasonality or volume fluctuations. |
|
|
| AI Recommendation Override Rate |
User Experience |
Yes |
Minimum (Low-Data Feasible) |
Operational |
Safety |
How often users choose not to follow AI recommendations, indicating trust, fit, or concern. |
Human oversight behavior |
High override rates may signal safety or usability issues |
System logs, user self-report |
Low |
Yes |
KPI |
Incomplete logging or data capture; Variation across sites, users, or patient groups. |
|
FUTURE-AI Guidelines |
| Click Through Rate (CTR) |
Implementation Evaluation |
Context-specific |
Advanced (High Analytic Capacity) |
Operational |
|
Represents the percentage of people who take action after an event |
showing exposure leads to capability engagement |
Can be used to measure adoption rate, action taken rate, and alert engagement rate and ignore rate. |
Number of actions taken divided by number of exposures. |
Low |
Yes |
Both |
Incomplete logging or data capture; Infrastructure variability. |
|
|
| Clinical Use Appropriateness |
Implementation Evaluation |
Yes |
Minimum (Low-Data Feasible) |
Clinical |
Safety |
The extent to which the AI tool is used for the clinical purposes and patient groups it was intended for. |
Alignment between intended and actual clinical use |
Helps prevent unsafe or inappropriate application of AI |
Chart review, usage audits, clinician self-report |
Low |
Yes |
KPI |
Incomplete logging or data capture. |
|
FUTURE-AI Guidelines |
| Clinical Workflow Fit |
Implementation Evaluation |
Yes |
Minimum (Low-Data Feasible) |
Operational |
|
How well the AI fits into existing clinical workflows without disrupting care delivery. |
Ease of integration into care processes |
Poor workflow fit reduces adoption and effectiveness |
Workflow observation, user feedback |
Low |
Yes |
Both |
Incomplete logging or data capture; Variation across users. |
|
FUTURE-AI Guidelines |
| Clinician-Reported Clinical Value |
User Experience |
Yes |
Minimum (Low-Data Feasible) |
Experience (clinician or patient) |
Clinical |
Clinicians’ assessment of whether the AI meaningfully supports clinical decision-making or care delivery. |
Perceived contribution to clinical decision-making |
Captures value in settings where outcome data are limited |
Clinician surveys or structured interviews |
Low |
Yes |
Both |
Self-report bias; Small sample size. |
|
FUTURE-AI Guidelines |
| Coherence / Fluency |
Techical/Model Performance |
Context-specific |
Advanced (High Analytic Capacity) |
Experience (clinician or patient) |
Quality |
Clarity and logical order of a response. |
Measures whether the answer stays on track and fits, or makes logical sense from beginning to end. |
Represents if the AI is able to stay on topic as value of output is lost if response is poorly structured or disjointed. |
Typically measured through human and user ratings. Advanced tools can measure contraction detection |
High |
No |
KPI |
Small sample size; Variation across evaluators. |
|
|
| Contextual Suitability |
Implementation Evaluation |
|
Expanded (Moderate Capacity) |
Operational |
Equity |
How well the AI works across different care contexts, resource levels, or local practices. |
Context sensitivity |
Supports transferability across settings |
Qualitative assessment, pilot feedback |
Medium |
No |
KPI |
Variation across sites, users, or patient groups; Small sample size. |
|
FUTURE-AI Guidelines |
| Cost of Implementation |
Implementation Evaluation |
Yes |
Minimum (Low-Data Feasible) |
Financial |
Operational |
The total cost required to introduce the AI tool, including licensing, setup, and staff time. |
Upfront financial investment |
Critical for budgeting and approval decisions |
Budget records, procurement data |
Low |
Yes |
Both |
Attribution to AI versus other workflow or staffing changes; Seasonality or volume fluctuations. |
|
FUTURE-AI Guidelines |
| Downstream Benefits |
Implementation Evaluation |
Yes |
Expanded (Moderate Capacity) |
Financial |
Operational |
Secondary or indirect benefits that occur as a result of initial improvements enabled by an AI solution. |
The broader unexpected effects that follow primary gains. |
Many AI benefits are not directly measurable but materially increase total value over time if acknowledged in decision-making |
Estimated through historical data, modeling, or expert judgment |
High |
No |
Both |
Small sample size; Model drift or system updates over time. |
|
Bharadwaj P, Nicola L, Breau-Brunel M, Sensini F, Tanova-Yotova N, Atanasov P, Lobig F, Blankenburg M. Unlocking the Value: Quantifying the Return on Investment of Hospital Artificial Intelligence. J Am Coll Radiol. 2024 Oct;21(10):1677-1685. doi: 10.1016/j.jacr.2024.02.034. Epub 2024 Mar 16. PMID: 38499053. |
| Ease of Use |
User Experience |
Yes |
Minimum (Low-Data Feasible) |
Experience (clinician or patient) |
Trust |
How easy and intuitive the AI tool is for users to operate within their normal workflow. |
User perception of simplicity, clarity, and effort required to use the tool. |
Tools that are difficult to use reduce adoption, increase frustration, and limit realized value. Ease of use directly influences sustained engagement and trust. |
Short user surveys (for example, 1–5 rating scale), structured feedback questions, or brief post-implementation interviews. |
Low |
Yes |
KPI |
Self-report bias; small sample size; early novelty effects. |
|
|
| Error Rate |
Implementation Evaluation |
Yes |
Minimum (Low-Data Feasible) |
Clinical |
Safety |
The frequency with which the AI produces incorrect outputs relative to a defined reference standard. |
The proportion of AI outputs that are incorrect when compared to expected, verified, or gold-standard results. |
A higher error rate increases clinical risk, reduces trust, and may undermine adoption. Monitoring error rate supports early detection of performance issues. |
Comparison of AI outputs against chart review, expert adjudication, or validated reference datasets; expressed as a percentage of incorrect outputs. |
Low |
Yes |
KPI |
Incomplete logging or data capture; Small sample size. |
|
https://www.producttalk.org/glossary-ai-error-rate/?srsltid=AfmBOor1H8m7GqY5Ig1tFoJ_3O5kp81g9aoV3tWOfkzbQdlgmJkVm4lI |
| Hours Saved |
Implementation Evaluation |
Yes |
Minimum (Low-Data Feasible) |
Operational |
Financial |
Represents aggregate time savings across the organization. The total staff time saved across a defined period due to AI-assisted or automated tasks. |
The time returned to the organization. |
Faster service delivery |
Time per task comparison, process step reduction, increase productivity |
Low |
Yes |
Both |
Small sample size; Reference standard limitations. |
|
|
| Implementation Effort |
Implementation Evaluation |
Yes |
Minimum (Low-Data Feasible) |
Operational |
Financial |
The amount of staff time and organizational effort required to implement and maintain the AI tool. |
Organizational burden of adoption |
High effort can limit scalability |
Project tracking, staff time estimates |
Low |
Sometimes |
Both |
Attribution to AI versus other workflow or staffing changes; Variation across sites, users, or patient groups. |
|
FUTURE-AI Guidelines |
| Intended Population Coverage |
Implementation Evaluation |
|
Expanded (Moderate Capacity) |
Equity |
Operational |
The extent to which the AI reaches the populations and settings it was designed to support. |
Reach across intended populations |
Uneven reach can worsen inequities |
Deployment records, site reports |
Medium |
No |
Both |
Small sample size; Variation across sites, users, or patient groups. |
|
FUTURE-AI Guidelines |
| Model Latency |
Techical/Model Performance |
Context-specific |
Advanced (High Analytic Capacity) |
Operational |
Experience |
The time required for the AI system to generate a response after receiving a request. |
This measurement represents system performance, infrastructure performance. A low latency means higher adoption, greater impact, stronger ROI. A high latency means low adoption, low impact, and weak ROI |
Model latency matters because speed determines usability. If AI responses are slow, adoption drops, workflow efficiency declines, and expected ROI is not realized. |
The time it takes to generate a response. Response end time minus response start time. Many different types of latency to measure. End to end, model processing time (backend latency), time to first token (this is for streaming systems)..... |
Low |
Yes |
KPI |
Small sample size; Variation across evaluators. |
|
|
| Observed Harm or Safety Concerns |
Implementation Evaluation |
Yes |
Minimum (Low-Data Feasible) |
Clinical |
Safety |
The presence and frequency of reported patient safety concerns or unintended negative effects related to AI use. |
Patient safety risks linked to AI |
Early detection of harm supports safer deployment |
Incident reports, safety logs, qualitative feedback |
Low |
Yes |
Both |
Small sample size; Inconsistent documentation quality. |
|
FUTURE-AI Guidelines |
| Ongoing Operating Cost |
Implementation Evaluation |
Yes |
Minimum (Low-Data Feasible) |
Financial |
Sustainability |
The recurring costs needed to maintain the AI tool over time. |
Long-term financial burden |
Affects sustainability and scale-up |
Financial tracking, vendor contracts |
Low |
Yes |
Both |
Attribution to AI versus other workflow or staffing changes; Seasonality or volume fluctuations. |
|
FUTURE-AI Guidelines |
| Perceived Value for Money |
Implementation Evaluation |
Yes |
Minimum (Low-Data Feasible) |
Financial |
Experience |
Decision-makers’ assessment of whether the benefits of the AI justify its costs. |
Overall value judgment |
Supports go/no-go and renewal decisions |
Leader survey / structured assessment |
Medium |
Yes |
Both |
Self-report bias; Attribution to AI versus other financial or operational changes. |
|
FUTURE-AI Guidelines |
| Recommendation Acceptance Rate |
User Experience |
Context-specific |
Advanced (High Analytic Capacity) |
Operational |
Trust |
The proportion of AI-generated recommendations that are accepted and acted upon by users. |
The rate at which AI advice translates into real-world clinical or operational action. |
AI only creates value when recommendations influence behavior. A high acceptance rate may signal workflow fit and trust. A low acceptance rate may indicate usability issues, misalignment with clinical judgment, or weak change management. |
Number of accepted recommendations divided by total recommendations generated within a defined period. |
Medium |
No |
Both |
Attribution to AI versus concurrent workflow or staffing changes; variation across users or sites; incomplete logging. |
|
|
| Reported Equity Concerns |
Implementation Evaluation |
Yes |
Minimum (Low-Data Feasible) |
Equity |
Risk |
The presence of reported concerns that the AI performs differently or creates barriers for certain groups. |
Perceived inequitable impacts |
Helps identify emerging bias in low-data settings |
User feedback, stakeholder reports |
Low |
Yes |
KPI |
Small sample size; Variation across sites, users, or patient groups. |
|
FUTURE-AI Guidelines |
| Request throughput |
Techical/Model Performance |
Context-specific |
Advanced (High Analytic Capacity) |
Operational |
Time |
The number of requests an AI system can process within a defined time period. |
The system’s capacity to handle workload volume under typical or peak conditions. |
If throughput is too low relative to demand, response times increase, adoption drops, and expected operational or financial benefits may not be realized. Adequate throughput supports scalability and reliability. |
Total number of requests processed per unit of time (for example, requests per minute or requests per hour), based on system logs under normal or peak usage conditions. |
Low |
Yes |
KPI |
Incomplete logging; variability in workload mix; infrastructure differences across environments. |
|
|
| Retrieval Latency |
Techical/Model Performance |
Context-specific |
Advanced (High Analytic Capacity) |
Operational |
|
Time taken for a system to find and return relevant result after a query |
How quickly the system can find the information the AI needs. |
Poor indexing increases retrieval time and will directly impact total response time. |
Total AI response time = retrieval time + model generation time |
Low |
Yes |
Both |
Incomplete logging or data capture; Infrastructure variability. |
|
|
| Return Period |
Implementation Evaluation |
Yes |
Advanced (High Analytic Capacity) |
Financial |
|
The time period after deployment of the AI tool until revenue surpasses implementation expenses. |
How long it would take for the investment in the AI tool to pay itself and generate financial benefit |
Assists in understanding cost burden until financial recuperation from implementing a tool |
Calculated by dividing total implementation and operating costs by estimated annual net benefit. |
Low |
Yes |
ROI Input |
Attribution to AI versus other workflow or staffing changes; Variation across sites, users, or patient groups. |
|
|
| Safety |
Implementation Evaluation |
Yes |
Minimum (Low-Data Feasible) |
Clinical |
Safety |
The extent to which AI outputs avoid generating harmful, misleading, biased, or clinically unsafe information. |
Frequency or presence of unsafe AI outputs or clinically inappropriate recommendations. |
Unsafe outputs can lead to clinical safety risk, regulatory issues, legal risk, reputational damage, and trust. |
Harm incident rate, hallucination rate, compliance rate, and bias detection metrics. |
Medium |
Sometimes |
KPI |
Small sample size; model drift or system updates over time. |
|
|
| Summarization Quality |
Techical/Model Performance |
Context-specific |
Advanced (High Analytic Capacity) |
Experience (clinician or patient) |
Clinical |
Measures how the AI summary captures the most important information from a longer source, without leaving critical details or adding incorrect information. |
This measurement will help understand if the AI distorts meaning, introduces new information, preserves context, and includes key points of prompt. |
This measurement is often evaluated alongside time savings because low-quality summaries increase editing effort and reduce realized efficiency. |
Measured using coverage scores, calculated as the proportion of key facts included relative to total critical facts. |
High |
No |
ROI Input |
Attribution to AI versus other workflow or staffing changes; Variation across sites, users, or patient groups. |
|
|
| System Availability |
Implementation Evaluation |
Yes |
Expanded (Moderate Capacity) |
Operational |
Experience |
The proportion of time the AI tool is accessible and functioning when users need it. |
Technical reliability |
Unreliable systems undermine trust and use |
System logs, uptime monitoring |
Low |
Yes |
KPI |
Incomplete logging or data capture; Infrastructure variability. |
|
FUTURE-AI Guidelines |
| Task Completion Time / Time to Decision |
Implementation Evaluation |
Yes |
Minimum (Low-Data Feasible) |
Operational |
Financial |
Represents time per individual task. The average time required to complete a task with the AI compared to without it. |
Change in task duration |
Shows whether AI improves operational efficiency |
Before–after comparisons, sampling |
Medium |
No |
ROI Input |
Attribution to AI versus other workflow or staffing changes; Variation across sites, users, or patient groups. |
|
FUTURE-AI Guidelines |
| Text Quality or Quality Proximity |
Techical/Model Performance |
Context-specific |
Advanced (High Analytic Capacity) |
Experience (clinician or patient) |
Quality |
Measures how clear, accurate, complete, and useful AI generative text is for its intended purpose. |
Measures if the output is good enough to use if its factually correct and relevant to the input. |
High quality proximity leads to reduced time editing, faster decision time, lower rework, improved consistency. Lower quality creates hidden labor costs, safety risks, and trust issues and adoption decline. |
Typically measured by the percentage of outputs used without modification. |
High |
No |
Both |
Small sample size; Variation across evaluators. |
|
|
| Thumbs Up / Down Feedback |
User Experience |
Yes |
Minimum (Low-Data Feasible) |
Experience (clinician or patient) |
Trust |
User rating on if they found the AI response helpful or not |
The satisfaction of users with the system |
Provides a simple, scalable signal of perceived usefulness and early user trust. |
Binary or scaled in-system feedback collected during or immediately after AI interaction. |
Low |
Yes |
Both |
Self-report bias. |
|
|
| Time on Site (Session Duration) |
User Experience |
Context-specific |
Advanced (High Analytic Capacity) |
Operational |
Experience |
Amount of time a user spends actively using the system during a single session |
Time spent in user engagement |
Can help to understand how users are engaging |
using system logs that record when a user starts and ends a session, |
Low |
Yes |
Both |
Incomplete logging or data capture; Variation across users. |
|
|
| Token throughput |
Techical/Model Performance |
Context-specific |
Advanced (High Analytic Capacity) |
Operational |
|
The number of tokens the model can generate per second under defined workload conditions. |
A measurement of speed and processing capacity |
Represents the efficiency of an AI system |
Number of tokens generated per unit of time under defined workload conditions. |
Medium |
No |
Both |
Incomplete logging or data capture; Infrastructure variability. |
|
|
| Training Completion |
Implementation Evaluation |
Yes |
Minimum (Low-Data Feasible) |
Operational |
Safety |
The percentage of intended users who have completed basic training on how to use the AI tool safely. |
User preparedness |
Training supports safe and effective use |
Training records |
Low |
Yes |
KPI |
Incomplete logging or data capture. |
|
FUTURE-AI Guidelines |
| User Adoption Rate |
User Experience |
Yes |
Minimum (Low-Data Feasible) |
Experience (clinician or patient) |
Operational |
The percentage of new users who adopt the technology. |
Represents that the solution fits users' workflow |
Helps identify if tool/capability has moved from availability to usability and sustainability. |
Percentage of users that use the AI capability within a defined period |
Low |
Yes |
Both |
Incomplete logging or data capture; Variation across sites, users, or patient groups. |
|
|
| User Satisfaction |
User Experience |
Yes |
Minimum (Low-Data Feasible) |
Experience |
Trust |
Users’ overall satisfaction with the AI tool, including ease of use and perceived usefulness. |
User trust and acceptance |
Trust is critical for sustained use |
Surveys, feedback forms |
Low |
Yes |
KPI |
Self-report bias; Small sample size. |
|
FUTURE-AI Guidelines |
| Variability Reduction |
Implementation Evaluation |
Context-specific |
Expanded (Moderate Capacity) |
Clinical |
Trust |
Making the system produce more consistent and predictable outputs. |
Consistency |
Consistent outputs improve reliability, trust, and safety |
Comparison of standard deviation or range in outputs before and after implementation; or structured review of output consistency across similar cases. |
Medium |
No |
KPI |
Small sample size; Inconsistent documentation quality. |
|
|
| Verbosity |
Techical/Model Performance |
Context-specific |
Advanced (High Analytic Capacity) |
Experience (clinician or patient) |
Quality |
How long an AI's output is. |
This measures how much text or explanation an AI system produces when responding to a request. |
It can be tracked to understand clarity, efficiency, cost, and user experience. |
Typically measured by average word count per response. |
Medium |
No |
KPI |
Attribution to AI versus other financial or operational changes; Seasonality or volume fluctuations. |
|
|
| Workflow Adaptation Time |
Implementation Evaluation |
Yes |
Expanded (Moderate Capacity) |
Operational |
Experience |
The time required for clinicians or staff to integrate the AI tool into routine workflows. |
The amount of time required for clinicians or staff to become comfortable using the AI tool as part of routine workflow, including adjustments to documentation, decision processes, or task sequencing. |
Tools that require long adaptation periods may reduce adoption and delay benefits. |
User surveys, observational workflow studies, or time-to-stable usage estimates. |
Medium |
Sometimes |
KPI |
Variation across users, services, or sites; difficulty separating AI learning curve from broader workflow change. |
|
|