Physicians, Error Rates and AI

Feb 07, 2026By S. Charles Bivens ThM.| AI Ethics | The Being Human
S. Charles Bivens ThM.| AI Ethics | The Being Human

Preamble: As physicians, you practice under the constraints of time, uncertainty, and moral responsibility. Nothing in this discussion diminishes that. My aim is to help ensure conversations about AI and clinical performance are honest, clinically grounded, and ethically balanced—because what you do and how tools affect it directly touches patient safety.

The problem with one-sided error rate comparisons. In public debates, I increasingly see single-number claims about “AI error rates,” often without the human baseline or matched methodology. That framing can be misleading. As you know from audit-and-feedback work, the denominator, case-mix, measurement setting, and ground truth formation all matter. If we report an AI error without comparable human data—or vice versa—we risk distorting clinical reality and, potentially, policy.


What the clinical data actually say:


1. Human diagnostic harm is non-trivial—and concentrated

The best current national estimate indicates approximately 795,000 Americans per year suffer death or permanent disability associated with diagnostic error. Harms are concentrated in a small number of high-incidence/high-severity conditions, including stroke, sepsis, pneumonia, venous thromboembolism, and lung cancer. These represent target-rich areas for safety improvement work, with or without AI support. Newman-Toker et al., BMJ Quality & Safety, 33(2), 109–120. https://doi.org/10.1136/bmjqs-2021-014130

2. Generative AI is not a replacement for expertise

A systematic review and meta-analysis in npj Digital Medicine found that the diagnostic performance of generative AI systems was comparable to that of non-expert physicians but inferior to that of expert physicians. This is consistent with your lived experience: general-purpose models may assist in reasoning and differential generation, but expert pattern recognition, context, and calibration remain decisive. Takita et al., npj Digital Medicine, 8, Article 1543. https://doi.org/10.1038/s41746-025-01543-z

3. Narrow, task-focused AI can exceed generalists in bounded domains—dermatology is a representative case

Meta-analytic evidence in skin cancer classification shows pooled AI sensitivity around the high-80s and specificity in the mid-to-high-70s, slightly outperforming clinicians overall and roughly comparable to expert dermatologists, with clearer advantages over generalists. The practical takeaway: in well-scoped tasks with high-quality imaging and robust training data, AI can offer add-on value—particularly for non-specialists—when embedded with appropriate guardrails. Systematic review/meta-analysis of AI vs clinicians for skin cancer classification. Additional context: clinician sensitivity/specificity can improve with AI assistance, with larger gains for non-dermatologists.

A crucial caution: when AI is wrong, clinicians can be pulled off target

A multicenter randomized vignette study showed that systematically biased AI reduced clinician diagnostic accuracy by 11.3 percentage points; model explanations did not mitigate the harm. This is an applied example of automation/confirmation bias under real diagnostic reasoning tasks—an effect to be anticipated and engineered against in clinical workflows. JAMA study on AI impact and bias.

Additional experimental work demonstrates that inaccurate AI reduces accuracy, and that mitigation strategies (e.g., selective prediction/abstention, confidence thresholds) can reduce but not eliminate risk—and may shift error profiles (e.g., fewer false positives, more false negatives). These are design trade-offs, not silver bullets.

Translating this into practice: five physician-facing principles

1. Match the metric to the task and the comparator

Demand head-to-head, same-case comparisons with clearly specified ground truth and clinically meaningful endpoints (e.g., sensitivity by acuity strata, net benefit, time-to-correct-diagnosis), not just overall accuracy. Avoid mixing retrospective human chart-review rates with prospective AI benchmark tests.

2. Require calibration and confidence transparency

A well-calibrated system that knows when it doesn’t know is safer to work with. Require display of uncertainty, out-of-distribution warnings, and abstention options, especially in high-stakes, low-prevalence contexts.

3. Use AI where it measurably helps the intended user

The evidentiary pattern suggests larger gains for non-specialists in narrow tasks (e.g., dermatology triage, image pre-screening), with smaller or mixed gains for experts. Align deployments with the local skill mix, case mix, and throughput constraints.

4. Engineer for human factors to reduce overreliance

Expect over-acceptance of AI suggestions under time pressure. Countermeasures include:

Contrastive rationales and structured differentials (why A over B? what would make B more likely?)

Forced consideration of alternative diagnoses for high-stakes presentations

UI nudges that separate suggestion exposure from final sign-off

Audit-and-feedback at the clinician and unit level tied to safety metrics, not adoption rates

5. Keep accountability clinical and institutional, not algorithmic

AI is a tool. A magnificent tool. Accountability lives with the licensed clinician and the deploying institution. Governance should include local performance monitoring, equity audits, rollback plans, and clear criteria for indications/contraindications.

What ethical communication looks like in physician forums
When sharing “AI error rates,” include the human comparator under matched conditions.

Report uncertainty: confidence intervals, calibration, and subgroup performance, especially by demographics and comorbidity.

Avoid generalizing beyond task boundaries; “AI helps in derm imaging triage.”  “AI is safe for broad diagnostic substitution.”

Disclose conflicts of interest and data provenance. Who labeled the data? How was ground truth adjudicated? What was the prevalence?

My Growing Position: 

The right comparison for patient safety is not AI versus clinicians—it’s clinicians plus well-designed AI, versus clinicians alone. I have always, and still do, advocate for humanity, insisting that AI is a non-conscious tool to be used for the benefit of all people and this planet. I am Pro-Human and Pro-Medical Field Personnel. Ethics isn't easy. Being human isn't easy. But here we are. 

Therefore, ladies and gentlemen of the medical profession, in high-harm, high-volume conditions, the bar should be: does this configuration reduce net diagnostic harm in my setting, with my team, for my patients? If the answer is yes, scale thoughtfully; if not, iterate or decline.

Take Care. 

References

Takita, H., Kabata, D., Walston, S. L., Tatekawa, H., Saito, K., Tsujimoto, Y., Miki, Y., & Ueda, D. (2025). A systematic review and meta-analysis of diagnostic performance comparison between generative AI and physicians. npj Digital Medicine, 8, Article 1543. https://doi.org/10.1038/s41746-025-01543-z

Newman-Toker, D. E., Nassery, N., Schaffer, A. C., Yu-Moe, C. W., Clemens, G. D., Wang, Z., Zhu, Y., Saber Tehrani, A. S., Fanai, M., Hassoon, A., & Siegal, D. (2024). Burden of serious harms from diagnostic error in the USA. BMJ Quality & Safety, 33(2), 109–120. https://doi.org/10.1136/bmjqs-2021-014130

Supporting evidence cited in-text

Dermatology AI vs clinicians (systematic review/meta-analysis; pooled sensitivity/specificity; AI vs overall clinicians and experts): A systematic review and meta-analysis of AI vs clinicians for skin cancer classification.

AI assistance improving clinician performance in skin cancer diagnosis: Stanford Medicine–reported findings consistent with sensitivity/specificity gains, especially for non-dermatologists; NICE brief summarizing comparative evaluations in melanoma detection.

Accuracy declines with wrong AI suggestions; overreliance/automation bias:

Randomized multicenter vignette study: biased AI decreased clinician diagnostic accuracy by 11.3 percentage points; explanations did not mitigate harm.

Experimental studies on selective prediction/abstention and error trade-offs; evidence that clinician agreement with AI can increase even when AI is incorrect, indicating overreliance risk.