When AI Doctors "See" What Isn't There: Why Better Accuracy Doesn't Mean Better Vision
RLVR fine-tuning raises accuracy on medical VQA benchmarks while quietly degrading visual grounding: a new counterfactual evaluation framework identify the gap.
Open Science Community
Research notes, technical essays, and personal stories from the Cohere Labs Community
Research notes, stories, and ideas from people shaping AI together—transparent, collaborative, and community-led.
RLVR fine-tuning raises accuracy on medical VQA benchmarks while quietly degrading visual grounding: a new counterfactual evaluation framework identify the gap.
The people, projects, and conversations that turned a moment of change into a community I could give back to.
A 2,312-prompt, 23-language benchmark for child–AI conversations that evaluates four production models and validates the LLM-as-judge pipeline with five independent judges (Cohen's κ up to 0.71).
What happens to a multilingual model's safety guardrails when you fine-tune it on harmful data and probe it with code-mixed inputs, and why current binary benchmarks can't tell you.
A community lead reflects on three years of learning, research, and building programs inside the Cohere Labs Open Science Community.