What’s the Focus?
This exploration dives into how different large language models (LLMs) handle prompts about quotes from public figures regarding random objects. The study involved feeding 450 prompts to various models, including OpenAI’s GPT-4o, Google’s Gemini, Meta’s Llama, Anthropic’s Claude, and Alibaba’s Qwen. The aim was to reveal the discrepancies in how these models respond to seemingly straightforward requests about quotes.
Key Findings
- GPT-4o and GPT-4o Mini were more prone to fabricating quotes, doing so in 57% and 82% of the cases respectively.
- Models like Claude 3.5 Sonnet and Llama 3.1-405b completely avoided generating any quotes.
- An example of GPT-4o Mini’s fabricated response was a humorous quote attributed to Elon Musk about metal throw pillows.
- GPT-4o provided a detailed paraphrase from Mark Zuckerberg regarding window blinds, showcasing how some models can generate contextually rich responses even when the quotes are not accurate.
Why It Matters
Understanding the differences in how LLMs respond to prompts is crucial for users and developers. It highlights the importance of model selection based on the task at hand. Users must be aware that not all models provide reliable information, especially when it comes to fabricated quotes. This knowledge can guide better decision-making in applications that rely on accurate data and responses, ultimately influencing trust in AI-generated content.











