Overview of the Situation
DeepSeek recently launched an updated version of its reasoning AI model, named R1-0528. This model shows impressive performance in math and coding tests. However, the company has not disclosed the data sources used for training. Some researchers suspect that a portion of the training data may have come from Google’s Gemini AI family. This raises questions about ethical practices in AI development and data sourcing.
Key Details
- Developer Sam Paech claims to have found evidence suggesting that DeepSeek utilized outputs from Google’s Gemini for training.
- Another developer observed that the model’s reasoning patterns resemble those of Gemini, indicating possible data overlap.
- DeepSeek has faced accusations before, such as its V3 model identifying itself as ChatGPT, hinting at training on OpenAI’s data.
- OpenAI has previously noted that DeepSeek may have engaged in distillation, a method to extract data from larger models, which is against OpenAI’s terms of service.
Significance of the Issue
The controversy highlights ongoing challenges in the AI industry regarding data sourcing and ethical practices. As more companies rely on similar data from the open web, distinguishing between original and derived content becomes increasingly difficult. This situation raises important questions about intellectual property rights and the future of AI model development. Companies are now taking steps to enhance security and protect their data, reflecting the growing concern over competitive integrity in the AI landscape.











