
Microsoft trained its MAI models on unlicensed web data despite promising "enterprise grade, clean and commercially licensed data"
Quick Answer
This paper shows that Microsoft's MAI models were trained on unlicensed web data, contradicting its claims of using only 'clean and commercially licensed data.' This practice mirrors that of other AI companies, relying on fair use while placing the onus on website owners to block crawlers.
Quick Take
Microsoft's MAI models were trained on unlicensed web data, contradicting its claims of using only 'clean and commercially licensed data.' This practice mirrors that of other AI companies, relying on fair use while placing the onus on website owners to block crawlers.
Key Points
- Microsoft's MAI models utilize unlicensed data sources like Common Crawl.
- The company claims to provide 'enterprise grade' data but does not adhere to this.
- Microsoft's approach shifts the responsibility to site owners to block data crawlers.
- This practice is common among AI labs, raising ethical concerns.
Article Excerpt
From source RSS / original summaryMicrosoft sells its LLM training approach as different from other AI companies. It isn't. The company trained its new MAI models partly on unlicensed web data like Common Crawl, despite claiming they used only "clean and commercially licensed data. " Like every other AI lab, Microsoft leans on fair use and puts the burden on site owners to block its crawlers.
The article Microsoft trained its MAI models on unlicensed web data despite promising "enterprise grade, clean and commercially licensed data" appeared first on The Decoder.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from The Decoder
See more →
OpenAI models now available on Amazon Web Services
OpenAI has launched GPT-5.5, GPT-5.4, and Codex on Amazon Bedrock, matching its own pricing. Currently, these models are available only in the US across commercial and government AWS regions, with usage contributing to existing AWS contracts.


