Microsoft Pledges Romani-Led AI Data Project and GitHub Metadata Tool in Strasbourg Multilingual Push

Microsoft chose the seat of the European Parliament for its latest multilingual AI offensive, announcing on June 16, 2026 a pair of initiatives that place historically marginalised voices at the centre of the data supply chain. From its Strasbourg engineering centre, the company detailed a Roma-led effort to produce high-quality digital resources for the Romani language alongside a new public dataset harvested from millions of public GitHub repositories, designed to help researchers identify and analyse the linguistic fingerprints hiding in code. The twin moves are part of a broader European investment that Microsoft says will turn Strasbourg into a hub for multilingual, community-governed artificial intelligence.

Both projects arrive at a moment when the European Union is hardening its digital infrastructure rules through the AI Act and when technology companies are under acute pressure to demonstrate that their language models do not only serve the world’s most-spoken tongues. Romani, an Indo-Aryan language with an estimated 3.5 million speakers scattered across Europe, has long been described as a “low-resource” language, starved of the digitised text and speech corpora needed to train modern natural-language systems. Microsoft’s programme aims to change that by funding and empowering Roma linguists, educators, and community organisations to build parallel corpora, pronunciation lexicons, and eventually speech datasets under open-source licences approved by the community itself.

The Romani initiative is being coordinated through a newly formed advisory board that counts Roma scholars and representatives from six EU member states. Early work will focus on the Kalderash and Lovari dialects, but the roadmap extends to the Balkan and Vlax varieties, with a public progress tracker promised for the autumn. In an unusual governance structure, the board holds veto power over any commercial licensing terms, ensuring the data is never sold to surveillance or advertising platforms. Microsoft positions this model as a template for other endangered and minority languages, arguing that community-led data trusts can accelerate inclusion without repeating the extractive patterns of earlier tech-for-good campaigns.

Alongside the Romani corpus, Microsoft’s GitHub arm is releasing a metadata-only dataset covering more than 200 million public repositories. Dubbed the GitHub Multilingual Code Metadata Corpus, it extracts non-code signals—natural-language comments, README files displayed in different languages, commit messages, issue discussions, and even the locale settings of repository contributors—while stripping all source code and personal information. The company states the dataset is intended to help academic researchers and open-source tool builders map the real-world linguistic landscape of software development, track the adoption of underrepresented languages in technical documentation, and build better code-switching detection models.

Privacy safeguards were a recurring theme in the Strasbourg briefing. Engineers confirmed that the dataset is aggregated at the repository level without attaching to individual user accounts, and a differential-privacy layer adds calibrated noise to prevent re-identification of sole contributors. Only repositories with permissive open-source licences were included, and a “do-not-scan” flag will be retroactively honoured if owners opt out after the dataset becomes available. Researchers can access the corpus through Microsoft Research’s Azure-hosted sandbox, with a lightweight subset available for local download.

The Strasbourg hub will also house a newly funded doctoral programme in multilingual AI, co-supervised by the University of Strasbourg and Microsoft Research. Ten fellowships have been ring-fenced for scholars from Council of Europe minority-language communities, with the first cohort expected in September 2027. A parallel residency programme for open-source maintainers aims to improve the language coverage of tools such as GitHub Copilot, which already supports dozens of spoken languages in its natural-language interface but has historically struggled with code comments written in languages beyond English, Chinese, and Spanish.

Industry observers note that the announcement aligns with Microsoft’s broader campaign to position itself as a trusted AI infrastructure provider for European institutions. The company recently pledged €3.2 billion for European data-centre capacity by 2027, and Strasbourg—already home to a Microsoft cloud region—gives it a physical foothold inside the EU’s juridical perimeter. Pairing a sensitive cultural project like the Romani corpus with a developer-tools dataset may also help inoculate the company against criticism that its AI investments simply reinforce the dominance of English-language models while paying lip service to linguistic diversity.

The Romani language undertaking, however, immediately drew both praise and scepticism from advocacy groups. The European Roma Rights Centre issued a statement welcoming the investment but cautioning that “data sovereignty means more than consent checkboxes; it requires long-term, legally binding co-ownership agreements.” Microsoft officials confirmed they are exploring a community-data trust model inspired by the Māori data-governance framework in New Zealand and said a draft charter will be published for public consultation within 90 days. Independent linguists pointed out that Romani’s dialectal fragmentation and its heavy use of loanwords from contact languages such as Hungarian, Romanian, and Greek will test even state-of-the-art tokenisers, making the project a formidable technical benchmark for multilingual model training.

On the GitHub metadata side, early reactions from the open-source community were mixed. Several prominent maintainers expressed unease on social platforms about the lack of an explicit opt-in process, even with the differential-privacy guarantees. GitHub responded by posting an FAQ that emphasises the non-commercial, research-only licence attached to the dataset and its exclusion of any repository content that can be directly linked to a person. The company is also convening a review panel of open-source foundation representatives before the official release in August.

From a technical perspective, the metadata corpus is being positioned as a complement to existing efforts such as the Software Heritage archive and the World of Code dataset. By layering linguistic signals on top of repository structures, Microsoft hopes to enable new kinds of research: detecting which non-English languages are growing fastest in developer communities, understanding how multilingual teams produce documentation, and even surfacing potential security vulnerabilities hidden in non-English error messages that traditional static-analysis tools overlook. The company has committed to releasing annual snapshots through 2030.

In Strasbourg, the political symbolism was unmistakable. Microsoft’s president of European government affairs held the briefing inside the Lieu d’Europe, a stone’s throw from the Parliament, flanked by the city’s mayor and the Council of Europe’s commissioner for minority rights. The choice of venue underscored a message that AI regulation and inclusion are not at odds but can be architected together when data governance is handed to communities. Whether the follow-through matches the rhetoric will depend on the details of the charters, the longevity of the funding, and the willingness of researchers to build on these datasets rather than treat them as one-off experiments.

For Windows enthusiasts and developers, the immediate takeaway is that Microsoft’s tools ecosystem may soon speak—and understand—a far wider array of languages. Improvements to GitHub Copilot’s multilingual prompting, better localised error messages in Visual Studio, and community-driven language packs for Windows are all plausible downstream effects. The Romani data initiative, while niche in terms of speaker numbers, sets a precedent for how Windows and Office localisation could eventually incorporate languages that lack a commercial market, provided the data is built on community terms. As the AI Act’s transparency obligations loom, the Strasbourg projects may also serve as a blueprint for how tech giants can document training-data provenance without sacrificing the scale that modern models demand.

The June 16 announcement made it clear that Microsoft is betting on bottom-up data partnerships rather than top-down scraping. Whether that bet pays off will be tested not in keynote speeches but in the upload of the first Romani speech samples, the first pull request flagged by a linting tool trained on Hindi-Romani code-switched comments, and the first research paper that uses the metadata corpus to quantify linguistic diversity—or its absence—in the open-source supply chain. For now, the message from Strasbourg is that multilingual AI needs multilingual data governors, and those governors should look a lot more like the communities they represent.