Beyond the source code: the hidden licensing crisis in open AI

OnMarch 19, 2026, Cursor launched Composer 2 to over one million daily active users. The AI coding company, valued at close to fifty billion dollars, promoted the model as offering “frontier-level coding intelligence.” It did not mention that the model was built on top of Kimi K2.5, developed by the Chinese AI company Moonshot AI. Within hours, an independent developer found the connection buried in Cursor’s API responses. Cursor’s co-founder acknowledged it was a mistake not to disclose the base model. Moonshot AI was gracious, calling the arrangement “an authorized commercial partnership” and expressing support for the open model ecosystem. The technical controversy made headlines. The licensing question beneath it did not.

Kimi K2.5 is what the AI industry calls an “open-weight” model: its trained parameters, the numerical values that define how it behaves, are publicly released for anyone to download, adapt, and build upon. It is released under a modified MIT licence with attribution requirements for large-scale commercial use. Whether Cursor satisfied those requirements is a factual and legal question this article does not attempt to answer, and the public record suggests the parties resolved the matter between themselves. But the episode exposes something that goes well beyond one company’s disclosure practices. Cursor accessed Kimi K2.5 through Fireworks AI, a third-party “inference provider”: a company that hosts AI models on its own servers and makes them available to others through a cloud-based interface. The developer using the model through such a service never downloads it, never inspects its components, and never encounters the licence file that accompanies it. Cursor’s million users, in turn, had no visibility into which model was generating their code. The licence conditions existed at every stage of this chain. The delivery architecture carried none of them forward.

This pattern is not unique to Cursor. It is the default condition of the open AI ecosystem, and it raises a question that, to the best of the author’s knowledge, the existing literature has not addressed: can open-model licence obligations continue to perform their intended function when model access runs through cloud-based intermediaries that preserve legal continuity in theory but eliminate licensing visibility in practice? The argument in this article is that openness in AI can be diluted not by relicensing or by outright non-compliance, but by the delivery architecture itself. Three compounding gaps, in documentation, in legal definitions, and in the delivery infrastructure, interact to produce a licensing system that is structurally unable to do what it was designed to do. But the open-source licensing tradition has faced a version of this problem before, and it solved it. This article examines whether the same structural logic can close the gap again.

Executive summary

Three compounding gaps undermine the practical enforceability of open-model licence obligations, and their interaction is more significant than any one of them alone.

The first is a documentation gap. A February 2026 audit by Jewitt, Rajbahadur, Li, Adams, and Hassan found that 96.5% of datasets and 95.8% of models labelled as “permissive” on the leading AI platform Hugging Face lacked the licence text that those same permissive licences require. Attribution of the model notice reached the final application in only 5.75% of cases. Where the licence text is absent, a downstream user cannot demonstrate that they hold a valid licence grant for their use, leaving them exposed to a rightsholder assertion of full copyright protection. Put more simply, a metadata label on a platform is not a licence grant.

The second is a definitional gap. Open-source licences were written to govern software code. Their foundational concepts, “source code,” “derivative work,” “distribution,” assume artefacts that humans can read and inspect. AI model weights (the trained numerical values that define how a model behaves) are none of these things. The result, as demonstrated by Duan, Zhao, Jiang, Shadbolt, and He in their 2026 analysis at the ACM Web Conference, is that the GPL’s copyleft mechanism, the most protective instrument in the open-source licensing arsenal, encounters fundamental ambiguities when applied to model artefacts. Under one plausible reading, a licensee could close off public access to a GPL-licensed model without clearly violating any licence term.

The third gap, and the focus of this article, is an intermediary gap. When AI models are accessed through cloud-based inference providers, the downstream user never receives the model itself. They interact with an internet endpoint, not a software package. The licence conditions, the attribution requirements, and the provenance chain all remain upstream, on the provider’s infrastructure, and do not reach the downstream user as part of the default delivery experience. The information is not destroyed. It is simply not transmitted. The Cursor/Kimi K2.5 episode is illustrative: a billion-dollar product built on an open model, served to a million users, and the upstream model’s identity itself was not publicly known until an outside developer independently discovered it.

Organisations accessing open models through cloud services cannot treat a platform licence label as legal verification. Provenance tracking should be built into AI procurement with the same rigour that software supply chain management demands. And if cloud-mediated delivery continues to become the standard mode of accessing AI models, meaningful compliance will likely require something that does not yet exist as an industry norm: affirmative disclosure practices at the service layer, identifying the upstream model, its licence, and its attribution conditions for every downstream user.

I. What the licence labels are actually worth

The assumption behind the open AI ecosystem is straightforward: when a model is published with a licence label on a platform like Hugging Face, that label reflects the legal reality. A developer who sees “Apache-2.0” on a model page reasonably assumes the model can be freely used, modified, and redistributed under the terms of that licence. For the overwhelming majority of models on the platform, that assumption is empirically unsupported.

Jewitt, Rajbahadur, Li, Adams, and Hassan coined the term “permissive washing” to describe this phenomenon: ”labelling an AI artefact as free to use while omitting the legal documentation that would make that freedom actionable”.

Permissive licences such as MIT, Apache-2.0, and BSD-3-Clause are not unconditional. They are conditional grants that require, at minimum, the inclusion of the full licence text, a copyright notice, and preservation of upstream attribution when the work is redistributed. When these conditions are not met, the downstream user cannot point to the licence as their legal authorisation for the use. The default legal position, if there is no valid licence, is that copyright remains with the original creator and no permission has been granted for the uses in question. Other legal arguments may still be available in some cases, depending on the facts and the jurisdiction, such as implied licence, estoppel, or fair dealing. But where the conditions of the permissive licence have not been met, that licence cannot itself serve as the legal basis for the use.

The scale of the problem is not marginal. The Jewitt et al. audit examined 124,278 supply chains spanning 3,338 datasets, 6,664 models, and 28,516 applications across Hugging Face and GitHub. Of the datasets labelled as permissive, 96.5% lacked the required licence text. For models, the figure was 95.8%. When tested for both conditions together, licence text and copyright notice, only 2.3% of datasets and 3.2% of models complied. Even when the upstream artefact did include complete documentation, the information rarely survived the journey downstream: only 27.59% of models preserved compliant dataset notices, and only 5.75% of applications preserved compliant model notices. The licensing chain does not degrade gracefully. It collapses.

A separate audit by Stalnaker, Wintersgill, Chaparro, Heymann, Di Penta, German, and Poshyvanyk, analysing 760,460 models and 175,000 datasets directly on Hugging Face, adds a structural insight that transforms the analysis. The compliance gap is not distributed evenly across platforms. On GitHub, where the convention is to include a LICENCE file in the root of a code repository, application-level compliance reached 74.2%. On Hugging Face, where the primary convention is a metadata tag on a model page, compliance stood in single digits. The platform’s design shapes the community’s behaviour. Hugging Face prioritises metadata and visual model cards over traditional file structures, and developers have responded by publishing models with a licence tag but without the actual licence file the tag is supposed to represent. The infrastructure does not facilitate compliance. It facilitates only the distribution.

The Data Provenance Initiative, a multi-disciplinary collaboration between legal and machine learning experts published in _Nature Machine Intelligence_in 2024, reached the same conclusion at the dataset level. Their audit of over 1,800 text datasets found licence omission rates exceeding 70% and error rates exceeding 50% on major hosting platforms. The researchers observed that practitioners often rely on proxies for legal and ethical risk, including the creator’s identity, the dataset’s source, the lineage of licences, and the fact that other well-known developers have already adopted the dataset, precisely because the formal licensing infrastructure does not provide reliable information. The community has built workarounds for a system that does not work.

The takeaway is not that individual developers are negligent. It is that the ecosystem’s architecture produces non-compliance as a systemic outcome.

II. Where software licences meet model weights

If the documentation gap were the only problem, it would be serious but manageable: better platform design and stronger publishing practices could, in principle, close it. The deeper difficulty is that even where a developer locates the licence file and reads its terms, those terms were written for a different kind of artefact, and the fit between the licence and the model is uncertain in ways that create real commercial risk.

Open-source software licensing rests on a set of concepts that assume software artefacts. “Source code” means the human-readable form of a program. A “derivative work” is a modified version. “Distribution” is the act of providing a copy to someone else. These definitions work because the artefact is inspectable: a developer who receives a software package can read the source, identify its components, and determine which licence terms apply. The open-source licensing system, from the most permissive licence to the most protective, depends on this inspectability.

AI model weights are not inspectable in the same way. They are vast arrays of numbers, the product of training a neural network on data. A human examining the raw values cannot determine what the model does, what data shaped it, or which other models contributed to its creation. This does not mean that software licences have no legal effect when applied to model weights. They may well have effect, and courts have not ruled otherwise. But it does mean that the mechanism through which these licences have operated for three decades, the ability of the recipient to inspect the artefact and trace its licensing obligations, does not function as expected in the AI context. The fit is awkward, and the ambiguities are commercially significant.

Considering copyleft licences like the GPL, the licence family most associated with the principle that open software should remain open. The GPL’s core mechanism requires that anyone who distributes a derivative work must make the “source code” available under the same licence terms. In the software world, this is the most powerful tool in the open-source arsenal: it ensures that modifications remain publicly accessible. But in the AI context, this mechanism encounters questions that the licence text does not answer. What is the “source code” of a model? Is it the weights? The training code? The training data? The architecture definition? The GPL does not say, because the GPL was not written with these artefacts in mind.

Duan, Zhao, Jiang, Shadbolt, and He, in their analysis presented at the ACM Web Conference 2026, argue that these ambiguities are not merely theoretical. Through a generalized licence-analysis framework, they show that, under at least one plausible reading of the GPL in the context of model publishing, a licensee could restrict public access to a model released under the GPL without necessarily triggering the licence’s expected copyleft consequences. The difficulty lies in the GPL’s reliance on concepts such as “distribution” and “source code,” neither of which maps cleanly onto the operations that define the AI model lifecycle: fine-tuning, distillation, model merging, or serving the model through a cloud API. The point should not be overstated. The authors are not establishing settled law, and the question has not been tested in court. Other readings of the GPL are possible. But the existence of a credible interpretation under which the licence’s most protective mechanism may fail to operate as expected is itself a significant commercial risk.

For permissive licences, the ambiguity is subtler. MIT and Apache-2.0 require attribution and licence preservation upon redistribution. But what constitutes “redistribution” when a model is fine-tuned, compressed into a more efficient numerical format, and republished under a new name on a different platform? A separate study by Jewitt et al.documented a “gravitational pull” toward permissive licensing across the AI supply chain, with 35.5% of model-to-application transitions eliminating restrictive upstream clauses entirely. This may not reflect deliberate infringement. It may reflect genuine uncertainty about what “redistribute” means when the artefact is not a software package.

The Open Source Initiative recognised this gap. Its Open Source AI Definition, Version 1.0 (‘OSAID’), published in October 2024, attempts to define what “open source” should mean for AI by requiring three distinct components:

sufficiently detailed information about the training data;
the complete source code used to train and run the system and;
the model parameters themselves, such as weights or other configuration settings each made available under OSI-approved terms.

The analytical move is significant. OSAID treats these elements not as optional complements, but as the “preferred form to make modifications” to a machine-learning system. In other words, model weights alone, without the data information and code that produced them, are insufficient for the study, modification, and redistribution that open-source principles are meant to secure. At the same time, OSAID remains a definition rather than an operational standard. The text itself acknowledges that the legal mechanism for ensuring the freedom of model parameters may require further clarification over time. Whether OSAID achieves the adoption necessary to shape market practice therefore remains to be seen. In the meantime, the model-specific licences that dominate in practice, including Meta’s Llama Community Licence Agreement, Google’s Gemma Terms of Use, and the OpenRAIL-M family, define AI-specific operations in their own terms and often pull in different directions. “Open” does not yet describe a settled legal category in AI. It describes a contested field.

The problem, in other words, is not only that people fail to comply with licence terms. It is that the terms themselves are uncertain when applied to AI artefacts, and that uncertainty creates risk regardless of intent.

III. How cloud delivery makes the problem structural

The first two sections established that the licence documentation is systemically absent and that the licences themselves fit AI artefacts awkwardly. This section examines what happens when a third element enters: a delivery architecture that changes the relationship between the user and the licensed material.

A. How models reach their users now

In traditional open-source software distribution, a user downloads a package. That package contains the source code, the compiled programme, and the licence file. The licence travels with the artefact because the two are physically bundled. The same act that delivers the software delivers the legal terms. This is how open-source licensing was designed to work.

In cloud-based AI inference, this coupling weakens. The user does not download the model. The user sends a request to an internet-connected endpoint and receives a response: a generated text, an image, a code suggestion. The model’s weights, its licence, and its upstream provenance remain on the provider’s servers. The user interacts with a service. Not a package.

The architecture has layers, and each one increases the distance between the end user and the model’s legal terms. At the base, model publishers host open-weight models on platforms like Hugging Face. Inference providers, companies such as Together AI, Replicate, Fireworks AI, and SambaNova, download these models, deploy them on their own computing infrastructure, and offer them as pay-per-use cloud services. Hugging Face itself operates a routing layer that connects developers to models served by multiple providers through a single interface. On top of this, aggregator platforms such as OpenRouter consolidate dozens of providers behind a single endpoint, allowing a developer to switch between hundreds of models with one line of code. OpenRouter alone processes over 30 trillion tokens per month, serving over five million users.

The Cursor/Kimi K2.5 situation fits this pattern. Cursor integrated an open-weight model into its product and served the outputs to over a million users through a cloud interface. The users had no visibility into which model was generating their code suggestions. The upstream model’s identity was discovered by an outsider, not disclosed as part of the product experience.

B. Where the information asymmetry lives

A critical distinction is necessary here. The licensing information that the downstream user lacks is not destroyed by the inference architecture. It exists. The inference provider knows which model it is serving. The model’s licence is typically visible on its Hugging Face page. A developer who is sufficiently motivated can, in many cases, identify the model, locate its licence, and assess the terms. The problem is not that the information has vanished. It is that the delivery architecture does not transmit the license by default, and the burden of discovering it falls entirely on the party least equipped to carry it.

This is an information-design problem, not a case of the information disappearing. But it is a consequential one, because open-source licensing was designed for a distribution model where the information flows automatically. The licence’s conditions are, in the software context, effectively self-transmitting: the user receives the package, encounters the licence file, and is on notice of the obligations. The system works not because every user reads the licence but because the licence is there to be read. Its presence in the package creates the opportunity for the legal relationship to function.

Cloud-mediated inference changes the default. A developer building a commercial product on top of outputs generated by an open model served through an inference provider may have licence obligations they will not encounter unless they independently investigate the upstream chain. The information exists, but the delivery system does not carry it. The developer would need to know which model is being served (which may not be disclosed), find that model’s page on a hosting platform, check whether the licence text is present (which, per Section I, it is not in 95.8% of cases), and determine whether the licence terms apply to the kind of use they are making. That is a significant investigation. In traditional software distribution, it is not necessary, because the licence file is in the package.

The legal obligations survive: a licence condition requiring attribution does not cease to apply because the model was accessed through an API rather than downloaded as a file. But the practical conditions for discovering and fulfilling those obligations shift from automatic to effortful. The licensing system still functions in theory. In practice, it requires a degree of diligence that the architecture does not prompt and that the current ecosystem does not support.

C. The compounding effect

The intermediary gap does not stand alone. It is the final stage in a supply chain where licensing information degrades at every link, and the cumulative picture is what makes the problem structural.

Consider the full chain as the empirical evidence reveals it.

A dataset is published on a hosting platform, where, according to the Data Provenance Initiative, the licence is omitted more than 70% of the time.
That dataset is used to train or fine-tune a model, which is published on Hugging Face under a licence tag but, according to Jewitt et al., without the required licence text in 95.8% of cases.
The model is served by an inference provider, whose cloud interface returns outputs but surfaces no licensing information as part of the standard developer experience.
A downstream developer integrates those outputs into a product that preserves the upstream model’s licence notice only 5.75% of the time.
An aggregator like OpenRouter may add another layer between the developer and the model.

At each link, the licensing information is not merely thinned. It is structurally separated from the delivery path. The dataset loses its licence on the hosting platform. The model loses its licence text when published. The inference provider removes the artefact from the equation, leaving only a cloud response. By the time the response reaches the end user, the licensing information is not attached to anything the user can see.

The technical capacity to trace model lineage is, it should be noted, developing. Nikolić, Baluta, and Saxena, in work presented at NeurIPS 2025, built a framework for testing whether one model is derived from another using only the kind of access that a cloud interface provides, achieving 90 to 95% precision across benchmarks spanning over 600 models. This demonstrates that provenance detection is technically feasible. But the AI ecosystem has not yet built the infrastructure to deploy this capability at scale, and the cloud delivery model does not incorporate it.

D. What compliance would actually require

If cloud-based inference continues on its current trajectory, and every market signal suggests it will, then open-model compliance cannot rest on the assumption that the downstream user will encounter the licence through the normal course of using the model. The delivery system does not work that way.

The licensing system was designed for a world where the licence travels with the artefact, like a deed that accompanies a property when ownership changes hands. In a world where the artefact stays on someone else’s infrastructure and only the outputs travel, the licensing system needs a new delivery mechanism. The legal obligations remain valid. The architecture does not provide a means of fulfilling them as part of the standard workflow.

The implication is that meaningful compliance at the cloud inference layer would require affirmative disclosure: the provider’s documentation, API metadata, or terms of service would need to identify the upstream model, its applicable licence, and any attribution conditions. To the best of the author’s knowledge, no major inference provider currently does this as a standard element of the API experience or developer documentation. Terms of service for these platforms address data privacy, uptime commitments, and billing. Upstream model licensing is, at present, not part of the standard offering. This may change as the ecosystem matures and as regulatory pressure increases, but as of this writing, the gap remains open.

IV. What this means in practice

The analysis in Sections I through III identifies a structural problem: the licensing system was designed for software packages, and the AI ecosystem has moved to a delivery model where the package never reaches the user. The practical question is what follows from this diagnosis. Some of what follows is defensive: steps that organisations can take now to manage risk within the existing framework. But the more significant implication is forward-looking, and it concerns the licensing framework itself.

A. The precedent that already exists

The open-source licensing community has faced a version of this problem before.

When web-based software services began replacing locally installed applications in the early 2000s, companies discovered they could take GPL-licensed code, modify it, run it as a hosted service, and keep their modifications entirely proprietary. The GPL’s copyleft triggers on “distribution,” and serving software over a network was not distribution: the users never received a copy. The licence’s most important obligation was structurally bypassed, not by violating the licence, but by changing the delivery model. The Free Software Foundation’s response was the GNU Affero General Public License, the AGPL, which added a single clause: if users interact with modified software over a network, the operator must make the source code available to them. The delivery model had changed. The licence followed.

The AI model ecosystem has the same gap, and it has not followed.

Open-model licences, whether copyleft or permissive, were designed for artefacts that get downloaded as packages. The inference-provider model serves them over a network without distributing them. The GPL’s copyleft could not reach software served as a cloud service. Today’s open-model licences cannot reach models served through inference APIs. The structural logic is identical. The AGPL closed the gap for software twenty years ago. For AI models, the gap remains open.

What the ecosystem needs is not another new licence. The landscape is already fragmented to the point of incoherence. What it needs is the same kind of intervention the AGPL represented: an extension of existing licensing logic to the delivery model that current licences do not reach. Specifically, a transparency obligation that triggers when an open model is made available through a cloud-based inference service, requiring the provider to disclose three things to the downstream user: the identity of the upstream model, the applicable licence, and any attribution conditions that the licence imposes.

This obligation would apply not at the moment a model is published on a platform, where transparency requirements already exist and already fail, but at the moment the model is served to a downstream user through a cloud API. It would not rewrite existing licences. It would extend them to the layer where, as this article has documented, the licensing information currently vanishes from the delivery path. A model published under Apache-2.0 stays under Apache-2.0. The obligation ensures that the developer accessing it through a cloud API can actually discover that fact.

The Open Source Initiative is well positioned to advance this. OSAID 1.0 defines what “open source” means for AI, requiring data information, code, and parameters under approved terms. But OSAID does not address what happens when those components exist and the delivery architecture does not transmit them. A future iteration of the definition, or a companion standard, could establish that an AI system cannot meaningfully be called “open” if the downstream user has no practical way to discover what model they are using, under what licence, and with what obligations. The principle is not new. The AGPL established it for software served over networks. The same principle, applied to models served through inference APIs, would close the most consequential gap this article identifies.

Whether this takes the form of a licence clause, an amendment to existing frameworks, or a platform-level standard is a question for the standards bodies and the licence authors to resolve. The argument this article advances is that the gap exists, that it is structural, and that the open-source licensing tradition already contains the precedent for closing it.

B. What organisations and publishers should do now

While the licensing framework catches up, the structural gap identified in this article creates immediate practical exposure that organisations can and should address.

For organisations deploying open models through cloud-based inference, the most immediate step is recognising that a licence tag on a model platform is not a legal opinion. Before deploying any open-weight model in a commercial context, whether accessed directly or through a provider, the licence status should be verified at the artefact level: does the model repository contain the full licence text, a copyright notice, and attribution information for its upstream components? Where models are accessed through inference providers, the organisation should require, as a contractual term, that the provider disclose the model identity, the applicable licence, and any attribution requirements. This is the AI equivalent of requiring a software vendor to identify its open-source dependencies, a practice that has been standard in software procurement for over a decade.

For inference providers, the current practice of serving open-weight models without surfacing upstream licence conditions creates an asymmetry that the downstream user cannot resolve alone. Implementing a disclosure mechanism, whether as an API header, a documentation page, or a metadata endpoint identifying the upstream model and its licence, would be a technically modest step with significant practical value. It would not impose compliance obligations on the provider. It would make compliance possible for the developer.

For model publishers, the evidence from the Jewitt et al. audit demonstrates that compliance begins at publication. A licence tag without a licence file is the equivalent of a label on a sealed box that reads “contents: permitted” without any legal instrument inside. Publishers should include the full licence text as a file in the model repository, attach a copyright notice, and document the provenance of the model’s training data and base model with enough specificity that a downstream user can trace the chain of obligations.

Conclusion

The licensing gap in open AI is not a compliance failure. It is a design failure. The documentation is systemically absent. The licences fit their target artefacts uncertainly. And the delivery architecture that is rapidly becoming the default, cloud-based inference, separates the end user from the licensed material in a way that no amount of individual diligence can fully overcome.

The most consequential of these three conditions is the last, because it will not self-correct. The documentation gap is measurable and addressable through better platform conventions. The definitional gap is recognised and will narrow as AI-specific licensing frameworks such as the OSI’s Open Source AI Definition and the SPDX 3.0 standard for AI bills of materials mature. But the intermediary gap is a product of the delivery architecture itself. Better compliance at the publishing stage cannot close a gap that opens at the serving stage.

The open-source licensing tradition contains the precedent for solving this. When software moved from packages to network services, the AGPLextended copyleft obligations to the new delivery model. AI models have undergone the same shift, from downloads to cloud-served inference, and the licensing framework has not followed. The principle is clear: when the delivery model changes, the licence must change with it. A transparency obligation at the point of serving, requiring inference providers to disclose the upstream model, its licence, and its attribution conditions, would apply that principle to the layer where the information currently disappears from the delivery path.

The direction of the law reinforces the urgency. The EU AI Act’stransparency obligations for general-purpose AI models, including training data summaries and documentation for downstream providers, apply from August 2025, with enforcement powers following in August 2026. The regulatory expectation is that provenance information will accompany AI models through the supply chain. The infrastructure to meet that expectation does not yet exist.

The risk that is being underestimated is not licence violation. It is the growing distance between the obligations that attach to open models and the practical ability of the people who use them to know what those obligations are. The open-source community closed this kind of gap once before. It is time to close it again.