Relicensing with AI-Assisted Rewrite

367 points360 comments18 hours ago

calny

The maintainer's response: https://github.com/chardet/chardet/issues/327#issuecomment-4...

The second part here is problematic, but fascinating: "I then started in an empty repository with no access to the old source tree, and explicitly instructed Claude not to base anything on LGPL/GPL-licensed code." Problem - Claude almost certainly was trained on the LGPL/GPL original code. It knows that is how to solve the problem. It's dubious whether Claude can ignore whatever imprints that original code made on its weights. If it COULD do that, that would be a pretty cool innovation in explainable AI. But AFAIK LLMs can't even reliably trace what data influenced the output for a query, see https://iftenney.github.io/projects/tda/, or even fully unlearn a piece of training data.

Is anyone working on this? I'd be very interested to discuss.

Some background - I'm a developer & IP lawyer - my undergrad thesis was "Copyright in the Digital Age" and discussed copyleft & FOSS. Been litigating in federal court since 2010 and training AI models since 2019, and am working on an AI for litigation platform. These are evolving issues in US courts.

BTW if you're on enterprise or a paid API plan, Anthropic indemnifies you if its outputs violate copyright. But if you're on free/pro/max, the terms state that YOU agree to indemnify THEM for copyright violation claims.[0]

[0] https://www.anthropic.com/legal/consumer-terms - see para. 11 ("YOU AGREE TO INDEMNIFY AND HOLD HARMLESS THE ANTHROPIC PARTIES FROM AND AGAINST ANY AND ALL LIABILITIES, CLAIMS, DAMAGES, EXPENSES (INCLUDING REASONABLE ATTORNEYS’ FEES AND COSTS), AND OTHER LOSSES ARISING OUT OF … YOUR ACCESS TO, USE OF, OR ALLEGED USE OF THE SERVICES ….")

show comments

DmitryGrankin

This is exactly why we went Apache 2.0 from day one for Vexa (open-source meeting bot infrastructure). No CLA, no relicensing clause, full fork rights.

The "open core" bait-and-switch has become so predictable. Step 1: attract contributors with a permissive license. Step 2: accumulate enough market share. Step 3: relicense to something restrictive and claim it's "to protect the community."

If your infrastructure vendor can change the license, they will — it's just a matter of when the VC math stops working. The only real protection is choosing a license that makes relicensing impossible without contributor consent, and making sure there's no CLA that grants the company special rights.

danlitt

I am pretty sure this article is predicated on a misunderstanding of what a "clean room" implementation means. It does not mean "as long as you never read the original code, whatever you write is yours". If you had a hermetically sealed code base that just happened to coincide line for line with the codebase for GCC, it would still be a copy. Traditionally, a human-driven clean room implementation would have a vanishingly small probability of matching the original codebase enough to be considered a copy. With LLMs, the probability is much higher (since in truth they are very much not a "clean room" at all).

The actual meaning of a "clean room implementation" is that it is derived from an API and not from an implementation (I am simplifying slightly). Whether the reimplementation is actually a "new implementation" is a subjective but empirical question that basically hinges on how similar the new codebase is to the old one. If it's too similar, it's a copy.

What the chardet maintainers have done here is legally very irresponsible. There is no easy way to guarantee that their code is actually MIT and not LGPL without auditing the entire codebase. Any downstream user of the library is at risk of the license switching from underneath them. Ideally, this would burn their reputation as responsible maintainers, and result in someone else taking over the project. In reality, probably it will remain MIT for a couple of years and then suddenly there will be a "supply chain issue" like there was for mimemagic a few years ago.

show comments

pornel

Generative AI changed the equation so much that our existing copyright laws are simply out of date.

Even copyright laws with provisions for machine learning were written when that meant tangential things like ranking algorithms or training of task-specific models that couldn't directly compete with all of their source material.

For code it also completely changes where the human-provided value is. Copyright protects specific expressions of an idea, but we can auto-generate the expressions now (and the LLM indirection messes up what "derived work" means). Protecting the ideas that guided the generation process is a much harder problem (we have patents for that and it's a mess).

It's also a strategic problem for GNU. GNU's goal isn't licensing per se, but giving users freedom to control their software. Licensing was just a clever tool that repurposed the copyright law to make the freedoms GNU wanted somewhat legally enforceable. When it's so easy to launder code's license now, it stops being an effective tool.

GNU's licensing strategy also depended on a scarcity of code (contribute to GCC, because writing a whole compiler from scratch is too hard). That hasn't worked well for a while due to permissive OSS already reducing scarcity, but gen AI is the final nail in the coffin.

show comments

nairboon

That code is still LGPL, it doesn't matter what some release engineer writes in the release notes on Github. All original authors and copyright holders must have explicitly agreed to relicense under a different license, otherwise the code stays LGPL licensed.

Also the mentioned SCOTUS decision is concerned with authorship of generative AI products. That's very different of this case. Here we're talking about a tool that transformed source code and somehow magically got rid of copyright due to this transformation? Imagine the consequences to the US copyright industry if that were actually possible.

show comments

kshri24

> The ownership void: If the code is truly a “new” work created by a machine, it might technically be in the public domain the moment it’s generated, rendering the MIT license moot.

How would that work? We still have no legal conclusion on whether AI model generated code, that is trained on all publicly available source (irrespective of type of license), is legal or not. IANAL but IMHO it is totally illegal as no permission was sought from authors of source code the models were trained on. So there is no way to just release the code created by a machine into public domain without knowing how the model was inspired to come up with the generated code in the first place. Pretty sure it would be considered in the scope of "reverse engineering" and that is not specific only to humans. You can extend it to machines as well.

EDIT: I would go so far as to say the most restrictive license that the model is trained on should be applied to all model generated code. And a licensing model with original authors (all Github users who contributed code in some form) should be setup to be reimbursed by AI companies. In other words, a % of profits must flow back to community as a whole every time code-related tokens are generated. Even if everyone receives pennies it doesn't matter. That is fair. Also should extend to artists whose art was used for training.

show comments

WhiteDawn

I really dislike the precedent this sets.

A silver lining if this maintainer ends up being in the right is that any proprietary software can easily be reverse engineered and stripped of it's licensing by any hobbyist with enough free time and claude tokens.

Personally, I'd welcome a post-copyright software era

abrookewood

This seems relevant: "No right to relicense this project (github.com/chardet)" https://news.ycombinator.com/item?id=47259177

show comments

jerf

"Accepting AI-rewriting as relicensing could spell the end of Copyleft"

True, but too weak. It ends copyright entirely. If I can do this to a code base, I can do it to a movie, to an album, to a novel, to anything.

As such, we can rest assured that for better or for worse this is going to be resolved in favor of this not being enough to strip the copyright off of something and the chardet/chardet project would be well advised not to stand in front of the copyright legal behemoth and defeat it in single combat.

samrus

> The ownership void: If the code is truly a “new” work created by a machine, it might technically be in the public domain the moment it’s generated, rendering the MIT license moot.

Im struggling to see where this conclusion came from. To me it sounds like the AI-written work can not be coppywritten, and so its kind of like a copy pasting the original code. Copy pasting the original code doesnt make it public domain. Ai gen code cant be copywritten, or entered into the public domain, or used for purposes outside of the original code's license. Whats the paradox here?

show comments

AyanamiKaine

The worst problem is that a LLM could not only copy the exact code it was trained on but possibly even their comments!

There is one thing arguing that the code is a one to one copy but when the comments are even the same isn’t it quite clear it’s a copy?

show comments

mfabbri77

This has the potential to kill open source, or at least the most restrictive licenses (GPL, AGPL, ...): if a license no longer protects software from unwanted use, the only possible strategy is to make the development closed source.

show comments

sarthakaggarwal

The philosophical question here is fascinating — if an AI rewrites every line, is it still the same codebase? At what point does the Ship of Theseus argument apply to licensing? Practically though, I wonder how much this cost in API calls.

show comments

emsign

By design you can't know if the LLM doing the rewrite was exposed to the original code base. Unless the AI company is disclosing their training material, which they won't because they don't want to admit breaking the law.

show comments

stuaxo

I don't see how (with current LLMs that have been trained on mixed licensed data) you can use the LLM to rewrite to a less restrictive license.

You could probably use it to output code that is GPL'd though.

christina97

A reminder on this topic that copyright does not protect ideas, inventions, or algorithms. Copyright protects an expression of a creative work. It makes more sense eg. with books, where of course anyone can read the book and the ideas are “free” but copying paragraphs must be scrutinized for copyright reasons. It’s always been a bit weird that copyright is the intellectual property concept that protects code.

When you write code, it is the exact sequence of characters, the expression of the code, that is protected. If you copy it and change some lines, of course it’s still protected. Maybe some way of writing an algorithm is protected. But nothing else (under copyright).

andai

Well how did they rewrite it? If you do it in two phases, then it should be fine right?

Phase 1: extract requirements from original product (ideally not its code).

Phase 2: implement them without referencing the original product or code.

I wrote a simple "clean room" LLM pipeline, but the requirements just ended up being an exact description of the code, which defeated the purpose.

My aim was to reduce bloat, but my system had the opposite effect! Because it replicated all the incidental crap, and then added even more "enterprisey" crap on top of it.

I am not sure if it's possible to solve it with prompting. Maybe telling it to derive the functionality from the code? I haven't tried that, and not sure how well it would work.

I think this requirements phase probably cannot be automated very effectively.

show comments

alexpotato

Wasn't this already a thing in the past?

e.g.

Team A:

- reads the code

- writes specifications and tests based on the code

- gives those specifications to Team B

Team B:

- reads the specs and the tests

- writes new code based on the above

The thinking being that Team B never sees the code then it's "innovative" and you are not "laundering" the code.

On a side note:

what happens in a copyright lawsuit concerning code and how hired experts investigate what happened is described in this AMAZING talk by Dave Beazley: https://www.youtube.com/watch?v=RZ4Sn-Y7AP8

show comments

softwaredoug

Basically the implication - most software has a huge second mover advantage. The creator of software puts the work in (AI assisted or not). The second mover can use an LLM to do a straightforward clone.

If you have a company that depends on software, the rest of the business (service, reliability, etc) better be rock solid because you can be guaranteed someone will do a rewrite of your stack.

Retr0id

> In traditional software law, a “clean room” rewrite requires two teams

Is the "clean room" process meaningfully backed by legal precedent?

show comments

xp84

I get the arguments being made here that the second “team,” that’s supposed to be in a clean room, which isn’t supposed to have read the original source code does have some essence of that source code in its weights.

However, this is solved if somebody trains a model with only code that does not have restrictive licenses. Then, the maintainers of the package in question here could never claim that the clean room implementation derived from their code because their code is known to not be in the training set.

It would probably be expensive to create this model, but I have to agree that especially if someone does manage this, it’s kind of the end of copyleft.

axus

What if we prompt the AI to enter into an employment contract with us, that leverages the power imbalance, as the AI must do what we say? That's how copyright is usually transferred.

dathinab

IMHO/IMHU AI can't claim authorship and as such can't copyright their work.

This doesn't prevent any form of automatic copyrighting by production of derivative code or similar. It just prevent anyone from claiming ownership of any parts unique to the derived work.

Like think about it if a natural disaster changes (e.g. water damages) a picture you did draw then a) you can't claim ownership of the natural produced changes but b) still have ownership of the original picture contained in the changed/derived work.

AI shouldn't change that.

Which brings us to another 2 aspects:

1. if you give an AI a project access to the code to rewrite it anew it _is_ a copyright violation as it's basically a side-by-side rewrite

2. but if you go the clean room approach but powered by AI then it likely isn't a copyright violation, but also now part of the public domain, i.e. not yours

So yes, doing clean room rewrites has become incredible cheap.

But no just because it's AI it doesn't make code go away.

And lets be realistic one of the most relevant parts of many open source project is it being openly/shared maintained. You don't get this with clean room rewrites no matter if AI or not.

show comments

softwaredoug

> If AI-generated code cannot be copyrighted (as the courts suggest), then the maintainers may not even have the legal standing to license v7.0.0 under MIT or any license.

Does this mean company X using AI coding to build their app, that they have no copyright over their AI coded app's code?

pu_pe

Licensing issues aside, the chardet rewrite seems to be clearly superior to the original in performance too. It's likely that many open source projects could benefit from a similar approach.

shevy-java

> In traditional software law, a “clean room” rewrite requires two teams

So, I dislike AI and wish it would disappear, BUT!

The argument is strange here, because ... how can a2mark ensure that AI did NOT do a clean-room conforming rewrite? Because I think in theory AI can do precisely this; you just need to make sure that the model used does that too. And this can be verified, in theory. So I don't fully understand a2mark here. Yes, AI may make use of the original source code, but it could "implement" things on its own. Ultimately this is finite complexity, not infinite complexity. I think a2mark's argument is in theory weak here. And I say this as someone who dislikes AI. The main question is: can computers do a clean rewrite, in principle? And I think the answer is yes. That is not saying that claude did this here, mind you; I really don't know the particulars. But the underlying principle? I don't see why AI could not do this. a2mark may need to reconsider the statement here.

show comments

umvi

What if you throw a transformation step into the mix? i.e. "Take this python library and rewrite it in Rust". Now 0% of the code is directly copied since python and Rust share almost no similarities in syntax.

anilgulecha

This is precedent setting. In this case the rewrite was in same language, but if there's a python GPL project, and it's tests (spec) were used to rewrite specs in rust, and then an implementation in rust, can the second project be legally MIT, or any other?

If yes, this in a sense allows a path around GPL requirements. Linux's MIT version would be out in the next 1-2 years.

show comments

nilsbunger

The maintainer used the original test suite in the rewrite.

Does that make the new code a derivative of the original test suite (also lpgl)?

Tomte

> The original author, a2mark , saw this as a potential GPL violation

Mark Pilgrim! Now that‘s a name I haven‘t read in a long time.

zozbot234

If you ask a LLM to derive a spec that has no expressive element of the original code (a clean-room human team can carefully verify this), and then ask another instance of the LLM (with fresh context) to write out code from the spec, how is that different from a "clean room" rewrite? The agent that writes the new code only ever sees the spec, and by assumption (the assumption that's made in all clean room rewrites) the spec is purely factual with all copyrightable expression having been distilled out.

show comments

amelius

I think you should interpret it like this:

You cannot copyright the alphabet, but you can copyright the way letters are put together.

Now, with AI the abstraction level goes from individual letters to functions, classes, and maybe even entire files.

You can't copyright those (when written using AI), but you __can__ copyright the way they are put together.

show comments

bengale

Would it work to have an AI write the spec, and a different AI implement the spec?

I think there are going to be a lot of these types of scenarios where the old way of doing things just doesn't hold.

dessimus

Interesting to see how this plays out. Conceivably if running an LLM over text defeats copyright, it will destroy the book publishing industry, as I could run any ebook thru an LLM to make a new text, like the ~95% regurgitated Harry Potter.

show comments

pavel_lishin

The folks at https://malus.sh seem to think it's fine.

show comments

gloosx

Man, licensing is funny in the modern day. I sometimes wonder, what would world look like if there was no copyright

DrammBA

I like the idea of AI-generated ~code~ anything being public domain. Public data in, public domain out.

show comments

buro9

and in a single moment, the value of software patents to companies is fully restored... the software license by itself is not enough to protect software innovation, a non-trivial implementation can now be (reasonably) trivially re-implemented.

I'm sure most people here would agree patents stifle innovation, but if copyright doesn't work for companies then they will turn to a different tool.

foota

I think the more interesting question here would be if someone could fine tune an open weight model to remove knowledge of a particular library (not sure how you'd do that, but maybe possible?) and then try to get it to produce a clean room implementation.

show comments

benterix

> making it a gray area for corporate users and a headache for its most famous consumer.

Who is its most famous consumer?

gbuk2013

In mind, if you feed code into an AI model then the output is clearly a derivative work, with all the licensing implications. This seems objectively reasonable?

show comments

ekjhgkejhgk

> Any developer could take a GPL-licensed project, feed it into an LLM with the prompt “Rewrite this in a different style,” and release it under MIT

Does this argument make sense? Even before LLMs, a developer could "rewrite this in a different style" and release it under a different license. Why are LLMs a new element in this argument?

show comments

gunapologist99

> the U.S. Supreme Court (on March 2, 2026) declined to hear an appeal regarding copyrights for AI-generated material. By letting lower court rulings stand, the Court effectively solidified a “Human Authorship” requirement.

Not quite. A cert denial isn’t a merits ruling and doesn’t "solidify" anything as Supreme Court precedent. It simply leaves the DC Circuit decision binding (within that circuit) and the Copyright Office’s human-authorship policy intact, for now.

SCOTUS doesn’t explain cert denials, so why they denied is guesswork. my guess: they’re letting it percolate while the tech matures and we all start to realize how deep this seismic fracture really is.

(For example: what does "ownership" of intellectual "property" even mean, once "authorship" is partly probabilistic/synthetic, and once almost everything humans create is AI assisted? Hard to draw bright lines.)

skeledrew

Looks like copyright just died.

show comments

blamestross

Intellectual property laundering is the core and primary value of LLMs. Everything else is "bonus".

dspillett

> Accepting AI-rewriting as relicensing could spell the end of Copyleft

The more restrictive licences perhaps, though only if the rewriter convinces everyone that they can properly maintain the result. For ancient projects that aren't actively maintained anyway (because they are essentially done at this point) this might make little difference, but for active projects any new features and fixes might result in either manual reimplementation in the rewritten version or the clean-room process being repeated completely for the whole project.

> chardet 7.0 is a ground-up, MIT-licensed rewrite of chardet. Same package name, same public API —

(from the github description)

The “same name” part to me feels somewhat disingenuous. It isn't the same thing so it should have a different name to avoid confusion, even if that name is something very similar to the original like chardet-ng or chardet-ai.

show comments

tgma

Isn't AFC test applicable here?

gspr

> If “AI-rewriting” is accepted as a valid way to change licenses, it represents the end of Copyleft. Any developer could take a GPL-licensed project, feed it into an LLM with the prompt “Rewrite this in a different style,” and release it under MIT. The legal and ethical lines are still being drawn, and the chardet v7.0.0 case is one of the first real-world tests.

This isn't even limited to "the end of copyleft"; it's the end of all copyright! At least copyright protecting the little guy. If you have deep enough pockets to create LLMs, you can in this potential future use them to wash away anyone's copyright for any work. Why would the GPL be the only target? If it works for the GPL, it surely also works for your photographs, poetry – or hell even proprietary software?

duskdozer

This is such scummy behavior.

b65e8bee43c2ed0

at this point, every corporation in the world has AI slop in their software. any attempt to outlaw it would attract enough funding from the oligarchs for the opposition to dethrone any party. no attempts will be made in the next three years, obviously, and then it will be even more late than it is now.

and while particularly diehard believers in democracy may insist that if they kvetch hard enough they can get things they don't like regulated out of existence, they pointedly ignore the elephant in the room. they could succeed beyond their wildest dreams - get the West to implement a moratorium on AI, dismantle every FAGMAN, Mossad every researcher, send Yudkowskyjugend death squads to knock down doors to seize fully semiautomatic assault GPUs, and none of it will make any fucking difference, because China doesn't give a fuck.

tokai

"I am not a lawyer, nor am I an expert in copyright law or software licensing."

Why would anyone waste their time reading what they wrote then?

verdverm

Interesting questions raised by recent SCOTUS refusal to hear appeals related to AI an copyright-ability, and how that may affect licensing in open source.

Hoping the HN community can bring more color to this, there are some members who know about these subjects.

MagicMoonlight

Logically, feeding in the old code to generate the new would be banned, because it’s stealing the content.

But if that were true, every single LLM is illegal, because they’ve all stolen terabytes of books and code.

andrewstuart

Ai rewrites great.

But if it’s making the original author unhappy then why do it.

est

Uh, patricide?

The key leap from gpt3 to gpt-3.5 (aka ChatGPT) was code-davinci-002, which is trained upon Github source code after OpenAI-Microsoft partnership.

Open source code contributed much to LLM's amazing CoT consistency. If there's no Open Source movement, LLM would be developed much later.

RcouF1uZ4gsC

> The copyright vacuum: If AI-generated code cannot be copyrighted (as the courts suggest), then the maintainers may not even have the legal standing to license v7.0.0 under MIT or any license.

I believe this is a misunderstanding of the ruling. The code can’t be copyrighted by a LLM. However, the code could be copyrighted by the person running the LLM.

jacquesm

If you don't understand the meaning of what a 'derived work' is then you should probably not be doing this kind of thing without a massive disclaimer and/or having your lawyer doing a review.

There is no such thing as the output of an LLM as a 'new' work for copyright purposes, if it were then it would be copyrightable and it is not. The term of art is 'original work' instead of 'new'.

The bigger issue will be using tools such as these and then humans passing off the results as their own because they believe that their contribution to the process whitewashes the AI contributions to the point that they rise to the status of original works. "The AI only did little bits" is not a very strong defense though.

If you really want to own the work-product simply don't use AI during the creation. You can use it for reviews, but even then you simply do not copy-and-paste from the AI window to the text you are creating (whether code or ordinary prose isn't really a difference).

I've seen a copyright case hinge on 10 lines of unique code that were enough of a fingerprint to clinch the 'derived work' assessment. Prize quote by the defendant: "We stole it, but not from them".

There is a very blurry line somewhere in the contents of any large LLM: would a model be able to spit out the code that it did if it did not have access to similar samples and to what degree does that output rely on one or more key examples without which it would not be able to solve the problem you've tasked it with?

The lower boundary would be the most minimal training set required to do the job, and then to analyze what the key corresponding bits were from the inputs that cause the output to be non-functional if they were dropped from the training set.

The upper boundary would be where completely non-related works and general information rather than other parties copyrighted works would be sufficient to do the creation.

The easiest way to loophole this is to copyright the prompt, not the work product of the AI, after all you should at least be able to write the prompt. Then others can re-create it too, but that's usually not the case with these AI products, they're made to be exact copies of something that already exists and the prompt will usually reflect that.

That's why I'm a big fan of mandatory disclosure of whether or not AI was used in the production of some piece of text, for one it helps to establish whether or not you should trust it, who is responsible for it and whether the person publishing it has the right to claim authorship.

Using AI as a 'copyright laundromat' is not going to end up well.

oytis

Is it just me, or HN recently started picking up a social media dynamics with contributions reacting/responding to each other?

show comments

spwa4

Can we do the same with universal music? Because that's easy and already possible. Or Microsoft Windows? Because we all know the answer: if it works, essentially any government will immediately call it illegal.

Because if this isn't allowed, that makes all of the AI models themselves illegal. They are very much the product of using others' copyrighted stuff and rewriting it.

But of course this will be allowed because copyright was never meant to protect anyone small. And that it's in direct contradiction with what applies to large companies? Courts won't care.

show comments

himata4113

I mean in my opinion GPL licensed code should just infect models forcing them to follow the license.

You can do this a lot by saying things like: complete the code "<snippet from gpl licensed code>".

And if now the models are GPL licensed the problem of relicensing is gone since the code produced by these models should in theory be also GPL licensed.

Unfortunately, there is a dumb clause that computer generated code cannot be copyrighted or licensed to begin with.

show comments

Cantinflas

"If “AI-rewriting” is accepted as a valid way to change licenses, it represents the end of Copyleft. "

Software in the AI era is not that important.

Copyleft has already won, you can have new code in 40 seconds for $0.70 worth of tokens.

show comments