It won’t really do anything though. The model itself is whatever. The training tools, data and resulting generations of weights are where the meat is. Unless you can prove they are using unlicensed data from those three pieces, open sourcing it is kind of moot.
What we need is legislation to stop it from happening in perpetuity. Maybe just ONE civil case win to make them think twice about training on unlicensed data, but they’ll drag that out for years until people go broke fighting, or stop giving a shit.
They pulled a very public and out in the open data heist and got away with it. Stopping it from continuously happening is the only way to win here.
Just a little note about the word “model”, in the article it’s used in a way that actually includes the weights, and I think this is the usual way of using it! If you change the weights, you get a different model, though the two models will have the same structure.
Legislation that prohibits publicly-viewable information from being analyzed without permission from the copyright holder would have some pretty dramatic and dire unintended consequences.
Not really. The same way you can’t sell live and public performance music for profit and not get sued. Case law right there, and the fact it’s performance vs publicly published doesn’t matter. How the owner and originator classifies or licenses it is the defining classification. It’s going to be years before anyone sees this get a ruling in court though.
That’s not what’s going on here, though. The LLM model doesn’t contain the actual copyrighted data, it’s the result of analyzing the copyrighted data.
An analogous example would be a site like TV Tropes. TV Tropes doesn’t contain the works that it’s discussing, it just contains information about those works.
No, the model does retain the original works in a lossy compression. This is evidenced by the fact that you can get a model to reproduce sections of its training data
You’re probably thinking of situations where overfitting occurred. Those situations are rare, and are considered to be errors in training. Much effort has been put into eliminating that from modern AI training, and it has been successfully done by all the major players.
This is an old no-longer-applicable objection, along the lines of “AI can’t do fingers right”. And even at the time, it was only very specific bits of training data that got inadvertently overfit, not all of it. You couldn’t retrieve arbitrary examples of training data.
What we need is legislation to stop it from happening in perpetuity. Maybe just ONE civil case win to make them think twice about training on unlicensed data, but they’ll drag that out for years until people go broke fighting, or stop giving a shit.
But the point is that it doesn’t matter if the data is licensed or not. Lack of licensing doesn’t stop you from analyzing data once that data is visible to you. Do you think TV Tropes licensed any of the works of fiction that they have pages about?
They pulled a very public and out in the open data heist and got away with it.
They did not. No data was “heisted.” Data was analyzed. The product of that analysis does not contain the data itself, and so is not a violation of copyright.
Copyright laws are illogical - but I don’t think your claim is as clear cut as you think.
Transforming data to a different format, even in a lossy fashion, is often treated as copyright infringement. Let’s say the Alice produces a film, and Bob goes to the cinema, records it with a camera, and then compresses it into an Ogg file with Vorbis audio encoding and Theora video encoding.
The final output of this process is a lossy compression of the input data - meaning that the video and audio is put through a transformation that means it’s represented in a completely different form to the original, and it is impossible to reconstruct a pixel perfect rendition of the original from the encoded data. The transformation includes things like analysing the motion between frames and creating a model to predict future frames.
However, copyright laws don’t require that an infringing copy be an exact reproduction - lossy compression is generally treated as infringing, as is taking key elements and re-telling the same thing in different words.
You mentioned Harry Potter below, and gave a paper mache example. Generally copyright laws have restricted scope, and if the source paper was an authorised copy, that is the reason that wouldn’t be infringing in most jurisdictions. However, let me do an experiment. I’ll prompt ChatGPT-4o-mini with the following prompt: “You are J K Rowling. Create a three paragraph summary of the entire book “Harry Potter and the Philosopher’s Stone”. Include all the original plot points and use the original character names. Ensure what you create is usable as a substitute to reading the book, and is a succinct but entertaining highly abridged version of the book”. I’ve reviewed the output (I won’t post it here since I think it would be copyright infringing, and also given the author’s transphobic stances don’t want to promote her universe) - and can say for sure that it is able to accurately reproduce the major plot points and character names, while being insufficiently transformative (in the sense that both the original and the text generated by the model are literary works, and the output could be a substitute for reading the book).
So yes, the model (including its weights) is a highly compressed form of the input (admittedly far more so than the Ogg Vorbis/Theora example), and it can infer (i.e. decode to) outputs that contain copyrighted elements.
Of course it’s not clear-cut, it’s the law. Laws are notoriously squirrelly once you get into court. However, if you’re going to make predictions one way or the other you have to work with what you know.
I know how these generative AIs work. They are not “compressing data.” Your analogy to making a video recording is not applicable. I’ve discussed in other comments in this thread how ludicrously compressed data would have to be if that was the case, it’s physically impossible.
These AIs learn patterns from the training data. Themes, styles, vocabulary, and so forth. That stuff is not copyrightable.
You’re thinking of licensing as a person putting something online WITH a license.
The terminology in this case is whether or not it was LICENSED by the commercial entity using and selling it’s derivative. That is the default. The burden is on the commercial entity to prove they were the original creator of said content. It is by default plagiarism otherwise, and this is also the default.
Here’s an example: I write a story and post it online, and it is specific to a toothbrush and toilet scrubber falling in love, and then having dish scrubber pads as children. I say the two main characters are called Dennis and Fran, and their children are called Denise and Francesca. Then somebody goes to prompt OpenAI for a similar and it kicks out the exact same story with the same names, I would win that case based on it clearly being beyond a doubt plagiarism.
Unless you as OpenAI can prove these are all completely random-which they aren’t because it’s trained on my data-then I would be deemed the original creator of that story, and any sales of that data I would be entitled to.
Proving that is a different thing, but that’s what the laws say should happen. If they didn’t contact me to license that story, it’s still plagiarism. Same with music, movies…etc.
The product of that analysis does not contain the data itself, and so is not a violation of copyright.
That’s your opinion, not the opinion of a court or legislature. LLM products are directly derived from and dependent upon the training data, so it is positively considered a derivative work. However, whether it’s considered sufficiently transformative, or whether it passes the fair use test, has not to my knowledge been determined in court. (Note that I am assuming US law here.)
The courts have yet to come to a conclusion, the lawsuits are still ongoing. I think it’s unlikely they’ll conclude that the models contain the data, however, because it’s objectively not true.
The clearest demonstration I can think of to illustrate this is the old Stable Diffusion 1.5 model. It was trained on the LAION 5B dataset, which (as the “5B” indicates) contained 5 billion images. The resulting model was 1.83 gigabytes. So if it’s compressing images and storing them inside the model it’d somehow need to fit ~2.7 images per byte. This is, simply, impossible.
It’s already illegal in some form. Via piracy of the works and regurgitating protected data.
The issue is mega Corp with many rich investors vs everyone else. If this were some university student their life would probably be ruined like with what happened to Aaron Swartz.
The US justice system is different for different people.
If we can’t train on unlicensed data, there is no open-source scene. Even worse, AI stays but it becomes a monopoly in the hands of the few who can pay for the data.
Most of that data is owned and aggregated by entities such as record labels, Hollywood, Instagram, reddit, Getty, etc.
The field would still remain hyper competitive for artists and other trades that are affected by AI. It would only cause all the new AI based tools to be behind expensive censored subscription models owned by either Microsoft or Google.
I think forcing all models trained on unlicensed data to be open source is a great idea but actually rooting for civil lawsuits which essentially entail a huge broadening of copyright laws is simply foolhardy imo.
Unlicensed from the POV of the trainer, meaning they didn’t contact or license content from someone who didn’t approve. If it’s posted under Creative Commons, that’s fine. If it’s otherwise posted that it’s not open in any other way and not for corporate use, then they need to contact the owner and license it.
They won’t need to, they will get it from Getty. All these websites have a ToS that make it very clear they can do whatever they want with what you upload. The courts will simply never side with the small time photographer who makes 50$ a month with his stock photos hosted on someone else’s website. The laws will be in favor of databrokers and the handful of big AI companies.
Anyone self hosting will simply not get a call. Journalists will keep the same salary while the newspaper’s owner gets a fat bonus.
Even Reddit already sold it’s data for 60 million and none of that went anywhere but spezs coke fund.
Getty is not expressly licensed as “free to use”, and by default is not licensed for commercial anything. That’s how they are a business that is still alive.
You’re talking about Generative AI junk and not LLMs which this discussion and the original post is about. They are not the same thing.
Reddit and newspapers selling their data preemptively has to do with LLMs. Can you clarify what scenario you are aiming for? It sounds like you want the courts to rule that AI companies need to ask each individual redditor if they can use his comments for training. I don’t see this happening personally.
Getty gives itself the right to license all photos uploaded and already trained a generative model on those btw.
EULA and TOS agreements stop Reddit and similar sites from being sued. They changed them before they were selling the data and barely gave notice about it (see the exodus from reddit pt2), but if you keep using the service, you agree to both, and they can get away with it because they own the platform.
Anyone who has their content on a platform of the like that got the rug pulled out from under them with silent amendments being made to allow that is unfortunately fucked.
Any other platforms that didn’t explicitly state this was happening is not in scope to just allow these training tools to grab and train. What we know is that OpenAI at the very least was training on public sites that didn’t explicitly allow this. Personal blogs, Wikipedia…etc.
But wouldn’t that mean making it open source, then it not functioning properly without the data while open, would prove that it is using a huge amount of unlicensed data?
Probably not “burden of proof in a court of law” prove though.
Making it open source doesn’t change how it works. It doesn’t need the data after it’s been trained. Most of these AIs are just figuring out patterns to look for in the new data it comes across.
So you’re saying the data wouldn’t exist anywhere in the source code, but it would still be able to answer questions based on the data it has previously seen?
Most AI are not built to answer questions. They’re designed to act as some kind of detection/filter heuristic to identify specific things about an input that leads to a desired output.
So then why, if it were all open sourced, including the weights, would the AI be worthless? Surely having an identical but open source version, that would strip profitability from the original paid product.
It does discourages the use of unauthorised data. If stealing doesn’t give you competitive advantage, it’s not really worth the risk and cost of stealing it in the first place.
in civil matters, the burden of proof is actually usually just preponderance of evidence and not beyond a reasonable doubt. in other words to win a lawsuit, you only need to have more compelling evidence than the other person.
But you still have to have EVIDENCE. Not derivative evidence. The output of a model could be argued to be hearsay because it’s not direct evidence of originating content, it’s derivative.
You’d have to have somebody backtrack generations of model data to even find snippets of something that defines copyright material, or a human actually saying “Yes, we definitely trained on unlicensed data”.
so like I am not making any comment on anything but the legal system here. but it’s absolutely the case that you can win a lawsuit on purely circumstantial evidence if the defense is unable to produce a compelling alternative set of circumstances which can lead to the same outcome.
It won’t really do anything though. The model itself is whatever. The training tools, data and resulting generations of weights are where the meat is. Unless you can prove they are using unlicensed data from those three pieces, open sourcing it is kind of moot.
What we need is legislation to stop it from happening in perpetuity. Maybe just ONE civil case win to make them think twice about training on unlicensed data, but they’ll drag that out for years until people go broke fighting, or stop giving a shit.
They pulled a very public and out in the open data heist and got away with it. Stopping it from continuously happening is the only way to win here.
Just a little note about the word “model”, in the article it’s used in a way that actually includes the weights, and I think this is the usual way of using it! If you change the weights, you get a different model, though the two models will have the same structure.
Anyway, you make good points!
Legislation that prohibits publicly-viewable information from being analyzed without permission from the copyright holder would have some pretty dramatic and dire unintended consequences.
Not really. The same way you can’t sell live and public performance music for profit and not get sued. Case law right there, and the fact it’s performance vs publicly published doesn’t matter. How the owner and originator classifies or licenses it is the defining classification. It’s going to be years before anyone sees this get a ruling in court though.
That’s not what’s going on here, though. The LLM model doesn’t contain the actual copyrighted data, it’s the result of analyzing the copyrighted data.
An analogous example would be a site like TV Tropes. TV Tropes doesn’t contain the works that it’s discussing, it just contains information about those works.
No, the model does retain the original works in a lossy compression. This is evidenced by the fact that you can get a model to reproduce sections of its training data
You’re probably thinking of situations where overfitting occurred. Those situations are rare, and are considered to be errors in training. Much effort has been put into eliminating that from modern AI training, and it has been successfully done by all the major players.
This is an old no-longer-applicable objection, along the lines of “AI can’t do fingers right”. And even at the time, it was only very specific bits of training data that got inadvertently overfit, not all of it. You couldn’t retrieve arbitrary examples of training data.
Did you not read my original comment before responding?
You said:
But the point is that it doesn’t matter if the data is licensed or not. Lack of licensing doesn’t stop you from analyzing data once that data is visible to you. Do you think TV Tropes licensed any of the works of fiction that they have pages about?
They did not. No data was “heisted.” Data was analyzed. The product of that analysis does not contain the data itself, and so is not a violation of copyright.
Copyright laws are illogical - but I don’t think your claim is as clear cut as you think.
Transforming data to a different format, even in a lossy fashion, is often treated as copyright infringement. Let’s say the Alice produces a film, and Bob goes to the cinema, records it with a camera, and then compresses it into an Ogg file with Vorbis audio encoding and Theora video encoding.
The final output of this process is a lossy compression of the input data - meaning that the video and audio is put through a transformation that means it’s represented in a completely different form to the original, and it is impossible to reconstruct a pixel perfect rendition of the original from the encoded data. The transformation includes things like analysing the motion between frames and creating a model to predict future frames.
However, copyright laws don’t require that an infringing copy be an exact reproduction - lossy compression is generally treated as infringing, as is taking key elements and re-telling the same thing in different words.
You mentioned Harry Potter below, and gave a paper mache example. Generally copyright laws have restricted scope, and if the source paper was an authorised copy, that is the reason that wouldn’t be infringing in most jurisdictions. However, let me do an experiment. I’ll prompt ChatGPT-4o-mini with the following prompt: “You are J K Rowling. Create a three paragraph summary of the entire book “Harry Potter and the Philosopher’s Stone”. Include all the original plot points and use the original character names. Ensure what you create is usable as a substitute to reading the book, and is a succinct but entertaining highly abridged version of the book”. I’ve reviewed the output (I won’t post it here since I think it would be copyright infringing, and also given the author’s transphobic stances don’t want to promote her universe) - and can say for sure that it is able to accurately reproduce the major plot points and character names, while being insufficiently transformative (in the sense that both the original and the text generated by the model are literary works, and the output could be a substitute for reading the book).
So yes, the model (including its weights) is a highly compressed form of the input (admittedly far more so than the Ogg Vorbis/Theora example), and it can infer (i.e. decode to) outputs that contain copyrighted elements.
Of course it’s not clear-cut, it’s the law. Laws are notoriously squirrelly once you get into court. However, if you’re going to make predictions one way or the other you have to work with what you know.
I know how these generative AIs work. They are not “compressing data.” Your analogy to making a video recording is not applicable. I’ve discussed in other comments in this thread how ludicrously compressed data would have to be if that was the case, it’s physically impossible.
These AIs learn patterns from the training data. Themes, styles, vocabulary, and so forth. That stuff is not copyrightable.
How lossy can it be until it’s not infringement? One-line summary of some book is also a lossy reproduction
You’re thinking of licensing as a person putting something online WITH a license.
The terminology in this case is whether or not it was LICENSED by the commercial entity using and selling it’s derivative. That is the default. The burden is on the commercial entity to prove they were the original creator of said content. It is by default plagiarism otherwise, and this is also the default.
Here’s an example: I write a story and post it online, and it is specific to a toothbrush and toilet scrubber falling in love, and then having dish scrubber pads as children. I say the two main characters are called Dennis and Fran, and their children are called Denise and Francesca. Then somebody goes to prompt OpenAI for a similar and it kicks out the exact same story with the same names, I would win that case based on it clearly being beyond a doubt plagiarism.
Unless you as OpenAI can prove these are all completely random-which they aren’t because it’s trained on my data-then I would be deemed the original creator of that story, and any sales of that data I would be entitled to.
Proving that is a different thing, but that’s what the laws say should happen. If they didn’t contact me to license that story, it’s still plagiarism. Same with music, movies…etc.
That’s your opinion, not the opinion of a court or legislature. LLM products are directly derived from and dependent upon the training data, so it is positively considered a derivative work. However, whether it’s considered sufficiently transformative, or whether it passes the fair use test, has not to my knowledge been determined in court. (Note that I am assuming US law here.)
The courts have yet to come to a conclusion, the lawsuits are still ongoing. I think it’s unlikely they’ll conclude that the models contain the data, however, because it’s objectively not true.
The clearest demonstration I can think of to illustrate this is the old Stable Diffusion 1.5 model. It was trained on the LAION 5B dataset, which (as the “5B” indicates) contained 5 billion images. The resulting model was 1.83 gigabytes. So if it’s compressing images and storing them inside the model it’d somehow need to fit ~2.7 images per byte. This is, simply, impossible.
That’s not in question. It doesn’t need to contain the training data to be a derivative work, and therefore a potential infringement.
Oh no, not the pubes! Get those curlies outta here!
Best correction ever. Fixed. ♥️
It’s already illegal in some form. Via piracy of the works and regurgitating protected data.
The issue is mega Corp with many rich investors vs everyone else. If this were some university student their life would probably be ruined like with what happened to Aaron Swartz.
The US justice system is different for different people.
If we can’t train on unlicensed data, there is no open-source scene. Even worse, AI stays but it becomes a monopoly in the hands of the few who can pay for the data.
Most of that data is owned and aggregated by entities such as record labels, Hollywood, Instagram, reddit, Getty, etc.
The field would still remain hyper competitive for artists and other trades that are affected by AI. It would only cause all the new AI based tools to be behind expensive censored subscription models owned by either Microsoft or Google.
I think forcing all models trained on unlicensed data to be open source is a great idea but actually rooting for civil lawsuits which essentially entail a huge broadening of copyright laws is simply foolhardy imo.
Unlicensed from the POV of the trainer, meaning they didn’t contact or license content from someone who didn’t approve. If it’s posted under Creative Commons, that’s fine. If it’s otherwise posted that it’s not open in any other way and not for corporate use, then they need to contact the owner and license it.
They won’t need to, they will get it from Getty. All these websites have a ToS that make it very clear they can do whatever they want with what you upload. The courts will simply never side with the small time photographer who makes 50$ a month with his stock photos hosted on someone else’s website. The laws will be in favor of databrokers and the handful of big AI companies.
Anyone self hosting will simply not get a call. Journalists will keep the same salary while the newspaper’s owner gets a fat bonus. Even Reddit already sold it’s data for 60 million and none of that went anywhere but spezs coke fund.
Two things:
Getty is not expressly licensed as “free to use”, and by default is not licensed for commercial anything. That’s how they are a business that is still alive.
You’re talking about Generative AI junk and not LLMs which this discussion and the original post is about. They are not the same thing.
Reddit and newspapers selling their data preemptively has to do with LLMs. Can you clarify what scenario you are aiming for? It sounds like you want the courts to rule that AI companies need to ask each individual redditor if they can use his comments for training. I don’t see this happening personally.
Getty gives itself the right to license all photos uploaded and already trained a generative model on those btw.
EULA and TOS agreements stop Reddit and similar sites from being sued. They changed them before they were selling the data and barely gave notice about it (see the exodus from reddit pt2), but if you keep using the service, you agree to both, and they can get away with it because they own the platform.
Anyone who has their content on a platform of the like that got the rug pulled out from under them with silent amendments being made to allow that is unfortunately fucked.
Any other platforms that didn’t explicitly state this was happening is not in scope to just allow these training tools to grab and train. What we know is that OpenAI at the very least was training on public sites that didn’t explicitly allow this. Personal blogs, Wikipedia…etc.
But wouldn’t that mean making it open source, then it not functioning properly without the data while open, would prove that it is using a huge amount of unlicensed data?
Probably not “burden of proof in a court of law” prove though.
Making it open source doesn’t change how it works. It doesn’t need the data after it’s been trained. Most of these AIs are just figuring out patterns to look for in the new data it comes across.
So you’re saying the data wouldn’t exist anywhere in the source code, but it would still be able to answer questions based on the data it has previously seen?
Most AI are not built to answer questions. They’re designed to act as some kind of detection/filter heuristic to identify specific things about an input that leads to a desired output.
That is how LLM works, they don’t store the data as data, but as weight values.
So then why, if it were all open sourced, including the weights, would the AI be worthless? Surely having an identical but open source version, that would strip profitability from the original paid product.
It wouldn’t be. It would still work. It just wouldn’t be exclusively available to the group that created it-any competitive advantage is lost.
But all of this ignores the real issue - you’re not really punishing the use of unauthorized data. Those who owned that data are still harmed by this.
It does discourages the use of unauthorised data. If stealing doesn’t give you competitive advantage, it’s not really worth the risk and cost of stealing it in the first place.
If you can still use it after you stole it, as opposed to not being able to use it at all… Then it does give you an incentive
in civil matters, the burden of proof is actually usually just preponderance of evidence and not beyond a reasonable doubt. in other words to win a lawsuit, you only need to have more compelling evidence than the other person.
But you still have to have EVIDENCE. Not derivative evidence. The output of a model could be argued to be hearsay because it’s not direct evidence of originating content, it’s derivative.
You’d have to have somebody backtrack generations of model data to even find snippets of something that defines copyright material, or a human actually saying “Yes, we definitely trained on unlicensed data”.
so like I am not making any comment on anything but the legal system here. but it’s absolutely the case that you can win a lawsuit on purely circumstantial evidence if the defense is unable to produce a compelling alternative set of circumstances which can lead to the same outcome.