To skirt copyright law, Anthropic bought and destroyed millions of physical books to train its "Claude" LLM

Arthur Besse@lemmy.ml · 2 days ago

To skirt copyright law, Anthropic bought and destroyed millions of physical books to train its "Claude" LLM

Jack@slrpnk.net · 2 days ago

I FEEL BAD WHEN BUYING PHYSICAL BOOKS BECAUSE IT FEELS WASTEFUL (usually use e-reader) AND THESE FUCKERS DO THIS…

QuarterSwede@lemmy.world · 2 days ago

I mean, using energy to display text on an eReader is wasteful too. Just being alive is wasteful. Everything is wasteful with that mindset. Guess what? The universe doesn’t care either way.

Ajen@sh.itjust.works · 2 days ago

No one should feel guilty about using the amount of energy it takes to power an eReader.

☂️-@lemmy.ml · 2 days ago

Gork@sopuli.xyz · 2 days ago

What a waste.

koper@feddit.nl · 2 days ago

Very on brand for AI companies. In a decade they’ll be allowed to give AI agents legal personhood and the right to vote, but only if they first euthanize an equal number of orphans.

P03 Locke@lemmy.dbzer0.com · 2 days ago

I blame copyright law for requiring this.

Madrigal@lemmy.world · 2 days ago

deleted by creator

koper@feddit.nl · 2 days ago

Ultimately, Judge William Alsup ruled that this destructive scanning operation qualified as fair use—but only because Anthropic had legally purchased the books first, destroyed each print copy after scanning, and kept the digital files internally rather than distributing them. The judge compared the process to “conserv[ing] space” through format conversion and found it transformative.

mindbleach@sh.itjust.works · 2 days ago

Phrased like it’s a technicality, when it’s just… your rights. You are explicitly allowed to do this.

This whole article sounds like Jack Valenti shrieking over VCRs. ‘They copied a broadcast! For later!!! That’s skirting copyright law!’

Copyright law suuucks. It needs vicious reform. And yet! These specific things have always been permitted, as a necessary part of protecting consumers, versus an industry that would love to charge rent for the books on your shelf. Those motherfuckers put DRM in cables. And yet: their laws say this is fine.

koper@feddit.nl · 24 hours ago

It’s not that clear cut. Buying a book doesn’t generallly give you the right to make copies and sell those.

mindbleach@sh.itjust.works · 24 hours ago

Is that what they did?

Is that what anyone’s talking about?

ChicoSuave@lemmy.world · 2 days ago

It’s literally the process that allows digitized media to be safe to possess. Someone read the FBI warnings before movies on VHS. This is some corporate malicious compliance and what the law looks like when taken to an absurd extreme.

MartianSands@sh.itjust.works · 2 days ago

That depends on whether you consider an LLM to be reading the text, or reproducing it.

Outside of the kind of malfunctions caused by overfitting, like when the same text appears again and again in the training data, it’s not difficult to construct an argument that an LLM does the former, not the latter.

Arthur Besse@lemmy.ml · edit-2 2 days ago

models can and do sometimes produce verbatim copies of individual items in their training data, and more frequently produce outputs that are close enough to them that they would clearly constitute copyright infringement if a human produced them.

the argument that models are not derivative works of their training data is absurd, and the fact that it is being accepted by courts is yet another confirmation that the “justice system” is anything but just and the law simply doesn’t apply when there is enough money at stake.

awesomesauce309@midwest.social · 2 days ago

It’s rare a person on social media understands they turn the input into predictive weights, and do not selectively copy and paste out of them.

Baggins [he/him]@lemmy.ca · 2 days ago

You’re saying if I encode a copyrighted work into a JPEG it isn’t infringement? It also uses statistics to produce an approximation of the input.

lad@programming.dev · 2 days ago

If it’s low enough resolution, it’s not an infringement /½s

awesomesauce309@midwest.social · 2 days ago

You’re saying save one jpeg with the intent to reproduce exactly that image. I’m saying if you have a million images you have turned into weights, it won’t exactly reproduce anything unless there is very limited training data on what you’re having it predict.

Baggins [he/him]@lemmy.ca · 2 days ago

So you just put a million JPEGs into a zip file? How is that not infringement?

awesomesauce309@midwest.social · 2 days ago

You are free to be obtuse if you like

mindbleach@sh.itjust.works · 2 days ago

Because it’s not that.

Baggins [he/him]@lemmy.ca · edit-2 2 days ago

Isn’t it? Both methods just produced a data structure you can query to obtain a statistical approximation of a subset of the input data.

Just because you moved the statistics from the JPEG to the ZIP file? That makes it ok?

Arthur Besse@lemmy.ml · 2 days ago

According to the article, Judge William Alsup disagrees

To skirt copyright law, Anthropic bought and destroyed millions of physical books to train its "Claude" LLM

To skirt copyright law, Anthropic bought and destroyed millions of physical books to train its "Claude" LLM

Anthropic destroyed millions of print books to build its AI models