
we’re working on some more complex solutions in an Element group. Not really sure where we stand at this moment, but it seems we can stitch a lot together from the large torrent files and by what we scraped from the DOJs website through a little bit of force.

yea for me it fails after anywhere between 200MB and 10-15GB. All the time.

PSA: paging bug has been fixed on the DOJ’s website. Website caps out at around 9600 for ~197k files, way less than the 520k in the less-complete dataset 9 torrent. Scraping the website now to find out which files they took offline.
Correction: 9600*50 files per page is in the 470k ballpark. Much more tan 197k but still a lot less than the torrent’s 530k let alone the expected 600k+ files that were supposed to be in there

we’ve been delving further into it on Element. I can invite you (and anyone else wondering) to the channel if you pm me your matrix id

if what you’re saying is that CSAM seems like a very good excuse to redact a lot more of those files than they previously intended, I agree yes.

ysk the page limit has been fixed, it caps out around 9600 for a total of ~197k file entries. Way less than the largest torrent’s 530k. Scraping now to get a list of the files they kept on the DOJ so we can determine which files they don’t want out there. Would be a good lead to further investigate the torrent

you’re not getting your connection cut off from the place where you’re downloading? That’s huge, could you let me know if you succeed?

we’re on element

DM me your matrix account, we’re looking to get more people to uncover what’s missing from dataset 9, see https://lemmy.world/post/42440468/21884671

not familiar with it but sure i can set something up, will DM the 3 of you a link in a minute

I can also take up some of these. Do you happen to have more of those gaps?
Also, are you guys using some chat channel for this? Might be a little more accessible
E: other users that run into this thread, DM me and I can add you to an element group where we coordinate this. We’re looking for more people

nice. Kinda feeling like we can’t be sure whether our URL lists are ever exhaustive enough or that the DOJ might just let a large part of the dataset go dark

Awesome, I don’t really understand what’s happening but I’m also running it (also doing it for the presumably exact same 48GB torrent, but I’m supposed to do that right?)

I’ve been checking your URLs but it seems you’ve got a lot without a downloadable document attached?

Would love to help still from my PC on dataset 9 specifically. Any way we can exchange progress so I won’t start with downloading files you already have downloaded?
E: just started scraping starting from page 18330 (as you mentioned you ended around 18333), hoping I can fill in the remaining 4000-ish pages
Update 2 (1715UTC): just finished scraping up until the page 20500 limit you set in the code. There are 0 new files in the range between 18330-20500 compared to the ones you already found. So unless I did something wrong, either your list is complete or the DOJ has been scrambling their shit (considering the large number of duplicate pages, I’m going with the second explanation).
Either way, I’m gonna extract the 48GB and 100GB torrent directories now and try to mark down which of the files already exist within those torrents, so we can make an (intermediate) list of which files are still missing from them
group on element, I could send you an inv if you pass your matrix id