Do people here have the partial dataset 9? or are you all missing the entire set?
There is a magnet link floating around for ~100GB of it, the one removed in the OP
I am trying to figure out exactly how many files dataset 9 is supposed to have in it.
Before the zip file went dark, I was able to download about 2GB of it. This was today, maybe not the original zip file from jan 30th
In the head of the zip file is an index file, VOL00009.OPT, you don’t need the full download in order to read this index file.
The index file says there are 531,307 pdfs
the 100GB torrent has 531,256, it’s missing 51 pdfs.
I checked the 51 file names and they no longer exist as individual files on the DOJ website either. I’m assuming these are the CSAM.
note that the 3M number of released documents != 3M pdfs. each pdf page is counted as a “document”. dataset 9 contains 1,223,757 documents, and according to the index, we are missing only 51 documents, they are not multipage.
In total, I have 2,731,789 documents from datasets 1-12, short of the 3M number. the index I got also was not missing document ranges
it’s curious that the zip file had an extra 80GB when only 51 documents are missing. I’m currently scraping links from the DOJ webpage to double check the filenames
Do people here have the partial dataset 9? or are you all missing the entire set? There is a magnet link floating around for ~100GB of it, the one removed in the OP
I am trying to figure out exactly how many files dataset 9 is supposed to have in it. Before the zip file went dark, I was able to download about 2GB of it. This was today, maybe not the original zip file from jan 30th In the head of the zip file is an index file, VOL00009.OPT, you don’t need the full download in order to read this index file. The index file says there are 531,307 pdfs the 100GB torrent has 531,256, it’s missing 51 pdfs. I checked the 51 file names and they no longer exist as individual files on the DOJ website either. I’m assuming these are the CSAM.
note that the 3M number of released documents != 3M pdfs. each pdf page is counted as a “document”. dataset 9 contains 1,223,757 documents, and according to the index, we are missing only 51 documents, they are not multipage. In total, I have 2,731,789 documents from datasets 1-12, short of the 3M number. the index I got also was not missing document ranges
it’s curious that the zip file had an extra 80GB when only 51 documents are missing. I’m currently scraping links from the DOJ webpage to double check the filenames