Can you grep through ebooks?

skaarl@feddit.nl · 6 个月前

Can you grep through ebooks?

thagoat@lemmy.sdf.org · 6 个月前

Try ripgrep-all.

skaarl@feddit.nl · 6 个月前

This looks pretty cool, thanks!

thagoat@lemmy.sdf.org · 6 个月前

Glad to help!

skaarl@feddit.nl · 6 个月前

This tool is very powerful! Just what I needed, thanks again. Turns out you need pandoc 3+ and linux mint repo has 2.9, so after upgrading that rga started working, but it still throws this error from time to time:


parseSpine
Error: copying adapter output to stdout

Caused by:
    0: subprocess: Command { std: "pandoc" "--from=epub" "--to=plain" "--wrap=none" "--markdown-headings=atx", kill_on_drop: false }
    1: ExitStatus(unix_wait_status(16384))

I will have to keep looking into it, I’m not sure if this error stops the search in it’s tracks.

nesc@lemmy.cafe · 6 个月前

ls lists files, if you pipe it to grep it will print matching lines with file names. Universally you can’t grep through ebook content, but you can do it with epub, probably other zipped text formats using zipgrep or just unzipthem and grep unarchived files.

skaarl@feddit.nl · 6 个月前

Thanks!

Admiral Patrick@dubvee.org · edit-2 6 个月前

grep searchTerm file

nesc@lemmy.cafe · 6 个月前

You can’t grep zip archives directly.

thagoat@lemmy.sdf.org · 6 个月前

Ripgrep-all has that capability.

nesc@lemmy.cafe · 6 个月前

Good to know.

Hellfire103@lemmy.ca · 6 个月前

Sounds like a good time to mention that “Little Brother” by Cory Doctorow is available in GNU Info format (usually used for manpages).

JASN_DE@lemmy.world · 6 个月前

Yeah, that’s to be expected with ls as it only lists the folder contents. Which format do you have?

skaarl@feddit.nl · 6 个月前

epub, mobi and pdf

sylver_dragon@lemmy.world · 6 个月前

It’s going to be different for different file formats. For example, something like epub is going to be hard because the format is really just a zip file with a specific internal file structure. So, it’s not really the .epub file you want to grep, but one of the files within that zip file you want to grep through. EBooks stored as PDFs could be a bit easier, as they are a monolithic file format with text often (though not always) stored just as plain text. However, the text streams can be encrypted and/or compressed (FlateDecode); so, there is no guarantee of seeing plain text.

I’m sure there are more formats, but I think you get the idea, how you would do a string search comes down to the actual file format. And some are not going to be easily greppable. It’s not impossible, just not straight forward.

Coelacanthus@infosec.pub · 2 天前

For example, something like epub is going to be hard because the format is really just a zip file with a specific internal file structure. So, it’s not really the .epub file you want to grep, but one of the files within that zip file you want to grep through.

ePub is a zip file contains a batch of HTML file for contents and some XML files for metadata. So you can extract it and do grep as you do for HTML files.

sylver_dragon@lemmy.world · 2 天前

That was just the first example to pop to mind where you couldn’t just grep search * and I didn’t want to get into a bunch of specific file formats. For something like epub you could probably just use zcat and then pipe the output to grep. Perhaps using a for loop if you want to do other fancy stuff along the way (e.g. output file names as headers).

So ya, “hard” may have been a bit overblown. “not simple” may have been better. But, without the OP actually stating what format the ebooks were in, I wasn’t going to write a primer on dealing with any format.