I'm working on downloading a bunch of military documents concerning the use of aerosol atmospheric injection for weather control and strategies. I've got a PDF version already, but now I want to save the scanned images of the original documents. The pages are JPG scans accessible through a URL that includes a page number in the "seq" variable.
I want to use a tool like wget or curl to automate the download process so I can grab all pages at once. I'm thinking of a loop like this:
FOR %COUNT in (1,1,52) do (
WGET "https://babel.hathitrust.org/cgi/imgsrv/image?id=uc1.d0008795742&attachment=1&tracker=D4&format=image/jpeg&size=ppi:300&seq=%COUNT"
)
Can anyone provide guidance or code examples on how to achieve this? Thanks!
I'm using Linux Mint 21.1 with Bash 5.1.16.
4 Answers
Actually, you might be able to simplify this without a loop by using brace expansion directly in the command:
```bash
wget "https://babel.hathitrust.org/cgi/imgsrv/image?id=uc1.d0008795742&attachment=1&tracker=D4&format=image/jpeg&size=ppi:300&seq={1..52}"
```
But be careful with your curl commands! I noticed you didn't quote your URL, which might have caused issues. Also, double-check the range—using {0..52} might not provide the pages correctly if they start from 1.
r
I actually managed to download everything using curl like this:
```bash
for i in {0..52}; do curl -LROJ --retry-all-errors https://babel.hathitrust.org/cgi/imgsrv/image?id=uc1.d0008795742&attachment=1&tracker=D4&format=image/jpeg&size=ppi:300&seq=$i; done
```
However, it took me about 20 tries to get all files! I'm wondering if someone could explain how to make curl keep trying until a file downloads successfully. It doesn’t seem to work well with the retry options. Thanks!
You can easily use wget with a two-step method. First, create a list of your URLs with a loop:
```bash
for img in {1..52}; do echo "https://babel.hathitrust.org/cgi/imgsrv/image?id=uc1.d0008795742&attachment=1&tracker=D4&format=image/jpeg&size=ppi:300&seq=${img}" >> urllist; done
```
Then use wget2 (or wget) to download them:
```bash
wget2 -i urllist
```
This worked well for me, but if you want to customize filenames, there are options in wget for that.
Edit: Thanks to another user for pointing out that you can do it in one command too:
```bash
wget "https://babel.hathitrust.org/cgi/imgsrv/image?id=uc1.d0008795742&attachment=1&tracker=D4&format=image/jpeg&size=ppi:300&seq={1..52}"
```
To get `curl` to retry on every error, just keep in mind that `--retry-all-errors` only retries for some HTTP errors, and not for most 4xx errors like 404. If you want it to retry more broadly, try combining it with `-f` for failed responses. You can also simplify rolling through the pages using this syntax instead of a `for` loop:
```bash
url="https://babel.hathitrust.org/cgi/imgsrv/image?id=uc1.d0008795742&attachment=1&tracker=D4&format=image/jpeg&size=ppi:300"
curl -LROJ --retry-all-errors --fail "$url&seq=[1-52]"
```