I'm looking to create a system that lets me search through a collection of fewer than 500 PDF files, which consist mainly of journal articles. For example, I might ask, "Which articles have information about frog habitats in North America?" Since adding new PDFs will happen only occasionally (a couple a month) and search queries will be low (just a few per day), I'm curious if using S3 vector stores would be a good fit for this purpose. I've heard that tools like Kendra have high operational costs, even for small setups. Does anyone have suggestions on how I could approach this?
1 Answer
I don't think the S3 vector store has a built-in capability for natural language retrieval. I'd suggest using Textract on your PDFs and then directing the output to a Bedrock knowledge base. This way, you only incur costs for the initial document processing, followed by small charges for each token used during querying.
Would using vector stores be effective for simple keyword searches? Like, if someone types "eardrum," would it list all PDFs containing that word? They seem open to functions that keep costs down.