Hey everyone! I'm working on creating a search engine for our CRM that needs to handle text search. I'm planning to vectorize the text before I insert it into OpenSearch, but I'm not quite sure how to tackle this. We have a massive amount of historical text messages — about 300 million — along with receiving around 500,000 new messages daily. I'll be using the HTTP API for data insertion. Any advice on how to effectively handle this would be greatly appreciated! Thanks!
4 Answers
It might be beneficial to start with some software engineering courses if you're new to this. Building software requires a solid understanding of the fundamentals, so get those skills down and then tackle the project!
You might want to look into using Amazon Kendra or Bedrock Knowledge Base for your needs. They automate the vectorization process when you upload your data. While OpenSearch is powerful, S3 Vectors can be a cheaper alternative for storage, although they might have slower retrieval times compared to OpenSearch. Just keep your project's latency requirements in mind!
For a project of this scale, it's better to handle the vectorization outside of OpenSearch. Consider using a dedicated embedding model like Bedrock or SageMaker to generate your vectors before indexing them. Here's a game plan:
1. Use an external model to vectorize your text.
2. Store these vectors in a knn_vector field alongside your original text.
3. Leverage OpenSearch's k-NN or vector search features for similarity searches.
A few tips:
- Don't try to stream all the data—instead, backfill in batches.
- Make use of bulk APIs instead of individual HTTP inserts.
- If possible, opt for a smaller embedding size; it can significantly affect performance.
- Be mindful of costs and indexing time; 300 million documents is a big task, so consider sharding by time or CRM entity for efficiency.
OpenSearch does include some machine learning features that you could experiment with, but results can be unpredictable. It might be worth checking it out to see if you can get a prototype to work. If ML features don’t cut it, diving into a solid book on OpenSearch could level up your skills, or consider getting some professional help if needed.

Related Questions
Neural Network Simulation Tool
xAI Grok Token Calculator
DeepSeek Token Calculator
Google Gemini Token Calculator
Meta LLaMA Token Calculator
OpenAI Token Calculator