I'm dealing with several hundred chat logs, with some going up to 30,000 words. The topics covered in these conversations are all over the place, often containing 10-20 different discussions even within a single 400-turn chat. I need an effective method to split these conversations for better organization.
I'm trying to avoid the pitfalls of 'super indexing,' where I end up with a ton of irrelevant references for useful entries, as well as having huge chunks of text referenced by a single index entry. Additionally, issues arise when I try to save these chats either by copying and pasting or saving as a complete webpage, as it results in excessive data tied to the presentation layer. I've done some Perl scripting to clean things up, which helped reduce a 30-turn conversation down to a more manageable size, but it still requires a day of programming. This solution might only work until the platform I'm using changes its interface. What's a better way to manage and access these chats?
6 Answers
This is a challenge that's been tackled before. Instead of reinventing the wheel, check out what industry leaders are doing. A quick search gave me some resources, like AWS documentation on full-text search and their OpenSearch service. Also, Slack uses Apache Solr for chat searching, which could be a good model to follow.
As for saving your chats, I recommend looking into existing saving functionalities in the tools you’re using. If there's nothing suitable, you could create a simple client that uses an API to save chats directly to your storage of choice. Relying on saving the whole webpage is just adding unnecessary complexity.
Why not utilize Git for this? You could download all your conversations, split them into smaller manageable files, and use Git to track changes. This way, you can efficiently search through them with `git grep`, which is quite handy for looking up text quickly.
Honestly, some of this indexing can feel like PhD-level work. If there was a straightforward solution to avoid super indexing, we wouldn’t have so many approaches to retrieval-augmented generation (RAG). It's definitely a complex area to navigate.
Your question could use a bit more clarity. What do you mean by 'chat'? Are these texts from multiple identities? Are timestamps important for the order of messages? If you're dealing with different languages or the quality of grammar and spelling varies, that affects indexing as well.
If you just need to index plain text, consider loading each message into a database system like ClickHouse or ElasticSearch, which have built-in capabilities for indexing and searching. You could also explore advanced techniques like vector embeddings or n-grams for a more refined search experience.
If you're trying to make these chats searchable with vector embeddings combined with BM25, my database simplifies this. It uses Neo4j drivers and provides out-of-the-box functionality for better management. Check out NornicDB on GitHub for more details!
A simple solution could be creating a hash from each chat text. This method allows for O(1) time complexity for lookups, making it pretty fast.

It sounds like these AI chats are a bit tricky! Maybe OP hasn’t fully grasped what information they actually need from all this data.