Hey everyone! I'm looking for some advice on integrating a legacy system with a Databricks SQL query to work with an LLM agent. The only way I can extract data is through SQL queries in Databricks, which outputs to a CSV, and there's no direct database access or APIs available. I need to fetch near-real-time data into an LLM agent workflow via an API, so I'm wondering if there's a way to automate this query process and expose the results as an API endpoint. Ideally, I want to avoid the hassle of manually downloading and uploading files each time. I'm open to options like Databricks Jobs to automate the query and potentially use cloud storage for file handling, or maybe something like Azure Functions or AWS Lambda for processing. Has anyone successfully set something like this up, and what would you recommend as the easiest and most sustainable method? Thanks for any insights!
3 Answers
You might want to consider setting up a small database to store the results of your query. You can automate the SQL query execution with a Cronjob and have it save the output in your own database. This way, your LLM can pull data directly from there without needing to deal with CSVs. It could simplify the process a lot! Just ensure the setup is secure for your workflow.
Check the official Databricks API documentation for any options that might allow you to expose your queries directly as endpoints. There may be features available that don't require an enterprise set up, and they'll have the latest and most appropriate methods mentioned.
Good call! Sometimes the official docs have hidden gems that can save you a lot of time.
I think you’re on the right track with Databricks Jobs! You can configure it to output your query results to cloud storage like S3 or Azure Blob. Once the data is there, set up an API with something like Flask or Express.js to serve the latest file or process it into a JSON response. It's pretty flexible and should work well for your needs!
Yeah, I like that idea! Combining Databricks with a lightweight API sounds efficient for near-real-time access.
Sounds like a solid plan, but what if you still need to use Databricks? Could you query directly there and route the results to your API somehow?