What’s the best Python stack for multi-threaded batch processing?

0
2
Asked By CuriousCoder99 On

Hey folks! I'm transitioning from a Java background and have a legacy Spring Boot batch process that's currently managing millions of users. We're looking at migrating this system to Python and I'm hoping to get your input on this. Here's how the current setup works:

- It connects to various databases (all major ones are supported).
- Each batch service operates on separate servers and processes a queue of about 100-1000 users at a time.
- We utilize a thread pool where each queue item is handled by a different thread.
- After processing, the tasks send messages to RabbitMQ or Kafka.

Given all this, what Python stack or architecture would you recommend for effectively managing this type of work? I know Python has its quirks with CPU-bound threads, but I've also read about solutions involving multiprocessing. I'd really appreciate your suggestions that fit within the Python ecosystem!

5 Answers

Answered By CeleryFan91 On

Using Celery with RabbitMQ is popular for this kind of task! Just be aware that while it's great for straightforward jobs, its workflow orchestration capabilities can be quite basic, which might not suit complex tasks well.

Answered By SkepticalSteve On

Honestly, unless you have a compelling reason, I wouldn't rush into migrating to Python. The performance drop can be significant for what you're trying to do, especially concerning multi-threading. Just a thought!

Answered By TechieTommy On

I’d recommend using Celery along with RabbitMQ. Celery is great for background task processing, especially if you need to handle asynchronous workloads. Combine it with RabbitMQ for effective message queuing and you'll have a solid setup for your batch tasks.

Answered By PerformanceWatcher On

I would caution against using Celery—it has its bloat. If you’re not using Redis, check out Dramatiq; it’s lightweight and performs well without all the extra features.

Answered By ProcessPro On

If you can set up the workload distribution in advance among independent Python processes, you'd see better performance. You might want to have a single master node fetch data from the DB, then disperse it to worker nodes using RMQ.

Related Questions

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.