I'm looking for recommendations on tools that facilitate ad-hoc remote execution, particularly in cloud-native environments, but I'm open to on-prem solutions too. My organization manages a lot of Kubernetes clusters and compute instances across various public clouds, and we've come across situations in automated or manual incident responses where we need to execute commands across multiple instances or clusters. Examples would include running a command in a specific pod across all clusters, restarting services in certain namespaces, generating diagnostics on high CPU instances, or listing config maps in AWS. Essentially, I'm after a reliable querying and workflow engine that suits our scale. What solutions are you leveraging for this? If you're using any commercial products like Datadog's Workflow Automation or something else, how's that been working out for you?
5 Answers
AWS Systems Manager is pretty solid for executing commands on a group of servers. It's quite cost-effective, but some folks mention it might be tricky to scale or integrate with other systems unless you've got a dedicated team for that kind of setup. Have you found it works well for you, or do you think there are limitations?
Ansible works great for ad-hoc tasks, especially with the Ansible Automation Platform. It’s flexible and lets you handle complex workflows easily, but you do need to know how to set it up.
For queries like yours, I usually combine Python scripting with boto3 to leverage AWS's SSM. If resources are tagged correctly, it streamlines the process. I also use Lambda for handling certain tasks and kubectl along with bash scripts for K8s operations.
That's interesting, but I wonder if there’s a centralized tool that can unify these tasks. It can get tricky when you have different teams creating their own solutions.
I’ve had decent experiences with Puppet if you're dealing with more complex tasks. It can be a bit pricey, though. But it does the job for remote execution. Separating permissions and having scripted commands for specific tasks is key for consistency.
Honestly, I try to steer clear of ad-hoc remote execution. I prefer predefined scripts with tightly controlled permissions. For example, I set up a dedicated user for restarting services, where the shell is configured to just run the restart command, making sure there’s no chance for interactive session mishaps. I’d go with a scheduled task or a job for anything needing execution in ECS or K8s.
I see your point about security and predictability. Ad-hoc queries can be risky, but sometimes we need a bit of flexibility while still maintaining those principles. Balancing both is definitely a priority.
That's a good point about scaling. It can become challenging if you need to incorporate other tools for authentication or monitoring. What has your experience been?