I've been on an absolute rollercoaster trying to troubleshoot an incredibly frustrating issue with ArgoCD and Crossplane. Here's what I'm facing: ArgoCD consistently indicates that resources are "Healthy" and "Synced," yet Crossplane is having major trouble provisioning AWS resources. We're receiving countless 400 errors from AWS, causing things like Lambda functions to not update and RDS instances to get stuck. It's like ArgoCD is serving as a false beacon of hope while everything else is crumbling.
I've spent a significant amount of time researching and I've found no documentation or discussion that addresses this issue directly. It's baffling! Through my investigation, I discovered that the health check logic in Crossplane is flawed. Conditions are processed in an array, leading ArgoCD to declare resources healthy when it's not necessarily true, as long as 'Ready: True' appears before any failing conditions.
Is anyone else dealing with this absurdity? Are we all just overlooking the health checks with Crossplane, or is my setup unusually cursed? I managed to circumvent the problem by rearranging the condition checks, but I'm shocked that this isn't better known.
If this strikes a chord with anyone, please let me know!
5 Answers
You should really consider sending a patch instead of just working around the problem. Fixing the issue directly might prevent it from affecting others in the future.
It's great that you found a workaround! But just a heads up, Medium articles that are "Member-only" can be frustrating. It might be better to share your findings somewhere more accessible, like GitHub.
I generally avoid Medium too. It's likely going to limit who can read about your issue.
It sounds like there's a bit of misunderstanding about how GitOps works with Argo. The resources may be synced, but that doesn't mean they're healthy. Argo's job is primarily to ensure that what's in your cluster matches the desired state, not to guarantee everything is completely operational. You should implement additional monitoring tools for a full health overview, like Grafana or Datadog, which can track the real-time state of your AWS resources.
I see what you're saying! I think I focused too much on Argo's output and not enough on monitoring tools.
Exactly! Relying solely on Argo for health checks could lead to these kinds of oversights.
If you're dealing with these kinds of issues, perhaps consider filing a GitHub issue instead of just writing it up somewhere else. It seems like this could benefit many users without them realizing it.
That’s a good suggestion! I did mention it to the maintainers, but it feels more like a community issue right now.
Honestly, I've seen folks hit this issue before. Fortunately, I had learned about Argo's health check behavior beforehand, so it didn't surprise me. I assumed anyone using Argo would know to test custom health checks when the defaults don't work as expected.
Right? I feel like this info should be more common knowledge!
Yeah, Medium can be a pain, especially with member restrictions.