I'm experiencing some frustrating issues with a Windows Event Collector server that collects data from around 2,650 endpoints. It's been configured to send forwarded events to a directory on the C: drive and is set to overwrite logs after processing 20MB of data. The server has Splunk Universal Forwarder installed for data ingestion.
The event logs don't provide much insight, except for an error indicating that a COM service failed to start. The last updated timestamp in the forwarded log file hasn't changed for about 10 minutes, and after a reboot, the server freezes the Windows Event Collector service after 2-3 days, making it impossible to stop the service through the menu.
I suspect there might be an issue related to how Splunk UF interacts with the log file getting full. If anyone has suggestions on how to troubleshoot or resolve this, I'd greatly appreciate it! Hope everyone had a good weekend!
4 Answers
You might want to consider switching to an agent-based SIEM solution, especially if you’re handling this much data. Tools like Wazuh or Graylog could help, and they have options for free use. Managing logs from 2,650 endpoints on one Windows Event Collector must be incredibly challenging!
Also, it's worth checking what event logs are being forwarded. You might not need to forward every log, as it could lead to unnecessary load and costs on Splunk. A more targeted logging approach could improve things significantly.
As basic as it sounds, have you run DISM and SFC scans to check for system integrity? You might want to run these commands:
`DISM /Online /Cleanup-Image /CheckHealth`
`DISM /Online /Cleanup-Image /ScanHealth`
`DISM /Online /Cleanup-Image /RestoreHealth`
`sfc /scannow`
We had a similar problem where the Windows Event Log service would crash when we had too many logs coming in. Changing the log settings to archive instead of overwrite seemed to fix it. If your current limit is 20MB, it could be hitting that too quickly. Try increasing the log size limit to 2GB and set it to "Archive when full, do not delete." This might give you a chance to troubleshoot before the service crashes again.
That sounds like a solid plan! Moving the logs to another drive could also help avoid filling up the C: drive and causing more issues. Maybe you could consider a PowerShell script to manage old archive logs too.
Just remember, while archiving logs is great, make sure you have a plan for managing those archives to keep overhead low, whether that’s deleting on a schedule or offloading them somewhere cheaper.

For DISM, remember that you can skip directly to `RestoreHealth` since it does the scanning part too. Just doing that can often resolve a ton of underlying issues without needing the other two.