I run an uptime monitoring service that checks various protocols every two minutes from multiple regions. Each check creates a row in BigQuery, and I'm starting to worry about the rising costs as the table grows rapidly.
The problem is that while BigQuery works fine, it gets expensive when users start querying long historical data for their dashboards. On-demand pricing penalizes frequent scans, and flat-rate options feel excessive for our current scale. Even simple queries for the last month can be prohibitively costly despite using daily partitions and clustering.
My writes happen at a high rate, but the challenge lies in instantaneous reads for dashboards and a public API. Users want analytics like uptime percentages and latency across customizable time ranges, and writing those complex percentile queries in BigQuery is both challenging and expensive.
I'm considering transitioning to alternatives like ClickHouse, which seems highly recommended, but I'm hesitant as I've never used it in production. Timescale also looks appealing since it's built on PostgreSQL and seems suited for continuous aggregates, but I'm unsure how it would handle such a write-heavy workflow. Sticking with BigQuery and using aggressive pre-aggregation into rollup tables is another option I'm considering.
I'd love to hear from anyone who's transitioned away from BigQuery for similar needs. Did your costs decrease, or did you just end up with different challenges? For those using ClickHouse, is it really as straightforward as setting up AggregatingMergeTree and materialized views? Also, is there a hybrid model that combines hot data in ClickHouse with cold storage using object storage techniques like Iceberg or Parquet? I'm open to sharing more specific numbers if it helps the discussion.
4 Answers
Have you looked into using a time-series database like InfluxDB or TimescaleDB for your uptime results? They're built for handling high write loads and could be more cost-effective while still allowing fast query access for your dashboards. You could also consider aggregating your older data back into BigQuery if needed.
I’m curious why you’re not considering a traditional database setup on a dedicated VPS or machine. Wouldn't that be more manageable for your needs?
You raise a valid point! I’m good with functions but not so much with databases; I lean towards ready solutions. But yeah, a PostgreSQL setup on my VPS might actually be a feasible option.
Honestly, transitioning to ClickHouse sounds like a solid plan! The learning curve might be steep, especially if you're coming from SQL, but we faced similar issues and saw improvements in our cost and query performance by making the switch. The materialized views are crucial for scaling; they help with aggregating your data for those common timeframes so that most dashboard queries don’t hit the raw data at all.
Definitely don’t underestimate the importance of those aggregating merge trees and the potential storage bloat. Using Kafka for smoother ingestion and offloading older raw data to Parquet on S3 with a short TTL might help prevent you from running into problems.
Have you thought about implementing an end-of-day cleanup? It might help condense your numerous rows into single rows per instance for the last 24 hours. You could also temporarily store results in something like Redis and aggregate them later for historical records—might simplify your architecture a bit!
Good point! I actually do have a daily rollup table to reduce costs for queries like "uptime over 30 days." The read issues arise when I need detailed access to the raw data for specific incidents, which my users often want when there’s an issue.

Yes, Timescale is on my shortlist since I trust PostgreSQL. I haven't explored InfluxDB much yet; my impression is it's great for metrics but gets tricky with joins or filters. Would love to hear more if you have experience with it!