Thoughts on my data distribution strategy for an open-source web app?

0
0
Asked By CuriousCoder92 On

Hey everyone! I'm a beginner working on an open-source web app that heavily relies on a large dataset. Since it takes about 9 hours to collect the data via API calls, it's not practical to ask users to gather it themselves. Here's my current approach:

- I collect and cache the data nightly on my machine.
- Then, I export the cached data as a compressed JSON file, which is way smaller than the raw data.
- I plan to include this JSON file in my GitHub repo.
- When users run the Docker container, it loads the JSON into a local SQLite database on startup and downloads the current data index from the source.
- It compares my cached data against the index to find any missing or outdated records and syncs those via API.
- There's also a scheduled script that keeps everything updated.

This way, users get a working app with data that's about 24 hours old, but it updates fairly quickly after that.

I'm wondering if this method is sustainable:
- Is it bad practice to include large data files in a Git repo?
- Will this approach work if the project becomes popular?
- Is there a simpler way to handle this?

I considered using Git LFS but I'm worried about the potential costs and cloud storage seems like overkill for just a side project.

Tech stack: Flask/FastAPI for the backend, JavaScript for the frontend, and SQLite for the database (about 1 million records). Targeting users who want to self-host. Any thoughts? Appreciate the help!

**Edit:** The JSON file is only used once to seed the SQLite database at startup, after which it doesn't get accessed. All operations during runtime go directly to SQLite. The JSON is merely a way to distribute the initial data.

2 Answers

Answered By DataMonkey85 On

It's great that you’re getting into data-heavy apps! Just so you know, JSON files don't scale well with large datasets. Can you clarify if your data is more transactional or analytical? That can really change how you handle distribution. Also, keep in mind that including big datasets in a Git repository can be inconvenient, especially with frequent changes. It might be worth looking into hosting the dataset separately and including a simple script to download it during installation, especially since your project is open-source. As for syncing, that can add unnecessary maintenance, so consider whether your dataset might be better suited to periodic complete updates rather than continuous syncing.

Answered By CodeNinjaX On

That’s a solid framework you're building! Just to clarify, are the users running the app locally and storing the data on their machines? With the amount you’re updating nightly (like 8,000 to 10,000 rows), it looks like syncing could get tricky, especially if the repo is actively changing. It would be smart to either completely switch out the dataset daily or just keep it simple with one-way syncing. Since the JSON file is just used for the initial load, then it's really just a convenience tool.

Related Questions

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.