Hey folks! I'm diving into server fault tolerance and crash recovery and I'm considering building a simple project to really get the hang of it. I have this idea for a file system where data is split and stored across an odd number of chunk servers, with a master node overseeing everything—like checking for file corruption, monitoring server health, adding new servers, and copying the file system information.
At first, I thought I could just store every chunk on all the servers for easy access, but I learned that full replication isn't the best approach because it leads to high write latencies and takes up too much storage. After some research, I found out that distributing chunks across the servers could help with load management.
However, I'm still struggling to wrap my head around how this "distributed chunk" idea works. Can someone break it down for me? Thanks!
1 Answer
When you're working with a lot of nodes in a distributed system, putting all the data on all the nodes can create a bottleneck and waste resources. Instead, you want to set a limit on how many copies of each chunk you store—like 3 copies across 5 nodes. Plus, it's a good idea to break the data into smaller chunks and distribute those across different servers.
As for checking for data corruption, it's better if each node can verify its own data instead of funneling everything through the master node, which could slow things down. You might want to look into things like Hamming codes to help with error correction too!
That makes sense! So, to clarify, we should aim for each chunk to be on separate nodes to maintain consistency? Also, I'm planning to use MD5 hashes for verifying files and keeping the master node to just manage metadata instead of checking data integrity. Is that a good approach? What if the master node goes down? Should we run another node to take over?