I'm a complete beginner trying to work with a 150GB labeled dataset on my MacBook for a fault detection project. Currently, my whole workflow involves downloading the dataset, which causes my machine to lag, crash, and run out of memory. I know that processing this data locally isn't feasible, and I'm aware of cloud platforms like AWS and GCP, but I have no idea where to begin. Here are a few questions I have:
1. What's the first step to get my dataset onto the cloud? Should I start by uploading it to something like AWS S3?
2. Once the data is in the cloud, how do I actually run a Jupyter Notebook on it? Do I need to rent a virtual machine like an EC2 instance to connect to my data?
3. Is there a common workflow that most beginners use for projects like this?
4. How can I avoid racking up huge bills while using cloud services? What common mistakes should I be wary of?
5. What should be the very first thing I do today? Should I sign up for an AWS Free Tier account, or are there any beginner tutorials you recommend? Any advice, no matter how small, would be a huge help. Thanks!
1 Answer
First off, try to understand your data and figure out how to process it in smaller, manageable chunks. Just moving everything to the cloud won't magically solve your issues; it might even cost you more in the long run. You need to adjust your processing strategy before going into the cloud.

I've tracked my dataset, but could you share some strategies on chunking it? I really appreciate the insight!