I'm completely new to cloud computing and am struggling with a 150GB dataset on my Mac for a fault detection project. Every time I try to work with the dataset, my MacBook goes into a panic mode, lagging and crashing due to memory issues. I've heard about AWS, GCP, and Azure, but the whole thing feels overwhelming. I need advice on getting started, specifically how to handle such a large dataset without crashing my laptop. I'm looking for guidance on the following: 1. How do I transfer my 150GB dataset to the cloud? Do I use something like AWS S3? 2. Once the data is in the cloud, how do I run code, like using Jupyter Notebooks? Do I need to rent a more powerful virtual machine? 3. What does a typical beginner's workflow look like for a project like this? 4. How can I avoid unexpected costs in the cloud? 5. What should be my first step right now?
5 Answers
150GB isn't monstrous, and with some code optimization, you might be able to handle it locally. But if you're out of storage, moving to the cloud would be a smart next step. Just be aware that if you go straight to the cloud without optimizing your process, it could cost you big time.
There’s no need to load everything into memory at once. Break down your tasks: read, process, and analyze your 150GB dataset piece by piece. This can align with efficient machine learning practices, focusing on chunks rather than the entire dataset at once.
That makes sense! My first task is actually data analysis—should I focus on chunking before everything else?
Before jumping into the cloud, think about chunking your data. Understanding how to process it in smaller parts can help manage memory, even locally. Just transferring it to the cloud won't solve the issue if your approach remains the same. Make sure to define what processing you're doing in chunks.
Got it! Can you give a specific example of how to process in chunks? I’d really appreciate some direction here!
Your cloud experience will depend on your actual needs. Cloud services like AWS can be daunting, but S3 is just a storage solution, and EC2 is where you run your computing tasks. If you're just managing your own data, you might find EC2 simpler to operate for your project.
If you want to keep it simple, consider using Kaggle to host your dataset. They offer free access to Jupyter Notebooks with great resources, and their limit should accommodate your dataset size. If you really want to stick to the cloud, test out AWS's free tier for your VM setup to avoid overages.

I think you're right, storage is my main issue. Sounds like the cloud might be necessary soon.