How to Set Up a Data Lake for Analytics and ML Projects?

0
13
Asked By CuriousCat92 On

Hi all! I'm a junior ML engineer with about two years of experience but very limited exposure to AWS, so please bear with me. I've been tasked with proposing a data lake that will simplify our data access for analytics and machine learning projects, allowing us to work independently from the main production system.

Currently, our data is confined within a centralized setup managed by the IT team, combining AWS and some on-prem solutions. When we need data, our options are either to export it manually through the product's UI or use an existing API, which can be quite slow for experimentation.

The goal is to create an independent copy of our production data and continuously ingest data from the same sources as the main system, including AWS databases, logs, and some external institutions. This would make it easier for analytics and ML tasks to access the required information without waiting for manual exports or setting up new endpoints.

We're focused on fleet management, so our data is mostly structured (like GPS data, equipment status, event logs). I'm considering a SQL-based approach but I'm unsure about its long-term scalability, costs, and maintenance.

I have several questions: 1) What's the best approach for data ingestion from mostly AWS databases and logs, along with some on-prem data? Should we stick with AWS for cost efficiency in the long run? 2) Should we clone the AWS databases at the start or ingest changes incrementally? 3) Is it feasible to keep production databases synchronized in real-time?

I'd appreciate any advice on the architecture, useful tools, and resources to focus on at the beginning!

4 Answers

Answered By DataWiz_23 On

Two key aspects to focus on are data organization and ingestion methods. It's essential to ensure your proposal addresses these components clearly, and that you involve business stakeholders early on to get their buy-in.

Understanding how to organize your data and how it will flow into the system will be crucial for the success of the data lake.

CuriousCat92 -

Absolutely! I’ll make sure to delve deeper into those areas and get feedback from the stakeholders before finalizing my proposal.

Answered By DataNinja_101 On

Here's a great resource to kick off your understanding: AWS has a white paper on building data lakes that might be really useful for you. Your approach ultimately depends on how quickly you need the data and how large it is.

If a daily update is acceptable, consider scheduled exports to S3 in a readable format like Parquet or CSV. This initial layer is your raw data tier that doesn’t require clean-up. You can keep it for 7–30 days for reference.

The next step involves type conversions and cleaning up the data, aiming for user accessibility, while the final tier consists of pre-joined datasets for common queries.

For tools: think about using QuickSight or Power BI for analytics, Athena for query front-end, S3 for storage, and Glue for data jobs. Just remember, efficient partitioning in Athena helps manage huge datasets efficiently.

CuriousCat92 -

Thanks for the tips! A daily update does seem manageable for now, and your three-tier structure sounds like a solid plan. I’ll check those services out and see how they can fit into my design.

TechieSquirrel -

Also, don’t forget that S3 storage can be more cost-effective for AI/ML tasks compared to traditional databases. It’s designed to handle massive data well!

Answered By RiskyBiz_67 On

I’d tread carefully. It's important to have discussions with the people managing the data storage and access. Consider the potential pitfalls—like data stagnation or duplication, which can lead to issues if syncing isn’t handled correctly.

You should also think about storage costs and legal implications, especially in light of regulations like GDPR. Data can be an asset, but it can also become a liability if not managed correctly. Make sure you fully discuss your needs with the IT team to see how they can support your access requirements instead of trying to circumvent established processes.

Answered By AnalyticalMoose On

For your project, consider drawing a clear distinction between a data lake and a data warehouse. A data lake can indeed be seen as dumping raw data, while a warehouse would involve structured data for specific analytics. Keeping both concepts clear will help guide your architecture better.

Ingesting data as real-time changes can be complex, so starting with full dumps for flexible access might make more sense first.

Related Questions

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.