How to Propose a Data Lake for Analytics in My Company?

0
15
Asked By CuriousCoderX12 On

Hey folks! I'm a junior ML engineer with about two years of experience, but I'm still figuring out the ins and outs of AWS. I've been tasked with coming up with a proposal for a "data lake" to make our data more accessible for analytics and future machine learning projects, without relying on our main production system. Currently, we have most of our data locked away in a centralized setup managed by the IT team, mixing AWS and on-prem systems. Accessing this data is a hassle, as we often have to manually export through the product UI or use an existing API. This slows down experimentation and makes it challenging to create reusable datasets for various projects.

My idea is to establish an independent copy of our production data while continuously ingesting data from the same sources our main software uses (AWS databases, logs, plus some on-prem and external sources). The goal is to have the same data accessible for analytics and ML purposes without the ongoing need for manual exports or requests for new endpoints.

We're focused on fleet management, so our data is quite structured, including equipment details like GPS positions and event data such as job information with timestamps and locations. I initially think a SQL-based approach might be feasible, but I'm concerned about long-term scalability, costs, and maintenance.

I would love to hear your thoughts on what a solid long-term design might look like. Also, I'd appreciate any insights on the following:
* What's the most efficient and scalable way to set this up with data coming mainly from AWS databases and logs, along with additional sources?
* Should we clone AWS databases or incrementally ingest data from the get-go?
* Is it practical to synchronize the production databases with the replicas? What's the feasibility here?

Any guidance on architecture patterns, tools, or initial focus areas would be super helpful!

3 Answers

Answered By PragmaticPete On

I suggest being cautious with the data lake approach until you've spoken with the folks who own the data and the teams responsible for finance and legal. You want to avoid pitfalls like stale data, increasing storage costs, and legal risks. Having a centralized system makes data management easier. You might want to consider advocating for the need to access data in bulk while collaborating closely with your IT team, so all legal, security, and access considerations are taken care of. Data is a huge asset, but it can also be a liability if handled improperly.

Answered By DataWise101 On

You might want to check out AWS's whitepaper on building data lakes, it's really insightful. Your questions mostly depend on how timely you need the data to be. If having it a day behind is okay, you can set up data export jobs to dump raw data into S3 daily. From there, you'll have a raw data tier where you can keep things as they are. It's wise to do some cleanup and type conversions for your next layers. Ultimately, your AI/ML initiatives will appreciate data stored in S3 or FSX Lustre, since S3 is usually more cost-effective than traditional RDBMS. Just ensure you have good partitioning if you choose Athena for querying, as it scales nicely with larger datasets.

CuriousCoderX12 -

Thanks for the tips! The three-layer structure makes sense for organizing the data, and I appreciate the recommended services for each tier.

TechieTina -

Great breakdown of the approach! I'm curious, what format do you recommend for storing that raw data in S3?

Answered By AnalyticalAlice On

Two key considerations when designing your data lake are ontology and ingestion methods. Make sure to discuss these elements with relevant business stakeholders before finalizing your proposal. It's crucial they understand the plan to ensure a successful implementation. We can assist if you need more help navigating this!

CuriousCoderX12 -

Absolutely, I'll be digging deeper into those aspects before moving forward. Thanks for the reminder!

Related Questions

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.