fbpx
Unlock Safety & Savings: Mastering a Secure, Cost-Effective Cloud Data Lake Unlock Safety & Savings: Mastering a Secure, Cost-Effective Cloud Data Lake
Editor’s note: Ori Nakar and Johnathan Azaria are speakers for ODSC East this April 23-25. Be sure to check out their... Unlock Safety & Savings: Mastering a Secure, Cost-Effective Cloud Data Lake

Editor’s note: Ori Nakar and Johnathan Azaria are speakers for ODSC East this April 23-25. Be sure to check out their talk, “Unlock Safety & Savings: Mastering a Secure, Cost-Effective Cloud Data Lake,” there!

Have you ever experienced a surge in your cloud data lake expenses? Is this surge indicating a malicious activity or a legitimate operation? Data lakes have become a cornerstone of the digital age, prized for their flexibility and cost-effectiveness. Yet, as they expand, they bring forth challenges in security, access control, cost management, and monitoring. The stakes are high: unauthorized access can lead to data breaches, while even legitimate users can inadvertently drive up costs.

In-Person and Virtual Conference

September 5th to 6th, 2024 – London

Featuring 200 hours of content, 90 thought leaders and experts, and 40+ workshops and training sessions, Europe 2024 will keep you up-to-date with the latest topics and tools in everything from machine learning to generative AI and more.

 

With the growth in usage comes far more complexity. The size of data, together with the number of objects, are growing rapidly. A growing number of users, both human and application, are performing constant operations on the data lake. The large number of operations makes access and cost control a hard and ongoing task. Monitoring is also a complex task, since there are many access options, and all should be monitored.

Attackers can also take advantage of the many access options to the data lake. They can use object store and query engines advanced functionally for reconnaissance and to effectively traverse, locate, and track sensitive data.

Figure: Data lake access

Traditional monitoring methods often fall short. Tracking object store access can be overwhelming, with a single query generating thousands of log records. Monitoring at the query engine level demands a unique solution for each engine, adding complexity.

We suggest a two-tiered approach to deal with these issues.

Level Up Your AI Expertise! Subscribe Now:  File:Spotify icon.svg - Wikipedia Soundcloud - Free social media icons File:Podcasts (iOS).svg - Wikipedia

The first tier is to adopt best practices, such as:

  • Using roles instead of keys
  • Using unique credentials and not sharing them between users and services
  • Using tailored, instead of wide access permissions
  • Applying lifecycle management, query size limitations, alerts and other general rules

The second tier is monitoring your data for anomalies. By logging the queries performed on your data lake you can detect and stop numerous cases of abuse and misuse. Let us explain how.

The data lake is often accessed via query by two major user types:

  • Humans – employees will often query the data to get information or during the process of development. 
  • Applications – a deployed application will access the data as part of its normal function.

The major difference between the two is the usage pattern. While human queries are sporadic in nature, they are also normally limited to their working hours and their areas of work. Humans who work in marketing don’t normally wake up at 3AM to start a new project on production tables.

Applications either work in a periodic schedule, such as ETLs, or work on demand per user request, but they are normally limited to a predefined number of tables and often have a clear usage baseline. We don’t expect applications to change their queries, access new tables, or suddenly switch from a periodic schedule to an irregular one.

In-Person & Virtual Data Science Conference

October 29th-31st, 2024 – Burlingame, CA

Join us for 300+ hours of expert-led content, featuring hands-on, immersive training sessions, workshops, tutorials, and talks on cutting-edge AI tools and techniques, including our first-ever track devoted to AI Robotics!

 

You should manage and protect your data lake carefully:

  • Adopt best practices for permissions management
  • Monitor access and use the monitoring data to create actionable insights

It will help you prevent data leakage and manage your costs by detecting data abuse and misuse.

About the Authors:

Ori Nakar is a principal cyber-security researcher, a data engineer, and a data scientist at Imperva Threat Research group. Ori has many years of experience as a software engineer and engineering manager, focused on cloud technologies and big data infrastructure. In the Threat Research group, Ori is responsible for the data infrastructure and is involved in analytics projects, machine learning, and innovation projects.

 

 

 

Johnathan Azaria is a tech lead in data science @ Imperva, specializing in AI-driven security algorithms and digital protection.

ODSC Community

The Open Data Science community is passionate and diverse, and we always welcome contributions from data science professionals! All of the articles under this profile are from our community, with individual authors mentioned in the text itself.

1