DiscoverData Engineering Podcast
Data Engineering Podcast

Data Engineering Podcast

Author: Tobias Macey

Subscribed: 3,463Played: 134,734
Share

Description

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.
427 Episodes
Reverse
Summary The purpose of business intelligence systems is to allow anyone in the business to access and decode data to help them make informed decisions. Unfortunately this often turns into an exercise in frustration for everyone involved due to complex workflows and hard-to-understand dashboards. The team at Zenlytic have leaned on the promise of large language models to build an AI agent that lets you converse with your data. In this episode they share their journey through the fast-moving landscape of generative AI and unpack the difference between an AI chatbot and an AI agent. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management This episode is supported by Code Comments, an original podcast from Red Hat. As someone who listens to the Data Engineering Podcast, you know that the road from tool selection to production readiness is anything but smooth or straight. In Code Comments, host Jamie Parker, Red Hatter and experienced engineer, shares the journey of technologists from across the industry and their hard-won lessons in implementing new technologies. I listened to the recent episode "Transforming Your Database" and appreciated the valuable advice on how to approach the selection and integration of new databases in applications and the impact on team dynamics. There are 3 seasons of great episodes and new ones landing everywhere you listen to podcasts. Search for "Code Commentst" in your podcast player or go to dataengineeringpodcast.com/codecomments (https://www.dataengineeringpodcast.com/codecomments) today to subscribe. My thanks to the team at Code Comments for their support. Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Ryan Janssen and Paul Blankley about their experiences building AI powered agents for interacting with your data Interview Introduction How did you get involved in data? In AI? Can you describe what Zenlytic is and the role that AI is playing in your platform? What have been the key stages in your AI journey? What are some of the dead ends that you ran into along the path to where you are today? What are some of the persistent challenges that you are facing? So tell us more about data agents. Firstly, what are data agents and why do you think they're important? How are data agents different from chatbots? Are data agents harder to build? How do you make them work in production? What other technical architectures have you had to develop to support the use of AI in Zenlytic? How have you approached the work of customer education as you introduce this functionality? What are some of the most interesting or erroneous misconceptions that you have heard about what the AI can and can't do? How have you balanced accuracy/trustworthiness with user experience and flexibility in the conversational AI, given the potential for these models to create erroneous responses? What are the most interesting, innovative, or unexpected ways that you have seen your AI agent used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on building an AI agent for business intelligence? When is an AI agent the wrong choice? What do you have planned for the future of AI in the Zenlytic product? Contact Info Ryan LinkedIn (https://www.linkedin.com/in/janssenryan) Paul LinkedIn (https://www.linkedin.com/in/paulblankley/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. Links Zenlytic (https://www.zenlytic.com/) Podcast Episode (https://www.dataengineeringpodcast.com/zenlytic-self-serve-business-intelligence-episode-371) Attention is all you need (https://arxiv.org/abs/1706.03762) Transformers (https://en.wikipedia.org/wiki/Transformer_(deep_learning_architecture)) BERT (https://en.wikipedia.org/wiki/BERT_(language_model)) The Bitter Lesson (http://www.incompleteideas.net/IncIdeas/BitterLesson.html) Richard Sutton PID Loops (https://en.wikipedia.org/wiki/Proportional%E2%80%93integral%E2%80%93derivative_controller) AutoGPT (https://github.com/Significant-Gravitas/AutoGPT) Devin.ai (https://www.cognition.ai/introducing-devin) Google Gemini (https://gemini.google.com/) Anthropic Claude (https://www.anthropic.com/claude) OpenAI Code Interpreter (https://platform.openai.com/docs/assistants/tools/code-interpreter) Edward Tufte (https://www.edwardtufte.com/tufte/books_vdqi) Looker ActionHub (https://developers.looker.com/actions/overview/) OAuth (https://oauth.net/2/) GitHub Copilot (https://github.com/features/copilot) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
Summary Building a data platform is a substrantial engineering endeavor. Once it is running, the next challenge is figuring out how to address release management for all of the different component parts. The services and systems need to be kept up to date, but so does the code that controls their behavior. In this episode your host Tobias Macey reflects on his current challenges in this area and some of the factors that contribute to the complexity of the problem. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management This episode is supported by Code Comments, an original podcast from Red Hat. As someone who listens to the Data Engineering Podcast, you know that the road from tool selection to production readiness is anything but smooth or straight. In Code Comments, host Jamie Parker, Red Hatter and experienced engineer, shares the journey of technologists from across the industry and their hard-won lessons in implementing new technologies. I listened to the recent episode "Transforming Your Database" and appreciated the valuable advice on how to approach the selection and integration of new databases in applications and the impact on team dynamics. There are 3 seasons of great episodes and new ones landing everywhere you listen to podcasts. Search for "Code Commentst" in your podcast player or go to dataengineeringpodcast.com/codecomments (https://www.dataengineeringpodcast.com/codecomments) today to subscribe. My thanks to the team at Code Comments for their support. Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I want to talk about my experiences managing the QA and release management process of my data platform Interview Introduction As a team, our overall goal is to ensure that the production environment for our data platform is highly stable and reliable. This is the foundational element of establishing and maintaining trust with the consumers of our data. In order to support this effort, we need to ensure that only changes that have been tested and verified are promoted to production. Our current challenge is one that plagues all data teams. We want to have an environment that mirrors our production environment that is available for testing, but it’s not feasible to maintain a complete duplicate of all of the production data. Compounding that challenge is the fact that each of the components of our data platform interact with data in slightly different ways and need different processes for ensuring that changes are being promoted safely. Contact Info LinkedIn () Website (https://www.dataengineeringpodcast.com) Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com) with your story. Links Data Platforms and Leaky Abstractions Episode (https://www.dataengineeringpodcast.com/abstractions-and-technical-debt-episode-374) Building A Data Platform From Scratch (https://www.dataengineeringpodcast.com/designing-a-lakehouse-from-scratch-episode-354) Airbyte (https://airbyte.com/) Podcast Episode (https://www.dataengineeringpodcast.com/airbyte-open-source-data-integration-episode-173/) Trino (https://trino.io/) dbt (https://www.getdbt.com/) Starburst Galaxy (https://www.starburst.io/platform/starburst-galaxy/) Superset (https://superset.apache.org/) Dagster (https://dagster.io/) LakeFS (https://lakefs.io/) Podcast Episode (https://www.dataengineeringpodcast.com/lakefs-data-lake-versioning-episode-157) Nessie (https://projectnessie.org/) Podcast Episode (https://www.dataengineeringpodcast.com/nessie-data-lakehouse-data-versioning-episode-416) Iceberg (https://iceberg.apache.org/) Snowflake (https://www.snowflake.com/en/) LocalStack (https://www.localstack.cloud/) DSL == Domain Specific Language (https://en.wikipedia.org/wiki/Domain-specific_language) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
Summary Artificial intelligence has dominated the headlines for several months due to the successes of large language models. This has prompted numerous debates about the possibility of, and timeline for, artificial general intelligence (AGI). Peter Voss has dedicated decades of his life to the pursuit of truly intelligent software through the approach of cognitive AI. In this episode he explains his approach to building AI in a more human-like fashion and the emphasis on learning rather than statistical prediction. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster (https://www.dataengineeringpodcast.com/dagster) today to get started. Your first 30 days are free! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Peter Voss about what is involved in making your AI applications more "human" Interview Introduction How did you get involved in machine learning? Can you start by unpacking the idea of "human-like" AI? How does that contrast with the conception of "AGI"? The applications and limitations of GPT/LLM models have been dominating the popular conversation around AI. How do you see that impacting the overrall ecosystem of ML/AI applications and investment? The fundamental/foundational challenge of every AI use case is sourcing appropriate data. What are the strategies that you have found useful to acquire, evaluate, and prepare data at an appropriate scale to build high quality models? What are the opportunities and limitations of causal modeling techniques for generalized AI models? As AI systems gain more sophistication there is a challenge with establishing and maintaining trust. What are the risks involved in deploying more human-level AI systems and monitoring their reliability? What are the practical/architectural methods necessary to build more cognitive AI systems? How would you characterize the ecosystem of tools/frameworks available for creating, evolving, and maintaining these applications? What are the most interesting, innovative, or unexpected ways that you have seen cognitive AI applied? What are the most interesting, unexpected, or challenging lessons that you have learned while working on desiging/developing cognitive AI systems? When is cognitive AI the wrong choice? What do you have planned for the future of cognitive AI applications at Aigo? Contact Info LinkedIn (https://www.linkedin.com/in/vosspeter/) Website (http://optimal.org/voss.html) Parting Question From your perspective, what is the biggest barrier to adoption of machine learning today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. Links Aigo.ai (https://aigo.ai/) Artificial General Intelligence (https://aigo.ai/what-is-real-agi/) Cognitive AI (https://aigo.ai/cognitive-ai/) Knowledge Graph (https://en.wikipedia.org/wiki/Knowledge_graph) Causal Modeling (https://en.wikipedia.org/wiki/Causal_model) Bayesian Statistics (https://en.wikipedia.org/wiki/Bayesian_statistics) Thinking Fast & Slow (https://amzn.to/3UJKsmK) by Daniel Kahneman (affiliate link) Agent-Based Modeling (https://en.wikipedia.org/wiki/Agent-based_model) Reinforcement Learning (https://en.wikipedia.org/wiki/Reinforcement_learning) DARPA 3 Waves of AI (https://www.darpa.mil/about-us/darpa-perspective-on-ai) presentation Why Don't We Have AGI Yet? (https://arxiv.org/abs/2308.03598) whitepaper Concepts Is All You Need (https://arxiv.org/abs/2309.01622) Whitepaper Hellen Keller (https://en.wikipedia.org/wiki/Helen_Keller) Stephen Hawking (https://en.wikipedia.org/wiki/Stephen_Hawking) The intro and outro music is from Hitman's Lovesong feat. Paola Graziano (https://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Tales_Of_A_Dead_Fish/Hitmans_Lovesong/) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/)/CC BY-SA 3.0 (https://creativecommons.org/licenses/by-sa/3.0/)
Summary Generative AI promises to accelerate the productivity of human collaborators. Currently the primary way of working with these tools is through a conversational prompt, which is often cumbersome and unwieldy. In order to simplify the integration of AI capabilities into developer workflows Tsavo Knott helped create Pieces, a powerful collection of tools that complements the tools that developers already use. In this episode he explains the data collection and preparation process, the collection of model types and sizes that work together to power the experience, and how to incorporate it into your workflow to act as a second brain. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster (https://www.dataengineeringpodcast.com/dagster) today to get started. Your first 30 days are free! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Tsavo Knott about Pieces, a personal AI toolkit to improve the efficiency of developers Interview Introduction How did you get involved in machine learning? Can you describe what Pieces is and the story behind it? The past few months have seen an endless series of personalized AI tools launched. What are the features and focus of Pieces that might encourage someone to use it over the alternatives? model selections architecture of Pieces application local vs. hybrid vs. online models model update/delivery process data preparation/serving for models in context of Pieces app application of AI to developer workflows types of workflows that people are building with pieces What are the most interesting, innovative, or unexpected ways that you have seen Pieces used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Pieces? When is Pieces the wrong choice? What do you have planned for the future of Pieces? Contact Info LinkedIn (https://www.linkedin.com/in/tsavoknott/) Parting Question From your perspective, what is the biggest barrier to adoption of machine learning today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. Links Pieces (https://pieces.app/) NPU == Neural Processing Unit (https://en.wikipedia.org/wiki/AI_accelerator) Tensor Chip (https://en.wikipedia.org/wiki/Google_Tensor) LoRA == Low Rank Adaptation (https://github.com/microsoft/LoRA) Generative Adversarial Networks (https://en.wikipedia.org/wiki/Generative_adversarial_network) Mistral (https://mistral.ai/) Emacs (https://www.gnu.org/software/emacs/) Vim (https://www.vim.org/) NeoVim (https://neovim.io/) Dart (https://dart.dev/) Flutter (https://flutter.dev/) Typescript (https://www.typescriptlang.org/) Lua (https://www.lua.org/) Retrieval Augmented Generation (https://github.blog/2024-04-04-what-is-retrieval-augmented-generation-and-what-does-it-do-for-generative-ai/) ONNX (https://onnx.ai/) LSTM == Long Short-Term Memory (https://en.wikipedia.org/wiki/Long_short-term_memory) LLama 2 (https://llama.meta.com/llama2/) GitHub Copilot (https://github.com/features/copilot) Tabnine (https://www.tabnine.com/) Podcast Episode (https://www.themachinelearningpodcast.com/tabnine-generative-ai-developer-assistant-episode-24) The intro and outro music is from Hitman's Lovesong feat. Paola Graziano (https://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Tales_Of_A_Dead_Fish/Hitmans_Lovesong/) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/)/CC BY-SA 3.0 (https://creativecommons.org/licenses/by-sa/3.0/)
Summary Generative AI has rapidly transformed everything in the technology sector. When Andrew Lee started work on Shortwave he was focused on making email more productive. When AI started gaining adoption he realized that he had even more potential for a transformative experience. In this episode he shares the technical challenges that he and his team have overcome in integrating AI into their product, as well as the benefits and features that it provides to their customers. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster (https://www.dataengineeringpodcast.com/dagster) today to get started. Your first 30 days are free! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Andrew Lee about his work on Shortwave, an AI powered email client Interview Introduction How did you get involved in the area of data management? Can you describe what Shortwave is and the story behind it? What is the core problem that you are addressing with Shortwave? Email has been a central part of communication and business productivity for decades now. What are the overall themes that continue to be problematic? What are the strengths that email maintains as a protocol and ecosystem? From a product perspective, what are the data challenges that are posed by email? Can you describe how you have architected the Shortwave platform? How have the design and goals of the product changed since you started it? What are the ways that the advent and evolution of language models have influenced your product roadmap? How do you manage the personalization of the AI functionality in your system for each user/team? For users and teams who are using Shortwave, how does it change their workflow and communication patterns? Can you describe how I would use Shortwave for managing the workflow of evaluating, planning, and promoting my podcast episodes? What are the most interesting, innovative, or unexpected ways that you have seen Shortwave used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Shortwave? When is Shortwave the wrong choice? What do you have planned for the future of Shortwave? Contact Info LinkedIn (https://www.linkedin.com/in/startupandrew/) Blog (https://startupandrew.com/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. Links Shortwave (https://www.shortwave.com/) Firebase (https://firebase.google.com/) Google Inbox (https://en.wikipedia.org/wiki/Inbox_by_Gmail) Hey (https://www.hey.com/) Ezra Klein Hey Article (https://www.nytimes.com/2024/04/07/opinion/gmail-email-digital-shame.html) Superhuman (https://superhuman.com/) Pinecone (https://www.pinecone.io/) Podcast Episode (https://www.dataengineeringpodcast.com/pinecone-vector-database-similarity-search-episode-189/) Elastic (https://www.elastic.co/) Hybrid Search (https://weaviate.io/blog/hybrid-search-explained) Semantic Search (https://en.wikipedia.org/wiki/Semantic_search) Mistral (https://mistral.ai/) GPT 3.5 (https://platform.openai.com/docs/models/gpt-3-5-turbo) IMAP (https://en.wikipedia.org/wiki/Internet_Message_Access_Protocol) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
Summary Databases come in a variety of formats for different use cases. The default association with the term "database" is relational engines, but non-relational engines are also used quite widely. In this episode Oren Eini, CEO and creator of RavenDB, explores the nuances of relational vs. non-relational engines, and the strategies for designing a non-relational database. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management This episode is brought to you by Datafold – a testing automation platform for data engineers that prevents data quality issues from entering every part of your data workflow, from migration to dbt deployment. Datafold has recently launched data replication testing, providing ongoing validation for source-to-target replication. Leverage Datafold's fast cross-database data diffing and Monitoring to test your replication pipelines automatically and continuously. Validate consistency between source and target at any scale, and receive alerts about any discrepancies. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold (https://www.dataengineeringpodcast.com/datafold). Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster (https://www.dataengineeringpodcast.com/dagster) today to get started. Your first 30 days are free! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Oren Eini about the work of designing and building a NoSQL database engine Interview Introduction How did you get involved in the area of data management? Can you describe what constitutes a NoSQL database? How have the requirements and applications of NoSQL engines changed since they first became popular ~15 years ago? What are the factors that convince teams to use a NoSQL vs. SQL database? NoSQL is a generalized term that encompasses a number of different data models. How does the underlying representation (e.g. document, K/V, graph) change that calculus? How have the evolution in data formats (e.g. N-dimensional vectors, point clouds, etc.) changed the landscape for NoSQL engines? When designing and building a database, what are the initial set of questions that need to be answered? How many "core capabilities" can you reasonably design around before they conflict with each other? How have you approached the evolution of RavenDB as you add new capabilities and mature the project? What are some of the early decisions that had to be unwound to enable new capabilities? If you were to start from scratch today, what database would you build? What are the most interesting, innovative, or unexpected ways that you have seen RavenDB/NoSQL databases used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on RavenDB? When is a NoSQL database/RavenDB the wrong choice? What do you have planned for the future of RavenDB? Contact Info Blog (https://ayende.com/blog/) LinkedIn (https://www.linkedin.com/in/ravendb/?originalSubdomain=il) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. Links RavenDB (https://ravendb.net/) RSS (https://en.wikipedia.org/wiki/RSS) Object Relational Mapper (ORM) (https://en.wikipedia.org/wiki/Object%E2%80%93relational_mapping) Relational Database (https://en.wikipedia.org/wiki/Relational_database) NoSQL (https://en.wikipedia.org/wiki/NoSQL) CouchDB (https://couchdb.apache.org/) Navigational Database (https://en.wikipedia.org/wiki/Navigational_database) MongoDB (https://www.mongodb.com/) Redis (https://redis.io/) Neo4J (https://neo4j.com/) Cassandra (https://cassandra.apache.org/_/index.html) Column-Family (https://en.wikipedia.org/wiki/Column_family) SQLite (https://www.sqlite.org/) LevelDB (https://github.com/google/leveldb) Firebird DB (https://firebirdsql.org/) fsync (https://man7.org/linux/man-pages/man2/fsync.2.html) Esent DB? (https://learn.microsoft.com/en-us/windows/win32/extensible-storage-engine/extensible-storage-engine-managed-reference) KNN == K-Nearest Neighbors (https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm) RocksDB (https://rocksdb.org/) C# Language (https://en.wikipedia.org/wiki/C_Sharp_(programming_language)) ASP.NET (https://en.wikipedia.org/wiki/ASP.NET) QUIC (https://en.wikipedia.org/wiki/QUIC) Dynamo Paper (https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf) Database Internals (https://amzn.to/49A5wjF) book (affiliate link) Designing Data Intensive Applications (https://amzn.to/3JgCZFh) book (affiliate link) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
Summary Maintaining a single source of truth for your data is the biggest challenge in data engineering. Different roles and tasks in the business need their own ways to access and analyze the data in the organization. In order to enable this use case, while maintaining a single point of access, the semantic layer has evolved as a technological solution to the problem. In this episode Artyom Keydunov, creator of Cube, discusses the evolution and applications of the semantic layer as a component of your data platform, and how Cube provides speed and cost optimization for your data consumers. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management This episode is brought to you by Datafold – a testing automation platform for data engineers that prevents data quality issues from entering every part of your data workflow, from migration to dbt deployment. Datafold has recently launched data replication testing, providing ongoing validation for source-to-target replication. Leverage Datafold's fast cross-database data diffing and Monitoring to test your replication pipelines automatically and continuously. Validate consistency between source and target at any scale, and receive alerts about any discrepancies. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold (https://www.dataengineeringpodcast.com/datafold). Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster (https://www.dataengineeringpodcast.com/dagster) today to get started. Your first 30 days are free! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Artyom Keydunov about the role of the semantic layer in your data platform Interview Introduction How did you get involved in the area of data management? Can you start by outlining the technical elements of what it means to have a "semantic layer"? In the past couple of years there was a rapid hype cycle around the "metrics layer" and "headless BI", which has largely faded. Can you give your assessment of the current state of the industry around the adoption/implementation of these concepts? What are the benefits of having a discrete service that offers the business metrics/semantic mappings as opposed to implementing those concepts as part of a more general system? (e.g. dbt, BI, warehouse marts, etc.) At what point does it become necessary/beneficial for a team to adopt such a service? What are the challenges involved in retrofitting a semantic layer into a production data system? evolution of requirements/usage patterns technical complexities/performance and cost optimization What are the most interesting, innovative, or unexpected ways that you have seen Cube used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Cube? When is Cube/a semantic layer the wrong choice? What do you have planned for the future of Cube? Contact Info LinkedIn (https://www.linkedin.com/in/keydunov/) keydunov (https://github.com/keydunov) on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. Links Cube (https://cube.dev/) Semantic Layer (https://en.wikipedia.org/wiki/Semantic_layer) Business Objects (https://en.wikipedia.org/wiki/BusinessObjects) Tableau (https://www.tableau.com/) Looker (https://cloud.google.com/looker/?hl=en) Podcast Episode (https://www.dataengineeringpodcast.com/looker-with-daniel-mintz-episode-55/) Mode (https://mode.com/) Thoughtspot (https://www.thoughtspot.com/) LightDash (https://www.lightdash.com/) Podcast Episode (https://www.dataengineeringpodcast.com/lightdash-exploratory-business-intelligence-episode-232/) Embedded Analytics (https://en.wikipedia.org/wiki/Embedded_analytics) Dimensional Modeling (https://en.wikipedia.org/wiki/Dimensional_modeling) Clickhouse (https://clickhouse.com/) Podcast Episode (https://www.dataengineeringpodcast.com/clickhouse-data-warehouse-episode-88/) Druid (https://druid.apache.org/) BigQuery (https://cloud.google.com/bigquery?hl=en) Starburst (https://www.starburst.io/) Pinot (https://pinot.apache.org/) Snowflake (https://www.snowflake.com/en/) Podcast Episode (https://www.dataengineeringpodcast.com/snowflakedb-cloud-data-warehouse-episode-110/) Arrow Datafusion (https://arrow.apache.org/datafusion/) Metabase (https://www.metabase.com/) Podcast Episode (https://www.dataengineeringpodcast.com/metabase-with-sameer-al-sakran-episode-29) Superset (https://superset.apache.org/) Alation (https://www.alation.com/) Collibra (https://www.collibra.com/) Podcast Episode (https://www.dataengineeringpodcast.com/collibra-enterprise-data-governance-episode-188) Atlan (https://atlan.com/) Podcast Episode (https://www.dataengineeringpodcast.com/atlan-data-team-collaboration-episode-179) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
Summary Working with data is a complicated process, with numerous chances for something to go wrong. Identifying and accounting for those errors is a critical piece of building trust in the organization that your data is accurate and up to date. While there are numerous products available to provide that visibility, they all have different technologies and workflows that they focus on. To bring observability to dbt projects the team at Elementary embedded themselves into the workflow. In this episode Maayan Salom explores the approach that she has taken to bring observability, enhanced testing capabilities, and anomaly detection into every step of the dbt developer experience. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster (https://www.dataengineeringpodcast.com/dagster) today to get started. Your first 30 days are free! This episode is brought to you by Datafold – a testing automation platform for data engineers that prevents data quality issues from entering every part of your data workflow, from migration to dbt deployment. Datafold has recently launched data replication testing, providing ongoing validation for source-to-target replication. Leverage Datafold's fast cross-database data diffing and Monitoring to test your replication pipelines automatically and continuously. Validate consistency between source and target at any scale, and receive alerts about any discrepancies. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold (https://www.dataengineeringpodcast.com/datafold). Your host is Tobias Macey and today I'm interviewing Maayan Salom about how to incorporate observability into a dbt-oriented workflow and how Elementary can help Interview Introduction How did you get involved in the area of data management? Can you start by outlining what elements of observability are most relevant for dbt projects? What are some of the common ad-hoc/DIY methods that teams develop to acquire those insights? What are the challenges/shortcomings associated with those approaches? Over the past ~3 years there were numerous data observability systems/products created. What are some of the ways that the specifics of dbt workflows are not covered by those generalized tools? What are the insights that can be more easily generated by embedding into the dbt toolchain and development cycle? Can you describe what Elementary is and how it is designed to enhance the development and maintenance work in dbt projects? How is Elementary designed/implemented? How have the scope and goals of the project changed since you started working on it? What are the engineering challenges/frustrations that you have dealt with in the creation and evolution of Elementary? Can you talk us through the setup and workflow for teams adopting Elementary in their dbt projects? How does the incorporation of Elementary change the development habits of the teams who are using it? What are the most interesting, innovative, or unexpected ways that you have seen Elementary used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Elementary? When is Elementary the wrong choice? What do you have planned for the future of Elementary? Contact Info LinkedIn (https://www.linkedin.com/in/maayansa/?originalSubdomain=il) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. Links Elementary (https://www.elementary-data.com/) Data Observability (https://www.montecarlodata.com/blog-what-is-data-observability/) dbt (https://www.getdbt.com/) Datadog (https://www.datadoghq.com/) pre-commit (https://pre-commit.com/) dbt packages (https://docs.getdbt.com/docs/build/packages) SQLMesh (https://sqlmesh.readthedocs.io/en/latest/) Malloy (https://www.malloydata.dev/) SDF (https://www.sdf.com/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
Summary A core differentiator of Dagster in the ecosystem of data orchestration is their focus on software defined assets as a means of building declarative workflows. With their launch of Dagster+ as the redesigned commercial companion to the open source project they are investing in that capability with a suite of new features. In this episode Pete Hunt, CEO of Dagster labs, outlines these new capabilities, how they reduce the burden on data teams, and the increased collaboration that they enable across teams and business units. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster (https://www.dataengineeringpodcast.com/dagster) today to get started. Your first 30 days are free! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Pete Hunt about how the launch of Dagster+ will level up your data platform and orchestrate across language platforms Interview Introduction How did you get involved in the area of data management? Can you describe what the focus of Dagster+ is and the story behind it? What problems are you trying to solve with Dagster+? What are the notable enhancements beyond the Dagster Core project that this updated platform provides? How is it different from the current Dagster Cloud product? In the launch announcement you tease new capabilities that would be great to explore in turns: Make data a team sport, enabling data teams across the organization Deliver reliable, high quality data the organization can trust Observe and manage data platform costs Master the heterogeneous collection of technologies—both traditional and Modern Data Stack What are the business/product goals that you are focused on improving with the launch of Dagster+ What are the most interesting, innovative, or unexpected ways that you have seen Dagster used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on the design and launch of Dagster+? When is Dagster+ the wrong choice? What do you have planned for the future of Dagster/Dagster Cloud/Dagster+? Contact Info Twitter (https://twitter.com/floydophone) LinkedIn (https://linkedin.com/in/pwhunt) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. Links Dagster (https://dagster.io/) Podcast Episode (https://www.dataengineeringpodcast.com/dagster-data-applications-episode-104) Dagster+ Launch Event (https://dagster.io/dagster-plus-launch-event) Hadoop (https://hadoop.apache.org/) MapReduce (https://en.wikipedia.org/wiki/MapReduce) Pydantic (https://docs.pydantic.dev/latest/) Software Defined Assets (https://docs.dagster.io/concepts/assets/software-defined-assets) Dagster Insights (https://docs.dagster.io/dagster-cloud/insights) Dagster Pipes (https://docs.dagster.io/guides/dagster-pipes) Conway's Law (https://en.wikipedia.org/wiki/Conway%27s_law) Data Mesh (https://www.datamesh-architecture.com/) Dagster Code Locations (https://docs.dagster.io/concepts/code-locations) Dagster Asset Checks (https://docs.dagster.io/concepts/assets/asset-checks) Dave & Buster's (https://www.daveandbusters.com/us/en/home) SQLMesh (https://sqlmesh.readthedocs.io/en/latest/) Podcast Episode (https://www.dataengineeringpodcast.com/sqlmesh-open-source-dataops-episode-380) SDF (https://www.sdf.com/) Malloy (https://www.malloydata.dev/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
Summary A significant portion of data workflows involve storing and processing information in database engines. Validating that the information is stored and processed correctly can be complex and time-consuming, especially when the source and destination speak different dialects of SQL. In this episode Gleb Mezhanskiy, founder and CEO of Datafold, discusses the different error conditions and solutions that you need to know about to ensure the accuracy of your data. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster (https://www.dataengineeringpodcast.com/dagster) today to get started. Your first 30 days are free! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Join us at the top event for the global data community, Data Council Austin. From March 26-28th 2024, we'll play host to hundreds of attendees, 100 top speakers and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data and sharing their insights and learnings through deeply technical talks. As a listener to the Data Engineering Podcast you can get a special discount off regular priced and late bird tickets by using the promo code dataengpod20. Don't miss out on our only event this year! Visit dataengineeringpodcast.com/data-council (https://www.dataengineeringpodcast.com/data-council) and use code dataengpod20 to register today! Your host is Tobias Macey and today I'm welcoming back Gleb Mezhanskiy to talk about how to reconcile data in database environments Interview Introduction How did you get involved in the area of data management? Can you start by outlining some of the situations where reconciling data between databases is needed? What are examples of the error conditions that you are likely to run into when duplicating information between database engines? When these errors do occur, what are some of the problems that they can cause? When teams are replicating data between database engines, what are some of the common patterns for managing those flows? How does that change between continual and one-time replication? What are some of the steps involved in verifying the integrity of data replication between database engines? If the source or destination isn't a traditional database engine (e.g. data lakehouse) how does that change the work involved in verifying the success of the replication? What are the challenges of validating and reconciling data? Sheer scale and cost of pulling data out, have to do in-place Performance. Pushing databases to the limit, especially hard for OLTP and legacy Cross-database compatibilty Data types What are the most interesting, innovative, or unexpected ways that you have seen Datafold/data-diff used in the context of cross-database validation? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Datafold? When is Datafold/data-diff the wrong choice? What do you have planned for the future of Datafold? Contact Info LinkedIn (https://www.linkedin.com/in/glebmezh/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. Links Datafold (https://www.datafold.com/) Podcast Episode (https://www.dataengineeringpodcast.com/datafold-proactive-data-quality-episode-205/) data-diff (https://github.com/datafold/data-diff) Podcast Episode (https://www.dataengineeringpodcast.com/data-diff-open-source-data-integration-validation-episode-303) Hive (https://hive.apache.org/) Presto (https://prestodb.io/) Spark (https://spark.apache.org/) SAP HANA (https://en.wikipedia.org/wiki/SAP_HANA) Change Data Capture (https://en.wikipedia.org/wiki/Change_data_capture) Nessie (https://projectnessie.org/) Podcast Episode (https://www.dataengineeringpodcast.com/nessie-data-lakehouse-data-versioning-episode-416) LakeFS (https://lakefs.io/) Podcast Episode (https://www.dataengineeringpodcast.com/lakefs-data-lake-versioning-episode-157) Iceberg Tables (https://iceberg.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/iceberg-with-ryan-blue-episode-52/) SQLGlot (https://github.com/tobymao/sqlglot) Trino (https://trino.io/) GitHub Copilot (https://github.com/features/copilot) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
Summary Data lakehouse architectures are gaining popularity due to the flexibility and cost effectiveness that they offer. The link that bridges the gap between data lake and warehouse capabilities is the catalog. The primary purpose of the catalog is to inform the query engine of what data exists and where, but the Nessie project aims to go beyond that simple utility. In this episode Alex Merced explains how the branching and merging functionality in Nessie allows you to use the same versioning semantics for your data lakehouse that you are used to from Git. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster (https://www.dataengineeringpodcast.com/dagster) today to get started. Your first 30 days are free! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Join us at the top event for the global data community, Data Council Austin. From March 26-28th 2024, we'll play host to hundreds of attendees, 100 top speakers and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data and sharing their insights and learnings through deeply technical talks. As a listener to the Data Engineering Podcast you can get a special discount off regular priced and late bird tickets by using the promo code dataengpod20. Don't miss out on our only event this year! Visit dataengineeringpodcast.com/data-council (https://www.dataengineeringpodcast.com/data-council) and use code dataengpod20 to register today! Your host is Tobias Macey and today I'm interviewing Alex Merced, developer advocate at Dremio and co-author of the upcoming book from O'reilly, "Apache Iceberg, The definitive Guide", about Nessie, a git-like versioned catalog for data lakes using Apache Iceberg Interview Introduction How did you get involved in the area of data management? Can you describe what Nessie is and the story behind it? What are the core problems/complexities that Nessie is designed to solve? The closest analogue to Nessie that I've seen in the ecosystem is LakeFS. What are the features that would lead someone to choose one or the other for a given use case? Why would someone choose Nessie over native table-level branching in the Apache Iceberg spec? How do the versioning capabilities compare to/augment the data versioning in Iceberg? What are some of the sources of, and challenges in resolving, merge conflicts between table branches? Can you describe the architecture of Nessie? How have the design and goals of the project changed since it was first created? What is involved in integrating Nessie into a given data stack? For cases where a given query/compute engine doesn't natively support Nessie, what are the options for using it effectively? How does the inclusion of Nessie in a data lake influence the overall workflow of developing/deploying/evolving processing flows? What are the most interesting, innovative, or unexpected ways that you have seen Nessie used? What are the most interesting, unexpected, or challenging lessons that you have learned while working with Nessie? When is Nessie the wrong choice? What have you heard is planned for the future of Nessie? Contact Info LinkedIn (https://www.linkedin.com/in/alexmerced) Twitter (https://www.twitter.com/amdatalakehouse) Alex's Article on Dremio's Blog (https://www.dremio.com/authors/alex-merced/) Alex's Substack (https://amdatalakehouse.substack.com/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. Links Project Nessie (https://projectnessie.org/) Article: What is Nessie, Catalog Versioning and Git-for-Data? (https://www.dremio.com/blog/what-is-nessie-catalog-versioning-and-git-for-data/) Article: What is Lakehouse Management?: Git-for-Data, Automated Apache Iceberg Table Maintenance and more (https://www.dremio.com/blog/what-is-lakehouse-management-git-for-data-automated-apache-iceberg-table-maintenance-and-more/) Free Early Release Copy of "Apache Iceberg: The Definitive Guide" (https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html) Iceberg (https://iceberg.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/iceberg-with-ryan-blue-episode-52/) Arrow (https://arrow.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/voltron-data-apache-arrow-episode-346/) Data Lakehouse (https://www.forbes.com/sites/bernardmarr/2022/01/18/what-is-a-data-lakehouse-a-super-simple-explanation-for-anyone/?sh=6cc46c8c6088) LakeFS (https://lakefs.io/) Podcast Episode (https://www.dataengineeringpodcast.com/lakefs-data-lake-versioning-episode-157) AWS Glue (https://aws.amazon.com/glue/) Tabular (https://tabular.io/) Podcast Episode (https://www.dataengineeringpodcast.com/tabular-iceberg-lakehouse-tables-episode-363) Trino (https://trino.io/) Presto (https://prestodb.io/) Dremio (https://www.dremio.com/) Podcast Episode (https://www.dataengineeringpodcast.com/dremio-with-tomer-shiran-episode-58) RocksDB (https://rocksdb.org/) Delta Lake (https://delta.io/) Podcast Episode (https://www.dataengineeringpodcast.com/delta-lake-data-lake-episode-85/) Hive Metastore (https://cwiki.apache.org/confluence/display/hive/design#Design-Metastore) PyIceberg (https://py.iceberg.apache.org/) Optimistic Concurrency Control (https://en.wikipedia.org/wiki/Optimistic_concurrency_control) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
Summary Artificial intelligence technologies promise to revolutionize business and produce new sources of value. In order to make those promises a reality there is a substantial amount of strategy and investment required. Colleen Tartow has worked across all stages of the data lifecycle, and in this episode she shares her hard-earned wisdom about how to conduct an AI program for your organization. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster (https://www.dataengineeringpodcast.com/dagster) today to get started. Your first 30 days are free! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Join us at the top event for the global data community, Data Council Austin. From March 26-28th 2024, we'll play host to hundreds of attendees, 100 top speakers and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data and sharing their insights and learnings through deeply technical talks. As a listener to the Data Engineering Podcast you can get a special discount off regular priced and late bird tickets by using the promo code dataengpod20. Don't miss out on our only event this year! Visit dataengineeringpodcast.com/data-council (https://www.dataengineeringpodcast.com/data-council) and use code dataengpod20 to register today! Your host is Tobias Macey and today I'm interviewing Colleen Tartow about the questions to answer before and during the development of an AI program Interview Introduction How did you get involved in the area of data management? When you say "AI Program", what are the organizational, technical, and strategic elements that it encompasses? How does the idea of an "AI Program" differ from an "AI Product"? What are some of the signals to watch for that indicate an objective for which AI is not a reasonable solution? Who needs to be involved in the process of defining and developing that program? What are the skills and systems that need to be in place to effectively execute on an AI program? "AI" has grown to be an even more overloaded term than it already was. What are some of the useful clarifying/scoping questions to address when deciding the path to deployment for different definitions of "AI"? Organizations can easily fall into the trap of green-lighting an AI project before they have done the work of ensuring they have the necessary data and the ability to process it. What are the steps to take to build confidence in the availability of the data? Even if you are sure that you can get the data, what are the implementation pitfalls that teams should be wary of while building out the data flows for powering the AI system? What are the key considerations for powering AI applications that are substantially different from analytical applications? The ecosystem for ML/AI is a rapidly moving target. What are the foundational/fundamental principles that you need to design around to allow for future flexibility? What are the most interesting, innovative, or unexpected ways that you have seen AI programs implemented? What are the most interesting, unexpected, or challenging lessons that you have learned while working on powering AI systems? When is AI the wrong choice? What do you have planned for the future of your work at VAST Data? Contact Info LinkedIn (https://www.linkedin.com/in/colleen-tartow-phd/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. Links VAST Data (https://vastdata.com/) Colleen's Previous Appearance (https://www.dataengineeringpodcast.com/starburst-lakehouse-modern-data-architecture-episode-304) Linear Regression (https://en.wikipedia.org/wiki/Linear_regression) CoreWeave (https://www.coreweave.com/) Lambda Labs (https://lambdalabs.com/) MAD Landscape (https://mattturck.com/mad2023/) Podcast Episode (https://www.dataengineeringpodcast.com/mad-landscape-2023-data-infrastructure-episode-369) ML Episode (https://www.themachinelearningpodcast.com/mad-landscape-2023-ml-ai-episode-21) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
Summary Building a database engine requires a substantial amount of engineering effort and time investment. Over the decades of research and development into building these software systems there are a number of common components that are shared across implementations. When Paul Dix decided to re-write the InfluxDB engine he found the Apache Arrow ecosystem ready and waiting with useful building blocks to accelerate the process. In this episode he explains how he used the combination of Apache Arrow, Flight, Datafusion, and Parquet to lay the foundation of the newest version of his time-series database. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster (https://www.dataengineeringpodcast.com/dagster) today to get started. Your first 30 days are free! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Join us at the top event for the global data community, Data Council Austin. From March 26-28th 2024, we'll play host to hundreds of attendees, 100 top speakers and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data and sharing their insights and learnings through deeply technical talks. As a listener to the Data Engineering Podcast you can get a special discount off regular priced and late bird tickets by using the promo code dataengpod20. Don't miss out on our only event this year! Visit dataengineeringpodcast.com/data-council (https://www.dataengineeringpodcast.com/data-council) and use code dataengpod20 to register today! Your host is Tobias Macey and today I'm interviewing Paul Dix about his investment in the Apache Arrow ecosystem and how it led him to create the latest PFAD in database design Interview Introduction How did you get involved in the area of data management? Can you start by describing the FDAP stack and how the components combine to provide a foundational architecture for database engines? This was the core of your recent re-write of the InfluxDB engine. What were the design goals and constraints that led you to this architecture? Each of the architectural components are well engineered for their particular scope. What is the engineering work that is involved in building a cohesive platform from those components? One of the major benefits of using open source components is the network effect of ecosystem integrations. That can also be a risk when the community vision for the project doesn't align with your own goals. How have you worked to mitigate that risk in your specific platform? Can you describe the operational/architectural aspects of building a full data engine on top of the FDAP stack? What are the elements of the overall product/user experience that you had to build to create a cohesive platform? What are some of the other tools/technologies that can benefit from some or all of the pieces of the FDAP stack? What are the pieces of the Arrow ecosystem that are still immature or need further investment from the community? What are the most interesting, innovative, or unexpected ways that you have seen parts or all of the FDAP stack used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on/with the FDAP stack? When is the FDAP stack the wrong choice? What do you have planned for the future of the InfluxDB IOx engine and the FDAP stack? Contact Info LinkedIn (https://www.linkedin.com/in/pauldix/) pauldix (https://github.com/pauldix) on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. Links FDAP Stack Blog Post (https://www.influxdata.com/blog/flight-datafusion-arrow-parquet-fdap-architecture-influxdb/) Apache Arrow (https://arrow.apache.org/) DataFusion (https://arrow.apache.org/datafusion/) Arrow Flight (https://arrow.apache.org/docs/format/Flight.html) Apache Parquet (https://parquet.apache.org/) InfluxDB (https://www.influxdata.com/products/influxdb/) Influx Data (https://www.influxdata.com/) Podcast Episode (https://www.dataengineeringpodcast.com/influxdb-timeseries-data-platform-episode-199) Rust Language (https://www.rust-lang.org/) DuckDB (https://duckdb.org/) ClickHouse (https://clickhouse.com/) Voltron Data (https://voltrondata.com/) Podcast Episode (https://www.dataengineeringpodcast.com/voltron-data-apache-arrow-episode-346/) Velox (https://github.com/facebookincubator/velox) Iceberg (https://iceberg.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/iceberg-with-ryan-blue-episode-52/) Trino (https://trino.io/) ODBC == Open DataBase Connectivity (https://en.wikipedia.org/wiki/Open_Database_Connectivity) GeoParquet (https://github.com/opengeospatial/geoparquet) ORC == Optimized Row Columnar (https://orc.apache.org/) Avro (https://avro.apache.org/) Protocol Buffers (https://protobuf.dev/) gRPC (https://grpc.io/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
Summary A data lakehouse is intended to combine the benefits of data lakes (cost effective, scalable storage and compute) and data warehouses (user friendly SQL interface). Multiple open source projects and vendors have been working together to make this vision a reality. In this episode Dain Sundstrom, CTO of Starburst, explains how the combination of the Trino query engine and the Iceberg table format offer the ease of use and execution speed of data warehouses with the infinite storage and scalability of data lakes. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster (https://www.dataengineeringpodcast.com/dagster) today to get started. Your first 30 days are free! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Join in with the event for the global data community, Data Council Austin. From March 26th-28th 2024, they'll play host to hundreds of attendees, 100 top speakers, and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working togethr to build the future of data. As a listener to the Data Engineering Podcast you can get a special discount of 20% off your ticket by using the promo code dataengpod20. Don't miss out on their only event this year! Visit: dataengineeringpodcast.com/data-council (https://www.dataengineeringpodcast.com/data-council) today. Your host is Tobias Macey and today I'm interviewing Dain Sundstrom about building a data lakehouse with Trino and Iceberg Interview Introduction How did you get involved in the area of data management? To start, can you share your definition of what constitutes a "Data Lakehouse"? What are the technical/architectural/UX challenges that have hindered the progression of lakehouses? What are the notable advancements in recent months/years that make them a more viable platform choice? There are multiple tools and vendors that have adopted the "data lakehouse" terminology. What are the benefits offered by the combination of Trino and Iceberg? What are the key points of comparison for that combination in relation to other possible selections? What are the pain points that are still prevalent in lakehouse architectures as compared to warehouse or vertically integrated systems? What progress is being made (within or across the ecosystem) to address those sharp edges? For someone who is interested in building a data lakehouse with Trino and Iceberg, how does that influence their selection of other platform elements? What are the differences in terms of pipeline design/access and usage patterns when using a Trino/Iceberg lakehouse as compared to other popular warehouse/lakehouse structures? What are the most interesting, innovative, or unexpected ways that you have seen Trino lakehouses used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on the data lakehouse ecosystem? When is a lakehouse the wrong choice? What do you have planned for the future of Trino/Starburst? Contact Info LinkedIn (https://www.linkedin.com/in/dainsundstrom/) dain (https://github.com/dain) on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. Links Trino (https://trino.io/) Starburst (https://www.starburst.io/) Presto (https://prestodb.io/) JBoss (https://en.wikipedia.org/wiki/JBoss_Enterprise_Application_Platform) Java EE (https://www.oracle.com/java/technologies/java-ee-glance.html) HDFS (https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html) S3 (https://aws.amazon.com/s3/) GCS == Google Cloud Storage (https://cloud.google.com/storage?hl=en) Hive (https://hive.apache.org/) Hive ACID (https://cwiki.apache.org/confluence/display/hive/hive+transactions) Apache Ranger (https://ranger.apache.org/) OPA == Open Policy Agent (https://www.openpolicyagent.org/) Oso (https://www.osohq.com/) AWS Lakeformation (https://aws.amazon.com/lake-formation/) Tabular (https://tabular.io/) Iceberg (https://iceberg.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/iceberg-with-ryan-blue-episode-52/) Delta Lake (https://delta.io/) Podcast Episode (https://www.dataengineeringpodcast.com/delta-lake-data-lake-episode-85/) Debezium (https://debezium.io/) Podcast Episode (https://www.dataengineeringpodcast.com/debezium-change-data-capture-episode-114) Materialized View (https://en.wikipedia.org/wiki/Materialized_view) Clickhouse (https://clickhouse.com/) Druid (https://druid.apache.org/) Hudi (https://hudi.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/hudi-streaming-data-lake-episode-209) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
Summary Sharing data is a simple concept, but complicated to implement well. There are numerous business rules and regulatory concerns that need to be applied. There are also numerous technical considerations to be made, particularly if the producer and consumer of the data aren't using the same platforms. In this episode Andrew Jefferson explains the complexities of building a robust system for data sharing, the techno-social considerations, and how the Bobsled platform that he is building aims to simplify the process. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster (https://www.dataengineeringpodcast.com/dagster) today to get started. Your first 30 days are free! Your host is Tobias Macey and today I'm interviewing Andy Jefferson about how to solve the problem of data sharing Interview Introduction How did you get involved in the area of data management? Can you start by giving some context and scope of what we mean by "data sharing" for the purposes of this conversation? What is the current state of the ecosystem for data sharing protocols/practices/platforms? What are some of the main challenges/shortcomings that teams/organizations experience with these options? What are the technical capabilities that need to be present for an effective data sharing solution? How does that change as a function of the type of data? (e.g. tabular, image, etc.) What are the requirements around governance and auditability of data access that need to be addressed when sharing data? What are the typical boundaries along which data access requires special consideration for how the sharing is managed? Many data platform vendors have their own interfaces for data sharing. What are the shortcomings of those options, and what are the opportunities for abstracting the sharing capability from the underlying platform? What are the most interesting, innovative, or unexpected ways that you have seen data sharing/Bobsled used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on data sharing? When is Bobsled the wrong choice? What do you have planned for the future of data sharing? Contact Info LinkedIn (https://www.linkedin.com/in/andyjefferson/?originalSubdomain=de) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. Links Bobsled (https://www.bobsled.co/) OLAP == OnLine Analytical Processing (https://en.wikipedia.org/wiki/Online_analytical_processing) Cassandra (https://cassandra.apache.org/_/index.html) Podcast Episode (https://www.dataengineeringpodcast.com/cassandra-global-scale-database-episode-220) Neo4J (https://neo4j.com/) FTP == File Transfer Protocol (https://en.wikipedia.org/wiki/File_Transfer_Protocol) S3 Access Points (https://aws.amazon.com/s3/features/access-points/) Snowflake Sharing (https://docs.snowflake.com/en/guides-overview-sharing) BigQuery Sharing (https://cloud.google.com/bigquery/docs/authorized-datasets) Databricks Delta Sharing (https://www.databricks.com/product/delta-sharing) DuckDB (https://duckdb.org/) Podcast Episode (https://www.dataengineeringpodcast.com/duckdb-in-process-olap-database-episode-270/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
Summary Stream processing systems have long been built with a code-first design, adding SQL as a layer on top of the existing framework. RisingWave is a database engine that was created specifically for stream processing, with S3 as the storage layer. In this episode Yingjun Wu explains how it is architected to power analytical workflows on continuous data flows, and the challenges of making it responsive and scalable. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster (https://www.dataengineeringpodcast.com/dagster) today to get started. Your first 30 days are free! Your host is Tobias Macey and today I'm interviewing Yingjun Wu about the RisingWave database and the intricacies of building a stream processing engine on S3 Interview Introduction How did you get involved in the area of data management? Can you describe what RisingWave is and the story behind it? There are numerous stream processing engines, near-real-time database engines, streaming SQL systems, etc. What is the specific niche that RisingWave addresses? What are some of the platforms/architectures that teams are replacing with RisingWave? What are some of the unique capabilities/use cases that RisingWave provides over other offerings in the current ecosystem? Can you describe how RisingWave is architected and implemented? How have the design and goals/scope changed since you first started working on it? What are the core design philosophies that you rely on to prioritize the ongoing development of the project? What are the most complex engineering challenges that you have had to address in the creation of RisingWave? Can you describe a typical workflow for teams that are building on top of RisingWave? What are the user/developer experience elements that you have prioritized most highly? What are the situations where RisingWave can/should be a system of record vs. a point-in-time view of data in transit, with a data warehouse/lakehouse as the longitudinal storage and query engine? What are the most interesting, innovative, or unexpected ways that you have seen RisingWave used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on RisingWave? When is RisingWave the wrong choice? What do you have planned for the future of RisingWave? Contact Info yingjunwu (https://github.com/yingjunwu) on GitHub Personal Website (https://yingjunwu.github.io/) LinkedIn (https://www.linkedin.com/in/yingjun-wu-4b584536/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. Links RisingWave (https://risingwave.com/) AWS Redshift (https://aws.amazon.com/redshift/) Flink (https://flink.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/apache-flink-with-fabian-hueske-episode-57) Clickhouse (https://clickhouse.com/) Podcast Episode (https://www.dataengineeringpodcast.com/clickhouse-data-warehouse-episode-88/) Druid (https://druid.apache.org/) Materialize (https://materialize.com/) Spark (https://spark.apache.org/) Trino (https://trino.io/) Snowflake (https://www.snowflake.com/en/) Kafka (https://kafka.apache.org/) Iceberg (https://iceberg.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/iceberg-with-ryan-blue-episode-52/) Hudi (https://hudi.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/hudi-streaming-data-lake-episode-209) Postgres (https://www.postgresql.org/) Debezium (https://debezium.io/) Podcast Episode (https://www.dataengineeringpodcast.com/debezium-change-data-capture-episode-114) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
Summary Monitoring and auditing IT systems for security events requires the ability to quickly analyze massive volumes of unstructured log data. The majority of products that are available either require too much effort to structure the logs, or aren't fast enough for interactive use cases. Cliff Crosland co-founded Scanner to provide fast querying of high scale log data for security auditing. In this episode he shares the story of how it got started, how it works, and how you can get started with it. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Cliff Crosland about Scanner, a security data lake platform for analyzing security logs and identifying issues quickly and cost-effectively Interview Introduction How did you get involved in the area of data management? Can you describe what Scanner is and the story behind it? What were the shortcomings of other tools that are available in the ecosystem? What is Scanner explicitly not trying to solve for in the security space? (e.g. SIEM) A query engine is useless without data to analyze. What are the data acquisition paths/sources that you are designed to work with?- e.g. cloudtrail logs, app logs, etc. What are some of the other sources of signal for security monitoring that would be valuable to incorporate or integrate with through Scanner? Log data is notoriously messy, with no strictly defined format. How do you handle introspection and querying across loosely structured records that might span multiple sources and inconsistent labelling strategies? Can you describe the architecture of the Scanner platform? What were the motivating constraints that led you to your current implementation? How have the design and goals of the product changed since you first started working on it? Given the security oriented customer base that you are targeting, how do you address trust/network boundaries for compliance with regulatory/organizational policies? What are the personas of the end-users for Scanner? How has that influenced the way that you think about the query formats, APIs, user experience etc. for the prroduct? For teams who are working with Scanner can you describe how it fits into their workflow? What are the most interesting, innovative, or unexpected ways that you have seen Scanner used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Scanner? When is Scanner the wrong choice? What do you have planned for the future of Scanner? Contact Info LinkedIn (https://www.linkedin.com/in/cliftoncrosland/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. Links Scanner (https://scanner.dev/) cURL (https://curl.se/) Rust (https://www.rust-lang.org/) Splunk (https://www.splunk.com/) S3 (https://aws.amazon.com/s3/) AWS Athena (https://aws.amazon.com/athena/) Loki (https://grafana.com/oss/loki/) Snowflake (https://www.snowflake.com/en/) Podcast Episode (https://www.dataengineeringpodcast.com/snowflakedb-cloud-data-warehouse-episode-110/) Presto (https://prestodb.io/) Trino (thttps://trino.io/) AWS CloudTrail (https://aws.amazon.com/cloudtrail/) GitHub Audit Logs (https://docs.github.com/en/organizations/keeping-your-organization-secure/managing-security-settings-for-your-organization/reviewing-the-audit-log-for-your-organization) Okta (https://www.okta.com/) Cribl (https://cribl.io/) Vector.dev (https://vector.dev/) Tines (https://www.tines.com/) Torq (https://torq.io/) Jira (https://www.atlassian.com/software/jira) Linear (https://linear.app/) ECS Fargate (https://aws.amazon.com/fargate/) SQS (https://aws.amazon.com/sqs/) Monoid (https://en.wikipedia.org/wiki/Monoid) Group Theory (https://en.wikipedia.org/wiki/Group_theory) Avro (https://avro.apache.org/) Parquet (https://parquet.apache.org/) OCSF (https://github.com/ocsf/) VPC Flow Logs (https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs.html) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
Summary Databases and analytics architectures have gone through several generational shifts. A substantial amount of the data that is being managed in these systems is related to customers and their interactions with an organization. In this episode Tasso Argyros, CEO of ActionIQ, gives a summary of the major epochs in database technologies and how he is applying the capabilities of cloud data warehouses to the challenge of building more comprehensive experiences for end-users through a modern customer data platform (CDP). Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Data projects are notoriously complex. With multiple stakeholders to manage across varying backgrounds and toolchains even simple reports can become unwieldy to maintain. Miro is your single pane of glass where everyone can discover, track, and collaborate on your organization's data. I especially like the ability to combine your technical diagrams with data documentation and dependency mapping, allowing your data engineers and data consumers to communicate seamlessly about your projects. Find simplicity in your most complex projects with Miro. Your first three Miro boards are free when you sign up today at dataengineeringpodcast.com/miro (https://www.dataengineeringpodcast.com/miro). That’s three free boards at dataengineeringpodcast.com/miro (https://www.dataengineeringpodcast.com/miro). Your host is Tobias Macey and today I'm interviewing Tasso Argyros about the role of a customer data platform in the context of the modern data stack Interview Introduction How did you get involved in the area of data management? Can you describe what the role of the CDP is in the context of a businesses data ecosystem? What are the core technical challenges associated with building and maintaining a CDP? What are the organizational/business factors that contribute to the complexity of these systems? The early days of CDPs came with the promise of "Customer 360". Can you unpack that concept and how it has changed over the past ~5 years? Recent years have seen the adoption of reverse ETL, cloud data warehouses, and sophisticated product analytics suites. How has that changed the architectural approach to CDPs? How have the architectural shifts changed the ways that organizations interact with their customer data? How have the responsibilities shifted across different roles? What are the governance policy and enforcement challenges that are added with the expansion of access and responsibility? What are the most interesting, innovative, or unexpected ways that you have seen CDPs built/used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on CDPs? When is a CDP the wrong choice? What do you have planned for the future of ActionIQ? Contact Info LinkedIn (https://www.linkedin.com/in/tasso/) @Tasso (https://twitter.com/tasso) on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Action IQ (https://www.actioniq.com) Aster Data (https://en.wikipedia.org/wiki/Aster_Data_Systems) Teradata (https://www.teradata.com/) Filemaker (https://en.wikipedia.org/wiki/FileMaker) Hadoop (https://hadoop.apache.org/) NoSQL (https://en.wikipedia.org/wiki/NoSQL) Hive (https://hive.apache.org/) Informix (https://en.wikipedia.org/wiki/Informix) Parquet (https://parquet.apache.org/) Snowflake (https://www.snowflake.com/en/) Podcast Episode (https://www.dataengineeringpodcast.com/snowflakedb-cloud-data-warehouse-episode-110/) Spark (https://spark.apache.org/) Redshift (https://aws.amazon.com/redshift/) Unity Catalog (https://www.databricks.com/product/unity-catalog) Customer Data Platform (https://en.wikipedia.org/wiki/Customer_data_platform) CDP Market Guide (https://info.actioniq.com/hubfs/CDP%20Market%20Guide/CDP_Market_Guide_2024.pdf?utm_campaign=FY24Q4_2024%20CDP%20Market%20Guide&utm_source=AIQ&utm_medium=podcast) Kaizen (https://en.wikipedia.org/wiki/Kaizen) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
Summary Data processing technologies have dramatically improved in their sophistication and raw throughput. Unfortunately, the volumes of data that are being generated continue to double, requiring further advancements in the platform capabilities to keep up. As the sophistication increases, so does the complexity, leading to challenges for user experience. Jignesh Patel has been researching these areas for several years in his work as a professor at Carnegie Mellon University. In this episode he illuminates the landscape of problems that we are faced with and how his research is aimed at helping to solve these problems. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Jignesh Patel about the research that he is conducting on technical scalability and user experience improvements around data management Interview Introduction How did you get involved in the area of data management? Can you start by summarizing your current areas of research and the motivations behind them? What are the open questions today in technical scalability of data engines? What are the experimental methods that you are using to gain understanding in the opportunities and practical limits of those systems? As you strive to push the limits of technical capacity in data systems, how does that impact the usability of the resulting systems? When performing research and building prototypes of the projects, what is your process for incorporating user experience into the implementation of the product? What are the main sources of tension between technical scalability and user experience/ease of comprehension? What are some of the positive synergies that you have been able to realize between your teaching, research, and corporate activities? In what ways do they produce conflict, whether personally or technically? What are the most interesting, innovative, or unexpected ways that you have seen your research used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on research of the scalability limits of data systems? What is your heuristic for when a given research project needs to be terminated or productionized? What do you have planned for the future of your academic research? Contact Info Website (https://jigneshpatel.org/) LinkedIn (https://www.linkedin.com/in/jigneshmpatel/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Carnegie Mellon Universe (https://www.cmu.edu/) Parallel Databases (https://en.wikipedia.org/wiki/Parallel_database) Genomics (https://en.wikipedia.org/wiki/Genomics) Proteomics (https://en.wikipedia.org/wiki/Proteomics) Moore's Law (https://en.wikipedia.org/wiki/Moore%27s_law) Dennard Scaling (https://en.wikipedia.org/wiki/Dennard_scaling) Generative AI (https://en.wikipedia.org/wiki/Generative_artificial_intelligence) Quantum Computing (https://en.wikipedia.org/wiki/Quantum_computing) Voltron Data (https://voltrondata.com/) Podcast Episode (https://www.dataengineeringpodcast.com/voltron-data-apache-arrow-episode-346/) Von Neumann Architecture (https://en.wikipedia.org/wiki/Von_Neumann_architecture) Two's Complement (https://en.wikipedia.org/wiki/Two%27s_complement) Ottertune (https://ottertune.com/) Podcast Episode (https://www.dataengineeringpodcast.com/ottertune-database-performance-optimization-episode-197/) dbt (https://www.getdbt.com/) Informatica (https://www.informatica.com/) Mozart Data (https://mozartdata.com/) Podcast Episode (https://www.dataengineeringpodcast.com/mozart-data-modern-data-stack-episode-242/) DataChat (https://datachat.ai/) Von Neumann Bottleneck (https://www.techopedia.com/definition/14630/von-neumann-bottleneck) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
Summary Working with financial data requires a high degree of rigor due to the numerous regulations and the risks involved in security breaches. In this episode Andrey Korchack, CTO of fintech startup Monite, discusses the complexities of designing and implementing a data platform in that sector. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize (https://www.dataengineeringpodcast.com/materialize) today to get 2 weeks free! Your host is Tobias Macey and today I'm interviewing Andrey Korchak about how to manage data in a fintech environment Interview Introduction How did you get involved in the area of data management? Can you start by summarizing the data challenges that are particular to the fintech ecosystem? What are the primary sources and types of data that fintech organizations are working with? What are the business-level capabilities that are dependent on this data? How do the regulatory and business requirements influence the technology landscape in fintech organizations? What does a typical build vs. buy decision process look like? Fraud prediction in e.g. banks is one of the most well-established applications of machine learning in industry. What are some of the other ways that ML plays a part in fintech? How does that influence the architectural design/capabilities for data platforms in those organizations? Data governance is a notoriously challenging problem. What are some of the strategies that fintech companies are able to apply to this problem given their regulatory burdens? What are the most interesting, innovative, or unexpected approaches to data management that you have seen in the fintech sector? What are the most interesting, unexpected, or challenging lessons that you have learned while working on data in fintech? What do you have planned for the future of your data capabilities at Monite? Contact Info LinkedIn (https://www.linkedin.com/in/a-korchak/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Monite (https://monite.com/) ISO 270001 (https://www.iso.org/standard/27001) Tesseract (https://github.com/tesseract-ocr/tesseract) GitOps (https://about.gitlab.com/topics/gitops/) SWIFT Protocol (https://en.wikipedia.org/wiki/SWIFT) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
loading
Comments (4)

mrs rime

🔴💚Really Amazing ️You Can Try This💚WATCH💚ᗪOᗯᑎᒪOᗩᗪ👉https://co.fastmovies.org

Jan 16th
Reply

Vassili Savinov

can we simply use sql :)?

Aug 13th
Reply

Andre A.

Nice program.. The concept is useful to datagrids and EDA.!

Feb 9th
Reply

T L

It's very hard to follow your guest..

Sep 22nd
Reply
Download from Google Play
Download from App Store