Regarding documentation, I think the DuckLake docs would benefit from a relatively simple “When should I consider using DuckLake?” type FAQ entry. You have sections for what, how, and why, essentially, and a few simple use cases and/or case studies could help provide the aha moment to people in data jobs who are inundated with marketing from other companies. It would help folks like me understand under which circumstances I would stand to benefit most from using DuckLake.
DuckDB devrel here. You are right. This was in the FAQ but I also added it to the DuckLake documentation's main page at https://ducklake.select/docs/stable/
My employer is in the midst of migrating petabytes of data from Snowflake to DataBricks. They’re sold on the “all in one” nature of the platform and believe they’ll save significant money through a contract locking them into DataBricks running on Azure. It is a wildly disruptive process in an environment where the “Snowflake police” (as we call them) have been hounding everybody to reduce credit usage. Now the IT platform team is trying to explain units of work to non-technical VPs, for example, and there’s mass confusion. All signs point to them ending up in the same situation with expensive DataBricks bills, vendor lock in, and a future migration to try to reduce costs.
I guess what I was trying to say is that DuckLake isn’t even a blip on their radar. Should it be? Could you explain it to a non-technical marketing VP as part of a cost savings measure? What’s the DuckLake equivalent to a Unit of Work on DataBricks or a Snowflake Warehouse? If I needed to join multiple tables with billions of rows, where does the compute happen in DuckLake? Can you run your own cluster like with Clickhouse or StarRocks? How does it scale horizontally with storage and compute? How do I update it? What if there’s a security flaw? How well does it stand up to 500 people querying it simultaneously and what type of setup would I need to achieve that?
The PMs that manage the IT platform team aren’t necessarily deeply familiar with all of the technical details. A compelling introduction to DuckLake would provide the answer to some of these questions in a way that the VPs or PMs could digest it easily while providing the technical details the data workers require. For better or worse, “data lakehouse” and data warehouse and data lake all are industry jargon that is pretty impenetrable to people who don’t spend a lot of time working with the tools but who cut checks and make decisions.