HN via remix.js for vilnius.js

by herpderperator 3 days ago

Does this help with DuckDB concurrency? My main gripe with DuckDB is that you can't write to it from multiple processes at the same time. If you open the database in write mode with one process, you cannot modify it at all from another process without the first process completely releasing it. In fact, you cannot even read from it from another process in this scenario.

So if you typically use a file-backed DuckDB database in one process and want to quickly modify something in that database using the DuckDB CLI (like you might connect SequelPro or DBeaver to make changes to a DB while your main application is 'using' it), then it complains that it's locked by another process and doesn't let you connect to it at all.

This is unlike SQLite, which supports and handles this in a thread-safe manner out of the box. I know it's DuckDB's explicit design decision[0], but it would be amazing if DuckDB could behave more like SQLite when it comes to this sort of thing. DuckDB has incredible quality-of-life improvements with many extra types and functions supported, not to mention all the SQL dialect enhancements allowing you to type much more concise SQL (they call it "Friendly SQL"), which executes super efficiently too.

[0] https://duckdb.org/docs/current/connect/concurrency

szarnyasg 3 days ago | [-4 more]

Hi, DuckDB DevRel here. To have concurrent read-write access to a database, you can use our DuckLake lakehouse format and coordinate concurrent access through a shared Postgres catalog. We released v1.0 yesterday: https://ducklake.select/2026/04/13/ducklake-10/

I updated your reference [0] with this information.

nrjames 3 days ago | [-2 more]

Regarding documentation, I think the DuckLake docs would benefit from a relatively simple “When should I consider using DuckLake?” type FAQ entry. You have sections for what, how, and why, essentially, and a few simple use cases and/or case studies could help provide the aha moment to people in data jobs who are inundated with marketing from other companies. It would help folks like me understand under which circumstances I would stand to benefit most from using DuckLake.

szarnyasg 2 days ago | [-1 more]

DuckDB devrel here. You are right. This was in the FAQ but I also added it to the DuckLake documentation's main page at https://ducklake.select/docs/stable/

nrjames a day ago | [-0 more]

My employer is in the midst of migrating petabytes of data from Snowflake to DataBricks. They’re sold on the “all in one” nature of the platform and believe they’ll save significant money through a contract locking them into DataBricks running on Azure. It is a wildly disruptive process in an environment where the “Snowflake police” (as we call them) have been hounding everybody to reduce credit usage. Now the IT platform team is trying to explain units of work to non-technical VPs, for example, and there’s mass confusion. All signs point to them ending up in the same situation with expensive DataBricks bills, vendor lock in, and a future migration to try to reduce costs.

I guess what I was trying to say is that DuckLake isn’t even a blip on their radar. Should it be? Could you explain it to a non-technical marketing VP as part of a cost savings measure? What’s the DuckLake equivalent to a Unit of Work on DataBricks or a Snowflake Warehouse? If I needed to join multiple tables with billions of rows, where does the compute happen in DuckLake? Can you run your own cluster like with Clickhouse or StarRocks? How does it scale horizontally with storage and compute? How do I update it? What if there’s a security flaw? How well does it stand up to 500 people querying it simultaneously and what type of setup would I need to achieve that?

The PMs that manage the IT platform team aren’t necessarily deeply familiar with all of the technical details. A compelling introduction to DuckLake would provide the answer to some of these questions in a way that the VPs or PMs could digest it easily while providing the technical details the data workers require. For better or worse, “data lakehouse” and data warehouse and data lake all are industry jargon that is pretty impenetrable to people who don’t spend a lot of time working with the tools but who cut checks and make decisions.

citguru 3 days ago | [-0 more]

Hi,

DuckLake is great for the lakehouse layer and it's what we use in production. But there's a gap and thats what I'm trying to address with OpenDuck. DuckLake do solve concurrent access at the lakehouse/catalog level and table management.

But the moment you need to fall back to DuckDB's own compute for things DuckLake doesn't support yet, you're back to a single .duckdb file with exclusive locking. One process writes, nobody else reads.

OpenDuck sits at a different layer. It intercepts DuckDB's file I/O and replaces it with a differential storage engine which is append-only layers with snapshot isolation.

citguru 3 days ago | [-0 more]

Yes, this is actually one of the core problems OpenDuck's architecture addresses.

The short version: OpenDuck interposes a differential storage layer between DuckDB and the underlying file. DuckDB still sees a normal file (via FUSE on Linux or an in-process FileSystem on any platform), but underneath, writes go to append-only layers and reads are resolved by overlaying those layers newest-first. Sealing a layer creates an immutable snapshot.

This gives you:

Many concurrent readers: each reader opens a snapshot, which is a frozen, consistent view of the database. They don't touch the writer's active layer at all. No locks contended.

One serialized write path: multiple clients can submit writes, but they're ordered through a single gateway/primary rather than racing on the same file. This is intentional: DuckDB's storage engine was never designed for multi-process byte-level writes, and pretending otherwise leads to corruption. Instead, OpenDuck serializes mutations at a higher level and gives you safe concurrency via snapshots.

So for your specific scenario — one process writing while you want to quickly inspect or query the DB from the CLI — you'd be able to open a read-only snapshot mount (or attach with ?snapshot=<uuid>) from a second process and query freely. The writer keeps going, new snapshots appear as checkpoints seal, and readers can pick up the latest snapshot whenever they're ready.

It's not unconstrained multi-writer OLTP (that's an explicit non-goal), but it does solve the "I literally cannot even read the database while another process has it open" problem that makes DuckDB painful in practice.

jeadie 3 days ago | [-0 more]

This is exactly what we found. Ingest rates were tough. We partitioned and ran over multiple duckdb instances too (and wrangled the complexity).

We ending up building a Sqlite + vortex file alternative for our use case: https://spice.ai/blog/introducing-spice-cayenne-data-acceler...

3 days ago | [-0 more]

[deleted]

wenc 3 days ago | [-0 more]

Try DuckLake. They just released a prod version.

You can do read/write of a parquet folder on your local drive, but managed by DuckLake. Supports schema evolution and versioning too.

Basically SQLite for parquet.