Statistics that live in your SQL

114 points15 comments2 days ago
pacbard

This fits a need I had when working with DuckDB: running statistical analyses directly within the database without having to spin up external tools (like R/Stata/Python).

I really appreciate the API for calling stats functions and retrieving results. This seamless integration was exactly what I was missing.

Regarding my concerns, the project currently gets you through about a first semester's worth of grad-level statistics or quantitative methods. It's sufficient for exploratory descriptive statistics, paired tests, and basic linear regressions. However, this also means it isn't close to the "production-level" statistics required for rigorous research work. At a minimum, it needs to support heteroskedastic-robust standard errors (Huber-White and clustered at a minimum; jackknife and bootstrapped as a non-parametric bonus), multilevel linear models, and generalized linear models (GLMs). I notice there is an open issue for GLM support, though it sounds like that will require a full rewrite of the inference backend. Including marginal estimates following a regression would also be highly useful, especially if GLMs are implemented.

Taking this library from an MVP to a production-ready replacement for SAS, R, or Stata will require significant effort. I am unsure about the market fit for a tool like this; organizations paying for SAS or Stata are unlikely to abandon them for an upstart project, and R has a deeply entrenched ecosystem that is impossible to replace. The situation feels similar to the dynamic between Octave and Matlab, or SageMath and Mathematica. It risks becoming a free alternative used primarily by those who cannot afford the paid products.

As others in the thread have pointed out, this extension includes functionality that is already handled well by other community extensions (like ggsql, stochastic, and read_stat). Because the ultimate goal seems to be providing a SAS-compatible frontend built on DuckDB, which is a huge and exciting undertaking. I wonder if the statistical backend might progress faster by focusing only on tests and regressions. Since it serves as the foundation for the other frontend services, zeroing in on the core stats and relying on the broader DuckDB ecosystem for the rest might make this massive scope a bit more manageable.

williamcotton

The plotting aspect of this seems very similar to:

https://opensource.posit.co/blog/2026-04-20_ggsql_alpha_rele...

show comments
geysersam

Looks great!

One minor correction - the `summarize` function in duckdb can also be used in CTEs etc.

But you have to wrap the `summarize` in a `from` clause like this:

  with
    some_table as (from range(10)),
    x as (from (summarize some_table))
  from x;
show comments
PashaGo

Interesting, but I think it works only for quick ad-hoc analysis. For dashboards or deeper research, you still need other tools

show comments
HackerThemAll

NoScript detected a potential Cross-Site Scripting attack

from https://kolistat.com to https://bedeverewise.app.

Suspicious data:

(URL) https://bedeverewise.app/embed?autorun=1&query=WITH pois AS (

SELECT k, dpois(k, 3) AS pmf

FROM range(0, 11) AS t(k)

)

VISUALIZE

k AS x

, pmf AS y

FROM pois

DRAW bar

;

so... no, thanks.

show comments