Could you also release the source code behind the automatic update system?
epogrebnyak
Wonder why median votes count is 0, seems every post is getting at least a few votes - maybe this was not the case in the past
show comments
robotswantdata
Where’s the opt out ?
show comments
imhoguy
Yay! So much knowledge in just 11GB. Adding to my end of the World hoarding stash!
gkbrk
My Hacker News items table in ClickHouse has 47,428,860 items, and it's 5.82 GB compressed and 18.18 GB uncompressed. What makes Parquet compression worse here, when both formats are columnar?
show comments
brtkwr
This comment should make it into the download in a few mins.
show comments
kshacker
Good for demo but every 5 minutes? Why?
show comments
mlhpdx
Static web content and dynamic data?
> The archive currently spans from 2006-10 to 2026-03-16 23:55 UTC, with 47,358,772 items committed.
That’s more than 5 minutes ago by a day or two. No big deal, but a little bit depressing this is still how we do things in 2026.
show comments
alstonite
What happened between 2023 and 2024 to cause the usage dropoff?
> At midnight UTC, the entire current month is refetched from the source as a single authoritative Parquet file, and today's individual 5-minute blocks are removed from the today/ directory.
Wouldn't that lose deleted/moderated comments?
show comments
0cf8612b2e1e
Under the Known Limitations section
deleted and dead are integers. They are stored as 0/1 rather than booleans.
Is there a technical reason to do this? You have the type right there.
Imustaskforhelp
As someone who had made a project analysing hackernews who had used clickhouse, I really feel like this is a project made for me (especially the updated every 5 minute aspect which could've helped my project back then too!)
Your project actually helps me out a ton in making one of the new project ideas that I had about hackernews that I had put into the back-burner.
I had thought of making a ping website where people can just @Username and a service which can detect it and then send mail to said username if the username has signed up to the service (similar to a service run by someone from HN community which mails you everytime someone responds to your thread directly, but this time in a sort of ping)
[The previous idea came as I tried to ping someone to show them something relevant and thought that wait a minute, something like ping which mails might be interesting and then tried to see if I can use algolia or any service to hook things up but not many/any service made much sense back then sadly so I had the idea in back of my mind but this service sort of solves it by having it being updated every 5 minutes]
Your 5 minute updates really make it possible. I will look what I can do with that in some days but I am seeing some discrepancy in the 5 minute update as last seems to be 16 march in the readme so I would love to know more about if its being updated every 5 minutes because it truly feels phenomenal if true and its exciting to think of some new possibilities unlocked with it.
tonymet
what's the license for HN content?
show comments
Onavo
Is is possible to only download a subset? e.g. Show HNs or HN Whoishiring. The Show HNs and HN Whoishiring are very useful for classroom data science i.e. a very useful set of data for students to learn the basic of data cleaning and engineering.
show comments
lokimoon
You are the product
show comments
bstsb
what’s the license? “do whatever the fuck you want with the data as long as you don’t get caught”? or does that only work for massive corporations
show comments
GeoAtreides
is the legal page a placeholder, do words have no meaning?
Replacing an 11.6GB Parquet file every 5 minutes strikes me as a bit wasteful. I would probably use Apache Iceberg here.
The best source for this data used to be Clickhouse (https://play.clickhouse.com/play?user=play#U0VMRUNUIG1heCh0a...), but it hasn't updated since 2025-12-26.
Could you also release the source code behind the automatic update system?
Wonder why median votes count is 0, seems every post is getting at least a few votes - maybe this was not the case in the past
Where’s the opt out ?
Yay! So much knowledge in just 11GB. Adding to my end of the World hoarding stash!
My Hacker News items table in ClickHouse has 47,428,860 items, and it's 5.82 GB compressed and 18.18 GB uncompressed. What makes Parquet compression worse here, when both formats are columnar?
This comment should make it into the download in a few mins.
Good for demo but every 5 minutes? Why?
Static web content and dynamic data?
> The archive currently spans from 2006-10 to 2026-03-16 23:55 UTC, with 47,358,772 items committed.
That’s more than 5 minutes ago by a day or two. No big deal, but a little bit depressing this is still how we do things in 2026.
What happened between 2023 and 2024 to cause the usage dropoff?
Please upload to https://academictorrents.com/ as well if possible
> At midnight UTC, the entire current month is refetched from the source as a single authoritative Parquet file, and today's individual 5-minute blocks are removed from the today/ directory.
Wouldn't that lose deleted/moderated comments?
Under the Known Limitations section
Is there a technical reason to do this? You have the type right there.As someone who had made a project analysing hackernews who had used clickhouse, I really feel like this is a project made for me (especially the updated every 5 minute aspect which could've helped my project back then too!)
Your project actually helps me out a ton in making one of the new project ideas that I had about hackernews that I had put into the back-burner.
I had thought of making a ping website where people can just @Username and a service which can detect it and then send mail to said username if the username has signed up to the service (similar to a service run by someone from HN community which mails you everytime someone responds to your thread directly, but this time in a sort of ping)
[The previous idea came as I tried to ping someone to show them something relevant and thought that wait a minute, something like ping which mails might be interesting and then tried to see if I can use algolia or any service to hook things up but not many/any service made much sense back then sadly so I had the idea in back of my mind but this service sort of solves it by having it being updated every 5 minutes]
Your 5 minute updates really make it possible. I will look what I can do with that in some days but I am seeing some discrepancy in the 5 minute update as last seems to be 16 march in the readme so I would love to know more about if its being updated every 5 minutes because it truly feels phenomenal if true and its exciting to think of some new possibilities unlocked with it.
what's the license for HN content?
Is is possible to only download a subset? e.g. Show HNs or HN Whoishiring. The Show HNs and HN Whoishiring are very useful for classroom data science i.e. a very useful set of data for students to learn the basic of data cleaning and engineering.
You are the product
what’s the license? “do whatever the fuck you want with the data as long as you don’t get caught”? or does that only work for massive corporations
is the legal page a placeholder, do words have no meaning?
https://www.ycombinator.com/legal/
Mods, enforce your license terms, you're playing fast and loose with the law (GDPR/CPRA)