One use I'd have for this is company wikis that you want to give folks easy offline access to (maybe the wiki has documentation that's useful at sites that don't have cellular coverage).
Cool!
It would be especially cool to have a version that didn't require the separate serving process - even though it's nifty you can package up a whole site as a single binary.
Maybe a single HTML entrypoint shim with a bit of javascript that could index into an archive (potentially embedded) of the site's content?
show comments
xlii
> No tracking, no network calls, no surprises.
Won't comment on a project (though idea seems interesting) but this in README is a tell for me ;)
show comments
ninalanyon
> kage serve $HOME/data/kage/paulgraham.com
If the result is static why does it need a server? Isn't it possible to make it so that it can simply be opened by the browser? Like:
$ firefox $HOME/data/kage/paulgraham.com
Then the result would be useable on machines without kage nstalled.
show comments
maxloh
I find SingleFile [0] to be a much more robust version of this.
It strips out all the JavaScript too, but also packs everything into a single HTML file that is easy to transfer. Binary assets (like web fonts and images) are packed as base64 strings.
I've been using httrack (https://www.httrack.com) to download wikis to read on flights, which isn't perfect but better than I'd found previously. I'll try this out, I'd be delighted to have good results. Thanks for the post.
show comments
Departed7405
What's the advantage compared to mhtml?
gregwebs
This seems like it has potential to create a lot of load on a site- are there settings to set how fast it clones or avoid images/videos?
Is there a way to only get a subset of a website?
show comments
dimiprasakis
Neat project, I like the idea.
One thing from a quick read: you launch Chrome with --no-sandbox. Is there a good reason for that? Security wise it's probably not a good idea. If there is no reason, I'd suggest leaving the sandbox on!
In any case, cool stuff :)
show comments
coffeecoders
I've accumulated a bunch of old website archives over the years. The funny thing is the ugly HTML dumps have been more useful than the "perfect" archive.
It's one of the reasons I've become a bigger fan of RSS over time. A feed from 10-ish years ago is often more usable today than a carefully preserved (application) website.
But will look into this now, see if we can swap some stuff out. We’ve really liked the idea of an offline mirror, makes a lot of collaboration use cases simpler
rahimnathwani
So this is like using wget --mirror except that it works on pages that require javascript, right?
show comments
sails
What is the best way to give coding agent a full website so that it can see what I see? With animation and design I’m never sure what it gets when I save the website in the browser. Maybe this is suitable?
lolpython
This is cool. I could see myself downloading the articles behind the first couple pages of hacker news with this, for viewing on a flight or long distance train ride with spotty internet
sanqui
Cool concept. I would like to see this combined with mitmproxy for archive grade fidelity. You could be saving exactly the data served and at the same time a representation by a modern (contemporary) browser, with all JS having run. This combination would be my perfect replacement for the WARC format.
show comments
amatecha
Suddenly remembering the days of dialup and your browser serving a fully-functional cached copy of a webpage when you try to access it and you're not online...
Igor_Wiwi
This is quite useful tool, especially for the cases where internet access is limited (the flights for example). I implemented it as a separate feature in mdview.io: for example you can export a document as a html file for offline usage, with all the presentation features like reach tables, mermaid and etc built in. Example https://mdview.io/s/why-markdown-became-default-format-for-a... then try to Export - Export HTML
kjmh
I was floored by the idea of browsing docs offline but disappointed that recreating the demo of archiving Paul Graham’s essays gave me a ZIM with broken images and broken Unicode symbols when viewed in Kiwix.
Sathwickp
I'm still trying to cope with your github profile, 68k commits a year is crazyy
carsonye
This is interesting. Is the intended use case mostly read-only websites like blogs/docs/essays? How well does it handle sites where navigation, search, dropdowns, or other UI interactions depend on JavaScript?
show comments
c7b
Probably a stupid question, but could this archive embedded videos as well?
show comments
latexr
For those with an eReader, one thing that works really well is using pandoc to download and convert a webpage to EPUB that you can then load to your reader.
pandoc --from html --to epub --output /PATH/TO/FILE.epub https://example.com
show comments
jyscao
I tried to clone a HTTP (not HTTPS) site, and it's giving me `navigation failed: net::ERR_NAME_NOT_RESOLVED`. Even when I explicitly included the protocol with `http://<FQDN>`.
godot
the readme uses paulgraham.com as an example (which is text articles mostly) and I never use "Save As" for a web page (for the reasons the author states), I always just print as PDF and save the PDF file.
for an entire website though of many pages I can see this can be useful.
snowflaxxx
Meet Teleport Pro
smusamashah
What if I wanted to download all Confluence docs at work?
ekianjo
Curious about "keep it for a decade" claim. Can something possibly break down the road?
calrizien
Does this work for the Apple Docs website? Really tricky to get those offline.
show comments
nitotm
I was looking for something like this the other day, it can be very helpful.
I would recommend an add-on or new feature to detect and remove cookie banners / annoying popups that open on load (eg. sign up to my mailing list).
listing a few examples form fastText could help you.
You might also have the opposite problem though: some websites have content in the base html (so it's searchable by Google and they get views) and remove it on load (so you have to pay).
Capturing the initial html and comparing it to the final version could give you some hints and allow you to repair the removed content.
Best of luck with the project!
daviding
Nice idea!
fwiw, false positives and all, but the Windows 11 default Windows Security doesn't like it:
`leakless.exe: Operation did not complete successfully because the file contains a virus or potentially unwanted software.`
G_o_D
How its different then MHTML ??
KellyCriterion
Sounds like .MCH-files re-invented? (-:
chinnyys
The readme is AI slop, and incredibly grating to read. The disgust I felt while reading it almost put me off trying the project.
Is the code also AI slop?
chfritz
how is this different from using puppeteer to load the page and save the DOM as HTML?
cynicalsecurity
Binary app is a really bad way of storing data. No one would ever want to run a binary shared with them or found online.
show comments
soulofmischief
Cool project! I know it's written in go, but it would be cool to see something like this which uses Cosmopolitan Libc + redbean or something similar to create a binary which runs anywhere. Would be fun to be able to pass around self-executable website archives.
(Certificates just expired for justine's website, just ignore the warning.)
show comments
delduca
curl can do this
sneak
The README is LLM slop. This makes me assume the code is the same.
aa-jv
I've been using "Print to PDF" as my principle bookmarks management tool, since 1998, and I have over 90,000+ such PDF's sitting on my system, easily re-read and discovered.
So I don't quite get whats the point of kage? What does it do that print-to-PDF won't already do? The resulting .pdf's contain all the content, and also include the original URL and creation date, etc. How is kage an improvement?
grahamstanes17
nice
Onavo
How does it handle websites with client side paywalls? Can you run it with extensions like bypass paywalls and ublock origin?
I was intrigued to see how the demo GIF in the README was generated: https://github.com/tamnd/kage/blob/01e75b87ecc893bbba7943c63...
Turns out it's using another project by the same author: https://github.com/tamnd/ascii-gif
The script used for the demo is at https://github.com/tamnd/kage/blob/01e75b87ecc893bbba7943c63... and has a comment showing how to run it:
Looks like it's an opinionated wrapper around https://github.com/charmbracelet/vhsOne use I'd have for this is company wikis that you want to give folks easy offline access to (maybe the wiki has documentation that's useful at sites that don't have cellular coverage).
Cool!
It would be especially cool to have a version that didn't require the separate serving process - even though it's nifty you can package up a whole site as a single binary.
Maybe a single HTML entrypoint shim with a bit of javascript that could index into an archive (potentially embedded) of the site's content?
> No tracking, no network calls, no surprises.
Won't comment on a project (though idea seems interesting) but this in README is a tell for me ;)
> kage serve $HOME/data/kage/paulgraham.com
If the result is static why does it need a server? Isn't it possible to make it so that it can simply be opened by the browser? Like:
$ firefox $HOME/data/kage/paulgraham.com
Then the result would be useable on machines without kage nstalled.
I find SingleFile [0] to be a much more robust version of this.
It strips out all the JavaScript too, but also packs everything into a single HTML file that is easy to transfer. Binary assets (like web fonts and images) are packed as base64 strings.
They also offer a CLI powered by Puppeteer. [1]
[0]: https://github.com/gildas-lormeau/singlefile
[1]: https://github.com/gildas-lormeau/single-file-cli
I've been using httrack (https://www.httrack.com) to download wikis to read on flights, which isn't perfect but better than I'd found previously. I'll try this out, I'd be delighted to have good results. Thanks for the post.
What's the advantage compared to mhtml?
This seems like it has potential to create a lot of load on a site- are there settings to set how fast it clones or avoid images/videos? Is there a way to only get a subset of a website?
Neat project, I like the idea. One thing from a quick read: you launch Chrome with --no-sandbox. Is there a good reason for that? Security wise it's probably not a good idea. If there is no reason, I'd suggest leaving the sandbox on!
In any case, cool stuff :)
I've accumulated a bunch of old website archives over the years. The funny thing is the ugly HTML dumps have been more useful than the "perfect" archive.
It's one of the reasons I've become a bigger fan of RSS over time. A feed from 10-ish years ago is often more usable today than a carefully preserved (application) website.
Reminds me of this. https://gwern.net/gwtar
Compared to that is there anything kage does better?
This is awesome, we wanted an offline copy of someone’s prototype (as built on Lovable, etc) so we could do version control and sharing in an easier format. Wrote our approach here: https://productnow.ai/blogs/extracting-html-from-ai-prototyp...
But will look into this now, see if we can swap some stuff out. We’ve really liked the idea of an offline mirror, makes a lot of collaboration use cases simpler
So this is like using wget --mirror except that it works on pages that require javascript, right?
What is the best way to give coding agent a full website so that it can see what I see? With animation and design I’m never sure what it gets when I save the website in the browser. Maybe this is suitable?
This is cool. I could see myself downloading the articles behind the first couple pages of hacker news with this, for viewing on a flight or long distance train ride with spotty internet
Cool concept. I would like to see this combined with mitmproxy for archive grade fidelity. You could be saving exactly the data served and at the same time a representation by a modern (contemporary) browser, with all JS having run. This combination would be my perfect replacement for the WARC format.
Suddenly remembering the days of dialup and your browser serving a fully-functional cached copy of a webpage when you try to access it and you're not online...
This is quite useful tool, especially for the cases where internet access is limited (the flights for example). I implemented it as a separate feature in mdview.io: for example you can export a document as a html file for offline usage, with all the presentation features like reach tables, mermaid and etc built in. Example https://mdview.io/s/why-markdown-became-default-format-for-a... then try to Export - Export HTML
I was floored by the idea of browsing docs offline but disappointed that recreating the demo of archiving Paul Graham’s essays gave me a ZIM with broken images and broken Unicode symbols when viewed in Kiwix.
I'm still trying to cope with your github profile, 68k commits a year is crazyy
This is interesting. Is the intended use case mostly read-only websites like blogs/docs/essays? How well does it handle sites where navigation, search, dropdowns, or other UI interactions depend on JavaScript?
Probably a stupid question, but could this archive embedded videos as well?
For those with an eReader, one thing that works really well is using pandoc to download and convert a webpage to EPUB that you can then load to your reader.
I tried to clone a HTTP (not HTTPS) site, and it's giving me `navigation failed: net::ERR_NAME_NOT_RESOLVED`. Even when I explicitly included the protocol with `http://<FQDN>`.
the readme uses paulgraham.com as an example (which is text articles mostly) and I never use "Save As" for a web page (for the reasons the author states), I always just print as PDF and save the PDF file.
for an entire website though of many pages I can see this can be useful.
Meet Teleport Pro
What if I wanted to download all Confluence docs at work?
Curious about "keep it for a decade" claim. Can something possibly break down the road?
Does this work for the Apple Docs website? Really tricky to get those offline.
I was looking for something like this the other day, it can be very helpful.
It seems like https://github.com/tw93/pake is better.
Anyone remembers Teleport Pro?
Amazing stuff!
I would recommend an add-on or new feature to detect and remove cookie banners / annoying popups that open on load (eg. sign up to my mailing list).
listing a few examples form fastText could help you.
You might also have the opposite problem though: some websites have content in the base html (so it's searchable by Google and they get views) and remove it on load (so you have to pay).
Capturing the initial html and comparing it to the final version could give you some hints and allow you to repair the removed content.
Best of luck with the project!
Nice idea! fwiw, false positives and all, but the Windows 11 default Windows Security doesn't like it: `leakless.exe: Operation did not complete successfully because the file contains a virus or potentially unwanted software.`
How its different then MHTML ??
Sounds like .MCH-files re-invented? (-:
The readme is AI slop, and incredibly grating to read. The disgust I felt while reading it almost put me off trying the project.
Is the code also AI slop?
how is this different from using puppeteer to load the page and save the DOM as HTML?
Binary app is a really bad way of storing data. No one would ever want to run a binary shared with them or found online.
Cool project! I know it's written in go, but it would be cool to see something like this which uses Cosmopolitan Libc + redbean or something similar to create a binary which runs anywhere. Would be fun to be able to pass around self-executable website archives.
https://github.com/jart/cosmopolitan
https://justine.lol/cosmopolitan/index.html
https://redbean.dev
(Certificates just expired for justine's website, just ignore the warning.)
curl can do this
The README is LLM slop. This makes me assume the code is the same.
I've been using "Print to PDF" as my principle bookmarks management tool, since 1998, and I have over 90,000+ such PDF's sitting on my system, easily re-read and discovered.
So I don't quite get whats the point of kage? What does it do that print-to-PDF won't already do? The resulting .pdf's contain all the content, and also include the original URL and creation date, etc. How is kage an improvement?
nice
How does it handle websites with client side paywalls? Can you run it with extensions like bypass paywalls and ublock origin?