Parsing the Cohost Export with DuckDB

Cohost recently emailed me with a link to all the metadata for my account. There’s a lot there. There’s all your posts, images / attachments of course. But there’s also a lot of other useful metadata.

I want to take all my cohost posts and turn them into the format Astro expects. It shouldn’t be too hard, and we could do it with a simple node or python script, but it’d be a lot more fun to use DuckDB, so that’s what we’re going to do.

We are going to complete the task of migrating this cohost metadata into mdx files/attachments that Astro can consume with no assistance from a higher level programming language (like Node.js / python). We will use a little bash but even then the bash scripts will be generated by DuckDB queries and executed from within the DuckDB shell.

Some of this data is sort of personal, so I won’t host the whole payload (at least until I’ve gone through it) but I’ll include screenshots and samples of outputs as we go along so you can get an idea of how it all works and how cool DuckDB is.

What is DuckDB

In a nutshell, DuckDB is an in-process database that is designed to query a lot of data on your local machine. It’s also well known for its modern and convenient features. DuckDB can handle modern formats and syntax we’d never expect from a traditional database. It’s also very fast, and famous for running queries orders of magnitude faster than competing DBs (for its specific use case).

DuckDB is very good at importing random CSV’s or JSON data and turning it into structured and (easily queryable) binary.

DuckDB is designed to be used for offline data processing but that hasn’t stopped the larger community from using DuckDB in other contexts.
DuckDB has a partnership with a company Motherduck, that runs DuckDB on the cloud (primarily for use as a data warehouse). Lots of companies are also running DuckDB in WASM to power low latency dataviz and data analysis either entirely client side or hybrid client/server.

I think the most important thing to say about DuckDB is that it is a joy to use. You can feel the love that has gone into it. I would recommend it for anyone who is not super familiar with databases, but with the proviso that you are comfortable in some other existing host context (like Python, Bash, Node etc). Because DuckDB runs “in-process”, it is more of a native library than an external system, you always need some host context where you can run it.

Set up

First step, install DuckDB.

In my projects I usually have two git ignored folders input and output. I use it for tasks exactly like this, where I need to do some intermediary data processing but I don’t want to commit anything.

So step one, lets create those folders and ensure they are git ignored:

mkdir input output
echo "
input
output
" >> .gitignore

Now I’m going to put the cohost export in that folder and name it cohost-export

Querying the file system

Let’s query the filesystem using DuckDB to see what this export looks like. I’ve opened a terminal and run the DuckDB cli from ./input directory in this repo.

select filename 
from read_text('./cohost-export/**')
;
filename
./cohost-export/project/jmsfbs-code/124035-d0b291a6-a10b-4424-8a9a-84bdce1a1d69-profile.jpeg
./cohost-export/project/jmsfbs-code/comments/1b18d7cc-e64e-412e-a43a-af7d5b2db60f.json
./cohost-export/project/jmsfbs-code/comments/288abe0b-66b3-4df7-8305-14b3e4449b92.json
./cohost-export/project/jmsfbs-code/comments/36e4597b-5114-4e2d-89c6-b4ab7318f532.json
./cohost-export/project/jmsfbs-code/comments/42c7829f-fecd-4791-a308-5b3e2d449892.json
./cohost-export/project/jmsfbs-code/comments/7346f794-bd3a-48de-a119-f8ef20ea9e7a.json
./cohost-export/project/jmsfbs-code/comments/c07362e9-1407-49cb-ba66-e1d4dab8615a.json
./cohost-export/project/jmsfbs-code/comments/c4cbd27c-bb6c-4d94-8621-4738d43001b7.json
./cohost-export/project/jmsfbs-code/comments/f7c98510-96dc-4599-8e77-73d7c9737029.json
./cohost-export/project/jmsfbs-code/find-your-friends.json
./cohost-export/project/jmsfbs-code/followers/MikeWake.json
./cohost-export/project/jmsfbs-code/followers/ackasi.json
./cohost-export/project/jmsfbs-code/followers/happy-hermit.json
./cohost-export/project/jmsfbs-code/followers/jcolag.json
./cohost-export/project/jmsfbs-code/followers/pygy.json
./cohost-export/project/jmsfbs-code/followers/this-is-envy.json
./cohost-export/project/jmsfbs-code/followers/tibber5.json
./cohost-export/project/jmsfbs-code/following/MikeWake.json
./cohost-export/project/jmsfbs-code/following/crosseye.json
./cohost-export/project/jmsfbs-code/following/draknek.json
./cohost-export/project/jmsfbs-code/following/jcolag.json
./cohost-export/project/jmsfbs-code/following/pygy.json
./cohost-export/project/jmsfbs-code/following/this-is-envy.json
./cohost-export/project/jmsfbs-code/jmsfbs-code.json
./cohost-export/project/jmsfbs-code/posts/published/1042927-new-row-violates-row/post.json
./cohost-export/project/jmsfbs-code/posts/published/1059858-more-productive-on-t/post.json
./cohost-export/project/jmsfbs-code/posts/published/1196048-druids-datadog-s-d/post.json
./cohost-export/project/jmsfbs-code/posts/published/1196053-css-anchor-positioni/post.json
./cohost-export/project/jmsfbs-code/posts/published/1201573-how-to-use-deno-debu/post.json
./cohost-export/project/jmsfbs-code/posts/published/1205386-parser-combinators/post.json
./cohost-export/project/jmsfbs-code/posts/published/1205822-what-i-made-the-comp/post.json
./cohost-export/project/jmsfbs-code/posts/published/1215585-how-to-disable-web-s/post.json
./cohost-export/project/jmsfbs-code/posts/published/1230738-nushell-snippets/post.json
./cohost-export/project/jmsfbs-code/posts/published/1302500-workaround-for-lack/post.json
./cohost-export/project/jmsfbs-code/posts/published/1312901-debugger-launch-nod/post.json
./cohost-export/project/jmsfbs-code/posts/published/1349274-just-upgraded-a-self/post.json
./cohost-export/project/jmsfbs-code/posts/published/1349307-ssh-import-id/post.json
./cohost-export/project/jmsfbs-code/posts/published/1353509-how-to-parse-dynamic/post.json
./cohost-export/project/jmsfbs-code/posts/published/1681286-making-my-bluetooth/post.json
./cohost-export/project/jmsfbs-code/posts/published/1687721-rss-feed-truncated/post.json
./cohost-export/project/jmsfbs-code/posts/published/1706101-ouch-better-cli-ux/post.json
./cohost-export/project/jmsfbs-code/posts/published/1706541-installing-rust-carg/post.json
./cohost-export/project/jmsfbs-code/posts/published/1829684-simple-secret-sharin/post.json
./cohost-export/project/jmsfbs-code/posts/published/1933414-chat-gpt-almost-alway/post.json
./cohost-export/project/jmsfbs-code/posts/published/2254888-using-nushell-to-saf/post.json
./cohost-export/project/jmsfbs-code/posts/published/2625470-moving-to-ghost/post.json
./cohost-export/project/jmsfbs-code/posts/published/4081387-secrets-service/post.json
./cohost-export/project/jmsfbs-code/posts/published/4092488-typescript-wishlist/post.json
./cohost-export/project/jmsfbs-code/posts/published/4362243-repeat-0-1fr-doesn/post.json
./cohost-export/project/jmsfbs-code/posts/published/4369806-kind-of-surprised-ho/post.json
./cohost-export/project/jmsfbs-code/posts/published/4893642-recursive-queries-in/post.json
./cohost-export/project/jmsfbs-code/posts/published/4954186-learning-about-apl-a/image.png
./cohost-export/project/jmsfbs-code/posts/published/4954186-learning-about-apl-a/post.json
./cohost-export/project/jmsfbs-code/posts/published/5262412-typescript-performan/post.json
./cohost-export/project/jmsfbs-code/posts/published/5555101-github-copilot/post.json
./cohost-export/project/jmsfbs-code/posts/published/5679848-typescript-tip-out/post.json
./cohost-export/project/jmsfbs-code/posts/published/5976016-postgres-implicit-l/image.png
./cohost-export/project/jmsfbs-code/posts/published/5976016-postgres-implicit-l/post.json
./cohost-export/project/jmsfbs-code/posts/published/6559941-learning-godot-has-c/post.json
./cohost-export/project/jmsfbs-code/posts/published/6633844-database-migrations/post.json
./cohost-export/project/jmsfbs-code/posts/published/696437-interesting-blow-ran/post.json
./cohost-export/project/jmsfbs-code/posts/published/716808-minor-website-update/post.json
./cohost-export/project/jmsfbs-code/posts/published/7238140-useful-alias-for-doc/post.json
./cohost-export/project/jmsfbs-code/posts/published/729566-keydiff/post.json
./cohost-export/project/jmsfbs-code/posts/published/738657-puppeteer-and-docker/post.json
./cohost-export/project/jmsfbs-code/posts/published/739146-cohost-launched-rss/post.json
./cohost-export/project/jmsfbs-code/posts/published/766864-nginx-caddy/post.json
./cohost-export/project/jmsfbs-code/posts/published/773340-crossposting-cohost/Screen%20Shot%202023-01-03%20at%2011.32.54%20pm.png
./cohost-export/project/jmsfbs-code/posts/published/773340-crossposting-cohost/post.json
./cohost-export/project/jmsfbs-code/posts/published/779895-function-constr/post.json
./cohost-export/project/jmsfbs-code/posts/published/784613-browser-automation-w/post.json
./cohost-export/project/jmsfbs-code/posts/published/784666-joining-rss-feeds/post.json
./cohost-export/project/jmsfbs-code/posts/published/926060-working-on-a-simple/Screen%20Shot%202023-01-29%20at%202.40.23%20pm.png
./cohost-export/project/jmsfbs-code/posts/published/926060-working-on-a-simple/post.json
./cohost-export/project/jmsfbs-code/posts/published/927564-solid-js-and-presite/post.json
./cohost-export/project/jmsfbs-code/posts/published/943321-crypto-random-uuid/post.json
./cohost-export/project/jmsfbs-code/posts/published/943384-svelte-kit-design-dec/post.json
./cohost-export/project/jmsfbs-code/posts/published/993001-knee-deep-in-postgre/post.json
./cohost-export/project/jmsfbs/28244-b8ae4e49-30db-4b31-bc27-b4b06a02a4b1-profile.jpg
./cohost-export/project/jmsfbs/28244-ff52a913-0c06-40a2-ad25-016c19f52e86-profile.jpg
./cohost-export/project/jmsfbs/comments/266a9fc9-96e5-4012-abd8-7e06c2c8ff16.json
./cohost-export/project/jmsfbs/comments/3039c5c4-a107-4622-b2fd-8b116a718cc1.json
./cohost-export/project/jmsfbs/comments/3acdd1e7-6704-451d-bcb5-449354d49f8f.json
./cohost-export/project/jmsfbs/comments/5cf51c5a-f4ae-4f7c-9cb2-c35c0c348ab4.json
./cohost-export/project/jmsfbs/comments/70ed3f4a-8279-434c-82e5-257d066bb4a2.json
./cohost-export/project/jmsfbs/comments/79b2a664-ef49-4be1-b264-4f49aa9e47c9.json
./cohost-export/project/jmsfbs/comments/7f6570ee-9075-49b5-99e7-ec209b6e6fc6.json
./cohost-export/project/jmsfbs/comments/7f936dbe-4c63-4f5d-aa0f-a8c78d732297.json
./cohost-export/project/jmsfbs/comments/925ac802-9823-4a7c-b1b3-19c9b777211e.json
./cohost-export/project/jmsfbs/comments/94ba93a6-00c5-4e37-b871-346b3a0598b6.json
./cohost-export/project/jmsfbs/comments/9f529de9-e29f-49c0-9d4b-88d2120c7ab1.json
./cohost-export/project/jmsfbs/comments/a7643eef-e639-4b53-a483-3f52ed849646.json
./cohost-export/project/jmsfbs/comments/a99ace8d-a252-487b-a937-715798c6f12c.json
./cohost-export/project/jmsfbs/comments/b94ac2e5-b442-4c87-8adc-0de46b79eca1.json
./cohost-export/project/jmsfbs/comments/c5bbb206-81f1-4b76-8f76-8032e4e2bf43.json
./cohost-export/project/jmsfbs/comments/cad605ea-a97c-43f0-bdc6-d653d22d2ff1.json
./cohost-export/project/jmsfbs/comments/cef9a3ba-608b-4364-b2d4-814db8280a5d.json
./cohost-export/project/jmsfbs/comments/d5ed8974-93e8-401c-91f0-7c69d92e6f40.json
./cohost-export/project/jmsfbs/comments/dd066a6a-d1fa-490c-9a04-845fa7d5cd3b.json
./cohost-export/project/jmsfbs/comments/e918ff83-0d5a-40a2-9f11-8090411cf462.json
./cohost-export/project/jmsfbs/comments/f1cde7b6-f5a6-4bb4-9a4e-adc9ec2edfd7.json
./cohost-export/project/jmsfbs/comments/f54870a8-275b-40a2-9ac1-28aa1c7906eb.json
./cohost-export/project/jmsfbs/comments/f7a60c03-7279-49d1-b5db-a16cffc237f4.json
./cohost-export/project/jmsfbs/find-your-friends.json
./cohost-export/project/jmsfbs/followers/BigBear.json
./cohost-export/project/jmsfbs/followers/MikeWake.json
./cohost-export/project/jmsfbs/followers/Noxulous.json
./cohost-export/project/jmsfbs/followers/PadPalon.json
./cohost-export/project/jmsfbs/followers/aacurate2.json
./cohost-export/project/jmsfbs/followers/aloe.json
./cohost-export/project/jmsfbs/followers/attica.json
./cohost-export/project/jmsfbs/followers/danielleri.json
./cohost-export/project/jmsfbs/followers/jcolag.json
./cohost-export/project/jmsfbs/followers/llewellynn.json
./cohost-export/project/jmsfbs/followers/lucidlucille.json
./cohost-export/project/jmsfbs/followers/pygy.json
./cohost-export/project/jmsfbs/followers/ridiculousdino.json
./cohost-export/project/jmsfbs/followers/tokyoscarab.json
./cohost-export/project/jmsfbs/following/Campster.json
./cohost-export/project/jmsfbs/following/Jusiv.json
./cohost-export/project/jmsfbs/following/Maid-Music.json
./cohost-export/project/jmsfbs/following/PadPalon.json
./cohost-export/project/jmsfbs/following/SabbiTabbi.json
./cohost-export/project/jmsfbs/following/TiberiusMoreau.json
./cohost-export/project/jmsfbs/following/Tribow.json
./cohost-export/project/jmsfbs/following/Vectorpark.json
./cohost-export/project/jmsfbs/following/accidentallymental.json
./cohost-export/project/jmsfbs/following/barney.json
./cohost-export/project/jmsfbs/following/boodoo.json
./cohost-export/project/jmsfbs/following/bungo.json
./cohost-export/project/jmsfbs/following/clempics.json
./cohost-export/project/jmsfbs/following/community-jukebox.json
./cohost-export/project/jmsfbs/following/crosseye.json
./cohost-export/project/jmsfbs/following/css.json
./cohost-export/project/jmsfbs/following/danielelliott.json
./cohost-export/project/jmsfbs/following/danielleri.json
./cohost-export/project/jmsfbs/following/df2jkpics.json
./cohost-export/project/jmsfbs/following/draknek.json
./cohost-export/project/jmsfbs/following/drewsof.json
./cohost-export/project/jmsfbs/following/droqen.json
./cohost-export/project/jmsfbs/following/eniko.json
./cohost-export/project/jmsfbs/following/ewoud.json
./cohost-export/project/jmsfbs/following/fasterthanlime.json
./cohost-export/project/jmsfbs/following/geometric.json
./cohost-export/project/jmsfbs/following/hollygramazio.json
./cohost-export/project/jmsfbs/following/itsnatclayton.json
./cohost-export/project/jmsfbs/following/jcolag.json
./cohost-export/project/jmsfbs/following/johnnemann.json
./cohost-export/project/jmsfbs/following/jwaaaap.json
./cohost-export/project/jmsfbs/following/kenney.json
./cohost-export/project/jmsfbs/following/llewellynn.json
./cohost-export/project/jmsfbs/following/lucidlucille.json
./cohost-export/project/jmsfbs/following/lutz.json
./cohost-export/project/jmsfbs/following/objects.json
./cohost-export/project/jmsfbs/following/pygy.json
./cohost-export/project/jmsfbs/following/saint11.json
./cohost-export/project/jmsfbs/following/staff.json
./cohost-export/project/jmsfbs/following/sweatercinema.json
./cohost-export/project/jmsfbs/following/terrycavanagh.json
./cohost-export/project/jmsfbs/following/tha-rami.json
./cohost-export/project/jmsfbs/following/the-technician.json
./cohost-export/project/jmsfbs/following/thecatamites.json
./cohost-export/project/jmsfbs/following/tipsheda.json
./cohost-export/project/jmsfbs/following/tokyoscarab.json
./cohost-export/project/jmsfbs/following/tomforsyth.json
./cohost-export/project/jmsfbs/following/torcado.json
./cohost-export/project/jmsfbs/following/vectorpoem.json
./cohost-export/project/jmsfbs/following/wiredaemon.json
./cohost-export/project/jmsfbs/following/ztiworoh.json
./cohost-export/project/jmsfbs/jmsfbs.json
./cohost-export/project/jmsfbs/posts/published/1003900-pequod/post.json
./cohost-export/project/jmsfbs/posts/published/1008879-the-dinner/post.json
./cohost-export/project/jmsfbs/posts/published/1016743-frog-2-tutorial/post.json
./cohost-export/project/jmsfbs/posts/published/1046457-plugged-my-electric/post.json
./cohost-export/project/jmsfbs/posts/published/1065458-spotify-getting-a-bi/post.json
./cohost-export/project/jmsfbs/posts/published/1088733-garage-band-no-plugi/post.json
./cohost-export/project/jmsfbs/posts/published/1105201-cloudpunk/post.json
./cohost-export/project/jmsfbs/posts/published/1166928-just-finished-saints/post.json
./cohost-export/project/jmsfbs/posts/published/1270030-loopy-levels-in-fall/post.json
./cohost-export/project/jmsfbs/posts/published/1313723-df-psych-oddysey/post.json
./cohost-export/project/jmsfbs/posts/published/1342661-music-production-del/post.json
./cohost-export/project/jmsfbs/posts/published/1342883-midi-research-round/post.json
./cohost-export/project/jmsfbs/posts/published/1379548-freeing-up-your-hand/post.json
./cohost-export/project/jmsfbs/posts/published/1433411-cousins/post.json
./cohost-export/project/jmsfbs/posts/published/1456123-i-paid-for-a-daw/post.json
./cohost-export/project/jmsfbs/posts/published/1456278-bass-sound-design-w/post.json
./cohost-export/project/jmsfbs/posts/published/1460591-larry-sanders-show/post.json
./cohost-export/project/jmsfbs/posts/published/1480427-splitting-frequencie/post.json
./cohost-export/project/jmsfbs/posts/published/1572613-he-died-the-way-he-l/post.json
./cohost-export/project/jmsfbs/posts/published/160965-welcome-to-cohost/post.json
./cohost-export/project/jmsfbs/posts/published/161748-baldies/post.json
./cohost-export/project/jmsfbs/posts/published/1619795-playing-high-hats-fa/post.json
./cohost-export/project/jmsfbs/posts/published/1712599-one-of-the-better-es/post.json
./cohost-export/project/jmsfbs/posts/published/174731-drum-rants-thought/post.json
./cohost-export/project/jmsfbs/posts/published/179858-feels-good-man-docu/post.json
./cohost-export/project/jmsfbs/posts/published/180406-you-definitely-notic/post.json
./cohost-export/project/jmsfbs/posts/published/185752-one-reason-might-be/post.json
./cohost-export/project/jmsfbs/posts/published/1958396-new-drinking-game/post.json
./cohost-export/project/jmsfbs/posts/published/220695-went-for-a-walk-toda/post.json
./cohost-export/project/jmsfbs/posts/published/2254047-this-is-huge-for-mus/post.json
./cohost-export/project/jmsfbs/posts/published/2458613-i-wonder-when-chrome/post.json
./cohost-export/project/jmsfbs/posts/published/259394-new-skillet/post.json
./cohost-export/project/jmsfbs/posts/published/2869791-nerd-points/image.png
./cohost-export/project/jmsfbs/posts/published/2869791-nerd-points/post.json
./cohost-export/project/jmsfbs/posts/published/2870336-new-track-golden/post.json
./cohost-export/project/jmsfbs/posts/published/2895937-more-alone/post.json
./cohost-export/project/jmsfbs/posts/published/2983024-new-recording-proces/image.png
./cohost-export/project/jmsfbs/posts/published/2983024-new-recording-proces/post.json
./cohost-export/project/jmsfbs/posts/published/3025811-walls/post.json
./cohost-export/project/jmsfbs/posts/published/3112762-recording-progress-u/post.json
./cohost-export/project/jmsfbs/posts/published/3125303-need-some-distance/post.json
./cohost-export/project/jmsfbs/posts/published/3136545-accidentally-deleted/post.json
./cohost-export/project/jmsfbs/posts/published/3169687-chameleon-days/post.json
./cohost-export/project/jmsfbs/posts/published/3172316-electronic-kit-upgra/post.json
./cohost-export/project/jmsfbs/posts/published/3175152-australia-votes-no-t/post.json
./cohost-export/project/jmsfbs/posts/published/3192439-kind-of-nice-to-see/post.json
./cohost-export/project/jmsfbs/posts/published/3213421-vst-vendors-and-thei/post.json
./cohost-export/project/jmsfbs/posts/published/3259093-practicing-and-mayb/post.json
./cohost-export/project/jmsfbs/posts/published/329629-stuff-like-this-driv/post.json
./cohost-export/project/jmsfbs/posts/published/3349306-expensive-reverb-fo/post.json
./cohost-export/project/jmsfbs/posts/published/3393436-the-last-beatles-son/post.json
./cohost-export/project/jmsfbs/posts/published/3395966-played-an-open-mic-a/397796805_841791357742803_767766949589955176_n.png
./cohost-export/project/jmsfbs/posts/published/3395966-played-an-open-mic-a/post.json
./cohost-export/project/jmsfbs/posts/published/3457586-more-giggin/Noth.mp3
./cohost-export/project/jmsfbs/posts/published/3457586-more-giggin/img_1_1699499465335.jpg
./cohost-export/project/jmsfbs/posts/published/3457586-more-giggin/img_3_1699499479580.jpg
./cohost-export/project/jmsfbs/posts/published/3457586-more-giggin/post.json
./cohost-export/project/jmsfbs/posts/published/364112-saw-3-turtles-and-a/post.json
./cohost-export/project/jmsfbs/posts/published/3644721-only-a-shadow/post.json
./cohost-export/project/jmsfbs/posts/published/3644766-open-mic-night/IMG_20231124_104548.jpg
./cohost-export/project/jmsfbs/posts/published/3644766-open-mic-night/post.json
./cohost-export/project/jmsfbs/posts/published/3656874-where-do-birds-go-af/post.json
./cohost-export/project/jmsfbs/posts/published/3673531-book-thoughts-hitch/post.json
./cohost-export/project/jmsfbs/posts/published/3819063-recording-tea-and-s/post.json
./cohost-export/project/jmsfbs/posts/published/3852001-the-games-industry-m/post.json
./cohost-export/project/jmsfbs/posts/published/3982228-guitar-modelling-and/post.json
./cohost-export/project/jmsfbs/posts/published/3984809-lone-wolf/post.json
./cohost-export/project/jmsfbs/posts/published/4053158-a-non-dogmatic-left/post.json
./cohost-export/project/jmsfbs/posts/published/4154545-gig-zoe-s-bar/image.png
./cohost-export/project/jmsfbs/posts/published/4154545-gig-zoe-s-bar/post.json
./cohost-export/project/jmsfbs/posts/published/4169376-dynamics-and-live-sh/post.json
./cohost-export/project/jmsfbs/posts/published/4310343-my-spotify-playlist/post.json
./cohost-export/project/jmsfbs/posts/published/4470982-foundation-s2-is-pre/post.json
./cohost-export/project/jmsfbs/posts/published/5262397-found-out-today-some/post.json
./cohost-export/project/jmsfbs/posts/published/5338982-how-can-you-not-noti/post.json
./cohost-export/project/jmsfbs/posts/published/536580-driving-around-today/post.json
./cohost-export/project/jmsfbs/posts/published/5444811-why-cohost-is-good/post.json
./cohost-export/project/jmsfbs/posts/published/5444882-ursula-le-guin-s-a-w/post.json
./cohost-export/project/jmsfbs/posts/published/5516880-donald-glover-son-o/post.json
./cohost-export/project/jmsfbs/posts/published/5528127-finally-yes/post.json
./cohost-export/project/jmsfbs/posts/published/5638375-witcher-3-rp-gs-and/Nova_pict_131.webp
./cohost-export/project/jmsfbs/posts/published/5638375-witcher-3-rp-gs-and/post.json
./cohost-export/project/jmsfbs/posts/published/5654197-getting-back-to-my-r/post.json
./cohost-export/project/jmsfbs/posts/published/5777910-china-doll-sessions/China%20Doll%202.mp3
./cohost-export/project/jmsfbs/posts/published/5777910-china-doll-sessions/PXL_20240429_062918414.jpg
./cohost-export/project/jmsfbs/posts/published/5777910-china-doll-sessions/post.json
./cohost-export/project/jmsfbs/posts/published/5814203-https-youtu-be-cej/post.json
./cohost-export/project/jmsfbs/posts/published/584679-stacklands-and-rts/Warcraft-1-fog-of-war.jpg
./cohost-export/project/jmsfbs/posts/published/584679-stacklands-and-rts/header.jpg
./cohost-export/project/jmsfbs/posts/published/584679-stacklands-and-rts/post.json
./cohost-export/project/jmsfbs/posts/published/597784-howard-zinn-woody/post.json
./cohost-export/project/jmsfbs/posts/published/603580-stuff-2022/post.json
./cohost-export/project/jmsfbs/posts/published/631473-negroni/post.json
./cohost-export/project/jmsfbs/posts/published/7254514-reading-spotify-bios/post.json
./cohost-export/project/jmsfbs/posts/published/733962-learning-to-drum-ev/post.json
./cohost-export/project/jmsfbs/posts/published/750274-breakdown-of-karen-c/post.json
./cohost-export/project/jmsfbs/posts/published/752472-breaking-the-monopol/post.json
./cohost-export/project/jmsfbs/posts/published/752730-the-true-believers/post.json
./cohost-export/project/jmsfbs/posts/published/757625-frick-filler/post.json
./cohost-export/project/jmsfbs/posts/published/766452-home-recording-journ/post.json
./cohost-export/project/jmsfbs/posts/published/767388-hmmm/Screenshot%202023-01-02%20193640.png
./cohost-export/project/jmsfbs/posts/published/767388-hmmm/post.json
./cohost-export/project/jmsfbs/posts/published/779120-how-sneakers-are-man/post.json
./cohost-export/project/jmsfbs/posts/published/7810959-gonna-miss-this-plac/post.json
./cohost-export/project/jmsfbs/posts/published/823558-home-recording/post.json
./cohost-export/project/jmsfbs/posts/published/827539-new-track-never/post.json
./cohost-export/project/jmsfbs/posts/published/829585-songwriting-process/post.json
./cohost-export/project/jmsfbs/posts/published/842542-new-guitars/post.json
./cohost-export/project/jmsfbs/posts/published/845495-i-don-t-really-like/post.json
./cohost-export/project/jmsfbs/posts/published/859426-scotch-crime/post.json
./cohost-export/project/jmsfbs/posts/published/881830-midi-thoughts/PXL_20230121_083234984_2.jpg
./cohost-export/project/jmsfbs/posts/published/881830-midi-thoughts/post.json
./cohost-export/project/jmsfbs/posts/published/892822-automate-the-pedal-b/post.json
./cohost-export/project/jmsfbs/posts/published/904948-video-editing/post.json
./cohost-export/project/jmsfbs/posts/published/909617-new-track-wicked/post.json
./cohost-export/project/jmsfbs/posts/published/915951-new-tracks/post.json
./cohost-export/project/jmsfbs/posts/published/915980-starting-the-next-so/post.json
./cohost-export/project/jmsfbs/posts/published/942742-listening-to-fruscia/post.json
./cohost-export/project/jmsfbs/posts/published/943274-samorost/post.json
./cohost-export/project/jmsfbs/silenced/3amsoda.json
./cohost-export/project/jmsfbs/silenced/ReynaNeko.json
./cohost-export/project/jmsfbs/silenced/batluster.json
./cohost-export/project/jmsfbs/silenced/endingsummary.json
./cohost-export/project/jmsfbs/silenced/kyn.json
./cohost-export/project/jmsfbs/silenced/melinoe.json
./cohost-export/project/recede/find-your-friends.json
./cohost-export/project/recede/recede.json
./cohost-export/user-notes/ewoud.json
./cohost-export/user-notes/llewellynn.json
./cohost-export/user-notes/ztiworoh.json
./cohost-export/user.json

Hmm that’s a lot of data, its hard for me to get a sense of the structure. Let’s see if we can aggregate it a bit.

First let’s parse the path into usable metadata:

select 
    (filename)
    .regexp_extract(
        '(.*)/(.*)(\..*)'
        , ['dir', 'filename', 'ext']
    ) as struct
from read_text('./cohost-export/**')
limit 5
;
struct
{'dir': ./cohost-export/project/jmsfbs-code, 'filename': 124035-d0b291a6-a10b-4424-8a9a-84bdce1a1d69-profile, 'ext': .jpeg}
{'dir': ./cohost-export/project/jmsfbs-code/comments, 'filename': 1b18d7cc-e64e-412e-a43a-af7d5b2db60f, 'ext': .json}
{'dir': ./cohost-export/project/jmsfbs-code/comments, 'filename': 288abe0b-66b3-4df7-8305-14b3e4449b92, 'ext': .json}
{'dir': ./cohost-export/project/jmsfbs-code/comments, 'filename': 36e4597b-5114-4e2d-89c6-b4ab7318f532, 'ext': .json}
{'dir': ./cohost-export/project/jmsfbs-code/comments, 'filename': 42c7829f-fecd-4791-a308-5b3e2d449892, 'ext': .json}

That’s great, but it’d be better if dir, filename and ext were separate columns, so let’s use unnest()

select 
    (filename)
    .regexp_extract(
        '(.*)/(.*)(\..*)'
        , ['dir', 'filename', 'ext']
    )
    .unnest()

from read_text('./cohost-export/**')
limit 5
;
dir filename ext
./cohost-export/project/jmsfbs-code 124035-d0b291a6-a10b-4424-8a9a-84bdce1a1d69-profile .jpeg
./cohost-export/project/jmsfbs-code/comments 1b18d7cc-e64e-412e-a43a-af7d5b2db60f .json
./cohost-export/project/jmsfbs-code/comments 288abe0b-66b3-4df7-8305-14b3e4449b92 .json
./cohost-export/project/jmsfbs-code/comments 36e4597b-5114-4e2d-89c6-b4ab7318f532 .json
./cohost-export/project/jmsfbs-code/comments 42c7829f-fecd-4791-a308-5b3e2d449892 .json

Great!

Casually introducing advanced features

Already we’ve seen three pretty advanced features of DuckDB.

  1. We’re querying the filesystem with SQL
  2. We’re able to build nested structs of data
  3. We’re able to dot chain methods within SQL

All of this is quite convenient but also quite unusual (for most databases). Normally a database runs on a server and doesn’t have access to local data on your filesystem.

There are other in-process databases (like SQLite) but they likely don’t have the same syntactic convenience.

Other databases like Postgres have nested data in various forms but there is no easy to use anonymous structs. You can only deal with nested data with predefined types.

Having nested anonymous binary data in your result sets is incredible powerful. It has surprised me how often I reach for it now.

Aggregating results

Okay, lets get back to aggregating. Let’s group by dir and collect filename/ext into an array and take a sample of the files so we can get a sense of the contents.

Let’s take the query we already had, and wrap it in a Common Table Expression (CTE). Which is basically just a pipeline of SQL

The CTE below is equivalent to the query before, but we can start to add more blocks.

with parsed as (
    select 
        (filename)
        .regexp_extract(
            '(.*)/(.*)(\..*)'
            , ['dir', 'filename', 'ext']
        )
        .unnest()
    from read_text('./cohost-export/**')
)
select * from parsed
limit 5
;

By the way, I’m just writing all this SQL in a markdown file in VSCode. In the terminal I’ve got DuckDB running in the VSCode terminal, and I select the text and send it to the terminal (I’ve bound that to Ctrl/Command + Enter).

I’ve also told DuckDB to output tables in markdown format

.mode markdown

So I can just paste the results straight back into this blogpost.

Now let’s aggregate!

with parsed as (
    select 
        (filename)
        .regexp_extract(
            '(.*)/(.*)(\..*)'
            , ['dir', 'filename', 'ext']
        )
        .unnest()
    from read_text('./cohost-export/**')
)
select 
    dir
    , (array_agg(distinct (ext))) as contents
from parsed
group by all
;
dir contents
./cohost-export/project/jmsfbs-code [.jpeg, .json, .json]
./cohost-export/project/jmsfbs-code/posts/published/1196053-css-anchor-positioni [.json]
./cohost-export/project/jmsfbs-code/posts/published/1205822-what-i-made-the-comp [.json]
./cohost-export/project/jmsfbs-code/posts/published/1215585-how-to-disable-web-s [.json]
./cohost-export/project/jmsfbs-code/posts/published/1302500-workaround-for-lack [.json]
./cohost-export/project/jmsfbs-code/posts/published/1312901-debugger-launch-nod [.json]
./cohost-export/project/jmsfbs-code/posts/published/1349274-just-upgraded-a-self [.json]
./cohost-export/project/jmsfbs-code/posts/published/1353509-how-to-parse-dynamic [.json]
./cohost-export/project/jmsfbs-code/posts/published/4362243-repeat-0-1fr-doesn [.json]
./cohost-export/project/jmsfbs-code/posts/published/4893642-recursive-queries-in [.json]
./cohost-export/project/jmsfbs-code/posts/published/5262412-typescript-performan [.json]
./cohost-export/project/jmsfbs-code/posts/published/696437-interesting-blow-ran [.json]
./cohost-export/project/jmsfbs-code/posts/published/729566-keydiff [.json]
./cohost-export/project/jmsfbs-code/posts/published/784613-browser-automation-w [.json]
./cohost-export/project/jmsfbs-code/posts/published/926060-working-on-a-simple [.png, .json]
./cohost-export/project/jmsfbs/posts/published/1065458-spotify-getting-a-bi [.json]
./cohost-export/project/jmsfbs/posts/published/1480427-splitting-frequencie [.json]
./cohost-export/project/jmsfbs/posts/published/179858-feels-good-man-docu [.json]
./cohost-export/project/jmsfbs/posts/published/220695-went-for-a-walk-toda [.json]
./cohost-export/project/jmsfbs/posts/published/2869791-nerd-points [.png, .json]
./cohost-export/project/jmsfbs/posts/published/3112762-recording-progress-u [.json]
./cohost-export/project/jmsfbs/posts/published/3169687-chameleon-days [.json]
./cohost-export/project/jmsfbs/posts/published/329629-stuff-like-this-driv [.json]
./cohost-export/project/jmsfbs/posts/published/3349306-expensive-reverb-fo [.json]
./cohost-export/project/jmsfbs/posts/published/3393436-the-last-beatles-son [.json]
./cohost-export/project/jmsfbs/posts/published/3457586-more-giggin [.mp3, .jpg, .jpg]
./cohost-export/project/jmsfbs/posts/published/3644721-only-a-shadow [.json]
./cohost-export/project/jmsfbs/posts/published/5262397-found-out-today-some [.json]
./cohost-export/project/jmsfbs/posts/published/5444882-ursula-le-guin-s-a-w [.json]
./cohost-export/project/jmsfbs/posts/published/5638375-witcher-3-rp-gs-and [.webp, .json]
./cohost-export/project/jmsfbs/posts/published/631473-negroni [.json]
./cohost-export/project/jmsfbs/posts/published/752730-the-true-believers [.json]
./cohost-export/project/jmsfbs/posts/published/766452-home-recording-journ [.json]
./cohost-export/project/jmsfbs/posts/published/7810959-gonna-miss-this-plac [.json]
./cohost-export/project/jmsfbs/posts/published/915951-new-tracks [.json]
./cohost-export/project/jmsfbs-code/followers [.json, .json, .json]
./cohost-export/project/jmsfbs-code/following [.json, .json, .json]
./cohost-export/project/jmsfbs-code/posts/published/1681286-making-my-bluetooth [.json]
./cohost-export/project/jmsfbs-code/posts/published/1829684-simple-secret-sharin [.json]
./cohost-export/project/jmsfbs-code/posts/published/4369806-kind-of-surprised-ho [.json]
./cohost-export/project/jmsfbs-code/posts/published/5555101-github-copilot [.json]
./cohost-export/project/jmsfbs-code/posts/published/6559941-learning-godot-has-c [.json]
./cohost-export/project/jmsfbs-code/posts/published/766864-nginx-caddy [.json]
./cohost-export/project/jmsfbs-code/posts/published/943321-crypto-random-uuid [.json]
./cohost-export/project/jmsfbs/followers [.json, .json, .json]
./cohost-export/project/jmsfbs/posts/published/1008879-the-dinner [.json]
./cohost-export/project/jmsfbs/posts/published/1016743-frog-2-tutorial [.json]
./cohost-export/project/jmsfbs/posts/published/1046457-plugged-my-electric [.json]
./cohost-export/project/jmsfbs/posts/published/1088733-garage-band-no-plugi [.json]
./cohost-export/project/jmsfbs/posts/published/1105201-cloudpunk [.json]
./cohost-export/project/jmsfbs/posts/published/1166928-just-finished-saints [.json]
./cohost-export/project/jmsfbs/posts/published/1342883-midi-research-round [.json]
./cohost-export/project/jmsfbs/posts/published/1433411-cousins [.json]
./cohost-export/project/jmsfbs/posts/published/1456123-i-paid-for-a-daw [.json]
./cohost-export/project/jmsfbs/posts/published/1712599-one-of-the-better-es [.json]
./cohost-export/project/jmsfbs/posts/published/185752-one-reason-might-be [.json]
./cohost-export/project/jmsfbs/posts/published/1958396-new-drinking-game [.json]
./cohost-export/project/jmsfbs/posts/published/2870336-new-track-golden [.json]
./cohost-export/project/jmsfbs/posts/published/2983024-new-recording-proces [.png, .json]
./cohost-export/project/jmsfbs/posts/published/3125303-need-some-distance [.json]
./cohost-export/project/jmsfbs/posts/published/3136545-accidentally-deleted [.json]
./cohost-export/project/jmsfbs/posts/published/3395966-played-an-open-mic-a [.png, .json]
./cohost-export/project/jmsfbs/posts/published/3819063-recording-tea-and-s [.json]
./cohost-export/project/jmsfbs/posts/published/3984809-lone-wolf [.json]
./cohost-export/project/jmsfbs/posts/published/4470982-foundation-s2-is-pre [.json]
./cohost-export/project/jmsfbs/posts/published/5528127-finally-yes [.json]
./cohost-export/project/jmsfbs/posts/published/5654197-getting-back-to-my-r [.json]
./cohost-export/project/jmsfbs/posts/published/5814203-https-youtu-be-cej [.json]
./cohost-export/project/jmsfbs/posts/published/597784-howard-zinn-woody [.json]
./cohost-export/project/jmsfbs/posts/published/7254514-reading-spotify-bios [.json]
./cohost-export/project/jmsfbs/posts/published/733962-learning-to-drum-ev [.json]
./cohost-export/project/jmsfbs/posts/published/750274-breakdown-of-karen-c [.json]
./cohost-export/project/jmsfbs/posts/published/779120-how-sneakers-are-man [.json]
./cohost-export/project/jmsfbs/posts/published/827539-new-track-never [.json]
./cohost-export/project/jmsfbs/posts/published/829585-songwriting-process [.json]
./cohost-export/project/jmsfbs/posts/published/842542-new-guitars [.json]
./cohost-export/project/jmsfbs/posts/published/845495-i-don-t-really-like [.json]
./cohost-export/project/jmsfbs/posts/published/859426-scotch-crime [.json]
./cohost-export/project/jmsfbs/posts/published/881830-midi-thoughts [.jpg, .json]
./cohost-export/project/jmsfbs/posts/published/915980-starting-the-next-so [.json]
./cohost-export/project/jmsfbs/posts/published/942742-listening-to-fruscia [.json]
./cohost-export/project/jmsfbs/posts/published/943274-samorost [.json]
./cohost-export/project/jmsfbs-code/posts/published/1196048-druids-datadog-s-d [.json]
./cohost-export/project/jmsfbs-code/posts/published/1205386-parser-combinators [.json]
./cohost-export/project/jmsfbs-code/posts/published/1230738-nushell-snippets [.json]
./cohost-export/project/jmsfbs-code/posts/published/1706101-ouch-better-cli-ux [.json]
./cohost-export/project/jmsfbs-code/posts/published/1933414-chat-gpt-almost-alway [.json]
./cohost-export/project/jmsfbs-code/posts/published/2625470-moving-to-ghost [.json]
./cohost-export/project/jmsfbs-code/posts/published/4954186-learning-about-apl-a [.png, .json]
./cohost-export/project/jmsfbs-code/posts/published/5679848-typescript-tip-out [.json]
./cohost-export/project/jmsfbs-code/posts/published/5976016-postgres-implicit-l [.png, .json]
./cohost-export/project/jmsfbs-code/posts/published/739146-cohost-launched-rss [.json]
./cohost-export/project/jmsfbs-code/posts/published/927564-solid-js-and-presite [.json]
./cohost-export/project/jmsfbs/posts/published/1003900-pequod [.json]
./cohost-export/project/jmsfbs/posts/published/1270030-loopy-levels-in-fall [.json]
./cohost-export/project/jmsfbs/posts/published/1456278-bass-sound-design-w [.json]
./cohost-export/project/jmsfbs/posts/published/1460591-larry-sanders-show [.json]
./cohost-export/project/jmsfbs/posts/published/1572613-he-died-the-way-he-l [.json]
./cohost-export/project/jmsfbs/posts/published/160965-welcome-to-cohost [.json]
./cohost-export/project/jmsfbs/posts/published/2254047-this-is-huge-for-mus [.json]
./cohost-export/project/jmsfbs/posts/published/3025811-walls [.json]
./cohost-export/project/jmsfbs/posts/published/3172316-electronic-kit-upgra [.json]
./cohost-export/project/jmsfbs/posts/published/3175152-australia-votes-no-t [.json]
./cohost-export/project/jmsfbs/posts/published/3192439-kind-of-nice-to-see [.json]
./cohost-export/project/jmsfbs/posts/published/3259093-practicing-and-mayb [.json]
./cohost-export/project/jmsfbs/posts/published/3644766-open-mic-night [.jpg, .json]
./cohost-export/project/jmsfbs/posts/published/3673531-book-thoughts-hitch [.json]
./cohost-export/project/jmsfbs/posts/published/3982228-guitar-modelling-and [.json]
./cohost-export/project/jmsfbs/posts/published/4154545-gig-zoe-s-bar [.png, .json]
./cohost-export/project/jmsfbs/posts/published/5338982-how-can-you-not-noti [.json]
./cohost-export/project/jmsfbs/posts/published/536580-driving-around-today [.json]
./cohost-export/project/jmsfbs/posts/published/757625-frick-filler [.json]
./cohost-export/project/jmsfbs/posts/published/767388-hmmm [.png, .json]
./cohost-export/project/jmsfbs/posts/published/892822-automate-the-pedal-b [.json]
./cohost-export/project/jmsfbs/posts/published/904948-video-editing [.json]
./cohost-export/project/jmsfbs/posts/published/909617-new-track-wicked [.json]
./cohost-export/project/jmsfbs/silenced [.json, .json, .json]
./cohost-export/project/recede [.json, .json]
./cohost-export/user-notes [.json, .json, .json]
./cohost-export/project/jmsfbs-code/comments [.json, .json, .json]
./cohost-export/project/jmsfbs-code/posts/published/1042927-new-row-violates-row [.json]
./cohost-export/project/jmsfbs-code/posts/published/1059858-more-productive-on-t [.json]
./cohost-export/project/jmsfbs-code/posts/published/1201573-how-to-use-deno-debu [.json]
./cohost-export/project/jmsfbs-code/posts/published/1349307-ssh-import-id [.json]
./cohost-export/project/jmsfbs-code/posts/published/1687721-rss-feed-truncated [.json]
./cohost-export/project/jmsfbs-code/posts/published/1706541-installing-rust-carg [.json]
./cohost-export/project/jmsfbs-code/posts/published/2254888-using-nushell-to-saf [.json]
./cohost-export/project/jmsfbs-code/posts/published/4081387-secrets-service [.json]
./cohost-export/project/jmsfbs-code/posts/published/4092488-typescript-wishlist [.json]
./cohost-export/project/jmsfbs-code/posts/published/6633844-database-migrations [.json]
./cohost-export/project/jmsfbs-code/posts/published/716808-minor-website-update [.json]
./cohost-export/project/jmsfbs-code/posts/published/7238140-useful-alias-for-doc [.json]
./cohost-export/project/jmsfbs-code/posts/published/738657-puppeteer-and-docker [.json]
./cohost-export/project/jmsfbs-code/posts/published/773340-crossposting-cohost [.png, .json]
./cohost-export/project/jmsfbs-code/posts/published/779895-function-constr [.json]
./cohost-export/project/jmsfbs-code/posts/published/784666-joining-rss-feeds [.json]
./cohost-export/project/jmsfbs-code/posts/published/943384-svelte-kit-design-dec [.json]
./cohost-export/project/jmsfbs-code/posts/published/993001-knee-deep-in-postgre [.json]
./cohost-export/project/jmsfbs [.jpg, .jpg, .json]
./cohost-export/project/jmsfbs/comments [.json, .json, .json]
./cohost-export/project/jmsfbs/following [.json, .json, .json]
./cohost-export/project/jmsfbs/posts/published/1313723-df-psych-oddysey [.json]
./cohost-export/project/jmsfbs/posts/published/1342661-music-production-del [.json]
./cohost-export/project/jmsfbs/posts/published/1379548-freeing-up-your-hand [.json]
./cohost-export/project/jmsfbs/posts/published/161748-baldies [.json]
./cohost-export/project/jmsfbs/posts/published/1619795-playing-high-hats-fa [.json]
./cohost-export/project/jmsfbs/posts/published/174731-drum-rants-thought [.json]
./cohost-export/project/jmsfbs/posts/published/180406-you-definitely-notic [.json]
./cohost-export/project/jmsfbs/posts/published/2458613-i-wonder-when-chrome [.json]
./cohost-export/project/jmsfbs/posts/published/259394-new-skillet [.json]
./cohost-export/project/jmsfbs/posts/published/2895937-more-alone [.json]
./cohost-export/project/jmsfbs/posts/published/3213421-vst-vendors-and-thei [.json]
./cohost-export/project/jmsfbs/posts/published/364112-saw-3-turtles-and-a [.json]
./cohost-export/project/jmsfbs/posts/published/3656874-where-do-birds-go-af [.json]
./cohost-export/project/jmsfbs/posts/published/3852001-the-games-industry-m [.json]
./cohost-export/project/jmsfbs/posts/published/4053158-a-non-dogmatic-left [.json]
./cohost-export/project/jmsfbs/posts/published/4169376-dynamics-and-live-sh [.json]
./cohost-export/project/jmsfbs/posts/published/4310343-my-spotify-playlist [.json]
./cohost-export/project/jmsfbs/posts/published/5444811-why-cohost-is-good [.json]
./cohost-export/project/jmsfbs/posts/published/5516880-donald-glover-son-o [.json]
./cohost-export/project/jmsfbs/posts/published/5777910-china-doll-sessions [.mp3, .jpg, .json]
./cohost-export/project/jmsfbs/posts/published/584679-stacklands-and-rts [.jpg, .jpg, .json]
./cohost-export/project/jmsfbs/posts/published/603580-stuff-2022 [.json]
./cohost-export/project/jmsfbs/posts/published/752472-breaking-the-monopol [.json]
./cohost-export/project/jmsfbs/posts/published/823558-home-recording [.json]
./cohost-export [.json]

It’s the little things. group by all is another DuckDB nicety. Normally we’d have to specify every column we want to aggregate by. group by all just means: assume any unaggregated column is what we are grouping by. This lets us constantly revise our query without needing to update the group by to match the select.

Still too many results for me to comprehend, but the bulk of the data is posts so let’s focus on just the posts/*/ results.

select 
    (filename)
    .regexp_extract(
        '(.*)/(.*)(\..*)'
        , ['dir', 'filename', 'ext']
    )
    .unnest()
from read_text(
    './cohost-export/project/*/posts/*/**'
)
order by filename desc
limit 5
;
dir filename ext
./cohost-export/project/jmsfbs/posts/published/943274-samorost post .json
./cohost-export/project/jmsfbs/posts/published/942742-listening-to-fruscia post .json
./cohost-export/project/jmsfbs/posts/published/915980-starting-the-next-so post .json
./cohost-export/project/jmsfbs/posts/published/915951-new-tracks post .json
./cohost-export/project/jmsfbs/posts/published/909617-new-track-wicked post .json

Let’s parse that path a little more specifically now:

select 
    (filename)
    .regexp_extract(
        './cohost-export/project/(.*)/posts/(.*)/(.*)/(.*)\.(.*)', 
        [ 'project'
        , 'status'
        , 'slug'
        , 'filename'
        , 'extension'
        ]
    )
    .unnest()
from read_text(
    './cohost-export/project/*/posts/*/**'
)
order by filename desc
limit 5
;
project status slug filename extension
jmsfbs published 943274-samorost post json
jmsfbs published 942742-listening-to-fruscia post json
jmsfbs published 915980-starting-the-next-so post json
jmsfbs published 915951-new-tracks post json
jmsfbs published 909617-new-track-wicked post json

That’s a useful query, lets give it a name:

create or replace view vw_post_files as
select 
    (filename)
    .regexp_extract(
        './cohost-export/project/(.*)/posts/(.*)/(.*)/(.*)\.(.*)', 
        [ 'project'
        , 'status'
        , 'slug'
        , 'filename'
        , 'extension'
        ]
    )
    .unnest()
    , filename as filepath
from read_text(
    './cohost-export/project/*/posts/*/**'
)
;

And we can query that by name now:

select * from vw_post_files
order by filename desc
limit 5
;
project status slug filename extension filepath
jmsfbs published 179858-feels-good-man-docu post json ./cohost-export/project/jmsfbs/posts/published/179858-feels-good-man-docu/post.json
jmsfbs published 752472-breaking-the-monopol post json ./cohost-export/project/jmsfbs/posts/published/752472-breaking-the-monopol/post.json
jmsfbs-code published 779895-function-constr post json ./cohost-export/project/jmsfbs-code/posts/published/779895-function-constr/post.json
jmsfbs-code published 784613-browser-automation-w post json ./cohost-export/project/jmsfbs-code/posts/published/784613-browser-automation-w/post.json
jmsfbs-code published 1042927-new-row-violates-row post json ./cohost-export/project/jmsfbs-code/posts/published/1042927-new-row-violates-row/post.json

DuckDB can arbitrarily unnest CTE’s, view definitions, and nested queries there is no runtime performance penality for composing SQL like this (not necessarilly true in other DB systems, DuckDB is special here).

Exploring JSON data

Let’s dive in to the actual post.json files now.

We can glob them from the filesystem trivially:

select 
    * exclude(blocks, filename)
from read_json('./cohost-export/**/*/*/post.json', union_by_name=true, filename=true)
order by singlePostPageURL desc
limit 5
;

I’ll explain union_by_name=true momentarily…

postId headline state cws tags pinned commentsLocked sharesLocked singlePostPageUrl publishedAt shareOfPost
943274 Samorost 1 [] [gamedev, art, games, documentary] false false false https://cohost.org/jmsfbs/post/943274-samorost 2023-02-01T11:05:41.324Z
942742 Listening to Frusciante / Klinghoffer today 1 [] [music, klinghoffer, frusciante, rhcp] false false false https://cohost.org/jmsfbs/post/942742-listening-to-fruscia 2023-02-01T07:07:44.007Z
915980 Starting the next song 1 [] [Midi, home recording, songwriting, crosspost:jmsfbs.com] false false false https://cohost.org/jmsfbs/post/915980-starting-the-next-so 2023-01-27T06:56:51.647Z
915951 New Tracks 1 [] [music, home recording, crosspost:jmsfbs.com] false false false https://cohost.org/jmsfbs/post/915951-new-tracks 2023-01-27T06:48:57.228Z
909617 New Track - Wicked 1 [] [music, songwriting, home recording, crosspost:jmsfbs.com] false false false https://cohost.org/jmsfbs/post/909617-new-track-wicked 2023-01-26T03:09:33.404Z

Yep, we just read json files from disk and DuckDB inferred a unified schema and returned it back to us as native rows.

Because the cohost post content is itself markdown its hard to nest in a markdown table, so I’ve excluded the blocks column (exclude(blocks)) just to get a sense of the high level structure. blocks is the column that contains each content block (markdown/attachments etc).

select * exclude ( columnA, columnB ) is another DuckDB nicety. Normally if we want to exclude a column in SQL we have to explicitly list every column we want to include, not so with DuckDB!

But because that content is just json, we can unnest again and take a look at the block content. We’ll deliberately skip blocks with nested codeblocks (as it would confuse this blog’s renderer).

with xs as (
    select 
        singlePostPageURL
        , unnest(blocks, recursive:= true)
    from read_json(
        './cohost-export/**/*/*/post.json'
        , union_by_name=true
    )
)
select * 
from xs
-- ignore code blocks that would 
-- confuse Astro (`)
where content not ilike '%`%' 
order by singlePostPageURL desc
limit 5
;
singlePostPageUrl type content kind fileURL previewURL attachmentId altText width height artist title
https://cohost.org/jmsfbs/post/943274-samorost markdown https://www.youtube.com/watch?v=6W3kXEGy_GQ&t=1s
https://cohost.org/jmsfbs/post/942742-listening-to-fruscia markdown https://open.spotify.com/album/724MiwOVJq5YS9sr70RLif?si=QKtxPtDUTlyM1ZlJifMU-A
https://cohost.org/jmsfbs/post/915980-starting-the-next-so markdown The midi thing worked really well on the last track I think. So I'm going to try the same approach on this next track but also experiment a bit with the Haas effect to get a wider stereo image without actually panning at all. If that sounds fancy, I only just learned about it
https://cohost.org/jmsfbs/post/915980-starting-the-next-so markdown The song was originally called Moon Addict, but I think it may now be called Starsign. I'm definitely going to play around with the song, lyrics and arrangement as I wrote it close to 15 years ago and otherwise it wouldn't be too interesting for me or anyone else.
https://cohost.org/jmsfbs/post/915980-starting-the-next-so markdown So I've plugged in my Arturia Minilab (something that was previously gathering dust but is now potentially my favourite piece of kit) and I've started playing around with church organs. We'll see how it turns out!

Automatic schema detection

You can see we’ve got all these other columns appearing that are never used in our sample. By default DuckDB does a sample read when reading JSON / CSV / etc so it can infer the correct schema.

If different files have different structures that cannot be merged into a shared schema DuckDB will tell you it needs your input on how to proceed. You can ignore rows with errors in them, you can manually specifiy the schema, you can increase the sample threshold. For my usage I find union_by_name fixes 99% of my problems. It allows columns and keys to appear out of order (or not at all), but relies on the name of the field to join similar rows.

Transforming our data

We have discovered the structure of our data, we now need to convert it into an mdx file that Astro can digest.

It’s tempting to reach for a host language at this point. We want to create directories, and we want to generate frontmatter, typescript schemas, markdown files. Move file attachments around, update references in the markdown to the local reference and on and on.

But let’s see how far we can get with just the DuckDB CLI though. You might be surprised!

Frontmatter

Frontmatter is the name for the metadata at the top of a markdown document. It looks like this:

---
title: Parsing the Cohost Export with DuckDB
subtitle: Code really is data.
created: 2024-10-10T04:28:15.002Z
archived: false
featured: false
tags:
- cohost
- DuckDB
- data
- ops
- programming
---

That is the frontmatter for this post at time of writing. This metadata helps this blog generate web pages. We can query it at build time to e.g. get back all the posts that have a particular tag. Embedding the metadata in the post is a neat way of avoiding having separate external config files (or even a database) for collating the posts.

We want to generate this frontmatter to prefix the mdx file for all the cohost posts. DuckDB has no built in support for YAML (the underlying data type of frontmatter). But we can at least create a DuckDB structure of all the posts and then manually construct the yaml string.

Let’s query those fields and name them using this blog’s naming conventions:

select 
    headline as title
    , publishedAt as created
    , true as archived
    , false as featured
    , tags
    , regexp_extract(
        singlePostPageURL
        , '.*/post/(.*)'
        , ['slug']
    )
    .unnest()

from read_json(
    './cohost-export/**/*/*/post.json'
    , union_by_name=true
    , filename=true
)
order by singlePostPageURL desc
limit 5
;

With mode set to .mode line it almost looks like yaml.

    title = Samorost
  created = 2023-02-01T11:05:41.324Z
 archived = true
 featured = false
     tags = [gamedev, art, games, documentary]
     slug = 943274-samorost

    title = Listening to Frusciante / Klinghoffer today
  created = 2023-02-01T07:07:44.007Z
 archived = true
 featured = false
     tags = [music, klinghoffer, frusciante, rhcp]
     slug = 942742-listening-to-fruscia

    title = Starting the next song
  created = 2023-01-27T06:56:51.647Z
 archived = true
 featured = false
     tags = [Midi, home recording, songwriting, crosspost:jmsfbs.com]
     slug = 915980-starting-the-next-so

    title = New Tracks
  created = 2023-01-27T06:48:57.228Z
 archived = true
 featured = false
     tags = [music, home recording, crosspost:jmsfbs.com]
     slug = 915951-new-tracks

    title = New Track - Wicked
  created = 2023-01-26T03:09:33.404Z
 archived = true
 featured = false
     tags = [music, songwriting, home recording, crosspost:jmsfbs.com]
     slug = 909617-new-track-wicked

In fact, this would be the equivalent valid yml/frontmatter:

- title: Samorost
  created: 2023-02-01T11:05:41.324Z
  archived: true
  featured: false
  tags: [gamedev, art, games, documentary]
  slug: 943274-samorost

- title: Listening to Frusciante / Klinghoffer today
  created: 2023-02-01T07:07:44.007Z
  archived: true
  featured: false
  tags: [music, klinghoffer, frusciante, rhcp]
  slug: 942742-listening-to-fruscia

- title: Starting the next song
  created: 2023-01-27T06:56:51.647Z
  archived: true
  featured: false
  tags: [Midi, home recording, songwriting, crosspost:jmsfbs.com]
  slug: 915980-starting-the-next-so

  title: New Tracks
  created: 2023-01-27T06:48:57.228Z
  archived: true
  featured: false
  tags: [music, home recording, crosspost:jmsfbs.com]
  slug: 915951-new-tracks

- title: "New Track - Wicked"
  created: 2023-01-26T03:09:33.404Z
  archived: true
  featured: false
  tags: [music, songwriting, home recording, crosspost:jmsfbs.com]
  slug: 909617-new-track-wicked

When I first saw this output I was tempted to write this native DuckDB format to a file and regex replace it into yaml as it is so close. But instead we’ll generate the frontmatter using a general purpose technique because the intention of this post is to show off DuckDB not solve this specific problem.

Let’s adjust our query to produce a text column frontmatter that uses all the metadata from the previous query. Note the use of E'\n' we use this because the newline character is automatically escaped and it won’t render as expected. E'\n' is an unescaped / “raw” string.

select
    (
    ''
    || E'---'
    || E'\ntitle: ' 
    ||  printf('"%s"', headline)
    || E'\ncreated: ' 
    ||  publishedAt
    || E'\nfeatured: false'
    || E'\narchived: true'
    || E'\ntags: ' 
    ||  (tags || ['from:cohost'])
            .to_json()::text
    || E'\nslug: ' 
    ||  regexp_extract(
            singlePostPageURL
            , '.*/post/(.*)'
            , ['slug']
        )
        .slug
    || E'\n---'
    )
    as frontmatter
from read_json(
    './cohost-export/**/*/*/post.json'
    , union_by_name=true
)
order by singlePostPageURL desc
limit 5
;

We get this in .mode list

---
title: Samorost
created: 2023-02-01T11:05:41.324Z
featured: false
archived: true
tags: ["gamedev","art","games","documentary","from:cohost"]
slug: 943274-samorost
---
---
title: Listening to Frusciante / Klinghoffer today
created: 2023-02-01T07:07:44.007Z
featured: false
archived: true
tags: ["music","klinghoffer","frusciante","rhcp","from:cohost"]
slug: 942742-listening-to-fruscia
---
---
title: Starting the next song
created: 2023-01-27T06:56:51.647Z
featured: false
archived: true
tags: ["Midi","home recording","songwriting","crosspost:jmsfbs.com","from:cohost"]
slug: 915980-starting-the-next-so
---
---
title: New Tracks
created: 2023-01-27T06:48:57.228Z
featured: false
archived: true
tags: ["music","home recording","crosspost:jmsfbs.com","from:cohost"]
slug: 915951-new-tracks
---
---
title: New Track - Wicked
created: 2023-01-26T03:09:33.404Z
featured: false
archived: true
tags: ["music","songwriting","home recording","crosspost:jmsfbs.com","from:cohost"]
slug: 909617-new-track-wicked
---

Let’s bring that filename parsing logic in and add some of that metadata to the frontmatter too:

with xs as (
    select 
        * exclude (filename)
        , (filename)
            .regexp_extract(
                './cohost-export/project/(.*)/posts/(.*)/(.*)/(.*)\.(.*)', 
                [ 'project'
                , 'status'
                , 'slug'
                , 'filename'
                , 'extension'
                ]
            )
            .unnest()
    from read_json(
        './cohost-export/**/*/*/post.json'
        , union_by_name=true
        , filename=true
    )
)
select 
    (
        ''
        || E'---'
        || E'\ntitle: ' 
        ||  printf('"%s"', headline)
        || E'\ncreated: ' ||  publishedAt
        || E'\featured: false'
        || E'\narchived: true'
        || E'\ncohost_project: ' || project
        || E'\ncohost_status: ' || status
        || E'\ntags: ' 
        ||  (tags || ['from:cohost'])
                .to_json()::text
        || E'\nslug: ' ||  slug
        || E'\n---'
    )
    as frontmatter
from xs
order by singlePostPageURL desc
limit 5
;

Let’s name that query via a view and with a few metadata columns for joining / debugging.

I’ve also added a few minor changes to handle edgecases I discovered while testing different pages.

create or replace view vw_post_frontmatter as
with xs as (
    select 
        * exclude (filename)
    , (filename)
        .regexp_extract(
            './cohost-export/project/(.*)/posts/(.*)/(.*)/(.*)\.(.*)', 
            ['project'
            , 'status'
            , 'slug'
            , 'filename'
            , 'extension'
            ]
        )
        .unnest()
    
    from read_json(
        './cohost-export/**/*/*/post.json'
        , union_by_name=true
        , filename=true
    )
)
select 
    postId
    , headline
    , slug
    , printf('"%s"', headline) as title
    -- drafts don't have publishedAt
    , coalesce(
        publishedAt
        -- to ISO8601:
        , now()::json->>'$'
    ) as created 
    , false as featured
    , true as archived
    , project as cohost_project
    , status as cohost_status
    , (
        tags || ['from:cohost']
    )
      .to_json()::text as tags_text
    , (
    ''
    || E'---'
    || E'\ntitle: ' || title
    || E'\ncreated: ' || created
    || E'\nfeatured: ' || featured
    || E'\narchived: ' || archived
    || E'\ncohost_project: ' 
    ||  cohost_project
    || E'\ncohost_status: ' 
    ||  cohost_status
    || E'\ntags: ' || tags_text
    || E'\nslug: ' || slug
    || E'\n---'
    )
    as frontmatter
from xs
;

Joining the Markdown content

Now what about the markdown content? We need to glue that all together. Let’s try it with an aggregate function string_agg.

with xs as (
    select 
        headline
        , unnest(blocks) as block
    from read_json(
        './cohost-export/**/*/*/post.json'
        , union_by_name=true
    )
)
select 
    headline
    , string_agg(
        block.markdown.content
        , E'\n\n'
    )[0:100] as content
from xs
where block.type = 'markdown'
group by all
limit 5
;
headline content
Recursive queries in postgres Was just reading this [great article](https://elvisciotti.medium.com/postgresql-recursive-query-the
Cohost launched RSS / JSON feeds This is cool: https://cohost.org/jmsfbs-code/rss/public.json\nI guess this would make it pretty easy
NGINX → Caddy Going to try and replace nginx with caddy on my website in 20-30 mins, let's see.\n---\nTemporarily
Joining RSS Feeds It'd be cool to have a single RSS feed for both my website and this cohost feed. Wouldn't be too ha
Pequod Thinking after the current song project I'll do one of my bands (Baby X) songs, and get my drummer (

I’ve sliced the first 100 chars, and limited to the first 5 rows. But we can see this is working.

But we also need to handle including attachments in the blog post. First we’ll build up a data structure for the non markdown blocks. Then we’ll integrate it into the post content.

with xs as (
    select 
        headline
        , unnest(
            blocks
            , recursive:= true
        ) as block
    from read_json(
        './cohost-export/**/*/*/post.json'
        , union_by_name=true
    )
)
select *
from xs
where type <> 'markdown'
;
headline type content kind fileURL previewURL attachmentId altText width height artist title
Learning about APL and losing my mind attachment image https://staging.cohostcdn.org/attachment/af844fd0-0f9b-4b4f-bc02-168dee9ce1a6/image.png https://staging.cohostcdn.org/attachment/af844fd0-0f9b-4b4f-bc02-168dee9ce1a6/image.png af844fd0-0f9b-4b4f-bc02-168dee9ce1a6 1216 659
Postgres: Implicit lateral cross joins attachment image https://staging.cohostcdn.org/attachment/5097c2b0-d82e-4348-9e0a-84d073bcd35e/image.png https://staging.cohostcdn.org/attachment/5097c2b0-d82e-4348-9e0a-84d073bcd35e/image.png 5097c2b0-d82e-4348-9e0a-84d073bcd35e 1840 1180
Crossposting Cohost to Website attachment image https://staging.cohostcdn.org/attachment/68df2c18-9235-47af-96b5-c6515c6c3999/Screen%20Shot%202023-01-03%20at%2011.32.54%20pm.png https://staging.cohostcdn.org/attachment/68df2c18-9235-47af-96b5-c6515c6c3999/Screen%20Shot%202023-01-03%20at%2011.32.54%20pm.png 68df2c18-9235-47af-96b5-c6515c6c3999 2542 1330
Working on a simple website for my music attachment image https://staging.cohostcdn.org/attachment/e8c1ca83-3d03-486b-b4e3-e25052be83b4/Screen%20Shot%202023-01-29%20at%202.40.23%20pm.png https://staging.cohostcdn.org/attachment/e8c1ca83-3d03-486b-b4e3-e25052be83b4/Screen%20Shot%202023-01-29%20at%202.40.23%20pm.png e8c1ca83-3d03-486b-b4e3-e25052be83b4 1604 1512
Nerd points attachment image https://staging.cohostcdn.org/attachment/b7e09133-9f92-4cc9-bd76-222db1d614da/image.png https://staging.cohostcdn.org/attachment/b7e09133-9f92-4cc9-bd76-222db1d614da/image.png b7e09133-9f92-4cc9-bd76-222db1d614da 2328 600
New recording process attachment image https://staging.cohostcdn.org/attachment/c9d58d7f-5ddd-4389-b146-28005841c950/image.png https://staging.cohostcdn.org/attachment/c9d58d7f-5ddd-4389-b146-28005841c950/image.png c9d58d7f-5ddd-4389-b146-28005841c950 1516 529
Played an open mic again attachment image https://staging.cohostcdn.org/attachment/28f111aa-9a8c-4af5-9afa-b2dda4478679/397796805_841791357742803_767766949589955176_n.png https://staging.cohostcdn.org/attachment/28f111aa-9a8c-4af5-9afa-b2dda4478679/397796805_841791357742803_767766949589955176_n.png 28f111aa-9a8c-4af5-9afa-b2dda4478679 720 720
More giggin' attachment image https://staging.cohostcdn.org/attachment/b2be61c7-ea03-4ecc-9bb8-418aefd49601/img_3_1699499479580.jpg https://staging.cohostcdn.org/attachment/b2be61c7-ea03-4ecc-9bb8-418aefd49601/img_3_1699499479580.jpg b2be61c7-ea03-4ecc-9bb8-418aefd49601 993 1765
More giggin' attachment image https://staging.cohostcdn.org/attachment/144f0553-80e7-43f3-a6e1-ef01bf21df57/img_1_1699499465335.jpg https://staging.cohostcdn.org/attachment/144f0553-80e7-43f3-a6e1-ef01bf21df57/img_1_1699499465335.jpg 144f0553-80e7-43f3-a6e1-ef01bf21df57 993 1765
More giggin' attachment audio https://staging.cohostcdn.org/attachment/87270e9c-c88b-44eb-8c7d-d99d986790b0/Noth.mp3 https://staging.cohostcdn.org/attachment/87270e9c-c88b-44eb-8c7d-d99d986790b0/Noth.mp3 87270e9c-c88b-44eb-8c7d-d99d986790b0 Baby X Noth (bedroom jam)
Open mic night attachment image https://staging.cohostcdn.org/attachment/e2d702b0-b6dc-4cb7-891b-82b80bce76d6/IMG_20231124_104548.jpg https://staging.cohostcdn.org/attachment/e2d702b0-b6dc-4cb7-891b-82b80bce76d6/IMG_20231124_104548.jpg e2d702b0-b6dc-4cb7-891b-82b80bce76d6 960 751
Gig @ Zoe's Bar attachment image https://staging.cohostcdn.org/attachment/486822cd-d74f-46eb-a3fb-9aa4ac34c174/image.png https://staging.cohostcdn.org/attachment/486822cd-d74f-46eb-a3fb-9aa4ac34c174/image.png 486822cd-d74f-46eb-a3fb-9aa4ac34c174 679 641
Witcher 3, RPGs and gamefeel? attachment image https://staging.cohostcdn.org/attachment/162b7d07-dcb6-44e9-85d4-8b6bd6d98410/Nova_pict_131.webp https://staging.cohostcdn.org/attachment/162b7d07-dcb6-44e9-85d4-8b6bd6d98410/Nova_pict_131.webp 162b7d07-dcb6-44e9-85d4-8b6bd6d98410 832 624
China Doll Sessions attachment audio https://staging.cohostcdn.org/attachment/c7457348-f770-408a-8d54-317f7ac33ffe/China%20Doll%202.mp3 https://staging.cohostcdn.org/attachment/c7457348-f770-408a-8d54-317f7ac33ffe/China%20Doll%202.mp3 c7457348-f770-408a-8d54-317f7ac33ffe jmsfbs, baby x china doll guitar chain experiment
China Doll Sessions attachment image https://staging.cohostcdn.org/attachment/e41ee3e0-be75-402c-8b5c-507fbca1124a/PXL_20240429_062918414.jpg https://staging.cohostcdn.org/attachment/e41ee3e0-be75-402c-8b5c-507fbca1124a/PXL_20240429_062918414.jpg e41ee3e0-be75-402c-8b5c-507fbca1124a I re-arranged my room, this is the new recording setup. 4032 2268
China Doll Sessions attachment audio https://staging.cohostcdn.org/attachment/1039f2fa-9519-4b66-8b52-2704b1967a0e/China%20Doll%202.mp3 https://staging.cohostcdn.org/attachment/1039f2fa-9519-4b66-8b52-2704b1967a0e/China%20Doll%202.mp3 1039f2fa-9519-4b66-8b52-2704b1967a0e jmsfbs, baby x china doll guitar chain experiment 2
Stacklands (and RTS in general) attachment image https://staging.cohostcdn.org/attachment/4c3fcac7-4e8b-4a41-a312-caef366c9087/header.jpg https://staging.cohostcdn.org/attachment/4c3fcac7-4e8b-4a41-a312-caef366c9087/header.jpg 4c3fcac7-4e8b-4a41-a312-caef366c9087 460 215
Stacklands (and RTS in general) attachment image https://staging.cohostcdn.org/attachment/a9e895bf-10a3-42d8-9fdc-5975629131a9/Warcraft-1-fog-of-war.jpg https://staging.cohostcdn.org/attachment/a9e895bf-10a3-42d8-9fdc-5975629131a9/Warcraft-1-fog-of-war.jpg a9e895bf-10a3-42d8-9fdc-5975629131a9 590 472
attachment image https://staging.cohostcdn.org/attachment/a5d93351-de36-469b-8ae0-0857ca09985a/Screenshot%202023-01-02%20193640.png https://staging.cohostcdn.org/attachment/a5d93351-de36-469b-8ae0-0857ca09985a/Screenshot%202023-01-02%20193640.png a5d93351-de36-469b-8ae0-0857ca09985a 2560 1440
Midi Thoughts attachment image https://staging.cohostcdn.org/attachment/fc7337ec-1135-484f-9ede-cf9afc075300/PXL_20230121_083234984_2.jpg https://staging.cohostcdn.org/attachment/fc7337ec-1135-484f-9ede-cf9afc075300/PXL_20230121_083234984_2.jpg fc7337ec-1135-484f-9ede-cf9afc075300 4080 3072

Let’s see what type of attachments we have:

with xs as (
    select 
        headline
        , unnest(
            blocks
            , recursive:= true
        ) as block
    from read_json(
        './cohost-export/**/*/*/post.json'
        , union_by_name=true
    )
)
, ys as (
    select 
        type
        , kind
        , parse_filename(fileURL)
            .regexp_split_to_array('\.')[-1] 
            as ext
    from xs
    where type = 'attachment'
)
pivot ys
on ext
using count(*)
;
type kind jpg mp3 png webp
attachment audio 0 3 0 0
attachment image 7 0 9 1

At least for my dataset, this is all I have to consider:

The cohost export comes prebundled with all the file attachments. This makes a lot of sense as they are planning to delete all user data at the end of the year. But at time of writing, those fileURL and previewURL URLs are still live. So for simplicity I’m just going to re-download them instead of joining on filename (which can be a little risky in terms of name collisions).

Let’s start by making a virtual directory struture for our posts. We’ll drop the attachments into this virtual file system.

with xs as (
    select 
        headline
        , unnest(
            blocks
            , recursive:= true
        ) as block
        , * exclude (blocks)
    from read_json(
        './cohost-export/**/*/*/post.json'
        , union_by_name=true
    )
)
select 
    (singlePostpageURL)
        .parse_filename() as slug
    , (fileURL)
        .parse_path()[-2:-1]
        .array_to_string('-') as filename
    , fileURL as src
    , printf(
        './%s/%s', slug, filename
    ) as dest
from xs
where type <> 'markdown'
;
slug filename src dest
4954186-learning-about-apl-a af844fd0-0f9b-4b4f-bc02-168dee9ce1a6-image.png https://staging.cohostcdn.org/attachment/af844fd0-0f9b-4b4f-bc02-168dee9ce1a6/image.png ./4954186-learning-about-apl-a/af844fd0-0f9b-4b4f-bc02-168dee9ce1a6-image.png
5976016-postgres-implicit-l 5097c2b0-d82e-4348-9e0a-84d073bcd35e-image.png https://staging.cohostcdn.org/attachment/5097c2b0-d82e-4348-9e0a-84d073bcd35e/image.png ./5976016-postgres-implicit-l/5097c2b0-d82e-4348-9e0a-84d073bcd35e-image.png
773340-crossposting-cohost 68df2c18-9235-47af-96b5-c6515c6c3999-Screen%20Shot%202023-01-03%20at%2011.32.54%20pm.png https://staging.cohostcdn.org/attachment/68df2c18-9235-47af-96b5-c6515c6c3999/Screen%20Shot%202023-01-03%20at%2011.32.54%20pm.png ./773340-crossposting-cohost/68df2c18-9235-47af-96b5-c6515c6c3999-Screen%20Shot%202023-01-03%20at%2011.32.54%20pm.png
926060-working-on-a-simple e8c1ca83-3d03-486b-b4e3-e25052be83b4-Screen%20Shot%202023-01-29%20at%202.40.23%20pm.png https://staging.cohostcdn.org/attachment/e8c1ca83-3d03-486b-b4e3-e25052be83b4/Screen%20Shot%202023-01-29%20at%202.40.23%20pm.png ./926060-working-on-a-simple/e8c1ca83-3d03-486b-b4e3-e25052be83b4-Screen%20Shot%202023-01-29%20at%202.40.23%20pm.png
8dff553847ec4a4a9d8d8be81553f017 086c8fa9-dc82-40e1-be6c-cd5abd390332-Screen%20Shot%202022-12-13%20at%205.46.40%20pm.png https://staging.cohostcdn.org/attachment/086c8fa9-dc82-40e1-be6c-cd5abd390332/Screen%20Shot%202022-12-13%20at%205.46.40%20pm.png ./8dff553847ec4a4a9d8d8be81553f017/086c8fa9-dc82-40e1-be6c-cd5abd390332-Screen%20Shot%202022-12-13%20at%205.46.40%20pm.png
8dff553847ec4a4a9d8d8be81553f017 981436c3-b99a-44fc-84dd-ac52b7db7ecd-a345fe50-9c38-4d6f-8cc4-4b7c14697a5d-620x372.webp https://staging.cohostcdn.org/attachment/981436c3-b99a-44fc-84dd-ac52b7db7ecd/a345fe50-9c38-4d6f-8cc4-4b7c14697a5d-620x372.webp ./8dff553847ec4a4a9d8d8be81553f017/981436c3-b99a-44fc-84dd-ac52b7db7ecd-a345fe50-9c38-4d6f-8cc4-4b7c14697a5d-620x372.webp
8dff553847ec4a4a9d8d8be81553f017 dc5a6e47-e01e-4abe-94d6-42ac2d4de15b-fb91501a1da25da354f6cdde25271dee.jpg https://staging.cohostcdn.org/attachment/dc5a6e47-e01e-4abe-94d6-42ac2d4de15b/fb91501a1da25da354f6cdde25271dee.jpg ./8dff553847ec4a4a9d8d8be81553f017/dc5a6e47-e01e-4abe-94d6-42ac2d4de15b-fb91501a1da25da354f6cdde25271dee.jpg
2869791-nerd-points b7e09133-9f92-4cc9-bd76-222db1d614da-image.png https://staging.cohostcdn.org/attachment/b7e09133-9f92-4cc9-bd76-222db1d614da/image.png ./2869791-nerd-points/b7e09133-9f92-4cc9-bd76-222db1d614da-image.png
2983024-new-recording-proces c9d58d7f-5ddd-4389-b146-28005841c950-image.png https://staging.cohostcdn.org/attachment/c9d58d7f-5ddd-4389-b146-28005841c950/image.png ./2983024-new-recording-proces/c9d58d7f-5ddd-4389-b146-28005841c950-image.png
3395966-played-an-open-mic-a 28f111aa-9a8c-4af5-9afa-b2dda4478679-397796805_841791357742803_767766949589955176_n.png https://staging.cohostcdn.org/attachment/28f111aa-9a8c-4af5-9afa-b2dda4478679/397796805_841791357742803_767766949589955176_n.png ./3395966-played-an-open-mic-a/28f111aa-9a8c-4af5-9afa-b2dda4478679-397796805_841791357742803_767766949589955176_n.png
3457586-more-giggin b2be61c7-ea03-4ecc-9bb8-418aefd49601-img_3_1699499479580.jpg https://staging.cohostcdn.org/attachment/b2be61c7-ea03-4ecc-9bb8-418aefd49601/img_3_1699499479580.jpg ./3457586-more-giggin/b2be61c7-ea03-4ecc-9bb8-418aefd49601-img_3_1699499479580.jpg
3457586-more-giggin 144f0553-80e7-43f3-a6e1-ef01bf21df57-img_1_1699499465335.jpg https://staging.cohostcdn.org/attachment/144f0553-80e7-43f3-a6e1-ef01bf21df57/img_1_1699499465335.jpg ./3457586-more-giggin/144f0553-80e7-43f3-a6e1-ef01bf21df57-img_1_1699499465335.jpg
3457586-more-giggin 87270e9c-c88b-44eb-8c7d-d99d986790b0-Noth.mp3 https://staging.cohostcdn.org/attachment/87270e9c-c88b-44eb-8c7d-d99d986790b0/Noth.mp3 ./3457586-more-giggin/87270e9c-c88b-44eb-8c7d-d99d986790b0-Noth.mp3
3644766-open-mic-night e2d702b0-b6dc-4cb7-891b-82b80bce76d6-IMG_20231124_104548.jpg https://staging.cohostcdn.org/attachment/e2d702b0-b6dc-4cb7-891b-82b80bce76d6/IMG_20231124_104548.jpg ./3644766-open-mic-night/e2d702b0-b6dc-4cb7-891b-82b80bce76d6-IMG_20231124_104548.jpg
4154545-gig-zoe-s-bar 486822cd-d74f-46eb-a3fb-9aa4ac34c174-image.png https://staging.cohostcdn.org/attachment/486822cd-d74f-46eb-a3fb-9aa4ac34c174/image.png ./4154545-gig-zoe-s-bar/486822cd-d74f-46eb-a3fb-9aa4ac34c174-image.png
5638375-witcher-3-rp-gs-and 162b7d07-dcb6-44e9-85d4-8b6bd6d98410-Nova_pict_131.webp https://staging.cohostcdn.org/attachment/162b7d07-dcb6-44e9-85d4-8b6bd6d98410/Nova_pict_131.webp ./5638375-witcher-3-rp-gs-and/162b7d07-dcb6-44e9-85d4-8b6bd6d98410-Nova_pict_131.webp
5777910-china-doll-sessions c7457348-f770-408a-8d54-317f7ac33ffe-China%20Doll%202.mp3 https://staging.cohostcdn.org/attachment/c7457348-f770-408a-8d54-317f7ac33ffe/China%20Doll%202.mp3 ./5777910-china-doll-sessions/c7457348-f770-408a-8d54-317f7ac33ffe-China%20Doll%202.mp3
5777910-china-doll-sessions e41ee3e0-be75-402c-8b5c-507fbca1124a-PXL_20240429_062918414.jpg https://staging.cohostcdn.org/attachment/e41ee3e0-be75-402c-8b5c-507fbca1124a/PXL_20240429_062918414.jpg ./5777910-china-doll-sessions/e41ee3e0-be75-402c-8b5c-507fbca1124a-PXL_20240429_062918414.jpg
5777910-china-doll-sessions 1039f2fa-9519-4b66-8b52-2704b1967a0e-China%20Doll%202.mp3 https://staging.cohostcdn.org/attachment/1039f2fa-9519-4b66-8b52-2704b1967a0e/China%20Doll%202.mp3 ./5777910-china-doll-sessions/1039f2fa-9519-4b66-8b52-2704b1967a0e-China%20Doll%202.mp3
584679-stacklands-and-rts 4c3fcac7-4e8b-4a41-a312-caef366c9087-header.jpg https://staging.cohostcdn.org/attachment/4c3fcac7-4e8b-4a41-a312-caef366c9087/header.jpg ./584679-stacklands-and-rts/4c3fcac7-4e8b-4a41-a312-caef366c9087-header.jpg
584679-stacklands-and-rts a9e895bf-10a3-42d8-9fdc-5975629131a9-Warcraft-1-fog-of-war.jpg https://staging.cohostcdn.org/attachment/a9e895bf-10a3-42d8-9fdc-5975629131a9/Warcraft-1-fog-of-war.jpg ./584679-stacklands-and-rts/a9e895bf-10a3-42d8-9fdc-5975629131a9-Warcraft-1-fog-of-war.jpg
767388-hmmm a5d93351-de36-469b-8ae0-0857ca09985a-Screenshot%202023-01-02%20193640.png https://staging.cohostcdn.org/attachment/a5d93351-de36-469b-8ae0-0857ca09985a/Screenshot%202023-01-02%20193640.png ./767388-hmmm/a5d93351-de36-469b-8ae0-0857ca09985a-Screenshot%202023-01-02%20193640.png
881830-midi-thoughts fc7337ec-1135-484f-9ede-cf9afc075300-PXL_20230121_083234984_2.jpg https://staging.cohostcdn.org/attachment/fc7337ec-1135-484f-9ede-cf9afc075300/PXL_20230121_083234984_2.jpg ./881830-midi-thoughts/fc7337ec-1135-484f-9ede-cf9afc075300-PXL_20230121_083234984_2.jpg

Now lets add a command colum cmd to download each file to the result set.

printf(
    'mkdir -p "%s"; wget "%s" -O "%s"'
    , parent_dir
    , src
    , dest
)
with xs as (
    select 
        headline
        , unnest(
            blocks
            , recursive:= true
        ) as block
        , * exclude (blocks)
    from read_json(
        './cohost-export/**/*/*/post.json'
        , union_by_name=true
    )
)
select 
    (singlePostpageURL)
        .parse_filename() as slug
    , (fileURL)
        .parse_path()[-2:-1]
        .array_to_string('-') as filename
    , fileURL as src
    , printf(
        '../output/posts/%s/attachments'
        , slug
    ) as parent_dir
    , printf(
        '%s/%s'
        , parent_dir
        , filename
    ) as dest
    , printf(
        'mkdir -p "%s"; wget "%s" -O "%s"'
        , parent_dir
        , src
        , dest
    ) as cmd
from xs
where type <> 'markdown'
;

Here’s a sample of one of those generated commands:

mkdir -p "../output/posts/881830-midi-thoughts/attachments"; 
wget "https://staging.cohostcdn.org/attachment/fc7337ec-1135-484f-9ede-cf9afc075300/PXL_20230121_083234984_2.jpg" 
    \ -O "../output/posts/881830-midi-thoughts/attachments/fc7337ec-1135-484f-9ede-cf9afc075300-PXL_20230121_083234984_2.jpg"

Now if we just select cmd only, turn headers off (.header off), turn off all table formatting (.mode list) and set the output mode to a file (.once $outputFile), we can write these commands to a script on disk.

.mode list
.header off
.once ../output/download-attachment.sh
with xs as (
    select 
        headline
        , unnest(
            blocks
            , recursive:= true
        ) as block
        , * exclude (blocks)
    from read_json(
        './cohost-export/**/*/*/post.json'
        , union_by_name=true
    )
)
, ys as (
    select 
        (singlePostpageURL)
            .parse_filename() as slug
        , (fileURL)
            .parse_path()[-2:-1]
            .array_to_string('-') as filename
        , fileURL as src
        , printf(
            '../output/posts/%s/attachments'
            , slug
        ) as parent_dir
        , printf(
            '%s/%s'
            , parent_dir
            , filename
        ) as dest
        , printf(
            'mkdir -p "%s"; wget "%s" -O "%s"'
            , parent_dir
            , src
            , dest
        ) as cmd
    from xs
    where type <> 'markdown'
)
select cmd from ys
;

Now we have a script on our filesystem where each line downloads 1 attachment via wget. It looks like this (substituting the $variables for real values):

mkdir -p "$parent_dir"; 
wget "$src1" -O "$dest1";
mkdir -p "$parent_dir"; 
wget "$src2" -O "$dest2";
mkdir -p "$parent_dir"; 
wget "$src3" -O "$dest3";
# etc...

We can also run bash commands from the DuckDB shell.

E.g.

.shell ls cohost-export
project  user-notes  user.json

So let’s run our script:

.shell bash ../output/download-attachment.sh

And after running it, we can see the downloads worked as expected:

select 
    filename
    , size
    , last_modified
from read_text('../output/posts/**')
;
filename size last_modified
../output/posts/2869791-nerd-points/attachments/b7e09133-9f92-4cc9-bd76-222db1d614da-image.png 289078 2024-10-10 07:44:45
../output/posts/2983024-new-recording-proces/attachments/c9d58d7f-5ddd-4389-b146-28005841c950-image.png 45105 2024-10-10 07:44:45
../output/posts/3395966-played-an-open-mic-a/attachments/28f111aa-9a8c-4af5-9afa-b2dda4478679-397796805_841791357742803_767766949589955176_n.png 423742 2024-10-10 07:44:45
../output/posts/3457586-more-giggin/attachments/144f0553-80e7-43f3-a6e1-ef01bf21df57-img_1_1699499465335.jpg 98520 2024-10-10 07:44:46
../output/posts/3457586-more-giggin/attachments/87270e9c-c88b-44eb-8c7d-d99d986790b0-Noth.mp3 2608900 2024-04-06 03:23:53
../output/posts/3457586-more-giggin/attachments/b2be61c7-ea03-4ecc-9bb8-418aefd49601-img_3_1699499479580.jpg 122023 2024-10-10 07:44:45
../output/posts/3644766-open-mic-night/attachments/e2d702b0-b6dc-4cb7-891b-82b80bce76d6-IMG_20231124_104548.jpg 111299 2024-10-10 07:44:46
../output/posts/4154545-gig-zoe-s-bar/attachments/486822cd-d74f-46eb-a3fb-9aa4ac34c174-image.png 519389 2024-10-10 07:44:47
../output/posts/4954186-learning-about-apl-a/attachments/af844fd0-0f9b-4b4f-bc02-168dee9ce1a6-image.png 96019 2024-10-10 07:44:43
../output/posts/5638375-witcher-3-rp-gs-and/attachments/162b7d07-dcb6-44e9-85d4-8b6bd6d98410-Nova_pict_131.webp 51198 2024-10-10 07:44:47
../output/posts/5777910-china-doll-sessions/attachments/1039f2fa-9519-4b66-8b52-2704b1967a0e-China%20Doll%202.mp3 1010415 2024-04-30 10:20:27
../output/posts/5777910-china-doll-sessions/attachments/c7457348-f770-408a-8d54-317f7ac33ffe-China%20Doll%202.mp3 770088 2024-04-30 09:59:48
../output/posts/5777910-china-doll-sessions/attachments/e41ee3e0-be75-402c-8b5c-507fbca1124a-PXL_20240429_062918414.jpg 957500 2024-10-10 07:44:47
../output/posts/584679-stacklands-and-rts/attachments/4c3fcac7-4e8b-4a41-a312-caef366c9087-header.jpg 23699 2024-10-10 07:44:47
../output/posts/584679-stacklands-and-rts/attachments/a9e895bf-10a3-42d8-9fdc-5975629131a9-Warcraft-1-fog-of-war.jpg 33466 2024-10-10 07:44:48
../output/posts/5976016-postgres-implicit-l/attachments/5097c2b0-d82e-4348-9e0a-84d073bcd35e-image.png 316259 2024-10-10 07:44:44
../output/posts/767388-hmmm/attachments/a5d93351-de36-469b-8ae0-0857ca09985a-Screenshot%202023-01-02%20193640.png 2622668 2024-10-10 07:44:48
../output/posts/773340-crossposting-cohost/attachments/68df2c18-9235-47af-96b5-c6515c6c3999-Screen%20Shot%202023-01-03%20at%2011.32.54%20pm.png 58603 2024-10-10 07:44:44
../output/posts/881830-midi-thoughts/attachments/fc7337ec-1135-484f-9ede-cf9afc075300-PXL_20230121_083234984_2.jpg 824407 2024-10-10 07:44:49
../output/posts/926060-working-on-a-simple/attachments/e8c1ca83-3d03-486b-b4e3-e25052be83b4-Screen%20Shot%202023-01-29%20at%202.40.23%20pm.png 2291287 2024-10-10 07:44:44

Building the post content

Previously we authored a query to build the post content but we skipped attachments. We’re now going to reference our locally downloaded attachments instead of skipping them.

with xs as (
    select 
        headline
        , unnest(blocks) as block
    from read_json(
        './cohost-export/**/*/*/post.json'
        , union_by_name=true
    )
)
select 
    headline
    , block
    , (block.attachment.fileURL)
        .parse_path()[-2:-1]
        .array_to_string('-') as dest
    , (
        case block.type
            when 'markdown' 
                then block.markdown.content 
            when 'attachment' then 
                case block.attachment.kind
                    when 'audio' then 
                        printf(
                            '<Audio title={"%s"} artist={"%s"} src={import("./attachments/%s")} />'
                            , block.attachment.title
                            , block.attachment.artist, dest
                        )
                                            when 'image' then 
                        printf(
                            '<Image alttext={"%s"} src={import("./attachments/%s")} />'
                            , block.attachment.alttext
                            , dest
                        )
                end
        end
        
    ) as content
from xs
where headline = 'China Doll Sessions'
;

Gives us:

headline block dest content
China Doll Sessions {'type': attachment, 'markdown': NULL, 'attachment': {'kind': audio, 'fileURL': https://staging.cohostcdn.org/attachment/c7457348-f770-408a-8d54-317f7ac33ffe/China%20Doll%202.mp3, 'previewURL': https://staging.cohostcdn.org/attachment/c7457348-f770-408a-8d54-317f7ac33ffe/China%20Doll%202.mp3, 'attachmentId': c7457348-f770-408a-8d54 c7457348-f770-408a-8d54-317f7ac33ffe-China%20Doll%202.mp3 <Audio title={"china doll guitar chain experiment"} artist={"jmsfbs, baby x"} src={import("./attachments/c7457348-f770-408a-8d54-317f7ac33ffe-China%20Doll%202.mp3")} />
China Doll Sessions {'type': attachment, 'markdown': NULL, 'attachment': {'kind': image, 'fileURL': https://staging.cohostcdn.org/attachment/e41ee3e0-be75-402c-8b5c-507fbca1124a/PXL_20240429_062918414.jpg, 'previewURL': https://staging.cohostcdn.org/attachment/e41ee3e0-be75-402c-8b5c-507fbca1124a/PXL_20240429_062918414.jpg, 'attachmentId': e41ee3e0-be e41ee3e0-be75-402c-8b5c-507fbca1124a-PXL_20240429_062918414.jpg <Image alttext={"I re-arranged my room, this is the new recording setup."} src={import("./attachments/e41ee3e0-be75-402c-8b5c-507fbca1124a-PXL_20240429_062918414.jpg")} />
China Doll Sessions {'type': attachment, 'markdown': NULL, 'attachment': {'kind': audio, 'fileURL': https://staging.cohostcdn.org/attachment/1039f2fa-9519-4b66-8b52-2704b1967a0e/China%20Doll%202.mp3, 'previewURL': https://staging.cohostcdn.org/attachment/1039f2fa-9519-4b66-8b52-2704b1967a0e/China%20Doll%202.mp3, 'attachmentId': 1039f2fa-9519-4b66-8b52 1039f2fa-9519-4b66-8b52-2704b1967a0e-China%20Doll%202.mp3 <Audio title={"china doll guitar chain experiment 2"} artist={"jmsfbs, baby x"} src={import("./attachments/1039f2fa-9519-4b66-8b52-2704b1967a0e-China%20Doll%202.mp3")} />
China Doll Sessions {'type': markdown, 'markdown': {'content': This is an old song from my early baby x days, in fact, the first baby x jam predated Drew as the drummer and the song we rehearsed was China Doll!}, 'attachment': NULL} This is an old song from my early baby x days, in fact, the first baby x jam predated Drew as the drummer and the song we rehearsed was China Doll!
China Doll Sessions {'type': markdown, 'markdown': {'content': The lyrics (aside from the breakdown section) were written by an old friend (Dom Gilchrist) but the words meant a lot to me at the time because the exact situation described in the song was happening where I was the friend who knew his other friend was being cheated on but didn't want to s The lyrics (aside from the breakdown section) were written by an old friend (Dom Gilchrist) but the words meant a lot to me at the time because the exact situation described in the song was happening where I was the friend who knew his other friend was being cheated on but didn't want to say anything to ruin their relationship. Th
China Doll Sessions {'type': markdown, 'markdown': {'content': It's a been a long time since I last recorded anything. I re-arranged my room months ago and life got in the way and I hadn't even plugged anything in.}, 'attachment': NULL} It's a been a long time since I last recorded anything. I re-arranged my room months ago and life got in the way and I hadn't even plugged anything in.
China Doll Sessions {'type': markdown, 'markdown': {'content': I've done a few things differently this time which I might write about later, but the most interesting thing I guess is I've sort of got a hold on making guitars sound big. This eluded me for so long, and I'm just using free and stock plugins.}, 'attachment': NULL} I've done a few things differently this time which I might write about later, but the most interesting thing I guess is I've sort of got a hold on making guitars sound big. This eluded me for so long, and I'm just using free and stock plugins.
China Doll Sessions {'type': markdown, 'markdown': {'content': I tried piping the guitars into BiasFX and it instantly sounded amazing, but a) I don't want to become too dependent on paid plugins, and b) Bias is so slow when you start adding lots of layers. So instead I've been struggling with impulse response files and stock amp / cabinet simulators I tried piping the guitars into BiasFX and it instantly sounded amazing, but a) I don't want to become too dependent on paid plugins, and b) Bias is so slow when you start adding lots of layers. So instead I've been struggling with impulse response files and stock amp / cabinet simulators.
China Doll Sessions {'type': markdown, 'markdown': {'content': Then using a lot of side chain compression and EQ to make sure the lead cuts through the droning guitars whenever it comes in, but trying to keep it subtle enough that it doesn't feel like its being ducked.}, 'attachment': NULL} Then using a lot of side chain compression and EQ to make sure the lead cuts through the droning guitars whenever it comes in, but trying to keep it subtle enough that it doesn't feel like its being ducked.
China Doll Sessions {'type': markdown, 'markdown': {'content': Attached is a loop of that approach in practice.}, 'attachment': NULL} Attached is a loop of that approach in practice.
China Doll Sessions {'type': markdown, 'markdown': {'content': My current thinking is I'm not going to release anything for a while, just keeping noodling and blogging and when I've got a set of songs spend some time refining them and making them work well together.}, 'attachment': NULL} My current thinking is I'm not going to release anything for a while, just keeping noodling and blogging and when I've got a set of songs spend some time refining them and making them work well together.

Now that content field can be string_agg’d

with xs as (
    select 
        headline
        , unnest(blocks) as block
    from read_json(
        './cohost-export/**/*/*/post.json'
        , union_by_name=true
    )
)
, ys as (
    select 
        headline
        , block
        , (block.attachment.fileURL)
            .parse_path()[-2:-1]
            .array_to_string('-') as dest
        , (
            case block.type
                when 'markdown' then 
                    block.markdown.content 
                when 'attachment' then 
                    case block.attachment.kind
                        when 'audio' then 
                            printf(
                                '<Audio title={"%s"} artist={"%s"} src={import("./attachments/%s")} />'
                                , block.attachment.title
                                , block.attachment.artist
                                , dest
                            )
                                                    when 'image' then 
                            printf(
                                '<Image alttext={"%s"} src={import("./attachments/%s")} />'
                                , block.attachment.alttext
                                , dest
                            )
                    end
            end
        ) as content
    from xs
)
select 
    headline
    , string_agg(content, E'\n\n') as content
from ys
group by all
;

If we make that a view too…

create or replace view vw_post_content as
with xs as (
    select 
        postId
        , headline
        , unnest(blocks) as block 
    from read_json(
        './cohost-export/**/*/*/post.json'
        , union_by_name=true
    )
)
, ys as (
    select 
        postId
        , headline
        , block
        , (block.attachment.fileURL)
            .parse_path()[-2:-1]
            .array_to_string('-') as dest
        , (
            case block.type
                when 'markdown' then 
                    block.markdown.content 
                when 'attachment' then 
                    case block.attachment.kind
                        when 'audio' then 
                            printf(
                                '<Audio title={"%s"} artist={"%s"} src={import("./attachments/%s")} />'
                                , block.attachment.title
                                , block.attachment.artist
                                , dest
                            )
                        when 'image' then 
                            printf(
                                '<Image alttext={"%s"} src={import("./attachments/%s")} />'
                                , block.attachment.alttext
                                , dest
                            )
                    end
            end
        ) as content
    from xs
)
select 
    postId
    , headline
    , string_agg(content, E'\n\n') as content
from ys
group by all
;

We can have a look at one of the posts:

.mode list
select 
    content
from vw_post_content
where headline = 'China Doll Sessions'
;

And we get this:

<Audio title={"china doll guitar chain experiment"} artist={"jmsfbs, baby x"} src={import("./attachments/c7457348-f770-408a-8d54-317f7ac33ffe-China%20Doll%202.mp3")} />

<Image alttext={"I re-arranged my room, this is the new recording setup."} src={import("./attachments/e41ee3e0-be75-402c-8b5c-507fbca1124a-PXL_20240429_062918414.jpg")} />

<Audio title={"china doll guitar chain experiment 2"} artist={"jmsfbs, baby x"} src={import("./attachments/1039f2fa-9519-4b66-8b52-2704b1967a0e-China%20Doll%202.mp3")} />

This is an old song from my early [baby x](https://babyx.bandcamp.com/album/i-am-1-am) days, in fact, the first baby x jam predated Drew as the drummer and the song we rehearsed was China Doll!

The lyrics (aside from the breakdown section) were written by an old friend (Dom Gilchrist) but the words meant a lot to me at the time because the exact situation described in the song was happening where I was the friend who knew his other friend was being cheated on but didn't want to say anything to ruin their relationship.  That whole situation led us to being temporarily estranged.  This was myspace drama days!

It's a been a long time since I last recorded anything.  I re-arranged my room months ago and life got in the way and I hadn't even plugged anything in.

I've done a few things differently this time which I might write about later, but the most interesting thing I guess is I've sort of got a hold on making guitars sound big.  This eluded me for so long, and I'm just using free and stock plugins.

I tried piping the guitars into BiasFX and it instantly sounded amazing, but a) I don't want to become too dependent on paid plugins, and b) Bias is so slow when you start adding lots of layers.  So instead I've been struggling with impulse response files and stock amp / cabinet simulators.  

Then using a lot of side chain compression and EQ to make sure the lead cuts through the droning guitars whenever it comes in, but trying to keep it subtle enough that it doesn't feel like its being ducked.

Attached is a loop of that approach in practice.

My current thinking is I'm not going to release anything for a while, just keeping noodling and blogging and when I've got a set of songs spend some time refining them and making them work well together.

You’ll note I’m dropping components into my markdown. This is only possible because Astro supports MDX. I’m using MDX instead of Markdown as it makes it possible to drop in rich media for embeds.

Now we can glue the frontmatter with the content with a simple join.

select 
    frontmatter 
    || E'\n\n' 
    || content as content
from vw_post_content
inner join vw_post_frontmatter using(headline)
where headline = 'China Doll Sessions'
;

And we get this:

---
title: China Doll Sessions
created: 2024-04-30T09:59:48.880Z
featured: false
archived: true
cohost_project: jmsfbs
cohost_status: published
tags: ["recording","baby x","band","myspace","Drama","music","from:cohost"]
slug: 5777910-china-doll-sessions
---

<Audio title={"china doll guitar chain experiment"} artist={"jmsfbs, baby x"} src={import("./attachments/c7457348-f770-408a-8d54-317f7ac33ffe-China%20Doll%202.mp3")} />

<Image alttext={"I re-arranged my room, this is the new recording setup."} src={import("./attachments/e41ee3e0-be75-402c-8b5c-507fbca1124a-PXL_20240429_062918414.jpg")} />

<Audio title={"china doll guitar chain experiment 2"} artist={"jmsfbs, baby x"} src={import("./attachments/1039f2fa-9519-4b66-8b52-2704b1967a0e-China%20Doll%202.mp3")} />

This is an old song from my early [baby x](https://babyx.bandcamp.com/album/i-am-1-am) days, in fact, the first baby x jam predated Drew as the drummer and the song we rehearsed was China Doll!

The lyrics (aside from the breakdown section) were written by an old friend (Dom Gilchrist) but the words meant a lot to me at the time because the exact situation described in the song was happening where I was the friend who knew his other friend was being cheated on but didn't want to say anything to ruin their relationship.  That whole situation led us to being temporarily estranged.  This was myspace drama days!

It's a been a long time since I last recorded anything.  I re-arranged my room months ago and life got in the way and I hadn't even plugged anything in.

I've done a few things differently this time which I might write about later, but the most interesting thing I guess is I've sort of got a hold on making guitars sound big.  This eluded me for so long, and I'm just using free and stock plugins.

I tried piping the guitars into BiasFX and it instantly sounded amazing, but a) I don't want to become too dependent on paid plugins, and b) Bias is so slow when you start adding lots of layers.  So instead I've been struggling with impulse response files and stock amp / cabinet simulators.  

Then using a lot of side chain compression and EQ to make sure the lead cuts through the droning guitars whenever it comes in, but trying to keep it subtle enough that it doesn't feel like its being ducked.

Attached is a loop of that approach in practice.

My current thinking is I'm not going to release anything for a while, just keeping noodling and blogging and when I've got a set of songs spend some time refining them and making them work well together.

Importing components

Finally, if an Audio or Image attachment is used we need to import that component in the MDX content. Let’s inject that if it is required for the post. First let’s detect if there’s any audio in a post.

select 
    json_contains(
        blocks
        , { kind: 'audio' }
    ) as import_audio
    ,json_contains(
        blocks
        , { kind: 'image' }
    ) as import_image
    -- ...

And then inject the imports:

select
    -- ...
    || E'\n' 
    || case when import_audio 
        then E'\nimport Audio from "../../../components/Audio.astro"' 
        else '' 
    end
    || case when import_image 
        then E'\nimport Image from "../../../components/Image.astro"' 
        else '' 
    end
    || E'\n\n'
    -- ...

In context:

select 
    json_contains(
        blocks
        , { kind: 'audio' }
    ) as import_audio
    ,json_contains(
        blocks
        , { kind: 'image' }
    ) as import_image
    
    ,frontmatter 
    || E'\n' 
    || case when import_audio 
        then E'\nimport Audio from "../../../components/Audio.astro"' 
        else '' end
    || case when import_image 
        then E'\nimport Image from "../../../components/Image.astro"' 
        else '' end
    || E'\n\n'
    || content as content
from read_json(
    './cohost-export/**/*/*/post.json'
    , union_by_name=true
) as data
inner join vw_post_content 
    as content using(postId)
inner join vw_post_frontmatter 
    as frontmatter using(postId)
where postId = (
    select postId from vw_post_content 
    where headline = 'China Doll Sessions'
)
;

That gives us:

---
title: China Doll Sessions
created: 2024-04-30T09:59:48.880Z
featured: false
archived: true
cohost_project: jmsfbs
cohost_status: published
tags: ["recording","baby x","band","myspace","Drama","music","from:cohost"]
slug: 5777910-china-doll-sessions
---

import Audio from "../../../components/Audio.astro"
import Image from "../../../components/Image.astro"

<Audio title={"china doll guitar chain experiment"} artist={"jmsfbs, baby x"} src={import("./attachments/c7457348-f770-408a-8d54-317f7ac33ffe-China%20Doll%202.mp3")} />

<Image alttext={"I re-arranged my room, this is the new recording setup."} src={import("./attachments/e41ee3e0-be75-402c-8b5c-507fbca1124a-PXL_20240429_062918414.jpg")} />

<Audio title={"china doll guitar chain experiment 2"} artist={"jmsfbs, baby x"} src={import("./attachments/1039f2fa-9519-4b66-8b52-2704b1967a0e-China%20Doll%202.mp3")} />

This is an old song from my early [baby x](https://babyx.bandcamp.com/album/i-am-1-am) days, in fact, the first baby x jam predated Drew as the drummer and the song we rehearsed was China Doll!

The lyrics (aside from the breakdown section) were written by an old friend (Dom Gilchrist) but the words meant a lot to me at the time because the exact situation described in the song was happening where I was the friend who knew his other friend was being cheated on but didn't want to say anything to ruin their relationship.  That whole situation led us to being temporarily estranged.  This was myspace drama days!

It's a been a long time since I last recorded anything.  I re-arranged my room months ago and life got in the way and I hadn't even plugged anything in.

I've done a few things differently this time which I might write about later, but the most interesting thing I guess is I've sort of got a hold on making guitars sound big.  This eluded me for so long, and I'm just using free and stock plugins.

I tried piping the guitars into BiasFX and it instantly sounded amazing, but a) I don't want to become too dependent on paid plugins, and b) Bias is so slow when you start adding lots of layers.  So instead I've been struggling with impulse response files and stock amp / cabinet simulators.  

Then using a lot of side chain compression and EQ to make sure the lead cuts through the droning guitars whenever it comes in, but trying to keep it subtle enough that it doesn't feel like its being ducked.

Attached is a loop of that approach in practice.

My current thinking is I'm not going to release anything for a while, just keeping noodling and blogging and when I've got a set of songs spend some time refining them and making them work well together.

We never left the DuckDB shell

I think it is worth calling out at this point that technically we never left the DuckDB shell. And as foreign as this may seem, the total code to do this is pretty concise. It’s also a lot easier to debug because we deal with result sets (data) until the very last moment (code is data and data is code).

Yeah, at one point we generated a shell script from data and executed that from within the DuckDB shell.

But that bash script is just an artifact, its an executable result set.

Postgres’ repl has a slash command gexec which lets you natively execute each row in a shell and get the result back as a new line in a result set. So I’m definitely not the first person to do this, its kind of cool!

Last step: Write the generated mdx to disk

We’ve generated the content, we’ve downloaded the attachment files. Let’s write it to disk.

Step 1, lets just write it to our output folder. And when we’re absolutely sure everything is perfect we’ll copy it into the real posts directory (used by posts like this), and astro should pick them up.

When in doubt, base64

Now for each post we need to write the content to disk at a different location.

We’ll use base64 so we don’t have to worry about shell escaping. We’ll encode the post as base64, and embed that in a command like so:

echo $base64 \
| base64 --decode \
> ./posts/$slug/index.mdx

First we change the output format, and set up the next command to write to our shell script

.mode list
.header off
.once ../output/write-content-to-disk.sh
with xs as (
    select 
        json_contains(
            blocks
            , { kind: 'audio' }
        ) as import_audio
        ,json_contains(
            blocks
            , { kind: 'image' }
        ) as import_image
        ,headline
        ,postId
        ,slug
        ,frontmatter 
        || E'\n' 
        || case 
            when import_audio 
                then E'\nimport Audio from "../../../components/Audio.astro"' 
                else '' 
            end
        || case 
            when import_image 
                then E'\nimport Image from "../../../components/Image.astro"' 
                else '' 
            end
        || E'\n\n'
        || content as content
    from read_json(
        './cohost-export/**/*/*/post.json'
        , union_by_name=true
    ) as data
    inner join vw_post_content 
        as content using(postId, headline)
    inner join vw_post_frontmatter 
        as frontmatter using(postId, headline)
)
select 
    printf(
        'mkdir -p ../output/posts/%s; echo "%s" | base64 --decode > "../output/posts/%s/index.mdx"'
        , slug
        , to_base64(
            encode(content))
            , slug
        ) as cmd
from xs
;

We now have a script that looks like this (with a lot more lines):

echo "Ci0tLQp0aXRsZTogTW92aW5nIHRvIGdob3N0PwpjcmVhdGVkOiAyMDIzLTA4LTI2VDAyOjU0OjEzLjM3OFoKcHVpYmxpc2hlZDogdHJ1ZQphcmNoaXZlZDogZmFsc2UKdGFnczogWyJnaG9zdCIsImJsb2ciLCJmcm9tOmNvaG9zdCJdCnNsdWc6IDI2MjU0NzAtbW92aW5nLXRvLWdob3N0Ci0tLQoKCkknbSB0aGlua2luZyBvZiBtb3ZpbmcgdGhpcyBwYWdlIHRvIGdob3N0LiAgSSB3YXMgdXNpbmcgY29ob3N0J3MgcnNzIEFQSSB0byBmZWVkIG15IHdlYnNpdGUsIGJ1dCB0aGVyZSdzIGJlZW4gYW4gb3V0c3RhbmRpbmcgaXNzdWUgZm9yIHVzZXJuYW1lcyB3aXRoIGh5cGhlbnMgaW4gdGhlbSBmb3IgbWFueSBtb250aHMgbm93IGFuZCBJIGFsc28gZG9uJ3QgdGhpbmsgY29ob3N0IGlzIHRoZSByaWdodCB0b29sIGZvciB0aGF0IGpvYiBhbnl3YXkuCgpNeSB0aGlua2luZyBpcyB0byBob3N0IGEgd2FrZSBvbiByZXF1ZXN0IGdob3N0IGluc3RhbmNlIG9uIGZseS4gIFRoZSBkb2NrZXIgaW1hZ2UgbG9va3MgcHJldHR5IHN0cmFpZ2h0IGZvcndhcmQuICBXZSdsbCBzZWUgaG93IGl0IGdvZXMuIA==" | base64 --decode > "../output/posts/2625470-moving-to-ghost/index.mdx"
echo "Ci0tLQp0aXRsZTogCmNyZWF0ZWQ6IDIwMjQtMDItMDRUMDI6MzU6MTguMTkyWgpwdWlibGlzaGVkOiB0cnVlCmFyY2hpdmVkOiBmYWxzZQp0YWdzOiBbInByb2dyYW1taW5nIiwianMiLCJmcm9tOmNvaG9zdCJdCnNsdWc6IDQzNjk4MDYta2luZC1vZi1zdXJwcmlzZWQtaG8KLS0tCgoKS2luZCBvZiBzdXJwcmlzZWQgaG93IG9mdGVuIEkgdXNlIGxhYmVscyBpbiBqcyBub3cu" | base64 --decode > "../output/posts/4369806-kind-of-surprised-ho/index.mdx"

Let’s run it:

.shell bash ../output/write-content-to-disk.sh

It works! Here is a sample of the outputted directories:

And here is the generated file content in the index.mdx file referencing the correct attachments on disk.

Conclusion

This has been a long post! I hope it has been an interesting one. It’s cool to see the total code to do all of this in one screen is only 104 lines.

create or replace view vw_post_frontmatter as
-- ...
;

create or replace view vw_post_content as
-- ...
;

.mode list
.header off
.once ../output/download-attachment.sh

with xs as (
    select 
        headline, unnest(blocks, recursive:= true) as block, * exclude (blocks)
    from read_json('./cohost-export/**/*/*/post.json', union_by_name=true)
)
, ys as (
    select 
        (singlePostpageURL).parse_filename() as slug
        , (fileURL).parse_path()[-2:-1].array_to_string('-') as filename
        , fileURL as src
        , printf('../output/posts/%s/attachments', slug) as parent_dir
        , printf('%s/%s', parent_dir, filename) as dest
        , printf('mkdir -p "%s"; wget "%s" -O "%s"', parent_dir, src, dest) as cmd
    from xs
    where type <> 'markdown'
)
select cmd from ys
;

.mode list
.header off
.once ../output/write-content-to-disk.sh

with xs as (
    select 
        json_contains(blocks, { kind: 'audio' }) as import_audio
        ,json_contains(blocks, { kind: 'image' }) as import_image
        ,FM.headline
        ,postId
        ,slug
        ,frontmatter 
        || E'\n' 
        || case when import_audio then E'\nimport Audio from "../../../components/Audio.astro"' else '' end
        || case when import_image then E'\nimport Image from "../../../components/Image.astro"' else '' end
        || E'\n\n'
        || content as content
    from read_json('./cohost-export/**/*/*/post.json', union_by_name=true) as data
    inner join vw_post_content as C using(postId)
    inner join vw_post_frontmatter as FM using(postId)
)
select 
    printf('mkdir -p ../output/posts/%s; echo "%s" | base64 --decode > "../output/posts/%s/index.mdx"', slug, to_base64(encode(content)), slug) as cmd
from xs
;

.shell bash ../output/download-attachment.sh
.shell bash ../output/write-content-to-disk.sh

After a bit of time I’ll publish these Cohost posts on this site. First I need to think of a nice way to segment them from actual articles as when I render them all the home page looks a little busy. Additionally, some of the Cohost posts are close to being “real” articles I’d post here, so I may just upgrade them and republish in their own right.

Looking forward I’ll probably have a micro blog section that includes these processed cohosts and any other posts from around the web (e.g. Mastodon, BlueSky).

If you have a Cohost export and you’d like to convert it to a format like Markdown or MDX, I would recommend waiting around for someone to build a more robust tool.
But if you are interested in learning DuckDB I hope this has been a helpful practical demonstrattion of the power of DuckDB for general purpose data slicing and dicing.

Next time you’ve got to inspect some data and you’re reaching for tools like JQ maybe use it as an opportunity to try DuckDB out. I think it is so nice we can bring SQL to a domain where we’d normally have to learn some new and arcane DSL. Mastering SQL is non trivial, but most technical (and even semi-technical) people can write even a simple SQL query. That is very powerful and should be exploited if we want simpler systems to maintain.


Thank you for reading. If you’d like to reach out you can get in touch on Mastodon, BlueSky or Twitter.

Recent articles

Previous articles

Feeds