Parsing the Cohost Export with DuckDB
Posted on October 10, 2024
Cohost recently emailed me with a link to all the metadata for my account. There’s a lot there. There’s all your posts, images / attachments of course. But there’s also a lot of other useful metadata.
I want to take all my cohost posts and turn them into the format Astro expects. It shouldn’t be too hard, and we could do it with a simple node or python script, but it’d be a lot more fun to use DuckDB, so that’s what we’re going to do.
We are going to complete the task of migrating this cohost metadata into mdx files/attachments that Astro can consume with no assistance from a higher level programming language (like Node.js / python). We will use a little bash but even then the bash scripts will be generated by DuckDB queries and executed from within the DuckDB shell.
Some of this data is sort of personal, so I won’t host the whole payload (at least until I’ve gone through it) but I’ll include screenshots and samples of outputs as we go along so you can get an idea of how it all works and how cool DuckDB is.
What is DuckDB
In a nutshell, DuckDB is an in-process database that is designed to query a lot of data on your local machine. It’s also well known for its modern and convenient features. DuckDB can handle modern formats and syntax we’d never expect from a traditional database. It’s also very fast, and famous for running queries orders of magnitude faster than competing DBs (for its specific use case).
DuckDB is very good at importing random CSV’s or JSON data and turning it into structured and (easily queryable) binary.
DuckDB is designed to be used for offline data processing but that hasn’t stopped the larger community from using DuckDB in other contexts.
DuckDB has a partnership with a company Motherduck, that runs DuckDB on the cloud (primarily for use as a data warehouse). Lots of companies are also running DuckDB in WASM to power low latency dataviz and data analysis either entirely client side or hybrid client/server.
I think the most important thing to say about DuckDB is that it is a joy to use. You can feel the love that has gone into it. I would recommend it for anyone who is not super familiar with databases, but with the proviso that you are comfortable in some other existing host context (like Python, Bash, Node etc). Because DuckDB runs “in-process”, it is more of a native library than an external system, you always need some host context where you can run it.
Set up
First step, install DuckDB.
In my projects I usually have two git ignored folders input
and output
. I use it for tasks exactly like this, where I need to do some intermediary data processing but I don’t want to commit anything.
So step one, lets create those folders and ensure they are git ignored:
mkdir input output
echo "
input
output
" >> .gitignore
Now I’m going to put the cohost export in that folder and name it cohost-export
Querying the file system
Let’s query the filesystem using DuckDB to see what this export looks like. I’ve opened a terminal and run the DuckDB
cli from ./input
directory in this repo.
select filename
from read_text('./cohost-export/**')
;
filename |
---|
./cohost-export/project/jmsfbs-code/124035-d0b291a6-a10b-4424-8a9a-84bdce1a1d69-profile.jpeg |
./cohost-export/project/jmsfbs-code/comments/1b18d7cc-e64e-412e-a43a-af7d5b2db60f.json |
./cohost-export/project/jmsfbs-code/comments/288abe0b-66b3-4df7-8305-14b3e4449b92.json |
./cohost-export/project/jmsfbs-code/comments/36e4597b-5114-4e2d-89c6-b4ab7318f532.json |
./cohost-export/project/jmsfbs-code/comments/42c7829f-fecd-4791-a308-5b3e2d449892.json |
./cohost-export/project/jmsfbs-code/comments/7346f794-bd3a-48de-a119-f8ef20ea9e7a.json |
./cohost-export/project/jmsfbs-code/comments/c07362e9-1407-49cb-ba66-e1d4dab8615a.json |
./cohost-export/project/jmsfbs-code/comments/c4cbd27c-bb6c-4d94-8621-4738d43001b7.json |
./cohost-export/project/jmsfbs-code/comments/f7c98510-96dc-4599-8e77-73d7c9737029.json |
./cohost-export/project/jmsfbs-code/find-your-friends.json |
./cohost-export/project/jmsfbs-code/followers/MikeWake.json |
./cohost-export/project/jmsfbs-code/followers/ackasi.json |
./cohost-export/project/jmsfbs-code/followers/happy-hermit.json |
./cohost-export/project/jmsfbs-code/followers/jcolag.json |
./cohost-export/project/jmsfbs-code/followers/pygy.json |
./cohost-export/project/jmsfbs-code/followers/this-is-envy.json |
./cohost-export/project/jmsfbs-code/followers/tibber5.json |
./cohost-export/project/jmsfbs-code/following/MikeWake.json |
./cohost-export/project/jmsfbs-code/following/crosseye.json |
./cohost-export/project/jmsfbs-code/following/draknek.json |
./cohost-export/project/jmsfbs-code/following/jcolag.json |
./cohost-export/project/jmsfbs-code/following/pygy.json |
./cohost-export/project/jmsfbs-code/following/this-is-envy.json |
./cohost-export/project/jmsfbs-code/jmsfbs-code.json |
./cohost-export/project/jmsfbs-code/posts/published/1042927-new-row-violates-row/post.json |
./cohost-export/project/jmsfbs-code/posts/published/1059858-more-productive-on-t/post.json |
./cohost-export/project/jmsfbs-code/posts/published/1196048-druids-datadog-s-d/post.json |
./cohost-export/project/jmsfbs-code/posts/published/1196053-css-anchor-positioni/post.json |
./cohost-export/project/jmsfbs-code/posts/published/1201573-how-to-use-deno-debu/post.json |
./cohost-export/project/jmsfbs-code/posts/published/1205386-parser-combinators/post.json |
./cohost-export/project/jmsfbs-code/posts/published/1205822-what-i-made-the-comp/post.json |
./cohost-export/project/jmsfbs-code/posts/published/1215585-how-to-disable-web-s/post.json |
./cohost-export/project/jmsfbs-code/posts/published/1230738-nushell-snippets/post.json |
./cohost-export/project/jmsfbs-code/posts/published/1302500-workaround-for-lack/post.json |
./cohost-export/project/jmsfbs-code/posts/published/1312901-debugger-launch-nod/post.json |
./cohost-export/project/jmsfbs-code/posts/published/1349274-just-upgraded-a-self/post.json |
./cohost-export/project/jmsfbs-code/posts/published/1349307-ssh-import-id/post.json |
./cohost-export/project/jmsfbs-code/posts/published/1353509-how-to-parse-dynamic/post.json |
./cohost-export/project/jmsfbs-code/posts/published/1681286-making-my-bluetooth/post.json |
./cohost-export/project/jmsfbs-code/posts/published/1687721-rss-feed-truncated/post.json |
./cohost-export/project/jmsfbs-code/posts/published/1706101-ouch-better-cli-ux/post.json |
./cohost-export/project/jmsfbs-code/posts/published/1706541-installing-rust-carg/post.json |
./cohost-export/project/jmsfbs-code/posts/published/1829684-simple-secret-sharin/post.json |
./cohost-export/project/jmsfbs-code/posts/published/1933414-chat-gpt-almost-alway/post.json |
./cohost-export/project/jmsfbs-code/posts/published/2254888-using-nushell-to-saf/post.json |
./cohost-export/project/jmsfbs-code/posts/published/2625470-moving-to-ghost/post.json |
./cohost-export/project/jmsfbs-code/posts/published/4081387-secrets-service/post.json |
./cohost-export/project/jmsfbs-code/posts/published/4092488-typescript-wishlist/post.json |
./cohost-export/project/jmsfbs-code/posts/published/4362243-repeat-0-1fr-doesn/post.json |
./cohost-export/project/jmsfbs-code/posts/published/4369806-kind-of-surprised-ho/post.json |
./cohost-export/project/jmsfbs-code/posts/published/4893642-recursive-queries-in/post.json |
./cohost-export/project/jmsfbs-code/posts/published/4954186-learning-about-apl-a/image.png |
./cohost-export/project/jmsfbs-code/posts/published/4954186-learning-about-apl-a/post.json |
./cohost-export/project/jmsfbs-code/posts/published/5262412-typescript-performan/post.json |
./cohost-export/project/jmsfbs-code/posts/published/5555101-github-copilot/post.json |
./cohost-export/project/jmsfbs-code/posts/published/5679848-typescript-tip-out/post.json |
./cohost-export/project/jmsfbs-code/posts/published/5976016-postgres-implicit-l/image.png |
./cohost-export/project/jmsfbs-code/posts/published/5976016-postgres-implicit-l/post.json |
./cohost-export/project/jmsfbs-code/posts/published/6559941-learning-godot-has-c/post.json |
./cohost-export/project/jmsfbs-code/posts/published/6633844-database-migrations/post.json |
./cohost-export/project/jmsfbs-code/posts/published/696437-interesting-blow-ran/post.json |
./cohost-export/project/jmsfbs-code/posts/published/716808-minor-website-update/post.json |
./cohost-export/project/jmsfbs-code/posts/published/7238140-useful-alias-for-doc/post.json |
./cohost-export/project/jmsfbs-code/posts/published/729566-keydiff/post.json |
./cohost-export/project/jmsfbs-code/posts/published/738657-puppeteer-and-docker/post.json |
./cohost-export/project/jmsfbs-code/posts/published/739146-cohost-launched-rss/post.json |
./cohost-export/project/jmsfbs-code/posts/published/766864-nginx-caddy/post.json |
./cohost-export/project/jmsfbs-code/posts/published/773340-crossposting-cohost/Screen%20Shot%202023-01-03%20at%2011.32.54%20pm.png |
./cohost-export/project/jmsfbs-code/posts/published/773340-crossposting-cohost/post.json |
./cohost-export/project/jmsfbs-code/posts/published/779895-function-constr/post.json |
./cohost-export/project/jmsfbs-code/posts/published/784613-browser-automation-w/post.json |
./cohost-export/project/jmsfbs-code/posts/published/784666-joining-rss-feeds/post.json |
./cohost-export/project/jmsfbs-code/posts/published/926060-working-on-a-simple/Screen%20Shot%202023-01-29%20at%202.40.23%20pm.png |
./cohost-export/project/jmsfbs-code/posts/published/926060-working-on-a-simple/post.json |
./cohost-export/project/jmsfbs-code/posts/published/927564-solid-js-and-presite/post.json |
./cohost-export/project/jmsfbs-code/posts/published/943321-crypto-random-uuid/post.json |
./cohost-export/project/jmsfbs-code/posts/published/943384-svelte-kit-design-dec/post.json |
./cohost-export/project/jmsfbs-code/posts/published/993001-knee-deep-in-postgre/post.json |
./cohost-export/project/jmsfbs/28244-b8ae4e49-30db-4b31-bc27-b4b06a02a4b1-profile.jpg |
./cohost-export/project/jmsfbs/28244-ff52a913-0c06-40a2-ad25-016c19f52e86-profile.jpg |
./cohost-export/project/jmsfbs/comments/266a9fc9-96e5-4012-abd8-7e06c2c8ff16.json |
./cohost-export/project/jmsfbs/comments/3039c5c4-a107-4622-b2fd-8b116a718cc1.json |
./cohost-export/project/jmsfbs/comments/3acdd1e7-6704-451d-bcb5-449354d49f8f.json |
./cohost-export/project/jmsfbs/comments/5cf51c5a-f4ae-4f7c-9cb2-c35c0c348ab4.json |
./cohost-export/project/jmsfbs/comments/70ed3f4a-8279-434c-82e5-257d066bb4a2.json |
./cohost-export/project/jmsfbs/comments/79b2a664-ef49-4be1-b264-4f49aa9e47c9.json |
./cohost-export/project/jmsfbs/comments/7f6570ee-9075-49b5-99e7-ec209b6e6fc6.json |
./cohost-export/project/jmsfbs/comments/7f936dbe-4c63-4f5d-aa0f-a8c78d732297.json |
./cohost-export/project/jmsfbs/comments/925ac802-9823-4a7c-b1b3-19c9b777211e.json |
./cohost-export/project/jmsfbs/comments/94ba93a6-00c5-4e37-b871-346b3a0598b6.json |
./cohost-export/project/jmsfbs/comments/9f529de9-e29f-49c0-9d4b-88d2120c7ab1.json |
./cohost-export/project/jmsfbs/comments/a7643eef-e639-4b53-a483-3f52ed849646.json |
./cohost-export/project/jmsfbs/comments/a99ace8d-a252-487b-a937-715798c6f12c.json |
./cohost-export/project/jmsfbs/comments/b94ac2e5-b442-4c87-8adc-0de46b79eca1.json |
./cohost-export/project/jmsfbs/comments/c5bbb206-81f1-4b76-8f76-8032e4e2bf43.json |
./cohost-export/project/jmsfbs/comments/cad605ea-a97c-43f0-bdc6-d653d22d2ff1.json |
./cohost-export/project/jmsfbs/comments/cef9a3ba-608b-4364-b2d4-814db8280a5d.json |
./cohost-export/project/jmsfbs/comments/d5ed8974-93e8-401c-91f0-7c69d92e6f40.json |
./cohost-export/project/jmsfbs/comments/dd066a6a-d1fa-490c-9a04-845fa7d5cd3b.json |
./cohost-export/project/jmsfbs/comments/e918ff83-0d5a-40a2-9f11-8090411cf462.json |
./cohost-export/project/jmsfbs/comments/f1cde7b6-f5a6-4bb4-9a4e-adc9ec2edfd7.json |
./cohost-export/project/jmsfbs/comments/f54870a8-275b-40a2-9ac1-28aa1c7906eb.json |
./cohost-export/project/jmsfbs/comments/f7a60c03-7279-49d1-b5db-a16cffc237f4.json |
./cohost-export/project/jmsfbs/find-your-friends.json |
./cohost-export/project/jmsfbs/followers/BigBear.json |
./cohost-export/project/jmsfbs/followers/MikeWake.json |
./cohost-export/project/jmsfbs/followers/Noxulous.json |
./cohost-export/project/jmsfbs/followers/PadPalon.json |
./cohost-export/project/jmsfbs/followers/aacurate2.json |
./cohost-export/project/jmsfbs/followers/aloe.json |
./cohost-export/project/jmsfbs/followers/attica.json |
./cohost-export/project/jmsfbs/followers/danielleri.json |
./cohost-export/project/jmsfbs/followers/jcolag.json |
./cohost-export/project/jmsfbs/followers/llewellynn.json |
./cohost-export/project/jmsfbs/followers/lucidlucille.json |
./cohost-export/project/jmsfbs/followers/pygy.json |
./cohost-export/project/jmsfbs/followers/ridiculousdino.json |
./cohost-export/project/jmsfbs/followers/tokyoscarab.json |
./cohost-export/project/jmsfbs/following/Campster.json |
./cohost-export/project/jmsfbs/following/Jusiv.json |
./cohost-export/project/jmsfbs/following/Maid-Music.json |
./cohost-export/project/jmsfbs/following/PadPalon.json |
./cohost-export/project/jmsfbs/following/SabbiTabbi.json |
./cohost-export/project/jmsfbs/following/TiberiusMoreau.json |
./cohost-export/project/jmsfbs/following/Tribow.json |
./cohost-export/project/jmsfbs/following/Vectorpark.json |
./cohost-export/project/jmsfbs/following/accidentallymental.json |
./cohost-export/project/jmsfbs/following/barney.json |
./cohost-export/project/jmsfbs/following/boodoo.json |
./cohost-export/project/jmsfbs/following/bungo.json |
./cohost-export/project/jmsfbs/following/clempics.json |
./cohost-export/project/jmsfbs/following/community-jukebox.json |
./cohost-export/project/jmsfbs/following/crosseye.json |
./cohost-export/project/jmsfbs/following/css.json |
./cohost-export/project/jmsfbs/following/danielelliott.json |
./cohost-export/project/jmsfbs/following/danielleri.json |
./cohost-export/project/jmsfbs/following/df2jkpics.json |
./cohost-export/project/jmsfbs/following/draknek.json |
./cohost-export/project/jmsfbs/following/drewsof.json |
./cohost-export/project/jmsfbs/following/droqen.json |
./cohost-export/project/jmsfbs/following/eniko.json |
./cohost-export/project/jmsfbs/following/ewoud.json |
./cohost-export/project/jmsfbs/following/fasterthanlime.json |
./cohost-export/project/jmsfbs/following/geometric.json |
./cohost-export/project/jmsfbs/following/hollygramazio.json |
./cohost-export/project/jmsfbs/following/itsnatclayton.json |
./cohost-export/project/jmsfbs/following/jcolag.json |
./cohost-export/project/jmsfbs/following/johnnemann.json |
./cohost-export/project/jmsfbs/following/jwaaaap.json |
./cohost-export/project/jmsfbs/following/kenney.json |
./cohost-export/project/jmsfbs/following/llewellynn.json |
./cohost-export/project/jmsfbs/following/lucidlucille.json |
./cohost-export/project/jmsfbs/following/lutz.json |
./cohost-export/project/jmsfbs/following/objects.json |
./cohost-export/project/jmsfbs/following/pygy.json |
./cohost-export/project/jmsfbs/following/saint11.json |
./cohost-export/project/jmsfbs/following/staff.json |
./cohost-export/project/jmsfbs/following/sweatercinema.json |
./cohost-export/project/jmsfbs/following/terrycavanagh.json |
./cohost-export/project/jmsfbs/following/tha-rami.json |
./cohost-export/project/jmsfbs/following/the-technician.json |
./cohost-export/project/jmsfbs/following/thecatamites.json |
./cohost-export/project/jmsfbs/following/tipsheda.json |
./cohost-export/project/jmsfbs/following/tokyoscarab.json |
./cohost-export/project/jmsfbs/following/tomforsyth.json |
./cohost-export/project/jmsfbs/following/torcado.json |
./cohost-export/project/jmsfbs/following/vectorpoem.json |
./cohost-export/project/jmsfbs/following/wiredaemon.json |
./cohost-export/project/jmsfbs/following/ztiworoh.json |
./cohost-export/project/jmsfbs/jmsfbs.json |
./cohost-export/project/jmsfbs/posts/published/1003900-pequod/post.json |
./cohost-export/project/jmsfbs/posts/published/1008879-the-dinner/post.json |
./cohost-export/project/jmsfbs/posts/published/1016743-frog-2-tutorial/post.json |
./cohost-export/project/jmsfbs/posts/published/1046457-plugged-my-electric/post.json |
./cohost-export/project/jmsfbs/posts/published/1065458-spotify-getting-a-bi/post.json |
./cohost-export/project/jmsfbs/posts/published/1088733-garage-band-no-plugi/post.json |
./cohost-export/project/jmsfbs/posts/published/1105201-cloudpunk/post.json |
./cohost-export/project/jmsfbs/posts/published/1166928-just-finished-saints/post.json |
./cohost-export/project/jmsfbs/posts/published/1270030-loopy-levels-in-fall/post.json |
./cohost-export/project/jmsfbs/posts/published/1313723-df-psych-oddysey/post.json |
./cohost-export/project/jmsfbs/posts/published/1342661-music-production-del/post.json |
./cohost-export/project/jmsfbs/posts/published/1342883-midi-research-round/post.json |
./cohost-export/project/jmsfbs/posts/published/1379548-freeing-up-your-hand/post.json |
./cohost-export/project/jmsfbs/posts/published/1433411-cousins/post.json |
./cohost-export/project/jmsfbs/posts/published/1456123-i-paid-for-a-daw/post.json |
./cohost-export/project/jmsfbs/posts/published/1456278-bass-sound-design-w/post.json |
./cohost-export/project/jmsfbs/posts/published/1460591-larry-sanders-show/post.json |
./cohost-export/project/jmsfbs/posts/published/1480427-splitting-frequencie/post.json |
./cohost-export/project/jmsfbs/posts/published/1572613-he-died-the-way-he-l/post.json |
./cohost-export/project/jmsfbs/posts/published/160965-welcome-to-cohost/post.json |
./cohost-export/project/jmsfbs/posts/published/161748-baldies/post.json |
./cohost-export/project/jmsfbs/posts/published/1619795-playing-high-hats-fa/post.json |
./cohost-export/project/jmsfbs/posts/published/1712599-one-of-the-better-es/post.json |
./cohost-export/project/jmsfbs/posts/published/174731-drum-rants-thought/post.json |
./cohost-export/project/jmsfbs/posts/published/179858-feels-good-man-docu/post.json |
./cohost-export/project/jmsfbs/posts/published/180406-you-definitely-notic/post.json |
./cohost-export/project/jmsfbs/posts/published/185752-one-reason-might-be/post.json |
./cohost-export/project/jmsfbs/posts/published/1958396-new-drinking-game/post.json |
./cohost-export/project/jmsfbs/posts/published/220695-went-for-a-walk-toda/post.json |
./cohost-export/project/jmsfbs/posts/published/2254047-this-is-huge-for-mus/post.json |
./cohost-export/project/jmsfbs/posts/published/2458613-i-wonder-when-chrome/post.json |
./cohost-export/project/jmsfbs/posts/published/259394-new-skillet/post.json |
./cohost-export/project/jmsfbs/posts/published/2869791-nerd-points/image.png |
./cohost-export/project/jmsfbs/posts/published/2869791-nerd-points/post.json |
./cohost-export/project/jmsfbs/posts/published/2870336-new-track-golden/post.json |
./cohost-export/project/jmsfbs/posts/published/2895937-more-alone/post.json |
./cohost-export/project/jmsfbs/posts/published/2983024-new-recording-proces/image.png |
./cohost-export/project/jmsfbs/posts/published/2983024-new-recording-proces/post.json |
./cohost-export/project/jmsfbs/posts/published/3025811-walls/post.json |
./cohost-export/project/jmsfbs/posts/published/3112762-recording-progress-u/post.json |
./cohost-export/project/jmsfbs/posts/published/3125303-need-some-distance/post.json |
./cohost-export/project/jmsfbs/posts/published/3136545-accidentally-deleted/post.json |
./cohost-export/project/jmsfbs/posts/published/3169687-chameleon-days/post.json |
./cohost-export/project/jmsfbs/posts/published/3172316-electronic-kit-upgra/post.json |
./cohost-export/project/jmsfbs/posts/published/3175152-australia-votes-no-t/post.json |
./cohost-export/project/jmsfbs/posts/published/3192439-kind-of-nice-to-see/post.json |
./cohost-export/project/jmsfbs/posts/published/3213421-vst-vendors-and-thei/post.json |
./cohost-export/project/jmsfbs/posts/published/3259093-practicing-and-mayb/post.json |
./cohost-export/project/jmsfbs/posts/published/329629-stuff-like-this-driv/post.json |
./cohost-export/project/jmsfbs/posts/published/3349306-expensive-reverb-fo/post.json |
./cohost-export/project/jmsfbs/posts/published/3393436-the-last-beatles-son/post.json |
./cohost-export/project/jmsfbs/posts/published/3395966-played-an-open-mic-a/397796805_841791357742803_767766949589955176_n.png |
./cohost-export/project/jmsfbs/posts/published/3395966-played-an-open-mic-a/post.json |
./cohost-export/project/jmsfbs/posts/published/3457586-more-giggin/Noth.mp3 |
./cohost-export/project/jmsfbs/posts/published/3457586-more-giggin/img_1_1699499465335.jpg |
./cohost-export/project/jmsfbs/posts/published/3457586-more-giggin/img_3_1699499479580.jpg |
./cohost-export/project/jmsfbs/posts/published/3457586-more-giggin/post.json |
./cohost-export/project/jmsfbs/posts/published/364112-saw-3-turtles-and-a/post.json |
./cohost-export/project/jmsfbs/posts/published/3644721-only-a-shadow/post.json |
./cohost-export/project/jmsfbs/posts/published/3644766-open-mic-night/IMG_20231124_104548.jpg |
./cohost-export/project/jmsfbs/posts/published/3644766-open-mic-night/post.json |
./cohost-export/project/jmsfbs/posts/published/3656874-where-do-birds-go-af/post.json |
./cohost-export/project/jmsfbs/posts/published/3673531-book-thoughts-hitch/post.json |
./cohost-export/project/jmsfbs/posts/published/3819063-recording-tea-and-s/post.json |
./cohost-export/project/jmsfbs/posts/published/3852001-the-games-industry-m/post.json |
./cohost-export/project/jmsfbs/posts/published/3982228-guitar-modelling-and/post.json |
./cohost-export/project/jmsfbs/posts/published/3984809-lone-wolf/post.json |
./cohost-export/project/jmsfbs/posts/published/4053158-a-non-dogmatic-left/post.json |
./cohost-export/project/jmsfbs/posts/published/4154545-gig-zoe-s-bar/image.png |
./cohost-export/project/jmsfbs/posts/published/4154545-gig-zoe-s-bar/post.json |
./cohost-export/project/jmsfbs/posts/published/4169376-dynamics-and-live-sh/post.json |
./cohost-export/project/jmsfbs/posts/published/4310343-my-spotify-playlist/post.json |
./cohost-export/project/jmsfbs/posts/published/4470982-foundation-s2-is-pre/post.json |
./cohost-export/project/jmsfbs/posts/published/5262397-found-out-today-some/post.json |
./cohost-export/project/jmsfbs/posts/published/5338982-how-can-you-not-noti/post.json |
./cohost-export/project/jmsfbs/posts/published/536580-driving-around-today/post.json |
./cohost-export/project/jmsfbs/posts/published/5444811-why-cohost-is-good/post.json |
./cohost-export/project/jmsfbs/posts/published/5444882-ursula-le-guin-s-a-w/post.json |
./cohost-export/project/jmsfbs/posts/published/5516880-donald-glover-son-o/post.json |
./cohost-export/project/jmsfbs/posts/published/5528127-finally-yes/post.json |
./cohost-export/project/jmsfbs/posts/published/5638375-witcher-3-rp-gs-and/Nova_pict_131.webp |
./cohost-export/project/jmsfbs/posts/published/5638375-witcher-3-rp-gs-and/post.json |
./cohost-export/project/jmsfbs/posts/published/5654197-getting-back-to-my-r/post.json |
./cohost-export/project/jmsfbs/posts/published/5777910-china-doll-sessions/China%20Doll%202.mp3 |
./cohost-export/project/jmsfbs/posts/published/5777910-china-doll-sessions/PXL_20240429_062918414.jpg |
./cohost-export/project/jmsfbs/posts/published/5777910-china-doll-sessions/post.json |
./cohost-export/project/jmsfbs/posts/published/5814203-https-youtu-be-cej/post.json |
./cohost-export/project/jmsfbs/posts/published/584679-stacklands-and-rts/Warcraft-1-fog-of-war.jpg |
./cohost-export/project/jmsfbs/posts/published/584679-stacklands-and-rts/header.jpg |
./cohost-export/project/jmsfbs/posts/published/584679-stacklands-and-rts/post.json |
./cohost-export/project/jmsfbs/posts/published/597784-howard-zinn-woody/post.json |
./cohost-export/project/jmsfbs/posts/published/603580-stuff-2022/post.json |
./cohost-export/project/jmsfbs/posts/published/631473-negroni/post.json |
./cohost-export/project/jmsfbs/posts/published/7254514-reading-spotify-bios/post.json |
./cohost-export/project/jmsfbs/posts/published/733962-learning-to-drum-ev/post.json |
./cohost-export/project/jmsfbs/posts/published/750274-breakdown-of-karen-c/post.json |
./cohost-export/project/jmsfbs/posts/published/752472-breaking-the-monopol/post.json |
./cohost-export/project/jmsfbs/posts/published/752730-the-true-believers/post.json |
./cohost-export/project/jmsfbs/posts/published/757625-frick-filler/post.json |
./cohost-export/project/jmsfbs/posts/published/766452-home-recording-journ/post.json |
./cohost-export/project/jmsfbs/posts/published/767388-hmmm/Screenshot%202023-01-02%20193640.png |
./cohost-export/project/jmsfbs/posts/published/767388-hmmm/post.json |
./cohost-export/project/jmsfbs/posts/published/779120-how-sneakers-are-man/post.json |
./cohost-export/project/jmsfbs/posts/published/7810959-gonna-miss-this-plac/post.json |
./cohost-export/project/jmsfbs/posts/published/823558-home-recording/post.json |
./cohost-export/project/jmsfbs/posts/published/827539-new-track-never/post.json |
./cohost-export/project/jmsfbs/posts/published/829585-songwriting-process/post.json |
./cohost-export/project/jmsfbs/posts/published/842542-new-guitars/post.json |
./cohost-export/project/jmsfbs/posts/published/845495-i-don-t-really-like/post.json |
./cohost-export/project/jmsfbs/posts/published/859426-scotch-crime/post.json |
./cohost-export/project/jmsfbs/posts/published/881830-midi-thoughts/PXL_20230121_083234984_2.jpg |
./cohost-export/project/jmsfbs/posts/published/881830-midi-thoughts/post.json |
./cohost-export/project/jmsfbs/posts/published/892822-automate-the-pedal-b/post.json |
./cohost-export/project/jmsfbs/posts/published/904948-video-editing/post.json |
./cohost-export/project/jmsfbs/posts/published/909617-new-track-wicked/post.json |
./cohost-export/project/jmsfbs/posts/published/915951-new-tracks/post.json |
./cohost-export/project/jmsfbs/posts/published/915980-starting-the-next-so/post.json |
./cohost-export/project/jmsfbs/posts/published/942742-listening-to-fruscia/post.json |
./cohost-export/project/jmsfbs/posts/published/943274-samorost/post.json |
./cohost-export/project/jmsfbs/silenced/3amsoda.json |
./cohost-export/project/jmsfbs/silenced/ReynaNeko.json |
./cohost-export/project/jmsfbs/silenced/batluster.json |
./cohost-export/project/jmsfbs/silenced/endingsummary.json |
./cohost-export/project/jmsfbs/silenced/kyn.json |
./cohost-export/project/jmsfbs/silenced/melinoe.json |
./cohost-export/project/recede/find-your-friends.json |
./cohost-export/project/recede/recede.json |
./cohost-export/user-notes/ewoud.json |
./cohost-export/user-notes/llewellynn.json |
./cohost-export/user-notes/ztiworoh.json |
./cohost-export/user.json |
Hmm that’s a lot of data, its hard for me to get a sense of the structure. Let’s see if we can aggregate it a bit.
First let’s parse the path into usable metadata:
select
(filename)
.regexp_extract(
'(.*)/(.*)(\..*)'
, ['dir', 'filename', 'ext']
) as struct
from read_text('./cohost-export/**')
limit 5
;
struct |
---|
{'dir': ./cohost-export/project/jmsfbs-code, 'filename': 124035-d0b291a6-a10b-4424-8a9a-84bdce1a1d69-profile, 'ext': .jpeg} |
{'dir': ./cohost-export/project/jmsfbs-code/comments, 'filename': 1b18d7cc-e64e-412e-a43a-af7d5b2db60f, 'ext': .json} |
{'dir': ./cohost-export/project/jmsfbs-code/comments, 'filename': 288abe0b-66b3-4df7-8305-14b3e4449b92, 'ext': .json} |
{'dir': ./cohost-export/project/jmsfbs-code/comments, 'filename': 36e4597b-5114-4e2d-89c6-b4ab7318f532, 'ext': .json} |
{'dir': ./cohost-export/project/jmsfbs-code/comments, 'filename': 42c7829f-fecd-4791-a308-5b3e2d449892, 'ext': .json} |
That’s great, but it’d be better if dir
, filename
and ext
were separate columns, so let’s use unnest()
select
(filename)
.regexp_extract(
'(.*)/(.*)(\..*)'
, ['dir', 'filename', 'ext']
)
.unnest()
from read_text('./cohost-export/**')
limit 5
;
dir | filename | ext |
---|---|---|
./cohost-export/project/jmsfbs-code | 124035-d0b291a6-a10b-4424-8a9a-84bdce1a1d69-profile | .jpeg |
./cohost-export/project/jmsfbs-code/comments | 1b18d7cc-e64e-412e-a43a-af7d5b2db60f | .json |
./cohost-export/project/jmsfbs-code/comments | 288abe0b-66b3-4df7-8305-14b3e4449b92 | .json |
./cohost-export/project/jmsfbs-code/comments | 36e4597b-5114-4e2d-89c6-b4ab7318f532 | .json |
./cohost-export/project/jmsfbs-code/comments | 42c7829f-fecd-4791-a308-5b3e2d449892 | .json |
Great!
Casually introducing advanced features
Already we’ve seen three pretty advanced features of DuckDB.
- We’re querying the filesystem with SQL
- We’re able to build nested structs of data
- We’re able to dot chain methods within SQL
All of this is quite convenient but also quite unusual (for most databases). Normally a database runs on a server and doesn’t have access to local data on your filesystem.
There are other in-process databases (like SQLite) but they likely don’t have the same syntactic convenience.
Other databases like Postgres have nested data in various forms but there is no easy to use anonymous structs. You can only deal with nested data with predefined types.
Having nested anonymous binary data in your result sets is incredible powerful. It has surprised me how often I reach for it now.
Aggregating results
Okay, lets get back to aggregating. Let’s group by dir
and collect filename/ext into an array and take a sample of the files so we can get a sense of the contents.
Let’s take the query we already had, and wrap it in a Common Table Expression (CTE). Which is basically just a pipeline of SQL
The CTE below is equivalent to the query before, but we can start to add more blocks.
with parsed as (
select
(filename)
.regexp_extract(
'(.*)/(.*)(\..*)'
, ['dir', 'filename', 'ext']
)
.unnest()
from read_text('./cohost-export/**')
)
select * from parsed
limit 5
;
By the way, I’m just writing all this SQL in a markdown file in VSCode. In the terminal I’ve got DuckDB running in the VSCode terminal, and I select the text and send it to the terminal (I’ve bound that to Ctrl/Command + Enter).
I’ve also told DuckDB to output tables in markdown format
.mode markdown
So I can just paste the results straight back into this blogpost.
Now let’s aggregate!
with parsed as (
select
(filename)
.regexp_extract(
'(.*)/(.*)(\..*)'
, ['dir', 'filename', 'ext']
)
.unnest()
from read_text('./cohost-export/**')
)
select
dir
, (array_agg(distinct (ext))) as contents
from parsed
group by all
;
dir | contents |
---|---|
./cohost-export/project/jmsfbs-code | [.jpeg, .json, .json] |
./cohost-export/project/jmsfbs-code/posts/published/1196053-css-anchor-positioni | [.json] |
./cohost-export/project/jmsfbs-code/posts/published/1205822-what-i-made-the-comp | [.json] |
./cohost-export/project/jmsfbs-code/posts/published/1215585-how-to-disable-web-s | [.json] |
./cohost-export/project/jmsfbs-code/posts/published/1302500-workaround-for-lack | [.json] |
./cohost-export/project/jmsfbs-code/posts/published/1312901-debugger-launch-nod | [.json] |
./cohost-export/project/jmsfbs-code/posts/published/1349274-just-upgraded-a-self | [.json] |
./cohost-export/project/jmsfbs-code/posts/published/1353509-how-to-parse-dynamic | [.json] |
./cohost-export/project/jmsfbs-code/posts/published/4362243-repeat-0-1fr-doesn | [.json] |
./cohost-export/project/jmsfbs-code/posts/published/4893642-recursive-queries-in | [.json] |
./cohost-export/project/jmsfbs-code/posts/published/5262412-typescript-performan | [.json] |
./cohost-export/project/jmsfbs-code/posts/published/696437-interesting-blow-ran | [.json] |
./cohost-export/project/jmsfbs-code/posts/published/729566-keydiff | [.json] |
./cohost-export/project/jmsfbs-code/posts/published/784613-browser-automation-w | [.json] |
./cohost-export/project/jmsfbs-code/posts/published/926060-working-on-a-simple | [.png, .json] |
./cohost-export/project/jmsfbs/posts/published/1065458-spotify-getting-a-bi | [.json] |
./cohost-export/project/jmsfbs/posts/published/1480427-splitting-frequencie | [.json] |
./cohost-export/project/jmsfbs/posts/published/179858-feels-good-man-docu | [.json] |
./cohost-export/project/jmsfbs/posts/published/220695-went-for-a-walk-toda | [.json] |
./cohost-export/project/jmsfbs/posts/published/2869791-nerd-points | [.png, .json] |
./cohost-export/project/jmsfbs/posts/published/3112762-recording-progress-u | [.json] |
./cohost-export/project/jmsfbs/posts/published/3169687-chameleon-days | [.json] |
./cohost-export/project/jmsfbs/posts/published/329629-stuff-like-this-driv | [.json] |
./cohost-export/project/jmsfbs/posts/published/3349306-expensive-reverb-fo | [.json] |
./cohost-export/project/jmsfbs/posts/published/3393436-the-last-beatles-son | [.json] |
./cohost-export/project/jmsfbs/posts/published/3457586-more-giggin | [.mp3, .jpg, .jpg] |
./cohost-export/project/jmsfbs/posts/published/3644721-only-a-shadow | [.json] |
./cohost-export/project/jmsfbs/posts/published/5262397-found-out-today-some | [.json] |
./cohost-export/project/jmsfbs/posts/published/5444882-ursula-le-guin-s-a-w | [.json] |
./cohost-export/project/jmsfbs/posts/published/5638375-witcher-3-rp-gs-and | [.webp, .json] |
./cohost-export/project/jmsfbs/posts/published/631473-negroni | [.json] |
./cohost-export/project/jmsfbs/posts/published/752730-the-true-believers | [.json] |
./cohost-export/project/jmsfbs/posts/published/766452-home-recording-journ | [.json] |
./cohost-export/project/jmsfbs/posts/published/7810959-gonna-miss-this-plac | [.json] |
./cohost-export/project/jmsfbs/posts/published/915951-new-tracks | [.json] |
./cohost-export/project/jmsfbs-code/followers | [.json, .json, .json] |
./cohost-export/project/jmsfbs-code/following | [.json, .json, .json] |
./cohost-export/project/jmsfbs-code/posts/published/1681286-making-my-bluetooth | [.json] |
./cohost-export/project/jmsfbs-code/posts/published/1829684-simple-secret-sharin | [.json] |
./cohost-export/project/jmsfbs-code/posts/published/4369806-kind-of-surprised-ho | [.json] |
./cohost-export/project/jmsfbs-code/posts/published/5555101-github-copilot | [.json] |
./cohost-export/project/jmsfbs-code/posts/published/6559941-learning-godot-has-c | [.json] |
./cohost-export/project/jmsfbs-code/posts/published/766864-nginx-caddy | [.json] |
./cohost-export/project/jmsfbs-code/posts/published/943321-crypto-random-uuid | [.json] |
./cohost-export/project/jmsfbs/followers | [.json, .json, .json] |
./cohost-export/project/jmsfbs/posts/published/1008879-the-dinner | [.json] |
./cohost-export/project/jmsfbs/posts/published/1016743-frog-2-tutorial | [.json] |
./cohost-export/project/jmsfbs/posts/published/1046457-plugged-my-electric | [.json] |
./cohost-export/project/jmsfbs/posts/published/1088733-garage-band-no-plugi | [.json] |
./cohost-export/project/jmsfbs/posts/published/1105201-cloudpunk | [.json] |
./cohost-export/project/jmsfbs/posts/published/1166928-just-finished-saints | [.json] |
./cohost-export/project/jmsfbs/posts/published/1342883-midi-research-round | [.json] |
./cohost-export/project/jmsfbs/posts/published/1433411-cousins | [.json] |
./cohost-export/project/jmsfbs/posts/published/1456123-i-paid-for-a-daw | [.json] |
./cohost-export/project/jmsfbs/posts/published/1712599-one-of-the-better-es | [.json] |
./cohost-export/project/jmsfbs/posts/published/185752-one-reason-might-be | [.json] |
./cohost-export/project/jmsfbs/posts/published/1958396-new-drinking-game | [.json] |
./cohost-export/project/jmsfbs/posts/published/2870336-new-track-golden | [.json] |
./cohost-export/project/jmsfbs/posts/published/2983024-new-recording-proces | [.png, .json] |
./cohost-export/project/jmsfbs/posts/published/3125303-need-some-distance | [.json] |
./cohost-export/project/jmsfbs/posts/published/3136545-accidentally-deleted | [.json] |
./cohost-export/project/jmsfbs/posts/published/3395966-played-an-open-mic-a | [.png, .json] |
./cohost-export/project/jmsfbs/posts/published/3819063-recording-tea-and-s | [.json] |
./cohost-export/project/jmsfbs/posts/published/3984809-lone-wolf | [.json] |
./cohost-export/project/jmsfbs/posts/published/4470982-foundation-s2-is-pre | [.json] |
./cohost-export/project/jmsfbs/posts/published/5528127-finally-yes | [.json] |
./cohost-export/project/jmsfbs/posts/published/5654197-getting-back-to-my-r | [.json] |
./cohost-export/project/jmsfbs/posts/published/5814203-https-youtu-be-cej | [.json] |
./cohost-export/project/jmsfbs/posts/published/597784-howard-zinn-woody | [.json] |
./cohost-export/project/jmsfbs/posts/published/7254514-reading-spotify-bios | [.json] |
./cohost-export/project/jmsfbs/posts/published/733962-learning-to-drum-ev | [.json] |
./cohost-export/project/jmsfbs/posts/published/750274-breakdown-of-karen-c | [.json] |
./cohost-export/project/jmsfbs/posts/published/779120-how-sneakers-are-man | [.json] |
./cohost-export/project/jmsfbs/posts/published/827539-new-track-never | [.json] |
./cohost-export/project/jmsfbs/posts/published/829585-songwriting-process | [.json] |
./cohost-export/project/jmsfbs/posts/published/842542-new-guitars | [.json] |
./cohost-export/project/jmsfbs/posts/published/845495-i-don-t-really-like | [.json] |
./cohost-export/project/jmsfbs/posts/published/859426-scotch-crime | [.json] |
./cohost-export/project/jmsfbs/posts/published/881830-midi-thoughts | [.jpg, .json] |
./cohost-export/project/jmsfbs/posts/published/915980-starting-the-next-so | [.json] |
./cohost-export/project/jmsfbs/posts/published/942742-listening-to-fruscia | [.json] |
./cohost-export/project/jmsfbs/posts/published/943274-samorost | [.json] |
./cohost-export/project/jmsfbs-code/posts/published/1196048-druids-datadog-s-d | [.json] |
./cohost-export/project/jmsfbs-code/posts/published/1205386-parser-combinators | [.json] |
./cohost-export/project/jmsfbs-code/posts/published/1230738-nushell-snippets | [.json] |
./cohost-export/project/jmsfbs-code/posts/published/1706101-ouch-better-cli-ux | [.json] |
./cohost-export/project/jmsfbs-code/posts/published/1933414-chat-gpt-almost-alway | [.json] |
./cohost-export/project/jmsfbs-code/posts/published/2625470-moving-to-ghost | [.json] |
./cohost-export/project/jmsfbs-code/posts/published/4954186-learning-about-apl-a | [.png, .json] |
./cohost-export/project/jmsfbs-code/posts/published/5679848-typescript-tip-out | [.json] |
./cohost-export/project/jmsfbs-code/posts/published/5976016-postgres-implicit-l | [.png, .json] |
./cohost-export/project/jmsfbs-code/posts/published/739146-cohost-launched-rss | [.json] |
./cohost-export/project/jmsfbs-code/posts/published/927564-solid-js-and-presite | [.json] |
./cohost-export/project/jmsfbs/posts/published/1003900-pequod | [.json] |
./cohost-export/project/jmsfbs/posts/published/1270030-loopy-levels-in-fall | [.json] |
./cohost-export/project/jmsfbs/posts/published/1456278-bass-sound-design-w | [.json] |
./cohost-export/project/jmsfbs/posts/published/1460591-larry-sanders-show | [.json] |
./cohost-export/project/jmsfbs/posts/published/1572613-he-died-the-way-he-l | [.json] |
./cohost-export/project/jmsfbs/posts/published/160965-welcome-to-cohost | [.json] |
./cohost-export/project/jmsfbs/posts/published/2254047-this-is-huge-for-mus | [.json] |
./cohost-export/project/jmsfbs/posts/published/3025811-walls | [.json] |
./cohost-export/project/jmsfbs/posts/published/3172316-electronic-kit-upgra | [.json] |
./cohost-export/project/jmsfbs/posts/published/3175152-australia-votes-no-t | [.json] |
./cohost-export/project/jmsfbs/posts/published/3192439-kind-of-nice-to-see | [.json] |
./cohost-export/project/jmsfbs/posts/published/3259093-practicing-and-mayb | [.json] |
./cohost-export/project/jmsfbs/posts/published/3644766-open-mic-night | [.jpg, .json] |
./cohost-export/project/jmsfbs/posts/published/3673531-book-thoughts-hitch | [.json] |
./cohost-export/project/jmsfbs/posts/published/3982228-guitar-modelling-and | [.json] |
./cohost-export/project/jmsfbs/posts/published/4154545-gig-zoe-s-bar | [.png, .json] |
./cohost-export/project/jmsfbs/posts/published/5338982-how-can-you-not-noti | [.json] |
./cohost-export/project/jmsfbs/posts/published/536580-driving-around-today | [.json] |
./cohost-export/project/jmsfbs/posts/published/757625-frick-filler | [.json] |
./cohost-export/project/jmsfbs/posts/published/767388-hmmm | [.png, .json] |
./cohost-export/project/jmsfbs/posts/published/892822-automate-the-pedal-b | [.json] |
./cohost-export/project/jmsfbs/posts/published/904948-video-editing | [.json] |
./cohost-export/project/jmsfbs/posts/published/909617-new-track-wicked | [.json] |
./cohost-export/project/jmsfbs/silenced | [.json, .json, .json] |
./cohost-export/project/recede | [.json, .json] |
./cohost-export/user-notes | [.json, .json, .json] |
./cohost-export/project/jmsfbs-code/comments | [.json, .json, .json] |
./cohost-export/project/jmsfbs-code/posts/published/1042927-new-row-violates-row | [.json] |
./cohost-export/project/jmsfbs-code/posts/published/1059858-more-productive-on-t | [.json] |
./cohost-export/project/jmsfbs-code/posts/published/1201573-how-to-use-deno-debu | [.json] |
./cohost-export/project/jmsfbs-code/posts/published/1349307-ssh-import-id | [.json] |
./cohost-export/project/jmsfbs-code/posts/published/1687721-rss-feed-truncated | [.json] |
./cohost-export/project/jmsfbs-code/posts/published/1706541-installing-rust-carg | [.json] |
./cohost-export/project/jmsfbs-code/posts/published/2254888-using-nushell-to-saf | [.json] |
./cohost-export/project/jmsfbs-code/posts/published/4081387-secrets-service | [.json] |
./cohost-export/project/jmsfbs-code/posts/published/4092488-typescript-wishlist | [.json] |
./cohost-export/project/jmsfbs-code/posts/published/6633844-database-migrations | [.json] |
./cohost-export/project/jmsfbs-code/posts/published/716808-minor-website-update | [.json] |
./cohost-export/project/jmsfbs-code/posts/published/7238140-useful-alias-for-doc | [.json] |
./cohost-export/project/jmsfbs-code/posts/published/738657-puppeteer-and-docker | [.json] |
./cohost-export/project/jmsfbs-code/posts/published/773340-crossposting-cohost | [.png, .json] |
./cohost-export/project/jmsfbs-code/posts/published/779895-function-constr | [.json] |
./cohost-export/project/jmsfbs-code/posts/published/784666-joining-rss-feeds | [.json] |
./cohost-export/project/jmsfbs-code/posts/published/943384-svelte-kit-design-dec | [.json] |
./cohost-export/project/jmsfbs-code/posts/published/993001-knee-deep-in-postgre | [.json] |
./cohost-export/project/jmsfbs | [.jpg, .jpg, .json] |
./cohost-export/project/jmsfbs/comments | [.json, .json, .json] |
./cohost-export/project/jmsfbs/following | [.json, .json, .json] |
./cohost-export/project/jmsfbs/posts/published/1313723-df-psych-oddysey | [.json] |
./cohost-export/project/jmsfbs/posts/published/1342661-music-production-del | [.json] |
./cohost-export/project/jmsfbs/posts/published/1379548-freeing-up-your-hand | [.json] |
./cohost-export/project/jmsfbs/posts/published/161748-baldies | [.json] |
./cohost-export/project/jmsfbs/posts/published/1619795-playing-high-hats-fa | [.json] |
./cohost-export/project/jmsfbs/posts/published/174731-drum-rants-thought | [.json] |
./cohost-export/project/jmsfbs/posts/published/180406-you-definitely-notic | [.json] |
./cohost-export/project/jmsfbs/posts/published/2458613-i-wonder-when-chrome | [.json] |
./cohost-export/project/jmsfbs/posts/published/259394-new-skillet | [.json] |
./cohost-export/project/jmsfbs/posts/published/2895937-more-alone | [.json] |
./cohost-export/project/jmsfbs/posts/published/3213421-vst-vendors-and-thei | [.json] |
./cohost-export/project/jmsfbs/posts/published/364112-saw-3-turtles-and-a | [.json] |
./cohost-export/project/jmsfbs/posts/published/3656874-where-do-birds-go-af | [.json] |
./cohost-export/project/jmsfbs/posts/published/3852001-the-games-industry-m | [.json] |
./cohost-export/project/jmsfbs/posts/published/4053158-a-non-dogmatic-left | [.json] |
./cohost-export/project/jmsfbs/posts/published/4169376-dynamics-and-live-sh | [.json] |
./cohost-export/project/jmsfbs/posts/published/4310343-my-spotify-playlist | [.json] |
./cohost-export/project/jmsfbs/posts/published/5444811-why-cohost-is-good | [.json] |
./cohost-export/project/jmsfbs/posts/published/5516880-donald-glover-son-o | [.json] |
./cohost-export/project/jmsfbs/posts/published/5777910-china-doll-sessions | [.mp3, .jpg, .json] |
./cohost-export/project/jmsfbs/posts/published/584679-stacklands-and-rts | [.jpg, .jpg, .json] |
./cohost-export/project/jmsfbs/posts/published/603580-stuff-2022 | [.json] |
./cohost-export/project/jmsfbs/posts/published/752472-breaking-the-monopol | [.json] |
./cohost-export/project/jmsfbs/posts/published/823558-home-recording | [.json] |
./cohost-export | [.json] |
It’s the little things.
group by all
is another DuckDB nicety. Normally we’d have to specify every column we want to aggregate by.group by all
just means: assume any unaggregated column is what we are grouping by. This lets us constantly revise our query without needing to update thegroup by
to match theselect
.
Still too many results for me to comprehend, but the bulk of the data is posts
so let’s focus on just the posts/*/
results.
select
(filename)
.regexp_extract(
'(.*)/(.*)(\..*)'
, ['dir', 'filename', 'ext']
)
.unnest()
from read_text(
'./cohost-export/project/*/posts/*/**'
)
order by filename desc
limit 5
;
dir | filename | ext |
---|---|---|
./cohost-export/project/jmsfbs/posts/published/943274-samorost | post | .json |
./cohost-export/project/jmsfbs/posts/published/942742-listening-to-fruscia | post | .json |
./cohost-export/project/jmsfbs/posts/published/915980-starting-the-next-so | post | .json |
./cohost-export/project/jmsfbs/posts/published/915951-new-tracks | post | .json |
./cohost-export/project/jmsfbs/posts/published/909617-new-track-wicked | post | .json |
Let’s parse that path a little more specifically now:
select
(filename)
.regexp_extract(
'./cohost-export/project/(.*)/posts/(.*)/(.*)/(.*)\.(.*)',
[ 'project'
, 'status'
, 'slug'
, 'filename'
, 'extension'
]
)
.unnest()
from read_text(
'./cohost-export/project/*/posts/*/**'
)
order by filename desc
limit 5
;
project | status | slug | filename | extension |
---|---|---|---|---|
jmsfbs | published | 943274-samorost | post | json |
jmsfbs | published | 942742-listening-to-fruscia | post | json |
jmsfbs | published | 915980-starting-the-next-so | post | json |
jmsfbs | published | 915951-new-tracks | post | json |
jmsfbs | published | 909617-new-track-wicked | post | json |
That’s a useful query, lets give it a name:
create or replace view vw_post_files as
select
(filename)
.regexp_extract(
'./cohost-export/project/(.*)/posts/(.*)/(.*)/(.*)\.(.*)',
[ 'project'
, 'status'
, 'slug'
, 'filename'
, 'extension'
]
)
.unnest()
, filename as filepath
from read_text(
'./cohost-export/project/*/posts/*/**'
)
;
And we can query that by name now:
select * from vw_post_files
order by filename desc
limit 5
;
project | status | slug | filename | extension | filepath |
---|---|---|---|---|---|
jmsfbs | published | 179858-feels-good-man-docu | post | json | ./cohost-export/project/jmsfbs/posts/published/179858-feels-good-man-docu/post.json |
jmsfbs | published | 752472-breaking-the-monopol | post | json | ./cohost-export/project/jmsfbs/posts/published/752472-breaking-the-monopol/post.json |
jmsfbs-code | published | 779895-function-constr | post | json | ./cohost-export/project/jmsfbs-code/posts/published/779895-function-constr/post.json |
jmsfbs-code | published | 784613-browser-automation-w | post | json | ./cohost-export/project/jmsfbs-code/posts/published/784613-browser-automation-w/post.json |
jmsfbs-code | published | 1042927-new-row-violates-row | post | json | ./cohost-export/project/jmsfbs-code/posts/published/1042927-new-row-violates-row/post.json |
DuckDB can arbitrarily unnest CTE’s, view definitions, and nested queries there is no runtime performance penality for composing SQL like this (not necessarilly true in other DB systems, DuckDB is special here).
Exploring JSON data
Let’s dive in to the actual post.json
files now.
We can glob them from the filesystem trivially:
select
* exclude(blocks, filename)
from read_json('./cohost-export/**/*/*/post.json', union_by_name=true, filename=true)
order by singlePostPageURL desc
limit 5
;
I’ll explain
union_by_name=true
momentarily…
postId | headline | state | cws | tags | pinned | commentsLocked | sharesLocked | singlePostPageUrl | publishedAt | shareOfPost |
---|---|---|---|---|---|---|---|---|---|---|
943274 | Samorost | 1 | [] | [gamedev, art, games, documentary] | false | false | false | https://cohost.org/jmsfbs/post/943274-samorost | 2023-02-01T11:05:41.324Z | |
942742 | Listening to Frusciante / Klinghoffer today | 1 | [] | [music, klinghoffer, frusciante, rhcp] | false | false | false | https://cohost.org/jmsfbs/post/942742-listening-to-fruscia | 2023-02-01T07:07:44.007Z | |
915980 | Starting the next song | 1 | [] | [Midi, home recording, songwriting, crosspost:jmsfbs.com] | false | false | false | https://cohost.org/jmsfbs/post/915980-starting-the-next-so | 2023-01-27T06:56:51.647Z | |
915951 | New Tracks | 1 | [] | [music, home recording, crosspost:jmsfbs.com] | false | false | false | https://cohost.org/jmsfbs/post/915951-new-tracks | 2023-01-27T06:48:57.228Z | |
909617 | New Track - Wicked | 1 | [] | [music, songwriting, home recording, crosspost:jmsfbs.com] | false | false | false | https://cohost.org/jmsfbs/post/909617-new-track-wicked | 2023-01-26T03:09:33.404Z |
Yep, we just read json files from disk and DuckDB inferred a unified schema and returned it back to us as native rows.
Because the cohost post content is itself markdown its hard to nest in a markdown table, so I’ve excluded the blocks
column (exclude(blocks)
) just to get a sense of the high level structure. blocks
is the column that contains each content block (markdown/attachments etc).
select * exclude ( columnA, columnB )
is another DuckDB nicety. Normally if we want to exclude a column in SQL we have to explicitly list every column we want to include, not so with DuckDB!
But because that content is just json, we can unnest
again and take a look at the block content. We’ll deliberately skip blocks with nested codeblocks (as it would confuse this blog’s renderer).
with xs as (
select
singlePostPageURL
, unnest(blocks, recursive:= true)
from read_json(
'./cohost-export/**/*/*/post.json'
, union_by_name=true
)
)
select *
from xs
-- ignore code blocks that would
-- confuse Astro (`)
where content not ilike '%`%'
order by singlePostPageURL desc
limit 5
;
singlePostPageUrl | type | content | kind | fileURL | previewURL | attachmentId | altText | width | height | artist | title |
---|---|---|---|---|---|---|---|---|---|---|---|
https://cohost.org/jmsfbs/post/943274-samorost | markdown | https://www.youtube.com/watch?v=6W3kXEGy_GQ&t=1s | |||||||||
https://cohost.org/jmsfbs/post/942742-listening-to-fruscia | markdown | https://open.spotify.com/album/724MiwOVJq5YS9sr70RLif?si=QKtxPtDUTlyM1ZlJifMU-A | |||||||||
https://cohost.org/jmsfbs/post/915980-starting-the-next-so | markdown | The midi thing worked really well on the last track I think. So I'm going to try the same approach on this next track but also experiment a bit with the Haas effect to get a wider stereo image without actually panning at all. If that sounds fancy, I only just learned about it | |||||||||
https://cohost.org/jmsfbs/post/915980-starting-the-next-so | markdown | The song was originally called Moon Addict, but I think it may now be called Starsign. I'm definitely going to play around with the song, lyrics and arrangement as I wrote it close to 15 years ago and otherwise it wouldn't be too interesting for me or anyone else. | |||||||||
https://cohost.org/jmsfbs/post/915980-starting-the-next-so | markdown | So I've plugged in my Arturia Minilab (something that was previously gathering dust but is now potentially my favourite piece of kit) and I've started playing around with church organs. We'll see how it turns out! |
Automatic schema detection
You can see we’ve got all these other columns appearing that are never used in our sample. By default DuckDB does a sample read when reading JSON / CSV / etc so it can infer the correct schema.
If different files have different structures that cannot be merged into a shared schema DuckDB will tell you it needs your input on how to proceed. You can ignore rows with errors in them, you can manually specifiy the schema, you can increase the sample threshold. For my usage I find union_by_name
fixes 99% of my problems. It allows columns and keys to appear out of order (or not at all), but relies on the name of the field to join similar rows.
Transforming our data
We have discovered the structure of our data, we now need to convert it into an mdx
file that Astro can digest.
It’s tempting to reach for a host language at this point. We want to create directories, and we want to generate frontmatter, typescript schemas, markdown files. Move file attachments around, update references in the markdown to the local reference and on and on.
But let’s see how far we can get with just the DuckDB CLI though. You might be surprised!
Frontmatter
Frontmatter is the name for the metadata at the top of a markdown document. It looks like this:
---
title: Parsing the Cohost Export with DuckDB
subtitle: Code really is data.
created: 2024-10-10T04:28:15.002Z
archived: false
featured: false
tags:
- cohost
- DuckDB
- data
- ops
- programming
---
That is the frontmatter for this post at time of writing. This metadata helps this blog generate web pages. We can query it at build time to e.g. get back all the posts that have a particular tag. Embedding the metadata in the post is a neat way of avoiding having separate external config files (or even a database) for collating the posts.
We want to generate this frontmatter to prefix the mdx
file for all the cohost posts. DuckDB has no built in support for YAML (the underlying data type of frontmatter). But we can at least create a DuckDB structure of all the posts and then manually construct the yaml string.
Let’s query those fields and name them using this blog’s naming conventions:
select
headline as title
, publishedAt as created
, true as archived
, false as featured
, tags
, regexp_extract(
singlePostPageURL
, '.*/post/(.*)'
, ['slug']
)
.unnest()
from read_json(
'./cohost-export/**/*/*/post.json'
, union_by_name=true
, filename=true
)
order by singlePostPageURL desc
limit 5
;
With mode set to .mode line
it almost looks like yaml.
title = Samorost
created = 2023-02-01T11:05:41.324Z
archived = true
featured = false
tags = [gamedev, art, games, documentary]
slug = 943274-samorost
title = Listening to Frusciante / Klinghoffer today
created = 2023-02-01T07:07:44.007Z
archived = true
featured = false
tags = [music, klinghoffer, frusciante, rhcp]
slug = 942742-listening-to-fruscia
title = Starting the next song
created = 2023-01-27T06:56:51.647Z
archived = true
featured = false
tags = [Midi, home recording, songwriting, crosspost:jmsfbs.com]
slug = 915980-starting-the-next-so
title = New Tracks
created = 2023-01-27T06:48:57.228Z
archived = true
featured = false
tags = [music, home recording, crosspost:jmsfbs.com]
slug = 915951-new-tracks
title = New Track - Wicked
created = 2023-01-26T03:09:33.404Z
archived = true
featured = false
tags = [music, songwriting, home recording, crosspost:jmsfbs.com]
slug = 909617-new-track-wicked
In fact, this would be the equivalent valid yml/frontmatter:
- title: Samorost
created: 2023-02-01T11:05:41.324Z
archived: true
featured: false
tags: [gamedev, art, games, documentary]
slug: 943274-samorost
- title: Listening to Frusciante / Klinghoffer today
created: 2023-02-01T07:07:44.007Z
archived: true
featured: false
tags: [music, klinghoffer, frusciante, rhcp]
slug: 942742-listening-to-fruscia
- title: Starting the next song
created: 2023-01-27T06:56:51.647Z
archived: true
featured: false
tags: [Midi, home recording, songwriting, crosspost:jmsfbs.com]
slug: 915980-starting-the-next-so
title: New Tracks
created: 2023-01-27T06:48:57.228Z
archived: true
featured: false
tags: [music, home recording, crosspost:jmsfbs.com]
slug: 915951-new-tracks
- title: "New Track - Wicked"
created: 2023-01-26T03:09:33.404Z
archived: true
featured: false
tags: [music, songwriting, home recording, crosspost:jmsfbs.com]
slug: 909617-new-track-wicked
When I first saw this output I was tempted to write this native DuckDB format to a file and regex replace it into yaml as it is so close. But instead we’ll generate the frontmatter using a general purpose technique because the intention of this post is to show off DuckDB not solve this specific problem.
Let’s adjust our query to produce a text column frontmatter
that uses all the metadata from the previous query. Note the use of E'\n'
we use this because the newline character is automatically escaped and it won’t render as expected. E'\n'
is an unescaped / “raw” string.
select
(
''
|| E'---'
|| E'\ntitle: '
|| printf('"%s"', headline)
|| E'\ncreated: '
|| publishedAt
|| E'\nfeatured: false'
|| E'\narchived: true'
|| E'\ntags: '
|| (tags || ['from:cohost'])
.to_json()::text
|| E'\nslug: '
|| regexp_extract(
singlePostPageURL
, '.*/post/(.*)'
, ['slug']
)
.slug
|| E'\n---'
)
as frontmatter
from read_json(
'./cohost-export/**/*/*/post.json'
, union_by_name=true
)
order by singlePostPageURL desc
limit 5
;
We get this in .mode list
---
title: Samorost
created: 2023-02-01T11:05:41.324Z
featured: false
archived: true
tags: ["gamedev","art","games","documentary","from:cohost"]
slug: 943274-samorost
---
---
title: Listening to Frusciante / Klinghoffer today
created: 2023-02-01T07:07:44.007Z
featured: false
archived: true
tags: ["music","klinghoffer","frusciante","rhcp","from:cohost"]
slug: 942742-listening-to-fruscia
---
---
title: Starting the next song
created: 2023-01-27T06:56:51.647Z
featured: false
archived: true
tags: ["Midi","home recording","songwriting","crosspost:jmsfbs.com","from:cohost"]
slug: 915980-starting-the-next-so
---
---
title: New Tracks
created: 2023-01-27T06:48:57.228Z
featured: false
archived: true
tags: ["music","home recording","crosspost:jmsfbs.com","from:cohost"]
slug: 915951-new-tracks
---
---
title: New Track - Wicked
created: 2023-01-26T03:09:33.404Z
featured: false
archived: true
tags: ["music","songwriting","home recording","crosspost:jmsfbs.com","from:cohost"]
slug: 909617-new-track-wicked
---
Let’s bring that filename parsing logic in and add some of that metadata to the frontmatter too:
with xs as (
select
* exclude (filename)
, (filename)
.regexp_extract(
'./cohost-export/project/(.*)/posts/(.*)/(.*)/(.*)\.(.*)',
[ 'project'
, 'status'
, 'slug'
, 'filename'
, 'extension'
]
)
.unnest()
from read_json(
'./cohost-export/**/*/*/post.json'
, union_by_name=true
, filename=true
)
)
select
(
''
|| E'---'
|| E'\ntitle: '
|| printf('"%s"', headline)
|| E'\ncreated: ' || publishedAt
|| E'\featured: false'
|| E'\narchived: true'
|| E'\ncohost_project: ' || project
|| E'\ncohost_status: ' || status
|| E'\ntags: '
|| (tags || ['from:cohost'])
.to_json()::text
|| E'\nslug: ' || slug
|| E'\n---'
)
as frontmatter
from xs
order by singlePostPageURL desc
limit 5
;
Let’s name that query via a view
and with a few metadata columns for joining / debugging.
I’ve also added a few minor changes to handle edgecases I discovered while testing different pages.
create or replace view vw_post_frontmatter as
with xs as (
select
* exclude (filename)
, (filename)
.regexp_extract(
'./cohost-export/project/(.*)/posts/(.*)/(.*)/(.*)\.(.*)',
['project'
, 'status'
, 'slug'
, 'filename'
, 'extension'
]
)
.unnest()
from read_json(
'./cohost-export/**/*/*/post.json'
, union_by_name=true
, filename=true
)
)
select
postId
, headline
, slug
, printf('"%s"', headline) as title
-- drafts don't have publishedAt
, coalesce(
publishedAt
-- to ISO8601:
, now()::json->>'$'
) as created
, false as featured
, true as archived
, project as cohost_project
, status as cohost_status
, (
tags || ['from:cohost']
)
.to_json()::text as tags_text
, (
''
|| E'---'
|| E'\ntitle: ' || title
|| E'\ncreated: ' || created
|| E'\nfeatured: ' || featured
|| E'\narchived: ' || archived
|| E'\ncohost_project: '
|| cohost_project
|| E'\ncohost_status: '
|| cohost_status
|| E'\ntags: ' || tags_text
|| E'\nslug: ' || slug
|| E'\n---'
)
as frontmatter
from xs
;
Joining the Markdown content
Now what about the markdown content? We need to glue that all together. Let’s try it with an aggregate function string_agg
.
with xs as (
select
headline
, unnest(blocks) as block
from read_json(
'./cohost-export/**/*/*/post.json'
, union_by_name=true
)
)
select
headline
, string_agg(
block.markdown.content
, E'\n\n'
)[0:100] as content
from xs
where block.type = 'markdown'
group by all
limit 5
;
headline | content |
---|---|
Recursive queries in postgres | Was just reading this [great article](https://elvisciotti.medium.com/postgresql-recursive-query-the |
Cohost launched RSS / JSON feeds | This is cool: https://cohost.org/jmsfbs-code/rss/public.json\nI guess this would make it pretty easy |
NGINX → Caddy | Going to try and replace nginx with caddy on my website in 20-30 mins, let's see.\n---\nTemporarily |
Joining RSS Feeds | It'd be cool to have a single RSS feed for both my website and this cohost feed. Wouldn't be too ha |
Pequod | Thinking after the current song project I'll do one of my bands (Baby X) songs, and get my drummer ( |
I’ve sliced the first 100 chars, and limited to the first 5 rows. But we can see this is working.
But we also need to handle including attachments in the blog post. First we’ll build up a data structure for the non markdown blocks. Then we’ll integrate it into the post content.
with xs as (
select
headline
, unnest(
blocks
, recursive:= true
) as block
from read_json(
'./cohost-export/**/*/*/post.json'
, union_by_name=true
)
)
select *
from xs
where type <> 'markdown'
;
headline | type | content | kind | fileURL | previewURL | attachmentId | altText | width | height | artist | title |
---|---|---|---|---|---|---|---|---|---|---|---|
Learning about APL and losing my mind | attachment | image | https://staging.cohostcdn.org/attachment/af844fd0-0f9b-4b4f-bc02-168dee9ce1a6/image.png | https://staging.cohostcdn.org/attachment/af844fd0-0f9b-4b4f-bc02-168dee9ce1a6/image.png | af844fd0-0f9b-4b4f-bc02-168dee9ce1a6 | 1216 | 659 | ||||
Postgres: Implicit lateral cross joins | attachment | image | https://staging.cohostcdn.org/attachment/5097c2b0-d82e-4348-9e0a-84d073bcd35e/image.png | https://staging.cohostcdn.org/attachment/5097c2b0-d82e-4348-9e0a-84d073bcd35e/image.png | 5097c2b0-d82e-4348-9e0a-84d073bcd35e | 1840 | 1180 | ||||
Crossposting Cohost to Website | attachment | image | https://staging.cohostcdn.org/attachment/68df2c18-9235-47af-96b5-c6515c6c3999/Screen%20Shot%202023-01-03%20at%2011.32.54%20pm.png | https://staging.cohostcdn.org/attachment/68df2c18-9235-47af-96b5-c6515c6c3999/Screen%20Shot%202023-01-03%20at%2011.32.54%20pm.png | 68df2c18-9235-47af-96b5-c6515c6c3999 | 2542 | 1330 | ||||
Working on a simple website for my music | attachment | image | https://staging.cohostcdn.org/attachment/e8c1ca83-3d03-486b-b4e3-e25052be83b4/Screen%20Shot%202023-01-29%20at%202.40.23%20pm.png | https://staging.cohostcdn.org/attachment/e8c1ca83-3d03-486b-b4e3-e25052be83b4/Screen%20Shot%202023-01-29%20at%202.40.23%20pm.png | e8c1ca83-3d03-486b-b4e3-e25052be83b4 | 1604 | 1512 | ||||
Nerd points | attachment | image | https://staging.cohostcdn.org/attachment/b7e09133-9f92-4cc9-bd76-222db1d614da/image.png | https://staging.cohostcdn.org/attachment/b7e09133-9f92-4cc9-bd76-222db1d614da/image.png | b7e09133-9f92-4cc9-bd76-222db1d614da | 2328 | 600 | ||||
New recording process | attachment | image | https://staging.cohostcdn.org/attachment/c9d58d7f-5ddd-4389-b146-28005841c950/image.png | https://staging.cohostcdn.org/attachment/c9d58d7f-5ddd-4389-b146-28005841c950/image.png | c9d58d7f-5ddd-4389-b146-28005841c950 | 1516 | 529 | ||||
Played an open mic again | attachment | image | https://staging.cohostcdn.org/attachment/28f111aa-9a8c-4af5-9afa-b2dda4478679/397796805_841791357742803_767766949589955176_n.png | https://staging.cohostcdn.org/attachment/28f111aa-9a8c-4af5-9afa-b2dda4478679/397796805_841791357742803_767766949589955176_n.png | 28f111aa-9a8c-4af5-9afa-b2dda4478679 | 720 | 720 | ||||
More giggin' | attachment | image | https://staging.cohostcdn.org/attachment/b2be61c7-ea03-4ecc-9bb8-418aefd49601/img_3_1699499479580.jpg | https://staging.cohostcdn.org/attachment/b2be61c7-ea03-4ecc-9bb8-418aefd49601/img_3_1699499479580.jpg | b2be61c7-ea03-4ecc-9bb8-418aefd49601 | 993 | 1765 | ||||
More giggin' | attachment | image | https://staging.cohostcdn.org/attachment/144f0553-80e7-43f3-a6e1-ef01bf21df57/img_1_1699499465335.jpg | https://staging.cohostcdn.org/attachment/144f0553-80e7-43f3-a6e1-ef01bf21df57/img_1_1699499465335.jpg | 144f0553-80e7-43f3-a6e1-ef01bf21df57 | 993 | 1765 | ||||
More giggin' | attachment | audio | https://staging.cohostcdn.org/attachment/87270e9c-c88b-44eb-8c7d-d99d986790b0/Noth.mp3 | https://staging.cohostcdn.org/attachment/87270e9c-c88b-44eb-8c7d-d99d986790b0/Noth.mp3 | 87270e9c-c88b-44eb-8c7d-d99d986790b0 | Baby X | Noth (bedroom jam) | ||||
Open mic night | attachment | image | https://staging.cohostcdn.org/attachment/e2d702b0-b6dc-4cb7-891b-82b80bce76d6/IMG_20231124_104548.jpg | https://staging.cohostcdn.org/attachment/e2d702b0-b6dc-4cb7-891b-82b80bce76d6/IMG_20231124_104548.jpg | e2d702b0-b6dc-4cb7-891b-82b80bce76d6 | 960 | 751 | ||||
Gig @ Zoe's Bar | attachment | image | https://staging.cohostcdn.org/attachment/486822cd-d74f-46eb-a3fb-9aa4ac34c174/image.png | https://staging.cohostcdn.org/attachment/486822cd-d74f-46eb-a3fb-9aa4ac34c174/image.png | 486822cd-d74f-46eb-a3fb-9aa4ac34c174 | 679 | 641 | ||||
Witcher 3, RPGs and gamefeel? | attachment | image | https://staging.cohostcdn.org/attachment/162b7d07-dcb6-44e9-85d4-8b6bd6d98410/Nova_pict_131.webp | https://staging.cohostcdn.org/attachment/162b7d07-dcb6-44e9-85d4-8b6bd6d98410/Nova_pict_131.webp | 162b7d07-dcb6-44e9-85d4-8b6bd6d98410 | 832 | 624 | ||||
China Doll Sessions | attachment | audio | https://staging.cohostcdn.org/attachment/c7457348-f770-408a-8d54-317f7ac33ffe/China%20Doll%202.mp3 | https://staging.cohostcdn.org/attachment/c7457348-f770-408a-8d54-317f7ac33ffe/China%20Doll%202.mp3 | c7457348-f770-408a-8d54-317f7ac33ffe | jmsfbs, baby x | china doll guitar chain experiment | ||||
China Doll Sessions | attachment | image | https://staging.cohostcdn.org/attachment/e41ee3e0-be75-402c-8b5c-507fbca1124a/PXL_20240429_062918414.jpg | https://staging.cohostcdn.org/attachment/e41ee3e0-be75-402c-8b5c-507fbca1124a/PXL_20240429_062918414.jpg | e41ee3e0-be75-402c-8b5c-507fbca1124a | I re-arranged my room, this is the new recording setup. | 4032 | 2268 | |||
China Doll Sessions | attachment | audio | https://staging.cohostcdn.org/attachment/1039f2fa-9519-4b66-8b52-2704b1967a0e/China%20Doll%202.mp3 | https://staging.cohostcdn.org/attachment/1039f2fa-9519-4b66-8b52-2704b1967a0e/China%20Doll%202.mp3 | 1039f2fa-9519-4b66-8b52-2704b1967a0e | jmsfbs, baby x | china doll guitar chain experiment 2 | ||||
Stacklands (and RTS in general) | attachment | image | https://staging.cohostcdn.org/attachment/4c3fcac7-4e8b-4a41-a312-caef366c9087/header.jpg | https://staging.cohostcdn.org/attachment/4c3fcac7-4e8b-4a41-a312-caef366c9087/header.jpg | 4c3fcac7-4e8b-4a41-a312-caef366c9087 | 460 | 215 | ||||
Stacklands (and RTS in general) | attachment | image | https://staging.cohostcdn.org/attachment/a9e895bf-10a3-42d8-9fdc-5975629131a9/Warcraft-1-fog-of-war.jpg | https://staging.cohostcdn.org/attachment/a9e895bf-10a3-42d8-9fdc-5975629131a9/Warcraft-1-fog-of-war.jpg | a9e895bf-10a3-42d8-9fdc-5975629131a9 | 590 | 472 | ||||
attachment | image | https://staging.cohostcdn.org/attachment/a5d93351-de36-469b-8ae0-0857ca09985a/Screenshot%202023-01-02%20193640.png | https://staging.cohostcdn.org/attachment/a5d93351-de36-469b-8ae0-0857ca09985a/Screenshot%202023-01-02%20193640.png | a5d93351-de36-469b-8ae0-0857ca09985a | 2560 | 1440 | |||||
Midi Thoughts | attachment | image | https://staging.cohostcdn.org/attachment/fc7337ec-1135-484f-9ede-cf9afc075300/PXL_20230121_083234984_2.jpg | https://staging.cohostcdn.org/attachment/fc7337ec-1135-484f-9ede-cf9afc075300/PXL_20230121_083234984_2.jpg | fc7337ec-1135-484f-9ede-cf9afc075300 | 4080 | 3072 |
Let’s see what type of attachments we have:
with xs as (
select
headline
, unnest(
blocks
, recursive:= true
) as block
from read_json(
'./cohost-export/**/*/*/post.json'
, union_by_name=true
)
)
, ys as (
select
type
, kind
, parse_filename(fileURL)
.regexp_split_to_array('\.')[-1]
as ext
from xs
where type = 'attachment'
)
pivot ys
on ext
using count(*)
;
type | kind | jpg | mp3 | png | webp |
---|---|---|---|---|---|
attachment | audio | 0 | 3 | 0 | 0 |
attachment | image | 7 | 0 | 9 | 1 |
At least for my dataset, this is all I have to consider:
audio/mp3
image/jpg
image/png
image/webp
The cohost export comes prebundled with all the file attachments. This makes a lot of sense as they are planning to delete all user data at the end of the year. But at time of writing, those fileURL
and previewURL
URLs are still live. So for simplicity I’m just going to re-download them instead of joining on filename (which can be a little risky in terms of name collisions).
Let’s start by making a virtual directory struture for our posts. We’ll drop the attachments into this virtual file system.
with xs as (
select
headline
, unnest(
blocks
, recursive:= true
) as block
, * exclude (blocks)
from read_json(
'./cohost-export/**/*/*/post.json'
, union_by_name=true
)
)
select
(singlePostpageURL)
.parse_filename() as slug
, (fileURL)
.parse_path()[-2:-1]
.array_to_string('-') as filename
, fileURL as src
, printf(
'./%s/%s', slug, filename
) as dest
from xs
where type <> 'markdown'
;
slug | filename | src | dest |
---|---|---|---|
4954186-learning-about-apl-a | af844fd0-0f9b-4b4f-bc02-168dee9ce1a6-image.png | https://staging.cohostcdn.org/attachment/af844fd0-0f9b-4b4f-bc02-168dee9ce1a6/image.png | ./4954186-learning-about-apl-a/af844fd0-0f9b-4b4f-bc02-168dee9ce1a6-image.png |
5976016-postgres-implicit-l | 5097c2b0-d82e-4348-9e0a-84d073bcd35e-image.png | https://staging.cohostcdn.org/attachment/5097c2b0-d82e-4348-9e0a-84d073bcd35e/image.png | ./5976016-postgres-implicit-l/5097c2b0-d82e-4348-9e0a-84d073bcd35e-image.png |
773340-crossposting-cohost | 68df2c18-9235-47af-96b5-c6515c6c3999-Screen%20Shot%202023-01-03%20at%2011.32.54%20pm.png | https://staging.cohostcdn.org/attachment/68df2c18-9235-47af-96b5-c6515c6c3999/Screen%20Shot%202023-01-03%20at%2011.32.54%20pm.png | ./773340-crossposting-cohost/68df2c18-9235-47af-96b5-c6515c6c3999-Screen%20Shot%202023-01-03%20at%2011.32.54%20pm.png |
926060-working-on-a-simple | e8c1ca83-3d03-486b-b4e3-e25052be83b4-Screen%20Shot%202023-01-29%20at%202.40.23%20pm.png | https://staging.cohostcdn.org/attachment/e8c1ca83-3d03-486b-b4e3-e25052be83b4/Screen%20Shot%202023-01-29%20at%202.40.23%20pm.png | ./926060-working-on-a-simple/e8c1ca83-3d03-486b-b4e3-e25052be83b4-Screen%20Shot%202023-01-29%20at%202.40.23%20pm.png |
8dff553847ec4a4a9d8d8be81553f017 | 086c8fa9-dc82-40e1-be6c-cd5abd390332-Screen%20Shot%202022-12-13%20at%205.46.40%20pm.png | https://staging.cohostcdn.org/attachment/086c8fa9-dc82-40e1-be6c-cd5abd390332/Screen%20Shot%202022-12-13%20at%205.46.40%20pm.png | ./8dff553847ec4a4a9d8d8be81553f017/086c8fa9-dc82-40e1-be6c-cd5abd390332-Screen%20Shot%202022-12-13%20at%205.46.40%20pm.png |
8dff553847ec4a4a9d8d8be81553f017 | 981436c3-b99a-44fc-84dd-ac52b7db7ecd-a345fe50-9c38-4d6f-8cc4-4b7c14697a5d-620x372.webp | https://staging.cohostcdn.org/attachment/981436c3-b99a-44fc-84dd-ac52b7db7ecd/a345fe50-9c38-4d6f-8cc4-4b7c14697a5d-620x372.webp | ./8dff553847ec4a4a9d8d8be81553f017/981436c3-b99a-44fc-84dd-ac52b7db7ecd-a345fe50-9c38-4d6f-8cc4-4b7c14697a5d-620x372.webp |
8dff553847ec4a4a9d8d8be81553f017 | dc5a6e47-e01e-4abe-94d6-42ac2d4de15b-fb91501a1da25da354f6cdde25271dee.jpg | https://staging.cohostcdn.org/attachment/dc5a6e47-e01e-4abe-94d6-42ac2d4de15b/fb91501a1da25da354f6cdde25271dee.jpg | ./8dff553847ec4a4a9d8d8be81553f017/dc5a6e47-e01e-4abe-94d6-42ac2d4de15b-fb91501a1da25da354f6cdde25271dee.jpg |
2869791-nerd-points | b7e09133-9f92-4cc9-bd76-222db1d614da-image.png | https://staging.cohostcdn.org/attachment/b7e09133-9f92-4cc9-bd76-222db1d614da/image.png | ./2869791-nerd-points/b7e09133-9f92-4cc9-bd76-222db1d614da-image.png |
2983024-new-recording-proces | c9d58d7f-5ddd-4389-b146-28005841c950-image.png | https://staging.cohostcdn.org/attachment/c9d58d7f-5ddd-4389-b146-28005841c950/image.png | ./2983024-new-recording-proces/c9d58d7f-5ddd-4389-b146-28005841c950-image.png |
3395966-played-an-open-mic-a | 28f111aa-9a8c-4af5-9afa-b2dda4478679-397796805_841791357742803_767766949589955176_n.png | https://staging.cohostcdn.org/attachment/28f111aa-9a8c-4af5-9afa-b2dda4478679/397796805_841791357742803_767766949589955176_n.png | ./3395966-played-an-open-mic-a/28f111aa-9a8c-4af5-9afa-b2dda4478679-397796805_841791357742803_767766949589955176_n.png |
3457586-more-giggin | b2be61c7-ea03-4ecc-9bb8-418aefd49601-img_3_1699499479580.jpg | https://staging.cohostcdn.org/attachment/b2be61c7-ea03-4ecc-9bb8-418aefd49601/img_3_1699499479580.jpg | ./3457586-more-giggin/b2be61c7-ea03-4ecc-9bb8-418aefd49601-img_3_1699499479580.jpg |
3457586-more-giggin | 144f0553-80e7-43f3-a6e1-ef01bf21df57-img_1_1699499465335.jpg | https://staging.cohostcdn.org/attachment/144f0553-80e7-43f3-a6e1-ef01bf21df57/img_1_1699499465335.jpg | ./3457586-more-giggin/144f0553-80e7-43f3-a6e1-ef01bf21df57-img_1_1699499465335.jpg |
3457586-more-giggin | 87270e9c-c88b-44eb-8c7d-d99d986790b0-Noth.mp3 | https://staging.cohostcdn.org/attachment/87270e9c-c88b-44eb-8c7d-d99d986790b0/Noth.mp3 | ./3457586-more-giggin/87270e9c-c88b-44eb-8c7d-d99d986790b0-Noth.mp3 |
3644766-open-mic-night | e2d702b0-b6dc-4cb7-891b-82b80bce76d6-IMG_20231124_104548.jpg | https://staging.cohostcdn.org/attachment/e2d702b0-b6dc-4cb7-891b-82b80bce76d6/IMG_20231124_104548.jpg | ./3644766-open-mic-night/e2d702b0-b6dc-4cb7-891b-82b80bce76d6-IMG_20231124_104548.jpg |
4154545-gig-zoe-s-bar | 486822cd-d74f-46eb-a3fb-9aa4ac34c174-image.png | https://staging.cohostcdn.org/attachment/486822cd-d74f-46eb-a3fb-9aa4ac34c174/image.png | ./4154545-gig-zoe-s-bar/486822cd-d74f-46eb-a3fb-9aa4ac34c174-image.png |
5638375-witcher-3-rp-gs-and | 162b7d07-dcb6-44e9-85d4-8b6bd6d98410-Nova_pict_131.webp | https://staging.cohostcdn.org/attachment/162b7d07-dcb6-44e9-85d4-8b6bd6d98410/Nova_pict_131.webp | ./5638375-witcher-3-rp-gs-and/162b7d07-dcb6-44e9-85d4-8b6bd6d98410-Nova_pict_131.webp |
5777910-china-doll-sessions | c7457348-f770-408a-8d54-317f7ac33ffe-China%20Doll%202.mp3 | https://staging.cohostcdn.org/attachment/c7457348-f770-408a-8d54-317f7ac33ffe/China%20Doll%202.mp3 | ./5777910-china-doll-sessions/c7457348-f770-408a-8d54-317f7ac33ffe-China%20Doll%202.mp3 |
5777910-china-doll-sessions | e41ee3e0-be75-402c-8b5c-507fbca1124a-PXL_20240429_062918414.jpg | https://staging.cohostcdn.org/attachment/e41ee3e0-be75-402c-8b5c-507fbca1124a/PXL_20240429_062918414.jpg | ./5777910-china-doll-sessions/e41ee3e0-be75-402c-8b5c-507fbca1124a-PXL_20240429_062918414.jpg |
5777910-china-doll-sessions | 1039f2fa-9519-4b66-8b52-2704b1967a0e-China%20Doll%202.mp3 | https://staging.cohostcdn.org/attachment/1039f2fa-9519-4b66-8b52-2704b1967a0e/China%20Doll%202.mp3 | ./5777910-china-doll-sessions/1039f2fa-9519-4b66-8b52-2704b1967a0e-China%20Doll%202.mp3 |
584679-stacklands-and-rts | 4c3fcac7-4e8b-4a41-a312-caef366c9087-header.jpg | https://staging.cohostcdn.org/attachment/4c3fcac7-4e8b-4a41-a312-caef366c9087/header.jpg | ./584679-stacklands-and-rts/4c3fcac7-4e8b-4a41-a312-caef366c9087-header.jpg |
584679-stacklands-and-rts | a9e895bf-10a3-42d8-9fdc-5975629131a9-Warcraft-1-fog-of-war.jpg | https://staging.cohostcdn.org/attachment/a9e895bf-10a3-42d8-9fdc-5975629131a9/Warcraft-1-fog-of-war.jpg | ./584679-stacklands-and-rts/a9e895bf-10a3-42d8-9fdc-5975629131a9-Warcraft-1-fog-of-war.jpg |
767388-hmmm | a5d93351-de36-469b-8ae0-0857ca09985a-Screenshot%202023-01-02%20193640.png | https://staging.cohostcdn.org/attachment/a5d93351-de36-469b-8ae0-0857ca09985a/Screenshot%202023-01-02%20193640.png | ./767388-hmmm/a5d93351-de36-469b-8ae0-0857ca09985a-Screenshot%202023-01-02%20193640.png |
881830-midi-thoughts | fc7337ec-1135-484f-9ede-cf9afc075300-PXL_20230121_083234984_2.jpg | https://staging.cohostcdn.org/attachment/fc7337ec-1135-484f-9ede-cf9afc075300/PXL_20230121_083234984_2.jpg | ./881830-midi-thoughts/fc7337ec-1135-484f-9ede-cf9afc075300-PXL_20230121_083234984_2.jpg |
Now lets add a command colum cmd
to download each file to the result set.
printf(
'mkdir -p "%s"; wget "%s" -O "%s"'
, parent_dir
, src
, dest
)
with xs as (
select
headline
, unnest(
blocks
, recursive:= true
) as block
, * exclude (blocks)
from read_json(
'./cohost-export/**/*/*/post.json'
, union_by_name=true
)
)
select
(singlePostpageURL)
.parse_filename() as slug
, (fileURL)
.parse_path()[-2:-1]
.array_to_string('-') as filename
, fileURL as src
, printf(
'../output/posts/%s/attachments'
, slug
) as parent_dir
, printf(
'%s/%s'
, parent_dir
, filename
) as dest
, printf(
'mkdir -p "%s"; wget "%s" -O "%s"'
, parent_dir
, src
, dest
) as cmd
from xs
where type <> 'markdown'
;
Here’s a sample of one of those generated commands:
mkdir -p "../output/posts/881830-midi-thoughts/attachments";
wget "https://staging.cohostcdn.org/attachment/fc7337ec-1135-484f-9ede-cf9afc075300/PXL_20230121_083234984_2.jpg"
\ -O "../output/posts/881830-midi-thoughts/attachments/fc7337ec-1135-484f-9ede-cf9afc075300-PXL_20230121_083234984_2.jpg"
Now if we just select cmd
only, turn headers off (.header off
), turn off all table formatting (.mode list
) and set the output mode to a file (.once $outputFile
), we can write these commands to a script on disk.
.mode list
.header off
.once ../output/download-attachment.sh
with xs as (
select
headline
, unnest(
blocks
, recursive:= true
) as block
, * exclude (blocks)
from read_json(
'./cohost-export/**/*/*/post.json'
, union_by_name=true
)
)
, ys as (
select
(singlePostpageURL)
.parse_filename() as slug
, (fileURL)
.parse_path()[-2:-1]
.array_to_string('-') as filename
, fileURL as src
, printf(
'../output/posts/%s/attachments'
, slug
) as parent_dir
, printf(
'%s/%s'
, parent_dir
, filename
) as dest
, printf(
'mkdir -p "%s"; wget "%s" -O "%s"'
, parent_dir
, src
, dest
) as cmd
from xs
where type <> 'markdown'
)
select cmd from ys
;
Now we have a script on our filesystem where each line downloads 1 attachment via wget
. It looks like this (substituting the $variables for real values):
mkdir -p "$parent_dir";
wget "$src1" -O "$dest1";
mkdir -p "$parent_dir";
wget "$src2" -O "$dest2";
mkdir -p "$parent_dir";
wget "$src3" -O "$dest3";
# etc...
We can also run bash commands from the DuckDB shell.
E.g.
.shell ls cohost-export
project user-notes user.json
So let’s run our script:
.shell bash ../output/download-attachment.sh
And after running it, we can see the downloads worked as expected:
select
filename
, size
, last_modified
from read_text('../output/posts/**')
;
filename | size | last_modified |
---|---|---|
../output/posts/2869791-nerd-points/attachments/b7e09133-9f92-4cc9-bd76-222db1d614da-image.png | 289078 | 2024-10-10 07:44:45 |
../output/posts/2983024-new-recording-proces/attachments/c9d58d7f-5ddd-4389-b146-28005841c950-image.png | 45105 | 2024-10-10 07:44:45 |
../output/posts/3395966-played-an-open-mic-a/attachments/28f111aa-9a8c-4af5-9afa-b2dda4478679-397796805_841791357742803_767766949589955176_n.png | 423742 | 2024-10-10 07:44:45 |
../output/posts/3457586-more-giggin/attachments/144f0553-80e7-43f3-a6e1-ef01bf21df57-img_1_1699499465335.jpg | 98520 | 2024-10-10 07:44:46 |
../output/posts/3457586-more-giggin/attachments/87270e9c-c88b-44eb-8c7d-d99d986790b0-Noth.mp3 | 2608900 | 2024-04-06 03:23:53 |
../output/posts/3457586-more-giggin/attachments/b2be61c7-ea03-4ecc-9bb8-418aefd49601-img_3_1699499479580.jpg | 122023 | 2024-10-10 07:44:45 |
../output/posts/3644766-open-mic-night/attachments/e2d702b0-b6dc-4cb7-891b-82b80bce76d6-IMG_20231124_104548.jpg | 111299 | 2024-10-10 07:44:46 |
../output/posts/4154545-gig-zoe-s-bar/attachments/486822cd-d74f-46eb-a3fb-9aa4ac34c174-image.png | 519389 | 2024-10-10 07:44:47 |
../output/posts/4954186-learning-about-apl-a/attachments/af844fd0-0f9b-4b4f-bc02-168dee9ce1a6-image.png | 96019 | 2024-10-10 07:44:43 |
../output/posts/5638375-witcher-3-rp-gs-and/attachments/162b7d07-dcb6-44e9-85d4-8b6bd6d98410-Nova_pict_131.webp | 51198 | 2024-10-10 07:44:47 |
../output/posts/5777910-china-doll-sessions/attachments/1039f2fa-9519-4b66-8b52-2704b1967a0e-China%20Doll%202.mp3 | 1010415 | 2024-04-30 10:20:27 |
../output/posts/5777910-china-doll-sessions/attachments/c7457348-f770-408a-8d54-317f7ac33ffe-China%20Doll%202.mp3 | 770088 | 2024-04-30 09:59:48 |
../output/posts/5777910-china-doll-sessions/attachments/e41ee3e0-be75-402c-8b5c-507fbca1124a-PXL_20240429_062918414.jpg | 957500 | 2024-10-10 07:44:47 |
../output/posts/584679-stacklands-and-rts/attachments/4c3fcac7-4e8b-4a41-a312-caef366c9087-header.jpg | 23699 | 2024-10-10 07:44:47 |
../output/posts/584679-stacklands-and-rts/attachments/a9e895bf-10a3-42d8-9fdc-5975629131a9-Warcraft-1-fog-of-war.jpg | 33466 | 2024-10-10 07:44:48 |
../output/posts/5976016-postgres-implicit-l/attachments/5097c2b0-d82e-4348-9e0a-84d073bcd35e-image.png | 316259 | 2024-10-10 07:44:44 |
../output/posts/767388-hmmm/attachments/a5d93351-de36-469b-8ae0-0857ca09985a-Screenshot%202023-01-02%20193640.png | 2622668 | 2024-10-10 07:44:48 |
../output/posts/773340-crossposting-cohost/attachments/68df2c18-9235-47af-96b5-c6515c6c3999-Screen%20Shot%202023-01-03%20at%2011.32.54%20pm.png | 58603 | 2024-10-10 07:44:44 |
../output/posts/881830-midi-thoughts/attachments/fc7337ec-1135-484f-9ede-cf9afc075300-PXL_20230121_083234984_2.jpg | 824407 | 2024-10-10 07:44:49 |
../output/posts/926060-working-on-a-simple/attachments/e8c1ca83-3d03-486b-b4e3-e25052be83b4-Screen%20Shot%202023-01-29%20at%202.40.23%20pm.png | 2291287 | 2024-10-10 07:44:44 |
Building the post content
Previously we authored a query to build the post content but we skipped attachments. We’re now going to reference our locally downloaded attachments instead of skipping them.
with xs as (
select
headline
, unnest(blocks) as block
from read_json(
'./cohost-export/**/*/*/post.json'
, union_by_name=true
)
)
select
headline
, block
, (block.attachment.fileURL)
.parse_path()[-2:-1]
.array_to_string('-') as dest
, (
case block.type
when 'markdown'
then block.markdown.content
when 'attachment' then
case block.attachment.kind
when 'audio' then
printf(
'<Audio title={"%s"} artist={"%s"} src={import("./attachments/%s")} />'
, block.attachment.title
, block.attachment.artist, dest
)
when 'image' then
printf(
'<Image alttext={"%s"} src={import("./attachments/%s")} />'
, block.attachment.alttext
, dest
)
end
end
) as content
from xs
where headline = 'China Doll Sessions'
;
Gives us:
headline | block | dest | content |
---|---|---|---|
China Doll Sessions | {'type': attachment, 'markdown': NULL, 'attachment': {'kind': audio, 'fileURL': https://staging.cohostcdn.org/attachment/c7457348-f770-408a-8d54-317f7ac33ffe/China%20Doll%202.mp3, 'previewURL': https://staging.cohostcdn.org/attachment/c7457348-f770-408a-8d54-317f7ac33ffe/China%20Doll%202.mp3, 'attachmentId': c7457348-f770-408a-8d54 | c7457348-f770-408a-8d54-317f7ac33ffe-China%20Doll%202.mp3 | <Audio title={"china doll guitar chain experiment"} artist={"jmsfbs, baby x"} src={import("./attachments/c7457348-f770-408a-8d54-317f7ac33ffe-China%20Doll%202.mp3")} /> |
China Doll Sessions | {'type': attachment, 'markdown': NULL, 'attachment': {'kind': image, 'fileURL': https://staging.cohostcdn.org/attachment/e41ee3e0-be75-402c-8b5c-507fbca1124a/PXL_20240429_062918414.jpg, 'previewURL': https://staging.cohostcdn.org/attachment/e41ee3e0-be75-402c-8b5c-507fbca1124a/PXL_20240429_062918414.jpg, 'attachmentId': e41ee3e0-be | e41ee3e0-be75-402c-8b5c-507fbca1124a-PXL_20240429_062918414.jpg | <Image alttext={"I re-arranged my room, this is the new recording setup."} src={import("./attachments/e41ee3e0-be75-402c-8b5c-507fbca1124a-PXL_20240429_062918414.jpg")} /> |
China Doll Sessions | {'type': attachment, 'markdown': NULL, 'attachment': {'kind': audio, 'fileURL': https://staging.cohostcdn.org/attachment/1039f2fa-9519-4b66-8b52-2704b1967a0e/China%20Doll%202.mp3, 'previewURL': https://staging.cohostcdn.org/attachment/1039f2fa-9519-4b66-8b52-2704b1967a0e/China%20Doll%202.mp3, 'attachmentId': 1039f2fa-9519-4b66-8b52 | 1039f2fa-9519-4b66-8b52-2704b1967a0e-China%20Doll%202.mp3 | <Audio title={"china doll guitar chain experiment 2"} artist={"jmsfbs, baby x"} src={import("./attachments/1039f2fa-9519-4b66-8b52-2704b1967a0e-China%20Doll%202.mp3")} /> |
China Doll Sessions | {'type': markdown, 'markdown': {'content': This is an old song from my early baby x days, in fact, the first baby x jam predated Drew as the drummer and the song we rehearsed was China Doll!}, 'attachment': NULL} | This is an old song from my early baby x days, in fact, the first baby x jam predated Drew as the drummer and the song we rehearsed was China Doll! | |
China Doll Sessions | {'type': markdown, 'markdown': {'content': The lyrics (aside from the breakdown section) were written by an old friend (Dom Gilchrist) but the words meant a lot to me at the time because the exact situation described in the song was happening where I was the friend who knew his other friend was being cheated on but didn't want to s | The lyrics (aside from the breakdown section) were written by an old friend (Dom Gilchrist) but the words meant a lot to me at the time because the exact situation described in the song was happening where I was the friend who knew his other friend was being cheated on but didn't want to say anything to ruin their relationship. Th | |
China Doll Sessions | {'type': markdown, 'markdown': {'content': It's a been a long time since I last recorded anything. I re-arranged my room months ago and life got in the way and I hadn't even plugged anything in.}, 'attachment': NULL} | It's a been a long time since I last recorded anything. I re-arranged my room months ago and life got in the way and I hadn't even plugged anything in. | |
China Doll Sessions | {'type': markdown, 'markdown': {'content': I've done a few things differently this time which I might write about later, but the most interesting thing I guess is I've sort of got a hold on making guitars sound big. This eluded me for so long, and I'm just using free and stock plugins.}, 'attachment': NULL} | I've done a few things differently this time which I might write about later, but the most interesting thing I guess is I've sort of got a hold on making guitars sound big. This eluded me for so long, and I'm just using free and stock plugins. | |
China Doll Sessions | {'type': markdown, 'markdown': {'content': I tried piping the guitars into BiasFX and it instantly sounded amazing, but a) I don't want to become too dependent on paid plugins, and b) Bias is so slow when you start adding lots of layers. So instead I've been struggling with impulse response files and stock amp / cabinet simulators | I tried piping the guitars into BiasFX and it instantly sounded amazing, but a) I don't want to become too dependent on paid plugins, and b) Bias is so slow when you start adding lots of layers. So instead I've been struggling with impulse response files and stock amp / cabinet simulators. | |
China Doll Sessions | {'type': markdown, 'markdown': {'content': Then using a lot of side chain compression and EQ to make sure the lead cuts through the droning guitars whenever it comes in, but trying to keep it subtle enough that it doesn't feel like its being ducked.}, 'attachment': NULL} | Then using a lot of side chain compression and EQ to make sure the lead cuts through the droning guitars whenever it comes in, but trying to keep it subtle enough that it doesn't feel like its being ducked. | |
China Doll Sessions | {'type': markdown, 'markdown': {'content': Attached is a loop of that approach in practice.}, 'attachment': NULL} | Attached is a loop of that approach in practice. | |
China Doll Sessions | {'type': markdown, 'markdown': {'content': My current thinking is I'm not going to release anything for a while, just keeping noodling and blogging and when I've got a set of songs spend some time refining them and making them work well together.}, 'attachment': NULL} | My current thinking is I'm not going to release anything for a while, just keeping noodling and blogging and when I've got a set of songs spend some time refining them and making them work well together. |
Now that content field can be string_agg
’d
with xs as (
select
headline
, unnest(blocks) as block
from read_json(
'./cohost-export/**/*/*/post.json'
, union_by_name=true
)
)
, ys as (
select
headline
, block
, (block.attachment.fileURL)
.parse_path()[-2:-1]
.array_to_string('-') as dest
, (
case block.type
when 'markdown' then
block.markdown.content
when 'attachment' then
case block.attachment.kind
when 'audio' then
printf(
'<Audio title={"%s"} artist={"%s"} src={import("./attachments/%s")} />'
, block.attachment.title
, block.attachment.artist
, dest
)
when 'image' then
printf(
'<Image alttext={"%s"} src={import("./attachments/%s")} />'
, block.attachment.alttext
, dest
)
end
end
) as content
from xs
)
select
headline
, string_agg(content, E'\n\n') as content
from ys
group by all
;
If we make that a view too…
create or replace view vw_post_content as
with xs as (
select
postId
, headline
, unnest(blocks) as block
from read_json(
'./cohost-export/**/*/*/post.json'
, union_by_name=true
)
)
, ys as (
select
postId
, headline
, block
, (block.attachment.fileURL)
.parse_path()[-2:-1]
.array_to_string('-') as dest
, (
case block.type
when 'markdown' then
block.markdown.content
when 'attachment' then
case block.attachment.kind
when 'audio' then
printf(
'<Audio title={"%s"} artist={"%s"} src={import("./attachments/%s")} />'
, block.attachment.title
, block.attachment.artist
, dest
)
when 'image' then
printf(
'<Image alttext={"%s"} src={import("./attachments/%s")} />'
, block.attachment.alttext
, dest
)
end
end
) as content
from xs
)
select
postId
, headline
, string_agg(content, E'\n\n') as content
from ys
group by all
;
We can have a look at one of the posts:
.mode list
select
content
from vw_post_content
where headline = 'China Doll Sessions'
;
And we get this:
<Audio title={"china doll guitar chain experiment"} artist={"jmsfbs, baby x"} src={import("./attachments/c7457348-f770-408a-8d54-317f7ac33ffe-China%20Doll%202.mp3")} />
<Image alttext={"I re-arranged my room, this is the new recording setup."} src={import("./attachments/e41ee3e0-be75-402c-8b5c-507fbca1124a-PXL_20240429_062918414.jpg")} />
<Audio title={"china doll guitar chain experiment 2"} artist={"jmsfbs, baby x"} src={import("./attachments/1039f2fa-9519-4b66-8b52-2704b1967a0e-China%20Doll%202.mp3")} />
This is an old song from my early [baby x](https://babyx.bandcamp.com/album/i-am-1-am) days, in fact, the first baby x jam predated Drew as the drummer and the song we rehearsed was China Doll!
The lyrics (aside from the breakdown section) were written by an old friend (Dom Gilchrist) but the words meant a lot to me at the time because the exact situation described in the song was happening where I was the friend who knew his other friend was being cheated on but didn't want to say anything to ruin their relationship. That whole situation led us to being temporarily estranged. This was myspace drama days!
It's a been a long time since I last recorded anything. I re-arranged my room months ago and life got in the way and I hadn't even plugged anything in.
I've done a few things differently this time which I might write about later, but the most interesting thing I guess is I've sort of got a hold on making guitars sound big. This eluded me for so long, and I'm just using free and stock plugins.
I tried piping the guitars into BiasFX and it instantly sounded amazing, but a) I don't want to become too dependent on paid plugins, and b) Bias is so slow when you start adding lots of layers. So instead I've been struggling with impulse response files and stock amp / cabinet simulators.
Then using a lot of side chain compression and EQ to make sure the lead cuts through the droning guitars whenever it comes in, but trying to keep it subtle enough that it doesn't feel like its being ducked.
Attached is a loop of that approach in practice.
My current thinking is I'm not going to release anything for a while, just keeping noodling and blogging and when I've got a set of songs spend some time refining them and making them work well together.
You’ll note I’m dropping components into my markdown. This is only possible because Astro supports MDX. I’m using MDX instead of Markdown as it makes it possible to drop in rich media for embeds.
Now we can glue the frontmatter with the content with a simple join.
select
frontmatter
|| E'\n\n'
|| content as content
from vw_post_content
inner join vw_post_frontmatter using(headline)
where headline = 'China Doll Sessions'
;
And we get this:
---
title: China Doll Sessions
created: 2024-04-30T09:59:48.880Z
featured: false
archived: true
cohost_project: jmsfbs
cohost_status: published
tags: ["recording","baby x","band","myspace","Drama","music","from:cohost"]
slug: 5777910-china-doll-sessions
---
<Audio title={"china doll guitar chain experiment"} artist={"jmsfbs, baby x"} src={import("./attachments/c7457348-f770-408a-8d54-317f7ac33ffe-China%20Doll%202.mp3")} />
<Image alttext={"I re-arranged my room, this is the new recording setup."} src={import("./attachments/e41ee3e0-be75-402c-8b5c-507fbca1124a-PXL_20240429_062918414.jpg")} />
<Audio title={"china doll guitar chain experiment 2"} artist={"jmsfbs, baby x"} src={import("./attachments/1039f2fa-9519-4b66-8b52-2704b1967a0e-China%20Doll%202.mp3")} />
This is an old song from my early [baby x](https://babyx.bandcamp.com/album/i-am-1-am) days, in fact, the first baby x jam predated Drew as the drummer and the song we rehearsed was China Doll!
The lyrics (aside from the breakdown section) were written by an old friend (Dom Gilchrist) but the words meant a lot to me at the time because the exact situation described in the song was happening where I was the friend who knew his other friend was being cheated on but didn't want to say anything to ruin their relationship. That whole situation led us to being temporarily estranged. This was myspace drama days!
It's a been a long time since I last recorded anything. I re-arranged my room months ago and life got in the way and I hadn't even plugged anything in.
I've done a few things differently this time which I might write about later, but the most interesting thing I guess is I've sort of got a hold on making guitars sound big. This eluded me for so long, and I'm just using free and stock plugins.
I tried piping the guitars into BiasFX and it instantly sounded amazing, but a) I don't want to become too dependent on paid plugins, and b) Bias is so slow when you start adding lots of layers. So instead I've been struggling with impulse response files and stock amp / cabinet simulators.
Then using a lot of side chain compression and EQ to make sure the lead cuts through the droning guitars whenever it comes in, but trying to keep it subtle enough that it doesn't feel like its being ducked.
Attached is a loop of that approach in practice.
My current thinking is I'm not going to release anything for a while, just keeping noodling and blogging and when I've got a set of songs spend some time refining them and making them work well together.
Importing components
Finally, if an Audio
or Image
attachment is used we need to import that component in the MDX content. Let’s inject that if it is required for the post. First let’s detect if there’s any audio in a post.
select
json_contains(
blocks
, { kind: 'audio' }
) as import_audio
,json_contains(
blocks
, { kind: 'image' }
) as import_image
-- ...
And then inject the imports:
select
-- ...
|| E'\n'
|| case when import_audio
then E'\nimport Audio from "../../../components/Audio.astro"'
else ''
end
|| case when import_image
then E'\nimport Image from "../../../components/Image.astro"'
else ''
end
|| E'\n\n'
-- ...
In context:
select
json_contains(
blocks
, { kind: 'audio' }
) as import_audio
,json_contains(
blocks
, { kind: 'image' }
) as import_image
,frontmatter
|| E'\n'
|| case when import_audio
then E'\nimport Audio from "../../../components/Audio.astro"'
else '' end
|| case when import_image
then E'\nimport Image from "../../../components/Image.astro"'
else '' end
|| E'\n\n'
|| content as content
from read_json(
'./cohost-export/**/*/*/post.json'
, union_by_name=true
) as data
inner join vw_post_content
as content using(postId)
inner join vw_post_frontmatter
as frontmatter using(postId)
where postId = (
select postId from vw_post_content
where headline = 'China Doll Sessions'
)
;
That gives us:
---
title: China Doll Sessions
created: 2024-04-30T09:59:48.880Z
featured: false
archived: true
cohost_project: jmsfbs
cohost_status: published
tags: ["recording","baby x","band","myspace","Drama","music","from:cohost"]
slug: 5777910-china-doll-sessions
---
import Audio from "../../../components/Audio.astro"
import Image from "../../../components/Image.astro"
<Audio title={"china doll guitar chain experiment"} artist={"jmsfbs, baby x"} src={import("./attachments/c7457348-f770-408a-8d54-317f7ac33ffe-China%20Doll%202.mp3")} />
<Image alttext={"I re-arranged my room, this is the new recording setup."} src={import("./attachments/e41ee3e0-be75-402c-8b5c-507fbca1124a-PXL_20240429_062918414.jpg")} />
<Audio title={"china doll guitar chain experiment 2"} artist={"jmsfbs, baby x"} src={import("./attachments/1039f2fa-9519-4b66-8b52-2704b1967a0e-China%20Doll%202.mp3")} />
This is an old song from my early [baby x](https://babyx.bandcamp.com/album/i-am-1-am) days, in fact, the first baby x jam predated Drew as the drummer and the song we rehearsed was China Doll!
The lyrics (aside from the breakdown section) were written by an old friend (Dom Gilchrist) but the words meant a lot to me at the time because the exact situation described in the song was happening where I was the friend who knew his other friend was being cheated on but didn't want to say anything to ruin their relationship. That whole situation led us to being temporarily estranged. This was myspace drama days!
It's a been a long time since I last recorded anything. I re-arranged my room months ago and life got in the way and I hadn't even plugged anything in.
I've done a few things differently this time which I might write about later, but the most interesting thing I guess is I've sort of got a hold on making guitars sound big. This eluded me for so long, and I'm just using free and stock plugins.
I tried piping the guitars into BiasFX and it instantly sounded amazing, but a) I don't want to become too dependent on paid plugins, and b) Bias is so slow when you start adding lots of layers. So instead I've been struggling with impulse response files and stock amp / cabinet simulators.
Then using a lot of side chain compression and EQ to make sure the lead cuts through the droning guitars whenever it comes in, but trying to keep it subtle enough that it doesn't feel like its being ducked.
Attached is a loop of that approach in practice.
My current thinking is I'm not going to release anything for a while, just keeping noodling and blogging and when I've got a set of songs spend some time refining them and making them work well together.
We never left the DuckDB shell
I think it is worth calling out at this point that technically we never left the DuckDB shell. And as foreign as this may seem, the total code to do this is pretty concise. It’s also a lot easier to debug because we deal with result sets (data) until the very last moment (code is data and data is code).
Yeah, at one point we generated a shell script from data and executed that from within the DuckDB shell.
But that bash script is just an artifact, its an executable result set.
Postgres’ repl has a slash command gexec
which lets you natively execute each row in a shell and get the result back as a new line in a result set. So I’m definitely not the first person to do this, its kind of cool!
Last step: Write the generated mdx to disk
We’ve generated the content, we’ve downloaded the attachment files. Let’s write it to disk.
Step 1, lets just write it to our output folder. And when we’re absolutely sure everything is perfect we’ll copy it into the real posts directory (used by posts like this), and astro should pick them up.
When in doubt, base64
Now for each post we need to write the content to disk at a different location.
We’ll use base64 so we don’t have to worry about shell escaping. We’ll encode the post as base64, and embed that in a command like so:
echo $base64 \
| base64 --decode \
> ./posts/$slug/index.mdx
First we change the output format, and set up the next command to write to our shell script
.mode list
.header off
.once ../output/write-content-to-disk.sh
with xs as (
select
json_contains(
blocks
, { kind: 'audio' }
) as import_audio
,json_contains(
blocks
, { kind: 'image' }
) as import_image
,headline
,postId
,slug
,frontmatter
|| E'\n'
|| case
when import_audio
then E'\nimport Audio from "../../../components/Audio.astro"'
else ''
end
|| case
when import_image
then E'\nimport Image from "../../../components/Image.astro"'
else ''
end
|| E'\n\n'
|| content as content
from read_json(
'./cohost-export/**/*/*/post.json'
, union_by_name=true
) as data
inner join vw_post_content
as content using(postId, headline)
inner join vw_post_frontmatter
as frontmatter using(postId, headline)
)
select
printf(
'mkdir -p ../output/posts/%s; echo "%s" | base64 --decode > "../output/posts/%s/index.mdx"'
, slug
, to_base64(
encode(content))
, slug
) as cmd
from xs
;
We now have a script that looks like this (with a lot more lines):
echo "Ci0tLQp0aXRsZTogTW92aW5nIHRvIGdob3N0PwpjcmVhdGVkOiAyMDIzLTA4LTI2VDAyOjU0OjEzLjM3OFoKcHVpYmxpc2hlZDogdHJ1ZQphcmNoaXZlZDogZmFsc2UKdGFnczogWyJnaG9zdCIsImJsb2ciLCJmcm9tOmNvaG9zdCJdCnNsdWc6IDI2MjU0NzAtbW92aW5nLXRvLWdob3N0Ci0tLQoKCkknbSB0aGlua2luZyBvZiBtb3ZpbmcgdGhpcyBwYWdlIHRvIGdob3N0LiAgSSB3YXMgdXNpbmcgY29ob3N0J3MgcnNzIEFQSSB0byBmZWVkIG15IHdlYnNpdGUsIGJ1dCB0aGVyZSdzIGJlZW4gYW4gb3V0c3RhbmRpbmcgaXNzdWUgZm9yIHVzZXJuYW1lcyB3aXRoIGh5cGhlbnMgaW4gdGhlbSBmb3IgbWFueSBtb250aHMgbm93IGFuZCBJIGFsc28gZG9uJ3QgdGhpbmsgY29ob3N0IGlzIHRoZSByaWdodCB0b29sIGZvciB0aGF0IGpvYiBhbnl3YXkuCgpNeSB0aGlua2luZyBpcyB0byBob3N0IGEgd2FrZSBvbiByZXF1ZXN0IGdob3N0IGluc3RhbmNlIG9uIGZseS4gIFRoZSBkb2NrZXIgaW1hZ2UgbG9va3MgcHJldHR5IHN0cmFpZ2h0IGZvcndhcmQuICBXZSdsbCBzZWUgaG93IGl0IGdvZXMuIA==" | base64 --decode > "../output/posts/2625470-moving-to-ghost/index.mdx"
echo "Ci0tLQp0aXRsZTogCmNyZWF0ZWQ6IDIwMjQtMDItMDRUMDI6MzU6MTguMTkyWgpwdWlibGlzaGVkOiB0cnVlCmFyY2hpdmVkOiBmYWxzZQp0YWdzOiBbInByb2dyYW1taW5nIiwianMiLCJmcm9tOmNvaG9zdCJdCnNsdWc6IDQzNjk4MDYta2luZC1vZi1zdXJwcmlzZWQtaG8KLS0tCgoKS2luZCBvZiBzdXJwcmlzZWQgaG93IG9mdGVuIEkgdXNlIGxhYmVscyBpbiBqcyBub3cu" | base64 --decode > "../output/posts/4369806-kind-of-surprised-ho/index.mdx"
Let’s run it:
.shell bash ../output/write-content-to-disk.sh
It works! Here is a sample of the outputted directories:
And here is the generated file content in the index.mdx
file referencing the correct attachments on disk.
Conclusion
This has been a long post! I hope it has been an interesting one. It’s cool to see the total code to do all of this in one screen is only 104 lines.
create or replace view vw_post_frontmatter as
-- ...
;
create or replace view vw_post_content as
-- ...
;
.mode list
.header off
.once ../output/download-attachment.sh
with xs as (
select
headline, unnest(blocks, recursive:= true) as block, * exclude (blocks)
from read_json('./cohost-export/**/*/*/post.json', union_by_name=true)
)
, ys as (
select
(singlePostpageURL).parse_filename() as slug
, (fileURL).parse_path()[-2:-1].array_to_string('-') as filename
, fileURL as src
, printf('../output/posts/%s/attachments', slug) as parent_dir
, printf('%s/%s', parent_dir, filename) as dest
, printf('mkdir -p "%s"; wget "%s" -O "%s"', parent_dir, src, dest) as cmd
from xs
where type <> 'markdown'
)
select cmd from ys
;
.mode list
.header off
.once ../output/write-content-to-disk.sh
with xs as (
select
json_contains(blocks, { kind: 'audio' }) as import_audio
,json_contains(blocks, { kind: 'image' }) as import_image
,FM.headline
,postId
,slug
,frontmatter
|| E'\n'
|| case when import_audio then E'\nimport Audio from "../../../components/Audio.astro"' else '' end
|| case when import_image then E'\nimport Image from "../../../components/Image.astro"' else '' end
|| E'\n\n'
|| content as content
from read_json('./cohost-export/**/*/*/post.json', union_by_name=true) as data
inner join vw_post_content as C using(postId)
inner join vw_post_frontmatter as FM using(postId)
)
select
printf('mkdir -p ../output/posts/%s; echo "%s" | base64 --decode > "../output/posts/%s/index.mdx"', slug, to_base64(encode(content)), slug) as cmd
from xs
;
.shell bash ../output/download-attachment.sh
.shell bash ../output/write-content-to-disk.sh
After a bit of time I’ll publish these Cohost posts on this site. First I need to think of a nice way to segment them from actual articles as when I render them all the home page looks a little busy. Additionally, some of the Cohost posts are close to being “real” articles I’d post here, so I may just upgrade them and republish in their own right.
Looking forward I’ll probably have a micro blog section that includes these processed cohosts and any other posts from around the web (e.g. Mastodon, BlueSky).
If you have a Cohost export and you’d like to convert it to a format like Markdown or MDX, I would recommend waiting around for someone to build a more robust tool.
But if you are interested in learning DuckDB I hope this has been a helpful practical demonstrattion of the power of DuckDB for general purpose data slicing and dicing.
Next time you’ve got to inspect some data and you’re reaching for tools like JQ maybe use it as an opportunity to try DuckDB out. I think it is so nice we can bring SQL to a domain where we’d normally have to learn some new and arcane DSL. Mastering SQL is non trivial, but most technical (and even semi-technical) people can write even a simple SQL query. That is very powerful and should be exploited if we want simpler systems to maintain.
Thank you for reading. If you’d like to reach out you can get in touch on Mastodon, BlueSky or Twitter.