aka Jim Winstead Jr.
Hire me!
I am a software developer. After teaching myself to code on Apple II computers in junior high school and building software for our family-owned telecom business, my degree in Computer Science from Harvey Mudd College gave me the grounding for a career in software development where I was able to let my programming freak flag fly: I worked for Knowledge Adventure, an educational/games software company (where I first started leading teams of developers), HomePage.com, an internet startup (where we developed a white-label free web hosting platform), and MySQL Ab (where I led the web development team and then transitioned over to the server development team).
Through this time I was also involved in the open source community. In college, I was involved in the very early days of Linux, and I took over maintenance from Linus of the "root" disk used to install Linux before distributions. While working on websites for Knowledge Adventure, I got involved with the the PHP project, and became a founding member of the PHP Group. I helped set up and maintain some of the collaborative development infrastructure, including the mailing lists and online documentation feedback system. Then at HomePage.com, I contributed work we did related to mod_perl, and became a member of the Apache Software Foundation.
After MySQL Ab was acquired by Sun Microsystems (and then further by Oracle), I co-founded Raw Materials Art Supplies, a retail store in downtown Los Angeles which I have been operating as general manager for the last 15 years. I built Scat POS, a point of sale and ecommerce system using PHP and MySQL that we are using to run the business. It has involved a lot of integration work (payment processors, product feeds, shipping), and I automated as much as I could to focus on our artists' needs. This enabled us to build our business from scratch into one of the top independent art supply stores in the country.
I'm looking for individual contributor roles. PHP/MySQL is the core of what I do best, but I'm comfortable with full-stack web development, Linux system administration, C/C++, Docker, and confident in my abilities to jump into any tech stack and get up to speed quickly.
View my LinkedIn profile for some dates and details, and my GitHub profile for some of my open source contributions. Contact me to discuss opportunities.
Time to modernize PHP’s syntax highlighting?
This blog post about “A syntax highlighter that doesn't suck” was timely because recently I had been kicking at the code for the syntax highlighter that I use on this blog. It’s a very old JavaScript package called SHJS based on GNU Source-highlight.
I created a Git repository where I imported all of the released versions of SHJS and then tried to update the included language files to the ones from the latest GNU Source-highlight release (which was four years ago), but ran into some trouble. There are some new features to the syntax files that the old Perl code in the SHJS package can’t handle. And as you might imagine, the pile of code involved is really, really old.
That new PHP package seems like a great idea and all, but I really like the idea of leveraging work that other people have done to create syntax highlighting for other languages rather than inventing another one.
On Mastodon, Ben Ramsey brought up a start he had made at trying to port Pygments, a Python syntax highlighter, to PHP.
I ran across Chroma, which is a Go package that is built on top of the Pygments language definitions. They’ve converted the Pygments language definitions into an XML format. Those don’t completely handle 100% of the languages, but it covers most of them.
At the end of the day, both GNU Source-highlight and Pygments and variants are built on what are likely to remain imprecise parsers because they are mostly regex-based and just not the same lexing and parsing code actually being used to handle these languages.
PHP has long had it’s own built-in syntax highlighting functions (highlight_string()
and highlight_file()
) but it looks like the generation code hasn’t been updated in a meaningful way in about 25 years. It just has five colors that can be configured that it uses for <span style="color: #...;">
tags. There are many tokens that it simply outputs using the same color where it could make more distinctions. If it were to instead (or also) use CSS classes to mark every token with the exact type, you could do much finer-grained syntax highlighting.
Looks like an area ready for some experimentation.
Thoughts from SCALE 21x, day 4
Today was the last day of SCALE 21x. Again I didn’t make it out for the opening keynote, and I just took a quick spin around the expo floor to see it looking sort of quiet and winding down.
The first talk I attended was Jonathan Haddad on“Distributed System Performance Troubleshooting Like You’ve Been Doing it for Twenty Years” where he shared some of his insights from doing that the title said for companies like Apple and Netflix. His recommendation for greenfield deployments was to have Open Telemetry set up to collect traces and logs, and he was also a big fan of the BPF Compiler Collection (aka bcc-tools) for getting a realtime look into system issues. He was not a fan of running databases in containers, and even less of a fan of running them within Kubernetes. (You could almost see his eye twitch.)
The last talk that I attended (there were just two slots today) was Jen Diamond on “The Git-tastic Power of Conventional Commits.” It was a good talk that used a little light lexical analysis to explain the basic concepts of working with Git (and the revelation that it stands for “ global information tracker” although now a little more research shows that’s only sort-of true). This all led into talking about Conventional Commits which is a way of structuring commit messages, and how you could use that in automations and in driving semantic-versioning in the release process.
The final session was a closing keynote from Bill Cheswick titled “I Love Living in the Future: Half a Century of Computers, Software, and Security” but really could have just been “give the old guy the microphone and let him go!” I left a little over two hours ago, and I wouldn’t be surprised to hear that he’s still going. I hope they let him take a bathroom break.
Thoughts from SCALE 21x, day 3
Another day, another set of thoughts on the experience. It was a busy day at the 21st edition of the Southern California Linux Expo, and the site was more crowded because an episode of America’s Got Talent was being filmed at the Civic Auditorium that is between the two buildings that the conference were held in. If I’d been on the ball, I would have taken a picture of Howie Mandel standing outside his limo.
I will admit that I took my time in the morning and didn’t make it over to Pasadena until after the keynote that kicked off the day.
The first talk that I attended was “Contribution is not only a code.” by Tatiana Krupenya, the CEO of DBeaver. She did a great job of breaking down the many ways that people can contribute to open source development aside from writing code, and I appreciated her final point was that the simplest contributions that anyone can make that will be well-received is just a heart-felt thank you to maintainers of tools that you find valuable.
She also brought up what I am sure is a great talk by Zak Greant from Eclipsecon 2019 titled “When Your Happy Dreams Are About Dying” about burnout in the open source developer community, which I’m looking forward to catching up on.
After that, it was off to Brian Proffitt’s “Measuring the Impact of Community Events” where he provided his perspective from his roles at the Red Hat OSPO, Apache Software Foundation, and other places. It was a great companion to the first session, but more from the perspective of why companies and projects may want to think about measuring how they engage with the community.
I took another spin through the expo during what was supposed to be the lunch break, picked up my conference T-shirt and a free bucket hat from AWS.
After lunch, Tyler Menezes from CodeDay spoke about “Nurturing the Next Generation of Open Source Contributors” and how the non-profit he founded works to connect high school and college students from underprivileged backgrounds with resources to help them thrive in tech. One of the programs pairs small teams of students with a mentor to help them make a contribution to an open source project, and it sounds amazing. I plan to find a way to get involved once I have some my employment situation sorted out.
For the next talk was Heather Osborn on “Organic isn't always good for you” which was sort of a case study of her experience as a DevOps leader tackling the complicated environment that had taken root place at the startup she was working at, and how they figured out a strategy to straighten that out. It was really interesting to hear the language she used about convincing the company management to buy into the plan, which seemed more adversarial and dismissive than the working environments that I’ve been in.
“Solving ‘secret zero’, why you should care about SPIFFE!” by Mattias Gees was by far the most technical talk that I attended today. Like the presentation on Presto yesterday, it seemed a bit like the sort of system that is very impressive and I will probably never need.
The last talk I attended was Michael Gat on “Anti-Patterns in Tech Cost Management” which was pretty true to the title. It was a little light on the open source aspect, but there were definitely insights there on the importance of laying the groundwork early for being able to do cost analytics on systems you’ll be scaling. There were three or so questions from people that started with “I’m an engineer, and ...” which I thought was great. I think what bothered me about Heather Osborn’s talk was how it implied a certain distaste for connecting the engineering to the business realities, and I think it is very important for engineers to understand, and have respect for, business decision-making.
One more day to go. I am surprised how heavy the program is on cloud computing and DevOps, but I guess that’s a huge chunk of what people are working on these days. What I have been missing from the talks so far is programming-focused talks.
Thoughts from SCALE 21x, day 2
The second day of the Southern California Linux Expo meant the start of the expo, and the more talks.
I started the day with “Best Practices for Running Databases on Kubernetes” with Peter Zaitsev, who was a coworker at MySQL and went on to found Percona. While I am getting a better sense of what Kubernetes is all about and already had some idea of how databases might exist in that world, his talk was a great overview and the “best practices” seemed to cover a lot of bases.
That was followed by “Kubernetes and Distributed SQL Databases: Same Consistency With Better Availability and Scalability” which showed off using Kine as a way to plug in different systems as the data store for Kubernetes instead of etcd
. I wish the speaker had spent a little more time giving some practical examples of why is something you would even want to do. It was a good reminder that k3s exists and I should play around with it. And the speaker just using an outline in an open text editor (Pico!) as his slides reminded me of when I gave a talk on MySQL and PHP using plain-text slides. (Looks like my talk has been disappeared, though.)
After that, it was back over to the other side of the expo for a talk on “Leveraging PrestoDB for data success” which was an overview of the Presto project, which provides an ANSI SQL query interface to a collection of other data sources (my paraphrase). Kiersten Stokes, the presenter who works at IBM, called MySQL a “traditional database” which struck me as funny. Presto is a very slick and powerful system that I will probably never need. I appreciate that everyone I have seen talk about the concept of a “data lakehouse” is appropriately embarrassed about the name.
Before the next round of talks started, the expo floor finally opened, so I took a quick spin through that. It was pretty busy, and seemed like a good crowd of projects and companies. I think the largest footprint was maybe a couple of 10' × 40' booths from companies like AWS and Meta, but otherwise it was a lot of 10' × 10' booths with a couple of people handing out stickers or other promotional items from behind a table (and talking about their projects/companies).
After that I went back to the MySQL track (four talks!) to see “Design and Modeling for MySQL” which was really more of a speed-run of database history and concepts. The presenter made the classic mistake of white text on a dark background so it was pretty tough to see what he was showing until someone dimmed the lights.
That was followed by “Beyond MySQL: Advancing into the New Era of Distributed SQL with TiDB” from Sunny Bains, whose time as the MySQL/InnoDB team overlapped my time working at MySQL, but I don’t think we ever met. TiDB seems like a very impressive cloud-native distributed database which doesn’t actually derive from MySQL, but instead has chosen to be protocol and query-language compatible.
The last session I attended was a panel from the Open Government track on “The OSPO POV.” OSPO stands for “Open Source Program Office” and can act as kind of the interface between companies or organizations and the open source world. There were a bunch of projects and communities mentioned that I want to look into further: TODO Group, Fintech Open Source Foundation, CHAOSS (Community Health Analytics in Open Source Software), Sustain, The Open Source Way, Inner Source Commons, and OSPO++.
Things got busier today, which was nice to see. I wasn’t in a great headspace most of the day, which pretty much sucked, but I think I came away with a lot of things to dig into on my own, which is one of the reasons I wanted to attend.
Thoughts from SCALE 21x, day 1
Today was the first day of the 21st Southern California Linux Expo, also known as SCALE 21x. I gave a talk at way back at SCALE 4x and hadn’t made it back since then.
I attended a couple of talks on the UbuCon track at the beginning of the day. They weren’t technical talks, but focused on how the Ubuntu community operates and how Canonical relates to that. It sounds like Canonical has opened itself up more to the community by adopting Matrix as both their internal communications tool as well as what the community uses, which I think is very important for encouraging the developers in a commercial open source environment to engage with the community. This was an issue for us back in the MySQL days, too.
(There was also a comment about “neck beards” being annoying about not adopting newer communication tools and wanting everyone to stick with IRC, I think coming from someone involved with openSUSE, which I thought was kind of funny.)
After that, I popped over to the beginning of the Kwaai Personal AI Summit because Doc Searls was giving a (brief) talk and I thought I would see if there was anything to this AI thing that I’ve been hearing about. The room had a lot of old dude energy that just wasn’t sitting right with me, so I ended up bailing after Doc’s talk.
Since I left that earlier than I had planned, I ended up wandering into a PostgreSQL talk on how “wait events” can be used for troubleshooting performance, and I had a déjà vu moment because only yesterday I had run across the old Worklog for MySQL’s PERFORMANCE_SCHEMA
which blames credits me for suggesting that’s what the name of the schema should be. It was yet another random “plate of shrimp” moment that has been happening with frequency as of late.
Then I attended a workshop from the Kubernetes Community Day track on using Argo CD to put the OpenGitOps principles into practice. While I have been using Docker for a while, I haven’t really played around with Kubernetes or other container automation tools, so I figured this might be a good way to start learning more. Unfortunately, the hands-on workshop part of the session didn’t actually work due to some problem with the training environment from the sponsoring company, which kind of helped reinforce my instinct that a lot of these tools still have a lot of sharp edges. The concept sounds great, though.
Finally, I popped back over to the PostgreSQL track for their (apparently popular) “Ask Me Anything” session with some of the prominent community members and core developers that were in attendance. I was reminded today that the PostgreSQL project doesn’t have a bug tracker aside from their mailing list archive. I remembered writing about this before, and it turns out that was in 2008. (No shade intended that they don’t have one, it seems to be working out okay.)
That was the day. I really don’t want to seem like I am passing any judgement on anything because I know that putting on an event like this is tremendously difficult, and while there is an impressive line-up of sponsors this is clearly a community-driven and focused event. I was disappointed by how old, white, and male the crowd seemed to be (fully acknowledging that’s my demographic), and I’ll be interested to see if that holds true for the whole run or if this an outlier day because it was more workshop-oriented and the expo floor wasn’t open.
How I use Docker and Deployer together
I thought I’d write about this because I’m using Deployer in a way that doesn’t really seem to be supported.
After the work I’ve been doing with Python lately, I can see how I have been using Docker with PHP is sort of comparable to how venv
is used there.
On my production host, my docker-compose
setup all lives in a directory called tmky
. There are four containers: caddy
, talapoin
(PHP-FPM), db
(the database server), and search
(the search engine, currently Meilisearch).
There is no installation of PHP aside from that talapoin
container. There is no MySQL client software on the server outside of the db
container.
I guess the usual way of deploying in this situation would be to rebuild the PHP-FPM container, but what I do is just treat that container as a runtime environment and the PHP code that it runs is mounted from a directory on the server outside the container.
It’s in ${HOME}/tmky/deploy/talapoin
(which I’ll call ${DEPLOY_PATH}
from now on). ${DEPLOY_PATH}/current
is a symlink to something like ${DEPLOY_PATH}/release/5
.
The important bits from the docker-compose.yml
look like:
services: talapoin: image: jimwins/talapoin volumes: - ./deploy/talapoin:${DEPLOY_PATH}
This means that within the container, the files still live within a path that looks like ${HOME}/tmky/deploy/talapoin
. (It’s running under a different UID/GID so it can’t even write into any directories there.) The caddy
container has the same volume setup, so the relevant Caddyfile
config looks like:
trainedmonkey.com { log # compress stuff encode zstd gzip # our root is a couple of levels down root * {$DEPLOY_PATH}/current/site # pass everything else to php php_fastcgi talapoin:9000 { resolve_root_symlink } file_server }
(I like how compact this is, Caddy has a very it-just-works spirit to it that I dig.)
So when a request hits Caddy, it sees a URL like /2024/03/09
, figures out there is no static file for it and throws it over to the talapoin
container to handle, giving it a SCRIPT_FILENAME
of ${DEPLOY_PATH}
and a REQUEST_URI
of /2024/03/09
.
When I do a new deployment, ${DEPLOY_PATH}/current
will get relinked to the new release directory, the resolve_root_symlink
from the Caddyfile
will pick up the change, and new requests will seamlessly roll right over to the new deployment. (Requests already being processed will complete unmolested, which I guess is kind of my rationale for avoiding deployment via updated Docker container.)
Here is what my deploy.php
file looks like:
<?php namespace Deployer; require 'recipe/composer.php'; require 'contrib/phinx.php'; // Project name set('application', 'talapoin'); // Project repository set('repository', 'https://github.com/jimwins/talapoin.git'); // Host(s) import('hosts.yml'); // Copy previous vendor directory set('copy_dirs', [ 'vendor' ]); before('deploy:vendors', 'deploy:copy_dirs'); // Tasks after('deploy:cleanup', 'phinx:migrate'); // If deploy fails automatically unlock. after('deploy:failed', 'deploy:unlock');
Pretty normal for a PHP application, the only real additions here are using Phinx for the data migrations and using deploy:copy_dirs
to copy the vendors
directory from the previous release so we are less likely to have to download stuff.
That hosts.yml
is where it gets tricky, because when we are running PHP tools like composer
and phinx
, we have to run them inside the talapoin
container.
hosts: hanuman: bin/php: docker-compose -f "${HOME}/tmky/docker-compose.yml" exec --user="${UID}" -T --workdir="${PWD}" talapoin bin/composer: docker-compose -f "${HOME}/tmky/docker-compose.yml" exec --user="${UID}" -T --workdir="${PWD}" talapoin composer bin/phinx: docker-compose -f "${HOME}/tmky/docker-compose.yml" exec --user="${UID}" -T --workdir="${PWD}" talapoin ./vendor/bin/phinx deploy_path: ${HOME}/tmky/deploy/{{application}} phinx: configuration: ./phinx.yml
Now when it’s not being pushed to an OCI host that likes to fall flat on its face, I can just run dep deploy
and out goes the code.
I’m also actually running Deployer in a Docker container on my development machine, too, thanks to my fork of docker-deployer
. Here’s my dep
script:
#!/bin/sh exec \ docker run --rm -it \ --volume $(pwd):/project \ --volume ${SSH_AUTH_SOCK}:/ssh_agent \ --user $(id -u):$(id -g) \ --volume /etc/passwd:/etc/passwd:ro \ --volume /etc/group:/etc/group:ro \ --volume ${HOME}:${HOME} \ -e SSH_AUTH_SOCK=/ssh_agent \ jimwins/docker-deployer "$@"
Anyway, I’m sure there are different and maybe better ways I could be doing this. I wanted to write this down because I had to fight with some of these tools a lot to figure out how to make them work how I envisioned, and just going through the process of writing this has led me to refine it a little more. It’s one of those classic cases of putting in a lot of hours to end up with a relatively few lines of code.
I’m also just deploying to a single host, deployment to a real cluster of machines would require more thought and tinkering.
Back on Linode
For some reason I couldn’t keep the the instances I was setting up on Oracle Cloud Infrastructure (OCI) from eating themselves when I did something fancy like run apt-get update
, so I moved everything back to Linode ($100 referral credit there) on one of the lowest-price Nanode compute instances.
I took the opportunity to rebuild the host on Debian just to give that a spin. My setup runs on containers managed by docker-compose
, so the underlying system doesn’t matter to me that much.
I should probably be using this as an opportunity to learn some infrastructure-as-code tools.
Release early, release often
One of the benefits of starting Frozen Soup from a project template is that someone very smart (Simon) has done all the heavy lifting to make publishing it into the Python ecosystem really easy to do. So after I added a new feature today (pulling in external url(...)
references in CSS inline as data:
URLs), I went ahead and registered the project on PyPI, tagged the release on GitHub, and let the GitHub Actions that were part of the project template do the work of publishing the release. It worked on the first try, which is lovely.
I pushed more changes after I did that release, adding a way to set timeouts and fixing the first issue (that I also filed) about pre-existing data:
URLs getting mangled. I also added a quick-and-dirty server version which allows for getting the single-file HTML version of a page, and makes it a little easier to play around with the single-file version of live URLs without having to deal with saving and opening the files.
So I did a second release.
Introducing Frozen Soup
I made a new thing, which I decided to call Frozen Soup. It creates a single-file version of an HTML page by in-lining all of the images using data:
URLs, and pulling in any CSS and JavaScript files.
It is loosely inspired by SingleFile which is a browser extension that does a similar thing. There are also tools built on top of that which let you automate it, but then you’re spinning up a headless browser, and it all felt very heavyweight. The venerable wget
will also pull down a page and its prerequisites and rewrite the URLs to be relative, but I don’t think it has a comparable single-file output.
This may also exist in other incarnations, this is mostly an excuse for me to practice with Python. As such, it is a very crude first draft right now, but I hope to keep tinkering with it for at least a little while longer.
I have also been contributing some changes and test cases to ArchiveBox, but this is different yet also a little related.
Who is getting the ping data?
The excellent journalists at 404 Media uncovered information about Automattic selling data to “AI” companies from WordPress.com and Tumblr. A number of people have jumped in to say that “WordPress.com != WordPress” which is a fair point, but this is where I touch the onion on my belt and point out that even WordPress sends pings to Ping-o-Matic by default and does anyone know who gets a real-time feed of that data?
Grinding the ArchiveBox
I have been playing around with setting up ArchiveBox so I could use it to archive pages that I bookmark.
I am a long-time, but infrequent, user of Pinboard and have been trying to get in the habit of bookmarking more things. And although my current paid subscription doesn’t run out until 2027, I’m not paying for the archiving feature. So as I thought about how to integrate my bookmarks into this site, I started looking at how I might add that functionality. Pinboard uses wget
, which seems simple enough to mimic, and I also found other tools like SingleFile.
That’s when I ran across mention of ArchiveBox and decided that would be a way to have the archiving feature I want and don’t really need/want to expose to the public. So I spun it up on my in-home server, downloaded my bookmarks from Pinboard, and that’s when the coding began.
ArchiveBox was having trouble parsing the RSS feed from Pinboard, and as I started to dig into the code I found that instead of using an actual RSS parser, it was either parsing it using regexes (the generic_rss
parser) or an XML parser (the pinboard_rss
parser). Both of those seemed insane to me for a Python application to be doing when feedparser has practically been the gold standard of RSS/Atom parsers for 20 years.
After sleeping on it, I decided to roll up my sleeves, bang on some Python code, and produced a pull request that switches to using feedparser
. (The big thing I didn’t tackle is adding test cases because I haven’t yet wrapped my head around how to run those for the project when running it within Docker.)
Later, I realized that the RSS feed I was pulling of my bookmarks would be good for pulling on a schedule to keep archiving new bookmarks, but I actually needed to export my full list of bookmarks in JSON format and use that to get everything in the system from the start.
But that importer is broken, too. And again it’s because instead of just using the json
parser in the intended way, there was a hack to work around what appears to have been a poor design decision (ArchiveBox would prepend the filename to the file it read the JSON data from when storing it for later reading) that then got another hack piled on top of it when that decision was changed. The generic_json
parser used to just always skip the first line of the file, but when that stopped being necessary, that line-skipping wasn’t just removed, it was replaced with some code that suddenly expected the JSON file to look a certain way.
Now I’ve been reading more Python code and writing a little bit, and starting to get more comfortable some of the idioms. I didn’t make a full pull request for it, but my comment on the issue shows a different strategy of trying to parse the file as-is, and if that fails, skip the first line and try it again. That should handle any JSON files with garbage in the first line, such as what ArchiveBox used to store them as. And maybe there is some system out there that exports bookmarks in a format it calls JSON that actually has garbage on the first line. (I hope not.)
So with that workaround applied locally, my Pinboard bookmarks still don’t load because ArchiveBox uses the timestamp of the bookmark as a unique primary key and I have at least a couple of bookmarks that happen to have the same timestamp. I am glad to see that fixing that is project roadmap, but I feel like every time I dig deeper into trying to use ArchiveBox it has me wondering why I didn’t start from scratch and put together what I wanted from more discrete components.
I still like the idea of using ArchiveBox, and it is a good excuse to work on a Python-based project, but sometimes I find myself wondering if I should pay more attention my sense of code smell and just back away slowly.
(My current idea to work around the timestamp collision problem is to add some fake milliseconds to the timestamp as they are all added. That should avoid collisions from a single import. Or I could just edit my Pinboard export and cheat the times to duck the problem.)
Oracle Cloud Agent considered harmful?
Playing around with my OCI instances some more, I looked more closely at what was going on when I was able to trigger the load to go out of control, which seemed to be anything that did a fair amount of disk I/O. What quickly stuck out thanks to htop
is that there were a lot of Oracle Cloud Agent processes that were blocking on I/O.
So in the time-honored tradition of troubleshooting by shooting suspected trouble, I removed Oracle Cloud Agent.
After doing that, I can now do the things that seemed to bring these instances to their knees without them falling over, so I may have found the culprit.
I also enabled PHP’s OPcache and some rough-and-dirty testing with good ol’ ab
says I took the homepage from 6r/s to about 20r/s just by doing that. I am sure there’s more tuning that I could be doing. (Requesting a static file gets about 200 r/s.)
By the way, the documentation for how to remove Oracle Cloud Agent on Ubuntu systems is out of date. It is now a Snap package, so it has to be removed with sudo snap remove oracle-cloud-agent
. And then I also removed snapd
because I’m not using it and I’m petty like that.