#soylent | Logs for 2026-05-27

« return
[00:55:26] -!- systemd [systemd!~systemd@pid1] has joined #soylent
[01:00:11] -!- bender has quit [Remote host closed the connection]
[01:00:20] -!- bender [bender!bot@Soylent/Bot/Bender] has joined #soylent
[01:01:23] -!- Loggie [Loggie!Loggie@Soylent/BotArmy] has joined #soylent
[01:37:32] <kolie2> Did soylentnews ever support OpenID?
[01:37:51] <chromas> Not that I remember
[01:38:01] <kolie2> It's in the rehash code.
[01:38:09] <chromas> Maybe they were going to try it out
[01:38:40] <chromas> I don't recall it being deployed. Maybe janrinok remembers
[01:41:50] <kolie2> yea it was added in 2009
[01:42:55] <chromas> Definitely before our time
[01:43:40] <kolie2> i think the one in rehash is the old one / not the new one with the same na,e.
[01:43:56] <kolie2> OpenID 2.0 is prob what it is and OIDC is the new OAuth 2.0 shit.
[01:43:58] <chromas> is there a new openid?
[01:44:26] <chromas> 2009 is the last pubic release of slash, what they used to boot soylent but that was 2014
[01:44:58] <chromas> I guess openid and perl both were rebooted since then
[01:56:11] <kolie2> the new one is the OIDC one.
[01:56:32] <kolie2> https://learn.microsoft.com
[01:56:33] <systemd> ^ 0303OAuth 2.0 and OpenID Connect protocols
[05:11:06] <kolie2> Adobe Flash was pinnacle internet. Change my mind.
[05:12:58] <chromas> Webtard devs wanted to make whole websites with it. We said no, so now they bloat the browser itself to allow gay webapps.
[05:39:38] <janrinok> kolie2, openid? I don't think so, but the code might have been there from the original fork.
[09:46:43] -!- aristarchus [aristarchus!~aristarch@185.107.ix.vkx] has joined #soylent
[09:50:36] -!- aristarchus has quit [Client Quit]
[13:57:01] -!- Ingar [Ingar!~ingar@2001:bc8:1640:mwzr:pxpv:wq:jjgg:moil] has joined #soylent
[13:58:05] <Ingar> https://www.youtube.com
[13:58:07] <systemd> ^ 0303Soylent Green
[16:00:52] <janrinok> kolie2, there have been numerous short disruptions to the site during the last 24 hours. These are recovered quickly but it is something that I have not seen before. Not sure what it causing it.
[16:02:24] <janrinok> They tend to be 50x errors.
[16:04:18] <kolie2> if u have url + more precise times that would be helpful
[16:04:28] <kolie2> ill check the logs i got.
[16:12:24] <janrinok> I haven't got times. I've been elsewhere today and can see the breaks but they do not have associated times.
[16:21:33] <kolie2> Ok I see what's going on.
[16:23:06] <kolie2> Basically apache spawns workers, and varnish sits infront of that - cached stuff is good, some workers are poisoned, some aren't if you get a poisoned worker, you get a 500
[16:23:25] <kolie2> theres a race issue with the config existing and apache starting up.
[16:25:02] <kolie2> "dc up -d" would recycle rehash and resolve, i am digging in a bit more if I can see what's cause the race rn toh.
[16:33:27] <kolie2> This also exposes the fix I thought I put in last week did not take effect - so I get to confirm that too :)
[16:55:05] <kolie2> janrinok, you should see 500 free.
[17:00:20] <kolie2> 04== rehash 500-error investigation & fix ==
[17:00:20] <kolie2> 14TL;DR: ~half the Apache prefork children were poisoned at fork time and fast-failed every request with HTTP 500 in <5ms. Now fixed.
[17:00:20] <kolie2> 07-- Symptoms (last 24h before fix) --
[17:00:21] <kolie2> 08~6,400 HTTP 500s in 15h, baseline 08~400-500/hr with spikes to 0860/min
[17:00:23] <kolie2> Repro: 1125 curls to localhost:80/ inside the container -> 0417 returned 500 in 1-3ms, 038 returned 200 in ~270ms
[17:00:26] <kolie2> Apache child PIDs split into 2 cohorts: 035 doing real work (60+ min CPU) vs 045 fast-fail-only (2-3 sec CPU total)
[17:00:29] <kolie2> Varnish hit ratio only 08~21% so most reqs reached apache; ~half of those hit a poisoned child = 04~40% backend 500 rate
[17:00:31] <kolie2> janrinok saw "short disruptions" b/c varnish caches mask most of it; misses + scanner bursts leaked through
[17:00:34] <kolie2> 07-- Root cause --
[17:00:37] <kolie2> Apache forks N children at startup. The per-server Slash config (11$cfg->{slashdb}) is populated in the parent at config-load time by SlashVirtualUser. Children inherit it via copy-on-write fork — BUT the DB handle was disconnected before fork, and the cached config relies on a successful first 11sqlConnect on first request. When a cohort of children forks simultaneously and slams MariaDB during its InnoDB buffer-pool warm-u
[17:00:42] <kolie2> p (~48s after start), some races: getCurrentDB() returns the inherited object but downstream 11getObject('Slash::DB', ...) returns undef. Result: 04Can't call method "sqlConnect" on undefined value at Slash/Apache/Banlist.pm line 44 -> 500 returned in ~2ms. Child stays alive (no MaxRequestsPerChild), keeps fast-failing forever.
[17:00:46] <kolie2> Same root cause feeds: 11getSkins / createAccessLog / Global $r object is not available errors.
[17:00:48] <kolie2> 07-- Three fixes deployed --
[17:00:54] <kolie2> 1) 03Worker recycling actually enabled
[17:00:56] <kolie2> 14rehash/rehash/Dockerfile: previously had 11sed -i 's/MaxRequestsPerChild 0/.../' httpd-mpm.conf -- but that file is NOT Include'd from httpd.conf, so the edit was dead. Replaced with direct echo into httpd.conf: 11MaxClients 50 / MaxRequestsPerChild 500. Poisoned children now die after 500 reqs and respawn one-at-a-time into a warm DB.
[17:00:59] <kolie2> 2) 03New PerlChildInitHandler: Slash::Apache::ChildInit
[17:01:01] <kolie2> 14rehash/rehash/Slash/Apache/ChildInit/ChildInit.pm (new dist, registered in slash.conf). Runs once per worker right after fork: calls 11sqlConnect on the inherited slashdb with retry+exponential backoff (50/100/200/400ms, 4 attempts max). Warms the DBI handle before the worker accepts its first request. Never dies — if all retries fail, warns to error_log and lets lazy retry handle it on first request.
[17:01:07] <kolie2> 3) 03Hardened the security updates
[17:01:09] <kolie2> 14rehash previously had 044x die() calls inside 11verify_request_call / 11check_request_poison that ran in a PerlCleanupHandler — die in cleanup phase skips the rest of the cleanup chain. Replaced with: cached per-worker filehandle (no open/close churn), non-blocking flock (LOCK_EX|LOCK_NB so locking never serializes requests), eval{} wrappers on every helper, warn-not-die on any failure with auto-reopen of the fh on next call.
[17:01:14] <kolie2> 07-- Results (post-deploy, ~1056 req sample) --
[17:01:16] <kolie2> 03500 count: 6,400/hr -> 0
[17:01:18] <kolie2> 03Repro curl loop: 17/25 fast-fails -> 0/25 (100/100 also 200)
[17:01:24] <kolie2> 03Apache children: 5 working+5 broken bimodal -> 10 all working uniformly
[17:01:26] <kolie2> 03Static asset 500s (/images/topics/topic*.png, *.css) -> gone (hardened)
[17:01:28] <kolie2> Deploy: 11cd ~/rehash && docker compose build rehash && docker compose up -d rehash
[18:17:12] <kolie2> System looks really good rn.
[18:34:58] -!- jje has quit [Quit: ZNC - https://znc.in]
[18:39:21] -!- jje [jje!jje@ddjffrbpjin.info] has joined #soylent
[18:56:14] <janrinok> Thanks, I'll watch it through tomorrow. Not sure how we create poisoned workers but I will take your word for it!
[23:29:44] <kolie2> If they start there is a call to get and cache a db per worker
[23:29:57] <kolie2> If that call fails, that worker will 500 any request, included static resources.
[23:30:14] <kolie2> Last week I noticed there was worker issues and put worker recycling in every 50 requests.
[23:30:18] <kolie2> That didn't take effect.
[23:30:22] <kolie2> I did it for other reasons anyways.