#staff | Logs for 2024-06-17

« return
[01:10:16] -!- chromas2 [chromas2!~chromas@Soylent/Staph/Infector/chromas] has joined #staff
[01:10:16] -!- mode/#staff [+v chromas2] by Ares
[01:11:16] -!- chromas has quit [Ping timeout: 252 seconds]
[01:11:58] chromas2 is now known as chromas
[03:32:04] -!- drussell_ has quit [Ping timeout: 252 seconds]
[03:37:01] -!- drussell [drussell!~drussell@2604:3d08:a77f:rvyq:ksjx:vrvy:qnyi:htkl] has joined #staff
[09:21:50] -!- kolie has quit [Quit: ZNC 1.8.2 - https://znc.in]
[09:23:03] -!- kolie [kolie!~kolie@208.91.qqu.m] has joined #staff
[13:41:28] -!- drussell has quit [Ping timeout: 252 seconds]
[13:42:05] -!- drussell [drussell!~drussell@2604:3d08:a77f:sino:ykrj:snio:qtwr:spir] has joined #staff
[16:36:26] -!- kolie has quit [Changing host]
[16:36:26] -!- kolie [kolie!~kolie@Soylent/Staff/Management/kolie] has joined #staff
[16:49:01] <kolie> heyo
[17:07:47] <Deucalion> heyo kolie
[17:07:52] <Deucalion> sup?
[17:08:10] <kolie> nm just gettin gthe day going
[17:08:20] <kolie> Was there an issue still ongoing?
[17:09:46] <Deucalion> I believe jan is still stuck with whatever the issue is affecting both prod and dev performance. Not sure if it was temporarily resolved by clearing some space on helium.
[17:10:30] <Deucalion> He was saying something about sql being configured to create unnecessary large binary files in the / dir on helium.
[17:10:55] <Deucalion> Outside my ken. Probably best to touch base with him.
[17:14:28] <Deucalion> kolie, what is magnesium used for? That's the only one I can't ssh to. Not asking for access per se - no idea what it does, so no idea if I might need access :D
[17:15:00] <Deucalion> Just trying to get back up to speed about what is hosted where and how it's configured these days.
[17:18:30] <Deucalion> Oh and what credentials are needed for tech.soylentnews.org ? I thought it was our kerberos login - but that doesn;t work :/
[17:18:35] <kolie> I don't know what they do by names.
[17:19:00] <kolie> I can certainly look.
[17:19:33] <Deucalion> 23.239.29.31 the only 2GB RAM Linode instance we have :)
[17:20:38] <kolie> well it doesnt seem to have ssh enabled
[17:21:37] <Deucalion> I have a feeling it is or was used for backup purposes. But I've been out of the game so long I have NFI what's where these days other than IRC and Mail being on Bery
[17:21:59] <kolie> yea its function hasnt changed really.
[17:22:08] <kolie> but i dont know what that is and im not logged into it yet
[17:23:45] <Deucalion> The server roles were kinda documented on the tech wiki (tech.soylentnews.org) but I can't login (or have forgot how! ) :D
[17:23:55] <kolie> looks like the nginx front end atleast
[17:24:06] <kolie> 443, 80 are on it
[17:24:09] <Deucalion> Ahh
[17:24:34] <Deucalion> Are we still running varnish in front of apache?
[17:24:37] <kolie> yea ssl termination
[17:24:43] <kolie> It wouldn't have changed
[17:24:57] <kolie> I think varnish is running on 2626 infront of the uhh
[17:24:57] <kolie> perl
[17:25:18] <kolie> Just a caching engine, not particularly relevant, but the basic structure is untouched.
[17:26:07] <Deucalion> So varnish sits in between rehash and apache - then nginx sits in front of apache? Or did apache go away for the main site entirely in favour of nginx?
[17:26:34] <kolie> Nothing has changed - I can't comment on the various pieces. There is a nginx front end for ssl termination, its on magnesium, thats all I can confirm.
[17:26:45] <kolie> I can tell you where it forwards to.
[17:26:50] <kolie> 2626 is used for IPN/perl
[17:27:11] <kolie> I'm pretty sure the rehash uses varnish but im not confirming that 100%
[17:27:39] <Deucalion> It's a mysterious beast :D
[17:27:43] <kolie> I don't want to give conflicting information.
[17:27:52] <kolie> The things I know for sure and can confirm, I'll say so.
[17:28:28] <kolie> It's easy for me to figure that out/solve things, but its not in my head how the old system is.
[17:29:30] <Deucalion> Fair enough :D Thanks :D
[17:29:32] <kolie> Anything I can look at further rn?
[17:29:37] <kolie> I'm here, I got it open.
[17:29:53] <kolie> Happy to assist or do what ever I can.
[17:30:58] <kolie> 443 on nginx goes to rehash, which is basically flourine port 80
[17:31:26] <Deucalion> lol I though rehash was on helium :)
[17:31:42] <kolie> sec
[17:32:00] <kolie> fluorine has a running slashd instance, looks likes rehash is on fluorine
[17:32:08] <Deucalion> Or is that just the sql server - has the most ram
[17:32:12] <kolie> I thought helium has db
[17:32:13] <kolie> yea
[17:32:17] <kolie> exactly. and maybe dns.
[17:32:43] <kolie> yea varnish is on 80 on flourine, so confirming that now.
[17:33:40] <kolie> well
[17:33:41] <Deucalion> It's that janrinok was saying was causing issues. mysql being configured with create binary files set to True and writing out to the / dir or somesuch. But don't go by my relayed info, I'm easily confused L:D
[17:34:12] <kolie> Well, binary files is, potentially, the mysql database files :)
[17:35:02] <Deucalion> I don't know the details sadly
[17:38:01] <kolie> theres backups which are 3G taken regularly.
[17:38:05] <kolie> they total 42G right now
[17:41:15] <kolie> anyways theres plenty of space on the system now.
[17:43:12] <kolie> yea it looks like he set the mysql bin logs to expire after 7 days
[17:43:26] <kolie> So if there is any issue with mysql you have to figure it out in 7 days.
[17:43:29] <Deucalion> Perhaps it's those that end up eating all the space down to zero left and then need manually culling.
[17:43:40] <kolie> I think what happened, is
[17:43:43] <kolie> The DB has grown
[17:43:50] <kolie> And wwhatever rotation was on there
[17:43:56] <kolie> wasn't being gotten to, before the disk filled
[17:44:00] <kolie> and other cruft has added over time too
[17:44:12] <kolie> So, larger backups on smaller space, over time, its full and not rotating.
[17:44:21] <kolie> Looks like he set the rotation back
[17:44:31] <kolie> So all good, and I freed up a fuck ton more.
[17:44:51] <Deucalion> rm -rf / ? :D
[17:45:03] <kolie> You want me to channel the BOFH on that one?
[17:45:12] <kolie> What do you mean, I see *PLENTY* of space...
[17:45:27] <Deucalion> Not much else.... but plenty of space :D
[17:45:39] <kolie> . /dev/sdc 102G 69G 32G 69% /data
[17:46:04] <kolie> prior> . /dev/sdc 102G 90G 12G 89% /data
[17:49:30] <Deucalion> what did you cull?
[17:49:39] <kolie> Large old shi tthat hasnt been used in 10 years.
[17:49:56] <kolie> Backups older than time itself.
[17:50:08] <kolie> Of machines we could never hope to restore.
[17:50:31] <kolie> Yea I'm looking now, I remember something
[17:50:37] <kolie> there was a cron script someone made
[17:50:37] <Deucalion> Like all the hoarded old tech under my bed - you never know when it might come in handy :D
[17:50:42] <kolie> that cleaned out the bin logs...
[17:50:44] <kolie> let me see if i can find it.
[17:50:57] <kolie> here it is. /usr/local/sbin/cleansql.sh
[17:51:47] <Deucalion> Can't sql just be configured to not keep as many? Or use logrotate? I guess not as someone went to the lengths or writing a specific script for it
[17:52:16] <kolie> # for whatever reason, someone set sqlto save all these huge incrementalchanges
[17:52:16] <kolie> # in /var/log/mysql. These are tpically 100Mb binary files and quickly fill up the disk,
[17:52:16] <kolie> # causing the app tp fail. These cannot be cleaned up by the usual log rotate command, as mysql uses its own numbering, tese are not really log fiules.
[17:52:16] <kolie> # Until find out WHY this is in place I wrote this script to delete all but the last
[17:52:16] <kolie> # 60 ofthese files to keep the disk from filling up.
[17:52:18] <kolie> # This sill run from cron.
[17:52:33] <Deucalion> Maybe that cleansql.sh used to be in cron / anacron but got lost from there at some point
[17:53:26] <Deucalion> AHA! See - some nice person left meaningful comments why log rotate won't work :D
[17:53:45] <kolie> thats in the cron script
[17:53:54] <kolie> And the cron script is scheduled daily
[17:54:06] <kolie> . /etc/cron.daily/cleanupsql just found it
[17:54:11] <kolie> The problem wasn't those logs.
[17:54:17] <kolie> They are getting cleaned
[17:54:26] <Deucalion> But if it is keeping a defined 60 files and each file grows over time then it becomes insuffucient?
[17:54:27] <kolie> The problem was the DB backups have grown AND those bin logs.
[17:54:35] <Deucalion> Ah
[17:54:36] <kolie> Its just everything at once, they all use space.
[17:54:54] <kolie> Its how DBs work, the bin files are the transactions
[17:55:05] <kolie> To restore a db, you need the last golden point + bins since then
[17:55:26] <kolie> I'm not going to comment if we have the correct bins or a valid checkpoint.
[17:55:35] <kolie> BUT, that's what they exist for.
[17:55:39] <Deucalion> Double the storage size on the linode and we can kick it into the weeds for another decade :D
[17:55:54] <kolie> Typically you flush the db to the FS, checkpoint it, do a VSS, now you have a crash safe backup.
[17:56:16] <kolie> I mean yea with the cost of disk these days why it isnt 100000TB drive :)
[17:56:38] <kolie> But hey I'm not running the show :)
[17:56:39] <Deucalion> Sounds complex. Most I ever setup many many years ago was a script to run mysqlbackup
[17:57:02] <Deucalion> Who is running the show? I've lost track
[17:57:06] <kolie> its how a lot of backup systems work and with DBs
[17:57:36] <kolie> And just beause you have backups, doesnt mean its setup right or error proof in all corner cases.
[17:57:47] <kolie> Thats why these bin files are done that way. MSSQL works identically.
[17:57:54] <kolie> As does AD/VSS.
[17:58:04] <kolie> taking lunch bbiab.
[17:58:10] <Deucalion> Sure. Backing up any in flight system is challenging
[18:09:19] <janrinok> kolie /data isn't the problem on helium. It is / that is filling up. I moved things TO /data to get them off /. audioguy had been doing the same thing.
[18:09:20] <drussell> Whatever machine had the disk space issue, Janrinok said he was only able to free a few % disk space... It was enough that the site came back to life, but he seemed concerned that it would soon fill up again.
[18:09:31] <drussell> Oh, he's here...
[18:09:40] <drussell> Much better to hear from him.
[18:11:43] <janrinok> there are - I have been informed - 2 sets of backups being made. I am going on what I have been told. We make .sql backups. MySql 8 changed a setting so that it now defaults to making huge rollback .bin files which are being kept for a month.
[18:12:40] <janrinok> I have changed the config to 7 days but it will not be recognised until either mysql is restarted, or I can log into MySql as root to delete the old (unwanted) .bin files.
[18:14:34] <drussell> Is 7 days of transaction log really enough? Is someone realy able to somehow verify that the full backups are valid and working at least once a week?
[18:15:03] <drussell> Remember, we lost MONTHS of user data because the last full backup was far to old to be able to apply the avaiable transaction logs to bring the database up to date...
[18:15:31] <drussell> thus the entire database had to be reverted to a many-months-old copy, losing absolutely everything after the last working DB backup
[18:18:16] <janrinok> drussel - we don't use the transactional data as I understand it - we did when we kept 2 servers in sync. I think we only care about the .sql backups but I am trying to grok this from a standing start...
[18:18:47] <Deucalion> You never really know if a backup is valid without using it to do a restore. If we had abundant resources (both people and systems) we could automate scripted restoration of DB backups to test instances and an automated test harness to validate the restored DB.
[18:18:50] <janrinok> There is no documentation saying what is backed up, where it is kept, and who is responsible for doing it.
[18:20:49] <janrinok> I tried to copy the dev database but I cannot copy it anywhere except on our own servers. I wanted to load it locally so 1. I could find what the differences were, and 2. try to build a docker container locally so that I could test software.
[18:21:57] <janrinok> I don't seem to be able to copy anything from our servers but perhaps rsync or scp are not the way I should be doing it?
[18:23:10] <Deucalion> hmm... I used to do such for grabbing irc conf files to work with offline when my net connection was hellishly flaky and ssh would die on me mid edit every 2 minutes
[18:23:18] <Deucalion> lemme see if it still works
[18:23:19] <drussell> I don't understand the architecture, so I must defer to you guys' expertise. :) I'm just pointing out potential pitfalls with my (most certainly flawed) understanding of how the system MIGHT work. :)
[18:24:13] <drussell> Thanks, in advance, for all efforts! :)
[18:24:58] <janrinok> I've got no expertise! I have my own servers running here but they are configured the way that it seems almost everyone else does it. Our configuration is unique - but may be entirely justified.
[18:30:47] <Deucalion> There are as many config possibilities as there are number of servers to the power of number of sysadmins
[18:30:58] <Deucalion> squared for good measure :0
[18:38:18] <drussell> +1 Touché
[18:58:23] <kolie> The solution is ditch the current VM asap and forget it existed.
[18:58:39] <kolie> Standing up a new mysql, and switching the system to use it, far easier than whatever else.
[19:00:22] <kolie> Anyways I freed up space on / and /data
[19:01:12] <janrinok> I managed to save just over 10% of the space on helium / over the weekend.
[19:02:29] <kolie> . /dev/sdc 106952116 72313376 33529832 69% /data
[19:02:35] <janrinok> thanks for helping - I only emailed you on Friday in your role of Ops Officer. I hope that I didn't spoil your weekend. (Why do these problems always occur at a weekend?)
[19:02:36] <kolie> . /dev/sda 56769012 25137156 31051904 45% /
[19:03:14] <kolie> Yea it just happened to be a kids weekend :)
[19:03:26] <janrinok> I guessed it might have been :)!
[19:03:29] <kolie> I rarely have them weekends, seems to be thats the case.
[19:04:13] <kolie> Anyways theres a ton of space and looks pretty normal to me across the board.
[19:04:25] <janrinok> that sounds good to me!
[19:04:42] <kolie> you want me to hup mysql?
[19:06:22] <kolie> done.
[19:06:39] <kolie> 4 week uptime prior.
[19:31:49] -!- drussell has quit [Ping timeout: 252 seconds]
[19:32:28] -!- drussell [drussell!~drussell@2604:3d08:a77f:sino:ykrj:snio:qtwr:spir] has joined #staff
[21:19:00] <Fnord666> janrinok: scp seems to be working for me, at least to bery.
[21:30:18] <Fnord666> I was able to scp a file from helium to bery then scp it to my home machine FWIW.
[21:30:51] <Fnord666> I'm so used to doing things ass backwards at work that extra hops like that are normal for me.
[21:57:05] <chromas> You can't scp directly home?
[21:57:13] <chromas> Do you have ipv6 at home?
[22:45:57] <sylvester> 2024-06-18 00:45:57 08mail.soylentnews WARNING PING IPv400 PING WARNING - Packet loss = 0%, RTA = 206.86 ms
[22:48:06] <sylvester> 2024-06-18 00:48:06 08mail.soylentnews WARNING PING IPv600 PING WARNING - Packet loss = 0%, RTA = 207.38 ms
[22:48:39] <sylvester> 2024-06-18 00:48:39 08soylentnews WARNING PING IPv600 PING WARNING - Packet loss = 0%, RTA = 223.93 ms
[22:53:11] <sylvester> 2024-06-18 00:53:10 03mail.soylentnews OK PING IPv600 PING OK - Packet loss = 0%, RTA = 196.43 ms
[23:05:18] <sylvester> 2024-06-18 01:05:18 08mail.soylentnews WARNING PING IPv600 PING WARNING - Packet loss = 0%, RTA = 211.38 ms
[23:20:22] <sylvester> 2024-06-18 01:20:22 03mail.soylentnews OK PING IPv600 PING OK - Packet loss = 0%, RTA = 193.44 ms
[23:27:30] <sylvester> 2024-06-18 01:27:30 08mail.soylentnews WARNING PING IPv600 PING WARNING - Packet loss = 0%, RTA = 200.48 ms
[23:32:35] <sylvester> 2024-06-18 01:32:34 03mail.soylentnews OK PING IPv600 PING OK - Packet loss = 0%, RTA = 194.49 ms
[23:46:02] <sylvester> 2024-06-18 01:46:02 08mail.soylentnews WARNING PING IPv400 PING WARNING - Packet loss = 0%, RTA = 190.83 ms
[23:48:43] <sylvester> 2024-06-18 01:48:43 08soylentnews WARNING PING IPv600 PING WARNING - Packet loss = 0%, RTA = 215.91 ms
[23:59:43] <sylvester> 2024-06-18 01:59:43 08mail.soylentnews WARNING PING IPv600 PING WARNING - Packet loss = 0%, RTA = 201.89 ms