#editorial | Logs for 2019-10-06
« return
[06:28:43] <janrinok> what is the issue that you thought you had fixed?
[08:57:29] <chromas> To find the most article-iest part of the page, it looks for the longest runs of text, but that one link has a bunch of almost no text but lots of spaces
[08:57:53] <chromas> I added a regex to collapse spaces before counting but it's still outputting junk
[08:58:12] <janrinok> that sounds a novel approach. Can you quantity the success rate?
[08:59:20] <chromas> it's done that since upstart started doing subs. Dunno that I could give a number, but it seems to work with most sites
[08:59:37] <janrinok> I'm less subtle. I use libxml to strip out everything except the title, encoding and anything in <p>..</p>
[09:00:17] <chromas> Don't remember which site, but at least one uses <br>s instead of <p>s. I have it replace those with <p>s though
[09:00:30] <janrinok> In the main it seems to be doing better than 90% although some sites don't have any html to parse until you've done something on the page or used js. I simply ignore them.
[09:01:08] <janrinok> I'm getting over 300 stories a day so I'm not too concerned with any losses
[09:01:12] <chromas> I came across one site that has the entire article inside a json block
[09:01:31] <chromas> besides having it in the html
[09:01:43] <janrinok> They still require a human to decide if they are worth printing or not, but I will have to wait for AI to solve that problem ;)(
[09:01:43] <chromas> I guess that makes it easier when js replaces it with another article or something
[09:01:51] <chromas> oh, one just the other day had the whole thing in a meta tag
[09:01:59] <janrinok> ugh
[09:01:59] <chromas> I should take notes
[09:02:41] <chromas> I used to have a pile of articles from different sites to test it on but once it was working on most sites, I quit bothering with it
[09:02:57] <janrinok> I do some url specific processing. For example, phys.org litters its stories with internal tag links to take you to other pages 'of interest'. I strip all of those out.
[09:03:15] <chromas> like the urls with /tags/ ?
[09:03:20] <chromas> #MeToo
[09:03:27] <janrinok> yup
[09:04:53] <janrinok> I also have some templates that help improve the extraction for specific sites if I have found the time/interest to write them. But I'm just looking at low-hanging fruit at the moment.
[09:05:46] <chromas> I used to strip stuff that for sure shouldn't be article, like class="nav" or whatever, but sometimes articles are in those sections
[09:06:08] <chromas> and other times, the bot thinks it is because the xml thinger's dumb
[09:06:18] <janrinok> #MeToo
[09:07:35] <janrinok> I try to change any <p class=> containing unwanted data to a simple '<p>'
[09:08:02] <chromas> Yeah. I think the only useful attribute would be the href
[09:08:06] <chromas> everything else can go
[09:09:51] <janrinok> currently Arthur requires a manual submission, but today I am working on automating that as much as possible, bearing in mind that I still have to approve each possible submission as being worth submitting
[09:10:37] <chromas> I thought about doing that. Set up a cronjob to poll the queue and pick some stories to sub if the count is too low
[09:11:23] <chromas> Maybe train a bogofilter db on good vs bad articles, then use that to rate new ones to pick from
[09:12:12] <janrinok> my collection is automatic - every 4 hours - but with 300+ stories that still means about 30 will not have been processed correctly, and over half will probably not be of interest. We do get a lot of dross on our rss-feed
[09:13:22] <janrinok> but I'm writing a simple GUI that displays the story and I can simply press a single key to submit it, or I can edit it if necessary
[09:14:26] <chromas> Does it pull & extract all the articles every 4 hours, or does it just collect headlines?
[09:15:32] <janrinok> It extracts all the stories and saves them in a separate folder for each day. Folders over 7 days old delete automatically. But I only tend to look at today, and perhaps the day before, for stories. I get plenty
[09:16:46] <janrinok> It also formats each story to meet our requirements, and it puts the Arthur header on etc
[09:19:33] <janrinok> if you look on dev you can see some recent tests of my submissions, I was just making sure that the time limits were being adhered to. The content of the submission is irrelevant
[09:20:45] <chromas> if you give your bot account a subscription the time limit goes away (except for the reskey)
[09:21:09] <janrinok> I have another routine that flashes a warning if the subs or story queues are looking a bit thin - but this tends to happen during my night and so doesn't get actioned.
[09:22:13] <janrinok> I have a subscription, but as I would like to offer the software to anyone who wants it in the community (well, eventually) I cannot guarantee that they will have an account.
[09:23:14] <janrinok> I would love to be able to say that all the excuses about not making submissions are rubbish, if you read a story and you like it, press 1 button and it will do all the work for you.
[09:26:13] <chromas> Need a SN button for sites to put on their pages :)
[09:26:24] <chromas> rehash-it!
[09:26:25] <janrinok> but I'm building it as a series of blocks so that each can function on its own, and can also easily be made to pipe stories from one block to the next
[09:28:36] <janrinok> I might steal that name when I get to that stage - with credit to you of course. I think an add-on for FF, Chrome and whatever else seems popular would be a neat idea. Press it, and the page that you are viewing will be processed and submitted. With suitable checks for goatse, pron etc
[09:29:19] <janrinok> but that is a while into the future currently. My free time is much curtailed nowadays.
[09:29:40] <chromas> just don't misplace the hyphen
[09:30:03] <janrinok> ?
[09:30:09] <chromas> reha-shit
[09:30:27] <janrinok> gotcha - brain not working too well at the moment ;)
[09:30:36] <chromas> maybe that's the downvote button's label
[09:30:46] <chromas> bring back the firehose
[09:30:56] <janrinok> lol
[09:33:23] <chromas> Aw, arthur has a higher acceptance rate than upstart
[09:33:41] * chromas blames Runaway
[09:34:56] <janrinok> I think exec is also using Arthur too, so I'm not surprised.
[09:35:42] <chromas> I don't think exec has an account
[09:35:55] <janrinok> dunno
[09:36:06] <upstart> no karma for you, exec! hah!
[09:37:53] <janrinok> I've not looked at integrating Arthur with IRC. I know that it is on there but I think that was you who did that? I've never written an IRC bot.
[09:39:12] <janrinok> So most of the acceptance of arthur's subs is because I have to pick them and manually submit using Arthur's account.
[09:40:12] <janrinok> Because they are already formatted that is not a difficult task. It just needs 2 or 3 cut and pastes. But that is what I am hoping to automate over the next day or two.
[09:40:37] <chromas> Nah, it was a crutchy and Bytram operation
[09:40:53] <chromas> or maybe crutchy and cmn32480
[09:41:15] <janrinok> got to go and get lunch ready - will be back on in a hour or two. Thanks for the convo - I'm picking up useful tips everytime we chat
[09:41:23] <chromas> g'day
[09:41:31] <janrinok> yeah, cmn32480 rings a bell now that you mention it
[09:41:36] <chromas> or g'dafternoon
[09:41:49] <janrinok> laters chromas
[10:11:23] -!- cosurgi [cosurgi!~cosurgi@ddibp10.bl.pg.gda.pl] has joined #editorial
[10:11:54] -!- cosurgi has quit [Changing host]
[10:11:54] -!- cosurgi [cosurgi!~cosurgi@Soylent/Staff/Misc/cosurgi] has joined #editorial
[12:56:57] * Bytram is reading backscrool
[12:56:59] <janrinok> I've already a good chat with chromas this morning ^^^
[12:57:58] <Bytram> Yes, I was trying to catch up on that already.
[12:58:55] <Bytram> FWIW, as much as I wish I could claim otherwise, I have had no part in bot construction // operations except for using them and, possibly, getting some extra privs wrt exec
[12:59:32] <Bytram> exec was a crutchy thing, which was then handed over to cmn32480 who had a more stable box for it to run on.
[12:59:34] <janrinok> shhh - it makes you sound important when we spin these stories
[13:00:04] <Bytram> I have enough trouble keeping up with myself, let's set undue expectations, okay?
[13:00:09] * janrinok wonders if that word should have been impotent - it was a bad line....
[13:00:10] <Bytram> =)
[13:00:27] <Bytram> I'll give it a pass for now
[13:01:21] <janrinok> are we upgrading or has it all gone quiet?
[13:02:08] <Bytram> Dunno, put a query out to NCommander just a short while ago; have not yet heard back.
[13:03:57] <Bytram> oh, while you are here... do you remember who it was that said something about the problem with defending free speech is that you have to protect the rights of (???), because if you fail to do that, then eventually everyone's right will be reduced? VERY much paraphrased?
[13:04:18] <Bytram> I *think* I actually referred to it a while back in a story?
[13:04:24] <janrinok> there is scant information this time - we normally have a rough timing for each system....
[13:04:28] <Bytram> Famous name, which also escapes me atm
[13:05:06] <Bytram> I want to say Mencken(sp?) but not at all certain
[13:05:08] <janrinok> by name? nope, but I know a Mr Google who does
[13:05:30] <Bytram> yes, but I can't remember enough of the specific phrasing to find it, yet.
[13:05:42] <janrinok> Meccken H L
[13:05:51] <Bytram> Thought I'd ask Mr. J. while he was here
[13:05:53] <janrinok> Mencken
[13:06:11] <Bytram> yes, okay on the name, but I'm still trying to find the actual quote
[13:06:27] <Bytram> =g Mencken protect freedom speech
[13:06:28] <upstart> https://en.wikiquote.org - H. L. Mencken - Wikiquote
[13:06:31] <Bytram> lol
[13:06:43] <janrinok> The trouble with fighting for human freedom is that one spends most of one's time defending scoundrels. For it is against scoundrels that oppressive laws are first aimed, and oppression must be stopped at the beginning if it is to be stopped at all.
[13:07:43] <Bytram> YES! That's the one! Was that from Mencken?
[13:08:17] <Bytram> =g "The trouble with fighting for human freedom is that one spends most of one's time defending scoundrels"
[13:08:18] <upstart> http://www.quotationspage.com - Quote Details: H. L. Mencken: The trouble with fighting... - The ...
[13:08:25] <Bytram> apparently!
[13:08:32] <Bytram> janrinok++ Thank You!
[13:08:32] <Bender> karma - janrinok: 70
[13:08:38] <janrinok> well that is where I copied it from :)
[13:09:37] <janrinok> time for tea - in fact, I am reliably informed that I am 9 minutes late for tea!
[13:09:45] <janrinok> brb
[13:10:21] <Bytram> don't forget the biscuits!
[13:10:38] <Bytram> =g "Grand Day Out"
[13:10:38] <upstart> https://en.wikipedia.org - A Grand Day Out - Wikipedia
[13:11:34] <janrinok> S has a piece of chocolate on a Sunday - it is a special treat for her!
[13:11:47] <Bytram> YUM!
[13:12:00] <janrinok> I might have a biscuit but I'm not really hungry and don't _need_ to eat anything
[13:12:34] <Bytram> Oh! ANd thanks SO much for all the stories you pushed out the past few days while I was too busy with work!
[13:14:08] <janrinok> np - I'm working on Storybot5 and SNAPI so it makes sense to push some out
[13:15:12] <janrinok> As I was saying to chromas (^) I can automate everything except for deciding which stories are good and which are not
[13:15:18] <Bytram> ahh, yes. The copy of snapi.py you sent me and that I downloaded uses... *spaces* for indentation. I somehow thought that Python *required* tabs? Was I misinformed?
[13:16:06] <janrinok> no, any white space can be used providing you use the same combination every time. Most people convert tabs to spaces.
[13:16:21] <Bytram> Really? Kewel!
[13:16:46] <Bytram> Is there a concensus on # spaces per indentation?
[13:16:49] <janrinok> so I press tabs, but it inserts spaces, and when I backspace it goes back 4 spaces or whatever I've set the tab/space to be
[13:16:58] <Bytram> 4-spaces per tab? 8 spaces per tab?
[13:17:12] <Bytram> or do people not really care?
[13:18:02] <janrinok> tabs to spaces, 4 spaces, line length of 78 are the recommended and several Python IDEs will raise warnings if you do not stick to them. But, nothing is set in stone if you have a good reason to break the rules
[13:18:49] <janrinok> but if you stick to the standards you can use anyone else's code and they can use yours
[13:18:51] <Bytram> Okay, so things would still look okay on an IBM 3270 green screen terminal... or a punch-card!
[13:19:08] <janrinok> exactly - we haven't all got 4k monitors
[13:19:39] <Bytram> Do you know, offhand, if there is a limit on the length of the name of an identifier?
[13:19:55] <janrinok> everything also prints well on standard size paper etc too
[13:20:08] <Bytram> nod nod
[13:20:45] <janrinok> no limit but the compiler might only use the first 32 characters or whatever it chooses. It must also deconflict longer names if it tries to do this.
[13:20:54] <Bytram> abcdefghijklmnopqrstuvwxyz3 = abcdefghijklmnopqrstuvwxyz1 + abcdefghijklmnopqrstuvwxyz2
[13:21:44] * Bytram has used languages which *permitted* longer names, but neither warned nor deconflicted names > limit (which, IIRC, was something like 8 or maybe 16 chars)
[13:21:45] <janrinok> yeah, I know it is interpreted, but pycode is only created if the source file has been changed, so it doesn't get interpreted every time it is run
[13:21:55] <Bytram> nod nod
[13:22:21] <janrinok> I've never experienced any limits in Python. variable names can be as long as you wish
[13:23:19] <Bytram> good to know
[13:23:35] <janrinok> As the limit to a line length is 78 characters (self-imposed) there is no point in having identifiers longer than this anyway
[13:24:25] <janrinok> By the way, I have a much improved SNAPI python file - much better than the one I sent you the other day - if you are interested
[13:24:30] <Bytram> The_quick_brown_fox_jumped_over_the_lazy_dog_or_did_he = TRUE
[13:24:37] <Bytram> pls
[13:26:02] <Bytram> coffee++ break time
[13:26:02] <Bender> karma - coffee: 99
[13:26:04] <Bytram> biab
[14:35:57] <janrinok> I've checked the documentation of my IDE and it uses the first 128 characters of an identifier! I would call anything longer than that a 'book'!
[14:38:35] <Bytram> Yes, I suspect that 128 characters should be sufficient to come up with a unique identifier, even with programmatically-generated code!
[14:38:42] <Bytram> thanks for looking that up!
[14:39:21] <janrinok> https://www.python.org - Style guide for Python
[14:39:22] <upstart> ^ 03PEP 8 -- Style Guide for Python Code
[14:39:22] <exec> └─ 13PEP 8 -- Style Guide for Python Code | Python.org
[14:39:57] <Bytram> clicky
[14:43:40] <Bytram> Hmm, I would like to think that someone has come up with a "standard" pretty-printer which takes in source code and performs indentation based on syntax?
[14:44:17] <janrinok> oh they have - but the link is handy if you are having trouble sleeping....
[14:44:30] <Bytram> already taking effect!
[14:46:00] <janrinok> and there are numerous good python videos on youtube. Look for the Python conferences and pick from there. Ignore any from unknown 'experts' - sometimes they are just wrong and, until you know the difference, it is easy to get suckered in
[14:46:42] * Bytram can generally read -- and understand -- far more quickly with text than with video+audio
[14:46:55] <janrinok> PyCons - https://www.youtube.com
[14:46:56] <upstart> ^ 03PyCon - YouTube
[14:46:56] <exec> └─ 13PyCon - YouTube
[14:47:59] <janrinok> understood - which is why I have a mass of books, but the videos are good as a reinforcement tool and can be enlightening in a way that books might not be.
[14:48:56] <Bytram> we think alike on that one. There *are* cases where a video can explicate some *action* better than just plain-old-text, but in my experience, that tends to be the exception
[14:48:58] <janrinok> If you are thinking of trying Python, use python 3 not the earlier 2. Python 2 becomes unsupported next year (after 24 years!)
[14:50:20] <Bytram> Is part of the reason I help off on learning Python originally. When I first looked, there was a raging debate between using V2 with its large and solid foundation, or V3 with its (at the time) limited use and support.
[14:50:20] <janrinok> Names in videos to look for Ned Batchelder and Raymond Hettinger, both are, in my view, very good instructors
[14:50:31] <Bytram> Good to know that v3 is stable and recommended, now.
[14:50:41] <Bytram> got it, tx!
[14:51:48] <Bytram> so far, things look mostly reasonable.
[14:52:02] <janrinok> what I particularly like is that python is 'batteries included', i.e. there is a library for everything you would probably want to do and they are all consistent
[14:52:11] <Bytram> good++
[14:52:11] <Bender> karma - good: 1
[14:52:52] <Bytram> Here's one thing that I regularly do and I have not, as yet, seen addressed and suspect it will be seen as an anti-pattern if I use it.
[14:53:23] <janrinok> search the internet for Python Module Of The Week - in fact I've done it for you = https://pymotw.com
[14:53:24] <upstart> ^ 03Python 3 Module of the Week — PyMOTW 3
[14:53:25] <exec> └─ 13Python 3 Module of the Week — PyMOTW 3
[14:53:28] <Bytram> Here is an example from the spec you originally linked (pep-0008)
[14:53:30] <Bytram> from myclass import MyClass
[14:53:30] <Bytram> from foo.bar.yourclass import YourClass
[14:53:54] <Bytram> where I would prefer to add spaces so that the keyword "import" was vertically-aligned
[14:54:13] <Bytram> so as to make clearer (to me) what the differences in the class names are
[14:54:14] <janrinok> that works
[14:54:29] <Bytram> good!
[14:55:09] <Bytram> physical structure can more quickly convey (to me) differences in operands that may not be as readily apparent given just stream-of-text
[14:55:23] <janrinok> but you will probably find that after a while it just makes sense to do it like everyone else, but there is no right or wrong. The whitespace is only critical at the start of a line for showing blocks of code.
[14:55:35] <janrinok> After that you can do anything you wish
[14:55:51] <Bytram> nod nod
[14:56:43] <Bytram> Well, think I'll put the rest of the reading on hold while I work on getting an updated schedule generated for my dept.
[14:57:11] <janrinok> if you get to the stage where you want to use an IDE - look at PyCharm Community Edition. Free to download and use and it is very popular. Handles all the Git pushes, pulls and commits, easy to use etc
[14:57:48] <janrinok> got to go and cook dinner anyway. Prolly be back on later but, if our paths don't cross, have a good Sunday!
[14:58:31] <Bytram> I wrote code to take disparate schedule files (employees from possibly different areas, store hours, mall hours, etc.) and meld it into a single schedule with graphical depictions of when each shift starts and ends, hours worked, whether the employee gets a break or not, how long the meal break (if any) is, etc.
[14:59:05] <janrinok> sounds great - was it in awk and sed? ;)
[14:59:18] <Bytram> much obliged for the info! Saved me a lot of work trying to track things down and attempting to assess the "reputation" of each!
[14:59:24] <Bytram> mostly
[14:59:27] <janrinok> lol
[14:59:51] <janrinok> gtg - maybe laters
[15:00:29] <Bytram> There's about a dozen ancilliary programs that are used for date transformations, file locating, and a bunch of other stuff; figure it would be an excellent challenge to port to python
[15:00:37] <Bytram> ciao (chow) for now!
[15:06:12] * Bytram pauses his reading of ( https://www.python.org ) at "Comments"
[15:06:13] <exec> └─ 13PEP 8 -- Style Guide for Python Code | Python.org
[15:06:18] <Bytram> lol
[15:42:52] <Bytram> =g "BL Lac " black hole
[15:42:52] <upstart> https://arxiv.org - Dynamical Black Hole Masses of BL Lac Objects from the Sloan ...
[16:42:28] -!- MrPlow has quit [Remote host closed the connection]
[16:43:01] -!- MrPlow [MrPlow!MrPlow@Soylent/BotArmy] has joined #editorial
[16:43:01] -!- MrPlow has quit [Changing host]
[16:43:01] -!- MrPlow [MrPlow!MrPlow@nsa.gov] has joined #editorial
[16:45:20] <carny> =submit http://feedproxy.google.com
[16:45:22] <exec> └─ 13Strong eruption at Sheveluch volcano with ash up to 10 km (34 000 feet), pyroclastic flow produced, Russia
[16:45:22] <upstart> Submitting "Strong eruption at Sheveluch volcano with ash up to 10 km (34 000 feet), pyroclastic flow produced, Russia"...
[16:45:44] <upstart> ✓ Sub-ccess! "Strong Eruption at Sheveluch Volcano With Ash Up to 10 Km (34 000 Feet), Pyroclastic Flow Produced, " (29 paragraphs) -> https://soylentnews.org
[20:18:29] <chromas> das better
[20:18:51] <chromas> I had the regex wrong. [[:space:]] doesn't work in D
[20:19:17] -!- upstart has quit [Remote host closed the connection]
[20:19:24] -!- upstart [upstart!~systemd@0::1] has joined #editorial
[20:19:56] <chromas> .So now the bot handles Bytram's link more betterer: https://www.zdnet.com
[21:14:03] <carny> =submit https://archive.is from the all-phones-are-surveillance-devices dept.
[21:14:05] <exec> └─ 13Millions Of Android Phones Are Vulnerable To Israeli Surveillance Dealer Attack, Google Warns
[21:14:07] <upstart> Submitting "Millions Of Android Phones Are Vulnerable To Israeli Surveillance Dea…"...( 1 modified urls; http://archive.is )
[21:14:29] <upstart> ✓ Sub-ccess! "Millions Of Android Phones Are Vulnerable To Israeli Surveillance Dea…" (1 paragraphs) -> https://soylentnews.org
[22:02:24] <Bytram> chromas++ much obliged
[22:02:24] <Bender> karma - chromas: 95
[22:03:47] <Bytram> Have noticed something a couple times in recent days, lack of a space following an emphasized (italic) phrase; example: the <em>Los Angeles Times</em>reports
[22:04:00] <Bytram> expected a space following the "</em>"
[22:04:07] <Bytram> Fnord666: !!
[22:06:11] * Deucalion watches Bytram implode over a missing space
[22:06:34] <Bytram> is better than watching me explode over an extra space! ;)
[22:08:13] * Deucalion feeds Bytram lasagne and garlic bread.... waits for the carb sleepies to kick in and put Bytram to bed :D
[22:08:21] <Bytram> lol
[22:08:24] <Bytram> too late
[22:09:24] <Bytram> seasonal allergies are starting to return, so made up a big bowl of rice and chili with a huge sprinkling of extra chili powder to help clear the sinuses
[22:13:16] <Bytram> http://feedproxy.google.com
[22:13:18] <upstart> ^ 03Hayabusa2 spacecraft deploys its last rover to asteroid Ryugu ( https://watchers.news )
[22:13:18] <exec> └─ 13Hayabusa2 spacecraft deploys its last rover to asteroid Ryugu
[22:51:35] <Bytram> =g RNA
[22:51:37] <upstart> https://en.wikipedia.org - RNA - Wikipedia
[22:51:40] <Bytram> =g DNA
[22:51:41] <upstart> https://ghr.nlm.nih.gov - What is DNA? - Genetics Home Reference - NIH
[22:51:47] <Bytram> =g DNA wikipedia
[22:51:48] <upstart> https://en.wikipedia.org - DNA - Wikipedia