#editorial | Logs for 2015-06-28
« return
[01:39:48] cmn32480|afk is now known as cmn32480
[03:25:42] cmn32480 is now known as cmn32480|away
[10:42:05] -!- aqu4 has quit [Read error: Connection reset by peer]
[10:42:05] -!- Subsentient has quit [Read error: Connection reset by peer]
[12:31:49] -!- janrinok [janrinok!~janrinok@Soylent/Staff/Editor/janrinok] has joined #editorial
[12:31:49] -!- mode/#editorial [+v janrinok] by SkyNet
[14:39:43] <takyon> hi
[14:40:36] <takyon> what did you think of those keywords janrinok
[14:53:49] <takyon> breaking news
[14:55:41] <janrinok> takyon: I've 2nd'ed the breaking story - try not to release your own subs if at all possible, but no harm done in this case.
[14:56:37] <janrinok> Do you want a copy of my list of keywords, which now includes yours also. Thanks very much for that. The pseudo code was a bit too basic but I have written that module in python now and I'm happy with it.
[14:58:26] <takyon> haha it wasn't even pseudocode
[14:58:38] <takyon> it was the first python I've written, and absolutely not tested
[14:58:47] <janrinok> lol - np
[14:58:52] <takyon> sure reply back
[14:59:31] <takyon> have you tested the keywords on a variety of articles?
[14:59:37] <janrinok> the quick answer to your question is that I convert the story to XML, and then use XPath to extract the relevant parts. Each site requires its own template, but they are all very similar
[15:00:19] <takyon> my solution simply grabbed the innertext of the webpage
[15:00:41] <takyon> which would include navigation elements which skew the results, but with my set of keywords and topics it still worked 90% of the time
[15:00:45] <janrinok> It has varying success rates. Some stories fit easily into 1 or 2 topics but searching for keywords can result in 8 - 10 topics being identified each with a single keyword
[15:01:19] <takyon> like I said there's a lot of overlap
[15:01:47] <takyon> it's easy to tag a single article with techonomics, software, hardware, digital liberty
[15:02:05] <takyon> that's why this may be better for nexuses with a clear distinction
[15:02:08] <takyon> like "gaming"
[15:02:16] <janrinok> our solutions are very similar, but mine can be fine tuned. For example, some sites have <p> for both the story and other elements, I can use XPath to try to filter the unwanted bits out. Still not 100% and having similar success as you
[15:05:22] <janrinok> email on its way to you
[15:09:03] <janrinok> some sites require multiple templates - inconsistency or perhaps different sub-domains - but the algo selects all possible templates and tries them each in turn. No significant time penalty.
[15:10:50] <takyon> i wouldn't think there would be
[15:10:50] <janrinok> entering keywords can be via any editor with nothing more than a basic set of formatting rules. The software sorts it out, and then prints the keyword list in a clean layout should I wish to improve the layout of the saved keyword file.
[15:11:28] <takyon> sccanning some article text is dead simple for today's computers. it's accessing the article over the net that provides the latency
[15:11:56] <janrinok> scores default to 1 if no score is specified, but any score (+ or -) can be specified.
[15:12:20] <janrinok> the latency isn#t a big problem - It does about 500 stories in 3 minutes or so
[15:13:25] <janrinok> the problem is still that it cannot differentiate between good and bad stories. It will still require an editor to sort the wheat from the chaff, and of course you get the full story, which still needs cutting down. But I'm working on that...
[15:14:22] <janrinok> I use TMB's API to auto submit the collected stories. There are limits imposed by the site on just how fast one can submit but it isn't really a practical problem.
[15:16:38] <janrinok> I don't upload stories during development because it would simply flood the queue. The average weekday story count from rss-bot is between 300-500 stories, 200 or so on a Saturday, and <100 on a Sunday
[15:17:54] <takyon> source discrimination is probably a good start to determining what is good and bad
[15:18:07] <janrinok> Gaming stories are pretty easy to identify - there is a definite set of keywords that are very useful. Business is also a good example.
[15:18:28] <takyon> yeah my business keywords fit right in
[15:18:36] <janrinok> I know what you mean - but some of our best scoops have come from sites that also produce a fair amount of crap too
[15:18:38] <takyon> a lot of our business stories are mergers and buyouts anyway
[15:19:09] <takyon> does there need to be a mechanism to prevent it from submitting dupes of the bot's own subs or stories that have been submitted to soylent by humans?
[15:20:51] <janrinok> yes, it is a problem but we hope to avoid the clash with other subs by keeping them separate. If we don't we will discourage the community from making subs, and it will give a false impression of how many stories we have to choose from. Perhaps only 25% of a day's collection are worth any effort at all
[15:22:31] <janrinok> you can get several versions of the same story - even from the same source - during a day and it is not easy to decide which is the one to run with. The latest doesn't always improve on earlier subs, and if we collect in real-time - which is one of the options - then we simply collect stories and add them to the hidden queue as they appear. They will be auto deleted when they are, say, 48 hours old.
[15:23:16] <takyon> are any of the devs on board with making you a special bot queue for your bots? if not, you could do something where you click a bot story on your local machine, and then the soylent submission page is autofilled for you with the contents
[15:23:58] <janrinok> yep - I've spoken with paul and he can give us a hidden nexus - we can always move stories to other nexuses if we need to do so
[15:24:56] <janrinok> the API lets you make a full submission without having to use a browser at all, so it is easy to select a story and tell the prog to upload it
[15:25:22] <takyon> so you would have the bot publish stories to a nexus automatically, with all of them set not to display, and pick and move some out of there?
[15:26:56] <janrinok> My ideal would be to get the program to function without any human intervention during the collection phase. Then I can port it to perl and include it in our overall set of software, or I can run it as a bot using, say, a Rasp Pi 2. Insignificant power requirements and more than enough computing power.
[15:28:01] <janrinok> The aim of the first release is to extract a story from each of the 'approved' sources that appears on rss-bot, and submit it to our hidden queue automatically.
[15:29:00] <janrinok> I think the level of processing required to identify good from bad is for a later version, although it is not that far away from being achievable in a basic form.
[15:29:06] <takyon> 500 stories in 3 minutes seems impossibly fast.
[15:29:28] <takyon> unless you have a good internet connection
[15:29:59] <janrinok> well, the program can also scrape the logs of rss-bot. Looking at yesterdays log, downloading each story, and processing it doesn't take long at all.
[15:30:15] <janrinok> my down speed is 1.5Mb
[15:30:46] <takyon> you need anything else from me? more keywords?
[15:30:47] <janrinok> but the html for a story is only a few k - I don't bother with images or any other crap
[15:31:24] <janrinok> if you have any new keywords that aren't on my list, or if you just want to send me your current list from time to time it would be appreciated.
[15:31:55] <takyon> I don't update my list too often since it basically works
[15:32:05] <janrinok> also, if you have any bright solutions to the problems that we are both looking at I would, of course, be very interested.
[15:32:35] <takyon> as for the list I sent you, it might be worth putting more into the "news" topic to match stories we like to label simply "news". Like political stories.
[15:32:36] <janrinok> How often would you like me to provide you with my updated keyword list?
[15:32:47] * janrinok feels another cron job coming on!
[15:33:11] <takyon> not too often. let me check the email you sent me first and see what it looks like
[15:34:26] <takyon> first comment is that my keywords were certainly regular expressions
[15:34:35] <janrinok> gtg for a while - time to cook our evening meal. I'll be back on later.
[15:34:43] <takyon> and a couple of them contained a '.' (single wildcard character in javascript)
[15:34:50] <takyon> yeah later, I'll probably leave before then
[15:35:01] <janrinok> mine are converted to regexes by the software. I just want to type text
[15:35:53] <janrinok> for example, upper and lower case, preceding spaces etc are all taken care of by the software.
[15:36:15] <janrinok> line-wraps also
[15:36:53] <takyon> well with mine I had specific reasons for which ones should include word boundaries and where
[15:37:03] <takyon> which I won't go into. some of it's probably obvious
[15:37:09] <takyon> whatever works for you though
[15:37:25] <takyon> it's definitely got a cleaner look at least
[15:37:35] <janrinok> I want a wordlist that anyone can update, not only someone who knows a specific language or understands regexes. Several eds have no programming knowledge at all
[15:38:38] <takyon> understandable
[15:38:54] <janrinok> if a new buzzword or company name appears, I would like there to be a way for it to be added to the keyword list by any ed.
[15:39:44] <janrinok> anyway, gtg. laters perhaps
[15:39:54] <takyon> although I'd like to add more to your list, I'd want to do my own testing to make sure articles are being tagged correctly, and I don't care about topics as much as nexuses
[15:39:57] <takyon> see you
[15:40:03] janrinok is now known as janrinok|afk
[15:40:46] <janrinok|afk> I'm looking at nexuses too, but as we don't have them set up for all subjects yet I'm playing with topics - what is in a title?
[15:51:33] <takyon> huh?
[17:47:31] janrinok|afk is now known as janrinok
[19:33:28] -!- janrinok has quit [Quit: leaving]
[19:57:16] -!- Tachyon [Tachyon!~Tachyon@xuco.me] has joined #editorial
[19:58:53] -!- Tachyon_ has quit [Ping timeout: 264 seconds]
[21:08:57] -!- Subsentient [Subsentient!~WhiteRat@Soylent/Staff/Editor/Subsentient] has joined #editorial
[21:08:57] -!- mode/#editorial [+v Subsentient] by SkyNet
[21:09:20] -!- aqu4 [aqu4!~aqu4bot@universe2.us/ircbot/aqu4] has joined #editorial
[22:20:37] -!- aqu4 has quit [Read error: No route to host]
[22:20:37] -!- Subsentient has quit [Read error: No route to host]
[22:31:15] -!- aqu4 [aqu4!~aqu4bot@universe2.us/ircbot/aqu4] has joined #editorial
[22:33:07] -!- Subsentient [Subsentient!~WhiteRat@Soylent/Staff/Editor/Subsentient] has joined #editorial
[22:33:07] -!- mode/#editorial [+v Subsentient] by SkyNet