#soylent | Logs for 2025-05-09

« return
[00:00:16] <chromas> what they do that's special is having their own index, while almost all other sites use either bing or google
[00:00:20] <chromas> or sometimes yandex
[00:01:31] -!- drussell has quit [Ping timeout: 264 seconds]
[00:16:56] <ted-ious> Is it even possible for a small company to crawl the whole internet?
[00:21:11] <kolie2> Yes I think so.
[00:21:37] <kolie2> Depends what you mean by crawl the whole internet, how do you define that, but yea, possible.
[00:21:52] <kolie2> Cloud resources and it's very parallelized.
[01:04:26] <ted-ious> So you just need an infinite line of credit with amazon?
[01:07:33] <ted-ious> https://umc.edu
[01:07:35] <systemd> ^ 03COVID-19 vaccination during pregnancy protects babies, research finds
[01:08:03] <ted-ious> That's the first result I get when using their academic search for increased mortality linked to covid-19 vaccination.
[01:08:37] <ted-ious> So there is nothing at all different about this search engine.
[01:12:22] <chromas> build yourself a noice storage array and download common crawl
[01:13:52] <ted-ious> How big is that?
[01:14:44] <chromas> at least 650MB
[01:16:19] <ted-ious> Great I have 651mb available so I'm good to go.
[01:17:53] <chromas> Then you've just gotta build an index and do some ranking and yada yada yada, you've got yourself a search
[01:19:47] <ted-ious> It's about 17.5gb for the last 3 months according to this. https://blog.commoncrawl.org
[01:19:47] <systemd> ^ 03Common Crawl - Blog - Host- and Domain-Level Web Graphs February, March, and April 2025
[01:20:32] <chromas> Sounds about right
[01:20:59] <ted-ious> Oh wait that's only Host-Level Graph whatever that means.
[01:21:55] <chromas> Maybe connections between hostnames
[01:22:00] <chromas> not all the actual pages
[01:25:17] <ted-ious> Total is 39.3 GiB.
[01:25:28] <ted-ious> If my awk doesn't have bugs in it.
[01:26:29] <ted-ious> This has to be some kind of highly compressed database that only contains url's and pointers to keywords.
[01:27:49] <chromas> text is pretty compressible
[01:28:15] <ted-ious> If it was even maximum compression zstd page text it would have to be hundreds of thousands or millions of times bigger.
[01:29:22] <ted-ious> Doesn't just twitter generate tb's of new data per day?
[01:30:30] <chromas> it's common crawl; not specific crawl :D
[01:31:36] <ted-ious> Oh ok I was looking at the wrong stuff.
[01:31:41] <ted-ious> Announcing the release of the April 2025 crawl archive. The data was crawled between April 17th and May 1st, and contains 2.74 billion web pages (or 468 TiB of uncompressed content).
[01:32:28] <ted-ious> https://blog.commoncrawl.org
[01:32:28] <systemd> ^ 03Common Crawl - Blog - April 2025 Crawl Archive Now Available
[01:35:50] <ted-ious> chromas: So it looks like it's about 130tb compressed if you want to download it on your new fiber.
[04:02:03] <janrinok> No drussell? I'll hang around.
[07:45:51] <Ingar> There's a pope! And he's a citizen !
[07:46:16] <Ingar> (snarky reference to one of the Starship Troopers sequels)
[08:16:04] <janrinok> Not sure if that is good or bad. But as it doesn't affect me directly I will not worry about it.
[12:19:55] <Ingar> "God's back and he's a citizen" from Starship troopers Marauders, pointing out that the authoritarian regime now supports religion. The new Pope is from the US and Trump isn't opposed to being Pope. Construct-your-joke.
[12:24:09] <janrinok> Yes, sorry, I was saying that I don't know if a US Pope is a good thing or bad. Such a Pope might be influenced by politicians, or the Pres might find out that his desire to create a 'Christian' country is not going to work out the way he wants. The Catholics might not see it that way if the Pope is against it.
[14:05:36] <Ingar> at least they stuck to their theme and elected someone who keeps ignoring the child abuse issues
[14:27:20] -!- jelizondo70 [jelizondo70!~jelizondo@Soylent/Staff/Editor/jelizondo] has joined #soylent
[16:00:14] -!- jelizondo70 has quit [Quit: Leaving]
[16:09:17] <janrinok> and he doesn't like the idea of women being ordained either
[16:09:34] <kolie2> pope++ is here?
[16:09:34] <bender> pope: 1
[16:09:39] <kolie2> bad bot!
[16:09:40] <kolie2> pope--
[16:09:40] <bender> pope: 0
[16:09:47] <janrinok> lol#
[16:10:03] <janrinok> yep, a US pope
[16:10:17] <kolie2> I don't have an opinion on the pope persee. I have an opinion on the catholic appartus.
[16:10:21] <kolie2> apparatus
[16:11:18] <kolie2> francis seemed liked a good dude ngl.
[16:13:13] <kolie2> One of his real stand out comments to me was, that there is sin, and there is crime, and they don't have to coincide.
[16:13:19] <kolie2> re homosexuality.
[16:21:45] <Ingar> I don't have an opinion on the pope persee. I have an opinion on the catholic appartus. -> the pope is very much part of the apparatus
[16:22:05] <Ingar> ofc, I stopped caring about the Church a loooooong time ago
[16:37:47] <kolie2> The pope certainly has a role in the apparatus.
[16:38:15] <kolie2> And it seems like the new pope is more in line with the view I have of the apparatus then the old.
[16:55:22] -!- drussell [drussell!~drussell@a4627691kd3g1a0z4.wk.shawcable.net] has joined #soylent
[19:09:19] -!- drussell has quit [Ping timeout: 264 seconds]
[20:50:01] -!- drussell [drussell!~drussell@a4627691kd3g1a0z4.wk.shawcable.net] has joined #soylent
[20:50:32] -!- drussell has quit [Client Quit]