#dev | Logs for 2018-12-06
« return
[00:47:34] <fyngyrz> I'm... stunned
[00:48:03] <fyngyrz> TMB, run the benchmark for yourself. You have something to learn here.
[00:48:46] <fyngyrz> trust me on thsione: I'm old, I'm smart, and I write extremely high performance realtime software. I know what I'm talking about here.
[01:03:21] <fyngyrz> TMB, your statement here: "so, no. you can not write a method of checking one string against four hundred that is any more efficient than doing it in a while loop. it is not physically possible." Is utterly, completely wrong. I understand that no one knows everything, and I'm not ciritcizing you. But I **AM** telling you you're wrong, and I can absolutely, postiively, utterly guarantee I"m correct on this one. Please taka look at t
[01:03:21] <fyngyrz> he benchmark and actually RUN it.
[02:30:52] <Bytram> define a regex that matches one of the abbrevs... aka State Transition Model
[02:30:59] <Bytram> #g state transition model
[02:31:00] <MrPlow> https://www.ncbi.nlm.nih.gov - "Med Decis Making. 2012 Sep-Oct;32(5):690-700. State-transition modeling: a report of the ISPOR-SMDM Modeling Good Research Practices Task Force-3."
[02:31:05] <Bytram> #g state transition model wikipedia
[02:31:06] <MrPlow> https://en.wikipedia.org - "Transition state theory (TST) explains the reaction rates of elementary chemical reactions. The theory assumes a special type of chemical equilibrium ..."
[02:31:12] <Bytram> nope
[02:31:22] <Bytram> #g state transition diagram wikipedia
[02:31:23] <MrPlow> https://en.wikipedia.org - "A state diagram is a type of diagram used in computer science and related fields to describe the ... Another possible representation is the State transition table."
[03:50:23] <TheMightyBuzzard> fyngyrz, you're really not listening. you know a raindance, i know why your raindance works. accessing a member of a dictionary or hash requires a loop. you don't see it but it most assuredly runs. it is simply a less generic and thus optimized version of any other loop in $language. you perl or python or whatever still has to check every value until it runs out or finds the one it's looking for.
[03:51:17] <TheMightyBuzzard> there is no magic that can make checking for matching identifiers not a necessity.
[03:56:26] <TheMightyBuzzard> the perl or python loops you write have to be able to handle any old kind of non-typed variable and numerous other things. the code for hashes and dictionaries does not. no conversions, no safety checking, thus faster.
[03:57:18] <TheMightyBuzzard> and they can only be faster because they're written in C or assembly or such and get to ignore all that nonsense that higher level languages need to keep people from shooting themselves in the foot.
[04:04:06] <chromas> One nice thing about a hash table is you don't have to compare the entire string you're loooking for, just a hash, which can fit nicely into a register
[04:04:16] <chromas> (one for each item in the list)
[04:05:42] <TheMightyBuzzard> you don't really compare a string anyway. it gets converted to probably an int. text is for humans.
[04:06:19] <chromas> Yeah but then you have to compare all the bytes. With the hash table you don't (at least until you get to the right hash and find out there's more than one string ass ociated with it)
[04:06:41] <TheMightyBuzzard> any interpreted language is going to take an order of magnitude longer to access any kind of associative array though. too much runtime conversion happening.
[04:06:54] <chromas> Then again you could get fancy and check just the first 8 bytes of the string
[11:56:43] <fyngyrz> > accessing a member of a dictionary or hash requires a loop.
[11:56:45] <fyngyrz> Wrong.
[11:58:13] <fyngyrz> Hashes are not simple lists of reduced codes searched through with a loop. Unless you've written your own naive version, anyway.
[12:01:32] <fyngyrz> The searched for element is (quickly) converted into a hash; this IS an index into the table, so you are instantly at the table where the hash points. If there are multiple data items that resolve to the same hash, then there are multiple items at that point, and a (very) short compare ensues. A good hashing mechanism results in very few collisions.
[12:03:33] <fyngyrz> The reason that hashes always return keys in no particular order is because the location in the table is not related to any order you or I would think of; it's related only to the hashing algorithm.
[12:04:55] <fyngyrz> so keylist = mydict.keys() gets us what appears to be an unordered list, then we do keylist.sort() to get an ordered access to the dictionary (presuming we need it) I explain this only because it gives a little insight as to what is actually going on
[12:05:50] <fyngyrz> Here's a simple hash algorithm:
[12:06:08] <fyngyrz> 1) allocate a list of (N) pointers.
[12:06:37] <fyngyrz> 2) hash a bit of data into a number, 0-N
[12:07:20] <fyngyrz> 3) At the number in the table, place a pointer (usually to the first element of a linked list) to that data, stored.
[12:07:40] <fyngyrz> 3a) the key is there as well.
[12:08:20] <fyngyrz> 4) continue. For all elements that get a uinique hash, the null pointer in the initial table points to the first element of the list.
[12:08:26] <fyngyrz> Retreival of this is:
[12:08:42] <fyngyrz> 1) hash the bit of data (the key)
[12:08:55] <fyngyrz> 2) hit the table at the hash location
[12:09:16] <fyngyrz> 3) check to see if there's only one element there (no link... null or non-null)
[12:09:37] <fyngyrz> 4) If there's no link, you're done, it's the only element, return it (usually the case with a good hash alg.)
[12:09:59] <fyngyrz> 4a) If there *is* a link, then compare the key. If the key matches, return the data
[12:10:23] <fyngyrz> 4b) If the key does not match, move to the next element in the list. Repeat.
[12:10:57] <fyngyrz> 4a and 4b are known as "hash collisions", and the whole goal of a hashing algorithm is to reduce these.
[12:11:56] <fyngyrz> So you see, there IS NO CHECK-EVERY-ELEMENT-LOOP. None. Zero. Zilch. Nada.
[12:13:03] <fyngyrz> A really good hashing system will alter the hashing algorithm, including the table structure, if the current (smaller) hash starts having too many collisions.
[12:17:32] <fyngyrz> Example: store data "bar" at key "bar"
[12:18:01] <fyngyrz> hash algorithm mangles foo until it gets a number, say 0-1024
[12:18:44] <fyngyrz> dammnit.
[12:18:48] <fyngyrz> starting over
[12:18:57] <fyngyrz> Example: store data "foo" at key "bar"
[12:19:29] <fyngyrz> hash alg. mangles foo into a number from 0-1023; in this case, it's 700.
[12:19:52] <fyngyrz> at location 700 in table, it finds a NULL, because nothing there yet.
[12:20:15] <fyngyrz> it places a pointer to the first (and only, so far) element of a linked list at location 700.
[12:20:39] <fyngyrz> so if you say "gimme data for "foo", it does this:
[12:20:58] <fyngyrz> mangles foo again, gets 700 again.
[12:21:05] <fyngyrz> goes to location 700 in table.
[12:21:25] <fyngyrz> sees linked list. Sees only one element (no forward node pointer)
[12:21:32] <fyngyrz> grabs data, returns it
[12:21:43] <fyngyrz> That's ALL there is to it if there are no collisions
[12:22:13] <fyngyrz> but say there is a collision. Say that "gee" ALSO resolves to 700 when hashed.
[12:23:10] <fyngyrz> so you go to put "gee":"whillikers" into the dictionary, and 700 is the hash of gee, so you get 700 **again**
[12:23:41] <fyngyrz> So the storing algorithm looks at location 700, sees there's already a list started, and it does this:
[12:23:58] <fyngyrz> 1) Is the key the same as (any) element in the linked list?
[12:24:10] <TheMightyBuzzard> tell me then, how does perl or python tell which bit of the array has "gee" as a key?
[12:24:31] <fyngyrz> 2) It's not, there is only "foo" there, so it runs into the end of the list with no match, and ADDS a list element with gee:whillikers
[12:24:45] <TheMightyBuzzard> it's not magic. you have to check. tanstaafl.
[12:24:50] <fyngyrz> OR, 2) it finds that "gee" is already there, and replaces it
[12:25:14] <fyngyrz> but FUCK< you're only checking like ONE OR TWO THINGS, not running a loop through everything,
[12:25:23] <fyngyrz> you're being intentionally obtuse and moving the goalposts
[12:25:37] <TheMightyBuzzard> NO! you are checking the whole list because you have no idea if that key is there or not.
[12:25:58] <TheMightyBuzzard> well potentially the whole list. you could get lucky since they're pseudorandomly arranged.
[12:26:00] <fyngyrz> I guess I can't make you actually read what I wrote, can I?
[12:26:14] <fyngyrz> no. You don't. READ WHAT I WROTE
[12:26:25] <TheMightyBuzzard> i did read, you're simply lacking understanding. you think there is magic.
[12:26:53] <fyngyrz> okay, I give up. You simply dont' kno how hashing works, and you're being stubborn about learning
[12:27:34] <TheMightyBuzzard> you realize 4) is describing a loop, yes?
[12:27:35] <fyngyrz> there's no magic. There's exactly what I described. Iv'e WRITTEN hashing systems, I know exactly what I'm talking about
[12:28:44] <fyngyrz> 4) checks for a null pointer at the hash location in the table. It's not a loop unless there is a forward link. In a good hashing system, there won't be. One test for a null pointer, a very large percentage of the time. Period.
[12:29:32] <fyngyrz> Got it? Hask; check for no forward link; none found, return data. This is the high-probability result
[12:29:38] * TheMightyBuzzard sighs
[12:29:55] <fyngyrz> you said every element in th4 list must be checked. You'rewrong.
[12:30:06] <fyngyrz> you can sigh all you want, you're still completely wrong
[12:30:30] <TheMightyBuzzard> that's nice. it's become not worth it to explain further.
[12:31:16] <fyngyrz> you can't explain a wrong idea enough to make it right. You're laboring under a huge misconception
[12:31:42] <fyngyrz> There's no loop that tests every item in the table. None. Zero.
[12:32:11] <fyngyrz> That statement is 100% at odds with your claim that every item must be checked, which you have repeatedly claimed
[12:32:16] <TheMightyBuzzard> i'm glad you think so. carry on thinking so.
[12:33:10] <fyngyrz> don't worry. Reality is that thing which doesn't change because TMB doesn't understand it. :)
[12:33:53] <TheMightyBuzzard> oh i never worry about that.
[12:33:58] <fyngyrz> I can tell
[12:34:09] <fyngyrz> stuck in a misconception, and too proud to learn.
[12:34:58] <fyngyrz> you don't understand hash algorithms, and you don't *want* to.
[12:35:11] <fyngyrz> because it would put you in the wrong
[12:35:13] <fyngyrz> and you can't have that
[12:35:31] <fyngyrz> I'm sorry you are unable to learn this.
[12:35:52] <fyngyrz> learning disabilities are hard for those subject to them.
[12:35:58] <fyngyrz> I am very sorry for you
[12:36:30] -!- fyngyrz has quit [Quit: Leaving...]
[12:51:40] -!- fyngyrz [fyngyrz!~fyngyrz@Soylent/Staff/Editor/fyngyrz] has joined #dev
[13:24:02] <Bytram> I am quite pressed for time, so I apologize if this is not phrased as well as I'd like.
[13:25:33] <fyngyrz> There's no point, Bytram.
[13:25:38] <fyngyrz> Let it go.
[13:26:06] <Bytram> (1) A hash table lookup of a given string operates in O(n) time. In essence it's an extension of the concept of looking up of an element in an array. Define an array of say, 100 elements. Look up item n, e.g. 37, that's a direct access to a specific location in memory.
[13:28:15] <Bytram> a hash table takes the same concept, but translates each string into a number that can then be looked up in the table. Sometimes, the number you get for the hash of one string is the same as the hash for another string. That's a collision. As fyngyrz pointed out, it is a simple task to iterate through a linked list of keys and values until you find a match (or not).
[13:29:58] <Bytram> (2) *finding* a string in the source text that you want to look up can be reduced to a finite state automaton which can be calculated once (at compile time) and then used as often as you want at run time. It is the *optimal* search for a string/pattern in another string.
[13:30:31] <fyngyrz> Jsut o be clear the hash isn't looked up- it's an index, and the action is a one-off indexing operation. Not a search.
[13:31:22] <Bytram> For a general regexp, if I recall my syntax correctly, "[[:upper:]]{3,}" will match any pattern containing 3 or more uppercase letters.
[13:31:31] <Bytram> fyngyrz: agreed
[13:31:47] * Bytram knew his comp sci education would come in handy... eventually!
[13:31:55] <fyngyrz> lol yes
[13:32:10] * Bytram had to code hashes and regexps in college... using Pascal.
[13:32:36] <fyngyrz> this kind of knowledge is often the difference between horrible performance and great performance.
[13:33:20] <Bytram> oh, and our data structures prof was a sadist. One project was to write a doubly-linked circular list where each node, in turn, pointed to its own doubly-linked circular list.
[13:33:26] * Bytram cringes at the memory.
[13:33:30] <TheMightyBuzzard> fyngyrz, you ever hear the phrase "i've forgotten more than you'll ever know about $blah"?
[13:33:44] <TheMightyBuzzard> it appears i've just had one of those moments.
[13:34:18] <TheMightyBuzzard> you're absolutely correct, i am absolutely wrong. and it's something i used to know well.
[13:34:24] <fyngyrz> lol, no, you just had a "TMB is too stubborn to learn" moment.
[13:34:37] <fyngyrz> oh yes?
[13:34:41] <fyngyrz> hm
[13:34:45] <TheMightyBuzzard> bit disturbing, really
[13:34:52] <fyngyrz> oh no
[13:34:53] <Bytram> TheMightyBuzzard: I greatly respect your knowledge and expertise, but in this case, I have to side with fygyrz. A hash is an *optimal* lookup for a string and does not require a multitude of matches for every substring against every string in a source using a loop.
[13:34:55] <TheMightyBuzzard> i don't like forgetting things.
[13:35:19] <fyngyrz> oh no. Don't let it get to you. Any time you can get over a hump, it's a win.
[13:35:25] <Bytram> look back at what I wrote about hashes and regexps.
[13:36:06] <TheMightyBuzzard> right but hashing algorithms and btrees and a dozen other speed hacks are things i used to utterly grok.
[13:36:29] <TheMightyBuzzard> it's like realizing you forgot how to ride a bike
[13:36:33] <fyngyrz> we all have our areas of expertise, and they're usually the ones we keep being exposed to over and over
[13:36:50] <fyngyrz> I too have forgotten much, trust me on this
[13:37:42] <TheMightyBuzzard> getting old sucks. this is something i would have fundamentally understood without thinking about it when i started coding sn stuff.
[13:37:44] <Bytram> as have I forgotten much.
[13:39:16] <TheMightyBuzzard> i'm standing by my stubbornness approach though. hopefully it'll cause this to stick for a bit longer this time given the effort put into the argument.
[13:39:26] <fyngyrz> lol good
[13:39:27] <Bytram> but this one I am absolutely certain on. Take a look at lex (or is it spelled lexx?) as an example of a language tokenizer.
[13:40:33] <Bytram> oh, and I have read up on different implementations of regexp parsers and where perl vs AWK, for example, have advantages and shortcomings when it comes to certain boundary conditions
[13:41:39] <TheMightyBuzzard> Bytram, i'm sick of boundary conditions causing problems. we should build a wall.
[13:41:49] <Bytram> would love to chat more but I got a buncha stuff I gotta do before I head in to work, and we have a high-level mucky muck visiting our site today, so need to be on my toes for that.
[13:41:50] <fyngyrz> make the perl people pay for it!
[13:41:58] <Bytram> ROFL!
[13:42:22] <TheMightyBuzzard> should make the C people pay for it. it's mostly their problem.
[13:42:37] <fyngyrz> nah, just the shitty coders. Which is most of them, I admit
[13:42:39] <TheMightyBuzzard> well no, i tell a lie. it's mostly the php people's problem.
[13:43:03] <Bytram> nah, it's the people who have to deal with php's problem.
[13:43:04] <Bytram> =)
[13:43:12] <Bytram> okay, really gtg.
[13:43:16] * TheMightyBuzzard points at exec
[13:43:20] <TheMightyBuzzard> laters
[13:43:21] <Bytram> Have a great day everyone!
[13:44:43] <TheMightyBuzzard> fyngyrz, things like this are why i don't take disagreements personal much anymore. it took me a long time to get through my head but it's remotely possible for me to be wrong on occasion.
[13:45:14] <TheMightyBuzzard> still frustrating, mind. but there's no reason to get my panties in a bunch.
[13:46:33] * TheMightyBuzzard sighs
[13:47:16] <TheMightyBuzzard> now i gotta start going back over all my code for the past few years and figure out when i forgot how to program.
[18:29:18] * chromas has forgotten more than he ever learned about $blah
[18:40:54] <fyngyrz> Okay, so to retrench a bit: my contention is that a fast linear scanner of a flat file is a wiser approach than loading the DB with expansions. This is both because of the nature of this particular task, and because the DB is presently underperforming to the demands it faces.
[18:41:42] <fyngyrz> So that's my suggestion. Not inclined to argue about it; but having written a lot of this stuff myself, I'm very secure in that opinion.
[18:43:12] <chromas> Just run a second rdbms :D
[21:22:45] <TheMightyBuzzard> rather turn on slow queries, find the cause, and fix it.
[21:23:11] <TheMightyBuzzard> can't just now though. paying work takes precedence over non.
[21:33:12] <TheMightyBuzzard> in any case, that list isn't something that needs to be loaded every page load anyway. it needs to load at process creation and be part of what stays resident always. hitting either a flat file or the db for unchanging information every page load would be silly.
[22:55:18] <chromas> Create an acronym-expansiond with an xml interface and have the perl code hit it up for each page load
[23:31:19] <TheMightyBuzzard> better put the xml in json just to be safe.