#dev | Logs for 2017-10-15
« return
[02:17:50] BotSupremacist is now known as upstart
[13:23:04] <Bytram> TheMightyBuzzard: psst... over here.
[13:23:13] <TheMightyBuzzard> yar
[13:23:17] <Bytram> unicode normalization before filtering...
[13:23:20] <Bytram> NCKD?
[13:23:31] <TheMightyBuzzard> #g nckd
[13:23:32] * Bytram wonders if he remembered that right?
[13:23:32] <MrPlow> http://acronyms.thefreedictionary.com - "Looking for online definition of NCKD or what NCKD stands for? NCKD is listed in the World's largest and most authoritative dictionary database of abbreviations ..."
[13:23:43] <Bytram> holdon
[13:24:26] <Bytram> nfkc or nfkd
[13:24:35] <TheMightyBuzzard> no, not going to use a pre-written library. they're never kept up to date reliably.
[13:25:01] <Bytram> https://en.wikipedia.org
[13:25:02] <upstart> ^ 03Unicode equivalence - Wikipedia ( https://en.wikipedia.org )
[13:25:26] <Bytram> not suggesting you use a library, just trying to understand, before you start, what you had in mind....
[13:25:40] <Bytram> I've already gone down that rabbit hole a bit and it ain't... pretty
[13:26:44] <TheMightyBuzzard> making a table that associates all characters that could be substituted for A with the letter A, B with B, etc...
[13:27:06] <Bytram> yep, there's already something out there from the Unicode consortium... trying to track it down again
[13:27:08] <TheMightyBuzzard> making a page for admins where you can easily update them
[13:28:20] <TheMightyBuzzard> then de-unicoding everything down to ascii for filtering purposes but leaving the unicode intact for actual use.
[13:28:33] <Bytram> nod nod
[13:28:37] <Bytram> take a look at this: http://www.unicode.org
[13:29:16] <Bytram> and, within that, confusables.txt
[13:29:49] <TheMightyBuzzard> massive overkill
[13:30:06] <TheMightyBuzzard> and not complete for our purposes
[13:30:22] <Bytram> yes, but there is a simple extract from it that gets all the confusables that map to latin chars and nums
[13:30:29] <Bytram> give me a few more moments, k?
[13:30:31] <Bytram> btw
[13:30:35] <Bytram> coffee++
[13:30:35] <Bender> karma - coffee: 14
[13:31:10] <TheMightyBuzzard> don't worry about getting a complete list right now. that's why i'm making an admin page to update the lists.
[13:32:10] <Bytram> youmake da list, and I'll get you starting data
[13:36:47] <Bytram> refinement... more accessible data, here: confusablesSummary.txt
[13:39:17] <Bytram> TheMightyBuzzard: what chars are you looking to match? [a-zA-Z0-9] ?? anything else?
[13:39:54] <TheMightyBuzzard> very much in an unuseful format. i need the characters not a codepoint range or names.
[13:40:33] <Bytram> easy cowboy, I can get you THAT... just need to know what chars you are looking to map, first.
[13:40:34] <TheMightyBuzzard> no, nothing else.
[13:40:37] <Bytram> nod nod
[13:40:46] <Bytram> I can get you a simple file....
[13:41:07] <Bytram> # confusable_ascii_char
[13:41:12] <Bytram> confusable_match1
[13:41:15] <Bytram> confusable_match2
[13:41:16] <Bytram> confusable_match3
[13:41:21] <Bytram> # confusable_ascii_char_2
[13:41:24] <Bytram> etc.
[13:41:24] <TheMightyBuzzard> that's not all i need either. i need block letters and any other forms of A that can be read as A by a human even if they look nothing alike.
[13:41:54] <Bytram> even if they look nothing alike? then how can they get confused?
[13:42:00] <Bytram> or, do you mean:
[13:42:02] <Bytram> A
[13:42:08] <Bytram> /\
[13:42:20] <TheMightyBuzzard> we're not looking for confusion's sake. we're looking for wiseass spammer's sake.
[13:42:25] <Bytram> the 2nd being a slash followed by a blackslsh
[13:42:31] <Bytram> nod nod
[13:42:50] <TheMightyBuzzard> so, yeah, multi-character substitutions will have to work as well.
[13:42:57] <Bytram> but we can at least start with all of the already identified confusables for latin digits/letters
[13:43:11] <Bytram> so, let's say this is a subset of what you are looking to do...
[13:43:22] <Bytram> a starting point to which you can add more confuisables
[13:47:12] <TheMightyBuzzard> crap, it's raining and i really need to mow today
[13:47:36] <Bytram> grep '^#[[:space:]][a-zA-Z0-9][[:space:]]' confusablesSummary.txt
[13:47:52] <Bytram> that'll get you started.
[13:48:13] <Bytram> actually... the '[[:space:]]' is literally a tab character
[13:49:09] <Bytram> the first, tab-separated-value is the confusable latin-char
[13:49:24] <Bytram> all the remaining, tab-separated-values are the chars that could be confused with it
[13:51:31] <Bytram> for example, this is what would be output for matching Latin small letter A:
[13:51:32] <Bytram> # a ɑ α а ⍺ a 𝐚 𝑎 𝒂 𝒶 𝓪 𝔞 𝕒 𝖆 𝖺 𝗮 𝘢 𝙖 𝚊 𝛂 𝛼 𝜶 𝝰 𝞪
[13:52:06] <Bytram> similarly for 'b':
[13:52:07] <Bytram> # b Ƅ Ь Ꮟ ᖯ 𝐛 𝑏 𝒃 𝒷 𝓫 𝔟 𝕓 𝖇 𝖻 𝗯 𝘣 𝙗 𝚋
[13:54:06] <Bytram> if I may suggest? Have a flag in your table to distinguish between user-entered and system-derived (as the unicode standard gets updated, and/or the confusables list) we can just do a fresh extract and reload that into your table, while preserving any user-entered confusables.
[13:59:37] <Bytram> so, we could automagically pull down the latest zip file, extract confusablesSummary.txt, extract the problematic matches for [a-zA-Z0-9], and load that, fresh, into the DB while keeping any user-entered confusables around. =)
[13:59:47] <Bytram> break time