User talk:Anthony/LoPbN
26th
[edit]1 - J>A
[edit]I was pleasantly surprised (since for a long time i saw your contribs on VfD that reflected blind retentionism -- no, wait, as far as i noticed it actually looked like blind bio-retentionism [smile]), when i saw your many recent edits to the List of people by name tree, to see the constructive re-formatting involved. While i'm coming around to the view that with some enhancements to the cat system we many no longer need LoPbN, i think that time is still a while off, so i'm not ready to abandon my work on it.
(I can't help wondering if you might be working on getting it more bot-friendly, in which case i'm itching with curiosity abt what the bot you might have in mind would do!)
But in any case, thanks for you work. --Jerzy(t) 00:29, 2004 Aug 26 (UTC)
2 - A>J
[edit]- Welp, I'm working on a php script to get all this info into a database. (It was originally a perl script, then I realized I had to access my WP mirror so I moved to PHP). Right now I've got the [very kludgy] script to correctly identify which lines are actually entries of people. Haven't actually parsed them yet, but this doesn't look too bad.
- Perhaps you can help me by identifying what is intended as the common format. If you noticed I started out interpreting things really strictly and then gradually got less and less picky as I went along. By the end I just started special-casing stuff ("else if (preg_match('/^\* Notwithstandinig any theological arguments/', $lline"). When I hit the pages with the tables and comments in the middle of everything I went from a semi-validating parser into "let's just try to parse every line which starts with a *".
- My immediate intention is simply to extract the information. If I accomplish this I'll gladly put up a database dump somewhere if someone is interested. Longer term, it would be nice if I could somehow add information (maybe scan the categories system, compare the birth/death dates from the article, try to match things up with IMDB) and then put it back. But, as this would have to be done by hand so as not to lose anything, I'm not sure if I'll get around to that or not.
- Yes, categories would be a better solution, but at the same time the difficulty of parsing this data is leading me to understand how categories alone won't hold some of the special cases we have in these pages.
- anthony (see warning) 00:44, 26 Aug 2004 (UTC)
3 - J>A
[edit]Hmm, where to start.
- I'm fluent only in ancient programming languages, and most modern ones are just names to me. Specifically, that code scrap just has me headscratching about "theological" as my only reaction.
- I don't know what "the tables and comments in the middle of everything" refers to; for starters, is your input Wiki markup or HTML? If Wiki, you have the advantage of being able to avoid parsing the indexes that appear at the tops of pages, whose HTML form is name-free and very sparse in other useful info, and whose content is available elsewhere for much less effort.
- Are you worrying yet about automatically touring the tree structure of these pages, or just trying to interpret individual pages? (More on this shortly.) Do you have a list of all the page titles? Have you looked at the structure of the templates used in pages of the LoPbN tree? Especially those like H l Ha l Har l. Look at User:Jerzy/Argus for LoPbN MediaWiki, and ask me if you don't see why i mention it here.
- There are something like 600 pages in the tree, IIRC; the nodes at any depth (measured in pages, not heading-depth or sum of the single-page heading-depths above them) that have the greatest number of names as descendants correspond to the pages that are most likely to have had their format standardized recently. (Bcz i have gone thru them, breaking each of them into multiple pages and usually cleaning up such things as double-bullets.) There are places where old formats hang around not bcz they are good but bcz they were experiments -- in some (perhaps relatively many) cases my experiments -- that i can't expunge if can't find all the cases. (And bcz those pages don't have enuf names to have priority for getting broken up and at the same time re-standardized.) Those may account for some of what you have trouble parsing. Note, Olsen twins and Wright bros. IMO should be two lines per pair, not one; is that one of the sort of problem formats you hit?
- My major activities on LoPbN pages:
- Breaking up big pages (pretty much suspended: nothing above 21 K is left).
- Reducing section lengths to below about my own page size.
- Breaking up pages with big ToCs (which usually result from increasing the number of sections, in reducing section lengths).
- Restricting heading specifications to either
- a common-prefix string, or
- a range between two prefixes that
- are the same length, and
- differ only at the last letter.
- Breaking up range-specified headings that have subheadings.
- Where feasible, limiting increase of the lengths of prefixes mentioned to only one letter for each additional level of heading depth (ask for an example if curious)
- Correcting alphabetization, especially the machine-alpha errors where a surname appears after another surname that has it as a prefix. (For example, reversing the order when "Bloom, Geo." follows "Bloomberg".)
- Where some people have the same surname as others' given name, putting everyone with that given name before everyone with that surname. (Sounds weird, but there's a good reason.)
- (At present, i am tired of resorting within people known by given name instead of by surname.)
- Note that reformatting individual name entries is relatively rare.
One reason i ask abt touring the tree is that you could do so, depth-first, just ignoring the indexes and processing the rest of each page; the result could be a single alpha list of all the names in the tree, unparsed. (Or just the names that your code can't already parse). I don't how many different format variations there are that could be dispensed with; with such a list, we could go thru and say "that one takes a lot of effort to reformat" and "those five can all be quickly recast into one special-purpose format, instead of parsing each of the five." Unless you can quickly knock off an automated "depth-first-Tour-ing machine" (Ha,ha!), maybe i should sort up a list of page names in that order (there'd be a few hand adjustments; ask me to explain, when you become interested in that): your code could process those pages sequentially without having to do any recursion.
With any luck, your effort generating such a list would be more than compensated by the savings on coding for formats that should be eliminated anyway in the long run.
You know, BTW, i haven't even tried before to set upper and lower bounds on the number of names presently in the whole tree (there are pages with 0 or 1 name, and a few bulking 20K; i'd love to know within 25%! (Hmm, List of people by name: Da-Dd is around 21K with something close to 400 names; something like 50 char per name, we must be well under 200K names; there's an upper bound. At one point, there were 21 pages between 17 and 28 K apiece and 8 more totalling 300K; 20K is a lower bound. So for now i'll think of it as 50 or 100 K.)
Hope to hear more from you.
--Jerzy(t) 09:30, 2004 Aug 26 (UTC)
4 - A>J
[edit]Not sure where to tackle this, so I'll just go in seemingly random order. First of all, I have been able to navigate the tree. That's basically the part that's done. What I did was took the list of pages in Category:Lists_of_people_by_name. Then I just ignore any file which has "{{Index only" in it. This part turned out to be easy. By this method I've found 333 pages, not including index pages. I'm assuming all the subpages are tagged. If not we've got to fix that. I've pretty much ignored the actual structure of the templates, since I resolved this problem using the category. But if there's something I should be aware of anyway, please point it out.
As for the version I'm working with, it's the wiki source. This turns out to make things a lot easier, because templates at the top can essentially be ignored once I've got the whole list. I have a local copy of the DB, and I have a script, which I've adapted, from Fred Bauder, that updates individual entries from the live Wikipedia, using the export function Special:Export.
The first snag I hit was the situation where there are multiple people with the same family name. This is not handled in a standard way at all. Sometimes "See also Anderson" is used, but sometimes the name will get its own heading followed by all the first names indented. There are two things I'd like to see here, really. First, and most importantly, we need to agree on a common format. Here I'm pretty much open to whatever. But also we really should try to avoid duplicating information. Here Anderson has a list of people with the last name Anderson, and so does this list. This could be merged into a single list using a template. The only problem with using the template is it's unclear how we could continue to indent the names, as they would be in a simple (first level) list on the disambig page.
This snag wasn't too bad for me to get over, though. I just ignore See also lines and ignore the indentation. What really got to be a pain was when I hit List_of_people_by_name:_Bra. Those tables right in the middle of the rest of the content are quite a pain to ignore. If you look at the source there are comments in there too, although I took a lot of them out. It was at this point, after seeing that these tables continued for a number of subpages, that I decided to stop trying to parse everything, and only parse those lines beginning with a "*". I'd much rather be able to parse everything. This both enables me to catch mistakes where people forgot to insert the "*", and allows me to fully understand the formatting details so I can give a shot at outputting the well-formed version (which would automate much of the work you describe above).
By the way, my latest parse counts 23,550 people. But I have no idea if I'm miscounting. And I just printed out a list of some of the problem areas/special cases. Here goes:
- Some pages still include NOTOC even though this is in the template. Not really a problem, I ignore it, but I've removed it where I was able to.
- Indexes in non-template form: Bu, Coa-Cok, Com-Con, Coo-Coz, Fr, Ga-Gd, Ke, Li, Lo, Mad-Mam, Mar, Me-Mh, Mu, Pe, Pp-Pr, Ru, Sh, Si-Sj, Ta-Tb, Vi-Vk, Wh
- Du: Duke of...
- Mac
- aka names: Dekker, Eduard Douwes; Leach, Archie; Morrison, Marion
- Shared names: Anderson, Bauer, Cole, Coleman, Collins, Farmer, Fischer, Gay, Gore, Gordon, Green, Gregory, MacDonald, Mackenzie, MacLean, Mangus, Major, Mandell, Mann, Marcos, Maria, Mark, Marsalis, Marshall, Martinez, Marx, Mary, Maximus, May, Mill, Miller, Ritz, Schmidt, Schmitt, Schneider, Stanley, Steiner, Theodore, Thomas, Thompson, White
- Bob Goatse
- Misplacements: John, Saint; John, Little Willie; Man, Method; Mankiller, Wilma; Marie, Rose; Thitch
- Embedded tables: Bra, Bre-Bri, Bro, Bru, Brv-Brz, Col, Sta
- Otis, Mercy (either a misplacement or an aka name; or maybe a maiden name?)
- Saint
anthony (see warning) 13:56, 26 Aug 2004 (UTC)
5 - J>A
[edit]- This is great! There are tough cases, but your locating cases 2 and 9 means i can go thru and systematically remove that old approach, which i'd probably have done already if i'd had that list.
- Category 8 (Misplacements) have been retained intentionally, like misspelling redirects, but let's come up with a template that'll let us choose whether they get listed ignored since they are dupes (or, perhaps only for vague future use) included.
More to come in the period between 2 and 5 or 6 hours from now.
Thanks! --Jerzy(t) 22:08, 2004 Aug 26 (UTC)
27th
[edit]6 - J>A
[edit]Let me address your 11 points in the same order:
- The NOTOC directive is active on many, probably most, of the point 2 and point 9 pages. It's present, but commented out on some others. I think you're mistaken in thinking it appears in a template, but show me if i'm wrong.
- The non-table manual indexes are a special case from your point of view, but not from mine. Solution that will work for both of us discussed at point 9.
- Duke of is a case of an internal cross reference, see 11.
- Mac ditto
- AKAs may deserve a solution analogous to 3,4, and 11; i'll discuss the differences and similarities at 11.
- Glad also to have a list of links to surname pages. Some of these (e.g., Farmer, chosen at random) violate the order that i think is necessary to keep the section headings from descending into unusability. Anderson is one i happen to have recently modified, following the style rule i saw someone assert, that (in articles) headings contain no links. Maybe the question should be asked, whether lists are a special case deserving exemption. Does Anderson or Farmer better suit your plans?
- I don't remember what kept me from clobbering Bob Goatse earlier. Think about what formats would suit you if we can't keep him gone; a registered user added him (early this year).
- Are the "misplacements" a technical problem, perhaps bcz the qualifications on the name come before the dates? Tell me more abt this one.
OK, i gotta stop, more at unpredictable intervals thru Monday. --Jerzy(t) 03:17, 2004 Aug 27 (UTC)
7 - A>J
[edit]First, regarding NOTOC, you're right. I got confused because I actually have TOC turned off as a preference on my system. So, here we go, I'll renumber.
- Regarding NOTOC, is there a reason to include it on some pages and not others? If not, can we just put it in Template:List_of_people-Top or take it out completely?
- Duke of, Mac, Saint: I'll just special case these in the code.
- aka names: Dekker, Eduard Douwes; Leach, Archie; Morrison, Marion. The issue here is that the link was at the end for these. If you don't mine me changing the link to the beginning, then this is fine. Alternatively, we just need to come up with a standard format, e.g. "Morrison, Marion, aka John Wayne". I'd probably prefer the latter.
- Surnames. The example at Anderson works fine for me. I'd actually prefer it to Farmer from a style point of view, because it doesn't have those ugly hardcoded linebreaks in it. But I can parse either fine if I know which to expect.
- Bob Goatse. The only issue is that we don't have an article on Bob Goatse and I expect that we don't want one. Otherwise I'd have just created a redirect and handled this like anyone else. If we want to keep these types of cases, I'd suggest either linking to the redirect page or not having a link at all in the first section. Not having a link at all would probably have to be special cased, though, as I could imagine situations where this happens by accident, or where there are multiple links and thus ambiguity.
- The "problem" with the misplacements is the same as the aka names (and similar to that of Goatse). The two best solutions are to put the link in the beginning or use the format "Man, Method, see Method Man". Again I'd prefer the latter, and we could have any explanatory text come at the end. "Man, Method, see Method Man (Method Man is actually a title)". I guess we could still include dates, but really as these are misplacements I think it looks better if we don't.
- Tables, indexes. If these are there intentionally, that's fine, I can come up with code to work around them. In fact, if you want I can probably come up with a script to add them to all the pages. Is there a specific reason why they appear in some and not others?
- Mercy Otis Warren. I guess this is a maiden name. We'll handle these the same as misplacements I guess.
Hope I was clear enough about everything. anthony (see warning) 12:01, 27 Aug 2004 (UTC)
8 - J>A
[edit]Yes, very clear and helpful. Referring to your renumbering:
- The NOTOCs were so the manual indexes could replace, and not be interfered with, by the automatic ones. Both the active and commented ones are better gone, at least on the long run. For the short run i'd prefer to remove them, where there is a manual index, as i break up their pages, probably in most cases, so that the auto TOC is smaller than 1 screen and so is each section. (This goes much faster now that the templated upward and lateral indexes obviate manually changing ever other page that links to a page that changes, e.g., from Bo-Bq to Bo and Bp-Bq.) IMO, the boxed and the open internal indexes (your "7. Table, indexes") should go and should be my first priority. (I'd rather just do them than explain further the rules i use in eliminating them, if you can stand the wait.)
- If it's easy, special case them for now, but IMO there should be more rather than fewer like them, and we should come up with a standard syntax for them, even if it's just "let them be rendered into an unrestricted form, but make the Wiki markup be a template invocation so the parser can ignore them".
- Putting the link at the beginning actually is probably best, as long as we don't hide the AKA in the link. E.g., how abt "Morrison, Marion, (fl. 20thcentury), aka John Wayne" (exploiting a redirect) or "Morrison, Marion, (fl. 20thcentury), aka John Wayne"?
- Farmer was not a perfect example, bcz i wouldn't let the hard-breaks stand anyway. If the "see also" is troublesome, the template approach could be used. See [ chg to Farmer] to see how i would bring it into conformance. (I waver about whether to include the "===== Farmer =====" heading in such a case, and probably in practice would have omitted it (since i don't anticipate any "Farmetti"s or "Farmeskirov"s sharing the "Farme" slot with "Farmer-" names and "Farmer"). Sometimes i have put it in, commented out, which i bet you hate; in any case, with or without it, the deepest heading is two levels below "Farme", which does affect the ToC. Also, if i were reworking the whole page, i'd have de-subordinated each heading one level, to make "headroom" for a six-level ToC, and have separate headings under "...named Farmer" for "Farmer as sole surname" and "Farmer as first word of a compound surname".
- Yes, re yr 1st sent. on Goatse; the root page says something like "real people with bios on WP", which implies his entries should at most be inspirations for new Redirs when they are removed from LoPbN. But i don't think you should special case for anticipated accidents, other than programming for graceful failure: the format should be user friendly, i.e. simple, and IMO your tool should drive manual repairs of deviations from it rather than build in too many formats.
- I may not grasp the issues well enuf to understand why you prefer special format over moving links to beginning, and will have to make time to look at some cases. I'm sure these are no show-stopper.
- The "Tables, indexes" (manual indexes) were intentional, and i favored them for a time, but i think now that they are unmaintainable w/o building a "distributed automatic TOC" option into the server. I think your tool is valuable, and will come to seem necessary, but IMO the truly necessary software should be in the server, and not go AWOL when you (or, as a present-company-excepted caveat, your successor) get hit by a truck, or get pissed and leave. Or the platform your software runs on goes discontinued, illegal, intolerably inefficient due to OS changes, whatever. Suddenly becoming unable to maintain the formerly manual indexes could be a disaster for usability. Also, like any bot, it would undercut the "populist editing capability" -- IMO worse than most bots do, bcz the format of those indexes could in practice only be tweaked by begging the bot maintainer to do it, and waiting for the programming to be done. (It's bad enuf as it is, w/ me sending an occasional editor a "don't do that, you don't understand the consequences and you probably can't bear my explaining them to you" note!)
- Yes, i looked at the MOW article; i just don't know why we don't have her in the Ws. (But i try to limit how much adding entries keeps me from (trying to, claiming to) improve the access to what LoPbN content others are putting in.) I would assume whatever solution to AKAs we arrrive at should apply to these as well.
Thanks again, and sorry again abt not having unlimited time speed this IMO worthwhile process! --Jerzy(t) 17:00, 2004 Aug 27 (UTC)
9 - A>J
[edit]Don't worry about speed. I didn't expect to have any help with this at all, and the number of special cases is down to a workable level where I could run this right now if I wanted to. My comments should be brief, as I think we've got this figured out.
- OK, I'll just let you handle this. Sorry if I removed this somewhere you wanted it. From now on I'll just ignore it).
- I'll special case for now, and if we get more, which I agree would be a good thing, we'll revisit this.
- OK, I'll move the link to the beginning. I've already done this in some cases.
- Farmer or Anderson is fine. I don't mind the "see also", as long as it always means the same thing. I don't even mind the commented out headings once I understand the method to the madness. I guess in theory we should document this stuff somewhere.
- I'll remove Goatse. If it comes back, I'll put it in the aka/misdirections format (i.e. link at the beginning, linking to the proper page). That the link will be a website instead of a person, well, I'll have to live with that. I'm sure there are similar cases already (people's name redirecting to murder cases, for instance).
- I just thought the reason the links weren't made was to convey the point that it isn't a correct name. But I'll just move the links to the beginning, unless/until someone has a problem with that.
- Yes. The only thing I have to add is that eventually this will be resolved by the list being created on the fly by the software itself (advanced "categories"). But as I'm growing more and more aware, at the moment categories suck.
For now I'm back to playing with parsing bits of the categories system. But tomorrow [I mean Monday] I'll probably go back to the lists, now that we've pretty much got this figured out. anthony (see warning) 17:32, 27 Aug 2004 (UTC)
28th
[edit]10 - J>A
[edit]No problem re NOTOCs; my only concern is that even an undermaintained manual index probably works better than an overwhelmingly big automatic one, so the ideal transition is preview what turning the ToC back on looks like; use that to help decide where the boundaries between the page's chidren belong; create two new cascaded (if i use the term correctly) templates for the tops of the old and new pages; spend an hour or so creating all but the largest of the kids, with their ToCs turned on; preview the biggest kid (as an edit of the old page); then in 2-3 minutes, save the biggest kid's content as the new version of the old page, rename it to the biggest kid's name, and convert the redirect that results into an Index-only (with the old page's former name). That way, no one but the guy foolish enuf to undertake the conversion has to look at the bloated ToC.
Note that (of course) the commented out NOTOCs can go any time, as can any on pages w/ or fewer headings! (If you really want to take that on.)
You said
- I guess in theory we should document this stuff somewhere.
generously avoiding
- Why hasn't J already doc'ed this, & when will he start?
Ideally, you'd write what you understand and i'd correct misconceptions and say "do you realize you left out...?". But that selfish suggestion would be far from mirroring your generosity. A more reasonable suggestion from me is that i should write it, and you should point out ambiguities and impenetrabilities. But i'm inclined to first correct the problems you have so helpfully surveyed, whatever the doc'n process.
Contrary to my 1st thot, i think i can do Farmer-style cleanups on the dab-linked names you listed very quickly. Then i'll do the shorter list, and then the longer, of manual indexes.
One page w/ box-style manual indexes that you pointed me at had "especially frequent names" boxes. Those are easy to rebuild later; shall i trash them as i go so we can deal with their format issue at leisure? They're a nice amenity IMO for the long run, but not as important as standardizing, esp'ly toward supporting your extractor.
--Jerzy(t) 03:37, 2004 Aug 28 (UTC)
11 - J>A
[edit]Oh, [i] keep forgetting: in the next few days, i will compare the LoPbN categ list with my "argus" and the index at the end of the root page. One issue is an (almost unreachable) page that had the tag but was broken, in connection, i think, w/ someone consolidating pages w/o knowing the implications. I'm hoping to get word whether the conversion of needed redir back to this broken page was an isolated incident or not. More later on both aspects of lists of LoPbN pages. --Jerzy(t) 03:46, 2004 Aug 28 (UTC)
12 - J>A
[edit]I want to think abt the distinction between links to a dab for the same name (Cole) and links w/in LoPbN (Coleman, to the Colman section). I think users want to know which is which, but may miss the subtle hint of slightly different spelling. No plan yet. --Jerzy(t) 04:10, 2004 Aug 28 (UTC)
13 - A>J
[edit]Don't worry about the documentation. I really just meant once we're done we should summarize these discussions here for the benefit of others. And, by we I meant whoever winds up being less lazy about it.
Regarding especially frequent boxes, feel free to leave them if you like them.
Aug 29
[edit]14 - J>A
[edit]Re Cole and Coleman
[edit]I think the formats are
- This name may be confused with Colman.
and
- Some people missing here may be at Cole and/or Category:Cole.
and, to add another case,
- Some people missing here may be at Category:Dalton.
(I don't add a fourth (for dab but no cat), bcz i intend to create such a category whenever i find a disamb that lists people sharing a surname. (In the case of Dalton, there is a Dalton article, but it lacks a place for listing Daltons beyond John, who is already on LoPbN. Also, i intend to create surname categories w/o creating corresponding dabs; these should be the bigger class of users of the Dalton version.)
- [Much later: i give up; here's the fourth
- Some people missing here may be at Cole.
- [Much later: i give up; here's the fourth
Note that the Coleman-case link could be an inter-(LoPbN-)page one; in fact, Coleman could add "or Kuhlman (even tho we as yet have no Kuhlmans), and i would add
- This name may be confused with Kohl.
before the "Some people ..." line under Cole.
I assume your parser is unlikely to be finicky between the intra- and inter cases, and could handle both in one place (perhaps provided you know which order to expect).
In adding surname cats, i don't presume to know where they fit into the longterm fate of name lists, but i'm betting it won't be soon that we can rely entirely on Cats in place of them.
Esp'ly-frequent-names boxes
[edit]I do like them (and not just bcz they are one of the few LoPbN innovations that i'm sure are mine [smile]). I think they can save searching for a siginificant fraction of users; i'll probably eventually try even jumping some users down a (page-)level or two in the tree, for really common names. (For example, LoPbN includes one person with "George" and "Smith" as first and last, but George Smith lists 15 more, all at least superficially plausible, of whom 5 already have WP bios. If the total number of Smiths merely tripled, i'd expect to have them on their own LoPbN page, below root, S, Sm, Smi, Smit, and Smith (an index-only page pointing also to Smithe-Smiths). Then at least Sm thru Smith should each include a link to LoPbNed Smith IMO.)
In any case, i gather you've already programmed for EFN boxes (presumably to ignore them).
Templates
[edit]Re both Cole/Coleman/Dalton formats and the EFN boxes, i'd like eventually to consider use of Templates with 1 or 2 parameters.
- Upsides:
- help enforce standardization;
- in case of EFN boxes, avoid the burden on editors of fussing with table notation.
- Downsides:
- only 5 copies of the same template called within one page get converted, so occasionally a page would either have to get broken up before it would otherwise be desirable, or a few transcluded templates would have to be converted to subst invocations of the same template.
- if we do use subst, then you have at least one more format (any double-braces call after the standard headings) to recognize. Is that one simple enough to be an insignificant burden?
Aug 30
[edit]List_of_people_by_name:_Se was totally screwed up, so I reverted it to the version from a couple weeks ago. anthony (see warning) 19:14, 30 Aug 2004 (UTC)
The following lists weren't tagged as being in the category. This is now fixed:
- List of people by name: Y
- List of people by name: U
- List of people by name: Q
- List of people by name: X
anthony (see warning) 21:02, 30 Aug 2004 (UTC)
- If i'd thot it thru, i'd have predicted that. Good move. I continue deferring my own cross check, but it's not forgotten. --Jerzy(t) 04:11, 2004 Sep 1 (UTC)
Oct 8 7 [blush]
[edit]More LoPbN Work?
[edit]Hi, Anthony,
I think you'll have noted that i've addressed (or tried to) everything you raised about LoPbN. I'm hopeful i succeeded, and i'd like to tackle another round of such work, if you've noted anything i missed or have something new that seems like a good LoPbN task to set me.
On the other hand, i've still not, as i intended, checked my "Template Argus" against the hierarchical index at the bottom of the LoPbN page and against the list of pages you generated, to bolster your confidence of having found the whole tree. But i'm not worried much about how quickly i get to that; most of pages that are linked from elsewhere in the tree are hard to miss, with only the "[people] named..." pages falling short of that, and even with them it's barely an issue except for the highest-level (grandchildren of root) ones (which i think you mentioned finding, or left evidence of it), e.g. the one for people named F.
I have also newly subdivided, since we talked, a half-dozen or dozen pages, of which i can send you a list if your tree-traverser is not dynamically checking for them.
(I suppose that at some point it might also eventually be worth having a program tour the tree, checking "What links here" for "lost" pages that
- have titles likely to make them part of LoPbN,
- aren't redirects, and
- link into the tree,
- without the tree linking back to them.
Or even checking histories of tree pages & redirects for any pages that were turned temporarily or permanently into redirects, potentially w/o preserving their entries elsewhere. However, both of those are paranoia that is IMO pretty peripheral to your data base, and i'm more likely to ask you for source code that i would learn to modify, than to suggest you undertake to program for that.)
Machine-Readable Sort Keys
[edit]Coming to my main point, however, i'd like at least your advice, and ideally to coordinate design and timing with you, on a goal you may identify with.
I recall you as mentioning LoPbN and the Category facility in virtually the same breath, and i presume to hope you want your data base
- to reflect all the info either in the LoPbN tree or in Category:People and its descendants, and
- to be used to improve both of them (assuming LoPbN survives long enuf).
I also conjecture, BTW, that you hope it will be a tool for
- transferring at least names, at least from LoPbN into the Category structure.
One kind of information in LoPbN is the basis of an at least workable presentation order for people either lacking surnames (kings, popes, nobles, and medieval scholars, by and large), or known better by given name than by surname (Cher, Madonna, Prince, & Cristo are my poster-kids for them).
I've been much more aggressive in getting, e.g. Zeke Smith before Albert Smithson than i have in getting the kings of the same realm together (rather than all the Zeke Vs of different realms together) and sorting their Roman numerals into numeric rather than alphabetic order. But with machine-readable information for ordering kings and popes (other than the markup [[Zeke V the Foolish of Slobovia|Zeke, of Slobovia, 5, the Foolish]]), the order info can become portable, and a variant of your basic data-extraction routine could at the least list newly added kings who have been mislocated. And it would be reason for me to make a push to fix at least the big given-name-based-entry blocks (rather than just those that coincide with surnames of other people).
I consider Zeke, of Slobovia, 5, the Foolish completely unacceptable, for reader-unfriendliness, but my thot is to add an optional comment to the LoPbN entry syntax that we've standardized on, and (if it serves your goals) have your extractor capture the sort key, to populate a normally unused field or group of fields of your database. The comment for Zeke might read <!-- Sort Key:Zeke;GM;Slobovia;5 --> (with the GM indicating Given-name variant and Monarch subvariant); the semicolons between fields allow for using commas to delimit a few seldom needed subfields.
The choices for location that appeal to me are
- immediately before the link markup,
- between it and the comma that precedes the dates or description, and
- between that comma and the dates or description.
(But i expect your reasons for preferring some location would be more compelling than mine.)
I'd be grateful for your reaction to the concept, and any thoughts you have on timing.
--Jerzy(t) 06:14, 2004 Oct 7 (UTC)
Response (is it really Oct 8 there?)
[edit]Touring
[edit]I'd like to be able to tour the tree. My biggest problem with that right now is I don't have any code to deal with templates. If it were just one level of templates, I'd be home free, but I need to somehow recursively descend multiple levels of templates, and somehow make sure I'm using consistent versions in the first place (possibly have to either check Wikipedia every time (easy but bad), or write a script to somehow invalidate templates included by an article when I download a new copy of that article (harder, but more correct)).
The solution to the first problem, dealing with templates in the first place, is forthcoming. But I've been more concerned with getting a parse tree first. I'm too worried about dealing with all the complications, like <nowiki>{{whatever}}</nowiki> or even {{wh<nowiki>at</nowiki>ever}}. I've spoken to one of the developers on IRC, who is working on a standalone parser, in C. That would resolve the problem so that I don't have rely on understanding and porting mediawiki code into my scripts, which have been altered in other ways that make them utterly incompatible with the unported code. But whether that C parser gets done before I've deciphered some of the trickier wikisyntax, I'm not sure. It's more likely a longer term solution, for when mediawiki adds some brand new syntax again (like the table syntax that was added several months ago).
So, yes, touring is something I'd like to do, but it's not interesting enough that I'm willing to design code that's only going to be useful in LoPbN. It's going to have to wait until I have some sort of real parser for wikitext (even the one in Mediawiki isn't really a parser, it's just a non-validating wikitext->html converter). (alternatively I could run the wikitext->html conversion and then parse that, but this would be even less useful for my immediate goals).
For now, I'm content to just get a list of all articles which begin with "List of people..." and do a quick manual scan of the diff. That's how I found those missing ones I've already found.
Machine-Readable Sort Keys
[edit]Although I haven't done any actual sorting with it, I've always assumed that the second part of the pipe link was essentially the sort key. I'm not sure if it's better to explicitly write a sort key or just to write some rules for certain situations. Presumably Zeke is going to have "king" somewhere in his descriptive text (the last section of the LoPbN entry), so in theory sort key information would be extraneous.
On the other hand, there's going to need to be a sort key for the category, and yes, in theory the LoPbN data should be moved to the category system (but I'm not sure consensus on doing this by bot can be reached).
Also, yes, I wanted to use the category system to get this data. But I'm not sure there is enough consensus that the category system is going to be used in a way that facilitates this. What I'm talking about is the adding of things like Category:Positions of Authority and Category:lists of people to the people hierarchy. I can't find too many examples right now, but I know they're there, and it's just too hard for me to babysit the people hierarchy to keep removing them (even with automated help). Of course LoPbN probably has this to a lesser extent, but it's mostly in the form of redirects, which can be handled semi-manually.
In any case, if you'd like to add the sort key information (it certainly wouldn't hurt), between the name and the comma that precedes the dates or description would probably make the most sense. Before everything would make this particular parsing easier, but I'm not convinced that the wikisyntax intentionally allows you to place comments before the *. In any case, I've added comment-finding code to my scripts, so this will be no problem for me no matter where it is. I just see it as most logically to be placed with the name, since it's essentially a derivative of the name.
Well, I'm not sure I've helped very much, more gone off on rants about this and that, but hopefully you were able to gain something useful from this :).
anthony (see warning) 14:53, 7 Oct 2004 (UTC)
Oct 7 from Jerzy
[edit]No, i was trying to use UTC date, but my two-time-zone watch stopped when i left it out in freezing temperatures, & resetting it may be when i befuddled it; the mislabeled msg was about the first time after that that i cared about dates vs day of week.
Thanks; sorry things aren't easier in that work, but glad to have a better grasp of the situation than i did. I will give you a list of the new templates that list the new pages, Monday at the latest, and continue to update you one by one. In fact maybe i'll start a log of new pages & their dates in one place.... Or a spreadsheet-based data-base of parent pages that'll generate Wiki markup to replace my existing wiki pages....
As to using the pipe portion of the link for sorting, that is exactly what i've used, for people listed under surnames when they are not identical. And in a post in the next week or so on the LoPbN talk page, i'm including more details about using dates to resolve identical names.
Finally, BTW: while the polarity of your vote or your failure to vote would be irrelevant to our collaboration, you may have an interest in Wikipedia:Votes for deletion/List of people by name: Db-Dd.
--Jerzy(t) 22:32, 2004 Oct 7 (UTC)
Oct 11 fr/J
[edit]Plz see either (or both)
or
re the comment i inserted.
Please change it to suit your own convenience.
I still intend to get you your list in the next 6 hours.
More of Same
[edit]Well! I've got bad news. After making an enhancement to my Argus (more on that soon), and checking it in detail against the hierarchical index on the LoPbN root page, i began comparing the latter to the Category list for Category:Lists of people by name. For A-D (down to the full depth of the tree), everthing matches. But trouble starts in the Es.
I found that the combos from Ee thru Ek (and others) are tagged for the Cat., but not yet reflected on the Cat's page. The behavior reminded me of something i have observed with Cats, outside of the LoPbN tree, as follows:
- Typically, i am subordinating a category, e.g. making Category:Diplomats a child of Category:Politicians instead of a child of Category:People. I change the Cat tags on a bunch of pages, saving each one. I force a reload of both the old and new parent Cats. IIRC, the new parent reflects the changes immediately, but the old one lags for a while. IIRC, waiting overnight, or editing the old parent Cat, brings it up to date as well.
I infer that
- displaying the Cat page causes retrieval of an essentially cached list;
- saving an edit that adds a tag causes immediate update of the corresponding list;
- saving an edit that removes a tag (rarer and arguably less urgent) causes that event to go on a list, for that Cat, of changes waiting to be made to the Cat's cached list;
- a low-priority process does those changes asynchronously;
- editing the Cat causes those changes to be done immediately.
I conjecture further that
- the adding and low-priority updaters don't fully expand the templates in order to determine whether to update the cached list (in effect assuming that sufficiently deeply nested ones haven't changed), but
- the editing-the-Cat one checks something that expands nested templates at least two deep.
My best reason for seeing that connection is that i edited List of people by name: Ei-Ej, immediately refreshed Category:Lists of people by name, and Ei-Ej then showed up on the Cat list. (But it now occurs to me that the smallest visible change that came to mind also changed the representation of the template's title, so it's possible that appearing to change which template was transcluded may have been the trick, rather than just an edit. Even tho any edit was enough when no transclusion was involved.)
I see you've been curious enough at least to dive into editing Template:List of people. (I didn't realize, until now, that it had stayed or become part of the mechanism, when i set up the nested transclusion scheme, and others, i think with bots, applied it to most of the tree. I see now the un-subdivided pages with a single letter after the colon in their titles were treated as exceptions.) You may realize that most LoPbN pages get into the Cat in question by transcluding something like Template:List of people E, which transcludes Template:List of people-Top. That includes the most significant code that the majority (as i say, i thot all) of LoPbN pages automatically include or transclude, namely the Cat tag.
Your insertion of the Cat tag into Template:List of people is on the right track. (Unless we were go for stylistic consistency: we can't put the box-making code into the nested templates, yet, bcz the combo of it and nested transclusion is still unimplemented. But i prefer having Template:List of people holding out the prospect of the attractive and clean boxes, as a reminder that we want and would use the full combination of boxes via nested transclusion.) But, probably for the same sort of reason as with Ei-Ej, List of people by name: X still is not listed on the Cat page, and there may be a few more.
You're in a better position to say how much of a problem this is. A presumed temporary fix is have a bot edit each page, or just the pages we know are missing. In the long term presumably the developers will tackle it if asked. But if the Cat doesn't buy you enough to be worth the grief, we can custom-build you a list in a convenient format, generating it from essentially the hierarchical list in the LoPbN root or from my Argus.
In the meantime, i'm pretty sure i've IDed all the new pages since we started discussing this, and i'm about to massage that into what may be a more welcome format.
--Jerzy(t) 07:14, 2004 Oct 12 (UTC)
Finally, the lists of new pages
[edit]I'd like these to be prettier, but certainly not today.
List of Templates for Them
[edit]A - E
[edit]- Template:List of people Con Links - '04 Oct 08
- Template:List of people Da Links - '04 Oct 06
F - J
[edit]- Template:List of people Fr Links - '04 Sep 20
- Template:List of people Ga Links - '04 Sep 02
- Template:List of people Go Links - '04 Sep 02
- Template:List of people Gr Links - '04 Sep 03
- Template:List of people Hu Links - '04 Sep 20
J - P
[edit]- Template:List of people Li Links - '04 Sep 20
- Template:List of people Mar Links - '04 Sep 21
- Template:List of people Me Links - '04 Sep 21
- Template:List of people Pe Links - '04 Sep 22
- Template:List of people Li Links - '04 Sep 20
Q - U
[edit]- Template:List of people Sa Links - '04 Aug 03 (x2)
- Template:List of people Sch Links - '04 Jun 28 & '04 Sep 20
- Template:List of people Sh Links - '04 Sep 22
- Template:List of people Th Links - '04 Sep 12
- Template:List of people Sa Links - '04 Aug 03 (x2)
V - Z
[edit]- Template:List of people Va Links - '04 Sep 23
- Template:List of people Van Links - '04 Sep 23
- Template:List of people Wa Links - '04 Aug 04
- Template:List of people Va Links - '04 Sep 23
Template Transclusions that'll List the pages
[edit]=== A - E === {{subst:List of people Con Links}} - '04 Oct 08 {{subst:List of people Da Links}} - '04 Oct 06 === F - J === {{subst:List of people Fr Links}} - '04 Sep 20 {{subst:List of people Ga Links}} - '04 Sep 02 {{subst:List of people Go Links}} - '04 Sep 02 {{subst:List of people Gr Links}} - '04 Sep 03 {{subst:List of people Hu Links}} - '04 Sep 20 === J - P === {{subst:List of people Li Links}} - '04 Sep 20 {{subst:List of people Mar Links}} - '04 Sep 21 {{subst:List of people Me Links}} - '04 Sep 21 {{subst:List of people Pe Links}} - '04 Sep 22 === Q - U === {{subst:List of people Sa Links}} - '04 Aug 03 (x2) {{subst:List of people Sch Links}} - '04 Jun 28 & '04 Sep 20 {{subst:List of people Sh Links}} - '04 Sep 22 {{subst:List of people Th Links}} - '04 Sep 12 === V - Z === {{subst:List of people Va Links}} - '04 Sep 23 {{subst:List of people Van Links}} - '04 Sep 23 {{subst:List of people Wa Links}} - '04 Aug 04
All the New Pages
[edit]A - E
[edit]Cona-Conr | Cons | Cont-Conz - '04 Oct 08 Daa-Dam | Dan | Dao-Dau | Dav | Daw-Daz - '04 Oct 06
F - J
[edit]Fra-Frd | Fre | Frf-Frz - '04 Sep 20 Gaa | Gab | Gac-Gak | Gal | Gam-Gaq | Gar | Gas-Gaz - '04 Sep 02 Name Go | Goa-Gol | Gom-Goq | Gor | Gos-Goz - '04 Sep 02 Gra | Grb-Grd | Gre | Grf-Grh | Gri | Grj-Grz - '04 Sep 03 Name Hu | Hua-Hum |Hun-Huz - '04 Sep 20
J - P
[edit]Name Li | Lia-Lim | Lin | Lio-Liz - '04 Sep 20 Mara-Marh | Mari-Marr | Mars | Mart | Maru-Marz - '04 Sep 21 Mea-Meq | Mer | Mes-Mez - '04 Sep 21 Pea-Peq | Per | Pes | Pet | Peu-Pez - '04 Sep 22
Q - U
[edit]Saa-Sai | Saj-Sam | San | Sao-Sau | Sav-Saz - '04 Aug 03 (x2) Scha-Schh | Schi-Schl | Schm | Schn-Scho | Schp-Schz - '04 Jun 28 & '04 Sep 20 Sha | Shb-Shd | She | Shf-Shz - '04 Sep 22 Tha-Thd | The | Thf-Thn | Tho | Thp-Thz - '04 Sep 12
V - Z
[edit]Vaa-Vak | Val | Vam | Van | Vao-Vaz - '04 Sep 23 Name Van | Prefix Van | Vana-Vanz - '04 Sep 23 Waa-Wak | Wal | Wam-Waq | War | Was-Waz - '04 Aug 04
That's all for now. G'nite.
-- (belated sig:) Jerzy(t)
Oct 13 J>A
[edit]4 more new pages
[edit]Template:List of people Bar Links, added '04 Oct 13.
--Jerzy(t) 18:52, 2004 Oct 13 (UTC)
Standard Titles
[edit]You'll notice that List of people by name: Named Bar intentionally resumes my previous naming practice, in contrast to my earlier equanimity about your converting some to more natural wording. My epiphany came upon inspecting Category:Lists of people by name and finding List of people named Bacon, which duplicates all of our listings, in a different order and not inverted, and lacks the links to the LoPbN tree at the top. I doubt LoPbN will ever need that particular title: that would mean breaking up Baa-Baj, then Bac, then Baco, then Bacon. But the principle is clear, IMO:
- such pages have a better claim on the name than LoPbN does,
- renaming the existing page List of people by name: People named Li into that title-format would be folly, and
- List of people named Ho is a ticking bomb awaiting someone who wants to
- list them grouped by surname-ideogram (aren't there four Chinese tonal variations for Ho?) instead of alpha by given name, or
- have sections for Chinese, Viet, and Korean ancestry, or
- omit the surname from each entry, or
- just discard the LoPbN links.
Do you have other ideas besides eventually going back to my old approach?
--Jerzy(t) 18:52, 2004 Oct 13 (UTC)
Don't be a Stranger
[edit]Don't mean to rush you, but at some point, seeing you editing but not commenting, i'll grow discouraged, and nag or stop writing here. Even an "ACK" on this page would make a diff.
--Jerzy(t) 18:52, 2004 Oct 13 (UTC)