Thursday, February 9, 2017

Recommended fixes for MacDonald P. Jackson's flawed methodology

As I hoped to communicate in the title (the "please try again" part) and introductory paragraphs of my holiday blog-review, I relish the prospect of getting new and interesting statistical data regarding the authorship of "A Visit from St. Nicholas" aka "'Twas the Night Before Christmas." And as I've done my share of special pleading on other authorship questions ("Amen" sayeth the ishmailites), and personally made lots of errors everywhere, I was serious about the value of learning "from our mistakes" and the merits of devising an improved method of gathering and presenting statistical evidence. In the case of "Visit" I honestly don't need any particular outcome, it's just that the ample material evidence for Moore's authorship (even before my discovery of his emphatic and unambiguous 1844 claim) justified and still justifies an educated guess or reasonable expectation that statistical results on the whole will not be inconsistent with Moore's authorship. If internal evidence does look inconsistent with Moore's authorship, then we might have something interesting to talk about and perhaps even debate. For it to be scholarly and helpful, that debate would have to acknowledge somewhere that anomalies can happen without necessarily bearing on authorship, and that probabilities are not the same things as facts or events.

When I take my vehicle in for service the good folks at Luther Brookdale Honda run computerized tests that identify mechanical problems. Believe me, you don't want to break down in a blizzard on Highway 65. In that same spirit of helpfulness, I offer these fixes:
  1. Re-define the databases. More specifically:
    a) enlarge the Moore database to include all of Clement C. Moore's known poetry, unpublished as well as published. I know I said this before, but Jackson's excuse (in his Response to Scott Norsworthy) for his selective exclusion of poems by Moore fails to appreciate the irony of his making aesthetic judgments about what counts as original poetry while almost in the same breath making such a virtue of avoiding subjectivity. Verse "translations" take creative work to achieve. How much creative work depends on the poet, obviously. Also, I'm not yet convinced that usages of articles and other function words are "unconscious" reflexes as Jackson has claimed. If they are, then all the more reason for Jackson to count everything, even translations. Afterthought: this touches on an interesting authorship quandary. If we're after the authorial DNA or thumbprint, why should supposedly unconscious usages (assuming they exist, and also assuming they could accurately be identified as such) be more valuable markers than conscious ones?  Doubtless writing, like most human endeavors, involves a more or less mysterious combination of conscious and unconscious factors. It could be that conscious creative preferences reveal as much or more about a person's unique writing style than mechanical, "unconscious" reflexes. Then, too, the poets I know tend to agonize over the littlest things. If Wilde could spend all day on a comma....
    Anyhow, please note: I have no reason to expect that counting translations will in any way "improve" Moore's numbers. (Well, I had no reason at first, though later on I did stumble on another instance of "brain" in Moore's banned Prometheus translation.) Mainly I want the statistics to be as useful as possible, and to mean something. Jackson's subjective exclusion of translations is a relatively minor issue compared to his sad failure to include "Biography of the heart of Clement C. Moore." Excluding "Biography" barred one hugely telling phrase from any statistical consideration: "the lustre of the day" which suggestively resembles Moore's expression "the lustre of mid-day" in "Visit." The Museum of the City of New York has been unfailingly helpful to researchers. Now that others have done the work of transcribing for him, Jackson can easily include all of "Biography" in his improved database of known Moore poems.
    b) re-scrutinize the database of Henry Livingston's poetry and exclude poems that are not demonstrably his. More work needs to be done on the authorship of all those fascinating Carriers' Addresses. Only the 1787 Carrier's Address appears in Livingston's manuscript book. Given the natural and entirely understandable biases involved, attributions by Gertrude Thomas should be regarded as unsubstantiated claims, not evidentiary proof, of Henry Livingston's authorship. In the absence of more persuasive evidence, the 1803 and 1819 Addresses ought to be excluded from the database of known Livingston poems.
  2. Clean up the re-constituted databases by locating and correcting numerous (how numerous, I don't know) errors of transcription. So far, my hero Mary Van Deusen has done all the work around here. Besides her outstanding research and writing on family history, Mary Van Deusen has made unquestionably valuable and enduring contributions to scholarship on Henry Livingston, Jr. and Clement C. Moore. Now English Professors or their hungry graduate students can help out by finding and fixing the inevitable mistakes of transcription. Random examples: In Beekman Livingston clearly wrote "sempstress" but MVD and MPJ report "seamstress." Moore wrote "The silent snow had clad the ground" but in her online transcription of Moore's Lines Written After a Snow-storm, MVD gives Moore's word "clad" as "glad." See how one simple mistake of transcription = two statistical errors, resulting in one extra "glad" and one less "clad." To clean up this scary problem, Jackson definitely needs to go over every letter of every word of every transcription by Mary Van Deusen. Capitalization and punctuation marks need to be verified, too--especially since Jackson counts "The" and "the" as different words. Which reminds me, I need help understanding why we need to distinguish "And" from "and" and "He" from "he." There must be a good reason, I'm sure, so I won't make combining them a recommended fix. Yet.

  3. Confront and transparently deal with the problem of different versions. I can't think of an easy fix here, but the difficulties should be recognized and discussed. In the first 1824 printing of Moore's Lines Written after a Snow-Storm, the Troy Sentinel printed "ivy bowers," which is probably a copyist's or printer's error for "icy bowers." Should "ivy" be counted anyhow? Maybe not, but then why count "Dunder" and "Blixem," later revised by Moore to "Donder" and "Blitzen"?  If we're determined to count all words printed in early (or say, first) newspaper versions, then "ivy" absolutely does belong, as do other words that would need to be identified after painstaking study of first printings in, for example, the New York Evening Post and A New Translation with Notes of the Third Satire of Juvenal. On the important subject of versions, I heartily recommend John Bryant's groundbreaking study The Fluid Text.

  4. Test for everything. If you want reliable counts of function words and phonemes, fine. But count everything, including all nouns, verbs, adjectives and adverbs. After testing, present the figures for nouns, verbs, adjectives, and adverbs so that others can independently view and evaluate all the internal evidence. Jackson (in his book and recent Response) anticipates endless arguing about the significance of content and context-dependent words, but reliable frequency counts, statistically adjusted for the relatively smaller Livingston corpus, could provide a common ground and mutually acceptable boundaries. Reliable counts and frequency rates potentially could give us a solid statistical foundation for further discussion and analysis of all the internal evidence, including nouns, verbs, adjectives and adverbs. What's not to like about that?
  5. When testing for everything, combine Moore's published and unpublished poems in one database. Then present results for the whole database. Informative presentations on Mary Van Deusen's website have moved already in the direction I'm urging here and in #4 above, by giving raw counts of all words in all poems. But the misleading category of Word Frequencies in Divers Poems perpetuates maybe the worst flaw (besides omitting "Biography of the heart  of Clement C. Moore") in Jackson's method which is to isolate some unpublished manuscript poems, effectively (in some places) ignoring them. Notice, the figures for Moore in dark blue only reflect usages in published poems. For example, take the word brains as represented in this table of Word Frequencies. The numbers for "brain" and "brains" offer good evidence of something, but to that dark blue "9" we need to add eight more sky blue instances of brain, singular, from Moore's unpublished poems, as catalogued in the earlier melvilliana post Settle your brains. (The forbidden Prometheus translation supplies yet another instance of "brain.") Notice that Mary Van Deusen's tabulation here presents numbers for Henry+ (comprising verses Henry Livingston may not have written) while leaving out the category of unpublished poems unquestionably by Moore. The absence of Moore's unpublished poems from this key table raises the fundamental question of what database Jackson used to determine his lists of "high frequency" and "medium frequency" words. Jackson's confusing presentation makes this kind of hard to see, but it looks like he began working exclusively with Moore's published poems and only belatedly tackled the unpublished ones, after running numerous tests already, including the crucial determination of "high frequency" and "medium frequency" words. If I'm wrong about this I welcome (here as everywhere) correction and enlightenment. If only derived from Moore's published poems, those essential counts are flawed and seriously inadequate. For maximum utility, the figures for high and medium frequency words would need to be re-calculated using our new and improved and combined database of published and unpublished poems (using corrected transcripts as suggested above), including "Biography of the heart of Clement C. Moore," including also (perhaps) at least some texts of early printed versions. Off the bat, just for instance, imagine what the inclusion of Moore's long narrative poem Charles Elphinstone (corrected or not) would do to Jackson's counts of supposedly distinctive Henry favored words, especially the "high frequency" pronouns "her" and "his"; and the medium frequency pronouns "He," "She" and "His."
After writing the above, I looked more closely at Mary Van Deusen's table of figures for high frequency words. Now I see that her table actually does give counts for thirteen of Moore's manuscript poems. Not Moore's "Biography," of course, AND NOT CHARLES ELPHINSTONE. Likewise, it's the fascinatingly ambitious CHARLES ELPHINSTONE that is most conspicuously absent from the database that produced this table for medium frequency words. Looking again at the recent Response by MacDonald P. Jackson, I see he classifies the "massive" Elphinstone (guess we'd better keep quiet about Melville's Clarel) as
"one of the manuscript poems of 1843–52 that were held back for testing the tests and itself datable to 1851."
Alright then, there's your problem. Granted, it was very wrong of Clement C. Moore in 1851 to be composing long versified allegories of spiritual temptation and victory, however imaginative and potentially revealing. But remember, we're professionals here. Counting is the imperative thing. Aesthetic judgements can come later. Sometimes the dirty work of authorship attribution means analyzing such hopeless failures, no matter how "massive." As scholars we have to hold our noses and add Moore's Charles Elphinstone to the mix. So go ahead, add Elphinstone, along with Biography of the heart of Clement C. Moore and whatever else from Moore's poems, published or unpublished, might have been too-conveniently "held back for testing the tests." Fix all five problems identified here in the Melvilliana shop, and some day we might get somewhere. Then again, no amount of methodological fine tuning will help if we don't recognize compelling historical evidence of authorship when it smacks us upside the head--but that's another department.

