Regularization: silent

normalization regularization
reg orig

All transcription, even the most scrupulously self-aware, involves some degree of silent regularization: of minute differences in word spacing, of information about line height, of differences in type size, of other factors so insignificant that no one would consider them informationally significant. These kinds of regularizations do not require explicit documentation. At the other end of the spectrum, regularization of information that is highly significant to the reader should be done explicitly, using mechanisms such as the TEI orig or reg elements. (See Regularization.) In the middle fall the kind of information that can be silently regularized but whose regularization should be documented so that readers are aware of what has been done.

The boundary between these three categories will vary from project to project. For the projects envisioned by this Guide, for which scholarly care and explicitness of methodology is important, we recommend the following guidelines. All of these practices should be listed explicitly in the project documentation.

  1. Regularize space after punctuation, between words, or between words and punctuation. Regularize the space between words to one space, the space between a word and the following punctuation to zero spaces, and the space after the end of a sentence or after a colon to one space. Regularize the spacing around em-dashes to zero. In particular, there is no need to record variations in white space within a line where this results from the tightness or looseness of the line.
  2. Ignore delimiters on page numbers and signatures (e.g. parentheses, brackets, other marks of punctuation). In some cases, characters which are ordinarily used as delimiters may be used informationally, for instance where they are used to distinguish between two separate sequences (e.g. between signature sequence A, A2, A3 and A., A.2, A.3). In such cases they are no longer delimiters and should be transcribed along with the rest of the signature.
  3. Regularize all punctuational dashes (or sequences of dashes) longer than an em-dash to a single em-dash. If you wish to preserve the fact of the extended length (though not its actual extent), you could use a special character entity (the WWP uses &sdash; or superdash). However, in cases where dashes or hyphens are used to indicate a number of missing letters (e.g. in a concealed name), they should be recorded exactly as they appear.
  4. Ignore the exact appearance of rules and ornaments: for instance, the length or weight of ruled lines, or whether they are single, double, or ornamented. For more information, see the entries on ornaments and rules.
  5. Ignore ligatures between letters such as ct, ff, fi, and similar common types. Ligatures between ae, AE, oe and OE should be retained, since they carry linguistic meaning.
  6. Ignore uneven baselines and individual letters which are displaced vertically.
  7. Do not attempt to reproduce the exact location of marginalia and marginal notes. We do not recommend attempting to indicate by any precise means (coordinates, exact positioning relative to any other feature) the location of any marginal material, printed or handwritten. We consider it sufficient to indicate its general position by the align() and place() keywords on the rend attribute. (For more information, see entries on these keywords.) The position may also be indicated implicitly, in some cases, by the position of the anchor for marginal notes.
  8. Do not record exact typeface (e.g. Baskerville, Didot) or type size (12 pt, Pica, Great Primer) unless it is useful to do so for some particular project-specific reason. Both of these kinds of information are very difficult to capture accurately, particularly size. We recommend simply recording whether the type is roman, italic, or black letter. Shifts in type size will be captured implicitly, where they indicate an element boundary, by the change in elements.
  9. Ignore variations in vertical white space, except implicitly where these are the indicator of an element boundary.
  10. Special letterforms, whether in print or in handwriting. We do not record the presence of special letterforms (e.g. swash characters, alternative forms such as rounded or square E) apart from the long s.@/p@