Older texts often use forms of letters which are no longer in use and which may be unfamiliar or ambiguous. The following is a list of difficult letterforms and how we encode them. A help sheet with images of all of the letters discussed below is in the red binder.
Long s and long s ligatures
The old-style long s in roman type looks like an f without the crossbar, or with a very short crossbar extending only to the left. In italics it looks like an integral sign. In both cases it is encoded with the entity &s;. A long s, when ligatured to a normal s, looks somewhat like a distorted upper-case italic B, or like the German s-zed ligature. We do not encode this as a special character; we transcribe it as a long s (&s;) followed by a short s. We do not encode any ligatured characters (e.g. st, sc, sf, etc.) using entity references; we simply transcribe the two characters involved as ordinary characters.
I and J
In the lower-case, the difference between i and j is clear and unambiguous. In italics, in the upper-case, there is often only one letter-form, which looks like a long J with a cross-bar in the middle. This is (despite appearances) an I, and should be encoded as such (with <orig> as necessary). There are also some texts in which this character appears, which also contain instances of the more familiar italic capital I, which looks like a roman I but with a slant. In such texts both characters should be encoded with I, using <orig> as necessary. This seems counterintuitive, but is necessary to preserve consistency across the textbase.
In blackletter similarly, the character which looks like a capital I with a slight lefthand curve to the bottom and a crossbar in the middle is an I and should be encoded using I (with <orig> as necessary).
U and V
There are several forms of the lower-case letter v: with a pointed bottom, with a rounded bottom, and with a decorative swash. There are also be several forms of the upper-case letter V: with a pointed bottom or with a rounded bottom. All of these should be encoded as v or V, with <orig> as necessary. They can be distinguished from u and U by the fact that these have a tail on the right-hand side.
It is important to maintain consistency from text to text in how we encode a given letter-form. Thus in texts which have two forms of v, but no u, both forms of v should still be encoded as v, rather than assigning a u to one and a v to the other. The WWP does not preserve different forms of the same letter in other cases (for instance, curly versus rectilinear upper-case E) and hence there is no reason to make an exception for v.