Journal of Pragmatics 25, 1996, 503-535.
The ‘Pragmatics’ of Doing Language Science:
The ‘Warrant’ for Large-Corpus Linguistics
Robert de Beaugrande
[abstract]
The development of ‘mainstream linguistics’ in this century is briefly
retraced to suggest that the original decision to describe ‘language by itself’
as opposed to ‘language in use’ favoured formalism over functionalism and
eventually led to a severe impasse for three tests a valid science of language
ought to meet: coverage of language
data, convergence among the data
being described, and consensus among
linguists about how to proceed. The impact of large-corpus linguistics might
resolve this impasse and accordingly raises the prospect of a fundamental
reorientation of linguistic theory and of the ‘pragmatics’ of ‘doing language
science’.
A. Testing the progress of
‘mainstream linguistics’
1. The term ‘mainstream linguistics’ is sometimes used
to designate a language science pursuing a programme based on several generally
agreed principles (survey in Beaugrande 1991), notably:
(a) Language is a phenomenon
distinct from other domains of human knowledge or activity.
(b) A language constitutes a system defined completely by
internal, language-based constraints.
(c) A language should be described apart from the conditions under which
speakers use it.
(d) The description of a language should be couched in
statements at a high degree of generality,
if possible about the ‘rules’ for the language as a whole or even about the
‘universals’ for all languages.
Within the programme, the tenets interlock in projecting a free-standing
and self-sufficient conception of language as a uniform, stable, and abstract
system holding still while we are describing it, and separated from the
‘rich’ and ‘messy’ human contexts where it is encountered in ordinary life.
Linguistics has thus undertaken to describe a theoretical construct of
‘language by itself’ (‘langue’, ‘competence’, etc.) situated at a safe distance
from the empirical realities of ‘language in use’ (‘parole’, ‘performance’,
etc.) during communication. Saussure’s (1966 [1916]: 9) defensive mistrust of actual
‘speech’ has proven highly influential (cf. § 17):
(1) speech is many-sided and heterogeneous [...] it
belongs both to the individual and to society; we cannot put it into any
category of human facts, for we cannot discover its unity.
2. If we imagine a ‘layer-cake’ with language as a
system mediating between a culture’s knowledge of the ‘world’ on one side and
the ‘society’ of speakers on the other, the programme of ‘mainstream’
linguistics implies detaching language and rolling away the other ‘layers’
(Fig. 1). Doing so discounts the constraints upon what people say due to what
they are talking about and to whom.

The working hypothesis is that
once detached, the language system will stand firm: complete and fully
organised by its own internal constraints (cf. § 6, 19).
3. Yet what we encounter is always ‘language in
use’; even the formal analysis of isolated sentence structures is a mode of
use, albeit a peculiar and untypical one (§ 15, 53, 57). To get from ‘use’
over to ‘language by itself’, linguistics
has foregrounded data-handling
strategies, such as:
(a) collating:
a large set of data samples are compared and contrasted to distil out what they
have in common, e.g., which word types frequently occur with other types;
(b) generalising:
certain aspects of the observed data are construed to be general ones, e.g.,
that the ‘Subject-Verb-Object’ order of a sample set of English sentences is a
typical pattern for the language as a whole;
(c) rarefying:
the ‘rich’ data as they were observed in spontaneous interaction are made
‘sparse’, e.g., by disregarding the personal authority of speakers;
(d) decontextualising:
the data are taken out of the observed context and text type and treated as if
they had occurred in isolation or could occur in a wide range of contexts,
e.g., irrespective of the social status of individual speakers;
(e) introspecting:
the linguists make estimations based on their own intuitions about the
language, e.g., which sentences do or do not violate the ‘rules’;
(f) consulting
informants: native speakers are given data samples of their language and
asked to judge or rate them, e.g., to decide which of two versions of a
sentence is ‘grammatical’ or ‘ungrammatical’.
These strategies project a second working hypothesis, namely that applying
them will lead to a complete and valid description of a given natural language.
4. How might these two main working hypotheses be tested? If it is true that a language
system is fully organised by its own internal constraints, and that these strategies
can describe it as such, then the three key tests for progress would be steady
cumulative rises in (a) the coverage
of the language; (b) the convergence
among language data discovered and described; and (c) the consensus among linguists about how to formulate the description.
Yet if we apply these three ‘C-tests’
to mainstream linguistics, we find a rise only in some domains and a sharp
fluctuation in others. What ‘pragmatic’ factors for ‘doing language science’
can have led to this outcome?
5. Probably the most influential factor was
that the programme for describing ‘language by itself’ has naturally favoured formalism, the stance construing form to be the basis and framework of
language — how entities are shaped or arranged. As we strive to ‘abstract’
language out of everyday contexts, the most stable and reliable substrate
naturally appears to be the forms, the patterns of word-stems and suffixes or
the patterns of phrases and sentences. This factor discourages functionalism, the stance construing function to be the basis and framework
of language — what means are used toward which ends; functional aspects tend to
be associated with language in use. So the ‘majority position’ in ‘mainstream
linguistics’ has usually been that formalism confers high ‘scientific’ status
and that its legitimacy can be taken for granted, whereas functionalism is
‘unscientific’ or ‘pre-theoretical’ and its legitimacy must be expressly
justified. In this academic power structure, formalist research is not required
to specify its relevance and its ecological validity — whether and how
its findings contribute to a general and productive understanding of the human
situation — whereas functionalist research is expected either to struggle toward
the a priori criteria of rigour, abstractness, generality, and so on, set down
by formalism, or else to defend itself for not doing so. So functionalism has
been severely held back or has worked at cross-purposes by not following up on
its insights and by compromising its own ecological validity in order to
compete with formalism on the latter’s terrain.
6. In the long term, the development of mainstream
‘formalist’ linguistics reveals an ominous trade-off:
the more formalised a theory of language and its apparatus of terms and formal
notations, the less we can expect a steady rise on the three ‘C-tests’ of
coverage, convergence, and consensus (cf. § 13, 18, 51f). This trade-off is not
unduly surprising, given the robust fact that people, including linguists, are
naturally much less skilled at ‘formalising’ natural language data than they
are at speaking, hearing, reading, and writing them. After detaching language
from the constraints of ‘world’ and ‘society’ and discounting functions in
favour of forms, we are left with the huge task of reconstructing or inventing
the formal and ‘purely linguistic’ constraints for every sort of regularity we
may encounter (cf. § 18). The enterprise of linguistic formalism hinges on the
assumption that there exists, for each natural language, at least one set of
such constraints strictly separating what belongs to the language from what
does not. Decades of formalist research have failed to identify any such set
for any language; and the lack of progress on the three ‘C-tests’ strongly
argue that no such set exists.
Converted into a static and closed formal system, language does not stand firm, complete, and fully
organised by its own internal constraints (§ 2); instead, it tends to skid out
of control.
7. The ‘formalist’ trade-off has been somewhat obscured by the fact that
it holds in differing degrees for the various domains of language. The three
‘C-tests’ are best met in phonology and morphology, which both offer concise
methods for segmenting language data
so as to isolate and classify minimal units. Phonology has
the most clearly defined criteria in
the articulatory events and locations that characterise the sound-units called
‘phonemes’, e.g., a ‘voiced dental stop’ such as /d/ produced when the vocal
cords vibrate and the air flow is blocked by the teeth. The visual
correspondence between many phonemes and written letters of the Roman alphabet
also supports the ‘C-tests’, though it has not been made into a theoretical
principle, since the description is strictly addressed to spoken language.
Thanks to these clear criteria, linguistics soon provided descriptions of the
repertory of sound-units in language after language, covering all the phonemes
with impressively high convergence and consensus.
8. The success of phonology did much to entrench the
concept of ‘language by itself’ being a uniform, stable, and abstract system,
or rather a set of subsystems, usually called ‘levels’, each consisting of a repertory of minimal combinable elements.
A complete description of a language would be the sum of the complete
descriptions for each subsystem, supplied by linguists working within the tidy,
‘pragmatic’ division of labour reflected in academic specialisations, journals,
conference proceedings, and course offerings.
9. Yet in morphology, the criteria are already less
tidy. Convergence and consensus are fairly high for identifying and isolating
the ‘morphemes’, aided again by the visual clarity of the data written down.
The analyst segments the written data until no further meaningful subdivisions
appear feasible — a method once introduced as ‘immediate constituent analysis’,
which, if applied ‘in all observation of word-structure’, Bloomfield (1933:
209, 221) promised, would eliminate any ‘inconsistency of procedure’. His promise
rested on a staunchly formalist mandate:
(2) Any utterance can be described in terms of lexical
and grammatical forms; [and] any complex form can be fully described apart from
its meaning in terms of the immediate constituent forms and the grammatical
features [whereby these] are arranged (1933: 167).
Here, consensus is to be established by sheer stringency of method. Yet
recalcitrant problems can arise that would not trouble phonology. We can easily
reach a consensus about the human vocal apparatus; and after some simple
demonstrations, most speakers will agree that they ‘know’ the repertory of
phonemes in their native language, e.g., when they distinguish between voiced
and unvoiced consonants. But it’s harder to agree in what sense speakers ‘know’
their native language as a repertory of its minimal meaningful forms. So we are
on weaker grounds in claiming that our own consensus as linguists corresponds
directly to the consensus of speakers of the language we are describing (cf. §
74). Consider the English morphemes borrowed from French, Latin, and Greek but
no longer recognised by all contemporary monolingual speakers. Should a
morphological description include not just the more obvious ones like ‘in-’ and
‘im-’ for negation alongside ‘un-’, ‘non-’, or ‘a-’ but also the more erudite
ones like ‘pter’ (‘wing’) in ‘helicopter’, where speakers might instead
identify the final ‘-er’ as an agentive suffix (compared, say, to ‘lawnmower’)?
Doing so would oblige us to turn to language history and thus deviate from the
‘mainstream’ programme to describe language ‘synchronically’ in a single stage
of its evolution.
10. Again in contrast to phonology, morphology is
quite problematic in respect to coverage. In theory, the entire vocabulary of
the language consists of ‘morphemes’ or clusters of these; how could we list
them all? The ‘mainstream’ strategy has been to focus on the ones that form
stable, compact classes, e.g. the set of all verb inflections, versus the ones
that form unstable, open classes, e.g. the set of all verbs or verb stems. Only
for the first type could full coverage be attained, whereas the second type
could be consigned to the category of ‘lexemes’
to be described in the domain of ‘lexicology’.
11. Still, morphology has faired rather well on the
‘C-tests’ through its close engagement with the real data recorded in fieldwork
on previously undescribed languages,
which in my estimation has contributed by far the finest achievements in modern
linguistics. The fieldworker’s overall task is to progress from being an
‘outsider’ in the community of speakers over to being an ‘insider’ who can
speak the language at least well enough to interact with the community and
eventually to describe the language. The fieldworker must reach a working
consensus with the community or else expect to be misunderstood, ridiculed or
ignored. The task is richly supported by ordinary constraints from ‘world’ and
society (§ 2, 6), which always apply to real data but which may well not appear
in a stringent formal description. To maintain a consensus with the community,
the fieldworker can rely on an intuitive, perhaps unconscious grasp of such
constraints to produce ‘proper’ utterances, whether their ‘propriety’ is
‘purely linguistic’ and can be ‘formalised’ within a ‘linguistic theory’ or is
more cognitive or social.
12. The three ‘C-tests’ get shifted far more radically
in the move from morphology to syntax. Units can no longer be isolated by using
the criterion of ‘minimalness’; nor does it seem at all feasible to make a
complete repertory of syntactic patterns. So phonology and morphology were
replaced by syntax at the centre of linguistic theory, and the concept of
‘language’ itself got shifted from a ‘descriptive’
notion of a repertory of units (§ 8)
over to a ‘generative’ notion of a repertory of rules for constructing and
arranging units. Since the ‘rules’ plainly do not appear in the data, syntactic research relaxed the close engagement
with real data as established in morphology fieldwork. Instead of segmenting
the language sequences themselves, the task was to devise rules that would ‘generate’ the underlying structure of the sequences. Such a shift did not just
leave formalism intact, but actually endowed language data with an enhanced but
hypothetical formality. The effect was most striking when semantics was added
onto syntax: to maintain the detachment of language from knowledge of the
‘world’ (§ 2, 6), meanings were described as arrays of underlying forms, often
called ‘semantic features’.
13. At this point, the ‘formalist trade-off’ described
in § 6 began to grow virulent. Disengaging from real data encouraged some
influential generative linguists to turn to invented
data, whose status as part of the language was certified not by its occurrence in the actual speech of
native speakers but by the intuitive
approval through introspection.
The official rationalisation was that corpuses of real data are inadequate
because they are ‘finite’ and ‘accidental’ collections of utterances, whereas
speakers of a language can produce or understand many more utterances —
presumably an ‘infinite’ set of them (Chomsky 1957: 15). This rationalisation
had the labour-saving corollary that fieldwork is not very necessary or
helpful: linguists need merely elicit invented data or even — a unique
privilege among scientists — invent their own data when they are native
speakers of the language, e.g.:
(3) The man hit the ball.
(4) John is easy to please.
(5) The cat sat on the mat.
A bit paradoxically, the ‘normalness’ of such sentences can make them
seem a bit odd in comparison to what people actually say (§ 26).
14. Saving labour this way has some severe hidden
costs. When linguists were no longer in the concrete fieldwork situation of
confronting real data in an unknown language, the task of describing the
language is no longer firmly correlated with the task of reaching a working
consensus by going from an outsider to an insider in the culture (cf. § 11).
This change removes both the most tangible means of testing one’s assumptions
and the richest source of constraints from world and society. And when the task
of describing freely rides upon the describer’s prior facility and unstated
intuitions regarding the language, the linguists are already insiders before
they start their work.
15. The
impending problems were forestalled by the central formalist assumption that
there exists, for each natural language, at least one set of formal constraints
strictly delimiting what belongs to the language from what does not (§ 6). When
found, such a set would provide total coverage and lead to a formal account
both for the convergence among the structures of ‘grammatical’ or ‘well-formed’
sentences and for the consensus among the intuitions of native speakers. Since
phonology and morphology had been relegated to the sidelines (§ 12), the
constraints could be grouped under the respective headings of syntax,
semantics, and pragmatics. Each group of constraints could be identified by selectively violating it, e.g.:
(4) John is easy to please. (‘well-formed’)
(4a) *To is please John easy. (‘syntactic violation’)
(4b) ?John is easy to sneeze. (‘semantic violation’)
(4c) ?John, be eased and pleased! (‘pragmatic violation’)
(4d) ?A john is sleazy to fleece. (which violation?)
But such demonstrations entail several problems:
(a) Insofar as the examples were invented on the spot
to demonstrate the rules, they cannot be an independent validation of the rules;
and the constraints applying to the act of invention and to its peculiar and
untypical purpose — to produce a selective violation — hardly match the
constraints that apply to ordinary acts of discourse and to their practical
purposes, e.g. to justify what you are doing.
(b) The assessment of a violation depends
heavily on the ingenuity of the linguists, e.g., whether they can imagine a
situation where it would be appropriate to talk about ‘sneezing John’ (he might
be a flue microbe in a children’s story); or where John’s imperious mother
might command her son to be pleased about a Christmas gift and to have an easy
conscience about not getting her one. To argue that a sentence is disqualified if it was ingeniousely devised (as
Bierwisch has) is not helpful when we cannot define the threshold where naivity
leaves off and ingenuiousness starts. Even (4) is ingenious in the sense that
it was expressly invented to make a point about underlying structure, and is
unlikely to be uttered (§ 53, 57).
(c) It is also easy enough to invent examples where it
is not clear which group of constraints is violated. (4d) would be such a case,
and might yet be contextualised by applying the American meanings of ‘john’ as
a ‘toilet’ or a ‘prostitute’s customer’ (Random
House Webster’s College Dictionary,
1991, p. 729).
16. At all events linguists were dismayed to find a
wholly unexpected lack of agreement, both among themselves and among native
speakers, about sample sentences. This outcome gave rise to a series of complex
rhetorical manoeuvres on two sides. On the side of data, samples were carefully
restricted in order to highlight the contrast between clearly proper sentences
with seemingly obvious meanings versus clearly improper ones with no sensible
meanings, as in (4-4c). On the side of theory, the central formalist assumption
was shielded from the implications of observed disagreements. The set of formal
constraints was declared to correspond only to ‘competence, the speaker-hearer’s knowledge of his language’ and not
to ‘performance, the actual use of
language in concrete situations’ (Chomsky 1965: 4). The ‘speaker-hearer’ was in
turn declared ‘ideal': ‘living in a completely homogeneous
speech-community’, ‘knowing its language perfectly’, and being ‘unaffected’ by
‘memory limitations, distractions, shifts of attention and interest, and
errors’ (1965: 3). In effect, these two declarations instated consensus by
decree and converted it into an ‘ideal’ that need not, indeed cannot, be tested
against the agreement among speakers.
17. The same rhetorical pressure accounts for the
evasive complication opposing ‘surface structure’ to ‘deep structure’ and
declaring that ‘the grammar does not, in itself, provide any sensible procedure
for finding a deep structure of a given sentence, or for producing a given
sentence’ (1965: 141). Moreover, ‘much of the actual speech observed’ was
declared to ‘consist of fragments and deviant expressions’ (1965: 201), echoing
Saussure’s influential mistrust of ‘actual speech’ (cf. § 1). These further
declarations can serve to explain away any discovered lack of convergence among
language data.
18. These rhetorical manoeuvres suggest that
generative linguists were aware of and disquieted by the ‘formalist trade-off’
(described in § 6) but were determined to rescue the central formalist
assumption (also cited § 6) by designing the theory precisely so as to prevent
the lack of progress in coverage, convergence, and consensus in respect to real
data from counting as a refutation. They correctly speculated that their
manoeuvres, even if thinly or speciously argued, would not be critically
assessed by colleagues who were (a) firmly committed to the ‘mainstream’
linguistic programme of describing ‘language by itself’, (b) were not anxious
to undertake painstaking fieldwork in remote places, and (c) mistrustful or
actual speech in all its ‘messy richness’. So we can readily understand the
success of generative linguistics and its continuation through a long and
sometimes arcane series of ‘extensions’, ‘revisions’, or changes of notation
without any willingness to change its basic claims about what a ‘language’ is
and what a ‘linguistic theory’ should do. Its adherents cannot admit that
isolating language from the functional
constraints that apply to real data
incurs the impossible job of inventing all
the formal constraints for all conceivable data, irrespective of
whether native speakers would ever utter them. The ‘generative grammar’ would
have to reconstruct the formal possibility that speakers could utter or understand them; and no
evidence has been brought forward so far that this can ever be done.
19. The conclusion would have to be: if language is
detached from the constraints of ‘world’ and society, its own internal
constraints are not sufficient to support its organisation (cf. § 2, 6, 11f,
14). Hence, any linguistic description which postulates such a detachment will
only be able to cover a part of that organisation and will encounter frequent
obstacles to convergence and consensus. This conclusion is borne out by empirical evidence not about the formal
structure of sentences but about the ‘pragmatic’ activities of doing language
science over the past century. Because ‘language by itself’ was a technical fiction to begin with, theories
about it have been obliged to created a proliferating series of further
technical fictions to prop each other up — ‘grammaticality’, well-formedness’,
‘competence’, ‘ideal speaker-hearer’, ‘homogeneous speech-community’, ‘deep
structure’, and all the rest — that are not merely unconfirmed by real data but programmatically
opposed to real data. The prospect today is not merely that no formal
description of ‘language by itself’ has yet attained adequate coverage,
convergence, and consensus for any natural language, but that no such theory ever will. In the long
run, the apparent advantages of linguistic formalism — stability, determinism,
rigour, visual clarity, impressive notations — and the privileges its confers —
to invent and judge your own data, to do science without leaving your desk, and
to escape the rich and messy contexts of human interaction — all turn out to be
liabilities for achieving even its own carefully circumscribed tasks. Such a
formalism relegates us to a shadowy world of formulas and arrays whose
determinacy is financed by their indeterminate relation to the language data
they purport to represent.
C.
The impact of ‘large-corpus linguistics’
20. I have briefly retraced the theoretical evolution
of ‘mainstream’ linguistics in section A in order to indicate how the early
programme of describing ‘language by itself’, detached from world and society,
has favoured a linguistic formalism that turned away from real data and
eventually blocked further progress in coverage, convergence, and consensus,
without which we cannot attain a complete and valid description of any natural
language. The growing awareness of this impasse has led to a diversification
within linguistics that has edged formalism gradually out of its ‘mainstream’
and majority position. The brands of linguistics going under such designations
as ‘functional’, ‘systemic’, ‘applied’, ‘cognitive’, ‘computational’, and
‘critical’, along with some adjunct domains such as ‘discourse analysis’ and
‘discourse processing’ (which seldom aspired to be part of linguistics), all
share the enterprise of resituating language in its cognitive and social
contexts, reassembling, as it were, the ‘layer cake’ of language interfaced
with world and society (§ 2).
21. As the
conventional division between ‘language by itself’ versus ‘language in use’ has
been progressively narrowing, we have found that real data are not plagued by
the lack of ‘discoverable unity’ that, Saussure vowed, would prevent us from
‘putting speech into any category of human facts’ (§ 1); nor do they ‘consist
of the fragments and deviant expressions’ that justified Chomsky’s retreat from
‘the actual use of language in concrete situations’ (§ 16). Instead, real data
reveal an unexpectedly high degree of precision and clarity, though not
necessarily in the modalities that mainstream linguistic theories would easily
recognise.
22. This finding has been most profoundly assisted by
the advance of technology, placing within our reach a new source of data that
dramatically enhances the prospects for coverage, convergence, and consensus.
The key technical innovation is the large computerised corpus of data from
actual texts and discourses, such as the ‘Bank of English’ (hereafter ‘BoE’ for
short) developed at Birmingham University by John Sinclair and his team. I took
the data described below from the BoE in July 1994, at the stage when it had
reached the size of some 200 million words of running text from contemporary
spoken and written sources, including: British and American books; newspapers (Times, Independent, Guardian, Today, Wall
Street Journal, New Scientist, Economist); magazines (e.g., Esquire, Good Housekeeping); ‘ephemera’
such as letter-box mailings (e.g., YMCA appeal for homeless people, Friends of
the Earth Tropical Rainforest Campaign), radio broadcasts (British Broadcasting
Corporation in the UK and National Public Radio in the US); and recordings of
conversations.[1] The coverage by so large a corpus
might validly claim to be representative,
though it is certainly not complete
and is very far from ‘infinite’. Yet paradoxically, it has itself made us aware
of the ways in which it is yet too small (§66ff).
23. Still, as a sample of contemporary English usage,
the coverage exceeds previous sample sizes by various orders of magnitude, such
as: the previous 20-million word corpus used for the 1987 Collins COBUILD English Language Dictionary (by 1 order of
magnitude); the 1-million word Survey of English Usage at University College
London (by 2 orders of magnitude plus doubling); the 2000-word fragments in the
Brown University corpus (by 5 orders of magnitude); and the 24 invented
sentences analysed or ‘transformed’ in Chomsky’s Aspects (by 7 orders of magnitude).[2]
24. Contrary to what is widely believed, the increase
in orders of magnitude does not entail a direct proportionality whereby we just
get the same data multiplied by 10, 100, 1,000, and so on, so that if an item
appears once in a 1 million word corpus, it appears 20 times in a 20 million
word corpus and 200 times in a 200 million word corpus. If that were true,
building steadily bigger corpuses would only give the results we could
accurately predict from the proportions in a small corpus. But in fact the
large corpus offers not just more
data but different kinds of data:
(a) We find numerous items that did not appear at all
in smaller ones.
(b) We can make more informed judgements about
relative frequency. Of two items appearing only once in a small corpus, the one
might still appear only once in a larger corpus and the other fifteen or twenty
times.
(c) The larger corpus will display the data in
steadily finer degrees and differentiations of detail. An item which appeared
only once in a small corpus may appear in several distinctive variants in a
large one.
In these ways, each increase in magnitude can reveal hosts of fresh and
more detailed regularities that were simply not noticeable before, nor are they
readily open to unaided intuition and introspection (§ 27,52f, 55, 63). They
still have to be interpreted, but — in marked contrast to non-corpus linguistic
methods — the outcome is quite amenable to convergence and consensus (§ 4, 6,
15, 17ff, 20, 22, 27f, 39, 43, 46, 48, 50, 52-55, 62, 64f, 72-75).
25. Conversely, the corpus shows that examples we
might intuitively accept at face value are not typical of actual usage. Our
beloved evergreens like those cited in § 13:
(3) The man hit the ball.
(4) John is easy to please/eager to please.
(5) The cat sat on the mat.
do not appear in the BoE, not because they aren’t
properly ‘grammatical’ or ‘well-formed’ English but because they aren’t ‘natural’: typical contexts of real
discourse require less simple-minded and peremptory utterances. In the BoE,
nobody at all is said to be ‘easy to please’. For ‘eager to please’, three
instances appear (6-8), each with a direct object for ‘please’ that was missing
in (4) and with more interesting agent-subjects than our insipid friend ‘John’.
Even allowing for intervening items, the only combination of ‘man + hit + ball’
was (9); ‘man + hit’ alone returned only (10), where the sense of ‘hit’ adapts
to ‘jackpot’. For ‘cats sitting on mats’, the only attestations were
derivations from the use of this trite example in schoolbooks or logician’s
debates, e.g. (11-13), rather than being assertions about any real cat.[3]
(6) < a government official who is eager to please
the wealth goddess >
(7) < the Sandinistas. The government is eager to
please the Church >
(8) < show a sociable child who is eager to please
or charm those around him >
(9) Yes. Doesn’t that man hit the ball hard?
(10) Where can a con-man hit the biggest jackpot? In
politics
(11) On the first page was a drawing of a brindled cat
seated on a recognisable mat, the original ‘cat on the mat’ now quoted in
derision of an antiquated method of teaching
(12) so if you have <ZF1> a <ZF0> a man on
the roof [pause] er erm erm a cat on a mat er a tree on a mountain top a boy
sitting on a tree branch these all involve
(13) material-objects statements, ‘There is a cat on
the mat’, statements about people in novels, statements of mathematics
We shouldn’t regard the grainy details of the real data as a mere
obstruction to be filtered out by rarefying and decontextualising (in the sense
of § 3). Instead, we should respect the ‘naturalness’ of real data because,
unlike the ‘grammaticalness’ or ‘well-formedness’ of the formalists, it has been decided for us by real users of
the language (cf. § 63f). We want to account for the ‘competence’ real
users not just possess but display
when doing this; and there, ‘well-formedness’ has no overriding priority (§ 46,
48, 53, 55).
26. Now, corpus displays are in some sense frankly
‘surface’ data’, but, exactly because the data are not severed from their
contexts, it is easier to assess what sorts of ‘shallower’ or ‘deeper’
constraints might apply. Even on the surface, a corpus displays to the
investigator not just words but collocations,
to adopt Firth’s (1968 [1952-1957]: 106ff, 113, 182) well-known term: ‘words’
considered in ‘the company they usually keep’, i.e., typical word combinations
that would not usually qualify as idioms or standing phrases (cf. § 31, 33, 52,
55, 60, 66, 69, 77ff). Also, the data can be accessed in somewhat ‘deeper’ ways
by means of the search software, so that, for example:
(a) The collocation need not be invariant or
continuous but may contain varying interposed words (up to 4 in the BoE), e.g.
‘on the mat’ in (5) versus ‘on a recognisable mat’ in (11).
(b) We can sort out words that could belong to more
than one word-class, e.g. ‘warrant’ as either a noun or a verb.
(c) We can use uncommitted characters to search for a
stem with all its endings, e.g. to compare ‘logic’ with ‘logical’, which turn
out to collocate rather differently.
(d) We can make nested sub-displays to zero in on
possibly significant combinations in the general display, e.g., to go from
‘warrant’ to ‘warrant + investigation’.
27. The most ‘surface’ use of the large corpus is to
enable accurate judgements about the frequencies
of words or word-combinations — a familiar tactic in ‘computational
linguistics’. A far ‘deeper’ and more revealing use of the corpus is to detect tendencies rather than just frequencies,
so that we can assess why certain
combinations occur and not just how often.
Paradoxically, sorting vast quantities of real data allows unexpected
convergences to emerge within the regularities underlying this huge variety
(cf. § 43f, 72). Among all of the possible combinations of English words and
phrases that might be intuitively judged ‘grammatical’, we can finally see
which ones are more likely to be realised and at least some of the reasons why.
28. The main challenge now is how to identify and describe the constraints
whose effect the corpus-displays allow us to inspect (cf. § 73). The
constraints are all functional in the
broadest sense, i.e., related to what people do with their language (§ 5); any formality we may distil out is
derivative upon that functionality and cannot be consensually accounted for
without it (cf. § 54f; Beaugrande, in press). Moreover, functional constraints
need not fit neatly into the formal linguistic schemes devised for ‘language by
itself’ — not a surprising finding, perhaps, but an immensely significant one
(§ 34-46).
29. My demonstration here will be the Bank of English
corpus data on the English verb ‘warrant’. The BoE returned a total of 392
lines centring on that key-word as a Verb. To get a more manageable and
productive sample, I made a hand-sorted selection of 228 lines by eliminating
repetitions, e.g. when a statement by a politician got reported in several
media, and false alarms where the key word was actually a noun.[4] Selecting the verb allowed me to disregard the
numerous noun occurrences in stock phrases like ‘search warrant’, ‘death
warrant’, or ‘warrant for arrest’.
30. The word has a venerable history related,
according to Walter Skeat’s (1970 [1879-1882]: 702) Etymological Dictionary of the English Language, to the word
‘guarantee’. As a verb, we find such usages attested in the Oxford English Dictionary (pp. 930ff)
as: ‘to keep safe from danger (14); ‘to guarantee goods to be of the quality,
quantity, etc. specified’ (15); ‘to give a personal assurance of a fact’ (16),
‘chiefly in “I (I’ll) warrant you”’ (17); and ‘to authorise, sanction a course
of action’ (18).
(14) What good Man was he that from deth warawnted
thee? (Henry Lovelich, Merlin, 1450)
(15) This Ryche man thenne sold his oylle to the
marchaunts and waraunted eche tonne al ful (William Caxton, The subtyl historyes and fables of Esope,
Auyan, Alfonce, and Poge, 1484)
(16) Bot for to lere him I warand, Als mekil als he
mai vnderstand (The proces of the seuyn
sages, 14th century)
(17) There be many such I warrant you yt neuer
cum to light (Thomas More, A dyaloge
wherin he treatyd dyvers maters as of the veneration and worshyp of ymagys etc.,
1528)
(18) The Lord warrants us to suspect the inconstant
(Daniel Rogers, Naaman the Syrian, his
disease and cure, 1642)
These samples from the 14th to the 17th centuries suggest a gradual
widening away from official discourse, and a drift toward the modern usage
displayed by the BoE corpus, as we shall see.
31. A first heuristic for identifying the more
interesting collocations in the BoE is to list in the order of frequency the
most common words within the set of lines returned. Many of those near the top
of the list, such as ‘of’ or ‘to’, will seem unenlightening in the early
stages, but at least some of the more suggestive words can turn up:[5] among the nouns, ‘evidence’ (21 occurrences),
‘investigation’ (12), ‘trial’ (7), ‘attention’ (9), ‘circumstances’ (8),
‘concern’ (6), ‘mention’ (5), ‘consideration’ (5), ‘punishment’ (5),
‘intervention’ (4), and ‘conditions’ (3); among the modifiers, ‘enough’ (58),
‘sufficient’ (27), ‘serious’ (14), ‘really’ (7), ‘certainly’ (6), ‘important’
(5), ‘severe’ (5), and ‘trivial’ (4).
32. A second heuristic is to create a positional
frequency table in which the words in the several slots to the left and right
of the key word are displayed in descending order of frequency. The table below
shows the data for ‘warrant’.
3 to the left 2 to the left 1 to the left word 1 to the
right 2 to the right 3 to the right
sufficient enough to warrant
a the
of
enough evidence not warrant
the investigation
the
serious did 't warrant an
a in
too do would warrant it
<t> a
the does might warrant such
attention <t>
and not really warrant any
of but
that as that warrant further
action action
not didn yet warrant this
trial <LTH>
sufficient may should warrant that
with and
in doesn search warrant his
and to
is nothing and warrant
to more
trial
was the will warrant some special that
it and circumstances warrant their even for
of seem arrest warrant no mention into
which t o could warrant another intervention by
but trivial can warrant my 's it
good that may warrant its it is
done will soon warrant for because than
<h> so 'll warrant more than some
a small conditions warrant concern new as
's seemed germane warrant officer an an
be they death warrant <
/h> further here
important appear certainly warrant one sort then
These data too are at best suggestive, and for much the same reason that
purely formal syntax readily becomes convoluted or opaque: many words or
word-classes are fuzzy in respect to their mutual positions; and functional
relations need not show up as formal ones. The frequent negations — ‘not, ‘-t’,
‘didn’, ‘no’, and, by implication, ‘too’ — are scattered over four positions
(cf. § 36, 42). And some of the most revealing data don’t appear at all, either
because their position isn’t consistent enough, e.g. ‘situation’; or because a
shared semantic concept is lexicalised in various ways, e.g. ‘disability -
distress levels - ill health - medical problems’.
33. A third heuristic offered in the BoE software
sorts the lines by the alphabetical order for a given position to the left or
right of the key word. This tool works best in bringing out data about items
whose position is relatively fixed, e.g. the extreme frequency of ‘to’ in the
infinitive (top item before ‘warrant’ in the positional frequency table). But
user-performed hand-sorting is needed for groupings wherein the essential items
and collocations occupy more flexible positions, e.g. ‘serious’; or where
groupings are to be made by semantic criteria, e.g. ‘investigation’ with
‘inquiry’. I worked out three hand-sorted displays and added bold italics to
highlight the items that I chiefly relied on while doing the sorting and
alphabetising:[6] one for what does or doesn’t do the
‘warranting’, one for what is or is not ‘warranted’, and one for the relevant
criteria. Samplings from these three displays are given in Appendices A, B, and
C.
34. These displays begin to reveal the various types
of constraints. Some constraints might be provisionally stated according to the
familiar schemes of different ‘levels’ or ‘components’ of ‘mainstream
linguistics’. For phonology, the intonation would be distinctive for the
performative ‘warrant’ in relatively rare locutions like ‘I’ll warrant’ used
when you want to indicate you feel sure about something though you can’t point
to actual facts (cf. § 64):
(19) If I had ten thousand men like him tomorrow then I
warrant we’d see Napoleon beat by midday [quoting the Duke of Wellington.]
(20) The soil may look innocuous enough when you’ve
dug it over but I’ll warrant it’s teeming with root-eating wireworms.
(21) I’ll warrant I even heard Honey Bane shuffling by
somewhere in the background of a song that will provide the perfect soundtrack
for when your mum won’t let you out of your room until you’ve done your
homework.
A sample like (21) looks quite complex (with quadruple ‘embedding’) in
comparison to the usual invented sentences like (3-5) in § 13 and 25, but in
actual discourse it should present no difficulties for comprehension, even for
the young and not very intellectual readers it addresses.
35. For morphology, we might note the overwhelming frequency of non-finite
forms, either in infinitives with ‘to’ (136 occurrences) or with some modal
verb (58) (cf. § 42). Also, several Latin/French-based prefixes among the
semantic processes may be significant: ‘ad-’ for moving toward
something: ‘action, appeal, appellation, assistance, attention’; ‘com-’ or
‘con-’ for acting, happening, or bringing together: ‘collection, commitment,
complaints, conclusion, conditions, consideration, conspiracy, consultations’
plus the Anglo-Saxon ‘with-’ in ‘withdrawals’; ‘de-’ and ‘dis-’ for uncovering
or invalidating something: ‘declines, definition, developments, disability,
distress’; ‘e-’ or ‘ex-’ for getting outside: ‘event, evidence, examination,
exclusion, expansion, expenditure, extension’, plus the Anglo-Saxon ‘out-’ in
‘outburst’; negating ‘im-’ or ‘in-’ for something that is not as it should be:
‘impropriety, indeterminate, insufficient’, plus the Anglo-Saxon ‘un-’ in
‘uncharacteristic, uncovered, unimportant, unorthodox, unsatisfactory,
unspecifiable, untutored’; ‘in-’ and ‘inter-’ for getting inside or between:
‘inquiry, interception, interference, intervention, introducing, investigation
into war crimes, inclusion in the wheelchair, internal matters that warrant no
outside interference’; ‘re-’ for following up or going back toward something
previous: ‘recession, record, recording, relaxation, relief, respect, response,
retaliation, retrospective, return, revelations, revision’ (cf. § 41).
36. For syntax or ‘grammar’, we could note the extreme
dominance of third person subjects (224 occurrences), as opposed to just 4 in
first person (compare samples (19-21) and none at all in the second person;
and, within the third person, the mere handful of pronoun subjects ‘he’ (6
occurrences), ‘she’ (0), ‘they’ (5), and ‘it’ (7), as contrasted with the large
numbers of noun subjects (§ 42). Or, we might note the high proportion of
negations attached to the verb: ‘not, don’t, didn’t, not yet, hardly, not
really’ (cf. § 36, 42).
37. For semantics, we could note that many of the
subjects and direct objects fall into associative classes that are not unduly
hard to label, e g.:
(a) as subjects:
actions: ‘achievement,
aggressions, behaviour, blow, brawl’; resources:
‘abilities, acreage, growing area, scrappable cars’; knowledge: ‘evidence, information,
perception, scientific authority’; messages:
‘accusations, complaints, juicy stuff, message, piece of tittle-tattle,
revelations’; problems: ‘air
leaks, ambiguity, antitrust conspiracy, casualty rate, chilly old homes,
degenerating trees, disability, discriminatory practices, distress levels, food
shortage, ill health, impropriety, job bias, slowing in the economy, violence’;
(b) as direct objects: (in)appropriate reactions: ‘(further)
action, change, commitment, conclusion, consideration, expansion, extension,
formation, increases, motion, (cautious) move, plan, step, signing, treatment’;
consumption of resources: ‘cost,
expenditure, loss of any troops’ lives, overeating, paying the steeper taxes,
shelf-space’; messages: ‘apology,
appellation, billing, briefing, brochure, column inches, comment, description,
footnote, mention, phrase, satire, serious talk, suggestion, talking-to’; knowledge-gathering: ‘airing,
attention, consultations, examination, hearing, inquiry, investigation, retrospective
survey, review, [legal] trial, [medical] trials’; solving problems: ‘answering machine, (charitable /economic)
assistance, breaking the embargo, easing of interest rates, full-time
custodian, guests wearing thermal long johns, intensive care, introducing more
elaborate feeding, (professional/prompt/surgical) intervention, making peace,
mid-season break, opening of a new peat extraction plant, revision, sending of
those supplies, using these drugs’; retaliating:
‘banning the show, charge(s), God’s anger, jail time, lengthy ban, massive
American retaliation, penalties, pre-emptive strike, (criminal) prosecution,
(capital) punishment, retribution, [legal] trial’.
Such
groupings overlap, since a broad category like ‘(in)appropriate reactions’ can reasonably
include narrower ones like ‘knowledge-gathering’, ‘problem-solving’, and
‘retaliating’. Still, we can make a modest ‘semantic table’ showing the typical
correlations between subject-groupings and object-groupings, e.g.:
subject-groupings object-groupings
actions (in)appropriate
reactions
resources consumption of resources
messages messages
knowledge knowledge-gathering
problem problem-solving
It seems plausible that a given parallel across our columns might show
up in the data on the same line, as we see at once for ‘evidence’ (knowledge)
plus ‘investigation/trial’ (knowledge-gathering). But this co-occurrence of
semantic groupings on one line is by no means a rule. We can also have, say, an
action as subject and a message about it as direct object, e.g. when an
‘operation warrants a middle-of-the-night briefing’. Or, the context for one
grouping may imply another, as when knowledge-gathering in legal contexts
implies a retribution, e.g. the condemnation and punishment likely to follow
upon a ‘trial’. Or again, some people consider legal punishments a type of
problem-solving, despite the scant evidence that the ‘problem of crime’ is
being solved in this way.
38. The constraints of context soon impel us beyond
the customary borders of semantics. An abstract scheme of ‘semantic features’
would presumably suggest making a separate class for general nouns, some of
which appear as frequent subjects in our data: ‘behaviour, circumstances,
conditions, contemporary events, incident, occasion, operation, qualities,
situation’. But none of these remains general in the context. Most of them
carry a pejorative implication, i.e., that the ‘behaviour, circumstances’, etc.
involve some problems. If we read that ‘circumstances do not warrant a change in
the leadership’, we can assume that one or more ‘leaders’ do not seem to have
been acting as they should and that somebody wants to reassure us. Or, if we
read that ‘circumstances simply do not warrant charitable assistance’, we can
assume some people are in financial difficulties while other people with money
are, in the finest Tory tradition, excusing themselves from helping out.
39. We can see here a major difference between
conventional abstract semantics versus corpus-driven semantics, one which Sinclair
(1994) has pointed out. Most of what passes for generality, vagueness, or
ambiguity in the meaning of language and impels semanticists to build finicky
sets of rules to eliminate it, evaporates when we look at suitably sorted real
data. So we may well feel uneasy about approaches that expressly declare it the
job of semantics to ‘disambiguate’ sentences or sequences that allow for more
than one interpretation (§ 43). Quite plausibly, the ambiguity is largely an
artefact of using isolated and invented data. We might recall here the contrast
between invented simple sentences like (3-5) in § 13 and 25 versus authentic
and elaborate real data such as (19-21) in § 34. Again, trying to filter
language to the point of enabling a formalist description erodes the
constraints that are urgently needed for convergence and consensus (cf. § 4, 6,
15, 17ff, 20, 28) .
40. For pragmatics, finally, we could note the
explicit performative ‘warrant’ when the speaker is also the subject, as in
(19-21) in § 34. Less explicit but far more common and influential is the
pragmatic force entailed in declaring what does or does not ‘warrant’ what.
This force carries the implication that the event or state of affairs that
might do the ‘warranting’ is in some way unusual or significant enough that a
reaction might well be in order, and that those who might be expected to do the
reacting are likely to say why or why not they are going to, and how.
Accordingly, the speaker — or, when the discourse is reported, the originator
of the message — is likely to be a person who represents some institution or
authority, and our data suggest what kind: government, judiciary, military,
sports, business, science, and medicine. Or if the person does not, then the
use of ‘warrant’ implies a subtle signal that authority is being claimed
anyhow; we see this use among journalists and media persons when they are not
reporting what other people said. Uses like ‘the Chevrolet Beretta does not
warrant particular mention’ or ‘the documentary wouldn’t warrant more than a 4’
are inconsequential magisterial pronouncements merely aping genuine authority
with real consequences, e.g., medical judgements about whether ‘problems
warrant surgery’ or ‘drugs’.
41. I have followed through the familiar linguistic
‘levels’ or ‘components’ to suggest that each of them contributes a set of
constraints on the verb ‘warrant’. But taken by itself, each set is weak and
some may seem unduly speculative. For example, citing the frequency of prefixes
as morphological units (§ 35) might seem to be overinterpreting merely
coincidental or antiquarian materials, were it not for the semantic and
pragmatic constraints indicating that ‘warranting’ often does involve
situations in which people act together (viz. ‘commitment, complaints, consideration,
conspiracy, consultations’); or where something is not what it should be (viz.
‘impropriety, insufficient, unimportant, unorthodox, unsatisfactory,
untutored’); or where people want ‘inside’ knowledge (viz. ‘inquiry,
investigation’) or want to break ‘in’ on the chain of events (viz.
‘interception, interference, intervention, introducing’); and so on. Suggestive
too are some less frequent semantic combinations, e.g., that ‘assistance’ and
‘assistant manager’ both appear as ‘warranted’ solutions to problems. The
question of whether such accumulations or combinations reflect the design of
the language or the speaker’s choice still needs to be determined; but without
the corpus data display, we wouldn’t have occasion to pose the question at all.
42. Considering pragmatics clearly helps in
appreciating the significance of several ‘grammatical’ or syntactic
accumulations. Foremost among these is the high frequency of negations (§ 32,
36), signalling how often the potential reactors feel impelled to declare that
a predictable or reasonable reaction will not
take place. Or (to include morphology here), the frequency of infinitive forms
reflects the specification of the criterion for making such a declaration e.g.,
that things are ‘too small, trivial’ etc. or ‘not serious, severe, etc. enough’
‘to warrant’ something. Or again, the
frequent use of modal verbs like ‘may’ (14), ‘must’ (11), ‘would’ (10), ‘will’
(7), ‘might’ (5), ‘should’ (4), ‘can’ (3), ‘could’ (3), and ‘shall’ (1) in a
total of 58 lines, plus ‘seem’ (8) and ‘appear’ (2), all have the function of
attenuating the pragmatic force and conceding that other people might reach
different conclusions about the ‘warranting’. The same function is at stake in
the use of interrogatives, as in ‘Did he warrant the harsh punishment of
exclusion?’; and of dependent clauses with the force of interrogatives, as in
‘specify what kind of cases would warrant capital punishment’. Or again, the
low number of personal pronouns (§ 36) as subjects reflects the semantic and pragmatic
constraint that actions and situations are more likely to be said to ‘warrant’
something than people are.
43. When we are describing real data, the interaction between semantic
and pragmatic constraints is often so intense that there are only weak
indicators of which is which. How can we, say, keep our semantic understanding
of a general noun like ‘circumstances’ apart from our pragmatic understanding
of the force entailed? The constraints from knowledge of world and society,
which ‘mainstream’ linguistics sought to detach from the constraints on
language (cf. § 2, 6, 11f, 14, 19f), are absolutely crucial for interpreting
such data, but are by no means easy to formalise as ‘rules’ (§ 56). We appear
to be dealing with numerous local interactions among constraints that support
sophisticated higher-level organisation, as in a complex system with
distributed parallel processing (cf. Rumelhart, McClelland, et al. 1986;
Beaugrande, in preparation). What appears to be a single constraint in an
actual context might rather be a pattern of such interactions. If so, the standing internal constraints upon the
language, e.g. that the English infinitive be formed from ‘to’ + non-inflected
verb, are like the ‘frozen islands’
in a complex system and continually interact with emergent external constraints from world and society during
discourse, e.g. that something is or is not ‘warranted’ by a combination of
situation (e.g. ‘circumstances’, ‘conditions’) + sufficiency (e.g. ‘enough’,
‘sufficient’) + gradable modifier (e.g. ‘serious’, ‘severe’) (cf. 46, 53, 68).
This interaction supports a convergence
among the various modes of data and a consensus
among speaker and hearer or writer and reader. If, as formalists linguistics
sought to do, we detach language from the constraints from world and society
and retreat from real data, the emergent constraints get diluted or lost, and
we face the awesome task of trying to ‘freeze’ the entire system — a sort of
‘cryogenic linguistics’ building a ‘cryogenerative grammar’. Convergence and
consensus recede, and the data begin to appear vague and ambiguous, sending us
off in search of complicated formal rules which, being devised in a relative
vacuum, are naturally arbitrary and ponderous (cf. § 39).
44. Moreover, the emergent
external constraints may be quite flexible about formal positions. They can
generate rich strands of semantic relatedness among items at various locations
in the sequences showing up in our data lines. In some lines, we encounter
items together that might be said to belong to the same semantic field, e.g.
‘chilly - thermal’, ‘economy - interest rates’, ‘shortage - embargo’, or
‘slowing - easing’. In other lines, we find the ‘attraction’ of a specific item constraining a general one. In ‘forward attraction’, the specific comes
first and specifies the general after it, e.g., in ‘alcohol - taxes’ (hence not
value added taxes), ‘degenerating trees - specialist’ (hence not an eye
specialist), ‘medical - drugs’ (hence not psychedelics), ‘violence - security’
(hence not a bond), ‘worshippers - huge edifice’ (hence a church or shrine). In
‘backward attraction’, the general
comes first and gets specified further on, e.g. in ‘declines - recession’,
‘operation - intensive care’, or ‘sites - custodian’; in cases like ‘air leaks
- military interception’ and ‘inclusion - wheelchair’, the specific emergent
constraints run counter to the standing constraints on the general item, i.e.,
an ‘air leak’ being in a sealed container, or ‘inclusion’ being ‘making
something part of a larger thing’ (Collins
COBUILD English Language Dictionary, p.736). In either direction, the
formal distance between the items can vary quite freely.
45. Should these data be considered ‘purely semantic’ when so much
depends on our pragmatic knowledge of the situations in which people say that
things are or are not ‘warranted’? Should uses like ‘air leak’ and ‘inclusion’
be classed as semantically deviant or deficient because they go against the
standing constraints, even though we can readily understand if we consider the
speaker’s motivations, e.g. to arouse the impression that a ‘no-fly zone’ in a
war is virtually air-tight, or to avoid a more usual but harsher term like
‘confinement’? Should we devise ‘semantic rules’ that first compute the typical
meaning and then go on to compute the deviant meaning? How about cases where
the data seem plainly misleading, e.g.:
(22) < as a major threat sufficient to warrant a
pre-emptive strike of their own. >
(23) < stories of ill health that appear to warrant
surgical intervention. Frequently >
This
‘major threat’ in (22) differs from the standing constraints on the familiar
speech act of ‘threatening’ in that the agent may have done or said nothing
implying any intention to cause harm. Yet our social knowledge is quite
familiar with the high-tech jargon from the age-old military and political
discourse that disguises aggression as defence. Or, the ‘appear’ rather than
‘appears’ in (23) oddly suggests that surgery is to be performed on ‘stories’
or ‘story-tellers’ rather than on the people in ‘ill health’; but
world-knowledge prevented both the text producer and the news editors from
noticing this suggestion.
46. The overall conclusion would be that the familiar linguistic
‘levels’ or ‘components’ are designations not for neatly distinct sets of formal abstract data but for sets of functional standing constraints operating across sets of real
data and generating emergent constraints. Since this process supports the
convergence among the various modes of data and the consensus among speaker and
hearer or writer and reader (§ 43), a linguistic description can itself attain
convergence and consensus not just by sorting data into separate piles, one for
each set, but by assessing the interactions among these sets (§ 50). Even my
brief demonstration should suffice to show that the form of the data may seem
highly variable and at times utterly idiosyncratic unless we continually
examine the relevant functions. Formulating ‘formal rules’ that draw a rigorous
border between what can ‘warrant’ what versus what cannot in any ‘well-formed’
English sentence only leads to finicky debates over examples and
counter-examples and misrepresents the ‘competence’ of English speakers (§ 53).
They do not know what can and cannot be ‘warranted’ for once and for all, but
they do know what sorts of things people are likely to say are or are not
‘warranted’ and why; and that is the knowledge put to use by the people who
produce and understand real data.
C. Some implications of
corpus linguistics for linguistic theory
47. Our situation today recalls a complaint once
voiced by Saussure (1966 [1916]: 106): ‘It is one thing to feel the quick,
delicate interplay of units and quite another to account for them through
methodical analysis’. Corpus data reveal far more numerous and more ‘delicate
interplays’ than Saussure, with his deep mistrust of ‘actual speech’ (§ 1),
could have imagined, and they are pressuring us to develop suitable methods of
analysis and a more functional and realistic theoretical ambience (cf. Baker et
al. [eds.] 1993). In this final section, I shall explore some factors bearing
upon such a theoretical ambience and relate them to the theoretical problems
aired in section A.
48. Against the backdrop of my forceful articulation
of these problems, it may seem odd if I sound optimistic. But the chances for
‘mainstream’ linguistics to make major progress in coverage, convergence, and
consensus are best if we can properly learn from the problems in the past. The advantages
could be substantial:
(a) We would no longer need to retreat from corpuses
of real data on the grounds that they are ‘finite’ and ‘accidental’ (§ 12ff,
19).
(b) We would be authorised to formulate constraints on
real utterances without having to declare which particular ‘level’ or
‘component’ they belong to; or to worry if they are ‘purely linguistic’ rather
than cognitive or social (§ 8, 11, 20, 34-46).
(c) We would no longer need to shield technical
fictions like ‘well-formedness’, ‘grammaticalness’, or ‘ideal speaker-hearer’
against ‘actual speech observed’ by means of elaborate dichotomisations between
‘competence’ versus ‘performance’ or between ‘deep’ versus ‘surface’ (cf. §
16-19).
(d) We would no longer need to strive toward the a
priori criteria of rigour, abstractness, generality, and so on, set down by
formalism, or else to apologise and defend ourselves for not doing so (§ 5).
(e) Best of all, our work would be judged not by its
degree of formality but by its relevance
and ecological validity — how we can
contribute to a productive understanding of the human situation (§ 5). Real
data are the most propitious source for examining issues such as the discourse
of authority persons, e.g., when they magisterially declare what economic or
military actions are and are not ‘warranted’ (§ 40). And for most purposes,
real data can be given descriptions in ordinary language that speakers of
English can generally understand and make use of (§ 73).
In return, we incur the hard
work of engaging with large amounts of real data in all their grainy immediacy
(§ 25, 56).
49. The time seems opportune to reformulate the tenets
for a programme contrasting with the one stated in § 1:
(a) Language is a phenomenon integrated with human knowledge of the ‘world’ and society.
(b) A language constitutes a system defined both by internal, language-based constraints and by
external cognitive and social
constraints.
(c) The system should be described in terms of the conditions under which
speakers use it.
(d) The description of a language should be stated at varying degrees of generality between the
entire language and the specific discourse context, depending on what the data
can support.
These tenets might seem to make the job of describing language messier
and less focused, but actually may finally enable major progress.
50. Within such a programme, the concepts and methods
of ‘mainstream’ descriptive and generative linguistics would not be abandoned
but resituated in differently conceived projects. We would use the established
linguistic methods and terms for identifying and describing the various aspects
of our data in respect to the several ‘levels’ or ‘components’, as illustrated
in § 34-46, while not insisting upon deciding which constraint belongs where.
On the contrary, we assume that the interaction among sets of constraints
supports the convergence of ‘levels’ and the consensus among speaker and hearer
or writer and reader (§ 43, 46). Moreover, we shift our own search for
consensus onto a higher functional and pragmatic plane: we assume a consensus
among the language users who have produced our corpus data, and we exploit
those data to reach our consensus about what their consensus might consist of
(cf. § 9, 74).
51. Once we have shelved the search for ‘language by
itself’ (‘langue’, ‘competence’, etc.), we have also dissolved the central
rationale for ranking formalism over
functionalism (§ 5). The ‘formalist
trade-off’ whereby formality is financed by lowering coverage (§ 6) is now reversed. The very wide coverage of corpus
data will allow us to rationally determine the limits on the degrees of
formalisation we should impose upon language data, and thus upon our
theoretical models (cf. § 55; Beaugrande, in press). This factor underscores
the difference between the use of computers in corpus linguistics versus in all
those methods of ‘computational linguistics’ that require the conversion of
natural language into formal language (Sinclair 1994). Such conversions
necessarily erase the sets of rich emergent constraints of just the types that
large corpus data allow us to uncover (cf. § 43f, 46).
52. The ‘trade-off’
whereby formality is financed by also lowering convergence and consensus
(§ 6) is reversed too. A careful functional description of corpus data can be
confidently expected to raise them.
Reading back over my description of ‘warrant’, I noticed little that looked
genuinely contentious as long as I stayed close to the data. Yet the results
did not just confirm my previous intuitions (cf. § 24f, 27). Before working
through the data, I had only an implicit, fuzzy notion of what the verb
‘warrant’ would mean, partly because I hardly use it myself and find it a bit
stuffy or pompous. Confronted with data, my intuitions about the ‘semantics’ of
the item turned out to be too specific
in that I had immediately associated it with legal or quasi-legal terms like
‘search’ and ‘arrest’, which in fact collocate mainly with the noun, plus
‘investigation’, ‘trial’, and ‘punishment’; I failed to realise the importance
of the less specific but still common collocations such as ‘situation’. Yet
paradoxically, my intuitions were also too
general in that I also did not realise what sorts of situations warrant what
sorts of responses, e.g. knowledge and knowledge-gathering. Some of the data
disclosed items I would never have thought of as being ‘warranted’, such as
‘town’, ‘overeating’, or ‘space walk’, but I don’t claim such data are
ill-formed, ungrammatical, deviant, or deficient, because I can readily
interpret them in respect to activities (e.g. hiring, building, etc.) people
might not perform until a suitable occasion ‘warrants’ it.
53. A problem seems to arise here: how can I assert
that a corpus leads toward consensus and at the same time admit that it has
been showing me facts that did not fit my intuitive expectations? Evidently,
the degrees and modes of consensus can
fluctuate considerably. Intuitive consensus is strongest for the standing constraints on a language; we
all agree that the English infinitive is formed from ‘to’ + non-inflected verb,
though a few finicky users insist, without intuitive consensus, that no other
words can be interposed (the so-called ‘split infinitive’). But intuitive consensus
is naturally weaker for the emergent
constraints, which only apply when the occasion arises. So when we invent
sentences that are unlikely to be said and don’t appear in a very large corpus,
such as the 24 samples in Chomsky's Aspects,
our intuitions readily get out of their depth and become unreliable; they are
being asked to generate consensus on an inappropriate
level, and it’s hardly surprising when they don’t (cf. § 57). For example,
our intuition that a usual relation between the subject and the direct object
of ‘warrant’ is that between action and reaction, knowledge to
knowledge-gathering, or problem to problem-solving (§ 37), is quite adequate
for making sense of the real data from the BoE just because it is not rigorous
or formal. It is not adequate for
deciding the ‘well-formedness’ of invented sentences like ‘the man warranted
the ball’, ‘John is warranted to please’, or ‘the mat warranted being fatly sat
on by the cat’ without any real contexts; they don't seem very natural or sensible, but whether or not they are part of the English language is an inappropriate question.
54. I argued in section A that a stagnation in
coverage, convergence, and consensus began to grow acute during the move from
morphology to syntax and in the ensuing retreat from patterns in recorded
fieldwork data to a shadowy domain of ‘underlying structures’ and ‘rules’
intended to reconstruct the form of invented data (§ 11ff). Yet numerous
linguists accepted such hidden costs because formalism promises stability,
determinism, and rigour, and justifies a withdrawal from the rich and messy
contexts of human interaction (§ 19). A large corpus, in contrast, promises a
steep rise in coverage, convergence, and consensus by keeping us fairly close
to those contexts. Instead of a notion of language with formal units and structure-building rules at the centre (§ 12), we
have a notion with pragmatics at the
centre, so that the first ‘facts’ are the documented speech acts of producing the real data that the corpus offers us
(cf. § 68). The semantic and syntactic data are functionally framed within
those acts and need to be described as such (§ 28).
55. The notion of ‘well-formedness’ could be
reconceived (cf. § 48). Its grounding would no longer be sporadic, small-scale
applications of intuition or introspection to handfuls of invented sentences,
but a concerted large-scale empirical sorting to explore the greater or lesser
consistency of certain formal patterns in a given language. With a suitably
reliable parser, automatic searches and tabulations would not be unmanageably
laborious and could help us determine which syntactic patterns are genuinely
based on standing constraints. The end-result would revise the
‘well-formedness’ of formalist linguistics in at least five ways:
(a) It would be conceived not as a purely formal closed system operating on
rigorous internal criteria, but as an adjunct
layer of organisation supported by a
functional open system (§28).
(b) ‘Formality’ would not be uniform but would register clines
and gradations, ranging from the ‘frozen islands’ that formalism has
focused upon, e.g. article before noun, over to subtly flexible word-fields,
e.g. the mutual ordering of adjectives before a noun (e.g. ‘major new
dictionary’ versus ‘new major theme’) (§ 43).
(c) There would be no
sharp border where ‘well-formedness’ stops and ‘ill-formedness’ starts, but
a gradual shading away into combinations that hardly occur because they are not
very natural or sensible (§ 25, 53, 56), e.g., to say that ‘the government
warrants a change in the electorate’ rather than the other way around (not a
bad assessment of the administration of Jimmy Carter, though).
(d) ‘Well-formedness’ would not be the central object of linguistic description
independent of any corpus of real data. Whether the consistent patterns in
given set of data are due to ‘well-formedness’ or due to some other factor
would no longer be settled by speculative disputes but by the empirical results
of sorting corpus patterns.
(e) The sentence
would not be the axiomatic, obligatory unit but an empirically described
pattern or pattern-set for organising clauses and clause complexes (cf.
Halliday 1994). We could use the corpus to explore which types of words or
collocations tend to be used for beginning or ending a sentence. In my data,
for instance, ‘warrant it’ tends to go at the end of sentences, probably
because it is a conceptually well-delimited combination with an air of finality
— an action will be taken if the situation ‘warrants it’, where the ‘it’
handily takes in the both the action and its context with respect to the
situation.
In this new guise, ‘well-formedness’ would no longer impede coverage,
convergence, and consensus.
56. After such a revision, we would make rather different uses of our
data. We would not focus our attention on contrasting the ‘underlying syntactic
structure’ of invented sentences that look the same on the surface, e.g., ‘John
is easy to please’ versus ‘John is eager to please’ (cf. § 13, 25). Such
demonstrations lead us to complain, as (Chomsky 1965: 24), has, that ‘surface
structure’ is ‘unrevealing’ and ‘hides underlying distinctions of a fundamental
nature’. The corpus data, on the contrary, reveal so much that it’s hard to
take it all in. Genuine ambiguity in underlying structures seldom persists in
real data, witness this line :
(24) < shampoos are effective enough to warrant
only one shampoo per wash.>
A manufacturer is probably telling us that certain ‘shampoos’ as substances ‘warrant only one shampoo’ as
one act of use during the total
‘wash’. We don't respond by parsing out the sentence two ways and then
rejecting the alternative structure wherein some acts of use ‘warrant only one’ substance.
The latter structuring is quite grammatical
but pragmatically less sensible or natural, because manufacturers are more
likely to praise the substances they sell than the diligence of the customers
using them, and because people are more likely to use one shampoo several times
in a row than several shampoos at the same time. Such ‘world-knowledge’ seems
horribly grainy and vague when you try to ‘formalise’ it into ‘features’ and
‘rules’, but is cheerfully tidy and precise when you put it to use (cf. § 25,
48).
57. Similarly, we would not need to work out slippery
exercises in inventing sentences that selectively violate just one type of
constraint, which regularly leads to data that are plainly not sensible or
natural (§ 15, 53), e.g. :
(24a) *The effective is shampoo enough only warrant to
wash one. (syntactic violation)
(24b) ?These shampoos are blue enough to
wash only one warrant. (semantic violation)
(24c) ?Shampoo, warrant only one wash, or
you’ll get a pre-emptive strike! (pragmatic violation)
It seems safe to predict that none of these would appear in a corpus of
English, however large; but neither would infinitely many other variations, and
for reasons that would be rather tedious and pointless to label, let alone to
state as formal rules.
58. Corpus studies can bring us back toward the
tradition of fieldwork, from which
formal linguistics had retreated (§ 11f, 14, 18, 54). Admittedly, the corpus is
in a language we already know as ‘insiders’; and only the spoken part of the
corpus was actually recorded in the original situation, while the written part
was obtained through varying degrees of mediation, mainly through mass media.
The observable substrate of interaction is thereby greatly diluted, but in ways
that reflect the conditions of mass communication: the original producer(s) of
the text may not even be known to the receiver, and the text is designed for a
large, impersonal audience not too different from us in our roles as citizens
rather than as professional linguists (cf. § 64, 67).
59. In contrast, building computer corpuses of unwritten
and previously undescribed languages would be quite slow and laborious in the
early stages, where the reliability of transcriptions might be doubtful, e.g.,
when the language has a complex tone-system, as in Aymara of Bolivia (Hardman
1981). Such a corpus could never reach the size of the Bank of English, but
might well be usefully analysed with similar data-handling strategies and
software. An intermediary domain would be corpuses of regional varieties of a
well-known language, such as the set in the International Corpus of English
(ICE) co-ordinated by Sidney Greenbaum at University College London (cf.
Greenbaum 1991, 1992, 1993). So far, the corpuses of one million words each are
fairly complete for England (directed by Greenbaum), New Zealand (directed by
Janet Holmes and Laurie Bauer), and Singapore (directed by Anne Pakir and Paroo
Nihalani); further corpuses are in various stages for Cameroon, Canada, the
Caribbean-Jamaica, Hong Kong, India, Ireland, Kenya-Tanzania-Zimbabwe, Nigeria,
the Philippines, South Africa, and the US.
60. Corpus work will doubtless reshape the data-handling strategies enumerated in §
3. Until I can interview more people who work with corpuses, I shall merely
describe the ‘pragmatics’ of data-handling in my own activities. The usual
starting point would be to select one or more key words or collocations we
intuitively expect to be interesting. We can then read through the lines
returned on the display, along with frequency lists, positional frequency
tables, and left- or right-positioned alphabetic sortings, to infer what the
interesting aspects might be (§ 31ff). Going through the data lines and looking
for what they might have in common can steadily refine our sense of how to collate them and what to generalise, e.g., whether the
‘warranting’ typically entails some authoritative or institutional force
relevant to what kinds of situations ‘warrant’ what kinds of actions. We are on
safer grounds than we would be without a corpus for balancing the general with
the specific and for regulating the specificity of our generalisations, e.g.,
that it is much less typical for a good situation to ‘warrant’ being commended
than for a bad one to ‘warrant’ being amended. Also, we can assess the ‘depth’ of the relations between the left-hand
subcontexts and right-hand subcontexts as shown in the positional frequency
table (§ 32) and in the Appendices at the end of the paper: the shallower, more
lexical ones like ‘evidence + trial’ versus the deeper, more semantic ones like
‘problem + ‘problem solving’.
61. Our rarefying
is situated chiefly in not having the communicative situation at hand for
observation, a drawback applying most severely to the spoken portion of the
corpus (cf. § 58). It would be desirable, though horribly expensive, to maintain
a video corpus for at least some of
the spoken material, which would enable us to correlate the regularities that
do tend to leave evidence in transcriptions with ones that do not. We could
then add relevant types of commentary to future transcriptions, e.g. about
facial expressions.
62. Like fieldwork, corpus data mediate against the decontextualising that has been favoured
during the search for ‘language by itself’. If a line display appears to have
been unduly decontextualised, we can ‘recontextualise’
it at the touch of a button. Here too, we stand to gain consensus on points
that could be disputed for briefer samples. For instance, I was perplexed by
this line:
(25) < not bad enough nor predictable enough to
warrant a mid-season break. >
'Unpredictable’ seems more
fitting alongside ‘bad’ if, as I at first intuited, the performance of an
athlete or team might call for a ‘break’. But when I accessed the source text,
the missing subject was (as you may have guessed) ‘British weather’ — you know
it's ‘bad’ but not how bad or when.
63. Naturally, the role of introspecting is dramatically reduced and constrained. The data
have already passed the introspection of the text producers and, in public or
print media, that of editors as well (§ 25). Our task now is to explore how our
own introspecting can help us understand why
these data did pass, e.g., whether because the lexical items are so compatible,
say ‘evidence + trial’, or because the situation makes the action seem sensible,
say, when ‘degenerating trees warrant specialist attention’, which is a utterly
improbably combination in lexical terms alone. Moreover, you can test your
introspections by predicting the words or concepts before and after the line on
display and then calling up the text to check as for (26-29) (displayed line in
pointy brackets, non-displayed material in square brackets). I correctly
predicted ‘something’ for (26) (syntactic place-holder to precede the
adjective) and ‘improvements in’ for (27) (semantically quite plausible if ‘aid
is cut'); but I could predict only a negation plus conceptual grouping for (28)
such as ‘not /having/requiring/setting/imposing’, and still less for (29).
(26) [something] < terrible must have happened to
warrant God’s anger.<t> Finding the flat >
(27) [the administration said the improvements in]
< Costa Rica’s economic condition warrant the cut in aid, which the country
>
(28) [Anyone can help by not setting] < age limits
for jobs when jobs do not warrant them. <LTH> As local government >
(29) [a policy that can be relied upon to create] <
jobs overall is rare enough to warrant no apology.<p> The IIE’s estimate
>
Eventually, we might gain some reasonable estimate for the reliability
of intuitive introspections among speakers, as well as for the degrees of
predictability, which have long been a central and problematic issue in
information theory.
64. Finally, the role of consulting informants is redistributed. First, the data themselves
put us in indirect contact with a vast population of ‘informants’ who produced
the data, and our records make it in many cases, possible, if laborious, to
enter into direct contact with them if we are really stumped about what they
meant to say. Second, we can easily recruit native-speaker informants quite
similar to the ones who produced the data or to whom the data were addressed,
e.g., readers of newspapers and magazines, to make judgements, predictions, and
so on about the data. Third, we still occupy the role of informants ourselves when
we interpret the data, though, I have argued, we are much better positioned to
attain consensus than when we invent short, trivial sentences and then
interpret them (cf. § 20, 22, 27f, 39, 43, 46, 48, 50, 52-55, 62). After
working through the data I am a better informant on ‘warrant’ and feel I can
disagree even with prestigious dictionaries. When the Oxford English Dictionary (p. 931) lists a single source (30) for
an ‘obsolete rare’ sense of ‘to direct a person authoritatively; to command’, I
think the compilers were misled by ‘imperious’ to exaggerate the pragmatic
force, whereas the better attested sense of ‘to authorise a course of action’
(cited in § 30), or, in my terms, ‘make appropriate’, is quite adequate. I also
find the definition (31) in the Collins
COBUILD English Language Dictionary’s (p. 1640) incomplete, because ‘I‘ll
warrant’ also carries the implication that you can’t point to actual facts —
compare (19-21) in § 34 as well as the COBUILD’s
own example, where what ‘not many people know’ is plainly a matter of
conjecture.
(30) But that imperious custome warrants it, Our
Author with much willingnes would omit This Preface to his new worke (Philip
Massinger, The emperour of the east, a
træge-comedie, 1631)
(31) You say ‘I’ll warrant’ when you want to indicate
you are fairly sure of what you are saying [...] e.g. ‘not many people know
that, I’ll warrant’.
Such cases indicate that when lexicography is to be driven by a large
corpus, many people who are not lexicographers can actively participate as
informants.
65. Indeed, lexicography may be transformed even it
its most central concepts, such as synonymy.
In fine detail, corpus data reveal very few synonyms,
because virtually every word collocates in its own way — a final vindication of
Saussure’s (1966 [1916]: 120) universal principle that ‘in language there are
only differences’, yet not in the language system ('langue') but in language
use (‘parole’), just where he staunchly refused to look for it. For example,
‘serious concern’ collocates pejoratively, whereas ‘serious consideration’
collocates amelioratively, even though we seem to have the same word ‘serious’
and two words with similar definitions, ‘concern’ being ‘marked interest or
regard’ and ‘consideration’ being ‘continuous and careful thought’ (Webster‘s Collegiate Dictionary, pp.
172, 178), and ‘showing concern’ for somebody resembling ‘being considerate’ of
them. But we actually have here two kinds of ‘serious’: one in such
collocations as ‘serious problem’, i.e. ‘grave’, and the other in such
collocations as ‘serious intention’ i.e. ‘sincere’.[7]
Similarly, the collocations of ‘enough’ are pejorative twice as often (40
versus 20 occurrences, 10 of the 40 with ‘serious’), while those of
‘sufficient’ are ameliorative twice as often (14 versus 7 occurrences, 5 of the
7 with ‘evidence’). Evidently, the more formal term ‘sufficient’ seems more
appropriate when it’s something good ‘warranting’ or being ‘warranted’ than
when it’s something bad. I don’t see anything odd about ‘sufficiently serious’,
but I feel doubtful about ‘sufficiently unsatisfactory‘’ or ‘sufficiently
weak’, and I feel amused when Proust‘s masochistic Baron de Charlus complains,
in the English translation, that the man he hired to whip him was
‘insufficiently insulting’.
66. In respect to the three ‘C-tests’ proposed in
section A, the coverage of a language
by a corpus like the BoE is still far from complete. Indeed, corpus work makes
us keenly aware that any set of data we can bank and display is at most a partial set, representative for a
vastly larger set of utterances and collocations. Our best prospects would be
to uncover a set of convergent
regularities extending far enough across the corpus to enable a consensus that
they also extend beyond it. But of course we can never conclusively prove that
they do, nor that data outside the corpus would not reveal still further
regularities. In view of the problems in ‘mainstream’ linguistics, we must be
wary of wishfully judging the data more regular than they justify; not just the
‘the Lord’, as Daniel Rogers vowed (sample (18) from § 30) in 1642, but also
the corpus ‘warrants us to suspect the inconstant’.
67. Paradoxically, even the Bank of English, the
largest computer corpus of real data even built, is still too small. It
confronts us with a fresh version of the complex ratios between what has been said versus what can be said, and between what people can understand versus what they have occasion to say. Much of the BoE,
and of many similar corpuses, is taken from public
discourse produced by specific types of people — such as literary authors,
journalists, media personalities, advertisers, and interest groups — intending
to communicate at a distance with a general audience who need to be motivated
to read or listen what is being said to them. As a result, the corpus is
somewhat unbalanced in respect to informativity,
i.e., what people think is worth listening to or reading about. What people can
and do talk about in normal life is not necessarily what receives intensive
mass media coverage in newspapers and magazines. The BoE shows high frequencies
of depressing items like ‘death’, ‘kill’, ‘murder’, ‘massacre’, ‘shooting’,
‘robbery’, and ‘rape’, which — for reasons I have never understood — are
believed to be topics of universal interest. We also get a bulk of smarmy or
pretentious talk from politicians, military personnel, advertisers, and
entertainment ‘personalities’, who often don’t speak like ordinary people nor
let on just what they really think, especially not if the media are nearby. So
we get unduly broad coverage of the things they’re interested in, usually
‘sensitive issues’ like their own image and other people’s money. And we get a
hefty dose of careful grammar and printable vocabulary that sound cultivated
(‘inclusion in a wheelchair’, ‘revolutionary optimism’, ‘degenerating trees’)
or technical (‘military interception’, ‘peat extraction plant’, ‘scrappable
cars’). The corpus clearly needs a much larger contingent of everyday conversation,
which is unfortunately the costliest mode of data to accumulate in
computer-readable formats.
68. Still, as such corpuses continue to grow, the degree of approximation between its
coverage versus the entire language will steadily improve, bringing us closer
to a general convergence and consensus. We can transcend the choice between the
notions of language being either a repertory
of units or else a repertory of rules
for constructing and arranging units (§ 12) by a notion of language being a
set of the standing internal constraints
designed to interact richly with emergent
external constraints from world and society during discourse (§ 43f, 46,
53). Both types of constraints apply to corpus data; yet the corpus would
surely be the most productive basis for eventually determining which type is
which.
69. Sinclair (1994) has aired the striking prospect
that we can survey the ‘parole’ or ‘performance’ on the horizontal lines of
displayed data, and can survey the ‘langue’ or ‘competence’ by scanning the
entire vertical set of lines (e.g. in my Appendices). Yet what we see is
actually the results of competence,
and extensive psychological and social research is needed to determine how such
a ‘competence’ can produce such results during actual discourse. The most
plausible account, to my knowledge, is that language is a dynamic system in
continual evolution, and its current organisation during a given discourse is
generated quickly and cheaply by interactions among local constraints (cf. §
43; Beaugrande, in preparation). This account is congenial to corpus work in
suggesting that much of the operation is fairly detailed and highly organised
despite the ease and speed of producing and comprehending the discourse. The
knowledge of language might be stored and accessed in multiple formats and would operate as a morphological and syntactic
grammar, a lexical repertory of words or collocations, a semantic array of
concepts or meanings, or a pragmatic directory of action and interaction
strategies, as suits the occasion (cf. § 71). The ‘multiplex’ design would be
ideal for interfacing the constraints among domains, though it would be hard
for us to reconstruct it in detailed models.
70. As the corpus continues to be described, the
description will eventually take on such huge dimensions as to burst the
confines of even the largest grammar-book or dictionary compiled so far. Just
the description of the constraints that relate to the verb ‘warrant’, which is
not a particularly common verb in comparison to say, ‘cause’ or ‘call for’,
already seems extensive. What about all the other entries that would pass the
frequency cut-off in a corpus-driven dictionary like the 1987 Collins COBUILD English Language Dictionary
with 70,000 references, or merely the ‘most frequently used 2000 words in
English treated in exceptional detail’ (cover blurb) by that same compilation?
How could people make use of such an unwieldy description?
71. A reasonable answer might be: much as we now make
use of the corpus itself, by accessing modest, more general or more specific
displays not just of data but of data plus description. It would be highly
desirable to give the description multiple formatting capabilities similar to
the multiple representation I suggested for a person's knowledge of the
language (§ 69). The corpus could be accessed as a grammar, a guide to usage, a
special-purpose lexicon, a pedagogical tool, and so on. None of these
formattings could, or would need to, contain the totality of the description,
but each would be designed to expropriate what it needed from that totality,
e.g., when we want a survey of usage in a special field like theoretical
physics. Appropriate software could provide backup consultations in finer
detail if the user requested them, e.g., if our survey wanted to distinguish
popular books on physics from ones produced strictly for experts.
72. This mode of access would enable us to generate specific convergences of the
data that are relevant to a stated issue.
If we are designing a course in English as a Foreign Language for Special
Purposes to be included in a curriculum for physics students, we need to know
how the data converge for that discourse domain and not for, say, literary
English, which is often the central domain in traditional EFL programmes. With
a corpus, such programmes would have a systematic means of deciding what data
should be made the basis for instruction. They can also consult the recently
developed International Corpus of Learner English (ICE) directed by Sylvia
Granger at the University of Louvain, which consists of learners’ essays from
eight language areas and which supports a large-scale empirical assessment of
the degrees of competence we can build upon (cf. Granger 1993).
73. A consensus about our own descriptive terms and
methods has not yet been attained, which is hardly surprising for a relatively
new field of research. Lively disputes can still be expected about how to label
and categorise the various constraints, but matters may be kept under control
if we can agree on some procedures. One procedure would be to retain well-known
terms of linguistics, as I have done in this paper, Another would be to respect
the interaction among types of constraints, e.g. semantic and pragmatic ones,
rather than trying to put them all into neatly separated piles (cf. § 34-46).
Yet another would be not to formalise the constraints as ‘rules’ but to
formulate them chiefly in everyday language that would be both well-suited for
checking with informants and ‘friendly’ to potential users of our descriptions,
such as textbook authors.
74. Would our own rising consensus as linguists
correspond to the consensus of speakers of the language (§ 9, 50)? The close
contact with real data suggests that we ought to do rather better here than a
non-corpus linguistics relying on invented data (§ 13, 15, 19, 34, 39, 53ff,
56). However, it is far from settled how general or exact the consensus of
speakers might be; this is another question upon which corpus linguistics might
finally shed some light. Sorting the corpus to display the diversified
varieties of native speakers of English would be a major advance for
sociolinguistics, as would the creation of corpuses of international varieties
cited in § 59.
75. Could our own consensus eventually change public
attitudes about English by making them more realistic and ‘data-driven’? This
question has been around since the start of modern linguistics, whose
descriptive surveys (e.g. Leonard ed. 1932) have long ago proven that general
English usage is quite different from what schools and self-appointed language
guardians claim. But public attitudes have remained largely unrealistic because
they offer the surest basis for discriminating against specific social groups
and their real usage — a factor intensifying today wherever other bases for
discrimination have been forbidden by law.
76. Probably, the impact of corpus linguistics will be
greatest through widely used reference works, especially corpus-driven
dictionaries such as the 1987 Collins
COBUILD English Language Dictionary and its successor due to appear in
1995. However, a substantial impact could also be achieved if a large body of
users can access the corpus for themselves, which is already possible (though
expensive) for the older 20 million word Birmingham corpus. Having multiple
access modes for respective uses, as proposed in § 71, would be crucial here.
77. Wide access would at least enable influential
groups, such as language teachers, to get accurate data about usage. So far,
however, the implications of corpus linguistics for language teaching are just
beginning to be explored. Two contrary scenarios are readily conceivable. On
the gloomier side, the fine details of ‘idiomatic’ English revealed by a large
corpus makes the tasks of teachers and learners look much harder than the older
approaches that just tried to teach pronunciation, vocabulary words, and a
smattering of morphology and syntax, largely leaving the semantics and
pragmatics to take care of themselves. Would learning an item like the verb
‘warrant’ entail learning all the baggage it seems to carry about in actual
usage, e.g., its more frequent and characteristic collocations? Perhaps this
problem appears harder than it would be if we could finally lay to rest the
assumption of so many teachers, learners, and textbook authors that language
operates as a set of words plus a set of formal patterns into which we plug the
words. If the chief unit of language use and language learning is recognised to
be the collocation, then learning
words would be systematically correlated with learning the ‘company they keep’
(§ 26), and the latter kind of learning would not be seen as some onerous
supplementary job.
78. On the brighter side, corpus data will surely
improve our criteria for selecting the more frequent and useful words and
collocations, and for formulating the ‘notional’ concepts to be covered, e.g.,
a class of ‘knowledge-gathering activities’. Also, a large corpus could be
strategically exploited to decide what to put into a small corpus compact and
cheap enough for learners of English as a second language to access regularly
on modest personal computers or classroom workstations, another trend that has
already begun. This proposal recalls C.K. Ogden’s ‘Basic English’, except that
ours would be corpus-driven and would allow us to generate different versions
at many different degrees of ‘basicness’.
79. In sum, I have tried to suggest some ways in which
the advent of large corpus linguistics can offer, for the first time in many
years, a genuine opportunity to reorganise the pragmatics of doing language
science, and its neighbours which depend upon it, on a new and more realistic
basis. In an age when public attitudes about language are unrealistic and
discriminatory, and when our beleaguered abilities to talk and listen to each
other profoundly, affect our collective chances of survival, surely the heavy
investment in corpuses that get a broader yet more exact view of the population
of language users is, well — strongly warranted.
References
Baker, Mona, Gill Francis, & Elena
Tognini-Bonelli, eds. 1993. Text and technology. Amsterdam: Benjamins.
Beaugrande, Robert de. 1991.
Linguistic theory: The discourse of fundamental works. London: Longman.
Beaugrande, Robert de. in press. Function and form in
language theory and research: The tide is turning. Functions of Language 1/2.
Beaugrande, Robert de. in preparation. New foundations
for a science of text and discourse. London: Longman.
Bierwisch, Manfred. 1965. Poetik und Linguistik. In
Helmut Kreuzer & Rul Gunzenhäuser, eds., Mathematik und Dichtung. Munich:
Nymphenburger, pp. 49-66.
Bloomfield, Leonard. 1933.
Language. New York: Holt.
Chomsky, Noam. 1957.
Syntactic structures. The Hague: Mouton.
Chomsky, Noam. 1965. Aspects of the theory of syntax.
Cambridge: MIT Press.
Firth, John. 1968. Selected
papers of J.R. Firth 1952-1959, ed. Frank R. Palmer. London: Longman.
Francis, Gill. 1993. A
corpus-driven approach to grammar. In Baker et al. eds., 137-156.
Granger, Sylvia. 1993. The
International Corpus of Learner English. In Jan Aarts, Paul de Haan, and
Nelleke Oostdijk, eds., English language corpora: Design, analysis,
exploitation. Amsterdam: Rodopi, 57-69.
Greenbaum, Sidney. 1991. The
development of the International Corpus of English. In Karin Aijmer & Bengt
Altenberg, eds., English corpus linguistics. London: Longman, 83-91.
Greenbaum, Sidney. 1992. A
new corpus of English: ICE. In Svartvik, ed., 171-179.
Greenbaum, Sidney. 1993. The
tagset for the International Corpus of English. In Clive Souter & Eric
Atwell, eds., Corpus-based computational linguistics. Amsterdam: Rodopi, 11-24.
Halliday, Michael. 1994. An
introduction to functional linguistics (second revised edition). London:
Arnold.
Hardman, Martha. The Aymara
language in its cultural and social context. Gainesville: Univ. of Florida
Press, 1981.
Leonard, Sterling, ed. 1932. Current English usage.
Chicago: National Council of Teachers of English, 1932.
Louw, Bill. 1993. Irony in
the text or insincerity in the writer?: The diagnostic potential of semantic
prosodies. In Baker et al., eds., 157-176.
Rumelhart, David, James
McClelland, et al. 1986. Distributed parallel processing: Explorations in the
microstructures of cognition. Cambridge, MA: MIT Press.
Saussure, Ferdinand de. 1969
[1916]. Course in general linguistics, transl. Wade Baskin. New York:
McGraw-Hill.
Sinclair, John McH. 1992a. Priorities in discourse
analysis. In Malcolm Coulthard, ed., Advances in spoken discourse analysis.
London: Routledge, pp. 79-88.
Sinclair, John McH. 1992b. The automatic analysis of
corpora. In Svartvik, ed., 379-397.
Sinclair, John McH. 1994. Lecture on corpus
linguistics at the University of Vienna, June 1994.
Svartvik, Jan, ed. 1992. Directions in corpus
linguistics. Berlin: Mouton de Gruyter.
Appendix
A. What does or doesn't do the 'warranting'
< return. This achievement seemed to warrant a
couple of days' rest, a few >
< aggressions, each too small to warrant war. Because possession
of the >
< House says these air leaks do not warrant military
interception # I'm >
< office and that his behavior could warrant both criminal
and political >
< job bias is
widespread enough to warrant special protections for gay >
< Chevrolet Beretta. This car
does not warrant particular mention but I thought >
< not yet enough scrappable cars to warrant widespread collection of old >
< the present circumstances do not warrant a
change in the leadership # so it >
< Costa Rica's economic conditions warrant the
cut in aid, which the country >
< some of the costs erm just wouldn't warrant it.<M01> Yeah. I suspect also I
>
< disability is not felt sufficient to warrant his inclusion in
the wheelchair >
< distress levels severe enough to warrant
professional intervention, levels >
< On its own the documentary wouldn't warrant more
than a 4: out-takes of films,
< insists there is enough evidence to warrant an investigation # One suggestion >[1]
food shortage is severe
enough to warrant breaking the embargo # This report >
< stories of ill health that appear to warrant surgical intervention. Frequently
>
< and historic sites are too small to warrant a
full-time custodian. This can >
< these old homes are chilly enough to warrant guests wearing thermal long
johns >
age limits for jobs when jobs
do not warrant them. <LTH> As local government >
<< action if our national interests warrant it.<t> Credibility with our allies >
< the national objectives at stake warrant the
deaths of US troops # Oil, >
< there's no special occasion to warrant overeating. Even
if Grandma has >
< operation was
important enough to warrant a middle-of-the-night briefing >
< these problems are too trivial to warrant a visit to the surgery.
Your vet >
< medical or psychological problems to warrant
using these drugs. Health experts >
t> His disciplinary record may soon warrant a lengthy
ban from the game but, >
< revelations of an
affair did not warrant my
leaving the Government . <t> I >
< shampoos are effective enough to warrant only one shampoo per
wash. <LTH>
< situation is not bad enough yet to warrant that type of appeal.
We do not see >
< that the situation does not yet warrant the
sending of those supplies # >
< as a major threat sufficient to warrant a
pre-emptive strike of their own. >
< bark disease. Degenerating trees warrant
specialist attention. Felling or >
< here are enough worshippers to warrant keeping the
huge edifice open, so >
Appendix
B. What is or is not 'warranted'
< age limits for jobs when jobs do not warrant them. <LTH> As
local government >
< terrible must have happened to warrant God's anger.<t> Finding the flat >
< you do it. If your circumstances warrant it, consider an answering machine.
>
< jobs overall is rare enough to warrant no apology.<p> The
IIE's estimate >
< encouraging progress it did not yet warrant full-scale economic assistance.
>
< er are large enough really er to warrant an assistant manager.<M01>
Mm. >
< bark disease. Degenerating trees warrant specialist attention.
Felling or >
< bad enough nor predictable enough to warrant a mid-season break. A four-week >
< operation was important enough to warrant a middle-of-the-night briefing #
>
< the present circumstances do not warrant a change in the leadership
# so it >
< juicy stuff about themselves to warrant a few column inches.
MY LADYLUST >
< Boren says the charges, if true, warrant serious concern,
but he stresses >
< enough of its own character to warrant serious consideration.
<LTH> The >
< the national objectives at stake warrant the deaths of US troops #
Oil, >
< liberalizing trade sufficiently to warrant its exclusion from the Bush >
< disability is not felt sufficient to warrant his inclusion
in the wheelchair >
< House says these air leaks do not warrant military interception
# I'm >
< stories of ill health that appear to warrant surgical intervention.
Frequently >
< sufficient prima facie evidence to warrant an investigation into war
crimes >
< a saint, but he has done nothing to warrant jail time, and I want my son
home.>
< guide . If alive today, he would only warrant a mention in passing as
someone >
< to such an extent that their numbers warrant their official extermination
>
< that contemporary events did not warrant a revolutionary optimism,
or even >
< there's no special occasion to warrant overeating. Even if
Grandma has >
< s # though not sufficiently
so to warrant a plagiarism suit . >t> Freed then >
< pressures severe enough to warrant highly restrictive policies
in >
< six of those were serious
enough to warrant prosecution.<M03> The+ I mean some >
< been registered certainly do not warrant the soldiers leaving
their posts , >
< on Columbia' s cargo bay door may warrant a space walk for repairs by
the >
< there wasn't enough evidence to warrant a jury trial # The U.S. supreme
>
< these problems are too trivial to warrant a visit to the surgery. Your
vet >
Appendix
C. Relevant criteria
<
said there was enough evidence to warrant the case going
to court; he set no >
< not bad enough nor predictable enough to
warrant a mid-season break. A four-week
< will not be long or bloody
enough to warrant one.<p>
Perhaps. Hitler said that >
< these old homes are chilly enough to warrant
guests wearing thermal long johns >
shampoos are effective enough to warrant only one shampoo
per wash. <LTH> >
< not quite heroic or doom-laden enough to warrant the sort of revisionist
myth- >
< er are large enough really er to warrant an assistant
manager.<M01> Mm. >
< figure, he is prominent enough to warrant
direct attention without being on
six of those were serious enough to warrant
prosecution.<M03> The+ I mean some
< food shortage is severe enough to warrant
breaking the embargo # This report >
<
knee, and was spectacular enough to warrant opening the glass
doors and >
< conditions become so dire
as to warrant it. Even then, the
bank will take >
< believed the need was so great as to warrant the expenditure." A new sea wall >
< in Pakistan are so poor as to warrant teams not touring there. In the >
< were sufficiently accurate to warrant our setting up
a research program >
< REELS are sufficiently expensive to warrant at least a little regular >
< the riots was sufficiently grave to warrant the
extension # For National >
< sufficiently politically charged to warrant such a label. In other words, it >
< the declines are too modest to warrant the
phrase recession, " said Lewis >
< these problems are too trivial to warrant a
visit to the surgery. Your vet>
[1] Actually, the active corpus I used is around 169
million words, and fluctuates between
167 and 175 million, according to Ramesh Krishnamurthy, Development Manager at
COBUILD.
[2]
Technically, I’d have to count Chomsky’s words and not just his
sentences; but since he invented them all, the term ‘corpus’ really doesn’t
apply anyway.
[3] The diamond brackets <> appear in BE data at the start and
end of a displayed line. They also enclose a variety of codes, of which the
following appear in the data I cite: for upper case, <CQ0>: start quotes;
<CQ1>: end quotes; <FCH>: font change; <FO + [digit]>: female
speaker with identification number; <LCH>: chapter heading; <LTH>:
start text (e.g. when going to the next item in a newspaper); <MO +
[digit]> male speaker with identification number; <SO>: sentence(s)
omitted; <ZF1>: start repetition; <ZF0>: end repetition; for lower
case, <h>: heading; <p>: start paragraph; <t>: start text.
[4] The
number of occurrences may be greater than the number of lines, since multiple
occurrences of an item in a line are not uncommon; a word may even appear in
its own collocational frequencies list (in the sense of § 32).
[5] To enhance the visibility in the cut-off, I
give here the frequencies for the full return of 392 lines.
[6] My displays alphabetise sometimes just the
noun and sometimes modifier + noun, depending on what seemed more revealing;
but this is merely a matter of making it handier to compare things and raises
no theoretical issues at this stage
[7] Characertistically, the Collins COBUILD English Language Dictionary (p. 288) gives as the
first definition for ‘concern’ the ‘worry that people have about a situation’,
which was the more common use, whereas Webster‘s
Collegiate Dictionary (p. 172) puts first the less common and presumably
more basic definition ‘something that relates or belongs to one’, and gives
‘state of apprehension’ only in third place