Journal of Pragmatics 25, 1996, 503-535.

The ‘Pragmatics’ of Doing Language Science:

The ‘Warrant’ for Large-Corpus Linguistics

 

Robert de Beaugrande

 

[abstract]

 

The development of ‘mainstream linguistics’ in this century is briefly retraced to suggest that the original decision to describe ‘language by itself’ as opposed to ‘language in use’ favoured formalism over functionalism and eventually led to a severe impasse for three tests a valid science of language ought to meet: coverage of language data, convergence among the data being described, and consensus among linguists about how to proceed. The impact of large-corpus linguistics might resolve this impasse and accordingly raises the prospect of a fundamental reorientation of linguistic theory and of the ‘pragmatics’ of ‘doing language science’.

 

A. Testing the progress of ‘mainstream linguistics’

 

1. The term ‘mainstream linguistics’ is sometimes used to designate a language science pursuing a programme based on several generally agreed principles (survey in Beaugrande 1991), notably:

 

(a) Language is a phenomenon distinct from other domains of human knowledge or activity.

(b) A language constitutes a system defined completely by internal, language-based constraints.

(c) A language should be described apart from the conditions under which speakers use it.

(d) The description of a language should be couched in statements at a high degree of generality, if possible about the ‘rules’ for the language as a whole or even about the ‘universals’ for all languages.

 

Within the programme, the tenets interlock in projecting a free-standing and self-sufficient conception of language as a uniform, stable, and abstract system holding still while we are describing it, and separated from the ‘rich’ and ‘messy’ human contexts where it is encountered in ordinary life. Linguistics has thus undertaken to describe a theoretical construct of ‘language by itself’ (‘langue’, ‘competence’, etc.) situated at a safe distance from the empirical realities of ‘language in use’ (‘parole’, ‘performance’, etc.) during communication. Saussure’s (1966 [1916]: 9) defensive mistrust of actual ‘speech’ has proven highly influential (cf. § 17):

 

(1) speech is many-sided and heterogeneous [...] it belongs both to the individual and to society; we cannot put it into any category of human facts, for we cannot discover its unity.

 

2. If we imagine a ‘layer-cake’ with language as a system mediating between a culture’s knowledge of the ‘world’ on one side and the ‘society’ of speakers on the other, the programme of ‘mainstream’ linguistics implies detaching language and rolling away the other ‘layers’ (Fig. 1). Doing so discounts the constraints upon what people say due to what they are talking about and to whom.

The working hypothesis is that once detached, the language system will stand firm: complete and fully organised by its own internal constraints (cf. § 6, 19).

3. Yet what we encounter is always ‘language in use’; even the formal analysis of isolated sentence structures is a mode of use, albeit a peculiar and untypical one (§ 15, 53, 57). To get from ‘use’ over  to ‘language by itself’, linguistics has foregrounded data-handling strategies, such as:

 

(a) collating: a large set of data samples are compared and contrasted to distil out what they have in common, e.g., which word types frequently occur with other types;

(b) generalising: certain aspects of the observed data are construed to be general ones, e.g., that the ‘Subject-Verb-Object’ order of a sample set of English sentences is a typical pattern for the language as a whole;

(c) rarefying: the ‘rich’ data as they were observed in spontaneous interaction are made ‘sparse’, e.g., by disregarding the personal authority of speakers;

(d) decontextualising: the data are taken out of the observed context and text type and treated as if they had occurred in isolation or could occur in a wide range of contexts, e.g., irrespective of the social status of individual speakers;

(e) introspecting: the linguists make estimations based on their own intuitions about the language, e.g., which sentences do or do not violate the ‘rules’;

(f) consulting informants: native speakers are given data samples of their language and asked to judge or rate them, e.g., to decide which of two versions of a sentence is ‘grammatical’ or ‘ungrammatical’.

 

These strategies project a second working hypothesis, namely that applying them will lead to a complete and valid description of a given natural language.

4. How might these two main working hypotheses be tested? If it is true that a language system is fully organised by its own internal constraints, and that these strategies can describe it as such, then the three key tests for progress would be steady cumulative rises in (a) the coverage of the language; (b) the convergence among language data discovered and described; and (c) the consensus among linguists about how to formulate the description. Yet if we apply these three ‘C-tests’ to mainstream linguistics, we find a rise only in some domains and a sharp fluctuation in others. What ‘pragmatic’ factors for ‘doing language science’ can have led to this outcome?

5. Probably the most influential factor was that the programme for describing ‘language by itself’ has naturally favoured formalism, the stance construing form to be the basis and framework of language — how entities are shaped or arranged. As we strive to ‘abstract’ language out of everyday contexts, the most stable and reliable substrate naturally appears to be the forms, the patterns of word-stems and suffixes or the patterns of phrases and sentences. This factor discourages functionalism, the stance construing function to be the basis and framework of language — what means are used toward which ends; functional aspects tend to be associated with language in use. So the ‘majority position’ in ‘mainstream linguistics’ has usually been that formalism confers high ‘scientific’ status and that its legitimacy can be taken for granted, whereas functionalism is ‘unscientific’ or ‘pre-theoretical’ and its legitimacy must be expressly justified. In this academic power structure, formalist research is not required to specify its relevance and its ecological validity — whether and how its findings contribute to a general and productive understanding of the human situation — whereas functionalist research is expected either to struggle toward the a priori criteria of rigour, abstractness, generality, and so on, set down by formalism, or else to defend itself for not doing so. So functionalism has been severely held back or has worked at cross-purposes by not following up on its insights and by compromising its own ecological validity in order to compete with formalism on the latter’s terrain.

6. In the long term, the development of mainstream ‘formalist’ linguistics reveals an ominous trade-off: the more formalised a theory of language and its apparatus of terms and formal notations, the less we can expect a steady rise on the three ‘C-tests’ of coverage, convergence, and consensus (cf. § 13, 18, 51f). This trade-off is not unduly surprising, given the robust fact that people, including linguists, are naturally much less skilled at ‘formalising’ natural language data than they are at speaking, hearing, reading, and writing them. After detaching language from the constraints of ‘world’ and ‘society’ and discounting functions in favour of forms, we are left with the huge task of reconstructing or inventing the formal and ‘purely linguistic’ constraints for every sort of regularity we may encounter (cf. § 18). The enterprise of linguistic formalism hinges on the assumption that there exists, for each natural language, at least one set of such constraints strictly separating what belongs to the language from what does not. Decades of formalist research have failed to identify any such set for any language; and the lack of progress on the three ‘C-tests’ strongly argue that no such set exists. Converted into a static and closed formal system, language does not stand firm, complete, and fully organised by its own internal constraints (§ 2); instead, it tends to skid out of control.

7. The ‘formalist’ trade-off has been somewhat obscured by the fact that it holds in differing degrees for the various domains of language. The three ‘C-tests’ are best met in phonology and morphology, which both offer concise methods for segmenting language data so as to isolate and classify minimal units. Phonology has the most clearly defined criteria in the articulatory events and locations that characterise the sound-units called ‘phonemes’, e.g., a ‘voiced dental stop’ such as /d/ produced when the vocal cords vibrate and the air flow is blocked by the teeth. The visual correspondence between many phonemes and written letters of the Roman alphabet also supports the ‘C-tests’, though it has not been made into a theoretical principle, since the description is strictly addressed to spoken language. Thanks to these clear criteria, linguistics soon provided descriptions of the repertory of sound-units in language after language, covering all the phonemes with impressively high convergence and consensus.

8. The success of phonology did much to entrench the concept of ‘language by itself’ being a uniform, stable, and abstract system, or rather a set of subsystems, usually called ‘levels’, each consisting of a repertory of minimal combinable elements. A complete description of a language would be the sum of the complete descriptions for each subsystem, supplied by linguists working within the tidy, ‘pragmatic’ division of labour reflected in academic specialisations, journals, conference proceedings, and course offerings.

9. Yet in morphology, the criteria are already less tidy. Convergence and consensus are fairly high for identifying and isolating the ‘morphemes’, aided again by the visual clarity of the data written down. The analyst segments the written data until no further meaningful subdivisions appear feasible — a method once introduced as ‘immediate constituent analysis’, which, if applied ‘in all observation of word-structure’, Bloomfield (1933: 209, 221) promised, would eliminate any ‘inconsistency of procedure’. His promise rested on a staunchly formalist mandate:

 

(2) Any utterance can be described in terms of lexical and grammatical forms; [and] any complex form can be fully described apart from its meaning in terms of the immediate constituent forms and the grammatical features [whereby these] are arranged (1933: 167).

 

Here, consensus is to be established by sheer stringency of method. Yet recalcitrant problems can arise that would not trouble phonology. We can easily reach a consensus about the human vocal apparatus; and after some simple demonstrations, most speakers will agree that they ‘know’ the repertory of phonemes in their native language, e.g., when they distinguish between voiced and unvoiced consonants. But it’s harder to agree in what sense speakers ‘know’ their native language as a repertory of its minimal meaningful forms. So we are on weaker grounds in claiming that our own consensus as linguists corresponds directly to the consensus of speakers of the language we are describing (cf. § 74). Consider the English morphemes borrowed from French, Latin, and Greek but no longer recognised by all contemporary monolingual speakers. Should a morphological description include not just the more obvious ones like ‘in-’ and ‘im-’ for negation alongside ‘un-’, ‘non-’, or ‘a-’ but also the more erudite ones like ‘pter’ (‘wing’) in ‘helicopter’, where speakers might instead identify the final ‘-er’ as an agentive suffix (compared, say, to ‘lawnmower’)? Doing so would oblige us to turn to language history and thus deviate from the ‘mainstream’ programme to describe language ‘synchronically’ in a single stage of its evolution.

10. Again in contrast to phonology, morphology is quite problematic in respect to coverage. In theory, the entire vocabulary of the language consists of ‘morphemes’ or clusters of these; how could we list them all? The ‘mainstream’ strategy has been to focus on the ones that form stable, compact classes, e.g. the set of all verb inflections, versus the ones that form unstable, open classes, e.g. the set of all verbs or verb stems. Only for the first type could full coverage be attained, whereas the second type could be consigned to the category of ‘lexemes’ to be described in the domain of ‘lexicology’.

11. Still, morphology has faired rather well on the ‘C-tests’ through its close engagement with the real data recorded in fieldwork on previously undescribed languages, which in my estimation has contributed by far the finest achievements in modern linguistics. The fieldworker’s overall task is to progress from being an ‘outsider’ in the community of speakers over to being an ‘insider’ who can speak the language at least well enough to interact with the community and eventually to describe the language. The fieldworker must reach a working consensus with the community or else expect to be misunderstood, ridiculed or ignored. The task is richly supported by ordinary constraints from ‘world’ and society (§ 2, 6), which always apply to real data but which may well not appear in a stringent formal description. To maintain a consensus with the community, the fieldworker can rely on an intuitive, perhaps unconscious grasp of such constraints to produce ‘proper’ utterances, whether their ‘propriety’ is ‘purely linguistic’ and can be ‘formalised’ within a ‘linguistic theory’ or is more cognitive or social.

12. The three ‘C-tests’ get shifted far more radically in the move from morphology to syntax. Units can no longer be isolated by using the criterion of ‘minimalness’; nor does it seem at all feasible to make a complete repertory of syntactic patterns. So phonology and morphology were replaced by syntax at the centre of linguistic theory, and the concept of ‘language’ itself got shifted from a ‘descriptive’ notion of a repertory of units (§ 8) over to a ‘generative’ notion of a repertory of rules for constructing and arranging units. Since the ‘rules’ plainly do not appear in the data, syntactic research relaxed the close engagement with real data as established in morphology fieldwork. Instead of segmenting the language sequences themselves, the task was to devise rules that would ‘generate’ the underlying structure of the sequences. Such a shift did not just leave formalism intact, but actually endowed language data with an enhanced but hypothetical formality. The effect was most striking when semantics was added onto syntax: to maintain the detachment of language from knowledge of the ‘world’ (§ 2, 6), meanings were described as arrays of underlying forms, often called ‘semantic features’.

13. At this point, the ‘formalist trade-off’ described in § 6 began to grow virulent. Disengaging from real data encouraged some influential generative linguists to turn to invented data, whose status as part of the language was certified not by its occurrence in the actual speech of native speakers but by the intuitive approval through introspection. The official rationalisation was that corpuses of real data are inadequate because they are ‘finite’ and ‘accidental’ collections of utterances, whereas speakers of a language can produce or understand many more utterances — presumably an ‘infinite’ set of them (Chomsky 1957: 15). This rationalisation had the labour-saving corollary that fieldwork is not very necessary or helpful: linguists need merely elicit invented data or even — a unique privilege among scientists — invent their own data when they are native speakers of the language, e.g.:

 

(3) The man hit the ball.

(4) John is easy to please.

(5) The cat sat on the mat.

 

A bit paradoxically, the ‘normalness’ of such sentences can make them seem a bit odd in comparison to what people actually say (§ 26).

14. Saving labour this way has some severe hidden costs. When linguists were no longer in the concrete fieldwork situation of confronting real data in an unknown language, the task of describing the language is no longer firmly correlated with the task of reaching a working consensus by going from an outsider to an insider in the culture (cf. § 11). This change removes both the most tangible means of testing one’s assumptions and the richest source of constraints from world and society. And when the task of describing freely rides upon the describer’s prior facility and unstated intuitions regarding the language, the linguists are already insiders before they start their work.

  15. The impending problems were forestalled by the central formalist assumption that there exists, for each natural language, at least one set of formal constraints strictly delimiting what belongs to the language from what does not (§ 6). When found, such a set would provide total coverage and lead to a formal account both for the convergence among the structures of ‘grammatical’ or ‘well-formed’ sentences and for the consensus among the intuitions of native speakers. Since phonology and morphology had been relegated to the sidelines (§ 12), the constraints could be grouped under the respective headings of syntax, semantics, and pragmatics. Each group of constraints could be identified by selectively violating it, e.g.:

 

(4) John is easy to please. (‘well-formed’)

(4a) *To is please John easy. (‘syntactic violation’)

(4b) ?John is easy to sneeze. (‘semantic violation’)

(4c) ?John, be eased and pleased! (‘pragmatic violation’)

(4d) ?A john is sleazy to fleece. (which violation?)

 

But such demonstrations entail several problems:

 

(a) Insofar as the examples were invented on the spot to demonstrate the rules, they cannot be an independent validation of the rules; and the constraints applying to the act of invention and to its peculiar and untypical purpose — to produce a selective violation — hardly match the constraints that apply to ordinary acts of discourse and to their practical purposes, e.g. to justify what you are doing.

(b) The assessment of a violation depends heavily on the ingenuity of the linguists, e.g., whether they can imagine a situation where it would be appropriate to talk about ‘sneezing John’ (he might be a flue microbe in a children’s story); or where John’s imperious mother might command her son to be pleased about a Christmas gift and to have an easy conscience about not getting her one. To argue that  a sentence is disqualified if it was ingeniousely devised (as Bierwisch has) is not helpful when we cannot define the threshold where naivity leaves off and ingenuiousness starts. Even (4) is ingenious in the sense that it was expressly invented to make a point about underlying structure, and is unlikely to be uttered (§ 53, 57).

(c) It is also easy enough to invent examples where it is not clear which group of constraints is violated. (4d) would be such a case, and might yet be contextualised by applying the American meanings of ‘john’ as a ‘toilet’ or a ‘prostitute’s customer’ (Random House Websters College Dictionary, 1991, p. 729).

 

16. At all events linguists were dismayed to find a wholly unexpected lack of agreement, both among themselves and among native speakers, about sample sentences. This outcome gave rise to a series of complex rhetorical manoeuvres on two sides. On the side of data, samples were carefully restricted in order to highlight the contrast between clearly proper sentences with seemingly obvious meanings versus clearly improper ones with no sensible meanings, as in (4-4c). On the side of theory, the central formalist assumption was shielded from the implications of observed disagreements. The set of formal constraints was declared to correspond only to ‘competence, the speaker-hearer’s knowledge of his language’ and not to ‘performance, the actual use of language in concrete situations’ (Chomsky 1965: 4). The ‘speaker-hearer’ was in turn declared ‘ideal':living in a completely homogeneous speech-community’, ‘knowing its language perfectly’, and being ‘unaffected’ by ‘memory limitations, distractions, shifts of attention and interest, and errors’ (1965: 3). In effect, these two declarations instated consensus by decree and converted it into an ‘ideal’ that need not, indeed cannot, be tested against the agreement among speakers.

17. The same rhetorical pressure accounts for the evasive complication opposing ‘surface structure’ to ‘deep structure’ and declaring that ‘the grammar does not, in itself, provide any sensible procedure for finding a deep structure of a given sentence, or for producing a given sentence’ (1965: 141). Moreover, ‘much of the actual speech observed’ was declared to ‘consist of fragments and deviant expressions’ (1965: 201), echoing Saussure’s influential mistrust of ‘actual speech’ (cf. § 1). These further declarations can serve to explain away any discovered lack of convergence among language data.

18. These rhetorical manoeuvres suggest that generative linguists were aware of and disquieted by the ‘formalist trade-off’ (described in § 6) but were determined to rescue the central formalist assumption (also cited § 6) by designing the theory precisely so as to prevent the lack of progress in coverage, convergence, and consensus in respect to real data from counting as a refutation. They correctly speculated that their manoeuvres, even if thinly or speciously argued, would not be critically assessed by colleagues who were (a) firmly committed to the ‘mainstream’ linguistic programme of describing ‘language by itself’, (b) were not anxious to undertake painstaking fieldwork in remote places, and (c) mistrustful or actual speech in all its ‘messy richness’. So we can readily understand the success of generative linguistics and its continuation through a long and sometimes arcane series of ‘extensions’, ‘revisions’, or changes of notation without any willingness to change its basic claims about what a ‘language’ is and what a ‘linguistic theory’ should do. Its adherents cannot admit that isolating language from the functional constraints that apply to real data incurs the impossible job of inventing all the formal constraints for all conceivable data, irrespective of whether native speakers would ever utter them. The ‘generative grammar’ would have to reconstruct the formal possibility that speakers could utter or understand them; and no evidence has been brought forward so far that this can ever be done.

19. The conclusion would have to be: if language is detached from the constraints of ‘world’ and society, its own internal constraints are not sufficient to support its organisation (cf. § 2, 6, 11f, 14). Hence, any linguistic description which postulates such a detachment will only be able to cover a part of that organisation and will encounter frequent obstacles to convergence and consensus. This conclusion is borne out by empirical evidence not about the formal structure of sentences but about the ‘pragmatic’ activities of doing language science over the past century. Because ‘language by itself’ was a technical fiction to begin with, theories about it have been obliged to created a proliferating series of further technical fictions to prop each other up — ‘grammaticality’, well-formedness’, ‘competence’, ‘ideal speaker-hearer’, ‘homogeneous speech-community’, ‘deep structure’, and all the rest — that are not merely unconfirmed by real data but programmatically opposed to real data. The prospect today is not merely that no formal description of ‘language by itself’ has yet attained adequate coverage, convergence, and consensus for any natural language, but that no such theory ever will. In the long run, the apparent advantages of linguistic formalism — stability, determinism, rigour, visual clarity, impressive notations — and the privileges its confers — to invent and judge your own data, to do science without leaving your desk, and to escape the rich and messy contexts of human interaction — all turn out to be liabilities for achieving even its own carefully circumscribed tasks. Such a formalism relegates us to a shadowy world of formulas and arrays whose determinacy is financed by their indeterminate relation to the language data they purport to represent.

 

C. The impact of ‘large-corpus linguistics’

 

20. I have briefly retraced the theoretical evolution of ‘mainstream’ linguistics in section A in order to indicate how the early programme of describing ‘language by itself’, detached from world and society, has favoured a linguistic formalism that turned away from real data and eventually blocked further progress in coverage, convergence, and consensus, without which we cannot attain a complete and valid description of any natural language. The growing awareness of this impasse has led to a diversification within linguistics that has edged formalism gradually out of its ‘mainstream’ and majority position. The brands of linguistics going under such designations as ‘functional’, ‘systemic’, ‘applied’, ‘cognitive’, ‘computational’, and ‘critical’, along with some adjunct domains such as ‘discourse analysis’ and ‘discourse processing’ (which seldom aspired to be part of linguistics), all share the enterprise of resituating language in its cognitive and social contexts, reassembling, as it were, the ‘layer cake’ of language interfaced with world and society (§ 2).

 21. As the conventional division between ‘language by itself’ versus ‘language in use’ has been progressively narrowing, we have found that real data are not plagued by the lack of ‘discoverable unity’ that, Saussure vowed, would prevent us from ‘putting speech into any category of human facts’ (§ 1); nor do they ‘consist of the fragments and deviant expressions’ that justified Chomsky’s retreat from ‘the actual use of language in concrete situations’ (§ 16). Instead, real data reveal an unexpectedly high degree of precision and clarity, though not necessarily in the modalities that mainstream linguistic theories would easily recognise.

22. This finding has been most profoundly assisted by the advance of technology, placing within our reach a new source of data that dramatically enhances the prospects for coverage, convergence, and consensus. The key technical innovation is the large computerised corpus of data from actual texts and discourses, such as the ‘Bank of English’ (hereafter ‘BoE’ for short) developed at Birmingham University by John Sinclair and his team. I took the data described below from the BoE in July 1994, at the stage when it had reached the size of some 200 million words of running text from contemporary spoken and written sources, including: British and American books; newspapers (Times, Independent, Guardian, Today, Wall Street Journal, New Scientist, Economist); magazines (e.g., Esquire, Good Housekeeping); ‘ephemera’ such as letter-box mailings (e.g., YMCA appeal for homeless people, Friends of the Earth Tropical Rainforest Campaign), radio broadcasts (British Broadcasting Corporation in the UK and National Public Radio in the US); and recordings of conversations.[1] The coverage by so large a corpus might validly claim to be representative, though it is certainly not complete and is very far from ‘infinite’. Yet paradoxically, it has itself made us aware of the ways in which it is yet too small (§66ff).

23. Still, as a sample of contemporary English usage, the coverage exceeds previous sample sizes by various orders of magnitude, such as: the previous 20-million word corpus used for the 1987 Collins COBUILD English Language Dictionary (by 1 order of magnitude); the 1-million word Survey of English Usage at University College London (by 2 orders of magnitude plus doubling); the 2000-word fragments in the Brown University corpus (by 5 orders of magnitude); and the 24 invented sentences analysed or ‘transformed’ in Chomsky’s Aspects (by 7 orders of magnitude).[2]

24. Contrary to what is widely believed, the increase in orders of magnitude does not entail a direct proportionality whereby we just get the same data multiplied by 10, 100, 1,000, and so on, so that if an item appears once in a 1 million word corpus, it appears 20 times in a 20 million word corpus and 200 times in a 200 million word corpus. If that were true, building steadily bigger corpuses would only give the results we could accurately predict from the proportions in a small corpus. But in fact the large corpus offers not just more data but different kinds of data:

 

(a) We find numerous items that did not appear at all in smaller ones.

(b) We can make more informed judgements about relative frequency. Of two items appearing only once in a small corpus, the one might still appear only once in a larger corpus and the other fifteen or twenty times.

(c) The larger corpus will display the data in steadily finer degrees and differentiations of detail. An item which appeared only once in a small corpus may appear in several distinctive variants in a large one.

 

In these ways, each increase in magnitude can reveal hosts of fresh and more detailed regularities that were simply not noticeable before, nor are they readily open to unaided intuition and introspection (§ 27,52f, 55, 63). They still have to be interpreted, but — in marked contrast to non-corpus linguistic methods — the outcome is quite amenable to convergence and consensus (§ 4, 6, 15, 17ff, 20, 22, 27f, 39, 43, 46, 48, 50, 52-55, 62, 64f, 72-75).

25. Conversely, the corpus shows that examples we might intuitively accept at face value are not typical of actual usage. Our beloved evergreens like those cited in § 13:

 

(3) The man hit the ball.

(4) John is easy to please/eager to please.

(5) The cat sat on the mat.

 

do not appear in the BoE, not because they aren’t properly ‘grammatical’ or ‘well-formed’ English but because they aren’t ‘natural’: typical contexts of real discourse require less simple-minded and peremptory utterances. In the BoE, nobody at all is said to be ‘easy to please’. For ‘eager to please’, three instances appear (6-8), each with a direct object for ‘please’ that was missing in (4) and with more interesting agent-subjects than our insipid friend ‘John’. Even allowing for intervening items, the only combination of ‘man + hit + ball’ was (9); ‘man + hit’ alone returned only (10), where the sense of ‘hit’ adapts to ‘jackpot’. For ‘cats sitting on mats’, the only attestations were derivations from the use of this trite example in schoolbooks or logician’s debates, e.g. (11-13), rather than being assertions about any real cat.[3]

 

(6) < a government official who is eager to please the wealth goddess >

(7) < the Sandinistas. The government is eager to please the Church >

(8) < show a sociable child who is eager to please or charm those around him >

(9) Yes. Doesn’t that man hit the ball hard?

(10) Where can a con-man hit the biggest jackpot? In politics

(11) On the first page was a drawing of a brindled cat seated on a recognisable mat, the original ‘cat on the mat’ now quoted in derision of an antiquated method of teaching

(12) so if you have <ZF1> a <ZF0> a man on the roof [pause] er erm erm a cat on a mat er a tree on a mountain top a boy sitting on a tree branch these all involve

(13) material-objects statements, ‘There is a cat on the mat’, statements about people in novels, statements of mathematics

 

We shouldn’t regard the grainy details of the real data as a mere obstruction to be filtered out by rarefying and decontextualising (in the sense of § 3). Instead, we should respect the ‘naturalness’ of real data because, unlike the ‘grammaticalness’ or ‘well-formedness’ of the formalists, it has been decided for us by real users of the language (cf. § 63f). We want to account for the ‘competence’ real users not just possess but display when doing this; and there, ‘well-formedness’ has no overriding priority (§ 46, 48, 53, 55).

26. Now, corpus displays are in some sense frankly ‘surface’ data’, but, exactly because the data are not severed from their contexts, it is easier to assess what sorts of ‘shallower’ or ‘deeper’ constraints might apply. Even on the surface, a corpus displays to the investigator not just words but collocations, to adopt Firth’s (1968 [1952-1957]: 106ff, 113, 182) well-known term: ‘words’ considered in ‘the company they usually keep’, i.e., typical word combinations that would not usually qualify as idioms or standing phrases (cf. § 31, 33, 52, 55, 60, 66, 69, 77ff). Also, the data can be accessed in somewhat ‘deeper’ ways by means of the search software, so that, for example:

 

(a) The collocation need not be invariant or continuous but may contain varying interposed words (up to 4 in the BoE), e.g. ‘on the mat’ in (5) versus ‘on a recognisable mat’ in (11).

(b) We can sort out words that could belong to more than one word-class, e.g. ‘warrant’ as either a noun or a verb.

(c) We can use uncommitted characters to search for a stem with all its endings, e.g. to compare ‘logic’ with ‘logical’, which turn out to collocate rather differently.

(d) We can make nested sub-displays to zero in on possibly significant combinations in the general display, e.g., to go from ‘warrant’ to ‘warrant + investigation’.

 

27. The most ‘surface’ use of the large corpus is to enable accurate judgements about the frequencies of words or word-combinations — a familiar tactic in ‘computational linguistics’. A far ‘deeper’ and more revealing use of the corpus is to detect tendencies rather than just frequencies, so that we can assess why certain combinations occur and not just how often. Paradoxically, sorting vast quantities of real data allows unexpected convergences to emerge within the regularities underlying this huge variety (cf. § 43f, 72). Among all of the possible combinations of English words and phrases that might be intuitively judged ‘grammatical’, we can finally see which ones are more likely to be realised and at least some of the reasons why.

28. The main challenge now is how to identify and describe the constraints whose effect the corpus-displays allow us to inspect (cf. § 73). The constraints are all functional in the broadest sense, i.e., related to what people do with their language (§ 5); any formality we may distil out is derivative upon that functionality and cannot be consensually accounted for without it (cf. § 54f; Beaugrande, in press). Moreover, functional constraints need not fit neatly into the formal linguistic schemes devised for ‘language by itself’ — not a surprising finding, perhaps, but an immensely significant one (§ 34-46).

29. My demonstration here will be the Bank of English corpus data on the English verb ‘warrant’. The BoE returned a total of 392 lines centring on that key-word as a Verb. To get a more manageable and productive sample, I made a hand-sorted selection of 228 lines by eliminating repetitions, e.g. when a statement by a politician got reported in several media, and false alarms where the key word was actually a noun.[4] Selecting the verb allowed me to disregard the numerous noun occurrences in stock phrases like ‘search warrant’, ‘death warrant’, or ‘warrant for arrest’.

30. The word has a venerable history related, according to Walter Skeat’s (1970 [1879-1882]: 702) Etymological Dictionary of the English Language, to the word ‘guarantee’. As a verb, we find such usages attested in the Oxford English Dictionary (pp. 930ff) as: ‘to keep safe from danger (14); ‘to guarantee goods to be of the quality, quantity, etc. specified’ (15); ‘to give a personal assurance of a fact’ (16), ‘chiefly in “I (I’ll) warrant you”’ (17); and ‘to authorise, sanction a course of action’ (18).

 

(14) What good Man was he that from deth warawnted thee? (Henry Lovelich, Merlin, 1450)

(15) This Ryche man thenne sold his oylle to the marchaunts and waraunted eche tonne al ful (William Caxton, The subtyl historyes and fables of Esope, Auyan, Alfonce, and Poge, 1484)

(16) Bot for to lere him I warand, Als mekil als he mai vnderstand (The proces of the seuyn sages, 14th century)

(17) There be many such I warrant you yt neuer cum to light (Thomas More, A dyaloge wherin he treatyd dyvers maters as of the veneration and worshyp of ymagys etc., 1528)

(18) The Lord warrants us to suspect the inconstant (Daniel Rogers, Naaman the Syrian, his disease and cure, 1642)

 

These samples from the 14th to the 17th centuries suggest a gradual widening away from official discourse, and a drift toward the modern usage displayed by the BoE corpus, as we shall see.

31. A first heuristic for identifying the more interesting collocations in the BoE is to list in the order of frequency the most common words within the set of lines returned. Many of those near the top of the list, such as ‘of’ or ‘to’, will seem unenlightening in the early stages, but at least some of the more suggestive words can turn up:[5] among the nouns, ‘evidence’ (21 occurrences), ‘investigation’ (12), ‘trial’ (7), ‘attention’ (9), ‘circumstances’ (8), ‘concern’ (6), ‘mention’ (5), ‘consideration’ (5), ‘punishment’ (5), ‘intervention’ (4), and ‘conditions’ (3); among the modifiers, ‘enough’ (58), ‘sufficient’ (27), ‘serious’ (14), ‘really’ (7), ‘certainly’ (6), ‘important’ (5), ‘severe’ (5), and ‘trivial’ (4).

32. A second heuristic is to create a positional frequency table in which the words in the several slots to the left and right of the key word are displayed in descending order of frequency. The table below shows the data for ‘warrant’.

 

 

 

3 to the left    2 to the left    1 to the left          word          1 to the right       2 to the right       3 to the right

sufficient        enough          to                        warrant       a                        the                     of

enough          evidence        not                      warrant       the                     investigation       the

serious           did                't                         warrant       an                      a                        in

too                do                 would                  warrant       it                        <t>                    a

the               does               might                   warrant       such                   attention             <t>

and                not                really                   warrant       any                    of                       but

that                as                  that                      warrant       further                action                 action

not                didn               yet                       warrant       this                     trial                    <LTH>

sufficient        may               should                 warrant       that                    with                   and

in                   doesn            search                 warrant       his                      and                    to

is                   nothing          and                      warrant       to                       more                  trial

was               the                 will                      warrant       some                  special               that

it                   and                circumstances      warrant       their                   even                   for

of                  seem             arrest                   warrant       no                      mention              into

which            t o                  could                   warrant       another              intervention        by

but                trivial             can                      warrant       my                     's                       it

good             that                may                     warrant       its                      it                        is

done              will                soon                    warrant       for                     because             than

<h>               so                  'll                         warrant       more                  than                   some

a                   small              conditions            warrant       concern              new                   as

's                   seemed          germane              warrant       officer                an                      an

be                 they               death                   warrant       < /h>                 further                here

important       appear           certainly               warrant       one                    sort                    then

 

These data too are at best suggestive, and for much the same reason that purely formal syntax readily becomes convoluted or opaque: many words or word-classes are fuzzy in respect to their mutual positions; and functional relations need not show up as formal ones. The frequent negations — ‘not, ‘-t’, ‘didn’, ‘no’, and, by implication, ‘too’ — are scattered over four positions (cf. § 36, 42). And some of the most revealing data don’t appear at all, either because their position isn’t consistent enough, e.g. ‘situation’; or because a shared semantic concept is lexicalised in various ways, e.g. ‘disability - distress levels - ill health - medical problems’.

33. A third heuristic offered in the BoE software sorts the lines by the alphabetical order for a given position to the left or right of the key word. This tool works best in bringing out data about items whose position is relatively fixed, e.g. the extreme frequency of ‘to’ in the infinitive (top item before ‘warrant’ in the positional frequency table). But user-performed hand-sorting is needed for groupings wherein the essential items and collocations occupy more flexible positions, e.g. ‘serious’; or where groupings are to be made by semantic criteria, e.g. ‘investigation’ with ‘inquiry’. I worked out three hand-sorted displays and added bold italics to highlight the items that I chiefly relied on while doing the sorting and alphabetising:[6] one for what does or doesn’t do the ‘warranting’, one for what is or is not ‘warranted’, and one for the relevant criteria. Samplings from these three displays are given in Appendices A, B, and C.

34. These displays begin to reveal the various types of constraints. Some constraints might be provisionally stated according to the familiar schemes of different ‘levels’ or ‘components’ of ‘mainstream linguistics’. For phonology, the intonation would be distinctive for the performative ‘warrant’ in relatively rare locutions like ‘I’ll warrant’ used when you want to indicate you feel sure about something though you can’t point to actual facts (cf. § 64):

 

(19) If I had ten thousand men like him tomorrow then I warrant we’d see Napoleon beat by midday [quoting the Duke of Wellington.]

(20) The soil may look innocuous enough when you’ve dug it over but I’ll warrant it’s teeming with root-eating wireworms.

(21) I’ll warrant I even heard Honey Bane shuffling by somewhere in the background of a song that will provide the perfect soundtrack for when your mum won’t let you out of your room until you’ve done your homework.

 

A sample like (21) looks quite complex (with quadruple ‘embedding’) in comparison to the usual invented sentences like (3-5) in § 13 and 25, but in actual discourse it should present no difficulties for comprehension, even for the young and not very intellectual readers it addresses.

35. For morphology, we might note the overwhelming frequency of non-finite forms, either in infinitives with ‘to’ (136 occurrences) or with some modal verb (58) (cf. § 42). Also, several Latin/French-based prefixes among the semantic processes may be significant: ‘ad­-’ for moving toward something: ‘action, appeal, appellation, assistance, attention’; ‘com-’ or ‘con-’ for acting, happening, or bringing together: ‘collection, commitment, complaints, conclusion, conditions, consideration, conspiracy, consultations’ plus the Anglo-Saxon ‘with-’ in ‘withdrawals’; ‘de-’ and ‘dis-’ for uncovering or invalidating something: ‘declines, definition, developments, disability, distress’; ‘e-’ or ‘ex-’ for getting outside: ‘event, evidence, examination, exclusion, expansion, expenditure, extension’, plus the Anglo-Saxon ‘out-’ in ‘outburst’; negating ‘im-’ or ‘in-’ for something that is not as it should be: ‘impropriety, indeterminate, insufficient’, plus the Anglo-Saxon ‘un-’ in ‘uncharacteristic, uncovered, unimportant, unorthodox, unsatisfactory, unspecifiable, untutored’; ‘in-’ and ‘inter-’ for getting inside or between: ‘inquiry, interception, interference, intervention, introducing, investigation into war crimes, inclusion in the wheelchair, internal matters that warrant no outside interference’; ‘re-’ for following up or going back toward something previous: ‘recession, record, recording, relaxation, relief, respect, response, retaliation, retrospective, return, revelations, revision’ (cf. § 41).

36. For syntax or ‘grammar’, we could note the extreme dominance of third person subjects (224 occurrences), as opposed to just 4 in first person (compare samples (19-21) and none at all in the second person; and, within the third person, the mere handful of pronoun subjects ‘he’ (6 occurrences), ‘she’ (0), ‘they’ (5), and ‘it’ (7), as contrasted with the large numbers of noun subjects (§ 42). Or, we might note the high proportion of negations attached to the verb: ‘not, don’t, didn’t, not yet, hardly, not really’ (cf. § 36, 42).

37. For semantics, we could note that many of the subjects and direct objects fall into associative classes that are not unduly hard to label, e g.:

 

(a) as subjects: actions: ‘achievement, aggressions, behaviour, blow, brawl’; resources: ‘abilities, acreage, growing area, scrappable cars’; knowledge: ‘evidence, information, perception, scientific authority’; messages: ‘accusations, complaints, juicy stuff, message, piece of tittle-tattle, revelations’; problems: ‘air leaks, ambiguity, antitrust conspiracy, casualty rate, chilly old homes, degenerating trees, disability, discriminatory practices, distress levels, food shortage, ill health, impropriety, job bias, slowing in the economy, violence’;

(b) as direct objects: (in)appropriate reactions: ‘(further) action, change, commitment, conclusion, consideration, expansion, extension, formation, increases, motion, (cautious) move, plan, step, signing, treatment’; consumption of resources: ‘cost, expenditure, loss of any troops’ lives, overeating, paying the steeper taxes, shelf-space’; messages: ‘apology, appellation, billing, briefing, brochure, column inches, comment, description, footnote, mention, phrase, satire, serious talk, suggestion, talking-to’; knowledge-gathering: ‘airing, attention, consultations, examination, hearing, inquiry, investigation, retrospective survey, review, [legal] trial, [medical] trials’; solving problems: ‘answering machine, (charitable /economic) assistance, breaking the embargo, easing of interest rates, full-time custodian, guests wearing thermal long johns, intensive care, introducing more elaborate feeding, (professional/prompt/surgical) intervention, making peace, mid-season break, opening of a new peat extraction plant, revision, sending of those supplies, using these drugs’; retaliating: ‘banning the show, charge(s), God’s anger, jail time, lengthy ban, massive American retaliation, penalties, pre-emptive strike, (criminal) prosecution, (capital) punishment, retribution, [legal] trial’.

 

Such groupings overlap, since a broad category like ‘(in)appropriate reactions’ can reasonably include narrower ones like ‘knowledge-gathering’, ‘problem-solving’, and ‘retaliating’. Still, we can make a modest ‘semantic table’ showing the typical correlations between subject-groupings and object-groupings, e.g.:

 

                subject-groupings                   object-groupings

                actions                                                (in)appropriate reactions

                resources                                            consumption of resources

                messages                                             messages

                knowledge                                          knowledge-gathering

                problem                                              problem-solving

 

It seems plausible that a given parallel across our columns might show up in the data on the same line, as we see at once for ‘evidence’ (knowledge) plus ‘investigation/trial’ (knowledge-gathering). But this co-occurrence of semantic groupings on one line is by no means a rule. We can also have, say, an action as subject and a message about it as direct object, e.g. when an ‘operation warrants a middle-of-the-night briefing’. Or, the context for one grouping may imply another, as when knowledge-gathering in legal contexts implies a retribution, e.g. the condemnation and punishment likely to follow upon a ‘trial’. Or again, some people consider legal punishments a type of problem-solving, despite the scant evidence that the ‘problem of crime’ is being solved in this way.

38. The constraints of context soon impel us beyond the customary borders of semantics. An abstract scheme of ‘semantic features’ would presumably suggest making a separate class for general nouns, some of which appear as frequent subjects in our data: ‘behaviour, circumstances, conditions, contemporary events, incident, occasion, operation, qualities, situation’. But none of these remains general in the context. Most of them carry a pejorative implication, i.e., that the ‘behaviour, circumstances’, etc. involve some problems. If we read that ‘circumstances do not warrant a change in the leadership’, we can assume that one or more ‘leaders’ do not seem to have been acting as they should and that somebody wants to reassure us. Or, if we read that ‘circumstances simply do not warrant charitable assistance’, we can assume some people are in financial difficulties while other people with money are, in the finest Tory tradition, excusing themselves from helping out.

39. We can see here a major difference between conventional abstract semantics versus corpus-driven semantics, one which Sinclair (1994) has pointed out. Most of what passes for generality, vagueness, or ambiguity in the meaning of language and impels semanticists to build finicky sets of rules to eliminate it, evaporates when we look at suitably sorted real data. So we may well feel uneasy about approaches that expressly declare it the job of semantics to ‘disambiguate’ sentences or sequences that allow for more than one interpretation (§ 43). Quite plausibly, the ambiguity is largely an artefact of using isolated and invented data. We might recall here the contrast between invented simple sentences like (3-5) in § 13 and 25 versus authentic and elaborate real data such as (19-21) in § 34. Again, trying to filter language to the point of enabling a formalist description erodes the constraints that are urgently needed for convergence and consensus (cf. § 4, 6, 15, 17ff, 20, 28) .

40. For pragmatics, finally, we could note the explicit performative ‘warrant’ when the speaker is also the subject, as in (19-21) in § 34. Less explicit but far more common and influential is the pragmatic force entailed in declaring what does or does not ‘warrant’ what. This force carries the implication that the event or state of affairs that might do the ‘warranting’ is in some way unusual or significant enough that a reaction might well be in order, and that those who might be expected to do the reacting are likely to say why or why not they are going to, and how. Accordingly, the speaker — or, when the discourse is reported, the originator of the message — is likely to be a person who represents some institution or authority, and our data suggest what kind: government, judiciary, military, sports, business, science, and medicine. Or if the person does not, then the use of ‘warrant’ implies a subtle signal that authority is being claimed anyhow; we see this use among journalists and media persons when they are not reporting what other people said. Uses like ‘the Chevrolet Beretta does not warrant particular mention’ or ‘the documentary wouldn’t warrant more than a 4’ are inconsequential magisterial pronouncements merely aping genuine authority with real consequences, e.g., medical judgements about whether ‘problems warrant surgery’ or ‘drugs’.

41. I have followed through the familiar linguistic ‘levels’ or ‘components’ to suggest that each of them contributes a set of constraints on the verb ‘warrant’. But taken by itself, each set is weak and some may seem unduly speculative. For example, citing the frequency of prefixes as morphological units (§ 35) might seem to be overinterpreting merely coincidental or antiquarian materials, were it not for the semantic and pragmatic constraints indicating that ‘warranting’ often does involve situations in which people act together (viz. ‘commitment, complaints, consideration, conspiracy, consultations’); or where something is not what it should be (viz. ‘impropriety, insufficient, unimportant, unorthodox, unsatisfactory, untutored’); or where people want ‘inside’ knowledge (viz. ‘inquiry, investigation’) or want to break ‘in’ on the chain of events (viz. ‘interception, interference, intervention, introducing’); and so on. Suggestive too are some less frequent semantic combinations, e.g., that ‘assistance’ and ‘assistant manager’ both appear as ‘warranted’ solutions to problems. The question of whether such accumulations or combinations reflect the design of the language or the speaker’s choice still needs to be determined; but without the corpus data display, we wouldn’t have occasion to pose the question at all.

42. Considering pragmatics clearly helps in appreciating the significance of several ‘grammatical’ or syntactic accumulations. Foremost among these is the high frequency of negations (§ 32, 36), signalling how often the potential reactors feel impelled to declare that a predictable or reasonable reaction will not take place. Or (to include morphology here), the frequency of infinitive forms reflects the specification of the criterion for making such a declaration e.g., that things are ‘too small, trivial’ etc. or ‘not serious, severe, etc. enough’ ‘to warrant’ something. Or again, the frequent use of modal verbs like ‘may’ (14), ‘must’ (11), ‘would’ (10), ‘will’ (7), ‘might’ (5), ‘should’ (4), ‘can’ (3), ‘could’ (3), and ‘shall’ (1) in a total of 58 lines, plus ‘seem’ (8) and ‘appear’ (2), all have the function of attenuating the pragmatic force and conceding that other people might reach different conclusions about the ‘warranting’. The same function is at stake in the use of interrogatives, as in ‘Did he warrant the harsh punishment of exclusion?’; and of dependent clauses with the force of interrogatives, as in ‘specify what kind of cases would warrant capital punishment’. Or again, the low number of personal pronouns (§ 36) as subjects reflects the semantic and pragmatic constraint that actions and situations are more likely to be said to ‘warrant’ something than people are.

43. When we are describing real data, the interaction between semantic and pragmatic constraints is often so intense that there are only weak indicators of which is which. How can we, say, keep our semantic understanding of a general noun like ‘circumstances’ apart from our pragmatic understanding of the force entailed? The constraints from knowledge of world and society, which ‘mainstream’ linguistics sought to detach from the constraints on language (cf. § 2, 6, 11f, 14, 19f), are absolutely crucial for interpreting such data, but are by no means easy to formalise as ‘rules’ (§ 56). We appear to be dealing with numerous local interactions among constraints that support sophisticated higher-level organisation, as in a complex system with distributed parallel processing (cf. Rumelhart, McClelland, et al. 1986; Beaugrande, in preparation). What appears to be a single constraint in an actual context might rather be a pattern of such interactions. If so, the standing internal constraints upon the language, e.g. that the English infinitive be formed from ‘to’ + non-inflected verb, are like the ‘frozen islands’ in a complex system and continually interact with emergent external constraints from world and society during discourse, e.g. that something is or is not ‘warranted’ by a combination of situation (e.g. ‘circumstances’, ‘conditions’) + sufficiency (e.g. ‘enough’, ‘sufficient’) + gradable modifier (e.g. ‘serious’, ‘severe’) (cf. 46, 53, 68). This interaction supports a convergence among the various modes of data and a consensus among speaker and hearer or writer and reader. If, as formalists linguistics sought to do, we detach language from the constraints from world and society and retreat from real data, the emergent constraints get diluted or lost, and we face the awesome task of trying to ‘freeze’ the entire system — a sort of ‘cryogenic linguistics’ building a ‘cryogenerative grammar’. Convergence and consensus recede, and the data begin to appear vague and ambiguous, sending us off in search of complicated formal rules which, being devised in a relative vacuum, are naturally arbitrary and ponderous (cf. § 39).

 44. Moreover, the emergent external constraints may be quite flexible about formal positions. They can generate rich strands of semantic relatedness among items at various locations in the sequences showing up in our data lines. In some lines, we encounter items together that might be said to belong to the same semantic field, e.g. ‘chilly - thermal’, ‘economy - interest rates’, ‘shortage - embargo’, or ‘slowing - easing’. In other lines, we find the ‘attraction’ of a specific item constraining a general one. In ‘forward attraction’, the specific comes first and specifies the general after it, e.g., in ‘alcohol - taxes’ (hence not value added taxes), ‘degenerating trees - specialist’ (hence not an eye specialist), ‘medical - drugs’ (hence not psychedelics), ‘violence - security’ (hence not a bond), ‘worshippers - huge edifice’ (hence a church or shrine). In ‘backward attraction’, the general comes first and gets specified further on, e.g. in ‘declines - recession’, ‘operation - intensive care’, or ‘sites - custodian’; in cases like ‘air leaks - military interception’ and ‘inclusion - wheelchair’, the specific emergent constraints run counter to the standing constraints on the general item, i.e., an ‘air leak’ being in a sealed container, or ‘inclusion’ being ‘making something part of a larger thing’ (Collins COBUILD English Language Dictionary, p.736). In either direction, the formal distance between the items can vary quite freely.

45. Should these data be considered ‘purely semantic’ when so much depends on our pragmatic knowledge of the situations in which people say that things are or are not ‘warranted’? Should uses like ‘air leak’ and ‘inclusion’ be classed as semantically deviant or deficient because they go against the standing constraints, even though we can readily understand if we consider the speaker’s motivations, e.g. to arouse the impression that a ‘no-fly zone’ in a war is virtually air-tight, or to avoid a more usual but harsher term like ‘confinement’? Should we devise ‘semantic rules’ that first compute the typical meaning and then go on to compute the deviant meaning? How about cases where the data seem plainly misleading, e.g.:

 

(22) < as a major threat sufficient to warrant a pre-emptive strike of their own. >

(23) < stories of ill health that appear to warrant surgical intervention. Frequently >

 

This ‘major threat’ in (22) differs from the standing constraints on the familiar speech act of ‘threatening’ in that the agent may have done or said nothing implying any intention to cause harm. Yet our social knowledge is quite familiar with the high-tech jargon from the age-old military and political discourse that disguises aggression as defence. Or, the ‘appear’ rather than ‘appears’ in (23) oddly suggests that surgery is to be performed on ‘stories’ or ‘story-tellers’ rather than on the people in ‘ill health’; but world-knowledge prevented both the text producer and the news editors from noticing this suggestion.

46. The overall conclusion would be that the familiar linguistic ‘levels’ or ‘components’ are designations not for neatly distinct sets of formal abstract data but for sets of functional standing constraints operating across sets of real data and generating emergent constraints. Since this process supports the convergence among the various modes of data and the consensus among speaker and hearer or writer and reader (§ 43), a linguistic description can itself attain convergence and consensus not just by sorting data into separate piles, one for each set, but by assessing the interactions among these sets (§ 50). Even my brief demonstration should suffice to show that the form of the data may seem highly variable and at times utterly idiosyncratic unless we continually examine the relevant functions. Formulating ‘formal rules’ that draw a rigorous border between what can ‘warrant’ what versus what cannot in any ‘well-formed’ English sentence only leads to finicky debates over examples and counter-examples and misrepresents the ‘competence’ of English speakers (§ 53). They do not know what can and cannot be ‘warranted’ for once and for all, but they do know what sorts of things people are likely to say are or are not ‘warranted’ and why; and that is the knowledge put to use by the people who produce and understand real data.

 

C. Some implications of corpus linguistics for linguistic theory

 

47. Our situation today recalls a complaint once voiced by Saussure (1966 [1916]: 106): ‘It is one thing to feel the quick, delicate interplay of units and quite another to account for them through methodical analysis’. Corpus data reveal far more numerous and more ‘delicate interplays’ than Saussure, with his deep mistrust of ‘actual speech’ (§ 1), could have imagined, and they are pressuring us to develop suitable methods of analysis and a more functional and realistic theoretical ambience (cf. Baker et al. [eds.] 1993). In this final section, I shall explore some factors bearing upon such a theoretical ambience and relate them to the theoretical problems aired in section A.

48. Against the backdrop of my forceful articulation of these problems, it may seem odd if I sound optimistic. But the chances for ‘mainstream’ linguistics to make major progress in coverage, convergence, and consensus are best if we can properly learn from the problems in the past. The advantages could be substantial:

 

(a) We would no longer need to retreat from corpuses of real data on the grounds that they are ‘finite’ and ‘accidental’ (§ 12ff, 19).

(b) We would be authorised to formulate constraints on real utterances without having to declare which particular ‘level’ or ‘component’ they belong to; or to worry if they are ‘purely linguistic’ rather than cognitive or social (§ 8, 11, 20, 34-46).

(c) We would no longer need to shield technical fictions like ‘well-formedness’, ‘grammaticalness’, or ‘ideal speaker-hearer’ against ‘actual speech observed’ by means of elaborate dichotomisations between ‘competence’ versus ‘performance’ or between ‘deep’ versus ‘surface’ (cf. § 16-19).

(d) We would no longer need to strive toward the a priori criteria of rigour, abstractness, generality, and so on, set down by formalism, or else to apologise and defend ourselves for not doing so (§ 5).

(e) Best of all, our work would be judged not by its degree of formality but by its relevance and ecological validity — how we can contribute to a productive understanding of the human situation (§ 5). Real data are the most propitious source for examining issues such as the discourse of authority persons, e.g., when they magisterially declare what economic or military actions are and are not ‘warranted’ (§ 40). And for most purposes, real data can be given descriptions in ordinary language that speakers of English can generally understand and make use of (§ 73).

 

In return, we incur the hard work of engaging with large amounts of real data in all their grainy immediacy (§ 25, 56).

49. The time seems opportune to reformulate the tenets for a programme contrasting with the one stated in § 1:

 

(a) Language is a phenomenon integrated with human knowledge of the ‘world’ and society.

(b) A language constitutes a system defined both by internal, language-based constraints and by external cognitive and social constraints.

(c) The system should be described in terms of the conditions under which speakers use it.

(d) The description of a language should be stated at varying degrees of generality between the entire language and the specific discourse context, depending on what the data can support.

 

These tenets might seem to make the job of describing language messier and less focused, but actually may finally enable major progress.

50. Within such a programme, the concepts and methods of ‘mainstream’ descriptive and generative linguistics would not be abandoned but resituated in differently conceived projects. We would use the established linguistic methods and terms for identifying and describing the various aspects of our data in respect to the several ‘levels’ or ‘components’, as illustrated in § 34-46, while not insisting upon deciding which constraint belongs where. On the contrary, we assume that the interaction among sets of constraints supports the convergence of ‘levels’ and the consensus among speaker and hearer or writer and reader (§ 43, 46). Moreover, we shift our own search for consensus onto a higher functional and pragmatic plane: we assume a consensus among the language users who have produced our corpus data, and we exploit those data to reach our consensus about what their consensus might consist of (cf. § 9, 74).

51. Once we have shelved the search for ‘language by itself’ (‘langue’, ‘competence’, etc.), we have also dissolved the central rationale for ranking formalism over functionalism (§ 5). The ‘formalist trade-off’ whereby formality is financed by lowering coverage (§ 6) is now reversed. The very wide coverage of corpus data will allow us to rationally determine the limits on the degrees of formalisation we should impose upon language data, and thus upon our theoretical models (cf. § 55; Beaugrande, in press). This factor underscores the difference between the use of computers in corpus linguistics versus in all those methods of ‘computational linguistics’ that require the conversion of natural language into formal language (Sinclair 1994). Such conversions necessarily erase the sets of rich emergent constraints of just the types that large corpus data allow us to uncover (cf. § 43f, 46).

52. The ‘trade-off’ whereby formality is financed by also lowering convergence and consensus (§ 6) is reversed too. A careful functional description of corpus data can be confidently expected to raise them. Reading back over my description of ‘warrant’, I noticed little that looked genuinely contentious as long as I stayed close to the data. Yet the results did not just confirm my previous intuitions (cf. § 24f, 27). Before working through the data, I had only an implicit, fuzzy notion of what the verb ‘warrant’ would mean, partly because I hardly use it myself and find it a bit stuffy or pompous. Confronted with data, my intuitions about the ‘semantics’ of the item turned out to be too specific in that I had immediately associated it with legal or quasi-legal terms like ‘search’ and ‘arrest’, which in fact collocate mainly with the noun, plus ‘investigation’, ‘trial’, and ‘punishment’; I failed to realise the importance of the less specific but still common collocations such as ‘situation’. Yet paradoxically, my intuitions were also too general in that I also did not realise what sorts of situations warrant what sorts of responses, e.g. knowledge and knowledge-gathering. Some of the data disclosed items I would never have thought of as being ‘warranted’, such as ‘town’, ‘overeating’, or ‘space walk’, but I don’t claim such data are ill-formed, ungrammatical, deviant, or deficient, because I can readily interpret them in respect to activities (e.g. hiring, building, etc.) people might not perform until a suitable occasion ‘warrants’ it.

53. A problem seems to arise here: how can I assert that a corpus leads toward consensus and at the same time admit that it has been showing me facts that did not fit my intuitive expectations? Evidently, the degrees and modes of consensus can fluctuate considerably. Intuitive consensus is strongest for the standing constraints on a language; we all agree that the English infinitive is formed from ‘to’ + non-inflected verb, though a few finicky users insist, without intuitive consensus, that no other words can be interposed (the so-called ‘split infinitive’). But intuitive consensus is naturally weaker for the emergent constraints, which only apply when the occasion arises. So when we invent sentences that are unlikely to be said and don’t appear in a very large corpus, such as the 24 samples in Chomsky's Aspects, our intuitions readily get out of their depth and become unreliable; they are being asked to generate consensus on an inappropriate level, and it’s hardly surprising when they don’t (cf. § 57). For example, our intuition that a usual relation between the subject and the direct object of ‘warrant’ is that between action and reaction, knowledge to knowledge-gathering, or problem to problem-solving (§ 37), is quite adequate for making sense of the real data from the BoE just because it is not rigorous or formal. It is not adequate for deciding the ‘well-formedness’ of invented sentences like ‘the man warranted the ball’, ‘John is warranted to please’, or ‘the mat warranted being fatly sat on by the cat’ without any real contexts; they don't seem very natural or sensible, but whether or not they are part of the English language is an inappropriate question.

54. I argued in section A that a stagnation in coverage, convergence, and consensus began to grow acute during the move from morphology to syntax and in the ensuing retreat from patterns in recorded fieldwork data to a shadowy domain of ‘underlying structures’ and ‘rules’ intended to reconstruct the form of invented data (§ 11ff). Yet numerous linguists accepted such hidden costs because formalism promises stability, determinism, and rigour, and justifies a withdrawal from the rich and messy contexts of human interaction (§ 19). A large corpus, in contrast, promises a steep rise in coverage, convergence, and consensus by keeping us fairly close to those contexts. Instead of a notion of language with formal units and structure-building rules at the centre (§ 12), we have a notion with pragmatics at the centre, so that the first ‘facts’ are the documented speech acts of producing the real data that the corpus offers us (cf. § 68). The semantic and syntactic data are functionally framed within those acts and need to be described as such (§ 28).

55. The notion of ‘well-formedness’ could be reconceived (cf. § 48). Its grounding would no longer be sporadic, small-scale applications of intuition or introspection to handfuls of invented sentences, but a concerted large-scale empirical sorting to explore the greater or lesser consistency of certain formal patterns in a given language. With a suitably reliable parser, automatic searches and tabulations would not be unmanageably laborious and could help us determine which syntactic patterns are genuinely based on standing constraints. The end-result would revise the ‘well-formedness’ of formalist linguistics in at least five ways:

 

(a) It would be conceived not as a purely formal closed system operating on rigorous internal criteria, but as an adjunct layer of organisation supported by a functional open system (§28).

(b) ‘Formality’ would not be uniform but would register clines and gradations, ranging from the ‘frozen islands’ that formalism has focused upon, e.g. article before noun, over to subtly flexible word-fields, e.g. the mutual ordering of adjectives before a noun (e.g. ‘major new dictionary’ versus ‘new major theme’) (§ 43).

(c) There would be no sharp border where ‘well-formedness’ stops and ‘ill-formedness’ starts, but a gradual shading away into combinations that hardly occur because they are not very natural or sensible (§ 25, 53, 56), e.g., to say that ‘the government warrants a change in the electorate’ rather than the other way around (not a bad assessment of the administration of Jimmy Carter, though).

(d) ‘Well-formedness’ would not be the central object of linguistic description independent of any corpus of real data. Whether the consistent patterns in given set of data are due to ‘well-formedness’ or due to some other factor would no longer be settled by speculative disputes but by the empirical results of sorting corpus patterns.

(e) The sentence would not be the axiomatic, obligatory unit but an empirically described pattern or pattern-set for organising clauses and clause complexes (cf. Halliday 1994). We could use the corpus to explore which types of words or collocations tend to be used for beginning or ending a sentence. In my data, for instance, ‘warrant it’ tends to go at the end of sentences, probably because it is a conceptually well-delimited combination with an air of finality — an action will be taken if the situation ‘warrants it’, where the ‘it’ handily takes in the both the action and its context with respect to the situation.

 

In this new guise, ‘well-formedness’ would no longer impede coverage, convergence, and consensus.

56. After such a revision, we would make rather different uses of our data. We would not focus our attention on contrasting the ‘underlying syntactic structure’ of invented sentences that look the same on the surface, e.g., ‘John is easy to please’ versus ‘John is eager to please’ (cf. § 13, 25). Such demonstrations lead us to complain, as (Chomsky 1965: 24), has, that ‘surface structure’ is ‘unrevealing’ and ‘hides underlying distinctions of a fundamental nature’. The corpus data, on the contrary, reveal so much that it’s hard to take it all in. Genuine ambiguity in underlying structures seldom persists in real data, witness this line :

 

(24) < shampoos are effective enough to warrant only one shampoo per wash.>

 

A manufacturer is probably telling us that certain ‘shampoos’ as substances ‘warrant only one shampoo’ as one act of use during the total ‘wash’. We don't respond by parsing out the sentence two ways and then rejecting the alternative structure wherein some acts of use ‘warrant only one’ substance. The latter structuring is quite grammatical but pragmatically less sensible or natural, because manufacturers are more likely to praise the substances they sell than the diligence of the customers using them, and because people are more likely to use one shampoo several times in a row than several shampoos at the same time. Such ‘world-knowledge’ seems horribly grainy and vague when you try to ‘formalise’ it into ‘features’ and ‘rules’, but is cheerfully tidy and precise when you put it to use (cf. § 25, 48).

57. Similarly, we would not need to work out slippery exercises in inventing sentences that selectively violate just one type of constraint, which regularly leads to data that are plainly not sensible or natural (§ 15, 53), e.g. :

 

(24a) *The effective is shampoo enough only warrant to wash one. (syntactic violation)

(24b) ?These shampoos are blue enough to wash only one warrant. (semantic violation)

(24c) ?Shampoo, warrant only one wash, or you’ll get a pre-emptive strike! (pragmatic violation)

 

It seems safe to predict that none of these would appear in a corpus of English, however large; but neither would infinitely many other variations, and for reasons that would be rather tedious and pointless to label, let alone to state as formal rules.

58. Corpus studies can bring us back toward the tradition of fieldwork, from which formal linguistics had retreated (§ 11f, 14, 18, 54). Admittedly, the corpus is in a language we already know as ‘insiders’; and only the spoken part of the corpus was actually recorded in the original situation, while the written part was obtained through varying degrees of mediation, mainly through mass media. The observable substrate of interaction is thereby greatly diluted, but in ways that reflect the conditions of mass communication: the original producer(s) of the text may not even be known to the receiver, and the text is designed for a large, impersonal audience not too different from us in our roles as citizens rather than as professional linguists (cf. § 64, 67).

59. In contrast, building computer corpuses of unwritten and previously undescribed languages would be quite slow and laborious in the early stages, where the reliability of transcriptions might be doubtful, e.g., when the language has a complex tone-system, as in Aymara of Bolivia (Hardman 1981). Such a corpus could never reach the size of the Bank of English, but might well be usefully analysed with similar data-handling strategies and software. An intermediary domain would be corpuses of regional varieties of a well-known language, such as the set in the International Corpus of English (ICE) co-ordinated by Sidney Greenbaum at University College London (cf. Greenbaum 1991, 1992, 1993). So far, the corpuses of one million words each are fairly complete for England (directed by Greenbaum), New Zealand (directed by Janet Holmes and Laurie Bauer), and Singapore (directed by Anne Pakir and Paroo Nihalani); further corpuses are in various stages for Cameroon, Canada, the Caribbean-Jamaica, Hong Kong, India, Ireland, Kenya-Tanzania-Zimbabwe, Nigeria, the Philippines, South Africa, and the US.

60. Corpus work will doubtless reshape the data-handling strategies enumerated in § 3. Until I can interview more people who work with corpuses, I shall merely describe the ‘pragmatics’ of data-handling in my own activities. The usual starting point would be to select one or more key words or collocations we intuitively expect to be interesting. We can then read through the lines returned on the display, along with frequency lists, positional frequency tables, and left- or right-positioned alphabetic sortings, to infer what the interesting aspects might be (§ 31ff). Going through the data lines and looking for what they might have in common can steadily refine our sense of how to collate them and what to generalise, e.g., whether the ‘warranting’ typically entails some authoritative or institutional force relevant to what kinds of situations ‘warrant’ what kinds of actions. We are on safer grounds than we would be without a corpus for balancing the general with the specific and for regulating the specificity of our generalisations, e.g., that it is much less typical for a good situation to ‘warrant’ being commended than for a bad one to ‘warrant’ being amended. Also, we can assess the ‘depth’ of the relations between the left-hand subcontexts and right-hand subcontexts as shown in the positional frequency table (§ 32) and in the Appendices at the end of the paper: the shallower, more lexical ones like ‘evidence + trial’ versus the deeper, more semantic ones like ‘problem + ‘problem solving’.

61. Our rarefying is situated chiefly in not having the communicative situation at hand for observation, a drawback applying most severely to the spoken portion of the corpus (cf. § 58). It would be desirable, though horribly expensive, to maintain a video corpus for at least some of the spoken material, which would enable us to correlate the regularities that do tend to leave evidence in transcriptions with ones that do not. We could then add relevant types of commentary to future transcriptions, e.g. about facial expressions.

62. Like fieldwork, corpus data mediate against the decontextualising that has been favoured during the search for ‘language by itself’. If a line display appears to have been unduly decontextualised, we can ‘recontextualise’ it at the touch of a button. Here too, we stand to gain consensus on points that could be disputed for briefer samples. For instance, I was perplexed by this line:

 

(25) < not bad enough nor predictable enough to warrant a mid-season break. >

 

'Unpredictable’ seems more fitting alongside ‘bad’ if, as I at first intuited, the performance of an athlete or team might call for a ‘break’. But when I accessed the source text, the missing subject was (as you may have guessed) ‘British weather’ — you know it's ‘bad’ but not how bad or when.

63. Naturally, the role of introspecting is dramatically reduced and constrained. The data have already passed the introspection of the text producers and, in public or print media, that of editors as well (§ 25). Our task now is to explore how our own introspecting can help us understand why these data did pass, e.g., whether because the lexical items are so compatible, say ‘evidence + trial’, or because the situation makes the action seem sensible, say, when ‘degenerating trees warrant specialist attention’, which is a utterly improbably combination in lexical terms alone. Moreover, you can test your introspections by predicting the words or concepts before and after the line on display and then calling up the text to check as for (26-29) (displayed line in pointy brackets, non-displayed material in square brackets). I correctly predicted ‘something’ for (26) (syntactic place-holder to precede the adjective) and ‘improvements in’ for (27) (semantically quite plausible if ‘aid is cut'); but I could predict only a negation plus conceptual grouping for (28) such as ‘not /having/requiring/setting/imposing’, and still less for (29).

 

(26) [something] < terrible must have happened to warrant God’s anger.<t> Finding the flat >

(27) [the administration said the improvements in] < Costa Rica’s economic condition warrant the cut in aid, which the country >

(28) [Anyone can help by not setting] < age limits for jobs when jobs do not warrant them. <LTH> As local government >

(29) [a policy that can be relied upon to create] < jobs overall is rare enough to warrant no apology.<p> The IIE’s estimate >

 

Eventually, we might gain some reasonable estimate for the reliability of intuitive introspections among speakers, as well as for the degrees of predictability, which have long been a central and problematic issue in information theory.

64. Finally, the role of consulting informants is redistributed. First, the data themselves put us in indirect contact with a vast population of ‘informants’ who produced the data, and our records make it in many cases, possible, if laborious, to enter into direct contact with them if we are really stumped about what they meant to say. Second, we can easily recruit native-speaker informants quite similar to the ones who produced the data or to whom the data were addressed, e.g., readers of newspapers and magazines, to make judgements, predictions, and so on about the data. Third, we still occupy the role of informants ourselves when we interpret the data, though, I have argued, we are much better positioned to attain consensus than when we invent short, trivial sentences and then interpret them (cf. § 20, 22, 27f, 39, 43, 46, 48, 50, 52-55, 62). After working through the data I am a better informant on ‘warrant’ and feel I can disagree even with prestigious dictionaries. When the Oxford English Dictionary (p. 931) lists a single source (30) for an ‘obsolete rare’ sense of ‘to direct a person authoritatively; to command’, I think the compilers were misled by ‘imperious’ to exaggerate the pragmatic force, whereas the better attested sense of ‘to authorise a course of action’ (cited in § 30), or, in my terms, ‘make appropriate’, is quite adequate. I also find the definition (31) in the Collins COBUILD English Language Dictionary’s (p. 1640) incomplete, because ‘I‘ll warrant’ also carries the implication that you can’t point to actual facts — compare (19-21) in § 34 as well as the COBUILD’s own example, where what ‘not many people know’ is plainly a matter of conjecture.

 

(30) But that imperious custome warrants it, Our Author with much willingnes would omit This Preface to his new worke (Philip Massinger, The emperour of the east, a træge-comedie, 1631)

(31) You say ‘I’ll warrant’ when you want to indicate you are fairly sure of what you are saying [...] e.g. ‘not many people know that, I’ll warrant’.

 

Such cases indicate that when lexicography is to be driven by a large corpus, many people who are not lexicographers can actively participate as informants.

65. Indeed, lexicography may be transformed even it its most central concepts, such as synonymy. In fine detail, corpus data reveal very few synonyms, because virtually every word collocates in its own way — a final vindication of Saussure’s (1966 [1916]: 120) universal principle that ‘in language there are only differences’, yet not in the language system ('langue') but in language use (‘parole’), just where he staunchly refused to look for it. For example, ‘serious concern’ collocates pejoratively, whereas ‘serious consideration’ collocates amelioratively, even though we seem to have the same word ‘serious’ and two words with similar definitions, ‘concern’ being ‘marked interest or regard’ and ‘consideration’ being ‘continuous and careful thought’ (Webster‘s Collegiate Dictionary, pp. 172, 178), and ‘showing concern’ for somebody resembling ‘being considerate’ of them. But we actually have here two kinds of ‘serious’: one in such collocations as ‘serious problem’, i.e. ‘grave’, and the other in such collocations as ‘serious intention’ i.e. ‘sincere’.[7] Similarly, the collocations of ‘enough’ are pejorative twice as often (40 versus 20 occurrences, 10 of the 40 with ‘serious’), while those of ‘sufficient’ are ameliorative twice as often (14 versus 7 occurrences, 5 of the 7 with ‘evidence’). Evidently, the more formal term ‘sufficient’ seems more appropriate when it’s something good ‘warranting’ or being ‘warranted’ than when it’s something bad. I don’t see anything odd about ‘sufficiently serious’, but I feel doubtful about ‘sufficiently unsatisfactory‘’ or ‘sufficiently weak’, and I feel amused when Proust‘s masochistic Baron de Charlus complains, in the English translation, that the man he hired to whip him was ‘insufficiently insulting’.

66. In respect to the three ‘C-tests’ proposed in section A, the coverage of a language by a corpus like the BoE is still far from complete. Indeed, corpus work makes us keenly aware that any set of data we can bank and display is at most a partial set, representative for a vastly larger set of utterances and collocations. Our best prospects would be to uncover a set of convergent regularities extending far enough across the corpus to enable a consensus that they also extend beyond it. But of course we can never conclusively prove that they do, nor that data outside the corpus would not reveal still further regularities. In view of the problems in ‘mainstream’ linguistics, we must be wary of wishfully judging the data more regular than they justify; not just the ‘the Lord’, as Daniel Rogers vowed (sample (18) from § 30) in 1642, but also the corpus ‘warrants us to suspect the inconstant’.

67. Paradoxically, even the Bank of English, the largest computer corpus of real data even built, is still too small. It confronts us with a fresh version of the complex ratios between what has been said versus what can be said, and between what people can understand versus what they have occasion to say. Much of the BoE, and of many similar corpuses, is taken from public discourse produced by specific types of people — such as literary authors, journalists, media personalities, advertisers, and interest groups — intending to communicate at a distance with a general audience who need to be motivated to read or listen what is being said to them. As a result, the corpus is somewhat unbalanced in respect to informativity, i.e., what people think is worth listening to or reading about. What people can and do talk about in normal life is not necessarily what receives intensive mass media coverage in newspapers and magazines. The BoE shows high frequencies of depressing items like ‘death’, ‘kill’, ‘murder’, ‘massacre’, ‘shooting’, ‘robbery’, and ‘rape’, which — for reasons I have never understood — are believed to be topics of universal interest. We also get a bulk of smarmy or pretentious talk from politicians, military personnel, advertisers, and entertainment ‘personalities’, who often don’t speak like ordinary people nor let on just what they really think, especially not if the media are nearby. So we get unduly broad coverage of the things they’re interested in, usually ‘sensitive issues’ like their own image and other people’s money. And we get a hefty dose of careful grammar and printable vocabulary that sound cultivated (‘inclusion in a wheelchair’, ‘revolutionary optimism’, ‘degenerating trees’) or technical (‘military interception’, ‘peat extraction plant’, ‘scrappable cars’). The corpus clearly needs a much larger contingent of everyday conversation, which is unfortunately the costliest mode of data to accumulate in computer-readable formats.

68. Still, as such corpuses continue to grow, the degree of approximation between its coverage versus the entire language will steadily improve, bringing us closer to a general convergence and consensus. We can transcend the choice between the notions of language being either a repertory of units or else a repertory of rules for constructing and arranging units (§ 12) by a notion of language being a set of the standing internal constraints designed to interact richly with emergent external constraints from world and society during discourse (§ 43f, 46, 53). Both types of constraints apply to corpus data; yet the corpus would surely be the most productive basis for eventually determining which type is which.

69. Sinclair (1994) has aired the striking prospect that we can survey the ‘parole’ or ‘performance’ on the horizontal lines of displayed data, and can survey the ‘langue’ or ‘competence’ by scanning the entire vertical set of lines (e.g. in my Appendices). Yet what we see is actually the results of competence, and extensive psychological and social research is needed to determine how such a ‘competence’ can produce such results during actual discourse. The most plausible account, to my knowledge, is that language is a dynamic system in continual evolution, and its current organisation during a given discourse is generated quickly and cheaply by interactions among local constraints (cf. § 43; Beaugrande, in preparation). This account is congenial to corpus work in suggesting that much of the operation is fairly detailed and highly organised despite the ease and speed of producing and comprehending the discourse. The knowledge of language might be stored and accessed in multiple formats and would operate as a morphological and syntactic grammar, a lexical repertory of words or collocations, a semantic array of concepts or meanings, or a pragmatic directory of action and interaction strategies, as suits the occasion (cf. § 71). The ‘multiplex’ design would be ideal for interfacing the constraints among domains, though it would be hard for us to reconstruct it in detailed models.

70. As the corpus continues to be described, the description will eventually take on such huge dimensions as to burst the confines of even the largest grammar-book or dictionary compiled so far. Just the description of the constraints that relate to the verb ‘warrant’, which is not a particularly common verb in comparison to say, ‘cause’ or ‘call for’, already seems extensive. What about all the other entries that would pass the frequency cut-off in a corpus-driven dictionary like the 1987 Collins COBUILD English Language Dictionary with 70,000 references, or merely the ‘most frequently used 2000 words in English treated in exceptional detail’ (cover blurb) by that same compilation? How could people make use of such an unwieldy description?

71. A reasonable answer might be: much as we now make use of the corpus itself, by accessing modest, more general or more specific displays not just of data but of data plus description. It would be highly desirable to give the description multiple formatting capabilities similar to the multiple representation I suggested for a person's knowledge of the language (§ 69). The corpus could be accessed as a grammar, a guide to usage, a special-purpose lexicon, a pedagogical tool, and so on. None of these formattings could, or would need to, contain the totality of the description, but each would be designed to expropriate what it needed from that totality, e.g., when we want a survey of usage in a special field like theoretical physics. Appropriate software could provide backup consultations in finer detail if the user requested them, e.g., if our survey wanted to distinguish popular books on physics from ones produced strictly for experts.

72. This mode of access would enable us to generate specific convergences of the data that are relevant to a stated issue. If we are designing a course in English as a Foreign Language for Special Purposes to be included in a curriculum for physics students, we need to know how the data converge for that discourse domain and not for, say, literary English, which is often the central domain in traditional EFL programmes. With a corpus, such programmes would have a systematic means of deciding what data should be made the basis for instruction. They can also consult the recently developed International Corpus of Learner English (ICE) directed by Sylvia Granger at the University of Louvain, which consists of learners’ essays from eight language areas and which supports a large-scale empirical assessment of the degrees of competence we can build upon (cf. Granger 1993).

73. A consensus about our own descriptive terms and methods has not yet been attained, which is hardly surprising for a relatively new field of research. Lively disputes can still be expected about how to label and categorise the various constraints, but matters may be kept under control if we can agree on some procedures. One procedure would be to retain well-known terms of linguistics, as I have done in this paper, Another would be to respect the interaction among types of constraints, e.g. semantic and pragmatic ones, rather than trying to put them all into neatly separated piles (cf. § 34-46). Yet another would be not to formalise the constraints as ‘rules’ but to formulate them chiefly in everyday language that would be both well-suited for checking with informants and ‘friendly’ to potential users of our descriptions, such as textbook authors.

74. Would our own rising consensus as linguists correspond to the consensus of speakers of the language (§ 9, 50)? The close contact with real data suggests that we ought to do rather better here than a non-corpus linguistics relying on invented data (§ 13, 15, 19, 34, 39, 53ff, 56). However, it is far from settled how general or exact the consensus of speakers might be; this is another question upon which corpus linguistics might finally shed some light. Sorting the corpus to display the diversified varieties of native speakers of English would be a major advance for sociolinguistics, as would the creation of corpuses of international varieties cited in § 59.

75. Could our own consensus eventually change public attitudes about English by making them more realistic and ‘data-driven’? This question has been around since the start of modern linguistics, whose descriptive surveys (e.g. Leonard ed. 1932) have long ago proven that general English usage is quite different from what schools and self-appointed language guardians claim. But public attitudes have remained largely unrealistic because they offer the surest basis for discriminating against specific social groups and their real usage — a factor intensifying today wherever other bases for discrimination have been forbidden by law.

76. Probably, the impact of corpus linguistics will be greatest through widely used reference works, especially corpus-driven dictionaries such as the 1987 Collins COBUILD English Language Dictionary and its successor due to appear in 1995. However, a substantial impact could also be achieved if a large body of users can access the corpus for themselves, which is already possible (though expensive) for the older 20 million word Birmingham corpus. Having multiple access modes for respective uses, as proposed in § 71, would be crucial here.

77. Wide access would at least enable influential groups, such as language teachers, to get accurate data about usage. So far, however, the implications of corpus linguistics for language teaching are just beginning to be explored. Two contrary scenarios are readily conceivable. On the gloomier side, the fine details of ‘idiomatic’ English revealed by a large corpus makes the tasks of teachers and learners look much harder than the older approaches that just tried to teach pronunciation, vocabulary words, and a smattering of morphology and syntax, largely leaving the semantics and pragmatics to take care of themselves. Would learning an item like the verb ‘warrant’ entail learning all the baggage it seems to carry about in actual usage, e.g., its more frequent and characteristic collocations? Perhaps this problem appears harder than it would be if we could finally lay to rest the assumption of so many teachers, learners, and textbook authors that language operates as a set of words plus a set of formal patterns into which we plug the words. If the chief unit of language use and language learning is recognised to be the collocation, then learning words would be systematically correlated with learning the ‘company they keep’ (§ 26), and the latter kind of learning would not be seen as some onerous supplementary job.

78. On the brighter side, corpus data will surely improve our criteria for selecting the more frequent and useful words and collocations, and for formulating the ‘notional’ concepts to be covered, e.g., a class of ‘knowledge-gathering activities’. Also, a large corpus could be strategically exploited to decide what to put into a small corpus compact and cheap enough for learners of English as a second language to access regularly on modest personal computers or classroom workstations, another trend that has already begun. This proposal recalls C.K. Ogden’s ‘Basic English’, except that ours would be corpus-driven and would allow us to generate different versions at many different degrees of ‘basicness’.

79. In sum, I have tried to suggest some ways in which the advent of large corpus linguistics can offer, for the first time in many years, a genuine opportunity to reorganise the pragmatics of doing language science, and its neighbours which depend upon it, on a new and more realistic basis. In an age when public attitudes about language are unrealistic and discriminatory, and when our beleaguered abilities to talk and listen to each other profoundly, affect our collective chances of survival, surely the heavy investment in corpuses that get a broader yet more exact view of the population of language users is, well — strongly warranted.

 

References

 

Baker, Mona, Gill Francis, & Elena Tognini-Bonelli, eds. 1993. Text and technology. Amsterdam: Benjamins.

Beaugrande, Robert de. 1991. Linguistic theory: The discourse of fundamental works. London: Longman.

Beaugrande, Robert de. in press. Function and form in language theory and research: The tide is turning. Functions of Language 1/2.

Beaugrande, Robert de. in preparation. New foundations for a science of text and discourse. London: Longman.

Bierwisch, Manfred. 1965. Poetik und Linguistik. In Helmut Kreuzer & Rul Gunzenhäuser, eds., Mathematik und Dichtung. Munich: Nymphenburger, pp. 49-66.

Bloomfield, Leonard. 1933. Language. New York: Holt.

Chomsky, Noam. 1957. Syntactic structures. The Hague: Mouton.

Chomsky, Noam. 1965. Aspects of the theory of syntax. Cambridge: MIT Press.

Firth, John. 1968. Selected papers of J.R. Firth 1952-1959, ed. Frank R. Palmer. London: Longman.

Francis, Gill. 1993. A corpus-driven approach to grammar. In Baker et al. eds., 137-156.

Granger, Sylvia. 1993. The International Corpus of Learner English. In Jan Aarts, Paul de Haan, and Nelleke Oostdijk, eds., English language corpora: Design, analysis, exploitation. Amsterdam: Rodopi, 57-69. 

Greenbaum, Sidney. 1991. The development of the International Corpus of English. In Karin Aijmer & Bengt Altenberg, eds., English corpus linguistics. London: Longman, 83-91.

Greenbaum, Sidney. 1992. A new corpus of English: ICE. In Svartvik, ed., 171-179.

Greenbaum, Sidney. 1993. The tagset for the International Corpus of English. In Clive Souter & Eric Atwell, eds., Corpus-based computational linguistics. Amsterdam: Rodopi, 11-24.

Halliday, Michael. 1994. An introduction to functional linguistics (second revised edition). London: Arnold.

Hardman, Martha. The Aymara language in its cultural and social context. Gainesville: Univ. of Florida Press, 1981.

Leonard, Sterling, ed. 1932. Current English usage. Chicago: National Council of Teachers of English, 1932.

Louw, Bill. 1993. Irony in the text or insincerity in the writer?: The diagnostic potential of semantic prosodies. In Baker et al., eds., 157-176.

Rumelhart, David, James McClelland, et al. 1986. Distributed parallel processing: Explorations in the microstructures of cognition. Cambridge, MA: MIT Press.

Saussure, Ferdinand de. 1969 [1916]. Course in general linguistics, transl. Wade Baskin. New York: McGraw-Hill.

Sinclair, John McH. 1992a. Priorities in discourse analysis. In Malcolm Coulthard, ed., Advances in spoken discourse analysis. London: Routledge, pp. 79-88.

Sinclair, John McH. 1992b. The automatic analysis of corpora. In Svartvik, ed., 379-397.

Sinclair, John McH. 1994. Lecture on corpus linguistics at the University of Vienna, June 1994.

Svartvik, Jan, ed. 1992. Directions in corpus linguistics. Berlin: Mouton de Gruyter.

 

Appendix A. What does or doesn't do the 'warranting'

< return. This achievement seemed to warrant a couple of days' rest, a few >

< aggressions, each too small to warrant war. Because possession of the >

< House says these air leaks do not warrant military interception # I'm >

< office and that his behavior could warrant both criminal and political >

< job bias is widespread enough to warrant special protections for gay >

< Chevrolet Beretta. This car  does not warrant particular mention but I thought >

< not yet enough scrappable cars to warrant widespread collection of old >

< the present circumstances do not warrant a change in the leadership # so it >

< Costa Rica's economic conditions warrant the cut in aid, which the country >

< some of the costs erm just wouldn't warrant it.<M01> Yeah. I suspect also I >

< disability is not felt sufficient to warrant his inclusion in the wheelchair >

< distress levels severe enough to warrant professional intervention,  levels >

< On its own the documentary wouldn't warrant more than a 4: out-takes of films,

< insists there is enough evidence to warrant an investigation # One suggestion >[1]

 food shortage is severe enough to warrant breaking the embargo # This report >

< stories of ill health that appear  to warrant surgical intervention. Frequently >

< and historic sites are too small to warrant a full-time custodian. This can >

< these old homes are chilly enough to warrant guests wearing thermal long johns >

 age limits for jobs when jobs do not warrant them. <LTH> As local government >

<< action if our national interests warrant it.<t> Credibility with our allies >

< the national objectives at stake warrant the deaths of US troops # Oil, >

< there's no special occasion to warrant overeating. Even if Grandma has >

< operation was important enough to warrant a middle-of-the-night briefing  >

< these problems are too trivial to warrant a visit to the surgery. Your vet >

< medical or psychological problems to warrant using these drugs. Health experts >

t> His disciplinary record may soon warrant a lengthy ban from the game but, >

< revelations of an affair did not warrant my leaving the Government . <t> I >

< shampoos are effective enough to warrant only one shampoo per wash. <LTH>

< situation is not bad enough yet to warrant that type of appeal. We do not see >

< that the situation does not yet warrant the sending of those supplies # >

< as a major threat sufficient to warrant a pre-emptive strike of their own. >

< bark disease. Degenerating trees warrant specialist attention. Felling or >

< here are enough worshippers to warrant keeping the huge edifice open, so >

Appendix B. What is or is not 'warranted'

< age limits for jobs when jobs do not warrant them. <LTH> As local government >

< terrible must have happened to warrant God's anger.<t> Finding the flat >

< you do it. If your circumstances warrant it, consider an answering machine. >

< jobs overall is rare enough to warrant no apology.<p> The IIE's estimate >

< encouraging progress it did not yet warrant full-scale economic assistance. >

< er are large enough really er to warrant an assistant manager.<M01> Mm. >

< bark disease. Degenerating trees warrant specialist attention. Felling or >

< bad enough nor predictable enough to warrant a mid-season break. A four-week >

< operation was important enough to warrant a middle-of-the-night briefing # >

< the present circumstances do not warrant a change in the leadership # so it >

< juicy stuff about themselves to warrant a few column inches. MY LADYLUST >

< Boren says the charges, if true, warrant serious concern, but he stresses >

< enough of its own character to warrant serious consideration. <LTH> The >

< the national objectives at stake warrant the deaths of US troops # Oil, >

< liberalizing trade sufficiently to warrant its exclusion from the Bush >

< disability is not felt sufficient to warrant his inclusion in the wheelchair >

< House says these air leaks do not warrant military interception # I'm  >

< stories of ill health that appear to warrant surgical intervention. Frequently >

< sufficient prima facie evidence to warrant an investigation into war crimes >

< a saint, but he has done nothing to warrant jail time, and I want my son home.>

< guide . If alive today, he would only warrant a mention in passing as someone >

< to such an extent that their numbers warrant their official extermination >

< that contemporary events did not warrant a revolutionary optimism, or even >

< there's no special occasion to warrant overeating. Even if Grandma has >

<  s # though not sufficiently so to warrant a plagiarism suit . >t> Freed then >

< pressures severe enough to warrant highly restrictive policies in >

<  six of those were serious enough to warrant prosecution.<M03> The+ I mean some  >

< been registered certainly do not warrant the soldiers leaving their posts , >

< on Columbia' s cargo bay door may warrant a space walk for repairs by the >

< there wasn't enough evidence to warrant a jury trial # The U.S. supreme >

< these problems are too trivial to warrant a visit to the surgery. Your vet >

Appendix C. Relevant criteria

< said there was enough evidence to warrant the case going to court; he set no >

< not bad enough nor predictable enough to warrant a mid-season break. A four-week

< will not be long or bloody enough to warrant one.<p> Perhaps. Hitler said that >

< these old homes are chilly enough to warrant guests wearing thermal long johns >

shampoos are effective enough to warrant only one shampoo per wash. <LTH> >

< not quite heroic or doom-laden enough to warrant the sort of revisionist myth- >

< er are large enough really er to warrant an assistant manager.<M01> Mm. >

< figure, he is prominent enough to warrant direct attention without being on

six of those were serious enough to warrant prosecution.<M03> The+ I mean some

< food shortage is severe enough to warrant breaking the embargo # This report >

< knee, and was spectacular enough to warrant opening the glass doors and >

< conditions become so dire as to warrant it. Even then, the bank will take >

< believed the need was so great as to warrant the expenditure." A new sea wall >

< in Pakistan are so poor as to warrant teams not touring there. In the >

< were sufficiently accurate to warrant our setting up a research program >

< REELS are sufficiently expensive to warrant at least a little regular >

< the riots was sufficiently grave to warrant the extension # For National >

< sufficiently politically charged to warrant such a label. In other words, it >

< the declines are too modest to warrant the phrase recession, " said Lewis >

< these problems are too trivial to warrant a visit to the surgery. Your vet>



[1] Actually, the active corpus I used is around 169 million words,  and fluctuates between 167 and 175 million, according to Ramesh Krishnamurthy, Development Manager at COBUILD.

[2]  Technically, I’d have to count Chomsky’s words and not just his sentences; but since he invented them all, the term ‘corpus’ really doesn’t apply anyway.

[3] The diamond brackets  <> appear in BE data at the start and end of a displayed line. They also enclose a variety of codes, of which the following appear in the data I cite: for upper case, <CQ0>: start quotes; <CQ1>: end quotes; <FCH>: font change; <FO + [digit]>: female speaker with identification number; <LCH>: chapter heading; <LTH>: start text (e.g. when going to the next item in a newspaper); <MO + [digit]> male speaker with identification number; <SO>: sentence(s) omitted; <ZF1>: start repetition; <ZF0>: end repetition; for lower case, <h>: heading; <p>: start paragraph; <t>: start text.

[4]  The number of occurrences may be greater than the number of lines, since multiple occurrences of an item in a line are not uncommon; a word may even appear in its own collocational frequencies list (in the sense of  § 32).

[5] To enhance the visibility in the cut-off, I give here the frequencies for the full return of 392 lines.

[6] My displays alphabetise sometimes just the noun and sometimes modifier + noun, depending on what seemed more revealing; but this is merely a matter of making it handier to compare things and raises no theoretical issues at this stage

[7] Characertistically, the Collins COBUILD English Language Dictionary (p. 288) gives as the first definition for ‘concern’ the ‘worry that people have about a situation’, which was the more common use, whereas Webster‘s Collegiate Dictionary (p. 172) puts first the less common and presumably more basic definition ‘something that relates or belongs to one’, and gives ‘state of apprehension’ only in third place

[1]