XML mark-up: an annotation tool for discourse analysis

Working recently on a critical discourse analysis project that required annotation of sentencing remarks from UK judges, our team were introduced to the practice of XML mark-up. Manual XML coding was used as a way of recording the different representation strategies used by the judges when referring to the convicted offenders (Van Leeuwen 2008). This blog delves into the theory behind the annotation method to explain how manual XML mark-up contributes to the linguistic research process.

In corpus linguistics, computer-assisted analysis helps reveal statistical patterns of linguistic phenomena across large bodies of text. The application of pertinent annotations to the corpus is a necessary step towards this analysis, but is ‘a complex, expensive and time-consuming activity’ (Lenzi et al. 2012:333).  In this context, several computer-based tools have been developed with the aim of speeding up and simplifying manual text annotation. Digital coding methods have additional advantages over pen and paper approaches; O’Donnell (2012:116) notes that digitally coded files are easier to share, and easier to update, which is especially useful when rectifying mistakes. However, he also acknowledges that replicating pen and paper methods with basic IT solutions—word processors or spreadsheets—has its own limitations. XML mark-up is an alternative that addresses these limitations: it retains the context of linguistic tokens so they can be considered in a qualitatively appropriate light, and neither limits the number of dimensions that can be identified within a text, nor the quantity of text that can be highlighted for clear statistical analysis.

eXtensible Mark-up Language

XML (eXtensible Mark-up Language) is part of a larger family of mark-up languages based on SGML (Standard Generalised Mark-up Language), and one which can be utilised with comparatively little prior technical knowledge (Hardie 2014). Moreover, it is not restricted to a pre-existing set of rules, meaning users can devise their own tags within a bespoke system relevant to their particular study[1]. XML mark-up has increasingly become a favourite method in corpus annotation projects (Lu 2014:6).

As a manual approach to annotation, XML mark-up has some clear advantages over automated tagging methods: analysis isn’t constrained by the biases or concerns of other programmers who pre-set the computational parameters[2], and a wider range of linguistic features are identifiable than through computerised processes, particularly semantic and pragmatic features (O’Donnell 2008). Baker (2006:42) points out that text types which use unconventional lexicogrammars are particularly prone to errors in automatic tagging. Of the SGML-based mark-up languages, XML has the least complex requirements for application, and using a simpler annotation scheme generates a smaller chance of error (Leech 1993:279).

O’Donnell (2008:2), in discussing his XML-based UAM CorpusTool software, elaborates on the functional benefits of XML mark-up:

Annotation tasks also frequently require annotation at multiple ‘layers’, for instance, one might assign features to the document as a whole (e.g., text type, writer characteristics, register, etc.), or to segments within the text (at semantic-pragmatic layers, syntactic units (e.g., clause, phrases, etc.), or at a lexical level.

In the UAM CorpusTool, this mark-up is applied in a ‘standoff’ manner, meaning the raw corpus is not altered and a separate XML file is used to refer back to the original source. An early maxim for corpus linguistics was that the raw corpus should always be extractable from its annotations, which should be stored separately (Leech, 1993:275). However, Thompson and McKelvie (1997) supply specific reasons for preferring a ‘standoff’ approach: when using read-only or very large files that make copying for mark-up unfeasible; when the annotation scheme contains multiple hierarchies that require partial overlap of tags; and when the distribution of the source material is restricted, but the mark-up is intended to be freely accessible. When these conditions do not apply, as was the case with our project, the case for using a ‘standoff’ approach is less strong[3]. The popular online corpus tool SketchEngine can interpret XML tags incorporated within the files of uploaded corpora. Therefore, in keeping with Hardie’s (2014) advocacy for user-friendly mark-up systems, the compatibility of incorporated XML with this familiar software may make it a preferable approach for analysts whose expertise is in linguistics not computer-based technologies.


Perhaps the most apparent complication with XML tags is readability to the human analyst. ‘Standoff’ annotation, a step removed from its source context, is particularly difficult to decode without the aid of a computing interface, but incorporated mark-up can also pose a challenge to the unpractised eye (see Potts and Formato forthcoming). One key to mitigating this problem is to ensure a clearly defined and explanatory annotation scheme is in place, using tag labels that ‘make it as obvious as possible what each tag represents’ (Hardie 2014:102. Original emphasis). Lu (2014:5) observes similarly that ‘labels should be concise and intuitively meaningful’. Readability is also, perhaps, less of an issue when one remembers that the purpose of the mark-up is to be read by digital software, not human analysts.

Another set-back can come with editing the mark-up. Searching for and scanning concordance lines relating to particular tags is a simple method of checking for user errors[4]. However, whilst ‘standoff’ software can allow immediate editing within its programme, SketchEngine requires a multi-phase editing process of identifying errors, returning to and correcting original files, then re-uploading them, potentially making an already time-consuming process more so. Nonetheless, the specificity that manual annotation can offer to smaller corpora, compared to automated modes, may reduce the workload at the analysis stage of a project, despite initially increasing it during the data processing phase (Potts and Formato forthcoming).


XML mark-up is user defined and can, therefore, be tailored to answer specific research questions. This allows linguistic understanding to be applied across many fields (Lenzi et al. cite sociology, psychology, cultural and historical studies as examples, 2012:333). Within the inherent flexibility of XML, the option of ‘standoff’ or incorporated mark-up methods offers additional scope for analysts to tailor their approach to individual skills and project requirements.

Although more time consuming initially than using a pre-existing auto-tagger, XML mark-up can save time in the analysis period; by only focusing on the requirements of a specific research question, search options become clearer, errors fewer, and superfluous or irrelevant data are avoided. In smaller corpus studies, there is often no further end-user for the annotations than the single project analyst or team who will report their findings (Hardie 2014:76). In such cases (as with our project on representation strategies), any reduction in universal access to and application of annotations is outweighed by the ease of focused analysis that manual XML mark-up provides.


[1]Some tags have evolved into de facto standards, (e.g. <p> for paragraph, <s> for sentence), but these are not prescriptive requirements. Hardie (2014:82-103) offers a simple user guide to creating well-formed XML annotation.

[2]Meyers (2009:110-111) offers an example pertinent to our research, highlighting the non-standardised practices within different approaches to identifying names.

[3]I came across a single example in our 85223 word corpus where ‘standoff’ XML would have been preferable to accommodate overlapping attributes.

[4]Concordance lines are an output of corpus linguistic processing, whereby a list is produced of every occurrence of the search term (node) from within the corpus, together with its immediately preceding and succeeding text.



Baker, P. 2006. Using corpora in discourse analysis. London: Continuum.

Hardie, A. 2014. ‘Modest XML for Corpora: Not a Standard, but a Suggestion’. ICAME Journal, 38, pp. 73-103.

Leech, G. 1993. ‘Corpus Annotation Schemes’. Journal of the Association for Literary and Linguistic Computing8(4), pp. 275-281.

Lenzi, V.B., Moretti, G., and Sprugnoli, R. 2012. ‘CAT: The CELCT annotation tool’. Proceedings of Eighth International Conference on Language Resources and Evaluation (LREC 2012), pp.333-338.

Lu, X. 2014. Computational methods for corpus annotation and analysis.New York: Springer.

Meyers, A. 2009. ‘Compatibility between corpus annotation and its effect on computational linguistics’. In Baker, P (ed.) Contemporary Corpus Linguistics. London: Continuum. pp.105-124

O’Donnell, M. 2008. ‘The UAM CorpusTool: Software for corpus annotation and exploration’. Proceedings of XXVI Congreso de AESLA, Almeria, Spain, 3-5 April 2008. http://citeseerx.ist.psu.edu/viewdoc/download?doi= Accessed 1 May 2019

–2012 ‘APPRAISAL analysis and the computer’. Revista Canarias de Estudios Ingleses65, November 2012, pp. 115-130. https://riull.ull.es/xmlui/bitstream/handle/915/10724/RCEI_65_%28%202012%29_07.pdf?sequence=1&isAllowed=y Accessed 1 May 2019

Potts, A. and Formato, F. (forthcoming, 2019). ‘Women victims of men who murder: XML mark-up for nomination, collocation and frequency analysis of language of the law’. In J. Baxter & J. Angouri (Eds.), Language, Gender, and Sexuality. London: Routledge.

R v Philpott, Philpott and Mosley. 2013. Nottingham Crown Court. https://www.judiciary.uk/judgments/r-v-philpott-others-sentencing-remarks/ Accessed 8 Apr 2019.

Thompson, H. and McKelvie, D. 1997. ‘Hyperlink semantics for standoff markup of read-only documents’. Proceedings of SGML Europe ’97:The next decade – Pushing the Envelope,Barcelona, 13-15 May 1997. pp.227-229 http://www.ltg.ed.ac.uk/~ht/sgmleu97.html Accessed 1 May 2019.

Van Leeuwen, T. 2008. Discourse and practice: new tools for critical discourse analysis. Oxford: Oxford University Press.