At a time when we are seeing the proliferation of big data, of the databases that store it, and of the programs that allow its extraction, this second dossier of Biens symboliques / Symbolic Goods suggests we pause to reflect on the conception, maintenance, and use of databases designed for or by researchers who study print, literature, and their history. Whether they cover words, references, individuals, places, events, or material objects, databases have always existed in social science research and in the humanities. However, for the last decades, the development of specialized computer programs has made it easier to create, use, and combine them. Moreover, in an increasing number of cases, the internet has meant that databases can now be consulted by whomsoever might wish to delve into them.
Figure 1. Filing cabinet in a library
Licence CC 0 Creative Commons (source: Pixabay).
Databases are defined here as structured, consultable, and meaningful presentations of information that has been previously selected, categorized, and harmonized in preparation for scientific use. It is important to distinguish between two types of database in the humanities and social science: on one hand, the construction of a repertoire or catalogue of information available for consultation and investigation; and on the other hand, the structuring of this information so that it might be used for statistical processing (from the simplest to the most complex). The first provides literary specialists with materials, for example for exhibitions, conferences, or books, like, for instance, databases that provide information on publications or authors, such as Unesco’s Index Translationum, which provides an index of translated works, or Weblitaf, which has established an inventory of Francophone African literature. Nevertheless, only the second type of database incorporates quantitative reasoning and allows for original graphic representations of the data collected, using specifically adapted methods and tools. Conversely, the quantitative approaches that have enabled us to better understand the process of production and reception of literature (for an overview, see Sapiro 2010) do not always require the production of databases, whether they involve counting diffuse items or manipulating already-existing figures (produced by professionals, for example). We note, however, that these quantitative approaches – including bibliometric ones, often drawing on databases – have more frequently been the work of historians and sociologists, or linguists, than literature specialists (Viala 1985; Vaillant 1992; Genet & Lafon 2003).
In literary studies more specifically, though, the role of databases and the methodological reflections associated with them were for a long time limited. The main reason for this lack of interest can be found in the realm of literary studies itself. As was long the case in art history (Joyeux-Prunel 2010), the study (or even simply the design) of databases was subject to the symbolic discredit to which quantitative approaches and their research tools often fall victim. This is also because the text, which was at the heart of traditional approaches to literature, and of French “New Criticism” in the 1960s-1980s, is an analogous reality, requiring a continuous reading that varies according to the subjectivity, hermeneutic tools, and social characteristics of the reader. Its interpretation and its comprehension are not traditionally associated with digitalization – with being encoded and rendered comparable with other corpora. Textual analysis and textualism, which were dominant in literary studies worldwide between the post-war period and the late twentieth century, generally held back from using statistical tools or even large corpora, and focused instead on texts that were considered significant or representative.
Sociohistorical studies came relatively late to the digitalization of databases, and have long preferred handwritten forms. Linguists and lexicologists, however, turned to computers as early as the 1950s to conduct statistical analyses of their corpora of phonemes, words, phrases, or portions of phrases, and thus to study overall coherence (by putting the texts’ words, expressions, themes and so forth, into alphabetical order for example). Caught between linguistics and literature, textual statistics remained a relatively isolated branch of literary studies until quite recently (Delcourt 1995; Morin, Bosc, Hebrail, Lebart 2002).
In parallel, many studies in the history of the book and the sociology of literature have also turned to the collection and organization of data (biographical, bibliographical, contextual, or textual). After Roger Chartier and Robert Darnton’s1 innovative researches, the renewed history of the book has, for example, taken the material aspects of printed texts into account to have a better understanding of their production and reception. The construction of collective biographies – as in the pioneering works by Rémy Ponton (1977), Christophe Charle (1979), and Alain Viala (1985) – shed new light on the spaces of positions and interrelations, even in times of crisis (Sapiro 1999). These studies also made possible the comparison of individual and collective social resources; revealed the existence of strategies; and analysed the social hierarchy of literary genres, all while shaking up the typical investigations of literary history. Other more recent studies based on this kind of prosopographical database have shown interest in the role of cultural intermediaries and means of obtaining literary acclaim (Dubois 2008; Dozo 2011; Ducournau 2017).
Like the history of publishing,2 the history of newspapers3 has broadly benefited from studies relying on databases to describe literary life through the processing of large-scale press corpora, or realia such as publishers’ and librarians’ catalogues: this contributed to the renewal of historiographic traditions which were for the most part monographic. Similarly, recent works have studied reading practices using databases of either textual records of the experience of reading4 or correspondence networks between men (and women) from the Enlightenment period.
Things changed in this area from the early twenty-first century. The shift can be partly explained by factors such as the emergence of studies involving increased collaboration between sociologists, historians, and literary specialists often situated on the fringes of the field and open to social science; the decline debate on literary theory on American and European campuses; the diversification of research objects; easier access to computers, digital tools, and the internet; as well as the tendency of funding bodies to prioritize pluri-disciplinary projects with quantitative and digital aspects. In social science and humanities more broadly, increases in (accessibility to) digital data have led to disciplinary reorganizations promoting partnerships with information science (Ollion & Boelaert 2015). Over the last decade, we have therefore seen the emergence and institutionalization of the field known as “digital humanities,” which is perceived as a new epistemology for the study of written cultural heritage, resulting from the transformations that digital tools and methods have imposed on knowledge in human science. With its rapid progress, this field was praised by some commentators as a genuine scientific revolution, while others have condemned it (in the Anglo-American context at least) as merely a fashionable label likely to encourage conservatism in university institutions by going, for instance, against the critical – whether Marxist, feminist, or postcolonial – analysis of literary texts (Allington, Brouillette, Golumbia 2016). To mention a few initiatives that claim the digital humanities label in Francophone literary studies, we have seen the emergence of the Observatoire de la vie littéraire (OBVIL) in France, whose website provides digital texts and transcripts alongside catalogues and archival sources. Quebec has the Ex situ, Littérature et Technologie au Québec laboratory, or the Canada Research Chair in Digital Art and Literature (ALN). One of the most well-known and comprehensive of these collective undertakings is that of the Literary Lab founded by Franco Moretti at the University of Stanford (Moretti 2016). The studies that have resulted from this, published as brochures available online through open access, are based on computational analysis that involves significant corpora (on nineteenth-century European novel particularly), statistical analysis, and graphics (maps, graphs, mind maps). This dynamism has sparked similar projects, such as the Bodmer Lab at the University of Geneva.
Within this approach, formalist studies are the most prevalent, however, and sometimes spread from the text itself to its avant-textes, and subsequent editions. The results of this, from the perspective of textual interpretation, are sometimes situated within the continuity of traditional approaches, and sometimes in a clear and intentional break with them (Alexandre 2016). The monopoly of the very personal authority of literary critics has indeed waned, partly due to the possibility of calling on other mediators such as computer specialists, with the result of transforming classical representations of text, which has become a field for experimentation and hypothesis-testing. The availability of very substantial amounts of data allows to work with less well-known texts as well as texts of the literary canon without directly endorsing the differences in legitimacy that exist between them. Distant Reading, as defended by Franco Moretti, specifically tries to compare the formal characteristics (style, phonetic, and semantic regularities) of different types of text, which can level literary hierarchies, in order to reveal sometimes surprising proximities in terms of vocabulary (Moretti 2016), motifs, and narrative structures. Making thus literature “one discourse among others” (Gefen 2015: 7) and subject to quantitative protocols using only textual data, should not obscure the power relations and social hierarchies that have historically prevailed in literary life (as elsewhere), and which at least partly determine these textual realities, as we can see in sociohistorical studies of literature. Of course, this so-called computational turn should not either expunge the necessity for patient interpretation of the mathematical and graphic results obtained, through the use of traditional interpretive methods, including attention to detail, assessment of primary sources, and attempts at contextualization.
Taking stock of the first results, we are forced to observe that increased accessibility and quantity of digital data have made the need for methodological and epistemological vigilance all the more pressing (Ollion & Boelaert 2015). The production of this data – a process which is neither neutral nor transparent – is indeed often based on the “invisibility and obscuring of the operations upon which they depend, and of those who conduct them” (Jaton & Vinck 2016: 496), and on constraints and “considerable biases” (Gefen 2015: 3-4) which deserve to be studied and rendered reflexively. But, with the exception of some recent studies (Bernier & Couturier 2007; Vesna 2007; Flichy & Parasie 2013; Jaton & Vinck 2016), databases remain relatively free from investigation as a research practice, and in the concrete methodological approach that they impose, particularly in literary studies where they are mobilised in an increasing number of studies.
However, the presentation of results in the “digital humanities” often also includes the exposition of the different stages that have led to these findings in a more narrative mode, as in the “log-book” proposed by Moretti (2016: 11-12). The publication of the volumes of La Vie littéraire au Québec, based on data presented in this dossier by Marie-Frédérique Desbiens and Chantal Savoie, is thus presented from the perspective of what made it possible, in its collective development and evolutions. As in Moretti’s research, which advocates “a team of five to six researchers” (2016: 8), the need for collective discussion and argumentation is accompanied by a division of labour that allows several directions to be pursued at once, which would be impossible for a researcher working alone.
This dossier therefore seeks to contribute to breaking away from the idea of radical change that still nourishes certain contemporary debates, emphasizing instead the way in which sociohistorical studies of print and literature have traditionally constructed and used databases. This movement indeed often goes together with the use of sociological concepts such as field and network, mobilizing (and sometimes revisiting) certain theoretical approaches. We can see this at work in the contribution by Marie-Frédérique Desbiens and Chantal Savoie, on the research methods used to assess the autonomisation of the literary field in Quebec. This is also the case for Florence Bonifay’s article, which questions the conditions for the application of this same concept to literary history. Through a rigorous deconstruction of a certain number of shared representations associated with the group – incorrectly – known as “the Pléiade,” Bonifay reveals the tensions that ran through the French poetry scene in the second half of the sixteenth century, as well as the importance of sociability and exchanges in the social construction of individual careers, measured in an innovative way here using a corpus of published texts.
As a result, for sociohistorical studies of print and literature, the use of such databases responds to challenges that are not only technical but also scientific, as fundamental as the production of evidence or the definition of objects and research questions (which the use of databases necessarily makes explicit in most cases). By shedding light on what happens behind the scenes when databases are mobilised, this dossier therefore explores how digital data – which have become essential in many studies – allow us to survey literary life (in terms of both measuring and re-evaluating it) in different periods. In this perspective, three lines of questioning are proposed: how databases are designed; how they are used; and how they are stored.
The apparently trivial methodological decisions that are involved in the construction of a database can lead to significant theoretical and even ethical and epistemological questions. A machine processes information that may, by its sheer quantity, seem more objective and more comprehensive than information that could be processed by the human mind; yet such a machine must conduct operations that still depend on choices and questions initially formulated by researchers. How, why, and by whom are such databases constructed? What relations, or even negotiations, are established between those (professional IT specialists) who design the infrastructure (sometimes by default) and the researchers who use it, and connect data sources or compare them with other textual, iconographic, bibliographic, or archival data to produce new elements of knowledge and new problem questions? Constructing tables, categories, and data sequences is never neutral and always depends on the situated choices of those who implement them (Hayles 2012). The abundance and diversity of available data on certain objects, notably due to the internet, can promote a desire to integrate an almost limitless collection of information. However, the quality of sources and materials used in the construction of a database can have an influence on the result – just like ingredients in a recipe. It therefore appears important to respect the heterogeneity of sources and allow them to be consulted (Lemercier & Zalc 2008: 50). Some data appears more reliable than others, because of recurrences, coherence, or formal regularity over the long term for example. Conversely, data that may be or may seem insignificant, may prove useful later. Where does one stop in the delimitation of a corpus, how does one avoid it becoming arbitrary? How can we select relevant data respecting an adequate legislative framework concerning intellectual work (such as legal submission and copyright)? Certain databases, such as Unesco’s Index Translationum or the Electre catalogue, need to be considered in the context of the conditions of their construction in order to identify the biases inherent in any given numerical result (for example whether or not reprints are taken into account, variations in the publication details from one country of publication to another depending on cultural policy, and so forth). This is particularly relevant in cases where databases were conceived to respond to professional requirements rather than to research questions. The possibility of combining different sources and conducting internal comparisons seems fruitful in this sense, assuming a critical and reflective approach, as we can see in the recent studies led by Gisèle Sapiro on the flow of translations (Bokobza & Sapiro 2008). When it results from an individual project, the construction of a database involves initial difficulties and potentially discouraging diversions. Often extremely time-consuming, the design of vast databases generally relies on teams, which implies the distribution of tasks, from the apparently simplest (data entry) to the most significant (coding, which already constitutes interpretation, the construction of variables, and analysis). What are the specific challenges of this collective operation, which requires prior and often ongoing reflection at each stage (Genet 2002)? How can we reduce the leeway and subjective variations that are sometimes significant when these stages are conducted by different players with diverging interpretations (Merllié 1985)? Ideally, the division of labour should facilitate erudition on the case under study regarding the organization of the database, but also the critical distance necessary to confront different sources, identify duplicate entries, and relativize the importance of certain materials. Finally, certain projects include a collaborative approach and encourage everyone to participate in enriching the database with their own contributions: what are the advantages and potential problems of this kind of approach inscribed in the open access movement?
Although stockpiles of potentially limitless information are fascinating, they do not in themselves constitute a guarantee of scientific rigour. The data assembled in a database can, under certain circumstances, be treated statistically with different methods, from cross tabulation to logistic regression, as well as various forms of visual representation. These quantitative processes, from the most basic counting to the most sophisticated statistical analysis, are generally coupled with assumptions, definitions and decisions that are qualitative, and which may involve a whole worldview (broad or narrow geographical or historical delimitations, from one-off events to the longue durée, harmonious or conflictual conceptions of social relations, and so forth). What does the use of algorithms and specialized data analysis programs mean for research in literary studies? What role is there for queries (whether simple or combined) which are often the main form of access to databases?
Beyond these technical parameters, the use of this kind of digital database can heuristically consolidate or shift the questions that are usually asked in literary history. What (re-)discoveries do these approaches enable? What readings of events, texts, or individual and group trajectories do they provoke or require? Under what conditions do they enable us to defend an argument, confirm an intuition, or suggest a new avenue for research? Although the construction of databases may seem laborious and thankless, can’t it be an opportunity for playful and original explorations, allowing to consider in a new light some of the major notions in literary studies, such as intertextuality or literary events? In this perspective, it becomes indeed possible to redesign habitual chronologies by emphasizing certain – once imperceptible – demarcations in the most legitimate corpus; to reconstruct generations of authors based on biographical indicators such as the date of first publication, or to draw attention to populations that are absent from the canon, women for example, as Florence Bonifay does here. But it may also be a matter of focusing on unfulfilled possibilities, or re-examining the established distinctions between legitimate and popular, national and transnational, and so forth.
Although their construction often goes together with the necessity to take into account and interpret empirical data (on literary producers and literary goods, localizations, recurring terms and so forth) in connection with social and historical approaches to print and literature, databases also allow us to engage in dialogue with linguistics, economics or geography, for example. To what extent do they require or facilitate inter-discursive or interdisciplinary perspectives (between social science, history, language sciences, and art history)?
In the same way that old handwritten forms may end up as a mere pile of paper with no meaning after the retirement or death of their creator(s), digital databases may become unusable when there are no machines to read them, or worse they may disappear entirely once a research project is finished. What longevity is there for databases after their initial purpose? How can they be maintained as relevant tools? Can we consider testing them, or cleaning the data, for example? How can we make them interoperable or preserve them in order to enable new explorations to compare several databases? Some individual researchers never make the databases they constructed and exploited for their research accessible to others, out of a desire for control, or due to lack of time and information. However, significant progress has been made in this area, at least in France, with the development of the very large research infrastructure known as Huma-Num. The cost of accessing assembled data (the exploitation, storage, and preservation of which may depend on having the digital tools and computer skills required to operate them) is indeed likely to reinforce the structural inequalities specific to the world of research.