Current biological writing is afflicted by the use of ambiguous names, convoluted sentences, vague statements and narrative-fitted storylines. This represents a challenge for biological research in general and in particular for fields such as biological database curation and text mining, which have been tasked to cope with exponentially growing content. Improving the quality of biological writing by encouraging unambiguity and precision would foster expository discipline and machine reasoning. More specifically, the routine inclusion of formal languages in biological writing would improve our ability to describe, compile and model biology.
[…] language is just as indispensable a tool for the pursuit of biology as microscopes, kymographs and other instruments (46)
The primary way to describe biology is still document-centric natural language (1). Language is, therefore, fundamental to the development of biological research. Improving its quality to encourage unambiguity and precision fosters expository discipline and machine reasoning (2, 3) while striving towards Boole’s ideal of a language ‘[…] freed from idioms and divested of superfluity, […] in a manner the most simple and literal […]’ (4). This includes the use of standard nomenclatures, identifiers and reference databases, clear factual statements and computational or symbolic languages, i.e. formal languages.
The history of attempts to improve biological writings in such a way starts, at least, with the creation of standard nomenclatures for species dating back to Linnaeus in the 18th century (5) and Woodger’s 1929 critique of the language used in biology (6, 7). Such attempts have had limited impact. Biological documents still contain many ambiguous names, convoluted sentences, vague statements and narrative-fitted storylines (8–11).
The need for better writing is, nonetheless, increasing, because the number of biology-related documents, such as scientific articles, patents and grants, keeps growing exponentially (12), as noticed even by the public during the COVID-19 pandemic (13, 14). Currently proposed solutions to cope with this growth appear to be insufficient. First, there is a scarcity of accessible and structured biological data derived from these documents. Biological databases, which are primary repositories for such data, are not growing to match current needs (15, 16), and the sustainability of their business model has been questioned (17–22).
Second, text mining is not, at the moment, a sufficient solution for the extraction of structured data from text. Arguably, taking off in the late 1990s with the release of PubMed (23), text mining went through a period of stagnation in performance benchmarks until recent advances in natural language processing (NLP). While new NLP algorithms have been able to master general linguistic tasks with greater ability than non-specialist college-educated humans, they are not yet able to extract complex biological relations (24–27) with acceptable performance, according to past community challenges (28), and with the exception of certain niche relation types. Crucially, complex relations, and associated contextual information, play a large role in the description of biological processes (29, 30).
Moreover, NLP algorithms have also lagged in tasks for which a certain level of factual knowledge is necessary, such as open-domain question-answering (31, 32). Knowledge graphs, which compile and organize knowledge of the world (33, 34), have been used to enrich NLP algorithms, powering them to state-of-the-art performance in both linguistic and factual applications (35–39).
Knowledge graphs can be partially created automatically but, in order to increase and maintain their quality, they need manually curated data (40), which can also be introduced through semi-automatic curation workflows based on artificial intelligence (AI) algorithms (41, 42). The increased use of knowledge graphs, including by companies such as Google and Meta, shows that improvements in NLP have not led to a decrease in, and one could say it has fed, the need for duly compiled, manually extracted knowledge. Thus, and perhaps counterintuitively, a golden era for NLP, and for AI in general, has been paralleled by growth in the use of knowledge graphs.
Improving biological writing has been recognized as one way to address the bottlenecks in the extraction of biological data from text. Recently, tips for scientific authors to make their articles more ‘text-mining ready’ have been proposed (43), and there has been yet another call for the use of standard names in biological texts, in this case for gene products (44). There is, however, a need to improve biological writing that is beyond the still-insufficient adoption of standard terminologies and text-mining-ready writing tips. Specifically, with the adoption of formal languages, such as Biological Expression Language (BEL) (2, 45), as part of regular writing practice.
Content written in a formal language, such as that related to protein interactions, phylogenies, drug–disease interactions or post-translational modifications, could be embedded in-line within documents or in tables, metadata, equations or supplementary files or directly submitted to databases. Chemistry provides examples of how this could look in practice (46). A realization of the limitations of natural language and alchemical symbols (47) led to multiple successful initiatives in the 19th and early 20th centuries on the subject of standard nomenclatures, formulae and equations. Because of this, the text mining of chemical names is easier than the text mining of, for instance, gene names (48). Incidentally, these efforts were originally inspired by Linnaeus’s work in biology (49).
Within biology, the field of systems biology has also had a strong interest in the use of formal languages (50), such as SBGN (51) or BioPAX (52). The latter particularly describes signalling pathways and, unlike BEL, represents direct biological mechanisms with a higher degree of granularity and complexity. Signalling pathways represented in a formal language offer a stark contrast with the unsystematic way in which they are described in biological writings (53).
Imagine, for instance, if the phrase ‘TNF activates SYK’ (54) were written as ‘TNF activates SYK (p(HGNC:TNF)) -> act(p(HGNC:SYK))’, using, in this case, the BEL language inside parentheses. This type of content could easily be extractable and would provide a source of readily available knowledge that would help improve the yield of database curation and the performance of AI/NLP algorithms. The ultimate goal would not be to improve AI/NLP algorithms or curation for their own sake but to improve our ability to describe, compile and model biology. For authors, this could also increase the visibility and impact of their work (55).
Formal languages should not be seen as computational biology any more than chemistry formulae are computational chemistry. Biology students can get acquainted (56) with ways to apply standard nomenclatures, write clearer factual statements and integrate formal languages in their writing. In the end, better biological writing would help both biologists and algorithms.
This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors. The author is an employee of F. Hoffmann-La Roche Ltd.
Conflict of interest