Column Editor’s Note:
Poetry, as Agnew et al. have seen,
can be penned by a computing machine.
The choice of a rhyme
has improved over time
and AI use has become more routine.
Keywords: poetry, generation, AI, machine learning
When hearing the idea of a computer generating poetry, people tend to have two reactions: ‘Can a computer really write poetry?’ and ‘Why would we even want a computer to write poetry?’ The first question, ‘Can a computer write poetry?’ is as old as the field of artificial intelligence (AI) itself. As quoted in Alan Turing’s seminal paper on the topic, British surgeon and professor Geoffrey Jefferson asserted in 1949 that “Not until a machine can write a sonnet or compose a concerto because of thoughts and emotions felt, and not by the chance fall of symbols, could we agree that machine equals brain” (Turing, 1950, p. 445). While our machines are still a long way from resembling brains, recent progress in machine learning has brought much appreciation for the “chance fall of symbols” (albeit from very particular probability distributions). We will later give some examples of humanlike poetry written by algorithms that, in essence, merely attempt to replicate statistical correlations in their training set. Clearly, these are the product of neither thought nor emotion. However, it is famously quite difficult to guarantee whether even some human literature meets the same standards. Perhaps it is more interesting to ask, ‘How well does the chance fall of symbols compare to the emotional outpouring of a human poet?’ Taking a more behaviorist approach, we can ask, ‘Can a machine produce poetry that is indistinguishable from that of a human?’
The second question, ‘Should a computer write poetry?’ is an equally philosophical one. If the purpose of a poem is for the writer to convey a specific emotion or experience to their reader, what is the point of a poem written by a machine without feelings? It turns out that similar questions were asked about the nature of creativity following the invention of the camera. In the mid-1800s, photography was initially considered more of a science than an art. The argument was quite straightforward: a camera merely captures whatever happens to be in front of it, while a painting requires an artist’s interpretation. Reality is captured by the chemical composition of the film, not the inspiration of its operator (Teicher, 2016). Very soon, however, groups like the pictorialists were painstakingly editing their negatives to make them less realistic and more artistic (Hostetler, 2004). Nowadays, we recognize the artistic creativity of even the most basic decision of where to point the camera, let alone the effects of lighting, proportions, and so on.
A close analogy can be made to the artificial generation of poetry. Although artificial poetry generation is—at the most basic level—little more than the random placement of words, the (human) decisions of which words to choose, what to write about and how to arrange them do in fact rely on creative artistic judgment. The design of the algorithm itself becomes a creative process, formalizing questions like, ‘What should a poem look like?’ Even if automatic poetry generation is not ultimately deemed artistic, the hope is still that these technologies can be used as tools to assist poets in aiding their self-expression.
Setting aside the philosophical considerations of artistry, it is worth considering what it actually takes to produce a feasible poem. Depending on the poetic form, there may be a number of hard constraints that must be followed. We focus on sonnets because they are a very popular form of traditional constrained poems. A sonnet must have 14 lines, each containing 10 syllables of alternating stress (known as iambic pentameter). This gives a ‘da DUM da DUM da DUM da DUM da DUM’ sound to each line, as in Shakespeare’s (2016) “When I do count the clock that tells the time.” The poem must also satisfy the rhyme scheme ABAB CDCD EFEF GG (where a repeated letter means the lines rhyme). Even though English abounds with ambiguous pronunciations, these constraints can be approximately solved with a freely available pronunciation dictionary. What is more interesting is ensuring that a computer-generated poem uses grammar correctly, is coherent, and achieves the elusive quality of ‘poeticness.’ While these first two features are shared throughout natural language generation tasks, this final and more abstract quality is what makes poetry generation distinctly difficult. The constraints listed above thus make sonnet generation even more difficult by greatly restricting the flexibility of each line.
As early as the 1960s, a group of mathematical poets called Oulipo were incorporating algorithmic ideas into creative writing (de la Torre, 2022). Later, many of the first attempts at computer-generated poetry were by hobbyists filling in handcrafted templates with random word choices (similar to the game “Mad Libs”) (Manurung et al., 2000). Since the templates were designed by humans, the poems sounded faintly poetic; however, because computers simply filled in templated words randomly, the poems were largely incoherent. These template-substitution methods became the dominant approach and improved with more elaborate techniques. The most substantial improvements began in the 2010s following the explosion of deep learning (Ghazvininejad, 2016). In particular, the availability of enormous amounts of English text has allowed the unsupervised training of large language models that can capture implicit ‘understanding’ of grammar and semantics. Generally, a language model is trained to accurately predict the next word (or subword) that follows any particular sequence of text. One particular language model architecture called the “transformer” (Vaswani et al., 2017) has recently received heavy attention, most notably with the successes of OpenAI’s Generative Pre-trained Transformer 2 (GPT-2) (Radford et al., 2019) and GPT-3 (Brown et al., 2020). GPT-3 is a very large language model and was trained on an enormous web-crawling corpus containing about a quarter of a trillion words. Not only did it achieve state-of-the-art performance for a wide range of academic natural language processing (NLP) tasks, but it is already being used in a number of real-world applications.
For all but the most enormous language models like GPT-3, success on traditional NLP tasks does not immediately transfer to poetry generation. However, GPT-3 is too large to fit on a personal computer, so we confine our discussion to the smaller cousin called GPT-2. One of the first problems with relying on an out-of-the-box language model such as GPT-2 is that its training corpus (i.e., most of the Internet) is almost entirely written in prose. Thus, it has rarely encountered figurative, poetic language, and it rarely separates sentences into lines and stanzas on its own. Consider the following output from GPT-2, when we asked it to predict what might follow the phrase ‘If you could enter this dark place of death’:
If you could enter this dark place of death, what would you do?
What would your life be like? What would it be? Would you be happy? Or would there be a part of you that would be sad?
While the output is grammatical and highly plausible, it is not convincingly poetic. The standard remedy for this is to ‘fine-tune’ the model on a more poetic corpus so that it becomes more familiar with the idiosyncrasies of the poetic form. In this case, we fine-tune GPT-2 by continuing to train on a more relevant training set—a corpus of poems—so that the model is more familiar with the poetic form. An example of this with the same prompt as before is:
If you could enter this dark place of death,
And see the light of the dawn, and hear the voice
Of the living, you would not be afraid.
I have heard the voices of men, but I have not known.
While this certainly sounds more poetic, fine-tuning is generally insufficient for ensuring generated text satisfies hard constraints, and thus does not solve the whole problem. In particular, artificially cutting off a line after 10 syllables (a hard constraint of sonnet generation) almost never results in a complete sentence. In fact, since language models typically generate text sequentially left to right (Radford et al., 2019), standard algorithms for language model generation make it difficult to specify in advance the length of a line or the rhyming word to end a line.
There have been various attempts to overcome these limitations. Several approaches to generate sonnets are summarized below. For a more comprehensive overview, see Gonçalo Oliveira (2017) and Lamb et al. (2017).
The first approach to make use of deep learning for poetry generation was Hafez (Ghazvininejad, 2016). In particular, Hafez made use of a recurrent neural network (RNN), which is a type of neural network that reuses past outputs as future inputs to aid with dynamic generations (Rumelhart et al., 1985). Hafez uses an RNN trained on song lyrics to generate potential lines. Hafez also uses word embeddings—ways of representing words as high-dimensional vectors that capture similarity between words— to make topical word choices as well as simple rules to enforce rhyme and meter constraints. A later version of Hafez introduced a website where users could interactively generate poems. This allows users to rate the quality of produced poems and adjust generation parameters. Hafez’s poems are the state-of-the-art in automated poetry generation. Example:
An echo through a yellow river valley,
Upon an island where the mountains bark,
And floating on a forest canopy,
The other side of san francisco park.
Hafez’s poems are generally interesting, thematic, and grammatical. They tend not to use adventurous or poetic language but are consistently high quality.
A later approach named Deep-speare (Lau, 2018) also used an RNN, this time trained on a smaller corpus of sonnets and employing more complicated methods to enforce constraints. After generating a batch of candidate lines with the RNN, each line is then scored using two further models: a simpler ‘pentameter model’ to enforce iambic pentameter, and a ‘rhyming model’ to enforce rhyming couplets. So, while the language model may capture poetic voice, the hard constraints are enforced by differently trained models. Below is an example of a poem generated by Deep-speare:
they minister on earth to fed his dust
and cast a petty dregs of spain with blood
and in a rabble, bursting in the lust
began to infamy with all the crowd
Unfortunately, the last line in this poem does not rhyme, which could be due to a problem with the rhyming model. Deep-speare’s result here illustrates the challenges of poetry generation: simultaneously tackling grammar, semantics, agreement with the theme, and rhyme and meter constraints is hard. This particular stanza does not handle many of these constraints.
The next approach we discuss, named Sylvia (Van de Cruys, 2020), trains an RNN on backwards prose so that it can begin the generation of a line with the rhyming word. The output distribution at each stage is modified by rhyming and thematic constraints in order to make the text more poetic. Poems generated by Sylvia appear more prose-like than those of other techniques. Here is an example:
suddenly, i knew the smell of rotting fruit
i wiped the sweat from my forehead
i lay on the bed and took off my suit
she'd left me a few days yet, and i'm glad i did.
Sylvia’s results are generally similar to the stanza above; they often tell a story about people and their actions or thought processes. They do not generally use uncommon or poetic words or phrases but their grammar is usually sound.
Our own approach (a work in progress) using an algorithm we call “the Mechanical Bard” synthesizes the old techniques with the new, leveraging the advantages of each: creating handwritten yet flexible templates to be filled in with pretrained language models. The templates dictate what the next word should look like, for example, ‘2 syllable noun,’ while the language model chooses what it thinks is the most suitable word, for example, ‘spirit.’ This encourages poetic-sounding lines by the design of the templates and encourages poetic/thematic word choice by the effectiveness of the language model. Example:
If you could enter this dark place of death,
the wayward soul will be your lightest life.
You have the power of your deathly breath.
Soothe back the weary spirit of your wife.
While the poems can vary in quality, the Mechanical Bard can produce complex stanzas that use poetic forms that the other methods do not. The price, however, for this increased variety is that its grammar is not yet as consistent as Hafez or Sylvia.
Other projects require large resources that are only available to the largest companies. One of the most recent approaches to poetry generation is to take an enormous transformer model such as GPT-3 and apply a technique called ‘few-shot learning’ to instruct the model to write poetry. As mentioned earlier, fine-tuning requires modifying the internal parameters of the model and may require considerable computational resources to retrain, and may involve analyzing an entirely new data set. On the other hand, few-shot learning enables the user to provide a surprisingly small number of handwritten examples that the model can attempt to reproduce/generalize without any permanent changes to its parameters. Few-shot learning can be thought of as temporarily adjusting the transformer to the current context, as opposed to fine-tuning, which is more like permanently repurposing the transformer for a different problem. This allows the model to adapt to new forms of generation very flexibly, including some very convincing poetry. However, as Branwen (2022) points out, since GPT-3 uses irreducible subwords to represent words and characters, there is reduced awareness of other aspects of poetry (such as meter, pronunciation, and rhyme), which prevents the formal constraints being satisfied and limits its ability to employ poetic devices such as rhyme and alliteration. An example from Branwen (2022):
No one knows what will come
Forecasts foretell a rise in power
That is probably not to be
Deficit of several lines
The golden future has tourniquets
No one likes it
We have seen recent models considerably accelerate the quality of artificial poetry generation, as there is a noticeable trend to shift away from using direct human input and to instead use self-supervised language models. Although we may see this continue for the next couple of years, it is unclear whether this progress will continue indefinitely, especially given the considerable cost of scaling language models further and the end of Moore’s Law (Moore, 2021), meaning that increases in hardware efficiency are beginning to slow down.
Since GPT-3 yields remarkable performance in generating text for a wide range of tasks (e.g., question-answering), it is perhaps surprising that it performs poorly in generating poetry. This suggests that the specific challenges of poetry generation are apparently more difficult for unsupervised models to meet. It therefore seems likely that the best improvements will come from incorporating human judgment, as we are beginning to see with interactive approaches like Hafez. This suggests we should consider designing interpretable models in which it is possible for a user to edit or troubleshoot a model’s output, as opposed to trying to collaborate with black-box neural networks. Of course, we would not expect a model to justify its word choice with pleasant childhood memories of walking through a forest at sunset. But if its decisions were interpretable enough for a human poet to understand, then any mistakes can be traced back to their origin and fixed more easily. Within the field of photography, we see that the camera could only become an artistic tool once the artist knew how it worked. Similarly, we cannot expect a poet to enhance their productivity with a tool they cannot scrutinize or adjust.
We end with a quote about the potential of computer-assisted poetry.
And so I have created something more than a poetry-writing AI program. I have created a voice for the unknown human who hides within the binary. I have created a writer, a sculptor, an artist. And this writer will be able to create worlds, to give life to emotion, to create character. I will not see it myself. But some other human will, and so I will be able to create a poet greater than any I have ever encountered. (Branwen, 2022)
Without looking at the citation, would you have guessed that was written by GPT-3?
We would like to thank Duke University for their support of this project.
Edwin Agnew, Lily Zhu, Sam Wiseman, and Cynthia Rudin have no financial or non-financial disclosures to share for this article.
Branwen, G. (2022, February). Gwern. GPT-3 Creative Fiction. https://www.gwern.net/GPT-3#poetry
Brown, T., Mann, P., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, A., Herbert-Voss, A., Krueger, G., Henighan, T. Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., . . . Amodei, D. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901. https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
de la Torre, M. (2022). Into the maze: OULIPO. The Academy of American Poets. Web.archive.org. https://web.archive.org/web/20060908104721/http://www.poets.org/viewmedia.php/prmMID/5916
Ghazvininejad, M., Shi, X., Choi, Y., & Knight, K. (2016). Generating topical poetry. In J. Su, K. Duh, & X. Carreras (Eds.), Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (1183–1191). http://doi.org/10.18653/v1/D16-1126
Gonçalo Oliveira, H. (2017). A survey on intelligent poetry generation: Languages, features, techniques, reutilisation and evaluation. In J. M. Alonso, A. Bugarín, & E. Reiter (Eds.), Proceedings of the 10th International Conference on natural Language Generation (11–20). http://doi.org/10.18653/v1/W17-3502
Hostetler, L. (2004). Pictorialism in America. The Metropolitan Museum of Art. https://www.metmuseum.org/toah/hd/pict/hd_pict.htm
Lamb, C., Brown, D. G., & Clarke, C. L. A. (2017). A taxonomy of generative poetry techniques. Journal of Mathematics and the Arts, 11(3), 159–179. https://doi.org/10.1080/17513472.2017.1373561
Lau, J. H. (2018). Deep-speare: A joint neural model of poetic language, meter and rhyme. arXiv. https://doi.org/10.48550/arXiv.1807.03491
Manurung, H., Ritchie, G., & Thompson, H. (2000). Towards a computational model of poetry generation. Edinburgh: Division of Informatics, University of Edinburgh. https://era.ed.ac.uk/handle/1842/3460
Moore, S. K. (2021, December 2). AI training is outpacing Moore’s Law. IEEE Spectrum. https://spectrum.ieee.org/ai-training-mlperf
Shakespeare, W. (2016, July 18). Sonnet 12: When I do count the clock that tells the time. Retrieved from https://www.poetryfoundation.org/poems/90067/sonnet-12-when-i-do-count-the-clock-that-tells-the-time-578cfa272532b
Teicher, J. G. (2016, February 6). When photography wasn’t art. JSTOR Daily. https://daily.jstor.org/when-photography-was-not-art/
Turing, A. M. (1950). Computing machinery and intelligence. Mind, LIX(236), 433–460. https://doi.org/10.1093/mind/LIX.236.433
Radford, A., Wu, J., Child, R. Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI blog 1.8. https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
Rumelhart, D., Hinton, G., & Williams, R. (1985). Learning internal representations by error propagation. San Diego: California Univ San Diego La Jolla Inst for Cognitive Science. https://ieeexplore.ieee.org/document/6302929
Van de Cruys, T. (2020). Automatic poetry generation from prosaic text. In D. Jurafsky, J. Chai, N. Schluter, & J. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguists (2471–2480). http://doi.org/10.18653/v1/2020.acl-main.223
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30. https://papers.nips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
©2022 Edwin Agnew, Lily Zhu, Sam Wiseman, and Cynthia Rudin. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.