These questions are important to understanding the processes underpinning the social-semiotic function of language: how speech communicates information about who is talking, as well as what they are saying. In my dissertation I explore the above questions by combining experimental methods (which capture listeners’ intuitions about the social meaning of language variation) with phonetic analysis (which allows me to understand how their intuitions relate to the social distribution of language variation in production).
My dissertation research focuses on a pair of vowels in the speech of York, Northern England: the vowel in goose (/u/) and the vowel in goat (/o/). These vowels are interesting because they can vary in a lot of ways. In particular, they can be produced with different tongue positions (more front or back), or with more or less movement as the vowel is produced (either monophthongal or diphthongal).
The first question a sociolinguist might ask about these vowels is how these different pronunciation possibilities are distributed in social space. Do older people produce them in a different way to younger people? Do people from different social backgrounds pronounce them in different ways?
To answer these questions, I made recordings of a sample of people of different ages and social backgrounds in York. They were recorded reading a list of words containing the vowels of interest, as well as doing a ‘map task’, which aims to elicit more casual speech. From these recordings, I’ve extracted measurements of the first and second formants, which allow us to quantify differences in pronunciation. The first formant (F1) is related to the height of tongue/openness of the vowel (contrast the high/close ’i’ sound in ‘hid’ with the low/open sound ‘had’. The second formant (F2) is related to how far forward the tongue is, and partially to the protrusion of the lips: contrast the back vowel in fool with the front vowel in feel.
As I mentioned earlier, in order to capture all the possible different pronunciations of /u/ and /o/, we need to take dynamic properties of the vowel into account — we don’t just want to compare a single number representing a point or average value of F1 and F2; rather, we want to analyse how these change over time: the formant trajectories. To do this, I’ve used a statistical technique called Generalised Additive Modeling. This technique allows me to create statistical models which analyze the effect of a range of independent variables on the formant trajectories. The variables I’m interested in are:
These are all social which previous sociolinguistic research has identified as relevant to the way people speak. The three social indices might sound a bit wishy-washy at this stage — they come from an analysis of the interviews I carried out with each talker, and the variables are derived from an exploratory factor analysis — I’ll hopefully write about the details of this some other time. As always with variables reflecting peoples’ social background, we should treat them with care, and recognize that the labels we use for them may not directly reflect the latent variables which underlie our indices.
Comparing a set of statistical models and evaluating how well they fit the data, I can figure out which factors influence variability in the vowels we are interested in. The tables below show the formulae for the best models for each vowel. I've visualized the significant effects below.
Firstly, there's evidence of change in both vowels. We can see that in the plots below, which show the predicted F2 contours for three age groups.
It looks like both vowels are fronting. For /u/, there's fronting both at the onset of the vowel and the offglide, and the trajectory has become shorter – this means that people used to say a word like 'food' with their tongue very far back in the mouth, but now they pronounce it with their tongue further forward, so it sounds more like 'feed'. The same is true for /o/: it looks like the vowel in 'goat' used to be pronounced with a lot of movement of the tongue and lips ('gowt'), but now it's becoming more fronted ('geyt').
Another way of visualizing these changes is to take a point on the trajectory (say, the midpoint), and look at how the average value at that point has changed over time:
It looks like change in /u/ has been more rapid and regular than change in /o/ – the prediction line in the left-hand panel looks like it fits slightly better than the one on the right. In fact, speaker year of birth explains about 54% of the variance in /u/, and 22% of the variance in /o/.
The figures above suggest that there's two kinds of change going on in /u/ and /o/ – the tendency for the vowels to be pronounced further forward in the mouth, and also changes in the vowel dynamics: how much movement the jaw, lips and tongue make. We can quantify these dynamics using the Euclidean distance in F1-F2 space – picking a point at the start and end of the vowel trajectory, and measuring how far away those points are. I've shown this below for /o/: the x-axis shows the average degree of fronting for each speaker, and the the y-axis shows the Euclidean distance.The letters represent the age group of the speaker: Older (born 1935-1960), Middle (born 1961-1980) and younger (born 1981-2000).
The above plot demonstrates how change in /o/ involves both fronting and diphthongization: older speakers have very back pronunciations of /o/, and varying degrees of diphthongization. So some older speakers say 'goat' like 'gowt' and others more like 'gort' (n.b. with a British accent!). Younger speakers either have a very fronted diphthong ('gewt' or 'geyt'), or a back monophthong ('gort'). Interestingly, we don't seem to get speakers with the remaining logical possibility, a fronted monophthong ('gert').
One reason for these patterns seems to be that diphthongization of /o/ is socially-stratified. The plot below shows how more mobile speakers tend toward more diphthongal vowels – note that the diphthong is the form used in Southern Standard British English. Both groups have undergone fronting, but it's a bit more extreme for the more mobile speakers. There's also been a change in which aspects of /o/ differ across the groups: it looks like the primary distinction used to be the degree of fronting (left-hand panel), but the important difference today is in the degree of diphthongization (right-hand panel).
The analyses presented in this section demonstrate that /u/ and /o/ are undergoing change in York speech. Both vowels are undergoing fronting, and /o/ is subject to social stratification: more mobile speakers tend to pronunce it with a diphthong, and less mobile speakers tend to pronounce it with a monophthong. In the next section, I'll explore the extent to which listeners can use these differences as a social cue. Does producing a back /u/ make a speaker sound older than when they produce a fronted variant? Do people who produce /o/ with a diphthong sound different socially from those who produce a monophthong?
In the previous section, we established that variation in /u/ and /o/ patterns in interesting ways in York – in particular:
Having observed these patterns, it's reasonable to ask whether or not individual speakers are aware of variation in /o/ and /u/ on some level. For example, can they use variation in /o/ to make inferences about the social identity of a speaker? To what extent do those inferences match up with the patterns we observe in production?
To explore this question, I created an experiment where listeners heard a set of words containing the target vowels, and were asked to choose between pairs of images representing three social dimensions: age, socioeconomic status, and urban/rural identity. These dimensions were chosen based on previous work on these vowels. The images are shown below:
In order to capture the full range of variation in /u/ and /o/, I created a set of auditory stimuli that included various combinations of fronting and diphthongization of both vowels. I created the stimuli using a technique called vocoding, which allowed me to take a recording of a speaker from York reading a list of words and manipulate aspects of his vowel pronunciations. You can read more about this technique on my GitHub page: http://bit.ly/1LxISgc. For this experiment, I was interested in manipulating the first and second formant to generate the combinations of fronting and diphthongization represented in the symbols above. Here are some examples of what the formant contours looked like:
For the /u/ stimuli, you can see that the stimuli on the bottom row are very similar to the top row – there's a small difference in the first formant right at the start of the vowel, which makes them sound more diphthongal. As we'll see in the results, even a small difference in formant structure like this can make quite a big difference in the social perception of a vowel!
For the /o/ stimuli, the difference between the top row and second row should be clear – the formant contours of the top row are much flatter than the ones below. The second and final row show two different ways of fronting the vowel: moving from left to right across the second row, the vowels front primarily at the onset (so they sound more like 'gewt' than 'gowt'). On the bottom row, they front primarily at the offglide (so more like 'geyt').
The experiment had two parts: firstly, listeners saw the visual stimuli and classified them in response to questions such as 'Which character is older' and 'Which character is middle-class', 'Which character is from rural Yorkshire'. This made sure that they interpreted the social meanings conveyed by the stimuli as intended. They then moved on to the main experiment. On each trial, the participant heard a word and chose between two characters which differed on one dimension. For example, they would see the younger, urban, middle-class character alongside an older, urban, middle-class character. They would then hear a word containing one of the target vowels, and choose which person they thought was most likely to speak in that way. The idea was that if the difference between front and back /u/ was available as a cue to age (for example), listeners would be more likely to select an older character than a younger one when hearing back versus front /u/.
Before looking at the data, let's formulate some clear predictions. As a basic (and probably false) assumption, we might assume that listeners have direct access to the social distribution of variation in /u/ and /o/. If that was the case…
They would be able to use /u/ and /o/ fronting as a cue to talker age: fronter vowels would cue a higher proportion of 'younger' selections than back pronunciations.
They would be able to use the diphthongization of /o/ as a cue to socioeoconomic status, assuming that the 'mobility index' used in the production analysis reflects a dimension of social class.
Conversely, since there's no association between /u/ fronting and social class in production, there's no reason that listeners should be able to use that vowel as a cue to social class in perception.
Although I didn't find evidence for this in my production data, a recent paper by Bill Haddican and colleagues (here: ) showed that people who identify strongly as being from Yorkshire are more likely to use monophthongal /o/. If we accept that 'rural' can be a proxy for 'Yorkshire', it's reasonable to predict that /o/ diphthongization will affect selections on the urban/rural dimension – we expect a higher probability of a 'Rural' selection when listeners hear monophthongal /o/.
The way that the experiment was set up allows listeners' perceptual intuitions to be treated as a binary choice: on a given trial, they hear a speech token, and choose between image a (e.g. a 'working-class' character) and image b (e.g. a 'middle-class' character). We are interested in the way in which the speech samples affected listeners' responses: specifically, the way in which each sample affected the probability of the selection of one of the two images. More formally, we are interested in comparing the conditional probabilities of selections given vowel tokens:
\( P(Social\, category|Vowel\, variant) \)
To get at these values, we also need to take into account the fact that people might be biased one way or the other. During sociolinguistic perception, listeners have a lot of contextual information which might influence their decision – previous experience of the talker's speech; other socially-meaningful cues, etc. It might also be that some people are more generally biased toward selecting a 'working-class' or 'middle-class' image.
To understand how sociolinguistic inference works, I'm going to model listeners' responses using logistic regression. I've fit a model for each vowel and each social dimension (social class, age, and urban/rural). To do this, I used Markov Chain Monte Carlo sampling using Stan: http://mc-stan.org. Priors for the models were set using the recommendations in this paper from Andrew Gelman: http://bit.ly/2dAoD1a. The plots below show the predictions from each model, expressing the conditional probability of a particular social selection ('Working Class', 'Rural', or 'Old') given a particular pronunciation of /o/ or /u/. The error bars represent the 95% credible interval for each estimate, meaning that at the current state of knowledge we believe there is a 95% chance that the population-level parameter lies within that bound. If the interval of a given prediction crosses .5, it is unlikely that the vowel in question had a reliable effect on listeners' responses. If the intervals for two predictions cross over, it is unlikely that they are reliably different.
Although I predicted that /u/ wouldn't be useful as a cue to social class, it looks like it is – back variants tend to be heard as 'working class', and front variants tend to be heard as 'middle-class'. Diphthongization seems to strengthen this effect. The effects are much less convincing in the case of 'Rural' and 'Older' selections, although it looks like the trend for age selections goes in the right direction – back vowels tend to be mapped to older characters.
The social class selections for /o/ go more-or-less as predicted: diphthongs cue the selection of middle-class characters, and monophthongs cue the selection of working-class characters. Interestingly, it looks like fronting generally makes /o/ sound less working-class: for diphthongs, the most back diphthong is heard as working class, in contrast to the front ones; fronting monophthongs makes them less likely to cue a working-class selection. Although the effects are smaller than for social class, /o/ is a reliable cue to the urban/rural dimension: monophthongs sound more 'rural' than diphthongs, and fronting makes monophthongs sound less rural – note that this doesn't appear to apply to diphthongs, unlike in the social class case. There's also evidence that /o/ can be used as cue to talker age: the centralized monophthong cues the selection of younger characters, and diphthongs cue the selection of older characters, especially when they are fronted at the onset.
Let's see how this matches up with the predictions I made earlier:
This was partially supported by the data, but the effects are very weak. Although speaker year of birth explains more than half of the variation in /u/ productions, listeners are only weakly sensitive to this pattern in perception. /o/ can be used as a cue to age, but in a strange way – diphthongization is perceived as 'old', but is actually more likely to occur in the speech of younger speakers.
This prediction was supported by the data – listeners reliably map diphthongal /o/ variants onto 'middle-class' images, consistent with the distribution of diphthongization in production. This effect interacted with fronting – more back diphthongs were mapped to 'working-class' images, and fronting at the vowel onset was heard as particularly 'middle-class'.
This prediction was not supported by the data! Listeners appear to hear fronted '/u/' as more 'middle-class' and back '/u/' as more 'working-class', despite there being no such association in production.
In summary, this analysis has shown that York listeners can use variation in /u/ and /o/ as a social cue, but their intutions very often differ from the actual distribution of that variation in production. For example, although a speaker's degree of /u/ fronting is potentially very informative of their age, listeners aren't good at using that signal in social perception. In some cases, forms which are actually more likely in the speech of younger speakers (such as fronted, diphthongal /o/) are heard as 'older'. In my dissertation, I argue that this is because people are quite bad at tracking the actual distribution of phonetic cues in social space – rather, they make use of more abstract social meanings such as 'broad' and 'posh' to structure their social perceptions of linguistic features.
We can see more evidence for this in the fact that some of the visual stimuli were more likely to be selected than others: reliable effects were found only for the five stimuli below:
I've ordered these stimuli from left to right: the images on the left seemed to be selected most often when a listener heard fronted, diphthongal /o/ or back, monophthongal /u/, and the images on the right were selected when a listener heard back,diphthongal /u/ and back, monophthongal /o/. We can see that
I think that this might represent a relatively simple heuristic York listeners use when perceiving variation in /o/ and /u/: they percieve this variation as more or less 'broad' (to the right of this continuum) or 'posh' (to the left of the continuum). The fact that these particular images attracted such consistent responses reflects their prototypicality as types of 'broad' or 'posh' speaker.
In the previous section I established that people in York can use variation in /u/ and /o/ as a cue to various aspects of social identity. For example, they tend to hear fronted /u/ as 'middle class' and back variants as 'working class'. Although I identified this 'average' or community-level pattern, I didn't discuss was the extent to which individual listeners share those intuitions. Does everyone think that /u/ fronting or /o/ diphthongization sound middle class? How do people differ in their intuitions?
This is an interesting question because a lot of sociolinguistic work treats 'salience' or 'prestige' as a relatively stable property of variable linguistic features, with out considering that different features might be noticed by different speakers, or interpreted in socially different ways.Luckily, we can use the models I estimated in the previous section to explore this question. To do that, I'm going to focus on the responses for 'Working class' vs 'Middle class' selections. The plots below show predictions for each individual, indicating how likely each listener was to select a working-class character when hearing each vowel stimulus.
We can see that a lot of people have a similar pattern – for /u/, most people have the same s-shaped curve as listener “AR_F_1988_3”, although some people have a very flat response (“BE_F_1936_2”). With /o/, most people have the same z-shaped pattern, although there are some quite big differences in the perception of the back diphthong – for example, , listener “TO_M_1989_2” hears it as very middle class, but 'MI_M_1964_7' hears it as very 'working class'. Similarly, 'SA_M_1987_6' assigns the most fronted monophthong to middle class characters, while 'AR_F_1988_3' assigns it to working class characters. It looks like we can describe differences in the way variation is percieved as follows:
A question which logically follows from these points is the degree to which these differences are socially structured. Perhaps people of different ages and social backgrounds are sensitive to different meanings/phonetic properties. I've tested this by adding interaction terms to the logistic regression models from the previous section, testing the contribution of the same variables I used in the production analysis:
Of these variables, it turns out that the listener's year of birth and their score on the mobility index has a statistically significant effect on their responses. I've visualized these effects below. The plots show the probability that a listener will select a working-class character as a function of their YOB/mobility, split by the vowel variant they heard on each trial.
The general pattern seems to be that younger, more mobile listeners are more sensitive to variation in /u/ and /o/ as a social cue. Furthermore, the big differences seem to be related to aspects of vowel pronunciation which are undergoing change:
As I write up my dissertation results, I am still trying to figure out how to interpret these effects. However, what they generally show is that the social interpretation of linguistic variation is heavily dependent on the listener – this means that we can't base our claims about the social meaning of variation primarily on production patterns, and we can't assume that a form which we think is 'salient' is socially meaningful for everyone in our population.
When we listen to someone speak, we are doing two things: understanding what is being said, and understanding who is doing the speaking. As if understanding linguistic meaning wasn't hard enough, understanding the social meanings encoded in speech is also a challening task, firstly because speech sounds are very complex, and secondly because society is very complex. How do we manage to make sociolinguistic inferences under these conditions?
One obvious answer might be that listeners simply remember how different people talk when they encounter them; they might then use that information to form social impressions when they hear a new person speaking. If this were the case, we’d expect that people would have relatively reliable intuitions about the patterns of variation in their speech community, and we’d expect that those intuitions would match up with the actual social distribution of speech forms in that community. Another possibility is that social perceptions are less about the actual distribution of speech variation, and more about the stereotypes which circulate in society: which forms are ‘broad’ or ‘posh’, or which forms are typical of certain types of speaker.
In my dissertation I evaluate these possibilities by documenting the sociolinguistic distribution of vowel variation in production, and comparing that distribution with the intuitions listeners have when attempting to use those features to infer aspects of a talker’s social identity. It turns out that listeners are able to use very small differences in in vowel pronunciations to infer the social identity of a speaker, but their intuitions rarely seem to directly reflect the distribution of that variation in production. The kinds of ‘errors’ that people make are consistent in interesting ways, supporting the idea that listeners represent speech variation primarily in terms of abstract stereotypes, rather than detailed memories of speech events. Additionally, people differ considerably in the social meanings they assign to phonetic variation – what one listener hears as 'broad' may be heard as 'posh' by another person.
Sociolinguists very often hypothesize a social meaning for pronunciation variants in order to explain patterns of production (e.g. the reversal of sound changes). My results demonstrate that inferring the social meaning of variation from production patterns alone may be problematic. To get this right, I think we need to do two things:
Related to the above is the notion of 'sociolinguistic salience', which researchers often treat as a property of linguistic features. The fact that individuals vary in the degree to which they find different aspects of phonetic variation socially meaningful suggests that it doesn't really make sense to describe a form as more or less salient. Rather, we need to minimally say which phonetic characteristics of that form are meaningful, what social meaning(s) they are associated with, and which groups of listeners they are relevant to.
There is evidence that some people are more generally attuned to sociolinguistic meaning than others, which is relevant to recent work on the role of individuals in the actuation and propagation of sound changes. It may be that one of the characteristics of the individuals who facilitate the spread of sound changes is their tendency to recognize the difference between innovative and conservative ways of speaking and assign a social significance to that difference.
Episodic models of speech perception and production have gained a lot of popularity among some sociophoneticians, who claim that all manner of contextual detail may be encoded alongside speech. My results indicate that listeners knowledge about social-indexical variation may involve considerable abstraction. The fact that listeners have any intuitions about the social distribution of linguistic features means that some form of episodic representation must be involved. However, my results suggest that the mental representations involved in sociolinguistic inference are more likely to be general and underspecified than very detailed, allowing the rapid social categorization of talkers.
Finally, my results generally suggest that naive listeners are generally quite poor at tracking the social distribution of linguistic variation, implying that it is unwise to rely on their judgments in forensic contexts (e.g. in identifying the regional origin or social background of a talker).