Megan Figueroa 2017

Megan Figueroa
megan@email.arizona.edu

University of Arizona
College of Social & Behavioral Science
Department of Linguistics
I am working with corpus data in order to determine if children’s English past tense overregularization errors are dependent on the internal structure of the target verb. Each verb token was classified as one of eight classes based on internal structure. There were not enough data point for 4 of the classes, so I only included 4 different verb class factors.
 
I hypothesize that these 4 verb class factors will predict how likely an overregularization error is to occur at different stages of development based on possible interactions with MLU or type.token ratio. 
 
So my dependent variable is: overregularization error is "Present" (0) or "Not present" (1). The independent variables, and the variables I hope to investigate whether or not they are helping to predict whether the dep. variable is 0 or 1, are verb class factor, MLU, and type.token ratio.
 
The current model I am working with:

or2.lme = glmer(OR_present ~ verb_class_fac + type.token + mlu +verb_class_fac:type.token +verb_class_fac:mlu + (1+verb_class_fac|subject) + (1|verb), data=data.centered.scaled,family="binomial",control=glmerControl(optimizer="bobyqa"optCtrl = list(maxfun=1000000)))

 

Figueroa Data

 

Initial Meeting

I.  Who:

Client: Megan Figueroa (University of Arizona – College of Behavioral Sciences, Department of Linguistics)

Consultants: John, Julia, Hakeem (author)

 

II.  When:

15 September 2014, 3-4pm

 

III.  What:

A.  Summary of Client’s Problem

Megan is investigating possible dependencies of children’s English past-tense over-regularization errors (suffixing ‘ed’ to all verbs to denote past tense) on the interaction of verb class factors (different ways of affixing past tenses to verbs) and measures of language development (such as the ratio of word types to tokens or mean length of utterance, both of which are expected to increase with language development).

Her panel data is taken from the CHILDES corpus.  It comprises 3 children whose speech and utterances were recorded during playtime for around an hour each session over the ages of 1 – 6 years. These utterances were then transcribed, and analysis was conducted on the transcript. 

 

B.  Discussion

Megan has fitted a logistic regression model using a dependent binary variable that is ‘1’ if over-regularization error is present and ‘0’ otherwise and the following independent variables:

- Verb class factor: 5 unordered categorical variables taking values.

- MLU (mean length utterance): sum of number of morphemes (meaningful morphological unit of language) divided by the first 100 utterances of each play session.  The use of 100 utterances (instead of 50) was a choice made by Megan based on common practice in her field. 

- Type-token ratio (TTR): ratio of type (a root word) divided by token (number of words in a given text), e.g. ‘get’ is a type while every instance of the child saying any form of get (‘get’, ‘got’, ‘gotten’, ‘get’ and ‘gets’, and incorrect ‘getted’) are all tokens.  A single type-token ratio is calculated for each play session.

Higher MLU and TTR indicate more complex, varied and diverse vocabularies and language abilities.  

In addition, Megan also plotted graphs of the 3 predictor variables the binary response variable, and one additional predictor (age, which she later decided not to include in the model) to better understand the data and identify possible relationships. She hypothesized that MLU, TTR and age should have a positive correlation. However, data exhibits only partially confirmatory results. She also wanted to look for possible multi-collinearity. 

Furthermore, Megan noted that the dataset she shared was a condensed version of the original, which comprised actual over-regularized errors uttered by the children.

 

IV.  Next Steps

We suggested the following:

- Use the uncondensed dataset to investigate relationships between different error rates.

- Given the current logit model, test for endogeneity of dependent variables.

 

####################

 

Follow-Up Meeting

I.  Logistics

Client: Megan Figueroa (University of Arizona, Department of Linguistics)

Consultants: John, Julia (author), Hakeem

Session Date and Time:  27 September 2014, 3-4pm

 

II.  Details  

Summary of Client’s Problem

Megan is investigating possible dependencies of children’s English past-tense over-regularization errors (suffixing ‘ed’ to all verbs to denote past tense) on the interaction of verb class factors (different ways of affixing past tenses to verbs), measures of language development (such as the ratio of word types to tokens or mean length of utterance, both of which are expected to increase with language development), and properties of the individual verbs (phonotactic probability and word frequency).

Her panel data is taken from the CHILDES corpus.  It comprises 3 children whose speech and utterances were recorded during playtime for around an hour each session over the ages of 1 – 6 years. These utterances were then transcribed, and analysis was conducted on the transcript. 

Discussion

Since the last meeting, Megan added to the data and did a great deal of exploratory work.  First, she went back through the data and added in every instance of the child saying a verb in the past tense.  She also added in the phonotactic probability of the correct and incorrect forms of every verb.  Phonotactic probability is the probability of the sequence of sounds in a given word.  She is currently in the process of adding in a measure of frequency of the different verbs.  This will be discussed more below.  

With some assistance from Julia, she produced the following plots:

1.  Box plots of error rates for each verb for each child; each box plot used data from one class of verb

2.  Line plots of error rates per verb class against mean length of utterance (MLU).  Note that the error rate for each verb class at each MLU was calculated as the number of errors that occurred in a specific verb class’s verbs at that MLU divided by the total number of utterances of that class’s verbs at that MLU.

3.  Line plots of error rates per verb class against type-token ratio, calculated in the same manner as 2 above.

4.  Line plots of error rates per verb class over time, again calculated as in 2.

5.  Scatterplot of error rates per verb

6.  Scatterplot of the ratio of incorrect to correct forms’ phonotactic probabilities plotted against error rate of the incorrect form

These plots indicated a variety of things:

First, Adam and Sarah seem to pattern similarly, while Eve differs.  Second, there is little evidence that verb class is making a large difference in production errors.  This can be seen in the MLU, type-token ratio, and time plots and in the scatterplot of error rates by verb.  Third, of the two factors MLU and type-token ratio, type-token ratio seems to most cleanly show the expected U-shaped development pattern (in the plots, the U is upside down).  Fourth, the ratio of incorrect to correct forms’ phonotactic probabilities seems to partially explain the error rate.  However, as was discussed in the session, it doesn’t seem to fully explain the over-regularization.  Thus, there is likely another factor that is also affecting the error rate.

A good portion of the discussion in this session focused on finding the factor that could explain the rest of the variation in error rates.  Julia and Megan had previously felt this factor could be word frequency and had discussed adding it to the data.  Based on a discussion with her advisor, LouAnn Gerken, Megan has begun to calculate two measures of frequency, both taken from the speech of the adults interacting with the three children in the play sessions:

1.  Relative frequency of past tense form calculated as the ratio of the number of times the past tense form of a verb (i.e. caught) was said to the number of times all forms of the verb (i.e. catch, catching, catches, caught, etc.) were said

2.  Basic frequency of the past tense form calculated as the ratio of the number of times the past tense form of a verb was said to the total number of adult-spoken words in all the sessions

The measures of frequency suggested above would capture some of the frequency with which the studied children heard given past tense forms of verbs.  However, as was pointed out in the session, there is not a great deal of adult speech in the corpus Megan is using.  Thus, it is likely that the measures of frequency will be somewhat inaccurate.  We suggested using a more established measure of frequency of words in English based on a great deal more data.  While that frequency measure would not be specific to child-directed speech or to the children studied, it would likely be more accurate.  Alternatively, all three measures of frequency could be explored.  If one or more seemed to have a great effect on the error rate, then it (or they) could be included in a later model.  We would simply have to ensure that the measures are not multicollinear, and if they are (which seems likely), then we would choose the measure that seems to have the greatest effect.

The frequency measures discussed above are all based on adult speech.  However, some data Megan just compiled on the children’s own production frequency appears to show that the words said least frequently have the highest error rates.  Thus, the child’s own production frequency might affect over-regularization.  This would be worth exploring. 

Another factor suggested that may also be influencing the error rate is when the word was learned by the child.  This could be worth exploring as well.

We briefly discussed construction of a model once frequency has been explored.  In order to determine the form the model should take, we suggested producing a series of plots of the various predictors against the response.  Additionally, we will need to think about the form the response should take (rate, binary variable, etc.).  

Finally, we discussed which data should be modeled.  Since Sarah and Adam looked similar, we suggested that it might make sense to model them together.  However, separate models would also be justifiable.  We discussed the possibility of not modeling Eve’s data since she is so different from Sara and Adam and has much less data.

IV.  Next Steps

We thus suggest the following next steps:

1.  Calculate various measures of frequency and add them to the data.

2.  Explore whether or not any of those measure appear to influence the error rate.

3.  If more than one measure influences error rate, determine whether or not those measures are themselves correlated.  If they are, choose the measure that appears to best explain the data to use in the modeling step.

4.  If no measure of frequency appears to influence error rate, consider exploring age of acquisition of a verb as a factor.

5.  Finally, collect the factors from the exploration that seemed to best explain the data and fit a model to the data.  

Julia has been and will continue meeting with Megan on a regular basis to help her complete this project.

Figueroa Final Report 12.09.2014