A Simple Model of Infant Word Finding Using Computational Phonetics Embeddings
This summer, our project focused on learning how infants find words using computational phonetic embeddings. Babies build their vocabulary through word and pattern recognition from the speech they hear around them. But words sound differently every time they’re pronounced and when sentences are in a language unknown to you, it is difficult to find word boundaries and discern what actual words are being said. This begs the question, “How do babies recognize individual words from a string of words?” Our approach to exploring this question involved using speech technology to test the intuition that babies may be able to identify a word because it was a pattern of speech they had heard before.
In this experiment, we took an audio recording session of a mom naturally talking to her 7-month-old child in their home from a larger dataset of maternal infant-directed speech. The session recording was then split into utterances and computed into mathematical representations, called embeddings. We then took pairs of utterances, and using a model of a baby's ability to learn words, tried to find any part of the first utterance that was similar to any part of the second utterance. The matches it found were our hypotheses of what a baby may flag as words. These segments were compared to the target English word the mother actually said. We found that our model picked up four different kinds of “words.” Some of the found segments were entire target words, some had parts of the target word, some had an entire target word with extra parts of another word on either side, and some were parts of two different words combined. The most common occurrence was a segment with only part of a word. This might mean our mathematical speech representations were not good at managing coarticulation, the sounds at the end of one word bleeding into the beginning of the consecutive word. In other words, the embeddings may not have been made to accurately reflect sounds in different phonetic environments. In order to understand the merit of our model, we want to compare our results to a chance model that picks segments randomly and see which of the two does a better job at learning words.
Comments