“I wonder how we can get them to sleep more.”

This simple thought, expressed by my wife, not even a question, became a challenge to me. My engineer mind took this as a problem to be solved, and when a software developer sees a problem, they devise tests. Luckily, I knew the perfect system for testing out some ideas in a controlled and measurable setting. And with twins, testing would be even easier. Welcome to parenting, A/B testing style.

A/B testing is used all over the web. You likely encounter it dozens, if not hundreds of times a day, without even noticing it. All the big tech companies do it, using it as a tool to test the performance of ideas and measure them.

Google is famous for testing 41 shades of blue for search results. Designers allegedly couldn’t decide which of two shades to use, so they tested 41 in total to see which led to the more users clicking on the results.

Facebook tests different experiences within the feed constantly. Amazon even changes around the buy buttons and cart layouts fairly often. You may notice these if you ever log in from a new computer or see a friend using a site that looks subtly different from yours.

A/B testing is used to test one or more “treatments” or experiments over a “control” or the existing experience. A metric is measured, usually based on a user action such as a click through or “conversion” with a baseline against the control.

For the Google example, they might test the likelihood of users clicking through to at least one result with the different shade. After a statistically significant period of time, often a week or two, whichever experience has a better rate will be chosen as the winner and becomes the new control.

Where this gets really complicated is when multiple experiments are run at the same time or when the percentage of users is not equally split. Here a complicated knowledge of statistics is needed. Or the use of any of the many powerful testing tools available. At Audible and Amazon, we test experiences like this all the time. It’s the best way to see how users actually behave as often what users say they will do and what they do can be slightly different.

At Audible and Amazon, we test experiences like this all the time. It’s the best way to see how users actually behave, as often, what users say they will do and what they do can be slightly different.

Charting results
I decided to use this method of testing with the boys to see if we could increase the most important metric in the house, as anyone with 10-week old children, especially twins, knows: sleep times. Using one of the boys as a control and the other as the treatment – nevermind the fact that no one would describe any part of our lives right now with the words “control” or “treatment” – I tested several theories about length of sleep, baselined against the control.

In any experiment, accurate measurement and data tracking are critical. Often a success metric is chosen due to the availability of data or measurability. You don’t want to be trying to measure something that takes longer to measure than it does to change the test or test input. Luckily measuring sleep is about as easy as it gets.

When they wake up at night, we just write it down. This is exactly what we’ve been doing since the day they were born as the nurses at the hospital instructed us. We’ve gone through several notebooks already, but it’s so easy to track. For this, we even started importing the data into a spreadsheet to see the impact more visually.


Big Data
First, we tested increasing the amount given at the feeding immediately before bedtime. Instead of the normal four ounces, we tried five, then six. To prevent bias from one child, we alternated who was the test and who was the control since they seem to be on alternating cycles. While one child had a larger evening feeding, the other would stay at four ounces. The result: inconclusive.
Both children seemed to start increasing length of sleep anyway during this period. They both slept almost the exact same length of time as well. There was one night where an increased feeding correlated with a record 5.5 hour stretch of sleep, but one data point is insignificant in this dataset.
It was also hard to continue testing this as anything beyond five ounces had a high likelihood of being spit out a few minutes after eating.

Next was a secret whispered about in the dark corners of parent blogs around the web and passed from parent to fellow parent, at least in my office, gripe water. Ok maybe it isn’t that much of a secret, but it took us a while to try it. Supposedly this mix of herbs and spices, as opposed to KFC’s blend, would settle stomachs from reflux and gas, especially overnight, resulting in longer sleep.

After a week of testing, we found it did actually help with reflux, especially spit ups, and though we didn’t track individual burps or farts, seemed to reduce them as well. The length of sleep was not impacted much, though. We did see a small increase on average, between 20 and 30 minutes, but again this may have been natural increases due to age.

No reflux equals a happy baby

After gripe water, which became the new control, we tested an extra feeding before bed. The boys were starting to do this naturally on their own, anyway, and we had been trying to prevent it. However, it seemed like an opportunity ripe for testing, so we gave it a shot. Many children will “cluster” feed before bed, with a second feeding only a short time after the previous one. We did this feeding about 1.5 to 2 hours after the previous, compared to 3 hours normally. In this feeding, we tried 4 ounces compared with the 4–5 they normally take during daytime feedings. Sometimes they would refuse to take more than 3. Of all the experiments, this seemed to work best. We saw increases in up to an extra hour of sleep as a result, though often not until a few days into the experiment, apparently this takes time to affect sleep patterns. A good lesson for A/B tests is that sometimes there is a several day adjustment period while people figure out the new treatment and adjust. It’s important to capture both the adjustment period results and the post-adjustment ones, though. Apple has famously neglected the adjustment period on several product launches, notably maps.

Last, we tested keeping them awake longer during the day. Our hypothesis was that they would therefore, be more tired at night and would sleep longer as a result. This may have been slightly true, we saw minor increases in length of sleep, but we didn’t account for the stress and exhaustion it would cause by keeping them awake and making them unhappy. It also took significantly longer to get them to settle down and sleep at night as they were overtired and fussy. The lesson for testing: don’t sacrifice other metrics for a small gain in one.

You can’t make me sleep!

Many of these tests were inconclusive. This is largely due to the sample size. With a sample population like Facebook, tests can be done in small segments and achieve statistical significance very quickly. With twins, it’s hard to know what is a real result and what is personality or natural progression. In order to more accurately test, we may need to increase the sample size. Triplets would come in handy for this. Maybe someone else’s triplets though, we are definitely not ready for that!

This also shows the importance of the test, measure, iterate process. Though several of the methods didn’t show large improvements, put together they may. By using the treatment as the control when it outperforms the control, small improvements get stacked. By continuing to try new things quickly and moving on, it’s easy to come up with new ideas to try. You don’t need to move the mountain, just move little handfuls of dirt over a long time. With this approach to parenting, the boys can continuously grow as well. And with luck, so will our sanity, well-being, and lives as parents.

This piece was originally published on Dad On The Run.