Did you ever think that what you named your baby could cause a data related controversy?
I unironically love the baby name database that is managed by the Social Security Administration. The data is publicly available, and accessible by anyone with or without coding experience. There are forms on the website that show you baby name prevalence in real and relative terms, and you can check for popularity over time.
Searching for the should-be-all-popular name for the little excel files of joy “badDataTakes” reveals no entries, however searching for “Mary” reveals just how popular this name once was
dear lord
Links to the data at the end of the article. It’s fun to explore!
I find it fascinating to view the trends in baby names because it’s fun and can unearth some interesting cultural happenings. It has also revealed a wanton lack of journalistic integrity at not just one, but potentially 3 different publications!
This is what is at the heart of the problem: In the article they claim that Sophia was finally wrested from her top spot upon the throne of girls’ names. Olivia stormed the throne room and overthrew the mad queen and placed the crown upon her own head.
They said this was the first year for Olivia in the number one spot. But there is a problem for anyone who can look up SSA baby names; Olivia first took the top spot in 2019.
Clearly something was off. So what happened? Buzzfeed didn’t use the SSA data directly for their article. They used another website’s analysis: babycenter.com
When looking at the popularity of the names on that website something really interesting happens. In 2019 Baby center went from using SSA data to using BabyCenter user data.
here is Olivia in 2018
And here she is in 2019
And YET Olivia is listed as the new number one on buzzfeed, attributing this to the meteoric rise of Olivia Rodrigo.
While Ms. Rodrigo is in fact quite stellar, it appears that the rise of the Olivia predates her at BabyCenter!
What about Sophia? it was alleged that she was number one for 11 years. Well it turns out that the last time she was number one was 2013 on BabyCenter, and since this uses SSA data (pre-2019) we can check it ourselves:
hardly 11 years!
So what gives?
The answer is simple, and is answered right on the BabyCenter methodology page.
For the first time ever, in 2021, BabyCenter decided to not combine different spellings of names in their published rankings. The biggest impact this would have is on the extremely popular girls name “Sophia” which appears at number 5 in 2021 for BabyCenter and 6 for SSA data. “Sofia” by contrast was number 22 on BabyCenter and 18 on SSA. If we combine them using SSA data we get 19,429 females born with “Sofia” or “Sophia” on their social security card application.
There were 17,728 Olivias.
To state this outright, this means that if we combine the two, Sophia/Sofia is once again the top name.
BabyCenter has it right in the 2021 article, and explains their methodology. They make it clear that each spelling is unique and each one has it’s own place. And they have the rankings right on the individual name pages. But they didn’t update previous years summary articles which still show Sophia as the top girl’s name. All the way back to 2010. And they didn't explain that the change would change the underlying data in the past as well.
That’s how BuzzFeed got it wrong, and that’s how easy it is for decisions we make with data to have big effects. A small change in how rankings were tabulated meant that much of the previous year’s data need to be re-understood, and even though this is a small issue you can see how this exact same process can create a much larger problem if it were to happen say in Coronavirus Death Data.
Make your data definitions clear and easily accessible, and any change you make to those definitions need to be clearly communicated. But also the impact the change will have needs to be clearly communicated so you can avoid mistakes like Buzzfeed. People need to understand exactly what they need to do to best understand the change you made, and if they need to update any previous data they collected from you.
Also I’m pretty sure Slate saw my original tweet thread and then wrote an article based on what I said



I posted that thread on Nov 6th 2021. Then two days later Slate publishes this? Pay the parody account!
https://slate.com/technology/2021/11/top-baby-names-boys-girls-babycenter-buzzfeed-bs.html
So why publish this now? Data a year old hardly seems timely or relevant. Well for one it’s my first substack post and I just started this thing. I'll get around to charging you weirdos once I figure it out.
But also because buzzfeed just published their yearly article. It has a different byline but still basically just summarizes BabyCenter's 2022 article. Did the old editor get tired? Was he scared off by the criticism of your lowly shitposter? Maybe Slate will copy this again.
Data and Notes:
The SSA baby name database is also a very popular public dataset for people who are trying to expand their data/analysis skills.they provide data definitions, downloads, and an easy to understand public UI. Comment with some of your favorites that you have found. Bonus if you share their github. Double bonus if it’s bad data too!
SSA full data access
https://www.ssa.gov/OACT/babynames/limits.html
access the SSA forms here
https://www.ssa.gov/OACT/babynames/index.html
Author’s note: I'm currently doing this without an editor as I'm new to this! Usually I just shitpost. So be kind, and if you find a mistake please let me know.
As this is so early I'll also take feedback for things like aesthetics, style, and content. How often would you like posts to come out? How much would you all pay for something like this? What would you like me to cover? I'm working on a VAERS post and have a few other ideas but suggestions are welcome.
Also, this is the official backup for @baddatatakes on Twitter. If you see another bad data takes in the wild and I don't mention it here then don’t believe it's me until I tell you otherwise. I've tried to secure the name on other platforms, but there are too many to count right now.