Monthly Archives: April 2016

How to discover where millennials are willing to move.

There are a multitude of articles regarding where and why millennials are moving. Some of the articles conflict with each other. I’m going to demonstrate a way (from the proverbial horse’s mouth) to find where the millennials are moving to because they actually want to move there. This analysis is not light reading, but is for people who really want to know what’s in the data, and are willing to put forth some effort to understand it.

There are 5 primary items to download: 2010-2011 data and explanation. 2012-2014 data and explanation. An explanation of why there are two tabulations and how to use and match them.

We Americans aren’t limited to ‘published studies’ or prepared census bureau tabulations to find information from the census. The census bureau releases actual census response records that have some modifications to protect privacy, and we can tabulate this data any way that suits our needs. This type of data is called PUMS (Public Use Microdata Sample). “Public Use” because it’s been sanitized to protect privacy. “Microdata” because it contains the actual responses on a household-by-household and person-by-person basis. “Sample” because it comes from either 1% or 5% of the population, and has been weighted to reflect the entire population (The ACS, or American Community Survey, is not taken from the entire population). Armed with PUMS, we can do our own study, to suit our own needs. In this investigation, we are using the 2010-2014 5% 5 year file, which is the newest and largest file that the public can access.

In this case we want to find out where the millennials are moving to because they want to live there, which turns out to have a very different answer than if we simply ask where they are going. The answer will enable communities to ask themselves the following question: “What do those places have that we don’t?”

First, we need to understand a little about PUMS. Every person and household in the USA exists in a particular geographical area called a PUMA (Public Use Microdata Area). This is a geographical area that the census bureau creates, which gives us limited information on where a person lives. PUMAs are designed to have at least 100,000 people. Wyoming, with the smallest state population in the USA, is divided into 5 PUMAs. In contrast, New York City is divided into 55 PUMAs. These divisions allow us to analyze fairly small parts of large cities.

To begin answering this question, I ran a simple tabulation adding up all people 18-34 years old who had moved within the previous year, and found out what PUMAs they had moved to. I also averaged their incomes. The favorite spot was College Station, Texas, and the millennials who migrated there were earning very little. College Station is the home of Texas A&M University, and these people weren’t really moving there at all – they were college students.

First Modification: Remove all people who are enrolled in school (keep only SCH = 1)

I’m going to apply a series of filters to remove populations that cloud my answer, and after each filter (or other modification) is applied I will rerun the tabulation to find the new set of PUMAs with the most millennial migration, then apply more filters if necessary, and so on. From what I have seen in the online articles, they don’t filter the data, although better articles do mention that college students skew the data. This is likely because the authors started with tabulations, and thus had no ability to filter the data.  Trying to remove people after receiving a tabulation is like trying to separate bread dough into pure flour, milk, eggs, and sugar. It can’t be done. Because I am using microdata instead of a pre-summarized tabulation, I can do it by removing them before they are mixed in.

Second Modification: Remove all households that have 1 or more active military people (MIL = 1)

After removing the college students and rerunning the tabulation, there appeared a number of high ranking PUMAs that contained military bases within the PUMAs. To a large degree, military people and their families don’t move there by choice, but are assigned, so they have been filtered out.

Third modification: Put all PUMAs on equal footing regarding their population by creating the Scale00 and Scale10 variables to be used as sort keys in the tabulation.

While all PUMAs are designed to have at least 100,000 people, some large PUMAs have more than 4 times the count of small ones. This skews the data because more people are likely to migrate into (or inside of) a greater population. To correct this, I created a multiplier for each PUMA. For example, a PUMA that has ½ the residents of the largest PUMA would have their migrating millennial counts doubled and placed in a field called Scale00 or Scale10.  This allows me to rank migration proportionally for PUMAs, thus allowing a smaller, but more attractive PUMA to rank higher than a larger but less attractive PUMA without changing the actual person counts.

Fourth modification: Filter out people who moved, but stayed within their own state. (A new variable was created to identify and filter these people).

The current purpose is to find places that millennials want to move to. In the census, if someone moves to a different apartment in their same building, they are classified as a migrant – but we don’t want to misinterpret such a situation as someone who moved to that area when they were already there. More commonly, if someone lives in the suburbs of Chicago, and moves to Chicago to work, they are not helping me answer my question. They weren’t willing to really uproot themselves and move – they are still near family, friends, and familiar surroundings. In order to find those places that millennials really want to move to, I’m only counting them if they cross state lines. Also, if I don’t do this, and an area falls on hard times, and the millennials move back to their parent’s homes in the same area, it will appear that there is a lot of millennial migration into that area. This modification prevents that situation from misrepresenting what is really happening.

However, keep in mind that this approach has some pitfalls that need to be considered when looking at the data.  For example, DC is a city surrounded by state lines.  It’s easy for migrants to cross the line without going far, and subsequently DC pumas rank very high on the list.  There are two important points here:

  • Any approach can give certain areas apparent advantages over others, making it important to understand exactly how the numbers were derived.
  • If you really want to address a particular question, a tabulation that someone else has done (unless it was at your direction) probably isn’t going to be sufficient.

Fifth modification: Filter out people who are actively training for National Guard and Reserves

After the first 4 modifications there were a couple of PUMAs that appeared in the high rankings with relatively low salaries that I couldn’t explain at first, so I pulled some appropriate records (from PUMA 1400 in Missouri) out of the census and tried to ascertain why these people moved there. The data is here in an easy-to-view spreadsheet for those of you who want to look at actual PUMS records. To decipher the encodings, you will also need this data dictionary from the census bureau.

I picked serial number 2013000599829, at random, to examine in the file. This is a 20 year old male who is a federal employee (COW=5). He migrated to Missouri from southern Rhode Island (MIGSP12=44, MIGPUMA10=400). He is living in group quarters (his household record has a 0 for WGTP, which only occurs in the case of group quarters). He buys his own insurance (HINS2=1) but also has TRICARE or military health care (HINS5=1). He is on active duty for training in the Reserves or National Guard (MIL=3). He must have ‘moved’ to Missouri temporarily for training and he is staying in a barracks. Examining the rest of the ‘migrating’ millennials that are in this spreadsheet reveals that most of them are in training for the Reserves or National Guard. They didn’t choose to move to Missouri, and they aren’t staying, so I changed my definition of active military (which is being filtered already) to also include MIL=3.

After the 5th modification I reran the tabulation, and it looks much better (meaning that the migrants have pretty high average incomes, which I say is ‘better’ because people don’t generally choose to move to another state to get a low income). There does seem to be an oddity in PUMA 49003 in Utah – a lot of migration to that PUMA is reported, with low pay. The census bureau documentation tells me that 49003 in Utah is Provo. Wikipedia tells me that Provo is home to the world’s largest LDS (Mormon) missionary training center.  The center houses up to 3,800 missionaries at a time for 3-12 weeks of training. Those who attend there are not ‘enrolled in school’ according to the SCH data item. I left the Provo PUMA in the tabulation, but realize that millennials aren’t actually moving there at the inferred rate. At this point the filtering and modifications are sufficient for my purposes, and it’s time to upload and explain the tabulations.

There is a lot of information in the tabulations. I’ll just make a few comments, and let the readers make their own observations. If you have questions, please ask (preferably where others can see your question and my answer, rather than email). The top 10 lists in each tabulation have pretty much the same areas, albeit in different orders, and with different PUMA names. Only PA PUMA 4109/3209 stayed in the same position. The PUMA names were changed (many names were changed from PUMA00 to PUMA10), but they both apply to the same area, which is center city Philadelphia. PUMA 105 in DC ranked first in one tab and 2nd in the other. Note that in PMillMovers.10.xls, DC had more Pwgtp than Seattle (Washington PUMA 11603), but Seattle ranked higher. That’s one example where PUMA size became an issue, and why the Scale10 variable was created – the Seattle PUMA is more attractive, but the DC PUMA is larger.

The PL_Wgtp column, weighted household count at the person level, shows us how many households the Pwgtp (weighted person count) live in. E.g. on PMillMovers.10d.xls, line 11, 3399 millennial movers moved into 2703 households, for an average of 1.26 millennial movers per household. On line 12 of the same tabulation there are 1.63 millennial movers per household, indicating that millennials moving to PUMA 103 in DC are more likely to take another millennial mover with them.

I would be happy to answer any questions that anyone might have about how these tabulations were created, or anything that might appear confusing in them.

The Third Level of MAST Users

I had mentioned earlier that there are three levels of MAST users. The first and simplest level is to drop data items (and geographies, to limit the area tabulated) into the shopping cart. It’s easy to use, pretty powerful, but doesn’t get anywhere close to my claim to be able to answer (almost) any question you can think of that can be derived from the data. While you can do a great deal with it, you are limited to the variables that are either created by the Census Bureau, or the few that I have created in anticipation of people needing them (e.g. pincp_vwa). You can do a lot at the first level, but if you need a different variable, you’ll have to move up.

The second level is script-writing. In script-writing you can create variables that you might need that I could have never possibly anticipated. It is extremely powerful. Unfortunately I don’t currently have any error-checking/debugging available to the user, so I have to admit that when used via the website it doesn’t work very well. I still have it as an option on the site, but I’m not encouraging people to use it.  (The way it works is the user writes the script and sends it, then I debug the script [assuming there are errors] and return the debugged script to them, which they resubmit.  I end up getting $25 to debug their script, which isn’t good for me or for them.)

The third level is the one that takes us to the point where (almost) any question imaginable regarding the data can be answered. At the third level, which is actually the easiest level of all to use, the user simply tells me what they want. I then create the necessary variables, some via the scripting language, some otherwise, I run the tabulation, and send it to them. At all levels, the tabulations are somewhat dynamic, meaning that after the first tabulation the user learns something about MAST’s capabilities and the data. I learn more about the user’s needs and the data. We then decide to do another tabulation. After a few iterations of this, the user then has what they want.