PUMAs: What they are, and why I use them so often

PUMAs are used to protect the privacy of the people who live in the U.S., while still allowing their data to be analyzed in a multitude of ways.

Let me explain.

A PUMA is a geographical area (defined by the Census Bureau) that is smaller than a state. PUMA stands for “Public Use Microdata Area”, and they are important for a couple of reasons.

When the Census Bureau distributes their data to people like me (who tabulate it themselves, instead of using a pre-made Census Bureau tabulation), they give me the actual information from the person who answered the Census questionnaire. In order to do that, while still protecting the privacy of the respondent, it must be ‘sanitized’ before they hand it out.

One of the things they do to sanitize the data is to ‘blur’ the geographical area that the person lives in, by grouping them with at least 100,000 of their neighbors. If they did not sanitize the data, but kept it at (for example) the block group level, the public could look at a person in the data, and by knowing a few things about them (e.g. their block group, number of people in the house, number of cars, number of bedrooms in the house, ages of occupants), they could identify the people. Once they were identified, they could use the PUMS file to find out what their mortgage payments were, their earnings, their investment incomes ….

To prevent that, while still allowing us to analyze data, the Census Bureau invented the PUMA.

The PUMA is the smallest geographical area in the PUMS file, and (along with other sanitizing features) makes it impossible to use the PUMS file to violate a person’s privacy. Using PUMAs allows us to analyze areas smaller than states.

In the case of larger cities, we can analyze fractions of the cities. For example, NYC is broken up into 55 PUMAs, so by grouping them all together, I can analyze NY City separately from NY State. Or I can analyze the 5 Burroughs individually. Or I can regroup the PUMAs in any way that is most appropriate for the analysis.

In the case of the city that I live near (Altoona, PA), the city is too small to have even a single PUMA to itself. Instead, two counties have to be added together to make a single PUMA, and Altoona is a portion of them. The end result is that I can split NYC into 55 different areas and regroup them any way desirable, but I can’t analyze Altoona at all, unless I lump Blair and Huntingdon counties in with it.

In short, PUMAs allow a very good balance that protects privacy, while allowing analysis of fairly small geographical areas.

U.S. Census Bureau page on PUMAs
U.S. Census Bureau page on PUMS file

Details Behind Quora.com Answers

I often use the American Community Survey (ACS) in conjuction with MAST to answer questions on the question-and-answer site Quora.com. Here is some additional information about how the tabulations (and resultant charts) that I use are created.

The tabulations are derived from the 2014 ACS (American Community Survey) PUMS (Public Use Microdata Sample) file from the US Census Bureau. Tabulations are created by MAST (Multi-dimensional Analytical and Simulation Tool). Sometimes the free Android app “Gateway To MAST” is used, which allows anyone with an android device or kindle to create fairly simple MAST tabulations. After the original tabulations are produced, the spreadsheets are often manipulated by sorting, rearranging, calculating and charting in order to create a useful presentation.

Many tabulations from MAST have two sections:
1. An upper, household level section, that gives information about the households.
2. A lower, person level section that gives information about the people who live in the households.

Meanings of columns in the spreadsheet:

Wgtp: Weighted household count. Because the ACS is a sample, each household is weighted to indicate how many households in the US are represented by that particular household.

Pwgtp: Weighted person count. Because the ACS is a sample, each person is weighted to indicate how many people in the US are represented by that particular person.

LoVal/HiVal: Because the ACS is a weighted sample, the numbers are not expected to be 100% accurate. The Census Bureau has given us a facility to indicate how accurate the numbers are likely to be, and there is a 90% confidence that the actual number (for either Wgtp or Pwgtp) lies between the HiVal and LoVal. If LoVal is negative, that just means that the margin of error is larger than the count, which is common when the counts are small. It should be considered to be zero in these cases.

PL_Wgtp: Weighted household count at the person level. This tells you how many households the people (Pwgtp) on a particular line live in.

If you would like to try a scaled-down version of MAST for free, you are welcome to do so. (Scaled-down means up to 8 dimensions, 8 volumes, 2 levels, and the resulting tabulation must be no more than 300,000 characters for automated delivery. It can be larger than 300,000 characters, but I’ll have to manually email it when I get a chance to do so.)

  • To get the Android version: Gateway To MAST
  • To get the Kindle version: Gateway To MAST
  • For either version, watch the tutorial on YouTube.
  • Gateway To MAST (for ACS PUMS) is now on the Amazon Appstore

    This morning Gateway To MAST went live on the Amazon Appstore. Gateway To MAST allows the user to create custom tabulations of ACS PUMS data. It links the households to the individuals so you can mix household and person level dimensions and volumes in your tabulation. It has the following features:

  • Up to 8 volumes (can be a mix of Household and Person level)
  • Up to 8 dimensions (can be a mix of Household and Person level)
  • Unlimited geographies can be specified (down to the PUMA level)
  • Can recode volumes to make them into dimensions
  • Creates 1 or 2 level tabulations (Person level, Household level, or both)
  • Both the app and the tabulations are free
  • Gateway To MAST is of course still available in the Google Play Store for those with Android tablets and phones, but now Kindle Fire users can access it also. At this time only the 2014 PUMS is available, but if you want more than a couple of dimensions, this is still probably your best option. I’ll post a notice when more current data is implemented.

    Bible Annotator: Free Trial and Amazon News

    There are now 4 different varieties of Bible Annotator.
    1. Full Version on the Google Play Store
    2. Free 30 Day Trial on Google Play Store
    3. Full Version on Amazon Appstore
    4. Free 30 Day Trial on Amazon Appstore

    The difference between the “Full” and “30 Day Free Trial” is that the 30 day trial does not allow data import. Other than that, it does everything that the full version does for 30 days. After the 30 days is up, it still works as an offline KJV Bible reader, with all your annotations intact and usable. However, after the 30 days is up, you cannot make any new annotations. If you want to continue making annotations at that point, all you need to do is buy the full version, export your data from the trial version, and import it into the full version (that way you will have all your annotations from the trial period stored in your full version).

    The difference between the Google Play and Amazon versions is one penny for the full version (no difference for the free trial). I wanted to charge $3.00 for both versions. Amazon was OK with that, so their version is $3.00. Google play won’t let me charge $3.00 – when I set the price to $3.00, they change it to $2.99. Hence the penny difference.

    To get a pretty good idea what Bible Annotator does (and how to use it), watch this video. After the video was made, I also added data export/import capabilities, as described elsewhere in this blog.

    Bible Annotator 2.0 Released Today!

    One of the reasons that I wrote Bible Annotator was to prevent loss of notes. When I used to write notes in a paper Bible, after a few years the Bible would fall apart. I could either send it to a book binder for repair (which was expensive and time consuming without my Bible, and it would still fall apart again later), or start making notes all over again, or laboriously copy countless notes from the old Bible to a new one. None of these options were very attractive to me.

    So I wrote Bible Annotator in order to preserve them electronically, along with cross-referencing and linking them in ways that are impossible with paper. But eventually my phone will die, and I didn’t want my notes to be lost. Also there’s the possibility that I might want to switch to a different platform at some point. The obvious solution was to have a backup outside of the phone, a backup that could be read by a variety of software products.

    Bible Annotator 2.0 allows just that. It does all that 1.0 and 1.1 did, plus you can back up your data (notes, links, tags, and stars) to an outside area such as Google Drive or a removable SD card. It writes all of your information from the database into a simple .txt file that can be read with any word processor. Caveat: because of the ‘flat’ nature of a text file, as opposed to the interlinked nature of a database, humans would have difficulty reading it. I chose this format because it is as basic as data storage gets, and it should still be valid many years into the future. Were it stored as a database file, the database program would have to still be available to read it.

    One other thing that 2.0 does is that it allows all passages within a chapter to have their timestamp updated from a single option in the overflow menu. Prior to this, if you wanted all passages in a chapter that you were reading to show up in “Recent Selections”, you had to update each one individually. Now the entire chapter can be updated with two clicks.

    OneGuyOnTheInternet is expanding: Bible Annotator has been published

    OneGuyOnTheInternet.com is expanding today! This site has always been about easy access to ACS data, and it still is, but starting today the site will also be used for support of Bible Annotator.

    Bible Annotator allows you to deeply explore and understand the Bible by creating notes on passages, ranking passages (0-5 stars), and linking and tagging passages.

    The primary function is to keep your recent Bible readings fresh in your mind by bringing up your most recent annotations on demand. This allows your to use your phone (or other android device) in a constructive manner when you have a spare moment. Open Bible Annotator, and the most recently made notes (and relevant verses) will pop up in order of time. After reading a few, you can click on the current one that you are reading, and next time you open it, it will start at that point.

    You can also link passages in the Bible, connecting (e.g.) prophecies with fulfillments, or events in one gospel with the same events in another gospel. And, when there is a topic that you are particularly interested in, you can create a tag, and tag all passages that relate to that topic, and view the tag and all connected passages when you desire. When you want to put a passage in context (when viewing it from Recent Passages, Links, or Tags) , simply select a verse in the passage, and then click the ‘go to’ prompt, and you will be presented with the entire chapter that contains that passage.

    Working with the Bible this way preserves its original state as a linear document, but also allows it to be viewed as an interconnected web, with passages connecting to other passages in different books and chapters. In creating the connections, you will learn the Bible much better than before, and in using the connections as you view the notes and connections that you have made, you will gain a yet deeper understanding of it.

    Those of you who know me can contact me and get a promo code for a free download, otherwise it is $3.

    There is a 4 minute intro video here, and for those who are still interested after the intro, there is a 14 minute tutorial here.

    Gateway To MAST is temporarily out of service

    My apologies to anyone who has tried to run a tabulation in the last two days and didn’t get it. The hosting company that manages one of the servers that I use made some unexpected changes, and the communication link is down. I’m aware of it and in the process of making a correction. When the correction is done, you’ll need a new version of the app, which will use a slightly different type of communication.

    When the new version is available, and communication is restored, I will write a post to indicate so.

    HELP! I just ran a 6 dimensional tabulation with “Gateway To MAST” and my boss told me to prove that the numbers are correct!!

    Ok, you can’t find any other site that can duplicate or support the numbers that you just generated. Don’t panic. There’s actually a pretty easy way to verify the numbers.

    Pick two of the dimensions. Go to DataFerrett (dataferrett.census.gov). This is the Census Bureau’s online custom tabulation tool. If you don’t know how to use it, go to youtube. There are several tutorials on it. Take the two dimensions that you picked, and use them to create a custom tabulation in Data Ferrett. Make sure you are using the 2014 ACS PUMS data.

    Now take the 6 dimensional tabulation that you got from Gateway To MAST, and use the ‘pivot tables’ feature of your spreadsheet program to collapse out all but two of the dimensions. This will add the volumes together across the dimensions (except for person level household counts, LoVal and HiVal – as a general rule, they cannot be added together).

    The numbers from the collapsed tabulation should be identical to the numbers from DataFerrett. You’ve just verified two of the dimensions.

    Repeat for two other dimensions, and verify those, and so on.

    Android App Released Today: Gateway To MAST

    Today is the full release of “Gateway To MAST” on the Google Play Store. Gateway is an interface to MAST that allows anyone with an android device to define a custom tabulation. Their query is then sent to a MAST server which creates the tabulation, which is subsequently emailed to the user. Gateway has been in beta test mode in the Google Play Store for about 1.5 weeks, and the beta test has gone very well, so today I moved it from beta test to full release. In beta test, users would typically get their tabulation in about two minutes. The first tabulation that a user does requires email verification, so that first tabulation requires an extra step.

    The interface that I used to have here – where you could drop data items into a shopping cart and get a tabulation – was a first step in the right direction, but only a step. I needed a better user interface for MAST, one that can grow to allow user defined dimensions and multiple years of census data. Although Gateway is currently in its infancy, it allows the user to choose up to 8 volumes and 8 dimensions at two levels (Household and Person) simultaneously. It also allows choices of geographies. This video provides an introduction to Gateway. This is version 1.02, so while it doesn’t do much in the way of user-defined dimensions yet, I know of no other point-and-click product that allows the kind of insight into data that this early version of Gateway does.

    The video provides an introduction in using Gateway to create a pretty simple single-level tabulation, but it didn’t cover banding (which Gateway supports), so I’ll describe that here. Very often there will be a continuous variable (such as age or household income) that you want to use as a dimension. But only discrete variables can be used as dimensions, so you need to convert the continuous variable into a discrete variable. For example, you might want age bands of 0-5, 6-10, 11+. In that case, when you are choosing dimensions, and you see the “MUST BAND” button, click it, and it will take you to the band editor. In the band editor, click the flashing cursor to make the keyboard appear. Then enter the lowest number of your lowest band (for age, it would probably be zero, for some incomes, it can be negative). Then hit the ‘Done’ button on the keyboard. Next enter the lowest number of the next band. Gateway will calculate the upper levels of the bands, so you only enter the lower levels. After you enter your final lowest level, hit the ‘Done’ button, then (and only then) hit the “BANDING COMPLETE” button. If you want to review the bands that you have created, go to “Review And Submit Your Query” by either swiping from left to right, or clicking the 3-line ‘hamburger’ in the upper left hand corner. That’s all there is to banding. If you need to redo your banding, just go back to the dimension chooser and click the MUST BAND button again.

    Creating 8 dimensional tabulations where each dimension can have potentially hundreds of values (or thousands in the case of PUMAs) can create some size issues. There is some internal processing in my system that is very slow, and because of that I’m only allowing tabulations that are 300,000 bytes or less go through the automated processing (bigger tabulations get created, they just don’t get delivered automatically). If you create a tabulation that is greater than 300,000 bytes, you will get an email that tells you how big it was, and suggests that you use the RESTRICT button (in the dimension choosers) to reduce the size of your tabulation without losing any dimensions. For example, there are 480 different occupations in the Occp data item. If you are only interested in one of them, but you choose Occp without restricting it to the one you wanted, your tabulation will be 240X the size that it really needs to be. If you restrict it to the one you want, you will get two categories for Occp: the one you want and “other”. If you have good reason to want a tabulation that is larger than 300,000 bytes, let me know and I’ll try to bypass the automatic processing and deliver it to you manually.

    How to discover where millennials are willing to move.

    There are a multitude of articles regarding where and why millennials are moving. Some of the articles conflict with each other. I’m going to demonstrate a way (from the proverbial horse’s mouth) to find where the millennials are moving to because they actually want to move there. This analysis is not light reading, but is for people who really want to know what’s in the data, and are willing to put forth some effort to understand it.

    There are 5 primary items to download: 2010-2011 data and explanation. 2012-2014 data and explanation. An explanation of why there are two tabulations and how to use and match them.

    We Americans aren’t limited to ‘published studies’ or prepared census bureau tabulations to find information from the census. The census bureau releases actual census response records that have some modifications to protect privacy, and we can tabulate this data any way that suits our needs. This type of data is called PUMS (Public Use Microdata Sample). “Public Use” because it’s been sanitized to protect privacy. “Microdata” because it contains the actual responses on a household-by-household and person-by-person basis. “Sample” because it comes from either 1% or 5% of the population, and has been weighted to reflect the entire population (The ACS, or American Community Survey, is not taken from the entire population). Armed with PUMS, we can do our own study, to suit our own needs. In this investigation, we are using the 2010-2014 5% 5 year file, which is the newest and largest file that the public can access.

    In this case we want to find out where the millennials are moving to because they want to live there, which turns out to have a very different answer than if we simply ask where they are going. The answer will enable communities to ask themselves the following question: “What do those places have that we don’t?”

    First, we need to understand a little about PUMS. Every person and household in the USA exists in a particular geographical area called a PUMA (Public Use Microdata Area). This is a geographical area that the census bureau creates, which gives us limited information on where a person lives. PUMAs are designed to have at least 100,000 people. Wyoming, with the smallest state population in the USA, is divided into 5 PUMAs. In contrast, New York City is divided into 55 PUMAs. These divisions allow us to analyze fairly small parts of large cities.

    To begin answering this question, I ran a simple tabulation adding up all people 18-34 years old who had moved within the previous year, and found out what PUMAs they had moved to. I also averaged their incomes. The favorite spot was College Station, Texas, and the millennials who migrated there were earning very little. College Station is the home of Texas A&M University, and these people weren’t really moving there at all – they were college students.

    First Modification: Remove all people who are enrolled in school (keep only SCH = 1)

    I’m going to apply a series of filters to remove populations that cloud my answer, and after each filter (or other modification) is applied I will rerun the tabulation to find the new set of PUMAs with the most millennial migration, then apply more filters if necessary, and so on. From what I have seen in the online articles, they don’t filter the data, although better articles do mention that college students skew the data. This is likely because the authors started with tabulations, and thus had no ability to filter the data.  Trying to remove people after receiving a tabulation is like trying to separate bread dough into pure flour, milk, eggs, and sugar. It can’t be done. Because I am using microdata instead of a pre-summarized tabulation, I can do it by removing them before they are mixed in.

    Second Modification: Remove all households that have 1 or more active military people (MIL = 1)

    After removing the college students and rerunning the tabulation, there appeared a number of high ranking PUMAs that contained military bases within the PUMAs. To a large degree, military people and their families don’t move there by choice, but are assigned, so they have been filtered out.

    Third modification: Put all PUMAs on equal footing regarding their population by creating the Scale00 and Scale10 variables to be used as sort keys in the tabulation.

    While all PUMAs are designed to have at least 100,000 people, some large PUMAs have more than 4 times the count of small ones. This skews the data because more people are likely to migrate into (or inside of) a greater population. To correct this, I created a multiplier for each PUMA. For example, a PUMA that has ½ the residents of the largest PUMA would have their migrating millennial counts doubled and placed in a field called Scale00 or Scale10.  This allows me to rank migration proportionally for PUMAs, thus allowing a smaller, but more attractive PUMA to rank higher than a larger but less attractive PUMA without changing the actual person counts.

    Fourth modification: Filter out people who moved, but stayed within their own state. (A new variable was created to identify and filter these people).

    The current purpose is to find places that millennials want to move to. In the census, if someone moves to a different apartment in their same building, they are classified as a migrant – but we don’t want to misinterpret such a situation as someone who moved to that area when they were already there. More commonly, if someone lives in the suburbs of Chicago, and moves to Chicago to work, they are not helping me answer my question. They weren’t willing to really uproot themselves and move – they are still near family, friends, and familiar surroundings. In order to find those places that millennials really want to move to, I’m only counting them if they cross state lines. Also, if I don’t do this, and an area falls on hard times, and the millennials move back to their parent’s homes in the same area, it will appear that there is a lot of millennial migration into that area. This modification prevents that situation from misrepresenting what is really happening.

    However, keep in mind that this approach has some pitfalls that need to be considered when looking at the data.  For example, DC is a city surrounded by state lines.  It’s easy for migrants to cross the line without going far, and subsequently DC pumas rank very high on the list.  There are two important points here:

    • Any approach can give certain areas apparent advantages over others, making it important to understand exactly how the numbers were derived.
    • If you really want to address a particular question, a tabulation that someone else has done (unless it was at your direction) probably isn’t going to be sufficient.

    Fifth modification: Filter out people who are actively training for National Guard and Reserves

    After the first 4 modifications there were a couple of PUMAs that appeared in the high rankings with relatively low salaries that I couldn’t explain at first, so I pulled some appropriate records (from PUMA 1400 in Missouri) out of the census and tried to ascertain why these people moved there. The data is here in an easy-to-view spreadsheet for those of you who want to look at actual PUMS records. To decipher the encodings, you will also need this data dictionary from the census bureau.

    I picked serial number 2013000599829, at random, to examine in the file. This is a 20 year old male who is a federal employee (COW=5). He migrated to Missouri from southern Rhode Island (MIGSP12=44, MIGPUMA10=400). He is living in group quarters (his household record has a 0 for WGTP, which only occurs in the case of group quarters). He buys his own insurance (HINS2=1) but also has TRICARE or military health care (HINS5=1). He is on active duty for training in the Reserves or National Guard (MIL=3). He must have ‘moved’ to Missouri temporarily for training and he is staying in a barracks. Examining the rest of the ‘migrating’ millennials that are in this spreadsheet reveals that most of them are in training for the Reserves or National Guard. They didn’t choose to move to Missouri, and they aren’t staying, so I changed my definition of active military (which is being filtered already) to also include MIL=3.

    After the 5th modification I reran the tabulation, and it looks much better (meaning that the migrants have pretty high average incomes, which I say is ‘better’ because people don’t generally choose to move to another state to get a low income). There does seem to be an oddity in PUMA 49003 in Utah – a lot of migration to that PUMA is reported, with low pay. The census bureau documentation tells me that 49003 in Utah is Provo. Wikipedia tells me that Provo is home to the world’s largest LDS (Mormon) missionary training center.  The center houses up to 3,800 missionaries at a time for 3-12 weeks of training. Those who attend there are not ‘enrolled in school’ according to the SCH data item. I left the Provo PUMA in the tabulation, but realize that millennials aren’t actually moving there at the inferred rate. At this point the filtering and modifications are sufficient for my purposes, and it’s time to upload and explain the tabulations.

    There is a lot of information in the tabulations. I’ll just make a few comments, and let the readers make their own observations. If you have questions, please ask (preferably where others can see your question and my answer, rather than email). The top 10 lists in each tabulation have pretty much the same areas, albeit in different orders, and with different PUMA names. Only PA PUMA 4109/3209 stayed in the same position. The PUMA names were changed (many names were changed from PUMA00 to PUMA10), but they both apply to the same area, which is center city Philadelphia. PUMA 105 in DC ranked first in one tab and 2nd in the other. Note that in PMillMovers.10.xls, DC had more Pwgtp than Seattle (Washington PUMA 11603), but Seattle ranked higher. That’s one example where PUMA size became an issue, and why the Scale10 variable was created – the Seattle PUMA is more attractive, but the DC PUMA is larger.

    The PL_Wgtp column, weighted household count at the person level, shows us how many households the Pwgtp (weighted person count) live in. E.g. on PMillMovers.10d.xls, line 11, 3399 millennial movers moved into 2703 households, for an average of 1.26 millennial movers per household. On line 12 of the same tabulation there are 1.63 millennial movers per household, indicating that millennials moving to PUMA 103 in DC are more likely to take another millennial mover with them.

    I would be happy to answer any questions that anyone might have about how these tabulations were created, or anything that might appear confusing in them.

    The Third Level of MAST Users

    I had mentioned earlier that there are three levels of MAST users. The first and simplest level is to drop data items (and geographies, to limit the area tabulated) into the shopping cart. It’s easy to use, pretty powerful, but doesn’t get anywhere close to my claim to be able to answer (almost) any question you can think of that can be derived from the data. While you can do a great deal with it, you are limited to the variables that are either created by the Census Bureau, or the few that I have created in anticipation of people needing them (e.g. pincp_vwa). You can do a lot at the first level, but if you need a different variable, you’ll have to move up.

    The second level is script-writing. In script-writing you can create variables that you might need that I could have never possibly anticipated. It is extremely powerful. Unfortunately I don’t currently have any error-checking/debugging available to the user, so I have to admit that when used via the website it doesn’t work very well. I still have it as an option on the site, but I’m not encouraging people to use it.  (The way it works is the user writes the script and sends it, then I debug the script [assuming there are errors] and return the debugged script to them, which they resubmit.  I end up getting $25 to debug their script, which isn’t good for me or for them.)

    The third level is the one that takes us to the point where (almost) any question imaginable regarding the data can be answered. At the third level, which is actually the easiest level of all to use, the user simply tells me what they want. I then create the necessary variables, some via the scripting language, some otherwise, I run the tabulation, and send it to them. At all levels, the tabulations are somewhat dynamic, meaning that after the first tabulation the user learns something about MAST’s capabilities and the data. I learn more about the user’s needs and the data. We then decide to do another tabulation. After a few iterations of this, the user then has what they want.

    How can MAST answer questions so quickly, when without MAST it’s often very difficult to get an answer?

    You may have noticed that if someone asks MAST a question, they can get their answer right away, at little cost. But if they try to get a custom tabulation in a more traditional manner, they have to pay a lot of money and wait a long time.

    How can that be?  Is someone scamming the public here?

    No, there’s no scam going on.  This is the reality of data-processing.  There are certain questions that are really easy to answer, such as “How many females aged 60+ live in Michigan?”.  You can get answers to that kind of question in pre-tabulated tables that are freely available to anyone who wants to find them.  No problem.  But suppose you want to know how many 5 year old children live in the same household with 7 year old children.  Or perhaps something far, far, more complex than that.  In these cases, special purpose software needs to be developed in order to answer the question.

    To develop that software normally takes a large investment of time from highly paid and skilled data processing folks.  A few thousand dollars really isn’t a bad price for that type of work, but keep in mind that it usually costs more than a few thousand dollars.  I consulted for a telephone company, and data users in the company could go to a couple of different places within the company to get data.  My client (within the company) absolutely loved to turn down data requests (he was too busy), and send people over to the IT department.  The user would then go to the IT department and get an estimate for their special tabulation – usually the estimate would be around 3 months time and $30,000.  They would then come back to my client and plead their case.  He would then give them their answer the next day.  For free.

    How did he do that?  The IT department had to put it’s programmers to work writing special-purpose software to create the special tabulation (when they could fit it into their schedule).  That’s why it took 3 months and cost $30,000.  My client just had to have one person spend a little time running a code-generator.  Special-purpose software is being written in both cases.  The difference is that in the typical IT shop, it is written by a human, whereas when a code generator is used, a computer writes the code.

    The obvious question then is “Why didn’t the IT shop use the code generator?”  Suppose you are a manager in a company, and your competition for raises and promotions are other managers in the same company, and you have a “magic genie” that answers pretty much any question you can think of, quickly and accurately.  How hard would you work to insure that only your shop had access to it?  Do you think you would publicize it and demonstrate it to the people you are competing with for raises (especially when they have as much legal right to use it as you do)?

    If you’ve ever worked in data processing, think about this for a minute: you can spend 5 minutes explaining a need to the IT shop, then it takes them weeks or months to fulfill the need.  Why not develop software that allows you to spend the 5 minutes explaining to the software what you want?  Then the software writes the program.  This is how MAST works.  And this is why I can answer a question in a short time (for little or no cost) that would take much longer when using traditional methods.

    Think of MAST as Segmentation on Steroids

    In the data processing world (especially as it applies to marketing) there is a very popular term known as “Segmentation”.  For some reason, the census community does not seem to have embraced the term, though they use the concept every time they look at particular geographies, genders, races, age groups, … etc.

    The idea behind segmentation is that I want to study a particular segment of the population, such as 25-35 year olds that live in single-family homes in Florida and make between $41,000 and $83,000 per year.  When the telephone company asked me to create a tool that could answer any question they could think of (regarding customer usage of products and services), segmentation was clearly the only approach that would serve their purposes.

    We see a simple form of segmentation (though it isn’t called segmentation) in the Census Bureau’s “Data Ferrett” tool.  A word about Data Ferrett first:  The Census Bureau is not required to provide you with tools to analyze data, but to help people out, they produced Data Ferrett.  They aren’t trying to provide a tool to answer all your questions with Data Ferrett. – in fact they tell you up front that this tool is for people who don’t have access to SAS, clearly implying that SAS would be a better choice if you do have access to it.  I’ve never used SAS myself, but SAS is what the telephone company used 20 years ago to answer their questions before I wrote MAST.  (SAS quickly fell into the background afterwards.)

    The theory behind “answering nearly any question you can produce” via segmentation has been well supported over the years.  It’s used not just in MAST, but is the basis for OLAP tools and data mining.  Try for a moment to think of information that can be derived from existing data, but can’t be answered via segmentation … if you can think of one quickly, you may not fully understand the capabilities of segmentation – or MAST’s abilities to create new dimensions and volumes (from existing information) used in segmentation.  A few months after MAST went live in 1996, I received a call from a very happy telecom manager who emphatically stated: “We haven’t found anything that we can’t do with it yet!”

    Can MAST Really Answer Nearly Every Question (Part II – a bit more about scripts)?

    It is well understood by most, though not all, that when I say that MAST can answer nearly every question, it is limited to the answers that can be possibly be derived from the data. Obviously, MAST cannot derive the next set of lottery numbers or the distance to the moon from census data if the information isn’t in census data to begin with. Similarly it cannot produce mailing lists or analyze on geographies smaller than PUMAs because that information is not in the ACS PUMS file.

    Where MAST excels is in deriving information that is either directly stored, or can be derived from, the data. For example, yesterday a market research analyst needed to know the average household incomes of people who were 18-34 years old, and see this information by each state. Sounds simple, right? You would think that you could just google something that basic!!

    For MAST, that’s a pretty simple tabulation, but without MAST, it’s actually difficult to get. There are a few aspects to the request that make it non-trivial. First, the age breakout is specific, and therefore not something that is likely to be available online in a pre-tabulated form. A much bigger hurdle is that it has a household level component (household income) and a person level component (age). This is a point where most resources (either collections of pre-tabulated data or query tools) will fall flat. Here is the MAST script that can answer this question, followed by an explanation:


    Reference person



    Scripts are written in blocks, and we see 4 pretty simple blocks here. The blocks are contained within BEGIN/END bookends.

    The first block we see here is a VOLUME block. These are the numbers that we want to collect, sliced-and-diced by dimensions. In this case, we want an average household income. That means that we will need to accumulate the total weighted and adjusted volumetric portion of HINCP, which is in Hincp_vwa (details described in a previous post). We are also going to need total household counts (Wgtp) for each line in the tabulation, so that we can divide the total Hincp_vwa by Wgtp to get the required average.

    The next block that we see is contained within BEGIN/END ENTITY CLASSIFICATION bookends. This type of block allows us to create household level information based on aggregations of person level data. This block creates a new household level variable, called RefAge. It categorizes every household based upon the age of the reference person within the household. The analyst only needed the 18-34 band, but I included the others just because it is often useful to see how much data is in the part of the tabulation that you aren’t interested in.

    Next a household level classification called Hst is made, which is just State (at the household level). This is just a direct usage of the census bureau’s “ST” field with no manipulation.

    Finally (because there is no person section to this simple tabulation) there is a block contained within BEGIN/END ENTITY DISPLAY bookends. This tells MAST the household level dimensions that are desirable for the tabulation, in this case, RefAge and Hst.

    Running this script through MAST causes MAST to dimensionalize Pwgt and Hincp_vwa based on State and RefAge. MAST prints this information in a spreadsheet, and the user can then divide Hincp_vwa by Pwgt in the spreadsheet (a simple operation) and have their answer.

    Can MAST Really Answer Nearly Every Question?

    As we close out 2015, I need to revisit a claim that I made when OneGuyOnTheInternet.com first went live. The site went live on 12-8-2015, and I stated that it “will let the user ask just about any question they can think of and get an accurate answer”. Since that time, I’ve introduced a few concepts that are key to using MAST, and they are necessary background if we are to reach the point where the user can genuinely answer nearly any conceivable question.

    These concepts include:

    • dimensionalizing (slicing and dicing) volumes (such as household counts, person counts, expenses, incomes) based on discrete variables (such as geography, age, race, occupation)
    • analyzing households as units, by:
      • performing tabulations at both household and person levels simultaneously
      • carrying all household dimensions to the person level
      • including a weighted household count at the person level
    • allowing a multitude of volumes to be displayed in a single tabulation
    • allowing a multitude of dimensions (not just one, two, or three) to be used in a single tabulation

    While these things can be done today on OneGuyOnTheInternet.com simply by putting data items and geographies in a shopping cart, and these things give us an enormous amount of analytical capability with very little effort, they don’t come close to answering every conceivable question. We have to move a bit higher to fulfill the needs of the users while supporting my claim.

    There are 3 levels of MAST users. Dropping geographies and data items into a shopping cart is the most elementary level – it’s simple and powerful, but it can’t answer (nearly) everything.

    In 2016, we will be moving to the 2nd and 3rd levels, and you will eventually see why I am comfortable making the claim that I make.

    Very Brief Taste of the Second Level: Script-Writing

    The census bureau includes a data item on the household record called “R65”. It categorizes every household based on the number of people in the household that are 65 or over, and the possible values are: none, 1, or 2+. The census folks work with lots of users, and no doubt they have found that there is a great need to be able to categorize households in this manner. But what if I need to categorize the households based on 3 or more 65 year-olds? Or if I need the cutoff to be 63 years? Or what if I need to categorize the households based on the number of people that are 43-63 years old that worked at least 27 weeks last year and made between $27,000 and $46,000? Obviously we can’t expect the census bureau to anticipate that I would need that categorization and prepare it for me! MAST allows you to create categorizations like that on demand, which is another step towards answering nearly any conceivable question.

    Here are the portions of scripts that allow the user to define these kinds of categorizations:

    3 or more 65 year olds:

          BEGIN BUCKET01
          END BUCKET01





    The above tells MAST to count the ‘People’ (unweighted person count) in each household, but only count the people that are 65 or older according to the Agep_v data item. Add those people up for each household, then categorize each household based on how many it found: 0-2 or 3+.


    Presence or absence of people that are 43-63 years old that worked at least 27 weeks last year and made between $27,000 and $46,000:

          BEGIN BUCKET01
                50 to 52 weeks
                48 to 49 weeks

                40 to 47 weeks

                27 to 39 weeks



          END BUCKET01






    The above tells MAST to count the ‘People’ (unweighted person count) in each household, but only count the people that are 43-63 according to the Agep_v data item, AND they have to have worked from 27-52 weeks last year AND they have to have wages of $27,000 to $46,000. Add those people up for each household, then categorize each household based on whether there are any or none of them present.


    In 2016 I intend to produce a series of tutorials and examples that will explain script-writing (this was just a small taste!) and take us much closer to supporting my claim and your needs.

    Until then, Happy New Year!

    John Grumbine


    The PUMS 5 year file (2009-2013) contains all of the USA and Puerto Rico.  But suppose you don’t want to analyze the whole thing?  If you just want the USA, you can include the dimension H_USA_PR, which will split the tabulation into two sections: USA and Puerto Rico.  The tabulation will be twice the size you wanted, but it’s no big deal, you can just ignore the half that you don’t want.

    But what if you want to look at one value of PUMA10, so you dimensionalize on PUMA10 and State (which is necessary, because PUMA numbers are duplicated across states).  Now your tabulation is 2,378 times the size that you wanted.  Making tabulations unnecessarily large causes a few problems.  Today I’m going to show you how to avoid unnecessarily large tabulations.

    The Census Bureau has defined 4,544 different geographical areas, spread over 5 different data items.  There are:

    1. 5 Regions
    2. 10 Divisions
    3. 52 ‘States’ (two of which are DC and PR)
    4. 2099 PUMA00s (PUMAs as defined in 2000)
    5. 2378 PUMA10s (PUMAs as defined in 2010)

    The Region, Division, and State names are identical to the values given in the census bureau documentation, so if you wanted to analyze just Pennsylvania, you can search on “Penn” in the search window, and a product called “Pennsylvania/PA” will appear.  Add it to the cart, and if you don’t add any other geographies, your tabulation will include only Pennsylvania data.  You can add other states, or regions, or divisions, or PUMAs if you want a larger area.  If you were to add Pennsylvania, and 6 PUMAs in Pennsylvania, the PUMA additions will have no effect, because you have already included all the Pennsylvania data.  If instead, you add 6 PUMAs from the neighboring states, you have created a custom geography that includes PA plus some of the surrounding area (if you do, remember to consider both PUMA00 and PUMA10).

    The PUMA products are labeled in this format:

    PUMA#0 SS #####

    where PUMA#0 is either PUMA00 or PUMA10.  SS is the state designation, such as AL, AK, MD, etc.  ##### is the 5 digit PUMA assigned by the census bureau, such as 00100.

    So if you wanted PUMA10 100 from New York, you would search on:

    PUMA10 NY 00100

    and add it to the cart.

    Geographies can be added to the cart at any time during shopping.  Like the dimensions and volumes, the order of addition to the cart has no bearing on your tabulation.




    The easiest way to use this site

    If you didn’t see the video, this site might be a little confusing.  It’s only been live for a couple of days, as time goes on I’m sure I’ll explain things more clearly.

    All you need to do is pick the data items that you are interested in, then add them to the shopping cart, then checkout.  Your custom tabulation will be emailed to you.  Seems easy, right?

    The (minor) difficulty is that there are nearly 500 different data items that you can use (492 at this moment) to analyze the PUMS data, so paging through product lists is cumbersome.  That’s a good problem to have – there are a lot of choices which add up to making it easier to get what you really need, not what someone else wants to give you.

    Here’s how to do it:  Get this data dictionary from the census bureau.  Look through it until you find the data items that you want.  In the upper right hand corner of this site, use the ‘search’ function to find the data items that you want, and add them to the cart (this is way easier than paging through product lists!).

    Remember these things that aren’t in the data dictionary:

    PWGTP and WGTP are weighted person and household counts that will give you replicates (high and low values)

    PWGTPNR and WGTPNR are just the weighted counts – no high and low values (NR means no replicates)

    If you’re not using a State dimension and you want to separate the Puerto Rico data, use H_USA_PR as a dimension (it’s free).  If you are using a State dimension, Puerto Rico will be in it’s own category anyway.

    The _D, _V, _VA, _VW, and _VWA data items are explained in this post.

    “PEOPLE” is an unweighted person count.

    “UW_HDRS” is an unweighted headers (either households or group quarters) count.

    “PL_WGTP” is a weighted household count at the person level.

    “HSERIALNO_YR” is a dimension that will give you the year (2009-2013).

    There are a few data items that are household level items, but the census bureau has placed them on both the household and person records, and given them both the same name.  This is a good strategy if you have to analyze people and households separately, but if you can analyze both levels simultaneously, it creates a naming conflict.  The ones to be aware of are ST, PUMA00, and PUMA10.  I have included them on both records for completeness, though I can’t see any valid reason to use the person level items when you can use the household level ones (which will show in both the household and person levels in the tabulation).  They had to be renamed to eliminate the conflict, and are now called HST, PST, HPUMA00, PPUMA00, HPUMA10, PPUMA10.


    What is MAST?

    MAST is an acronym for “Multi-dimensional Analytical and Simulation Tool”.

    In the middle 1990’s I was working with one of the major telephone companies and they found that it was extremely difficult to analyze their data – it often took months for them to answer what seemed like a simple question, because programmers had to develop special purpose code (which can take a LONG time) to answer seemingly basic questions.

    They asked me if I could build a product that would answer the questions.  I agreed to do so at very low cost, under the condition that I would own the product.  That is how MAST came into existence and why I have it.  Given that telephone data consists of accounts with many phone calls associated with them, and census data consists of households with many people associated with them, and that all of them (accounts, phone calls, households people) have both volumetric and dimensional data, MAST is a very powerful tool for analyzing census data.

    What are the _v, _d, _vw, _va, and _vwa data items for?

    MAST works by dimensionalizing (or slicing and dicing) volumes.  A ‘dimension’ is something like gender, occupation, or state of residence – a way to categorize things, but it doesn’t make sense to add it up like dollars.  Dollars, people, and households are examples of volumes – things that can be added up sensibly.

    There are many data items in the census, like FULP, ELEP, PINCP, HINCP, etc. where the census bureau combined both volumetric and dimensional data in a single data item.  E.g. a ‘2’ might mean that the household didn’t use the item (dimensional), but a ‘4’ means that they spent $4 on it (volumetric).  In order for MAST to work powerfully, I had to break these data items into two, e.g. FULP_V and FULP_D, respectively containing the volumetric and dimensional portions of the data.

    But if you’re going to accumulate the volumetric portion, you will need it to be weighted.  So FULP_VW was born, to contain the volumetric weighted portion.  And since it is a multiyear file, you will often want it adjusted, so FULP_VA came into existence.  And sometimes you will want it weighted and adjusted, which gives us FULP_VWA.