Monthly Archives: December 2015

Can MAST Really Answer Nearly Every Question?

As we close out 2015, I need to revisit a claim that I made when OneGuyOnTheInternet.com first went live. The site went live on 12-8-2015, and I stated that it “will let the user ask just about any question they can think of and get an accurate answer”. Since that time, I’ve introduced a few concepts that are key to using MAST, and they are necessary background if we are to reach the point where the user can genuinely answer nearly any conceivable question.

These concepts include:

  • dimensionalizing (slicing and dicing) volumes (such as household counts, person counts, expenses, incomes) based on discrete variables (such as geography, age, race, occupation)
  • analyzing households as units, by:
    • performing tabulations at both household and person levels simultaneously
    • carrying all household dimensions to the person level
    • including a weighted household count at the person level
  • allowing a multitude of volumes to be displayed in a single tabulation
  • allowing a multitude of dimensions (not just one, two, or three) to be used in a single tabulation

While these things can be done today on OneGuyOnTheInternet.com simply by putting data items and geographies in a shopping cart, and these things give us an enormous amount of analytical capability with very little effort, they don’t come close to answering every conceivable question. We have to move a bit higher to fulfill the needs of the users while supporting my claim.

There are 3 levels of MAST users. Dropping geographies and data items into a shopping cart is the most elementary level – it’s simple and powerful, but it can’t answer (nearly) everything.

In 2016, we will be moving to the 2nd and 3rd levels, and you will eventually see why I am comfortable making the claim that I make.

Very Brief Taste of the Second Level: Script-Writing

The census bureau includes a data item on the household record called “R65”. It categorizes every household based on the number of people in the household that are 65 or over, and the possible values are: none, 1, or 2+. The census folks work with lots of users, and no doubt they have found that there is a great need to be able to categorize households in this manner. But what if I need to categorize the households based on 3 or more 65 year-olds? Or if I need the cutoff to be 63 years? Or what if I need to categorize the households based on the number of people that are 43-63 years old that worked at least 27 weeks last year and made between $27,000 and $46,000? Obviously we can’t expect the census bureau to anticipate that I would need that categorization and prepare it for me! MAST allows you to create categorizations like that on demand, which is another step towards answering nearly any conceivable question.

Here are the portions of scripts that allow the user to define these kinds of categorizations:

3 or more 65 year olds:

BEGIN ENTITY CLASSIFICATION
   Name=Num65
      BEGIN BUCKET01
         People
         _Agep_v
            65+
      END BUCKET01

      BEGIN RELATIONSHIP

      END RELATIONSHIP

      BEGIN BAND VALUES
         0-2
         3+

      END BAND VALUES
END ENTITY CLASSIFICATION

The above tells MAST to count the ‘People’ (unweighted person count) in each household, but only count the people that are 65 or older according to the Agep_v data item. Add those people up for each household, then categorize each household based on how many it found: 0-2 or 3+.


 

Presence or absence of people that are 43-63 years old that worked at least 27 weeks last year and made between $27,000 and $46,000:

BEGIN ENTITY CLASSIFICATION
   Name=MyCustomCategory
      BEGIN BUCKET01
         People
         _Agep_v
            43-63
         _Wkw,Labs/wkw
            50 to 52 weeks
            48 to 49 weeks

            40 to 47 weeks

            27 to 39 weeks

        _Wagp

            27000-46000

      END BUCKET01

      BEGIN RELATIONSHIP

      END RELATIONSHIP

      BEGIN BAND VALUES
         0-0
         1+

      END BAND VALUES
END ENTITY CLASSIFICATION

 

The above tells MAST to count the ‘People’ (unweighted person count) in each household, but only count the people that are 43-63 according to the Agep_v data item, AND they have to have worked from 27-52 weeks last year AND they have to have wages of $27,000 to $46,000. Add those people up for each household, then categorize each household based on whether there are any or none of them present.


 

In 2016 I intend to produce a series of tutorials and examples that will explain script-writing (this was just a small taste!) and take us much closer to supporting my claim and your needs.

Until then, Happy New Year!

John Grumbine

Geographies

The PUMS 5 year file (2009-2013) contains all of the USA and Puerto Rico.  But suppose you don’t want to analyze the whole thing?  If you just want the USA, you can include the dimension H_USA_PR, which will split the tabulation into two sections: USA and Puerto Rico.  The tabulation will be twice the size you wanted, but it’s no big deal, you can just ignore the half that you don’t want.

But what if you want to look at one value of PUMA10, so you dimensionalize on PUMA10 and State (which is necessary, because PUMA numbers are duplicated across states).  Now your tabulation is 2,378 times the size that you wanted.  Making tabulations unnecessarily large causes a few problems.  Today I’m going to show you how to avoid unnecessarily large tabulations.

The Census Bureau has defined 4,544 different geographical areas, spread over 5 different data items.  There are:

  1. 5 Regions
  2. 10 Divisions
  3. 52 ‘States’ (two of which are DC and PR)
  4. 2099 PUMA00s (PUMAs as defined in 2000)
  5. 2378 PUMA10s (PUMAs as defined in 2010)

The Region, Division, and State names are identical to the values given in the census bureau documentation, so if you wanted to analyze just Pennsylvania, you can search on “Penn” in the search window, and a product called “Pennsylvania/PA” will appear.  Add it to the cart, and if you don’t add any other geographies, your tabulation will include only Pennsylvania data.  You can add other states, or regions, or divisions, or PUMAs if you want a larger area.  If you were to add Pennsylvania, and 6 PUMAs in Pennsylvania, the PUMA additions will have no effect, because you have already included all the Pennsylvania data.  If instead, you add 6 PUMAs from the neighboring states, you have created a custom geography that includes PA plus some of the surrounding area (if you do, remember to consider both PUMA00 and PUMA10).

The PUMA products are labeled in this format:

PUMA#0 SS #####

where PUMA#0 is either PUMA00 or PUMA10.  SS is the state designation, such as AL, AK, MD, etc.  ##### is the 5 digit PUMA assigned by the census bureau, such as 00100.

So if you wanted PUMA10 100 from New York, you would search on:

PUMA10 NY 00100

and add it to the cart.

Geographies can be added to the cart at any time during shopping.  Like the dimensions and volumes, the order of addition to the cart has no bearing on your tabulation.

 

 

 

The easiest way to use this site

If you didn’t see the video, this site might be a little confusing.  It’s only been live for a couple of days, as time goes on I’m sure I’ll explain things more clearly.

All you need to do is pick the data items that you are interested in, then add them to the shopping cart, then checkout.  Your custom tabulation will be emailed to you.  Seems easy, right?

The (minor) difficulty is that there are nearly 500 different data items that you can use (492 at this moment) to analyze the PUMS data, so paging through product lists is cumbersome.  That’s a good problem to have – there are a lot of choices which add up to making it easier to get what you really need, not what someone else wants to give you.

Here’s how to do it:  Get this data dictionary from the census bureau.  Look through it until you find the data items that you want.  In the upper right hand corner of this site, use the ‘search’ function to find the data items that you want, and add them to the cart (this is way easier than paging through product lists!).

Remember these things that aren’t in the data dictionary:

PWGTP and WGTP are weighted person and household counts that will give you replicates (high and low values)

PWGTPNR and WGTPNR are just the weighted counts – no high and low values (NR means no replicates)

If you’re not using a State dimension and you want to separate the Puerto Rico data, use H_USA_PR as a dimension (it’s free).  If you are using a State dimension, Puerto Rico will be in it’s own category anyway.

The _D, _V, _VA, _VW, and _VWA data items are explained in this post.

“PEOPLE” is an unweighted person count.

“UW_HDRS” is an unweighted headers (either households or group quarters) count.

“PL_WGTP” is a weighted household count at the person level.

“HSERIALNO_YR” is a dimension that will give you the year (2009-2013).

There are a few data items that are household level items, but the census bureau has placed them on both the household and person records, and given them both the same name.  This is a good strategy if you have to analyze people and households separately, but if you can analyze both levels simultaneously, it creates a naming conflict.  The ones to be aware of are ST, PUMA00, and PUMA10.  I have included them on both records for completeness, though I can’t see any valid reason to use the person level items when you can use the household level ones (which will show in both the household and person levels in the tabulation).  They had to be renamed to eliminate the conflict, and are now called HST, PST, HPUMA00, PPUMA00, HPUMA10, PPUMA10.

 

What is MAST?

MAST is an acronym for “Multi-dimensional Analytical and Simulation Tool”.

In the middle 1990’s I was working with one of the major telephone companies and they found that it was extremely difficult to analyze their data – it often took months for them to answer what seemed like a simple question, because programmers had to develop special purpose code (which can take a LONG time) to answer seemingly basic questions.

They asked me if I could build a product that would answer the questions.  I agreed to do so at very low cost, under the condition that I would own the product.  That is how MAST came into existence and why I have it.  Given that telephone data consists of accounts with many phone calls associated with them, and census data consists of households with many people associated with them, and that all of them (accounts, phone calls, households people) have both volumetric and dimensional data, MAST is a very powerful tool for analyzing census data.

What are the _v, _d, _vw, _va, and _vwa data items for?

MAST works by dimensionalizing (or slicing and dicing) volumes.  A ‘dimension’ is something like gender, occupation, or state of residence – a way to categorize things, but it doesn’t make sense to add it up like dollars.  Dollars, people, and households are examples of volumes – things that can be added up sensibly.

There are many data items in the census, like FULP, ELEP, PINCP, HINCP, etc. where the census bureau combined both volumetric and dimensional data in a single data item.  E.g. a ‘2’ might mean that the household didn’t use the item (dimensional), but a ‘4’ means that they spent $4 on it (volumetric).  In order for MAST to work powerfully, I had to break these data items into two, e.g. FULP_V and FULP_D, respectively containing the volumetric and dimensional portions of the data.

But if you’re going to accumulate the volumetric portion, you will need it to be weighted.  So FULP_VW was born, to contain the volumetric weighted portion.  And since it is a multiyear file, you will often want it adjusted, so FULP_VA came into existence.  And sometimes you will want it weighted and adjusted, which gives us FULP_VWA.