Due: Monday 18 October
- Download the following files from the class web page:
The first file is a database collected by the National Institutes
of Health (NIH) as part of their National Survey of Family Growth (NSFG).
It contains one (very long) line of data for each of the 10,847
respondents who participated in the study. The second file is
an `interval' file that contains one line of data for each
of 21,332 pregnancies reported by the respondents.
Note: this database contains real data collected from people who
volunteered to provide an enormous amount of personal information,
including information about their reproductive and sexual histories.
Some of the information in this database, and some of the data we will
be looking at, is both politically sensitive and emotionally charged.
I think this work will be interesting, and I hope we will find some
interesting things, but it is important to approach this material with
appropriate discretion and sensitivity.
- birth.py contains a program that reads the two data files
and builds one Respondent object for each line in the respondent
file and one Interval object for each line in the interval file.
Read the program carefully and make sure you understand what it does
and how it works.
- Design and implement a data structure that will allow you
to represent sets and subsets of respondents, and that makes it
easy to find the intervals (pregnancies) associated with a given
- Before you work with a database like this, it is a good idea
to do some consistency checking. First, add code that checks whether
the number of intervals for a given respondent matches the
Then add code that checks whether the Interval objects are in
the right order, according to the pregordr attribute. If
either check fails, raise an appropriate exception.
This kind of checking serves several roles: you are checking the
database for missing or incorrect information; you are checking
whether you understand the contents of the database; and you
are checking whether your program is handling the data correctly.
Note: for some reason the people who designed the data files only
allocated two entries to record the sexes of children that result
from a single pregnancy. In cases of triplets and higher-order
births, we only know the sex of the first two children.
The attribute nbrnlv is the number of live births for a
- Once you have organized the database into a structure
conducive for analysis, we have to decide what questions to investigate.
For this assignment, we will investigate how the number of children
a woman has affects her intention to have more children, and whether
the sex of the children she has affects her intentions.
For this kind of inquiry, a common approach is to choose a set
of variables that we think measure the cause of a phenomenon (so-called
independent variables) and a set of variables that we think measure
an effect (the dependent variables).
In this study, the independent variables are
the number of male and female children and possibly the birth order
of male and female children.
The dependent variables are the answers to questions about whether
the mother wants more children, whether she intends to have more
children, and how many more children she intends to have.
The database we are working with has a HUGE amount of information
about the respondents. I have made a pass through the variable
descriptions and chosen some of the variables we might want to use.
If you are interested, or if there are additional variables you
want to investigate, I can show you the survey documentation.
The variables I selected are explained in the handout entitled
`Section G: Birth Expectations and Desired Family Size.'
- As a warmup exercise, write a program that computes the
histogram of family sizes; that is, the number of respondents that
have given birth to children, for each value of .
Remember that we are counting the number of live births for each
respondent, which may not be the number of children currently
in the household of the respondent, for a variety of reasons.
- Now that you have an idea where we are going, start to think
about some of the functions that you are likely to want, and some
of the data structures you will want to build. For example, it
seems likely that we will want to find the subset of respondents
that satisfy various criteria, and build data structures to represent
those subsets. That suggests that we will want functions that
evaluate various criteria for respondents (similar to the filters
we applied to words in the dictionary) and functions that form
subsets of the respondents.
If you have some ideas about how to design these functions, you
should do some design. If you are stumped, you should start writing
some code. Start with specific code that checks for specific
criteria, and look for ways to generalize it.
- At the very least, you should write a program that answers
this question: `Are respondents who have not had a male child
more likely to want additional children than respondents that have
had a male child?' Beyond that, I will leave it up to you to
think about which groups of respondents you want to investigate,
and which dependent variables you want to look at.
This is an open-ended assignment because I think there are
a lot of ways to proceed and I want to give you a chance to explore.
If you feel overwhelmed, or have trouble getting started, I would
be happy to make suggestions.
I have explored this dataset a little bit and seen some results that I
think are interesting, but I don't have all the answers, or even very
many. To the best of my knowledge, no one else has used this database
to investigate these issues. If we find something here, we will be
the first to find it. Welcome to the frontier of current knowledge.
- WHAT TO TURN IN: As always, you should make your printed code as
presentable as possible before you turn it in.
In addition, I would like you to write a short paper (1-2 pages) that
presents the analysis you did and the results you found. You can
assume that the reader is familiar with the database (for
example, you can refer to variables by name), but you should try to
present your results clearly and concisely, using tables and graphs
In some cases, you may see apparent differences between groups
that are relatively small, and it may not be clear whether the
effect is statistically significant. I don't expect you to do
significance testing, but you should be careful not to overstate
your confidence in the findings.