Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 6 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,10 @@ This is a collection of example datasets from various sources that the `datasets
for creating the datasets go to the original authors. The following datasets are included (along with their LICENCE).
The licences are included in the respective dataset folders as well.

1. [example-causal-datasets](https://github.com/cmu-phil/example-causal-datasets): CC0 1.0 Universal. Last synced on
1. [angrist-krueger-cps](https://economics.mit.edu/people/faculty/josh-angrist/angrist-data-archive): CC0 1.0 Universal. Last downloaded on 2026-03-02.

2. [example-causal-datasets](https://github.com/cmu-phil/example-causal-datasets): CC0 1.0 Universal. Last synced on
2026-02-05.
2. [nslm](https://github.com/grf-labs/grf/tree/master/experiments/acic18): CC0 1.0 Universal. Last downloaded on 2026-03-04.
3. [Tuebingen-pair-wise-dataset](https://webdav.tuebingen.mpg.de/cause-effect/): Last downloaded on 2026-03-02.
4. [Twins-datasets](http://www.nber.org/data/linked-birth-infant-death-data-vital-statistics-data.html)
3. [nslm](https://github.com/grf-labs/grf/tree/master/experiments/acic18): CC0 1.0 Universal. Last downloaded on 2026-03-04.
4. [Tuebingen-pair-wise-dataset](https://webdav.tuebingen.mpg.de/cause-effect/): Last downloaded on 2026-03-02.
5. [Twins-datasets](http://www.nber.org/data/linked-birth-infant-death-data-vital-statistics-data.html)
25 changes: 25 additions & 0 deletions angrist-krueger-cps/LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# License

Copyright (c) 1995, Angrist and Krueger
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
96 changes: 96 additions & 0 deletions angrist-krueger-cps/README.md
Comment thread
Rasesh2005 marked this conversation as resolved.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How exactly was this file generated? Was it taken from somewhere or has been LLM generated?

Copy link
Copy Markdown
Author

@Rasesh2005 Rasesh2005 Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description has been LLM Generated, many of then made sense which correlated directly with https://cps.ipums.org/cps-action/variables/{tag} description like educ https://cps.ipums.org/cps-action/variables/educ had similar description, so I verified most of them this way, but some tags were not found using the same name, so had to assume what the LLM gave was correct, I have a list of verified and unverified tags If u want I can send that as well.. these 2 tags were the only suspicious one in the list acc to me

Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# Angrist-Krueger-CPS Dataset

This dataset is an extract of the CPS data containing 30,967 observations on men born 1944-53 from the 1979 and 1981-85 March CPS, matched to lottery number dummies for groups of 25 lottery numbers. There are 72 variables including all covariates. The raw files (`extract.dta` and `samplcps.do`) were replicated and processed into a ready-to-use tabular `.mixed.txt` format suitable for `pgmpy` consumption.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The raw files (extract.dta and samplcps.do) were replicated and processed into a ready-to-use tabular .mixed.txt format suitable for pgmpy consumption.

Can you explain how this processing was done?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extract.dta is a stata file, I used pandas pd.read_stata, then I did all the preprocessing similarly as given in samplecps.do file.
the initial size of dataframe was (30967, 72), and after preprocessing and column selection, the final size is Final shape: (13993, 58).

Also samplecps.do simply is a sample Stata program that analyzes the CPS data set.

Also, the website says

Follow the sample selection rules in the notes to the tables to reproduce the 25, 781 observation working sample.

Should I do the preprocessing as mentioned in the paper instead that provides this ?? that will give a final shape of dataframe to (25782, 75).

Also, should I include all the preprocessing I did into the readme as well or something?


## Column Descriptions

The dataset contains 72 variables derived from CPS extracts used to estimate the causal return to schooling using Vietnam draft lottery instruments.

### Core Economic Variables
- educ: Years of completed education.
- annwage: Annual wage income.
- weeks: Number of weeks worked during the previous year.
- hrsly: Hours worked during the previous year.
- hrslw: Hours worked during the last week.
- wageflag: Indicator that wage information is valid/observed.

### Demographic Variables
- age: Age of the respondent.
- agesq: Age squared, used to model nonlinear age effects.
- age2: Alternative squared age variable used in some regressions.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, agesq and age2

- race: Race category from CPS.
- black: Dummy variable indicating Black respondents.
- other: Dummy variable for race other than White or Black.
- marital: Marital status indicator.
- spsepres: Indicator for spouse present in the household.

### Education Variables
- higratt: Highest grade attended.
- higrcomp: Highest grade completed.
- educ: Years of completed schooling.
- college: Indicator for college education.
- someco: Indicator for some college attendance.

### Labor Market Variables
- esr: Employment status recode.
- esrflag: Indicator for valid employment status data.
- class: Class of worker (private, government, self-employed, etc.).
- ind: Industry classification code.
- occ: Occupation classification code.
- vet: Veteran status indicator.
- veteran: Recoded veteran status variable.

### Geographic Variables
- state: State code.
- division: Census division classification.
- smsa: Indicator for residence in a Standard Metropolitan Statistical Area.
- metcode: Metropolitan area code.
- city: Indicator for residence in a central city.
- balsmsa: Balanced SMSA classification.

### Regional Indicator Variables
These variables represent U.S. census regions used as regression controls.

- neweng: New England region indicator.
- midatl: Mid-Atlantic region indicator.
- eastnth: East North Central region indicator.
- westnth: West North Central region indicator.
- sthatl: South Atlantic region indicator.
- eaststh: East South Central region indicator.
- weststh: West South Central region indicator.
- mount: Mountain region indicator.
- pacific: Pacific region indicator.

### Birth Year Variables
Dummy variables indicating the respondent’s year of birth.

- yob: Year of birth.
- yob44–yob53: Indicator variables for birth years 1944 through 1953.

### Survey Year Variables
Dummy variables identifying the CPS survey year.

- year: CPS survey year.
- yr81: Indicator for survey year 1981.
- yr82: Indicator for survey year 1982.
- yr83: Indicator for survey year 1983.
- yr84: Indicator for survey year 1984.
- yr85: Indicator for survey year 1985.

### Draft Lottery Instrument Variables
These variables represent grouped Vietnam draft lottery numbers used as instruments for education.

- lott1–lott13: Lottery number group indicator variables.

### Sampling and Administrative Variables
- marchwt: CPS March supplement sampling weight.
- recode: Observation identifier used in the replication dataset.

## Dataset Purpose

This dataset is used to estimate the causal effect of education on wages using instrumental variables derived from Vietnam draft lottery numbers. The lottery provides exogenous variation in schooling decisions among men born between 1944 and 1953.

## References
**Source Citation:**
Angrist, J. D., & Krueger, A. B. (1995). Split-Sample Instrumental Variables Estimates of the Return to Schooling. Journal of Business & Economic Statistics, 13(2), 225-235.
Data extracted from the [Angrist Data Archive](https://economics.mit.edu/people/faculty/josh-angrist/angrist-data-archive).
Loading