-
Notifications
You must be signed in to change notification settings - Fork 11
Add Angrist-Kreuger-CPS dataset #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
5db898f
135d200
5771cb8
927ca03
38a7910
dd0dfb8
731f860
3144acf
db45c0c
5b7c54a
aa368c0
987e234
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,25 @@ | ||
| # License | ||
|
|
||
| Copyright (c) 1995, Angrist and Krueger | ||
| All rights reserved. | ||
|
|
||
| Redistribution and use in source and binary forms, with or without | ||
| modification, are permitted provided that the following conditions are met: | ||
|
|
||
| 1. Redistributions of source code must retain the above copyright notice, this | ||
| list of conditions and the following disclaimer. | ||
|
|
||
| 2. Redistributions in binary form must reproduce the above copyright notice, | ||
| this list of conditions and the following disclaimer in the documentation | ||
| and/or other materials provided with the distribution. | ||
|
|
||
| THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" | ||
| AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE | ||
| IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE | ||
| DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE | ||
| FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL | ||
| DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR | ||
| SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER | ||
| CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, | ||
| OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE | ||
| OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. |
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How exactly was this file generated? Was it taken from somewhere or has been LLM generated?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The description has been LLM Generated, many of then made sense which correlated directly with https://cps.ipums.org/cps-action/variables/{tag} description like educ https://cps.ipums.org/cps-action/variables/educ had similar description, so I verified most of them this way, but some tags were not found using the same name, so had to assume what the LLM gave was correct, I have a list of verified and unverified tags If u want I can send that as well.. these 2 tags were the only suspicious one in the list acc to me |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,96 @@ | ||
| # Angrist-Krueger-CPS Dataset | ||
|
|
||
| This dataset is an extract of the CPS data containing 30,967 observations on men born 1944-53 from the 1979 and 1981-85 March CPS, matched to lottery number dummies for groups of 25 lottery numbers. There are 72 variables including all covariates. The raw files (`extract.dta` and `samplcps.do`) were replicated and processed into a ready-to-use tabular `.mixed.txt` format suitable for `pgmpy` consumption. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Can you explain how this processing was done?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Also Also, the website says
Should I do the preprocessing as mentioned in the paper instead that provides this ?? that will give a final shape of dataframe to Also, should I include all the preprocessing I did into the readme as well or something? |
||
|
|
||
| ## Column Descriptions | ||
|
|
||
| The dataset contains 72 variables derived from CPS extracts used to estimate the causal return to schooling using Vietnam draft lottery instruments. | ||
|
|
||
| ### Core Economic Variables | ||
| - educ: Years of completed education. | ||
| - annwage: Annual wage income. | ||
| - weeks: Number of weeks worked during the previous year. | ||
| - hrsly: Hours worked during the previous year. | ||
| - hrslw: Hours worked during the last week. | ||
| - wageflag: Indicator that wage information is valid/observed. | ||
|
|
||
| ### Demographic Variables | ||
| - age: Age of the respondent. | ||
| - agesq: Age squared, used to model nonlinear age effects. | ||
| - age2: Alternative squared age variable used in some regressions. | ||
|
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here, |
||
| - race: Race category from CPS. | ||
| - black: Dummy variable indicating Black respondents. | ||
| - other: Dummy variable for race other than White or Black. | ||
| - marital: Marital status indicator. | ||
| - spsepres: Indicator for spouse present in the household. | ||
|
|
||
| ### Education Variables | ||
| - higratt: Highest grade attended. | ||
| - higrcomp: Highest grade completed. | ||
| - educ: Years of completed schooling. | ||
| - college: Indicator for college education. | ||
| - someco: Indicator for some college attendance. | ||
|
|
||
| ### Labor Market Variables | ||
| - esr: Employment status recode. | ||
| - esrflag: Indicator for valid employment status data. | ||
| - class: Class of worker (private, government, self-employed, etc.). | ||
| - ind: Industry classification code. | ||
| - occ: Occupation classification code. | ||
| - vet: Veteran status indicator. | ||
| - veteran: Recoded veteran status variable. | ||
|
|
||
| ### Geographic Variables | ||
| - state: State code. | ||
| - division: Census division classification. | ||
| - smsa: Indicator for residence in a Standard Metropolitan Statistical Area. | ||
| - metcode: Metropolitan area code. | ||
| - city: Indicator for residence in a central city. | ||
| - balsmsa: Balanced SMSA classification. | ||
|
|
||
| ### Regional Indicator Variables | ||
| These variables represent U.S. census regions used as regression controls. | ||
|
|
||
| - neweng: New England region indicator. | ||
| - midatl: Mid-Atlantic region indicator. | ||
| - eastnth: East North Central region indicator. | ||
| - westnth: West North Central region indicator. | ||
| - sthatl: South Atlantic region indicator. | ||
| - eaststh: East South Central region indicator. | ||
| - weststh: West South Central region indicator. | ||
| - mount: Mountain region indicator. | ||
| - pacific: Pacific region indicator. | ||
|
|
||
| ### Birth Year Variables | ||
| Dummy variables indicating the respondent’s year of birth. | ||
|
|
||
| - yob: Year of birth. | ||
| - yob44–yob53: Indicator variables for birth years 1944 through 1953. | ||
|
|
||
| ### Survey Year Variables | ||
| Dummy variables identifying the CPS survey year. | ||
|
|
||
| - year: CPS survey year. | ||
| - yr81: Indicator for survey year 1981. | ||
| - yr82: Indicator for survey year 1982. | ||
| - yr83: Indicator for survey year 1983. | ||
| - yr84: Indicator for survey year 1984. | ||
| - yr85: Indicator for survey year 1985. | ||
|
|
||
| ### Draft Lottery Instrument Variables | ||
| These variables represent grouped Vietnam draft lottery numbers used as instruments for education. | ||
|
|
||
| - lott1–lott13: Lottery number group indicator variables. | ||
|
|
||
| ### Sampling and Administrative Variables | ||
| - marchwt: CPS March supplement sampling weight. | ||
| - recode: Observation identifier used in the replication dataset. | ||
|
|
||
| ## Dataset Purpose | ||
|
|
||
| This dataset is used to estimate the causal effect of education on wages using instrumental variables derived from Vietnam draft lottery numbers. The lottery provides exogenous variation in schooling decisions among men born between 1944 and 1953. | ||
|
|
||
| ## References | ||
| **Source Citation:** | ||
| Angrist, J. D., & Krueger, A. B. (1995). Split-Sample Instrumental Variables Estimates of the Return to Schooling. Journal of Business & Economic Statistics, 13(2), 225-235. | ||
| Data extracted from the [Angrist Data Archive](https://economics.mit.edu/people/faculty/josh-angrist/angrist-data-archive). | ||
Uh oh!
There was an error while loading. Please reload this page.