How to learn Stata with your generative AI?
A practical guide to writing prompts that generate clean, reproducible Stata code for regression and causal inference, with a focus on applied econometrics.
Getting Started
Master the art of translating econometric problems into precise Stata code through AI assistance.
Why This Guide?
- •Turn research questions into working code faster
- •Avoid common pitfalls in econometric implementation
- •Ensure reproducible and robust results
What You Will Learn
- 📝Structure effective prompts for AI models
- 🔧Access ready-made code for common methods
- ✅Verify results with systematic checklists
Golden Rules
EssentialFive fundamental principles for successful AI-assisted econometric programming.
1. Define Your Data Schema
Always specify variable names, types, and meanings. Include sample size and panel structure if applicable.
wage: numeric (hourly wage in $), educ: numeric (years), id: string (worker ID)
2. State Your Research Design
Clearly identify the econometric method and identification strategy.
OLS with robust SEs, IV using quarter-of-birth, DiD with two periods
3. Specify Standard Error Treatment
Match SE choice to your research design and data structure.
Cluster at individual level, robust for heteroskedasticity, bootstrap for complex estimators
4. Request Code Only
Ask for executable code with minimal comments, avoiding interactive prompts or placeholders.
Return Stata code block only; use actual variable names; include basic diagnostics
5. Ensure Reproducibility
Include version control, seeds, and explicit package requirements.
set seed 12345, ssc install reghdfe, version 17
Prompt Template
Copy & FillUse this structured template to create clear, comprehensive prompts for any econometric task.
📋 How to Use
- Copy the template below
- Replace placeholders with your specific details
- Paste into your preferred AI model (Claude, ChatGPT, etc.)
- Review output using the QC checklist
Role: You are a careful Stata tutor for undergraduate econometrics.
Context:
- Study question and identification strategy: <1–2 lines>
- Dataset name: <e.g., wagepanel.dta>
- Variables (name : type : meaning):
- wage : numeric : log hourly wage
- educ : numeric : years of schooling
- exper : numeric : years of experience
- id, year : panel identifiers
- Assumptions / SEs: <e.g., cluster at id; robust>
- Output format: Stata code only, with short comments. Use explicit prefixes (xtset, ivregress, etc.). No placeholder variables.
Task:
- Write Stata code to accomplish: <task>
- Include: data load, minimal cleaning (few lines), model, SEs, brief postestimation.
- Constraints: no interactive prompts; set seed when simulating; avoid deprecated commands.
Return: A single code block.
Ready-to-Use Code Recipes
Copy-paste Stata code for the most common econometric methods. Each recipe includes diagnostics and best practices.
OLS with Robust Standard Errors
Recipe 1Basic linear regression with heteroskedasticity-robust standard errors
* OLS with robust SEs
clear all
use "wagepanel.dta", clear
* Basic data exploration
describe wage educ exper
summarize wage educ exper, detail
* OLS with robust (Huber-White) standard errors
regress wage educ exper, vce(robust)
* Store results for later comparison
estimates store ols_robust
* Effect size interpretation for log wages
* If wage is log-transformed, uncomment:
* regress lnwage educ exper, vce(robust)
* display "Return to schooling (%) = " 100*(exp(_b[educ])-1)
Instrumental Variables (2SLS)
Recipe 2Two-stage least squares with endogenous regressors
* IV / 2SLS: wage on educ instrumented by quarter-of-birth
clear all
use "wage_iv.dta", clear
* Check instrument relevance
regress educ qob exper age, vce(robust)
test qob
* 2SLS estimation
ivregress 2sls wage (educ = qob) exper age, vce(robust)
* First-stage diagnostics
estat firststage
* Rule of thumb: F-stat > 10 for strong instrument
* Over-identification test (if multiple instruments)
* estat overid
Difference-in-Differences
Recipe 3Treatment effect estimation with before/after and treatment/control groups
* Difference-in-Differences estimation
clear all
use "did_sample.dta", clear
* Variables: treat (1=treatment group), post (1=after policy), y (outcome)
* Basic DiD specification
regress y i.treat##i.post, vce(robust)
* Extract treatment effect
lincom 1.treat#1.post
display "Treatment effect: " _b[1.treat#1.post]
* Event study with multiple periods (if panel data available)
* xtset id year
* reghdfe y i.year##i.treat, absorb(id) vce(cluster id)
* Pre-trend test (parallel trends assumption)
* Keep only pre-treatment periods and test trend differences
Regression Discontinuity
Recipe 4Sharp RD design around a cutoff point
* Sharp Regression Discontinuity around cutoff c=50
clear all
use "rd_sample.dta", clear
* Create treatment indicator based on cutoff
generate D = (running_var >= 50) if !missing(running_var)
* Optimal bandwidth RD (requires rdrobust package)
* ssc install rdrobust, replace
rdrobust outcome running_var, c(50)
* Manual local linear regression (for illustration)
* Restrict to bandwidth around cutoff
keep if inrange(running_var, 45, 55)
* Center running variable at cutoff
generate centered = running_var - 50
* Local linear regression
regress outcome i.D c.centered##i.D, vce(robust)
* Treatment effect is coefficient on D
lincom 1.D
Panel Fixed Effects
Recipe 5Two-way fixed effects with clustered standard errors
* Panel Fixed Effects with clustering
clear all
use "panel_data.dta", clear
* Set panel structure
xtset id year
* Two-way fixed effects (requires reghdfe)
* ssc install reghdfe, replace
reghdfe wage educ exper, absorb(id year) vce(cluster id)
* Alternative: manual fixed effects
* Unit FE with year dummies
areg wage educ exper i.year, absorb(id) vce(cluster id)
* Check for serial correlation
* xtserial wage educ exper
* Test fixed effects necessity
* xtreg wage educ exper, fe
* estimates store fe
* xtreg wage educ exper, re
* estimates store re
* hausman fe re
Logistic Regression
Recipe 6Binary outcome models with marginal effects
* Logistic regression for binary outcomes
clear all
use "binary_outcome.dta", clear
* Basic logistic regression
logistic employed educ exper, vce(robust)
* Marginal effects at means
margins, dydx(*) atmeans
* Average marginal effects
margins, dydx(*)
* Predicted probabilities by education level
margins, at(educ=(10(2)18)) atmeans
* Goodness of fit
estat classification
estat gof, group(10)
Prompt Modification Patterns
Common patterns for refining and adjusting your prompts to get exactly what you need.
Scope Control
- Return one Stata code block only; robust SEs; no graphs; minimal comments.
- Include data loading, estimation, and basic diagnostics only.
- Skip exploratory analysis; focus on the main regression.
Schema Matching
- Match these exact variable names: [paste describe output or list]
- Use my dataset structure: wage (numeric), educ (numeric), id (string).
- Variables are already cleaned; no missing value handling needed.
Estimator Swapping
- Rewrite using ivregress 2sls with instrument Z; include first-stage diagnostics.
- Convert to fixed effects: reghdfe with unit and year absorption.
- Change to logistic regression with marginal effects.
Standard Errors
- Cluster at id level; two-way FE for id & year.
- Use robust standard errors for cross-sectional data.
- Bootstrap standard errors with 1000 replications.
Quality Control Checklist
Before PublishingSystematic verification steps to ensure your code is correct, robust, and reproducible.
✅ Pre-Publication Checklist
Additional Resources
Helpful tools, packages, and references to enhance your econometric workflow.
Essential Packages
reghdfe
- High-dimensional FErdrobust
- RD estimationestout
- Table formattingietoolkit
- Impact evaluation
Best Practices
- • Keep dated do-files
- • Document data sources
- • Set project-root paths
- • Version control with Git
Ethics & Attribution
- • You verify all results
- • Cite your code, not the AI
- • Document methodology clearly
- • Share replication files