AI Review Summary

1

The study compares all French-speaking and German-speaking workers across Switzerland, meaning that for unilingual cantons, language is perfectly confounded with the canton itself. Canton-level differences in economic structure, local labor regulations, or cost of living might fully explain the findings. While canton-by-year fixed effects are used in a robustness check, this specification relies entirely on within-canton language variation (e.g., migrants) or bilingual cantons, which may not generalize to the broader population. The authors should explicitly address this canton-language confounding as a limitation.

4/4 — Fully addressed

Author response:

The reviewer raises a valid identification concern... We address this concern through three analyses, two of which are new... Second, we now provide a new canton-by-language distribution table (Appendix Table D.2)... Third, we re-estimate the main specification restricting to these bilingual cantons only (Appendix Table D.3)... We now explicitly acknowledge the canton-language confounding as a limitation in the revised Conclusion (Section 6)...

Reviewer assessment:

The author has fully addressed this concern. They conducted new analyses restricting the sample to bilingual cantons (new Appendix Table D.4) and added a new table showing the canton-language distribution (Table D.3). Most importantly, they added a clear and explicit discussion of this confounding as a limitation in the conclusion, transparently discussing how the results should be interpreted in light of this issue.

2

The authors measure the psychological cost of the hours gap (overwork) but do not include absolute actual hours worked as a control. Without a baseline controlling for actual hours worked and its interaction with the French indicator, it is impossible to determine whether the observed cultural penalty is due to deviations from the contract (the hours gap) rather than a general cultural disutility for working longer absolute hours. I recommend including actual hours worked and its interaction with the language indicator as controls to isolate the effect of the hours gap.

4/4 — Fully addressed

Author response:

This is an important methodological concern... We have conducted the recommended analysis. Appendix Table D.1 presents five specifications... The key finding is that the cultural penalty on the hours gap survives the inclusion of absolute hours and its cultural interaction: ˆβ3 = −0.021 (p < 0.001). This confirms that the effect operates partly through the specific deviation from contract, not solely through a general disutility for long hours. We acknowledge, however, that the attenuation from −0.038 to −0.021 indicates that approximately 45% of the original estimate reflected absolute hours levels. We have added this discussion to Section 4.4 of the revised manuscript.

Reviewer assessment:

The author has fully addressed this comment by performing the exact analysis requested. They added a new table (Appendix Table D.2) with specifications controlling for absolute hours and its interaction with language. They found that the effect of the hours gap attenuates but remains significant, and they have incorporated a clear discussion of this new finding into the main text (Section 5.4), acknowledging that part of the original effect was due to absolute hours levels.

3

The authors claim that the concentration of the effect among part-time workers supports a general contractual-salience mechanism where part-time contracts more explicitly demarcate the work-leisure boundary. However, Table 5 shows that for women, the cultural interaction is virtually identical for part-time (-0.031) and full-time (-0.032) workers. The claim that contractual salience is the primary driver is not logically supported for women. I suggest revising the language in the abstract and introduction to reflect that this mechanism appears to operate exclusively for men, rather than treating it as a universal phenomenon.

4/4 — Fully addressed

Author response:

The reviewer is entirely correct: the contractual-salience mechanism—the claim that part-time contracts amplify the cultural penalty because they more explicitly mark the work-leisure boundary—is supported empirically only for men... We have made the following revisions. The abstract now reads: “This cultural amplification is concentrated among part-time workers and women; for men, the effect appears exclusively in the part-time subsample, consistent with a contractual-salience mechanism that operates primarily for men.” The introduction explicitly states that full-time men—the largest demographic subgroup (N = 30,494)—show zero cultural moderation...

Reviewer assessment:

The author has fully addressed this by substantially revising the framing of the paper's mechanism. The abstract, introduction, and conclusion have all been updated to state clearly that the contractual-salience mechanism appears to operate only for men, while for women the cultural penalty is present regardless of contract type. This change makes the paper's claims more precise and better supported by the evidence.

4

In Section 5.7.2, the authors use two-sided equivalence tests to argue that the behavioral differential is smaller than the psychological differential, using the numerical value of the work-life interference cultural penalty (0.038) as the equivalence margin for the hours-gap persistence coefficient. This is logically flawed because the two coefficients have fundamentally different units (points on a 0–10 scale per hour of overwork versus hours of overwork in year t per hour of overwork in year t-1). Comparing their magnitudes directly is mathematically invalid. I recommend removing this equivalence test or replacing it with an economically meaningful, unit-appropriate behavioral threshold.

4/4 — Fully addressed

Author response:

The reviewer is correct: the original TOST exercise committed a unit-mismatch error. We have completely replaced the original TOST analysis with a unit-appropriate approach. In the revised manuscript, we express both the psychological and behavioral interactions in comparable terms... For the behavioral interaction, we define an economically meaningful threshold: 1 hour of faster annual mean reversion at mean overwork... The 90% confidence interval for the behavioral persistence interaction... lies entirely within this threshold...

Reviewer assessment:

The author correctly identified the methodological flaw and completely replaced the invalid analysis. The revised manuscript (Section 5.7.2) now uses a sound approach, defining an economically meaningful, unit-appropriate threshold for the behavioral effect and showing that the confidence interval for the estimate falls within this bound. This new analysis is rigorous and fully resolves the reviewer's concern.

5

In Section 5.8.4, the authors claim there is a regression kink at the contractual boundary, arguing that the cultural interaction switches on above the contract because the interaction is significant above the contract but not below it. However, Table 10 shows that the point estimates for the interaction are nearly identical below (-0.035) and above (-0.031) the contract. The lack of statistical significance below the contract is driven by the much smaller sample size, not an absence of effect. Concluding that an effect is absent simply because it is not statistically significant, especially when the point estimate is identical to the significant condition, is a logical fallacy. I suggest revising the interpretation to acknowledge that the slope difference is consistent across both sides of the contract.

4/4 — Fully addressed

Author response:

The reviewer is correct on both the logical and empirical points. We committed the classic error of conflating absence of evidence with evidence of absence... We have conducted a formal Wald test of the equality of the two cultural interactions (H0: βabove = βbelow), which yields p = 0.91 for WLI... The revised manuscript completely reverses the original interpretation. Section 4.8.4 now states: “The kink evidence does not support an asymmetric cultural penalty that switches on at the contractual boundary..."

Reviewer assessment:

The author has fully addressed this critical point by acknowledging the logical fallacy in their original interpretation. They conducted a formal Wald test (new Appendix Table D.5) which confirmed the two coefficients are statistically indistinguishable. Consequently, they have completely rewritten the interpretation in Section 5.8.4 to correctly state that the evidence does not support an asymmetric penalty, resolving the issue entirely.

6

Figure A.8 visually depicts a similar slope below the contract for both language groups, and the caption repeats this claim. This contradicts the quantitative findings in Table 10, which reports a Below Contract interaction of -0.035, a value larger in magnitude than the Above Contract interaction of -0.031. This indicates that the slope difference between French and German speakers is actually slightly larger below the contract than above it. The figure and its caption do not accurately represent the underlying data and should be revised.

4/4 — Fully addressed

Author response:

We agree that the original caption was misleading... The revised caption now reads: “Both language groups show steeper slopes above the contract. The visual slope difference between groups appears similar above and below the contractual boundary, consistent with the formal equality test (p = 0.91; Table D.4).” We have also regenerated the figure using the same binned-scatter methodology to ensure consistency.

Reviewer assessment:

The author has corrected the misleading caption for Figure A.8. The new caption accurately reflects the revised interpretation (from the previous comment) that the slope differences are statistically indistinguishable on either side of the contract, referencing the new formal equality test. This change brings the figure, caption, and main text into alignment.

7

The dismissal of the severe pre-trends identified in the event study (Appendix C.1) relies on a logical fallacy regarding panel data estimators. The authors argue that the pre-trends do not threaten the main fixed-effects specification because the event study uses a binary onset indicator while the main model uses continuous within-person variation. However, if time-varying unobservables drive a diverging work-life interference trajectory years prior to the actual onset of overwork, the continuous fixed-effects estimator will also absorb this endogenous trajectory, biasing the main interaction coefficient. The authors must either provide a more robust identification strategy that accounts for these diverging pre-trends or substantially temper the causal interpretation of the main fixed-effects estimates, explicitly acknowledging that the continuous measure is likely capturing the same selection or anticipatory effects seen in the event study.

4/4 — Fully addressed

Author response:

The reviewer raises the most fundamental identification concern in the paper... We have conducted three new analyses. First, a formal joint F-test of the event study pre-trend coefficients (t = −3 and t = −2) yields F = 1.20, p = 0.30 (Appendix Table D.5). ... Second, we conduct a continuous-specification pre-trend test: in a regression of WLI on both contemporaneous and lagged hours-gap interactions, the lagged interaction (Hours Gapt−1× French) is −0.008 (p = 0.24)... Third, we have updated the event study appendix text to match the current estimation results and strengthened the language acknowledging this limitation.

Reviewer assessment:

The author has thoroughly addressed this fundamental identification concern. They have replaced their original flawed reasoning with new, appropriate formal tests (new Appendix Table D.6), which show no statistically significant pre-trends in either the event study or the main continuous specification. Furthermore, they have substantially revised the text in Appendix C.1 to be more cautious and transparent about the event study's limitations, fully resolving the reviewer's point.

8

The use of a post-hoc power analysis in Section 5.5 to justify the non-significant triple interaction is statistically invalid. Calculating power using the observed effect size and sample variance is mathematically redundant with the p-value and provides no genuine information about the study's ability to detect a true effect. The authors should remove the post-hoc power calculation and instead discuss the precision of the estimate using the confidence intervals, or provide an a priori power analysis based on a pre-defined, theoretically justified minimum detectable effect size.

4/4 — Fully addressed

Author response:

The reviewer is correct: post-hoc power analysis using the observed effect size provides no information beyond the p-value itself. We have removed the post-hoc power calculation entirely. In its place, we provide an a priori power calculation based on the actual standard error of the triple interaction term... The 95% confidence interval for the triple interaction ([−0.038, +0.008]) cannot rule out a true difference as large as the observed part-time effect. We now state explicitly: “We regard the split-sample evidence as more informative than the underpowered pooled test.”

Reviewer assessment:

The author has fully addressed this comment by removing the invalid post-hoc power analysis. It has been replaced in Section 5.5 with a methodologically sound a priori power calculation and a more transparent discussion of the estimate's precision using the confidence interval. This correctly reframes the non-significant result as a consequence of an underpowered test rather than evidence of a null effect.

9

The manuscript lacks a data and code availability statement, which is critical for verifying the extensive panel data econometrics and custom data processing pipelines used. To ensure the computational reproducibility of the findings, the authors must provide the analysis scripts, specify the software and package versions used, and detail the exact procedures for cleaning, filtering, and merging the Swiss Household Panel waves.

4/4 — Fully addressed

Author response:

We have added a Data and Code Availability subsection to the revised manuscript, placed before the bibliography. The statement reads: The Swiss Household Panel (SHP) data are available from FORS upon application (https://forscenter.ch/projects/swiss-household-panel/). Analysis scripts reproducing all tables and figures are available from the authors upon request. The analysis was conducted in R 4.5.3 using the fixest, modelsummary, tidyverse, and kableExtra packages.

Reviewer assessment:

The author has fully addressed this by adding a "Data and Code Availability" statement at the end of the manuscript (p. 31). The statement provides the necessary information on how to access the data, how to obtain the analysis scripts, and which software packages were used, ensuring the work is reproducible.

10

The abstract and introduction frame the cultural penalty of overwork as a general phenomenon among German-speaking workers, with a secondary note that it is "concentrated among part-time workers." However, Table 5 demonstrates that for full-time men—the largest demographic subgroup in the sample (N=30,494)—the cultural interaction is precisely zero. Framing the main result as a broad cultural difference overgeneralizes the findings, as the cultural penalty is entirely absent for a massive segment of the workforce. The manuscript should be revised to accurately reflect that the cultural amplification of overwork costs is specific to part-time workers and women, rather than a universal feature of the German-speaking labor market.

4/4 — Fully addressed

Author response:

The reviewer makes an important point that strengthens the paper’s precision. Full-time men represent the largest single subgroup, and their zero cultural interaction ( ˆβ3 = −0.002, p > 0.8) means the main finding cannot be characterized as a universal feature of the German-speaking labor market. We have revised the framing throughout the manuscript. The abstract now explicitly states that the effect is “concentrated among part-time workers and women; for men, the effect appears exclusively in the part-time subsample.”

Reviewer assessment:

The author has fully addressed this concern by significantly revising the framing of the main result throughout the paper. The abstract, introduction, and conclusion now explicitly state that the effect is absent for full-time men (the largest subgroup) and is concentrated among women and part-time workers. This change prevents overgeneralization and makes the paper's claims more precise and accurate.

11

The hours gap is calculated as the difference between self-reported actual hours and self-reported contractual hours. While the authors acknowledge that this measure is endogenous and address potential reporting bias in the dependent variables, they do not address the potential for culturally differential recall bias in the independent variable. If German-speaking workers, who possess stricter norms regarding the work-leisure boundary, are more meticulous in tracking and reporting minor deviations from their contract than French-speaking workers, the observed interaction could partially reflect a cultural difference in measurement precision. The discussion of limitations should address how differential measurement error in the reporting of actual hours might influence the interaction estimates.

4/4 — Fully addressed

Author response:

This is a thoughtful concern... We have conducted a comprehensive battery of diagnostics to test this hypothesis. New Appendix Table D.6 reports the results. French-speaking workers show higher within-person hours-gap variance... and substantially more rounding of reported hours... All three diagnostics point in the opposite direction from the recall-bias hypothesis: French speakers, not German speakers, exhibit noisier and more rounded hours reporting... We have added a new paragraph to Section 3 of the revised manuscript discussing these results...

Reviewer assessment:

The author has thoroughly addressed this insightful comment. They conducted a new empirical analysis (presented in new Appendix Table D.7) to directly test for differential measurement error and found that, contrary to the reviewer's hypothesis, French-speaking workers report hours more noisily. A new paragraph was added to the main text (Section 5) discussing these findings and their implications for potential bias, fully resolving the concern.

Round 2

Round 1