Abstract:Objective: To achieve early identification of pulmonary nodules and visual interpretation of key variables through interpretable machine learning, and to facilitate precise prevention, control, early diagnosis and treatment of lung cancer. Methods: This study enrolled individuals at high risk of lung cancer and completed clinical screening. Their high-risk assessment data and imaging results were extracted. Participants were divided into high-risk and low-risk groups for pulmonary nodules based on China’s Lung Cancer Screening Standard (T/CPMA 013-2020). Variables with differences identified by univariate analysis were used as predictors, with pulmonary nodule grouping as the dependent variable, to construct an interpretable XGBoost-SHAP identification framework for early nodule detection and visual result interpretation. Results: A total of 644 high-risk individuals were included, with 199 (30.9%) in the high-risk pulmonary nodule group. The XGBoost model achieved an accuracy of 0.9146, sensitivity of 0.7587, specificity of 0.9843, F1-score of 0.8458, and AUC of 0.9741 for nodule grouping. SHAP analysis revealed that higher SHAP values—and thus increased risk of nodule enlargement—were associated with greater smoking intensity, exposure to secondhand smoke from colleagues/family, infrequent kitchen ventilation during cooking, excessive intake of processed foods, occupational exposure to asbestos/radon, insufficient intake of protein, fruits and vegetables, and manual labor occupation. Conclusion: The constructed interpretable framework performs well in early pulmonary nodule identification. Changes in nodule size are associated not only with traditional risk factors (e.g., smoking habits, secondhand smoke exposure, cooking fume exposure, occupational asbestos/radon exposure) but also with the participants’ dietary habits.