Analysis of Pulmonary Function Test Results By Using Gaussian Mixture Regression Model

Background: FEV1/FVC value is used in the diagnosis of obstructive and restrictive diseases of the lung. It is a parameter reported in the literature that it varies according to lung disease as well as weight, age and gender characteristics. Objective: The aim of this study is to investigate the relationship between age, weight, gender and height characteristics and FEV1/ FVC value using a heterogeneous population using Gaussian mixture regression method. Material and methods: GMR was used to separate the data into components and to make a parameter estimation for each component. The analysis performed on this model revealed that the patients were divided into 5 optimal groups and that these groups showed a regular transition from obstructive pattern to restrictive pattern. Results: The mean values of the components for FEV1/FVC were found as 50.071 (3.238), 67.034 (1.725), 82.156 (1.329), 93.592 (1.041), 98.466 (0.303), respectively. The effect of the weight on the components in terms of parameter estimation and standard errors of the components was determined as 0.445 (0.129)**, 0.226 (0.053)**, 0.173 (0.053)**, -0.036 (0.026), -0.040 (0.018)*, respectively. Conclusion: Direct proportional relationship between the patient's weight and the severity of the obstructive pattern, and between the severity of the disease and the age of the patient in both the obstructive and restrictive pattern are explicitly proved. Furthermore, it has been revealed that data sets containing heterogeneity can be analysed by dividing them into sub-components using the GMR model.


Introduction
Pulmonary function test (PFT) is diagnostic for obstructive pulmonary diseases (such as asthma, Chronic Obstructive Pulmonary Disease (COPD), restrictive lung diseases (parenchymal diseases, respiratory muscles, chest wall diseases, pulmonary edema, congestive heart failure) and extra-thoracic airway stenosis (trachea obstruction, vocal cord paralysis) [1]. PFT is a test used to diagnose respiratory system diseases and to monitor the progression of patients with known respiratory diseases [2]. Forced expiratory volume in 1 second (FEV1) and the forced vital capacity (FVC) ratio (FEV1/FVC), is a parameter measured in the PFT, used for the diagnosis of respiratory system diseases. FEV1/FVC, especially in the separation of obstructive and restrictive diseases of the lung is the most important parameter used PFT [3]. FEV1 (Forced expiratory volume in 1 second) and FVC (Forced vital capacity) are two of the values measured during PFT. FVC is the maximum air volume that can be blown after full inspiration. It can also be considered as the maximum breath volume of the lung(forced inspiratory vital capacity). FEV1 is the volume of air that can be blown in the first 1 second after a full inspiration [4].
When evaluating PFT results, the FEV1/FVC ratio is taken into account first. First, the obstructive-restrictive distinction is performed [5]. The FEV1/FVC ratio, also called the Tiffeneau-Pinelli index, is a ratio used to differentiate obstructive from restrictive lung diseases. The ratio of FEV1/ FVC (FEV1%) in healthy adults should be about 70-80% [4]. It is recommended by the GOLD criteria that patients with FEV1/ FVC below 70% should be evaluated as an obstructive disease [6].
In obstructive diseases, airflow from the lungs becomes difficult due to airway obstruction, the expiratory time is extended and the maximum volume of air (FEV1) blown in the first 1 second decreases [7]. FEV1 decreases in obstructive diseases (asthma, COPD, chronic bronchitis, emphysema) due to increased airway resistance in expiration. The main problem in restrictive diseases is fibrosis in the alveolar lung tissue. The compliance of the lung decreases due to this fibrosis. In restrictive diseases, both FEV1 and FVC are reduced. Since the decrease in FVC is greater than the decrease in FEV1, the ratio of FEV1 / FVC increases [8].
In the guidelines of the American Thoracic Society (ATS) published in 1987, it reported that the FEV1/FVC value should be below 75% for the diagnosis of obstructive pulmonary disease [7]. Later, in the guideline published by the British Thoracic Society (BTS) in 1997, it was proposed to use 70% as the upper limit of the FEV1/FVC value for the diagnosis of obstructive pulmonary disease [9]. In a study investigating ways to develop reference ranges on the relationship between spirometric lung function and height and age, the results were modeled in terms of gender, age and height [10]. In a study where two separate expected value formulas for men and women were tried to be developed in order to create equality for two different genders on spirometric test results, age and height variables were also used [11]. In relation to these variables, Swanney et al. stated that it was not appropriate to determine the FEV1/FVC value as a standard cut-off value [12].
Studies on determining the ranges of FEV1/FVC values that should be accepted as normal in adults are available in the literature [12][13][14][15].
The aim of this study is to investigate the relationship between age, weight, gender, and height variables with FEV1/ FVC values in a heterogeneous population using gaussian mixture regression method and summarize all data with a multivariate statistical method.

Materials and methods PFT equipment
SFT measurements were made with Spirobank II (MIR-Medical International Research USA, Inc., 5462 S. Westridge Drive New Berlin, WI 53151-USA) device. During the time that the data used in the study were obtained, all PFTs were made by the same technician. The calibration of the device was done once a day, following automatic procedures.

Study Sample
Those who were admitted to a secondary hospital between May 1, 2018 and May 31, 2018 and had a pulmonary function test were included in the study. A dataset was created by scanning PFT records retrospectively. Age, gender, weight and height information on the records, including PFT results, were recorded. Those missing at least one of these information were excluded from the study. After the information of the people with incomplete data was removed from the dataset, the analysis started with the data of the remaining 171 people. FEV1/FVC variable is determined as dependent variable and height, age, weight and gender are determined as independent variables.

Statistical analysis
While performing statistical analysis, the excess number of variables to be examined makes it difficult to solve by using classical statistical techniques such as clustering and factor analysis. In medical studies, the observed events depend on many factors. At the same time, these factors are interrelated. In order to obtain valid and reliable results, it is recommended that all variables related to the research be participated in the analysis. At this point, the mixture model is a powerful multivariate statistical tool that can be used effectively.

Finite mixture models
In multivariate statistical analysis techniques such as factor and clustering analysis, multiple variables should be clustered in terms of their effect on dependent variables. As a result of this grouping, it is ideal that both heterogeneity between groups and homogeneity within groups are high [16]. In recent years, mixture model approach has been used extensively in multivariate statistics. Mixture models have two important advantages over clustering and factor analysis. The first advantage is the calculation of which component is likely to include each observation. The second important advantage is that it is possible to calculate parameter estimates for each component with the mixture model. This is because mixture modelling also performs multiple regression analysis specifically for each component after dividing the data into components. Gaussian mixture regression analysis is used if the dependent variable fits normal distribution [17].
Another purpose of using mixture models is minimize the probability of misclassification and determine the populations from which they were sampled. In mixture models, parameters estimate by expectation and maximization (EM) algorithm and Maximum Likelihood (ML) method. Entropy is used for calculating true classification probability in mixture modelling [16]. Akaike and Bayesian information Criteria metrics are used for optimum component number [18].

Theoretical background
Component densities stated to apply to some parametric distributions. The component densities are described as , where is the vector of unknown parameters. This vector , is hypothesized form for the ith component density in the mixture. Mixture density can be formulated as (1) where vector , involving all unknown parameters in the mixture model can be define as (2) In Eq.2 vector , contain all the parameters in .
Parametric form of mixture model written as ( 4) where observed random sample . The loglikelihood for that can be form the observed data is given by ( 5) Likelihood equation written as; (6) EM algorithm trying to solve Eq.6 using iterative methods and Eq.5.

Model selection
In the mixture model approach, AIC and BIC model fit statistics are widely used to determine how many sub-grouped models are homogeneous within each other [19][20][21]. The model with the smallest AIC and BIC values is considered to be the best representation of the data set [22]. However, in determining the model that best fits the data set, entropy, which has the possibility of correct classification, is also widely used. The entropy criterion can be considered as the best model in which component model is higher [23].
AIC and BIC were used to determine the optimum number of components. AIC, selects model that can minimize Eq.7.
After Akaike, new approaches to the use of AIC to select the number of components in a mixture [23][24][25]. (8) Schwarz [26] suggested that Equation 8 omits the term O (1); (9) And Bayesian information criterion (BIC) for the number of components g in the mixture model, to be obtained as the smallest in the model selection, with a negative log penalty multiplied by two and written as; ( 10)

Results
Surface plots of independent variables height -weight, age -height and age -weight versus FEV1/FVC, are shown in Figure 1, Figure 2 and Figure 3 respectively. These plots created by biharmonic enterpolation method. These figures show that the response variable FEV1/FVC have heterogeneous structure versus the independent variables.
These heterogeneous structures were analyzed by homogenous fractionation with using gaussian mixture regression model. Akaike and Bayesian Information Criteria which used to determine how many components and entropy correct classification rates are given in Table 1 and Figure 4.
As a result of the 5 component model obtained using the normal mixture regression model, the number and mixing probability of the individuals entering each component were shown as Table 1. The highest number of individuals was in component 5 and the lowest number of individuals was in component 1( Figure 5, Figure 7).    In the model with 5 components, which are the best model in which the data set is distributed (Figure 6), the parameter estimates are given in Table 2.
In the first component, the effect of weight independent variable on FEV1/FVC dependent variable was statistically significant (p<0.01); height, age and gender variables were statistically insignificant (p>0.05). In the second component, the effects of weight and age independent variables on FEV1/ FVC dependent variable were statistically significant (p<0.01),    the effect of height and gender independent variables was found to be statistically insignificant (p>0.05). In the third component, the coefficient of height independent variable on FEV1/FVC dependent variable was found to be statistically insignificant (p>0.05); however, the effects of age, gender (p<0.05) and weight (p<0.01) independent variables were statistically significant. In the fourth component, gender independent variable on FEV1/ FVC dependent variable were statistically significant (p<0.05) and the effects of weight, height and age independent variables on FEV1/FVC dependent variable were found to be statistically insignificant (p>0.05). Finally, in the fifth component, the effect of weight independent variable on FEV1/FVC dependent variable was statistically significant (p<0.01); height, gender and age independent variables were statistically insignificant (p>0.05) ( Table 2). Mean values of FEV1/FVC dependent variable and estimated means and standart errors of gender, height, weight and age variables are shown in Table 3.
The normal distribution density functions of the five different components formed according to the obtained regression parameters are given in Figure 7. Considering the class sizes obtained for the components with the closest class membership (probability) grades, it was found that the mixing ratio of the fifth component had the highest mixing ratio as 45.61%.

Discussion
FEV1/FVC value is expected to decrease below 70% in obstructive pulmonary diseases, while it is expected to be normal or increase above 75% in restrictive lung diseases [12,[27][28][29]. FEV1/FVC value for the diagnosis of COPD is expected to be below 70% [28]. In obstructive disorders, there is a disproportionate decrease in the maximum volume of air exhaled from the lung. This represents airway contraction during exhalation and is defined by the FEV1/FVC ratio being less than 70% of the source values [29].
In this study, it was found that AIC and BIC values decreased to 5 component model and then increased. Thus, AIC and BIC with the smallest statistics and entropy value with high accurate classification ratio were obtained in 5 component model ( Figure 4). The entropy criterion was obtained as 92.5% in the 5 component model selected as the most fitted model. It is observed that variable FEV1/FVC grows with component index. When components are examined, a regular and significant transition from low to high values for FEV1/FVC variable is observed.
The first component stands out as the component in which the obstructive pattern is most prominent. Considering that the average of the weight variable in component 1 is higher than the average in other components, it can be said that the FEV1/FVC is inversely proportional to weight in patients with low FEV1/FVC. The relationship between obesity and COPD has not been clearly determined yet. In 2014 García-Rio et al. reported that obesity was more prevalent in patients diagnosed with COPD compared to the general population [28]. There are studies reporting improvement in PFT results after weight loss in patients undergoing bariatric surgery [29]. In a study comparing the FEV1/FVC value of obese children compared to normal-weight children, it was reported that the FEV1/FVC value decreased in direct proportion to the increase in BMI (Body Mass Index) [30]. A study published in 2015 reported that when a fixed FEV1/FVC ratio of 0.70 used, it resulted in underdiagnosis of COPD among overweight patients [31]. A review by Dixon et al. has been reported that obesity may cause some changes in the mechanical functions of the lungs and chest wall. In addition, it was concluded that obesity affects lung function through an inflammatory pathway by increasing inflammatory cytokines [32]. It should be noted that the relationship between obesity and FEV1/FVC is not only related to weighted increase, but also the disease-causing effects. When the components are ranked from small to large in terms of FEV1/FVC scores, the second component with a value of 67.034 FEV1/FVC is in the second place. In this component, the effects of weight and age variables on FEV1/FVC were found statistically significant (p<0.01). It was determined that FEV1/FVC value decreased -0.226 times when age increased by one unit and FEV1/FVC value increased by 0.226 times when weight increased. The second component population is a component composed of individuals with an FEV1/FVC value close to 70%. The minimum and maximum FEV1/FVC values of this component are values that converge to 70%.
When the effect of age independent variable on FEV1/ FVC is analyzed, it is determined that the coefficient of the age variable is negative in the components where the average value of FEV1/FVC is low. However, it was found that the coefficient for the age variable turned positive in components where FEV1/FVC was close to the 70% threshold value ( Table  2). As age increases in patients at this component, FEV1/FVC decreases more clearly than other components. It is concluded that FEV1/FVC value significantly increases with increasing age in this component. In a study conducted by collecting data between 1998 and 2009 in the United States, it was concluded that the prevalence of COPD increases with age for both men and women [33].
Changes such as a significant decrease in lung elasticity, chest wall stiffness and loss of respiratory muscle strength are the physiological effects that occur in the respiratory system with aging [34][35][36]. These physiological changes were seen as the cause of low FEV1/FVC in the elderly population [12].
The effect of the height independent variable on the FEV1/ FVC was found insignificant in all components.
Considering the gender independent variable, the effect on the FEV1/FVC dependent variable was found statistically significant only in the third and fourth components. It was concluded that in cases where FEV1/FVC values were slightly higher or slightly lower, the male sex was predisposed to form groups with moderately low FEV1/FVC, while the female sex was predisposed to form groups with moderately high FEV1/ FVC. The third component stands out as the component in which the gender variable is statistically most significant. The relationship between COPD and gender has been investigated since 1987 [37]. McHugh et al. also reported that FEV1/FVC values differ significantly between men and women [38].

Limitations
Retrospective studies have some limitations due to their nature as the data are scanned retrospectively. Our study was conducted with data obtained from 171 patients, and the sample size is relatively small. Prospective studies with larger samples are needed to verify the results of this study. Since we did not investigate the effect of smoking on FEV1/FVC, we did not use smoking history in our study. We think that it would be useful to compare the results of this study with the results of studies with large populations that are homogenized and free from other effects.

Conclusion
In this study, the heterogeneity in the dataset is tried to be understood with the mixture regression model. Using the mixture regression model gives useful results in the analysis of dependent variables that interact with multiple independent variables, both in determining the existence and power of the relationships. We can say that the relationship between FEV1/ FVC value and weight is greater than with other variables, in terms of the characteristics we examine the relationships in our study population. The age variable showed a statistically significant correlation between the FEV1/FVC value. FEV1/FVC results varied according to gender differences, these differences were lower than the difference due to weight and age differences. On the other hand, no significant relationship was found between the height variable and the FEV1/FVC value in the analysis we performed on our population. Stronger results can be obtained if the relationships between FEV1/FVC with age, weight, and sex variables are investigated using a mix regression model in larger populations. We believe it would be more meaningful to consider these relationships when interpreting the FEV1/FVC value.

Disclosures:
There is no conflict of interest for all authors.