<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
<ui>1742-5573-7-3</ui>
<ji>1742-5573</ji>
<fm>
<dochead>Research</dochead>
<bibl>
<title><p>Using variable importance measures from causal inference to rank risk factors of schistosomiasis infection in a rural setting in China</p></title>
<aug><au ca="yes" id="A1"><snm>Sudat</snm><mi>EK</mi><fnm>Sylvia</fnm><insr iid="I1"/><email>sk543@cal.berkeley.edu</email></au>
<au id="A2"><snm>Carlton</snm><mi>J</mi><fnm>Elizabeth</fnm><insr iid="I2"/><email>elizabeth_carlton@berkeley.edu</email></au>
<au id="A3"><snm>Seto</snm><mi>YW</mi><fnm>Edmund</fnm><insr iid="I2"/><email>seto@berkeley.edu</email></au>
<au id="A4"><snm>Spear</snm><mi>C</mi><fnm>Robert</fnm><insr iid="I2"/><email>spear@berkeley.edu</email></au>
<au id="A5"><snm>Hubbard</snm><mi>E</mi><fnm>Alan</fnm><insr iid="I1"/><email>hubbard@stat.berkeley.edu</email></au>
</aug>
<insg>
<ins id="I1"><p>Division of Biostatistics, University of California, Berkeley, USA</p></ins>
<ins id="I2"><p>Department of Environmental Health Sciences, University of California, Berkeley, USA</p></ins>
</insg>
<source>Epidemiologic Perspectives &amp; Innovations</source>
<issn>1742-5573</issn>
<pubdate>2010</pubdate>
<volume>7</volume>
<issue>1</issue>
<fpage>3</fpage>
<url>http://www.epi-perspectives.com/content/7/1/3</url>
<xrefbib><pubidlist><pubid idtype="pmpid">20626918</pubid><pubid idtype="doi">10.1186/1742-5573-7-3</pubid></pubidlist></xrefbib></bibl>
<history><rec><date><day>18</day><month>3</month><year>2009</year></date></rec><acc><date><day>14</day><month>7</month><year>2010</year></date></acc><pub><date><day>14</day><month>7</month><year>2010</year></date></pub></history><cpyrt><year>2010</year><collab>Sudat et al; licensee BioMed Central Ltd.</collab><note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note></cpyrt>
<abs>
<sec><st><p>Abstract</p></st>
<sec><st><p>Background</p></st>
<p>Schistosomiasis infection, contracted through contact with contaminated water, is a global public health concern. In this paper we analyze data from a retrospective study reporting water contact and schistosomiasis infection status among 1011 individuals in rural China. We present semi-parametric methods for identifying risk factors through a comparison of three analysis approaches: a prediction-focused machine learning algorithm, a simple main-effects multivariable regression, and a semi-parametric variable importance (VI) estimate inspired by a causal population intervention parameter.</p>
</sec>
<sec><st><p>Results</p></st>
<p>The multivariable regression found only tool washing to be associated with the outcome, with a relative risk of 1.03 and a 95% confidence interval (CI) of 1.01-1.05. Three types of water contact were found to be associated with the outcome in the semi-parametric VI analysis: July water contact (VI estimate 0.16, 95% CI 0.11-0.22), water contact from tool washing (VI estimate 0.88, 95% CI 0.80-0.97), and water contact from rice planting (VI estimate 0.71, 95% CI 0.53-0.96). The July VI result, in particular, indicated a strong association with infection status - its causal interpretation implies that eliminating water contact in July would reduce the prevalence of schistosomiasis in our study population by 84%, or from 0.3 to 0.05 (95% CI 78%-89%).</p>
</sec>
<sec><st><p>Conclusions</p></st>
<p>The July VI estimate suggests possible within-season variability in schistosomiasis infection risk, an association not detected by the regression analysis. Though there are many limitations to this study that temper the potential for causal interpretations, if a high-risk time period could be detected in something close to real time, new prevention options would be opened. Most importantly, we emphasize that traditional regression approaches are usually based on arbitrary pre-specified models, making their parameters difficult to interpret in the context of real-world applications. Our results support the practical application of analysis approaches that, in contrast, do not require arbitrary model pre-specification, estimate parameters that have simple public health interpretations, and apply inference that considers model selection as a source of variation.</p>
</sec>
</sec>
</abs>
</fm>
<bdy>
<sec><st><p>Background</p></st>
<p>Schistosomiasis is a parasitic disease affecting an estimated 200 million people in 76 countries <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>. Humans become infected with schistosomiasis following contact with water containing cercaria, the larval stage of the parasite. Infection can lead to liver fibrosis and portal hypertension, and may cause anemia <abbrgrp><abbr bid="B2">2</abbr><abbr bid="B3">3</abbr><abbr bid="B4">4</abbr></abbrgrp>.</p>
<p>Recent studies have shown that the distribution of human schistosomiasis infections can be explained in part by spatial variability in water contact, particularly with respect to differences in cercarial density. For example, clusters of <it>Schistosoma hematobium </it>infections in rural Kenya were identified near water bodies with high numbers of cercaria-shedding snails <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>. Also, in contrast to water contact measures that ignore spatial variability in cercarial density, measures of water contact that adjust for estimated cercarial density at the site of contact have shown strong correlations with human infection intensity <abbrgrp><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr></abbrgrp>.</p>
<p>Less attention has been paid to temporal variability in infection risk and to the variability in infection risk from specific water contact activities. While diurnal variations in the infectivity of cercaria have been recognized for decades, little is known about the variability in infection risk throughout the transmission season <abbrgrp><abbr bid="B8">8</abbr></abbrgrp>. Li <it>et al. </it>observed two annual peaks in <it>S. japonicum </it>infection prevalence in the lower Yantzee basin <abbrgrp><abbr bid="B9">9</abbr></abbrgrp>. In the irrigated hillsides of southwest China, temporal fluctuations in both hydrology and snail populations have been documented, and may yield corresponding variation in infection risk throughout the transmission season <abbrgrp><abbr bid="B10">10</abbr><abbr bid="B11">11</abbr></abbrgrp>. Specific water contact activities may also affect infection risk, due perhaps to the location in which these activities are performed and the parts of the body exposed. Several specific water contact activities have been associated with the prevalence of <it>S. hematobium </it>infection in Zanzibar and <it>S. mansoni </it>infection in Cote d'Ivoire <abbrgrp><abbr bid="B12">12</abbr><abbr bid="B13">13</abbr></abbrgrp>. However, neither analysis accounted for the duration or timing of water contact, and such relationships have not yet been examined for <it>S. japonicum</it>.</p>
<p>The two studies of <it>S. mansoni </it>and <it>S. hematobium </it>mentioned above examined numerous risk factors for infection using traditional correlation and multivariate regression techniques. The multivariable regression approach, while common, imposes an arbitrary model that limits the interpretation of results <abbrgrp><abbr bid="B14">14</abbr></abbrgrp>. For example, parameters from such models rarely have simply understood definitions within the context of the subject matter; they only have meaning within the context of the arbitrarily specified model. Multivariable regression models can also return misleading inference, because the assumption of an arbitrary model does not allow for model misspecification, and thus incorrectly estimates variability <abbrgrp><abbr bid="B15">15</abbr></abbrgrp>.</p>
<p>In contrast to multivariable regression, semi-parametric variable importance measures inspired by parameters from the causal inference literature have the virtue of (1) using machine learning algorithms to determine flexibly how to adjust for potential confounding variables without requiring arbitrary model pre-specification and (2) returning a simple and interpretable measure of variable importance that under assumptions can also yield estimates of the effect of intervention <abbrgrp><abbr bid="B16">16</abbr></abbrgrp>. Such parameters have been referred to as population intervention parameters <abbrgrp><abbr bid="B16">16</abbr><abbr bid="B17">17</abbr><abbr bid="B18">18</abbr><abbr bid="B19">19</abbr></abbrgrp>. This alternative to a traditional regression analysis is well suited to the exploratory analysis of high-dimensional data, where one desires to investigate the independent association of one variable and an outcome in the presence of many correlated variables.</p>
<p>We analyzed data from a retrospective study in which 1011 individuals reported their water contact during the 2000 <it>S. japonicum </it>infection season in rural China; infection status in 2000 was also recorded for these individuals. Water contact was calculated using the estimated duration of water contact and the estimated body surface area in contact with water during the specific water contact activity. We aimed to explore the relative importance of different types of water contact, defined by both water contact activity and by the month in which the water contact occurred, on the probability of schistosomiasis infection. We analyzed these data in three ways: first, by applying a prediction (machine learning) algorithm; second, by using a simple multivariable regression; and third, by assessing variable importance using a causal inference-inspired population parameter. We discuss the results of each method, as well as the limitations of interpretation within the context of the method used.</p>
</sec>
<sec><st><p>Methods</p></st>
<sec><st><p>Data Collection</p></st>
<p>This research was conducted in Xichang County located in the southwest of Sichuan Province, China. The region is hilly with irrigated agriculture and historically high schistosomiasis infection prevalence. Twenty villages ranging in size from approximately 100 to 300 residents were selected to participate in a cross-sectional study to characterize determinants of schistosomiasis infection <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>. In November 2000, all residents in the 20 villages were asked to participate in schistosomiasis infection surveys and in an interview to assess basic demographic characteristics including age, occupation and educational attainment. Participation rates were high: an estimated 90% of residents participated in these surveys. This research was conducted in close collaboration with the Xichang County Anti-Schistosomiasis Station and the Institute of Parasitic Diseases at the Sichuan Center for Disease Control. All participants provided verbal informed consent and human data collection protocols were approved by the Berkeley Committee for the Protection of Human Subjects and the Sichuan Institutional Review Board.</p>
<p>A 25% random sample of residents, stratified by village and occupation, was interviewed in person in November 2000 about their water contact patterns throughout the schistosomiasis transmission season. Participants were asked about eight different activities that involve contact with irrigation, pond or stream water each month from April through October: washing clothes or vegetables, washing agricultural tools, washing hands and feet, playing or swimming, irrigation ditch cleaning and water diverting, planting rice, harvesting rice and fishing. These water contact activities will be referred to subsequently as laundry, tool washing, bathing, swimming, ditch digging, rice planting, rice harvesting, and fishing, respectively. Participants were asked how often they performed each activity each month and for how many minutes each time, providing an estimate of water contact frequency and duration. Each activity was assigned an exposure intensity weight in order to account for differences in body surface area exposed. Field studies in the selected villages were conducted to observe which body parts were typically wetted for each water contact activity, and burn charts were used to estimate the percent of total body surface area accounted for in each exposed body part <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>. Water contact intensities were assigned as follows: laundry (0.05), tool washing (0.03), bathing (0.12), swimming (0.20), ditch digging (0.05), rice planting (0.05), rice harvesting (0.05) and fishing (0.32). Total body surface area for adults was estimated to be 1.626 m<sup>2</sup>, and for children age 14 and under: 1.130 m<sup>2 </sup><abbrgrp><abbr bid="B21">21</abbr></abbrgrp>. For each activity <it>i </it>in month <it>k</it>, water exposure in minutes-meters<sup>2 </sup>was calculated:</p>
<p><display-formula><m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1742-5573-7-3-i1"><m:mtable>
   <m:mtr>
      <m:mtd>
         <m:maligngroup/>
         <m:mi>W</m:mi>
         <m:msub>
            <m:mi>C</m:mi>
            <m:mrow>
               <m:mi>i</m:mi>
               <m:mi>k</m:mi>
            </m:mrow>
         </m:msub>
         <m:malignmark/>
         <m:mo>=</m:mo>
         <m:mi>F</m:mi>
         <m:mi>r</m:mi>
         <m:mi>e</m:mi>
         <m:mi>q</m:mi>
         <m:mi>u</m:mi>
         <m:mi>e</m:mi>
         <m:mi>n</m:mi>
         <m:mi>c</m:mi>
         <m:msub>
            <m:mi>y</m:mi>
            <m:mrow>
               <m:mi>i</m:mi>
               <m:mi>k</m:mi>
            </m:mrow>
         </m:msub>
         <m:mo>&#215;</m:mo>
         <m:mi>D</m:mi>
         <m:mi>u</m:mi>
         <m:mi>r</m:mi>
         <m:mi>a</m:mi>
         <m:mi>t</m:mi>
         <m:mi>i</m:mi>
         <m:mi>o</m:mi>
         <m:msub>
            <m:mi>n</m:mi>
            <m:mrow>
               <m:mi>i</m:mi>
               <m:mi>k</m:mi>
            </m:mrow>
         </m:msub>
         <m:mo>&#215;</m:mo>
         <m:mi>I</m:mi>
         <m:mi>n</m:mi>
         <m:mi>t</m:mi>
         <m:mi>e</m:mi>
         <m:mi>n</m:mi>
         <m:mi>s</m:mi>
         <m:mi>i</m:mi>
         <m:mi>t</m:mi>
         <m:msub>
            <m:mi>y</m:mi>
            <m:mi>i</m:mi>
         </m:msub>
      </m:mtd>
   </m:mtr>
   <m:mtr>
      <m:mtd>
         <m:maligngroup/>
         <m:malignmark/>
         <m:mo>&#215;</m:mo>
         <m:mi>B</m:mi>
         <m:mi>o</m:mi>
         <m:mi>d</m:mi>
         <m:mi>y</m:mi>
         <m:mi>S</m:mi>
         <m:mi>u</m:mi>
         <m:mi>r</m:mi>
         <m:mi>f</m:mi>
         <m:mi>a</m:mi>
         <m:mi>c</m:mi>
         <m:mi>e</m:mi>
         <m:mi>A</m:mi>
         <m:mi>r</m:mi>
         <m:mi>e</m:mi>
         <m:mi>a</m:mi>
         <m:mo>.</m:mo>
      </m:mtd>
   </m:mtr>
</m:mtable>
</m:math></display-formula></p>
<p>An individual's water contact for each month was calculated by summing water exposure for all activities that month. Likewise, an individual's total water exposure for each activity was calculated by summing the activity-specific water exposure over the seven months. The total water contact over the entire period was also calculated. Because it was determined that only one infected individual had any water contact associated with rice harvesting, rice harvesting was excluded from the set of activity variables. This type of water contact was not excluded from the monthly water contact variables, or from the total water contact variables.</p>
<p>At the same time as the water contact surveys, and corresponding with the end of the transmission season, schistosomiasis infection surveys were conducted using two different stool examination techniques. Participants submitted stool samples from three different days and each sample was examined using the miracidial hatch test according to Chinese Ministry of Health protocols <abbrgrp><abbr bid="B22">22</abbr></abbrgrp>. The Kato-Katz thick smear procedure was also used; three 41.5 mg slides were prepared from homogenized stool samples and examined for <it>S. japonicum </it>eggs <abbrgrp><abbr bid="B23">23</abbr></abbrgrp>. Any person with a positive miracidial hatch test or at least one <it>S. japonicum </it>egg detected through Kato-Katz was classified as infected. All infected individuals were referred to local health officials for treatment with praziquantel.</p>
</sec>
<sec><st><p>Statistical Analyses</p></st>
<sec><st><p><it>Prediction Algorithm</it></p></st>
<p>In our first analysis, we used a machine-learning algorithm to choose the "best" set of infection predictors. This algorithm formed recursive partitioning, regression, and classification trees, as implemented in the R function <it>rpart </it><abbrgrp><abbr bid="B24">24</abbr><abbr bid="B25">25</abbr><abbr bid="B26">26</abbr></abbrgrp>. The algorithm was allowed to choose among all of the possible water contact variables, as defined above: activity type, water contact month, and total water contact. Since the activities are sums over all months, the months are sums over all activities, and the total is the sum of all water contact over the entire study period, including these variables together would not make sense in an approach attempting to determine associations between the variables and the outcome (as in the analyses conducted later in the paper). However, from the prediction standpoint, the only concern is the accuracy of prediction; it makes the most sense, therefore, to include as many variables as possible in the potential prediction algorithm, which is why we included all variables. We note that <it>rpart </it>is just one of many machine learning algorithms that could be used, including algorithms that combine results from several learners <abbrgrp><abbr bid="B27">27</abbr></abbrgrp>. This approach generalizes to any such routines.</p>
<p>In an attempt to assess the relative "importance" of the variables in predicting the outcome, we applied a Monte Carlo re-sampling approach (nonparametric bootstrap) <abbrgrp><abbr bid="B28">28</abbr></abbrgrp>. The study individuals were randomly re-sampled with replacement (meaning that one subject could be sampled more than once, but that all samples were of the same size), and the <it>rpart </it>tree was recalculated. This bootstrapping method is a commonly used way of simulating re-sampling from the target population, and can help to examine how small changes in the data can affect the prediction model chosen. We performed this re-sampling approach 5000 times, and tabulated the number of times each variable was chosen by <it>rpart </it>in the prediction model. Multiple splits on a given variable within the same <it>rpart </it>fit were counted only once on each iteration.</p>
</sec>
<sec><st><p><it>Multiple Regression</it></p></st>
<p>Turning away from the prediction-focused approach, our second analysis was a main-effects log-linear regression, in which we also included age category (&lt; 18, 18-29, 30-29, 40-49, 50+) and village indicator variables as possible confounders. Here we separated the activity types from the months into two separate models, and excluded total water contact from both models. We could not use log-linear binomial models because they generated predicted probabilities that exceeded one, so we used instead Poisson log-linear models.</p>
<p><display-formula><m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1742-5573-7-3-i2"><m:mtable>
   <m:mtr>
      <m:mtd>
         <m:maligngroup/>
         <m:mtext>Model</m:mtext>
         <m:mtext>&#8201;</m:mtext>
         <m:mn>1</m:mn>
         <m:mo>:</m:mo>
         <m:mtext>log</m:mtext>
         <m:mrow>
            <m:mo>[</m:mo>
            <m:mrow>
               <m:mi>E</m:mi>
               <m:mrow>
                  <m:mo>(</m:mo>
                  <m:mrow>
                     <m:mi>Y</m:mi>
                     <m:mo>|</m:mo>
                     <m:msub>
                        <m:mi>W</m:mi>
                        <m:mrow>
                           <m:mi>a</m:mi>
                           <m:mi>c</m:mi>
                           <m:mi>t</m:mi>
                           <m:mi>i</m:mi>
                           <m:mi>v</m:mi>
                           <m:mi>i</m:mi>
                           <m:mi>t</m:mi>
                           <m:mi>y</m:mi>
                        </m:mrow>
                     </m:msub>
                     <m:mo>,</m:mo>
                     <m:mi>V</m:mi>
                  </m:mrow>
                  <m:mo>)</m:mo>
               </m:mrow>
            </m:mrow>
            <m:mo>]</m:mo>
         </m:mrow>
      </m:mtd>
   </m:mtr>
   <m:mtr>
      <m:mtd>
         <m:maligngroup/>
         <m:mo>=</m:mo>
         <m:mi>&#945;</m:mi>
         <m:mo>+</m:mo>
         <m:msub>
            <m:mi>&#946;</m:mi>
            <m:mrow>
               <m:mi>a</m:mi>
               <m:mi>c</m:mi>
               <m:mi>t</m:mi>
               <m:mi>i</m:mi>
               <m:mi>v</m:mi>
               <m:mi>i</m:mi>
               <m:mi>t</m:mi>
               <m:mi>y</m:mi>
            </m:mrow>
         </m:msub>
         <m:msub>
            <m:mi>W</m:mi>
            <m:mrow>
               <m:mi>a</m:mi>
               <m:mi>c</m:mi>
               <m:mi>t</m:mi>
               <m:mi>i</m:mi>
               <m:mi>v</m:mi>
               <m:mi>i</m:mi>
               <m:mi>t</m:mi>
               <m:mi>y</m:mi>
            </m:mrow>
         </m:msub>
         <m:mo>+</m:mo>
         <m:mi>&#947;</m:mi>
         <m:mi>V</m:mi>
      </m:mtd>
   </m:mtr>
</m:mtable>
</m:math></display-formula></p>
<p><display-formula><m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1742-5573-7-3-i3"><m:mtable>
   <m:mtr>
      <m:mtd>
         <m:maligngroup/>
         <m:mtext>Model</m:mtext>
         <m:mtext>&#8203;</m:mtext>
         <m:mtext>&#8203;</m:mtext>
         <m:mtext>&#8201;</m:mtext>
         <m:mn>2</m:mn>
         <m:malignmark/>
         <m:mo>:</m:mo>
         <m:mtext>log</m:mtext>
         <m:mrow>
            <m:mo>[</m:mo>
            <m:mrow>
               <m:mi>E</m:mi>
               <m:mrow>
                  <m:mo>(</m:mo>
                  <m:mrow>
                     <m:mi>Y</m:mi>
                     <m:mo>|</m:mo>
                     <m:msub>
                        <m:mi>W</m:mi>
                        <m:mrow>
                           <m:mi>m</m:mi>
                           <m:mi>o</m:mi>
                           <m:mi>n</m:mi>
                           <m:mi>t</m:mi>
                           <m:mi>h</m:mi>
                        </m:mrow>
                     </m:msub>
                     <m:mo>,</m:mo>
                     <m:mi>V</m:mi>
                  </m:mrow>
                  <m:mo>)</m:mo>
               </m:mrow>
            </m:mrow>
            <m:mo>]</m:mo>
         </m:mrow>
      </m:mtd>
   </m:mtr>
   <m:mtr>
      <m:mtd>
         <m:maligngroup/>
         <m:mo>=</m:mo>
         <m:mi>&#945;</m:mi>
         <m:mo>+</m:mo>
         <m:msub>
            <m:mi>&#946;</m:mi>
            <m:mrow>
               <m:mi>m</m:mi>
               <m:mi>o</m:mi>
               <m:mi>n</m:mi>
               <m:mi>t</m:mi>
               <m:mi>h</m:mi>
            </m:mrow>
         </m:msub>
         <m:msub>
            <m:mi>W</m:mi>
            <m:mrow>
               <m:mi>m</m:mi>
               <m:mi>o</m:mi>
               <m:mi>n</m:mi>
               <m:mi>t</m:mi>
               <m:mi>h</m:mi>
            </m:mrow>
         </m:msub>
         <m:mo>+</m:mo>
         <m:mi>&#947;</m:mi>
         <m:mi>V</m:mi>
      </m:mtd>
   </m:mtr>
</m:mtable>
</m:math></display-formula></p>
<p>In both models, <it>Y </it>is the (binary) outcome, <it>V </it>is the vector of village and age category indicators, and <it>&#947; </it>is the vector of coefficients associated with <it>V</it>. In Model 1, <it>W</it><sub><it>activity </it></sub>is the vector of activity type water contact variables, and <it>&#946;</it><sub><it>activity </it></sub>is the vector of activity type coefficients; in Model 2, <it>W</it><sub><it>month </it></sub>is the vector of monthly water contact variables, and <it>&#946;</it><sub><it>month </it></sub>is the vector of month coefficients. Because we did not wish to rely upon the Poisson assumption for estimating our standard errors and deriving inference, we instead calculated robust standard errors using the Huber/White sandwich estimator <abbrgrp><abbr bid="B29">29</abbr><abbr bid="B30">30</abbr></abbrgrp>. Regression estimates were obtained using the <it>glm </it>command in Stata <abbrgrp><abbr bid="B31">31</abbr></abbrgrp>.</p>
</sec>
<sec><st><p><it>Variable Importance</it></p></st>
<p>Our third (semi-parametric) approach estimated a so-called <it>variable importance </it>(VI) parameter which compares the current distribution of the outcome to its distribution under a theoretical experiment where the variable of interest is set to the lowest risk. In our data, this is equivalent to comparing the observed infection prevalence distribution to the distribution of infection in a theoretical experiment in which the entire study population never experienced a particular type of water contact.</p>
<p>Assume the current variable of interest is <it>A</it>, the outcome is <it>Y</it>, and the confounders - in this case, all other water contact variables except <it>A </it>- are <it>W</it>, and <it>V </it>are the additional confounders (age category and village). Our VI estimate is inspired by the following causal parameter:</p>
<p><display-formula><m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1742-5573-7-3-i4"><m:mrow>
   <m:mfrac>
      <m:mrow>
         <m:mi>E</m:mi>
         <m:mo stretchy="false">(</m:mo>
         <m:msub>
            <m:mi>Y</m:mi>
            <m:mn>0</m:mn>
         </m:msub>
         <m:mo stretchy="false">)</m:mo>
      </m:mrow>
      <m:mrow>
         <m:mi>E</m:mi>
         <m:mo stretchy="false">(</m:mo>
         <m:mi>Y</m:mi>
         <m:mo stretchy="false">)</m:mo>
      </m:mrow>
   </m:mfrac>
   <m:mo>.</m:mo>
</m:mrow>
</m:math></display-formula></p>
<p><it>Y</it><sub><it>a </it></sub>represents the outcome if - possibly contrary to fact - everyone had exposure <it>A = a</it>. (Outcomes defined in such a way have been referred to as <it>counterfactuals </it><abbrgrp><abbr bid="B32">32</abbr></abbrgrp>.) In the case of our binary outcome variable, <it>E(Y) </it>is estimated as the current disease prevalence in our target population, which is estimated as the average of the observed <it>Y </it>values.</p>
<p>If <it>Y </it>is binary (yes/no) - as it is in our case - this parameter can be interpreted as the proportional change, relative to current rates, in the prevalence of schistosomiasis in our target population if everyone were unexposed to the particular risk. This parameter is akin to the attributable risk, and its magnitude is both a function of the adjusted association of <it>A </it>and <it>Y </it>and of the prevalence of exposure. For example, removing exposure would have little effect on the value of this causal parameter if the exposure in question were very rare, even if it were strongly related to the disease outcome. Conversely, removing a common exposure that only modestly increased the risk of disease could have a much larger impact on the parameter's value.</p>
<p>With regards to the distribution of the data alone - that is, without assuming the necessary identifiability conditions for making causal inference (no unmeasured confounders and independence of counterfactual outcomes, or the so-called stable unit treatment value assumption - SUTVA <abbrgrp><abbr bid="B33">33</abbr></abbrgrp>) - our VI measure is an estimate of the following:</p>
<p><display-formula><m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1742-5573-7-3-i5"><m:mrow>
   <m:mi>V</m:mi>
   <m:mi>I</m:mi>
   <m:mo>=</m:mo>
   <m:mfrac>
      <m:mrow>
         <m:msub>
            <m:mi>E</m:mi>
            <m:mrow>
               <m:mi>W</m:mi>
               <m:mo>,</m:mo>
               <m:mi>V</m:mi>
            </m:mrow>
         </m:msub>
         <m:mi>E</m:mi>
         <m:mo stretchy="false">(</m:mo>
         <m:mi>Y</m:mi>
         <m:mo>|</m:mo>
         <m:mi>A</m:mi>
         <m:mo>=</m:mo>
         <m:mn>0</m:mn>
         <m:mo>,</m:mo>
         <m:mi>W</m:mi>
         <m:mo>,</m:mo>
         <m:mi>V</m:mi>
         <m:mo stretchy="false">)</m:mo>
      </m:mrow>
      <m:mrow>
         <m:mi>E</m:mi>
         <m:mo stretchy="false">(</m:mo>
         <m:mi>Y</m:mi>
         <m:mo stretchy="false">)</m:mo>
      </m:mrow>
   </m:mfrac>
   <m:mo>.</m:mo>
</m:mrow>
</m:math></display-formula></p>
<p>The numerator is interpreted as the mean predicted value of <it>Y </it>assuming one sets the exposure to 0 (<it>A </it>= 0 means unexposed) but keeps the other variables at their observed values. <it>E</it><sub><it>W, V </it></sub>in the numerator denotes that this mean predicted value of <it>Y </it>is also taken over all <it>W </it>and <it>V</it>.</p>
<p>The denominator was estimated by simply taking the mean of the <it>Y </it>values. To estimate the numerator, we used the so-called inverse-probability-of-censoring-weighted (IPCW) estimator:</p>
<p><display-formula><m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1742-5573-7-3-i6"><m:mrow>
   <m:msub>
      <m:mover accent="true">
         <m:mi>E</m:mi>
         <m:mo>^</m:mo>
      </m:mover>
      <m:mrow>
         <m:mi>W</m:mi>
         <m:mo>,</m:mo>
         <m:mi>V</m:mi>
      </m:mrow>
   </m:msub>
   <m:mover accent="true">
      <m:mi>E</m:mi>
      <m:mo>^</m:mo>
   </m:mover>
   <m:mo stretchy="false">(</m:mo>
   <m:mi>Y</m:mi>
   <m:mo>|</m:mo>
   <m:mi>A</m:mi>
   <m:mo>=</m:mo>
   <m:mn>0</m:mn>
   <m:mo>,</m:mo>
   <m:mi>W</m:mi>
   <m:mo>,</m:mo>
   <m:mi>V</m:mi>
   <m:mo stretchy="false">)</m:mo>
   <m:mo>=</m:mo>
   <m:mfrac>
      <m:mn>1</m:mn>
      <m:mi>n</m:mi>
   </m:mfrac>
   <m:mstyle displaystyle="true">
      <m:munderover>
         <m:mo>&#8721;</m:mo>
         <m:mrow>
            <m:mi>i</m:mi>
            <m:mo>=</m:mo>
            <m:mn>1</m:mn>
         </m:mrow>
         <m:mi>n</m:mi>
      </m:munderover>
      <m:mrow>
         <m:mfrac>
            <m:mrow>
               <m:mi>I</m:mi>
               <m:mo stretchy="false">(</m:mo>
               <m:msub>
                  <m:mi>A</m:mi>
                  <m:mi>i</m:mi>
               </m:msub>
               <m:mo>=</m:mo>
               <m:mn>0</m:mn>
               <m:mo stretchy="false">)</m:mo>
               <m:msub>
                  <m:mi>Y</m:mi>
                  <m:mi>i</m:mi>
               </m:msub>
            </m:mrow>
            <m:mrow>
               <m:mover accent="true">
                  <m:mi>P</m:mi>
                  <m:mo>^</m:mo>
               </m:mover>
               <m:mo stretchy="false">(</m:mo>
               <m:msub>
                  <m:mi>A</m:mi>
                  <m:mi>i</m:mi>
               </m:msub>
               <m:mo>=</m:mo>
               <m:mn>0</m:mn>
               <m:mo>|</m:mo>
               <m:msub>
                  <m:mi>W</m:mi>
                  <m:mi>i</m:mi>
               </m:msub>
               <m:mo>,</m:mo>
               <m:msub>
                  <m:mi>V</m:mi>
                  <m:mi>i</m:mi>
               </m:msub>
               <m:mo stretchy="false">)</m:mo>
            </m:mrow>
         </m:mfrac>
      </m:mrow>
   </m:mstyle>
   <m:mo>.</m:mo>
</m:mrow>
</m:math></display-formula></p>
<p>Here <inline-formula><m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1742-5573-7-3-i7"><m:mrow>
   <m:mover accent="true">
      <m:mi>P</m:mi>
      <m:mo>^</m:mo>
   </m:mover>
   <m:mo stretchy="false">(</m:mo>
   <m:msub>
      <m:mi>A</m:mi>
      <m:mi>i</m:mi>
   </m:msub>
   <m:mo>=</m:mo>
   <m:mn>0</m:mn>
   <m:mo>|</m:mo>
   <m:msub>
      <m:mi>W</m:mi>
      <m:mi>i</m:mi>
   </m:msub>
   <m:mo>,</m:mo>
   <m:msub>
      <m:mi>V</m:mi>
      <m:mi>i</m:mi>
   </m:msub>
   <m:mo stretchy="false">)</m:mo>
</m:mrow>
</m:math></inline-formula> is an estimate of the probability that <it>A </it>= 0 given the values of the covariates <it>W<sub>i </sub></it>and <it>V</it><sub><it>i </it></sub>for subject <it>i</it>. The form of this estimator makes obvious another assumption, which has been called positivity or experimental treatment assignment (ETA) assumption, which in this case says that <it>P(A = 0|W, V) &gt; 0 </it>in the data-generating distribution <abbrgrp><abbr bid="B34">34</abbr><abbr bid="B35">35</abbr><abbr bid="B36">36</abbr></abbrgrp>.</p>
<p>The IPCW estimator is a type of weighted average of the <it>Y </it>values, in which the weights are proportional to the probability of being unexposed (<it>A<sub>i </sub>= 0</it>) given the other covariates (<it>W</it><sub><it>i </it></sub>and <it>V</it><sub><it>i</it></sub>). The IPCW estimator relatively up-weights the disease outcomes of unexposed individuals with covariates underrepresented within the unexposed group, which has the effect of adjusting for confounding bias. Because <it>P(A<sub>i</sub>|W<sub>i</sub>, V<sub>i</sub>) </it>is unknown in this case, we used a machine-learning algorithm (<it>rpart</it>) to estimate a model for this probability.</p>
<p>A VI estimate was calculated for each variable of interest. Specifically, we define the VI estimate for each water contact activity as follows:</p>
<p><display-formula><m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1742-5573-7-3-i8"><m:mrow>
   <m:mi>V</m:mi>
   <m:msub>
      <m:mi>I</m:mi>
      <m:mrow>
         <m:mi>a</m:mi>
         <m:mi>c</m:mi>
         <m:mi>t</m:mi>
         <m:mi>i</m:mi>
         <m:mi>v</m:mi>
         <m:mi>i</m:mi>
         <m:mi>t</m:mi>
         <m:mi>y</m:mi>
      </m:mrow>
   </m:msub>
   <m:mo>=</m:mo>
   <m:mfrac>
      <m:mrow>
         <m:msub>
            <m:mi>E</m:mi>
            <m:mrow>
               <m:msub>
                  <m:mi>W</m:mi>
                  <m:mrow>
                     <m:mi>a</m:mi>
                     <m:mi>c</m:mi>
                     <m:mi>t</m:mi>
                     <m:mi>i</m:mi>
                     <m:mi>v</m:mi>
                     <m:mi>i</m:mi>
                     <m:mi>t</m:mi>
                     <m:mi>y</m:mi>
                  </m:mrow>
               </m:msub>
               <m:mo>,</m:mo>
               <m:mi>V</m:mi>
            </m:mrow>
         </m:msub>
         <m:mi>E</m:mi>
         <m:mo stretchy="false">(</m:mo>
         <m:mi>Y</m:mi>
         <m:mo>|</m:mo>
         <m:mi>A</m:mi>
         <m:mo>=</m:mo>
         <m:mn>0</m:mn>
         <m:mo>,</m:mo>
         <m:msub>
            <m:mi>W</m:mi>
            <m:mrow>
               <m:mi>a</m:mi>
               <m:mi>c</m:mi>
               <m:mi>t</m:mi>
               <m:mi>i</m:mi>
               <m:mi>v</m:mi>
               <m:mi>i</m:mi>
               <m:mi>t</m:mi>
               <m:mi>y</m:mi>
            </m:mrow>
         </m:msub>
         <m:mo>,</m:mo>
         <m:mi>V</m:mi>
         <m:mo stretchy="false">)</m:mo>
      </m:mrow>
      <m:mrow>
         <m:mi>E</m:mi>
         <m:mo stretchy="false">(</m:mo>
         <m:mi>Y</m:mi>
         <m:mo stretchy="false">)</m:mo>
      </m:mrow>
   </m:mfrac>
   <m:mo>,</m:mo>
</m:mrow>
</m:math></display-formula></p>
<p>where <it>A </it>represents the water contact activity type for which a VI estimate is being calculated, <it>W</it><sub><it>activity </it></sub>represents the remaining water contact activity type variables, and <it>V </it>represents the age category and village covariates. The VI estimate for each month is defined equivalently, with <it>W<sub>month </sub></it>in place of <it>W</it><sub><it>activity</it></sub>. As in the logistic regression analysis, total water contact was excluded; it would not be meaningful to estimate <it>E<sub>W, V </sub>E</it>(<it>Y</it>|<it>A </it>= 0, <it>W</it>, <it>V</it>) for <it>A = </it>total water contact, since none of the other water contact variables could be nonzero if total water contact were equal to zero.</p>
<p>To derive our inference, we estimated standard errors using the non-parametric bootstrap with 5000 iterations. Specifically, participants were re-sampled with replacement, producing 5000 bootstrap samples of size 1011. For each of these 5000 samples, VI estimates were calculated, including a re-calculation of <inline-formula><m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1742-5573-7-3-i7"><m:mrow><m:mover accent="true"><m:mi>P</m:mi><m:mo>^</m:mo></m:mover><m:mo stretchy="false">(</m:mo><m:msub><m:mi>A</m:mi><m:mi>i</m:mi></m:msub><m:mo>=</m:mo><m:mn>0</m:mn><m:mo>|</m:mo><m:msub><m:mi>W</m:mi><m:mi>i</m:mi></m:msub><m:mo>,</m:mo><m:msub><m:mi>V</m:mi><m:mi>i</m:mi></m:msub><m:mo stretchy="false">)</m:mo></m:mrow></m:math></inline-formula>. The standard deviation across these 5000 estimates was then calculated and used for inference. Because the model for <it>P</it>(<it>A</it><sub><it>i </it></sub>= 0|<it>W</it><sub><it>i</it></sub>, <it>V</it><sub><it>i</it></sub>) was not pre-specified, this method of calculating the standard error will account for both sampling variability (by re-sampling) and the variability introduced by model uncertainty with regards to <it>P</it>(<it>A<sub>i </sub></it>= 0|<it>W</it><sub><it>i</it></sub>, <it>V</it><sub><it>i</it></sub>) (by allowing for changes in the model for <inline-formula><m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1742-5573-7-3-i7"><m:mrow><m:mover accent="true"><m:mi>P</m:mi><m:mo>^</m:mo></m:mover><m:mo stretchy="false">(</m:mo><m:msub><m:mi>A</m:mi><m:mi>i</m:mi></m:msub><m:mo>=</m:mo><m:mn>0</m:mn><m:mo>|</m:mo><m:msub><m:mi>W</m:mi><m:mi>i</m:mi></m:msub><m:mo>,</m:mo><m:msub><m:mi>V</m:mi><m:mi>i</m:mi></m:msub><m:mo stretchy="false">)</m:mo></m:mrow></m:math></inline-formula> at each iteration).</p>
</sec>
</sec>
</sec>
<sec><st><p>Results</p></st>
<p>Figure <figr fid="F1">1</figr> shows the full data <it>rpart </it>tree formed by allowing the machine learning algorithm to choose splits from the pool of all water contact variables. April, May, June, tool washing, ditch digging, bathing, and rice picking were the water contact variables chosen for classification.</p>
<fig id="F1"><title><p>Figure 1</p></title><caption><p>Full data <it>rpart </it>classification tree</p></caption><text>
   <p><b>Full data <it>rpart </it>classification tree</b>.</p>
</text><graphic file="1742-5573-7-3-1"/></fig>
<p>When the data were re-sampled with replacement, Table <tblr tid="T1">1</tblr> lists the number and percentage of times (out of 5000) each variable was chosen for classification in a given <it>rpart </it>tree. The covariates are ordered according to the number of times they were chosen to be part of each <it>rpart </it>tree, from largest to smallest. This method identified April (92%), June (92%) and total water contact (86%) as the most frequently chosen predictors of infection status within the bootstrapping algorithm. The six variables chosen for classification in the original full data tree (Figure <figr fid="F1">1</figr>) are among the top seven identified most frequently for use in the bootstrap sample <it>rpart </it>trees. However, total water contact, chosen 86% of the time in the bootstrap samples, was not part of the original full data tree.</p>
<tbl id="T1"><title><p>Table 1</p></title><caption><p>Number of times out of 5000 that each water contact type was chosen by <it>rpart </it>to form a data-adaptive classification tree.</p></caption><tblbdy cols="3">
      <r>
         <c ca="left">
            <p>
               <b>Water contact type</b>
            </p>
         </c>
         <c ca="left">
            <p>
               <b>Number of times chosen</b>
            </p>
         </c>
         <c ca="left">
            <p>
               <b>Percentage</b>
            </p>
         </c>
      </r>
      <r>
         <c cspan="3">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>April</p>
         </c>
         <c ca="left">
            <p>4608</p>
         </c>
         <c ca="left">
            <p>92.2%</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>June</p>
         </c>
         <c ca="left">
            <p>4602</p>
         </c>
         <c ca="left">
            <p>92.0%</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Total</p>
         </c>
         <c ca="left">
            <p>4283</p>
         </c>
         <c ca="left">
            <p>85.7%</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Tool washing</p>
         </c>
         <c ca="left">
            <p>4067</p>
         </c>
         <c ca="left">
            <p>81.3%</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Ditch digging</p>
         </c>
         <c ca="left">
            <p>3825</p>
         </c>
         <c ca="left">
            <p>76.5%</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Rice planting</p>
         </c>
         <c ca="left">
            <p>3677</p>
         </c>
         <c ca="left">
            <p>73.5%</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>May</p>
         </c>
         <c ca="left">
            <p>3652</p>
         </c>
         <c ca="left">
            <p>73.0%</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>September</p>
         </c>
         <c ca="left">
            <p>3326</p>
         </c>
         <c ca="left">
            <p>66.5%</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>July</p>
         </c>
         <c ca="left">
            <p>3181</p>
         </c>
         <c ca="left">
            <p>63.6%</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Bathing</p>
         </c>
         <c ca="left">
            <p>3073</p>
         </c>
         <c ca="left">
            <p>61.5%</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>October</p>
         </c>
         <c ca="left">
            <p>2787</p>
         </c>
         <c ca="left">
            <p>55.7%</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Swimming</p>
         </c>
         <c ca="left">
            <p>2481</p>
         </c>
         <c ca="left">
            <p>49.6%</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Laundry</p>
         </c>
         <c ca="left">
            <p>2133</p>
         </c>
         <c ca="left">
            <p>42.7%</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>August</p>
         </c>
         <c ca="left">
            <p>1892</p>
         </c>
         <c ca="left">
            <p>37.8%</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Fishing</p>
         </c>
         <c ca="left">
            <p>69</p>
         </c>
         <c ca="left">
            <p>1.4%</p>
         </c>
      </r>
   </tblbdy><tblfn>
      <p>Data to form each tree were obtained by re-sampling the 1011 individuals with replacement.</p>
   </tblfn></tbl>
<p>Tables <tblr tid="T2">2</tblr> and <tblr tid="T3">3</tblr> show results from the log-linear regression models, along with the prevalence of each type of water contact in our sample. The correlations between the various water contact variables range from -0.02 (between April and August) to 0.68 (between July and August) for the monthly variables and from -0.15 (between swimming and bathing) and 0.28 (between rice picking and bathing) for the activity variables. The reported relative risks were calculated as <inline-formula><m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1742-5573-7-3-i9"><m:mrow>
   <m:msup>
      <m:mi>e</m:mi>
      <m:mrow>
         <m:msub>
            <m:mover accent="true">
               <m:mi>&#946;</m:mi>
               <m:mo>^</m:mo>
            </m:mover>
            <m:mi>i</m:mi>
         </m:msub>
         <m:msub>
            <m:mover accent="true">
               <m:mi>X</m:mi>
               <m:mo>&#175;</m:mo>
            </m:mover>
            <m:mi>i</m:mi>
         </m:msub>
      </m:mrow>
   </m:msup>
</m:mrow>
</m:math></inline-formula>, where <inline-formula><m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1742-5573-7-3-i10"><m:mrow>
   <m:msub>
      <m:mover accent="true">
         <m:mi>&#946;</m:mi>
         <m:mo>^</m:mo>
      </m:mover>
      <m:mi>i</m:mi>
   </m:msub>
</m:mrow>
</m:math></inline-formula> is the estimated regression coefficient and <inline-formula><m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1742-5573-7-3-i11"><m:mrow>
   <m:msub>
      <m:mover accent="true">
         <m:mi>X</m:mi>
         <m:mo>&#175;</m:mo>
      </m:mover>
      <m:mi>i</m:mi>
   </m:msub>
</m:mrow>
</m:math></inline-formula> is the mean water contact across all subjects for water contact variable <it>i</it>. This relative risk therefore reports the risk of having the mean value for water contact variable <it>i </it>versus the risk of having no water contact of type <it>i</it>. As previously mentioned, the month and activity variables were separated into two different models, which is why the results are reported separately. The estimates in Tables <tblr tid="T2">2</tblr> and <tblr tid="T3">3</tblr> are also adjusted for age category and village. We do not report relative risks associated with age category and village because the effects of these covariates were not the focus of this study.</p>
<tbl id="T2"><title><p>Table 2</p></title><caption><p>Relative risk estimates for water contact by month.</p></caption><tblbdy cols="6">
      <r>
         <c ca="left">
            <p>
               <b>Month</b>
            </p>
         </c>
         <c ca="right">
            <p>
               <b>Prevalence</b>
            </p>
         </c>
         <c ca="right">
            <p>
               <b>Relative Risk</b>
            </p>
         </c>
         <c ca="right">
            <p>
               <b>95% CI</b>
            </p>
         </c>
         <c ca="right">
            <p>
               <b>Std. error</b>
            </p>
         </c>
         <c ca="right">
            <p>
               <b>p-value</b>
            </p>
         </c>
      </r>
      <r>
         <c cspan="6">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>June</p>
         </c>
         <c ca="right">
            <p>0.75</p>
         </c>
         <c ca="right">
            <p>1.03</p>
         </c>
         <c ca="right">
            <p>(0.98, 1.09)</p>
         </c>
         <c ca="right">
            <p>0.03</p>
         </c>
         <c ca="right">
            <p>0.20</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>October</p>
         </c>
         <c ca="right">
            <p>0.58</p>
         </c>
         <c ca="right">
            <p>0.95</p>
         </c>
         <c ca="right">
            <p>(0.89, 1.03)</p>
         </c>
         <c ca="right">
            <p>0.04</p>
         </c>
         <c ca="right">
            <p>0.22</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>May</p>
         </c>
         <c ca="right">
            <p>0.75</p>
         </c>
         <c ca="right">
            <p>1.04</p>
         </c>
         <c ca="right">
            <p>(0.97, 1.13)</p>
         </c>
         <c ca="right">
            <p>0.04</p>
         </c>
         <c ca="right">
            <p>0.25</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>April</p>
         </c>
         <c ca="right">
            <p>0.73</p>
         </c>
         <c ca="right">
            <p>1.04</p>
         </c>
         <c ca="right">
            <p>(0.95, 1.14)</p>
         </c>
         <c ca="right">
            <p>0.05</p>
         </c>
         <c ca="right">
            <p>0.45</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>August</p>
         </c>
         <c ca="right">
            <p>0.76</p>
         </c>
         <c ca="right">
            <p>1.03</p>
         </c>
         <c ca="right">
            <p>(0.94, 1.13)</p>
         </c>
         <c ca="right">
            <p>0.05</p>
         </c>
         <c ca="right">
            <p>0.51</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>September</p>
         </c>
         <c ca="right">
            <p>0.70</p>
         </c>
         <c ca="right">
            <p>0.98</p>
         </c>
         <c ca="right">
            <p>(0.89, 1.07)</p>
         </c>
         <c ca="right">
            <p>0.05</p>
         </c>
         <c ca="right">
            <p>0.64</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>July</p>
         </c>
         <c ca="right">
            <p>0.77</p>
         </c>
         <c ca="right">
            <p>0.98</p>
         </c>
         <c ca="right">
            <p>(0.91, 1.06)</p>
         </c>
         <c ca="right">
            <p>0.04</p>
         </c>
         <c ca="right">
            <p>0.68</p>
         </c>
      </r>
   </tblbdy><tblfn>
      <p>These estimates are based on a main-effects log-linear regression, and are also adjusted for age category and village. The relative risks reflect the difference in risk of infection between exposure at the mean value for that month and zero exposure.</p>
   </tblfn></tbl>
<tbl id="T3"><title><p>Table 3</p></title><caption><p>Relative risk estimates for water contact by activity.</p></caption><tblbdy cols="6">
      <r>
         <c ca="left">
            <p>
               <b>Month</b>
            </p>
         </c>
         <c ca="right">
            <p>
               <b>Prevalence</b>
            </p>
         </c>
         <c ca="right">
            <p>
               <b>Relative Risk</b>
            </p>
         </c>
         <c ca="right">
            <p>
               <b>95% CI</b>
            </p>
         </c>
         <c ca="right">
            <p>
               <b>Std. error</b>
            </p>
         </c>
         <c ca="right">
            <p>
               <b>p-value</b>
            </p>
         </c>
      </r>
      <r>
         <c cspan="6">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Tool washing</p>
         </c>
         <c ca="right">
            <p>0.20</p>
         </c>
         <c ca="right">
            <p>1.03</p>
         </c>
         <c ca="right">
            <p>(1.01, 1.05)</p>
         </c>
         <c ca="right">
            <p>0.01</p>
         </c>
         <c ca="right">
            <p>&lt; 0.01</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Laundry</p>
         </c>
         <c ca="right">
            <p>0.22</p>
         </c>
         <c ca="right">
            <p>1.02</p>
         </c>
         <c ca="right">
            <p>(1.00, 1.05)</p>
         </c>
         <c ca="right">
            <p>0.01</p>
         </c>
         <c ca="right">
            <p>0.08</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Swimming</p>
         </c>
         <c ca="right">
            <p>0.21</p>
         </c>
         <c ca="right">
            <p>1.02</p>
         </c>
         <c ca="right">
            <p>(1.00, 1.05)</p>
         </c>
         <c ca="right">
            <p>0.01</p>
         </c>
         <c ca="right">
            <p>0.10</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Ditch digging</p>
         </c>
         <c ca="right">
            <p>0.48</p>
         </c>
         <c ca="right">
            <p>0.99</p>
         </c>
         <c ca="right">
            <p>(0.98, 1.00)</p>
         </c>
         <c ca="right">
            <p>0.01</p>
         </c>
         <c ca="right">
            <p>0.16</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Fishing</p>
         </c>
         <c ca="right">
            <p>0.02</p>
         </c>
         <c ca="right">
            <p>1.01</p>
         </c>
         <c ca="right">
            <p>(1.00, 1.02)</p>
         </c>
         <c ca="right">
            <p>0.01</p>
         </c>
         <c ca="right">
            <p>0.17</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Bathing</p>
         </c>
         <c ca="right">
            <p>0.49</p>
         </c>
         <c ca="right">
            <p>0.98</p>
         </c>
         <c ca="right">
            <p>(0.92, 1.04)</p>
         </c>
         <c ca="right">
            <p>0.03</p>
         </c>
         <c ca="right">
            <p>0.46</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Rice planting</p>
         </c>
         <c ca="right">
            <p>0.65</p>
         </c>
         <c ca="right">
            <p>1.03</p>
         </c>
         <c ca="right">
            <p>(0.94, 1.13)</p>
         </c>
         <c ca="right">
            <p>0.05</p>
         </c>
         <c ca="right">
            <p>0.52</p>
         </c>
      </r>
   </tblbdy><tblfn>
      <p>These estimates are based on a main-effects log-linear regression, and are also adjusted for age category and village. The relative risks reflect the difference in risk of infection between exposure at the mean value for that month and zero exposure.</p>
   </tblfn></tbl>
<p>In the log-linear regression framework, none of the monthly water contact variables were found to have strong associations with the outcome. All month-specific relative risk estimates are very close to one and have 95% confidence intervals that include one. This implies that the risk of having a positive stool sample when these variables are at their mean values is indistinguishable from the risk when there is zero water exposure during these months. Similarly, the relative risks associated with the water contact activity types are also all very close to one, and almost all have 95% confidence intervals that include one. The tool washing-specific relative risk has a 95% confidence interval that does not cross one; the estimated relative risk is still extremely close to one, however, implying almost no detected difference in risk. These results are of course only interpretable in the context of the regression models used.</p>
<p>Tables <tblr tid="T4">4</tblr> and <tblr tid="T5">5</tblr> show VI estimates for the two sets of water contact variables. As in the log-linear regression framework, the monthly water contact variables were analyzed separately from the water contact activity variables. As previously explained, the VI estimates were adjusted for age category and village by including these variables in the estimation of <it>P(A<sub>i</sub>|W<sub>i</sub>, V<sub>i</sub>)</it>. (In similarity with the regression analysis, we did not calculate VI estimates for age category and village.) Confidence intervals and <it>p</it>-values based on the bootstrap-derived standard errors are also reported. In contrast to the log-linear regression results, which identified no detectable adjusted associations with the outcome among the monthly water contact variables, July's VI estimate indicates a strong adjusted association. If one interprets this VI estimate as an estimate of <inline-formula><m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1742-5573-7-3-i12"><m:mrow>
   <m:mfrac>
      <m:mrow>
         <m:mi>E</m:mi>
         <m:mo stretchy="false">(</m:mo>
         <m:msub>
            <m:mi>Y</m:mi>
            <m:mn>0</m:mn>
         </m:msub>
         <m:mo stretchy="false">)</m:mo>
      </m:mrow>
      <m:mrow>
         <m:mi>E</m:mi>
         <m:mo stretchy="false">(</m:mo>
         <m:mi>Y</m:mi>
         <m:mo stretchy="false">)</m:mo>
      </m:mrow>
   </m:mfrac>
</m:mrow>
</m:math></inline-formula>, it implies that eliminating water contact in July would reduce the prevalence of schistosomiasis measured in the study by 84%, or from 0.3 to 0.05. The 95% confidence interval for this estimate indicates a range of 78% to 89%. The prevalence of exposure in July is 0.77, which along with August is the highest of any month. The VI estimates for all other months are near one and have 95% confidence intervals that include one (many of which are quite broad). No other month, therefore, has a detectable association with the outcome.</p>
<tbl id="T4"><title><p>Table 4</p></title><caption><p>Variable importance estimates for water contact by month.</p></caption><tblbdy cols="6">
      <r>
         <c ca="left">
            <p>
               <b>Month</b>
            </p>
         </c>
         <c ca="right">
            <p>
               <b>Prevalence</b>
            </p>
         </c>
         <c ca="right">
            <p>
               <b>VI estimate</b>
            </p>
         </c>
         <c ca="right">
            <p>
               <b>95% CI</b>
            </p>
         </c>
         <c ca="right">
            <p>
               <b>Std. Error</b>
            </p>
         </c>
         <c ca="right">
            <p>
               <b>p-value</b>
            </p>
         </c>
      </r>
      <r>
         <c cspan="6">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>July</p>
         </c>
         <c ca="right">
            <p>0.77</p>
         </c>
         <c ca="right">
            <p>0.16</p>
         </c>
         <c ca="right">
            <p>(0.11, 0.22)</p>
         </c>
         <c ca="right">
            <p>0.18</p>
         </c>
         <c ca="right">
            <p>&lt; 0.01</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>August</p>
         </c>
         <c ca="right">
            <p>0.76</p>
         </c>
         <c ca="right">
            <p>1.70</p>
         </c>
         <c ca="right">
            <p>(0.48, 6.02)</p>
         </c>
         <c ca="right">
            <p>0.66</p>
         </c>
         <c ca="right">
            <p>0.42</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>May</p>
         </c>
         <c ca="right">
            <p>0.75</p>
         </c>
         <c ca="right">
            <p>1.18</p>
         </c>
         <c ca="right">
            <p>(0.32, 4.30)</p>
         </c>
         <c ca="right">
            <p>0.66</p>
         </c>
         <c ca="right">
            <p>0.81</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>October</p>
         </c>
         <c ca="right">
            <p>0.58</p>
         </c>
         <c ca="right">
            <p>1.05</p>
         </c>
         <c ca="right">
            <p>(0.60, 1.84)</p>
         </c>
         <c ca="right">
            <p>0.28</p>
         </c>
         <c ca="right">
            <p>0.86</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>June</p>
         </c>
         <c ca="right">
            <p>0.75</p>
         </c>
         <c ca="right">
            <p>0.97</p>
         </c>
         <c ca="right">
            <p>(0.27, 3.56)</p>
         </c>
         <c ca="right">
            <p>0.66</p>
         </c>
         <c ca="right">
            <p>0.97</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>September</p>
         </c>
         <c ca="right">
            <p>0.70</p>
         </c>
         <c ca="right">
            <p>1.01</p>
         </c>
         <c ca="right">
            <p>(0.41, 2.49)</p>
         </c>
         <c ca="right">
            <p>0.46</p>
         </c>
         <c ca="right">
            <p>0.98</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>April</p>
         </c>
         <c ca="right">
            <p>0.73</p>
         </c>
         <c ca="right">
            <p>1.00</p>
         </c>
         <c ca="right">
            <p>(0.40, 2.50)</p>
         </c>
         <c ca="right">
            <p>0.46</p>
         </c>
         <c ca="right">
            <p>1.00</p>
         </c>
      </r>
   </tblbdy><tblfn>
      <p>The prevalence of water contact for each month in our study population is also shown.</p>
   </tblfn></tbl>
<tbl id="T5"><title><p>Table 5</p></title><caption><p>Variable importance estimates for water contact by activity type.</p></caption><tblbdy cols="6">
      <r>
         <c ca="left">
            <p>
               <b>Month</b>
            </p>
         </c>
         <c ca="right">
            <p>
               <b>Prevalence</b>
            </p>
         </c>
         <c ca="right">
            <p>
               <b>VI estimate</b>
            </p>
         </c>
         <c ca="right">
            <p>
               <b>95% CI</b>
            </p>
         </c>
         <c ca="right">
            <p>
               <b>Std. Error</b>
            </p>
         </c>
         <c ca="right">
            <p>
               <b>p-value</b>
            </p>
         </c>
      </r>
      <r>
         <c cspan="6">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Tool washing</p>
         </c>
         <c ca="right">
            <p>0.20</p>
         </c>
         <c ca="right">
            <p>0.88</p>
         </c>
         <c ca="right">
            <p>(0.80, 0.97)</p>
         </c>
         <c ca="right">
            <p>0.05</p>
         </c>
         <c ca="right">
            <p>0.01</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Rice planting</p>
         </c>
         <c ca="right">
            <p>0.65</p>
         </c>
         <c ca="right">
            <p>0.71</p>
         </c>
         <c ca="right">
            <p>(0.53, 0.96)</p>
         </c>
         <c ca="right">
            <p>0.15</p>
         </c>
         <c ca="right">
            <p>0.03</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Swimming</p>
         </c>
         <c ca="right">
            <p>0.21</p>
         </c>
         <c ca="right">
            <p>0.96</p>
         </c>
         <c ca="right">
            <p>(0.87, 1.06)</p>
         </c>
         <c ca="right">
            <p>0.05</p>
         </c>
         <c ca="right">
            <p>0.38</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Ditch digging</p>
         </c>
         <c ca="right">
            <p>0.48</p>
         </c>
         <c ca="right">
            <p>0.94</p>
         </c>
         <c ca="right">
            <p>(0.80, 1.10)</p>
         </c>
         <c ca="right">
            <p>0.08</p>
         </c>
         <c ca="right">
            <p>0.42</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Bathing</p>
         </c>
         <c ca="right">
            <p>0.49</p>
         </c>
         <c ca="right">
            <p>1.09</p>
         </c>
         <c ca="right">
            <p>(0.88, 1.35)</p>
         </c>
         <c ca="right">
            <p>0.11</p>
         </c>
         <c ca="right">
            <p>0.42</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Laundry</p>
         </c>
         <c ca="right">
            <p>0.22</p>
         </c>
         <c ca="right">
            <p>0.97</p>
         </c>
         <c ca="right">
            <p>(0.89, 1.06)</p>
         </c>
         <c ca="right">
            <p>0.04</p>
         </c>
         <c ca="right">
            <p>0.45</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Fishing</p>
         </c>
         <c ca="right">
            <p>0.02</p>
         </c>
         <c ca="right">
            <p>1.00</p>
         </c>
         <c ca="right">
            <p>(0.98, 1.02)</p>
         </c>
         <c ca="right">
            <p>0.01</p>
         </c>
         <c ca="right">
            <p>0.83</p>
         </c>
      </r>
   </tblbdy><tblfn>
      <p>The prevalence of water contact for each month in our study population is also shown.</p>
   </tblfn></tbl>
<p>In terms of VI, no other type of water contact had as large an impact on infection risk as July water contact. Tool washing and rice planting were the only two activities with a discernable impact on infection risk - all other activity types (Table <tblr tid="T4">4</tblr>) have VI estimates near one and 95% confidence intervals that include one. Both of the VI estimates associated with tool washing and rice planting, in contrast, have 95% confidence intervals that do not cross one. Interpreting the VI results once again as estimates of <inline-formula><m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1742-5573-7-3-i12"><m:mrow><m:mfrac><m:mrow><m:mi>E</m:mi><m:mo stretchy="false">(</m:mo><m:msub><m:mi>Y</m:mi><m:mn>0</m:mn></m:msub><m:mo stretchy="false">)</m:mo></m:mrow><m:mrow><m:mi>E</m:mi><m:mo stretchy="false">(</m:mo><m:mi>Y</m:mi><m:mo stretchy="false">)</m:mo></m:mrow></m:mfrac></m:mrow></m:math></inline-formula> would imply an estimated 12% reduction in the prevalence of schistosomiasis by eliminating tool washing and an estimated 29% reduction by eliminating rice planting. The associated 95% confidence intervals for these estimates imply a range of 3% to 20% for tool washing and 4% to 47% for rice planting. As shown in Table <tblr tid="T5">5</tblr>, the prevalence of water exposure due to tool washing in our study population was 0.20, while the prevalence of water exposure due to rice planting was 0.65.</p>
</sec>
<sec><st><p>Conclusions</p></st>
<p>The three analysis approaches used here are all attempts to answer the same research question: what is the best estimate of the contribution of one explanatory variable to the mean outcome in the presence of other correlated explanatory variables? We specifically hoped to see how various types of water contact affected the probability of a positive stool sample, adjusting for other types of water contact, age, and village.</p>
<p>The use of machine learning algorithms for model selection is attractive, particularly because the model does not have to be pre-specified; this means estimating the association parameters while acknowledging that very little is typically known about the form of the model. A comparison of Figure <figr fid="F1">1</figr> and Table <tblr tid="T1">1</tblr>, however, provides an example of how simply determining whether or not a variable is chosen by a machine learning algorithm (such as <it>rpart</it>) is not a particularly robust procedure for defining the importance of a variable. Given a finite sample size and highly correlated predictors - as we have in our data - small changes in the data often result in large changes in the variables chosen as predictors. This can occur even as the fidelity of prediction is nearly unchanged; there are often several sets of variables in various functional forms that can provide nearly identical accuracy of prediction. This issue is partially what inspired the idea of bagging or bootstrapping these machine learning algorithms, such as in the case of random forests <abbrgrp><abbr bid="B37">37</abbr></abbrgrp>. For example, our full data tree could lead us to conclude that total water contact is less predictive of a positive stool sample than the specific activity and month variables chosen to be part of the tree. Table <tblr tid="T1">1</tblr>, however, would lead us to conclude that total water contact is one of the top three most predictive variables - and therefore more "important" than four out of the six variables identified in the full data tree. Due to this instability, machine learning algorithms alone provide sub-optimal information for determining the importance of variables.</p>
<p>The actual best set of predictor variables is a function of the type of model, the method for constructing candidate models, and the method used to choose the so-called tuning parameters. Our results here therefore do not generalize to all machine learning routines - such as, for example, the Deletion/Substitution/Addition algorithm <abbrgrp><abbr bid="B38">38</abbr></abbrgrp>, POLYCLASS <abbrgrp><abbr bid="B39">39</abbr></abbrgrp> or random forests <abbrgrp><abbr bid="B37">37</abbr></abbrgrp>. Generally, as implied by the results displayed in Table <tblr tid="T1">1</tblr> and Figure <figr fid="F1">1</figr>, prediction algorithms are not constructed to provide any easily interpretable estimates of each water contact variable's contribution to the probability of a positive stool sample, which is ultimately what we were trying to investigate. Machine learning algorithms can be applied most effectively to answering our question of interest when used within an estimation framework whose parameters are defined independently from the specific model chosen by a given algorithm (such as <it>rpart</it>). This semi-parametric approach, of which our VI analysis is an example, contrasts dramatically with estimating simple, parametric regression models and reporting the resulting coefficients as association parameters (such as the relative risks reported in Tables <tblr tid="T2">2</tblr> and <tblr tid="T3">3</tblr>). Though such regression analyses can produce parameters with relatively straightforward public health interpretations, the interpretations only remain straightforward if the pre-specified regression model is correct; any interpretation of the estimates obtained must implicitly assert the truth of the model used, though there is very rarely any justification for a specific parametric model's <it>a priori </it>truth. In addition, the lack of data-adaptive procedures can sacrifice power by resulting in much larger residual variability than approaches that use the data to fit the models. Tables <tblr tid="T2">2</tblr> and <tblr tid="T3">3</tblr>, for example, show that under the constraints of the regression model, even the coefficients with 95% confidence intervals that did not cross one yielded relative risks very close to one, suggesting little contribution to the variability of the outcome. Whether this is a true result, however, or merely reflective of a poorly chosen model, is impossible to assess. The regression approach, though common, is therefore a dangerous choice as a basis for making causal inferences. Interpretation of parameters (conditional relative risks) in the context of a misspecified model are also of dubious value, since it is difficult to know what such interpretations really mean. This is true of the innumerable regression approaches reflexively used throughout observational epidemiology and other empirical fields.</p>
<p>Though one data analysis cannot justify the global use of an analysis technique, at least there is some hope that our approach here has found potentially interesting associations. Specifically, the importance of July water contact in our VI results - not detected by the regression analysis - could suggest temporal variability in infection risk during the infection season. This could be due to a combination of factors, since infection risk depends not only on water contact intensity but also on cercarial concentration in that water. A summer peak in cercarial concentration was observed in a number of villages in this same area in 2001 using a mouse bioassay procedure throughout the infection season <abbrgrp><abbr bid="B40">40</abbr></abbrgrp>. The peak occurred in August, not July, but year-to-year variability in cercarial concentration can be expected due to seasonal fluctuations in snail populations and agricultural activities driven by changes in rainfall, temperature, and humidity. Temporal variability in infection risk can also be influenced by seasonal changes in activities known to be associated with infection, such as swimming, which may increase during summer months when school is not in session and ambient temperatures are high. In addition, prior work has documented seasonal fluctuations in hydrology which correspond to differences in infection patterns between schistosomiasis endemic regions within Sichuan province <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>. One must consider, however, that this dataset has a number of limitations. The retrospective nature of the water contact surveys calls into question the accuracy of recall - particularly given the relatively long period of time (seven months) during which study participants were asked to recount their water contact activities. The analysis also relies on the definition of water contact, which as previously described includes an estimate of the body surface area believed to be in contact with water during certain activities. We are additionally limited by the need to analyze the monthly water contact and water contact activity variables separately; while it would have been ideal to consider the 56 activity type-by-month variables, the number of covariates is simply too large in comparison with the sample size for any technique to single out individual contributions. We therefore chose to simplify the set of variables by considering activity separately from month, thus providing some power to detect adjusted associations.</p>
<p>While the results of this analysis are far from conclusive, they nonetheless suggest possibly fruitful areas for future research. If a high-risk period in the schistosomiasis infection season could be detected in something close to real time, new prevention options would be opened. Recent advances in detecting schistosome cercariae in water using PCR techniques could potentially provide such a tool <abbrgrp><abbr bid="B41">41</abbr></abbrgrp>. The notion of changing from a surveillance system that relies on episodic human infection surveys to one based on water monitoring has many attractions, including the likelihood of lower cost. Water monitoring is also an appealing option in areas where schistosomiasis re-emergence has occurred or is suspected <abbrgrp><abbr bid="B42">42</abbr></abbrgrp>.</p>
<p>Though we compare here three specific analysis techniques, we note that many different machine learning algorithms (other than classification trees) are available, different regression models could be specified, and different approaches to estimating our VI parameter could be used (including G-computation and Targeted Maximum Likelihood) <abbrgrp><abbr bid="B43">43</abbr><abbr bid="B44">44</abbr></abbrgrp>. The general principals contrasting these methods remain the same, however, and are important in the larger issue of estimating the independent and potentially causal association of risk factors in data sets with large numbers of covariates. Prediction (machine learning) algorithms are very well-designed to provide optimal prediction and to balance the variance and bias in the predicted value (the estimate of <it>E(Y|A, W, V)</it>); they are not optimal for determining the contributions of individual variables directly. This is particularly obvious since small changes in the data can result in large changes in the variables chosen. In contrast, the standard regression model approach has a nicely interpretable parameter, but is entirely dependent upon the correctness of the model specified. The definition of the parameter itself is also generally tied to the form of the model - for example, adding a multiplicative interaction term into a regression model changes the meaning of the main effect term. Thus, the definition of a given parameter is only useful if the model is correct, and that parameter's interpretation changes as other variables are added to or removed from the model. In reality, such models are never correct, and there is no mechanism for allowing them more flexibility (such as through machine learning algorithms) to reduce bias as sample size grows. These issues expose the need for a meaningful parameter, one whose estimation can capitalize on the virtues of the asymptotic bias-reduction of machine learning algorithms and whose definition is not dependent upon the model chosen by these algorithms. The VI parameter we use is an answer to this need. We employ a machine learning algorithm to estimate the parameter, but differences in the model chosen by the algorithm do not change the definition of the parameter.</p>
<p>The semi-parametric approach is evolving, and recent advances promise to increase the power of this combination of machine learning and causal inference methods. We do not necessarily advocate the details of the semi-parametric VI algorithm used here - we in fact used a relatively inefficient method, and more refined methods are available to target model selection towards optimizing the particular parameter of interest <abbrgrp><abbr bid="B45">45</abbr></abbrgrp>. We simply argue that it is possible to devise estimation strategies that, given unavoidable assumptions, can converge to unbiased estimates of the causal effects defined as sample size grows. In addition to the aforementioned alternate approaches for estimating our VI parameter, one can also use so-called asymptotically linear estimators; these are normally distributed, and in many cases simple standard errors based on this normality can be derived if one wishes to avoid re-sampling-based techniques (i.e. the bootstrap).</p>
<p>Risk factor epidemiology has for too long relied upon inherently biased techniques, particularly for observational data. There is no longer any reason to do so; the bias-reduction flexibility of semi-parametric models can be combined with estimation of simple and frankly more meaningful parameters in public health. We suggest using techniques that (1) define parameters with convenient public health interpretations, (2) use flexible, data-adaptive routines that do not pre-suppose arbitrary and scientifically unjustifiable models, and (3) employ honest inference that accounts for all the aspects of variation, including model selection.</p>
</sec>
<sec><st><p>Competing interests</p></st>
<p>The authors declare that they have no competing interests.</p>
</sec>
<sec><st><p>Authors' contributions</p></st>
<p>SS primarily conducted the analyses, helped with writing and edited the paper. EC helped with writing the Background, and EC and ES helped interpret the results in the context of their study in China. RS helped to write the Conclusions and put the methodological contribution of the paper in the context of the methods used for these type of infectious disease studies. AH helped to derive the methods, conduct the regression analyses, and write/edit the paper. All authors read and approved the final manuscript.</p>
</sec>
</bdy>
<bm>
<ack><sec><st><p>Acknowledgements</p></st>
<p>The authors acknowledge the field staffs of the Xichang County Anti-Schistosomiasis Control Station and Sichuan Province Institute of Parasitic Diseases for their assistance in collecting the data for this study. This study was supported by the National Institute of Allergy and Infectious Diseases (NIAID) and the National Institutes of Health (NIH) (Grants R01 AI50612 and R01 AI68854).</p>
</sec>
</ack>
<refgrp><bibl id="B1"><aug><au><cnm>WHO</cnm></au></aug><source>Report of the Scientific Working Group meeting on Schistosomiasis</source><publisher>WHO: Geneva</publisher><pubdate>2006</pubdate></bibl><bibl id="B2"><title><p>Reassessment of the cost of chronic helminthic infection: a meta-analysis of disability-related outcomes in endemic Schistosomiasis</p></title><aug><au><snm>King</snm><fnm>CH</fnm></au><au><snm>Dickman</snm><fnm>K</fnm></au><au><snm>Tisch</snm><fnm>DJ</fnm></au></aug><source>Am J Trop Med Hyg</source><pubdate>2005</pubdate><volume>70</volume><issue>4</issue><fpage>443</fpage><lpage>448</lpage></bibl><bibl id="B3"><title><p>Schistosoma japonicum reinfection after praziquantel treatment causes anemia associated with inflammation</p></title><aug><au><snm>Leenstra</snm><fnm>T</fnm></au><au><snm>Coutinho</snm><fnm>HM</fnm></au><au><snm>Acosta</snm><fnm>LP</fnm></au><au><snm>Langdon</snm><fnm>GC</fnm></au><au><snm>Su</snm><fnm>L</fnm></au><au><snm>Olveda</snm><fnm>RM</fnm></au><au><snm>McGarvey</snm><fnm>ST</fnm></au><au><snm>Kurtis</snm><fnm>JD</fnm></au><au><snm>Friedman</snm><fnm>JF</fnm></au></aug><source>Infect Immun</source><pubdate>2006</pubdate><volume>74</volume><issue>11</issue><fpage>6398</fpage><lpage>6407</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1128/IAI.00757-06</pubid><pubid idtype="pmcid">1695508</pubid><pubid idtype="pmpid">16923790</pubid></pubidlist></xrefbib></bibl><bibl id="B4"><title><p>Schistosomiasis</p></title><aug><au><snm>Ross</snm><fnm>AG</fnm></au><au><snm>Bartley</snm><fnm>PB</fnm></au><au><snm>Sleigh</snm><fnm>AC</fnm></au><au><snm>Olds</snm><fnm>GR</fnm></au><au><snm>Li</snm><fnm>Y</fnm></au><au><snm>Williams</snm><fnm>GM</fnm></au><au><snm>McManus</snm><fnm>DP</fnm></au></aug><source>N Engl J Med</source><pubdate>2002</pubdate><volume>346</volume><issue>16</issue><fpage>1212</fpage><lpage>1220</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1056/NEJMra012396</pubid><pubid idtype="pmpid" link="fulltext">11961151</pubid></pubidlist></xrefbib></bibl><bibl id="B5"><title><p>Spatial patterns of urinary schistosomiasis infection in a highly endemic area of coastal Kenya</p></title><aug><au><snm>Clennon</snm><fnm>JA</fnm></au><au><snm>King</snm><fnm>CH</fnm></au><au><snm>Muchiri</snm><fnm>EM</fnm></au><au><snm>Kariuki</snm><fnm>HC</fnm></au><au><snm>Ouma</snm><fnm>JH</fnm></au><au><snm>Mungai</snm><fnm>P</fnm></au><au><snm>Kitron</snm><fnm>U</fnm></au></aug><source>Am J Trop Med Hyg</source><pubdate>2004</pubdate><volume>70</volume><issue>4</issue><fpage>443</fpage><lpage>448</lpage><xrefbib><pubid idtype="pmpid" link="fulltext">15100462</pubid></xrefbib></bibl><bibl id="B6"><title><p>Measuring exposure to Schistosoma japonicum in China. III. Activity diaries, snail and human infection, transmission ecology and options for control</p></title><aug><au><snm>Li</snm><fnm>Y</fnm></au><au><snm>Sleigh</snm><fnm>AC</fnm></au><au><snm>Williams</snm><fnm>GM</fnm></au><au><snm>Ross</snm><fnm>AG</fnm></au><au><snm>Forsyth</snm><fnm>SJ</fnm></au><au><snm>Tanner</snm><fnm>M</fnm></au><au><snm>McManus</snm><fnm>DP</fnm></au></aug><source>Acta Trop</source><pubdate>2000</pubdate><volume>75</volume><issue>3</issue><fpage>279</fpage><lpage>289</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1016/S0001-706X(00)00056-5</pubid><pubid idtype="pmpid" link="fulltext">10838211</pubid></pubidlist></xrefbib></bibl><bibl id="B7"><title><p>Individual and village-level study of water contact patterns and Schistosoma japonicum infection in mountainous rural China</p></title><aug><au><snm>Seto</snm><fnm>EY</fnm></au><au><snm>Lee</snm><fnm>YJ</fnm></au><au><snm>Liang</snm><fnm>S</fnm></au><au><snm>Zhong</snm><fnm>B</fnm></au></aug><source>Trop Med Int Health</source><pubdate>2007</pubdate><volume>12</volume><issue>10</issue><fpage>1199</fpage><lpage>1209</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1111/j.1365-3156.2007.01903.x</pubid><pubid idtype="pmpid" link="fulltext">17956502</pubid></pubidlist></xrefbib></bibl><bibl id="B8"><title><p>The emergence of Schistosoma japonicum cercariae from Oncomelania quadrasi</p></title><aug><au><snm>Nojima</snm><fnm>H</fnm></au><au><snm>Santos</snm><fnm>AT</fnm></au><au><snm>Blas</snm><fnm>BL</fnm></au><au><snm>Kamiya</snm><fnm>H</fnm></au></aug><source>J Parasitol</source><pubdate>1980</pubdate><volume>66</volume><issue>6</issue><fpage>1010</fpage><lpage>1013</lpage><xrefbib><pubidlist><pubid idtype="doi">10.2307/3280406</pubid><pubid idtype="pmpid">7218093</pubid></pubidlist></xrefbib></bibl><bibl id="B9"><title><p>Epidemiology of Schistosoma japonicum in China: morbidity and strategies for control in the Dongting Lake region</p></title><aug><au><snm>Li</snm><fnm>YS</fnm></au><au><snm>Sleigh</snm><fnm>AC</fnm></au><au><snm>Ross</snm><fnm>AG</fnm></au><au><snm>Williams</snm><fnm>GM</fnm></au><au><snm>Tanner</snm><fnm>M</fnm></au><au><snm>McManus</snm><fnm>DP</fnm></au></aug><source>Int J Parasitol</source><pubdate>2000</pubdate><volume>30</volume><issue>3</issue><fpage>273</fpage><lpage>281</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1016/S0020-7519(99)00201-5</pubid><pubid idtype="pmpid" link="fulltext">10719120</pubid></pubidlist></xrefbib></bibl><bibl id="B10"><title><p>Weather-driven dynamics of an intermediate host: mechanistic and statistical population modelling of Oncomelania hupensis</p></title><aug><au><snm>Remais</snm><fnm>J</fnm></au><au><snm>Hubbard</snm><fnm>A</fnm></au><au><snm>Wu</snm><fnm>ZS</fnm></au><au><snm>Spear</snm><fnm>RC</fnm></au></aug><source>Journal of Applied Ecology</source><pubdate>2007</pubdate><volume>44</volume><issue>4</issue><fpage>781</fpage><lpage>791</lpage><xrefbib><pubid idtype="doi">10.1111/j.1365-2664.2007.01305.x</pubid></xrefbib></bibl><bibl id="B11"><title><p>Coupling hydrologic and infectious disease models to explain regional differences in schistosomiasis transmission in southwestern China</p></title><aug><au><snm>Remais</snm><fnm>J</fnm></au><au><snm>Liang</snm><fnm>S</fnm></au><au><snm>Spear</snm><fnm>RC</fnm></au></aug><source>Environmental Science &amp; Technology</source><pubdate>2008</pubdate><volume>42</volume><issue>7</issue><fpage>2643</fpage><lpage>2649</lpage></bibl><bibl id="B12"><title><p>Risk factors for Schistosoma mansoni and hookworm in urban farming communities in western Cote d'Ivoire</p></title><aug><au><snm>Matthys</snm><fnm>B</fnm></au><au><snm>Tschannen</snm><fnm>AB</fnm></au><au><snm>Tian-Bi</snm><fnm>NT</fnm></au><au><snm>Comoe</snm><fnm>H</fnm></au><au><snm>Diabate</snm><fnm>S</fnm></au><au><snm>Traore</snm><fnm>M</fnm></au><au><snm>Vounatsou</snm><fnm>P</fnm></au><au><snm>Raso</snm><fnm>G</fnm></au><au><snm>Gosoniu</snm><fnm>L</fnm></au><au><snm>Tanner</snm><fnm>M</fnm></au><au><snm>Cisse</snm><fnm>G</fnm></au><au><snm>N&apos;Goran</snm><fnm>EK</fnm></au><au><snm>Utzinger</snm><fnm>J</fnm></au></aug><source>Trop Med Int Health</source><pubdate>2007</pubdate><volume>12</volume><issue>6</issue><fpage>709</fpage><lpage>723</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1111/j.1365-3156.2007.01841.x</pubid><pubid idtype="pmpid" link="fulltext">17550468</pubid></pubidlist></xrefbib></bibl><bibl id="B13"><title><p>Micro-epidemiology of urinary schistosomiasis in Zanzibar: Local risk factors associated with distribution of infections among schoolchildren and relevance for control</p></title><aug><au><snm>Rudge</snm><fnm>JW</fnm></au><au><snm>Stothard</snm><fnm>JR</fnm></au><au><snm>Basanez</snm><fnm>MG</fnm></au><au><snm>Mgeni</snm><fnm>AF</fnm></au><au><snm>Khamis</snm><fnm>IS</fnm></au><au><snm>Khamis</snm><fnm>AN</fnm></au><au><snm>Rollingson</snm><fnm>D</fnm></au></aug><source>Acta Trop</source><pubdate>2008</pubdate><volume>105</volume><issue>1</issue><fpage>45</fpage><lpage>54</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1016/j.actatropica.2007.09.006</pubid><pubid idtype="pmpid" link="fulltext">17996207</pubid></pubidlist></xrefbib></bibl><bibl id="B14"><title><p>Toward a curse of dimensionality appropriate (CODA) asymptotic theory for semi-parametric models</p></title><aug><au><snm>Robins</snm><fnm>JM</fnm></au><au><snm>Ritov</snm><fnm>Y</fnm></au></aug><source>Statistics in Medicine</source><pubdate>1997</pubdate><volume>16</volume><fpage>285</fpage><lpage>319</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1002/(SICI)1097-0258(19970215)16:3&lt;285::AID-SIM535&gt;3.0.CO;2-#</pubid><pubid idtype="pmpid" link="fulltext">9004398</pubid></pubidlist></xrefbib></bibl><bibl id="B15"><title><p>To GEE or not to GEE: comparing population average and mixed models for estimating the associations between neighborhood risk factors and health</p></title><aug><au><snm>Hubbard</snm><fnm>AE</fnm></au><au><snm>Ahern</snm><fnm>J</fnm></au><au><snm>Fleischer</snm><fnm>NL</fnm></au><au><snm>van der Laan</snm><fnm>M</fnm></au><au><snm>Lippman</snm><fnm>SA</fnm></au><au><snm>Jewell</snm><fnm>N</fnm></au><au><snm>Bruckner</snm><fnm>T</fnm></au><au><snm>Satariano</snm><fnm>WA</fnm></au></aug><source>Epidemiology</source><pubdate>2010</pubdate><volume>21</volume><issue>4</issue><fpage>475</fpage><lpage>8</lpage><note>discussion 479-81</note><xrefbib><pubidlist><pubid idtype="doi">10.1097/EDE.0b013e3181caeb90</pubid><pubid idtype="pmpid" link="fulltext">20539108</pubid></pubidlist></xrefbib></bibl><bibl id="B16"><title><p>Estimating the effects of potential public health interventions on population disease burden: a step-by-step illustration of causal inference methods</p></title><aug><au><snm>Ahern</snm><fnm>J</fnm></au><au><snm>Hubbard</snm><fnm>A</fnm></au><au><snm>Galea</snm><fnm>S</fnm></au></aug><source>American Journal of Epidemiology</source><pubdate>2009</pubdate><volume>169</volume><issue>9</issue><fpage>1140</fpage><lpage>1147</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/aje/kwp015</pubid><pubid idtype="pmcid">2732980</pubid><pubid idtype="pmpid">19270051</pubid></pubidlist></xrefbib></bibl><bibl id="B17"><title><p>Depressive symptoms in low-income women in rural Mexico</p></title><aug><au><snm>Fleischer</snm><fnm>NL</fnm></au><au><snm>Fernald</snm><fnm>LC</fnm></au><au><snm>Hubbard</snm><fnm>AE</fnm></au></aug><source>Epidemiology</source><pubdate>2007</pubdate><volume>18</volume><issue>6</issue><fpage>678</fpage><lpage>685</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1097/EDE.0b013e3181567fc5</pubid><pubid idtype="pmpid" link="fulltext">18049184</pubid></pubidlist></xrefbib></bibl><bibl id="B18"><title><p>Population Intervention Models</p></title><aug><au><snm>Hubbard</snm><fnm>AE</fnm></au><au><snm>van der Laan</snm><fnm>M</fnm></au></aug><source>Biometrika</source><pubdate>2007</pubdate><volume>95</volume><fpage>35</fpage><lpage>47</lpage><xrefbib><pubid idtype="doi">10.1093/biomet/asm097</pubid></xrefbib></bibl><bibl id="B19"><title><p>Maximum likelihood estimation of the attributable fraction from logistic models</p></title><aug><au><snm>Greenland</snm><fnm>S</fnm></au><au><snm>Drescher</snm><fnm>K</fnm></au></aug><source>Biometrics</source><pubdate>1993</pubdate><volume>49</volume><issue>3</issue><fpage>865</fpage><lpage>72</lpage><xrefbib><pubidlist><pubid idtype="doi">10.2307/2532206</pubid><pubid idtype="pmpid">8241375</pubid></pubidlist></xrefbib></bibl><bibl id="B20"><title><p>Factors influencing the transmission of Schistosoma japonicum in the mountains of Sichuan Province of China</p></title><aug><au><snm>Spear</snm><fnm>RC</fnm></au><au><snm>Seto</snm><fnm>E</fnm></au><au><snm>Liang</snm><fnm>S</fnm></au><au><snm>Birkner</snm><fnm>M</fnm></au><au><snm>Hubbard</snm><fnm>A</fnm></au><au><snm>Qiu</snm><fnm>D</fnm></au><au><snm>Yang</snm><fnm>C</fnm></au><au><snm>Zhong</snm><fnm>B</fnm></au><au><snm>Xu</snm><fnm>F</fnm></au><au><snm>Gu</snm><fnm>X</fnm></au><au><snm>Davis</snm><fnm>GM</fnm></au></aug><source>Am J Trop Med Hyg</source><pubdate>2004</pubdate><volume>70</volume><issue>1</issue><fpage>48</fpage><lpage>56</lpage><xrefbib><pubid idtype="pmpid" link="fulltext">14971698</pubid></xrefbib></bibl><bibl id="B21"><title><p>Simplified calculation of body-surface area</p></title><aug><au><snm>Mosteller</snm><fnm>RD</fnm></au></aug><source>N Engl J Med</source><pubdate>1987</pubdate><volume>317</volume><issue>17</issue><fpage>1098</fpage><xrefbib><pubid idtype="pmpid" link="fulltext">3657876</pubid></xrefbib></bibl><bibl id="B22"><aug><au><cnm>The Office of Endemic Disease Control MoH</cnm></au></aug><source>Handbook of Schistosomiasis Control</source><publisher>Shanghai:Shanghai Science &amp; Technology Press</publisher><pubdate>2000</pubdate></bibl><bibl id="B23"><title><p>A simple device for quantitative stool thick-smear technique in Schistosomiasis mansoni</p></title><aug><au><snm>Katz</snm><fnm>N</fnm></au><au><snm>Chaves</snm><fnm>A</fnm></au><au><snm>Pellegrino</snm><fnm>J</fnm></au></aug><source>Rev Inst Med Trop Sao Paulo</source><pubdate>1972</pubdate><volume>14</volume><issue>6</issue><fpage>397</fpage><lpage>400</lpage><xrefbib><pubid idtype="pmpid">4675644</pubid></xrefbib></bibl><bibl id="B24"><title><p>R version 2.10.0, Copyright (C) 2009 The R Foundation for Statistical Computing</p></title><url>http://www.r-project.org</url></bibl><bibl id="B25"><aug><au><snm>Breiman</snm><fnm></fnm></au><au><snm>Friedman</snm><fnm></fnm></au><au><snm>Olshen</snm><fnm></fnm></au><au><snm>Stone</snm><fnm></fnm></au></aug><source>Classification and Regression Trees. Wadsworth</source><pubdate>1984</pubdate></bibl><bibl id="B26"><aug><au><snm>Atkinson</snm><fnm>EJ</fnm></au><au><snm>Therneau</snm><fnm>TM</fnm></au></aug><source>An Introduction to Recursive Partitioning Using the RPART Routines. Technicial Report 61, Mayo Clinic, Section of Statistics</source><pubdate>1997</pubdate></bibl><bibl id="B27"><title><p>SuperLearning: an application to the prediction of HIV-1 drug resistance</p></title><aug><au><snm>Sinisi</snm><fnm>SE</fnm></au><au><snm>Polley</snm><fnm>EC</fnm></au><au><snm>Petersen</snm><fnm>ML</fnm></au><au><snm>Rhee</snm><fnm>SY</fnm></au><au><snm>van der Laan</snm><fnm>MJ</fnm></au></aug><source>Statistical Applications in Genetics and Molecular Biology</source><pubdate>2007</pubdate><volume>6</volume><issue>1</issue><fpage>Article 7</fpage><xrefbib><pubid idtype="doi">10.2202/1544-6115.1240</pubid></xrefbib></bibl><bibl id="B28"><aug><au><snm>Efron</snm><fnm>B</fnm></au></aug><source>The Jackknife, the Bootstrap and Other Re-sampling Plans. CBMS-NSF Regional Conference Series in Applied Mathematics 38</source><publisher>Capital City Press</publisher><pubdate>1982</pubdate></bibl><bibl id="B29"><title><p>The behavior of maximum likelihood estimates under nonstandard conditions</p></title><aug><au><snm>Huber</snm><fnm>PJ</fnm></au></aug><source>Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability</source><publisher>University of California Press</publisher><pubdate>1967</pubdate><volume>1</volume><fpage>221</fpage><lpage>223</lpage></bibl><bibl id="B30"><title><p>A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity</p></title><aug><au><snm>White</snm><fnm>H</fnm></au></aug><source>Econometrica</source><pubdate>1980</pubdate><volume>48</volume><fpage>817</fpage><lpage>830</lpage><xrefbib><pubid idtype="doi">10.2307/1912934</pubid></xrefbib></bibl><bibl id="B31"><title><p>Stata 10, StataCorp LP, College Station, TX</p></title></bibl><bibl id="B32"><title><p>Bayesian inference for causal effects: the role of randomization</p></title><aug><au><snm>Rubin</snm><fnm>DB</fnm></au></aug><source>Ann Statist</source><pubdate>1978</pubdate><volume>6</volume><fpage>34</fpage><lpage>58</lpage><xrefbib><pubid idtype="doi">10.1214/aos/1176344064</pubid></xrefbib></bibl><bibl id="B33"><title><p>Comment on a paper by P.W. Holland</p></title><aug><au><snm>Rubin</snm><fnm>DB</fnm></au></aug><source>J Am Statist Assoc</source><pubdate>1986</pubdate><volume>81</volume><fpage>961</fpage><lpage>2</lpage><xrefbib><pubid idtype="doi">10.2307/2289065</pubid></xrefbib></bibl><bibl id="B34"><title><p>Constructing inverse probability weights for marginal structural models</p></title><aug><au><snm>Cole</snm><fnm>SR</fnm></au><au><snm>Hern&#225;n</snm><fnm>MA</fnm></au></aug><source>Am J Epidemiology</source><pubdate>2008</pubdate><volume>168</volume><issue>6</issue><fpage>656</fpage><lpage>664</lpage><xrefbib><pubid idtype="doi">10.1093/aje/kwn164</pubid></xrefbib></bibl><bibl id="B35"><title><p>An application of model-fitting procedures for marginal structural models</p></title><aug><au><snm>Mortimer</snm><fnm>KM</fnm></au><au><snm>Neugebauer</snm><fnm>R</fnm></au><au><snm>van der Laan</snm><fnm>M</fnm></au><au><snm>Tager</snm><fnm>IB</fnm></au></aug><source>Am J Epidemiology</source><pubdate>2005</pubdate><volume>162</volume><issue>4</issue><fpage>382</fpage><lpage>388</lpage><xrefbib><pubid idtype="doi">10.1093/aje/kwi208</pubid></xrefbib></bibl><bibl id="B36"><title><p>Effects of socioeconomic and racial residential segregation on preterm birth: a cautionary tale of structural confounding</p></title><aug><au><snm>Messer</snm><fnm>LC</fnm></au><au><snm>Oakes</snm><fnm>JM</fnm></au><au><snm>Mason</snm><fnm>S</fnm></au></aug><source>Am J Epidemiology</source><pubdate>2010</pubdate><volume>171</volume><issue>6</issue><fpage>664</fpage><lpage>73</lpage><xrefbib><pubid idtype="doi">10.1093/aje/kwp435</pubid></xrefbib></bibl><bibl id="B37"><title><p>Random forests</p></title><aug><au><snm>Breiman</snm><fnm>L</fnm></au></aug><source>Machine Learning</source><pubdate>2001</pubdate><volume>45</volume><issue>1</issue><fpage>5</fpage><lpage>32</lpage><xrefbib><pubid idtype="doi">10.1023/A:1010933404324</pubid></xrefbib></bibl><bibl id="B38"><title><p>Deletion/Substitution/Addition algorithm in learning with applications in genomics</p></title><aug><au><snm>Sinisi</snm><fnm>SE</fnm></au><au><snm>van der Laan</snm><fnm>MJ</fnm></au></aug><source>Statistical Applications in Genetics and Molecular Biology</source><pubdate>2004</pubdate><volume>3</volume><issue>1</issue><xrefbib><pubidlist><pubid idtype="doi">10.2202/1544-6115.1069</pubid><pubid idtype="pmpid" link="fulltext">16646796</pubid></pubidlist></xrefbib></bibl><bibl id="B39"><title><p>Polychotomous regression</p></title><aug><au><snm>Kooperberg</snm><fnm>C</fnm></au><au><snm>Bose</snm><fnm>S</fnm></au><au><snm>Stone</snm><fnm>CJ</fnm></au></aug><source>Jour Am Stat Assoc</source><pubdate>1997</pubdate><volume>92</volume><fpage>117</fpage><lpage>127</lpage><xrefbib><pubid idtype="doi">10.2307/2291455</pubid></xrefbib></bibl><bibl id="B40"><title><p>Spatial and temporal variability in schistosome cercarial density detected by mouse bioassays in village irrigation ditches in Sichuan, China</p></title><aug><au><snm>Spear</snm><fnm>RC</fnm></au><au><snm>Zhong</snm><fnm>B</fnm></au><au><snm>Mao</snm><fnm>Y</fnm></au><au><snm>Hubbard</snm><fnm>A</fnm></au><au><snm>Birkner</snm><fnm>M</fnm></au><au><snm>Remais</snm><fnm>J</fnm></au><au><snm>Qiu</snm><fnm>D</fnm></au></aug><source>Am J Trop Med Hyg</source><pubdate>2004</pubdate><volume>71</volume><issue>5</issue><fpage>554</fpage><lpage>557</lpage><xrefbib><pubid idtype="pmpid" link="fulltext">15569783</pubid></xrefbib></bibl><bibl id="B41"><title><p>Quantitative detection of Schistosoma japonicum cercariae in water by real-time PCR</p></title><aug><au><snm>Hung</snm><fnm>YW</fnm></au><au><snm>Remais</snm><fnm>J</fnm></au></aug><source>PLoS Neglected Tropical Diseases</source><pubdate>2008</pubdate><volume>2</volume><fpage>e337</fpage><xrefbib><pubidlist><pubid idtype="doi">10.1371/journal.pntd.0000337</pubid><pubid idtype="pmcid">2580822</pubid><pubid idtype="pmpid">19015722</pubid></pubidlist></xrefbib></bibl><bibl id="B42"><title><p>Environmental effects on parasitic disease transmission exemplified by schistosomiasis in western China</p></title><aug><au><snm>Liang</snm><fnm>S</fnm></au><au><snm>Seto</snm><fnm>EY</fnm></au><au><snm>Remais</snm><fnm>JV</fnm></au><au><snm>Zhong</snm><fnm>B</fnm></au><au><snm>Yang</snm><fnm>C</fnm></au><au><snm>Hubbard</snm><fnm>A</fnm></au><au><snm>Davis</snm><fnm>GM</fnm></au><au><snm>Gu</snm><fnm>X</fnm></au><au><snm>Qiu</snm><fnm>D</fnm></au><au><snm>Spear</snm><fnm>RC</fnm></au></aug><source>Proc Natl Acad Sci USA</source><pubdate>2007</pubdate><volume>104</volume><issue>17</issue><fpage>7110</fpage><lpage>5</lpage><note>Epub 2007 Apr 16</note><xrefbib><pubidlist><pubid idtype="doi">10.1073/pnas.0701878104</pubid><pubid idtype="pmcid">1852328</pubid><pubid idtype="pmpid">17438266</pubid></pubidlist></xrefbib></bibl><bibl id="B43"><title><p>Marginal structural models versus structural nested models as tools for causal inference</p></title><aug><au><snm>Robins</snm><fnm>JM</fnm></au></aug><source>Statistical Models in Epidemiology, the Environment, and Clinical Trials</source><publisher>Springer, New York</publisher><editor>Halloran ME, Berry D</editor><pubdate>2000</pubdate><fpage>95</fpage><lpage>113</lpage></bibl><bibl id="B44"><title><p>Targeted maximum likelihood learning</p></title><aug><au><snm>van der Laan</snm><fnm>MJ</fnm></au><au><snm>Rubin</snm><fnm>D</fnm></au></aug><source>The International Journal of Biostatistics</source><pubdate>2006</pubdate><volume>2</volume><issue>1</issue></bibl><bibl id="B45"><title><p>Collaborative double robust targeted penalized maximum likelihood estimation</p></title><aug><au><snm>van der Laan</snm><fnm>MJ</fnm></au><au><snm>Gruber</snm><fnm>S</fnm></au></aug><source>U.C. Berkeley Division of Biostatistics Working Paper Series Working Paper 246</source><pubdate>2009</pubdate><url>http://www.bepress.com/ucbbiostat/paper246</url></bibl></refgrp>
</bm>
</art>
