Theses and DissertationsNo Descriptionhttps://hdl.handle.net/10217/1005192024-02-21T12:04:33Z2024-02-21T12:04:33Z701Bayesian models and streaming samplers for complex data with application to network regression and record linkageTaylor, Ian M., authorKaplan, Andee, advisorFosdick, Bailey K., advisorKeller, Kayleigh P., committee memberKoslovsky, Matthew D., committee membervan Leeuwen, Peter Jan, committee memberhttps://hdl.handle.net/10217/2374552024-01-12T16:20:13Z2023-01-01T00:00:00Zdc.title: Bayesian models and streaming samplers for complex data with application to network regression and record linkage
dc.contributor.author: Taylor, Ian M., author; Kaplan, Andee, advisor; Fosdick, Bailey K., advisor; Keller, Kayleigh P., committee member; Koslovsky, Matthew D., committee member; van Leeuwen, Peter Jan, committee member
dc.description.abstract: Real-world statistical problems often feature complex data due to either the structure of the data itself or the methods used to collect the data. In this dissertation, we present three methods for the analysis of specific complex data: Restricted Network Regression, Streaming Record Linkage, and Generative Filtering. Network data contain observations about the relationships between entities. Applying mixed models to network data can be problematic when the primary interest is estimating unconditional regression coefficients and some covariates are exactly or nearly in the vector space of node-level effects. We introduce the Restricted Network Regression model that removes the collinearity between fixed and random effects in network regression by orthogonalizing the random effects against the covariates. We discuss the change in the interpretation of the regression coefficients in Restricted Network Regression and analytically characterize the effect of Restricted Network Regression on the regression coefficients for continuous response data. We show through simulation on continuous and binary data that Restricted Network Regression mitigates, but does not alleviate, network confounding. We apply the Restricted Network Regression model in an analysis of 2015 Eurovision Song Contest voting data and show how the choice of regression model affects inference. Data that are collected from multiple noisy sources pose challenges to analysis due to potential errors and duplicates. Record linkage is the task of combining records from multiple files which refer to overlapping sets of entities when there is no unique identifying field. In streaming record linkage, files arrive sequentially in time and estimates of links are updated after the arrival of each file. We approach streaming record linkage from a Bayesian perspective with estimates calculated from posterior samples of parameters, and present methods for updating link estimates after the arrival of a new file that are faster than fitting a joint model with each new data file. We generalize a two-file Bayesian Fellegi-Sunter model to the multi-file case and propose two methods to perform streaming updates. We examine the effect of prior distribution on the resulting linkage accuracy as well as the computational trade-offs between the methods when compared to a Gibbs sampler through simulated and real-world survey panel data. We achieve near-equivalent posterior inference at a small fraction of the compute time. Motivated by the streaming data setting and streaming record linkage, we propose a more general sampling method for Bayesian models for streaming data. In the streaming data setting, Bayesian models can employ recursive updates, incorporating each new batch of data into the model parameters' posterior distribution. Filtering methods are currently used to perform these updates efficiently, however, they suffer from eventual degradation as the number of unique values within the filtered samples decreases. We propose Generative Filtering, a method for efficiently performing recursive Bayesian updates in the streaming setting. Generative Filtering retains the speed of a filtering method while using parallel updates to avoid degenerate distributions after repeated applications. We derive rates of convergence for Generative Filtering and conditions for the use of sufficient statistics instead of storing all past data. We investigate properties of Generative Filtering through simulation and ecological species count data.
dc.description: Includes bibliographical references.; 2023 Fall.
2023-01-01T00:00:00ZIntegrated statistical models in ecologyVan Ee, Justin, authorHooten, Mevin, advisorKoslovsky, Matthew, advisorKeller, Kayleigh, committee memberKaplan, Andee, committee memberBailey, Larissa, committee memberhttps://hdl.handle.net/10217/2374112024-01-19T17:45:53Z2023-01-01T00:00:00Zdc.title: Integrated statistical models in ecology
dc.contributor.author: Van Ee, Justin, author; Hooten, Mevin, advisor; Koslovsky, Matthew, advisor; Keller, Kayleigh, committee member; Kaplan, Andee, committee member; Bailey, Larissa, committee member
dc.description.abstract: The number of endangered and vulnerable species continues to grow globally as a result of habitat destruction, overharvesting, invasive species, and climate change. Understanding the drivers of population decline is pivotal for informing species conservation. Many datasets collected are restricted to a limited portion of the species range, may not include observations of other organisms in the community, or lack temporal breadth. When analyzed independently, these datasets often overlook drivers of population decline, muddle community responses to ecological threats, and poorly predict population trajectories. Over the last decade, thanks to efforts like The Long Term Ecological Research Network and National Ecological Observatory Network, citizen science surveys, and technological advances, ecological datasets that provide insights about collections of organisms or multiple characteristics of the same organism have become prevalent. The conglomerate of datasets has the potential to provide novel insights, improve predictive performance, and disentangle the contributions of confounded factors, but specifying joint models that assimilate all the available data sources is both intellectually daunting and computationally prohibitive. I develop methodology for specifying computationally efficient integrated models. I discuss datasets frequently collected in ecology, objectives common to many analyses, and the methodological challenges associated with specifying joint models in these contexts. I introduce a suite of model building and computational techniques I used to facilitate inference in three applied analyses of ecological data. In a case study of the joint mammalian response to the bark beetle epidemic in Colorado, I describe a restricted regression approach to deconfounding the effects of environmental factors and community structure on species distributions. I highlight that fitting certain joint species distribution models in a restricted parameterization improves sampling efficiency. To improve abundance estimates for a federally protected species, I specify an integrated model for analyzing independent aerial and ground surveys. I use a Markov melding approach to facilitate posterior inference and construct the joint distribution implied by the prior information, assumptions, and data expressed across a chain of submodels. I extend the integrated model by assimilating additional demographic surveys of the species that allow abundance estimates to be linked to annual variability in population vital rates. To reduce computation time, both models are fit using a multi-stage Markov chain Monte Carlo algorithm with parallelization. In each applied analysis, I uncover associations that would have been overlooked had the datasets been analyzed independently and improve predictive performance relative to models fit to individual datasets.
dc.description: Includes bibliographical references.; 2023 Fall.; Zip file contains Animation of annual survey effort.
2023-01-01T00:00:00ZStatistical models for COVID-19 infection fatality rates and diagnostic test dataPugh, Sierra, authorWilson, Ander, advisorFosdick, Bailey K., advisorKeller, Kayleigh, committee memberMeyer, Mary, committee memberGutilla, Molly, committee memberhttps://hdl.handle.net/10217/2369722023-09-01T16:50:00Z2023-01-01T00:00:00Zdc.title: Statistical models for COVID-19 infection fatality rates and diagnostic test data
dc.contributor.author: Pugh, Sierra, author; Wilson, Ander, advisor; Fosdick, Bailey K., advisor; Keller, Kayleigh, committee member; Meyer, Mary, committee member; Gutilla, Molly, committee member
dc.description.abstract: The COVID-19 pandemic has had devastating impacts worldwide. Early in the pandemic, little was known about the emerging disease. To inform policy, it was essential to develop data science tools to inform public health policy and interventions. We developed methods to fill three gaps in the literature. A first key task for scientists at the start of the pandemic was to develop diagnostic tests to classify an individual's disease status as positive or negative and to estimate community prevalence. Researchers rapidly developed diagnostic tests, yet there was a lack of guidance on how to select a cutoff to classify positive and negative test results for COVID-19 antibody tests developed with limited numbers of controls with known disease status. We propose selecting a cutoff using extreme value theory and compared this method to existing methods through a data analysis and simulation study. Second, there lacked a cohesive method for estimating the infection fatality rate (IFR) of COVID-19 that fully accounted for uncertainty in the fatality data, seroprevalence study data, and antibody test characteristics. We developed a Bayesian model to jointly model these data to fully account for the many sources of uncertainty. A third challenge is providing information that can be used to compare seroprevalence and IFR across locations to best allocate resources and target public health interventions. It is particularly important to account for differences in age-distributions when comparing across locations as age is a well-established risk factor for COVID-19 mortality. There is a lack of methods for estimating the seroprevalence and IFR as continuous functions of age, while adequately accounting for uncertainty. We present a Bayesian hierarchical model that jointly estimates seroprevalence and IFR as continuous functions of age, sharing information across locations to improve identifiability. We use this model to estimate seroprevalence and IFR in 26 developing country locations.
dc.description: 2023 Summer.; Includes bibliographical references.
2023-01-01T00:00:00ZMethodology in air pollution epidemiology for large-scale exposure prediction and environmental trials with non-complianceRyder, Nathan, authorKeller, Kayleigh, advisorWilson, Ander, committee memberCooley, Daniel, committee memberNeophytou, Andreas, committee memberhttps://hdl.handle.net/10217/2369592023-09-01T16:50:00Z2023-01-01T00:00:00Zdc.title: Methodology in air pollution epidemiology for large-scale exposure prediction and environmental trials with non-compliance
dc.contributor.author: Ryder, Nathan, author; Keller, Kayleigh, advisor; Wilson, Ander, committee member; Cooley, Daniel, committee member; Neophytou, Andreas, committee member
dc.description.abstract: Exposure to airborne pollutants, both long- and short-term, can lead to harmful respiratory, cardiovascular, and cardiometabolic outcomes. Multiple challenges arise in the study of relationships between ambient air pollution and health outcomes. For example, in large observational cohort studies, individual measurements are not feasible so researchers use small sets of pollutant concentration measurements to predict subject-level exposures. As a second example, inconsistent compliance of subjects to their assigned treatments can affect results from randomized controlled trials of environmental interventions. In this dissertation, we present methods to address these challenges. We develop a penalized regression model that can predict particulate matter exposures in space and time, including penalties to discourage overfitting and encourage smoothness in time. This model is more accurate than spatial-only and spatiotemporal universal kriging (UK) models when the exposures are missing in a regular (semi-daily) pattern. Our penalized regression model is also faster than both UK models, allowing the use of bootstrap methods to account for measurement error bias and monitor site selection in a two-stage health model. We introduce methods to estimate causal effects in a longitudinal setting by latent "at-the-time" principal strata. We implement an array of linear mixed models on data subsets, each with weights derived from principal scores. In addition, we estimate the same stratified causal effects with a Bayesian mixture model. The weighted linear mixed models outperform the Bayesian mixture model and an existing single-measure principal scores method in all simulation scenarios, and are the only method to produce a significant estimate for a causal effect of treatment assignment by strata when applied to a Honduran cookstove intervention study. Finally, we extend the "at-the-time" longitudinal principal stratification framework to a setting where continuous exposure measurements are the post-treatment variable by which the latent strata are defined. We categorize the continuous exposures to a binary variable in order to use our previous method of weighted linear mixed models. We also extend an existing Bayesian approach to the longitudinal setting, which does not require categorization of the exposures. The previous weighted linear mixed model and single-measure principal scores methods are negatively biased when applied to simulated samples, while the Bayesian approach produces the lowest RMSE and bias near zero. The Bayesian approach, when applied to the same Honduran cookstove intervention study as before, does not find a significant estimate for the causal effect of treatment assignment by strata.
dc.description: 2023 Summer.; Includes bibliographical references.
2023-01-01T00:00:00ZApplication of statistical and deep learning methods to power gridsRimkus, Mantautas, authorKokoszka, Piotr, advisorWang, Haonan, advisorNielsen, Aaron, committee memberCooley, Dan, committee memberChen, Haonan, committee memberhttps://hdl.handle.net/10217/2369402023-09-01T16:50:00Z2023-01-01T00:00:00Zdc.title: Application of statistical and deep learning methods to power grids
dc.contributor.author: Rimkus, Mantautas, author; Kokoszka, Piotr, advisor; Wang, Haonan, advisor; Nielsen, Aaron, committee member; Cooley, Dan, committee member; Chen, Haonan, committee member
dc.description.abstract: The structure of power flows in transmission grids is evolving and is likely to change significantly in the coming years due to the rapid growth of renewable energy generation that introduces randomness and bidirectional power flows. Another transformative aspect is the increasing penetration of various smart-meter technologies. Inexpensive measurement devices can be placed at practically any component of the grid. As a result, traditional fault detection methods may no longer be sufficient. Consequently, there is a growing interest in developing new methods to detect power grid faults. Using model data, we first propose a two-stage procedure for detecting a fault in a regional power grid. In the first stage, a fault is detected in real time. In the second stage, the faulted line is identified with a negligible delay. The approach uses only the voltage modulus measured at buses (nodes of the grid) as the input. Our method does not require prior knowledge of the fault type. We further explore fault detection based on high-frequency data streams that are becoming available in modern power grids. Our approach can be treated as an online (sequential) change point monitoring methodology. However, due to the mostly unexplored and very nonstandard structure of high-frequency power grid streaming data, substantial new statistical development is required to make this methodology practically applicable. The work includes development of scalar detectors based on multichannel data streams, determination of data-driven alarm thresholds and investigation of the performance and robustness of the new tools. Due to a reasonably large database of faults, we can calculate frequencies of false and correct fault signals, and recommend implementations that optimize these empirical success rates. Next, we extend our proposed method for fault localization in a regional grid for scenarios where partial observability limits the available data. While classification methods have been proposed for fault localization, their effectiveness depends on the availability of labeled data, which is often impractical in real-life situations. Our approach bridges the gap between partial and full observability of the power grid. We develop efficient fault localization methods that can operate effectively even when only a subset of power grid bus data is available. This work contributes to the research area of fault diagnosis in scenarios where the number of available phasor measurement unit devices is smaller than the number of buses in the grid. We propose using Graph Neural Networks in combination with statistical fault localization methods to localize faults in a regional power grid with minimal available data. Our contribution to the field of fault localization aims to enable the adoption of effective fault localization methods for future power grids.
dc.description: 2023 Summer.; Includes bibliographical references.
2023-01-01T00:00:00ZCausality and clustering in complex settingsGibbs, Connor P., authorKeller, Kayleigh, advisorFosdick, Bailey, advisorKoslovsky, Matthew, committee memberKaplan, Andee, committee memberAnderson, Brooke, committee memberhttps://hdl.handle.net/10217/2366882023-08-31T21:24:49Z2023-01-01T00:00:00Zdc.title: Causality and clustering in complex settings
dc.contributor.author: Gibbs, Connor P., author; Keller, Kayleigh, advisor; Fosdick, Bailey, advisor; Koslovsky, Matthew, committee member; Kaplan, Andee, committee member; Anderson, Brooke, committee member
dc.description.abstract: Causality and clustering are at the forefront of many problems in statistics. In this dissertation, we present new methods and approaches for drawing causal inference with temporally dependent units and clustering nodes in heterogeneous networks. To begin, we investigate the causal effect of a timeout at stopping an opposing team's run in the National Basketball Association (NBA). After formalizing the notion of a run in the NBA and in light of the temporal dependence among runs, we define the units under study with careful consideration of the stable unit-treatment-value assumption pertinent to the Rubin causal model. After introducing a novel, interpretable outcome based on the score difference, we conclude that while comebacks frequently occur after a run, it is slightly disadvantageous to call a timeout during a run by the opposing team. Further, we demonstrate that the magnitude of this effect varies by franchise, lending clarity to an oft-debated topic among sports' fans. Following, we represent the known relationships among and between genetic variants and phenotypic abnormalities as a heterogeneous network and introduce a novel analytic pipeline to identify clusters containing undiscovered gene to phenotype relations (ICCUR) from the network. ICCUR identifies, scores, and ranks small heterogeneous clusters according to their potential for future discovery in a large temporal biological network. We train an ensemble model of boosted regression trees to predict clusters' potential for future discovery using observable cluster features, and show the resulting clusters contain significantly more undiscovered gene to phenotype relations than expected by chance. To demonstrate its use as a diagnostic aid, we apply the results of the ICCUR pipeline to real, undiagnosed patients with rare diseases, identifying clusters containing patients' co-occurring yet otherwise unconnected genotypic and phenotypic information, some connections which have since been validated by human curation. Motivated by ICCUR and its application, we introduce a novel method called ECoHeN (pronounced "eco-hen") to extract communities from heterogeneous networks in a statistically meaningful way. Using a heterogeneous configuration model as a reference distribution, ECoHeN identifies communities that are significantly more densely connected than expected given the node types and connectivity of its membership without imposing constraints on the type composition of the extracted communities. The ECoHeN algorithm identifies communities one at a time through a dynamic set of iterative updating rules and is guaranteed to converge. To our knowledge this is the first discovery method that distinguishes and identifies both homogeneous and heterogeneous, possibly overlapping, community structure in a network. We demonstrate the performance of ECoHeN through simulation and in application to a political blogs network to identify collections of blogs which reference one another more than expected considering the ideology of its' members. Along with small partisan communities, we demonstrate ECoHeN's ability to identify a large, bipartisan community undetectable by canonical community detection methods and denser than modern, competing methods.
dc.description: Includes bibliographical references.; 2023 Spring.
2023-01-01T00:00:00ZRandomization tests for experiments embedded in complex surveysBrown, David A., authorBreidt, F. Jay, advisorSharp, Julia, committee memberZhou, Tianjian, committee memberOgle, Stephen, committee memberhttps://hdl.handle.net/10217/2360132023-08-31T21:20:09Z2022-01-01T00:00:00Zdc.title: Randomization tests for experiments embedded in complex surveys
dc.contributor.author: Brown, David A., author; Breidt, F. Jay, advisor; Sharp, Julia, committee member; Zhou, Tianjian, committee member; Ogle, Stephen, committee member
dc.description.abstract: Embedding experiments in complex surveys has become increasingly important. For scientific questions, such embedding allows researchers to take advantage of both the internal validity of controlled experiments and the external validity of probability-based samples of a population. Within survey statistics, declining response rates have led to the development of new methods, known as adaptive and responsive survey designs, that try to increase or maintain response rates without negatively impacting survey quality. Such methodologies are assessed experimentally. Examples include a series of embedded experiments in the 2019 Triennial Community Health Survey (TCHS), conducted by the Health District of Northern Larimer County in collaboration with the Department of Statistics at Colorado State University, to determine the effects of monetary incentives, targeted mailing of reminders, and double-stuffed envelopes (including both English and Spanish versions of the survey) on response rates, cost, and representativeness of the sample. This dissertation develops methodology and theory of randomization-based tests embedded in complex surveys, assesses the methodology via simulation, and applies the methods to data from the 2019 TCHS. An important consideration in experiments to increase response rates is the overall balance of the sample, because higher overall response might still underrepresent important groups. There have been advances in recent years on methods to assess the representativeness of samples, including application of the dissimilarity index (DI) to help evaluate the representativeness of a sample under the different conditions in an incentive experiment (Biemer et al. [2018]). We develop theory and methodology for design-based inference for the DI when used in a complex survey. Simulation studies show that the linearization method has good properties, with good confidence interval coverage even in cases when the true DI is close to zero, even though point estimates may be biased. We then develop a class of randomization tests for evaluating experiments embedded in complex surveys. We consider a general parametric contrast, estimated using the design-weighted Narain-Horvitz-Thompson (NHT) approach, in either a completely randomized design or a randomized complete block design embedded in a complex survey. We derive asymptotic normal approximations for the randomization distribution of a general contrast, from which critical values can be derived for testing the null hypothesis that the contrast is zero. The asymptotic results are conditioned on the complex sample, but we include results showing that, under mild conditions, the inference extends to the finite population. Further, we develop asymptotic power properties of the tests under moderate conditions. Through simulation, we illustrate asymptotic properties of the randomization tests and compare the normal approximations of the randomization tests with corresponding Monte Carlo tests, with a design-based test developed by van den Brakel, and with randomization tests developed by Fisher-Pitman-Welch and Neyman. The randomization approach generalizes broadly to other kinds of embedded experimental designs and null hypothesis testing problems, for very general survey designs. The randomization approach is then extended from NHT estimators to generalized regression estimators that incorporate auxiliary information, and from linear contrasts to comparisons of nonlinear functions.
dc.description: Includes bibliographical references.; 2022 Fall.
2022-01-01T00:00:00ZThe pooling of prior distributions via logarithmic and supra-Bayesian methods with application to Bayesian inference in deterministic simulation modelsRoback, Paul J., authorGivens, Geof, advisorHoeting, Jennifer, committee memberHowe, Adele, committee memberTweedie, Richard, committee memberhttps://hdl.handle.net/10217/2357532023-05-23T15:24:26Z1998-01-01T00:00:00Zdc.title: The pooling of prior distributions via logarithmic and supra-Bayesian methods with application to Bayesian inference in deterministic simulation models
dc.contributor.author: Roback, Paul J., author; Givens, Geof, advisor; Hoeting, Jennifer, committee member; Howe, Adele, committee member; Tweedie, Richard, committee member
dc.description.abstract: We consider Bayesian inference when priors and likelihoods are both available for inputs and outputs of a deterministic simulation model. Deterministic simulation models are used frequently by scientists to describe natural systems, and the Bayesian framework provides a natural vehicle for incorporating uncertainty in a deterministic model. The problem of making inference about parameters in deterministic simulation models is fundamentally related to the issue of aggregating (i. e. pooling) expert opinion. Alternative strategies for aggregation are surveyed and four approaches are discussed in detail- logarithmic pooling, linear pooling, French-Lindley supra-Bayesian pooling, and Lindley-Winkler supra-Bayesian pooling. The four pooling approaches are compared with respect to three suitability factors-theoretical properties, performance in examples, and the selection and sensitivity of hyperparameters or weightings incorporated in each method and the logarithmic pool is found to be the most appropriate pooling approach when combining exp rt opinions in the context of deterministic simulation models. We develop an adaptive algorithm for estimating log pooled priors for parameters in deterministic simulation models. Our adaptive estimation approach relies on importance sampling methods, density estimation techniques for which we numerically approximate the Jacobian, and nearest neighbor approximations in cases in which the model is noninvertible. This adaptive approach is compared to a nonadaptive approach over several examples ranging from a relatively simple R1 → R1 example with normally distributed priors and a linear deterministic model, to a relatively complex R2 → R2 example based on the bowhead whale population model. In each case, our adaptive approach leads to better and more efficient estimates of the log pooled prior than the nonadaptive estimation algorithm. Finally, we extend our inferential ideas to a higher-dimensional, realistic model for AIDS transmission. Several unique contributions to the statistical discipline are contained in this dissertation, including: 1. the application of logarithmic pooling to inference in deterministic simulation models; 2. the algorithm for estimating log pooled priors using an adaptive strategy; 3. the Jacobian-based approach to density estimation in this context, especially in higher dimensions; 4. the extension of the French-Lindley supra-Bayesian methodology to continuous parameters; 5. the extension of the Lindley-Winkler supra-Bayesian methodology to multivariate parameters; and, 6. the proofs and illustrations of the failure of Relative Propensity Consistency under the French-Lindley supra-Bayesian approach.
dc.description: 1998 Summer.; Includes bibliographic references.; Covers not scanned.; Print version deaccessioned 2022.
1998-01-01T00:00:00ZTransformed-linear models for time series extremesMhatre, Nehali, authorCooley, Daniel, advisorKokoszka, Piotr, committee memberShaby, Benjamin, committee memberWang, Tianyang, committee memberhttps://hdl.handle.net/10217/2357202023-08-31T21:16:34Z2022-01-01T00:00:00Zdc.title: Transformed-linear models for time series extremes
dc.contributor.author: Mhatre, Nehali, author; Cooley, Daniel, advisor; Kokoszka, Piotr, committee member; Shaby, Benjamin, committee member; Wang, Tianyang, committee member
dc.description.abstract: In order to capture the dependence in the upper tail of a time series, we develop nonnegative regularly-varying time series models that are constructed similarly to classical non-extreme ARMA models. Rather than fully characterizing tail dependence of the time series, we define the concept of weak tail stationarity which allows us to describe a regularly-varying time series through the tail pairwise dependence function (TPDF) which is a measure of pairwise extremal dependencies. We state consistency requirements among the finite-dimensional collections of the elements of a regularly-varying time series and show that the TPDF's value does not depend on the dimension being considered. So that our models take nonnegative values, we use transformed-linear operations. We show existence and stationarity of these models, and develop their properties such as the model TPDF's. Motivated by investigating conditions conducive to the spread of wildfires, we fit models to hourly windspeed data using a preliminary estimation method and find that the fitted transformed-linear models produce better estimates of upper tail quantities than traditional ARMA models or than classical linear regularly-varying models. The innovations algorithm is a classical recursive algorithm used in time series analysis. We develop an analogous transformed-linear innovations algorithm for our time series models that allows us to perform prediction which is fundamental to any time series analysis. The transformed-linear innovations algorithm also enables us to estimate parameters of the transformed-linear regularly-varying moving average models, thus providing a tool for modeling. We construct an inner product space of transformed-linear combinations of nonnegative regularly-varying random variables and prove its link to a Hilbert space which allows us to employ the projection theorem. We develop the transformed-linear innovations algorithm using the properties of the projection theorem. Turning our attention to the class of MA(∞) models, we talk about estimation and also show that this class of models is dense in the class of possible TPDFs. We also develop an extremes analogue of the classical Wold decomposition. Simulation study shows that our class of models provides adequate models for the GARCH and another model outside our class of models. The transformed-linear innovations algorithm gives us the best prediction and we also develop prediction intervals based on the geometry of regular variation. Simulation study shows that we obtain good coverage rates for prediction errors. We perform modeling and prediction for the hourly windspeed data by applying the innovations algorithm to the estimated TPDF.
dc.description: 2022 Summer.; Includes bibliographical references.
2022-01-01T00:00:00ZTopics in estimation for messy surveys: imperfect matching and nonprobability samplingHuang, Chien-Min, authorBreidt, F. Jay, advisorWang, Haonan, committee memberKeller, Joshua, committee memberPallickara, Sangmi, committee memberhttps://hdl.handle.net/10217/2357042023-08-31T21:17:34Z2022-01-01T00:00:00Zdc.title: Topics in estimation for messy surveys: imperfect matching and nonprobability sampling
dc.contributor.author: Huang, Chien-Min, author; Breidt, F. Jay, advisor; Wang, Haonan, committee member; Keller, Joshua, committee member; Pallickara, Sangmi, committee member
dc.description.abstract: Two problems in estimation for "messy" surveys are addressed, both requiring the combination of survey data with other data sources. The first estimation problem involves the combination of survey data with auxiliary data, when the matching of the two sources is imperfect. Model-assisted survey regression estimators combine auxiliary information available at a population level with complex survey data to estimate finite population parameters. Many prediction methods, including linear and mixed models, nonparametric regression, and machine learning techniques, can be incorporated into such model-assisted estimators. These methods assume that observations obtained for the sample can be matched without error to the auxiliary data. We investigate properties of estimators that rely on matching algorithms that do not in general yield perfect matches. We focus on difference estimators, which are exactly unbiased under perfect matching but not under imperfect matching. The methods are investigated analytically and via simulation, using a study of recreational angling in South Carolina to build a simulation population. In this study, the survey data come from a stratified, two-stage sample and the auxiliary data from logbooks filed by boat captains. Extensions to multiple frame estimators under imperfect matching are discussed. The second estimation problem involves the combination of survey data from a probability sample with additional data from a nonprobability sample. The problem is motivated by an application in which field crews are allowed to use their judgment in selecting part of a sample. Many surveys are conducted in two or more stages, with the first stage of primary sampling units dedicated to screening for secondary sampling units of interest, which are then measured or subsampled. The Large Pelagics Intercept Survey, conducted by the United States National Marine Fisheries Service, draws a probability sample of fishing access site-days in the first stage and screens for relatively rare fishing trips that target pelagic species (tuna, sharks, billfish, etc.). Many site-days yield no pelagic trips. Motivated by this low yield, we consider surveys that allow expert judgment in the selection of some site-days. This nonprobability judgment sample is combined with a probability sample to generate likelihood-based estimates of inclusion probabilities and estimators of population totals that are related to dual-frame estimators. Consistency and asymptotic normality of the estimators are established under the correct specification of the model for judgment behavior. An extensive simulation study shows the robustness of the methodology to misspecification of the judgment behavior. A standard variance estimator, readily available in statistical software, yields stable estimates with small negative bias and good confidence interval coverage. Across a range of conditions, the proposed strategy that allows for some judgment dominates the classic strategy of pure probability sampling with known design weights. The methodology is extended to a doubly-robust version that uses both a propensity model for judgment selection probabilities and a regression model for study variable characteristics. If either model is correctly specified, the doubly-robust estimator is unbiased. The dual-frame methodology for samples incorporating expert judgment is then extended to two other nonprobability settings: respondent-driven sampling and biased-frame sampling.
dc.description: 2022 Summer.; Includes bibliographical references.
2022-01-01T00:00:00Z