🚨 Limited Offer: First 50 users get 500 credits for free — only ... spots left!

Free Statistics & Probability flashcards, exportable to Notion

Learn faster with 44 Statistics & Probability flashcards. One-click export to Notion.

Learn fast, memorize everything, master Statistics & Probability. No credit card required.

Want to create flashcards from your own textbooks and notes?

Let AI create automatically flashcards from your own textbooks and notes. Upload your PDF, select the pages you want to memorize fast, and let AI do the rest. One-click export to Notion.

Create Flashcards from my PDFs

Statistics & Probability

44 flashcards

Descriptive statistics are methods for summarizing and describing datasets in a compact and informative way, such as calculating measures of central tendency (mean, median, mode) and measures of variability (range, variance, standard deviation).

Inferential statistics are methods for making inferences and predictions about a population from a sample dataset, usually involving hypothesis testing, confidence intervals, and statistical modeling.

The normal distribution is a continuous probability distribution that is symmetric and bell-shaped. It is commonly used to model many natural phenomena and forms the basis of many statistical methods.

The central limit theorem states that the sum of many independent and identically distributed random variables will tend towards a normal distribution, regardless of the underlying distribution, as the number of variables increases.

A t-test is a statistical hypothesis test that determines if there is a significant difference between the means of two groups or samples. It is used when the sample size is small and/or the population variance is unknown.

ANOVA (Analysis of Variance) is a collection of statistical models and their associated procedures used to analyze the differences among group means in a sample. It tests if the means of two or more groups are significantly different from each other.

Regression analysis is a statistical method for estimating the relationship between a dependent variable and one or more independent variables. It is widely used for prediction, forecasting, and causal inference.

Correlation is a statistical measure that describes the strength and direction of the linear relationship between two variables. It ranges from -1 to 1, with 0 indicating no correlation.

A confidence interval is a range of values that is likely to contain an unknown population parameter with a certain level of confidence, based on the sample data. It quantifies the uncertainty associated with a sampling process.

A p-value is the probability of obtaining a result at least as extreme as the observed data, assuming the null hypothesis is true. It is used in hypothesis testing to determine the statistical significance of results.

Probability is the study of the likelihood of events occurring, while statistics is the study of collecting, organizing, analyzing, and interpreting data to make inferences and decisions.

Bayes' theorem describes the probability of an event occurring based on prior knowledge of conditions that might be related to the event. It is a fundamental principle of Bayesian statistics and probabilistic reasoning.

A random variable is a variable that can take on different values associated with some probability distribution. Random variables are used to model and quantify uncertainty in probability theory and statistics.

A parameter is a numerical characteristic of a population, while a statistic is a numerical characteristic calculated from sample data. Parameters describe populations, while statistics describe samples.

Simpson's paradox occurs when a trend appears in different groups of data but disappears or reverses when the groups are combined. It highlights the importance of careful data aggregation and interpretation.

The law of large numbers states that as the number of trials in a random experiment increases, the average of the results will converge towards the expected value or mean of the probability distribution.

A sampling distribution is the probability distribution of a statistic (e.g., mean, proportion) calculated from a sample. It describes how the statistic varies across different samples from the same population.

Bootstrapping is a statistical method that involves resampling from an original dataset to estimate the sampling distribution of a statistic. It is useful when parametric assumptions are questionable or difficult to derive analytically.

A type I error, or false positive, occurs when the null hypothesis is incorrectly rejected, even though it is true. It represents an incorrect positive conclusion about the effect or relationship being tested.

A type II error, or false negative, occurs when the null hypothesis is incorrectly accepted, even though it is false. It represents a failure to detect an effect or relationship that is present.

A chi-squared test is a statistical hypothesis test used to determine if there is a significant difference between the observed frequencies or proportions and the expected frequencies or proportions for categorical data.

A contingency table is a way of summarizing and displaying the frequency distribution of variables, especially for categorical data. It is used in chi-squared tests and other analyses of categorical data.

Logistic regression is a statistical method used to model and analyze the relationship between one or more independent variables and a binary (yes/no) dependent variable. It is widely used in classification problems.

The exponential distribution is a continuous probability distribution that models the time between events in a Poisson process, i.e., a process in which events occur continuously and independently at a constant rate.

The Poisson distribution is a discrete probability distribution that models the number of events occurring in a fixed time or space interval, given a known constant mean rate and independent occurrences.

A stem-and-leaf plot is a graphical technique for displaying numerical data in a compact form, where each data value is split into a "stem" (the leading digits) and a "leaf" (the trailing digits).

Independent events are events whose outcomes do not influence or affect each other, while dependent events are events where the outcome of one event influences or affects the probability of the other event occurring.

A Monte Carlo simulation is a computational technique that uses repeated random sampling to obtain numerical results and estimate the possible outcomes of a process or system with inherent uncertainty or randomness.

Parametric tests make assumptions about the underlying probability distribution of the data (e.g., normality), while non-parametric tests do not make such assumptions and are more robust to outliers and non-normal distributions.

The law of total probability is a fundamental rule in probability theory that relates the probability of an event to the probabilities of mutually exclusive and exhaustive events whose union is the original event.

Markov's inequality provides an upper bound on the probability that a non-negative random variable is greater than or equal to some positive constant, in terms of the variable's expected value.

Chebyshev's inequality provides a lower bound on the probability that a random variable deviates from its mean by more than a given number of standard deviations, regardless of the variable's probability distribution.

The empirical rule states that for a normal distribution, approximately 68% of the data falls within one standard deviation of the mean, 95% within two standard deviations, and 99.7% within three standard deviations.

A z-score, or standard score, is a measure of how many standard deviations a given data point is away from the mean of a distribution. It allows for comparison across different distributions with different means and standard deviations.

A box plot is a standardized way of displaying the distribution of a dataset based on five summary statistics: the minimum, first quartile, median, third quartile, and maximum. It provides a visual summary of central tendency and dispersion.

Correlation refers to a statistical relationship between two variables, but does not necessarily imply causation. Causation requires additional evidence to establish that one variable causes changes in the other variable.

A time series is a sequence of data points recorded at successive points in time, often at regular intervals. Time series analysis is used to study patterns and trends in data over time.

A population is the complete set of individuals or objects under consideration, while a sample is a subset of the population selected for analysis or study.

Data mining is the process of discovering patterns and extracting useful information from large datasets using techniques from statistics, machine learning, and database systems.

The central dogma of statistics states that effects must have plausible causes, and that correlation does not imply causation. It emphasizes the importance of experimental design and causal inference.

A quantitative variable is a variable that can take on numerical values and be measured, while a qualitative variable is a variable that describes categories or attributes and cannot be measured numerically.

Categorical data are values that can be sorted into groups or categories, while continuous data are values that can take on any numerical value within a range.

Univariate analysis involves the analysis of a single variable, while multivariate analysis involves the analysis of multiple variables simultaneously and their relationships.

The principle of maximum likelihood is a method for estimating the parameters of a statistical model by finding the parameter values that make the observed data most likely under the assumed model.