r/statistics • u/Personal-Trainer-541 • 1h ago

Education [E] Dirichlet Distribution - Explained

• Upvotes

Hi there,

I've created a video here where I explain the Dirichlet distribution, which is a powerful tool in Bayesian statistics for modeling probabilities across multiple categories, extending the Beta distribution to more than two outcomes.

I hope it may be of use to some of you out there. Feedback is more than welcomed! :)

0 comments

r/statistics • u/Dull-Song2470 • 4h ago

Question [Question] Algorithm to update variance calculation data point by data point?

1 Upvotes

I'm currently trying to collect data inside of a program that is not set up to keep track of an arbitrary number of variables, but I still want to analyze the probability distribution of a series of observations within the program. Calculating the mean of the observations is easy; I set up one variable to track the most recent observation, and one variable to track the sum of observations so far, and one variable to track the number of observations so far; when observations stop coming in, I can then just divide the sum by n. But calculating the variance is trickier. I can set up a variable to keep track of the first observation, and another for second observation, and another for the third observation, but then if a fourth observation comes in when I was expecting three observations, I don't have a way of accounting for it. Is there some way that I can do something like calculate the variance initially when there four or five observations, then update it to account new information when a new data point comes in, without having to keep track of every individual data point that came before?

5 comments

r/statistics • u/ToeRepresentative627 • 11h ago

Question [Question] What statistical method should I use for my situation?

3 Upvotes

I originally posted on askstatistics, but was told that my question might be too complex, so I thought I'd ask here instead.

I am collecting behavioral data over a period of time, where an instance is recorded every time a behavior occurs. An instance can occur at any time, with some instances happening quickly after one another, and some with gaps in between.

What I want to do is to find clusters of instances that are close enough to one another to be considered separate from the others. Clusters can be of any size, with some clusters containing 20 instances, and some containing only 3.

I have read about cluster analysis, but am unsure how to make it fit my situation. The examples I find involve 2 variables, where my situation only involves counting a single behavior on a timeline. The examples I find also require me to specify my cluster size, but I want my analysis to help determine this for me and involve clusters of different sizes.

The reason why is because, in behavioral analysis, it's important to look at the antecedents and consequences of a behavior to determine its function, and for high frequency behaviors, it is better to look at the antecedent and consequences for an entire cluster of the behavior.

edit:

I was asked to provide more information about my specific problem. Let's say I've been asked to help a patient who engages in trichotillomania (hair pulling disorder, a type of repetitive self-harm behavior). The patient does not know why they do it. It started a few years ago, and they have been unable to stop it. An "instance" is defined as moving their hand to their head and applying enough force to remove at least 1 strand of hair. They do know that there are periods where the behavior occurs less than others (with maybe 1-3 minute gaps between instances), and periods where they do it almost constantly (with 1 second gaps between instances). So we know that these "episodes" are different somehow, but I am unsure how to define what constitutes an "episode".

To help them with this, I decide to do a home/community observation of them for a period of 5 hours, in order to determine the antecedents (triggers) to the episode and consequences (what occurs after the episode ends that explains why it has stopped) to an episode of hair pulling. This is essential to developing an intervention to help reduce or eliminate the behavior for the patient. We need to know when an episode "starts" and when it "ends".

My problem is, what constitutes an "episode"? How close together do a group of instances of the behavior have to be to be included in an episode? How much latency between instances does there need to be before I can confidently say that it is part of a new episode? This cannot be done using pure visual analysis. It's not as simple as 50 instances happen within the first hour, then an hour gap, then another 50 instances happen, where the demarkation between them would be trivial to determine. Instead, the behavior occurs to some degree at all times, making it difficult to determine when old episodes end and new episodes begin. It would be very unhelpful to view the entire 5 hour block as a single "episode". Clearly there are changes, but I don't know where to quantifiably determine it.

It's very important to be accurate here because if I determine the start point wrong, then I will identify the wrong trigger, and my intervention will target the wrong thing, and could potentially make the situation worse, which is very bad when the behavior is self-harm. The stakes are high enough to warrant a quantifiable approach here.

5 comments

r/statistics • u/[deleted] • 14h ago

Education [Education] Should I learn statistics in the workplace or in academia?

6 Upvotes

I work for a pharmaceutical research company. I am having a hard time trusting the statistics being done being done here. I’m relatively new to stats so can’t comment on the suitability of the methods being applied but my partner who is doing a PhD in statistics raised concerns. My main concern is that there aren’t many barriers to protect against bad stats. The most senior seems to be very knowledgeable and very much based in theory but the other most senior member appear to be self thought as they didn’t have formal/extensive training in statistics. I work in the stats department and is composed of graduates who studied maths and their stats training mainly came from the training the senior members of the team provided. They seem to have been promoted rather quickly too. The training is rather disorganised at times and everyone says something different. I want to do good stats and don’t want to pick up bad habits so early on. I’m interested in pursuing a PhD later down the line ones i have a bit more experience but I’m not sure if I should fast forward this to learn in an institution (academia) that is held more accountable for the quality of statistics. Is it advisable that I stay and learn here?

1 comment

r/statistics • u/MountainNegotiation • 18h ago

Question [Question] Linear Mixed-Effects Model: blocking with random factor with < 5 levels?

5 Upvotes

Hello everyone!

I am writing an academic article, and a part of it is: I am trying to determine if Species richness is driven by Disturbance (fire or clearcutting), Soil Type (Organic or mineral), or a large amount of chemical data from the samples taken from four different forests.

The literature I searched suggested I block/group the samples using forest names as a random factor to control the non-independence of the samples.

One test to do this is Linear Mixed-Effects Models; however, all the literature I have read says that blocking/creating a random factor with < 5 levels is not appropriate.

Thus, can I please have some advice on how to progress?

13 comments

r/statistics • u/largehardoncollider7 • 20h ago

Question [Q] Recommendations for a novice

4 Upvotes

[Question] Hey guys, I’ve just taken my first stats course as part of grad school, and I’m loving it. It’s primarily applied statistics and R studio, we don’t really delve too deep into derivations, and the course is focused on topics like AB testing, regression (linear, non-linear, multiple) , time series, and so on.

I would love to learn more and am seeking resources for the same! I’m looking at deeper knowledge of applied statistics (rusty on the calculus)

7 comments

r/statistics • u/Friendly-Popper • 1d ago

Education Statistics at Columbia University [E][Q]

3 Upvotes

Hey everyone, I'm interested in majoring in statistics and wanted to ask if anyone has insights on how the statistics undergraduate program is at Columbia University. I've seen some saying to avoid it from posts from many years ago so I'm wondering if that still might be the case. All thoughts are appreciated!

2 comments

r/statistics • u/nmolanog • 1d ago

Question [Q] multiple comparison problem in bivariate analysis in observational, exploratory studies.

2 Upvotes

0 comments

r/statistics • u/Cluelessjoint • 1d ago

Question [Q] How do I test if the difference between two averages is significant / not up to chance?

2 Upvotes

For example if I’m looking at the location with the highest average sales, and the lowest average in the past 10 years, how can I statistically determine whether the difference between the two surprising/is not up to chance? Anova? T-test?

8 comments

r/statistics • u/Suspicious_Cat119 • 1d ago

Education [Education] [E] Opinions on chosen Statistics modules

3 Upvotes

Hi everyone, I'm starting a MSc in Statistics at the University of St Andrews in a few weeks. I can pick all the modules I will study myself, and I wanted your opinion on my selection so far.

Semester 1: Applied Statistical Modelling Using GLMS, Markov Chains and Processes, Applied Bayesian Statistics, Independent Study Module (thinking of exploring Digital Signal Processing).

Semester 2: Multivariate Analysis, Advanced Data Analysis, Machine learning for Data Analysis, Statistical Machine Learning.

1 comment

r/statistics • u/Dizzy_Restaurant3874 • 1d ago

Question [Q]: JACC publication stats... Cardiomyopathy related to methamphetamine abuse

0 Upvotes

While reading a paper on Cardiomyopathy related to methamphetamines vs other etiologies, I came across the table. I do not see how there could possibly be a statistical difference between these two sets of values, but there sits p<0.001 - Cardiomyopathy with meth on the left, without meth on the right. The distributions are the same to less than 0.1%. I don't know much about statistics - but I know enough to ask a statistician - these numbers seem to be nearly identical. Is this an error? Link to paper below.

|| || |Length of stay (d)|<3 d|1,037,195 (40.34)|5,098,918.41 (40.39)|<0.001|

|4-6 d|738,610 (28.73)|3,632,147.96 (28.77)| |

|7-9 d|353,964 (13.77)|1,740,210.64 (13.79)| |

|10-12 d|167,402 (6.51)|822,719.36 (6.52)| |

|>12 d|273,942 (10.65)|1,328,752.52 (10.53)| |

https://www.jacc.org/doi/10.1016/j.jacadv.2024.100840

2 comments

r/statistics • u/tacosaremyreligion • 2d ago

Question [Q] R² and Within R²

5 Upvotes

Hey, I’m running a panel event study with unit and time fixed effects, and my output on Rstudio reports both overall R² and “Within R².” I understand the intuition (variance explained after de-meaning by unit/time), but I need a citable source (textbook, methods paper, or official documentation) that formally defines and/or derives Within R².
Also any notes on interpreting Within vs. Overall R² in TWFE event-study specs with leads and lags.

If you have a specific citation or recommendation, I’d really appreciate it.

0 comments

r/statistics • u/al3arabcoreleone • 3d ago

Discussion [D] this is probably one of the most rigorous but straight to the point course on Linear Regression

100 Upvotes

The Truth About Linear Regression has all a student/teacher needs for a course on perhaps the most misunderstood and the most used model in statistics, I wish we had more precise and concise materials on different statistics topics as obviously there is a growing "pseudo" statistics textbooks which claims results that are more or less contentious.

14 comments

r/statistics • u/damn_i_missed • 2d ago

Question [Question] Lectures to couple with Hoel, Port and Stone?

2 Upvotes

I recently started working through introduction to probability theory by Hoel, Port and Stone. Ive taken several statistics/biostatistical courses in grad school (7 yrs ago) but they really only covered the formulas without diving much into theory.

Anyway, was wondering if anyone recommends any particular lectures (ie MIT open courseware) that could work alongside this book. Then I can do the practice problems from the textbook. Thanks!

0 comments

r/statistics • u/The_Orange_Cow • 2d ago

Question [Question]Formula for probability of rolling all sides of a 12 sided die

2 Upvotes

Lets say I had a 12 sided die. I wanted to roll EACH INDIVIDUAL side of the die at least once. What would the formula be for the probability of having rolled all sides of the die at least once over total rolls. To determine something like: after 30 rolls, I'd have an X chance of having rolled each side at least once, where I'm trying to find X.

Thank you for any help in this matter.

3 comments

r/statistics • u/fuckosta • 2d ago

Question [Q] Does it make sense for a multivariate R^2 to be higher than that of any individual variable?

2 Upvotes

I fit a harmonic regression model on a set of time series. I then calculated the R^2 for each individual time series, and also the overall R^2 by taking the observations and fitted values as matrices. Somehow, the overall R^2 is significantly higher than those of the individual time series. Does this make sense? Is there a flaw in my approach?

5 comments

r/statistics • u/menejike • 3d ago

Question [Question] Regression Analysis Used Correctly?

2 Upvotes

I'm a non-statistician working on an analysis of project efficiency, mostly for people who know less about statistics than I do...but also a few that know a lot more about statistics than I do.

I can see that there is a lot of variation in the number of services provided as compared to the number of staff providing services in different provinces and I want to use regression analysis to look at the relationship, with the number of staff in provinces as the x variable and the number of services as the y variable and express the results using R squared and a line plot.

AI doesn't exactly answer if this is the best approach and I wanted to triangulate with some expert humans. Am I going in the right direction?

Thanks for any feedback or suggestions.

4 comments

r/statistics • u/DuragChamp420 • 3d ago

Education [Q] [E] whats a good GRE score for top programs

3 Upvotes

Essentially I took the GRE today and got a 167 Q and I'm wondering if it's too low. Tons of people have perfect scores so mine's a bit lacking, only 76th percentile. My V was pretty good for the stats field (164, 93rd percentile) but idk if that matters to anyone. Is it worth retaking for 168-169 Q score?

Thanks for any perspectives 🙏

13 comments

r/statistics • u/Jay31416 • 3d ago

Question [Question] Dynamic Linear Model: Classical vs Discount Approach

3 Upvotes

I'm working on a time series forecasting problem and trying to decide between the classical approach and discount approach for Dynamic Linear Models (DLMs). Anyone here has experience comparing these approaches?

I have successfully implemented the discount approach in Python. There seems to be limited literature on the comparison of both models and I'm curious if anyone has practical experience or opinions.

Classical approach: Estimates fixed variance matrices (V, W) via maximum likelihood
Discount approach: Uses discount factors (δ) to create adaptive evolution variance (West & Harrison, 1997)

Follow-up question: I am using maximum likelihood to estimate the discount parameters - is this the correct?

Reference: West, M., & Harrison, J. (1997). Bayesian Forecasting and Dynamic Models (2nd ed.). Springer-Verlag.

1 comment

r/statistics • u/Netgay • 3d ago

Question [Q] Could someone please explain to me how a homogeneous sample causes us to underestimate a correlation?

3 Upvotes

Title. In a class of psychology and the prof talked about how homogeneous samples cause us to estimate a correlation to be lower than it actually is and that it needs to be potentially accounted for, but I just can't seem to wrap my head around it? Shouldn't it be the other way around? Could someone explain it to me like im 5, I feel really dumb.

1 comment

r/statistics • u/Drowsy_Forest • 3d ago

Question [question] Bayes conditional probability for 9 IID events

2 Upvotes

I feel dumb for not being able to work this out without drawing up a large tree, and quick google didn’t get me the exact calculator I am looking for but:

I have 9 independent events, but they are condition in that if one fails, the test fails. I only have the probably of the test failing approx = 0.71

I want to know the probability of the individual events failing, what’s the smart way to do this ?

2 comments

r/statistics • u/chicanatifa • 3d ago

Question [Q] Qualified to apply to a masters?

4 Upvotes

Wondering if my background will meet the requisites for general stats programs.

I have an undergrad degree in economics, over 5 years of work experience and have taken calc I and an intro to stats course.

I am currently taking an intro to programming course and will take calc II, intro to linear algebra, and stats II this upcoming semester.

When I go through the prerequisites it seems like they are asking for a heavier amount of math which I won't be able to meet by the time applications are due. Do I have a chance at getting into a program next year or should I push it out?

5 comments

r/statistics • u/KronOliver • 3d ago

Education [Education] Applying/transferring to European PhD programs as a Brazilian

5 Upvotes

Hello guys, i'm currently a first year brazilian econ PhD Student at a top brazilian university specializing in econometrics (especifically pn semiparametric and nonparametric estimation and identification) looking to transfer/apply to a Stats PhD program in Europe.

Due to the nature of econ PhDs i've spent the majority of this year grinding through coursework (Math Camp, Microeconomics, Macroeconomics, Econometrics) and haven't really had time to perform research at all with the exception of alignments with my doctoral advisor. Grading schema is a bit confusing (with three options: A > B > C) as basically all grades are normalized since people tend to do very bad (for example, i've got an A in Metrics II with an overall grade of 5.0) and a B in Micro II with an overall grade of (6.1).
Most of my grades are B, with a A in Metrics II and, unfortunately two Cs, however i am confident that i can scrape more As in the current bimesters (mainly focusing for As in Metrics III and IV).

Originally i opted for a econ PhD in Brazil as i had no intention of leaving Brazil for personal reasons, however my doctoral advisor (who is a statistician) has strongly recommend that i try to transfer to the econ Msc program and that i apply to Econ/Stat PhD programs at the US/Europe for career reasons. And that, even if i'm unable to transfer, that i should apply either way using the graduate courses + electives (i'm looking to take functionaly analysis and measure theory next year, as i'll need both for my research) grade and my research as a writing sample.

To that end, i'm currently negotiating with the econ dept bureaucracy for transfer, but if that doesn't work i'll be applying either way as my doctoral advisor has suggested. My current plan is to finish my current RA and core courses this year and dedicate the following year to electives + research and a RA that my advisor has lined up with a buddy of his from Wharton and apply sometime in 2027/2028 (i'd wish to apply later due to personal reasons).

As such, as these ideas are still in preliminary stages, i'd like more information about stats dept in Europe and some advice. How do Stats application works if i end up not managing to transfer to the Msc programme, is a master obligatory? Is there anyway to transfer from my current PhD to an european PhD (i think this is extremely unlikely), what is more relevant for application: my grades? research? rec letters?

I can provide more information if it's deemed necessary, i'll be very grateful to anyone who can help :)

0 comments

r/statistics • u/putinisretard • 3d ago

Question [Q] Type 1 error rate higher than 0.05

4 Upvotes

Hi, I am designing a statistically relatively difficult physiological study for which I developed two statistical methods to detect an effect. I also coded a script which simulates 1000 data sets for different conditions (a condition with no effect, and a few varying conditions which have an effect).

Unfortunately, on the simulated data where the effect I am looking for is not present, with a significance level of α=0.05 one of my methods detects an effect at a rate of 0.073. The other method detects an effect at a rate of 0.063.

Is this generally still considered within limits for type 1 error rates? Will reviewers typically let this pass or will I have to tweak my methods? Thank you in advance.

Edit: Turns out the problem was actually in my fake data... I used a fixed seed for one of the random values so there was a bias in the overall dataset since one of the parameters that played into the data generation had the same "random" values in every single dataset

10 comments

r/statistics • u/Firesnake64 • 4d ago

Career [C] Guidance on higher-education trajectory, research interests?

3 Upvotes

I got my Bachelor's degree in mathematics with a statistics concentration in May 2024, and took a brief 2-year gap to work a completely not-math related job to save up money, and I'm now gearing up to apply to a master's degree program in applied statistics. My ultimate goal is to get my PhD in applied stats, and specifically I want to do research on methods or models used in humanitarian aid research, such as migration, refugee aid, etc. (Not applying directly to a PhD since I took a 2 year gap, and I did not have any research experience during my undergrad, though if you think I should try, just let me know)

Since I only have my bachelor's I quite honestly don't really know what kinds of research I would be looking to do but I know it's in that category. From what I've been able to gather myself it seems like the usual "buzzwords" would pop up such as time series, spatial stats, Bayesian stats, etc. but I wouldn't know where to begin to niche down on the specifics. In the meantime I am trying to have Claude guide me through a mock research project on public migration data from the UNHCR and conflict data from ACLED but I'm largely treating it as a kind of review course for myself.

At some level I feel like the above isn't "valid" justification enough for me to want to go for these advanced degrees but quite honestly I just can't see myself doing anything else, and I've always enjoyed being a student, and I want to become a college professor some day. So in that sense I'm posting this to ask if this plan of mine makes sense, is the field of applied statistics the most appropriate for what I'm interested in, and if you all have any advice in terms of preparing, or learning more about what kind of research specifically I would be able to do? I'm the first in my immediate family to pursue anything past a bachelor's degree so I also am just trying to figure out how it all works with research and assistantships and grants and all that - any guidance would be much appreciated!

3 comments

Subreddit

statistics

r/statistics

/r/Statistics is going dark from June 12-14th as an act of protest against Reddit's treatment of 3rd party app developers. _This community will not grant access requests during the protest. Please do not message asking to be added to the subreddit._

Members Active

602.7k

Sidebar

Guidelines:

All Posts Require One of the Following Tags in the Post Title! If you do not flag your post, automoderator will delete it:

Tag Abbreviation

[Research] [R]

[Software] [S]

[Question] [Q]

[Discussion] [D]

[Education] [E]

[Career] [C]

[Meta] [M]
This is not a subreddit for homework questions. They will be swiftly removed, so don't waste your time! Please kindly post those over at: r/homeworkhelp. Thank you.
Please try to keep submissions on topic and of high quality.
Just because it has a statistic in it doesn't make it statistics.
Memes and image macros are not acceptable forms of content.
Self posts with throwaway accounts will be deleted by AutoModerator

Related subreddits:

Data:

r/datasets
KDnuggets Data Mining Data
UC-Irvine Machine Learning Repository
Datamob
datasets package in R
Kaggle <- also great for stats competitions
CMU Data and Story Library
U.S. Government Data Portal
St. Louis Fed. Reserve
Infochimps
AllenDowney's Stats Page

Useful resources for learning R:
r-bloggers - blog aggregator with statistics articles generally done with R software.
Quick-R - great R reference site.

Related Software Links:
R
R Studio
SAS
Stata
EViews
JMP
SPSS
Minitab

Advice for applying to grad school:
Submission 1

Advice for undergrads:
Submission 1

Jobs and Internships

For grads:

For undergrads:

Tag	Abbreviation
[Research]	[R]
[Software]	[S]
[Question]	[Q]
[Discussion]	[D]
[Education]	[E]
[Career]	[C]
[Meta]	[M]