Accounts who came in, performed an action, and then returned monthly.
Author
Robert Wright (rwright@)
Published
October 12, 2025
FAS Account Inflow / Outflow by Month
Show the code
import osimport globfrom pathlib import Pathfrom datetime import datetime, timedeltafrom collections import defaultdictimport numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsimport pyarrow as paimport pyarrow.dataset as dsimport pyarrow.parquet as pqplt.style.use("seaborn-v0_8")sns.set_theme(context="notebook", style="whitegrid")
Show the code
# @replace DATA_SOURCESDATA_SOURCES = {"datagrepper-parse-accounts": "/home/jovyan/work/bus2parquet/output_users"}parquet_dir = DATA_SOURCES["datagrepper-parse-accounts"]cutoff_date = (pd.Timestamp.now().replace(day=1) - pd.DateOffset(weeks=52)).date()files = []for p in Path(parquet_dir).glob("fedora-*.parquet"): stem = p.stem.replace("_processed", "") d = datetime.strptime(stem.split("-")[1], "%Y%m%d").date()if d >= cutoff_date: files.append(str(p))dataset = ds.dataset(files, format="parquet")chunks = []for batch in dataset.to_batches(batch_size=50_000): df = batch.to_pandas()if"sent_at"notin df.columns or"username"notin df.columns:continue df["sent_at"] = pd.to_datetime(df["sent_at"], errors="coerce").dt.floor("s") chunks.append(df)combined_df = pd.concat(chunks, ignore_index=True) if chunks else pd.DataFrame()ifnot combined_df.empty:print("Maximum date in data:", combined_df["sent_at"].max().date())print("Minimum date in data:", combined_df["sent_at"].min().date())else:print("No data found in cutoff range")
Maximum date in data: 2025-09-10
Minimum date in data: 2024-09-02
While we try to deturmine the best count of contributors, the following topics are removed: - org.centos - io.pagure.prod (Commits on distgit are not counted here) - org.fedoraproject.prod.mailman.receive* (these messages are not tied to FAS at this time) - org.fedoraproject.prod.bugzilla* (these messages are not tied to FAS at this time) - org.release-monitoring* (these messages are not user activity) - org.fedoraproject.prod.copr* (Due to a processing issue, COPR messages need to be fixed in processing before included in counts as they may be double counted) - org.fedoraproject.prod.discourse (Matthew has some ideas on how to extract Ask Fedora)
We also require a user to emit at least 10 messages in a month to be counted as Retained or Inflow. If they do not, they are counted as outflow for the next month. Month 0 is dropped from the visualization as Inflow wouldn’t make sense (Jan ’24 is skipped and Feb ’24 is first month instead).