i #1 Created event_processing file with 3 methods to run analysis #2

connorn-dev · 2025-02-25T05:34:45Z

Methods:

start_end_activities(csv_path)
generate_tree(csv_path, output_dir)
generate_graph(csv_path, output_dir)

Notes: Installs pm4py if not already installed.

… csv file Signed-off-by: Connor Narowetz <[email protected]>

Signed-off-by: Connor Narowetz <[email protected]>

- Added config_file and action argument - Run chosen action Signed-off-by: Connor Narowetz <[email protected]>

- Added Python notebook with all functionality.

- Deleted .py script to have notebook replace - Changed name of notebook

The following was added: - Example section at bottom of notebook - Petri Net model - Performance Models - BPMN filtered model - Cleaned up wording - Images folder Signed-off-by: Connor Narowetz <[email protected]>

carlosparadis

Hi Connor,

This needs substantial revision. I did not review the entirety of the notebook because the narrative is somewhat confusing.

The overall structure of the notebook should be something like this:

You introduce the user to GitHub Events. You explain you can obtain them with Kaiaulu. You motivate your code, saying you are interested in seeing if developers often follow a similar process. This is all brief, the details go in subsequent sections.
Right after the introduction, you explain the overall folder organization between kaiaulu, process mining, and the raw data exec will give to you.
GitHub Events: You minimally explain that Kaiaulu organizes its features in functions, which are used in Notebooks and execs. For details on how to construct config file for events, you defer the user to reading the Events Notebook (do ensure the notebook does explain what you claim it does).
- Explain that by running the config, Kaiaulu will save the files to rawdata (specify the exact folder). Don't create a sub-section for this.
- After executing the script, simply load the table in pandas and show the user what it looks like. Explain to the user. Again no need to keep making sub-sections here, GitHub Events section suffice.
- Proceed to explain and re-motivate your work, that is hard to guess if people are following the same process.
Process Mining GitHub Events:
- In this new section, begin by explaining you will demonstrate it in practice. Using the already loaded pandas dataframe of the real dataset, apply the 3 process mining functions.

By this point, the user should be able to use your code to their needs. What you then need is a section that explains experiments 1, 2 and 3. Where are the fake data generators here?

Also, where is the api folder? Where is the markdown? This does not look like the most recent version of the code. I still see functions being defined in the notebook.

Also:

Where is the env.yml conda file?

carlosparadis · 2025-04-20T21:49:21Z

images/occurrence_dfg.png

Do not version binary files. Generate them on the fly in the code so users generate it themselves when they run your notebook.

carlosparadis · 2025-04-20T21:49:26Z

images/performance_dfg.png

Do not version binary files. Generate them on the fly in the code so users generate it themselves when they run your notebook.

carlosparadis · 2025-04-20T21:49:44Z

images/petri_net.png

Do not version binary files. Generate them on the fly in the code so users generate it themselves when they run your notebook.

carlosparadis · 2025-04-20T21:49:50Z

images/process_graph.png

Do not version binary files. Generate them on the fly in the code so users generate it themselves when they run your notebook.

carlosparadis · 2025-04-20T21:49:58Z

images/process_tree.png

Do not version binary files. Generate them on the fly in the code so users generate it themselves when they run your notebook.

carlosparadis · 2025-04-20T22:20:00Z

issue_event_processing.ipynb

+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Download and Parse data with Kaiaulu ghevents.R (CLI)\n",


rename to: GitHub Events

carlosparadis · 2025-04-20T22:22:02Z

issue_event_processing.ipynb

+    "cwd = os.getcwd()\n",
+    "os.chdir(os.path.expanduser(\"~/Desktop/Kaiaulu/Working_issues/kaiaulu/\"))\n",
+    "\n",
+    "# To download use the download command specifing the <config_file> <github_token>\n",


This is too many sub-sections. You want to be consistent that the section division doesn't get too specific or it defeats the purpose of being a section.

Change to be a sub-section, enumerate the section (it is hard on the eye to see what is header or sub-header when note close together, so use 1., 1.1, etc). Rename to: GitHub Events using Kaiaulu

carlosparadis · 2025-04-20T22:25:37Z

issue_event_processing.ipynb

+    "\n",
+    "# To download use the download command specifing the <config_file> <github_token>\n",
+    "\n",
+    "command = f\"Rscript exec/ghevents.R download conf/kaiaulu.yml --token_path=~/.ssh/github_token\"\n",


This is not a good path. It assumes the user is running inside the Kaiaulu project the exec. Rather, the reasonable assumption is that the user downloaded your code (process_mining) and wants to run the exec from Kaiaulu. So the current working directory should assume the user is at process_mining folder, goes one level above, then access the kaiaulu folder.

What I recommend you do is to explain the expected folder organization of the projects before throwing paths left and right. Assume the following:

kaiaulu/kaiaulu/exec/ghevents.R
kaiaulu/rawdata/
process_mining/notebooks/issue_event_processing.ipynb

carlosparadis · 2025-04-20T22:36:01Z

issue_event_processing.ipynb

+    "\n",
+    "As stated above it is reccomended you start with only a few event issues. To do this you can open the created issue_output.csv with Excel or Google Sheets and modify the table to only include a few. \n",
+    "\n",
+    "Note: commit_output.csv has been implemented for furture development and is note currently used for any process modeling. "


I have no idea what is going on here. These should not be the file names that Kaiaulu will download the rawdata as. How do you go from the rawdata Kaiaulu downoaded to these csvs? Why are these two tables being passed as input here? Are both being generated by ghevents download? That sounds odd.

carlosparadis · 2025-04-20T22:37:28Z

issue_event_processing.ipynb

+   "source": [
+    "#### Execute Parse Command\n",
+    "\n",
+    "As stated above it is reccomended you start with only a few event issues. To do this you can open the created issue_output.csv with Excel or Google Sheets and modify the table to only include a few. \n",


This is not useful. The user will not be able to create the files by hand with just this as instruction. Did you not say you created the fake generator for the examples? Where are they located? That's what they were for here.

This version includes the following: - A separate API folder with process mining functions and helper functions - Function comments added to support pdoc format Signed-off-by: Connor Narowetz <[email protected]>

- Changed modify_event_in_csv function comment Signed-off-by: Connor Narowetz <[email protected]>

carlosparadis

Here some more comments on your API. I may need to do one more pass after all the changes to check the overall consistent of what file has what functions in the API, and how they are used on the Notebook after all the changes.

Thanks!

carlosparadis · 2025-04-21T07:35:24Z

api/csv_generator.py

+fake = Faker()
+
+# Function to generate fake data
+def generate_csv_file(num_issues=1, num_events_per_issue=7, output_csv="generated_csv", seed=2):


Notice the name of your function vs the title of the docs: The more appropriate name here is generate_fake_event_log. The genetate_csv_file is too broad and any logic could be doing this.

api/csv_generator.py

carlosparadis · 2025-04-21T07:37:01Z

api/csv_generator.py

+from datetime import datetime, timedelta
+
+# Initialize faker
+fake = Faker()


API files should not be initializing objects. This is function definition only.

carlosparadis · 2025-04-21T07:37:41Z

api/csv_generator.py

@@ -0,0 +1,78 @@
+import random
+import pandas as pd
+from faker import Faker


Nice finding!

carlosparadis · 2025-04-21T07:38:24Z

api/csv_generator.py

+# Function to generate fake data
+def generate_csv_file(num_issues=1, num_events_per_issue=7, output_csv="generated_csv", seed=2):
+    """
+    Generates fake event log data and saves it to a CSV file.


Expand description so it explains what fake set of activities is happening. Given the poster feedback, you may want to do some adjustments to reflect a Kaiaulu workflow.

carlosparadis · 2025-04-21T07:43:26Z