Skip to content

Conversation

@connorn-dev
Copy link
Collaborator

Methods:

  • start_end_activities(csv_path)
  • generate_tree(csv_path, output_dir)
  • generate_graph(csv_path, output_dir)

Notes: Installs pm4py if not already installed.

Signed-off-by: Connor Narowetz <[email protected]>
- Added config_file and action argument
- Run chosen action

Signed-off-by: Connor Narowetz <[email protected]>
- Added Python notebook with all functionality.
- Deleted .py script to have notebook replace
- Changed name of notebook
The following was added:
- Example section at bottom of notebook
- Petri Net model
- Performance Models
- BPMN filtered model
- Cleaned up wording
- Images folder

Signed-off-by: Connor Narowetz <[email protected]>
Copy link
Member

@carlosparadis carlosparadis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Connor,

This needs substantial revision. I did not review the entirety of the notebook because the narrative is somewhat confusing.

The overall structure of the notebook should be something like this:

  • You introduce the user to GitHub Events. You explain you can obtain them with Kaiaulu. You motivate your code, saying you are interested in seeing if developers often follow a similar process. This is all brief, the details go in subsequent sections.
  • Right after the introduction, you explain the overall folder organization between kaiaulu, process mining, and the raw data exec will give to you.
  • GitHub Events: You minimally explain that Kaiaulu organizes its features in functions, which are used in Notebooks and execs. For details on how to construct config file for events, you defer the user to reading the Events Notebook (do ensure the notebook does explain what you claim it does).
    • Explain that by running the config, Kaiaulu will save the files to rawdata (specify the exact folder). Don't create a sub-section for this.
    • After executing the script, simply load the table in pandas and show the user what it looks like. Explain to the user. Again no need to keep making sub-sections here, GitHub Events section suffice.
    • Proceed to explain and re-motivate your work, that is hard to guess if people are following the same process.
  • Process Mining GitHub Events:
    • In this new section, begin by explaining you will demonstrate it in practice. Using the already loaded pandas dataframe of the real dataset, apply the 3 process mining functions.

By this point, the user should be able to use your code to their needs. What you then need is a section that explains experiments 1, 2 and 3. Where are the fake data generators here?

Also, where is the api folder? Where is the markdown? This does not look like the most recent version of the code. I still see functions being defined in the notebook.

Also:

Where is the env.yml conda file?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not version binary files. Generate them on the fly in the code so users generate it themselves when they run your notebook.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not version binary files. Generate them on the fly in the code so users generate it themselves when they run your notebook.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not version binary files. Generate them on the fly in the code so users generate it themselves when they run your notebook.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not version binary files. Generate them on the fly in the code so users generate it themselves when they run your notebook.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not version binary files. Generate them on the fly in the code so users generate it themselves when they run your notebook.

"cell_type": "markdown",
"metadata": {},
"source": [
"### Download and Parse data with Kaiaulu ghevents.R (CLI)\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to: GitHub Events

"cwd = os.getcwd()\n",
"os.chdir(os.path.expanduser(\"~/Desktop/Kaiaulu/Working_issues/kaiaulu/\"))\n",
"\n",
"# To download use the download command specifing the <config_file> <github_token>\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is too many sub-sections. You want to be consistent that the section division doesn't get too specific or it defeats the purpose of being a section.

Change to be a sub-section, enumerate the section (it is hard on the eye to see what is header or sub-header when note close together, so use 1., 1.1, etc). Rename to: GitHub Events using Kaiaulu

"\n",
"# To download use the download command specifing the <config_file> <github_token>\n",
"\n",
"command = f\"Rscript exec/ghevents.R download conf/kaiaulu.yml --token_path=~/.ssh/github_token\"\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a good path. It assumes the user is running inside the Kaiaulu project the exec. Rather, the reasonable assumption is that the user downloaded your code (process_mining) and wants to run the exec from Kaiaulu. So the current working directory should assume the user is at process_mining folder, goes one level above, then access the kaiaulu folder.

What I recommend you do is to explain the expected folder organization of the projects before throwing paths left and right. Assume the following:

kaiaulu/kaiaulu/exec/ghevents.R
kaiaulu/rawdata/
process_mining/notebooks/issue_event_processing.ipynb

"\n",
"As stated above it is reccomended you start with only a few event issues. To do this you can open the created issue_output.csv with Excel or Google Sheets and modify the table to only include a few. \n",
"\n",
"Note: commit_output.csv has been implemented for furture development and is note currently used for any process modeling. "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea what is going on here. These should not be the file names that Kaiaulu will download the rawdata as. How do you go from the rawdata Kaiaulu downoaded to these csvs? Why are these two tables being passed as input here? Are both being generated by ghevents download? That sounds odd.

"source": [
"#### Execute Parse Command\n",
"\n",
"As stated above it is reccomended you start with only a few event issues. To do this you can open the created issue_output.csv with Excel or Google Sheets and modify the table to only include a few. \n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not useful. The user will not be able to create the files by hand with just this as instruction. Did you not say you created the fake generator for the examples? Where are they located? That's what they were for here.

This version includes the following:
- A separate API folder with process mining functions and helper functions
- Function comments added to support pdoc format

Signed-off-by: Connor Narowetz <[email protected]>
@connorn-dev connorn-dev changed the title Created event_processing file with 3 methods to run analysis i #1 Created event_processing file with 3 methods to run analysis Apr 21, 2025
- Changed modify_event_in_csv function comment

Signed-off-by: Connor Narowetz <[email protected]>
Copy link
Member

@carlosparadis carlosparadis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here some more comments on your API. I may need to do one more pass after all the changes to check the overall consistent of what file has what functions in the API, and how they are used on the Notebook after all the changes.

Thanks!

fake = Faker()

# Function to generate fake data
def generate_csv_file(num_issues=1, num_events_per_issue=7, output_csv="generated_csv", seed=2):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notice the name of your function vs the title of the docs: The more appropriate name here is generate_fake_event_log. The genetate_csv_file is too broad and any logic could be doing this.

from datetime import datetime, timedelta

# Initialize faker
fake = Faker()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API files should not be initializing objects. This is function definition only.

@@ -0,0 +1,78 @@
import random
import pandas as pd
from faker import Faker
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice finding!

# Function to generate fake data
def generate_csv_file(num_issues=1, num_events_per_issue=7, output_csv="generated_csv", seed=2):
"""
Generates fake event log data and saves it to a CSV file.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expand description so it explains what fake set of activities is happening. Given the poster feedback, you may want to do some adjustments to reflect a Kaiaulu workflow.


## Features

This module provides functions to:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-comment english in a .py file. Not sure this will work.


## Example

```python
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what is going on here. Is this supposed to be a ipynb?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename file to process_discovery.py ; stay consistent to the field terminology

import pm4py


def start_end_activities(csv_path):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this belongs to process_discovery.py. Seems event-log related? Maybe a file for event log manipulation may make more sense, or just place on io.py

"""
Reads an event log from a CSV file and returns its start and end activities.

Assumes the CSV has the following columns:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reference the python function that creates the .csv instead, and that function contains the specification of the csv returned. Have the user know what function to use, rather than explain it as if they will code it! Remember Kaiaulu has \code{\link{function_name}} in R. Find the equivalent in python.

You should also see itm0 for the "See Also" section so you can reference on the returns what functions will use the output of another.

- Notebook now starts the user from the beginning
- Uses relative paths
- Includes three "experiements"
- API now asks user action 'view' 'save' or 'both'
- Renamed API functions
- Includes env.yml
- Corrected relative path and api import
@connorn-dev connorn-dev requested a review from carlosparadis May 3, 2025 09:38
Copy link
Member

@carlosparadis carlosparadis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just need to patch the env file. The rest looks pretty good. Please check the env file of the sentiment project is also updated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change this for me. What you did was export the env.yml from conda. This hardcoded the exact version down to commit hash of your environment, and also add here all the indirect dependency matrix of libraries you don't even use.

You want to create this file manually and only list on dependencies the libraries you actually import.

env.yml Outdated
@@ -0,0 +1,548 @@
name: base
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name it process_mining

"metadata": {},
"source": [
"# Introduction\n",
"Event logs are the foundation of process mining. They capture records of activities within a system, providing information about when actions occur and what those actions are. For example, in GitHub Issue Events, actions such as assigning users, labeling issues, and closing issues are recorded. Together, these events tell the full story of the process from start to finish. Event logs can be transformed into differnt process graphs, which visually represent the flow of activities and how they connect. These graphs make it easier to identify inefficiencies, bottlenecks, and deviations from expected workflows. They provide valuable insights for process improvement and optimization. \n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo in diffrent. Please run a spell checker.

- env.yml changed
- file paths in notebook changed
- Added full env with versions
Signed-off-by: Connor Narowetz <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants