Full Document

Overview

High-Utility Itemset Mining (HUIM) is a critical data mining task, but traditional algorithms assume absolute certainty in the data. In real-world scenarios—such as data collected from sensors, unstable prediction systems, or probabilistic purchase behaviors—data contains inherent uncertainty.

Integrating probability into the profit model causes the search space to explode exponentially. Furthermore, traditional methods generate a massive number of redundant subset itemsets. This research introduces mHUIMiner, a solution that focuses on extracting only the Maximal itemsets (itemsets not subsumed by any other valid high-utility itemset), drastically reducing the output size without losing valuable strategic patterns.

Core Methodology & Architecture

The project explores two different traversal strategies: a Breadth-First Search (Level-wise Miner) and a Depth-First Search (Depth-first Miner). The Depth-first approach proved vastly superior due to the following structures:

1. The PUList (Probability-Utility List) Data Structure

To overcome the bottleneck of repeatedly scanning the database, we engineered the PUList. Instead of storing raw transactions, each itemset is represented by a list of elements. Each element tracks five crucial metrics:

tid: Transaction ID.
actualUtility: Total real profit of the items.
probability: The combined existence probability of the items.
expectedUtility: The product of actual utility and probability.
remainingExpectedUtility (REU): The expected utility of the remaining items in that transaction.

By utilizing PULists, the algorithm only scans the database twice. Subsequent itemsets are generated entirely in RAM by intersecting parent PULists in $\Theta(n)$ linear time, bypassing expensive database I/O operations.

2. Aggressive Pruning Strategies

The system implements strict upper-bound pruning to terminate invalid search branches early:

TWUE (Transaction-Weighted Expected Utility): Used as an initial filter to eliminate unpromising individual items before the core algorithm begins.
REU Pruning: Deep within the recursive branches, if the current Expected Utility (EU) plus the Remaining Expected Utility (REU) is less than the minimum threshold (EU + REU < minUtil), the algorithm halts expansion, mathematically guaranteeing no valid supersets exist down that path.

3. Code Optimization

Built in Java SE 25 using Object-Oriented Programming (OOP), the system is heavily optimized for execution speed. Instead of using polymorphic interfaces for pruning strategies—which would trigger millions of expensive overhead calls—the pruning conditions are inlined directly into the core execution loops.

Performance Analysis & Experimental Results

Extensive stress testing was conducted using standard datasets with opposing characteristics: Retail (a sparse dataset with short transactions) and Chess (a dense dataset with highly overlapping items).

Runtime Stability: The Depth-first miner demonstrated absolute stability, executing in mere milliseconds regardless of how low the utility threshold dropped. In contrast, the Level-wise miner's runtime degraded exponentially. Runtime Performance Comparison
Pruning Power (Memory Efficiency): DFS reduced the generated search space structure count by hundreds of times compared to the Apriori-based subset generation of the BFS method. Pruning Power Comparison
Density Tolerance: When tested on the highly dense Chess dataset, the Level-wise algorithm collapsed entirely, hitting a 100% timeout rate due to an inability to manage the combinatorial explosion. The Depth-first miner survived the immense pressure, successfully extracting all MHUIs within the 60-second hardware limit. Density Tolerance Comparison

Limitations & Future Work

The current mathematical model assumes statistical independence between items (calculating joint probability via simple multiplication). In reality, items often have positive correlations (e.g., hotdogs and mustard), meaning the system can occasionally underestimate the expected utility of complementary goods. Future developments will look into integrating Bayesian Networks to better model these dependent probabilities.

Maximal High Utility Itemsets Mining in Uncertain Quantitative Databases