Context :
Artificial Intelligence : clustering and unsupervised learning:
Artificial Intelligence (AI) is a field that combines computer science with data sets,
with the aim of enabling a machine to imitate the cognitive abilities of human being.
Machine learning (ML) and its sub-domain deep learning, which uses layers of neurons, are
two major sub-domains of AI. The difference lies in training of each algorithm.
Supervised learning, which involves training a model on known input and output data to
predict future outputs, and unsupervised learning involves the discovery of hidden
patterns and intrinsic underlying structures in the input data.
The aim of clustering methods is to group a set of individuals into homogeneous classes.
Non-hierarchical methods can be used to classify massive data but require to fixe in
advance the number of classes. Hierarchical methods, which are more time-consuming to
compute, consist of a series of nested partitions represented by a clustering tree. The
optimal number of classes can be determined a posteriori by reading the tree. In presence
of a large number of individuals, it is common to combine non-hierarchical and
hierarchical techniques. When classes are not clearly known in advance, clustering
methods are use with unsupervised learning (ML) [1]. Datasets are generally divided into
three disjoint datasets: training data, used to train the chosen algorithm(s); validation
data, used to check performance of result; and test data, used only at the end of the
process.
Venous thromboembolic disease:
Venous thromboembolic disease (VTE) is a common pathology whose incidence is imperfectly
known, but increases with age, reaching 1% in subjects over 75 years old. In France, it
is estimated that every year over 100,000 people develop VTE, which is responsible for
between 5,000 and 10,000 deaths. Deep vein thrombosis (DVT) and pulmonary embolism (PE)
are the two main types of VTE. DVT corresponds to partial or total occlusion of a deep
vein by a thrombus, most often localized in the lower limbs. PE is defined as partial or
total occlusion of the pulmonary arteries or their branches. The main risk of DVT is the
occurrence of PE, which can be life threatening. Other VTE-specific complications and
possible adverse outcomes include thromboembolic recurrence (either DVT or PE), chronic
thromboembolic pulmonary hypertension and post-thrombotic syndrome in DVT. Current
management of VTE is mainly based on anticoagulant therapy. The duration of treatment
varies according to the estimated risk of recurrence if treatment is withdrawn,
essentially depending on whether or not there is a prior major risk factor [2]. In this
subgroup of PE patients, in the absence of major risk factors, risk of recurrence is
considered intermediate and varies according to whether the event is a first episode or a
recurrence, and whether there are obstructive pulmonary sequelae or not [3]. More
recently, the therapeutic strategy has become more complex, with inclusion of minor risk
factors that modulate duration of treatment without relevant evidence. Moreover,
regardless of the duration of treatment, the dosage of anticoagulation beyond the sixth
month is uncertain for Direct Oral Anticoagulants.
Hypotheses :
The aim will be to use the database to identify clinically relevant phenotypes in
patients with acute pulmonary embolism. Hierarchical clustering methods combined with
unsupervised learning (machine learning) will be used to obtain groups of patients who
are homogeneous at diagnosis. Evaluating their prognosis at 6 months (recurrence or
chronic thromboembolic pulmonary hypertension), account the first 3 months of
anticoagulant treatment, would provide an aid to medical decision-making.
An analysis of the six-month evolution of homogeneous patient groups with acute pulmonary
embolism, constructed using clustering methods with unsupervised learning has never been
conducted before. This innovative project within a large-scale hospital infrastructure is
likely to offer doctors a decision-making aid, and patients a scientifically-validated
form of therapeutic management.
Material and Methods :
This research will include a retrospective and a prospective parts. The retrospective
part will include patients who have been admitted to CHITS for acute pulmonary embolism
since 2019 (around 1900 patients). For the prospective part, it is planned to include
patients with same characteristics over the years 2024 and 2025 (approximately 765
patients). If individual information is not available or they object to the processing of
their data for 25% of the patients, a large volume of data on over 2,500 patients could
potentially be analysed in this trial. This research will have no impact on current
patient care. Data from consultations and various examinations carried out as part of the
care will be collected for six months post-diagnosis to meet the research objectives.
Unsupervised clustering methods used in this study combine hierarchical and
non-hierarchical methods. Following the hierarchical ascending clustering, Ward's index
is used to determine the number of groups of interest. The centroids of these groups are
then considered to initialize a partitioning algorithm, such as the k-means algorithm.
Once most medically relevant groups have been determined, six-month evolution (stable,
aggravation or progress) are compared. Factors influencing progression during the first
three months of treatment can also be included in a statistic model, depending on their
ability to predict aggravation. All these explorations should provide a basis for medical
decision-making.