Bahri, M., Salutari, F., Putina, A., & Sozio, M. (2022). AutoML: state of the art with a focus on anomaly detection, challenges, and research directions. International Journal of Data Science and Analytics.
@article{2022ijds,
title = {AutoML: state of the art with a focus on anomaly detection, challenges, and research directions},
author = {Bahri, Maroua and Salutari, Flavia and Putina, Andrian and Sozio, Mauro},
booktitle = {International Journal of Data Science and Analytics},
month = jan,
year = {2022},
miodoi = {10.1007/s41060-022-00309-0}
}
The last decade has witnessed the explosion of machine learning research studies with the inception of several algorithms proposed and successfully adopted in different application domains. However, the performance of multiple machine learning algorithms is very sensitive to multiple ingredients (e.g., hyper-parameters tuning and data cleaning) where a significant human effort is required to achieve good results. Thus, building well-performing machine learning algorithms requires domain knowledge and highly specialized data scientists. Automated machine learning (autoML) aims to make easier and more accessible the use of machine learning algorithms for researchers with varying levels of expertise. Besides, research effort to date has mainly been devoted to autoML for supervised learning, and only a few research proposals have been provided for the unsupervised learning. In this paper, we present an overview of the autoML field with a particular emphasis on the automated methods and strategies that have been proposed for unsupervised anomaly detection.
Salutari, F., Da Hora, D., Dubuc, G., & Rossi, D. (2020). Analyzing Wikipedia Users’ Perceived Quality Of Experience: A Large-Scale Study. IEEE Transactions on Network and Service Management.
@article{salutari2020tnsm,
author = {{Salutari}, F. and {Da Hora}, D. and {Dubuc}, G. and {Rossi}, D.},
journal = {IEEE Transactions on Network and Service Management},
title = {Analyzing Wikipedia Users’ Perceived Quality Of Experience: A Large-Scale Study},
month = mar,
year = {2020},
miodoi = {10.1109/TNSM.2020.2978685}
}
The Web is one of the most successful Internet applications. Yet, the quality of Web users’ experience is still largely impenetrable. Whereas Web performance is typically studied with controlled experiments, in this work we perform a large-scale study of a real site, Wikipedia, explicitly asking (a small fraction of its) users for feedback on the browsing experience. The analysis of the collected feedback reveals that 85% of users are satisfied, along with both expected (e.g., the impact of browser and network connectivity) and surprising findings (e.g., absence of day/night, weekday/weekend seasonality) that we detail in this paper. Also, we leverage user responses to build supervised data-driven models to predict user satisfaction which, despite including state-of-the art quality of experience metrics, are still far from achieving accurate results (0.62 recall of negative answers). Finally, we make our dataset publicly available, hopefully contributing in enriching and refining the scientific community knowledge on Web users’ QoE.
Conference Proceedings
Salutari, F., Ramos, J., Rahmani, H. A., Linguaglossa, L., & Lipani, A. (2023, May). Quantifying the Bias of Transformer-Based Language Models for African American English in Masked Language Modeling. PAKDD Conference.
@inproceedings{salutari2023pakdd,
title = {Quantifying the Bias of Transformer-Based Language Models for African American English in Masked Language Modeling},
author = {Salutari, Flavia and Ramos, Jerome and Rahmani, Hosein A and Linguaglossa, Leonardo and Lipani, Aldo},
booktitle = {PAKDD Conference},
month = may,
year = {2023},
miodoi = {10.1007/978-3-031-33374-3_42}
}
In recent years, groundbreaking transformer-based language models (LMs) have made tremendous advances in natural language processing (NLP) tasks. However, the measurement of their fairness with respect to different social groups still remains unsolved. In this paper, we propose and thoroughly validate an evaluation technique to assess the quality and bias of language model predictions on transcripts of both spoken African American English (AAE) and Spoken American English (SAE). Our analysis reveals the presence of a bias towards SAE encoded by state-of-the-art LMs such as BERT and DistilBERT and a lower bias in distilled LMs. We also observe a bias towards AAE in RoBERTa and BART. Additionally, we show evidence that this disparity is present across all the LMs when we only consider the grammar and the syntax specific to AAE.
Salutari, F., Da Hora, D., Varvello, M., Teixeira, R., Christophides, V., & Rossi, D. (2020, June). Implications of the Multi-Modality of User Perceived Page Load Time. IEEE MedComNet Conference.
@inproceedings{salutari2020medcomnet,
title = {Implications of the Multi-Modality of User Perceived Page Load Time},
author = {Salutari, Flavia and Da Hora, Diego and Varvello, Matteo and Teixeira, Renata and Christophides, Vassilis and Rossi, Dario},
booktitle = {IEEE MedComNet Conference},
month = jun,
year = {2020},
miodoi = {10.1109/MedComNet49392.2020.9191615}
}
Web browsing is one of the most popular applications for both desktop and mobile users. A lot of effort has been devoted to speedup the Web, as well as in designing metrics that can accurately tell whether a webpage loaded fast or not. An often implicit assumption made by industrial and academic research communities is that a single metric is sufficient to assess whether a webpage loaded fast. In this paper we collect and make publicly available a unique dataset which contains webpage features (e.g., number and type of embedded objects) along with both objective and subjective Web quality metrics. This dataset was collected by crawling over 100 websites—representative of the top 1 M websites in the Web—while crowdsourcing 6,000 user opinions on user perceived page load time (uPLT). We show that the uPLT distribution is often multi-modal and that, in practice, no more than three modes are present. The main conclusion drawn from our analysis is that, for complex webpages, each of the different objective QoE metrics proposed in the literature (such as AFT, TTI, PLT, etc.) is suited to approximate one of the different uPLT modes.
Salutari, F., Da Hora, D., Dubuc, G., & Rossi, D. (2019, May). A large-scale study of Wikipedia users’ quality of experience. The Web Conference (WWW’19).
@inproceedings{salutari2019www,
title = {A large-scale study of Wikipedia users' quality of experience},
author = {Salutari, Flavia and Da Hora, Diego and Dubuc, Gilles and Rossi, Dario},
booktitle = {The Web Conference (WWW'19)},
month = may,
year = {2019},
address = {San Francisco, California},
miodoi = {10.1145/3308558.3313467}
}
The Web is one of the most successful Internet application. Yet, the quality of Web users’ experience is still largely impenetrable. Whereas Web performances are typically gathered with controlled experiments, in this work we perform a large-scale study of one of the most popular websites,namely Wikipedia, explicitly asking (a small fraction of its) users for feedback on the browsing experience. We leverage user survey responses to build a data-driven model of user satisfaction which, despite including state-of-the art quality of experience metrics, is still far from achieving accurate results, and discuss directions to move forward. Finally, we aim at making our dataset publicly available, which hopefully contributes in enriching and refining the scientific community knowledge on Web users’ quality of experience (QoE).
Salutari, F., Cicalese, D., & Rossi, D. (2018, March). A closer look at IP-ID behavior in the Wild. International Conference on Passive and Active Network Measurement (PAM).
@inproceedings{salutari2018pam,
title = {A closer look at IP-ID behavior in the Wild},
author = {Salutari, Flavia and Cicalese, Danilo and Rossi, Dario},
booktitle = {International Conference on Passive and Active Network Measurement (PAM)},
address = {Berlin, Germany},
year = {2018},
month = mar,
miodoi = {10.1007/978-3-319-76481-8_18}
}
Originally used to assist network-layer fragmentation and reassembly, the IP identification field (IP-ID) has been used and abused for a range of tasks, from counting hosts behind NAT, to detect router aliases and, lately, to assist detection of censorship in the Internet at large. These inferences have been possible since, in the past, the IPID was mostly implemented as a simple packet counter: however, this behavior has been discouraged for security reasons and other policies, such as random values, have been suggested. In this study, we propose a framework to classify the different IP-ID behaviors using active probing from a single host. Despite being only minimally intrusive, our technique is significantly accurate (99% true positive classification) robust against packet losses (up to 20%) and lightweight (few packets suffices to discriminate all IP-ID behaviors). We then apply our technique to an Internet-wide census, where we actively probe one alive target per each routable /24 subnet: we find that that the majority of hosts adopts a constant IP-IDs (39%) or local counter (34%), that the fraction of global counters (18%) significantly diminished, that a non marginal number of hosts have an odd behavior (7%) and that random IP-IDs are still an exception (2%).
Ciociola, A., Cocca, M., Giordano, D., Mellia, M., Morichetta, A., Putina, A., & Salutari, F. (2017, August). UMAP: Urban Mobility Analysis Platform to Harvest Car Sharing Data. IEEE Smart City Innovations (IEEE SCI’17),
@inproceedings{salutari2017umap,
author = {Ciociola, Alessandro and Cocca, Michele and Giordano, Danilo and Mellia, Marco and Morichetta, Andrea and Putina, Andrian and Salutari, Flavia},
title = {UMAP: Urban Mobility Analysis Platform to Harvest Car Sharing Data},
booktitle = {IEEE Smart City Innovations (IEEE SCI'17),},
month = aug,
year = {2017},
address = {San Francisco, California},
miodoi = {10.1109/UIC-ATC.2017.8397566}
}
Car sharing is nowadays a popular means of transport in smart cities. In particular, the free-floating paradigm lets the customers look for available cars, book one, and then start and stop the rental at their will, within a specific area. This is done thanks to a smartphone app, which contacts a webbased backend to exchange information. In this paper we present UMAP, a platform to harvest the data freely made available on the web by these backends and to extract driving habits in cities. We design UMAP with two specific purposes. Firsty UMAP fetches data from car sharing platforms in real time. Secondly, it processes the data to extract advanced information about driving patterns and user’s habits. To extract information, UMAP augments the data available from the car sharing platforms with mapping and direction information fetched from other web platforms. This information is stored in a data lake where historical series are built, and later analyzed using analytics modules easy to design and customize. We prove the flexibility of UMAP by presenting a case of study for the city of Turin. We collect car sharing usage data for over 50 days to characterize both the temporal and spatial properties of rentals, and to characterize customers’ habits in using the service, which we contrast with public transportation alternatives. Results provide insights about the driving style and needs, which are useful for smart city planners, and prove the feasibility of our approach.
Technical Reports
Salutari, F., Hora, D. D., Dubuc, G., & Rossi, D. (2020). Analyzing Wikipedia Users’ Perceived Quality Of Experience: A Large-Scale Study (Extended Technical Report). In Technical Report.
@techrep{techrepqoe2020,
author = {Salutari, Flavia and Hora, Diego Da and Dubuc, Gilles and Rossi, Dario},
title = {Analyzing Wikipedia Users’ Perceived Quality Of Experience: A Large-Scale Study (Extended Technical Report)},
booktitle = {Technical Report},
month = dec,
year = {2020}
}
The Web is one of the most successful Internet application. Yet, the quality of Web users’ experience is still largely impenetrable. Whereas Web performances are typically gathered with controlled experiments, in this work we perform a large-scale study of one of the most popular websites, namely Wikipedia, explicitly asking (a small fraction of its) users for feedback on the browsing experience. The analysis of the collected users’ feedback reveals both expected (e.g., the impact of browser and network connectivity) and surprising findings (e.g., absence of day/night, weekday/weekend seasonality and other temporal dependencies) that we detail in this paper. Also, we leverage user survey responses to build supervised data-driven models to predict user satisfaction which, despite including state-of-the art quality of experience metrics, are still far from achieving accurate results. Finally, we make our dataset publicly available, which hopefully contributes in enriching and refining the scientific community knowledge on Web users’ Quality of Experience (QoE).
Salutari, F., & Rossi, D. (2019). A deeper look at IP-ID behavior in the Wild (Extended Technical Report). In Technical Report.
@techrep{techrepipid2018,
author = {Salutari, Flavia and Rossi, Dario},
title = {A deeper look at IP-ID behavior in the Wild (Extended Technical Report)},
booktitle = {Technical Report},
month = feb,
year = {2019}
}
Originally used to assist network-layer fragmentation and reassembly, the IP identification field (IP-ID) has been used and abused for a range of tasks, from counting hosts behind NAT, to detect router aliases and, lately, to assist detection of censorship in the Internet at large. These inferences have been possible since, in the past, the IP-ID was mostly implemented as a simple packet counter: however, this behavior has been discouraged for security reasons and other policies, the use of random values, have been suggested. In this study, we propose a framework to classify the different IP-ID behaviors using active probing from a single host. Despite being only minimally intrusive, our technique is significantly accurate (99% true positive classification) robust against packet losses (up to 20%) and lightweight (few packets suffices to discriminate all IP-ID behaviors). We then apply our technique to an Internet wide census, where we actively probe one alive target per each routable /24 subnet: we find that the majority of hosts adopts a constant IP-IDs (39%) or local counter (34%), that the fraction of global counters (18%) significantly diminished, that a non marginal number of hosts have an odd behavior (7%) and that random IP-IDs are still an exception (2%). We believe that these findings, together with the datasets we release, can provide some support for works relying on a specific implementation of the IPID and, more generally, they can be instrumental for researchers operating in the field of network measurements, by providing them an updated picture of the Internet-wide adoption of the different known IP-ID implementations.