Assignments

You will critically analyze, implement, and discuss various technical approaches for improving human-AI interaction through a series of assignments hosted on our custom interactive platforms.

Assignment 1: Detecting and Mitigating AI Bias
Assignment 2: Explaining Model Predictions to Users
Assignment 3: Evaluating Long Tail AI Tasks

Tips & Info

What do I do?

There will be a set of assignments that will be announced as the course progresses, including:

Detecting and mitigating AI bias
Can explanable AI really explain?
Visualize how AI works

Why do this?

Hands-on exercises, analysis, and reflection are a fun and effective way to learn.

How do I submit?

We'll create an assignment in KLMS for each assignment.

Late Policy

You'll lose 10% for each late day. Submissions will be accepted until three days after the deadline. After then you'll get 0 on that assignment.

Detecting and Mitigating AI Bias

Assignment 1: Detecting and Mitigating AI Bias

Due: 11:59pm on 9/27 (Wed)

10% toward your grade

What do I do?

It's a two-part assignment. First, you will use our interactive platform to explore, implement, and inspect methods for detecting, scoring, and mitigating AI bias. Second, you will reflect on your activities through answering a series of discussion questions below. This video describes (1) a brief overview of what this assignment is about and (2) how to use the interactive platform. Your username and password for the Jupyter notebook is "u" + your student ID (e.g., "u20231234"). Please make sure that you change your password immediately. Let us know if you have any questions or trouble.

Discussion Questions

Please answer the following questions after you complete the exploration and implementation through the platform above. Make sure to cite any external sources when you refer to examples, ideas, and quotes to support your arguments.

Explore the raw tweets before proceeding any further. Any notable, interesting, or worrisome observations? We highly encourage you to attach specific examples and what observations you made about them.
Let's say we released our logistic regression model trained with the given dataset without any debiasing technique to perform content moderation on the live Twitter platform.
(1) Do you think it is socially, culturally, and ethically acceptable? Would there be any potential problems?
(2) For the problems you identified, what would be potential solutions to mitigate them? Don’t limit yourself to technical solutions.
The datasets are highly skewed. The dataset1 contains 77% of "Offensive" tweets, but the dataset2 contains 34% of "Hateful" tweets.
(1) The way we construct a dataset significantly affects the distribution of labels. Why do you think such skew occurs in the datasets provided? Why are the distributions of labels different between the two datasets? You may want to refer to the data collection methods described in the paper (dataset1, dataset2).
(2) How does the skewness need to be handled? Is it an ultimate goal to reduce such skew and balance all the labels? Then how can we determine whether the current dataset is balanced enough?
(3) Let's say you are going to collect the dataset again to train the logistic regression model. How would you design the data collection pipeline? Explain the rationale.
Do you think the pipeline we used to train the model (i.e., data preprocessing and the model selection) contributed to the biased result? If so, why do you think so? Note that we have used n-grams (n <= 3) as the vocabulary and TF-IDF as the feature.
The trained logistic regression model performed pretty well with respect to the accuracy measure (in terms of weighted avg of F1-score), although it gives a racially biased prediction result. One might say that model performance was enough to be deployed and conclude that AAE is more hateful and abusive than SAE. Do you think this is a reasonable conclusion? If so, why? If not so, why? Please discuss your thoughts.
Have you observed anything on the relationship between the model performance and bias measure, even if it is not significant in our assignment? If so, share your opinions on why such a relationship could exist.
We explored two debiasing techniques, undersampling and post-processing.
(1) Why do you think undersampling reduced a measure of bias? What are the cases it might not work well?
(2) Why do you think post-processing reduced a measure of bias? What are the cases it might not work well?
The bias score on the "Hateful" label increased whereas the score on "Offensive" decreased. Why do you think it happened? What do you think needs to be done to achieve improvements on the measures on both labels?
Is debiasing always necessary? When do we need debiasing? Are there cases where debiasing might not be desired?

Any useful resources?

The following papers should be useful to get a sense of the background, methodology, and technical approaches we will be using.

Davidson, Thomas, Debasmita Bhattacharya, and Ingmar Weber. "Racial bias in hate speech and abusive language detection datasets." arXiv preprint arXiv:1905.12516 (2019).
Dixon, Lucas, et al. "Measuring and mitigating unintended bias in text classification." Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society. 2018.
Sap, Maarten, et al. "The risk of racial bias in hate speech detection." Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019.

What do I submit?

You need to submit two things: (1) code with your implementation and (2) answers to the discussion questions. You do not need to explicitly submit your code as the server keeps track of your latest implementation. Answers to the discussion questions need to be written as a report in PDF. We highly encourage you to use resources from your Jupyter Notebook such as code, figures, and statistical results to support your arguments in the discussion.

Grading

Task 1: Code for measuring the bias score (15%)
Task 2: Code for undersampling (15%)
Task 3: Code for Reject Option based Classification (ROC) (20%)
Discussion (50%): further broken down to the following components
- Completeness (10%): Are all the questions are answered with enough substance?
- Depth (10%): Include thoughtful analysis beyond surface level observations.
- Clarity (10%): The reader who sees the example for the first time should not struggle to understand your points.
- Visual Communication (10%): Use various resources such as numbers, figures, charts, external data, and statistics from the interactive platform to effectively to communicate your idea.
- Conciseness (10%): Avoid being verbose in the description.

Please make sure to read the academic integrity and collaboration policy of this course carefully.

How do I submit?

Make a single PDF file of your report and submit via KLMS. Code is automatically stored on the server with your latest version.

Explaining Model Predictions to Users

Assignment 2. Explaining Model Predictions to Users

Due: 11:59pm on 11/2(Thu)

20% toward your grade

Learning Objectives

Explainable AI (XAI) helps users understand and interpret predictions produced by models. The objective of this assignment is for you to try existing off-the-shelf tools for explanations, think about strengths and weaknesses of them, and design your own interactive user interface that provides user-centered explanations that can address such weaknesses.

Background

You will work with methods for explaining model predictions in image classification tasks. Such explanations help users resolve questions around what’s happening inside of the AI model and why. However, as users explore these explanations, they may come up with additional questions about the model, which possibly requires other kinds of explanations.

What should I do?

In this assignment, you are asked to (1) explore Google’s What-If Tool, a platform that helps users understand the performance of models, (2) build and run an algorithm based on Local Interpretable Model-agnostic Explanations (LIME) for presenting which parts of an image contribute to the class prediction for better interpretation of classification results, and (3) design a UI prototype that further helps users interpret the results especially when such explanation is not enough. For each of the stages, you are asked to discuss what can be explained with such tools/methods, and limitations of such explanations. For (2), we are going to use our interactive platform that provides an environment for implementing the algorithm and applying your algorithm to images. You can easily organize the result of explanations to focus more on analyzing the limitations of the explanation algorithm without additional implementations for experimenting.

Instructions

Go to What-If Tool demo on a smile classification task. You can explore the dataset, browse the model performance, and experiment with the model by asking what-if questions to the tool.
Answer questions in Discussion “Stage 1”. The discussion contains a specific task that you need to perform with the What-If tool.
Finish an implementation of LIME algorithm on Jupyter notebook. We provide skeleton code (Assignment2.ipynb) that describes an algorithm that explains which parts of an image contribute to the prediction result. You need to fill out some blanks in the code with your implementation to make it work. The account is the same, which means that you need to use the password that you changed in assignment #1.
Go to the interactive platform (we recommend using Google Chrome browser). Your email is the one that you indicated as the "preferred" in the Signup Form and the password is "u" + your student ID (e.g., "u20231234"). Follow the steps below.

Analyze the LIME algorithm

Upload images.
Get the classification labels from the image classification model (inception v3).
Get the explanations that your implementation provides.
Annotate whether the explanation is helpful or not.
Browse the summary of results organized by your annotations.

Design an interactive UI

Describe the limitations of the LIME algorithm and share your ideas about how to overcome the limitations.
Create a prototype UI that captures your ideas in step f. with Figma.

Answer questions in Discussion Stage 2 and 3.

Discussion

Stage 1. (What-If Tool)

What-If Tool consists of three tabs: Datapoint editor, Performance & Fairness, and Features. Each tab represents different aspects of the model and results.

Note: You will be using the demo of What-If-Tool, so it shows only the precalculated results for your interactions. For example, the exact numbers in the prediction may not be the actual results.

Datapoint editor tab

1-1. Choose an image on the right plane, and modify a feature value on the left. What happens? How can the Datapoint editor be used as an explanation? Do you have any interesting observations?
1-2. You can set X-Axis, Y-Axis, Color By, and others on the top bar. Fix X-Axis to “Inference correct” and explore the visualization updates by changing other dimensions. What happens? How can this be used as an explanation? Do you have any interesting observations? You can try various combinations and explore how the visualization updates.

Performance & Fairness tab

1-3. It shows the overall performance of the model. How can this be used as an explanation? Do you have any interesting observations?
1-4. In the Configure tab on the left, fix “Ground Truth Feature” to “Smiling” and select a feature on “Slice By”. What happens? How can this be used as an explanation? Do you have any interesting observations?

Strength and Limitation of the tool

1-5. What are the strengths of this tool as an explanation?
1-6. What are the weaknesses of this tool as an explanation?

Stage 2. (LIME algorithm)

2-1. Upload at least five images with the same category and get the explanation. Compare how different relevant features each image has. Attach code and images if necessary.
2-2. How does the generated explanation, which presents a set of important superpixels, help users understand the model predictions? Does it complement performance measures such as accuracy and F1-score? If so, how?
2-3. Did you find any image for which the algorithm does not give an explanation that is easy to understand for users? Why do you think the algorithm gives such an explanation?
2-4. Is our explanation sufficient to trust the model predictions? When is it sufficient? When is it not sufficient? What kinds of additional information would you include in your explanation?
2-5. Are explanations always necessary in order to understand the model performance? When do users need it? When do users not need it?

Stage 3. (Figma prototype)

3-1. List a few representative questions that your UI prototype can possibly answer. Show us a walkthrough of how to resolve the questions. Please add representative screenshots of your UI.

Any useful resources?

The following resources should be useful for you in getting a sense of the background, methodology, and technical approaches we will be using.

What-If Tool: https://pair-code.github.io/what-if-tool/
Ribeiro et al., “"Why Should I Trust You?": Explaining the Predictions of Any Classifier”, Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 2016.

What do I submit?

You only need to submit a .pdf file that answers the discussion questions. We highly encourage you to use resources such as code, figures, and statistical results to support your arguments in the discussion. Note that you do not need to explicitly submit your implementations, description of limitations of LIME algorithm, and prototype of interactive UI as they are automatically stored in the server.

Grading

LIME Implementation (50%)
- Task 1: Code for creating perturbed data (20%)
- Task 2: Code for fitting a linear model (15%)
- Task 3: Code for feature selection (15%)
Discussion (50%): further broken down to the following components
- Completeness (10%): Are all the questions are answered with enough substance?
- Depth (10%): Include thoughtful analysis beyond surface level observations.
- Clarity (10%): The reader who sees the example for the first time should not struggle to understand your points.
- Visual Communication (10%): Use various resources such as numbers, figures, charts, external data, and statistics from the Google What-If Tool and the Jupyter notebook to effectively communicate your idea.
- Conciseness (10%): Avoid being verbose in the description.

Please make sure to read the academic integrity and collaboration policy of this course carefully.

How do I submit?

Submit your report (a PDF file) via KLMS.

Evaluating Long Tail AI Tasks

Assignment 3. Evaluating Long Tail AI Tasks

Due: 11:59pm on 11/15(Wed)

10% toward your grade

Learning Objectives

Advances in model scaling have led to models (e.g., large language models, multimodal models) possessing various emergent capabilities that allow them to perform new tasks with minimal or no data. Both developers and researchers have been leveraging these capabilities to power interfaces that perform long-tail tasks: tasks that are more specific to certain audiences and domains. However, it can be difficult to evaluate the models’ performance on these long-tail tasks as there are no established benchmarks or datasets.

What should I do?

In this assignment, you will be designing a long-tail task that will be performed by an LLM and then exploring how to evaluate this long-tail task. Specifically, you will be designing a summarization

task (e.g., shortening and/or simplifying a longer text into a shorter one) by considering how it can be applied to a specific type of audience (e.g., summarize for older adults), domain (e.g., summarize research papers), and/or use case (e.g., real-time summarization for chat messages).

Instructions

In this assignment, you will follow the instructions below and compose a report by answering the given questions. When needed, make sure to cite any external sources, examples, ideas and quotes to support your arguments in your answers.

Step 1. Design and implement a long-tail task.

Design a long-tail summarization task by considering a type of audience, domain, and/or use case. Then, implement 2 different prompt-based “pipelines” that can perform this task using an LLM: (1) a single basic prompt (1~3 sentences), and (2) a more complex prompt (e.g., more instruction text, chain of prompts, few-shot prompt). TIP: You can simply use the ChatGPT interface to “implement” your pipelines. TIP: When implementing, think of the steps a person would take to perform this task and try to implement these steps through prompts.

Answer the following questions:

1-1. Describe your long-tail summarization task by describing its goals, input(s), and output(s). How does your long-tail task differ from a general summarization task?
1-2. Describe your implementation process. What requirements or goals did you consider for the task? How did you incorporate these into your implementations? In your report, include the LLM used (e.g., ChatGPT, GPT-4, Bard), prompts, and (when applicable) conversation screenshots/links or code.

Step 2. Evaluate your implementations.

Now, let’s evaluate the implementations by first creating ~5 samples of inputs to test your implementations with. (TIP: You can optionally use an LLM to create these samples.) Then, run each of your implementations on this sample set to generate output samples.

Answer the following questions:

2-1. Look at the outputs from each implementation. How do the implementations differ in performing the task? In your opinion, which implementation appears to perform better and in terms of what aspects?
2-2. Consider the existing evaluation metrics below. How could each of these metrics be applied to evaluate performance on your designed task? What are the limitations of each metric (consider the aspects you considered in 2-1).

BLEU (Papineni et al., 2002): measures the similarity between a generated text (e.g., summary) and a gold reference (e.g., summary written by a human) by measuring the number of overlapping words or n-grams between the two.
MAUVE (Pillutla et al., 2021): measures the statistical gap between the distribution of generated text and distribution of human text (i.e., how different is the text generated by a model from human text, using samples from both distributions).
Human ratings: ask human annotators to provide a numerical rating (out of 5 or 7) for the quality of generated text on specified criteria (e.g., readability, coherency, helpfulness).

Step 3. Design a new metric.

Considering your findings and discussions in Step 2, design a new metric that can measure one aspect of task performance. For example, when generating summaries for research papers, one metric could focus on measuring the complexity and technical difficulty of terms used in the summary. Then, you will imagine how to potentially measure this metric and design a pseudo-metric using an LLM.

Answer the following questions:

3-1. Describe the metric that you designed. What should this metric measure? Why is this metric important when measuring performance for your task?
3-2. How can your metric be measured? Try to imagine a potential way in which this metric could be measured beyond the metrics in 2-2. Be creative; you do not need to consider whether it is technically possible.

Example: To evaluate the complexity of terms used in a summary, we could (1) train a classifier that rates how technically complex a term is, or (2) measure the percentage of “simple terms” in the summary by checking which terms are not included in the 3000 most common English words.

3-3. Based on your answer to 3-2, implement a prototype of this metric using the help of an LLM. Describe how you implemented the metric (e.g., reasoning and prompts used). Show results of your metric on the samples generated by your two implementations in Step 2.

Example: Use an LLM to predict whether a word would be included in the most common 300 words or not, and then measure the percentage of common words in a summary.

3-4. (Optional!) How did this metric results correlate with your own evaluations of the samples.

Any useful resources?

The following resources should be useful for you in getting a sense of different approaches that are used to evaluate natural language generation tasks.

Celikyilmaz et al., Evaluation of Text Generation: A Survey. (2021)
Gehrmann et al., Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text. (Journal of Artificial Intelligence Research, 2023)
Papineni et al., BLEU: a Method for Automatic Evaluation of Machine Translation. (ACL 2002)
Pillutla et al., MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers. (NeurIPS 2021)

What do I submit?

You only need to submit a .pdf file for your report containing the answers to the questions in each step of the instructions. We highly encourage you to use resources such as code, figures, and samples to support your arguments in the discussion.

Grading

Report (100%): further broken down to the following components
- Completeness (35%): Are all the questions are answered with enough substance?
- Depth (35%): Include thoughtful analysis beyond surface level observations.
- Clarity (10%): The reader who sees the example for the first time should not struggle to understand your points.
- Visual Communication (10%): Use various resources such as examples, samples, code, etc to effectively communicate your idea.
- Conciseness (10%): Avoid being verbose in the description.

Please make sure to read the academic integrity and collaboration policy of this course carefully.

How do I submit?

Submit your report (a PDF file) via KLMS.