You will critically analyze, implement, and discuss various technical approaches for improving human-AI interaction through a series of assignments hosted on our custom interactive platforms.
There will be a set of assignments that will be announced as the course progresses, including:
Hands-on exercises, analysis, and reflection are a fun and effective way to learn.
We'll create an assignment in KLMS for each assignment.
You'll lose 10% for each late day. Submissions will be accepted until three days after the deadline. After then you'll get 0 on that assignment.
Please answer the following questions after you complete the exploration and implementation through the platform above. Make sure to cite any external sources when you refer to examples, ideas, and quotes to support your arguments.
The following papers should be useful to get a sense of the background, methodology, and technical approaches we will be using.
You need to submit two things: (1) code with your implementation and (2) answers to the discussion questions. You do not need to explicitly submit your code as the server keeps track of your latest implementation. Answers to the discussion questions need to be written as a report in PDF. We highly encourage you to use resources from your Jupyter Notebook such as code, figures, and statistical results to support your arguments in the discussion.
Explainable AI (XAI) helps users understand and interpret predictions produced by models. The objective of this assignment is for you to try existing off-the-shelf tools for explanations, think about strengths and weaknesses of them, and design your own interactive user interface that provides user-centered explanations that can address such weaknesses.
You will work with methods for explaining model predictions in image classification tasks. Such explanations help users resolve questions around what’s happening inside of the AI model and why. However, as users explore these explanations, they may come up with additional questions about the model, which possibly requires other kinds of explanations.
In this assignment, you are asked to (1) explore Google’s What-If Tool, a platform that helps users understand the performance of models, (2) build and run an algorithm based on Local Interpretable Model-agnostic Explanations (LIME) for presenting which parts of an image contribute to the class prediction for better interpretation of classification results, and (3) design a UI prototype that further helps users interpret the results especially when such explanation is not enough. For each of the stages, you are asked to discuss what can be explained with such tools/methods, and limitations of such explanations. For (2), we are going to use our interactive platform that provides an environment for implementing the algorithm and applying your algorithm to images. You can easily organize the result of explanations to focus more on analyzing the limitations of the explanation algorithm without additional implementations for experimenting.
What-If Tool consists of three tabs: Datapoint editor, Performance & Fairness, and Features. Each tab represents different aspects of the model and results.
Note: You will be using the demo of What-If-Tool, so it shows only the precalculated results for your interactions. For example, the exact numbers in the prediction may not be the actual results.
Datapoint editor tab
The following resources should be useful for you in getting a sense of the background, methodology, and technical approaches we will be using.
You only need to submit a .pdf file that answers the discussion questions. We highly encourage you to use resources such as code, figures, and statistical results to support your arguments in the discussion. Note that you do not need to explicitly submit your implementations, description of limitations of LIME algorithm, and prototype of interactive UI as they are automatically stored in the server.
Advances in model scaling have led to models (e.g., large language models, multimodal models) possessing various emergent capabilities that allow them to perform new tasks with minimal or no data. Both developers and researchers have been leveraging these capabilities to power interfaces that perform long-tail tasks: tasks that are more specific to certain audiences and domains. However, it can be difficult to evaluate the models’ performance on these long-tail tasks as there are no established benchmarks or datasets.
In this assignment, you will be designing a long-tail task that will be performed by an LLM and then exploring how to evaluate this long-tail task. Specifically, you will be designing a summarization
task (e.g., shortening and/or simplifying a longer text into a shorter one) by considering how it can be applied to a specific type of audience (e.g., summarize for older adults), domain (e.g., summarize research papers), and/or use case (e.g., real-time summarization for chat messages).In this assignment, you will follow the instructions below and compose a report by answering the given questions. When needed, make sure to cite any external sources, examples, ideas and quotes to support your arguments in your answers.
Design a long-tail summarization task by considering a type of audience, domain, and/or use case. Then, implement 2 different prompt-based “pipelines” that can perform this task using an LLM: (1) a single basic prompt (1~3 sentences), and (2) a more complex prompt (e.g., more instruction text, chain of prompts, few-shot prompt). TIP: You can simply use the ChatGPT interface to “implement” your pipelines. TIP: When implementing, think of the steps a person would take to perform this task and try to implement these steps through prompts.
Answer the following questions:Now, let’s evaluate the implementations by first creating ~5 samples of inputs to test your implementations with. (TIP: You can optionally use an LLM to create these samples.) Then, run each of your implementations on this sample set to generate output samples.
Answer the following questions:Considering your findings and discussions in Step 2, design a new metric that can measure one aspect of task performance. For example, when generating summaries for research papers, one metric could focus on measuring the complexity and technical difficulty of terms used in the summary. Then, you will imagine how to potentially measure this metric and design a pseudo-metric using an LLM.
Answer the following questions:The following resources should be useful for you in getting a sense of different approaches that are used to evaluate natural language generation tasks.
You only need to submit a .pdf file for your report containing the answers to the questions in each step of the instructions. We highly encourage you to use resources such as code, figures, and samples to support your arguments in the discussion.