Background
For my HCI Masters Project, I pursued a research project to investigate multimodal interaction in Network Visualization. Specifically, I studied how the individual modalities, Speech and Touch, lend themselves to Network Exploration tasks as well as the effects of priming with one modality when it comes to using both modalities in a multimodal system.
I collaborated with a PhD student in the Vis Lab at Georgia tech. He had created a multimodal system for network Visualization called Orko. He had conducted preliminary evaluations of the system earlier.
ORKO - Original system developed by PhD student
Research Objective
As part of my objectives, we wanted to take a closer look at really understanding how these modalities were utilized, and came up with the bunch of research questions outlined below.
Objectives
- O1: Understand individual modalities and their usage in the context of network data visualization and Explore use of multimodal Interaction
- O2 : To understand if and how priming with a single modality affects how a multimodal system is used.
Research Questions
- RQ1: How does natural language support network data exploration ? What are its strengths and weaknesses ? (O1)
- RQ2: How does direct manipulation support network data exploration ? What are its strengths and weaknesses ? (O1)
- RQ3: How do the two modalities work together ? (O1)
- RQ4: Can the 2 modalities complement each other? How can we make use of their strengths combined? (O1, O2)
- RQ5: Are the modalities particularly good for specific categories of tasks regardless of what users are familiar with ? (O2)
Process
I made improvements to the multimodal system that was already built, based on the feedback from the previous evaluative user study.
In order to truly understand how Natural language and Touch truly support interactions individually, I derived two new systems that were functionally equivalent to the multimodal system supporting interactions only via Touch and speech respectively.
This gave us the ability to deeply investigate the individual modalities as well as set us up for studying priming effects of one modality in how a multimodal system is used.
Improved ORKO
Unimodal System - Using only Speech input
Unimodal System - Using only Touch input
Design of Experiment
I designed an experiment to answer our research questions mentioned above. I recruited a total of 18 participants and split them into 3 groups of six each.
- The first group interacted with 2 systems - Touch-only system followed by the improved multimodal system.
- The second group interacted with 2 systems - Speech-only system followed by the improved multimodal system.
- The third group interacted with just the improved multimodal system.

Design of the study as it relates to the research questions
Design of Tasks
I designed a total of 6 tasks for each system, 5 of which were close-ended and one was designed to encourage open exploration of the system. The tasks also were constructed with an aim to span a set of common network exploration tasks (finding nodes, finding connections, finding paths, filtering nodes, visually encoding nodes, etc.). Participants could choose any of the above operations supported by the system to complete a specific task.
In order to really achieve our objectives of understanding the modalities and how they are utilized in the system, the tasks used in the evaluation needed to have the following characteristics:
- Be representative of real-world tasks
I chose an airport data set that had details about each airport and the connections they have between each other. This let me construct real world and meaningful tasks. - Be equally achievable across all three systems
For the 12 participants who would explore a unimodal system prior to the multimodal system, in order to reduce familiarity with the dataset, I used 2 flavors of the dataset, one with airports in US and Canada, and one of airports in the Asia Pacific region. Both datasets were of similar size and complexity.
I used 2 different versions of the same core 6 tasks in both conditions with tasks pertaining to the specific dataset. The US-Canada dataset was used in the unimodal (speech only or touch only) system and the Asia Pacific dataset was used with the multimodal system.
- Not be presented to the participants such that they can be simply read back to the system
We relied on a "Jeopardy!" style evaluation, (originally proposed by Gao et al) to phrase all the tasks. We presented them as scenarios or facts, that the participants were asked to prove/disprove by exploring the visualization.
Example Close-ended Task
Let us call airports that have direct flights to 55 or more other airports as “popular” airports. Visually prove that China has the most number of “popular” airports. Now, assume that you are traveling from Sydney Kingsford Smith airport to Domodedovo through one of these “popular” airports. Yes or No, must you then be travelling through either Thailand or China?
Example Open-ended Task
Pick any two airports that have at least one direct international flight. Consider these two airports and the airports they have direct flights to. Now compare the two groups of airports with regard to
Accessibility
Altitude ranges
Variety of time zones
You may also list any additional observations you make based on interacting with the network.
Measures
- Success or Failure to complete task
- Counts of touch / spoken utterances to complete each task
- Observational data + notes from user thinking aloud during session
- Quantitative data from post-session Questionnaires
- Qualitative data from post-session interviews
Data Analysis
The sessions were video recorded and we performed a closed coding of the recordings, using the 6 different network operations as our pre-established codes. A total of 945 interactions corresponding to the different network data operations across the 18 participants and the two study interfaces were recorded as shown below.

Distribution of interactions used by the 1st group, in both the Unimodal (touch) and Multimodal systems.
U: Unimodal interface, M: Multimodal interface, S: Speech, T: Touch, ST: Multimodal interactions.
A ‘-’ indicates that a modality was not supported in a condition or that participants were not assigned to a condition.

Distribution of interactions used by the 2nd group, in both the Unimodal (speech) and Multimodal systems.
U: Unimodal interface, M: Multimodal interface, S: Speech, T: Touch, ST: Multimodal interactions.
A ‘-’ indicates that a modality was not supported in a condition or that participants were not assigned to a condition.

Distribution of interactions for the 3rd Group of participants using only Multimodal System.
S: Speech, T: Touch, ST: Multimodal interactions.
A ‘-’ indicates that a modality was not supported in a condition or that participants were not assigned to a condition.
High Level Findings
100% participants preferred Multimodal Interaction to unimodal interaction
Our qualitative observations and feedback based on the post-session debrief suggested that participants preferred multimodal interactions for the following reasons.

The combination is certainly better. Voice is great when I was asking questions or finding something I couldn’t see. Touch let me directly interact.
- Complementary nature of modalities

I liked that I could correct with touch. Because it’s not always going to be perfect right. Like the smart assistant on the phone sometimes gets the wrong thing but doesn’t let me correct and just goes okay.

I used voice when I didn’t know how to do it with touch.
- Integrated interaction experience

It was somehow less complex even though more interactions were added.
Priming participants with one input modality did not impact how they interacted with the multimodal system
Our findings indicated that participants who had prior experience working with the unimodal system (P1-P12) interacted with the multimodal system comparably to participants (P13-P18) who worked only with the multimodal system.
The single most important aspect that decided which modality would be used was the operation being performed. for Example. participants P1-P6 when interacting with the multimodal interface, they switched to using only speech commands even though they had all previously performed the operation using touch. This sort of observation was also confirmed when we asked the participants about these at the end of the sessions.

Now that I think of it, not consciously but I did use speech to mostly narrow down to a subset and then touch to do more detailed tasks.
Participants expected the system to be more conversational and even “answer” questions
Given the availability of speech as an input modality, unsurprisingly, participants expected the system to be more conversational and even “answer” questions. For instance, one participant said

Working with the system for a while starts making you want to ask higher level questions and get specific answers or summaries as opposed to just the visualization.