Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning (2025)

Baoqi Pei1,2∗,Yifei Huang2,3∗,Jilan Xu2,4,Guo Chen5,Yuping He5,Lijin Yang3,
Yali Wang2,6, Weidi Xie2,7, Yu Qiao2, Fei Wu1,Limin Wang2,5
1Zhejiang University,2Shanghai Artificial Intelligence Laboratory,3The University of Tokyo,
4Fudan University,5Nanjing University,6SIAT,7Shanghai Jiao Tong University,
peibaoqi@gmail.com; hyf@iis.u-tokyo.ac.jp

Abstract

In egocentric video understanding, the motion of hands and objects as well as their interactions play a significant role by nature.However, existing egocentric video representation learning methods mainly focus on aligning video representation with high-level narrations, overlooking the intricate dynamics between hands and objects.In this work, we aim to integrate the modeling of fine-grained hand-object dynamics into the video representation learning process.Since no suitable data is available, we introduce HOD, a novel pipeline employing a hand-object detector and a large language model to generate high-quality narrations with detailed descriptions of hand-object dynamics.To learn these fine-grained dynamics, we propose EgoVideo, a model with a new lightweight motion adapter to capture fine-grained hand-object motion information.Through our co-training strategy, EgoVideo effectively and efficiently leverages the fine-grained hand-object dynamics in the HOD data.Extensive experiments demonstrate that our method achieves state-of-the-art performance across multiple egocentric downstream tasks, including improvements of 6.3% in EK-100 multi-instance retrieval, 5.7% in EK-100 classification, and 16.3% in EGTEA classification in zero-shot settings. Furthermore, our model exhibits robust generalization capabilities in hand-object interaction and robot manipulation tasks. Code and data are available at https://github.com/OpenRobotLab/EgoHOD/.

11footnotetext: Equal contribution. Yifei Huang is the corresponding author.

1 Introduction

Egocentric video understanding has recently garnered increasing attention due to its crucial role in areas such as augmented reality(Pan etal., 2023), embodied AI(Srivastava etal., 2022; Huang etal., 2024b), and personalized assistants(Huang etal., 2018).With the collection of large-scale egocentric video datasets(Damen etal., 2020; Grauman etal., 2022), researchers begin to adopt video-language pretraining(Lin etal., 2022) based on these annotations to learn egocentric video representations. Since the original annotations tend to be highly template-driven and lack diversity, previous works explore using Large Language Models (LLM) to rephrase the narration(Zhao etal., 2023) or introducing new video-language pairs from exocentric datasets(Dou etal., 2024). This scheme has shown its success in a wide range of downstream tasks(Plizzari etal., 2024).

However, as can be seen from the example in Figure1 right, the original annotations in egocentric video datasets are typically highly condensed, describing only overall actions like “C draws on a book” or “C moves both hands”.Since no additional information is provided, previous works like LaViLa(Zhao etal., 2023) can only rephrase at the same level of abstraction as the original annotations, neglecting a crucial aspect of egocentric videos – the fine-grained dynamics of hands and objects. Most egocentric videos contain a large portion of hand-object interactions, which reflects the camera wearer’s behavior and intentions. As will be seen, integrating this information in vision-language pretraining significantly enhances egocentric video representation learning, resulting in state-of-the-art performance across various benchmarks.

Firstly, to incorporate hand-object dynamics into vision-language pretaining, it is essential to construct data that accurately captures the detailed motion of hands and objects in videos. A recent work directly uses the output of off-the-shelf hand-object detectors(Shan etal., 2020) as the ground truth of auxiliary targets in the pretraining(Zhang etal., 2023). However, this approach only models the appearance of hands and objects without considering their dynamics.It also fails to learn the semantic connections between hand-object interactions and the original narration. To address this, we introduce HOD, a novel framework for generating descriptions with fine-grained Hand-Object Dynamics for a given video clip.We begin with using hand-object detectors to obtain bounding boxes of hands and contact objects. Then we design prompts based on these bounding boxes to generate descriptions of the trajectories of the hand and object, as well as their contact states and positions. Finally, using the new prompts and original annotations, we leverage a large language model (LLM) to generate semantically rich captions that encompass the motion states of hands and objects. By utilizing high framerate inputs, we ensure the capture of more detailed motions.

Secondly, to efficiently and effectively exploit the fine-grained spatiotemporal information in HOD, we propose EgoVideo, a novel ViT-based model with a lightweight motion adapter. Cooperating with the HOD data, EgoVideo employs a dual-branch design and co-training strategy. The backbone branch is trained normally to learn fundamental video-language alignment, while the adapter branch is trained with a higher framerate to capture detailed hand-object dynamics. The motion adapter has a separable convolution design, allowing for information aggregation from both adjacent frames temporally and from hands and objects at different locations. This design enables EgoVideo to model detailed hand-object dynamics while maintaining low computational costs. This also allows us to scale the model size to 1B to fully unlock its potential to comprehend egocentric videos.

We extensively evaluate EgoVideo across multiple pretraining data sources and various egocentric downstream tasks. Experimental results show that our model sets a new state-of-the-art on 9 tasks as partially shown in Figure1. Notably, our model also achieves the best performance under the same model size in both zero-shot and fine-tuning settings. Further experiments demonstrate that our HOD data is also beneficial for robot manipulation tasks.

Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning (1)

Our main contributions are as follows: (1) We develop a HOD data pipeline to generate captions that describe fine-grained hand-object dynamics, which are crucial for egocentric video understanding; (2) We propose EgoVideo, a dual-branch model with a novel lightweight motion adapter and a co-training strategy to leverage the HOD data efficiently and effectively; (3) We demonstrate state-of-the-art performance on 12 downstream tasks, and our approach generalizes well to robot manipulation tasks. All code and data will be made publicly available.

2 Related Work

Egocentric Video Understandingis receiving increasing research attention. Previous works focus on diverse tasks such as action recognition(Plizzari etal., 2022; Huang etal., 2020a), action anticipation(Girdhar & Grauman, 2021), and cross-view understanding(Xue etal., 2022; Huang etal., 2024a; Luo etal., 2024).Recent methods begin to work on egocentric representation learning(Lin etal., 2022; Pei etal., 2024) using the large-scale data from Ego4D(Grauman etal., 2022), or refining the Ego4D narrations by LLM rephrasing(Zhao & Krähenbühl, 2023). A recent work also searches for additional data from exocentric datasets to improve the pretraining(Dou etal., 2024). However, since the Ego4D narrations are highly abstract, these methods fail to learn one critical aspect of egocentric videos – fine-grained hand-object dynamics.Recently, Helping Hands(Zhang etal., 2023) utilizes hand and object coordinates as auxiliary targets during pretraining. However, it only focuses on the spatial information of hands and objects, neglecting their motion dynamics. Additionally, the provided supervision does not integrate the states of hands and objects with the video descriptions, limiting the model’s ability to comprehend fine-grained details.

Unlike previous works, we propose the first method to integrate the hand-object dynamics into egocentric representation learning.On the data side, we propose the HOD (Hand-Object Dynamics) pipeline, which generates high-quality video-language pairs. The language in these pairs explicitly represents the complex states and motions of hands and objects in the videos, enabling the model to learn detailed information about these dynamics.On the model side, we introduce EgoVideo, a model equipped with a lightweight motion adapter. This adapter is designed to effectively capture the intricate hand and object dynamics provided by the HOD data, enhancing the model’s ability to understand and interpret fine-grained dynamics in egocentric videos.

Video-Language Representation Learninghas also attracted researchers after the success of CLIP(Radford etal., 2021), due to the need for generating robust video representations.Several large-scale video-language datasets(Kay etal., 2017; Miech etal., 2019; CabaHeilbron etal., 2015) further fueled the research in this area.However, generating high-quality video-text pairs remains a challenging task, prompting researchers to develop innovative solutions.LaViLa(Zhao etal., 2023) leverages Large Language Models (LLMs) to generate dense narrations for videos. Video Recap(Islam etal., 2024) utilizes a curriculum learning training scheme to generate summaries for long videos. EMBED(Dou etal., 2024) and EgoInstructor(Xu etal., 2024) use rules or retrieval models to add additional training data.However, the previous methods can only pretrain their models at the same abstraction level as the original annotation. In contrast, our approach integrates finer-level details into the representation learning process.

Hand-Object Interaction Understanding has long been a key research topic within the field of egocentric vision.In recent years, several works have made significant strides in modeling estimate 3Dhand joints(Brahmbhatt etal., 2020; Cai etal., 2018; Yang & Yao, 2019; Yuan etal., 2018; Ohkawa etal., 2023) and reconstructing hand-object shape(Cao etal., 2021; Doosti etal., 2020; Hasson etal., 2019; 2020; Liu etal., 2021). EgoHOS(Zhang etal., 2022) provides a labeled dataset with fine-grained per-pixel labels of hand and objects and a reliable foundational tool for 2D hand-object segmentation, 100DOH(Shan etal., 2020) introduces a large-scale video dataset containing hands and hand-object interactions, providing a rich resource for hand object detector training. In our work, we utilize existing hand and object detectors in our HOD pipeline to convert information related to hand/object motion and contact details into natural language descriptions. By integrating these detailed descriptions with our EgoVideo model, we can integrate this finer level of detail into the video representation learning process.

3 Method

3.1 Data Generation Pipeline: HOD

The fine-grained dynamics of hands and objects play a pivotal role in egocentric video understanding(Fathi etal., 2011a).To effectively integrate this information in the video-language pretraining process, we propose HOD, a novel data generation pipeline to transform hand-object dynamics into natural languages.An overview of HOD is illustrated in Figure2 top. First, we utilize an off-the-shelf hand object detector(Shan etal., 2020) to generate bounding boxes for hands and objects in each frame of the video clips. Next, we employ a large language model(AI etal., 2024) to enrich the original video captions. The model is prompted to generate new narrations that integrate the original captions with hand-object dynamics information, enhancing the semantic richness of the annotations. Below, we go into the details of the HOD data generation process.

3.1.1 Data Selection

Before going into the generation process,it is essential to select appropriate source data.The basic data component comes from the 4M subset(Lin etal., 2022) of Ego4D(Grauman etal., 2022) which has been proven useful in egocentric video-language pretraining(Pramanick etal., 2023).Additionally, we curate data from the large-scale HowTo100M dataset(Miech etal., 2019) since it contains rich hand-object interactions. We specifically choose How2-Interlink7M(Wang etal., 2024a) which contains 7M clips with high-quality GPT-4(OpenAI, 2023) refined caption. Since the videos are from diverse sources that may include a portion that impedes egocentric representation learning, we employ a filtering technique to retain only clips with egocentric style. We train a style classifier 𝒫𝒫\mathcal{P}caligraphic_P by manually annotating 10,000 clips as “ego-like” or “non-ego-like”. With this classifier, we obtain an additional 3.4M egocentric-style clips. More details can be found in Appendix A.

Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning (2)

3.1.2 Generating Captions with Hand Object Dynamics

In this section, we introduce our HOD framework. Existing methods of refining descriptions in video-language pretraining(Zhao etal., 2023; Dou etal., 2024) focus on high-level abstracts but overlook the fine-grained details of hand-object dynamics. This oversight is detrimental in egocentric representation learning, where understanding these interactions forms a considerable proportion of egocentric videos by nature.To address this gap, in our HOD framework, we first detect the positions of hands and objects using a hand object detector. With this information, we prompt a large language model to augment the original annotation with detailed descriptions of hand and object movements. Followed by the subsequent video-language pretraining, our EgoVideo model can understand videos at a finer-grained level.

Hand Object Dynamics Detector.Thanks to the rapid advancement in the field of hand-object interaction(Jiang etal., 2021; Ohkawa etal., 2023), off-the-shelf hand-object detectors can provide robust hand and object positions. In our framework, we employ 100DOH(Shan etal., 2020) as the detector ΦdetsubscriptΦdet\Phi_{\text{det}}roman_Φ start_POSTSUBSCRIPT det end_POSTSUBSCRIPT for bounding boxes extraction.

For a video clip x=(x1,x2,xT)𝑥subscript𝑥1subscript𝑥2subscript𝑥𝑇x=(x_{1},x_{2},...x_{T})italic_x = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ), we uniformly sample n=16𝑛16n=16italic_n = 16 frames within the clip to obtain fine-grained motion information. Then we use ΦdetsubscriptΦdet\Phi_{\text{det}}roman_Φ start_POSTSUBSCRIPT det end_POSTSUBSCRIPT to acquire the bounding boxes of hands and objects on these frames, which can be represented as

LHi,RHi,LOi,ROi=Φdet(xi)𝐿subscript𝐻𝑖𝑅subscript𝐻𝑖𝐿subscript𝑂𝑖𝑅subscript𝑂𝑖subscriptΦdetsubscript𝑥𝑖\displaystyle LH_{i},RH_{i},LO_{i},RO_{i}=\Phi_{\text{det}}(x_{i})italic_L italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_R italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_L italic_O start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_R italic_O start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_Φ start_POSTSUBSCRIPT det end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )(1)

where LHi,RHi,LOi,ROi𝐿subscript𝐻𝑖𝑅subscript𝐻𝑖𝐿subscript𝑂𝑖𝑅subscript𝑂𝑖LH_{i},RH_{i},LO_{i},RO_{i}italic_L italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_R italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_L italic_O start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_R italic_O start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the bounding box of the left hand, right hand, objects in contact with the left hand, and objects in contact with the right hand in the i𝑖iitalic_i-th frame. We use linear interpolation to compensate for missing hand boxes of frame t𝑡titalic_t if the corresponding hand boxes can be detected for both frames xt1subscript𝑥𝑡1x_{t-1}italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT and xt+1subscript𝑥𝑡1x_{t+1}italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT.

Hand Object Dynamics Rephraser.Current pretraining methods only use high-level language descriptions (e.g., “C takes the scissors” in Figure2), which lacks important egocentric details like hand and object interaction.In this work, we incorporate these details into the video-language pretraining process.Hand-object dynamics encompass a variety of information, including bounding boxes of hands and objects, hand and object movement directions and trajectories, as well as their contact conditions. To integrate all this information into the video-language pretraining process, we use a LLM as a rephraser to express these dynamics in natural language.

Specifically, we employ Yi-34B(AI etal., 2024) as our LLM.To capture the nuances of hand and object movements, we extract the central points of bounding boxes to derive trajectories for hands and objects. This process yields six essential categories of information: spatial-temporal data for 1) the left hand, 2) the right hand, 3) objects contacted by the left hand, 4) objects contacted by the right hand, 5) objects contacted by both hands, and 6) the original narration.We then prompt the LLM to amalgamate this detailed information, enabling the generation of rich narratives that intricately describe hand-object dynamics. Further details on prompting can be found in Appendix A.

Analysis of HOD Data. We conduct additional analyses on our HOD data to evaluate its quality. First, we identify the top 30 most frequent words in HOD captions and the original EgoClip narrations and plot their normalized frequencies in Figure3. The EgoClip narrations exhibit a more pronounced long-tail distribution, while our HOD captions display a more balanced distribution. Notably, HOD captions include many “dynamic” words, such as “up” and “downwards,” which aligns with the rationale behind our data generation process. To further verify the quality of our HOD, we employ GPT-4o(OpenAI, 2023) for quality assessment. We randomly select 1000 clips and let GPT evaluate the score of the caption data for the video clip in a range from 0 to 10. To ensure GPT does not simply assign high scores based on the length of the captions, we also conduct random gerund replacements on our data for comparison. The results, summarized in Table1, show that our HOD data have a significantly better GPT-Score. Additional details on the scoring process and evaluations using other metrics are provided in AppendixA.

3.2 Egocentric representation learning model: EgoVideo

The narrations generated by our HOD pipeline are highly detailed. As a result, the previous pretraining scheme struggles to capture the corresponding visual information at this level of detail. In response, we introduce EgoVideo (Figure2 bottom), a model comprising a backbone and a motion adapter. The motion adapter aids in learning fine-grained hand-object dynamics from densely sampled video frames. Cooperating with a co-training strategy, our EgoVideo model can obtain richer video representations while maintaining computational efficiency.

Visual and Text Encoder. Following the standard video-language pretraining setting(Lin etal., 2022), our model includes a visual encoder vsubscript𝑣\mathcal{F}_{v}caligraphic_F start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT (including our motion adapter) and a text encoder tsubscript𝑡\mathcal{F}_{t}caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. In the visual encoder, for a clip xT×H×W×3𝑥superscript𝑇𝐻𝑊3x\in\mathbb{R}^{T\times H\times W\times 3}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_H × italic_W × 3 end_POSTSUPERSCRIPT, we concatenate image tokens in T𝑇Titalic_T frames with a learnable class token. The output of our visual encoder is 𝐄𝐯Dsubscript𝐄𝐯superscript𝐷\mathbf{E_{v}}\in\mathbb{R}^{D}bold_E start_POSTSUBSCRIPT bold_v end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT. For the text encoder, we employ a 12-layer GPT-like Transformer(Radford etal., 2019) that input tokens after BPE tokenization(Sennrich, 2015). The output of our text encoder is 𝐄𝐭Dsubscript𝐄𝐭superscript𝐷\mathbf{E_{t}}\in\mathbb{R}^{D}bold_E start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT.

Motion Adapter.Intuitively, to encode visual representations at the same level of detail as the languages, it is essential to utilize a greater number of frames as input.Since increasing the number of frames in training will result in unacceptable computational overhead, inspired by the PEFT technique in LLMs(Ding etal., 2023), we propose to use a lightweight motion adapter. The motion adapter is injected between the layers of the visual backbone, and is tailored to learn the finer-grained details with a high framerate. Since the hand and object motion forms a spatiotemporal pattern, unlike previous methods(Pan etal., 2022; Xing etal., 2024) that only focus on learning temporal information, our module is designed to learn both spatial and temporal information.

Our motion adapter is attached to the top of each of the N𝑁Nitalic_N transformer layers. Without loss of generality, here we illustrate the motion adapter for one transformer layer and illustrate it in Figure4.Denote 𝐘L×D𝐘superscript𝐿𝐷\mathbf{Y}\in\mathbb{R}^{L\times D}bold_Y ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × italic_D end_POSTSUPERSCRIPT as the output of a transformer layer in vsubscript𝑣\mathcal{F}_{v}caligraphic_F start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT where L𝐿Litalic_L is the number of tokens, we first forward 𝐘𝐘\mathbf{Y}bold_Y to a down-projection layer WdownsubscriptWdown{\rm W}_{\text{down}}roman_W start_POSTSUBSCRIPT down end_POSTSUBSCRIPT with ratio γ𝛾\gammaitalic_γ and followed with a GELU activation function σ𝜎\sigmaitalic_σ. Then, we use a 2D convolution layer Conv2D with kernel size (k𝑘kitalic_k,k𝑘kitalic_k) to aggregate spatial information from each frame, followed by a 1D temporal convolution layer TConv1D and Linear layer WmsubscriptWm{\rm W}_{\text{m}}roman_W start_POSTSUBSCRIPT m end_POSTSUBSCRIPT to model the dynamics between adjacent frames. Finally, an up-projection layer WupsubscriptWup{\rm W}_{\text{up}}roman_W start_POSTSUBSCRIPT up end_POSTSUBSCRIPT is used to restore the dimension. Formally, the structure can be described as:

𝐘=σ(𝐘Wdown),𝐘𝐬=ReLU(BN(Conv2D(𝐘))),formulae-sequencesuperscript𝐘𝜎𝐘subscriptWdownsubscript𝐘𝐬ReLUBNConv2Dsuperscript𝐘\displaystyle\mathbf{Y^{\prime}}=\sigma(\mathbf{Y}{\rm W}_{\text{down}}),\quad%\mathbf{Y_{s}}={\rm ReLU}({\rm BN}({\rm Conv2D}(\mathbf{Y^{\prime}}))),bold_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_σ ( bold_Y roman_W start_POSTSUBSCRIPT down end_POSTSUBSCRIPT ) , bold_Y start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT = roman_ReLU ( roman_BN ( Conv2D ( bold_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) ) ,(2)
𝐘𝐬𝐭=(TConv1D(𝐘𝐬))Wm,MotionAdapter(𝐘)=𝐘+𝐘𝐬𝐭Wup,formulae-sequencesubscript𝐘𝐬𝐭TConv1Dsubscript𝐘𝐬subscriptWmMotionAdapter𝐘𝐘subscript𝐘𝐬𝐭subscriptWup\displaystyle\mathbf{Y_{st}}=({\rm TConv1D}(\mathbf{Y_{s}})){\rm W}_{\text{m}}%,\quad{\rm MotionAdapter}(\mathbf{Y})=\mathbf{Y}+\mathbf{Y_{st}}{\rm W}_{\text%{up}},bold_Y start_POSTSUBSCRIPT bold_st end_POSTSUBSCRIPT = ( TConv1D ( bold_Y start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT ) ) roman_W start_POSTSUBSCRIPT m end_POSTSUBSCRIPT , roman_MotionAdapter ( bold_Y ) = bold_Y + bold_Y start_POSTSUBSCRIPT bold_st end_POSTSUBSCRIPT roman_W start_POSTSUBSCRIPT up end_POSTSUBSCRIPT ,

where WdownD×γDsubscriptWdownsuperscript𝐷𝛾𝐷{\rm W}_{\text{down}}\in\mathbb{R}^{D\times{\gamma}D}roman_W start_POSTSUBSCRIPT down end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D × italic_γ italic_D end_POSTSUPERSCRIPT, WmγD×γDsubscriptWmsuperscript𝛾𝐷𝛾𝐷{\rm W}_{\text{m}}\in\mathbb{R}^{{\gamma}D\times{\gamma}D}roman_W start_POSTSUBSCRIPT m end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_γ italic_D × italic_γ italic_D end_POSTSUPERSCRIPT and WupγD×DsubscriptWupsuperscript𝛾𝐷𝐷{\rm W}_{\text{up}}\in\mathbb{R}^{{\gamma}D\times D}roman_W start_POSTSUBSCRIPT up end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_γ italic_D × italic_D end_POSTSUPERSCRIPT. BN denotes BatchNorm2D.

Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning (4)

Co-training Strategy.In EgoVideo, the motion adapter receives input at a higher framerate to capture the fine-grained information. Additionally, the backbone must be trained to fully adapt to the egocentric domain. Thus, different from previous PEFT methods that freeze the backbone and only train the adapter part, we need to train both the backbone and adapter parameters.Motivated by the architecture of (Feichtenhofer etal., 2019), we employ a co-training strategy to train the backbone and the motion adapter jointly.

Specifically, we use an upsampling parameter λ𝜆\lambdaitalic_λ to sample the input using two sampling rates. For the input xlT×H×W×Csubscript𝑥𝑙superscript𝑇𝐻𝑊𝐶x_{l}\in\mathbb{R}^{T\times H\times W\times C}italic_x start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_H × italic_W × italic_C end_POSTSUPERSCRIPT with a low sampling rate, we pass it through the backbone and unfreeze all parameters. As a result, we get the output 𝐄𝐯𝐥Dsubscript𝐄𝐯𝐥superscript𝐷\mathbf{E_{vl}}\in\mathbb{R}^{D}bold_E start_POSTSUBSCRIPT bold_vl end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT. For the input with a higher sampling rate xhλT×H×W×Csubscript𝑥superscript𝜆𝑇𝐻𝑊𝐶x_{h}\in\mathbb{R}^{\lambda T\times H\times W\times C}italic_x start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_λ italic_T × italic_H × italic_W × italic_C end_POSTSUPERSCRIPT, we pass the input through both the backbone and adapter parameters and get the output 𝐄𝐯𝐡Dsubscript𝐄𝐯𝐡superscript𝐷\mathbf{E_{vh}}\in\mathbb{R}^{D}bold_E start_POSTSUBSCRIPT bold_vh end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT, during which we freeze the parameters of the backbone and only train the adapter. Finally, for the outputs of the two pathways, we concatenate them and pass them through a fully connected layer to obtain the final output 𝐄𝐯Dsubscript𝐄𝐯superscript𝐷\mathbf{E_{v}}\in\mathbb{R}^{D}bold_E start_POSTSUBSCRIPT bold_v end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT:

𝐄𝐯𝐥=subscript𝐄𝐯𝐥absent\displaystyle\mathbf{E_{vl}}=bold_E start_POSTSUBSCRIPT bold_vl end_POSTSUBSCRIPT =backbone(xl),𝐄𝐯𝐡=v(xh),subscript𝑏𝑎𝑐𝑘𝑏𝑜𝑛𝑒subscript𝑥𝑙subscript𝐄𝐯𝐡subscript𝑣subscript𝑥\displaystyle\mathcal{F}_{backbone}(x_{l}),\quad\mathbf{E_{vh}}=\mathcal{F}_{v%}(x_{h}),caligraphic_F start_POSTSUBSCRIPT italic_b italic_a italic_c italic_k italic_b italic_o italic_n italic_e end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) , bold_E start_POSTSUBSCRIPT bold_vh end_POSTSUBSCRIPT = caligraphic_F start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) ,(3)
𝐄𝐯=[𝐄𝐯𝐥;𝐄𝐯𝐡]Wo,subscript𝐄𝐯subscript𝐄𝐯𝐥subscript𝐄𝐯𝐡subscriptWo\displaystyle\mathbf{E_{v}}=[\mathbf{E_{vl}};\mathbf{E_{vh}}]{\rm W}_{\text{o}},bold_E start_POSTSUBSCRIPT bold_v end_POSTSUBSCRIPT = [ bold_E start_POSTSUBSCRIPT bold_vl end_POSTSUBSCRIPT ; bold_E start_POSTSUBSCRIPT bold_vh end_POSTSUBSCRIPT ] roman_W start_POSTSUBSCRIPT o end_POSTSUBSCRIPT ,

Where “[;]” denotes the concatenation operation, backbonesubscript𝑏𝑎𝑐𝑘𝑏𝑜𝑛𝑒\mathcal{F}_{backbone}caligraphic_F start_POSTSUBSCRIPT italic_b italic_a italic_c italic_k italic_b italic_o italic_n italic_e end_POSTSUBSCRIPT denotes the visual backbone, vsubscript𝑣\mathcal{F}_{v}caligraphic_F start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT denotes backbonesubscript𝑏𝑎𝑐𝑘𝑏𝑜𝑛𝑒\mathcal{F}_{backbone}caligraphic_F start_POSTSUBSCRIPT italic_b italic_a italic_c italic_k italic_b italic_o italic_n italic_e end_POSTSUBSCRIPT with motion adapter andWo2D×DsubscriptWosuperscript2𝐷𝐷{\rm W}_{\text{o}}\in\mathbb{R}^{2D\times D}roman_W start_POSTSUBSCRIPT o end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 2 italic_D × italic_D end_POSTSUPERSCRIPT.With this strategy, we integrate the training of the backbone and adapter into a single stage, reducing the cost of data and computation.

Vision-Text Alignment. We follow the standard InfoNCE(Oord etal., 2018) loss as the objective for alignment between visual embedding 𝐄𝐯subscript𝐄𝐯\mathbf{E_{v}}bold_E start_POSTSUBSCRIPT bold_v end_POSTSUBSCRIPT and text embedding 𝐄𝐭subscript𝐄𝐭\mathbf{E_{t}}bold_E start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT. For a sampled batch \mathcal{B}caligraphic_B, we have:

=1||(𝐄𝐯𝐢,𝐄𝐭𝐢)(loges(𝐄𝐯𝐢,𝐄𝐭𝐢)/τ𝐄𝐭𝐣Bes(𝐄𝐯𝐢,𝐄𝐭𝐣)/τ+loges(𝐄𝐯𝐢,𝐄𝐭𝐢)/τ𝐄𝐯𝐤Bes(𝐄𝐯𝐤,𝐄𝐭𝐢)/τ),1subscriptsuperscriptsubscript𝐄𝐯𝐢superscriptsubscript𝐄𝐭𝐢superscript𝑒ssuperscriptsubscript𝐄𝐯𝐢superscriptsubscript𝐄𝐭𝐢𝜏subscriptsuperscriptsubscript𝐄𝐭𝐣𝐵superscript𝑒ssuperscriptsubscript𝐄𝐯𝐢superscriptsubscript𝐄𝐭𝐣𝜏superscript𝑒ssuperscriptsubscript𝐄𝐯𝐢superscriptsubscript𝐄𝐭𝐢𝜏subscriptsuperscriptsubscript𝐄𝐯𝐤𝐵superscript𝑒ssuperscriptsubscript𝐄𝐯𝐤superscriptsubscript𝐄𝐭𝐢𝜏\mathcal{L}=\frac{1}{|\mathcal{B}|}\sum_{(\mathbf{E_{v}^{i}},\mathbf{E_{t}^{i}%})\in\mathcal{B}}\left(\log\frac{e^{\text{s}(\mathbf{E_{v}^{i}},\mathbf{E_{t}^%{i}})/\tau}}{\sum_{\mathbf{E_{t}^{j}}\in B}e^{\text{s}(\mathbf{E_{v}^{i}},%\mathbf{E_{t}^{j}})/\tau}}+\log\frac{e^{\text{s}(\mathbf{E_{v}^{i}},\mathbf{E_%{t}^{i}})/\tau}}{\sum_{\mathbf{E_{v}^{k}}\in B}e^{\text{s}(\mathbf{E_{v}^{k}},%\mathbf{E_{t}^{i}})/\tau}}\right),caligraphic_L = divide start_ARG 1 end_ARG start_ARG | caligraphic_B | end_ARG ∑ start_POSTSUBSCRIPT ( bold_E start_POSTSUBSCRIPT bold_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_i end_POSTSUPERSCRIPT , bold_E start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_i end_POSTSUPERSCRIPT ) ∈ caligraphic_B end_POSTSUBSCRIPT ( roman_log divide start_ARG italic_e start_POSTSUPERSCRIPT s ( bold_E start_POSTSUBSCRIPT bold_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_i end_POSTSUPERSCRIPT , bold_E start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_i end_POSTSUPERSCRIPT ) / italic_τ end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT bold_E start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_j end_POSTSUPERSCRIPT ∈ italic_B end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT s ( bold_E start_POSTSUBSCRIPT bold_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_i end_POSTSUPERSCRIPT , bold_E start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_j end_POSTSUPERSCRIPT ) / italic_τ end_POSTSUPERSCRIPT end_ARG + roman_log divide start_ARG italic_e start_POSTSUPERSCRIPT s ( bold_E start_POSTSUBSCRIPT bold_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_i end_POSTSUPERSCRIPT , bold_E start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_i end_POSTSUPERSCRIPT ) / italic_τ end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT bold_E start_POSTSUBSCRIPT bold_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_k end_POSTSUPERSCRIPT ∈ italic_B end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT s ( bold_E start_POSTSUBSCRIPT bold_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_k end_POSTSUPERSCRIPT , bold_E start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_i end_POSTSUPERSCRIPT ) / italic_τ end_POSTSUPERSCRIPT end_ARG ) ,(4)

where s(𝐄𝐯𝐢,𝐄𝐭𝐢)ssuperscriptsubscript𝐄𝐯𝐢superscriptsubscript𝐄𝐭𝐢\text{s}(\mathbf{E_{v}^{i}},\mathbf{E_{t}^{i}})s ( bold_E start_POSTSUBSCRIPT bold_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_i end_POSTSUPERSCRIPT , bold_E start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_i end_POSTSUPERSCRIPT ) denotes dot product operation between the i𝑖iitalic_i-th sample in the batch of 𝐄𝐯subscript𝐄𝐯\mathbf{E_{v}}bold_E start_POSTSUBSCRIPT bold_v end_POSTSUBSCRIPT and 𝐄𝐭subscript𝐄𝐭\mathbf{E_{t}}bold_E start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT, and τ𝜏\tauitalic_τ is a temperature parameter that scales the similarity scores.

4 Experiments

4.1 Datasets and Evaluation Protocols

Pretraining Dataset. As stated in the previous section, the source of our pretraining data comes from Ego4D(Grauman etal., 2022) and How2-Interlink-7M(Wang etal., 2024a).After processing by our HOD pipeline, the total amount of data is 7.4M clips.

Evaluation Protocols.We follow previous works(Zhao etal., 2023; Pramanick etal., 2023) and use the following evaluation protocols.(1) Zero-shot (ZS): the pretrained video-text encoders are directly applied to the downstream datasets to perform video-text retrieval tasks without any additional tuning. For classification, we compute the similarity score between the video clip and the textual descriptions of all possible classes.(2) Finetuned (FT): This approach involves taking the pretrained video-text model and performing end-to-end finetuning on the training split of the target downstream dataset.(3) Feature-based: We extract video features using a frozen encoder and only train a task-specific head on the downstream dataset.

Model Architecture and Hyperparameters.Our vision-language model follows the initialization of CLIP(Radford etal., 2021), which is composed of a vision encoder and a text encoder. For the base and large models, ViT is used as our vision encoder, and we use a temporal position embedding to learn temporal information, which is randomly initialized. For our giant size model, we use Internvideo2(Wang etal., 2024b).For hyperparameters, we use T=4𝑇4T=4italic_T = 4 and λ=4𝜆4\lambda=4italic_λ = 4 for frame inputs, and we use downsample ratio γ=0.5𝛾0.5\gamma=0.5italic_γ = 0.5 for the motion adapter. During pretraining, we freeze the temperature parameter τ=0.07𝜏0.07\tau=0.07italic_τ = 0.07. More details are placed in Appendix C.

Downstream Tasks. We evaluate models on several egocentric downstream tasks: (1) Epic-Kitchens-100(Damen etal., 2020) (EK-100) tasks. For this dataset, we evaluate our method on multi-instance retrieval (EK-100MIR) and action recognition (EK-100 CLS) tasks; 2) Ego4D(Grauman etal., 2022) tasks. For Ego4D, we evaluate our model on multiple choice questions (EgoMCQ)(Li etal., 2021), and natural language query (EgoNLQ) and moment query (EgoMQ) tasks; 3) EGTEA(Li etal., 2018) tasks. We evaluate our model on the action recognition task that is focused on fine-grained cooking activities and hand-object interaction. 4) Other tasks. We also evaluate our model on GTEA(Fathi etal., 2011b) and HOI4D(Liu etal., 2022) datasets for the action segmentation task. Meanwhile, to show the generalization ability of our learned video representation, we evaluate the task success rate on Franka Kitchen dataset (Gupta etal., 2019), a simulation environment for embodied AI.

4.2 Comparison to State-Of-The-Art

EK-100 MIREK-100 CLSEGTEAEgoMCQ
Method (ZS)Data SizemAPnDCGTop1-acc.Top5-acc.Mean-acc.Top1-acc.IntraInter
EgoVLPv24M26.729.1----60.991.0
LaViLa-B35M30.932.016.434.428.935.459.993.8
AVION-B35M32.932.7------
EMBED-B38.3M36.034.919.039.037.042.761.394.5
EgoVideo-B7.4M36.534.522.443.343.651.064.695.0
LaViLa-L35M36.134.620.841.434.140.163.194.5
AVION-L35M37.635.3------
Helping Hands4M37.537.8--39.146.663.094.5
EMBED-L38.3M40.837.522.845.040.346.764.795.6
EgoVideo-L7.4M41.837.024.046.847.151.765.595.9
EgoVideo-G7.4M47.139.028.554.358.063.069.196.6

Zero-shot Evaluation.Table2 shows the results on 4 tasks in the zero-shot setting. We compare our method against previous egocentric representation learning methods including EgoVLPv2(Pramanick etal., 2023), LaViLa(Zhao etal., 2023), AVION(Zhao & Krähenbühl, 2023), Helping Hands(Zhang etal., 2023) and EMBED(Dou etal., 2024).Notably, despite the use of refined captions and significantly larger training datasets, LaViLa, AVION, and EMBED fail to achieve results as our EgoVideo. In the following experiments we will demonstrate that both our high-quality HOD data and our design of the EgoVideo model play important roles in achieving good performance. Helping Hands uses a stronger backbone TimeSformer(Bertasius etal., 2021) and adds additional decoders for auxiliary object-oriented tasks. However, our method can still outperform Helping Hands, demonstrating the superiority of our representation learning scheme.

Specifically, in the EK-100 MIR task, our EgoVideo outperforms EMBED by 0.5%, 1.0%, 6.3% in mAP and significantly outperforms the LaViLa at the same model size.In the EK-100 CLS task, our EgoVideo-B model demonstrates superior performance with a top-1 accuracy of 22.4% and a top-5 accuracy of 43.3%, significantly outperforming LaViLa-B and EMBED-B.

On the EGTEA dataset known for its focus on hand-object interactions, our EgoVideo-B achieves a mean accuracy of 43.6% and a top-1 accuracy of 51.0%, surpassing EMBED-B and even EMBED-L. This underscores the importance of learning hand object dynamics and shows the strong generalization capability of our model.The EgoMCQ task further highlights the efficacy of our method, with EgoVideo-B outperforming LaViLa-B by 4.7%, 1.2% and EMBED-B’s by 3.3% and 0.5% on the inter-class and intra-class accuracy, respectively. Our EgoVideo-L model also shows significant improvements with an inter-class accuracy of 65.5% and an intra-class accuracy of 95.9%.These results demonstrate the superior performance and generalization capability of our method without any additional supervision. We take a step forward to explore the scaling law in egocentric representation learning, finding that EgoVideo-G has elevated performance to the next level.

Fine-tuning Evaluation.Table3 shows the result of the fine-tuning evaluation. Our EgoVideo method outperforms previous approaches across all tasks and datasets.Our EgoVideo-B demonstrates significant performance enhancements compared to EgoVLPv2-B, with improvements of 5.4% and 2.5% in mAP for the EK-100 MIR and EgoMCQ tasks, respectively. This performance is even comparable to the larger LaViLa-L.For our EgoVideo-L, we observe consistent improvements across all tasks, including a substantial enhancement by 1.8% and 3.1% in EK-100 CLS and EGTEA action recognition tasks, highlighting the superior performance of our model in fine-grained action understanding.Moreover, we achieve improvements of 0.4% in R1@0.5 in the EgoNLQ task and 2.7% and 2.0% in R1@0.5 and mAP in the EgoMQ task, confirming the richness of representations learned by our model and its capacity to capture intricate hand-object interaction information.

EK-100 MIREK-100 CLSEGTEAEgoNLQEgoMQ
Method (FT)mAPnDCGTop1-acc.Top1-acc.R1@0.5R1@0.5mAP
EgoVLPv2-B47.361.9--7.931.112.2
EgoVideo-B52.765.349.874.68.134.714.7
Helping Hands-L----7.933.416.0
LaViLa-L50.966.551.076.07.332.513.4
EMBED-L56.067.951.976.18.533.915.1
EgoVideo-L56.669.053.779.28.936.617.1
EgoVideo-G60.370.056.080.010.038.719.6

4.3 Ablation Studies

Pretraining Data. We first conduct experiments by fixing the models and varying the pretraining data. Here we choose to use AVION for fair comparison since both AVION and EgoVideo use ViT as the backbone. As shown in Table4, both our EgoVideo and AVION achieve the best performance when the combination of Ego4D-HOD data and How2-HOD data is used, and EgoVideo consistently outperforms AVION when trained on the same data, emphasizing the effectiveness of the model design.Comparing models trained with EgoClip and Ego4D-HOD (rows 1,2 and 5,6), it is clear that significant improvements can be observed in the EK-100 MIR and EGTEA tasks. Adding additional data from How2-HOD can improve both models substantially (rows 1,3 and 5,7). Furthermore, when using only Ego4D-HOD, the performance on EGTEA surpasses EgoClip and How2-HOD together, indicating the beneficial impact of our data on fine-grained dynamics understanding.

Ego4D-Ego4D-How2-EK-100 MIREGTEA
IDModelEgoClipHODHODmAPnDCGMean-acc.Top1-acc.
1AVION-B27.329.326.230.5
231.0(+3.7)31.3(+2.0)32.3(+6.1)37.0(+6.5)
333.2(+5.9)32.5(+3.2)31.6(+5.4)35.6(+5.1)
434.4(+7.1)33.7(+4.4)39.4(+13.2)46.4(+15.9)
5EgoVideo-B31.132.030.836.0
634.4(+3.3)33.9(+1.9)41.1(+10.3)47.9(+11.9)
735.5(+4.4)34.1(+2.1)40.8(+10.0)47.1(+11.1)
836.5(+5.4)34.5(+2.5)43.6(+12.8)51.0(+15.0)
EK100
MethodBackboneParamsmAP
LaViLa-BTSF-B121M30.9
AVION-BViT-B86M32.9
EMBED-BTSF-B121M36.0
EgoVideo-BViT-B112M36.5
LaViLa-LTSF-L438M36.1
AVION-LViT-L307M37.6
EMBED-LTSF-L438M40.8
EgoVideo-LViT-L375M41.8
EgoVideo-GViT-G1050M47.1
Extra
MethodViewsGFLOPsGFLOPs
ViT-B4×1×34134\times 1\times 34 × 1 × 3201-
ViT-B16×1×3161316\times 1\times 316 × 1 × 3804-
LaViLa-B16×1×3161316\times 1\times 316 × 1 × 31432
EgoVideo-B16×1×3161316\times 1\times 316 × 1 × 31092288
ViT-L4×1×34134\times 1\times 34 × 1 × 31047-
ViT-L16×1×3161316\times 1\times 316 × 1 × 34188-
LaViLa-L16×1×3161316\times 1\times 316 × 1 × 34956
EgoVideo-L16×1×3161316\times 1\times 316 × 1 × 353501162

Model Size and Inference Computational Cost.Table6 compares the number of parameters.Our EgoVideo model maintains a relatively small parameter count, and even with the addition of the motion adapter, the total remains lower than that of LaViLa and EMBED, highlighting the efficiency of our approach. Meanwhile, in Table6 we compare the inference computational cost of our EgoVideo with ViT and LaViLa. Thanks to our MotionAdapter, the increase in inference time for our model compared to ViT at 16 frames, is only similar to ViT’s inference time at 4 frames.

Training Efficiency.In Table8, we compare the performance and computational speed of our EgoVideo-B with AVION-B, where AVION-B is trained under two different parameter settings: pre-training with 16 frames and pre-training with 4 frames.Our EgoVideo is trained in a mixed 16 and 4 frame fashion, thus being faster than directly using all 16 frames to train the whole backbone. Meanwhile, EgoVideo achieves the best performance on the EK-100 MIR and EGTEA datasets. These results strongly demonstrate the effectiveness of our training strategy and motion adapter design in the EgoVideo model.

Motion Adapter vs Other Adapters. We compare our motion adapter with the standard Adapter(Houlsby etal., 2019) and the ST-adapter(Pan etal., 2022). In the standard adapter, we only use a downsample MLP and upsample MLP, while in the ST-adapter, we perform convolution operations solely along the temporal dimension.As shown in Table 8, the results on the EK-100 MIR task demonstrate that both the ST-adapter and our motion adapter outperform the standard adapter. This improvement can be attributed to the limited parameters of the standard adapter, which restrict its ability to capture complex, fine-grained information. Compared to the ST-adapter, our Motion Adapter achieves the best performance by adding a spatial convolution operation, suggesting that both spatial and temporal information are crucial for egocentric video representation learning.

GPUEK-100 MIREGTEA
MethodHoursmAPnDCGTop1-acc.
AVION-4f95.534.433.746.4
AVION-16f395.536.234.347.4
EgoVideo180.636.534.551.0
EK-100 MIR
DesignParam SizemAPnDCG
Adapter8.28M34.733.0
ST-adapter10.08M35.934.1
Motion Adapter26.01M36.534.5

4.4 Feature-based evaluation on other tasks

With the knowledge of hand-object dynamics, EgoVideo features can well generalize to other human behavior understanding tasks and robot manipulation tasks. Table10 shows the results of Action Segmentation on the HOI4D(Liu etal., 2022) and GTEA(Fathi etal., 2011a) datasets, using features extracted from I3D(Carreira & Zisserman, 2017), AVION, and our EgoVideo. The results demonstrate our EgoVideo is also effective in the action segmentation task, especially for HOI4D which requires differentiating fine-grained hand-object interaction.

Also, we test the generalization capability of EgoVideo on the robot manipulation task on the Franka Kitchen dataset(Gupta etal., 2019). We follow the same setting and compare with previous robotic representation learning works MVP(Radosavovic etal., 2023), Voltron(Karamcheti etal., 2023) and MPI(Zeng etal., 2024). For MPI we compare both MPI with and without additional detection supervision. From Table10, our EgoVideo can consistently surpass MVP and Voltron on the “Turn knob (TK)”, “Open Microwave (OM)” and “Open door (OD)” tasks. While MPI uses additional detection and prediction transformers and performs better than EgoVideo on two tasks, EgoVideo still performs comparably in overall success rate. Complete results with more details and analyses can be seen in Appendix E. The results strongly prove the delicacy and generalization ability of our EgoVideo learned representations.

HOI4DGTEA
FeatureF1@50EditF1@50Edit
I3D35.080.379.284.6
AVION70.289.184.589.4
EgoVideo74.890.187.190.1
MethodTKOMODAvg.
MVP79.041.048.056.0
Voltron76.041.045.354.1
MPI85.549.052.562.3
EgoVideo80.165.052.766.0
MPI+Det89.054.057.766.9

5 Conclusion

In this work, we inject fine-grained hand-object dynamics into egocentric video representation learning. Our method addresses the drawbacks of the existing method from two perspectives. On the data side, we propose HOD, a novel framework to generate new paired video-language data, where the language contains intricately depicted hand-object dynamics. On the model side, we propose EgoVideo, where we use a model with a motion adapter combined with a co-training technique, to fully exploit the fine-grained dynamics provided by HOD data in the representation learning process.Experimental results demonstrate that our method achieves state-of-the-art performance across multiple downstream tasks, and can generalize in the embodied manipulation environment.

Acknowledgement This work is funded in part by the National Key R&D Program of China (2022ZD0160201), and Shanghai Artificial Intelligence Laboratory, and JSPS KAKENHI Grant Number JP22KF0119.

References

  • AI etal. (2024)01. AI, :, Alex Young, Bei Chen, Chao Li, Chengen Huang, GeZhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, Kaidong Yu, Peng Liu, Qiang Liu, Shawn Yue, Senbin Yang, Shiming Yang, Tao Yu, Wen Xie, Wenhao Huang, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Pengcheng Nie, Yuchi Xu, Yudong Liu, Yue Wang, Yuxuan Cai, Zhenyu Gu, Zhiyuan Liu, and Zonghong Dai.Yi: Open foundation models by 01.ai, 2024.
  • Bertasius etal. (2021)Gedas Bertasius, Heng Wang, and Lorenzo Torresani.Is space-time attention all you need for video understanding?In ICML, volume2, pp.4, 2021.
  • Brahmbhatt etal. (2020)Samarth Brahmbhatt, Chengcheng Tang, ChristopherD Twigg, CharlesC Kemp, and James Hays.Contactpose: A dataset of grasps with object contact and hand pose.In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIII 16, pp. 361–378. Springer, 2020.
  • CabaHeilbron etal. (2015)Fabian CabaHeilbron, Victor Escorcia, Bernard Ghanem, and Juan CarlosNiebles.Activitynet: A large-scale video benchmark for human activity understanding.In Proceedings of the ieee conference on computer vision and pattern recognition, pp. 961–970, 2015.
  • Cai etal. (2018)Yujun Cai, Liuhao Ge, Jianfei Cai, and Junsong Yuan.Weakly-supervised 3d hand pose estimation from monocular rgb images.In Proceedings of the European conference on computer vision (ECCV), pp. 666–682, 2018.
  • Cao etal. (2021)Zhe Cao, Ilija Radosavovic, Angjoo Kanazawa, and Jitendra Malik.Reconstructing hand-object interactions in the wild.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12417–12426, 2021.
  • Carreira & Zisserman (2017)Joao Carreira and Andrew Zisserman.Quo vadis, action recognition? a new model and the kinetics dataset.In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308, 2017.
  • Chen etal. (2024)Guo Chen, Yifei Huang, Jilan Xu, Baoqi Pei, Zhe Chen, Zhiqi Li, Jiahao Wang, Kunchang Li, Tong Lu, and Limin Wang.Video mamba suite: State space model as a versatile alternative for video understanding.arXiv preprint arXiv:2403.09626, 2024.
  • Damen etal. (2018)Dima Damen, Hazel Doughty, GiovanniMaria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, etal.Scaling egocentric vision: The epic-kitchens dataset.In Proceedings of the European conference on computer vision (ECCV), pp. 720–736, 2018.
  • Damen etal. (2020)Dima Damen, Hazel Doughty, GiovanniMaria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, etal.The epic-kitchens dataset: Collection, challenges and baselines.IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(11):4125–4141, 2020.
  • Ding etal. (2023)Ning Ding, Yujia Qin, Guang Yang, Fuchao Wei, Zonghan Yang, Yusheng Su, Shengding Hu, Yulin Chen, Chi-Min Chan, Weize Chen, etal.Parameter-efficient fine-tuning of large-scale pre-trained language models.Nature Machine Intelligence, 5(3):220–235, 2023.
  • Doosti etal. (2020)Bardia Doosti, Shujon Naha, Majid Mirbagheri, and DavidJ Crandall.Hope-net: A graph-based model for hand-object pose estimation.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6608–6617, 2020.
  • Dou etal. (2024)Zi-Yi Dou, Xitong Yang, Tushar Nagarajan, Huiyu Wang, Jing Huang, Nanyun Peng, Kris Kitani, and Fu-Jen Chu.Unlocking exocentric video-language data for egocentric video representation learning.arXiv preprint arXiv:2408.03567, 2024.
  • Farha & Gall (2019)YazanAbu Farha and Jurgen Gall.Ms-tcn: Multi-stage temporal convolutional network for action segmentation.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3575–3584, 2019.
  • Fathi etal. (2011a)Alireza Fathi, Ali Farhadi, and JamesM Rehg.Understanding egocentric activities.In 2011 international conference on computer vision, pp. 407–414. IEEE, 2011a.
  • Fathi etal. (2011b)Alireza Fathi, Xiaofeng Ren, and JamesM Rehg.Learning to recognize objects in egocentric activities.In CVPR 2011, pp. 3281–3288. IEEE, 2011b.
  • Feichtenhofer etal. (2019)Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He.Slowfast networks for video recognition.In Proceedings of the IEEE/CVF international conference on computer vision, pp. 6202–6211, 2019.
  • Girdhar & Grauman (2021)Rohit Girdhar and Kristen Grauman.Anticipative video transformer.In Proceedings of the IEEE/CVF international conference on computer vision, pp. 13505–13515, 2021.
  • Grauman etal. (2022)Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, etal.Ego4d: Around the world in 3,000 hours of egocentric video.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18995–19012, 2022.
  • Gupta etal. (2019)Abhishek Gupta, Vikash Kumar, Corey Lynch, Sergey Levine, and Karol Hausman.Relay policy learning: Solving long-horizon tasks via imitation and reinforcement learning.arXiv preprint arXiv:1910.11956, 2019.
  • Hasson etal. (2019)Yana Hasson, Gul Varol, Dimitrios Tzionas, Igor Kalevatykh, MichaelJ Black, Ivan Laptev, and Cordelia Schmid.Learning joint reconstruction of hands and manipulated objects.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11807–11816, 2019.
  • Hasson etal. (2020)Yana Hasson, Bugra Tekin, Federica Bogo, Ivan Laptev, Marc Pollefeys, and Cordelia Schmid.Leveraging photometric consistency over time for sparsely supervised hand-object reconstruction.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 571–580, 2020.
  • Houlsby etal. (2019)Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin DeLaroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly.Parameter-efficient transfer learning for nlp.In International conference on machine learning, pp. 2790–2799. PMLR, 2019.
  • Huang etal. (2018)Yifei Huang, Minjie Cai, Zhenqiang Li, and Yoichi Sato.Predicting gaze in egocentric video by learning task-dependent attention transition.In European Conference on Computer Vision, 2018.
  • Huang etal. (2020a)Yifei Huang, Minjie Cai, Zhenqiang Li, Feng Lu, and Yoichi Sato.Mutual context network for jointly estimating egocentric gaze and action.IEEE Transactions on Image Processing, 29:7795–7806, 2020a.
  • Huang etal. (2020b)Yifei Huang, Yusuke Sugano, and Yoichi Sato.Improving action segmentation via graph-based temporal reasoning.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 14024–14034, 2020b.
  • Huang etal. (2024a)Yifei Huang, Guo Chen, Jilan Xu, Mingfang Zhang, Lijin Yang, Baoqi Pei, Hongjie Zhang, LuDong, Yali Wang, Limin Wang, etal.Egoexolearn: A dataset for bridging asynchronous ego-and exo-centric view of procedural activities in real world.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22072–22086, 2024a.
  • Huang etal. (2024b)Yifei Huang, Jilan Xu, Baoqi Pei, Yuping He, Guo Chen, Lijin Yang, Xinyuan Chen, Yaohui Wang, Zheng Nie, Jinyao Liu, etal.Vinci: A real-time embodied smart assistant based on egocentric vision-language model.arXiv preprint arXiv:2412.21080, 2024b.
  • Islam etal. (2024)MdMohaiminul Islam, Ngan Ho, Xitong Yang, Tushar Nagarajan, Lorenzo Torresani, and Gedas Bertasius.Video recap: Recursive captioning of hour-long videos.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18198–18208, 2024.
  • Jiang etal. (2021)Hanwen Jiang, Shaowei Liu, Jiashun Wang, and Xiaolong Wang.Hand-object contact consistency reasoning for human grasps generation.In Proceedings of the IEEE/CVF international conference on computer vision, pp. 11107–11116, 2021.
  • Karamcheti etal. (2023)Siddharth Karamcheti, Suraj Nair, AnnieS Chen, Thomas Kollar, Chelsea Finn, Dorsa Sadigh, and Percy Liang.Language-driven representation learning for robotics.arXiv preprint arXiv:2302.12766, 2023.
  • Kay etal. (2017)Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, etal.The kinetics human action video dataset.arXiv preprint arXiv:1705.06950, 2017.
  • Li etal. (2020)Shijie Li, YazanAbu Farha, Yun Liu, Ming-Ming Cheng, and Juergen Gall.Ms-tcn++: Multi-stage temporal convolutional network for action segmentation.IEEE transactions on pattern analysis and machine intelligence, 45(6):6647–6658, 2020.
  • Li etal. (2021)Yanghao Li, Tushar Nagarajan, BoXiong, and Kristen Grauman.Ego-exo: Transferring visual representations from third-person to first-person videos.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6943–6953, 2021.
  • Li etal. (2018)Yin Li, Miao Liu, and JamesM Rehg.In the eye of beholder: Joint learning of gaze and actions in first person video.In Proceedings of the European conference on computer vision (ECCV), pp. 619–635, 2018.
  • Lin etal. (2022)KevinQinghong Lin, Jinpeng Wang, Mattia Soldan, Michael Wray, Rui Yan, EricZ Xu, Difei Gao, Rong-Cheng Tu, Wenzhe Zhao, Weijie Kong, etal.Egocentric video-language pretraining.Advances in Neural Information Processing Systems, 35:7575–7586, 2022.
  • Liu etal. (2021)Shaowei Liu, Hanwen Jiang, Jiarui Xu, Sifei Liu, and Xiaolong Wang.Semi-supervised 3d hand-object poses estimation with interactions in time.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14687–14697, 2021.
  • Liu etal. (2022)Yunze Liu, Yun Liu, Che Jiang, Kangbo Lyu, Weikang Wan, Hao Shen, Boqiang Liang, Zhoujie Fu, HeWang, and LiYi.Hoi4d: A 4d egocentric dataset for category-level human-object interaction.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 21013–21022, 2022.
  • Luo etal. (2024)MiLuo, Zihui Xue, Alex Dimakis, and Kristen Grauman.Put myself in your shoes: Lifting the egocentric perspective from exocentric videos.arXiv preprint arXiv:2403.06351, 2024.
  • Miech etal. (2019)Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic.Howto100m: Learning a text-video embedding by watching hundred million narrated video clips.In Proceedings of the IEEE/CVF international conference on computer vision, pp. 2630–2640, 2019.
  • Nair etal. (2022)Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta.R3m: A universal visual representation for robot manipulation.arXiv preprint arXiv:2203.12601, 2022.
  • Ohkawa etal. (2023)Takehiko Ohkawa, Kun He, Fadime Sener, Tomas Hodan, Luan Tran, and Cem Keskin.Assemblyhands: Towards egocentric activity understanding via 3d hand pose estimation.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12999–13008, 2023.
  • Oord etal. (2018)Aaron vanden Oord, Yazhe Li, and Oriol Vinyals.Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018.
  • OpenAI (2023)ROpenAI.Gpt-4 technical report. arxiv 2303.08774.View in Article, 2(5), 2023.
  • Pan etal. (2022)Junting Pan, Ziyi Lin, Xiatian Zhu, Jing Shao, and Hongsheng Li.St-adapter: Parameter-efficient image-to-video transfer learning.Advances in Neural Information Processing Systems, 35:26462–26477, 2022.
  • Pan etal. (2023)Xiaqing Pan, Nicholas Charron, Yongqian Yang, Scott Peters, Thomas Whelan, Chen Kong, Omkar Parkhi, Richard Newcombe, and YuhengCarl Ren.Aria digital twin: A new benchmark dataset for egocentric 3d machine perception.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 20133–20143, 2023.
  • Pei etal. (2024)Baoqi Pei, Guo Chen, Jilan Xu, Yuping He, Yicheng Liu, Kanghua Pan, Yifei Huang, Yali Wang, Tong Lu, Limin Wang, etal.Egovideo: Exploring egocentric foundation model and downstream adaptation.arXiv preprint arXiv:2406.18070, 2024.
  • Plizzari etal. (2022)Chiara Plizzari, Mirco Planamente, Gabriele Goletto, Marco Cannici, Emanuele Gusso, Matteo Matteucci, and Barbara Caputo.E2 (go) motion: Motion augmented event stream for egocentric action recognition.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 19935–19947, 2022.
  • Plizzari etal. (2024)Chiara Plizzari, Gabriele Goletto, Antonino Furnari, Siddhant Bansal, Francesco Ragusa, GiovanniMaria Farinella, Dima Damen, and Tatiana Tommasi.An outlook into the future of egocentric vision.International Journal of Computer Vision, pp. 1–57, 2024.
  • Pramanick etal. (2023)Shraman Pramanick, Yale Song, Sayan Nag, KevinQinghong Lin, Hardik Shah, MikeZheng Shou, Rama Chellappa, and Pengchuan Zhang.Egovlpv2: Egocentric video-language pre-training with fusion in the backbone.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5285–5297, 2023.
  • Radford etal. (2019)Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, etal.Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019.
  • Radford etal. (2021)Alec Radford, JongWook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, etal.Learning transferable visual models from natural language supervision.In International conference on machine learning, pp. 8748–8763. PMLR, 2021.
  • Radosavovic etal. (2023)Ilija Radosavovic, Tete Xiao, Stephen James, Pieter Abbeel, Jitendra Malik, and Trevor Darrell.Real-world robot learning with masked visual pre-training.In Conference on Robot Learning, pp. 416–426. PMLR, 2023.
  • Saha etal. (2021)Snehanshu Saha, Archana Mathur, Aditya Pandey, and HarshithArun Kumar.Diffact: A unifying framework for activation functions.In 2021 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE, 2021.
  • Sennrich (2015)Rico Sennrich.Neural machine translation of rare words with subword units.arXiv preprint arXiv:1508.07909, 2015.
  • Shan etal. (2020)Dandan Shan, Jiaqi Geng, Michelle Shu, and DavidF Fouhey.Understanding human hands in contact at internet scale.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9869–9878, 2020.
  • Srivastava etal. (2022)Sanjana Srivastava, Chengshu Li, Michael Lingelbach, Roberto Martín-Martín, Fei Xia, KentElliott Vainio, Zheng Lian, Cem Gokmen, Shyamal Buch, Karen Liu, etal.Behavior: Benchmark for everyday household activities in virtual, interactive, and ecological environments.In Conference on robot learning, pp. 477–490. PMLR, 2022.
  • Wang etal. (2024a)AlexJinpeng Wang, Linjie Li, KevinQinghong Lin, Jianfeng Wang, Kevin Lin, Zhengyuan Yang, Lijuan Wang, and MikeZheng Shou.Cosmo: Contrastive streamlined multimodal model with interleaved pre-training.arXiv preprint arXiv:2401.00849, 2024a.
  • Wang etal. (2024b)YiWang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Jilan Xu, Zun Wang, etal.Internvideo2: Scaling video foundation models for multimodal video understanding.arXiv preprint arXiv:2403.15377, 2024b.
  • Xing etal. (2024)Zhen Xing, QiDai, Zejia Weng, Zuxuan Wu, and Yu-Gang Jiang.Aid: Adapting image2video diffusion models for instruction-guided video prediction.arXiv preprint arXiv:2406.06465, 2024.
  • Xu etal. (2024)Jilan Xu, Yifei Huang, Junlin Hou, Guo Chen, Yuejie Zhang, Rui Feng, and Weidi Xie.Retrieval-augmented egocentric video captioning.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13525–13536, 2024.
  • Xue etal. (2022)Hongwei Xue, Tiankai Hang, Yanhong Zeng, Yuchong Sun, Bei Liu, Huan Yang, Jianlong Fu, and Baining Guo.Advancing high-resolution video-language representation with large-scale video transcriptions.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5036–5045, 2022.
  • Yang & Yao (2019)Linlin Yang and Angela Yao.Disentangling latent hands for image synthesis and pose estimation.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9877–9886, 2019.
  • Yi etal. (2021)Fangqiu Yi, Hongyu Wen, and Tingting Jiang.Asformer: Transformer for action segmentation.arXiv preprint arXiv:2110.08568, 2021.
  • Yuan etal. (2018)Shanxin Yuan, Guillermo Garcia-Hernando, Björn Stenger, Gyeongsik Moon, JuYong Chang, KyoungMu Lee, Pavlo Molchanov, Jan Kautz, Sina Honari, Liuhao Ge, etal.Depth-based 3d hand pose estimation: From current achievements to future goals.In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2636–2645, 2018.
  • Zeng etal. (2024)Jia Zeng, Qingwen Bu, Bangjun Wang, Wenke Xia, LiChen, Hao Dong, Haoming Song, Dong Wang, DiHu, Ping Luo, etal.Learning manipulation by predicting interaction.arXiv preprint arXiv:2406.00439, 2024.
  • Zhang etal. (2023)Chuhan Zhang, Ankush Gupta, and Andrew Zisserman.Helping hands: An object-aware ego-centric video recognition model.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13901–13912, 2023.
  • Zhang etal. (2020)Hao Zhang, Aixin Sun, Wei Jing, and JoeyTianyi Zhou.Span-based localizing network for natural language video localization.In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 6543–6554, Online, July 2020. Association for Computational Linguistics.URL https://www.aclweb.org/anthology/2020.acl-main.585.
  • Zhang etal. (2022)Lingzhi Zhang, Shenghao Zhou, Simon Stent, and Jianbo Shi.Fine-grained egocentric hand-object segmentation: Dataset, model, and applications.In European Conference on Computer Vision, pp. 127–145. Springer, 2022.
  • Zhao etal. (2021)Chen Zhao, AliK Thabet, and Bernard Ghanem.Video self-stitching graph network for temporal action localization.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13658–13667, 2021.
  • Zhao & Krähenbühl (2023)Yue Zhao and Philipp Krähenbühl.Training a large video model on a single machine in a day.arXiv preprint arXiv:2309.16669, 2023.
  • Zhao etal. (2023)Yue Zhao, Ishan Misra, Philipp Krähenbühl, and Rohit Girdhar.Learning video representations from large language models.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6586–6597, 2023.

Appendix A Details About HOD

Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning (5)

Data SelectionSince our HOD involves data from not only Ego4D but also How2link-7M, we use a style classifier 𝒫𝒫\mathcal{P}caligraphic_P to filter egocentric style videos from the How2link-7M dataset. Specifically, our style classifier employs a simple two-layer MLP architecture. We utilize InternVideo2(Wang etal., 2024b) to extract video features from all videos of the How2-Interlink7M dataset. After that, we manually annotate 10,000 clips with positive and negative labels, where the positive label indicates this video is an egocentric video (or ego-like video). Examples of positive and negative labeled videos can be found in Figure5.We randomly select 10% of these clips to form the validation set. After training our classifier on the train set, we get 89% accuracy on the validation set.

HOD Rephraser We use Yi-34B model to generate hand object dynamics narrations. The Yi-34B model is trained on a corpus of over 150,000 high-quality texts and its model weights are open-source, which has a high ranking among all existing open-source Large Language Models. We directly use the model without finetuning.

To generate reliable narrations, we need to convert the obtained hand-object information into appropriate texts.For the movement trajectories of hands and objects, we directly calculate the center pointsof the bounding boxes and perform normalization to get a sequence L=((w0,h0),(w1,h1),,(w15,h15))𝐿subscript𝑤0subscript0subscript𝑤1subscript1subscript𝑤15subscript15L=((w_{0},h_{0}),(w_{1},h_{1}),...,(w_{15},h_{15}))italic_L = ( ( italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , ( italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , ( italic_w start_POSTSUBSCRIPT 15 end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT 15 end_POSTSUBSCRIPT ) ). To determine whether the object is contacted by the left hand or right hand separately, or contacted by both hands simultaneously, we use a generalized IoU function on the left-contact object and right-contact object. For IoU value greater than 0.9, we classify the object as being contacted by both hands.

Subsequently, we prompt the LLM with a system prompt of:

## System Prompt
Now you are a captioning assistant, you need to generate hand object interaction
caption and combine them with the origin narration. Given the origin narration
of the video clip and spatial localization ([x, y]) of hands and objects in the
clip, please help me describe the direction of motion of the left and right hands,
their relative relationship to objects and whether they are touching or not. Do not
mention the pixel info. Two_hand_object means objects with two hands in contact,
left_hand_object means objects with left hand in contact, right_hand_object means objects with right hand in contact.

## Hand Object Dynamics
left_hand:((w_0,h_0),(w_1,h_1),...,(w_15,h_15))
right_hand:((w_0,h_0),(w_1,h_1),...,(w_15,h_15))
left_hand_object:((w_0,h_0),(w_1,h_1),...,(w_15,h_15))
right_hand_object:((w_0,h_0),(w_1,h_1),...,(w_15,h_15))
two_hand_object:((w_0,h_0),(w_1,h_1),...,(w_15,h_15))
origin narration: C takes a scissors.

## System Prompt
Please help me summarize the direction of movement of the left hand, right hand, and objects, and generate a new caption based on the original caption. It is strictly forbidden to mention the frame number and spatial position coordinates in the description.

For the computational cost, it takes around 2 days to extract bounding boxes from all vision-language clips and 3 days to generate narrations with LLM using 32 A100 GPUs, resulting in a total of around 4000 GPU\cdothours.

Data Evaluation. Here, we provide a detailed explanation of our evaluation process. First, we prompted the LLM to generate new narrations with different verbs/nouns using the prompt:

Please help me modify the key verbs and nouns in this sentence to slightly alter
its meaning while keeping the sentencestructure largely unchanged. Just return
the modified sentenceto me. Ensure the semantic shift is minimal, such as
changingone or two verbs and nouns.

Then we utilized GPT-4o as a judge to determine the quality of the narrations with the prompt:

You are a judge. There are 16 frames in the video, I have three captions and need your help to score the three captions based on three criteria: relevance, accuracy, and level of detail. The score ranges from 0 to 10, with a higher score indicating better quality of the caption. You can just answer me in the following format: First: score1, Second: score2, Third: score3. First caption: text1 Second caption: text2 Third caption: text3

As mentioned in the main manuscript, to further validate the quality of our HOD dataset, we utilize two standard unsupervised automatic metrics to evaluate the quality of narrations. We use the human narration as the ground truth and compare our HOD data with LaViLa-Narrator on METEOR and CIDEr scores. The results in Table11 reveal that while our HOD data achieves a slightly lower METEOR score, it outperforms LaViLa-Narrator in CIDEr. This discrepancy arises because many LaViLa narrations closely mirror the original text, whereas our narrations incorporate additional dynamic information. Although our performance does not drastically exceed that of LaViLa, the results demonstrate that our narrations successfully retain the original semantic content.

TextMETEORCIDEr
LaViLa-Narrator0.450.34
HOD0.390.40

Limitations and future work. Our model relies on the quality of hand-object detection and the rephrasing of LLM, which may include error accumulation. In addition to reducing the error in the data construction, exploring how the hand-object dynamics can be better involved into language or in other formats is a promising direction for our future work.

Appendix B Dataset Details

Ego4D Ego4D (Grauman etal., 2022) contains 3,670 hours of egocentric videoswith temporally dense narrations. Each narration has atimestamp and an associated free-form sentence.We follow previous works ((Zhao etal., 2023),(Lin etal., 2022)) to prepare the Ego4D dataset for vision-language pretraining.Specifically, we drop the narrations that either contain “#unsure”/“#Unsure” tags or are shorter than 4 words. This results in 4M video-text clip pairs.

Howto-Interlink7M Howto-Interlink7M (Wang etal., 2024a) contains 1M videos and 7M clips, which is part of the broader Howto100M dataset. Diverging from the original dataset, clips in Howto-Interlink7M have concise descriptions, and dense region captions and leverage GPT-4 to generate comprehensive summaries from detailed annotation. We use a classifier to select 3.3M vision-text pairs from the dataset.

EpicKitchens-100The Epic-Kitchens-100 (EK-100) dataset (Damen etal., 2020; 2018) contains 100 hours of egocentric cooking videos. Each clip is annotated with a start and end timestamp, a short textual narration, and a verb and noun class that the narration belongs to. The action class can also be uniquely determined by combining the verb and the noun. In EpicKitchens-MIR, we use Mean AveragePrecision and normalized Discounted Cumulative Gain (nDCG) as evaluation metrics. In EpicKitchens-CLS, we use top-1 action accuracy and top-5 action accuracy as evaluation metrics.

EGTEAEGTEA (Li etal., 2018) contains 28 hours of cooking activities from 86 unique sessions of 32 subjects.In zero-shot evaluation, we compute the similarity score between every video embedding and the 106 text embeddings, and take the text embedding with the highest similarity score as the predicted class. In fine-tuning evaluation, we finetune the video encoder for action classification. using the linear probing protocol.

GTEAThe Georgia Tech Egocentric Activities (GTEA) dataset (Fathi etal., 2011b) consists of seven distinct types of everyday activities, including making a sandwich, preparing tea, and brewing coffee. Each of these activities is demonstrated by four different individuals, resulting in a total of 28 unique video recordings. Each video captures around 20 fine-grained action instances, such as ”take bread” or ”pour ketchup,” all occurring within approximately one minute. This dataset provides a comprehensive look at egocentric perspectives, making it an invaluable resource for research in activity recognition and human-computer interaction.

HOI4DThe HOI4D dataset (Liu etal., 2022) represents a significant advancement in the study of category-level human-object interaction, offering a large-scale 4D egocentric resource enriched with detailed annotations. Comprising 2.4 million RGB-D egocentric video frames across more than 4,000 sequences, the dataset captures interactions performed by nine participants with 800 unique object instances spanning 16 categories within 610 diverse indoor environments. To foster advancements in category-level human-object interaction, HOI4D introduces three benchmarking tasks: semantic segmentation of 4D dynamic point cloud sequences, category-level object pose tracking, and egocentric action segmentation involving a variety of interaction targets.

Franka KitchenThe Franka Kitchen dataset (Gupta etal., 2019) is a comprehensive resource designed to facilitate research in robotic manipulation and human-robot interaction within a kitchen environment. This dataset comprises a diverse collection of videos showcasing a humanoid robot, Franka Emika Panda, performing various cooking tasks. The setup features a Franka robot with 9 degrees of freedom positioned within a kitchen environment equipped with various common household items, including a microwave, kettle, overhead light, cabinets, and an oven. This environment is designed for multitask objectives, requiring the robot to interact with these items to achieve specific goal configurations.

Appendix C Implementation Details

Pretraining Details We pre-train on the video-narration pairs generated by our HOD from Ego4D and How-InterLink7M. We use AdamW optimizer with betas = (0.9,0.999) for 15 epochs. We use different settings for different size models. For EgoVideo-B, we adopt a batch size of 128 over 16 GPUs with a fixed learning rate of 5e-5, For EgoVideo-L, we use a batch size of 32 over 16 GPUs with a fixed learning rate of 3e-5. For EgoVideo-G, we choose to use a batch size of 16 over 16 GPUs with a fixed learning rate of 1e-5. For input frames, we preprocess the frames by resizing the shorter side to 320 pixels, which accelerates the data loading speed. Subsequently, we applied a standard RandomResizedCrop function(Zhao & Krähenbühl, 2023) with a scale parameter of (0.5, 1.0) to obtain the corresponding input frames.

Finetuning DetailsWe finetune the downstream tasks using AdamW with (β1𝛽1\beta 1italic_β 1, β2𝛽2\beta 2italic_β 2) = (0.9, 0.999) and weight decay of 0.05 with cosine annealing. Table 12 shows the hyperparameters details and in all tasks we use 8 GPUs for finetuning. During the training phase, we resize the shorter side of the video to 256 pixels and subsequently extract a 224×224 crop. During the testing phase, we scale the shorter side to 224 pixels and take a central 224×224 crop.

TaskModel SizeEpochsLR_startLR_endBatch Size
EK100-MIREgoVideo-B1001e-61e-5256
EgoVideo-L705e-75e-664
EgoVideo-G504e-74e-632
EK100-CLSEgoVideo-B1001e-61e-5256
EgoVideo-L705e-75e-664
EgoVideo-G604e-74e-632
EGTEAEgoVideo-B1001e-61e-5256
EgoVideo-L707e-77e-664
EgoVideo-G504e-74e-632

For the EgoNLQ task(Grauman etal., 2022), we build on the methodologies introduced by EgoVLP (Lin etal., 2022) and LAVILA (Zhao etal., 2023) for fairness. We adopt VSLNet (Zhang etal., 2020) as the task head. We train the task head for 50 epochs, using a learning rate of 3e-3, dropout 0.3, batch size 32 on a single A100 GPU. For the EgoMQ task, we use VSGN (Zhao etal., 2021) as our task head for training. We set batch size as 16, learning rate as 2e-4, gamma as 0.6, and train the task head on a single A100 GPU.

ModelPretrain DataData SizeEK-100 MIR
mAPnDCG
EgoVideo-BEgoClip4M31.132.0
EgoVideo-BEgo4D-HOD4M34.433.9
EgoVideo-LEgoClip4M35.334.6
EgoVideo-LEgo4D-HOD4M38.335.9
EgoVideo-GEgoClip4M42.137.5
EgoVideo-GEgo4D-HOD4M44.838.2

Appendix D Additional Ablations

Pretraining DataTo further demonstrate the effectiveness of our HOD, we fix the amount of data and conduct experiments using different model sizes. As shown in Table13, with the same model size and the same data size, using our Ego4D-HOD data can consistently achieve improvement. Since one sample in EgoClip corresponds strictly to on sample in Ego4D-HOD, this table strongly demonstrates the high quality of our HOD data.

Frameλ𝜆\lambdaitalic_λEK MIR
countmAP
4134.2
8235.3
12335.9
16436.5
32836.5

Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning (6)

Adapter Downsampling RatioWe test the design of our motion adapter by studying the effect of adapter downsampling ratio γ𝛾\gammaitalic_γ, and show the result in Figure 6. It can be observed that as the value of γ𝛾\gammaitalic_γ increases, the model’s performance continues to improve. This indicates that our generated narrations contain rich semantic information and further validates the effectiveness of our motion adapter. To reduce computational overhead, we ultimately decide to set γ=0.5𝛾0.5\gamma=0.5italic_γ = 0.5.

Frame numberWe further study the effect of the number of sampled frames as input. We consistently use 4 frames as the sampling rate for the backbone part. The results in Table 14 indicate that as the number of frames increases from 4 to 16, the model’s performance improves continuously from 34.2% to 36.5%. However, when the frame count reaches to 32, the performance plateaus, showing no significant improvement with further increases in frame count. Besides, increasing the number of frames beyond this point incurs substantial computational cost. As a result, we choose to use λ=4𝜆4\lambda=4italic_λ = 4 as the default value in our EgoVideo, balancing the speed and accuracy.

Appendix E Additional Results

GTEAHOI4D
FeatureMethodF1@10, 25, 50EditAccF1@10, 25, 50EditAcc
I3DMS-TCN85.8 / 83.4 / 69.879.076.355.6 / 47.8 / 31.874.744.2
I3DMS-TCN++88.8 / 85.7 / 76.083.580.154.7 / 46.5 / 30.375.242.2
I3DASFormer90.1 / 88.8 / 79.284.679.7-
I3DDiffAct92.5 / 91.5 / 84.789.682.2---
AVIONASFormer92.5 / 91.0 / 84.589.481.484.4 / 81.1 / 70.289.274.2
EgoVideoASFormer92.7 / 92.2 / 87.190.182.788.9 / 85.3 / 74.890.176.2

E.1 Details on action segmentation tasks.

Action segmentation tests the representation on its understanding of the temporal dependencies of the videoHuang etal. (2020b); Yi etal. (2021). We evaluate our model on two benchmark datasets: GTEA (Fathi etal., 2011b), and HOI4D (Liu etal., 2022). We follow the previous work to use four-fold cross-validations on both datasets.We use accuracy (Acc), the edit distance (Edit), and the F1 scores at overlap thresholds 10%, 25%, 50% (F1@10, 25, 50) as metrics for evaluation.

We use ASFormer (Yi etal., 2021) as the task head, with input features extracted by our Egovideo, I3D (Carreira & Zisserman, 2017), and AVION (Zhao & Krähenbühl, 2023). We follow(Chen etal., 2024), using learning rate = 5e-4, drop rate = 0.3, epoch = 100 for training.Table 15 presents the experimental results of our methodand other recent approaches, including MS-TCN (Farha & Gall, 2019), MS-TCN++ (Li etal., 2020), ASFormer (Yi etal., 2021), and DiffAct (Saha etal., 2021). The results clearly show the high quality of our EgoVideo feature. With the same task head ASFormer, the EgoVideo feature can outperform the AVION feature consistently. EgoVideo can even help ASFormer to beat the stronger task head DiffAct.

Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning (7)

E.2 Details on Franka kitchen dataset.

Here we introduce the details of experiments on the Franka Kitchen dataset(Gupta etal., 2019). In this dataset, we adopt 3 tasks, including “Turn the stove top knob (TK)”, “Open the microwave (OM)” and “Open the left door (OD)”. The goal is to predict 9-DoF joint velocities (7 joints and 2 grippers) based on the visualrepresentations and proprioceptive states (i.e., joint velocities). We follow the MPI mode(Zeng etal., 2024),which trains a shallow MLP policy network. For evaluation, we follow the R3M method(Nair etal., 2022) and(Karamcheti etal., 2023) and calculate the average success rates for each setting across the 3 tasks.

We compare our EgoVideo with MVP(Radosavovic etal., 2023), Voltron (Karamcheti etal., 2023) and MPI(Zeng etal., 2024).MVP learns representation for robot manipulation by masked image modeling. Voltron takes a step forward to combine masked image modeling with vision text alignment. MPI designs detection and prediction transformers to use object detection signals as additional guidance. Notably, these works also use Ego4D as training data.

The experimental results in Table 10 indicate that our model significantly outperforms both the MVP and Voltron models by more than 10% in average success rate, and exhibits performance comparable to the more advanced MPI model, which integrates multiple pre-training tasks related to robot learning. When the MPI model is solely trained using contrastive learning and masked signal modeling as supervision, we achieve 3.7% improvements in average success rate than MPI model. When MPI incorporates the video prediction task, which has been proven crucial for robot learning, our average success rate is only 0.9% lower.This demonstrates the robust generalization capabilities of our model and highlights the contribution of our hand-object dynamics learning scheme to fine-grained hand operations. Figure 7 shows the qualitative results on turning on the knob, and opening the microwave and opening the door tasks.

Appendix F Qualitative Results

In Figure 8, we show more examples to compare narrations generated by our HOD with LaViLa Rephraser and the original narrations. We can observe that the narrations generated by our HOD model can well describe the hand-object dynamics (e.g., ‘The left hand moves downwards to touch the bicycle tire’). Moreover, compared to the LaViLa rephraser, which often merely changes word order or modifies nouns/verbs, our model can combine original actions to generate more semantically rich descriptions of actions and scenes, resulting in significantly higher quality narrations. (See the first example: our HOD generates ‘Person C picks a card with their right hand, whichis then handed to their left hand.’ while LaViLa yields ‘#C C chooses a card/#C C selects a card/#C C picks a card’).

Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning (8)
Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning (2025)

References

Top Articles
Latest Posts
Recommended Articles
Article information

Author: Duncan Muller

Last Updated:

Views: 5465

Rating: 4.9 / 5 (79 voted)

Reviews: 86% of readers found this page helpful

Author information

Name: Duncan Muller

Birthday: 1997-01-13

Address: Apt. 505 914 Phillip Crossroad, O'Konborough, NV 62411

Phone: +8555305800947

Job: Construction Agent

Hobby: Shopping, Table tennis, Snowboarding, Rafting, Motor sports, Homebrewing, Taxidermy

Introduction: My name is Duncan Muller, I am a enchanting, good, gentle, modern, tasty, nice, elegant person who loves writing and wants to share my knowledge and understanding with you.