Konstantin Mishchenko

Konstantin Mishchenko

Research Scientist

Meta

Bio

Hi there, I’m Konstantin, an AI researcher at Meta and a wannabe musician. I like using mathematics to make things work in practice, especially in deep learning applications. Previously I was a research scientist at Samsung AI Center in Cambridge, UK. Beside doing research, I serve as an Action Editor for TMLR, tweet about interesting papers, and give talks about my studies. In 2023, I was lucky to receive the Outstanding Paper Award together with Aaron Defazio for our work on adaptive methods.

Before joining Samsung, I did a postdoc at Inria Sierra with Alexandre d’Aspremont and Francis Bach. I received my PhD from KAUST, where I worked under the supervision of Peter Richtárik on optimization theory and its applications in machine learning. In 2020, I also interned at Google Brain. I obtained my double degree MSc diploma from École Normale Supérieure Paris-Saclay and Paris-Dauphine, and a BSc from Moscow Institute of Physics and Technology.

My interests and hobbies tend to change every couple of years or so. Recently, I finished 6 months of evening classes at The Institute of Contemporary Music Performance where I studied electronic music production using Ableton Live. I hope to release some music online in the future.

Feel free to shoot me an email if you want to chat in person about research or music, go to a museum, or maybe just take a walk in Paris!

Interests
  • Generative AI
  • Optimization
  • Deep learning
Education
  • PhD in Computer Science, 2021

    KAUST

  • MSc in Data Science, 2017

    École normale supérieure Paris-Saclay and Paris-Dauphine

  • BSc in Computer Science and Physics, 2016

    Moscow Institute of Physics and Technology

Experience

 
 
 
 
 
Meta
Research Scientist
Meta
Oct 2024 – Present Paris, France
Doing research on code generation.
 
 
 
 
 
Samsung
Research Scientist
Samsung
Jan 2023 – Oct 2024 Cambridge, UK
Working on embedded AI systems as a member of the Distributed AI team and GenAI initiative. Some of the things I worked on: Non-autoregressive multi-token generation for LLMs using soft prompt tuning (paper under review); efficient transformer layers for on-device models; federated learning with streaming clients using small batch size (submitted a patent); federated learning under heterogeneous data (paper under review); adaptive optimization methods for automated training (papers published at ICML 2023 and ICML 2024).
 
 
 
 
 
Inria Sierra
Postdoc
Dec 2021 – Dec 2022 Paris, France
Conducted research on adaptive, second-order, and distributed optimization.

Recent Posts

New job at Meta

I’m excited to announce that I started my new job at Meta as a Research Scientist on the CodeGen team in Paris, France led by Gabriel Synnaeve.

Code generation using ML got me very excited because I was always frustrated by the amount of time it was taking me to translate my ideas into code, and it became much easier in the last couple of years. I think John Carmack said in an interview that in game development, most of code written is never read by anyone because there is just too much of it. I like to imagine systems like that in which the code is generated on the fly for different use cases, optimized, tested and debugged without us directly seeing any of that. More than anything, I just want a tool that would make programming more about designing elegant systems and solutions than actually writing or debugging them.

I am particularly excited to join Meta given their strong commitment to open source AI, which I believe is crucial for ensuring democratic access to this technology. I will keep writing and publishing papers as well as releasing my code.

Recent Papers

Quickly discover relevant content by filtering publications.
(2024). Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference.

PDF Cite Slides arXiv

(2024). The Road Less Scheduled.

PDF Cite arXiv

(2023). When, Why and How Much? Adaptive Learning Rate Scheduling by Refinement.

PDF Cite arXiv

(2023). Adaptive Proximal Gradient Method for Convex Optimization.

PDF Cite arXiv

(2023). Partially Personalized Federated Learning: Breaking the Curse of Data Heterogeneity.

PDF Cite arXiv

(2023). Two Losses Are Better Than One: Faster Optimization Using a Cheaper Proxy.

PDF Cite arXiv ICML

(2023). Learning-Rate-Free Learning by D-Adaptation.

PDF Cite Code arXiv ICML

(2023). Convergence of First-Order Algorithms for Meta-Learning with Moreau Envelopes.

PDF Cite arXiv

(2022). Super-Universal Regularized Newton Method.

PDF Cite Code Slides arXiv

(2022). Adaptive Learning Rates for Faster Stochastic Gradient Methods.

PDF Cite arXiv

(2022). Asynchronous SGD Beats Minibatch SGD Under Arbitrary Delays.

PDF Cite Code Slides arXiv

(2022). ProxSkip: Yes! Local Gradient Steps Provably Lead to Communication Acceleration! Finally!. ICML.

PDF Cite Code Video arXiv ICML

(2022). Server-Side Stepsizes and Sampling Without Replacement Provably Help in Federated Optimization.

PDF Cite arXiv

(2021). IntSGD: Adaptive Floatless Compression of Stochastic Gradients. ICLR.

PDF Cite Code Poster Slides arXiv ICLR

(2021). Proximal and Federated Random Reshuffling. ICML.

PDF Cite Code Slides Video arXiv ICML

(2020). Random Reshuffling: Simple Analysis with Vast Improvements. NeurIPS.

PDF Cite Code Poster Slides arXiv NeurIPS

(2020). Dualize, Split, Randomize: Toward Fast Nonsmooth Optimization Algorithms. JOTA.

PDF Cite Poster arXiv JOTA

(2019). Adaptive Gradient Descent without Descent. ICML.

PDF Cite Code Poster Slides arXiv ICML Video

(2019). First Analysis of Local GD on Heterogeneous Data.

PDF Cite Slides arXiv NeurIPS

(2019). Tighter Theory for Local SGD on Identical and Heterogeneous Data. AISTATS.

PDF Cite Slides arXiv AISTATS

(2019). MISO is Making a Comeback With Better Proofs and Rates.

PDF Cite arXiv

(2019). DAve-QN: A Distributed Averaged Quasi-Newton Method with Local Superlinear Convergence Rate. AISTATS.

PDF Cite arXiv AISTATS

(2019). Revisiting Stochastic Extragradient. AISTATS.

PDF Cite Slides arXiv AISTATS

(2019). Stochastic Distributed Learning with Gradient Quantization and Double Variance Reduction. Optimization Methods and Software.

PDF Cite arXiv

(2019). 99% of Worker-Master Communication in Distributed Optimization Is Not Needed. UAI.

PDF Cite arXiv UAI

(2019). Distributed Learning with Compressed Gradient Differences.

PDF Cite arXiv

(2018). SEGA: Variance Reduction via Gradient Sketching. NeurIPS.

PDF Cite arXiv NIPS

(2018). A Delay-tolerant Proximal-Gradient Algorithm for Distributed Learning. ICML.

PDF Cite ICML

(2018). A Distributed Flexible Delay-tolerant Proximal Gradient Algorithm. SIOPT.

PDF Cite arXiv SIAM