#Local Models Related Papers ->/lmg/<-| ->[Abstracts Search (Current as end of 06/2024)](https://files.catbox.moe/s5ind7.txt)<- ------ | ------ |**Google** ->[Papers](https://research.google/pubs/?area=machine-intelligence) [Blog](https://ai.googleblog.com)<- 12/2017|[Attention Is All You Need (Transformers)](https://arxiv.org/abs/1706.03762) 10/2018|[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) 10/2019|[Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (T5)](https://arxiv.org/abs/1910.10683) 11/2019|[Fast Transformer Decoding: One Write-Head is All You Need](https://arxiv.org/abs/1911.02150) 02/2020|[GLU Variants Improve Transformer](https://arxiv.org/abs/2002.05202) 03/2020|[Talking-Heads Attention](https://arxiv.org/abs/2003.02436) 05/2020|[Conformer: Convolution-augmented Transformer for Speech Recognition](https://arxiv.org/abs/2005.08100) 09/2020|[Efficient Transformers: A Survey](https://arxiv.org/abs/2009.06732) 12/2020|[RealFormer: Transformer Likes Residual Attention](https://arxiv.org/abs/2012.11747) 01/2021|[Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity](https://arxiv.org/abs/2101.03961) 09/2021|[Finetuned Language Models Are Zero-Shot Learners (Flan)](https://arxiv.org/abs/2109.01652) 09/2021|[Primer: Searching for Efficient Transformers for Language Modeling](https://arxiv.org/abs/2109.08668) 11/2021|[Sparse is Enough in Scaling Transformers](https://arxiv.org/abs/2111.12763) 12/2021|[GLaM: Efficient Scaling of Language Models with Mixture-of-Experts](https://arxiv.org/abs/2112.06905) 01/2022|[LaMDA: Language Models for Dialog Applications](https://arxiv.org/abs/2201.08239) 01/2022|[Chain-of-Thought Prompting Elicits Reasoning in Large Language Models](https://arxiv.org/abs/2201.11903) 04/2022|[PaLM: Scaling Language Modeling with Pathways](https://arxiv.org/abs/2204.02311) 07/2022|[Confident Adaptive Language Modeling](https://arxiv.org/abs/2207.07061) 10/2022|[Scaling Instruction-Finetuned Language Models (Flan-Palm)](https://arxiv.org/abs/2210.11416) 10/2022|[Towards Better Few-Shot and Finetuning Performance with Forgetful Causal Language Models](https://arxiv.org/abs/2210.13432) 10/2022|[Large Language Models Can Self-Improve](https://arxiv.org/abs/2210.11610) 11/2022|[Efficiently Scaling Transformer Inference](https://arxiv.org/abs/2211.05102) 11/2022|[Fast Inference from Transformers via Speculative Decoding](https://arxiv.org/abs/2211.17192) 02/2023|[Symbolic Discovery of Optimization Algorithms (Lion)](https://arxiv.org/abs/2302.06675) 03/2023|[PaLM-E: An Embodied Multimodal Language Model](https://arxiv.org/abs/2303.03378) 04/2023|[Conditional Adapters: Parameter-efficient Transfer Learning with Fast Inference](https://arxiv.org/abs/2304.04947) 05/2023|[Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes](https://arxiv.org/abs/2305.02301) 05/2023|[FormNetV2: Multimodal Graph Contrastive Learning for Form Document Information Extraction](https://arxiv.org/abs/2305.02549) 05/2023|[PaLM 2 Technical Report](https://arxiv.org/abs/2305.10403) 05/2023|[Symbol tuning improves in-context learning in language models](https://arxiv.org/abs/2305.08298) 05/2023|[Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for Large Language Models](https://arxiv.org/abs/2305.14705) 05/2023|[Towards Expert-Level Medical Question Answering with Large Language Models (Med-Palm 2)](https://arxiv.org/abs/2305.09617) 05/2023|[DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining](https://arxiv.org/abs/2305.10429) 05/2023|[How Does Generative Retrieval Scale to Millions of Passages?](https://arxiv.org/abs/2305.11841) 05/2023|[GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoint](https://arxiv.org/abs/2305.13245) 05/2023|[Small Language Models Improve Giants by Rewriting Their Outputs](https://arxiv.org/abs/2305.13514) 06/2023|[StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual Representation Learners](https://arxiv.org/abs/2306.00984) 06/2023|[AudioPaLM: A Large Language Model That Can Speak and Listen](https://arxiv.org/abs/2306.12925) 06/2023|[Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting](https://arxiv.org/abs/2306.17563) 07/2023|[HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models](https://arxiv.org/abs/2307.06949) 09/2023|[Uncovering mesa-optimization algorithms in Transformers](https://arxiv.org/abs/2309.05858) 10/2023|[Think before you speak: Training Language Models With Pause Tokens](https://arxiv.org/abs/2310.02226) 10/2023|[SpecTr: Fast Speculative Decoding via Optimal Transport](https://arxiv.org/abs/2310.15141) 11/2023|[UFOGen: You Forward Once Large Scale Text-to-Image Generation via Diffusion GANs](https://arxiv.org/abs/2311.09257) 11/2023|[Automatic Engineering of Long Prompts](https://arxiv.org/abs/2311.10117) 12/2023|[Beyond ChatBots: ExploreLLM for Structured Thoughts and Personalized Model Responses](https://arxiv.org/abs/2312.00763) 12/2023|[Style Aligned Image Generation via Shared Attention](https://arxiv.org/abs/2312.02133) 01/2024|[A Minimaximalist Approach to Reinforcement Learning from Human Feedback (SPO)](https://arxiv.org/abs/2401.04056) 02/2024|[Time-, Memory- and Parameter-Efficient Visual Adaptation (LoSA)](https://arxiv.org/abs/2402.02887) 02/2024|[Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context](https://arxiv.org/abs/2403.05530) 03/2024|[PERL: Parameter Efficient Reinforcement Learning from Human Feedback](https://arxiv.org/abs/2403.10704) 04/2024|[TransformerFAM: Feedback attention is working memory](https://arxiv.org/abs/2404.09173) 05/2024|[eXmY: A Data Type and Technique for Arbitrary Bit Precision Quantization](https://arxiv.org/abs/2405.13938) 05/2024|[Faster Cascades via Speculative Decoding](https://arxiv.org/abs/2405.19261) 06/2024|[Proofread: Fixes All Errors with One Tap](https://arxiv.org/abs/2406.04523) | |**Deepmind (Google Deepmind as of 4/2023)** ->[Papers](https://www.deepmind.com/research) [Blog](https://www.deepmind.com/blog)<- 10/2019|[Stabilizing Transformers for Reinforcement Learning](https://arxiv.org/abs/1910.06764) 12/2021|[Scaling Language Models: Methods, Analysis & Insights from Training Gopher](https://arxiv.org/abs/2112.11446) 12/2021|[Improving language models by retrieving from trillions of tokens (RETRO)](https://arxiv.org/abs/2112.04426) 02/2022|[Competition-Level Code Generation with AlphaCode](https://arxiv.org/abs/2203.07814) 02/2022|[Unified Scaling Laws for Routed Language Models](https://arxiv.org/abs/2202.01169) 03/2022|[Training Compute-Optimal Large Language Models (Chinchilla)](https://arxiv.org/abs/2203.15556) 04/2022|[Flamingo: a Visual Language Model for Few-Shot Learning](https://arxiv.org/abs/2204.14198) 05/2022|[A Generalist Agent (GATO)](https://arxiv.org/abs/2205.06175) 07/2022|[Formal Algorithms for Transformers](https://arxiv.org/abs/2207.09238) 02/2023|[Accelerating Large Language Model Decoding with Speculative Sampling](https://arxiv.org/abs/2302.01318) 05/2023|[Tree of Thoughts: Deliberate Problem Solving with Large Language Models](https://arxiv.org/abs/2305.10601) 05/2023|[Block-State Transformer](https://arxiv.org/abs/2306.09539) 05/2023|[Randomized Positional Encodings Boost Length Generalization of Transformers](https://arxiv.org/abs/2305.16843) 08/2023|[From Sparse to Soft Mixtures of Experts](https://arxiv.org/abs/2308.00951) 09/2023|[Large Language Models as Optimizers](https://arxiv.org/abs/2309.03409) 09/2023|[MADLAD-400: A Multilingual And Document-Level Large Audited Dataset (MT Model)](https://arxiv.org/abs/2309.04662) 09/2023|[Scaling Laws for Sparsely-Connected Foundation Models](https://arxiv.org/abs/2309.08520) 09/2023|[Language Modeling Is Compression](https://arxiv.org/abs/2309.10668) 09/2023|[Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution](https://arxiv.org/abs/2309.16797) 10/2023|[Large Language Models as Analogical Reasoners](https://arxiv.org/abs/2310.01714) 10/2023|[Controlled Decoding from Language Models](https://arxiv.org/abs/2310.17022) 10/2023|[A General Theoretical Paradigm to Understand Learning from Human Preferences](https://arxiv.org/abs/2310.12036) 11/2023|[DiLoCo: Distributed Low-Communication Training of Language Models](https://arxiv.org/abs/2311.08105) 12/2023|[Gemini: A Family of Highly Capable Multimodal Models](https://files.catbox.moe/g7nn73.pdf) 12/2023|[AlphaCode 2 Technical Report](https://files.catbox.moe/lqpb7g.pdf) 12/2023|[Chain of Code: Reasoning with a Language Model-Augmented Code Emulator](https://arxiv.org/abs/2312.04474) 12/2023|[Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models](https://arxiv.org/abs/2312.06585) 12/2023|[Bad Students Make Great Teachers: Active Learning Accelerates Large-Scale Visual Understanding](https://arxiv.org/abs/2312.05328) 01/2024|[Solving olympiad geometry without human demonstrations](https://files.catbox.moe/3fu2lc.pdf) 02/2024|[LiPO: Listwise Preference Optimization through Learning-to-Rank](https://arxiv.org/abs/2402.01878) 02/2024|[Grandmaster-Level Chess Without Search](https://arxiv.org/abs/2402.04494) 02/2024|[How to Train Data-Efficient LLMs](https://arxiv.org/abs/2402.09668) 02/2024|[A Human-Inspired Reading Agent with Gist Memory of Very Long Contexts](https://arxiv.org/abs/2402.09727) 02/2024|[Gemma: Open Models Based on Gemini Research and Technology](https://files.catbox.moe/og82ni.pdf) 02/2024|[Genie: Generative Interactive Environments](https://arxiv.org/abs/2402.15391) 02/2024|[Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models](https://arxiv.org/abs/2402.19427) 03/2024|[DiPaCo: Distributed Path Composition](https://arxiv.org/abs/2403.10616) 04/2024|[Mixture-of-Depths: Dynamically allocating compute in transformer-based language models](https://arxiv.org/abs/2404.02258) 05/2024|[Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities](https://arxiv.org/abs/2405.18669) 06/2024|[Transformers meet Neural Algorithmic Reasoners](https://arxiv.org/abs/2406.09308) 06/2024|[Gemma 2: Improving Open Language Models at a Practical Size](https://files.catbox.moe/xpysuw.pdf) 06/2024|[Data curation via joint example selection further accelerates multimodal learning](https://arxiv.org/abs/2406.17711) 07/2024|[PaliGemma: A versatile 3B VLM for transfer](https://arxiv.org/abs/2407.07726) 07/2024|[LookupViT: Compressing visual information to a limited number of tokens](https://arxiv.org/abs/2407.12753) 07/2024|[Mixture of Nested Experts: Adaptive Processing of Visual Tokens](https://arxiv.org/abs/2407.19985) | |**Meta (Facebook AI Research)** ->[Papers](https://ai.facebook.com/results/?content_types%5B0%5D=publication) [Blog](https://ai.facebook.com/blog)<- 04/2019|[fairseq: A Fast, Extensible Toolkit for Sequence Modeling](https://arxiv.org/abs/1904.01038) 07/2019|[Augmenting Self-attention with Persistent Memory](https://arxiv.org/abs/1907.01470) 11/2019|[Improving Transformer Models by Reordering their Sublayers](https://arxiv.org/abs/1911.03864) 08/2021|[Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation](https://arxiv.org/abs/2108.12409) 03/2022|[Training Logbook for OPT-175B](https://files.catbox.moe/u1836w.pdf) 05/2022|[OPT: Open Pre-trained Transformer Language Models](https://arxiv.org/abs/2205.01068) 07/2022|[Beyond neural scaling laws: beating power law scaling via data pruning](https://arxiv.org/abs/2206.14486) 11/2022|[Galactica: A Large Language Model for Science](https://arxiv.org/abs/2211.09085) 01/2023|[Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture (I-JEPA)](https://arxiv.org/abs/2301.08243) 02/2023|[LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971) 02/2023|[Toolformer: Language Models Can Teach Themselves to Use Tools](https://arxiv.org/abs/2302.04761) 03/2023|[Scaling Expert Language Models with Unsupervised Domain Discovery](https://arxiv.org/abs/2303.14177) 03/2023|[SemDeDup: Data-efficient learning at web-scale through semantic deduplication](https://arxiv.org/abs/2303.09540) 04/2023|[Segment Anything (SAM)](https://arxiv.org/abs/2304.02643) 04/2023|[A Cookbook of Self-Supervised Learning](https://arxiv.org/abs/2304.12210) 05/2023|[Learning to Reason and Memorize with Self-Notes](https://arxiv.org/abs/2305.00833) 05/2023|[ImageBind: One Embedding Space To Bind Them All](https://arxiv.org/abs/2305.05665) 05/2023|[MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers](https://arxiv.org/abs/2305.07185) 05/2023|[LIMA: Less Is More for Alignment](https://arxiv.org/abs/2305.11206) 05/2023|[Scaling Speech Technology to 1,000+ Languages](https://files.catbox.moe/6j8gka.pdf) 05/2023|[READ: Recurrent Adaptation of Large Transformers](https://arxiv.org/abs/2305.15348) 05/2023|[LLM-QAT: Data-Free Quantization Aware Training for Large Language Models](https://arxiv.org/abs/2305.17888) 05/2023|[Physics of Language Models: Part 1, Learning Hierarchical Language Structures](https://arxiv.org/abs/2305.13673) 06/2023|[Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles](https://arxiv.org/abs/2306.00989) 06/2023|[Simple and Controllable Music Generation (MusicGen)](https://arxiv.org/abs/2306.05284) 06/2023|[Improving Open Language Models by Learning from Organic Interactions (BlenderBot 3x)](https://arxiv.org/abs/2306.04707) 06/2023|[Extending Context Window of Large Language Models via Positional Interpolation](https://arxiv.org/abs/2306.15595) 06/2023|[Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale](https://arxiv.org/abs/2306.15687) 07/2023|[Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning (CM3leon)](https://arxiv.org/abs/2309.02591) 07/2023|[Llama 2: Open Foundation and Fine-Tuned Chat Models](https://files.catbox.moe/tuog0d.pdf) 08/2023|[SeamlessM4T—Massively Multilingual & Multimodal Machine Translation](https://files.catbox.moe/bdw0bw.pdf) 08/2023|[D4: Improving LLM Pretraining via Document De-Duplication and Diversification](https://arxiv.org/abs/2308.12284) 08/2023|[Code Llama: Open Foundation Models for Code](https://files.catbox.moe/hfy4wf.pdf) 08/2023|[Nougat: Neural Optical Understanding for Academic Documents](https://arxiv.org/abs/2308.13418) 09/2023|[Contrastive Decoding Improves Reasoning in Large Language Models](https://arxiv.org/abs/2309.09117) 09/2023|[Effective Long-Context Scaling of Foundation Models](https://arxiv.org/abs/2309.16039) 09/2023|[AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model](https://arxiv.org/abs/2309.16058) 09/2023|[Vision Transformers Need Registers](https://arxiv.org/abs/2309.16588) 09/2023|[Physics of Language Models: Part 3.1, Knowledge Storage and Extraction](https://arxiv.org/abs/2309.14316) 09/2023|[Physics of Language Models: Part 3.2, Knowledge Manipulation](https://arxiv.org/abs/2309.14402) 10/2023|[RA-DIT: Retrieval-Augmented Dual Instruction Tuning](https://arxiv.org/abs/2310.01352) 10/2023|[Branch-Solve-Merge Improves Large Language Model Evaluation and Generation](https://arxiv.org/abs/2310.15123) 10/2023|[Generative Pre-training for Speech with Flow Matching](https://arxiv.org/abs/2310.16338) 11/2023|[Emu Edit: Precise Image Editing via Recognition and Generation Tasks](https://arxiv.org/abs/2311.10089) 12/2023|[Audiobox: Unified Audio Generation with Natural Language Prompts](https://arxiv.org/abs/2312.15821) 12/2023|[Universal Pyramid Adversarial Training for Improved ViT Performance](https://arxiv.org/abs/2312.16339) 01/2024|[Self-Rewarding Language Models](https://arxiv.org/abs/2401.10020) 02/2024|[Revisiting Feature Prediction for Learning Visual Representations from Video (V-JEPA)](https://files.catbox.moe/gn25vw.pdf) 02/2024|[MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases](https://arxiv.org/abs/2402.14905) 03/2024|[Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM](https://arxiv.org/abs/2403.07816) 03/2024|[Reverse Training to Nurse the Reversal Curse](https://arxiv.org/abs/2403.13799) 04/2024|[Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws](https://arxiv.org/abs/2404.05405) 04/2024|[Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length](https://arxiv.org/abs/2404.08801) 04/2024|[TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding](https://arxiv.org/abs/2404.11912) 04/2024|[Layer Skip: Enabling Early Exit Inference and Self-Speculative Decoding](https://arxiv.org/abs/2404.16710) 04/2024|[MoDE: CLIP Data Experts via Clustering](https://arxiv.org/abs/2404.16030) 04/2024|[Iterative Reasoning Preference Optimization](https://arxiv.org/abs/2404.19733) 04/2024|[Better & Faster Large Language Models via Multi-token Prediction](https://arxiv.org/abs/2404.19737) 05/2024|[Modeling Caption Diversity in Contrastive Vision-Language Pretraining (LLIP)](https://arxiv.org/abs/2405.00740) 05/2024|[Chameleon: Mixed-Modal Early-Fusion Foundation Models](https://arxiv.org/abs/2405.09818) 05/2024|[SpinQuant -- LLM quantization with learned rotations](https://arxiv.org/abs/2405.16406) 05/2024|[Contextual Position Encoding: Learning to Count What's Important](https://arxiv.org/abs/2405.18719) 06/2024|[The Factorization Curse: Which Tokens You Predict Underlie the Reversal Curse and More](https://arxiv.org/abs/2406.05183) 06/2024|[Beyond Model Collapse: Scaling Up with Synthesized Data Requires Reinforcemen](https://arxiv.org/abs/2406.07515) 07/2024|[The Llama 3 Herd of Models](https://arxiv.org/abs/2407.21783) 07/2024|[SAM 2: Segment Anything in Images and Videos](https://arxiv.org/abs/2408.00714) 07/2024|[Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning Process](https://arxiv.org/abs/2407.20311) 07/2024|[MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts](https://arxiv.org/abs/2407.21770) | |**Microsoft** ->[Papers](https://www.microsoft.com/en-us/research/research-area/artificial-intelligence/?) [Blog](https://www.microsoft.com/en-us/research/blog)<- 12/2015|[Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385) 05/2021|[EL-Attention: Memory Efficient Lossless Attention for Generation](https://arxiv.org/abs/2105.04779) 01/2022|[DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale](https://arxiv.org/abs/2201.05596) 03/2022|[DeepNet: Scaling Transformers to 1,000 Layers](https://arxiv.org/abs/2203.00555) 12/2022|[A Length-Extrapolatable Transformer](https://arxiv.org/abs/2212.10554) 01/2023|[Understanding INT4 Quantization for Transformer Models: Latency Speedup, Composability, and Failure Cases](https://arxiv.org/abs/2301.12017) 02/2023|[Language Is Not All You Need: Aligning Perception with Language Models (Kosmos-1)](https://arxiv.org/abs/2302.14045) 03/2023|[Sparks of Artificial General Intelligence: Early experiments with GPT-4](https://arxiv.org/abs/2303.12712) 03/2023|[TaskMatrix. AI: Completing Tasks by Connecting Foundation Models with Millions of APIs](https://arxiv.org/abs/2303.16434) 04/2023|[Instruction Tuning with GPT-4](https://arxiv.org/abs/2304.03277) 04/2023|[Inference with Reference: Lossless Acceleration of Large Language Models](https://arxiv.org/abs/2304.04487) 04/2023|[Low-code LLM: Visual Programming over LLMs](https://arxiv.org/abs/2304.08103) 04/2023|[WizardLM: Empowering Large Language Models to Follow Complex Instructions](https://arxiv.org/abs/2304.12244) 04/2023|[MLCopilot: Unleashing the Power of Large Language Models in Solving Machine Learning Tasks](https://arxiv.org/abs/2304.14979) 04/2023|[ResiDual: Transformer with Dual Residual Connections](https://arxiv.org/abs/2304.14802) 05/2023|[Code Execution with Pre-trained Language Models](https://arxiv.org/abs/2305.05383) 05/2023|[Small Models are Valuable Plug-ins for Large Language Models](https://arxiv.org/abs/2305.08848) 05/2023|[CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing](https://arxiv.org/abs/2305.11738) 06/2023|[Orca: Progressive Learning from Complex Explanation Traces of GPT-4](https://arxiv.org/abs/2306.02707) 06/2023|[Augmenting Language Models with Long-Term Memory](https://arxiv.org/abs/2306.07174) 06/2023|[WizardCoder: Empowering Code Large Language Models with Evol-Instruct](https://arxiv.org/abs/2306.08568) 06/2023|[Textbooks Are All You Need (phi-1)](https://arxiv.org/abs/2306.11644) 07/2023|[In-context Autoencoder for Context Compression in a Large Language Model](https://arxiv.org/abs/2307.06945) 07/2023|[Retentive Network: A Successor to Transformer for Large Language Models](https://arxiv.org/abs/2307.08621) 08/2023|[Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference](https://arxiv.org/abs/2308.12066) 09/2023|[Efficient RLHF: Reducing the Memory Usage of PPO](https://arxiv.org/abs/2309.00754) 09/2023|[DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models](https://arxiv.org/abs/2309.03883) 09/2023|[Textbooks Are All You Need II (phi-1.5)](https://arxiv.org/abs/2309.05463) 09/2023|[PoSE: Efficient Context Window Extension of LLMs via Positional Skip-wise Training](https://arxiv.org/abs/2309.10400) 09/2023|[A Paradigm Shift in Machine Translation: Boosting Translation Performance of Large Language Models](https://arxiv.org/abs/2309.11674) 09/2023|[Attention Satisfies: A Constraint-Satisfaction Lens on Factual Errors of Language Models](https://arxiv.org/abs/2309.15098) 10/2023|[Sparse Backpropagation for MoE Training](https://arxiv.org/abs/2310.00811) 10/2023|[Nugget 2D: Dynamic Contextual Compression for Scaling Decoder-only Language Models](https://arxiv.org/abs/2310.02409) 10/2023|[Mixture of Quantized Experts (MoQE): Complementary Effect of Low-bit Quantization and Robustness](https://arxiv.org/abs/2310.02410) 10/2023|[Augmented Embeddings for Custom Retrievals](https://arxiv.org/abs/2310.05380) 10/2023|[Guiding Language Model Reasoning with Planning Tokens](https://arxiv.org/abs/2310.05707) 10/2023|[Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V](https://arxiv.org/abs/2310.11441) 10/2023|[CodeFusion: A Pre-trained Diffusion Model for Code Generation](https://arxiv.org/abs/2310.17680) 10/2023|[LoRAShear: Efficient Large Language Model Structured Pruning and Knowledge Recovery](https://arxiv.org/abs/2310.18356) 10/2023|[FP8-LM: Training FP8 Large Language Models](https://arxiv.org/abs/2310.18313) 11/2023|[Orca 2: Teaching Small Language Models How to Reason](https://arxiv.org/abs/2311.11045) 12/2023|[ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks](https://arxiv.org/abs/2312.08583) 12/2023|[The Truth is in There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction](https://arxiv.org/abs/2312.13558) 01/2024|[SliceGPT: Compress Large Language Models by Deleting Rows and Columns](https://arxiv.org/abs/2401.15024) 01/2024|[RAG vs Fine-tuning: Pipelines, Tradeoffs, and a Case Study on Agriculture](https://arxiv.org/abs/2401.08406) 02/2024|[LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens](https://arxiv.org/abs/2402.13753) 02/2024|[The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits (BitNet)](https://arxiv.org/abs/2402.17764) 02/2024|[ResLoRA: Identity Residual Mapping in Low-Rank Adaption](https://arxiv.org/abs/2402.18039) 03/2024|[LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression](https://arxiv.org/abs/2403.12968) 03/2024|[SiMBA: Simplified Mamba-Based Architecture for Vision and Multivariate Time series](https://arxiv.org/abs/2403.15360) 04/2024|[LongEmbed: Extending Embedding Models for Long Context Retrieval](https://arxiv.org/abs/2404.12096) 04/2024|[Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone](https://arxiv.org/abs/2404.14219) 05/2024|[You Only Cache Once: Decoder-Decoder Architectures for Language Models (YOCO)](https://arxiv.org/abs/2405.05254) 06/2024|[Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling](https://arxiv.org/abs/2406.07522) 06/2024|[E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS](https://arxiv.org/abs/2406.18009) 06/2024|[Automatic Instruction Evolving for Large Language Models](https://arxiv.org/abs/2406.00770) 07/2024|[Arena Learning : Build Data Flywheel for LLMs Post-training via Simulated Chatbot Arena](https://arxiv.org/abs/2407.10627) 07/2024|[Q-Sparse: All Large Language Models can be Fully Sparsely-Activated](https://arxiv.org/abs/2407.10969) | |**OpenAI** ->[Papers](https://openai.com/research) [Blog](https://openai.com/blog)<- 07/2017|[Proximal Policy Optimization Algorithms](https://arxiv.org/abs/1707.06347) 04/2019|[Generating Long Sequences with Sparse Transformers](https://arxiv.org/abs/1904.10509) 01/2020|[Scaling Laws for Neural Language Models](https://arxiv.org/abs/2001.08361) 05/2020|[Language Models are Few-Shot Learners (GPT-3)](https://arxiv.org/abs/2005.14165) 01/2022|[Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets](https://arxiv.org/abs/2201.02177) 03/2022|[Training language models to follow instructions with human feedback (InstructGPT)](https://arxiv.org/abs/2203.02155) 07/2022|[Efficient Training of Language Models to Fill in the Middle](https://arxiv.org/abs/2207.14255) 03/2023|[GPT-4 Technical Report](https://arxiv.org/abs/2303.08774) 04/2023|[Consistency Models](https://arxiv.org/abs/2303.01469) 05/2023|[Let's Verify Step by Step](https://arxiv.org/abs/2305.20050) 10/2023|[Improving Image Generation with Better Captions (DALL·E 3)](https://files.catbox.moe/e7jl5b.pdf) | |**Hazy Research (Stanford)** ->[Papers](https://cs.stanford.edu/people/chrismre/#papers) [Blog](https://hazyresearch.stanford.edu/blog)<- 10/2021|[Efficiently Modeling Long Sequences with Structured State Spaces (S4)](https://arxiv.org/abs/2111.00396) 04/2022|[Monarch: Expressive Structured Matrices for Efficient and Accurate Training](https://arxiv.org/abs/2204.00595) 05/2022|[FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness](https://arxiv.org/abs/2205.14135) 12/2022|[Hungry Hungry Hippos: Towards Language Modeling with State Space Models](https://arxiv.org/abs/2212.14052) 02/2023|[Simple Hardware-Efficient Long Convolutions for Sequence Modeling](https://arxiv.org/abs/2302.06646) 02/2023|[Hyena Hierarchy: Towards Larger Convolutional Language Models](https://arxiv.org/abs/2302.10866) 06/2023|[TART: A plug-and-play Transformer module for task-agnostic reasoning](https://arxiv.org/abs/2306.07536) 07/2023|[FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning](https://files.catbox.moe/arj3zc.pdf) 11/2023|[FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores](https://arxiv.org/abs/2311.05908) | |**THUDM (Tsinghua University)** ->[Papers](http://keg.cs.tsinghua.edu.cn/jietang/publication_list.html) [Github](https://github.com/THUDM)<- 10/2022|[GLM-130B: An Open Bilingual Pre-Trained Model](https://arxiv.org/abs/2210.02414) 03/2023|[CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Evaluations on HumanEval-X](https://arxiv.org/abs/2303.17568) 04/2023|[DoctorGLM: Fine-tuning your Chinese Doctor is not a Herculean Task](https://arxiv.org/abs/2304.01097) 06/2023|[WebGLM: Towards An Efficient Web-Enhanced Question Answering System with Human Preferences](https://arxiv.org/abs/2306.07906) 09/2023|[GPT Can Solve Mathematical Problems Without a Calculator (MathGLM)](https://arxiv.org/abs/2309.03241) 10/2023|[AgentTuning: Enabling Generalized Agent Abilities for LLMs (AgentLM)](https://arxiv.org/abs/2310.12823) 11/2023|[CogVLM: Visual Expert for Pretrained Language Models](https://arxiv.org/abs/2311.03079) 12/2023|[CogAgent: A Visual Language Model for GUI Agents](https://arxiv.org/abs/2312.08914) 01/2024|[APAR: LLMs Can Do Auto-Parallel Auto-Regressive Decoding](https://arxiv.org/abs/2401.06761) 01/2024|[LongAlign: A Recipe for Long Context Alignment of Large Language Models](https://arxiv.org/abs/2401.18058) | |**Open Models** 06/2021|[GPT-J-6B: 6B JAX-Based Transformer](https://archive.is/HPCbB) 09/2021|[Pythia: A Customizable Hardware Prefetching Framework Using Online Reinforcement Learning](https://arxiv.org/abs/2109.12021) 03/2022|[CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis](https://arxiv.org/abs/2203.13474) 04/2022|[GPT-NeoX-20B: An Open-Source Autoregressive Language Model](https://arxiv.org/abs/2204.06745) 11/2022|[BLOOM: A 176B-Parameter Open-Access Multilingual Language Model](https://arxiv.org/abs/2211.05100) 12/2022|[DDColor: Towards Photo-Realistic Image Colorization via Dual Decoders](https://arxiv.org/abs/2212.11613) 04/2023|[Visual Instruction Tuning (LLaVA)](https://arxiv.org/abs/2304.08485) 05/2023|[StarCoder: May the source be with you!](https://arxiv.org/abs/2305.06161) 05/2023|[CodeGen2: Lessons for Training LLMs on Programming and Natural Languages](https://arxiv.org/abs/2305.02309) 05/2023|[Otter: A Multi-Modal Model with In-Context Instruction Tuning](https://arxiv.org/abs/2305.03726) 05/2023|[InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning](https://arxiv.org/abs/2305.06500) 05/2023|[CodeT5+: Open Code Large Language Models for Code Understanding and Generation](https://arxiv.org/abs/2305.07922) 05/2023|[ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities](https://arxiv.org/abs/2305.11172) 05/2023|[RWKV: Reinventing RNNs for the Transformer Era](https://arxiv.org/abs/2305.13048) 05/2023|[Lion: Adversarial Distillation of Closed-Source Large Language Model](https://arxiv.org/abs/2305.12870) 05/2023|[MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training](https://arxiv.org/abs/2306.00107) 06/2023|[Segment Anything in High Quality](https://arxiv.org/abs/2306.01567) 06/2023|[Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding](https://arxiv.org/abs/2306.02858) 06/2023|[High-Fidelity Audio Compression with Improved RVQGAN (DAC)](https://arxiv.org/abs/2306.06546) 06/2023|[StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models](https://arxiv.org/abs/2306.07691) 06/2023|[Anticipatory Music Transformer](https://arxiv.org/abs/2306.08620) 06/2023|[RepoFusion: Training Code Models to Understand Your Repository](https://arxiv.org/abs/2306.10998) 06/2023|[MPT-30B: Raising the bar for open-source foundation models](https://archive.is/SOhKy) 06/2023|[Vec2Vec: A Compact Neural Network Approach for Transforming Text Embeddings with High Fidelity](https://arxiv.org/abs/2306.12689) 06/2023|[ViNT: A Foundation Model for Visual Navigation](https://arxiv.org/abs/2306.14846) 06/2023|[How Long Can Open-Source LLMs Truly Promise on Context Length? (LongChat)](https://archive.is/NfIj2) 07/2023|[Hierarchical Open-vocabulary Universal Image Segmentation](https://arxiv.org/abs/2307.00764) 07/2023|[Focused Transformer: Contrastive Training for Context Scaling (LongLLaMA](https://arxiv.org/abs/2307.03170) 07/2023|[Rhythm Modeling for Voice Conversion (Urhythmic)](https://arxiv.org/abs/2307.06040) 07/2023|[Scaling TransNormer to 175 Billion Parameters](https://arxiv.org/abs/2307.14995) 08/2023|[Separate Anything You Describe](https://arxiv.org/abs/2308.05037) 08/2023|[StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data](https://arxiv.org/abs/2308.10253) 09/2023|[RADIO: Reference-Agnostic Dubbing Video Synthesis](https://arxiv.org/abs/2309.01950) 09/2023|[Matcha-TTS: A fast TTS architecture with conditional flow matching](https://arxiv.org/abs/2309.03199) 09/2023|[DreamLLM: Synergistic Multimodal Comprehension and Creation](https://arxiv.org/abs/2309.11499) 09/2023|[Baichuan 2: Open Large-scale Language Models](https://arxiv.org/abs/2309.10305) 09/2023|[Qwen Technical Report](https://files.catbox.moe/y61ihm.pdf) 09/2023|[Mistral 7B](https://files.catbox.moe/bars04.pdf) 10/2023|[MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning](https://arxiv.org/abs/2310.03731) 10/2023|[Improved Baselines with Visual Instruction Tuning (LLaVA 1.5)](https://arxiv.org/abs/2310.03744) 10/2023|[LLark: A Multimodal Foundation Model for Music](https://arxiv.org/abs/2310.07160) 10/2023|[SALMONN: Towards Generic Hearing Abilities for Large Language Models](https://arxiv.org/abs/2310.13289) 10/2023|[Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings for Long Documents](https://arxiv.org/abs/2310.19923) 11/2023|[Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models](https://arxiv.org/abs/2311.07919) 11/2023|[UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition](https://arxiv.org/abs/2311.15599) 11/2023|[YUAN 2.0: A Large Language Model with Localized Filtering-based Attention](https://arxiv.org/abs/2311.15786) 12/2023|[Making Large Multimodal Models Understand Arbitrary Visual Prompts (ViP-LLaVA)](https://arxiv.org/abs/2312.00784) 12/2023|[Mamba: Linear-Time Sequence Modeling with Selective State Spaces](https://arxiv.org/abs/2312.00752) 12/2023|[OpenVoice: Versatile Instant Voice Cloning](https://arxiv.org/abs/2312.01479) 12/2023|[Sequential Modeling Enables Scalable Learning for Large Vision Models (LVM)](https://arxiv.org/abs/2312.00785) 12/2023|[Magicoder: Source Code Is All You Need](https://arxiv.org/abs/2312.02120) 12/2023|[StripedHyena-7B, open source models offering a glimpse into a world beyond Transformers](https://archive.is/cHoct) 12/2023|[MMM: Generative Masked Motion Model](https://arxiv.org/abs/2312.03596) 12/2023|[4M: Massively Multimodal Masked Modeling](https://arxiv.org/abs/2312.06647) 12/2023|[LLM360: Towards Fully Transparent Open-Source LLMs](https://arxiv.org/abs/2312.06550) 12/2023|[SOLAR 10.7B: Scaling Large Language Models with Simple yet Effective Depth Up-Scaling](https://arxiv.org/abs/2312.15166) 01/2024|[DeepSeek LLM: Scaling Open-Source Language Models with Longtermism](https://arxiv.org/abs/2401.02954) 01/2024|[Mixtral of Experts](https://arxiv.org/abs/2401.04088) 01/2024|[EAT: Self-Supervised Pre-Training with Efficient Audio Transformer](https://arxiv.org/abs/2401.03497) 01/2024|[Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications](https://arxiv.org/abs/2401.06197) 01/2024|[Scalable Pre-training of Large Autoregressive Image Models](https://arxiv.org/abs/2401.08541) 01/2024|[Orion-14B: Open-source Multilingual Large Language Models](https://arxiv.org/abs/2401.12246) 01/2024|[Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data](https://arxiv.org/abs/2401.10891) 01/2024|[VMamba: Visual State Space Model](https://arxiv.org/abs/2401.10166) 01/2024|[DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence](https://arxiv.org/abs/2401.14196) 01/2024|[MoE-LLaVA: Mixture of Experts for Large Vision-Language Models](https://arxiv.org/abs/2401.15947) 01/2024|[LLaVA-1.6: Improved reasoning, OCR, and world knowledge](https://archive.is/WMr0Z) 01/2024|[MiniCPM: Unveiling the Potential of End-side Large Language Models](https://archive.is/IlMnJ) 01/2024|[Scaling Up to Excellence: Practicing Model Scaling for Photo-Realistic Image Restoration In the Wild](https://arxiv.org/abs/2401.13627) 02/2024|[Graph-Mamba: Towards Long-Range Graph Sequence Modeling with Selective State Spaces](https://arxiv.org/abs/2402.00789) 02/2024|[Introducing Qwen1.5](https://archive.is/C6gpR) 02/2024|[BlackMamba: Mixture of Experts for State-Space Models](https://arxiv.org/abs/2402.01771) 02/2024|[DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models](https://arxiv.org/abs/2402.03300) 02/2024|[EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss](https://arxiv.org/abs/2402.05008) 02/2024|[GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators](https://arxiv.org/abs/2402.06894) 02/2024|[Zero-Shot Unsupervised and Text-Based Audio Editing Using DDPM Inversion](https://arxiv.org/abs/2402.10009) 02/2024|[Brant-2: Foundation Model for Brain Signals](https://arxiv.org/abs/2402.10251) 02/2024|[CLLMs: Consistency Large Language Models](https://arxiv.org/abs/2403.00835) 03/2024|[Scaling Rectified Flow Transformers for High-Resolution Image Synthesis (SD3)](https://files.catbox.moe/anmseu.pdf) 03/2024|[TripoSR: Fast 3D Object Reconstruction from a Single Image](https://arxiv.org/abs/2403.02151) 03/2024|[Yi: Open Foundation Models by 01.AI](https://arxiv.org/abs/2403.04652) 03/2024|[DeepSeek-VL: Towards Real-World Vision-Language Understanding](https://arxiv.org/abs/2403.05525) 03/2024|[VideoMamba: State Space Model for Efficient Video Understanding](https://arxiv.org/abs/2403.06977) 03/2024|[VOICECRAFT: Zero-Shot Speech Editing and Text-to-Speech in the Wild](https://arxiv.org/abs/2403.16973) 03/2024|[GRM: Large Gaussian Reconstruction Model for Efficient 3D Reconstruction and Generation](https://arxiv.org/abs/2403.14621) 03/2024|[DBRX: A New State-of-the-Art Open LLM](https://archive.is/vP5bV) 03/2024|[AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animation](https://arxiv.org/abs/2403.17694) 03/2024|[Jamba: A Hybrid Transformer-Mamba Language Model](https://arxiv.org/abs/2403.19887) 04/2024|[Advancing LLM Reasoning Generalists with Preference Trees (Eurus)](https://arxiv.org/abs/2404.02078) 04/2024|[Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction (VAR)](https://arxiv.org/abs/2404.02905) 04/2024|[Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence](https://arxiv.org/abs/2404.05892) 04/2024|[Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models](https://arxiv.org/abs/2404.13013) 05/2024|[DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model](https://files.catbox.moe/5sguux.pdf) 05/2024|[Language-Image Models with 3D Understanding (Cube-LLM)](https://arxiv.org/abs/2405.03685) 05/2024|[AniTalker: Animate Vivid and Diverse Talking Faces through Identity-Decoupled Facial Motion Encoding](https://arxiv.org/abs/2405.03121) 05/2024|[Pandora : Towards General World Model with Natural Language Actions and Video State](https://files.catbox.moe/c854p7.pdf) 05/2024|[TerDiT: Ternary Diffusion Models with Transformers](https://arxiv.org/abs/2405.14854) 05/2024|[NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models](https://arxiv.org/abs/2405.17428) 05/2024|[Phased Consistency Model](https://arxiv.org/abs/2405.18407) 05/2024|[MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series](https://arxiv.org/abs/2405.19327) 05/2024|[YOLOv10: Real-Time End-to-End Object Detection](https://arxiv.org/abs/2405.14458) 05/2024|[MegActor: Harness the Power of Raw Video for Vivid Portrait Animation](https://arxiv.org/abs/2405.20851) 06/2024|[Bootstrap3D: Improving 3D Content Creation with Synthetic Data](https://arxiv.org/abs/2406.00093) 06/2024|[EasyAnimate: A High-Performance Long Video Generation Method based on Transformer Architecture](https://arxiv.org/abs/2405.18991) 06/2024|[ControlSpeech: Towards Simultaneous Zero-shot Speaker Cloning and Zero-shot Language Style Control With Decoupled Codec](https://arxiv.org/abs/2406.01205) 06/2024|[GrootVL: Tree Topology is All You Need in State Space Model](https://arxiv.org/abs/2406.02395) 06/2024|[An Independence-promoting Loss for Music Generation with Language Models (MusicGen-MMD)](https://arxiv.org/abs/2406.02315) 06/2024|[Matching Anything by Segmenting Anything](https://arxiv.org/abs/2406.04221) 06/2024|[Nemotron-4 340B Technical Report](https://arxiv.org/abs/2406.11704) 06/2024|[DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence](https://arxiv.org/abs/2406.11931) 06/2024|[TroL: Traversal of Layers for Large Language and Vision Models](https://arxiv.org/abs/2406.12246) 06/2024|[Depth Anything V2](https://arxiv.org/abs/2406.09414) 06/2024|[HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale](https://arxiv.org/abs/2406.19280) 06/2024|[Network Bending of Diffusion Models for Audio-Visual Generation](https://arxiv.org/abs/2406.19589) 06/2024|[Less is More: Accurate Speech Recognition & Translation without Web-Scale Data (Canary)](https://arxiv.org/abs/2406.19674) 07/2024|[LivePortrait: Efficient Portrait Animation with Stitching and Retargeting Control](https://arxiv.org/abs/2407.03168) 07/2024|[Qwen2 Technical Report](https://arxiv.org/abs/2407.10671) 07/2024|[Qwen2-Audio Technical Report](https://arxiv.org/abs/2407.10759) 07/2024|[ColPali: Efficient Document Retrieval with Vision Language Models](https://arxiv.org/abs/2407.01449) 07/2024|[Compact Language Models via Pruning and Knowledge Distillation (Minitron)](https://arxiv.org/abs/2407.14679) | |**Various** 09/2014|[Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/abs/1409.0473) 06/2019|[Understanding and Improving Transformer From a Multi-Particle Dynamic System Point of View](https://arxiv.org/abs/1906.02762) 10/2019|[Root Mean Square Layer Normalization](https://arxiv.org/abs/1910.07467) 10/2019|[Transformers without Tears: Improving the Normalization of Self-Attention](https://arxiv.org/abs/1910.05895) 12/2019|[Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection](https://arxiv.org/abs/1912.11637) 02/2020|[On Layer Normalization in the Transformer Architecture](https://arxiv.org/abs/2002.04745) 04/2020|[Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) 04/2020|[Improved Natural Language Generation via Loss Truncation](https://arxiv.org/abs/2004.14589) 06/2020|[Memory Transformer](https://arxiv.org/abs/2006.11527) 07/2020|[Mirostat: A Neural Text Decoding Algorithm that Directly Controls Perplexity](https://arxiv.org/abs/2007.14966) 12/2020|[ERNIE-Doc: A Retrospective Long-Document Modeling Transformer](https://arxiv.org/abs/2012.15688) 01/2021|[Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks](https://arxiv.org/abs/2102.00554) 03/2021|[The Low-Rank Simplicity Bias in Deep Networks](https://arxiv.org/abs/2103.10427) 04/2021|[RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864) 06/2021|[LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685) 07/2023|[CrossFormer: A Versatile Vision Transformer Hinging on Cross-scale Attention](https://arxiv.org/abs/2108.00154) 03/2022|[Memorizing Transformers](https://arxiv.org/abs/2203.08913) 04/2022|[UL2: Unifying Language Learning Paradigms](https://arxiv.org/abs/2205.05131) 05/2022|[Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning (IA3)](https://arxiv.org/abs/2205.05638) 06/2022|[nuQmm: Quantized MatMul for Efficient Inference of Large-Scale Generative Language Models](https://arxiv.org/abs/2206.09557) 07/2022|[Language Models (Mostly) Know What They Know](https://arxiv.org/abs/2207.05221) 08/2022|[LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale](https://arxiv.org/abs/2208.07339) 09/2022|[Petals: Collaborative Inference and Fine-tuning of Large Models](https://arxiv.org/abs/2209.01188) 10/2022|[GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers](https://arxiv.org/abs/2210.17323) 10/2022|[Recurrent Memory Transformer](https://files.catbox.moe/8trivt.pdf) 10/2022|[Truncation Sampling as Language Model Desmoothing](https://arxiv.org/abs/2210.15191) 10/2022|[DyLoRA: Parameter Efficient Tuning of Pre-trained Models using Dynamic Search-Free Low-Rank Adaptation](https://arxiv.org/abs/2210.07558) 11/2022|[An Algorithm for Routing Vectors in Sequences](https://arxiv.org/abs/2211.11754) 11/2022|[MegaBlocks: Efficient Sparse Training with Mixture-of-Experts](https://arxiv.org/abs/2211.15841) 12/2022|[Self-Instruct: Aligning Language Model with Self Generated Instructions](https://arxiv.org/abs/2212.10560) 12/2022|[Parallel Context Windows Improve In-Context Learning of Large Language Models](https://arxiv.org/abs/2212.10947) 12/2022|[Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor](https://arxiv.org/abs/2212.09689) 12/2022|[Pretraining Without Attention](https://arxiv.org/abs/2212.10544) 12/2022|[The case for 4-bit precision: k-bit Inference Scaling Laws](https://arxiv.org/abs/2212.09720) 12/2022|[Prompting Is Programming: A Query Language for Large Language Models](https://arxiv.org/abs/2212.06094) 01/2023|[SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient](https://arxiv.org/abs/2301.11913) 01/2023|[SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot](https://arxiv.org/abs/2301.00774) 01/2023|[Memory Augmented Large Language Models are Computationally Universal](https://arxiv.org/abs/2301.04589) 01/2023|[Progress measures for grokking via mechanistic interpretability](https://arxiv.org/abs/2301.05217) 01/2023|[Adaptive Computation with Elastic Input Sequence](https://arxiv.org/abs/2301.13195) 02/2023|[Colossal-Auto: Unified Automation of Parallelization and Activation Checkpoint for Large-scale Models](https://arxiv.org/abs/2302.02599) 02/2023|[The Wisdom of Hindsight Makes Language Models Better Instruction Followers](https://arxiv.org/abs/2302.05206) 02/2023|[The Stable Entropy Hypothesis and Entropy-Aware Decoding: An Analysis and Algorithm for Robust Natural Language Generation](https://arxiv.org/abs/2302.06784) 03/2023|[COLT5: Faster Long-Range Transformers with Conditional Computation](https://arxiv.org/abs/2303.09752) 03/2023|[High-throughput Generative Inference of Large Language Models with a Single GPU](https://arxiv.org/abs/2303.06865) 03/2023|[Meet in the Middle: A New Pre-training Paradigm](https://arxiv.org/abs/2303.07295) 03/2023|[Reflexion: an autonomous agent with dynamic memory and self-reflection](https://arxiv.org/abs/2303.11366) 03/2023|[Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning](https://arxiv.org/abs/2303.15647) 03/2023|[FP8 versus INT8 for efficient deep learning inference](https://arxiv.org/abs/2303.17951) 03/2023|[Self-Refine: Iterative Refinement with Self-Feedback](https://arxiv.org/abs/2303.17651) 04/2023|[RPTQ: Reorder-based Post-training Quantization for Large Language Models](https://arxiv.org/abs/2304.01089) 04/2023|[REFINER: Reasoning Feedback on Intermediate Representations](https://arxiv.org/abs/2304.01904) 04/2023|[Generative Agents: Interactive Simulacra of Human Behavior](https://arxiv.org/abs/2304.03442) 04/2023|[Compressed Regression over Adaptive Networks](https://arxiv.org/abs/2304.03638) 04/2023|[A Cheaper and Better Diffusion Language Model with Soft-Masked Noise](https://arxiv.org/abs/2304.04746) 04/2023|[RRHF: Rank Responses to Align Language Models with Human Feedback without tears](https://arxiv.org/abs/2304.05302) 04/2023|[CAMEL: Communicative Agents for "Mind" Exploration of Large Scale Language Model Society](https://arxiv.org/abs/2303.17760) 04/2023|[Automatic Gradient Descent: Deep Learning without Hyperparameters](https://arxiv.org/abs/2304.05187) 04/2023|[SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language Models](https://arxiv.org/abs/2303.10464) 04/2023|[Shall We Pretrain Autoregressive Language Models with Retrieval? A Comprehensive Study](https://arxiv.org/abs/2304.06762) 04/2023|[Outlier Suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling](https://arxiv.org/abs/2304.09145) 04/2023|[Scaling Transformer to 1M tokens and beyond with RMT](https://arxiv.org/abs/2304.11062) 04/2023|[Answering Questions by Meta-Reasoning over Multiple Chains of Thought](https://arxiv.org/abs/2304.13007) 04/2023|[Towards Multi-Modal DBMSs for Seamless Querying of Texts and Tables](https://arxiv.org/abs/2304.13559) 04/2023|[We're Afraid Language Models Aren't Modeling Ambiguity](https://arxiv.org/abs/2304.14399) 04/2023|[The Internal State of an LLM Knows When its Lying](https://arxiv.org/abs/2304.13734) 04/2023|[Search-in-the-Chain: Towards the Accurate, Credible and Traceable Content Generation for Complex Knowledge-intensive Tasks](https://arxiv.org/abs/2304.14732) 05/2023|[Towards Unbiased Training in Federated Open-world Semi-supervised Learning](https://arxiv.org/abs/2305.00771) 05/2023|[Unlimiformer: Long-Range Transformers with Unlimited Length Input](https://arxiv.org/abs/2305.01625) 05/2023|[FreeLM: Fine-Tuning-Free Language Model](https://arxiv.org/abs/2305.01616) 05/2023|[Cuttlefish: Low-rank Model Training without All The Tuning](https://arxiv.org/abs/2305.02538) 05/2023|[AttentionViz: A Global View of Transformer Attention](https://arxiv.org/abs/2305.03210) 05/2023|[Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models](https://arxiv.org/abs/2305.04091) 05/2023|[A Frustratingly Easy Improvement for Position Embeddings via Random Padding](https://arxiv.org/abs/2305.04859) 05/2023|[Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision](https://arxiv.org/abs/2305.03047) 05/2023|[Explanation-based Finetuning Makes Models More Robust to Spurious Cues](https://arxiv.org/abs/2305.04990) 05/2023|[An automatically discovered chain-of-thought prompt generalizes to novel models and datasets](https://arxiv.org/abs/2305.02897) 05/2023|[Recommender Systems with Generative Retrieval](https://arxiv.org/abs/2305.05065) 05/2023|[Fast Distributed Inference Serving for Large Language Models](https://arxiv.org/abs/2305.05920) 05/2023|[Chain-of-Dictionary Prompting Elicits Translation in Large Language Models](https://arxiv.org/abs/2305.06575) 05/2023|[Recommendation as Instruction Following: A Large Language Model Empowered Recommendation Approach](https://arxiv.org/abs/2305.07001) 05/2023|[Active Retrieval Augmented Generation](https://arxiv.org/abs/2305.06983) 05/2023|[Scalable Coupling of Deep Learning with Logical Reasoning](https://arxiv.org/abs/2305.07617) 05/2023|[Interpretability at Scale: Identifying Causal Mechanisms in Alpaca](https://arxiv.org/abs/2305.08809) 05/2023|[StructGPT: A General Framework for Large Language Model to Reason over Structured Data](https://arxiv.org/abs/2305.09645) 05/2023|[Pre-Training to Learn in Context](https://arxiv.org/abs/2305.09137) 05/2023|[ToolkenGPT: Augmenting Frozen Language Models with Massive Tools via Tool Embeddings](https://arxiv.org/abs/2305.11554) 05/2023|[Accelerating Transformer Inference for Translation via Parallel Decoding](https://arxiv.org/abs/2305.10427) 05/2023|[Cooperation Is All You Need](https://arxiv.org/abs/2305.10449) 05/2023|[PTQD: Accurate Post-Training Quantization for Diffusion Models](https://arxiv.org/abs/2305.10657) 05/2023|[LLM-Pruner: On the Structural Pruning of Large Language Models](https://arxiv.org/abs/2305.11627) 05/2023|[SelfzCoT: a Self-Prompt Zero-shot CoT from Semantic-level to Code-level for a Better Utilization of LLMs](https://arxiv.org/abs/2305.11461) 05/2023|[QLoRA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/abs/2305.14314) 05/2023|["According to ..." Prompting Language Models Improves Quoting from Pre-Training Data](https://arxiv.org/abs/2305.13252) 05/2023|[Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training](https://arxiv.org/abs/2305.14342) 05/2023|[Landmark Attention: Random-Access Infinite Context Length for Transformers](https://arxiv.org/abs/2305.16300) 05/2023|[Scaling Data-Constrained Language Models](https://arxiv.org/abs/2305.16264) 05/2023|[Fine-Tuning Language Models with Just Forward Passes](https://arxiv.org/abs/2305.17333) 05/2023|[Intriguing Properties of Quantization at Scale](https://arxiv.org/abs/2305.19268) 05/2023|[Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time](https://arxiv.org/abs/2305.17118) 05/2023|[Blockwise Parallel Transformer for Long Context Large Models](https://arxiv.org/abs/2305.19370) 05/2023|[The Impact of Positional Encoding on Length Generalization in Transformers](https://arxiv.org/abs/2305.19466) 05/2023|[Adapting Language Models to Compress Contexts](https://arxiv.org/abs/2305.14788) 05/2023|[Direct Preference Optimization: Your Language Model is Secretly a Reward Model](https://arxiv.org/abs/2305.18290) 06/2023|[AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration](https://arxiv.org/abs/2306.00978) 06/2023|[Faster Causal Attention Over Large Sequences Through Sparse Flash Attention](https://arxiv.org/abs/2306.01160) 06/2023|[Fine-Grained Human Feedback Gives Better Rewards for Language Model Training](https://arxiv.org/abs/2306.01693) 06/2023|[SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression](https://arxiv.org/abs/2306.03078) 06/2023|[Fine-Tuning Language Models with Advantage-Induced Policy Alignment](https://arxiv.org/abs/2306.02231) 06/2023|[Rewarded soups: towards Pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards](https://arxiv.org/abs/2306.04488) 06/2023|[Inference-Time Intervention: Eliciting Truthful Answers from a Language Model](https://arxiv.org/abs/2306.03341) 06/2023|[Mixture-of-Domain-Adapters: Decoupling and Injecting Domain Knowledge to Pre-trained Language Models Memories](https://arxiv.org/abs/2306.05406) 06/2023|[Boosting Fast and High-Quality Speech Synthesis with Linear Diffusion](https://arxiv.org/abs/2306.05708) 06/2023|[Word sense extension](https://arxiv.org/abs/2306.05609) 06/2023|[Mitigating Transformer Overconfidence via Lipschitz Regularization](https://arxiv.org/abs/2306.06849) 06/2023|[Recurrent Attention Networks for Long-text Modeling](https://arxiv.org/abs/2306.06843) 06/2023|[One-for-All: Generalized LoRA for Parameter-Efficient Fine-tuning](https://arxiv.org/abs/2306.07967) 06/2023|[SqueezeLLM: Dense-and-Sparse Quantization](https://arxiv.org/abs/2306.07629) 06/2023|[Tune As You Scale: Hyperparameter Optimization For Compute Efficient Training](https://arxiv.org/abs/2306.08055) 06/2023|[Propagating Knowledge Updates to LMs Through Distillation](https://arxiv.org/abs/2306.09306) 06/2023|[Full Parameter Fine-tuning for Large Language Models with Limited Resources](https://arxiv.org/abs/2306.09782) 06/2023|[A Simple and Effective Pruning Approach for Large Language Models](https://arxiv.org/abs/2306.11695) 06/2023|[InRank: Incremental Low-Rank Learning](https://arxiv.org/abs/2306.11250) 06/2023|[Evaluating the Zero-shot Robustness of Instruction-tuned Language Models](https://arxiv.org/abs/2306.11270) 06/2023|[Learning to Generate Better Than Your LLM (RLGF)](https://arxiv.org/abs/2306.11816) 06/2023|[Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing](https://arxiv.org/abs/2306.12929) 06/2023|[H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Model](https://arxiv.org/abs/2306.14048) 06/2023|[FLuRKA: Fast fused Low-Rank & Kernel Attention](https://arxiv.org/abs/2306.15799) 06/2023|[Stay on topic with Classifier-Free Guidance](https://arxiv.org/abs/2306.17806) 07/2023|[AutoST: Training-free Neural Architecture Search for Spiking Transformers](https://arxiv.org/abs/2307.00293) 07/2023|[Single Sequence Prediction over Reasoning Graphs for Multi-hop QA](https://arxiv.org/abs/2307.00335) 07/2023|[Shifting Attention to Relevance: Towards the Uncertainty Estimation of Large Language Models](https://arxiv.org/abs/2307.01379) 07/2023|[Facing off World Model Backbones: RNNs, Transformers, and S4](https://arxiv.org/abs/2307.02064) 07/2023|[Improving Retrieval-Augmented Large Language Models via Data Importance Learning](https://arxiv.org/abs/2307.03027) 07/2023|[Teaching Arithmetic to Small Transformers](https://arxiv.org/abs/2307.03381) 07/2023|[QIGen: Generating Efficient Kernels for Quantized Inference on Large Language Models](https://arxiv.org/abs/2307.03738) 07/2023|[Stack More Layers Differently: High-Rank Training Through Low-Rank Updates](https://arxiv.org/abs/2307.05695) 07/2023|[Copy Is All You Need (CoG)](https://arxiv.org/abs/2307.06962) 07/2023|[Multi-Method Self-Training: Improving Code Generation With Text, And Vice Versa](https://arxiv.org/abs/2307.10633) 07/2023|[Divide & Bind Your Attention for Improved Generative Semantic Nursing](https://arxiv.org/abs/2307.10864) 07/2023|[Challenges and Applications of Large Language Models](https://arxiv.org/abs/2307.10169) 07/2023|[Soft Prompt Tuning for Augmenting Dense Retrieval with Large Language Models](https://arxiv.org/abs/2307.08303) 07/2023|[QuIP: 2-Bit Quantization of Large Language Models With Guarantees](https://arxiv.org/abs/2307.13304) 07/2023|[CoRe Optimizer: An All-in-One Solution for Machine Learning](https://arxiv.org/abs/2307.15663) 07/2023|[Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time](https://arxiv.org/abs/2310.17157) 08/2023|[ReCLIP: Refine Contrastive Language Image Pre-Training with Source Free Domain Adaptation](https://arxiv.org/abs/2308.03793) 08/2023|[EasyEdit: An Easy-to-use Knowledge Editing Framework for Large Language Models](https://arxiv.org/abs/2308.07269) 08/2023|[Activation Addition: Steering Language Models Without Optimization](https://arxiv.org/abs/2308.10248) 08/2023|[OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models](https://arxiv.org/abs/2308.13137) 08/2023|[Accelerating LLM Inference with Staged Speculative Decoding](https://arxiv.org/abs/2308.04623) 08/2023|[YaRN: Efficient Context Window Extension of Large Language Models](https://arxiv.org/abs/2309.00071) 08/2023|[LM-Infinite: Simple On-the-Fly Length Generalization for Large Language Models](https://arxiv.org/abs/2308.16137) 09/2023|[Making Large Language Models Better Reasoners with Alignment](https://arxiv.org/abs/2309.02144) 09/2023|[Data-Juicer: A One-Stop Data Processing System for Large Language Models](https://arxiv.org/abs/2309.02033) 09/2023|[Delta-LoRA: Fine-Tuning High-Rank Parameters with the Delta of Low-Rank Matrices](https://arxiv.org/abs/2309.02411) 09/2023|[SLiMe: Segment Like Me](https://arxiv.org/abs/2309.03179) 09/2023|[Norm Tweaking: High-performance Low-bit Quantization of Large Language Models](https://arxiv.org/abs/2309.02784) 09/2023|[When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale](https://arxiv.org/abs/2309.04564) 09/2023|[Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs](https://arxiv.org/abs/2309.05516) 09/2023|[Efficient Memory Management for Large Language Model Serving with PagedAttention](https://arxiv.org/abs/2309.06180) 09/2023|[Cure the headache of Transformers via Collinear Constrained Attention](https://arxiv.org/abs/2309.08646) 09/2023|[Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity](https://arxiv.org/abs/2309.10285) 09/2023|[LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models](https://arxiv.org/abs/2309.12307) 09/2023|[MosaicFusion: Diffusion Models as Data Augmenters for Large Vocabulary Instance Segmentation](https://arxiv.org/abs/2309.13042) 09/2023|[Rethinking Channel Dimensions to Isolate Outliers for Low-bit Weight Quantization of Large Language Models](https://arxiv.org/abs/2309.15531) 09/2023|[Improving Code Generation by Dynamic Temperature Sampling](https://arxiv.org/abs/2309.02772) 09/2023|[Efficient Streaming Language Models with Attention Sinks](https://arxiv.org/abs/2309.17453) 10/2023|[DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and Diffusion Models](https://arxiv.org/abs/2310.00902) 10/2023|[GrowLength: Accelerating LLMs Pretraining by Progressively Growing Training Length](https://arxiv.org/abs/2310.00576) 10/2023|[Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image Diffusion Models](https://arxiv.org/abs/2310.01107) 10/2023|[Elephant Neural Networks: Born to Be a Continual Learner](https://arxiv.org/abs/2310.01365) 10/2023|[Ring Attention with Blockwise Transformers for Near-Infinite Context](https://arxiv.org/abs/2310.01889) 10/2023|[Retrieval meets Long Context Large Language Models](https://arxiv.org/abs/2310.03025) 10/2023|[DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines](https://arxiv.org/abs/2310.03714) 10/2023|[LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers](https://arxiv.org/abs/2310.03294) 10/2023|[Amortizing intractable inference in large language models (GFlowNet Tuning)](https://arxiv.org/abs/2310.04363) 10/2023|[SteerLM: Attribute Conditioned SFT as an (User-Steerable) Alternative to RLHF](https://arxiv.org/abs/2310.05344) 10/2023|[Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity](https://arxiv.org/abs/2310.05175) 10/2023|[Let Models Speak Ciphers: Multiagent Debate through Embeddings](https://arxiv.org/abs/2310.06272) 10/2023|[InstructRetro: Instruction Tuning post Retrieval-Augmented Pretraining](https://arxiv.org/abs/2310.07713) 10/2023|[CacheGen: Fast Context Loading for Language Model Applications](https://arxiv.org/abs/2310.07240) 10/2023|[MatFormer: Nested Transformer for Elastic Inference](https://arxiv.org/abs/2310.07707) 10/2023|[LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models](https://arxiv.org/abs/2310.08659) 10/2023|[Towards End-to-end 4-Bit Inference on Generative Large Language Models (QUIK)](https://arxiv.org/abs/2310.09259) 10/2023|[Microscaling Data Formats for Deep Learning](https://arxiv.org/abs/2310.10537) 10/2023|[xVal: A Continuous Number Encoding for Large Language Models](https://arxiv.org/abs/2310.02989) 10/2023|[An Emulator for Fine-Tuning Large Language Models using Small Language Models](https://arxiv.org/abs/2310.12962) 10/2023|[Frozen Transformers in Language Models Are Effective Visual Encoder Layers](https://arxiv.org/abs/2310.12973) 10/2023|[LoBaSS: Gauging Learnability in Supervised Fine-tuning Data](https://arxiv.org/abs/2310.13008) 10/2023|[Quality-Diversity through AI Feedback](https://arxiv.org/abs/2310.13032) 10/2023|[DoGE: Domain Reweighting with Generalization Estimation](https://arxiv.org/abs/2310.15393) 10/2023|[E-Sparse: Boosting the Large Language Model Inference through Entropy-based N:M Sparsity](https://arxiv.org/abs/2310.15929) 10/2023|[Mixture of Tokens: Efficient LLMs through Cross-Example Aggregation](https://arxiv.org/abs/2310.15961) 10/2023|[Personas as a Way to Model Truthfulness in Language Models](https://arxiv.org/abs/2310.18168) 10/2023|[Atom: Low-bit Quantization for Efficient and Accurate LLM Serving](https://arxiv.org/abs/2310.19102) 10/2023|[QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models](https://arxiv.org/abs/2310.16795) 11/2023|[AWEQ: Post-Training Quantization with Activation-Weight Equalization for Large Language Models](https://arxiv.org/abs/2311.01305) 11/2023|[FlashDecoding++: Faster Large Language Model Inference on GPUs](https://arxiv.org/abs/2311.01282) 11/2023|[Divergent Token Metrics: Measuring degradation to prune away LLM components -- and optimize quantization](https://arxiv.org/abs/2311.01544) 11/2023|[Tell Your Model Where to Attend: Post-hoc Attention Steering for LLMs](https://arxiv.org/abs/2311.02262) 11/2023|[REST: Retrieval-Based Speculative Decoding](https://arxiv.org/abs/2311.08252) 11/2023|[DynaPipe: Optimizing Multi-task Training through Dynamic Pipelines](https://arxiv.org/abs/2311.10418) 11/2023|[Token-level Adaptation of LoRA Adapters for Downstream Task Generalization](https://arxiv.org/abs/2311.10847) 11/2023|[Exponentially Faster Language Modelling](https://arxiv.org/abs/2311.10770) 11/2023|[MultiLoRA: Democratizing LoRA for Better Multi-Task Learning](https://arxiv.org/abs/2311.11501) 11/2023|[LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning](https://arxiv.org/abs/2311.12023) 11/2023|[Token Recycling for Efficient Sequential Inference with Vision Transformers](https://arxiv.org/abs/2311.15335) 11/2023|[Enabling Fast 2-bit LLM on GPUs: Memory Alignment, Sparse Outlier, and Asynchronous Dequantization](https://arxiv.org/abs/2311.16442) 12/2023|[GIFT: Generative Interpretable Fine-Tuning Transformers](https://arxiv.org/abs/2312.00700) 12/2023|[PEFA: Parameter-Free Adapters for Large-scale Embedding-based Retrieval Models](https://arxiv.org/abs/2312.02429) 12/2023|[Improving Activation Steering in Language Models with Mean-Centring](https://arxiv.org/abs/2312.03813) 12/2023|[A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA](https://arxiv.org/abs/2312.03732) 12/2023|[SparQ Attention: Bandwidth-Efficient LLM Inference](https://arxiv.org/abs/2312.04985) 12/2023|[ESPN: Memory-Efficient Multi-Vector Information Retrieval](https://arxiv.org/abs/2312.05417) 12/2023|[Aligner: One Global Token is Worth Millions of Parameters When Aligning Large Language Models](https://arxiv.org/abs/2312.05503) 12/2023|[CBQ: Cross-Block Quantization for Large Language Models](https://arxiv.org/abs/2312.07950) 12/2023|[SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention](https://arxiv.org/abs/2312.07987) 12/2023|[Weight subcloning: direct initialization of transformers using larger pretrained ones](https://arxiv.org/abs/2312.09299) 12/2023|[Cascade Speculative Drafting for Even Faster LLM Inference](https://arxiv.org/abs/2312.11462) 12/2023|[ConsistentEE: A Consistent and Hardness-Guided Early Exiting Method for Accelerating Language Models Inference](https://arxiv.org/abs/2312.11882) 12/2023|[Lookahead: An Inference Acceleration Framework for Large Language Model with Lossless Generation Accuracy](https://arxiv.org/abs/2312.12728) 12/2023|[A Semantic Space is Worth 256 Language Descriptions: Make Stronger Segmentation Models with Descriptive Properties](https://arxiv.org/abs/2312.13764) 12/2023|[Algebraic Positional Encodings](https://arxiv.org/abs/2312.16045) 12/2023|[Preference as Reward, Maximum Preference Optimization with Importance Sampling](https://arxiv.org/abs/2312.16430) 01/2024|[LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning](https://arxiv.org/abs/2401.01325) 01/2024|[Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models](https://arxiv.org/abs/2401.01335) 01/2024|[LLaMA Pro: Progressive LLaMA with Block Expansion](https://arxiv.org/abs/2401.02415) 01/2024|[Fast and Optimal Weight Update for Pruned Large Language Models](https://arxiv.org/abs/2401.02938) 01/2024|[Soaring from 4K to 400K: Extending LLM's Context with Activation Beacon](https://arxiv.org/abs/2401.03462) 01/2024|[MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts](https://arxiv.org/abs/2401.04081) 01/2024|[Chain of LoRA: Efficient Fine-tuning of Language Models via Residual Learning](https://arxiv.org/abs/2401.04151) 01/2024|[RoSA: Accurate Parameter-Efficient Fine-Tuning via Robust Adaptation](https://arxiv.org/abs/2401.04679) 01/2024|[Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models](https://arxiv.org/abs//2401.04658) 01/2024|[AUTOACT: Automatic Agent Learning from Scratch via Self-Planning](https://arxiv.org/abs/2401.05268) 01/2024|[Extreme Compression of Large Language Models via Additive Quantization (AQLM)](https://arxiv.org/abs/2401.06118) 01/2024|[Knowledge Translation: A New Pathway for Model Compression](https://arxiv.org/abs/2401.05772) 01/2024|[Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts for Instruction Tuning on General Tasks](https://arxiv.org/abs/2401.02731) 01/2024|[Transformers are Multi-State RNNs](https://arxiv.org/abs/2401.06104) 01/2024|[Extending LLMs' Context Window with 100 Samples (Entropy-ABF)](https://arxiv.org/abs/2401.07004) 01/2024|[ChatQA: Building GPT-4 Level Conversational QA Models](https://arxiv.org/abs/2401.10225) 01/2024|[AutoChunk: Automated Activation Chunk for Memory-Efficient Long Sequence Inference](https://arxiv.org/abs/2401.10652) 01/2024|[Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads](https://arxiv.org/abs/2401.10774) 01/2024|[Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation](https://arxiv.org/abs/2401.08417) 01/2024|[BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models](https://arxiv.org/abs/2401.12522) 01/2024|[Large Language Models are Superpositions of All Characters: Attaining Arbitrary Role-play via Self-Alignment](https://arxiv.org/abs/2401.12474) 01/2024|[Dynamic Layer Tying for Parameter-Efficient Transformers](https://arxiv.org/abs/2401.12819) 01/2024|[MambaByte: Token-free Selective State Space Model](https://arxiv.org/abs/2401.13660) 01/2024|[FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design](https://arxiv.org/abs/2401.14112) 01/2024|[Accelerating Retrieval-Augmented Language Model Serving with Speculation](https://arxiv.org/abs/2401.14021) 01/2024|[Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities](https://arxiv.org/abs/2401.14405) 01/2024|[EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty](https://arxiv.org/abs/2401.15077) 01/2024|[With Greater Text Comes Greater Necessity: Inference-Time Training Helps Long Text Generation (Temp LoRA)](https://arxiv.org/abs/2401.11504) 01/2024|[YODA: Teacher-Student Progressive Learning for Language Models](https://arxiv.org/abs/2401.15670) 01/2024|[KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization](https://arxiv.org/abs/2401.18079) 01/2024|[LOCOST: State-Space Models for Long Document Abstractive Summarization](https://arxiv.org/abs/2401.17919) 01/2024|[Convolution Meets LoRA: Parameter Efficient Finetuning for Segment Anything Model](https://arxiv.org/abs/2401.17868) 01/2024|[RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval](https://arxiv.org/abs/2401.18059) 02/2024|[EE-Tuning: An Economical yet Scalable Solution for Tuning Early-Exit Large Language Models](https://arxiv.org/abs/2402.00518) 02/2024|[MoDE: A Mixture-of-Experts Model with Mutual Distillation among the Experts](https://arxiv.org/abs/2402.00893) 02/2024|[Break the Sequential Dependency of LLM Inference Using Lookahead Decoding](https://arxiv.org/abs/2402.02057) 02/2024|[Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities](https://arxiv.org/abs/2402.01831) 02/2024|[HiQA: A Hierarchical Contextual Augmentation RAG for Massive Documents QA](https://arxiv.org/abs/2402.01767) 02/2024|[KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache](https://arxiv.org/abs/2402.02750) 02/2024|[DiffEditor: Boosting Accuracy and Flexibility on Diffusion-based Image Editing](https://arxiv.org/abs/2402.02583) 02/2024|[QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks](https://arxiv.org/abs/2402.04396) 02/2024|[Hydragen: High-Throughput LLM Inference with Shared Prefixes](https://arxiv.org/abs/2402.05099) 02/2024|[Hydra: Sequentially-Dependent Draft Heads for Medusa Decoding](https://arxiv.org/abs/2402.05109) 02/2024|[LESS: Selecting Influential Data for Targeted Instruction Tuning](https://arxiv.org/abs/2402.04333) 02/2024|[Accurate LoRA-Finetuning Quantization of LLMs via Information Retention](https://arxiv.org/abs/2402.05445) 02/2024|[AttnLRP: Attention-Aware Layer-wise Relevance Propagation for Transformers](https://arxiv.org/abs/2402.05602) 02/2024|[X-LoRA: Mixture of Low-Rank Adapter Experts, a Flexible Framework for Large Language Models with Applications in Protein Mechanics](https://arxiv.org/abs/2402.07148) 02/2024|[BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data](https://arxiv.org/abs/2402.08093) 02/2024|[Mitigating Object Hallucination in Large Vision-Language Models via Classifier-Free Guidance](https://arxiv.org/abs/2402.08680) 02/2024|[Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference](https://arxiv.org/abs/2402.09398) 02/2024|[Uncertainty Decomposition and Quantification for In-Context Learning of Large Language Models](https://arxiv.org/abs/2402.10189) 02/2024|[RS-DPO: A Hybrid Rejection Sampling and Direct Preference Optimization Method for Alignment of Large Language Models](https://arxiv.org/abs/2402.10038) 02/2024|[BitDelta: Your Fine-Tune May Only Be Worth One Bit](https://arxiv.org/abs/2402.10193) 02/2024|[DoRA: Weight-Decomposed Low-Rank Adaptation](https://arxiv.org/abs/2402.09353) 02/2024|[In Search of Needles in a 10M Haystack: Recurrent Memory Finds What LLMs Miss](https://arxiv.org/abs/2402.10790) 02/2024|[Aligning Modalities in Vision Large Language Models via Preference Fine-tuning](https://arxiv.org/abs/2402.11411) 02/2024|[Generation Meets Verification: Accelerating Large Language Model Inference with Smart Parallel Auto-Correct Decoding](https://arxiv.org/abs/2402.11809) 02/2024|[Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts](https://arxiv.org/abs/2402.10958) 02/2024|[WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More](https://arxiv.org/abs/2402.12065) 02/2024|[DB-LLM: Accurate Dual-Binarization for Efficient LLMs](https://arxiv.org/abs/2402.11960) 02/2024|[Data Engineering for Scaling Language Models to 128K Context](https://arxiv.org/abs/2402.10171) 02/2024|[EBFT: Effective and Block-Wise Fine-Tuning for Sparse LLMs](https://arxiv.org/abs/2402.12419) 02/2024|[HyperMoE: Towards Better Mixture of Experts via Transferring Among Experts](https://arxiv.org/abs/2402.12656) 02/2024|[Turn Waste into Worth: Rectifying Top](https://arxiv.org/abs/2402.12399) 02/2024|[Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive](https://arxiv.org/abs/2402.13228) 02/2024|[Q-Probe: A Lightweight Approach to Reward Maximization for Language Models](https://arxiv.org/abs/2402.14688) 02/2024|[Take the Bull by the Horns: Hard Sample-Reweighted Continual Training Improves LLM Generalization](https://arxiv.org/abs/2402.14270) 02/2024|[MemoryPrompt: A Light Wrapper to Improve Context Tracking in Pre-trained Language Models](https://arxiv.org/abs/2402.15268) 02/2024|[Fine-tuning CLIP Text Encoders with Two-step Paraphrasing](https://arxiv.org/abs/2402.15120) 02/2024|[BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation](https://arxiv.org/abs/2402.16880) 02/2024|[No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization](https://arxiv.org/abs/2402.18096) 02/2024|[DropBP: Accelerating Fine-Tuning of Large Language Models by Dropping Backward Propagation](https://arxiv.org/abs/2402.17812) 02/2024|[CoDream: Exchanging dreams instead of models for federated aggregation with heterogeneous models](https://arxiv.org/abs/2402.15968) 02/2024|[Humanoid Locomotion as Next Token Prediction](https://arxiv.org/abs/2402.19469) 02/2024|[KTO: Model Alignment as Prospect Theoretic Optimization](https://arxiv.org/abs/2402.01306) 02/2024|[Noise Contrastive Alignment of Language Models with Explicit Rewards (NCA)](https://arxiv.org/abs/2402.05369) 02/2024|[ReLU2 Wins: Discovering Efficient Activation Functions for Sparse LLMs](https://arxiv.org/abs/2402.03804) 02/2024|[Training-Free Long-Context Scaling of Large Language Models (DCA)](https://arxiv.org/abs/2402.17463) 03/2024|[Not all Layers of LLMs are Necessary during Inference](https://arxiv.org/abs/2403.02181) 03/2024|[Masked Thought: Simply Masking Partial Reasoning Steps Can Improve Mathematical Reasoning Learning of Language Models](https://arxiv.org/abs/2403.02178) 03/2024|[DenseMamba: State Space Models with Dense Hidden Connection for Efficient Large Language Models](https://arxiv.org/abs/2403.00818) 03/2024|[GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection](https://arxiv.org/abs/2403.03507) 03/2024|[Found in the Middle: How Language Models Use Long Contexts Better via Plug-and-Play Positional Encoding](https://arxiv.org/abs/2403.04797) 03/2024|[Scattered Mixture-of-Experts Implementation](https://arxiv.org/abs/2403.08245) 03/2024|[AutoLoRA: Automatically Tuning Matrix Ranks in Low-Rank Adaptation Based on Meta Learning](https://arxiv.org/abs/2403.09113) 03/2024|[BurstAttention: An Efficient Distributed Attention Framework for Extremely Long Sequences](https://arxiv.org/abs/2403.09347) 03/2024|[Bifurcated Attention for Single-Context Large-Batch Sampling](https://arxiv.org/abs/2403.08845) 03/2024|[Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference](https://arxiv.org/abs/2403.09054) 03/2024|[Glyph-ByT5: A Customized Text Encoder for Accurate Visual Text Rendering](https://arxiv.org/abs/2403.09622) 03/2024|[Recurrent Drafter for Fast Speculative Decoding in Large Language Models](https://arxiv.org/abs/2403.09919) 03/2024|[Arcee's MergeKit: A Toolkit for Merging Large Language Models](https://arxiv.org/abs/2403.13257) 03/2024|[Rotary Position Embedding for Vision Transformer](https://arxiv.org/abs/2403.13298) 03/2024|[BiLoRA: A Bi-level Optimization Framework for Overfitting-Resilient Low-Rank Adaptation of Large Pre-trained Models](https://arxiv.org/abs/2403.13037) 03/2024|[Efficient Video Diffusion Models via Content-Frame Motion-Latent Decomposition](https://arxiv.org/abs/2403.14148) 03/2024|[DreamReward: Text-to-3D Generation with Human Preference](https://arxiv.org/abs/2403.14613) 03/2024|[Evolutionary Optimization of Model Merging Recipes](https://arxiv.org/abs/2403.13187) 03/2024|[Self-Rectifying Diffusion Sampling with Perturbed-Attention Guidance](https://arxiv.org/abs/2403.17377) 03/2024|[When Do We Not Need Larger Vision Models?](https://arxiv.org/abs/2403.13043) 03/2024|[FeatUp: A Model-Agnostic Framework for Features at Any Resolution](https://arxiv.org/abs/2403.10516) 03/2024|[ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching](https://arxiv.org/abs/2403.17312) 03/2024|[The Unreasonable Ineffectiveness of the Deeper Layers](https://arxiv.org/abs/2403.17887) 03/2024|[QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs](https://arxiv.org/abs/2404.00456) 04/2024|[LLM-ABR: Designing Adaptive Bitrate Algorithms via Large Language Models](https://arxiv.org/abs/2404.01617) 04/2024|[Prompt-prompted Mixture of Experts for Efficient LLM Generation (GRIFFIN)](https://arxiv.org/abs/2404.01617) 04/2024|[BAdam: A Memory Efficient Full Parameter Training Method for Large Language Models](https://arxiv.org/abs/2404.02827) 04/2024|[SqueezeAttention: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget](https://arxiv.org/abs/2404.04793) 04/2024|[CodecLM: Aligning Language Models with Tailored Synthetic Data](https://arxiv.org/abs/2404.05875) 04/2024|[Superposition Prompting: Improving and Accelerating Retrieval-Augmented Generation](https://arxiv.org/abs/2404.06910) 04/2024|[Graph Chain-of-Thought: Augmenting Large Language Models by Reasoning on Graphs](https://arxiv.org/abs/2404.07103) 04/2024|[Continuous Language Model Interpolation for Dynamic and Controllable Text Generation](https://arxiv.org/abs/2404.07117) 04/2024|[RULER: What's the Real Context Size of Your Long-Context Language Models?](https://arxiv.org/abs/2404.06654) 04/2024|[Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models](https://arxiv.org/abs/2404.09529) 04/2024|[On Speculative Decoding for Multimodal Large Language Models](https://arxiv.org/abs/2404.08856) 04/2024|[CATS: Contextually-Aware Thresholding for Sparsity in Large Language Models](https://arxiv.org/abs/2404.08763) 04/2024|[Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs](https://arxiv.org/abs/2404.10308) 04/2024|[Fewer Truncations Improve Language Modeling](https://arxiv.org/abs/2404.10830) 04/2024|[When LLMs are Unfit Use FastFit: Fast and Effective Text Classification with Many Classes](https://arxiv.org/abs/2404.12365) 04/2024|[Learn2Talk: 3D Talking Face Learns from 2D Talking Face](https://arxiv.org/abs/2404.12888) 04/2024|[Weak-to-Strong Extrapolation Expedites Alignment (EXPO)](https://arxiv.org/abs/2404.16792) 04/2024|[decoupleQ: Towards 2-bit Post-Training Uniform Quantization via decoupling Parameters into Integer and Floating Points](https://arxiv.org/abs/2404.12759) 04/2024|[RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation](https://arxiv.org/abs/2404.12457) 04/2024|[Lossless Acceleration of Large Language Model via Adaptive N-gram Parallel Decoding](https://arxiv.org/abs/2404.08698) 04/2024|[Mixture of LoRA Experts](https://arxiv.org/abs/2404.13628) 04/2024|[MARVEL: Multidimensional Abstraction and Reasoning through Visual Evaluation and Learning](https://arxiv.org/abs/2404.13591) 04/2024|[XFT: Unlocking the Power of Code Instruction Tuning by Simply Merging Upcycled Mixture-of-Experts](https://arxiv.org/abs/2404.15247) 04/2024|[Retrieval Head Mechanistically Explains Long-Context Factuality](https://arxiv.org/abs/2404.15574) 04/2024|[Let's Think Dot by Dot: Hidden Computation in Transformer Language Models](https://arxiv.org/abs/2404.15758) 04/2024|[Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting](https://arxiv.org/abs/2404.18911) 05/2024|[When to Retrieve: Teaching LLMs to Utilize Information Retrieval Effectively](https://arxiv.org/abs/2404.19705) 05/2024|[A Careful Examination of Large Language Model Performance on Grade School Arithmetic](https://arxiv.org/abs/2405.00332) 05/2024|[Clover: Regressive Lightweight Speculative Decoding with Sequential Knowledge](https://arxiv.org/abs/2405.00263) 05/2024|[Parameter-Efficient Fine-Tuning with Discrete Fourier Transform](https://arxiv.org/abs/2405.03003) 05/2024|[COPAL: Continual Pruning in Large Language Generative Models](https://arxiv.org/abs/2405.02347) 05/2024|[Revisiting a Pain in the Neck: Semantic Phrase Processing Benchmark for Language Models](https://arxiv.org/abs/2405.02861) 05/2024|[AlphaMath Almost Zero: process Supervision without process](https://arxiv.org/abs/2405.03553) 05/2024|[QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving](https://arxiv.org/abs/2405.04532) 05/2024|[xLSTM: Extended Long Short-Term Memory](https://arxiv.org/abs/2405.04517) 05/2024|[FlashBack:Efficient Retrieval-Augmented Language Modeling for Long Context Inference](https://arxiv.org/abs/2405.04065) 05/2024|[SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models](https://arxiv.org/abs/2405.06219) 05/2024|[HMT: Hierarchical Memory Transformer for Long Context Language Processing](https://arxiv.org/abs/2405.06067) 05/2024|[The Future of Large Language Model Pre-training is Federated](https://arxiv.org/abs/2405.10853) 05/2024|[Layer-Condensed KV Cache for Efficient Inference of Large Language Models](https://arxiv.org/abs/2405.10637) 05/2024|[MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning](https://arxiv.org/abs/2405.12130) 05/2024|[SSAMBA: Self-Supervised Audio Representation Learning with Mamba State Space Model](https://arxiv.org/abs/2405.11831) 05/2024|[Reducing Transformer Key-Value Cache Size with Cross-Layer Attention](https://arxiv.org/abs/2405.12981) 05/2024|[Bagging Improves Generalization Exponentially](https://arxiv.org/abs/2405.14741) 05/2024|[Self-Taught Recognizer: Toward Unsupervised Adaptation for Speech Foundation Models](https://arxiv.org/abs/2405.14161) 05/2024|[Unchosen Experts Can Contribute Too: Unleashing MoE Models' Power by Self-Contrast](https://arxiv.org/abs/2405.14507) 05/2024|[Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum](https://arxiv.org/abs/2405.13226) 05/2024|[T2 of Thoughts: Temperature Tree Elicits Reasoning in Large Language Models](https://arxiv.org/abs/2405.14075) 05/2024|[ReALLM: A general framework for LLM compression and fine-tuning](https://arxiv.org/abs/2405.13155) 05/2024|[SimPO: Simple Preference Optimization with a Reference-Free Reward](https://arxiv.org/abs/2405.14734) 05/2024|[PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression](https://arxiv.org/abs/2405.14852) 05/2024|[Removing Bias from Maximum Likelihood Estimation with Model Autophagy](https://arxiv.org/abs/2405.13977) 05/2024|[RE-Adapt: Reverse Engineered Adaptation of Large Language Models](https://arxiv.org/abs/2405.15007) 05/2024|[MicroAdam: Accurate Adaptive Optimization with Low Space Overhead and Provable Convergence](https://arxiv.org/abs/2405.15593) 05/2024|[Data Mixing Made Efficient: A Bivariate Scaling Law for Language Model Pretraining](https://arxiv.org/abs/2405.14908) 05/2024|[Accelerating Transformers with Spectrum-Preserving Token Merging](https://arxiv.org/abs/2405.16148) 05/2024|[A Closer Look at Time Steps is Worthy of Triple Speed-Up for Diffusion Model Training](https://arxiv.org/abs/2405.17403) 05/2024|[MoEUT: Mixture-of-Experts Universal Transformers](https://arxiv.org/abs/2405.16039) 05/2024|[Exploring Context Window of Large Language Models via Decomposed Positional Vectors](https://arxiv.org/abs/2405.18009) 05/2024|[Transformers Can Do Arithmetic with the Right Embeddings](https://arxiv.org/abs/2405.17399) 05/2024|[OwLore: Outlier-weighed Layerwise Sampled Low-Rank Projection for Memory-Efficient LLM Fine-tuning](https://arxiv.org/abs/2405.18380) 05/2024|[MetaToken: Detecting Hallucination in Image Descriptions by Meta Classification](https://arxiv.org/abs/2405.19186) 05/2024|[Self-Play Preference Optimization for Language Model Alignment](https://arxiv.org/abs/2405.00675) 05/2024|[The Road Less Scheduled(Schedule-Free)](https://arxiv.org/abs/2405.15682) 06/2024|[FineWeb: decanting the web for the finest text data at scale](https://archive.is/2HqVQ) 06/2024|[Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality (Mamba-2)](https://arxiv.org/abs/2405.21060) 06/2024|[Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization](https://arxiv.org/abs/2406.00045) 06/2024|[DeCoOp: Robust Prompt Tuning with Out-of-Distribution Detection](https://arxiv.org/abs/2406.00345) 06/2024|[MultiMax: Sparse and Multi-Modal Attention Learning](https://arxiv.org/abs/2406.01189) 06/2024|[MagR: Weight Magnitude Reduction for Enhancing Post-Training Quantization](https://arxiv.org/abs/2406.00800) 06/2024|[Mix-of-Granularity: Optimize the Chunking Granularity for Retrieval-Augmented Generation](https://arxiv.org/abs/2406.00456) 06/2024|[QuanTA: Efficient High-Rank Fine-Tuning of LLMs with Quantum-Informed Tensor Adaptation](https://arxiv.org/abs/2406.00132) 06/2024|[SLTrain: a sparse plus low-rank approach for parameter and memory efficient pretraining](https://arxiv.org/abs/2406.02214) 06/2024|[Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models](https://arxiv.org/abs/2406.04271) 06/2024|[VCR: Visual Caption Restoration](https://arxiv.org/abs/2406.06462) 06/2024|[LoCoCo: Dropping In Convolutions for Long Context Compression](https://arxiv.org/abs/2406.05317) 06/2024|[Low-Rank Quantization-Aware Training for LLMs](https://arxiv.org/abs/2406.06385) 06/2024|[Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters](https://arxiv.org/abs/2406.05955) 06/2024|[DHA: Learning Decoupled-Head Attention from Transformer Checkpoints via Adaptive Heads Fusion](https://arxiv.org/abs/2406.06567) 06/2024|[TernaryLLM: Ternarized Large Language Model](https://arxiv.org/abs/2406.07177) 06/2024|[Image and Video Tokenization with Binary Spherical Quantization](https://arxiv.org/abs/2406.07548) 06/2024|[Discovering Preference Optimization Algorithms with and for Large Language Models](https://arxiv.org/abs/2406.08414) 06/2024|[ProTrain: Efficient LLM Training via Memory-Aware Techniques](https://arxiv.org/abs/2406.08334) 06/2024|[PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling](https://arxiv.org/abs/2406.02069) 06/2024|[Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing](https://arxiv.org/abs/2406.08464) 06/2024|[Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs](https://arxiv.org/abs/2406.09136) 06/2024|[HiP Attention: Sparse Sub-Quadratic Attention with Hierarchical Attention Pruning](https://arxiv.org/abs/2406.09827) 06/2024|[LieRE: Generalizing Rotary Position Encodings](https://arxiv.org/abs/2406.10322) 06/2024|[DiTTo-TTS: Efficient and Scalable Zero-Shot Text-to-Speech with Diffusion Transformer](https://arxiv.org/abs/2406.11427) 06/2024|[Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference](https://arxiv.org/abs/2406.10774) 06/2024|[Investigating Video Reasoning Capability of Large Language Models with Tropes in Movies](https://arxiv.org/abs/2406.10923) 06/2024|[mDPO: Conditional Preference Optimization for Multimodal Large Language Models](https://arxiv.org/abs/2406.11839) 06/2024|[QTIP: Quantization with Trellises and Incoherence Processing](https://arxiv.org/abs/2406.11235) 06/2024|[Mixture-of-Subspaces in Low-Rank Adaptation (MoSLoRA)](https://arxiv.org/abs/2406.11909) 06/2024|[Prefixing Attention Sinks can Mitigate Activation Outliers for Large Language Model Quantization](https://arxiv.org/abs/2406.12016) 06/2024|[Mixture of Scales: Memory-Efficient Token-Adaptive Binarization for Large Language Models](https://arxiv.org/abs/2406.12311) 06/2024|[DeciMamba: Exploring the Length Extrapolation Potential of Mamba](https://arxiv.org/abs/2406.14528) 06/2024|[Optimised Grouped-Query Attention Mechanism for Transformers](https://arxiv.org/abs/2406.14963) 06/2024|[MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression](https://arxiv.org/abs/2406.14909) 06/2024|[Unsupervised Morphological Tree Tokenizer](https://arxiv.org/abs/2406.15245) 06/2024|[Reducing Fine-Tuning Memory Overhead by Approximate and Memory-Sharing Backpropagation](https://arxiv.org/abs/2406.16282) 06/2024|[What Matters in Transformers? Not All Attention is Needed](https://arxiv.org/abs/2406.15786) 06/2024|[Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention](https://arxiv.org/abs/2406.15486) 06/2024|[ShadowLLM: Predictor-based Contextual Sparsity for Large Language Models](https://arxiv.org/abs/2406.16635) 06/2024|[Adam-mini: Use Fewer Learning Rates To Gain More](https://arxiv.org/abs/2406.16793) 06/2024|[Large Language Models are Interpretable Learners](https://arxiv.org/abs/2406.17224) 06/2024|[Selective Prompting Tuning for Personalized Conversations with LLMs](https://arxiv.org/abs/2406.18187) 06/2024|[Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs](https://arxiv.org/abs/2406.18629) 07/2024|[Eliminating Position Bias of Language Models: A Mechanistic Approach](https://arxiv.org/abs/2407.01100) 07/2024|[Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion](https://arxiv.org/abs/2407.01392) 07/2024|[Efficient Expert Pruning for Sparse Mixture-of-Experts Language Models: Enhancing Performance and Reducing Inference Costs](https://arxiv.org/abs/2407.00945) 07/2024|[Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models](https://arxiv.org/abs/2407.01906) 07/2024|[LoCo: Low-Bit Communication Adaptor for Large-scale Model Training](https://arxiv.org/abs/2407.04480) 07/2024|[Code Less, Align More: Efficient LLM Fine-tuning for Code Generation with Data Pruning](https://arxiv.org/abs/2407.05040) 07/2024|[Learning to (Learn at Test Time): RNNs with Expressive Hidden States (TTT)](https://arxiv.org/abs/2407.04620) 07/2024|[Lookback Lens: Detecting and Mitigating Contextual Hallucinations in Large Language Models Using Only Attention Maps](https://arxiv.org/abs/2407.07071) 07/2024|[Mixture-of-Modules: Reinventing Transformers as Dynamic Assemblies of Modules](https://arxiv.org/abs/2407.06677) 07/2024|[OpenDiLoCo: An Open-Source Framework for Globally Distributed Low-Communication Training](https://arxiv.org/abs/2407.07852) 07/2024|[Towards Robust Alignment of Language Models: Distributionally Robustifying Direct Preference Optimization](https://arxiv.org/abs/2407.07880) 07/2024|[Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients](https://arxiv.org/abs/2407.08296) 07/2024|[FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision](https://arxiv.org/abs/2407.08608) 07/2024|[Lite-SAM Is Actually What You Need for Segment Everything](https://arxiv.org/abs/2407.08965) 07/2024|[BitNet b1.58 Reloaded: State-of-the-art Performance Also on Smaller Networks](https://arxiv.org/abs/2407.09527) 07/2024|[Tiled Bit Networks: Sub-Bit Neural Network Compression Through Reuse of Learnable Binary Vectors](https://arxiv.org/abs/2407.12075) 07/2024|[Patch-Level Training for Large Language Models](https://arxiv.org/abs/2407.12665) 07/2024|[Correcting the Mythos of KL-Regularization: Direct Alignment without Overparameterization via Chi-squared Preference Optimization](https://arxiv.org/abs/2407.13399) 07/2024|[LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference](https://arxiv.org/abs/2407.14057) 07/2024|[Hi-EF: Benchmarking Emotion Forecasting in Human-interaction](https://arxiv.org/abs/2407.16406) 07/2024|[RazorAttention: Efficient KV Cache Compression Through Retrieval Heads](https://arxiv.org/abs/2407.15891) 07/2024|[MoFO: Momentum-Filtered Optimizer for Mitigating Forgetting in LLM Fine-Tuning](https://arxiv.org/abs/2407.20999) 07/2024|[Palu: Compressing KV-Cache with Low-Rank Projection](https://arxiv.org/abs/2407.21118) 07/2024|[AI-Assisted Generation of Difficult Math Questions (MATH^2)](https://arxiv.org/abs/2407.21009) 07/2024|[]() | |**Articles** 03/2019|[Rich Sutton - The Bitter Lesson](https://archive.is/QqKWF) 06/2022|[Yann LeCun - A Path Towards Autonomous Machine Intelligence](https://openreview.net/forum?id=BZ5a1r-kVsf) 01/2023|[Lilian Weng - The Transformer Family Version 2.0](https://archive.is/3O1n8) 01/2023|[Lilian Weng - Large Transformer Model Inference Optimization](https://archive.is/Clu0H) 03/2023|[Stanford - Alpaca: A Strong, Replicable Instruction-Following Model](https://archive.is/Ky1lu) 05/2023|[OpenAI - Language models can explain neurons in language models](https://archive.is/Y6Lvd) 05/2023|[Alex Turner - Steering GPT-2-XL by adding an activation vector](https://archive.is/E7ehv) 06/2023|[YyWang - Do We Really Need the KVCache for All Large Language Models](https://archive.is/quOu2) 06/2023|[kaiokendev - Extending Context is Hard…but not Impossible](https://archive.is/vJC44) 06/2023|[bloc97 - NTK-Aware Scaled RoPE](https://archive.is/Rsoai) 07/2023|[oobabooga - A direct comparison between llama.cpp, AutoGPTQ, ExLlama, and transformers perplexities](https://archive.is/HgzRV) 07/2023|[Jianlin Su - Carrying the beta position to the end (better NTK RoPe method)](https://archive.is/hfbHH) 08/2023|[Charles Goddard - On Frankenllama](https://archive.is/GYoVX) 10/2023|[Tri Dao - Flash-Decoding for Long-Context Inference](https://archive.is/KCu83) 10/2023|[Evan Armstrong - Human-Sourced, AI-Augmented: a promising solution for open source conversational data](https://archive.is/zPPFU) 12/2023|[Anthropic - Long context prompting for Claude 2.1](https://archive.is/zGngI) 12/2023|[Andrej Karpathy - On the "hallucination problem" (tweet.jpg)](https://files.catbox.moe/jnrzrz.jpg) 12/2023|[HuggingFace - Mixture of Experts Explained](https://archive.is/8r7t9) 01/2024|[Vgel - Representation Engineering](https://archive.is/SHV3E) 01/2024|[Alex Alemi - KL is All You Need](https://archive.is/w0U7t) 02/2024|[Lilian Weng - Thinking about High-Quality Human Data](https://archive.is/1K0EM) 03/2024|[rayliuca - T-Ragx Project Write Up (Translation RAG)](https://archive.is/VU9eI) 04/2024|[Answer.Ai - Efficient finetuning of Llama 3 with FSDP QDoRA](https://archive.is/IbOaf) 04/2024|[Sam Paech - Creating MAGI: A hard subset of MMLU and AGIEval](https://archive.is/BEdCw) 05/2024|[LLaVA Team - LLaVA-NeXT: Stronger LLMs Supercharge Multimodal Capabilities in the Wild](https://archive.is/ZbDhO) 05/2024|[Hazy Research - GPUs Go Brrr (ThunderKittens)](https://archive.is/zBekg) 05/2024|[Anthropic - Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet](https://archive.is/QJqSq) 06/2024|[CharacterAI - Optimizing AI Inference](https://archive.is/koFXi) 07/2024|[Lilian Weng - Extrinsic Hallucinations in LLMs](https://archive.is/NIm5r) 07/2024|[Andrej Karpathy - Let's reproduce GPT-2 (1.6B)](https://archive.is/VoITK) 07/2024|[Pierre-Carl Langlais - Announcing Finance Commons and the Bad Data Toolbox](https://archive.is/9uYkD) 07/2024|[Zeyuan Allen-Zhu - Physics of Language Models ICML Talk (Video)](https://youtu.be/yBL7J0kgldU)