#Local Models Related Papers /lmg/ | ->[Abstracts Search (Current as of 03/2024)](https://files.catbox.moe/65wvgn.txt)<- ------ | ------ |**Google** ->[Papers](https://research.google/pubs/?area=machine-intelligence) [Blog](https://ai.googleblog.com)<- 12/2017|[Attention Is All You Need (Transformers)](https://arxiv.org/abs/1706.03762) 10/2018|[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) 10/2019|[Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (T5)](https://arxiv.org/abs/1910.10683) 11/2019|[Fast Transformer Decoding: One Write-Head is All You Need](https://arxiv.org/abs/1911.02150) 02/2020|[GLU Variants Improve Transformer](https://arxiv.org/abs/2002.05202) 03/2020|[Talking-Heads Attention](https://arxiv.org/abs/2003.02436) 05/2020|[Conformer: Convolution-augmented Transformer for Speech Recognition](https://arxiv.org/abs/2005.08100) 09/2020|[Efficient Transformers: A Survey](https://arxiv.org/abs/2009.06732) 12/2020|[RealFormer: Transformer Likes Residual Attention](https://arxiv.org/abs/2012.11747) 01/2021|[Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity](https://arxiv.org/abs/2101.03961) 09/2021|[Finetuned Language Models Are Zero-Shot Learners (Flan)](https://arxiv.org/abs/2109.01652) 09/2021|[Primer: Searching for Efficient Transformers for Language Modeling](https://arxiv.org/abs/2109.08668) 11/2021|[Sparse is Enough in Scaling Transformers](https://arxiv.org/abs/2111.12763) 12/2021|[GLaM: Efficient Scaling of Language Models with Mixture-of-Experts](https://arxiv.org/abs/2112.06905) 01/2022|[LaMDA: Language Models for Dialog Applications](https://arxiv.org/abs/2201.08239) 01/2022|[Chain-of-Thought Prompting Elicits Reasoning in Large Language Models](https://arxiv.org/abs/2201.11903) 04/2022|[PaLM: Scaling Language Modeling with Pathways](https://arxiv.org/abs/2204.02311) 07/2022|[Confident Adaptive Language Modeling](https://arxiv.org/abs/2207.07061) 10/2022|[Scaling Instruction-Finetuned Language Models (Flan-Palm)](https://arxiv.org/abs/2210.11416) 10/2022|[Towards Better Few-Shot and Finetuning Performance with Forgetful Causal Language Models](https://arxiv.org/abs/2210.13432) 10/2022|[Large Language Models Can Self-Improve](https://arxiv.org/abs/2210.11610) 11/2022|[Efficiently Scaling Transformer Inference](https://arxiv.org/abs/2211.05102) 11/2022|[Fast Inference from Transformers via Speculative Decoding](https://arxiv.org/abs/2211.17192) 02/2023|[Symbolic Discovery of Optimization Algorithms (Lion)](https://arxiv.org/abs/2302.06675) 03/2023|[PaLM-E: An Embodied Multimodal Language Model](https://arxiv.org/abs/2303.03378) 04/2023|[Conditional Adapters: Parameter-efficient Transfer Learning with Fast Inference](https://arxiv.org/abs/2304.04947) 05/2023|[Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes](https://arxiv.org/abs/2305.02301) 05/2023|[FormNetV2: Multimodal Graph Contrastive Learning for Form Document Information Extraction](https://arxiv.org/abs/2305.02549) 05/2023|[PaLM 2 Technical Report](https://arxiv.org/abs/2305.10403) 05/2023|[Symbol tuning improves in-context learning in language models](https://arxiv.org/abs/2305.08298) 05/2023|[Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for Large Language Models](https://arxiv.org/abs/2305.14705) 05/2023|[Towards Expert-Level Medical Question Answering with Large Language Models (Med-Palm 2)](https://arxiv.org/abs/2305.09617) 05/2023|[DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining](https://arxiv.org/abs/2305.10429) 05/2023|[How Does Generative Retrieval Scale to Millions of Passages?](https://arxiv.org/abs/2305.11841) 05/2023|[GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoint](https://arxiv.org/abs/2305.13245) 05/2023|[Small Language Models Improve Giants by Rewriting Their Outputs](https://arxiv.org/abs/2305.13514) 06/2023|[StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual Representation Learners](https://arxiv.org/abs/2306.00984) 06/2023|[AudioPaLM: A Large Language Model That Can Speak and Listen](https://arxiv.org/abs/2306.12925) 06/2023|[Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting](https://arxiv.org/abs/2306.17563) 07/2023|[HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models](https://arxiv.org/abs/2307.06949) 09/2023|[Uncovering mesa-optimization algorithms in Transformers](https://arxiv.org/abs/2309.05858) 10/2023|[Think before you speak: Training Language Models With Pause Tokens](https://arxiv.org/abs/2310.02226) 10/2023|[SpecTr: Fast Speculative Decoding via Optimal Transport](https://arxiv.org/abs/2310.15141) 11/2023|[UFOGen: You Forward Once Large Scale Text-to-Image Generation via Diffusion GANs](https://arxiv.org/abs/2311.09257) 11/2023|[Automatic Engineering of Long Prompts](https://arxiv.org/abs/2311.10117) 12/2023|[Beyond ChatBots: ExploreLLM for Structured Thoughts and Personalized Model Responses](https://arxiv.org/abs/2312.00763) 12/2023|[Style Aligned Image Generation via Shared Attention](https://arxiv.org/abs/2312.02133) 01/2024|[A Minimaximalist Approach to Reinforcement Learning from Human Feedback (SPO)](https://arxiv.org/abs/2401.04056) 02/2024|[Time-, Memory- and Parameter-Efficient Visual Adaptation (LoSA)](https://arxiv.org/abs/2402.02887) 02/2024|[Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context](https://files.catbox.moe/0tugft.pdf) 03/2024|[PERL: Parameter Efficient Reinforcement Learning from Human Feedback](https://arxiv.org/abs/2403.10704) | |**Deepmind (Google Deepmind as of 4/2023)** ->[Papers](https://www.deepmind.com/research) [Blog](https://www.deepmind.com/blog)<- 10/2019|[Stabilizing Transformers for Reinforcement Learning](https://arxiv.org/abs/1910.06764) 12/2021|[Scaling Language Models: Methods, Analysis & Insights from Training Gopher](https://arxiv.org/abs/2112.11446) 12/2021|[Improving language models by retrieving from trillions of tokens (RETRO)](https://arxiv.org/abs/2112.04426) 02/2022|[Competition-Level Code Generation with AlphaCode](https://arxiv.org/abs/2203.07814) 02/2022|[Unified Scaling Laws for Routed Language Models](https://arxiv.org/abs/2202.01169) 03/2022|[Training Compute-Optimal Large Language Models (Chinchilla)](https://arxiv.org/abs/2203.15556) 04/2022|[Flamingo: a Visual Language Model for Few-Shot Learning](https://arxiv.org/abs/2204.14198) 05/2022|[A Generalist Agent (GATO)](https://arxiv.org/abs/2205.06175) 07/2022|[Formal Algorithms for Transformers](https://arxiv.org/abs/2207.09238) 02/2023|[Accelerating Large Language Model Decoding with Speculative Sampling](https://arxiv.org/abs/2302.01318) 05/2023|[Tree of Thoughts: Deliberate Problem Solving with Large Language Models](https://arxiv.org/abs/2305.10601) 05/2023|[Block-State Transformer](https://arxiv.org/abs/2306.09539) 05/2023|[Randomized Positional Encodings Boost Length Generalization of Transformers](https://arxiv.org/abs/2305.16843) 08/2023|[From Sparse to Soft Mixtures of Experts](https://arxiv.org/abs/2308.00951) 09/2023|[Large Language Models as Optimizers](https://arxiv.org/abs/2309.03409) 09/2023|[MADLAD-400: A Multilingual And Document-Level Large Audited Dataset (MT Model)](https://arxiv.org/abs/2309.04662) 09/2023|[Scaling Laws for Sparsely-Connected Foundation Models](https://arxiv.org/abs/2309.08520) 09/2023|[Language Modeling Is Compression](https://arxiv.org/abs/2309.10668) 09/2023|[Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution](https://arxiv.org/abs/2309.16797) 10/2023|[Large Language Models as Analogical Reasoners](https://arxiv.org/abs/2310.01714) 10/2023|[Controlled Decoding from Language Models](https://arxiv.org/abs/2310.17022) 10/2023|[A General Theoretical Paradigm to Understand Learning from Human Preferences](https://arxiv.org/abs/2310.12036) 12/2023|[Gemini: A Family of Highly Capable Multimodal Models](https://files.catbox.moe/g7nn73.pdf) 12/2023|[AlphaCode 2 Technical Report](https://files.catbox.moe/lqpb7g.pdf) 12/2023|[Chain of Code: Reasoning with a Language Model-Augmented Code Emulator](https://arxiv.org/abs/2312.04474) 12/2023|[Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models](https://arxiv.org/abs/2312.06585) 12/2023|[Bad Students Make Great Teachers: Active Learning Accelerates Large-Scale Visual Understanding](https://arxiv.org/abs/2312.05328) 01/2024|[Solving olympiad geometry without human demonstrations](https://files.catbox.moe/3fu2lc.pdf) 02/2024|[LiPO: Listwise Preference Optimization through Learning-to-Rank](https://arxiv.org/abs/2402.01878) 02/2024|[Grandmaster-Level Chess Without Search](https://arxiv.org/abs/2402.04494) 02/2024|[How to Train Data-Efficient LLMs](https://arxiv.org/abs/2402.09668) 02/2024|[A Human-Inspired Reading Agent with Gist Memory of Very Long Contexts](https://arxiv.org/abs/2402.09727) 02/2024|[Gemma: Open Models Based on Gemini Research and Technology](https://files.catbox.moe/og82ni.pdf) 02/2024|[Genie: Generative Interactive Environments](https://arxiv.org/abs/2402.15391) 02/2024|[Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models](https://arxiv.org/abs/2402.19427) 03/2024|[DiPaCo: Distributed Path Composition](https://arxiv.org/abs/2403.10616) 04/2024|[Mixture-of-Depths: Dynamically allocating compute in transformer-based language models](https://arxiv.org/abs/2404.02258) | |**Meta (Facebook AI Research)** ->[Papers](https://ai.facebook.com/results/?content_types%5B0%5D=publication) [Blog](https://ai.facebook.com/blog)<- 04/2019|[fairseq: A Fast, Extensible Toolkit for Sequence Modeling](https://arxiv.org/abs/1904.01038) 07/2019|[Augmenting Self-attention with Persistent Memory](https://arxiv.org/abs/1907.01470) 11/2019|[Improving Transformer Models by Reordering their Sublayers](https://arxiv.org/abs/1911.03864) 08/2021|[Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation](https://arxiv.org/abs/2108.12409) 03/2022|[Training Logbook for OPT-175B](https://files.catbox.moe/u1836w.pdf) 05/2022|[OPT: Open Pre-trained Transformer Language Models](https://arxiv.org/abs/2205.01068) 07/2022|[Beyond neural scaling laws: beating power law scaling via data pruning](https://arxiv.org/abs/2206.14486) 11/2022|[Galactica: A Large Language Model for Science](https://arxiv.org/abs/2211.09085) 01/2023|[Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture (I-JEPA)](https://arxiv.org/abs/2301.08243) 02/2023|[LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971) 02/2023|[Toolformer: Language Models Can Teach Themselves to Use Tools](https://arxiv.org/abs/2302.04761) 03/2023|[Scaling Expert Language Models with Unsupervised Domain Discovery](https://arxiv.org/abs/2303.14177) 03/2023|[SemDeDup: Data-efficient learning at web-scale through semantic deduplication](https://arxiv.org/abs/2303.09540) 04/2023|[Segment Anything (SAM)](https://arxiv.org/abs/2304.02643) 04/2023|[A Cookbook of Self-Supervised Learning](https://arxiv.org/abs/2304.12210) 05/2023|[Learning to Reason and Memorize with Self-Notes](https://arxiv.org/abs/2305.00833) 05/2023|[ImageBind: One Embedding Space To Bind Them All](https://arxiv.org/abs/2305.05665) 05/2023|[MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers](https://arxiv.org/abs/2305.07185) 05/2023|[LIMA: Less Is More for Alignment](https://arxiv.org/abs/2305.11206) 05/2023|[Scaling Speech Technology to 1,000+ Languages](https://files.catbox.moe/6j8gka.pdf) 05/2023|[READ: Recurrent Adaptation of Large Transformers](https://arxiv.org/abs/2305.15348) 05/2023|[LLM-QAT: Data-Free Quantization Aware Training for Large Language Models](https://arxiv.org/abs/2305.17888) 06/2023|[Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles](https://arxiv.org/abs/2306.00989) 06/2023|[Simple and Controllable Music Generation (MusicGen)](https://arxiv.org/abs/2306.05284) 06/2023|[Improving Open Language Models by Learning from Organic Interactions (BlenderBot 3x)](https://arxiv.org/abs/2306.04707) 06/2023|[Extending Context Window of Large Language Models via Positional Interpolation](https://arxiv.org/abs/2306.15595) 06/2023|[Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale](https://arxiv.org/abs/2306.15687) 07/2023|[Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning (CM3leon)](https://arxiv.org/abs/2309.02591) 07/2023|[Llama 2: Open Foundation and Fine-Tuned Chat Models](https://files.catbox.moe/tuog0d.pdf) 08/2023|[SeamlessM4T—Massively Multilingual & Multimodal Machine Translation](https://files.catbox.moe/bdw0bw.pdf) 08/2023|[D4: Improving LLM Pretraining via Document De-Duplication and Diversification](https://arxiv.org/abs/2308.12284) 08/2023|[Code Llama: Open Foundation Models for Code](https://files.catbox.moe/hfy4wf.pdf) 08/2023|[Nougat: Neural Optical Understanding for Academic Documents](https://arxiv.org/abs/2308.13418) 09/2023|[Contrastive Decoding Improves Reasoning in Large Language Models](https://arxiv.org/abs/2309.09117) 09/2023|[Effective Long-Context Scaling of Foundation Models](https://arxiv.org/abs/2309.16039) 09/2023|[AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model](https://arxiv.org/abs/2309.16058) 09/2023|[Vision Transformers Need Registers](https://arxiv.org/abs/2309.16588) 10/2023|[RA-DIT: Retrieval-Augmented Dual Instruction Tuning](https://arxiv.org/abs/2310.01352) 10/2023|[Branch-Solve-Merge Improves Large Language Model Evaluation and Generation](https://arxiv.org/abs/2310.15123) 10/2023|[Generative Pre-training for Speech with Flow Matching](https://arxiv.org/abs/2310.16338) 11/2023|[Emu Edit: Precise Image Editing via Recognition and Generation Tasks](https://arxiv.org/abs/2311.10089) 12/2023|[Audiobox: Unified Audio Generation with Natural Language Prompts](https://arxiv.org/abs/2312.15821) 12/2023|[Universal Pyramid Adversarial Training for Improved ViT Performance](https://arxiv.org/abs/2312.16339) 01/2024|[Self-Rewarding Language Models](https://arxiv.org/abs/2401.10020) 02/2024|[Revisiting Feature Prediction for Learning Visual Representations from Video](https://files.catbox.moe/gn25vw.pdf) 03/2024|[Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM](https://arxiv.org/abs/2403.07816) 03/2024|[Reverse Training to Nurse the Reversal Curse](https://arxiv.org/abs/2403.13799) | |**Microsoft** ->[Papers](https://www.microsoft.com/en-us/research/research-area/artificial-intelligence/?) [Blog](https://www.microsoft.com/en-us/research/blog)<- 12/2015|[Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385) 05/2021|[EL-Attention: Memory Efficient Lossless Attention for Generation](https://arxiv.org/abs/2105.04779) 01/2022|[DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale](https://arxiv.org/abs/2201.05596) 03/2022|[DeepNet: Scaling Transformers to 1,000 Layers](https://arxiv.org/abs/2203.00555) 12/2022|[A Length-Extrapolatable Transformer](https://arxiv.org/abs/2212.10554) 01/2023|[Understanding INT4 Quantization for Transformer Models: Latency Speedup, Composability, and Failure Cases](https://arxiv.org/abs/2301.12017) 02/2023|[Language Is Not All You Need: Aligning Perception with Language Models (Kosmos-1)](https://arxiv.org/abs/2302.14045) 03/2023|[Sparks of Artificial General Intelligence: Early experiments with GPT-4](https://arxiv.org/abs/2303.12712) 03/2023|[TaskMatrix. AI: Completing Tasks by Connecting Foundation Models with Millions of APIs](https://arxiv.org/abs/2303.16434) 04/2023|[Instruction Tuning with GPT-4](https://arxiv.org/abs/2304.03277) 04/2023|[Inference with Reference: Lossless Acceleration of Large Language Models](https://arxiv.org/abs/2304.04487) 04/2023|[Low-code LLM: Visual Programming over LLMs](https://arxiv.org/abs/2304.08103) 04/2023|[WizardLM: Empowering Large Language Models to Follow Complex Instructions](https://arxiv.org/abs/2304.12244) 04/2023|[MLCopilot: Unleashing the Power of Large Language Models in Solving Machine Learning Tasks](https://arxiv.org/abs/2304.14979) 04/2023|[ResiDual: Transformer with Dual Residual Connections](https://arxiv.org/abs/2304.14802) 05/2023|[Code Execution with Pre-trained Language Models](https://arxiv.org/abs/2305.05383) 05/2023|[Small Models are Valuable Plug-ins for Large Language Models](https://arxiv.org/abs/2305.08848) 05/2023|[CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing](https://arxiv.org/abs/2305.11738) 06/2023|[Orca: Progressive Learning from Complex Explanation Traces of GPT-4](https://arxiv.org/abs/2306.02707) 06/2023|[Augmenting Language Models with Long-Term Memory](https://arxiv.org/abs/2306.07174) 06/2023|[WizardCoder: Empowering Code Large Language Models with Evol-Instruct](https://arxiv.org/abs/2306.08568) 06/2023|[Textbooks Are All You Need (phi-1)](https://arxiv.org/abs/2306.11644) 07/2023|[In-context Autoencoder for Context Compression in a Large Language Model](https://arxiv.org/abs/2307.06945) 07/2023|[Retentive Network: A Successor to Transformer for Large Language Models](https://arxiv.org/abs/2307.08621) 08/2023|[Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference](https://arxiv.org/abs/2308.12066) 09/2023|[Efficient RLHF: Reducing the Memory Usage of PPO](https://arxiv.org/abs/2309.00754) 09/2023|[DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models](https://arxiv.org/abs/2309.03883) 09/2023|[Textbooks Are All You Need II (phi-1.5)](https://arxiv.org/abs/2309.05463) 09/2023|[PoSE: Efficient Context Window Extension of LLMs via Positional Skip-wise Training](https://arxiv.org/abs/2309.10400) 09/2023|[A Paradigm Shift in Machine Translation: Boosting Translation Performance of Large Language Models](https://arxiv.org/abs/2309.11674) 09/2023|[Attention Satisfies: A Constraint-Satisfaction Lens on Factual Errors of Language Models](https://arxiv.org/abs/2309.15098) 10/2023|[Sparse Backpropagation for MoE Training](https://arxiv.org/abs/2310.00811) 10/2023|[Nugget 2D: Dynamic Contextual Compression for Scaling Decoder-only Language Models](https://arxiv.org/abs/2310.02409) 10/2023|[Mixture of Quantized Experts (MoQE): Complementary Effect of Low-bit Quantization and Robustness](https://arxiv.org/abs/2310.02410) 10/2023|[Augmented Embeddings for Custom Retrievals](https://arxiv.org/abs/2310.05380) 10/2023|[Guiding Language Model Reasoning with Planning Tokens](https://arxiv.org/abs/2310.05707) 10/2023|[Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V](https://arxiv.org/abs/2310.11441) 10/2023|[CodeFusion: A Pre-trained Diffusion Model for Code Generation](https://arxiv.org/abs/2310.17680) 10/2023|[LoRAShear: Efficient Large Language Model Structured Pruning and Knowledge Recovery](https://arxiv.org/abs/2310.18356) 10/2023|[FP8-LM: Training FP8 Large Language Models](https://arxiv.org/abs/2310.18313) 11/2023|[Orca 2: Teaching Small Language Models How to Reason](https://arxiv.org/abs/2311.11045) 12/2023|[ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks](https://arxiv.org/abs/2312.08583) 12/2023|[The Truth is in There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction](https://arxiv.org/abs/2312.13558) 01/2024|[SliceGPT: Compress Large Language Models by Deleting Rows and Columns](https://arxiv.org/abs/2401.15024) 01/2024|[RAG vs Fine-tuning: Pipelines, Tradeoffs, and a Case Study on Agriculture](https://arxiv.org/abs/2401.08406) 02/2024|[LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens](https://arxiv.org/abs/2402.13753) 02/2024|[The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits](https://arxiv.org/abs/2402.17764) 02/2024|[ResLoRA: Identity Residual Mapping in Low-Rank Adaption](https://arxiv.org/abs/2402.18039) 03/2024|[LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression](https://arxiv.org/abs/2403.12968) 03/2024|[SiMBA: Simplified Mamba-Based Architecture for Vision and Multivariate Time series](https://arxiv.org/abs/2403.15360) | |**OpenAI** ->[Papers](https://openai.com/research) [Blog](https://openai.com/blog)<- 07/2017|[Proximal Policy Optimization Algorithms](https://arxiv.org/abs/1707.06347) 04/2019|[Generating Long Sequences with Sparse Transformers](https://arxiv.org/abs/1904.10509) 01/2020|[Scaling Laws for Neural Language Models](https://arxiv.org/abs/2001.08361) 05/2020|[Language Models are Few-Shot Learners (GPT-3)](https://arxiv.org/abs/2005.14165) 01/2022|[Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets](https://arxiv.org/abs/2201.02177) 03/2022|[Training language models to follow instructions with human feedback (InstructGPT)](https://arxiv.org/abs/2203.02155) 07/2022|[Efficient Training of Language Models to Fill in the Middle](https://arxiv.org/abs/2207.14255) 03/2023|[GPT-4 Technical Report](https://arxiv.org/abs/2303.08774) 04/2023|[Consistency Models](https://arxiv.org/abs/2303.01469) 05/2023|[Let's Verify Step by Step](https://arxiv.org/abs/2305.20050) 10/2023|[Improving Image Generation with Better Captions (DALL·E 3)](https://files.catbox.moe/e7jl5b.pdf) | |**Hazy Research (Stanford)** ->[Papers](https://cs.stanford.edu/people/chrismre/#papers) [Blog](https://hazyresearch.stanford.edu/blog)<- 10/2021|[Efficiently Modeling Long Sequences with Structured State Spaces (S4)](https://arxiv.org/abs/2111.00396) 04/2022|[Monarch: Expressive Structured Matrices for Efficient and Accurate Training](https://arxiv.org/abs/2204.00595) 05/2022|[FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness](https://arxiv.org/abs/2205.14135) 12/2022|[Hungry Hungry Hippos: Towards Language Modeling with State Space Models](https://arxiv.org/abs/2212.14052) 02/2023|[Simple Hardware-Efficient Long Convolutions for Sequence Modeling](https://arxiv.org/abs/2302.06646) 02/2023|[Hyena Hierarchy: Towards Larger Convolutional Language Models](https://arxiv.org/abs/2302.10866) 06/2023|[TART: A plug-and-play Transformer module for task-agnostic reasoning](https://arxiv.org/abs/2306.07536) 07/2023|[FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning](https://files.catbox.moe/arj3zc.pdf) 11/2023|[FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores](https://arxiv.org/abs/2311.05908) | |**THUDM (Tsinghua University)** ->[Papers](http://keg.cs.tsinghua.edu.cn/jietang/publication_list.html) [Github](https://github.com/THUDM)<- 10/2022|[GLM-130B: An Open Bilingual Pre-Trained Model](https://arxiv.org/abs/2210.02414) 03/2023|[CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Evaluations on HumanEval-X](https://arxiv.org/abs/2303.17568) 04/2023|[DoctorGLM: Fine-tuning your Chinese Doctor is not a Herculean Task](https://arxiv.org/abs/2304.01097) 06/2023|[WebGLM: Towards An Efficient Web-Enhanced Question Answering System with Human Preferences](https://arxiv.org/abs/2306.07906) 09/2023|[GPT Can Solve Mathematical Problems Without a Calculator (MathGLM)](https://arxiv.org/abs/2309.03241) 10/2023|[AgentTuning: Enabling Generalized Agent Abilities for LLMs (AgentLM)](https://arxiv.org/abs/2310.12823) 11/2023|[CogVLM: Visual Expert for Pretrained Language Models](https://arxiv.org/abs/2311.03079) 12/2023|[CogAgent: A Visual Language Model for GUI Agents](https://arxiv.org/abs/2312.08914) 01/2024|[APAR: LLMs Can Do Auto-Parallel Auto-Regressive Decoding](https://arxiv.org/abs/2401.06761) 01/2024|[LongAlign: A Recipe for Long Context Alignment of Large Language Models](https://arxiv.org/abs/2401.18058) | |**Open Models** 06/2021|[GPT-J-6B: 6B JAX-Based Transformer](https://archive.is/HPCbB) 09/2021|[Pythia: A Customizable Hardware Prefetching Framework Using Online Reinforcement Learning](https://arxiv.org/abs/2109.12021) 03/2022|[CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis](https://arxiv.org/abs/2203.13474) 04/2022|[GPT-NeoX-20B: An Open-Source Autoregressive Language Model](https://arxiv.org/abs/2204.06745) 11/2022|[BLOOM: A 176B-Parameter Open-Access Multilingual Language Model](https://arxiv.org/abs/2211.05100) 12/2022|[DDColor: Towards Photo-Realistic Image Colorization via Dual Decoders](https://arxiv.org/abs/2212.11613) 04/2023|[Visual Instruction Tuning (LLaVA)](https://arxiv.org/abs/2304.08485) 05/2023|[StarCoder: May the source be with you!](https://arxiv.org/abs/2305.06161) 05/2023|[CodeGen2: Lessons for Training LLMs on Programming and Natural Languages](https://arxiv.org/abs/2305.02309) 05/2023|[Otter: A Multi-Modal Model with In-Context Instruction Tuning](https://arxiv.org/abs/2305.03726) 05/2023|[InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning](https://arxiv.org/abs/2305.06500) 05/2023|[CodeT5+: Open Code Large Language Models for Code Understanding and Generation](https://arxiv.org/abs/2305.07922) 05/2023|[ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities](https://arxiv.org/abs/2305.11172) 05/2023|[RWKV: Reinventing RNNs for the Transformer Era](https://arxiv.org/abs/2305.13048) 05/2023|[Lion: Adversarial Distillation of Closed-Source Large Language Model](https://arxiv.org/abs/2305.12870) 05/2023|[MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training](https://arxiv.org/abs/2306.00107) 06/2023|[Segment Anything in High Quality](https://arxiv.org/abs/2306.01567) 06/2023|[Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding](https://arxiv.org/abs/2306.02858) 06/2023|[High-Fidelity Audio Compression with Improved RVQGAN](https://arxiv.org/abs/2306.06546) 06/2023|[StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models](https://arxiv.org/abs/2306.07691) 06/2023|[Anticipatory Music Transformer](https://arxiv.org/abs/2306.08620) 06/2023|[RepoFusion: Training Code Models to Understand Your Repository](https://arxiv.org/abs/2306.10998) 06/2023|[MPT-30B: Raising the bar for open-source foundation models](https://archive.is/SOhKy) 06/2023|[Vec2Vec: A Compact Neural Network Approach for Transforming Text Embeddings with High Fidelity](https://arxiv.org/abs/2306.12689) 06/2023|[ViNT: A Foundation Model for Visual Navigation](https://arxiv.org/abs/2306.14846) 06/2023|[How Long Can Open-Source LLMs Truly Promise on Context Length? (LongChat)](https://archive.is/NfIj2) 07/2023|[Hierarchical Open-vocabulary Universal Image Segmentation](https://arxiv.org/abs/2307.00764) 07/2023|[Focused Transformer: Contrastive Training for Context Scaling (LongLLaMA](https://arxiv.org/abs/2307.03170) 07/2023|[Rhythm Modeling for Voice Conversion (Urhythmic)](https://arxiv.org/abs/2307.06040) 07/2023|[Scaling TransNormer to 175 Billion Parameters](https://arxiv.org/abs/2307.14995) 08/2023|[Separate Anything You Describe](https://arxiv.org/abs/2308.05037) 08/2023|[StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data](https://arxiv.org/abs/2308.10253) 09/2023|[RADIO: Reference-Agnostic Dubbing Video Synthesis](https://arxiv.org/abs/2309.01950) 09/2023|[Matcha-TTS: A fast TTS architecture with conditional flow matching](https://arxiv.org/abs/2309.03199) 09/2023|[DreamLLM: Synergistic Multimodal Comprehension and Creation](https://arxiv.org/abs/2309.11499) 09/2023|[Baichuan 2: Open Large-scale Language Models](https://arxiv.org/abs/2309.10305) 09/2023|[Qwen Technical Report](https://files.catbox.moe/y61ihm.pdf) 09/2023|[Mistral 7B](https://files.catbox.moe/bars04.pdf) 10/2023|[MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning](https://arxiv.org/abs/2310.03731) 10/2023|[Improved Baselines with Visual Instruction Tuning (LLaVA 1.5)](https://arxiv.org/abs/2310.03744) 10/2023|[LLark: A Multimodal Foundation Model for Music](https://arxiv.org/abs/2310.07160) 10/2023|[SALMONN: Towards Generic Hearing Abilities for Large Language Models](https://arxiv.org/abs/2310.13289) 10/2023|[Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings for Long Documents](https://arxiv.org/abs/2310.19923) 11/2023|[Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models](https://arxiv.org/abs/2311.07919) 11/2023|[UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition](https://arxiv.org/abs/2311.15599) 11/2023|[YUAN 2.0: A Large Language Model with Localized Filtering-based Attention](https://arxiv.org/abs/2311.15786) 12/2023|[Making Large Multimodal Models Understand Arbitrary Visual Prompts (ViP-LLaVA)](https://arxiv.org/abs/2312.00784) 12/2023|[Mamba: Linear-Time Sequence Modeling with Selective State Spaces](https://arxiv.org/abs/2312.00752) 12/2023|[OpenVoice: Versatile Instant Voice Cloning](https://arxiv.org/abs/2312.01479) 12/2023|[Sequential Modeling Enables Scalable Learning for Large Vision Models (LVM)](https://arxiv.org/abs/2312.00785) 12/2023|[Magicoder: Source Code Is All You Need](https://arxiv.org/abs/2312.02120) 12/2023|[StripedHyena-7B, open source models offering a glimpse into a world beyond Transformers](https://archive.is/cHoct) 12/2023|[MMM: Generative Masked Motion Model](https://arxiv.org/abs/2312.03596) 12/2023|[4M: Massively Multimodal Masked Modeling](https://arxiv.org/abs/2312.06647) 12/2023|[LLM360: Towards Fully Transparent Open-Source LLMs](https://arxiv.org/abs/2312.06550) 12/2023|[SOLAR 10.7B: Scaling Large Language Models with Simple yet Effective Depth Up-Scaling](https://arxiv.org/abs/2312.15166) 01/2024|[DeepSeek LLM: Scaling Open-Source Language Models with Longtermism](https://arxiv.org/abs/2401.02954) 01/2024|[Mixtral of Experts](https://arxiv.org/abs/2401.04088) 01/2024|[EAT: Self-Supervised Pre-Training with Efficient Audio Transformer](https://arxiv.org/abs/2401.03497) 01/2024|[Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications](https://arxiv.org/abs/2401.06197) 01/2024|[Scalable Pre-training of Large Autoregressive Image Models](https://arxiv.org/abs/2401.08541) 01/2024|[Orion-14B: Open-source Multilingual Large Language Models](https://arxiv.org/abs/2401.12246) 01/2024|[Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data](https://arxiv.org/abs/2401.10891) 01/2024|[VMamba: Visual State Space Model](https://arxiv.org/abs/2401.10166) 01/2024|[DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence](https://arxiv.org/abs/2401.14196) 01/2024|[MoE-LLaVA: Mixture of Experts for Large Vision-Language Models](https://arxiv.org/abs/2401.15947) 01/2024|[LLaVA-1.6: Improved reasoning, OCR, and world knowledge](https://archive.is/WMr0Z) 01/2024|[MiniCPM: Unveiling the Potential of End-side Large Language Models](https://archive.is/IlMnJ) 01/2024|[Scaling Up to Excellence: Practicing Model Scaling for Photo-Realistic Image Restoration In the Wild](https://arxiv.org/abs/2401.13627) 02/2024|[Graph-Mamba: Towards Long-Range Graph Sequence Modeling with Selective State Spaces](https://arxiv.org/abs/2402.00789) 02/2024|[Introducing Qwen1.5](https://archive.is/C6gpR) 02/2024|[BlackMamba: Mixture of Experts for State-Space Models](https://arxiv.org/abs/2402.01771) 02/2024|[DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models](https://arxiv.org/abs/2402.03300) 02/2024|[EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss](https://arxiv.org/abs/2402.05008) 02/2024|[GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators](https://arxiv.org/abs/2402.06894) 02/2024|[Zero-Shot Unsupervised and Text-Based Audio Editing Using DDPM Inversion](https://arxiv.org/abs/2402.10009) 02/2024|[Brant-2: Foundation Model for Brain Signals](https://arxiv.org/abs/2402.10251) 03/2024|[Scaling Rectified Flow Transformers for High-Resolution Image Synthesis (SD3)](https://files.catbox.moe/anmseu.pdf) 03/2024|[TripoSR: Fast 3D Object Reconstruction from a Single Image](https://arxiv.org/abs/2403.02151) 03/2024|[Yi: Open Foundation Models by 01.AI](https://arxiv.org/abs/2403.04652) 03/2024|[DeepSeek-VL: Towards Real-World Vision-Language Understanding](https://arxiv.org/abs/2403.05525) 03/2024|[VideoMamba: State Space Model for Efficient Video Understanding](https://arxiv.org/abs/2403.06977) 03/2024|[VOICECRAFT: Zero-Shot Speech Editing and Text-to-Speech in the Wild](https://arxiv.org/abs/2403.16973) 03/2024|[GRM: Large Gaussian Reconstruction Model for Efficient 3D Reconstruction and Generation](https://arxiv.org/abs/2403.14621) 03/2024|[DBRX: A New State-of-the-Art Open LLM](https://archive.is/vP5bV) 03/2024|[AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animation](https://arxiv.org/abs/2403.17694) 03/2024|[Jamba: A Hybrid Transformer-Mamba Language Model](https://arxiv.org/abs/2403.19887) 04/2024|[Advancing LLM Reasoning Generalists with Preference Trees (Eurus)](https://arxiv.org/abs/2404.02078) | |**Various** 09/2014|[Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/abs/1409.0473) 06/2019|[Understanding and Improving Transformer From a Multi-Particle Dynamic System Point of View](https://arxiv.org/abs/1906.02762) 10/2019|[Root Mean Square Layer Normalization](https://arxiv.org/abs/1910.07467) 10/2019|[Transformers without Tears: Improving the Normalization of Self-Attention](https://arxiv.org/abs/1910.05895) 12/2019|[Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection](https://arxiv.org/abs/1912.11637) 02/2020|[On Layer Normalization in the Transformer Architecture](https://arxiv.org/abs/2002.04745) 04/2020|[Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) 06/2020|[Memory Transformer](https://arxiv.org/abs/2006.11527) 07/2020|[Mirostat: A Neural Text Decoding Algorithm that Directly Controls Perplexity](https://arxiv.org/abs/2007.14966) 12/2020|[ERNIE-Doc: A Retrospective Long-Document Modeling Transformer](https://arxiv.org/abs/2012.15688) 01/2021|[Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks](https://arxiv.org/abs/2102.00554) 03/2021|[The Low-Rank Simplicity Bias in Deep Networks](https://arxiv.org/abs/2103.10427) 04/2021|[RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864) 06/2021|[LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685) 07/2023|[CrossFormer: A Versatile Vision Transformer Hinging on Cross-scale Attention](https://arxiv.org/abs/2108.00154) 03/2022|[Memorizing Transformers](https://arxiv.org/abs/2203.08913) 04/2022|[UL2: Unifying Language Learning Paradigms](https://arxiv.org/abs/2205.05131) 05/2022|[Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning (IA3)](https://arxiv.org/abs/2205.05638) 06/2022|[nuQmm: Quantized MatMul for Efficient Inference of Large-Scale Generative Language Models](https://arxiv.org/abs/2206.09557) 07/2022|[Language Models (Mostly) Know What They Know](https://arxiv.org/abs/2207.05221) 08/2022|[LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale](https://arxiv.org/abs/2208.07339) 09/2022|[Petals: Collaborative Inference and Fine-tuning of Large Models](https://arxiv.org/abs/2209.01188) 10/2022|[GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers](https://arxiv.org/abs/2210.17323) 10/2022|[Recurrent Memory Transformer](https://files.catbox.moe/8trivt.pdf) 10/2022|[Truncation Sampling as Language Model Desmoothing](https://arxiv.org/abs/2210.15191) 10/2022|[DyLoRA: Parameter Efficient Tuning of Pre-trained Models using Dynamic Search-Free Low-Rank Adaptation](https://arxiv.org/abs/2210.07558) 11/2022|[An Algorithm for Routing Vectors in Sequences](https://arxiv.org/abs/2211.11754) 11/2022|[MegaBlocks: Efficient Sparse Training with Mixture-of-Experts](https://arxiv.org/abs/2211.15841) 12/2022|[Self-Instruct: Aligning Language Model with Self Generated Instructions](https://arxiv.org/abs/2212.10560) 12/2022|[Parallel Context Windows Improve In-Context Learning of Large Language Models](https://arxiv.org/abs/2212.10947) 12/2022|[Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor](https://arxiv.org/abs/2212.09689) 12/2022|[Pretraining Without Attention](https://arxiv.org/abs/2212.10544) 12/2022|[The case for 4-bit precision: k-bit Inference Scaling Laws](https://arxiv.org/abs/2212.09720) 12/2022|[Prompting Is Programming: A Query Language for Large Language Models](https://arxiv.org/abs/2212.06094) 01/2023|[SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient](https://arxiv.org/abs/2301.11913) 01/2023|[SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot](https://arxiv.org/abs/2301.00774) 01/2023|[Memory Augmented Large Language Models are Computationally Universal](https://arxiv.org/abs/2301.04589) 01/2023|[Progress measures for grokking via mechanistic interpretability](https://arxiv.org/abs/2301.05217) 02/2023|[Colossal-Auto: Unified Automation of Parallelization and Activation Checkpoint for Large-scale Models](https://arxiv.org/abs/2302.02599) 02/2023|[The Wisdom of Hindsight Makes Language Models Better Instruction Followers](https://arxiv.org/abs/2302.05206) 02/2023|[The Stable Entropy Hypothesis and Entropy-Aware Decoding: An Analysis and Algorithm for Robust Natural Language Generation](https://arxiv.org/abs/2302.06784) 03/2023|[COLT5: Faster Long-Range Transformers with Conditional Computation](https://arxiv.org/abs/2303.09752) 03/2023|[High-throughput Generative Inference of Large Language Models with a Single GPU](https://arxiv.org/abs/2303.06865) 03/2023|[Meet in the Middle: A New Pre-training Paradigm](https://arxiv.org/abs/2303.07295) 03/2023|[Reflexion: an autonomous agent with dynamic memory and self-reflection](https://arxiv.org/abs/2303.11366) 03/2023|[Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning](https://arxiv.org/abs/2303.15647) 03/2023|[FP8 versus INT8 for efficient deep learning inference](https://arxiv.org/abs/2303.17951) 03/2023|[Self-Refine: Iterative Refinement with Self-Feedback](https://arxiv.org/abs/2303.17651) 04/2023|[RPTQ: Reorder-based Post-training Quantization for Large Language Models](https://arxiv.org/abs/2304.01089) 04/2023|[REFINER: Reasoning Feedback on Intermediate Representations](https://arxiv.org/abs/2304.01904) 04/2023|[Generative Agents: Interactive Simulacra of Human Behavior](https://arxiv.org/abs/2304.03442) 04/2023|[Compressed Regression over Adaptive Networks](https://arxiv.org/abs/2304.03638) 04/2023|[A Cheaper and Better Diffusion Language Model with Soft-Masked Noise](https://arxiv.org/abs/2304.04746) 04/2023|[RRHF: Rank Responses to Align Language Models with Human Feedback without tears](https://arxiv.org/abs/2304.05302) 04/2023|[CAMEL: Communicative Agents for "Mind" Exploration of Large Scale Language Model Society](https://arxiv.org/abs/2303.17760) 04/2023|[Automatic Gradient Descent: Deep Learning without Hyperparameters](https://arxiv.org/abs/2304.05187) 04/2023|[SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language Models](https://arxiv.org/abs/2303.10464) 04/2023|[Shall We Pretrain Autoregressive Language Models with Retrieval? A Comprehensive Study](https://arxiv.org/abs/2304.06762) 04/2023|[Outlier Suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling](https://arxiv.org/abs/2304.09145) 04/2023|[Scaling Transformer to 1M tokens and beyond with RMT](https://arxiv.org/abs/2304.11062) 04/2023|[Answering Questions by Meta-Reasoning over Multiple Chains of Thought](https://arxiv.org/abs/2304.13007) 04/2023|[Towards Multi-Modal DBMSs for Seamless Querying of Texts and Tables](https://arxiv.org/abs/2304.13559) 04/2023|[We're Afraid Language Models Aren't Modeling Ambiguity](https://arxiv.org/abs/2304.14399) 04/2023|[The Internal State of an LLM Knows When its Lying](https://arxiv.org/abs/2304.13734) 04/2023|[Search-in-the-Chain: Towards the Accurate, Credible and Traceable Content Generation for Complex Knowledge-intensive Tasks](https://arxiv.org/abs/2304.14732) 05/2023|[Towards Unbiased Training in Federated Open-world Semi-supervised Learning](https://arxiv.org/abs/2305.00771) 05/2023|[Unlimiformer: Long-Range Transformers with Unlimited Length Input](https://arxiv.org/abs/2305.01625) 05/2023|[FreeLM: Fine-Tuning-Free Language Model](https://arxiv.org/abs/2305.01616) 05/2023|[Cuttlefish: Low-rank Model Training without All The Tuning](https://arxiv.org/abs/2305.02538) 05/2023|[AttentionViz: A Global View of Transformer Attention](https://arxiv.org/abs/2305.03210) 05/2023|[Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models](https://arxiv.org/abs/2305.04091) 05/2023|[A Frustratingly Easy Improvement for Position Embeddings via Random Padding](https://arxiv.org/abs/2305.04859) 05/2023|[Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision](https://arxiv.org/abs/2305.03047) 05/2023|[Explanation-based Finetuning Makes Models More Robust to Spurious Cues](https://arxiv.org/abs/2305.04990) 05/2023|[An automatically discovered chain-of-thought prompt generalizes to novel models and datasets](https://arxiv.org/abs/2305.02897) 05/2023|[Recommender Systems with Generative Retrieval](https://arxiv.org/abs/2305.05065) 05/2023|[Fast Distributed Inference Serving for Large Language Models](https://arxiv.org/abs/2305.05920) 05/2023|[Chain-of-Dictionary Prompting Elicits Translation in Large Language Models](https://arxiv.org/abs/2305.06575) 05/2023|[Recommendation as Instruction Following: A Large Language Model Empowered Recommendation Approach](https://arxiv.org/abs/2305.07001) 05/2023|[Active Retrieval Augmented Generation](https://arxiv.org/abs/2305.06983) 05/2023|[Scalable Coupling of Deep Learning with Logical Reasoning](https://arxiv.org/abs/2305.07617) 05/2023|[Interpretability at Scale: Identifying Causal Mechanisms in Alpaca](https://arxiv.org/abs/2305.08809) 05/2023|[StructGPT: A General Framework for Large Language Model to Reason over Structured Data](https://arxiv.org/abs/2305.09645) 05/2023|[Pre-Training to Learn in Context](https://arxiv.org/abs/2305.09137) 05/2023|[ToolkenGPT: Augmenting Frozen Language Models with Massive Tools via Tool Embeddings](https://arxiv.org/abs/2305.11554) 05/2023|[Accelerating Transformer Inference for Translation via Parallel Decoding](https://arxiv.org/abs/2305.10427) 05/2023|[Cooperation Is All You Need](https://arxiv.org/abs/2305.10449) 05/2023|[PTQD: Accurate Post-Training Quantization for Diffusion Models](https://arxiv.org/abs/2305.10657) 05/2023|[LLM-Pruner: On the Structural Pruning of Large Language Models](https://arxiv.org/abs/2305.11627) 05/2023|[SelfzCoT: a Self-Prompt Zero-shot CoT from Semantic-level to Code-level for a Better Utilization of LLMs](https://arxiv.org/abs/2305.11461) 05/2023|[QLoRA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/abs/2305.14314) 05/2023|["According to ..." Prompting Language Models Improves Quoting from Pre-Training Data](https://arxiv.org/abs/2305.13252) 05/2023|[Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training](https://arxiv.org/abs/2305.14342) 05/2023|[Landmark Attention: Random-Access Infinite Context Length for Transformers](https://arxiv.org/abs/2305.16300) 05/2023|[Scaling Data-Constrained Language Models](https://arxiv.org/abs/2305.16264) 05/2023|[Fine-Tuning Language Models with Just Forward Passes](https://arxiv.org/abs/2305.17333) 05/2023|[Intriguing Properties of Quantization at Scale](https://arxiv.org/abs/2305.19268) 05/2023|[Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time](https://arxiv.org/abs/2305.17118) 05/2023|[Blockwise Parallel Transformer for Long Context Large Models](https://arxiv.org/abs/2305.19370) 05/2023|[The Impact of Positional Encoding on Length Generalization in Transformers](https://arxiv.org/abs/2305.19466) 05/2023|[Adapting Language Models to Compress Contexts](https://arxiv.org/abs/2305.14788) 05/2023|[Direct Preference Optimization: Your Language Model is Secretly a Reward Model](https://arxiv.org/abs/2305.18290) 06/2023|[AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration](https://arxiv.org/abs/2306.00978) 06/2023|[Faster Causal Attention Over Large Sequences Through Sparse Flash Attention](https://arxiv.org/abs/2306.01160) 06/2023|[Fine-Grained Human Feedback Gives Better Rewards for Language Model Training](https://arxiv.org/abs/2306.01693) 06/2023|[SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression](https://arxiv.org/abs/2306.03078) 06/2023|[Fine-Tuning Language Models with Advantage-Induced Policy Alignment](https://arxiv.org/abs/2306.02231) 06/2023|[Rewarded soups: towards Pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards](https://arxiv.org/abs/2306.04488) 06/2023|[Inference-Time Intervention: Eliciting Truthful Answers from a Language Model](https://arxiv.org/abs/2306.03341) 06/2023|[Mixture-of-Domain-Adapters: Decoupling and Injecting Domain Knowledge to Pre-trained Language Models Memories](https://arxiv.org/abs/2306.05406) 06/2023|[Boosting Fast and High-Quality Speech Synthesis with Linear Diffusion](https://arxiv.org/abs/2306.05708) 06/2023|[Word sense extension](https://arxiv.org/abs/2306.05609) 06/2023|[Mitigating Transformer Overconfidence via Lipschitz Regularization](https://arxiv.org/abs/2306.06849) 06/2023|[Recurrent Attention Networks for Long-text Modeling](https://arxiv.org/abs/2306.06843) 06/2023|[One-for-All: Generalized LoRA for Parameter-Efficient Fine-tuning](https://arxiv.org/abs/2306.07967) 06/2023|[SqueezeLLM: Dense-and-Sparse Quantization](https://arxiv.org/abs/2306.07629) 06/2023|[Tune As You Scale: Hyperparameter Optimization For Compute Efficient Training](https://arxiv.org/abs/2306.08055) 06/2023|[Propagating Knowledge Updates to LMs Through Distillation](https://arxiv.org/abs/2306.09306) 06/2023|[Full Parameter Fine-tuning for Large Language Models with Limited Resources](https://arxiv.org/abs/2306.09782) 06/2023|[A Simple and Effective Pruning Approach for Large Language Models](https://arxiv.org/abs/2306.11695) 06/2023|[InRank: Incremental Low-Rank Learning](https://arxiv.org/abs/2306.11250) 06/2023|[Evaluating the Zero-shot Robustness of Instruction-tuned Language Models](https://arxiv.org/abs/2306.11270) 06/2023|[Learning to Generate Better Than Your LLM (RLGF)](https://arxiv.org/abs/2306.11816) 06/2023|[Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing](https://arxiv.org/abs/2306.12929) 06/2023|[H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Model](https://arxiv.org/abs/2306.14048) 06/2023|[FLuRKA: Fast fused Low-Rank & Kernel Attention](https://arxiv.org/abs/2306.15799) 06/2023|[Stay on topic with Classifier-Free Guidance](https://arxiv.org/abs/2306.17806) 07/2023|[AutoST: Training-free Neural Architecture Search for Spiking Transformers](https://arxiv.org/abs/2307.00293) 07/2023|[Single Sequence Prediction over Reasoning Graphs for Multi-hop QA](https://arxiv.org/abs/2307.00335) 07/2023|[Shifting Attention to Relevance: Towards the Uncertainty Estimation of Large Language Models](https://arxiv.org/abs/2307.01379) 07/2023|[Facing off World Model Backbones: RNNs, Transformers, and S4](https://arxiv.org/abs/2307.02064) 07/2023|[Improving Retrieval-Augmented Large Language Models via Data Importance Learning](https://arxiv.org/abs/2307.03027) 07/2023|[Teaching Arithmetic to Small Transformers](https://arxiv.org/abs/2307.03381) 07/2023|[QIGen: Generating Efficient Kernels for Quantized Inference on Large Language Models](https://arxiv.org/abs/2307.03738) 07/2023|[Stack More Layers Differently: High-Rank Training Through Low-Rank Updates](https://arxiv.org/abs/2307.05695) 07/2023|[Copy Is All You Need (CoG)](https://arxiv.org/abs/2307.06962) 07/2023|[Multi-Method Self-Training: Improving Code Generation With Text, And Vice Versa](https://arxiv.org/abs/2307.10633) 07/2023|[Divide & Bind Your Attention for Improved Generative Semantic Nursing](https://arxiv.org/abs/2307.10864) 07/2023|[Challenges and Applications of Large Language Models](https://arxiv.org/abs/2307.10169) 07/2023|[Soft Prompt Tuning for Augmenting Dense Retrieval with Large Language Models](https://arxiv.org/abs/2307.08303) 07/2023|[QuIP: 2-Bit Quantization of Large Language Models With Guarantees](https://arxiv.org/abs/2307.13304) 07/2023|[CoRe Optimizer: An All-in-One Solution for Machine Learning](https://arxiv.org/abs/2307.15663) 07/2023|[Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time](https://files.catbox.moe/4zgrsb.pdf) 08/2023|[ReCLIP: Refine Contrastive Language Image Pre-Training with Source Free Domain Adaptation](https://arxiv.org/abs/2308.03793) 08/2023|[EasyEdit: An Easy-to-use Knowledge Editing Framework for Large Language Models](https://arxiv.org/abs/2308.07269) 08/2023|[Activation Addition: Steering Language Models Without Optimization](https://arxiv.org/abs/2308.10248) 08/2023|[OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models](https://arxiv.org/abs/2308.13137) 08/2023|[Accelerating LLM Inference with Staged Speculative Decoding](https://arxiv.org/abs/2308.04623) 08/2023|[YaRN: Efficient Context Window Extension of Large Language Models](https://arxiv.org/abs/2309.00071) 08/2023|[LM-Infinite: Simple On-the-Fly Length Generalization for Large Language Models](https://arxiv.org/abs/2308.16137) 09/2023|[Making Large Language Models Better Reasoners with Alignment](https://arxiv.org/abs/2309.02144) 09/2023|[Data-Juicer: A One-Stop Data Processing System for Large Language Models](https://arxiv.org/abs/2309.02033) 09/2023|[Delta-LoRA: Fine-Tuning High-Rank Parameters with the Delta of Low-Rank Matrices](https://arxiv.org/abs/2309.02411) 09/2023|[SLiMe: Segment Like Me](https://arxiv.org/abs/2309.03179) 09/2023|[Norm Tweaking: High-performance Low-bit Quantization of Large Language Models](https://arxiv.org/abs/2309.02784) 09/2023|[When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale](https://arxiv.org/abs/2309.04564) 09/2023|[Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs](https://arxiv.org/abs/2309.05516) 09/2023|[Efficient Memory Management for Large Language Model Serving with PagedAttention](https://arxiv.org/abs/2309.06180) 09/2023|[Cure the headache of Transformers via Collinear Constrained Attention](https://arxiv.org/abs/2309.08646) 09/2023|[Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity](https://arxiv.org/abs/2309.10285) 09/2023|[LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models](https://arxiv.org/abs/2309.12307) 09/2023|[MosaicFusion: Diffusion Models as Data Augmenters for Large Vocabulary Instance Segmentation](https://arxiv.org/abs/2309.13042) 09/2023|[Rethinking Channel Dimensions to Isolate Outliers for Low-bit Weight Quantization of Large Language Models](https://arxiv.org/abs/2309.15531) 09/2023|[Improving Code Generation by Dynamic Temperature Sampling](https://arxiv.org/abs/2309.02772) 09/2023|[Efficient Streaming Language Models with Attention Sinks](https://arxiv.org/abs/2309.17453) 10/2023|[DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and Diffusion Models](https://arxiv.org/abs/2310.00902) 10/2023|[GrowLength: Accelerating LLMs Pretraining by Progressively Growing Training Length](https://arxiv.org/abs/2310.00576) 10/2023|[Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image Diffusion Models](https://arxiv.org/abs/2310.01107) 10/2023|[Elephant Neural Networks: Born to Be a Continual Learner](https://arxiv.org/abs/2310.01365) 10/2023|[Ring Attention with Blockwise Transformers for Near-Infinite Context](https://arxiv.org/abs/2310.01889) 10/2023|[Retrieval meets Long Context Large Language Models](https://arxiv.org/abs/2310.03025) 10/2023|[DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines](https://arxiv.org/abs/2310.03714) 10/2023|[LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers](https://arxiv.org/abs/2310.03294) 10/2023|[Amortizing intractable inference in large language models (GFlowNet Tuning)](https://arxiv.org/abs/2310.04363) 10/2023|[SteerLM: Attribute Conditioned SFT as an (User-Steerable) Alternative to RLHF](https://arxiv.org/abs/2310.05344) 10/2023|[Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity](https://arxiv.org/abs/2310.05175) 10/2023|[Let Models Speak Ciphers: Multiagent Debate through Embeddings](https://arxiv.org/abs/2310.06272) 10/2023|[InstructRetro: Instruction Tuning post Retrieval-Augmented Pretraining](https://arxiv.org/abs/2310.07713) 10/2023|[CacheGen: Fast Context Loading for Language Model Applications](https://arxiv.org/abs/2310.07240) 10/2023|[MatFormer: Nested Transformer for Elastic Inference](https://arxiv.org/abs/2310.07707) 10/2023|[LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models](https://arxiv.org/abs/2310.08659) 10/2023|[Towards End-to-end 4-Bit Inference on Generative Large Language Models (QUIK)](https://arxiv.org/abs/2310.09259) 10/2023|[Microscaling Data Formats for Deep Learning](https://arxiv.org/abs/2310.10537) 10/2023|[xVal: A Continuous Number Encoding for Large Language Models](https://arxiv.org/abs/2310.02989) 10/2023|[An Emulator for Fine-Tuning Large Language Models using Small Language Models](https://arxiv.org/abs/2310.12962) 10/2023|[Frozen Transformers in Language Models Are Effective Visual Encoder Layers](https://arxiv.org/abs/2310.12973) 10/2023|[LoBaSS: Gauging Learnability in Supervised Fine-tuning Data](https://arxiv.org/abs/2310.13008) 10/2023|[Quality-Diversity through AI Feedback](https://arxiv.org/abs/2310.13032) 10/2023|[DoGE: Domain Reweighting with Generalization Estimation](https://arxiv.org/abs/2310.15393) 10/2023|[E-Sparse: Boosting the Large Language Model Inference through Entropy-based N:M Sparsity](https://arxiv.org/abs/2310.15929) 10/2023|[Mixture of Tokens: Efficient LLMs through Cross-Example Aggregation](https://arxiv.org/abs/2310.15961) 10/2023|[Personas as a Way to Model Truthfulness in Language Models](https://arxiv.org/abs/2310.18168) 10/2023|[Atom: Low-bit Quantization for Efficient and Accurate LLM Serving](https://arxiv.org/abs/2310.19102) 10/2023|[QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models](https://arxiv.org/abs/2310.16795) 11/2023|[AWEQ: Post-Training Quantization with Activation-Weight Equalization for Large Language Models](https://arxiv.org/abs/2311.01305) 11/2023|[FlashDecoding++: Faster Large Language Model Inference on GPUs](https://arxiv.org/abs/2311.01282) 11/2023|[Divergent Token Metrics: Measuring degradation to prune away LLM components -- and optimize quantization](https://arxiv.org/abs/2311.01544) 11/2023|[Tell Your Model Where to Attend: Post-hoc Attention Steering for LLMs](https://arxiv.org/abs/2311.02262) 11/2023|[REST: Retrieval-Based Speculative Decoding](https://arxiv.org/abs/2311.08252) 11/2023|[DynaPipe: Optimizing Multi-task Training through Dynamic Pipelines](https://arxiv.org/abs/2311.10418) 11/2023|[Token-level Adaptation of LoRA Adapters for Downstream Task Generalization](https://arxiv.org/abs/2311.10847) 11/2023|[Exponentially Faster Language Modelling](https://arxiv.org/abs/2311.10770) 11/2023|[MultiLoRA: Democratizing LoRA for Better Multi-Task Learning](https://arxiv.org/abs/2311.11501) 11/2023|[LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning](https://arxiv.org/abs/2311.12023) 11/2023|[Token Recycling for Efficient Sequential Inference with Vision Transformers](https://arxiv.org/abs/2311.15335) 11/2023|[Enabling Fast 2-bit LLM on GPUs: Memory Alignment, Sparse Outlier, and Asynchronous Dequantization](https://arxiv.org/abs/2311.16442) 12/2023|[GIFT: Generative Interpretable Fine-Tuning Transformers](https://arxiv.org/abs/2312.00700) 12/2023|[PEFA: Parameter-Free Adapters for Large-scale Embedding-based Retrieval Models](https://arxiv.org/abs/2312.02429) 12/2023|[Improving Activation Steering in Language Models with Mean-Centring](https://arxiv.org/abs/2312.03813) 12/2023|[A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA](https://arxiv.org/abs/2312.03732) 12/2023|[SparQ Attention: Bandwidth-Efficient LLM Inference](https://arxiv.org/abs/2312.04985) 12/2023|[ESPN: Memory-Efficient Multi-Vector Information Retrieval](https://arxiv.org/abs/2312.05417) 12/2023|[Aligner: One Global Token is Worth Millions of Parameters When Aligning Large Language Models](https://arxiv.org/abs/2312.05503) 12/2023|[CBQ: Cross-Block Quantization for Large Language Models](https://arxiv.org/abs/2312.07950) 12/2023|[SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention](https://arxiv.org/abs/2312.07987) 12/2023|[Weight subcloning: direct initialization of transformers using larger pretrained ones](https://arxiv.org/abs/2312.09299) 12/2023|[Cascade Speculative Drafting for Even Faster LLM Inference](https://arxiv.org/abs/2312.11462) 12/2023|[PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU](https://arxiv.org/abs/2312.12456) 12/2023|[ConsistentEE: A Consistent and Hardness-Guided Early Exiting Method for Accelerating Language Models Inference](https://arxiv.org/abs/2312.11882) 12/2023|[Lookahead: An Inference Acceleration Framework for Large Language Model with Lossless Generation Accuracy](https://arxiv.org/abs/2312.12728) 12/2023|[A Semantic Space is Worth 256 Language Descriptions: Make Stronger Segmentation Models with Descriptive Properties](https://arxiv.org/abs/2312.13764) 12/2023|[Algebraic Positional Encodings](https://arxiv.org/abs/2312.16045) 12/2023|[Preference as Reward, Maximum Preference Optimization with Importance Sampling](https://arxiv.org/abs/2312.16430) 01/2024|[LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning](https://arxiv.org/abs/2401.01325) 01/2024|[Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models](https://arxiv.org/abs/2401.01335) 01/2024|[LLaMA Pro: Progressive LLaMA with Block Expansion](https://arxiv.org/abs/2401.02415) 01/2024|[Fast and Optimal Weight Update for Pruned Large Language Models](https://arxiv.org/abs/2401.02938) 01/2024|[Soaring from 4K to 400K: Extending LLM's Context with Activation Beacon](https://arxiv.org/abs/2401.03462) 01/2024|[MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts](https://arxiv.org/abs/2401.04081) 01/2024|[Chain of LoRA: Efficient Fine-tuning of Language Models via Residual Learning](https://arxiv.org/abs/2401.04151) 01/2024|[RoSA: Accurate Parameter-Efficient Fine-Tuning via Robust Adaptation](https://arxiv.org/abs/2401.04679) 01/2024|[Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models](https://arxiv.org/abs//2401.04658) 01/2024|[AUTOACT: Automatic Agent Learning from Scratch via Self-Planning](https://arxiv.org/abs/2401.05268) 01/2024|[Extreme Compression of Large Language Models via Additive Quantization (AQLM)](https://arxiv.org/abs/2401.06118) 01/2024|[Knowledge Translation: A New Pathway for Model Compression](https://arxiv.org/abs/2401.05772) 01/2024|[Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts for Instruction Tuning on General Tasks](https://arxiv.org/abs/2401.02731) 01/2024|[Transformers are Multi-State RNNs](https://arxiv.org/abs/2401.06104) 01/2024|[Extending LLMs' Context Window with 100 Samples (Entropy-ABF)](https://arxiv.org/abs/2401.07004) 01/2024|[ChatQA: Building GPT-4 Level Conversational QA Models](https://arxiv.org/abs/2401.10225) 01/2024|[AutoChunk: Automated Activation Chunk for Memory-Efficient Long Sequence Inference](https://arxiv.org/abs/2401.10652) 01/2024|[Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads](https://arxiv.org/abs/2401.10774) 01/2024|[Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation](https://arxiv.org/abs/2401.08417) 01/2024|[BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models](https://arxiv.org/abs/2401.12522) 01/2024|[Large Language Models are Superpositions of All Characters: Attaining Arbitrary Role-play via Self-Alignment](https://arxiv.org/abs/2401.12474) 01/2024|[Dynamic Layer Tying for Parameter-Efficient Transformers](https://arxiv.org/abs/2401.12819) 01/2024|[MambaByte: Token-free Selective State Space Model](https://arxiv.org/abs/2401.13660) 01/2024|[FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design](https://arxiv.org/abs/2401.14112) 01/2024|[Accelerating Retrieval-Augmented Language Model Serving with Speculation](https://arxiv.org/abs/2401.14021) 01/2024|[Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities](https://arxiv.org/abs/2401.14405) 01/2024|[EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty](https://arxiv.org/abs/2401.15077) 01/2024|[With Greater Text Comes Greater Necessity: Inference-Time Training Helps Long Text Generation (Temp LoRA)](https://arxiv.org/abs/2401.11504) 01/2024|[YODA: Teacher-Student Progressive Learning for Language Models](https://arxiv.org/abs/2401.15670) 01/2024|[KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization](https://arxiv.org/abs/2401.18079) 01/2024|[LOCOST: State-Space Models for Long Document Abstractive Summarization](https://arxiv.org/abs/2401.17919) 01/2024|[Convolution Meets LoRA: Parameter Efficient Finetuning for Segment Anything Model](https://arxiv.org/abs/2401.17868) 01/2024|[RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval](https://arxiv.org/abs/2401.18059) 02/2024|[EE-Tuning: An Economical yet Scalable Solution for Tuning Early-Exit Large Language Models](https://arxiv.org/abs/2402.00518) 02/2024|[MoDE: A Mixture-of-Experts Model with Mutual Distillation among the Experts](https://arxiv.org/abs/2402.00893) 02/2024|[Break the Sequential Dependency of LLM Inference Using Lookahead Decoding](https://arxiv.org/abs/2402.02057) 02/2024|[Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities](https://arxiv.org/abs/2402.01831) 02/2024|[HiQA: A Hierarchical Contextual Augmentation RAG for Massive Documents QA](https://arxiv.org/abs/2402.01767) 02/2024|[KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache](https://arxiv.org/abs/2402.02750) 02/2024|[DiffEditor: Boosting Accuracy and Flexibility on Diffusion-based Image Editing](https://arxiv.org/abs/2402.02583) 02/2024|[QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks](https://arxiv.org/abs/2402.04396) 02/2024|[Hydragen: High-Throughput LLM Inference with Shared Prefixes](https://arxiv.org/abs/2402.05099) 02/2024|[Hydra: Sequentially-Dependent Draft Heads for Medusa Decoding](https://arxiv.org/abs/2402.05109) 02/2024|[LESS: Selecting Influential Data for Targeted Instruction Tuning](https://arxiv.org/abs/2402.04333) 02/2024|[Accurate LoRA-Finetuning Quantization of LLMs via Information Retention](https://arxiv.org/abs/2402.05445) 02/2024|[AttnLRP: Attention-Aware Layer-wise Relevance Propagation for Transformers](https://arxiv.org/abs/2402.05602) 02/2024|[X-LoRA: Mixture of Low-Rank Adapter Experts, a Flexible Framework for Large Language Models with Applications in Protein Mechanics](https://arxiv.org/abs/2402.07148) 02/2024|[BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data](https://arxiv.org/abs/2402.08093) 02/2024|[Mitigating Object Hallucination in Large Vision-Language Models via Classifier-Free Guidance](https://arxiv.org/abs/2402.08680) 02/2024|[Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference](https://arxiv.org/abs/2402.09398) 02/2024|[Uncertainty Decomposition and Quantification for In-Context Learning of Large Language Models](https://arxiv.org/abs/2402.10189) 02/2024|[RS-DPO: A Hybrid Rejection Sampling and Direct Preference Optimization Method for Alignment of Large Language Models](https://arxiv.org/abs/2402.10038) 02/2024|[BitDelta: Your Fine-Tune May Only Be Worth One Bit](https://arxiv.org/abs/2402.10193) 02/2024|[DoRA: Weight-Decomposed Low-Rank Adaptation](https://arxiv.org/abs/2402.09353) 02/2024|[In Search of Needles in a 10M Haystack: Recurrent Memory Finds What LLMs Miss](https://arxiv.org/abs/2402.10790) 02/2024|[Aligning Modalities in Vision Large Language Models via Preference Fine-tuning](https://arxiv.org/abs/2402.11411) 02/2024|[Generation Meets Verification: Accelerating Large Language Model Inference with Smart Parallel Auto-Correct Decoding](https://arxiv.org/abs/2402.11809) 02/2024|[Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts](https://arxiv.org/abs/2402.10958) 02/2024|[WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More](https://arxiv.org/abs/2402.12065) 02/2024|[DB-LLM: Accurate Dual-Binarization for Efficient LLMs](https://arxiv.org/abs/2402.11960) 02/2024|[Data Engineering for Scaling Language Models to 128K Context](https://arxiv.org/abs/2402.10171) 02/2024|[EBFT: Effective and Block-Wise Fine-Tuning for Sparse LLMs](https://arxiv.org/abs/2402.12419) 02/2024|[HyperMoE: Towards Better Mixture of Experts via Transferring Among Experts](https://arxiv.org/abs/2402.12656) 02/2024|[Turn Waste into Worth: Rectifying Top](https://arxiv.org/abs/2402.12399) 02/2024|[Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive](https://arxiv.org/abs/2402.13228) 02/2024|[Q-Probe: A Lightweight Approach to Reward Maximization for Language Models](https://arxiv.org/abs/2402.14688) 02/2024|[Take the Bull by the Horns: Hard Sample-Reweighted Continual Training Improves LLM Generalization](https://arxiv.org/abs/2402.14270) 02/2024|[MemoryPrompt: A Light Wrapper to Improve Context Tracking in Pre-trained Language Models](https://arxiv.org/abs/2402.15268) 02/2024|[Fine-tuning CLIP Text Encoders with Two-step Paraphrasing](https://arxiv.org/abs/2402.15120) 02/2024|[BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation](https://arxiv.org/abs/2402.16880) 02/2024|[No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization](https://arxiv.org/abs/2402.18096) 02/2024|[DropBP: Accelerating Fine-Tuning of Large Language Models by Dropping Backward Propagation](https://arxiv.org/abs/2402.17812) 02/2024|[CoDream: Exchanging dreams instead of models for federated aggregation with heterogeneous models](https://arxiv.org/abs/2402.15968) 02/2024|[Humanoid Locomotion as Next Token Prediction](https://arxiv.org/abs/2402.19469) 02/2024|[KTO: Model Alignment as Prospect Theoretic Optimization](https://arxiv.org/abs/2402.01306) 02/2024|[Noise Contrastive Alignment of Language Models with Explicit Rewards (NCA)](https://arxiv.org/abs/2402.05369) 03/2024|[Not all Layers of LLMs are Necessary during Inference](https://arxiv.org/abs/2403.02181) 03/2024|[Masked Thought: Simply Masking Partial Reasoning Steps Can Improve Mathematical Reasoning Learning of Language Models](https://arxiv.org/abs/2403.02178) 03/2024|[DenseMamba: State Space Models with Dense Hidden Connection for Efficient Large Language Models](https://arxiv.org/abs/2403.00818) 03/2024|[GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection](https://arxiv.org/abs/2403.03507) 03/2024|[Found in the Middle: How Language Models Use Long Contexts Better via Plug-and-Play Positional Encoding](https://arxiv.org/abs/2403.04797) 03/2024|[Scattered Mixture-of-Experts Implementation](https://arxiv.org/abs/2403.08245) 03/2024|[AutoLoRA: Automatically Tuning Matrix Ranks in Low-Rank Adaptation Based on Meta Learning](https://arxiv.org/abs/2403.09113) 03/2024|[BurstAttention: An Efficient Distributed Attention Framework for Extremely Long Sequences](https://arxiv.org/abs/2403.09347) 03/2024|[Bifurcated Attention for Single-Context Large-Batch Sampling](https://arxiv.org/abs/2403.08845) 03/2024|[Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference](https://arxiv.org/abs/2403.09054) 03/2024|[Glyph-ByT5: A Customized Text Encoder for Accurate Visual Text Rendering](https://arxiv.org/abs/2403.09622) 03/2024|[Recurrent Drafter for Fast Speculative Decoding in Large Language Models](https://arxiv.org/abs/2403.09919) 03/2024|[Arcee's MergeKit: A Toolkit for Merging Large Language Models](https://arxiv.org/abs/2403.13257) 03/2024|[Rotary Position Embedding for Vision Transformer](https://arxiv.org/abs/2403.13298) 03/2024|[BiLoRA: A Bi-level Optimization Framework for Overfitting-Resilient Low-Rank Adaptation of Large Pre-trained Models](https://arxiv.org/abs/2403.13037) 03/2024|[Efficient Video Diffusion Models via Content-Frame Motion-Latent Decomposition](https://arxiv.org/abs/2403.14148) 03/2024|[DreamReward: Text-to-3D Generation with Human Preference](https://arxiv.org/abs/2403.14613) 03/2024|[Evolutionary Optimization of Model Merging Recipes](https://arxiv.org/abs/2403.13187) 03/2024|[When Do We Not Need Larger Vision Models?](https://arxiv.org/abs/2403.13043) 03/2024|[FeatUp: A Model-Agnostic Framework for Features at Any Resolution](https://arxiv.org/abs/2403.10516) 03/2024|[ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching](https://arxiv.org/abs/2403.17312) 03/2024|[The Unreasonable Ineffectiveness of the Deeper Layers](https://arxiv.org/abs/2403.17887) 03/2024|[QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs](https://arxiv.org/abs/2404.00456) 04/2024|[LLM-ABR: Designing Adaptive Bitrate Algorithms via Large Language Models](https://arxiv.org/abs/2404.01617) 04/2024|[Prompt-prompted Mixture of Experts for Efficient LLM Generation (GRIFFIN)](https://arxiv.org/abs/2404.01617) 04/2024|[BAdam: A Memory Efficient Full Parameter Training Method for Large Language Models](https://arxiv.org/abs/2404.02827) | |**Articles** 03/2019|[Rich Sutton - The Bitter Lesson](https://archive.is/QqKWF) 06/2022|[Yann LeCun - A Path Towards Autonomous Machine Intelligence](https://openreview.net/forum?id=BZ5a1r-kVsf) 01/2023|[Lilian Weng - The Transformer Family Version 2.0](https://archive.is/3O1n8) 01/2023|[Lilian Weng - Large Transformer Model Inference Optimization](https://archive.is/Clu0H) 03/2023|[Stanford - Alpaca: A Strong, Replicable Instruction-Following Model](https://archive.is/Ky1lu) 05/2023|[OpenAI - Language models can explain neurons in language models](https://archive.is/Y6Lvd) 05/2023|[Alex Turner - Steering GPT-2-XL by adding an activation vector](https://archive.is/E7ehv) 06/2023|[YyWang - Do We Really Need the KVCache for All Large Language Models](https://archive.is/quOu2) 06/2023|[kaiokendev - Extending Context is Hard…but not Impossible](https://archive.is/vJC44) 06/2023|[bloc97 - NTK-Aware Scaled RoPE](https://archive.is/Rsoai) 07/2023|[oobabooga - A direct comparison between llama.cpp, AutoGPTQ, ExLlama, and transformers perplexities](https://archive.is/HgzRV) 07/2023|[Jianlin Su - Carrying the beta position to the end (better NTK RoPe method)](https://archive.is/hfbHH) 08/2023|[Charles Goddard - On Frankenllama](https://archive.is/GYoVX) 10/2023|[Tri Dao - Flash-Decoding for Long-Context Inference](https://archive.is/KCu83) 10/2023|[Evan Armstrong - Human-Sourced, AI-Augmented: a promising solution for open source conversational data](https://archive.is/zPPFU) 12/2023|[Anthropic - Long context prompting for Claude 2.1](https://archive.is/zGngI) 12/2023|[Andrej Karpathy - On the "hallucination problem" (tweet.jpg)](https://files.catbox.moe/jnrzrz.jpg) 12/2023|[HuggingFace - Mixture of Experts Explained](https://archive.is/8r7t9) 01/2024|[Vgel - Representation Engineering](https://archive.is/SHV3E) 02/2024|[Lilian Weng - Thinking about High-Quality Human Data](https://archive.is/1K0EM) 03/2024|[rayliuca - T-Ragx Project Write Up (Translation RAG)](https://archive.is/VU9eI) 03/2024|[Answer.Ai - You can now train a 70b language model at home (FSDP/QLoRA)](https://archive.is/jb4n9)