!!!warning this guide is %red%LARGELY OUTDATED %% (from October 2023)
	and designed ONLY for running AI in Cloud, which is **inferior** to ANY other methods as of 2024+
	if you are **new to AI** and want to chat with bots then [**start with this guide**](https://rentry.org/onrms) instead
	guide below is designed for anons who want to try out local models on Google servers

!!!note running Locals via Colab + semi-general Locals guide

[TOC4]

╔═══════════════════════  ⋆⋅✨⋅🎈⋅🍎⋅🌈⋅💎⋅🦋⋅⋆  ═══════════════════════╗
         .**don't know where to start?**
* never heard about about GPTQ, GGUF, kobold.cpp and quantz? then check [QRD on models](#qrd-on-models)
* want to learn about Mirostat, Typical P, Beam Search and other samplers? then check [Knobs](#knobs)
* never touched Advanced Formatting before? then learn about [presets](#advanced-formatting)
* have issues with Locals? then check [troubleshooting](#troubleshooting)
* otherwise, just grab Colab Notebook [here](#notebook), and up you go
* ...or just read everything

╚═══════════════════════  ⋆⋅🦋⋅💎⋅🌈⋅🍎⋅🎈⋅✨⋅⋆  ═══════════════════════╝

#### readme
> what it is?
it a guide on running Locals via [Google Colab](https://colab.research.google.com/) + the settings + knobs + presets + troubleshooting + two pony-based LoRAs by [anon from /ppp/ thread](https://desuarchive.org/mlp/thread/40255207/#40271923)
***

> context size?
for 13b Locals `max context size` is:
|format|max context size|
|---|---|
|**GPTQ**|7168|
|**EXL2** < 4 bws|7680|
|**EXL2** 4-5 bws|6144|
|**EXL2** 5-6 bws|4096|
|**GGUF** / **GGML** Q4_K_M|6144|
|**GGUF** / **GGML** Q5_K_M|4096 *(5122 in llamacpp*)|
|**GGUF** / **GGML** Q6_K|4096|

*(`response length` better to set in ~300 range)*
***

> what model shall I use?
* generally, use **GGUF Q6_K** model if you need good responses with *4k context*;
* otherwise use **GGUF Q4_K_M** model if you need *6k context* but less quality
* if you need pony *LoRAs* then use **GPTQ 4bit 32g**

***
> what is Google Colab? do I need a good PC for this?
Colab is *Google's data servers*. Google allows people to use them for free in non-commercial purposes (with an emphasis on *research and learning*). paid accounts may use Colab for various other purposes and have more performance/priority. you need an internet connection only: **your PC/GPU/CPU/RAM don't matter**. all computation is done solely on Colab servers
***

> what is the catch?
==**Time limit.**== Google Colab applies a vague time limit, allowing only a certain number of hours per account. *usually,* you can use Google Colab **for 2-3 hours per day**, but it varies greatly due to unknown factors. one time it allowed me to sit for 7 hours, and another time it cut me short after 55 minutes
***

> but they count only active usage time, right? if I launch their system and will not use the model then I don't use their resources, correct?
**no**. Google counts idle connection as a real one regardless of active usage. in simple terms: if you have connected to Google Colab and left your PC for four hours and literally did nothing - you will get 'daily limit is reached' nonetheless
***

> what will happen after my time limit is up?
Google will cut off your access to their T4 GPU, and Locals do require GPU for work. you can still use Colab on CPU via kobold.cpp (see settings) but the generation will be extremely slow. your **usage limit will recharged in 8-12 hours**. there is no other limitations and anon who made a well-known rentry about Colab [claims using it for Locals for months](https://desuarchive.org/g/thread/95472866/#95480712) with no problems
![](https://files.catbox.moe/koblpb.png)
***

> can I abuse it with multi-accounting? will Google Colab ban me for that?
yes, you can do multi-accounting and abuse their free limits, but it is highly recommended to **use VPN + fake user-agent** / incognito mode to protect yourself
***

> NSFW?
**yes**. no strings attached. Locals are absolutely unfiltered so you can do whatever you want. mind that some Locals may have inner filters because they were trained on filtered materials (for example, none-finetuned *Llama 2 is censored* as well as Falcon 180b)
***

> will Google ban me for NSFW?
Google *will not* ban your account for NSFW: but they can in theory throttle or rate limit accounts for suspicious usage; however I have yet to find real evidence of such practice. still don't push very questionable content on your primary gmail account too much
***

>...but Google banned Pygmalion...
it was done in a different (much stupider) time, and it no longer applies. nowadays users can freely run Pygmalion 13B on Colab without negative consequences 
***

>...but Google banned A1111...
not banned but severely handicapped with throttling; because everyone uses it! [Derpibooru alone has 10000+ AI pics](https://www.derpibooru.org/search?q=ai+content%2C+-generator%3Anovelai) and you can guess with high certainty that 80% of them were generated via Colab, and that number doesn't include 3.000.000+ bad pics that were discarded in the process. now apply those numbers to the whole internet, not 4chan only, and you get a picture. too much payload for Google (and people were abusing it much more than Pygmalion)
***

#### troubleshooting
> Local generates garbage and noise
1. *significantly* `reduce the max context` (by 2k - 4k) and try again. it might be an overflow issue. if you start receiving coherent responses on low context then return back to the current max context and reduce it by 512 until the responses became meaningful
2. set the following `samplers` to default values (if they are available): *Mirostat* (0/0/0), *Penalty Alpha* (0), *Encoder Rep. Pen.* (1), *Beam search* (1), *CFG Scale* / *Negative Prompt* (1)
3. ensure that `Ban EOS Token` is **OFF**
4. ensure that `Instruct Mode` is **ON**
5. change `presets` (Alpaca / Vicuna 1.1 / Llama 2 Chat are the most common ones, but try them all)
6. enable/disable `streaming`
***

> kobold.cpp generates only 49-50-51 tokens
in SillyTavern you are launching kobold.cpp as *oobabooga*. **pick koboldAI** instead:
![](https://files.catbox.moe/5o88v7.png)
***

> Local generates blank messages or stuck
it is an issues on SillyTavern/Agnai <\-> Cloudflare <\-> Colab part, not a skill issue. sometimes your response get stuck
first, **enable streaming** since it helps to see generation in real time
second, if message generation doesn't start in 10-15 seconds then you can safely **discard/pause** current message and send a new one. just click on pause button
![](https://files.catbox.moe/owgn22.png)
***

> is that me or models loses intelligence when it gets to the end of its context?
no, that's how **ALL** models work (including GPT4 and Claude). it becomes even more apparent upon utilizing *alpha_value (NTK RoPE)* that scales models's context at values higher than original max context. to compensate the scaling model starts losing some tokens in the prompt - it is unavoidable unless more VRAM will be offloaded (which big corpos may afford but not you)
***

> I got "RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver" error on Colab
you didn't select `T4 GPU` in `Hardware accelerator` in Colab. **or** your *daily limit is run out* and you are now trying to connect via CPU, in that case wait for 8-12 hours before trying again
***

> it says got "You are running ExLlamaV2 without flash-attention. This will cause the VRAM usage to be a lot higher than it could be."
yes, don't mind that error. if you install flash-attention in Colab then EXL2 will not work at all (shitty Colab GPU)
***




***
***
***
#### QRD on models
Locals work the following way

first you have a **basic (aka foundation, pretuned) model** - which is a model trained on optimized dataset. the most known foundation model - `Llama`, made by Meta; other examples - `Falcon`, `Mistral`, `GPT-2/J`, `MPT`...

those fundamental models can additionally be further **trained** on extra dataset (text files) to steer them into certain direction: write fiction, RP, doing science work, helping with coding, giving medical advice, being NSFW writers... those models are named **finetuned** or **instruction-tuned** (depending on the process): `Pygmalion`, `Vicuna`, `Pythia`, `Wizard`, `Hermes`, `Beluga`, etc if you see ==L== or ==L2== label in the model's name then it means the model derived/based on `Llama` (1/2)

those finetuned models can be **merged** together in a pretty much any combination, for example a well-known `Mythomax` is a merge of:
1. Hermes + Chronos + Airoboros = **Mythologic**
2. Hermes + Chronos + Airoboros + Beluga + LimaRP = **Huginn**
3. MythoLogic + Huginn = **Mythomax**

yes, it is not a typo - three models: Hermes, Chronos and Airoboros - were merged *twice* in process, only in different ratios. 
> if you ever wondered why Locals are so retarded, well they have *lots of inbreeding*
***

model' most define distinction is **parameter** (or as faggots say - **beaks**). the most common models' beaks are **7b**, **13b**, **20b**, **33b**, **65b** and **70b**, but models may be having *any value* in between
> in general, parameter means the general intelligence of model: **more the beaks more the smarts**

but it is not always the case: some 7b models (`Mistral`) *may perform* better than some 13b, while some 20b *may be* as dumb as 7b. furthermore community always gets new tools, technologies and ideas that further boost the distinctions: *for example, 13b models from early April of 2023 are much-much dumber than 13b models from September of 2023*. you have to lurk and ask around which Locals are recurrently considered the best, because **the meta changes** every ~2-3 weeks upon the new milestone achieved
***

all those models requires A LOT of VRAM/RAM to load them; in order to reduce the specs - the models are **compressed** in the different formats. compression reduces the overall model size but hurts intelligence because some data becomes non-retrievable. four most popular compressions are `GGUF`, `GGML`, `GPTQ` and `EXL2`

**GGUF** - the format that values compression in *Quant* (**Q**); with `Q2_K` being the most retarded version but takes the least amount of VRAM/RAM to load, while `Q8_0` offers the less compression rate but offers the best, almost lossless, quality. for Colab you shall use:

|GGUF quantz|vram usage|quality|context in Colab via kobold.cpp|...via llama.cpp|
|--|--|--|--|--|
|**Q4_K_M**|low|average|6144|6144|
|**Q5_K_M**|average|good|4096|5122|
|**Q6_K**|high|best|4096|4096|

--use `Q4_K_M` if you value context size and `Q6_K` if you want quality the most

**GGML** - the same as GGUF above and outdated now. use GGUF models instead

**GPTQ** - the format that values compression in *Bits* (**b**) and *Group Order* (**g**). more the bits and less group order - better quality. for the purpose of Colab look for `4bit` / `32g` models: they can be load in Colab in 7168 context. in terms of intelligence, it stays at about `Q4_K_M` level

**EXL2** - the new format that will replace GPTQ in future but for now its support is still low. in allows to create models of mixed number of bits - **bws**. GPTQ allowed to create only 4-bit and 8-bit models, while EXL2 can be anything in between and offers better compression. for the Colab look for models with `4-5-6 bws`

>models of various formats, datasets to train them with, all benchamarks are hosted on [huggingface.co](https://huggingface.co/) --all Locals can be found there
***

to launch all those formats we use different *programs* called either **backend** or **loaders**:
* GPTQ - [exllama](https://github.com/turboderp/exllama)
* EXL2 - [exllama2](https://github.com/turboderp/exllamav2)
* GGUF - [kobold.cpp](https://github.com/LostRuins/koboldcpp) or [llama.cpp](https://github.com/ggerganov/llama.cpp)
* GGML - kobold.cpp

`exllama`, `exllamav2` and `llama.cpp` are combined together into one backend called [oobabooga](https://github.com/oobabooga/text-generation-webui) (if you ever did SD and heard about *A1111* - then it is the same kind of all-in-one package), while `kobold.cpp` is a standalone program originated from llama.cpp but became its own thing
***

if you don't know which model to start with, then start with **Q4_K_M  GGUF models with 6k context**. you can search [huggingface/models](https://huggingface.co/models?pipeline_tag=text-generation&sort=trending&search=GGUF+13B) for catalog of various models. if you want to apply LoRAs (see below) then look for **GPTQ 4bit / 32g models** in [huggingface/models](https://huggingface.co/models?pipeline_tag=text-generation&sort=trending&search=GPTQ+13B)

if you absolutely don't know what model to pick then check [huggingface's leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) and [/lmg/'s ayumi leaderboard](https://rentry.org/ayumi_erp_rating#13b-models) (which measures only the amount of used lewd words, not an actual coherence)

some good options to start with:
 
==**--\> [MythoMax 13B](https://huggingface.co/TheBloke/MythoMax-L2-13B-GGUF/tree/main)**==
**MythoMax** is a little 13B full of big sovl. maybe not as smart or as lewd, but produces surprisingly poetic and rich descriptions for 13b. Mythomax is considered a golden standard of all-around models: a well-around working ~~pony~~ horse

==**--\> [MLewdBoros 13B](https://huggingface.co/Undi95/MLewdBoros-L2-13B-GGUF/tree/main)**==
**MLewdBoros** is the merge between Mlewd, the extremely lewd model - and Spicyboros, the another horny model. result is, you guess it - lewd and horny model

==**--\> [Amethyst 13B](https://huggingface.co/Undi95/Amethyst-13B-GGUF/tree/main)**==
**Amethyst** is my personal favorite 13b merge so far. as usual Undi95 threw some random stuff into one merge and that somehow clicked: both RP and storywriting. don't question my personal autism




***
***
***
#### Notebook
==**--\> #1**==. 
**download [Notebook](https://files.catbox.moe/jowep9.ipynb)** *(all credits in the notebook)*
previous versions (not needed, just for history):  [[07_13]](https://files.catbox.moe/09p2rs.ipynb), [[old]](https://files.catbox.moe/rjpbvb.ipynb)
 
 
 
 
***
==**--\> #2**==. proceed to [Google Colab](https://colab.research.google.com/) (requires gmail account)
 
 
 
 
***
==**--\> #3**==. `File` -\> `Open notebook` -\> `Upload` -\> ==Select notebook==

![](https://files.catbox.moe/9e2vl2.png)

if you cannot find `File` then click on the *gray arrow* in top-right corner:
![](https://files.catbox.moe/9l80ky.png)
 
 
 
 
***
==**--\> #4**==. you will be redirected to that page. those are the settings:

![](https://files.catbox.moe/nsbj00.png)

if you cannot see the settings then click there:
![](https://files.catbox.moe/3xmlto.png)

 
 
 
 
***
==**--\> #5**==. everything is already set to the best options the only thing you have to do - is provide the link to the Locals you would like to use
*(the first input field - `input_model_string`)*
 
 
==**for GGUF / GGML models:**==
**copy-paste the link to the ==exact file== you want to download**. for example if you want to use GGUF 6Q version of `PsyMedRP` then visit [its page on huggingface](https://huggingface.co/Undi95/PsyMedRP-v1-13B-GGUF/tree/main) and copy-paste the link:

![](https://files.catbox.moe/kmujy0.png)

![](https://files.catbox.moe/kj75qu.png)
 
 
==**FOR GPTQ / EXL2 models:**==
**copy-paste the link to the ==whole repo== you want to download**. for example if you want to use GPTQ version of `MythoMax` then visit [its page on huggingface](https://huggingface.co/TheBloke/MythoMax-L2-13B-GPTQ/tree/main) and copy-paste the link:

![](https://files.catbox.moe/tlb4ey.png)

![](https://files.catbox.moe/1um0ox.png)
 
different GPTQ versions are usually uploaded as **different tree branches**. select them from the dropdown
![](https://files.catbox.moe/bbmwr3.png)
 
 
 
 
***
==**--\> #6**==. in *top-right corner* look for `reconnect` link, next to it *an arrow-down* -\> `Change runtime type` -\> `Hardware accelerator` -> `T4 GPU`

![](https://files.catbox.moe/eyj5rs.png)
 
 
 
 
***
==**--\> #7**==. press on the **first PLAY button** to enable music player. it is a 24h silent soundfile to keep the tab busy and prevent its hibernation (and issues with Colab)

![](https://files.catbox.moe/y7ugnl.png)
 
 
 
 
***
==**--\> #8**==. press on the **second PLAY button** and wait.

![](https://files.catbox.moe/n2kvkv.png)
 
 
 
 
***
==**--\> #9**==. now you need to link Colab to your SillyTavern or Agnai

==**for GGUF / GGML models:**==
when notebook asks you to connect to `http://localhost:5001`...

![](https://files.catbox.moe/43yxrm.png)

...scroll up until you see a `Your quick Tunnel has been created` link. copy it

![](https://files.catbox.moe/99ia08.png)

in **SillyTavern** select `KoboldAI` as API and paste the generated link, afterwards click `connect`, ensure that the *name of Model* appeared below... **and chat with AI Mares**

![](https://files.catbox.moe/57dcvg.png)

in **Agnai** create the following preset:
* select `AI Service` -\> `Kobold / 3rd party` -\> `Kobold`
* `Third Party URL` -\> your link from Colab **WITHOUT `API` AT THE END**
* `Disable Auto-URL`: OFF (gray)

![](https://files.catbox.moe/fv9g54.png)


==**for GPTQ / EXL2 models:**==
you will get two links that look like this:
* `https://****************.trycloudflare.com/api` 
	-- **main link**
* `wss://****************.trycloudflare.com/api/v1/stream` 
	-- **streaming link** (if you need streaming, optional)

![](https://files.catbox.moe/z1hwj3.png)
howto_notebook_link.png

in **SillyTavern** select `Text Gen WebUI (ooba/Mancer)` as API and then paste both links:
* **main link** goes into `Blocking API url`
* **streaming link** goes into `Streaming API url` (if you need streaming, optional)
--after that click `connect`, ensure that the *name of Model* appeared below... **and chat with AI Mares**

![](https://files.catbox.moe/het9lk.png)
 
in **Agnai** create the following preset:
* select `AI Service` -\> `Kobold / 3rd party` -\> `Textgen (Ooba)`
* `Third Party URL` -\> your link from Colab **WITHOUT `API` AT THE END**
* `Disable Auto-URL`: OFF (gray)

![](https://files.catbox.moe/7v80xv.png)
 
 
 
 
***
==**--\> #10**==. if you didn't do it already then setup [Advanced Formatting](#advanced-formatting) and [Samplers](#knobs)
!!!warning **don't close the tab with Colab, left it open**. 
 
 
 
 
***
==**--\> #11**==. when you done chatting. in *top-right corner* look for `reconnect` link, next to it *an arrow-down* -\> `Disconnect and delete runtime`

![](https://files.catbox.moe/36ohot.png)

 
 
 
 
***
***
***
#### LoRAs
LoRA is a subset of data containing examples, text, weight, lore, training corpus and extra details. they can be served as the plugins to directly incorporate into a model, or serve as extra knowledge cutoff

that [Notebook](#notebook) supports two pony-related LoRAs that can be applied to any GPTQ model. LoRAs are created [by anon in /ppp/ thread](https://desuarchive.org/mlp/thread/40255207/#40271923)
* ==**FIMFarchive-v1**== - dataset of fanfics from FIMfiction archive
* ==**desu-/mlp/-v1**== - dataset of /mlp/ posts, including greentext

!!!warning  **those pony LoRAs can be applied to GPTQ models only**

for LoRAs to work correctly set preset to **Pygmallion, Metharme, Vicuna 1.0 or WizardLM-13B**. other presets may work better/worse depending on the model, so it is a but of trial-n-error

those two LoRAs *improve models' awareness about MLP and help with lingo and names*. **reasoning itself is unaffected** so don't expect models to become much smarter, but they will be able, for example, to deduce that Element of Honesty is Applejack (with certain probability). on pic below you can see the **effect of applied FIMFarchive-v1** LoRA (before/after):

![](https://files.catbox.moe/gviiub.png)
-> *(Scootaloo is RD's adoptive daughter is too cute, I don't care)* ->

the negative consequence of LoRAs is spurting random data from dataset (noise tokens), which can be **mitigated with lower temperature** or more aggressive sampling. **just cut them off**. in my test they mostly proc on the low context prompt until the model get enough tokens to works with

![](https://files.catbox.moe/txac2k.png)

...however some of those OOC are funny and I don't mind their occasional appearance :

![](https://files.catbox.moe/033nq4.png)
***
below are pictures for comparisons, done on basic **Athena** model, then with two different LoRAs applied, and forth column for **MythoMax** model for extras:

|Athena model|Athena + desu-/mlp/-v1|Athena + FIMFarchive-v1|MythoMa model + FIMFarchive-v1|
|---|---|---|---|
|![](https://cdn.discordapp.com/attachments/1152584624367734907/1152681789383512247/ex1_norm.png)|![](https://cdn.discordapp.com/attachments/1152584624367734907/1152681789651943495/ex1_chag.png)|![](https://cdn.discordapp.com/attachments/1152584624367734907/1152681789916188793/ex1_fim.png)|![](https://cdn.discordapp.com/attachments/1152584624367734907/1152681790167859360/ex1_Myth.png)|
|![](https://cdn.discordapp.com/attachments/1152584624367734907/1152683017576402974/ex2_norm.png)|![](https://cdn.discordapp.com/attachments/1152584624367734907/1152683018025181194/ex2_chag.png)|![](https://cdn.discordapp.com/attachments/1152584624367734907/1152683018255880315/ex2_fim.png)|![](https://cdn.discordapp.com/attachments/1152584624367734907/1152683018507526144/ex2_Myth.png)|
|![](https://cdn.discordapp.com/attachments/1152584624367734907/1152683342773374976/ex3_norm.png)|![](https://cdn.discordapp.com/attachments/1152584624367734907/1152683342991458345/ex3_chag.png)|![](https://cdn.discordapp.com/attachments/1152584624367734907/1152683343209570344/ex3_fim.png)|![](https://cdn.discordapp.com/attachments/1152584624367734907/1152683343549321276/ex3_Myth.png)|
|![](https://cdn.discordapp.com/attachments/1152584624367734907/1152683550521446460/ex4_norm.png)|![](https://cdn.discordapp.com/attachments/1152584624367734907/1152683550831804507/ex4_chag.png)|![](https://cdn.discordapp.com/attachments/1152584624367734907/1152683551133814855/ex4_fim.png)|![](https://cdn.discordapp.com/attachments/1152584624367734907/1152683551762944071/ex4_Myth.png)|
|![](https://cdn.discordapp.com/attachments/1152584624367734907/1152683759615881378/ex5_norm.png)|![](https://cdn.discordapp.com/attachments/1152584624367734907/1152683758701527070/ex5_chag.png)|![](https://cdn.discordapp.com/attachments/1152584624367734907/1152683758995124224/ex5_fim.png)|![](https://cdn.discordapp.com/attachments/1152584624367734907/1152683759238398174/ex5_Myth.png)|
|![](https://cdn.discordapp.com/attachments/1152584624367734907/1152683935768252586/ex6_norm.png)|![](https://cdn.discordapp.com/attachments/1152584624367734907/1152683936321896488/ex6_chag.png)|![](https://cdn.discordapp.com/attachments/1152584624367734907/1152683936711987262/ex6_fim.png)|![](https://cdn.discordapp.com/attachments/1152584624367734907/1152683937064292372/ex6_Myth.png)|

***
***
***
#### Advanced Formatting
**Preset/Format**
to use Local models properly you need to set correct options for them
open `Advanced Formatting` tab in SillyTavern and check `presets`. here change between various presets: *Alpaca*, *Vicuna*, *Llama 2 Chat*... to change how Local will process your prompt:

![](https://files.catbox.moe/t3nj2h.png)

how to know which preset/format is correct? visit model page. usually author posts which format the model has been trained on:
![](https://files.catbox.moe/pzssxd.png)

you don't need to tweak anything else here by default - settings correct preset is enough in the most cases; but if you want the more control over the text generation, and know what all those settings mean, then read further:

***
**Story String****
it acts as your `MAIN` and templates formatting (*Agnai users will find it very familiar*)
the absolute minimalist, super default, `Story String` looks like this. it basically goes: if there is `system` info then output it, if there is `description` info then output it, etc...
```
{{#if system}}{{system}}{{/if}}
{{#if wiBefore}}{{wiBefore}}{{/if}}
{{#if description}}{{description}}{{/if}}
{{#if personality}}{{personality}}{{/if}}
{{#if scenario}}{{scenario}}{{/if}}
{{#if wiAfter}}{{wiAfter}}{{/if}}
{{#if persona}}{{persona}}{{/if}}
```
|option|description|
|----|----|
|`system`|content of `system prompt` (read below)|
|`wiBefore`|content of `World Info` / `LoreBook` if you set to insert it BEFORE `character's card`|
|`wiAfter`|content of `World Info` / `LoreBook` if you set to insert it AFTER `character's card`|
|`description`|content of `character's card` (defs)|
|`personality`|content of character's `Personality summary` (from Advanced Definition)|
|`scenario`|content of character's `Scenario ` (from Advanced Definition)|
|`persona`|content of user's `Persona Description` (from Persona Management)|

you can move those templates any way you like and it may affect the generation. for example (using scenario as `main` while `system` will be used as mid-JB):
```
{{#if scenario}}{{scenario}}{{/if}}
{{#if wiBefore}}{{wiBefore}}{{/if}}
{{#if description}}{{description}}{{/if}}
{{#if personality}}{{personality}}{{/if}}
{{#if wiAfter}}{{wiAfter}}{{/if}}
{{#if persona}}{{persona}}{{/if}}
{{#if system}}{{system}}{{/if}}
```
...furthermore you can add extra data to (hopefully) aid AI:
```
{{#if system}}### Roleplay instructions: {{system}}{{/if}}
{{#if scenario}}### Roleplay Scenario: {{/scenario}}{{/if}}
{{#if description}}### {{char}}'s description: {{description}}{{/if}}
{{#if persona}}### {{user}}'s description: {{persona}}{{/if}}
{{#if personality}}### {{char}}'s personality: {{personality}}{{/if}}
### Roleplay facts and memory: 
{{#if wiBefore}}{{wiBefore}}{{/if}}
{{#if wiAfter}}{{wiAfter}}{{/if}}
```
...in addition you can manually add your instructions here; some people recommend add `### Input:` at the start which supposedly helps AI --but can be a snake oil. for example:
```
### Input:
This is a back and forth roleplay between Human and AI based on My Little Pony (MLP). You are an author, tasked with crafting a captivating, memorable narrative based on the provided instructions and chat. Human writes for {{user}}; while AI writes for {{char}}, other characters, and narrates roleplay. Throughout the roleplay, describe every action in vivid details, and use dialogue effectively to advance the plot. Remember, the goal is to leave a lasting impression on Human.

### Description:
{{#if description}}{{description}}{{/if}}
{{#if personality}}{{personality}}{{/if}}
{{#if persona}}{{persona}}{{/if}}
{{#if scenario}}{{scenario}}{{/if}}
{{#if wiBefore}}{{wiBefore}}{{/if}}
{{#if wiAfter}}{{wiAfter}}{{/if}}

### Guidelines:
{{#if system}}{{system}}{{/if}}
```

for **Agnai** use the following settings:
* `Use Prompt Template`: ON (red)
* `Prompt Template` ([source](https://rentry.org/tsukasa13b)):

```
<|system|>{{#if system_prompt}}{{system_prompt}}{{/if}}

{{char}}'s Persona: {{personality}}

{{#if example_dialogue}}
{{char}} talks like: {{example_dialogue}}
{{/if}}

{{#if scenario}}
This scenario of the conversation is: {{scenario}}
{{/if}}

{{#each msg}}{{#if .isbot}}<|model|>{{/if}}{{#if .isuser}}<|user|>{{/if}}{{.name}}: {{.msg}}
{{/each}}
{{#if ujb}}<|system|>{{ujb}}{{/if}}
<|model|>{{post}}
```

***
**System Prompt**
`System Prompt` will be inserted in place of `{{system}}` template in `Story String`. you may utilize it as `MAIN` that will go first before anything else:

* `Story String` (start):
```
{{#if system}}{{system}}{{/if}}

{{#if wiBefore}}{{wiBefore}}{{/if}}
<...>
```
* `System Prompt`:
``` js
This is a back and forth roleplay between Human and AI based on My Little Pony (MLP).
```
...or use at the end of `Story String` and utilize as a pseudo `AN` with extra instructions:

* `Story String` (end):
```
<...>
{{#if wiAfter}}{{wiAfter}}{{/if}}

{{#if system}}{{system}}{{/if}}
``` 
* `System Prompt`:
``` js
Back-n-forth roleplay / Stay in character / Provide details imagery full of action and movements / Include dialogues and environment / Use knowledge of MLP and equine anatomy.
```
--whatever approach to use depends on your personal preference, current RP and the current model

***
**Example Separator and Chat Start**
* `Example Separator` is added before every provided `Dialogue Examples`
* `Chat Start` is added before the actual `Chat` started (after all the examples)

...those two commands aim to help AI into difference what it the part of actual narrative and what serves as an example
![](https://files.catbox.moe/ef28nu.png)


a typical guideline is to use the following separators:
``` js
Example Separator: ### Example Dialogue:
       Chat Start: ### New Roleplay:
```
I personally like the following approach - to announce that all examples are done and now we proceed to the roleplay itself:
``` js
Example Separator - ### Example Dialogue Start:
       Chat Start - ### Example Dialogue End. New Roleplay Begins:
```

***
**Instruct mode and Sequences**
`Instruct mode` allows model to create text based on your prompt, which is exactly what you want in RP

if you disable `Instruct mode` then AI instead will try to CONTINUE the current prompt, see pic below
![](https://files.catbox.moe/3fd2ah.png)

in most cases `Instruct mode` shall be **ON** unless you have a reason to ask AI to continue the text instead of completing it. seems like writing greentext is a legit case where `Instruct mode` shall be **OFF** but it depends on how you use Locals
***

* `Input Sequence` is added before every User's response (in chat history)
* `Output Sequence` is added before every AI's response (in chat history)

...those two commands aim to help AI into difference the roles in the story
![](https://files.catbox.moe/4tpwzf.png)

a typical guideline is to use the following sequences:
``` js
Input Sequence: ### Instruction:
Output Sequence: ### Response:
```

...but you can use whatever format you want and maybe model will click better for you, some examples:
``` js
Input Sequence: Human:
Output Sequence: AI:

Input Sequence: ---Human turn:
Output Sequence: ---AI turn:

Input Sequence: ### {{user}}:
Output Sequence: ### {{char}}:
```

***
* `First Output Sequence` is added before very *first* AI's response
* `Last Output Sequence` is added before very *last* AI's response

...usually you don't need to touch `First Output Sequence` but for `Last Output Sequence` you can add an extra instructions just before prompt generation, for example:
``` js
Last Output Sequence: ### Response (write as a talented MLP fanfic author):
Last Output Sequence: ### Response (your answer shall be brief and concise):
Last Output Sequence: ### Response (reply in Old English with a mix of UPPERCASE words to empathize Luna's aloofness and her royal Canterlot voice):
```
...of course you can go completely overboard and do stuff like this, and consider it to be a `mini-JB`:
``` js
Last Output Sequence: Now it is your time to reply. Follow the instructions below:
- Maintain spatial understanding to ensure extremely realistic scenes and anatomically correct character interactions. 
- Amplify non-human anatomy of MLP characters. 
- MLP characters wear nothing. 
- Pony characters use ponyfied language. 
- Obey MLP canon. Your TOP PRIORITY is to make sure that EACH character stays as true to their canon appearances, personalities, mannerisms, and speech patterns as possible. 
- Use English language creatively, in vivid details.

Proceed with your answer. ### Response:
```

***
* `System Sequence Prefix` is added before the whole prompt
* `System Sequence Suffix` is added after the whole prompt 
the point of those sequences is to compel to quirky formats of certain Models. you most likely don't need to touch them on your own

***
**Stop Sequence, EOS, Custom Stopping Strings** 
`Stop Sequence` is a special symbol commanding Local to stop the generation. upon reaching that token Local will stop producing more content until the next prompt is issued. it acts as a stop-signal both for Local itself and for SillyTavern/Agnai so they would not expect any tokens afterwards (which is especially useful in `streaming`)
the most common `Stop Sequence` is `</s>` - unless Local was trained to use something different

via `Custom Stopping Strings` *(on the right)* you may include more tokens and word combinations that will stop the generation. the difference is that `Stop Sequence` is a special indicator produced by Local itself to stop the generation, while `Custom Stopping String` can be pretty much any string, word, letter, sign, which can be part of original generation. for example you may include here the word "ministration" and Locals will stop the generation when that word appear in text

for `Custom Stopping Strings` you must utilize [JSON format](https://jsonlint.com/), example:
```
["\nInstruction", "\n</stream>", "\n</endfic>", "\n</startfic>"]
```
*that combination willl force Local to stop generation when a newline starts with the word "Instruction" or  tags </stream>, </endfic> or </startfic> are used*

if you want to disable any stop sequences then look to `Ban EOS` option (EOS stands for "End of Stream"): it will force Local to generate tokens further despite `Stop Sequence`. be warned that model will go schizo very fast
![](https://files.catbox.moe/4ak9su.png)

***
**Tokenizer and padding**
!!!warning Set `Tokenizer` to `Sentencepiece (LLaMA)`
proper Tokenizer is required for the correct token for `Response Length (tokens)` and `Context Size (tokens)` correctly. `Sentencepiece` works with all Locals

`padding` is a *"token reserve"* to avoid overflow. with *lower values* AI will try to output as many tokens without considering an actual length of prompt, with *higher values* AI will lookahead and finish its prompt prematurely before it overflow the completion

usually setting `padding` to `64` or `96` is enough

***
**Wrap Sequences with Newline**
that option adds the newlines before sequences. without it they will append on the same lines. good for content clarity

![](https://files.catbox.moe/jwhj95.png)

***
**Replace Macro in Sequences**
**Replace Macro in Custom Stopping Strings**
"macro" are various entries inside curly brackets `{{}}` that automatically transformed into the proper words; for example `{{char}}` stands for "card's name" and `{{user}}` stands for "user's persona". if you disable them then instead of names the actual words `{{char}}` and `{{user}}` will be used

***
**Include Names + Force for Groups and Personas**
that option appends names (both `{{user}}` and `{{char}}`) into prompt. it is great thing for RP since it helps AI to follow the roles better, but if you are doing a story-writing hen you *probably* better to turn them off. some models may confuse the roles and enabling/disabling Names may help those models to get the better context

![](https://files.catbox.moe/xy3lf6.png)



***
***
***
#### Knobs
**tldr**
!!!note don't want to read all that stuff and just need settings copy-paste?
    * [freellamas](https://rentry.org/freellamas)
    * [tsukasa13b](https://rentry.org/tsukasa13b)
    * my preset below
    * @todo: need more

Big Models support mostly `Temperature`, `Top_P`, `Top_K` -and in case of GPT4 - `penalties`
Locals, in other hand, support much more knobs and offer a huge assortment of options which can make your head dizzy, but things ain't hard
!!!warning **ENABLE `DO SAMPLING` IN OPTIONS**
	otherwise most of those settings will not work you

***
**my preset**
those are my settings I am using for the most models. not necessary saying those are the best settings --but that is what I have settled with
!!!note Temperature must be changed according to the model
	some Locals perform better on lower Temperature, while other require higher Temperature. Mythomax, for example, benefits from Temperature in 0.6-0.8 range, while higher ones turning it into schizo

|sampler|Default|More Stable|More Creative|Mirostat Default|Mirostat Creative|Mirostat Schizo|
|---|---|---|---|---|---|---|
|Temperature|1.05|0.84|1.25|0.84|0.95|1.40|
|Repetition Penalty|1.06|1.11|1.15|1.06|1.05|1.03|
|Repetition Penalty Range|1320|1320|1320|1320|1320|1320|
|Rep. Pen. Slope|0.8|0.9|0.8|0.8|0.8|0.8|
|Encoder Repetition Penalty|1.01|1.01|0.99|1.02|1.02|1.03|
|No Repeat Ngram Size|0|0|0|0|0|0|
|Top K|0|0|0|0|0|0|
|Top P / Nucleus|0.90|0.80|0|0|0|0|
|Typical P / Typical Sampling |0.9|0.9|0.95|1|1|1|
|Top A|0|0.1|0.1|0|0|0|
|TFC / Tail Free Sampling|0.94|0.92|0.94|1|1|1|
|Epsilon Cutoff|0|0|0|0|0|0|
|Eta Cutoff|0|0|0|0|0|0|
|Number of Beams|1|1|1|1|1|1|
|Length Penalty|1|1|1|1|1|1|
|Penalty Alpha|0|0|0|0|0|0|
|Mirostat Mode|0|0|0|2|2|2|
|Mirostat Tau|0|0|0|4.0|3.5|5.0|
|Mirostat Eta|0|0|0|0.15|0.20|0.30|

***
**presets**

SillyTavern offers prebuilt presets you may load and use without figuring what all those knobs do. 

![](https://files.catbox.moe/pztk0n.png)

one [anon made a good rounddown of most presets](https://rentry.org/llm-settings#presets-presets-everywhere). my imho - for **Mythomax** the best presets are:
- Titanic 
- Space Alien
- Naive
- TFS-with-Top-A
- Midnight Enigma

***
**Temperature (temp)**
your absolute basic knob. it **determinate the pool of tokens (words)**, and their overall number:
- `higher`: more tokens in the pool, more random tokens
- `lower`: less tokens in the pool, more predictable tokens
- `disable`: 1 (default)

at value `temp: 0` AI becomes 100% *deterministic* and always outputs the same text regardless of the seed or swipe

there are two main approaches:
- *set temp at high value* (**1.4+**) and then use other knobs *generously* to trim unnecessary tokens
- *set temp at low value* (**0.6+**) and then use other knobs *sparingly* to avoid trimming good tokens

overall, a safe `temp` value is in the **0.8-1.2** range, depending on how much creativity you need. if you are going with high `temp` then you need to counter-balance it with other sampling methods below

***
**Top P & Top K**
`Temp`'s two younger sisters that always follow her around

`Top P` (aka "nucleus sampling") sets the range of tokens to be applied into the pool. it **sums the probability of tokens** until it reaches the point when the total sum of picked tokens exceeds the given range - then **discards all leftover tokens**. so it leaves only the tokens (the mass) of highly probable tokens
- `higher`: more random tokens (less sampling effect)
- `lower`: more predictable tokens (more sampling effect)
- `disable`: 1

`Top K` is dumbed-down `Top P`. instead of actually calculating the sum of tokens' probability - it just **picks the given number of tokens** in a row and **cuts everything else**. in nutshell it limits the total range of pool to that many tokens
- `number`: allow that many tokens into selection
- `disable`: 0

if you set `Top K: 1` (only 1 most likely token to pick from) --then AI becomes fully deterministic because you effectively cut the whole pool to 1 most-likely token which always will be the same

neither of `Top P` and `Top K` is ideal because `Top P` tends to put random tokens into pool, while `Top K` cuts quite likely tokens. pic below explains the shortsights of both samplings. `Top A` below combines both their methods and 'patches' their issue

however, it might be  **a good idea to set `Top K` to some high value (50-80)** to cut unnecessary tokens firsthoof, and allow other samplings to work with more relevant data

![](https://files.catbox.moe/q3a6hl.png)

***
**Top A**
`Top A` uses the mass of the most likely token (the most probable) and verifies other tokens based on it. so instead of using some arbitrary sums - **it uses the chance of the most likely token to cut the pool**. due to its design it is very aggressive and can lead to dry results even on high settings
- `higher`: more predictable tokens (more sampling effect)
- `lower`: more random tokens (less sampling effect)
- `disable`: 0

![](https://files.catbox.moe/8xxaar.png)

***
**Typical P**
`Typical P` (aka 'locally typical sampling') is unorthodox knob. it **uses an entropy value to cut both tokens with too low probability and too high probability**. it uses an idea that natural human speech uses median words, where every token has about the same probability as the previous one (aka, *uses tropes and cliches*). the value of that sampling method determines how strong the entropy to look for. while it is *very good for summarization* and providing facts - **the usage in RP or storywriting can cause more harm than good**
- `higher`: more random tokens (less sampling effect)
- `lower`: more predictable tokens (more sampling effect)
- `disable`: 1

***
**Tail Free Sampling**
`Tail Free Sampling` (TFS) is extremely complicated sampling method **based on the differences between the tokens' probability**. in the first round `TFS` calculates the difference between tokens' probability and *normalizes* them to median (calculates the median of all differences). in the second round it cuts off the tokens behind the threshold. the common idea is **to not set it too low** unless user needs deterministic text
- `higher`: more random tokens (less sampling effect)
- `lower`: more predictable tokens (more sampling effect)
- `disable`: 1

***
**Epsilon Cutoff & Eta Cutoff**
are  basically the **same thing as `Tail Free Sampling` but separated in two different values**. see how in `TFS` the two rounds were mentioned? `Epsilon Cutoff` determines how aggressive the normalization across the board will be, while `Eta Cutoff` sets the threshold of tokens to cut
- `higher`: more predictable tokens (more sampling effect)
- `lower`: more random tokens (less sampling effect)
- `disable`: 0
 
***
**Repetition Penalty**
`Repetition Penalty` **reduces the appearance of tokens that are already present in the text**. if user already has the word '*ministration*' in the prompt (20 sentences ago) then this knob will check whether another appearance of '*ministration*' is allowed. while it sounds as a nice feature, too high value will lead to unnatural text full of synonyms and simile
- `higher`: less word repetition allowed
- `lower`: more word repetition allowed
- `disable`: 1

***
**Repetition Penalty Range**
`Repetition Penalty Range` tells **how far back AI shall look for the repetitive words** (in tokens)
- `number`: take that many tokens into consideration when determinate repetitions
- `disable`: 0 (all prompt)

***
**Rep. Pen. Slope**
`Rep. Pen. Slope` determines how aggressive anti-repetition measures must be closer to the end of chat. if you set that value low then Ai will allow some repetition to slip-in, setting it too high prevent AI from using the same words at the end of chat. you must use `Repetition Penalty Range` for `Rep. Pen. Slope` to work
- `higher`: less word repetition allowed at the end of chat
- `lower`: more word repetition allowed at the end of chat
- `disable`: 0

***
**No Repeat Ngram Size**
`Ngram` is sorta plugin for repetition detection and **sets the number of tokens in a row to check for the repetition**. for example the phrase '*the night is still young*' is 6 tokens long. if you set this knob to '6' then AI will prevent the generation of that phrase in text (based on `Repetition Penalty`'s aggressiveness), but setting it to '5' will allow that phrase to generate further. while useful, again, too lower value *will make the text incomprehensible*
- `number`: check for that many tokens in a row for repetition
- `disable`: 0 

***
**Encoder Rep. Pen.**
`Encoder Repetition Penalty` is a special knob that tries to fight with an issue of text becoming incomprehensible with other repetition penalties. it **takes into account the context and allows some tokens that would have been otherwise banned** to slip it. in other words, that option prevents tokens from being banned under regular circumstances if it will improve text legibility
- `higher`: more token repetition
- `lower`: less token repetition (set to 0.8 and have a laugh)
- `disable`: 1

***
**Negative Prompt & CFG Scale**
!!!warning Negative prompts & CFG doubles the usage of VRAM, effectively cutting your max context in half
`Negative prompt` **tells AI what TO NOT generate**. you can do lots of things there:
* give a *general direction* of what you don't want to see (per @ada1's preset):
```
anthro, anthropomorphic, Equestria Girls, tropes, bland, poor characterization, summary, logical error, illogical, incoherent, unintelligible, inarticulate, incomprehensible, out of character, omnipresent, omniscient, summary, forum post, article, OOC, give pony characters human anatomy
```
* tell AI what it *shall not* to do:
```
Speak for {{user}} and describe its actions.
```
* provide the *text with bad grammar* and bad sentence structure ([source](https://ask-max.neocities.org/prompt-buffet/)):
```
he walked. he ran. he screamed. he was lost. it was hopeles. he said, “theres no way out." she said, “ur doomed lol”
```

`CFG Scale` affects **how powerful the negative prompt** must be:
- `higher`: stay away from negative prompt as much as possible
- `lower`: embrace and apply negative prompt into generation
- `disable`: 1

"*why in the name of the Celestia's second horn anyone wants to apply a negative prompt?*", you may ask. the answer is simple - you can provide a (literally) random text into that field and set AI to apply it, then it will act as a second-hand randomizer/seed, ([source](https://rentry.org/primeanon#primer_1)) - which is a damn good hack

***
**Beam search**
!!!warning Beams don't work in free Google Colab because of insufficient VRAM
`Beams` allow AI to **predict multiple sentence generation** and then pick the one generation that offers the best token probabilities. it results in generating more organic text, but uses a huge amount of VRAM. ironically but AI can generate much worse text in return because it will look for more organic text rather than the more creative and avoid unsure tokens

`Number of Beams` tells **how many different variants to generate** in total, while `Length Penalty` determines **the length of each generation**

***
**Penalty Alpha**
`Penalty Alpha` flipflops the whole sampling system and **adds a natural degeneration into the range allowing some level of mistake into the range**. now all sampling methods will allow a **margin of error** to (supposedly) make generation more natural, creative, random and unpredictable
- `higher`: bigger margin of error, less predictable text
- `lower`: smaller margin of error, more predictable text
- `disable`: 0

***
**Mirostat**
infamous `Mirostat`. in nutshell, very-simple, it is **an auto-pilot mode that controls the sampling and adapts the generation based on the currently picked token buffed by randomness factor**. some people enjoy it because it is a simple plug-n-play knob that does most of the work by itself, while other dislike it because of randomness in generation

the basic idea behind of Mirostat goes like this. your usual token prediction and selection work as usual but Mirostat checks what token was picked: 
* if AI picked the token with **high probability** then **Mirostat slightly boosts the chance of all other tokens via samplers** in the next selection
* if AI picked the token with **low probability** then **Mirostat slightly lowers the chance of other tokens via samplers, but leave the most-likely token intact** in next selection, allowing AI to pick the token with higher chance next time

!!!note Mirostat takes control of samples
	Top P, Top K, Typical P, Top A, Tail Free Sampling, Epsilon Cutoff & Eta Cutoff --will not work when Mirostat is on. Temperature and penalties work

Mirostat always adapts the token chance trying to balance between selecting random and predictable tokens. you can say that Mirostat works like the swing, picking between the tokens of both good and bad, and plays a devil's advocate in between

`Mirostat Eta` (learning rate) sets **the overall aggressiveness of Mirostat**, how much it will affect the tokens and how often it will affect the generation
- `higher`: more random tokens
- `lower`: more predictable tokens
- `disable`: 0

`Mirostat Tau` sets the amplitude that Mirostat will uses on tokens with each generation (**changing their chance**)
- `higher`: more random tokens
- `lower`: more predictable tokens
- `disable`: 0

`Mirostat Mode` sets what version of Mirostat to use:
- `0`: disable
- `1`: mode for llama.cpp
- `2`: mode for exllama_hf (**Colab notebook works on that**)

setting the correct value for `Mirostat` is not easy and depends on the model and other sampling methods. some good variations from the anons in /lmg/:
``` js
Tau   Eta
  3   0.1 
  3   0.3
  4   0.2
  5   0.1
  6   0.4
```


***
***
***
#### credits, backlinks & shoutout
* [colabfreellamas](https://rentry.org/colabfreellamas)
* authors of two Colab notebooks: [one](https://files.catbox.moe/jgqjre.ipynb) & [two](https://files.catbox.moe/kd79da.ipynb)
* author of [two pony LoRAs](https://desuarchive.org/mlp/thread/40255207/#40271923) and [pony-related Colab notebook](https://colab.research.google.com/drive/1Un02h4uQN6zLlgL3zmMOeyce9ICEA3qO)
* author of two pony-trained LLaMA 2 13B models: [LL2-13B-DesuMLP-QLORA-GGML](https://huggingface.co/Nikolai1902/LL2-13B-DesuMLP-QLORA-GGML) & [LL2-13B-FIMFiction-QLORA-GGML](https://huggingface.co/Nikolai1902/LL2-13B-FIMFiction-QLORA-GGML)
* author of [pony-related corpus data](https://huggingface.co/tekkithorse)
* [ada1's preset](https://desuarchive.org/mlp/thread/40293067/#40311295)
* Kalomaze for [GGUF colab](https://colab.research.google.com/github/kalomaze/koboldcpp/blob/alternate_colab/Koboldcpp_Colab_(Improved_Edition).ipynb) and for the numerous tests and reports
* Undi95 for [non-stop merges](https://huggingface.co/Undi95) and especially for Amethyst
* [freellamas](https://rentry.org/freellamas)
* [llm-settings](https://rentry.org/llm-settings)
* [hostfreellamas](https://rentry.org/hostfreellamas)
* [primeanon](https://rentry.org/primeanon)
* [darkstardestinations](https://darkstardestinations.com/parametersTab)
* [ask-max](https://ask-max.neocities.org/presets/#mythomax)